[PDF] A new robust approach for multinomial logistic regression with complex design model

Abstract

Robust estimators and Wald-type tests are developed for the multinomial logistic regression based on \phi-divergence measures. The robustness of the proposed estimators and tests is proved through the study of their influence functions and it is also illustrated with two numerical examples and an extensive simulation study.

Full PDF

AA new robust approach for multinomial logisticregression with complex design model

Elena Castilla and Pedro J. Chocano Department of Statistics and O.R., Complutense University of Madrid, Spain Department of Algebra, Geometry and Topology, Complutense University of Madrid, Spain

Abstract

Robust estimators and Wald-type tests are developed for the multinomial logisticregression based on φ -divergence measures. The robustness of the proposed estima-tors and tests is proved through the study of their inﬂuence functions and it is alsoillustrated with two numerical examples and an extensive simulation study. Multinomial logistic regression model, also known as polytomous logistic regression model,is widely used in health and life sciences for analyzing nominal qualitative response vari-ables and their relationship with respect to their corresponding explanatory variables orcovariates. Daniels and Gatsonis [1997] used hieratical multinomial logistic regresion mod-els to examine how the rates of cardiac procedures depend on patient-level characteristics,including age, gender and race. Dreassi [2007] also used multinomial logistic regression todetect uncommon risk factors related to oral cavity, larynx and lung cancers. Recently,Ke et al. [2016] proposed a risk prediction model using semi-varying coeﬃcient multino-mial logistic regression to assess correct prediction rates when classifying the patients withearly rheumatoid arthritis. Further examples of application of these methods can be foundin Blizzard and Hosmer [2007], Bull et al. [2007] and Bertens et al. [2016], among oth-ers. Although most of classical literature deals with the cases of simple random samplingscheme, the application of multinomial logistic regression model under complex surveysetting (with stratiﬁcation, clustering or unequal selection probabilities, for example) canbe found, for example, in Binder [1983], Roberts et al. [1987], Morel [1989] and Morel andNeerchal [2012].Most of the results mentioned above are based on (pseudo) maximum likelihood esti-mators (PMLEs), which are well-known to be eﬃcient, but also non-robust. Therefore,testing procedures based on MLEs may face serious robustness problems. Castilla et al.[2018a] developed density power divergence (DPD) based robust estimators (MDPDEs)and Wald-type tests for multinomial logistic regression model under simple random sam-pling. This approach was extended to complex design in Castilla et al. [2020]. In Castillaet al. [2018b], pseudo minimum φ -divergence estimators (PM φ Es), as well as new estima-tors for the intra-cluster correlation coeﬃcient were developed. Estimators in the CressieRead subfamily with tuning parameter λ > λ = 0) for small samples sizes. However, the robustness issue was notconsidered. In the cited paper of Castilla et al. [2020], some simulation studies showedthat Cressie Read estimators with negative tuning parameter were even more robust thanMDPDEs, in terms of eﬃciency, for low-moderate intra-cluster correlations. However, this1 a r X i v : . [ m a t h . S T ] F e b roblem was not theoretically studied and hypothesis testing was not considered. In thispaper, we prove, through the study of the inﬂuence function, the robustness of PM φ Eswith − < λ < φ Es based Wald-type tests for testing com-posite null hypothesis. In Section 2, we present the multinomial logistic regression modelas well as the framework necessary to deﬁne the PM φ Es. Based on their asymptotic dis-tribution, robust Wald-type tests are developed in Section 3. An extensive simulationstudy and two numerical examples illustrate the robustness of the proposed estimatorsand Wald-type tests, in Section 4 and Section 5 respectively. In Appendix A, the studyof the inﬂuence function of the proposed test statistics is detailed, while in Appendix Bwe present the proofs of the main results. Finally, in Appendix C, some extensions of theMonte Carlo simulation study are presented.

We consider a population Ω partitioned into H strata and the data consist of n h clustersin stratum h . In the i -th cluster ( i = 1 , ..., n h ) within the h -th stratum ( h = 1 , ..., H ) wehave observed for the j -th unit ( j = 1 , ..., m hi ) the values of a categorical response variable Y with d + 1 categories. Note that we assume there are H strata, n h clusters in stratum h and m hi units in cluster i of stratum h . The observed responses of the ( d + 1)-dimensionalvariable Y are denoted by the ( d + 1)-dimensional classiﬁcation vector. y hij = ( y hij , ...., y hij,d +1 ) T , h = 1 , ..., H, i = 1 , ..., n h , j = 1 , ..., m hi , with y hijr = 1 if the j -th unit selected from the i -th cluster of the h -th stratum falls inthe r -th category and y hijl = 0 for l (cid:54) = r . It is very common when working with dummyor qualitative explanatory variables to consider that the k + 1 explanatory variables arecommon for all the individuals in the i -th cluster of the h -th stratum, being denoted as x hi = ( x hi , x hi , ...., x hik ) T , with the ﬁrst one, x hi = 1, associated with the intercept.Let us denote the sampling weight from the i -th cluster of the h -th stratum by w hi .For each i , h and j , the expectation of the r -th element of the random variable Y hij =( Y hij , ..., Y hij,d +1 ) T , corresponding to the realization y hij , is determined by π hir ( β ) = E [ Y hijr | x hi ] = Pr ( Y hijr = 1 | x hi ) = exp { x Thi β r } (cid:80) dl =1 exp { x Thi β l } , r = 1 , ..., d, (1)with β r = ( β r , β r , ..., β rk ) T ∈ R k +1 , r = 1 , ..., d and the associated parameter spacegiven by Θ = R d ( k +1) . It is clear that π hid +1 ( β ) = 11 + (cid:80) dl =1 exp { x Thi β l } . (2)Note that, under homogeneity, the expectation of Y hij does not depends on the unitnumber j . Therefore, from now on, we will denote by (cid:98) Y hi = m hi (cid:88) j =1 Y hij =  m hi (cid:88) j =1 Y hij , ..., m hi (cid:88) j =1 Y hij,d +1  T = ( (cid:98) Y hi , ..., (cid:98) Y hi,d +1 ) T the random vector of counts in the i -th cluster of the h -th stratum and by π hi ( β )the ( d + 1)-dimensional probability vector with the elements given in (1), π hi ( β ) =( π hi ( β ) , ..., π hi,d +1 ( β )) T . 2ollowing this notation we can deﬁne the empirical and the theoretical probabilityvectors of the model as (cid:98) p = 1 τ ( w (cid:98) y T , ..., w n (cid:98) y T n , ..., w H (cid:98) y TH , ..., w Hn H (cid:98) y THn H ) T and (3) π ( β ) = 1 τ ( w m π T ( β ) , ..., w n m n π THn H ( β )) T , (4)respectively, where τ = H (cid:80) h =1 n h (cid:80) i =1 w hi m hi . Probability vectors (3) and (4), both of dimension( d + 1) (cid:80) Hh =1 n h , will play a basic role in the deﬁnition of PM φ Es.

Deﬁnition 2.1

Under homogeneity assumption within the clusters and taking into ac-count the weights w hi , the (weighted) pseudo-maximum likelihood estimator (PMLE), (cid:98) β P ,of β is obtained by maximizing L ( β ) = H (cid:88) h =1 n h (cid:88) i =1 w hi log π Thi ( β ) (cid:98) y hi , (5) where log π Thi ( β ) = (log π hi ( β ) , . . . , log π hid +1 ( β )) . The PMLE can be obtained as the solution of the system of equations u ( β ) = d ( k +1) ,where d ( k +1) is the null vector of dimension d ( k + 1), and u ( β ) = H (cid:88) h =1 n h (cid:88) i =1 u hi ( β ) , u hi ( β ) = w hi r ∗ hi ( β ) ⊗ x hi , (6)where ⊗ is the Kronecker product and r ∗ hi ( β ) = (cid:98) y ∗ hi − m hi π ∗ hi ( β ), denoting with super-script ∗ the vector obtained deleting the last component from the initial vector. Remark 2.2

The distribution of (cid:98) Y hi , might be unknown, as their components jointly,might be correlated. The most common assumption is to consider that (cid:98) Y hi has a multi-nomial sampling scheme, which means that Y hij , j = 1 , ..., m hi are independent randomvariables with covariance matrix Σ hi = m hi ∆ ( π hi ( β )) , with ∆ ( π hi ( β )) = diag( π hi ( β )) − π hi ( β ) π Thi ( β ) ; and since (5) is not an approximation,the term “pseudo” should be dropped. A weaker assumption is to consider that (cid:98) Y hi hasa multinomial sampling scheme with a overdispersion parameter ν hi = 1 + ρ hi ( m hi − ,and Σ hi = ν hi m hi ∆ ( π hi ( β )) , but the distribution of (cid:98) Y hi is not in principle used for the estimators. Distributions suchas Dirichlet Multinomial, Random Clumped and m -inﬂated belong to this family (see Moreland Neerchal [2012], Alonso-Revenga et al. [2017] and Castilla et al. [2018c] for details).In Appendix C.1, the algorithms needed to compute these distributions in the context ofPLR model with complex design are presented. Example 2.3 (Education in Malawi)

The 2010 Malawi Demographic and Health Sur-vey (2010 MDHS, Oﬃce and Macro [2010]) was implemented by the National StatisticalOﬃce (NSO) from June through November 2010, with a nationally representative sample f more than , households. The sample for the 2010 MDHS was designed to providepopulation and health indicator estimation at the national, regional, and district levels.Let us focus on Tables 2.3.1 and 2.3.2 of the cited study, that present data on educationalattainment for female and male household members age and older, divided in ﬁve wealthquintile levels, which are considered as strata. We consider here a response variable with d + 1 = 5 categories: “no education”, “some primary”, “completed primary”, “some sec-ondary”, “completed secondary or more”. For simplicity, the missing observations are nottaken into account. Figure 1 shows the estimated probabilities by the PMLE of each one ofthe response categories for each gender. As expected, the proportion of women who havenever attended any formal schooling is greater than the proportion of men and the propor-tion of the population that has attained education declines with its level. In the ensuingwork, we will present alternative estimators to the PMLE, which are seen to provide betterperformance in terms of robustness. no education some primary completed primary some secondary more wealthy e s t. p r obab ili t i e s Females no education some primary completed primary some secondary more wealthy e s t. p r obab ili t i e s Males

Figure 1: Estimated probabilities for wealth quintiles by gender. PMLE. φ Es: deﬁnition, estimation and asymptotic distribution.

Deﬁnition 2.4

Given the probability vectors (3) and (4), the family of phi-divergencemeasures between these probability vectors is given by d φ ( (cid:98) p , π ( β )) = 1 τ H (cid:88) h =1 n h (cid:88) i =1 w hi m hi d +1 (cid:88) s =1 π his ( β ) φ (cid:18) (cid:98) y his m hi π his ( β ) (cid:19) , (7) where φ ∈ Φ ∗ and Φ ∗ denotes the class of all convex functions φ : [0 , ∞ ) → R ∪ {∞} , suchthat φ (1) = 0 , φ (cid:48)(cid:48) (1) > and we deﬁne φ (0 /

0) = 0 and φ ( p/

0) = p lim u →∞ φ ( u ) /u . Notice that, for φ ( x ) = x log x − x + 1 in (7) , we have the so-called Kullback Leiblerdivergence. For more details about phi-divergence measures see Pardo [2005].Let us now consider the PMLE in (5). It can be shown that it is related to theKullback-Leibler divergence measure as follows d KL ( (cid:98) p , π ( β )) = K − τ L ( β ) , with K being a constant not dependent on β (Castilla et al. [2018b]). Therefore, themaximization of L ( β ) is equivalent to the minimization of d KL ( (cid:98) p , π ( β )), i.e., PMLE4s the one which minimizes the Kullback-Leibler divergence between the empirical andtheoretical probability vectors, 3 and 4, (cid:98) β P = arg min β ∈ Θ d KL ( (cid:98) p , π ( β )) . (8)The deﬁnition of PM φ E arises from the idea of generalize deﬁnition (8) to other phi-divergence measures.

Deﬁnition 2.5

We consider the multinomial logistic regression model with complex sur-vey deﬁned in (1). The PM φ E of β is deﬁned as (cid:98) β φ,P = arg min β ∈ Θ d φ ( (cid:98) p , π ( β )) . Once the PM φ Es are deﬁned, it is necessary to provide the equations needed to obtainthem. From equation (7), it is clear that the PM φ E of β , (cid:98) β φ,P , is obtained by solving thesystem of equations u φ ( β ) = d ( k +1) , where u φ ( β ) = H (cid:88) h =1 n h (cid:88) i =1 u φ,hi ( β ) , (9)with u φ,hi ( β ) = w hi m hi φ (cid:48)(cid:48) (1) ∂ π Thi ( β ) ∂ β f φ,hi ( (cid:98) y hi m hi , β ) , ∂ π Thi ( β ) ∂ β = ( I d × d , d × ) ∆ ( π hi ( β )) ⊗ x hi , and f φ,hi ( (cid:98) y hi m hi , β ) = ( f φ,hi ( (cid:98) y hi m hi , β ) , ..., f φ,hi ( d +1) ( (cid:98) y hi ( d +1) m hi , β )) T ,f φ,his ( x, β ) = xπ his ( β ) φ (cid:48) (cid:18) xπ his ( β ) (cid:19) − φ (cid:18) xπ his ( β ) (cid:19) . Theorem 2.6 establishes the asymptotic distribution of the PM φ Es, which will be thebasis of the deﬁnition of the family of Wald-type tests in Section 3. The proof of thistheorem can be found in Castilla et al. [2018b].

Theorem 2.6

Let (cid:98) β φ,P the PM φ E of parameter β for a multinomial logistic regressionmodel with complex survey, n the total of clusters in all the strata of the sample and η ∗ h an unknown proportion obtained as lim n →∞ n h n = η ∗ h , h = 1 , ..., H . Then, we have √ n ( (cid:98) β φ,P − β ) L −→ n →∞ N (cid:0) d ( k +1) , J − (cid:0) β (cid:1) G (cid:0) β (cid:1) J − (cid:0) β (cid:1)(cid:1) , (10) where β is the true parameter value and J ( β ) = lim n →∞ J n ( β ) = H (cid:88) h =1 η ∗ h lim n h →∞ J ( h ) n h ( β ) , G ( β ) = lim n →∞ G n ( β ) = H (cid:88) h =1 η ∗ h lim n h →∞ G ( h ) n h ( β ) , ith J n ( β ) = 1 n H (cid:88) h =1 n h (cid:88) i =1 w hi m hi ∆ ( π ∗ hi ( β )) ⊗ x hi x Thi , J ( h ) n h ( β ) = 1 n h n h (cid:88) i =1 w hi m hi ∆ ( π ∗ hi ( β )) ⊗ x hi x Thi , G n ( β ) = 1 n H (cid:88) h =1 n h (cid:88) i =1 V [ U hi ( β )] , G ( h ) n h ( β ) = 1 n h n h (cid:88) i =1 V [ U hi ( β )] , V [ U hi ( β )] = w hi V [ (cid:98) Y ∗ hi ] ⊗ x hi x Thi , J ( β ) is the Fisher information matrix, V [ · ] denotes the variance-covariance matrix of arandom vector and U hi ( β ) is the random variable generator of u hi ( β ) , given by (6). Remark 2.7

Matrices J ( β ) and G ( β ) of Theorem 2.6 can be consistently estimated as (cid:98) J n ( (cid:98) β φ,P ) = 1 n H (cid:88) h =1 n h (cid:88) i =1 w hi m hi ∆ ( π ∗ hi ( (cid:98) β φ,P )) ⊗ x hi x Thi (cid:98) G n ( (cid:98) β φ,P ) = 1 n H (cid:88) h =1 n h (cid:88) i =1 (cid:18) u hi ( (cid:98) β φ,P ) − n u ( (cid:98) β φ,P ) (cid:19) (cid:18) u hi ( (cid:98) β φ,P ) − n u ( (cid:98) β φ,P ) (cid:19) T . An important family of phi-divergence measures is obtained by restricting φ from thefamily of convex functions Φ ∗ to the Cressie-Read subfamily, that is to say, φ is of thefollowing form: φ λ ( x ) = (cid:40) λ (1+ λ ) (cid:2) x λ +1 − x − λ ( x − (cid:3) , λ ∈ R − {− , } lim υ → λ υ (1+ υ ) (cid:2) x υ +1 − x − υ ( x − (cid:3) , λ ∈ {− , } . For the Cressie-Read subfamily, it is established that for λ (cid:54) = − u φ λ ( β ) = (cid:88) Hh =1 (cid:88) n i i =1 u φ λ ,hi ( β ) , where u φ λ ,hi ( β ) = w hi ( λ + 1) m λhi ∆ ∗ ( π hi ( β ))diag − ( λ +1) ( π hi ( β )) (cid:98) y λ +1 hi ⊗ x hi , (11)where ∆ ∗ ( π hi ( β )) = ( I d , d ) ∆ ( π hi ( β )) . We can observe that for λ = 0, we have φ λ =0 ( x ) = lim υ → υ (1 + υ ) (cid:2) x υ +1 − x − υ ( x − (cid:3) = x log x − x + 1 , and the associated phi-divergence, coincides with the Kullback divergence. Therefore, thePM φ E of β based on φ λ ( x ) contains as special case the PMLE and u hi ( β ) given in (6)matches u φ,hi ( β ) given in (11). Other important divergences are obtained inside thisfamily: for λ = 1 the chi-square divergence, for λ = 2 / λ = − . φ Es for − < λ < emark 2.8 Along this paper, we are referring to the case of complex sample survey.These all procedures can be easily simpliﬁed to the case of simple sample survey by con-sidering a single stratum and considering “observations” instead of clusters. Some workhas been done within the phi-divergence measures and multinomial logistic regression (seeGupta et al. [2008], Mart´ın and Pardo [2014]) but, to the best of our knowledge, the ro-bustness issue was not previously considered. In Section 5.2, an example is provided toillustrate the application of the proposed methods also in this context.

In the last years, it has been very common in the statistical literature to consider Wald-type tests based on the minimum distance estimators instead of the MLE. The resultingtests have an excellent behaviour in relation to the robustness with a non-signiﬁcant lossof eﬃciency, see for instance, Basu et al. [2017, 2018]. In this section, we will introduceWald-type test statistics based on the PM φ Es as a generalization of the classical Wald testbased on PMLE. As it happened with PM φ Es, the robustness of these proposed Wald-typetests can be proved for − < λ < H : M T β = m against M T β (cid:54) = m , (12)where M is d ( k + 1) × r full rank matrix with r ≤ d ( k + 1) and m an r -vector. Deﬁnition 3.1

Let (cid:98) β φ,P the PM φ E of β and denote (cid:98) V n ( (cid:98) β φ,P ) = (cid:98) J − n ( (cid:98) β φ,P ) (cid:98) G n ( (cid:98) β φ,P ) (cid:98) J − n ( (cid:98) β φ,P ) . Then, the family of Wald-type test statistics for testing the null hypothesis given in (12)is deﬁned as W n ( (cid:98) β φ,P ) = n (cid:16) M T (cid:98) β φ,P − m (cid:17) T (cid:104) M T V n ( (cid:98) β φ,P ) M (cid:105) − (cid:16) M T (cid:98) β φ,P − m (cid:17) . (13) Theorem 3.2

The asymptotic distribution of the Wald-type test statistics, W n ( (cid:98) β φ,P ) ,under the null hypothesis in (12), is a chi-square distribution with r degrees of freedom. Corollary 3.3

Based on Theorem 3.2, the null hypothesis in (12) will be rejected if W n ( (cid:98) β φ,P ) > χ r,α (14) being χ r,α the upper α -th quantile of χ r . The following theorem may be used to approximate the power function for the Wald-type test statistics given in (14).

Theorem 3.4

Let β be the true value of the parameter and let us denote (cid:96) ∗ ( β , β ) = (cid:0) M T β − m (cid:1) T (cid:0) M T V ( β ) M (cid:1) − (cid:0) M T β − m (cid:1) . Then it holds √ n (cid:16) (cid:96) ∗ ( (cid:98) β φ,P , β ) − (cid:96) ∗ (cid:0) β , β (cid:1)(cid:17) L −→ n →∞ N (0 , σ W (cid:0) β (cid:1) ) , where σ W (cid:0) β (cid:1) = 4 (cid:0) M T β − m (cid:1) T (cid:0) M T V (cid:0) β (cid:1) M (cid:1) − (cid:0) M T β − m (cid:1) . heorem 3.5 Let β ∈ Θ , with M T β (cid:54) = m , be the true value of the parameter suchthat (cid:98) β φ,P P −→ n →∞ β . The power function of the Wald-type test given in (14), is given by Π W n ( (cid:98) β φ,P ) (cid:0) β (cid:1) = 1 − Φ n (cid:32) σ W (cid:0) β (cid:1) (cid:32) χ r,α √ n − √ n (cid:96) ∗ ( (cid:98) β φ,P , β ) (cid:33)(cid:33) (15) where Φ n ( x ) uniformly tends to the standard normal distribution as n → ∞ . Corollary 3.6

It is clear that lim n →∞ Π W n ( (cid:98) β φ,P ) (cid:0) β (cid:1) = 1 for all α ∈ (0 , . Therefore, the Wald-type tests are consistent in the sense of Fraser. Remark 3.7

Theorem 3.5 can be applied in the sense of getting the necessary sample sizein order to get that the Wald-type tests have a determinate ﬁx power, i.e., Π W n ( (cid:98) β φ,P ) (cid:0) β (cid:1) ≡ π and size α . The necessary sample size is given by n = (cid:34) A + B + (cid:112) A ( A + 2 B )2 (cid:96) ∗ (cid:0) β , β (cid:1) (cid:35) + 1 , where [ x ] denotes the largest integer less than or equal to x , A = σ W (cid:0) β (cid:1) (cid:0) Φ − (1 − π ) (cid:1) and B = 2 (cid:96) ∗ (cid:0) β , β (cid:1) χ r,α . We may also ﬁnd approximations of the power function of the Wald-type tests given in(13) at an alternative hypothesis close to the null hypothesis. Let β n ∈ Θ − Θ be a givenalternative, and let β ∈ Θ (null hypothesis) the element closest to β n in terms of theEuclidean distance. We may introduce contiguous alternative hypotheses by considering aﬁxed d ∈ R d ( k +1) and to permit β n moving towards β as n increases through the relation H ,n : β n = β + n − / d . (16)Let us now relax the condition M T β = m deﬁning the null hypothesis. Let δ ∈ R r andconsider the following sequence, β n , of parameters moving towards β according to H ∗ ,n : M T β n − m = n − / δ . (17) Theorem 3.8

The asymptotic distribution of W n ( (cid:98) β φ,P ) is given by:(a) Under H ,n , W n ( (cid:98) β φ,P ) L −→ n →∞ χ r (∆) , where ∆ is the parameter of non-centrality givenby ∆ = d T M (cid:2) M T V ( β ) M (cid:3) − M T d , (b) Under H ∗ ,n , W n ( (cid:98) β φ,P ) L −→ n →∞ χ r (∆ ∗ ) , where ∆ ∗ is the parameter of non-centralitygiven by ∆ ∗ = d T M (cid:2) M T V ( β ) M (cid:3) − M T d . Proofs of the results given in this section can be found in Appendix B.8

Monte Carlo Simulation Study

In this section, we develop a simulation study in order to illustrate the robustness of theproposed estimators and Wald-type tests based on them. Following the simulation studiesproposed in Castilla et al. [2018a] and Castilla et al. [2020], we consider H = 4 strata with n h clusters of the same size m , with m = 20 and n h ∈ { , , .., } for h = 1 , . . . , H . Weconsider d + 1 = 3 categories on the response variable, depending on k = 2 explanatoryvariables. The response variable (cid:98) Y hi , described as E [ (cid:98) Y hi ] = m π hi (cid:0) β (cid:1) and V [ (cid:98) Y hi ] = ν m m ∆ ( π hi (cid:0) β (cid:1) ) ,ν m = 1 + ρ ( m − , i = 1 , . . . , n h , h = 1 , . . . , H, is considered to follow the m-Inﬂated multinomial distribution (see Remark 2.2), withparameters ρ = 0 . π hi (cid:0) β (cid:1) , given by the logistic relationship (1) with β = ( β , β , β , β , β , β ) T = (0 , − . , . , . , − . , . T and x hi iid ∼ N ( , I ) for all i = 1 , . . . , n h , h = 1 , . . . , H . In order to study the robustnessissue, these simulations are repeated under contaminated data having 10% outliers. Theseoutliers are generated by permuting the elements of the outcome variable, such that cat-egories 1, 2, 3 are classiﬁed as categories 3, 1, 2 for the outlying observations. Note thatthis view of considering outliers as classiﬁcation errors in the PLR model is, in fact, in linewith the general literature on robust analysis of categorical data (Johnson [1985], Crouxand Haesbroeck [2003]) and is covered with the theory developed in Appendix A, whereour “outlier producing measure” indeed provides classiﬁcation error if the outlier pointyields its mass in a wrong category (see Castilla et al. [2020] for more details).In this scenario, the root of mean square error (RMSE) for the Cressie-Read PM φ Es of β with λ ∈ {− . , − . , , / } is studied, both for the contaminated and not-contaminatedcases (see top of Figure 2). To compute the accuracy in terms of contrast, we consider thetesting problem H : β = − . H : β (cid:54) = − . . For computing the empirical test level, we measured the proportion of Wald-type teststatistics ex- ceeding the corresponding chi-square critical value. The simulated test powerswere also obtained under H in in a similar manner (here we consider β = − . .

05. Both levels and powers are presented in the middle andbottom of Figure 2.It is observed that PMLE ( λ = 0) presents the best behavior in terms of eﬃciencyfor the non-contaminated setting. In addition, PM φ E with λ = 0 .

66 presents a RMSElower than PM φ E with negative values of λ . On the other hand, when it is consideredthe contaminated setting, PM φ E with λ = − . n h . The other negative value considered for PM φ E improvesclearly the RMSE regarding to λ = 0 .

66, again, for high values of n h . However, the greatestdiﬀerence is observed when studying empirical levels and powers. Although PMLE remainsthe best estimator for testing in a pure scenario, for the contaminated setting, betterempirical levels are observed for negative values of the tuning parameter λ , in particular,for λ = − .

5. In terms of powers, negative values of λ present better behavior in bothsettings, non-contaminated and contaminated. Positive values of λ are presented as a goodalternative only in terms of eﬃciency for small sample sizes, in concordance with Castillaet al. [2018b]. Other alternative scenarios are considered in Appendix C.9 l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 Figure 2: RMSEs (top), emprirical levels (middle) and empirical powers (bottom). Non-contaminated and contaminated settings (left and right, respectively). m-Inﬂated distri-bution. 10

Numerical Examples

Let us continue with the 2010 MDHS presented in Example 2.3. As pointed out there, the2010 MDHS presents data on educational attainment for female and male by its wealthquintile level. In this section, we make a comparison of the behaviour between diﬀerentPM φ Es, when estimating the probabilities of the response categories. For this purpose, andafter estimating the PM φ Es in a grid of tuning parameters λ ∈ {− . , . } in the Cressie-Read subfamily, we measure a pondered standardized mean absolute error (SMAE) of theestimated probabilities against the observed probabilities. This is done by distinguishingthe strata (wealth quintiles), the clusters (female and male) and jointly, as it can be seenin Figure 3. The lowest SMAEs are obtained for negative values of λ . Then, they seem tooﬀer a better behavior than the classical PMLE. l l l l l l l l l l l l l l S M AE wealthy l no educationsome primarycompleted primarysome secondarymore Females l l l l l l l l l l l l l l S M AE wealthy l no educationsome primarycompleted primarysome secondarymore Males l l l l l l l l l l l l l l S M AE wealthy l no educationsome primarycompleted primarysome secondarymore Males and females l l l l l l l l l l l l l l S M AE sex l femalemaletotal Figure 3: Education in Malawi: estimated SMAEs for diﬀerent values of the tuning pa-rameter.

As noted in Remark 2.8, multinomial logistic regression model under complex sampledesign is an extension of the classical one, evaluated under a simple sample design. There-fore, the tools developed in this paper, can be also applied to these cases, in which thedata design may be much simpler. In this section, we study the Mammography experiencedata, a subset of a study by the University of Massachusetts Medical School, introduced11 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

First response category l =2/3 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

Second response category l =2/3 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

First response category l =0 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

Second response category l =0 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

First response category l =−0.5 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of Observation P r ed i c t ed p r obab ili t y l With outliers Without outliers

Second response category l =−0.5 Figure 4: Mamography example: predicted category probabilities of the response variablefor the MLE ( λ = 0) and M φ Es with λ = 2 / λ = − . x i for i ∈ { , , , , , , } can be treated as outliers.So this data set is a perfect candidate to show the robustness performance of the proposedestimators.We compute the minimum φ -divergence estimators (M φ Es) of β for λ ∈ {− . , , / } ,for the full dataset and also for the outliers deleted dataset. Moreover, we plot the cor-responding (estimated) category probabilities for each available distinct covariate values.The left panel of Figure 4 presents these category probabilities for the ﬁrst category, whilethe right panel presents these category probabilities for the second category. Resultsclearly indicate the signiﬁcant variation of the MLE and M φ E with λ = 2 / φ E with λ = − . j=1 f D P D j=2 f D P D Figure 5: Mamography example: comparation of M φ Es and MDPDEs.In Castilla et al. [2018a] this dataset was also analyzed in order to illustrate the robust-ness of other family of estimators, those based on DPD divergences, the MDPDEs. Thisfamily is also parametrized by a tuning parameter, let say, λ ∗ ≥

0, and contains the MLEas a particular case for λ ∗ = 0. The question that could arise here is the point on usingPM φ Es instead of MDPDEs. In this regard, the eﬃciency of both family of estimatorsis compared in the following way: for each pair of tuning parameters ( λ, λ ∗ ) in a grid on[ − . , × [0 , φ Es and MDPDEs are computed and the estimated probabilitiesfor each of the categories of the response variable are obtained for each of the I = 125observations. Then, we count the number of times, in these 125 observations, that theM φ E presents a lower error (the estimated probability is closer to the observed probabil-13ty) than the MDPDE. The higher this value is, the better is the M φ E with respect to theMDPDE. If the value is under [ I/

2] = 63, then the MDPDE is preferable. These resultsare illustrated in the two heat plots (for the ﬁrst and second category, the third is omittedsince it is similar) on Figure 5. We can observe how M φ Es with a low value of λ improvesany MDPDE, while MDPDEs with a large value of λ ∗ only improves PM φ Es with tuningparameters near to 0. The eﬃciency of the MLE is not comparable to any other option.Now, we want to evaluate the robustness of the proposed Wald-type tests. We considerthe problem of testing H : β SY MP T = 0 ,H : β SY MP T = β SY MP T , for the variable SYMPT (“You do not need a mammogram unless you develop symptoms:1, strongly agree; 2, agree; 3, disagree; 4, strongly disagree). The p-values obtained basedon the proposed test are plotted over λ in Figure 6 for both the full and the outlier deleteddata. Clearly, the test decision at the signiﬁcance level α = 0 . λ near to 0. l l l l l l l p − v a l ue l With outliers Without outliers b SYMPT =0 l l l l l l l p − v a l ue l With outliers Without outliers b SYMPT = b SYMPT Figure 6: Mamography example: p-values of the proposed M φ Es based Wald-type tests.

In this paper, we present robust estimators (PM φ E) and Wald-type tests based on themfor the multinomial logistic regression under complex survey, by means of φ -divergencemeasures. In particular, we focus our study in the Cressie-Read subfamily of divergences,which are modelized by a tuning parameter λ . It is theoretically proved and empiricallyillustrated how PM φ Es and Wald-type tests with − < λ < λ . Robustness is usually accompanied with a loss of eﬃciency and other factors,such as sample size, can be also determinant in this decision. One possible way to make thischoice is as follows: in a grid of possible tuning parameters, apply a measure of discrepancyto the data. Then, the tuning parameter that leads to the minimum discrepancy-statisticcan be chosen as the “optimal” one. This is, somehow, the idea followed in the examples(see Figure 3 and 5). Another alternative is the one proposed by Warwick and Jones[2005], which consists on minimizing the estimated mean square error, computed as thesum of the squared (estimated) bias and variance. One of the main drawbacks of thismethod is the fact that it depends on a pilot estimator to estimate the bias. This problemwas also highlighted recently in Basak et al. [2020], where an “iterative Warwick and Jonesalgorithm” (IJW algorithm) is proposed. Application of these methods and a developmentof new ones will be a challenging and interesting problem for further consideration. Acknowledgments:

This research is partially supported by Grant PGC2018-095194-B-I00, Grant FPU16/03104 and Grant BES-2016- 076669 from

Ministerio de Ciencia, Inno-vaci´on y Universidades and

Ministerio de Econom´ıa, Industria y Competitividad (Spain).E. Castilla is member of the

Instituto de Matem´atica Interdisciplinar , Complutense Uni-versity of Madrid. 15

Study of the Robustness of the proposed estimators andWald-type tests

An important concept in robustness theory is the inﬂuence function (Hampel et al. [1986]).For any estimator deﬁned in terms of a statistical functional U ( F ) from the true distri-bution F , its inﬂuence function (IF) is deﬁned as IF ( t, U , F ) = lim ε ↓ U ( F ε ) − U ( F ) ε = ∂ U ( F ε ) ∂ε (cid:12)(cid:12)(cid:12)(cid:12) ε =0 + , (18)where F ε = (1 − ε ) F + ε ∆ t , with ε being the contamination proportion and ∆ t beingthe degenerate distribution at the contamination point t . Thus, the (ﬁrst-order) IF, as afunction of t , measures the standardized asymptotic bias (in its ﬁrst-order approximation)caused by the inﬁnitesimal contamination at the point t . The maximum of this IF over t indicates the extent of bias due to contamination and so smaller its value, the more robustthe estimator is.The classical theory deals with the case of independent and identically distributed (iid)observations (Lindsay et al. [1994]). However, the present set-up of complex survey is notas simple as the iid set-up; in fact the observations within a cluster of a stratum are iid butthe observations in diﬀerent cluster and stratum are independent non-homogeneous. So weneed to modify the deﬁnition of the inﬂuence function accordingly. Recently, Ghosh andBasu [2013, 2015, 2018] have discussed the extended deﬁnition of the inﬂuence function forthe independent but non-homogeneous observations; however, all these works have beendeveloped for DPD-based estimators. Here, we will extend their deﬁnition for the case ofmultinomial logistic regression and PM φ DEs.We ﬁrst need to deﬁne the statistical functional U φ ( β ) corresponding to the PM φ Esas the minimizer of the phi-divergence between the true and the model densities. This isdeﬁned as the minimizer of H φ ( β ) = 1 τ H (cid:88) h =1 n h (cid:88) i =1 w hi m hi d +1 (cid:88) s =1 π his ( β ) φ (cid:18) g his π his ( β ) (cid:19) . Under appropriate diﬀerentiability conditions as the solution of the estimating equations ∂ H φ ( β ) ∂ β = H (cid:88) h =1 n h (cid:88) i =1 ω hi m hi φ (cid:48)(cid:48) (1) ∂ π Thi ( β ) ∂ β f φ,hi ( g hi , β ) = . In particular, for the Cressie-Read subfamily ∂ H φ λ ( β ) ∂ β = H (cid:88) h =1 n h (cid:88) i =1 ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β diag − ( λ +1) ( π hi ( β )) g λ +1 hi = . (19)For simplicity, let us ﬁrst assume that the contamination is only in one cluster probability g h i for some ﬁx h and i . Consider the contaminated probability vector g h i ,ε = (1 − ε ) π h i ( β ) + εδ t , ε is the contamination proportion and δ t is the degenerate probability at the outlierpoint t = ( t , . . . , t d +1 ) T ∈ { , } d +1 with (cid:80) d +1 s =1 t s = 1 and g hi = (cid:26) π hi ( β ) if ( i, h ) (cid:54) = ( i , h ); g h i ,ε if ( i, h ) = ( i , h ) . Denote the corresponding contaminated full probability vector as g ε which is the sameas g except g h i being replaced by g h i ,ε and let the corresponding contaminated distri-bution vector be G ε . We replace β in (19) by β h i ε = U φ λ ( G ε ). Then, we have ∂ H φ λ ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε (20)= H (cid:88) h =1( i,h ) (cid:54) =( i ,h ) n h (cid:88) i =1 (cid:40) ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +1) ( π hi ( β h i ε )) π λ +1 hi ( β ) (cid:41) + ω h i m h i λ + 1 ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +1) ( π h i ( β h i ε )) (cid:104) (1 − ε ) π λ +1 h i ( β ) + εδ λ +1 t (cid:105) . Now, we are going to get the derivative of (20) with respect to ε . − ( λ + 1) H (cid:88) h =1( i,h ) (cid:54) =( i ,h ) n h (cid:88) i =1 (cid:40) ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +2) ( π hi ( β h i ε )) × ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε π λ +1 hi ( β ) ∂ β h i ε ∂ε (cid:41) + H (cid:88) h =1( i,h ) (cid:54) =( i ,h ) n h (cid:88) i =1 (cid:40) ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β ∂ β T (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +1) ( π hi ( β h i ε )) π λ +1 hi ( β ) ∂ β h i ε ∂ε (cid:41) + (cid:40) ω h i m h i λ + 1 ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +2) ( π h i ( β h i ε )) ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε × (cid:104) (1 − ε ) π λ +1 h i ( β ) + εδ λ +1 t (cid:105) ∂ β h i ε ∂ε (cid:41) + (cid:40) ω h i m h i λ + 1 ∂ π Thi ( β ) ∂ β ∂ β T (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +1) ( π h i ( β h i ε )) × (cid:104) (1 − ε ) π λ +1 h i ( β ) + εδ λ +1 t (cid:105) ∂ β h i ε ∂ε (cid:41) + (cid:40) ω h i m h i ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β h i ε diag − ( λ +1) ( π h i ( β h i ε )) (cid:104) − π λ +1 h i ( β ) + δ λ +1 t (cid:105)(cid:41) = . Now, evaluating the previous expression in ε = 0 and simplifying, we have17 F ( t h i , U φ λ ( β ) , F β ) H (cid:88) h =1 n h (cid:88) i =1 (cid:40) ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β ∂ β T (cid:12)(cid:12)(cid:12)(cid:12) β = β diag − ( λ +1) ( π hi ( β )) π λ +1 hi ( β ) − ω hi m hi ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β diag − ( λ +2) ( π hi ( β )) ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β π λ +1 hi ( β ) (cid:41) + (cid:40) ω h i m h i ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β diag − ( λ +1) ( π h i ( β )) (cid:104) − π λ +1 h i ( β ) + δ λ +1 t (cid:105)(cid:41) = . Proposition A.3 follows straightforward.

Theorem A.1

Let us consider the multinomial logistic regression model under complexdesign given in (1). The IF of the PM φ Es with respect to the i cluster in the h stratumis given by IF ( t h i , U φ λ ( β ) , F β ) = Ψ − n,λ ( β ) u ∗ φ λ ,h i ( β ) , (21) where u ∗ φ λ ,h i ( β ) = (cid:34) ω h i m h i ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β diag − ( λ +1) ( π h i ( β )) δ λ +1 t (cid:35) (22) and Ψ n,φ λ ( β ) = − H (cid:88) h =1 n h (cid:88) i =1 (cid:40) ω hi m hi λ + 1 ∂ π Thi ( β ) ∂ β ∂ β T (cid:12)(cid:12)(cid:12)(cid:12) β = β Td +1 (cid:41) + H (cid:88) h =1 n h (cid:88) i =1 (cid:40) ω hi m hi ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β diag − ( λ +2) ( π hi ( β )) ∂ π Thi ( β ) ∂ β (cid:12)(cid:12)(cid:12)(cid:12) β = β π λ +1 hi ( β ) (cid:41) . Remark A.2

Let us consider the right part of the IF (21). Equation (22) can be expressedas u ∗ φ λ ,h i ( β ) = ω h i m h i ∆ ∗ ( π h i ( β )) diag − ( λ +1) ( π h i ( β )) δ λ +1 t ⊗ x h i and, in the particular case λ = 0 (PMLE) u ∗ h i ( β ) = ω h i m h i ( π ∗ h i ( β ) − δ ∗ t ) ⊗ x h i . (23) While the inﬂuence of vertical outliers on the PM φ Es is bounded as t changes its indicativecategory only; by the assumed form of the model probability, the IF is bounded for all − < λ < but unbounded at λ = 0 against “bad” leverage points. Eﬀectively, when x h i increases in (23), the residual ( π ∗ h i ( β ) − δ ∗ t ) will typically tend much faster to zero than x h i to inﬁnity, resulting in a small inﬂuence. But a “bad” leverage point associated to amisclassiﬁed observation will lead to an inﬁnite value of the IF (see Croux and Haesbroeck[2003]). U β ( G ) is given by W n ( U β ( G )) = n (cid:0) M T U β ( G ) − m (cid:1) T (cid:0) M T V n ( U β ( G )) M (cid:1) − (cid:0) M T U β ( G ) − m (cid:1) . The IF of general Wald-type tests under such non-homogeneous set-up has been ex-tensively studied in Basu et al. [2018], for the case of DPD estimators. Here, we canfollow that the ﬁrst-order IF of W n , deﬁned as the ﬁrst order derivative of its value at thecontaminated distribution with respect to ε at ε = 0, become null at the null distribution.Therefore, the ﬁrst order IF is not informative in this case of Wald-type tests, and we needto investigate the second-order IF, let say IF (2) , of W n . Through some computations weobtain the following proposition. Theorem A.3

Let us consider the multinomial logistic regression model under complexdesign given in (1). The second-order IF of the functional associated with the Wald-typetests with respect to the i cluster in the h stratum is given by IF (2) ( t h i , W n , F β )= 2 IF T ( t h i , W n , F β ) M (cid:0) M T V n ( U β ( G )) M (cid:1) − M T IF ( t h i , W n , F β ) , (24) where IF ( t h i , W n , F β ) is the IF of the PM φ Es given in (21).

Remark A.4

Note that, the second-order IF of the proposed Wald-type tests is a quadraticfunction of the corresponding IF of the PM φ Es. Therefore, the boundedness of the inﬂuencefunctions of PM φ E at − < λ < also indicates the boundedness of the IFs of the Wald-type tests functional W n . B Proof of Results

B.1 Proof of Theorem 3.2

Proof.

We have by (10) √ n ( (cid:98) β φ,P − β ) L −→ n →∞ N (cid:0) d +1 , V (cid:0) β (cid:1)(cid:1) , with V (cid:0) β (cid:1) = J − ( β ) G ( β ) J − ( β ). Therefore, √ n ( M T (cid:98) β φ,P − m ) L −→ n →∞ N (cid:0) d +1 , M T V (cid:0) β (cid:1) M (cid:1) . As rank( M ) = r , we have that n ( M T (cid:98) β φ,P − m ) T (cid:0) M T V (cid:0) β (cid:1) M (cid:1) − ( M T (cid:98) β φ,P − m )converge in law to a chi-square distribution with r degrees of freedom. But V n (cid:16)(cid:98) β φ,P (cid:17) isa consistent estimator of V (cid:0) β (cid:1) . Therefore, we have that under H , W n ( (cid:98) β φ,P ) deﬁnedin (13), converges in law to a chi-square distribution with r degrees of freedom.19 .2 Proof of Theorem 3.4 Proof.

A ﬁrst order Taylor expansion of (cid:96) ∗ ( (cid:98) β φ,P , β ) at (cid:98) β φ,P around β gives (cid:96) ∗ ( (cid:98) β φ,P , β ) − (cid:96) ∗ (cid:0) β , β (cid:1) = ∂(cid:96) ∗ (cid:0) β , β (cid:1) ∂ β T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β = β ( (cid:98) β φ,P − β ) + o p (cid:16)(cid:13)(cid:13)(cid:13)(cid:98) β φ,P − β (cid:13)(cid:13)(cid:13)(cid:17) . The asymptotic distribution of √ n (cid:16) (cid:96) ∗ ( (cid:98) β φ,P , β ) − (cid:96) ∗ (cid:0) β , β (cid:1)(cid:17) coincides with the asymp-totic distribution of √ n  ∂(cid:96) ∗ (cid:0) β , β (cid:1) ∂ β T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β = β ( (cid:98) β φ,P − β )  , but ∂(cid:96) ∗ (cid:0) β , β (cid:1) ∂ β T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β = β = 2 (cid:0) M T β − m (cid:1) T (cid:0) M T V (cid:0) β (cid:1) M (cid:1) − M T . Now the result follows.

B.3 Proof of Theorem 3.5

Proof. Π W n ( (cid:98) β φ,P ) (cid:0) β (cid:1) (cid:39) P β ( W n ( (cid:98) β φ,P ) > χ r,α ) = P β ( n(cid:96) ∗ ( (cid:98) β φ,P , (cid:98) β φ,P ) > χ r,α )= P β ( √ n(cid:96) ∗ ( (cid:98) β φ,P , (cid:98) β φ,P ) > χ r,α √ n )= P β (cid:32) √ n (cid:16) (cid:96) ∗ ( (cid:98) β φ,P , β ) − (cid:96) ∗ (cid:0) β , β (cid:1)(cid:17) > χ r,α √ n − √ n(cid:96) ∗ (cid:0) β , β (cid:1)(cid:33) = 1 − P β  √ n (cid:16) (cid:96) ∗ ( (cid:98) β φ,P , β ) − (cid:96) ∗ (cid:0) β , β (cid:1)(cid:17) σ W (cid:0) β (cid:1) ≤ σ W (cid:0) β (cid:1) (cid:32) χ r,α √ n − √ n(cid:96) ∗ (cid:0) β , β (cid:1)(cid:33) = 1 − Φ n (cid:32) √ nσ W ( β ∗ ) (cid:32) χ r,α n − (cid:96) ∗ (cid:0) β , β (cid:1)(cid:33)(cid:33) , where Φ n ( x ) uniformly tends to the standard normal distribution φ ( x ) as n → ∞ . B.4 Proof of Theorem 3.8

Proof.

It is easy to see that M T (cid:98) β φ,P − m = n − / M T d + M T ( (cid:98) β φ,P − β n ) . We know, under H ,n , that √ n ( (cid:98) β φ,P − β n ) L −→ n →∞ N ( d +1 , V ( β n )) and β n −→ n →∞ β .Therefore, as M T β = m , √ n ( M T (cid:98) β φ,P − m ) L −→ n →∞ N (cid:0) M T d , M T V ( β ) M (cid:1) . But it is known that if Z ∈ N ( µ, Σ ), Σ is a symmetric projection of rank k and Σ µ = µ , then Z T Z is a chi-square distribution with k degrees of freedom and non-centralityparameter µ T µ . So considering the quadratic form20 n ( (cid:98) β φ,P ) = Z T Z , with Z = √ n (cid:20) M T V (cid:16)(cid:98) β φ,P (cid:17) − M (cid:21) − / ( M T (cid:98) β φ,P − m ) and Z L −→ n →∞ N (cid:16)(cid:2) M T V ( β ) M (cid:3) − / M T d , I r × r (cid:17) , the application is immediate with the non-centrality parameter being d T M (cid:2) M T V ( β ) M (cid:3) − M T d . The point (b) is straightforward taking into account that the equivalence between thehypotheses (16) and (17) is given by M T d = δ . C Some extensions of the Simulation Study

We extend the simulation study presented in Section 4 to other scenarios. In particular, weﬁrst study the same scheme as in Section 4 but considering two diﬀerent overdispersed dis-tributions for the response variable: the Random-Clumped and the Dirichlet Multinomialdistributions. Results are presented in Figure 7 and Figure 8.Same conclusions as in Section 4 are obtained, illustrating again the robustness ofproposed estimators and Wald-type tests against classical PMLE.

C.1 Algorithms for m-Inﬂated, Random Clumped and Dirichlet multi-nomial distributions in the context of PLR models with complexdesign

We present the algorithms that are needed in order to compute the Random-Clumped(Morel and Nagaraj [1993]), Dirichlet-multinomial (Mosimann [1962]) and m-Inﬂated (Co-hen [1976]) multinomial distributions in the context of PLR models with complex design.We consider, without loss of generality, that the intra-cluster correlation parameter isequal in all the clusters and strata ( ρ hi ≡ ρ , h = 1 , . . . , H , i = 1 , . . . , n h ). m-Inﬂated distribution Goal:

Generation of response variable with m-Inﬂated distribution of parameters ρ and π ( β ), ina scenario with H strata and n h clusters in the stratum h , h = 1 , . . . , H . for h = 1 , . . . , H do for i = 1 , . . . , n h do k ← Ber ( ρ ) if k = 0 then y hi ← M ( m hi , π hi ( β )) else y hi ← m hi × M (1 , π hi ( β )) end if end for end for return y = ( y , . . . , y Hn H ) T andom-Clumped distribution Goal:

Generation of response variable with Random Clumped distribution of parameters ρ and π ( β ), in a scenario with H strata and n h clusters in the stratum h , h = 1 , . . . , H . for h = 1 , . . . , H do for i = 1 , . . . , n h do y (0) ← M (1 , π hi ( β )) k ← Bin ( m hi , ρ ) y (1) ← M ( m hi − k , π hi ( β )) y hi ← y (0) × k + y (1) end for end for return y = ( y , . . . , y Hn H ) T Dirichlet-Multinomial distribution

Goal:

Generation of response variable with Dirichlet Multinomial distribution of parameters ρ and π ( β ), in a scenario with H strata and n h clusters in the stratum h , h = 1 , . . . , H . for h = 1 , . . . , H do for i = 1 , . . . , n h do α ← − ρ ρ π hi ( β ) α ← − ρ ρ (1 − π hi ( β )) y hi ← Bin ( m hi , Beta ( α , α )) for r = 1 , . . . , d do α ← − ρ ρ π hir ( β ) α ← − ρ ρ (1 − (cid:80) rl =1 π hil ( β )) y hir ← Bin ( m hi − (cid:80) r − l =1 y hil , Beta ( α , α )) end for y hi,d +1 ← m hi − (cid:80) dr =1 y hir y hi ← ( y hi, , . . . , y hi,d +1 ) end for end for return y = ( y , . . . , y Hn H ) T l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 Figure 7: RMSEs (top), emprirical levels (middle) and empirical powers (bottom). Non-contaminated and contaminated settings (left and right, respectively). Random Clumpeddistribution 23 l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h R M SE l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h l e v e l l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 l l l l l l l l l l n h po w e r l −0.5 −0.3 0 0.66 Figure 8: RMSEs (top), emprirical levels (middle) and empirical powers (bottom). Non-contaminated and contaminated settings (left and right, respectively). Dirichlet Multino-mial distribution. 24 eferences

J. Alonso-Revenga, N. Mart´ın, and L. Pardo. New improved estimators for overdispersionin models with clustered multinomial data and unequal cluster sizes.

Statistics andComputing , (27):193–217, 2017.S. Basak, A. Basu, and M. C. Jones. On the ‘optimal’ density power divergence tuningparameter.

Journal of Applied Statistics , 0(0):1–21, 2020. doi: 10.1080/02664763.2020.1736524.A. Basu, A. Ghosh, A. Mandal, N. Martin, and L. Pardo. A Wald-type test statistic fortesting linear hypothesis in logistic regression models based on minimum density powerdivergence estimator.

Electronic Journal of Statistics , 11(2):2741–2772, 2017.A. Basu, A. Ghosh, A. Mandal, N. Martin, and L. Pardo. Robust Wald-type tests for non-homogeneous observations based on the minimum density power divergence estimator.

Metrika , 81(5):493–522, 2018.L. C. Bertens, K. G. Moons, F. H. Rutten, Y. van Mourik, A. W. Hoes, and J. B. Re-itsma. A nomogram was developed to enhance the use of multinomial logistic regressionmodeling in diagnostic research.

Journal of Clinical Epidemiology , 71:51–57, 2016.D. A. Binder. On the variances of asymptotically normal estimators from complex sur-veys.

International Statistical Review/Revue Internationale de Statistique , pages 279–292, 1983.L. Blizzard and D. Hosmer. The log multinomial regression model for nominal outcomeswith more than two attributes.

Biometrical Journal , 49(6):889–902, 2007.S. B. Bull, J. P. Lewinger, and S. S. Lee. Conﬁdence intervals for multinomial logisticregression in sparse data.

Statistics in Medicine , 26(4):903–918, 2007.E. Castilla, A. Ghosh, N. Mart´ın, and L. Pardo. New statistical robust procedures forpolytomous logistic models.

Biometrics , 74(4):1282–1291, 2018a.E. Castilla, N. Mart´ın, and L. Pardo. Pseudo minimum phi-divergence estimator forthe multinomial logistic regression model with complex sample design.

Advances inStatistical Analysis , 102(3):381–411, 2018b.E. Castilla, N. Mart´ın, and L. Pardo.

A Logistic Regression Analysis Approach for SampleSurvey Data Based on Phi-Divergence Measures , pages 465–474. Springer InternationalPublishing, Cham, 2018c. ISBN 978-3-319-73848-2.E. Castilla, A. Ghosh, N. Mart´ın, and L. Pardo. Robust semiparametric inference forpolytomous logistic regression with complex survey design.

Advances in Data Analysisand Classiﬁcation , 2020. DOI: 10.1007/s11634-020-00430-7.J. E. Cohen. The distribution of the chi-squared statistic under clustered sampling fromcontingency tables.

Journal of the American Statistical Association , 71(355):665–670,1976.C. Croux and G. Haesbroeck. Implementing the Bianco and Yohai estimator for logisticregression.

Computational Statistics and Data Analysis , 44(1-2):273–295, 2003.25. J. Daniels and C. Gatsonis. Hierarchical polytomous regression models with applica-tions to health services research.

Statistics in Medicine , 16(20):2311–2325, 1997.E. Dreassi. Polytomous disease mapping to detect uncommon risk factors for relateddiseases.

Biometrical Journal , 49(4):520–529, 2007.A. Ghosh and A. Basu. Robust estimation for independent non-homogeneous observationsusing density power divergence with applications to linear regression.

Electronic Journalof Statistics , 7:2420–2456, 2013.A. Ghosh and A. Basu. Robust estimation for non-homogeneous data and the selectionof the optimal tuning parameter: the density power divergence approach.

Journal ofApplied Statistics , 42(9):2056–2072, 2015.A. Ghosh and A. Basu. Robust bounded inﬂuence tests for independent non-homogeneousobservations.

Statistica Sinica , 28(3):1133–1155, 2018.A. Gupta, T. Nguyen, and L. Pardo. Residuals for polytomous logistic regression modelsbased on ϕ -divergences test statistics. Statistics , 42(6):495–514, 2008.F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.

Robust statistics: theapproach based on inﬂuence functions . John Wiley & Sons, 1986.D. W. Hosmer and S. Lemeshow. Applied logistic regression. John Wiley & Sons.

NewYork , 2000.W. Johnson. Inﬂuence measures for logistic regression: Another point of view.

Biometrika ,72(1):59–65, 1985.Y. Ke, B. Fu, and W. Zhang. Semi-varying coeﬃcient multinomial logistic regression fordisease progression risk prediction.

Statistics in Medicine , 35(26):4764–4778, 2016.B. G. Lindsay et al. Eﬃciency versus robustness: the case for minimum hellinger distanceand related methods.

The Annals of Statistics , 22(2):1081–1114, 1994.N. Mart´ın. Using cook’s distance in polytomous logistic regression.

British Journal ofMathematical and Statistical Psychology , 68(1):84–115, 2015.N. Mart´ın and L. Pardo. New inﬂuence measures in polytomous logistic regression modelsbased on phi-divergence measures.

Communications in Statistics-Theory and Methods ,43(10-12):2311–2321, 2014.J. G. Morel. Logistic regression under complex survey designs.

Survey Methodology , 15(2):203–223, 1989.J. G. Morel and N. K. Nagaraj. A ﬁnite mixture distribution for modelling multinomialextra variation.

Biometrika , 80(2):363–371, 1993.J. G. Morel and N. K. Neerchal.

Overdispersion models in SAS . SAS Publishing, 2012.J. E. Mosimann. On the compound multinomial distributions, the multivariate b-distribution and correlation among proportions.

Biometrika , 49:65–82, 1962.N. S. Oﬃce and I. Macro. Malawi demographic and health survey.

Zomba, Malawi, andCalverton, Maryland, USA: NSO and ICF Macro , 2010.26. Pardo.

Statistical inference based on divergence measures . CRC press, 2005.G. Roberts, N. Rao, and S. Kumar. Logistic regression analysis of sample survey data.

Biometrika , 74(1):1–12, 1987.J. Warwick and M. Jones. Choosing a robustness tuning parameter.