[PDF] Robust estimators for generalized linear models with a dispersion parameter

Abstract

Highly robust and efficient estimators for the generalized linear model with a dispersion parameter are proposed. The estimators are based on three steps. In the first step the maximum rank correlation estimator is used to consistently estimate the slopes up to a scale factor. In the second step, the scale factor, the intercept, and the dispersion parameter are consistently estimated using a MT-estimator of a simple regression model. The combined estimator is highly robust but inefficient. Then, randomized quantile residuals based on the initial estimators are used to detect outliers to be rejected and to define a set S of observations to be retained. Finally, a conditional maximum likelihood (CML) estimator given the observations in S is computed. We show that, under the model, S tends to the complete sample for increasing sample size. Therefore, the CML tends to the unconditional maximum likelihood estimator. It is therefore highly efficient, while maintaining the high degree of robustness of the initial estimator. The case of the negative binomial regression model is studied in detail.

Full PDF

aa r X i v : . [ s t a t . M E ] M a r Robust estimators for generalized linear modelswith a dispersion parameter

Michael Amiguet (1) , Alﬁo Marazzi (1) , Marina Valdora (2) and Victor Yohai (2) , (1) Universit´e de Lausanne, (2) Universidad de Buenos Aires March 29, 2017

Abstract

Highly robust and eﬃcient estimators for the generalized linear model witha dispersion parameter are proposed. The estimators are based on three steps.In the ﬁrst step the maximum rank correlation estimator is used to consis-tently estimate the slopes up to a scale factor. In the second step, the scalefactor, the intercept, and the dispersion parameter are consistently estimatedusing a MT-estimator of a simple regression model. The combined estimatoris highly robust but ineﬃcient. Then, randomized quantile residuals based onthe initial estimators are used to detect outliers to be rejected and to deﬁnea set S of observations to be retained. Finally, a conditional maximum likeli-hood (CML) estimator given the observations in S is computed. We show that,under the model, S tends to the complete sample for increasing sample size.Therefore, the CML tends to the unconditional maximum likelihood estimator.It is therefore highly eﬃcient, while maintaining the high degree of robustnessof the initial estimator. The case of the negative binomial regression model isstudied in detail. Introduction

In recent years, several extensions of the generalized linear models (GLM; Nelder andWedderburn, 1972) have been proposed to increase ﬂexibility in modelling complexdata structures. We consider the case where the response distribution does notnecessarily belong to the exponential family and where a dispersion parameter ispresent. For this case, we will propose highly eﬃcient and highly robust estimators.We focus on the Negative Binomial (NB) regression model, but we also considerthe Beta regression model as an example with continuous response. NB regression(see Hilbe, 2008) extends Poisson regression for modeling count data in presenceof overdispersion. Beta regression (Ferrari and Cribari-Neto, 2004) is a tool formodelling continuous responses which are restricted to the interval [0 , Let F µ,α ( y ) denote a general family of discrete or continuous distribution functions,where µ is the mean and α is a dispersion parameter, and let f µ,α ( y ) denote thecorresponding probability (density) function. We will focus on two speciﬁc examples4f families, one discrete and one continuous :- the NB family: f µ,α ( y ) = Γ( y + 1 /α )Γ(1 /α )Γ( y + 1) ( αµ +1) − /α (cid:18) αµαµ + 1 (cid:19) y , y = 0 , , , ... , α ≥ µ ≥

0; (1)- the Beta family: f µ,α ( y ) = Γ(1 /α )Γ( µ/α )Γ((1 − µ ) /α ) y µ/α − (1 − y ) (1 − µ ) /α − , 0 < y < α ≥ µ ≥

0. (2)In both cases, the parametrization has been chosen so that the expected value is µ .In the NB case, the variance is µ + αµ ; in the Beta case, the variance is µ (1 − µ ) / (1 +1 /α ). In both cases, ﬁxing µ , the variance increases with α .We will need the following assumption on F µ,α ( y ), which is satisﬁed in our exam-ples: Assumption A : For any α , Y ∼ F µ ,α ( y ), Y ∼ F µ ,α ( y ), if µ > µ then Y ≻ Y ,where “ ≻ ” means “stochastically larger”.Suppose that a response Y and a vector X = ( X , ...X p ) T of covariates are ob-served. We consider the following class of regression models Y | X = x ∼ F h ( x T β ) ,α , (3)where h is a strictly increasing known link function, and β = ( β , ...β p ) T is a vectorof coeﬃcients. We assume that X is constantly equal to one, that is, β is anintercept. We will use the notations x T = (1 , x ∗ T ), β T0 = ( β , β ∗ T0 ), γ = β ∗ / || β ∗ || ,and θ = ( β, α ).We assume that a random sample ( x , y ) , ..., ( x n , y n ) is available. The ML esti-mator of θ = ( β , α ) maximizes the log-likelihood of the sample given by L ( θ ) = n X i =1 ln (cid:16) f h ( x T i β ) ,α ( y i ) (cid:17) .The ML estimator is very eﬃcient but not robust. We want to obtain highly robustand eﬃcient estimators of β and α . 5 Estimation procedure

The proposed procedure starts with the computation of a very robust but not nec-essarily eﬃcient initial estimator which provides the tool for outlier identiﬁcation.Then, a conditional ML approach is used - where the outliers are removed - whichprovides a fully eﬃcient estimator.Most familiar highly robust estimators of regression, such as LMS, LTS, and Sestimators (see, e.g., Maronna et al., 2006), are based on the minimization of a robustmeasure of the residual scale, such as an M scale (Huber, 1980). These estimatorshave been used as initial estimators of well known highly robust and eﬃcient proce-dures, such as MM (Yohai, 1987), and TML (Marazzi and Yohai, 2004) estimators.However, for the regression models we are considering here, a diﬀerent approachhas to be used because the residual distribution may depend on the covariates andresidual measures of scale are not available in this case. We therefore propose an ap-proach based on the maximum rank correlation (MRC) estimator introduced by Han(1987a) and Han (1987b). However, the MRC estimator identiﬁes the scaled slopes γ = β ∗ / || β ∗ || , but it does not identify the intercept β , the dispersion parameter α , and the scale factor η = || β ∗ || . So, we need to estimate these three parametersseparately. The complete proposal can then be summarized as follows: Step 1

Compute the MRC estimator ˜ γ of γ . In addition, compute robust andconsistent estimators ˜ β , ˜ α , and ˜ η , of β , α , and η . Then, initial estimatorsof β and α are given by ˜ β = ( ˜ β , ˜ η ˜ γ T ) T and ˜ α respectively. Step 2

Compute randomized quantile residuals z i (Dunn and Smyth, 1996) based onthe initial model and use them to deﬁne cutoﬀ values ˜ a and ˜ b , so that inﬂuentialoutliers are deﬁned as observations such that z i / ∈ [˜ a, ˜ b ]. Step 3

Compute a conditional ML estimator of θ given z i ∈ [˜ a, ˜ b ].6n the following subsections, we provide a detailed description of each single step. For a given coeﬃcient vector γ = ( γ , ..., γ p ) T , the Kendall’s τ correlation coeﬃcientbetween the responses y i -s and the linear combinations γ T x ∗ i -s is given by τ ( γ ) = 1 n ( n − X i = j I (cid:2) ( γ T x ∗ j − γ T x ∗ i )( y j − y i ) ≥ (cid:3) and the maximum rank correlation ( MRC) estimator of γ is deﬁned by ˜ γ = arg min k γ k =1 τ ( γ ). (4)The robustness of Kendall’s τ correlation coeﬃcient has been studied by Alfons etal. (2016). Under the assumption A, the MRC estimator strongly converges to γ forany strictly increasing h (Han, 1987a); it is also root n consistent and asymptoticallynormal (Sherman, 1993) . To compute the MRC estimator one can utilize a subsampling procedure. Notethat the simple evaluation of the objective function requires O ( n ) calculations, butan algorithm using O ( n log n ) calculations has been proposed by Abrevaya (1999).However, in the Monte Carlo experiments described in Section 4, we used the very fastfunction maxCorGrid of the R package ccaPP (Alfons, 2015) based on an alternategrid algorithm described in Alfons et al. (2016).We now turn to the estimation of β , α , and η , necessary to complete the initialestimator. We observe that h ( x T β ) = h ( β + η γ T0 x ∗ ). Since ˜ γ is close to γ , weapproximate γ T0 x ∗ i by v i = ˜ γ T x ∗ i and consider the simple regression model with justone covariate: Y | v ∼ F h ( β + η ν ) ,α . (5)For this model and a given value α of the unknown α , we have many highly robustestimators ˜ β ∗ ( α ), ˜ η ∗ ( α ), of β and η . Examples are: the conditionally unbiased7ounded inﬂuence estimator of K¨unsch et al. (1989), the RQL estimator of Cantoniand Ronchetti (2001), and the weighted MT estimators of Valdora and Yohai (2014) . Finally, to estimate α , we consider a bounded function ψ ( y, µ, α ) such that, for all µ , we have E µ,α [ ψ ( y, µ, α )] = 0 . (6)Then, for any ﬁxed µ, if y , y , ...y n is a random sample of N B ( µ, α ), the M estimatorof α satisfying the equation n X i =1 ψ ( y i , µ, α ) = 0is Fisher consistent for α . Then, an initial consistent estimator ˜ α of α is obtainedby solving X i ψ ( y i , h ( ˜ β ∗ ( α ) + ˜ η ∗ ( α ) v i ) , α ) = 0. (7)The Fisher consistency of this estimator is immediate. In fact, asymptotically, h ( ˜ β ∗ ( α ) + ˜ η ∗ ( α ) v i ) = E ( y i ), and then by (6) E ( ψ ( y, µ i , α )) = 0 . Once ˜ α is computed, we deﬁne the initial estimators of β and η by ˜ β = ˜ β ∗ ( ˜ α ),˜ η = ˜ η ∗ ( ˜ α ). In this way we obtain the initial estimators ˜ β = ( ˜ β , ˜ η ˜ γ ) of β and ˜ α of α . We will assume that: Assumption B : n / ( ˜ β − β ) = O p (1) and n / ( ˜ α − α ) = O p (1) . In the simulations of Section 4 and the examples in Section 5, we use a weightedMT estimator for ˜ β ∗ ( α ), ˜ η ∗ ( α ) (see appendix 8) and the score function ψ of the op-timal bounded inﬂuence estimator according to Hampel (1972) described in Marazziand Yohai (2010). It can be proved that, under general conditions, the resultinginitial estimators ˜ β and ˜ α satisfy the assumption B.8 .2 Adaptive cutoﬀ values and outlier detection We now assume that some preliminary estimator ˜ θ = ( ˜ β, ˜ α ) of θ is available, forexample the estimators deﬁned in the previous section. Since the residual distributiondepends on the covariates, residuals cannot be used in the usual way for the purpose ofhighlighting outliers. Instead, we use the randomized quantile residuals (RQR) thatwere proposed in Dunn and Smyth (1996) for exploratory purposes. Let ˜ µ x = h ( x T ˜ β ).Then, the RQRs are deﬁned by z i = F ˜ µ x , ˜ α ( y i )in the continuous case and by z i = F ˜ µ x , ˜ α ( y i ) − u i f ˜ µ x , e α ( y i )in the discrete case, where { u , ..., u n } is a sample from a uniform distribution U [0 , x , y ) , ..., ( x n , y n ).If ˜ θ = θ , { z , ..., z n } is a sample from U [0 , a and a ﬁxed upper cutoﬀ value b for the RQRs are simply given by a low, respectivelya large quantile of U [0 ,

1] – e.g., a = 0 .

05 and b = 0 .

95 – and observations such that z i / ∈ [ a, b ] may be identiﬁed as outliers. However, we propose the use of “adaptive”cutoﬀ values ˜ a and ˜ b that, under the assumed model, tend to 0 and 1 respectively,when ˜ β and ˜ α are consistent estimators. Therefore, under the model, i.e., in theabsence of outliers, the fraction of observations that are erroneously identiﬁed asoutliers tends to 0 when the sample size n → ∞ .To deﬁne the adaptive cutoﬀ values, we follow a procedure similar to the onesdescribed in Marazzi and Yohai (2004, Section 3.2) and in Gervini and Yohai (2002).Let F n denote the empirical cdf of z , ..., z n and F Rn,t and F Ln,t be the right and theleft truncated versions of F n for a given t respectively, i.e,. F Rn,t ( z ) =  F n ( z ) /F n ( t ) if z ≤ t, Ln,t ( z ) =  ( F n ( z ) − F n ( t )) / (1 − F n ( t )) if z ≥ t, F n and the U [0 , t suchthat F Rn,t ( z ) ≥ z for all z ≥ ζ where ζ is a value close to one. More precisely, wedeﬁne an upper cutoﬀ value as˜ b = sup (cid:26) t : inf z ≥ ζ ( F Rn,t ( z ) − z ) ≥ (cid:27) . In a similar way, we deﬁne a lower cutoﬀ value as˜ a = inf (cid:26) t : sup z ≤ ζ ( F Ln,t ( z ) − z ) ≤ (cid:27) , where ζ is close to zero.We assume that: Assumption C : The density f ( y, µ, α ) has a bounded derivative with respect to µ and α .Then, we have the following Theorem, proved in Appendix 9. Theorem 1

Assume B and C. Then n / ˜ a = O p (1) , n / (˜ b −

1) = O p (1) . Usually, a quite high value of is ζ chosen. Our usual choice is ζ = 0 .

95; however,in the presence of a large proportion of high outliers, it may be convenient to use alower value, e.g., ζ = 0 .

90. Similar considerations apply to the choice of the lowercutoﬀ ˜ a and we usually set ζ = 0 .

05, but ζ = 0 .

10 would allow removing a largerfraction of small observations, such as “excess zeros”in the NB case. (In fact, avery small ζ could fail to identify many “excess zeros”, because each one of themcorresponds to several distinct z i ’s and may not emerge as an extremely small value.)10 .3 Final estimator In the ﬁnal step, we improve the eﬃciency of the initial estimator using a conditionalML approach. Suppose ﬁrst that ﬁxed cutoﬀ values a and b are given and the RQRsare computed. Let p β,α ( y | x , Z ∈ [ a, b ]) denote the conditional density of Y given X = x and Z ∈ [ a, b ], where Z ∼ U [0 ,

1] represents the RQR. Then, the conditionaldensity is of the form p β,α ( y | x , Z ∈ [ a, b ]) = f h ( x T i β ) ,α ( y ) W a,b ( x , β,α ). (8)In the continuous case we have W a,b ( x , β,α ) = I ( F − µ x ,α ( a ) ≤ y ≤ F − µ x ,α ( b )) b − a . In the discrete case, the following expression (9) for W a.b ( x , β,α ) is derived in theappendix 7. Let, for any c , y ∗ x ( c ) = max { y : F µ x ,α ( y ) ≤ c } , and t c, x = F µ x ,α ( y ∗ x ( c ) + 1) − cf µ x ,α ( y ∗ x ( c ) + 1) , Put T a, x = y ∗ x ( a ) + 2, T b, x = y ∗ x ( b ) ,A x = { y : T a, x ≤ y ≤ T b, x } , and Q ( x , β,α ) = F µ x ,α ( T b, x ) − F µ x ,α ( T a, x −

1) + f µ x ,α ( T a, x − t a, x + f µ x ,α ( T b, x + 1)(1 − t b, x ) . Then W a,b ( x , β,α ) =  Q ( x ,β,α ) if y ∈ A x , t a, x Q ( x ,β,α ) if y = T a, x − , − t b, x Q ( x ,β,α ) if y = T b, x + 1 , a and ˜ b are the adaptive cutoﬀ values deﬁned above, andconsider the adaptive conditional likelihood function L CML ( θ ) = n X i =1 I (˜ a ≤ z i ≤ ˜ b ) ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) . The conditional maximum likelihood ( CML) estimator ˆ θ CML = ( ˆ β CML , ˆ α CML ) is de-ﬁned by ˆ θ CML = arg max θ L CML ( θ ) . In the discrete case, a slight modiﬁcation of this deﬁnition is convenient. We notethat (see appendix 7): { a ≤ z i ≤ b } = { T a, x i ≤ y i ≤ T b, x i } ∪ { y i = T ∗ l, x , u i ≤ t a, x i } ∪ { y i = T ∗ u, x , u i ≥ t b, x i } , where T ∗ l, x = T a, x − , and T ∗ u, x = T b, x + 1. Then, L CML ( θ ) = X T ˜ a, x i ≤ y i ≤ T ˜ b, x i ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) + X y i = T ∗ l, x I ( u i ≤ t ˜ a, x i ) ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) + X y i = T ∗ u, x I ( u i ≥ t ˜ b, x i ) ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) . Since the u i -s are non–informative, we replace I ( u i ≤ t ˜ a, x i ) and I ( u i ≥ t ˜ b, x i ) by theirexpected values, and deﬁne L MCML ( θ ) = X T ˜ a, x i ≤ y i ≤ T ˜ b, x i ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) + X y i = T ∗ l, x t ˜ a, x i ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) + X y i = T ∗ u, x (1 − t ˜ b, x i ) ln (cid:16) p β,α ( y i | x i , z i ∈ [˜ a, ˜ b ]) (cid:17) . modiﬁed CML (MCML) estimator ˆ θ MCML = ( ˆ β MCML , ˆ α MCML ) by ˆ θ MCML = arg max θ L MCML ( θ ) . From (9) and Theorem 1, it is easy to show that n / (cid:0) W ˜ a, ˜ b ( x , β,α ) − (cid:1) = O p (1) (10)and therefore n / ( p β,α ( y | x , Z ∈ [˜ a, ˜ b ]) − f h ( x T i β ) ,α ( y )) = O p (1) . (11)Then, according to (11), both L CML ( θ ) and L MCML ( θ ) tend, under the model, to theunconditional likelihood function with rate n − / . For this reason we conjecture thatboth the CML and the MCML estimator have the same asymptotic distribution thanthe unconditional ML estimator, that is, n / ( ˆ θ CML − θ ) → D N p ( , I − ( θ )) , and n / ( ˆ θ MCML − θ ) → D N p ( , I − ( θ )) , where → D denotes convergence in distribution, N p ( µ, Σ ) the p -variate normal distri-bution with mean µ and covariance matrix Σ, and I ( θ ) the information matrix. Thisimplies that ˆ θ CML and ˆ θ MCML are both fully eﬃcient.

Remark 1 . Empirical results show that, in order to optimize the ﬁnite sampleeﬃciency, with no loss of robustness, it is convenient to iterate the conditional MLestimator as follows. Given a current value of ˆ θ CML (or ˆ θ MCML ), we compute newRQR-s. Then, we compute new values of ˜ a and ˜ b and use them to update ˆ θ CML .Often, the process converges after a few iterations, but can also move away from theinitial value. In the experiments reported in Section , we found that two stepsare enough: the eﬃciency did not improve using more iterations. Moreover, in the13iscrete case, the ﬁnal estimator slightly depends on the sample { u , ..., u n } used tocompute ˜ a and ˜ b . To remove this dependency, we propose to average the ﬁnal step(MCML) over a few replications of this sample. Remark 2 . In certain circumstances, we may use a very simple alternative procedureto compute robust and consistent estimators of β , α , and η in (5). We ﬁrst identifya simple model, which is free of the dispersion parameter, and that can be taken asan approximation of (5). For example, the Poisson regression model with mean h ( β + η ν ) may be taken as an approximation of the NB model. We then use anavailable robust procedure to estimate β and η . In the NB case, the conditionallyunbiased bounded inﬂuence estimators of K¨unsch et al. (1989), implemented in the Rpackage “robeth” (Marazzi, 1992) is a natural choice. In the Beta regression case wenote that Atkinson (1985) transforms the response so that the transformed dependentvariable (e.g., log( y/ (1 − y ))) assumes values on the real line, and then uses it in alinear regression analysis. Clearly, we may also use a robust regression estimatorin this case, e.g., the MM estimator implemented in the R package “robustbase”.Finally, we estimate α using (7). Since the approximate model is not the correctone, the estimators do not converge to β , α , and η . Usual robust estimatorsconverge however to their population values and can be used to deﬁne ﬁxed cut-oﬀvalues a and b for Z , which also converge to their asymptotic values. The CML(MCML) estimator of ( β , α , η ) given Z ∈ [ a, b ] is then consistent under (5). We present simulation results only for the NB regression model (3)-(1). We comparedthe initial estimators ˜ β and ˜ α and the ﬁnal modiﬁed CML estimators ˆ β and ˆ α . In thefollowing, these estimators will be referred as INI and CML respectively. All cutoﬀvalues were adaptive with ζ = 0 .

05 and ζ = 0 .

95. In order to compute the MRC14stimators we used the function maxCorGrid of the R package ccaPP (Alfons et al.,2015). The INI estimator was completed with the help of the weighted MT estimatordescribed in Appendix 8. In order to estimate α , we used the function ψ deﬁned bythe equation for α of the optimal M estimator M80 described in Marazzi and Yohai(2010, p.174) and available in the “robustGLM” package. To compute the CML esti-mator, we used the standard R optimizer “optim”, reparametrizing α with σ = √ α inorder to satisfy the constraint α >

0. (For a very small number of contaminated cases,the optimization process diverged; the initial solution was recorded in such cases.)Only two iterations of the CML procedure were computed. For comparison, we alsocomputed the GM estimators of Aeberhard et al. (2014) that will referred as ACHin the following. To compute the ACH estimator, we used the R function glmrob.nb(available on internet) with the parameters: bounding.func=‘T/T’, c.tukey.beta=4,c.tukey.sig=4, as suggested by the authors, and the option x-weight=hard that pro-vides hard rejection weights for the covariate observations. We used the followingmodel: y ∼ F exp( x T β ) ,α , x = (cid:18) x ∗ (cid:19) , x ∗ ∼ N (0 , I ) , (12) β = (1 . , . , . , , , , α = 0 . . We ﬁrst performed four experiments with samples of size n = 100, 400, 1000, 2000from (12) without addition of outliers. For each experiment, the number of replica-tions was N = 1000. To measure the quality of an estimator ( β,α ) we used the meanabsolute estimation error (MAEE) and the mean absolute prediction error (MAPE).The MAEE of β is deﬁned byMAEE( β ) = 1 N N X i =1 (cid:13)(cid:13)(cid:13) β i − β (cid:13)(cid:13)(cid:13) , β i is the estimate of β based on the i th replication and || . || denotes the l norm. The MAEE of α is deﬁned in a similar way byMAEE( α ) = 1 N N X i =1 (cid:12)(cid:12)(cid:12) α i − α (cid:12)(cid:12)(cid:12) .The MAPE of the prediction estimator µ x = exp (cid:0) x T β (cid:1) of µ , x = exp (cid:0) x T β (cid:1) isdeﬁned as MAPE( µ ) = 1 N N X i =1 (cid:12)(cid:12)(cid:12) µ i − µ i (cid:12)(cid:12)(cid:12) ,where µ i = exp (cid:16) x i β (cid:17) and µ i = exp (cid:16) x T i β i (cid:17) and x i is the i th replication of x .Table 1 reports the empirical relative eﬃciencies measured as the ratios of the MAEEand MAPE of the robust estimators with respect to the corresponding MAEE andMAPE of the ML estimators. Table 1. Empirical relative eﬃciencies of coeﬃcients, dispersion, and predictionestimators

We observe that the relative eﬃciencies of the initial estimators were low butcould be improved with the help of the ﬁnal MCML procedure. With the exceptionof the dispersion estimator for n = 100, our ﬁnal estimator is much more eﬃcient thanthe ACH competitor. (The tuning constants of the ACH estimator were apparentlychosen by the authors in order to obtain a satisfactory degree of robustness.) In another simulation the model (12) has been contaminated with 10% of pointwisecontamination. Preliminary experiments showed that the estimators were quite sen-sitive to outlying values of y when x ∗ = (3 , , , , T . This value of x is moderatelyoutlying with respect to the majority of the covariate observations. Therefore, we16sed point contaminations of the form ( x ∗ out , y out ) with x ∗ out = (3 , , , , T and aresponse y out varying in the set { , , , , , , , , , , , , } . Foreach value of y out , we generated 1000 samples of size n = 400 according to (12) andthen replaced 10% of the observations with identical outliers of the form ( x ∗ out , y out ).Table 2 reports the MAEE and MAPE of the estimators for the diﬀerent values of y out . (Outliers were excluded in the computation of the MAPE). The results are alsodisplayed in Figure 1. Both the MAPE and MAE of the proposed estimators weresmaller than those of ACH for most values of y out . Table 2. MAEE and MAPE of coeﬃcient, dispersion, and prediction estimators forvarying y out .Figure 1. Mean absolute prediction and estimation errors for varying y out . In modern hospital management, stays are classiﬁed into “diagnosis related groups”(DRGs; Fetter et al., 1980) which are designed to be as homogeneous as possiblewith respect to diagnosis, treatment, and resource consumption. The mean cost ofstay of each DRGs is periodically estimated with the help of administrative data ona national basis and used to determine “standard prices” for hospital funding and re-imbursement. Typical stays are reimbursed according to the standard prices, whereasthe reimbursement of exceptional stays (outliers) is subject to special negotiationsamong the partners. Since it is diﬃcult to measure cost, length of stay (LOS) is oftenused as a proxy. Outliers are usually deﬁned as observations with a LOS larger thatsome arbitrary cutoﬀ value. In designing and reﬁning the groups, the relationshipbetween LOS and other variables which are usually available on administrative ﬁleshas to be assessed and taken into account.17e ﬁrst reconsider the example described in Marazzi and Yohai (2010). In thisexample there are not covariables, that is, only the parameters of a NB distributionare estimated. Table 3 shows the LOS of 32 stays classiﬁed into DRG “disorders ofthe nervous system” and we immediately identify three extreme values: 115, 198,374 days. The arithmetic means with and without these observations are 25 . . − a = 0 .

05 and b = 0 .

95 based on two iterations starting from M80, and averagedover 100 replications of { u , ...u n } . We also computed the three estimators (MLE*,M80*, CML*) after removal of the three outliers. The numerical results are shownin Table 4. They show that M80 and CML provided results which were similar toMLE* and unaﬀected by the outliers. The average values of ˜ a and ˜ b were ¯ a = 0 . b = 0 .

953 from which we derived T ¯ a = 1, T ¯ b = 7, t ¯ a = 0 .

61, and t ¯ b = 0 .

43. Thismeans that, in the overage, the CML estimator completely rejected LOS − ,

8] and gives weights 0 .

61 and 0 .

57 to the extremes of thisinterval.

Table 3. Length of stay of 32 hospital patients.Table 4. Estimates of LOS-1 mean and LOS-1 dispersion for disorders of thenervous system.

18n a second example, we considered a sample of 649 hospital stays (256 maleand 393 female patients) for the “major diagnostic category” (MDC) “Diseases andDisorders of the Endocrine, Nutritional And Metabolic System”. A MDC is simplya group of DRGs associated with a particular medical specialty. The data are shownin Figure 2 (two outliers with LOS = 84 and LOS = 122 fall beyond the upper limitof the ﬁgure).We studied the relationship between LOS − x in years) and Sex of the patient ( x = 0 for males and x = 1 forfemales). We considered a NB model with exponential link and linear predictor β + β x + β x + β x x . We compared the ML, the ACH, and the complete es-timator (called CML in the following) proposed in Section 3. The ACH estimatorwas computed with the help of the R function glmrob.nb with the tuning parameterssuggested by the authors. The CML step - with a = 0 .

05 and b = 0 .

95 and twoiterations - was replicated 30 times with diﬀerent vectors { u , ..., u n } . The averagevalues of ˜ a and ˜ b were ¯ a = 0 .

004 and ¯ b = 0 . T ¯ a, x i , T ¯ b, x i , t ¯ a, x i , and t ¯ b , x i ( i = 1 , ..., n ). We found that 65 observations were totally rejected, 62fell on the lower limits T ¯ a, x i (receiving an average weight 0 .

96) and 9 on the upperlimits T b,x i (with an average weight 0 . Figure 2. Data: LOS-1 and Age of patients and ﬁtted models according to CMLand ML.Table 5. Coeﬃcient (standard errors) and dispersion estimates for disorders of theendocrine system.

19e observe that the CML and the ML* coeﬃcient estimates are very close andquite similar to the ACH estimates. (However, the standard errors provided byglmrob.nb are surprisingly large.) We also note that the dispersion parameter isheavily inﬂated by the contamination. For CML and ML*, the Sex eﬀect ( β ) issigniﬁcant at the 5% level and the interaction ( β ) is not signiﬁcant. Instead, forML, the interaction is signiﬁcant at the 5% level, but not the eﬀect of Sex. Thus, theclassical and the robust inferences are diﬀerent.Figure 3 shows three uniform qq-plots of randomized tail probabilities z , ..., z n based on diﬀerent estimates of α and β . In panel (a) the ML estimator has been usedand the sigmoidal shape suggests that the estimated model is incorrect. In panel(b), the z -values were based on the modiﬁed CML estimator; the plot is more linearbut it gradually departs form the diagonal for increasing quantiles. This suggeststhat the robustly ﬁtted model is adequate for a large proportion of data but not forthose corresponding to very large values of z . Panel (c) is based on ML* and the z -values corresponding to the full outliers based on CML have been removed fromthe plot; this plot follows the diagonal line very well. Finally, the boxplots in panel(d) compare the distribution of the absolute residuals (without outliers) based onML, ACH, CML, and ML*; the two latter ones are globally smaller than the formerones. We conclude that CML (and ML*) provide an adequate model for about 90%of the population. Figure 3. qq-plots of randomized tail probabilities based on ML, CML, ML withremoval of the extreme z-values from the plot, and boxplots of the absolute residualsof ML, ACH, CML, and ML*. Discussion

In many areas of applied statistics, the data may be aﬀected by a high level of contam-ination. An example is the analysis of hospital length of stay, where contaminationlevels as high as 10% are not uncommon. For this reason, diﬀerent ad hoc rules oftrimming had long been used by practitioners to remove outliers (e.g., Marazzi et al.,1998) from their data. In these applications, well founded highly robust proceduresare needed.Maronna et al. (1979) showed that classical M and GM estimators of regression(see e.g., Huber, 1980, Hampel et al., 1986) were unable to combine a high level ofrobustness and a high level eﬃciency: M and GM estimators can be very eﬃcient, butare very sensitive to outliers in the factor space. This work stimulated the research onhigh breakdown-point estimation that provided LMS, LTS, and S estimators (see e.g.,Maronna et al., 2006) just to mention three among many other procedures. Then,for the usual linear regression problem, the MM estimators of Yohai (1987) combinedhigh breakdown point and high eﬃciency with the help a two step approach: inthe ﬁrst step, a very robust initial ﬁt (an S estimator) provided the tool for outlieridentiﬁcation; the second step was based on an eﬃcient estimator (an M estimator),where the outliers were downweighted. Since then, similar two-step procedures havebeen proposed for diﬀerent models (Marazzi and Yohai, 2004; Locatelli et al. 2010;Agostinelli et al., 2014).However, the familiar high breakdown point regression estimators used in theﬁrst step are based on minimization of a robust measure of the residual scale and,unfortunately, cannot be used for GLMs with a dispersion parameter, such as NBand Beta regression. The reason is that the residual distribution depends on thecovariates and robust residual measures of scale are not available in this case. In thispaper, we propose a more general approach that bypasses residual scales.Our proposal is an original assembly of well known procedures. In the initial step21e use the MRC estimator (Han, 1987) to estimate the slopes up to a scale factor. Avery fast algorithm to compute this estimator has recently been proposed in Alfons et.al (2016). We complete the MRC with the help of a weighted MT estimator (Valdoraand Yohai, 2014) of a simple negative binomial regression. We then use randomizedquantile residuals (Dunn and Smyth, 1996) to determine adaptive cutoﬀ values ˜ a and ˜ b using a procedure similar to the one proposed in Marazzi and Yohai (2004).Inﬂuential outliers are identiﬁed by the residuals not belonging to [˜ a, ˜ b ]. Finally, wecompute a conditional ML, estimator where residuals belong to [˜ a, ˜ b ]. Since, in theabsence of outliers, ˜ a → b →

1, the CML estimator tends to the ML estimatorfor n → ∞ . It is therefore fully eﬃcient.Monte Carlo simulations conﬁrm that our proposal is very eﬃcient under themodel and very robust under point contamination, both in the response and thecovariate distributions. This kind of contamination is unrealistic; however, it is gen-erally the least favorable one and allows evaluation of the maximal bias an estimatorcan incur. The CML estimator for NB regression also resists to a moderate fractionof excess zeroes in the response. A more vigorous treatment of this peculiarity ofcount data should however be approached with the help of speciﬁc models, such ashurdle models (see, e.g. Min and Agresti, 2002, and Cantoni and Zedini, 2009).We have shown that the proposed method is a useful tool for modelling hospitallength of stay as a function of available covariates, while identifying inﬂuential outliersaccording to a model based rule. A set of R functions to compute the proposedestimators is made available as an R package.22 ppendices To simplify notations, we just consider the case without covariates; the extension tothe regression case is straightforward. We suppose that ϑ = ( µ , α ) is given and let z = F ϑ ( y ) − uf ϑ ( y ), where u ∼ U [0 , a and b are given cutoﬀ valuesfor z and deﬁne, for any c , y ∗ ( c ) = max { y : F ϑ ( y ) ≤ c } . Note that F ϑ ( y ∗ ( a ) + 1) − uf ϑ ( y ∗ ( a ) + 1) ≥ a is equivalent to u ≤ F ϑ ( y ∗ ( a ) + 1) − af ϑ ( y ∗ ( a ) + 1) = t a . Similarly F ϑ ( y ∗ ( b ) + 1) − uf ϑ ( y ∗ ( b ) + 1) ≤ b is equivalent to u ≥ F ϑ ( y ∗ ( b ) + 1) − bf ϑ ( y ∗ ( b ) + 1) = t b . Put T a = y ∗ ( a ) + 2, T b = y ∗ ( b ), and A = { y : T a ≤ y ≤ T b } . We have { a ≤ z ≤ b } = A ∪ { y = T a − , u ≤ t a } ∪ { y = T b + 1 , u ≥ t b } , and then P ϑ ( a ≤ z ≤ b | u ) = P ϑ ( A ) + f ϑ ( T a − I ( u ≤ t a ) + f ϑ ( T b + 1) I ( u ≥ t b ) , P ϑ ( A ) = F ϑ ( T b ) − F ϑ ( T a − v = I ( a ≤ z ≤ b ). Since E [ I ( u ≤ t a )] = P ( u ≤ t a ) = t a , the distribution of y | v = 1 is given by p ϑ ( y | v = 1) =  f ϑ ( y ) Q ( ϑ ) if y ∈ A, f ϑ ( y ) t a Q ( ϑ ) if y = T a − , f ϑ ( y )(1 − t b ) Q ( ϑ ) if y = T b + 1 , Q ( ϑ ) = E [ P ϑ ( a ≤ z ≤ b | u )] = P ϑ ( A ) + p ϑ ( T a − t a + p ϑ ( T b + 1)(1 − t b ) . We describe the use of the weighted MT estimator to compute ˜ β ∗ ( α ) and ˜ η ∗ ( α )introduced in subsection 3.1. We consider the simple regression model Y | v ∼ F h ( β + η v ) ,α . Assuming that α is known, the weighted MT estimator of ( β , η ) isdeﬁned as follows.( ˜ β ∗ ( α ) , ˜ η ∗ ( α )) = arg min β ,β n n X i =1 w ( x i , ˆ M , ˆ S ) ρ ( t ( y i , α ) − m ( h ( β + β x i , α )) , (13)where ρ is a continuous and bounded function with a unique local minimum at 0, m is the function deﬁned by m ( µ, α ) = arg min γ E µ,α ( ρ ( t ( y, α ) − γ )) , (14) t ( y, α ) is a variance stabilizing transformation and w ( x, ˆ M , ˆ S ) is a nonnegative non-increasing function of (cid:12)(cid:12)(cid:12) x − ˆ M (cid:12)(cid:12)(cid:12) / ˆ S , where ˆ M and ˆ S are robust estimators of location24nd scale of the covariate x . Usually, ρ is taken in the Tukey’s biweight family givenby ρ Tc ( u ) = 1 − max (cid:18) − (cid:16) uc (cid:17) (cid:19) , ! . In our simulations in Section 4 and the examples in section 5 with the NB distribution,we used the transformation t ( y, α ) =  s ( y, α ) if 0 < α < . s ( y, .

3) if α > . , where s ( y, α ) = p /α − . s y + 3 / /α − / ! .This is a modiﬁcation of the transformation proposed by Yu (2009) to allow valuesof α larger than 4 /

3. We take w ( x, ˆ M , ˆ S ) = I ( | x − median( x i ) | / mad( x i ) <

2) and h ( z ) = exp( z ). Since the variance of t ( y, α ) is almost constant, it is not necessary todivide the argument of ρ c by a scale estimator. While the eﬃciency of the estimatorincreases with c , its degree of robustness decreases. Since the weighted MT estimator,is used to deﬁne an initial estimator whose eﬃciency will be improved in further steps,the value of c is chosen in order to obtain a satisfactory degree of robustness. By trialand error we obtain the following rule for choosing c as a function of α : c = 1 . σ ( α ),where, for each α, σ ( α ) is the constant that approximates the standard deviationof t ( y, α ) . The value of σ ( α ) is obtained by interpolation the values in the followingTable A1: α .10 .20 .30 .40 .50 .60 .70 .80 .90 1.0 1.1 1.2 1.3 σ .41 .40 .39 .37 .36 .35 .33 .32 .30 .29 .27 .26 .24 Table A1. Approximated standard deviations of t ( y, α ) for the NB distribution25or the Beta distribution, we have Var µ,α ( y ) = µ (1 − µ ) / (1 + α ) and a suitablevariance stabilizing transformation (Bartlett, 1947) is given by t ( y, α ) = Z y / Var µ,α ( y ) / dµ = √ α arcsin ( √ y ) .In our experiments we used this transformation for α ∈ [5 ,

50] and link function h ( u ) = exp( u ) / (1+ exp( u ). We follow the same approach as in the NB case. Thevalues of the approximated variances can be found in the following Table A2: α σ .42 .43 .43 .44 .45 .45 .47 .48 .48 .49 .49 .49 .49 .49 Table A2. Approximated standard deviations of t ( y, α ) for the beta distributionWhen α is unknown, the estimator ( ˜ β ∗ ( ˜ α ) , ˜ η ∗ ( ˜ α )) simultaneously satisﬁes equa-tions (13) and (7). To compute an approximate solution we consider a grid of possiblevalues of α , namely the values in the tables above. For each α in the grid, we ﬁrstcompute ( ˜ β ∗ ( α ) , ˜ η ∗ ( α )) and then the solution ˜ α ∗ of (7). The desired approximationis then deﬁned as the vector ( ˜ β ∗ ( ˜ α ∗ ) , ˜ η ∗ ( ˜ α ∗ )) for which the diﬀerence between α and˜ α ∗ is minimal. We consider the discrete case, where the RQRs are deﬁned by z i = F h ( x T ˜ β ) , ˜ α ( y i ) − u i f h ( x T ˜ β ) , e α ( y i ) , ≤ i ≤ n, By Assumption B, there exist A and B such that, if D n = { n / | ˜ α − α | ≤ A , || n / ( ˜ β T − β ) || ≤ B } ,

26e have P ( D n ) ≥ − ζ / v i = F h ( x T β ) ,α − u i f h ( x T β ) ,α ( y i ) , ≤ i ≤ n. Then the v i ’s are i.i.d. with distribution U [0 , . By Assumption C, there exist K and K such that z i ≥ v i − (cid:16) K || ˜ β − β || + K | ˜ α − α | (cid:17) i.e., z i ≥ v i − n − / B n , where B n = O p (1). Let e such that, if M n = { B n ≤ e } , then P ( M n ) ≥ − ζ / . Let F zn and F vn be the empirical distributions of the z i ’s and v i ’s respectively. Then,we have F zn ( v ) ≤ F vn ( v + n − / B n ) . (15)Since E n = sup v n / | F vn ( v ) − v | = O P (1) , we get F vn ( v ) ≤ v + n − / E n . Then, putting G n = B n + E n , by (15) we obtain F zn ( v ) ≤ v + n − / G n . (16)In a similar way we get F zn ( v ) ≥ v − n − / G ∗ n . (17)where G ∗ n = O p (1). Let H na ( v ) = sup( F zn ( v ) − F zn ( a ) , − F zn ( a )27nd A = { a : sup v ≤ ζ ( H na ( v ) − v ) ≤ } . Then ˜ a = inf A Note that a ∈ A is equivalent to F zn ( v ) ≤ v (1 − F zn ( a )) + F zn ( a ) for all v ≤ ζ and this is equivalent to F zn ( a )(1 − v ) ≥ F zn ( v ) − v for all v ≤ ζ . By (16) and (17) a suﬃcient condition for a ∈ A is that( a − n − / G ∗ n )(1 − ζ ) ≥ n − / G n or equivalently that a ≥ n − / (cid:18) G n − ζ + G ∗ n (cid:19) This implies that ˜ a ≤ n − / (cid:18) G n − ζ + G ∗ n (cid:19) . proving that n / ˜ a is bounded in probability. The proof that n / (˜ b −

1) is boundedin probability too is similar. 28 eferences

Abrevaya J. (1999). Computation of the maximum rank correlation estimator. Eco-nomics Letters, 62, 279–285.Aeberhard W.H., Cantoni E. and Heritier S. (2014). Robust inference in the negativebinomial regression model with an application to falls data. Biometrics, 70, 920-931.DOI: 10.1111/biom.12212Agostinelli C., Marazzi A. and Yohai V.J. (2014). Robust estimators of the general-ized log-gamma distribution. Technometrics, 56(1), 92-101.Alfons A., Croux C. and Filzmoser P. (2016). Robust maximum association estima-tors. Journal of the American Statistical Association. In press.Alfons A. (2015). ccaPP: (Robust) canonical correlation analysis via projection pur-suit. R package version 0.3.1, URL http://CRAN.R-project.org/package=ccaPP.Amiguet, M. (2011). Adaptively weighted maximum likelihood estimation of discretedistributions. Ph.D. thesis, Universit´e de Lausanne, Switzerland.Austin, P.C., Rothwell, D.M. and Tu, J.V. (2002). A comparison of statistical mod-eling strategies for analyzing length of stay after CABG surgery. Health services &outcomes research methodology (3), 107-133. DOI:10.1023/A:1024260023851Cadigan N. G. and Chen J. (2001). Properties of robust M–estimators for Poissonand negative binomial data. Journal of Statistical Computation and Simulation, 70,273-288.Cantoni E., and Ronchetti E. (2001). Robust inference for generalized linear models.Journal of the American Statistical Association, 96(455),1022-1030.Cantoni E. and Zedini A. (2009). A robust version of the hurdle model. Cahiers dud´epartement d’´econom´etrie No 2009.07, Facult´e des sciences ´economiques et sociales,Universit´e de Gen`eve. 29arter E.M. and Potts H.W.W. (2014). Predicting length of stay from an electronicpatient record system: a primary total knee replacement example. BMC MedicalInformatics & Decision Making 14: 26. DOI:10.1186/1472-6947-14-26.Cribari–Neto F. and Zeiles, A. (2010). Beta regression in R. Journal of StatisticalSoftware, 34, 1–24.Cuesta-Albertos J.A., Matr´an C. and Mayo-Iscar A (2008). Trimming and likeli-hood: robust location and dispersion estimate in the elliptical model. The Annals ofStatistics, 36(5), 2284–2318.Davison A.C. and Snell E.J. (1991). Residuals and diagnostics. In Statistical Theoryand Modelling: In Honour of Sir David Cox. D.V. Hinkley, N. Reid and E.J. Snell(editors), 83–106. Chapman and Hall.Dunn P.K. and Smyth G.K. (1996). Randomized quantile residuals. Journal ofComputational and Graphical Statistics, 5(3), 236-244.Espinheira P. L., Ferrari S. L. P. and Cribari-Neto, F. (2008). Inﬂuence diagnosticsin beta regression. Computational Statistics & Data Analysis, 52(9), 4417-4431.Ferrari S. L. P. and Cribari-Neto F. (2004). Beta regression for modelling rates andproportions. Journal of Applied Statistics, 31(7), 799–815Fetter R.B., Shin Y., Freeman J.L., Averill R.F., and Thompson J.D. (1980). Casemixdeﬁnition by diagnosis-related groups. Medical care, 18(1), 1-53.Gervini D. and Yohai V.J. (2002). A class of robust and fully eﬃcient regressionestimators. Annals of Statistics, 30(2), 583-616.Hampel F.R., Ronchetti E.M., Rousseeuw P.J. and Stahel W.A. (1986). RobustStatistics: The Approach Based on Inﬂuence Functions. Wiley, New York.Han A.K. (1987a). Non-parametric analysis of a generalized regression model: Themaximum rank correlation estimator. Journal of Econometrics, 35(23), 303-316.30an A.K. (1987b). A non-parametric analysis of transformations. Journal of Econo-metrics, 35, (2-3), 191-209.Hilbe J.M. (2008). Negative Binomial Regression. Cambridge University press.Huber P.J. (1980). Robust Statistics. Wiley, New York.Hunger M., Baumert J. and Holle R. (2011). Analysis of SF-60 index data: is betaregression appropriate? Value in Health 14, 759-767.K¨unsch H.R., Stefanski L.A. and Carroll R.J. (1989). Conditionally unbiased bounded-inﬂuence estimation in general regression models, with applications to generalizedlinear models. Journal of the American Statistical Association, 84(406), 460-466.Locatelli I., Marazzi A. and Yohai V.J. (2010). Robust accelerated failure timeregression. Computational Statistics & Data Analysis, 55(1), 874-887.Marazzi A. (1993). Algorithms, Routines, and S-Functions for Robust Statistics.Wadsworth, Inc., Belmont, California.Marazzi A., Paccaud F., Ruﬃeux C. and Beguin C. (1998). Fitting the distributionof length of stay by parametric models. Medical Care, 36(6), 915-927.Marazzi A. and Yohai V.J. (2004). Adaptively truncated maximum likelihood re-gression with asymmetric errors. Journal of Statistical Planning and Inference. 122(1-2), 271-291.Marazzi A. and Yohai V.J. (2010). Optimal robust estimates based on the Hellingerdistance. Advances in Data Analysis and Classiﬁcation. Springer-Verlag.Maronna R.A., Martin R.D. and Yohai V.J. (2006). Robust Statistics Theory andMethods. Wiley & Sons, Ltd.Maronna R., Bustos O. and Yohai V.J. (1979). Bias-and eﬃciency-robustness ofgeneral M-estimators for regression with random carriers, Smoothing techniques forcurve estimation, 91-116. 31in Y. and Agresti A. (2002). Modeling nonnegative data with clumping at zero: Asurvey. Journal of the Iranian Statistical Society, 1,(1-2), 7-33Nelder J.A. and Wedderburn R. W. M. (1972). Generalized linear models. Journalof the Royal Statistical Society, Series A, 135 (3), 370-384.Venables W. N. and Ripley B. D. (1999). Modern Applied Statistics with S-PLUS.Third edition. Springer.Rocha, A. V. and Simas, A. B. (2010). Inﬂuence diagnostics in a general class of betaregression models. Test, 20(1), 95-119.Seow W.J., Pesatori A.C., Dimont E., Farmer P.B., Albetti B, et al. (2012). Urinarybenzene biomarkers and DNA methylation in Bulgarian petrochemi workers: Studyﬁndings and comparison of linear and beta regression models. PLOS ONE 7: e50471.Sherman R.P. (1993). The limiting distribution of the maximum rank correlationestimator. Econometrica, 61(1), 123-137.Swearingen C.J., Tillley B.C., Adams R.J., Rumboldt Z., Nicholas J.S., Bandyopad-hyay D. and Woolson R.F. (2011). Application of Beta Regression to Analyze ls-chemic Stroke Volume in NINDS rt-PA Clinical Trials. Neuroepidemiology, 37(2),73-82.Valdora M. and Yohai V.J. (2014). Robust estimation in generalized linear models.Journal of Statistical Planning and Inference, 146, 31-48.Yu G. (2009. Variance stabilizing transformations of Poisson, binomial and negativebinomial distributions. Statistics and Probability Letters, 79, 1621-1629.Yohai V.J. (1987). High breakdown-point and high eﬃciency robust estimates forregression. Annals of Statistics, 15(2), 642-656.32 α µn

INI CML ACH INI CML ACH INI CML ACH

100 0.55 0.74 0.71 0.78 0.71 0.76 0.50 0.70 0.76400 0.52 0.88 0.75 0.73 0.85 0.79 0.48 0.89 0.761000 0.51 0.93 0.78 0.73 0.93 0.83 0.48 0.93 0.782000 0.54 0.95 0.75 0.73 0.94 0.83 0.50 0.95 0.75

Table 1. Empirical relative eﬃciencies of coeﬃcients, dispersion, and prediction estimates. y out β CML 0.72 0.76 0.78 0.55 0.38 0.32 0.33 0.37 0.42 0.45 0.53 0.55 0.48ACH 1.27 1.19 1.09 0.67 0.46 0.41 0.45 0.50 0.55 0.59 0.70 0.75 0.88INI 0.45 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 α CML 0.45 0.25 0.13 0.18 0.23 0.23 0.23 0.21 0.20 0.19 0.15 0.12 0.09ACH 0.53 0.12 0.10 0.27 0.29 0.29 0.27 0.26 0.24 0.23 0.19 0.17 0.12INI 1.92 1.17 1.11 1.11 1.11 1.11 1.11 0 1.11 1.11 1.11 1.11 1.11 µ CML 1.59 1.91 1.69 1.19 0.79 0.62 0.63 0.74 0.88 1.04 1.40 1.54 1.34ACH 2.68 2.39 2.18 1.40 0.97 0.78 0.78 0.89 1.04 1.19 1.64 1.91 2.63

Table 2. MAEE and MAPE of coeﬃcient, dispersion, and prediction estimates for varying y out .33OS 1 2 3 4 5 6 7 8 9 16 115 198 374frequency 2 6 5 5 4 2 2 1 1 1 1 1 1Table 3. Length of stay of 32 hospital patients.MLE M80 CML MLE* M80* CML* µ α β β β β α ML 1.266 0.017 0.064 -0.009 1.067 (0.134) (0.002) (0.178) (0.003) (0.067)

ACH 1.656 0.004 -1.055 0.012 0.542 (0.726) (0.011) (0.735) (0.011) (—)

CML 0.899 0.017 -0.269 -0.002 0.593 (0.113) (0.002) (0.154) (0.003) (0.049)

ML* 0.846 0.016 -0.253 -0.002 0.503 (0.114) (0.002) (0.156) (0.003) (0.046)

Table 5. Coeﬃcient (standard errors) and dispersion estimates for disorders of theendocrine system.34

50 100 150 . . . . . M AEE ( b ) INIFINACH . . . . . M AEE ( a ) INIFINACH . . . . M AEE ( m ) INIFINACH

Figure 1. Mean absolute prediction and estimation errors for varying y out .35

20 40 60 80 100

Age L O S − Figure 2. Data: LOS and Age of 649 patients. Black circles are men, gray circlesare women. Full outliers are marked by cross signs (x); borderline observations byplus signs (+). Fitted models according to CML (solid lines) and ML (broken lines):black for men, gray for women 36 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . (a) empirical z−quantiles (ML) un i f o r m quan t il e s . . . . . . (b) empirical z−quantiles (CML) un i f o r m quan t il e s . . . . . . (c) cleaned empirical z−quantiles (ML*) un i f o r m quan t il e s ML ACH CML ML* (d) boxplots of absolute residuals(d) boxplots of absolute residuals