A logistic regression analysis approach for sample survey data based on phi-divergence measures
aa r X i v : . [ s t a t . M E ] N ov A logistic regression analysis approachfor sample survey data based onphi-divergence measures
Elena Castilla, Nirian Mart´ın and Leandro Pardo ∗ Abstract
A new family of minimum distance estimators for binary logisticregression models based on φ -divergence measures is introduced. The so called“pseudo minimum phi-divergence estimator”(PM φ E) family is presented asan extension of “ minimum phi-divergence estimator” (M φ E) for general sam-ple survey designs and contains, as a particular case, the pseudo maximumlikelihood estimator (PMLE) considered in Roberts et al. [8]. Through a sim-ulation study it is shown that some PM φ Es have a better behaviour, in termsof efficiency, than the PMLE.
Keywords and phrases:
Logistic regression models, Sample survey data,Phi-divergence measures, Maximum likelihood estimator, Pseudo maximumlikelihood estimator.
Suppose that the population of interest is partitioned into I cells or domainsaccording to the levels of one or more factors. Let N i ( i = 1 , ..., I ) denotethe i -th domain size, N = P Ii =1 N i the population domain total and N i , thepopulation counts, out of N i , where the binary response (0 for failure and 1for success) variable is equal to 1. Since N i and N i are fixed but unknownvalues ( i = 1 , ..., I ), b N i denotes the survey estimator of the i -th domain size ∗ Complutense University of Madrid, 28040 Madrid, Spain [email protected]
I had the great honor to have Pedro as a friend; a deeply beloved friend. We haveshared not only many scientific interests in connection with Statistical InformationTheory but also, in accordance with our frequent conversations, many similarities inour youthful life experiences. The research in this paper, written with two coauthors,is dedicated in memoriam to Professor Pedro Gil. 1 Elena Castilla, Nirian Mart´ın and Leandro Pardo N i and b N i the corresponding estimate of the successful events N i . The ratioestimator b p i = b N i / b N i , i = 1 , ..., I , is often used to estimate the populationproportion of successful events, π i = N i N i , i = 1 , ..., I. Standard samplingtheory provides an estimator of the covariance matrix of the b p = ( b p , ..., b p I ) T .Another choice is using the logistic regression π (cid:0) x Ti β (cid:1) = exp { x Ti β } { x Ti β } = exp { β + k P s =1 β j x ij } { β + k P s =1 β j x ij } , i = 1 , ..., I, (1)to modelize the population proportion of successful events, π i = π (cid:0) x Ti β (cid:1) = N i N i , which is assumed to depend on constants x ij , j = 1 , ..., k ( k < I ) de-rived from the factor levels, summarized in a ( k + 1)-vector of known con-stants x i = (1 , x i , ..., x ik ) T , and also on a ( k + 1)-vector of parameters β = ( β , β , ..., β k ) T .Under independent binomial sampling in each domain, it is well-knownthat the maximum likelihood estimator (MLE) of β , b β , is obtained throughiterative calculations from the following likelihood equations X T diag( n ) π ( β ) = X T diag( n ) b q , (2)where X = ( x , ..., x I ) T is a full rank matrix, π ( β ) = (cid:0) π (cid:0) x T β (cid:1) , ..., π (cid:0) x TI β (cid:1)(cid:1) T , b q = ( b q , ..., b q I ) T with b q i = n i /n i , n i being the sample size from the i -thdomain, n = P Ii =1 n i the i -th sample domain total and n i the sample totalof successful events the i -th domain. If we consider the probability vectors b p ∗ = (cid:16) n n b q , n n (1 − b q ) , ..., n I n b q I , n I n (1 − b q I ) (cid:17) T = (cid:18) n n , n − n n , ..., n I n , n I − n I n (cid:19) T = (cid:16) n n , n n , ..., n I n , n I n (cid:17) T , ( n i = n i − n i ) , and p ∗ ( β ) = (cid:0) n n π (cid:0) x T β (cid:1) , n n (cid:0) − π (cid:0) x T β (cid:1)(cid:1) , . . . , n I n π (cid:0) x TI β (cid:1) , n I n (cid:0) − π (cid:0) x TI β (cid:1)(cid:1)(cid:1) T the MLE of β , b β , can be equivalently defined by b β = arg min β ∈ R k +1 d Kullback (cid:0)b p ∗ , p ∗ ( β ) (cid:1) , logistic regression analysis approach for sample survey data 3 where d Kullback (cid:0)b p ∗ , p ∗ ( β ) (cid:1) is the Kullback-Leibler divergence between theprobability vectors b p ∗ and p ∗ ( β ) defined by d Kullback (cid:0)b p ∗ , p ∗ ( β ) (cid:1) = I X i =1 " n i n log n i n i π (cid:0) x Ti β (cid:1) + n i n log n i n i (cid:0) − π (cid:0) x Ti β (cid:1)(cid:1) . In Pardo et al. [5] the minimum phi-divergence estimator (M φ E) was intro-duced, as a natural extension of the MLE, as b β φ = arg min β ∈ R k +1 d φ (cid:0)b p ∗ , p ∗ ( β ) (cid:1) , (3)where d φ (cid:0)b p ∗ , p ∗ ( β ) (cid:1) is the phi-divergence measure between the probabilityvectors b p ∗ and p ∗ ( β ) given by d φ (cid:0)b p ∗ , p ∗ ( β ) (cid:1) = I X i =1 n i n (cid:20) π ( x i , β ) φ (cid:18) n i n i π ( x Ti β ) (cid:19) + (1 − π ( x i , β )) φ (cid:18) n i n i ( − π ( x Ti β )) (cid:19)(cid:21) , with φ ∈ Φ ∗ . By Φ ∗ we are denoting the class of all convex functions, φ ( x ), x >
0, such that at x = 1, φ (1) = φ ′ (1) = 0, and at x = 0, 0 φ (0 /
0) = 0and 0 φ ( p/
0) = p lim u →∞ φ ( u ) u . For every φ ∈ Φ ∗ , differentiable at x = 1, thefunction Ψ ( x ) = φ ( x ) − φ ′ (1) ( x − Φ ∗ . Therefore, we have d ψ (cid:0)b p ∗ , p ∗ ( β ) (cid:1) = d φ (cid:0)b p ∗ , p ∗ ( β ) (cid:1) and ψ has the additional property that ψ ′ (1) = 0. Since the two divergence measuresare equivalent, we can consider the set Φ ∗ to be equivalent to the set Φ = Φ ∗ ∩ { φ : φ ′ (1) = 0 } . For more details see Cressie and Pardo [3] and Pardo [7]. In what follows,we give our theoretical results for φ ∈ Φ , but often apply them to choices offunctions in Φ ∗ .For general sample survey designs we do not have maximum likelihoodestimators due to difficulties in obtaining appropriate likelihood functions.Hence, it is a common practice to use a pseudo maximum likelihood estimator(PMLE) of β , b β P , obtained from (2) by replacing n i /n , i = 1 , ..., I , bythe estimated domain relative size w i = b N i / b N , i = 1 , ..., I, and the sampleproportions b q i = n i /n i , i = 1 , ..., I, by the ratio estimate b p i = b N i / b N i , i =1 , ..., I, X T diag( w ) π ( β ) = X T diag( w ) b p , (4)where w = ( w , ..., w I ) T . Elena Castilla, Nirian Mart´ın and Leandro Pardo
In this paper we extend the concept of M φ E by considering the “pseudominimum phi-divergence estimator” (PM φ E) as a natural extension of thePMLE and we solve some statistical problem for the model considered in (1).In Section 2 we shall introduce the PM φ E for general sample designs and westudy its asymptotic behavior. A numerical example is presented in Section3 and, finally, in Section 4 a simulation study is carried out.
For general sample designs, we should consider the kernel of the weightedloglikelihood ℓ w ( β ) = n I X i =1 w i (cid:2)b p i log π (cid:0) x Ti β (cid:1) + (1 − b p i ) log(1 − π (cid:0) x Ti β (cid:1) ) (cid:3) , which is derived from the kernel of the likelihood for I independent binomialrandom variables ℓ ( β ) = I X i =1 n i (cid:2)b q i log π (cid:0) x Ti β (cid:1) + (1 − b q i ) log(1 − π (cid:0) x Ti β (cid:1) ) (cid:3) = n I X i =1 n i n (cid:2)b q i log π (cid:0) x Ti β (cid:1) + (1 − b q i ) log(1 − π (cid:0) x Ti β (cid:1) ) (cid:3) , replacing n i n by w i = b N i / b N , and b q i = n i /n i by b p i = b N i / b N i , i = 1 , ..., I . Ifwe consider the two probability vectors b p w = ( w b p , w (1 − b p ) , ..., w I b p I , w I (1 − b p I )) T and p w ( β ) = (cid:0) w π (cid:0) x T β (cid:1) , w (cid:0) − π (cid:0) x T β (cid:1)(cid:1) , ..., w I π (cid:0) x TI β (cid:1) , w I (cid:0) − π (cid:0) x TI β (cid:1)(cid:1)(cid:1) T , we get ℓ w ( β ) = − nd Kullback ( b p w , p w ( β )) + k, where k is a constant not depending on β . Therefore the PMLE of β , b β P ,presented in (4) can be defined as b β P = arg max β ∈ R k +1 ℓ w ( β ) = arg min β ∈ R k +1 d Kullback ( b p w , p w ( β )) . logistic regression analysis approach for sample survey data 5 Based on the previous interpretation of the PMLE, in the following definitionwe shall present the PM φ E. Definition 1.
The PM φ E in a general sample design for the parameter β inthe model considered in (1) is defined as b β φ,P = arg min β ∈ R k +1 d φ ( b p w , p w ( β )) , where d φ ( b p w , p w ( β )) = I X i =1 w i (cid:20) π ( x i , β ) φ (cid:18) b p i π ( x Ti β ) (cid:19) + (1 − π ( x i , β )) φ (cid:18) − b p i − π ( x Ti β ) (cid:19)(cid:21) is the phi-divergence measure between the probability vectors b p w and p w ( β ).The following result establishes the asymptotic distribution of the PM φ Eof β , b β φ,P . Theorem 1.
Let us assume that β is the true value of β and w p −→ n →∞ W , W = ( W , ..., W I ) T , W i = N i N , b p p −→ n →∞ π ( β ) , √ n ( b p − π ( β )) L −→ n →∞ N ( , V ) . Then, we have √ n ( b β φ,P − β ) L −→ n →∞ N ( k +1 , V ( β )) , where V ( β ) = ( X T ∆X ) − X T diag ( W ) V diag ( W ) X ( X T ∆X ) − , (5) ∆ = diag { W i π (cid:0) x Ti β (cid:1) (cid:0) − π (cid:0) x Ti β (cid:1)(cid:1) } i =1 ,...,I . Proof.
Based on Theorem 1 in Pardo et al. [5], we have b β φ,P = β +( X T ∆X ) − X T diag (cid:8) c Ti (cid:9) Ii =1 diag − / ( p w ( β )) ( b p w − p w ( β ))+ o (cid:16)(cid:13)(cid:13)(cid:13) diag (cid:8) c Ti (cid:9) Ii =1 diag − / ( p w ( β )) ( b p w − p w ( β )) (cid:13)(cid:13)(cid:13) k +1 (cid:17) , with c i = (cid:0) w i π (cid:0) x Ti β (cid:1) (cid:0) − π (cid:0) x Ti β (cid:1)(cid:1)(cid:1) / (cid:0) − π (cid:0) x Ti β (cid:1)(cid:1) / − π (cid:0) x Ti β (cid:1) / ! , i = 1 , .., I. Elena Castilla, Nirian Mart´ın and Leandro Pardo
Sincediag (cid:8) c Ti (cid:9) Ii =1 diag − / ( p w ( β )) ( b p w − p w ( β )) = diag ( w ) ( b p − π ( β )) , w p −→ n →∞ W and b p p −→ n →∞ π ( β ), it holds √ n ( b β φ,P − β ) = ( X T ∆X ) − X T diag ( W ) √ n ( b p − π ( β )) + o p ( k +1 ) . From the Sluysky’s theorem and taking into account √ n ( b p − π ( β )) L −→ n →∞ N ( k +1 , V ), it follows the desired result. ⊓⊔ Remark 1.
Under independent binomial sampling in each domain, it is well-known that V = diag { π (cid:0) x Ti β (cid:1) (cid:0) − π (cid:0) x Ti β (cid:1)(cid:1) } i =1 ,...,I diag − ( W ) andhence V ( β ) = ( X T ∆X ) − , which matches Theorem 2 in Pardo et al.[5]. Remark 2.
The asymptotic results obtained in the current paper differ fromCastilla et al. [2] in the elements tending to infinite, here the total individualsin the whole sample, n , while in the cited paper is the total number of clusterswhat tends to infinite. In order to obtain the PM φ Es, from a practical point of view, we can givean explicit expression for φ . In this paper we shall focus on the Cressie-Readsubfamily φ λ ( x ) = ( λ (1+ λ ) (cid:2) x λ +1 − x − λ ( x − (cid:3) , λ ∈ R − {− , } lim υ → λ υ (1+ υ ) (cid:2) x υ +1 − x − υ ( x − (cid:3) , λ ∈ {− , } . We can observe that for λ = 0, we have φ λ =0 ( x ) = lim υ → υ (1 + υ ) (cid:2) x υ +1 − x − υ ( x − (cid:3) = x log x − x + 1 , and the associated phi-divergence, coincides with the Kullback divergence,therefore the PM φ Es based on φ λ ( x ) contains as special case the PMLE.We shall consider the example presented in Molina et al. [4]. A randomsubsample of 50 clusters (primary sampling units) containing 1299 house-holds was selected from the 1975 U.K. Family Expenditure Survey. Thesehouseholds are divided into 12 groups of sizes n , ..., n by age of head ofhousehold (4 levels) and number of persons in the household (3 levels). Thebinary response is 1 if the household owns the dwelling it occupies and 0 logistic regression analysis approach for sample survey data 7 otherwise. The number of households r i for which the binary response is 1,together with n i are shown in Table 1 of the cited paper.We denote by β r ) the parameter associated to the level r of the factor“age of head of housholds”, r = 2 , β = 0 and by β s ) theparameter associated to the level s of the factor “number of persons in thehousholds”, s = 2 , , since we assume β = 0. The parameter vector withunknown values will be denote by β = ( β , β , β , β , β , β ) T . The design matrix that we are going to consider for the example under con-sideration is given by X = T = ( x , ..., x ) T and the logistic regression model under consideration is given by π (cid:0) x Ti β (cid:1) = exp (cid:8) x Ti β (cid:9) (cid:8) x Ti β (cid:9) , i = 1 , ..., , equivalent to π (cid:0) x Ti β (cid:1) = exp { β + β r ) + β s ) } { β + β r ) + β s ) } , if the i -th probability is associated with the r -th level of the first variable ( r =1 , ...,
4) and the s -th level of the second variable ( s = 1 , ..., β , β λ,P , for λ ∈ { , / , , } . Table 1
PMCREs for the clustered Family Expenditure Survey data model. λ b β ,λ,P b β ,λ,P b β ,λ,P b β ,λ,P b β ,λ,P b β ,λ,P − − − − − − − − − − − − The following simulation study has been designed by following the previousexample. Since in the logistic regression model there are two factors, the firstone with 4 categories and the second one with 3 categories, in total I = 12domains are taken into account. Let p ( β ) = (cid:0) N N π (cid:0) x T β (cid:1) , N N (cid:0) − π (cid:0) x T β (cid:1)(cid:1) , ..., N I N π (cid:0) x TI β (cid:1) , N I N (cid:0) − π (cid:0) x TI β (cid:1)(cid:1)(cid:1) T be the theoretical probability vector in the logistic regression with complexsampling. The values of the components of p ( β ) are given in Table 2. Intotal n = 1299 individuals are taken from the primary units of the sample, J = 50 clusters, of size m ( j ) = 26, j = 1 , ..., m (50) = 25 ( P j =1 m ( j ) = n ).Since the clusters are mutually independent and there is (possibly) correlationinside each cluster, we consider three possible distributions for( n j ) , n j ) − n j ) , n j ) , n j ) − n j ) , . . . , n , j ) , n j ) − n , j ) ) T corresponding to the j -th cluster (column, in Table 3), j = 1 , ..., J = 50: • Dirichlet-multinomial with parameters ( m ( j ) ; ρ, p ( β )), with ρ ∈ { ( i − } i =1 ; • Random-clumped with parameters ( m ( j ) ; ρ, p ( β )), with ρ ∈ { ( i − } i =1 ; • m ( j ) -inflated with parameters ( m ( j ) ; ρ, p ( β )), with ρ ∈ { ( i − } i =1 .For details about these distributions see Alonso et al. [1]. The values ofinterest for the sample are n i = X j =1 n j ) , i = 1 , ..., I and n i = X j =1 ( n j ) + n j ) ) , i = 1 , ..., I. Table 2
Theoretical values of p ( β ) in the simulation study. i I = 12 N i N π (cid:0) x Ti β (cid:1)
210 3863 65110 614 2935 188281 1740 56110 105185 78204 93196 2151 N i N π (cid:0) x Ti β (cid:1) Notice that the assumptions of Theorem 1 are held. In addition: • If ρ = 0 (multinomial distribution within each cluster), then V is a diago-nal matrix since the elements of b p are uncorrelated. In this case, we obtainMLEs and M φ Es. • If ρ >
0, then V is not a diagonal matrix since the elements of b p arecorrelated. In this case, we obtain PMLEs and PM φ Es. logistic regression analysis approach for sample survey data 9
Table 3
Scheme of a correlated sample generation through clusters. i j · · · j · · · J = 50 sample1 k = 1 k = 2 n n n n · · · · · · n n n n k = 1 k = 2 n n n n · · · · · · n n n n ... ... ... . .. ... ... i k = 1 k = 2 n i n i n i n i n i j ) n i j ) n i n i n i n i ... ... ... ... ... ... I = 12 k = 1 k = 2 n , n , n , n , · · · · · · n , n , n , n , m (1) m (2) · · · m ( j ) · · · m (50) n In these scenarios, the root of the mean square error (RMSE) for the PM-CREs of β are studied, considering different values of the tuning parameter λ ∈ { , / , , } . Note that when λ = 0, the corresponding PMCRE of β isequal to the PMLE.Results of the simulation study with 2,000 samples are shown in Figure1. As expected from a theoretical point of view, the RMSE increases as ρ increases. With independence to the distribution considered, estimators cor-responding to λ ∈ { / , , } present a better performance than the PMLE( λ = 0). This difference becomes more considerable for large values of ρ . In this paper we have considered the problem of estimating the parameters ofthe logistic regression model for sample survey data, introducing the familyof the PM φ Es that contains as a particular case the PMLE. A simulationstudy is carried out in order to see that there are PM φ Es that have a betterbehaviour than the PMLE in relation to the mean square error. (cid:1) R M S E ( (cid:2) ) Dirichlet-multinomial λ (cid:4) R M S E ( (cid:5) ) Random-clumped λ (cid:6) R M S E ( (cid:7) ) m-inflated λ Fig. 1
RMSEs for PMCREs of β with Dirichlet-multinomial (above), Random-clumped(middle) and m-inflated (below) distributions. logistic regression analysis approach for sample survey data 11 Acknowledgements
This research is partially supported by Grants MTM2015-67057-P(MINECO/FEDER) and ECO2015-66593, both from Ministerio de Economia y Com-petitividad (Spain).
References
1. Alonso-Revenga, J.M., Mart´ın, N. & Pardo, L. (2016): New improved estimators foroverdispersion in models with clustered multinomial data and unequal cluster sizes.Statistics and Computing, doi:10.1007/s11222-015-9616-z.2. Castilla, E., Martin, N. and Pardo, L. (2016). Pseudo minimum phi-divergenceestimator for multinomial logistic regression with complex sample design.https://arxiv.org/abs/1606.01009.3. Cressie, N. and L. Pardo (2003). Phi-divergencia statistic. In A.H. El-Shaarawi andW.W. Piegorsch (Eds.), Encyclopedia of Environmetrics, Vol. 3 (pp. 1551–1555). NewYork: Wiley.4. Molina, E. A., Skinner, C. J. (1992). Pseudo-likelihood and quasi-likelihood estimationfor complex sampling schemes. Computational Statistics & Data Analysis 13:395-405.5. Pardo, J. A., Pardo, M. C. and Pardo, L. (2005). Minimum φ -divergence estimator inlogistic regression models. Statistical Papers 47:91-108.6. Pardo, J. A., Pardo, M. C. and Pardo, L. (2006). Testing in logistic regression modelsbased on φφ