Fully Bayesian Estimation under Dependent and Informative Cluster Sampling
OO R I G I N A L A R T I C L E
Fully Bayesian Estimation under Dependent andInformative Cluster Sampling
Luis G. León-Novelo PhD | Terrance D. Savitsky PhD Department of Biostatistics and DataScience, University of Texas Health ScienceCenter at Houston- School of Public Health,Houston, TX, 77030, USA Office of Survey Methods Research,Bureau of Labor Statistics, Washington, DC,20212-0001, USA
Correspondence
Luis León-Novelo PhD, Department ofBiostatistics and Data Science, Universityof Texas Health Science Center at Houston-School of Public Health, 1200 Pressler St.Suite E805, Houston, TX, 77030, USAEmail: [email protected]
Present address † Department of Biostatistics and DataScience, University of Texas Health ScienceCenter at Houston- School of Public Health,1200 Pressler St. Suite E805, Houston, TX,77030, USA
Funding information
Survey data are often collected under multistage samplingdesigns where units are binned to clusters that are sam-pled in a first stage. The unit-indexed population variablesof interest are typically dependent within cluster. We pro-pose a Fully Bayesian method that constructs an exact like-lihood for the observed sample to incorporate unit-levelmarginal sampling weights for performing unbiased infer-ence for population parameters while simultaneously ac-counting for the dependence induced by sampling clustersof units to produce correct uncertainty quantification. Ourapproach parameterizes cluster-indexed random effects inboth a marginal model for the response and a conditionalmodel for published, unit-level sampling weights. We com-pare our method to plug-in Bayesian and frequentist alter-natives in a simulation study and demonstrate that our methodmost closely achieves correct uncertainty quantification formodel parameters, including the generating variances forcluster-indexed random effects. We demonstrate our methodin two applications with NHANES data.
K E Y W O R D S
Cluster sampling; Inclusion probabilities; Informative sampling;
Abbreviations:
BMI, Body Mass Index; CI, Credible Interval; cSRS, cluster simple random sampling; dep- y , association among theresponses within the same PSU; dep- π , association among inclusion probabilities within the same PSU; LS2019, León-Novelo andSavitsky[2019]; IS, Informative Sampling; kcal, kilocalories; NHANES, National Health and Nutrition Examination Survey; MSE, MeanSquare Error; PBF, Proportion of Body Fat; PPS, probability proportional to size; PSU, primary stage sampling unit; PSU-REs, PSU-specific random effects; SE, Standard Error; SLR, Simple Linear Regression; SRS, Simple Random Sampling. * Equally contributing authors. a r X i v : . [ s t a t . M E ] J a n Bayes Estimation/Informative Cluster Sampling
Mixed effects linear model; NHANES; primary stage samplingunit; Sampling weights; Survey sampling | INTRODUCTION
Inference with data from a complex sampling scheme, such as that collected by the National Health and NutritionExamination Survey (NHANES), requires consideration of the sampling design. A common multistage sampling schemein public survey datasets: i Divide survey population into H strata. ii Each stratum contains N h clusters of individuals called primary stage sampling units (PSU) from which J h PSUs areselected. PSU hj is selected with probability π hj . By design, at least one PSU is selected in each stratum, i.e. J h ≥ , (cid:91) h . iii Within each selected PSU, n hj individuals are sampled out of the total N hj in the PSU. Each individual is sampledwith probability π i | hj , i = 1 , . . . , N hj . We call a sampled individual survey participant.The indices i , j , h index individual, PSU and stratum, respectively. The marginal probability of including an individualin the sample is then π (cid:48) ihj = π i | hj π hj .In addition to sampling clusters of dependent individuals, both clusters and individuals-within-clusters are typ-ically selected with unequal sampling inclusion probabilities in order to improve estimation power for a populationsubgroup or to reduce variance of a global estimator. The sample inclusion probabilities are constructed to be corre-lated with or “informative" about the response variable of interest to reduce variance of the estimator. On the onehand, stratification reduces the standard error (SE) of estimates while, on the other hand, clustering tends to increaseit since clustering induces dependence and is used for convenience and to reduce cost. Utilizing unequal inclusionprobabilities can reduce the variance of the estimator where a subset of units drives the computation, such as is thecase where larger-sized employers drive the estimation of total employment for the Current Employment Statisticssurvey administered by the U.S. Bureau of Labor Statistics; more often, the use of unequal inclusion probabilitiestends to increase the variance of the estimator due to the variation in the information about the population reflectedin observed samples. Ignoring PSU and unequal sampling inclusion probabilities underestimates the SE because ofthe dependence among individuals within a PSU and the variation of information about the population reflected ininformative samples drawn from it.The statistical analyst receives variables of interest for each survey participant along with the stratum and PSUidentifiers to which s/he belongs to as well as sampling weights, w ihj ∝ / π ihj . The inclusion probability, π ihj , is aversion of π (cid:48) ihj after adjusting for oversampling of subpopulations and nonresponse.In the context of NHANES, a stratum is defined by the intersection of geography with concentrations of minoritypopulations and a PSU is constructed as a county or a group of geographically continuous counties. Secondary andtertiary stage sampling units include segments (contiguous census blocks) and households. The final unit is an eligibleparticipant in the selected household. NHANES released masked stratum and PSU information to protect participant’sprivacy. Every 2-year NHANES-data cycle (CDC, 2011) releases information obtained from H = 15 strata with n h = 2 PSU per stratum.In this paper, we focus on a two-stage sampling design characterized by the first stage sampling of PSUs and,subsequent, second stage sampling of units without loss of generality. We employ a fully Bayesian estimation ap- ayes Estimation/Informative Cluster Sampling 3 proach that co-models the response variable of interest and the marginal inclusion probabilities as is introduced inLeón-Novelo and Savitsky (2019), hereafter referred to LS2019. We extend their approach by constructing PSU-indexed random effects specified in both the marginal model for the response variable and the conditional model forthe sampling inclusion probabilities.Since, from now on, we do not deal with strata we do not consider point i and subindex h in the discussion above.Sampled individual, ( ij ) , denotes individual i ∈ ( , . . . , n j ) in cluster j ∈ ( , . . . , J pop ) included in the sample, where J pop denotes the total number of PSUs included in the population. Let J ≤ J pop denote the number of PSUs actuallysampled. We assume that the sampling weight, w ij , is proportional to the inverse marginal inclusion probability, π ij ofindividual ij included in the sample; or π ij ∝ / w ij . We denote the vector of predictors associated to individual ij as x ij . The data analyst aims to estimate the parameters, θ , of a population model, p ( y | θ , x ) , that they specify fromthese data. Relabeling PSU indices in the sample so they run from , . . . , J , the analyst observes sample of size, n = (cid:205) Jj =1 n j , and the associated variables, { y ( s ) ij , x ( s ) ij , π ( s ) ij ∝ / w ( s ) ij , j } i =1 ,..., n j , j =1 ,..., J with n j the number of participantsfrom PSU j and superindex ( s ) denotes in the sample. By contrast, y ij without index ( s ) denotes a response of anindividual in the survey population but not, necessarily, a survey participant included in the observed sample.The probability of inclusion of each PSU (denoted as π j in point ii, above) is unknown to the analyst becauseit is not typically published for the observed sample, though the PSU inclusion probabilities are used to constructpublished unit marginal inclusion probabilities, (such that inclusion probabilities within the same PSU tend to be moresimilar or correlated), but the dependence of units in the same PSU may not be fully accounted for by the dependenceon their inclusion probabilities.A sampling design is informative for inference about individuals within a group when y ij (cid:54)⊥ π ij | x ij . A samplingdesign will also be informative for PSUs in the case that ¯ y · j − ¯ y = ( / N j ) (cid:205) N j i =1 y ij − ¯ y (cid:54)⊥ π j | ¯ x j with ¯ x j = ( / N j ) (cid:205) N j i =1 x ij .Even if a sampling design is not informative for individuals and/or groups, however, there are typically scale effectsinduced by within group dependence that must be accounted for to produce correct uncertainty quantification.LS2019 propose a model-based Bayesian approach appropriate under informative sampling that incorporates thesampling weights into the model by modelling both the response given the parameter of interest and the inclusionprobability given the response, π ij | y ij . The main advantages of this approach is that it yields (1) consistent pointestimates [LS2019] (point estimates converge in probability to true values), (2) credible intervals that achieve nominal(frequentist) coverage, and (3) robust inference against mis-specification of π ij | y ij .LS2019 only focus on fixed effect models and ignore the dependence induced by the sampling design; that is,both association among the responses within the same PSU (that we label, dep- y ), and possible association amonginclusion probabilities within the same PSU (that we label, dep- π ). This paper extends the approach of LS2019 toaccount for these associations via mixed effect models. More specifically, we include PSU-specific random effects(PSU-REs) in both the model for the responses and in the model for the inclusion probabilities. We introduce our gen-eral approach in Section 3, but, in the rest of the paper, we focus on the linear regression setting. Competing methodsare summarized in this section as well. In Section 4, we show via simulation that our approach yields credible intervalswith nominal (frequentist) coverage, while the competing methods do not in some simulation scenarios. In Section5 we demonstrate our approach by applying it to two NHANES datasets. The first application models the relation-ship between percentage of body fat to body mass index while the second estimates the daily kcal consumption ofdifferent demographic groups in the U.S. population. Inference under our Fully Bayes approach is compared againstinference under competing plug-in Bayesian and frequentist methods. We provide a final discussion section and anAppendix containing details not discussed, but referred to in the main paper. Bayes Estimation/Informative Cluster Sampling | REVIEW OF LS2019
LS2019 introduces the inclusion probabilities into the Bayesian paradigm by assuming them to be random. In thissection we review their approach to extend it to include PSU information, in the next section. The superpopulationapproach in LS2019 assumes that the finite population of size N , ( y , π , x ) , . . . ( y N , π N , x N ) is a realization of ( y i , π i ) | x i , θ , κ ∼ p ( y i , π i | x i , θ , κ ) = p ( π i | y i , x i , κ ) p ( y i | x i , θ ) , i = 1 , . . . , N . (1)Here, y i is the response for individual i with vector of covariates x i and π i > is proportional to probability ofindividual i being sampled. It is assumed that ( y i , π i ) ⊥ ( y i (cid:48) , π i (cid:48) ) | x i , x i (cid:48) , θ , κ , for i (cid:44) i (cid:48) and x i is not modelled. Notealso that equation above presumes that π i ⊥ θ | y i , x i , κ and y i ⊥ κ | x i , θ . The population parameter θ determinesthe relationship between the x i and y i , and is of main interest. The parameter κ is a nuisance parameter that allowsmodeling the association between π i and y i , though we later see in our simulation study section that it provides insighton the informativeness of the sampling design for a particular response variable of interest. The informative sampleof size n is drawn so that P [ individual i in sample ] = π i (cid:46) (cid:205) Ni (cid:48) =1 π i (cid:48) . Bayes theorem implies, p ( y i , π i | x i , θ , κ , individual i in sample ) = Pr ( individual i in sample | y i , π i , x i , θ , κ ) × p ( y i , π i | x i , θ , κ ) denominator , (2)By the way the informative sample is drawn, and (1), the numerator in (2) is π i (cid:205) Ni (cid:48) =1 π i (cid:48) × p ( π i | y i , x i , κ ) p ( y i | x i , θ ) . (3)The denominator is obtained by integrating out ( y i , π i ) in the numerator, this is (cid:205) Ni (cid:48) =1 π i (cid:48) × ∫ π (cid:63) i p ( π (cid:63) i | y (cid:63) i , x i , κ ) p ( y (cid:63) i | x i , θ ) d π (cid:63) i dy (cid:63) i = 1 (cid:205) Ni (cid:48) =1 π i (cid:48) × E y (cid:63) i | x i , θ (cid:2) E (cid:2) π (cid:63) i | y (cid:63) i , x i , κ (cid:3)(cid:3) (4)The superindex (cid:63) is used to distinguish the quantities integrated out from the ones in the numerator. Plugging (3) and(4) in (2) we obtain equation (5) in LS2019 given by, p s ( y i , π i | x i , θ , κ ) = π i p ( π i | y i , x i , κ ) E y (cid:63) i | x i , θ (cid:2) E ( π (cid:63) i | y (cid:63) i , x i , κ ) (cid:3) × p ( y i | x i , θ ) (5)Where the LHS, p s (· · · | · · · ) , denotes the joint distribution of ( y i , π i ) conditioned on the individual i being in thesample, i.e. , the LHS of (2) that is equal to p (· · · | · · · , individual i in sample ) . Inference is based on this exact likelihoodfor the observed sample with, (cid:96) ( θ , κ ; y ( s ) , π ( s ) , x ( s ) ) = n (cid:214) i =1 (cid:104) p s ( y ( s ) i , π ( s ) i | x i ( s ) , θ , κ ) (cid:105) where the superindex ( s ) is used to emphasize that these are the values observed in the sample. We also relabel theindex i running from , . . . , N in the population so it runs from , . . . , n in the sample. A Bayesian inference model iscompleted by assigning priors to θ and κ .Notice that the π i s are proportional, and not equal, to the inclusion probabilities and besides being positive, no ayes Estimation/Informative Cluster Sampling 5 restriction is imposed on (cid:205) ni =1 π ( s ) i or (cid:205) Ni =1 π i . Also note that under noninformative sampling, i.e. when y i ⊥ π i | x i ,the quantity between curvy brackets in (5) does not depend on y i and therefore inference on θ does not depend onthe inclusion probabilities, or the π i s. In other words, inference using (5) is the same as if treating the sample as an SRSfrom the model y i ∼ p ( y i | x i , θ ) . For the informative sampling case, in theory, we can assume any distribution for y i | x i , θ and π i | y i , x i , κ . In practice, the calculation of E y (cid:63) i | x i , θ [· · · ] in (5) is a computational bottleneck. Theorem 1in LS2019, stated below, provides conditions to obtain a closed form for this expected value.Let v i and u i be subvectors of x i , the covariates used to specify the conditional distribution of π i | y i , x i , κ and y i | x i , θ , respectively; that is, π i | y i , x i , κ ∼ π i | y i , v i , κ and y i | x i , θ ∼ y i | u i , θ . Note that we allow for v i and u i to have common covariates. Let normal ( x | µ , s ) denote the normal distribution pdf with mean µ and variance s evaluated at x , and lognormal (· | µ , s ) denote the lognormal pdf, so that X ∼ lognormal ( µ , s ) is equivalent tolog X ∼ normal ( µ , s ) . Theorem 1 (Theorem 1 in LS2019) If p ( π i | y i , v i , κ ) = lognormal ( π i | h ( y i , v i , κ ) , σ π ) , with the function h ( y i , v i , κ ) ofthe form h ( y i , v i , κ ) = g ( y i , v i , κ ) + t ( v i , κ ) where σ π = σ π ( κ , v i ) , possibly a function of ( κ , v i ) then p s ( y i , π i | u i , v i , θ , κ ) = normal (cid:16) log π i | g ( y i , v i , κ ) + t ( v i , κ ) , σ π (cid:17) exp (cid:8) t ( v i , κ ) + σ π / (cid:9) × M y ( κ ; u i , v i , θ ) × p ( y i | u i , θ ) with M y ( κ ; u i , v i , θ ) : = E y (cid:63) i | u i , θ (cid:2) exp (cid:8) g ( y (cid:63) i , v i , κ ) (cid:9)(cid:3) . If both M y and p ( y i | · · · ) admit closed form expressions, then p s ( y i , π i | · · · ) has a closed form, as well; forexample, when g ( y i , v i , κ ) = κ y y i with κ y ∈ κ ∈ (cid:210) , then M y ( κ ; u i , v i , θ ) is the moment generating function (MGF)of y i | θ evaluated at κ y , which may have a closed form defined on (cid:210) . This implies a closed form for p s ( y i , π i |· · · ) . Analogously, we may consider an interaction between y and v , using g ( y i , v i , κ ) = ( κ y + v ti κ v ) y i ≡ t y i with κ = ( κ y , κ v ) . In this case, we achieve, M y ( t ; · · · ) , which is the MGF evaluated at t . As mentioned in LS2019, theassumption of a lognormal distribution for π i is mathematically appealing. The inclusion probability, ∝ π i , for individual, i , is composed from the product of inclusion probabilities of selection across the stages of the multistage survey design.If each of these stagewise probabilities are lognormal then their product, ∝ π i , is lognormal as well. This is particularlyhelpful in the setting including PSU in next section. | INCLUSION OF PSU INFORMATION INTO FULLY BAYESIAN APPROACH
In Subsection 3.1, we extend the approach in LS2019, reiviewed in Section 2, that co-models the response variableand sampling weights, modifying their notation by adding cluster-indexed parameters, in preparation to include PSUinformation into the analysis in Subsection 3.2 to capture within PSU dependence in the response and sample inclusionprobabilities. In Subsection 3.3 we introduce the Fully Bayes joint population model for the response and the sampleinclusion probabilities in the linear regression case. In Subsections 3.4 and 3.5 we briefly review competing approachesto analyze informative samples that we will compare in a simulation study.
Bayes Estimation/Informative Cluster Sampling | Extend Joint Population Model to incorporate PSU-indexed Parameters
We assume a population with a total of J pop PSUs and size N = (cid:205) Jj =1 N j with N j the number of individuals in PSU j .More specifically, the population consists of ( y , , π | , x , ) , . . . , ( y N , , π N | , x N , ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) PSU , . . . , ( y , j , π | j , x , j ) , . . . , ( y N j , j , π N j | j , x N j , j ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) PSU j , . . . ( y , J pop , π | J pop , x , J pop ) , . . . , ( y N Jpop , J pop , π N J pop | J pop , x N Jpop , J pop ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) PSU J pop and also of, π , π , . . . , π j , . . . , π J pop > with π i | j > , (cid:91) i , j . Notice that, as before, we do not impose any restriction on (cid:205) N j i =1 π i | j or (cid:205) J pop j =1 π j . The sample ofsize n = (cid:205) Jj =1 n j (with n j and J specified by the survey sampler) is drawn in two steps:• Step 1: PSU sampling. Sample J different PSUs j , . . . , j J ∈ { , . . . , J pop } so that P r [ PSU j is in the sample ] = π j (cid:30) J pop (cid:213) j (cid:48) =1 π j (cid:48) • Step 2: Individual sampling. Within each PSU in sample j (cid:96) , (cid:96) = 1 , . . . , J , draw n (cid:96) different individuals so thatindividual i (in the sampled PSU j ) is in the sample with probability P [ Individual ij is in the sample | PSU j is in the sample ] = π i | j (cid:30) N j (cid:213) i (cid:48) =1 π i (cid:48) | j and therefore the marginal inclusion probability is proportional to π ij : = π j π i | j .The superpopulation approach assumes that the population is a realization of the joint distribution for responseand inclusion probabilities, ( y ij , π ij ) | x ij , θ , η yj , κ , π j ∼ p ( y ij , π ij | x ij , θ , η yj , κ , π j ) = p ( π ij | y ij , x ij , κ , π j ) p ( y ij | x ij , θ , η yj ) , (6)We model π ij | y ij with p ( π ij | y ij , x ij , κ , π j ) . The (population) parameter of interest is θ . We have augmentedthe parameters used in LS2019, given in (1), to incorporate η yj and π j that are shared by all observations in PSU j .Parameters, η yj ,induce a correlation in the response for individuals in the same PSU (dep- y ) while π j induces associ-ation of marginal inclusion probabilities (dep- π ) among respondents nested in the same PSU. We will later constructpriors on these parameters to define PSU-indexed random effects. κ is a nuisance parameter used to model the inclu-sion probabilities. After relabeling the sampled PSU indices j , . . . j J to , . . . J , and the indices i in the sample to runfrom i = 1 , . . . , n j , the sample of size n = (cid:205) Jj =1 n j consists of { y ( s ) ij , x ( s ) ij , π ( s ) ij , j } i =1 ,..., n j ; j =1 ,..., J with j indicating fromwhich PSU the individual was sampled, n j the number of participants from PSU j , and J the total number of sampledPSUs. Recall, superindex ( s ) denotes in the sample. The equality in (6) assumes that y ij ⊥ ( κ , π j ) | x ij , θ , η yj and ayes Estimation/Informative Cluster Sampling 7 π ij ⊥ ( θ , η yj ) | y ij , x ij , κ , π j .We extend (5) that captures the joint probability model for the sample by replacing θ and κ with ( ( θ , η y ) and ( κ , π j ) ) to achieve, p s ( y ij , π ij | x ij , θ , η yj , κ , π j ) = π ij p ( π ij | y ij , x ij , κ , π j ) E y (cid:63) ij | x ij , θ , η yj (cid:104) E ( π (cid:63) ij | y (cid:48) ij , x ij , κ , π j ) (cid:105) × p ( y ij | x ij , θ , η yj ) . (7)The subindex s on the joint distribution p s on the LHS denotes that we are conditioning on individual ij being in thesample; that is, p s ( y ij , π ij | . . . ) = p ( y ij , π ij | individual ij is in the sample , . . . ) . In contrast the distributions on theRHS are population distributions. Inference on θ is based on the likelihood (cid:96) ( θ , η y , κ , π ; y ( s ) , π ( s ) , x ( s ) ) = J (cid:214) j =1 n j (cid:214) i =1 (cid:104) p s ( y ( s ) ij , π ( s ) ij | x ( s ) ij , θ , η yj , κ , π j ) (cid:105) with η y : = ( η y , . . . , η yJ ) , π : = ( π , . . . , π J ) . Inference for θ is achieved via the posterior distribution of the modelparameters: p s ( θ , η y , κ , π | y ( s ) , π ( s ) , x ( s ) ) ∝ (cid:96) ( θ , η y , κ ; y ( s ) , π ( s ) , x ( s ) ) × Prior ( θ ) × Prior ( η y ) × Prior ( π ) × Prior ( κ ) . To obtain a closed form of the likelihood we need a closed form of the expected value in the denominator inEquation (7). Theorem 1 (same as Theorem 1 in LS2019) is here extended under our extended PSU-indexed param-eterization, ( ( θ , η y ) and ( κ , π j ) ), to provide conditions that allow a closed form expression of this expected value.Similar to the set-up for Theorem 1, let v and u be subvectors of x , the covariates used to specify the conditionaldistribution of π ij | y , x , κ , π j and y | x , θ , η y , respectively; that is, π ij | y ij , x ij , κ , π j ∼ π ij | y ij , v ij , κ , π j and y ij | x ij , θ , η y ∼ y ij | u ij , θ , η y . Theorem 2 If p ( π ij | y ij , v ij , κ , π j ) = lognormal ( π ij | h ( y ij , v ij , κ , π j ) , σ π ) , with the function h ( y ij , v ij , κ , π j ) of theform h ( y ij , v ij , κ , π j ) = g ( y ij , v ij , κ ) + t ( v ij , κ , π j ) where σ π = σ π ( v ij , κ , π j ) , possibly a function of ( v ij , κ , π j ) then p s ( y ij , π ij | u ij , v ij , θ , η yj , κ , π j ) = normal (cid:16) log π ij | g ( y ij , v ij , κ ) + t ( v ij , κ , π j ) , σ π (cid:17) exp (cid:8) t ( v ij , κ , π j ) + σ π / (cid:9) × M y ( κ ; u ij , v ij , θ , η yj ) × p ( y ij | u ij , θ , η yj ) with M y ( κ ; u ij , v ij , θ ) : = E y (cid:63) ij | u ij , θ , η yj (cid:104) exp (cid:110) g ( y (cid:63) ij , v ij , κ ) (cid:111)(cid:105) . So, analogously to the discussion after Theorem 1, if g ( y , v , κ ) = κ y y with κ y depending on κ and, perhaps, on v ,then M y ( κ ; u ij , v ij , θ , η yj ) : = E y (cid:63) ij | u ij , θ , η yj (cid:104) exp (cid:16) κ y y (cid:63) ij (cid:17)(cid:105) is the moment generating function of y evaluated at κ y . So when both the population distribution of y , p ( y | θ ) , hasa close form and the moment generating function has a close form over the real line, then the likelihood, p s , has a Bayes Estimation/Informative Cluster Sampling closed form, as well. | Inclusion of PSU Information into Conditional Population Model for Weights
The marginal inclusion probability of the individual ij , ∝ π ij , is the product of the probability of selecting PSU j , ∝ π j ,and the probability of selecting the individual i conditioning on PSU being in the sample, ∝ π i | j such that π ij = π j π i | j . Therefore, log π ij = log π i | j + log π j Specifying, log π j ∼ normal ( µ j , σ P SU ) where µ j could depend on PSU covariates (e.g. county population, etc) but, forsimplicity, we assume that it does not and set µ j = 0 . Choosing a normal distribution for log π i | j ∼ normal ( g ( y , v , κ ) + t (cid:48) ( v , κ ) , σ π ) yields log π ij | y ij , v ij , κ , η πj ∼ normal ( g ( y ij , v ij , κ ) + t (cid:48) ( v ij , κ ) + η πj , σ π ) (8)with η πj : = log π j iid ∼ Normal ( , σ P SU ) PSU-specific random effects. So defining t ( v , κ , π j ) : = t (cid:48) ( v , κ ) + log π j = t (cid:48) ( v , κ ) + η πj , the distribution of π ij satisfies the conditions of Theorem 2.This set-up is coherent with our assumption that the data analyst does not have information about the PSU-indexed sampling weights for either the population or sampled units because they are not typically published bysurvey administrators. Nevertheless, our derivation of (8) by factoring the marginal inclusion probabilities, π ij , demon-strates how we may capture within PSU dependence among ( π ij ) by inclusion of random effects, η πj . Our Fully Bayesconstruction has the feature that it does not require availability of PSU-indexed random effects, unlike the pseudo-likelihood approach of Williams and Savitsky (2020). | Linear Regression Joint Population Model
Assume the linear regression model for the population, y ij | u ij , θ , η yj ∼ normal (cid:16) u tij β + η yj , σ y (cid:17) , with θ = ( β , σ y ) (9)with the PSU-specific random effect η yj in (9) playing the roll of η yj in (6). The conditional population model forinclusion probabilities is specified as in (8), with π ij | y ij , v ij , κ , η πj ∼ lognormal (cid:16) κ y y ij + v tij κ v + η πj , σ π (cid:17) , with κ = ( κ y , κ v , σ π ) (10)This construction results from setting, g ( y ij , v ij , κ ) = k y y ij , t ( v ij , κ , π j ) = v tij κ v + η πj (remember η πj = log π j ), and σ π ( κ , v ij , π j ) = σ π in (8). Here β and κ v are vectors of regression coefficients that include an intercept, so the first ayes Estimation/Informative Cluster Sampling 9 entry of both u ij and v ij equals 1. We select prior distributions, β ∼ MVN ( , I ) κ ∼ MVN ( , I ) η y , . . . , η yJ iid ∼ normal ( , σ η y ) η π , . . . , η πJ iid ∼ normal ( , σ η π ) σ y , σ π , σ η y , σ η π iid ∼ normal + ( , ) (11)with normal + ( m , s ) denoting a normal distribution with mean m and variance s restricted to the positive real line;MVN ( m , Σ ) denotes the multivariate normal distribution with mean vector m and variance-covariance matrix Σ ; and I the identity matrix. Since y ∼ normal ( m , s ) admits a closed form expression for moment generating function M y ( t ) = exp ( t m + t s / ) , we apply Theorem 2 to obtain, p s (cid:16) y ij , π ij | u ij , v ij , θ , η yj , κ , η πj (cid:17) = normal (cid:16) log π ij | κ y y ij + v tij κ v + η πj , σ π (cid:17) exp (cid:110) v tij κ v + η πj + σ π / κ y ( u tij β + η yj ) + κ y σ y / (cid:111) (12) × normal (cid:16) y ij | u tij β + η yj , σ y (cid:17) The implementation of the Gibbs sampler is not straightforward. To obtain a posterior sample of the model parameters,we rely on the ‘black box" solver, “Stan" (Carpenter et al., 2016), which performs an efficiently-mixing HamiltonianMonte Carlo sampling algorithm with a feature that non-conjugate model specifications are readily accommodated. | Pseudolikelihood Approach
In a related work, Savitsky and Williams (2019) propose an approach to incorporate sampling weights using a plug-inaugmented pseudolikelihood: J (cid:214) j =1 n j (cid:214) i =1 p ( y ( s ) ij | θ ) w ( s ) ij × J (cid:214) j =1 p ( η yj | σ η y ) w ( s ) j (13)where an augmented pseudolikelihood exponentiates the random effects prior for η yj by the PSU-indexed samplingweight, w ( s ) j , to permit asymptotically unbiased inference about σ η y . They standardize marginal individual samplingweights so that (cid:205) Jj =1 (cid:205) n j i =1 w ( s ) ij = n and the PSU sampling weights are standardized as, (cid:205) Jj =1 w ( s ) j = J to approximatelyreflect the amount of posterior uncertainty in the sample. Nevertheless, as in LS2019, they did not account for eitherdep- y nor dep- π , so resultant credibility intervals are overly optimistic (short). In a related work, Williams and Savitsky(2020) propose to utilize a separate, post-processing step applied to the pseudoposterior samples produces posteriordraws with the sandwich ( H − ∗ J ∗ H − ) variance estimator of the pseudo MLE to account for the dependenceinduced by clustering. Their post-processing step, however, assumes a pseudolikelihood without random effects;that is, they assume a fixed effects model and show asymptotically correct uncertainty quantification under clustersampling where their sandwich adjustment captures the PSU-induced dependence. This approach, however, does notpermit inference on PSU-indexed random effects. Their post-processing algorithm is computationally-intensive as it requires re-sampling PSUs and units within PSUs to estimate the sandwich variance construction.Since it is usually the case that the ( w ( s ) j ) are not published with the survey, they propose to use instead w ( s )· j = (cid:205) n j i =1 w ( s ) ij . The sampling weights not only play a role for individual-indexed pseudolikelihood, but also on the weightingof the prior of the PSU-indexed random effects. Inference on θ is done treating the pseudoposterior as if it were aposterior distribution. In Section 4 we label this approach, “Pseudo.w". In Section 4, we will also study the performanceof the pseudolikelihood approach when the prior of the random effects is not weighted (equivalent to w ( s ) j : = 1 in(13)) and label this approach, “Pseudo".Related frequentist approaches of Pfeffermann et al. (1998) and (Rabe-Hesketh and Skrondal, 2006) also employan augmented likelihood similar to that of (13) where they integrate out the η yj to perform estimation. They also focuson consistent estimation of parameters, rather than correct uncertainty quantification.We will see in the sequel that because our approach uses a joint model for ( y ij , π i | x ij ) , the asymptotic covariancematrix of the joint posterior is H − , which is the same for the MLE, resulting in correct uncertainty quantification.In the context of the linear regression in (9), the quantity between square brackets in (13) matches the likelihoodof the weighted regression model y ij | u ij , θ , η yj ∼ normal (cid:16) u tij β + η yj , σ y / w ij (cid:17) (See Subsection A.3 for details.) So (13)becomes J (cid:214) j =1 n j (cid:214) i =1 normal (cid:16) y ( s ) ij | u tij β + η yj , σ y / w ( s ) ij (cid:17) × J (cid:214) j =1 p ( η yj | σ η y ) w ( s ) j This becomes useful in the applications in stan where one can specify the weighted linear regression model and addthe log of p ( η y | . . . ) to the log of the full conditional or “Target”.. | Frequentist Approach
The frequentist approach is designed-based, assuming the population is fixed. It employs the pseudolikelihood con-struction, but without PSU-REs. The point estimate of θ , called ˜ θ f req , maximizes p pseudo ( θ ; y ( s ) , π ( s ) , x ( s ) ) = (cid:206) Jj =1 (cid:206) n j i =1 (cid:104) p ( y ( s ) ij | x ( s ) ij , θ ) (cid:105) w ( s ) ij with w ( s ) ij ∝ / π ( s ) ij standardized so that (cid:205) ij w ( s ) ij = n .The PSU indices, together with p pseudo , are used to estimate the standard error of ˜ θ f req via resampling methods ( i.e. ,balanced repeated replication, Jack-Knife, Bootstrap) or Taylor series linearization. The R function svyglm in the Rpackage survey(Lumley, 2019) uses the latter (as default) to fit common generalized regression models such as linear,Poisson, logistic, etc.In the context of multiple linear regression with θ = ( β , σ y ) , inference ( i.e. confidence intervals for each entry) forthe ( p + 1 ) -dimension vector of regression coefficients β (that includes an intercept) is based on that, asymptotically, ˜Σ / ( ˜ β f req − β ) ∼ ( p + 1 ) -variate Student-t with degrees of freedom equal to df = P SU − St r at a − p ; and ˜Σ / isthe matrix such that ˜Σ / ˜Σ / = ˜Σ with ˜Σ the estimate of the variance-covariance matrix of ˜ β f req . No stratification isequivalent to having one stratum and the degrees of freedom are then df = P SU − ( p +1 ) . This frequentist approachfor uncertainty quantification is similar to the post processing correction of Williams and Savitsky (2020) in that theanalysis model for the population does not employ a PSU-indexed random effects term; rather, the re-sampling ofclusters captures the dependence within clusters. Both methods perform nearly identically, in practice, so that wefocus on comparing our Fully Bayes approach to this frequentist re-sampling method in the simulation study thatfollows. ayes Estimation/Informative Cluster Sampling 11 | SIMULATION
We perform a Monte Carlo simulation study to compare the performance of our fully Bayes method of (12) thatemploys PSU-indexed random effects in both the models for the response and inclusion probabilities to the pseudo-posterior and frequentist methods, presented in Sections 3.4 and 3.5, respectively. In each Monte Carlo iteration,we generate a population of J pop clusters and N j individuals per cluster. The response variable is generated propor-tionally to size-based group and marginal inclusion probabilities to induce informativeness (dependence between theresponse variable and inclusion probabilities). We next take a sample of groups and, subsequently, individuals withingroup. A clustered simple random sample (cSRS) is also generated from the same population and the population es-timation for the response that includes sampling weights is used to estimate parameters. The cSRS is included toserve as a gold standard for point estimation and uncertainty quantification and is compared to methods estimatedon the informative sample taken from the same population. For each population and sample we apply the estimationapproaches of our Fully Bayes method and associated comparator methods, assessing the bias, MSE and coverageproperties.Here gamma ( a , b ) denotes the gamma distribution with shape and rate parameters a and b ( i.e. , mean a / b ). | Monte Carlo Simulation Scheme
Steps 1-6 describe how the synthetic population dataset is generated, steps 7-8 how the samples are generated and9 how they are analyzed. We use the superindex ‘DG’ to refer to data generating model as opposed to the analysismodel: Generate π i | j iid ∼ gamma ( a π = 2 , b π = 2 ) for i = 1 , . . . , ( N j = 20 ) individuals nested in PSU, j = 1 , . . . , ( J pop = 10 ) total PSUs. The total population size is J pop × N j = 20 , . Define PSU j inclusion probability π tem j : = (cid:205) N j i =1 π i | j (therefore π tem j iid ∼ gamma ( N j a π , b π ) ). Standardize π j : = π tem j / (cid:205) J pop j (cid:48) =1 π tem j (cid:48) , so ( π , , π , , . . . , π , J pop ) ∼ Dirichlet (cid:16) a π × ( N , N , . . . , N J pop ) (cid:17) . (Thus, b π does not play a roll on the distribution of π j .) Generate η y , DGj iid ∼ N ( , σ η y , DG = 0 . ) PSU specific random effects and predictor x ij iid ∼ Uniform ( , ) Generate the response. We consider three simulation scenarios to generate the response, in all of them y ij = β DG + β DG x ij + β π , π j + β π , π i | j + β η y , DG η y , DGj + (cid:15) DGij , with (cid:15) DGij iid ∼ Normal ( , ( σ DGy ) ) , but the last three regression coefficients differ:Scenario β π , β π , β η y , DG S Informative PSU-RE J pop S Non-informative PSU-RE 0 1 1 S No stage is informative 0 0 1with ( σ DGy ) = 0 . , β DG = 0 , β DG = 1 . Note that, in scenario S , β π , = 1 / E ( π j ) = J pop so that β π , E ( π j ) = 1 .Informative random effects are instantiated in Scenario S by generating y ij from π j , where π j , the inclusionprobability for PSU, j , is equivalent to a PSU-indexed random effect. We set the regression coefficient for π j equal to and that for η y , DGj equal to in Scenario S , where we generated random effects as non-informative(uncorrelated with the selection probabilities). Take a clustered simple random sample (cSRS) from the population: From each population dataset we draw twosamples, one informative and the other under two-stage cluster SRS . Both samples contain J = 30 PSUs and n j = 5 individuals, thus sample size is n = (cid:205) Jj =1 n j = 150 . Results under cSRS will serve us as Gold Standard andwill be compared with the results of using methods analysing the informative sample. To implement the clusteredrandom sample we, a. Draw an SRS (without replacement) of size J of PSUs indices from { , . . . , J pop } . b. Within each drawn PSU j , obtain a SRS (without replacement) of size n j . c. Relabel the PSU indices so they run from to J and individual indices so they run from to n j . d. The cSRS consists of {( y ij , x ij , j ) } i =1 ,..., n j ; j =1 ,..., J Generate an informative sample: a. Draw without replacement, J PSU indices j , . . . , j J ∈ { , . . . , J pop } with P r ( j ∈ sample ) = π j . b. For each j (cid:96) drawn, sample, without replacement, n (cid:96) individual indices i ∈ { , . . . , N j (cid:96) } with probability P r ( i ∈ sample from PSU j ) = π i | j / (cid:205) N j i (cid:48) =1 π i (cid:48) | j . c. Define π ij = π j π i | j and relabel the PSU and individual indices so they run from to J and from to n j ,respectively, and add superindex “ ( s ) ” to denote sampled quantities. d. The informative sample consists of {( y ( s ) ij , x ( s ) ij , π ( s ) ij , j ) } i =1 ,..., n j , j =1 ,..., J . Analyze the informative sample by estimating parameters under the following approaches: a. FULL.both : Denotes the approach enumerated in Subsection 3.2 that employs PSU-REs in models for bothresponse and inclusion probability; the model for the response includes PSU-indexed random effects with, y ij = β Ana + β Ana + η y , Anaj + (cid:15) Anaij with (cid:15)
Anaij iid ∼ Normal ( , ( σ Anay ) ) , (14)where the superscript, “ Ana " denotes the model for analysis or estimation as contrasted with the DG modelused for population data generation. We are interested in estimating β Ana and the standard deviation ofthe PSU-REs, σ η y , Ana where η y , Anaj iid ∼ Normal ( , ( σ η y , Ana ) ) . (Note that the estimation of β Ana is unbiasedregardless of the sampling scheme) We, subsequently, use the conditional estimation model for the marginalinclusion probabilities of (10) to include a PSU-indexed random effects term,log π ij | y ij , η π ∼ Normal ( κ + κ y y ij + κ x x ij + η πj , σ π ) . These two formulations describe the joint population model that employs random effects in both the marginalmodel for the response and conditional model for the inclusion probabilities. We leverage the Bayes ruleapproach of Section 3.1 under the linear regression population to produce (12) that adjusts the populationmodel to condition on the observed sample. We use this equation to estimate the Fully Bayes populationmodel on the observed sample.It bears noting that FULL.both assumes that the data analyst does not have access to PSU-indexed samplingweights ( ∝ / π j ). Yet, we show in the simulation study results that FULL.both is able to adjust for informativesampling of PSUs for estimation of population model parameters (e.g., the intercept and PSU random effectsvariance). This relatively good result owes to the inclusion of PSU-indexed random effects in the conditionalmodel for the inclusion probabilities, π ij , because it captures the within PSU dependence among them.Note that the Full.both analysis assumes that log π ij | y ij , . . . is a normal distribution, but that does nothold under any simulation scenario, which allows our assessment of the robustness of FULL.both to modelmisspecification. ayes Estimation/Informative Cluster Sampling 13 b. FULL.y : This alternative is a variation of FULL.both that uses the same population estimation for the responsestated in (14). In this option, however, PSU-REs are excluded from the conditional model for the marginalinclusion probabilities; i.e. , log π ij | y ij , η π ∼ Normal ( κ + κ y y ij + κ x x ij , σ π ) does not include PSU-REs. c. Pseudo.w : Denotes the pseudolikelihood of Section 3.4 where the likelihood is exponentiated by marginalsampling weights for individuals and the prior for the random effects is expontentiated with PSU-indexedsampling weights. In (13) we use w ( s ) j ∝ / π ( s ) j (and not w ( s )· j ) standardized so that (cid:205) Jj =1 w ( s ) j = n d. Pseudo : Denotes the pseudolikelihood that exponentiates the likelihood by marginal sampling weights butdoes not exponentiate the prior distribution for random effects by PSU-indexed sampling weights, as de-scribed in Subsection 3.4. e. Freq : Denotes the frequentist, design-based, inference under simple linear regression model as described inSubsection 3.5. Note that this analysis model does not include PSU- REs because we employ a step thatre-samples the PSUs in order to estimate confidence intervals. To fit the model, we use R function svyglm inlibrary survey (Lumley, 2019). f. Pop : Ignore the informative sampling and fit model in (14) (as if the sample were a cSRS). The inclusionprobabilities do not play a roll in the inference, though the model for the response includes PSU-REs. This isequivalent to Pseudo with all (group and individual) sampling weights set equal to 1. Analyze the cluster simple random sample. cSRS : Fit the model in (14) to the sample taken under a cSRS (generatedin step 6) design. The same as Pop but applied to the cSRS generated in step 6. Save parameter estimates to compute Bias, MSE, coverage probability of central 95% credible intervals and theirexpected length. The parameters of inferential interest are the point estimate of β Ana , T RUE : ˜ β Ana : = E ( β Ana | dat a ) (or ˜ β Ana , f req for Freq), and its central 95% credible (or confidence for Freq) interval lower and upper limits. Wealso produce point and interval estimates for σ Ana , T RUEy for those methods that include PSU-REs in the marginalresponse model ( i.e. exclude Freq). The computation of β Ana , T RUE and σ Ana , T RUEy is discussed in Subsection 4.2.Once we have run steps 1-10 times, we use the quantities stored in step 10 to estimate the bias and MSE of thepoints estimate of β Ana , T RUE (defined below in Subsection 4.2) of each method as the average of ˜ β Ana − β Ana , T RUE andaverage of ( ˜ β Ana − β Ana , T RUE ) , respectively. The coverage and expected length of the 95% credible (or confidence)are estimated as the proportion of times that the credible intervals contain β Ana , T RUE and their average length. Wedo the same for σ T RUE η y , Ana also defined in subsection 4.2. | True Model Parameters under Analysis Model
In this section we compute the true values of the intercept and random effects variance parameters for the analysis(
Ana ) models that are obtained from associating parameters of the analysis model to the data generating ( DG ) model.Having true values for the intercept and random effect variance under our analysis models allows our assessment ofbias, MSE and coverage. We use the superindex “ T RU E " to refer to the true parameter values for the
Ana modelimplied by the simulation true parameter values in the DG model. The true value of the intercept parameter underthe analysis model is achieved by integration, β Ana , T RUE = E ( y ij | x ij = 0 ) = β DG + β π , E ( π j ) + β π , E ( π i | j ) + β η y , DG E ( η y , DG ) , yielding, β Ana , T RUE = 2 , and under simulation scenarios S , S and S , respectively. The true value for thepopulation random effect is, η y , Anaj , T RUE = β π , (cid:2) π j − E ( π j ) (cid:3) + β η y , DG η y , DGj and the true values random errors under the analysis model are (cid:15)
Ana , T RUEij = β π , π i | j + (cid:15) DGij . Since π i | j s are not normally distributed, (cid:15) Ana , T RUEij s arealso not normally distributed, which as earlier mentioned allows our assessment of the robustness of the FULL.bothanalysis model to misspecification. Since normality assumption of the errors of the simple regression model is violated,the variance of random effects for the marginal population model for the response, v ar ( η y , Anaj T RUE ) = β π , v ar ( π j ) + β η y , DG v ar ( η y , DGj ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) σ η y , DG , is different from ( σ T RUE η y , Ana ) . Nevertheless, we may compute σ T RUE η y , Ana by fitting the linear mixed effect population modelin (14), which corresponds to the analysis model for the response, directly to the population dataset via the lmer Rfunction lmer( y ∼ x + ( | PSU index),data= population). In practice, the PSU inclusion probabilities, ( π j ) , are notavailable to the data analyst for individuals in both sampled and non-sampled individuals from the population. Underscenario ai σ T RUE η y , Ana ≈ . , and under the other two scenarios σ T RUE η y , Ana ≈ . . | Simulation Results
Tables 1, 2 and 3 show the simulation results under all scenarios. As expected in all scenarios cSRS credible intervalshave coverage close to nominal level (0.95) and the lowest MSE. In informative scenarios, S and S , (i) Pop performspoorly showing the consequences of ignoring the informative sampling scheme, (ii) all methods to analyze informativesamples yield similar quality point estimators (similar MSE), and (iii) Full.both and Full.y credible intervals maintain nom-inal coverage while Pseudo.w, Pseudo and Freq do not. Under non-informative scenario S , all methods to analyzeinformative samples yield similar results to Pop (now correctly specified model). Interestingly, only Freq pays the priceof the noise introduced by the non-informative sampling weights producing considerable wider confidence intervalsfor β Ana , T RUE than all other methods.Overall, Tables 1-3 show that Full.both and Full.y are the best methods to analyze informative samples, particularlyin terms of uncertainty quantification. But, so far, results have not shown advantage of Full.both over Full.y. To doso, under scenario S , we increase level of informativeness of the PSUs by increasing the value of β π , (See step 5 inSubsection 4.1) from J pop to J pop and J pop . As shown in Table 4, coverage of Full.y credible intervals deterioratesas informativeness increases while Full.both, in contrast to all other methods, maintains coverage similar to cSRS (atnominal level).The strength of Full.both over all other considered approaches is that it accounts for the association among theinclusion probabilities within the same PSU. Table 4 shows that Full.both is the only approach that performs well undersimulation scenario S when the level of informativeness of π ij (or correlation between y ij and π j ) increases. This isFull.both is the only method whose inference quality is not affected when π j ⊥ y ij | x ij (cid:91) i , j . ayes Estimation/Informative Cluster Sampling 15 FULL.both FULL.y Pseudo.w Pseudo Freq Pop cSRS β Ana Bias 0.035 0.046 0.067 0.091 0.015 0.518 0.005MSE 0.028 0.029 0.028 0.032 0.030 0.294 0.01695% CI Coverage 0.947 0.938 0.909 0.883 0.906 0.091 0.95795% CI Length 0.671 0.669 0.537 0.541 0.618 0.619 0.520 σ η y , Ana
Bias 0.016 0.017 0.047 0.045 0.018 -0.009MSE 0.010 0.010 0.010 0.010 0.010 0.00895% CI Coverage 0.960 0.963 0.886 0.911 0.959 0.95195% CI Length 0.425 0.422 0.359 0.359 0.418 0.355
TA B L E 1
Simulation Scenario S : Informative PSU-RE. cSRS analyses the cSRS sample while all otherapproaches analyze the informative sample. CI denotes central credible interval except for Freq where it denotesconfidence interval. FULL.both FULL.y Pseudo.w Pseudo Freq Pop cSRS β Ana Bias 0.008 0.009 0.048 0.055 0.011 0.494 0.001MSE 0.020 0.020 0.022 0.022 0.025 0.263 0.01495% CI Coverage 0.958 0.962 0.911 0.908 0.926 0.064 0.96295% CI Length 0.623 0.624 0.495 0.496 0.565 0.574 0.479 σ η y , Ana
Bias 0.063 0.065 0.108 0.107 0.065 0.049MSE 0.008 0.008 0.018 0.017 0.008 0.00695% CI Coverage 0.967 0.971 0.760 0.752 0.966 0.95695% CI Length 0.332 0.331 0.313 0.312 0.329 0.285
TA B L E 2
Simulation Scenario S : Non informative PSU-REs. cSRS analyses cSRS samples while all otherapproaches analyze informative samples. FULL.both FULL.y Pseudo.w Pseudo Freq Pop cSRS β Ana Bias 0.000 0.000 0.000 -0.000 0.000 -0.000 0.000MSE 0.001 0.001 0.001 0.001 0.001 0.001 0.00195% CI Coverage 0.957 0.964 0.941 0.944 0.947 0.956 0.95195% CI Length 0.103 0.103 0.104 0.104 0.134 0.102 0.103 σ η y , Ana
Bias 0.002 0.002 0.006 0.006 0.002 0.003MSE 0.000 0.000 0.000 0.000 0.000 0.00095% CI Coverage 0.935 0.936 0.931 0.933 0.938 0.95595% CI Length 0.067 0.067 0.068 0.068 0.067 0.068
TA B L E 3
Simulation Scenario S : No stage informative. cSRS analyses cSRS samples while all other approachesdo the same as if the samples were informative. β π , FULL.both FULL.y Pseudo.w Pseudo Freq cSRS J pop .947,.960 .938,.963 .909,.886 .883,911 .906,NA .957,.951 J pop .949,.929 .914,.927 .906,.917 .852,.929 .908,NA .961,.927 J pop .949,.946 .85,.954 .897,.916 .808,.914 .910,NA .957,.949 TA B L E 4
Coverage of central 95% credible (confidence for Freq) intervals for β Ana , T RUE , σ T RUE η y , Ana under scenario S increasing the level of informativeness of the PSUs (by increasing β π , ) . ayes Estimation/Informative Cluster Sampling 17 | APPLICATIONS
The National Health and Nutrition Examination Survey (NHANES) is designed to assess the health and nutritionalstatus of the non-institutionalized civilian population living in one of the 50 U.S states or Washington D.C. Althoughnationally representative, NHANES is designed to oversample specific subpopulations ( e.g. persons 60 and older,African Americans, Asians, and Hispanics) and follows a complex sampling design (CDC, 2017). The NHANES samplingdesign is constructed as multi-stage with stages that include sampling strata and nested primary sampling units (PSUs)that further nest respondents. A PSU is a cluster or grouping of spatially contiguous counties, while a stratum isconstructed as a region nesting multiple PSUs. NHANES publishes respondent-level marginal sampling weights basedon resulting respondent marginal inclusion probabilities in the sample after accounting for clustering. The samplingweights measure the number of people in the population represented by that sampled individual, reflecting unequalprobability of selection, nonresponse adjustment, and adjustment to independent population controls.The survey consists of both interviews and physical examinations. The NHANES interview includes demographic,socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, andphysiological measurements, as well as laboratory tests. Data obtained from 15 strata with two PSU per stratum arereleased in two-year cycles. Our first example considers examination data and the second dietary data.In both applications the priors are given in (11), and posterior inference under non-frequentist methods is basedon a posterior sample of the model parameters of size 10,000. The Gibbs sampler was run 10,000 iterations, after aburn-in period of another 10,000 iterations, on Stan. | Proportion of Body Fat and BMI
In Heo et al. (2012), here after referred as H2012, the authors model relationship of percentage of body fat (PBF) andbody mass index (BMI in k g / m ) using the simple linear regression (SLR) PBF = β + β ( / BMI ) . H2012 combinedata from three NHANES biannual cycles: 1999-2000,2001-2002 and 2003-2004. They fit a SLR model for eachcombination of sex (men and women), 3 age groups (18–29, 30–49, and 50–84 years of age), and 3 race-ethnicitygroups (non-Hispanic Whites, non-Hispanic Blacks, and Mexican Americans). Table 3 of H2012 reports the estimatedvalues and standard errors of β and β . Their table 4 reports the predicted PBF, ˆ β + ˆ β / BMI, for individuals with BMIlevels of . , , , and that represent BMI cutoffs for underweight, normal, overweight, and obesity classes I,II, and III, respectively. The PBF variable expresses a high rate of missing values. NHANES releases two datasets (percycle) with five sets of imputed PBF values; the first dataset includes participants with observed PBF or with imputedPBF values with low variability, while the second data set participants with high variability in their imputed values.H2012 analysis considers sampling design and multiple imputation of missing values.In this section we mimic their analysis but with the 2005-2006 NHANES dataset. We use a multiple linear regres-sion model where we control for stratification variables in H2012. Since PBF in the − cycle is reported onlyfor − year old participants, we categorize age into groups: − , − , and − years of age and excludedparticipants not in these age ranges. We also include two more race/ethnicity groups: “Other Hispanic” and “OtherRace - Including Multi-Racial”. As in H2012, we exclude participants with missing BMI, women who tested positive inpregnant test or who claimed to be pregnant (for which by design PBF is not measured), or, with PBF imputed valueswith high variability. Our final sample size is n = 4099 . For more detail about the dataset see subsection A.1. Theanalysis model of the non-frequentist methods is the mixed effect linear regression with PBF as the response variable,along with the following predictors: (1/BMI), gender, age group and race ethnicity, with male, 18-29 age group andnon-Hispanic White as reference groups, and PSU-REs. The frequentist analysis model is same (now fixed effect) model but without PSU-REs. We recall that u denotes predictors in the marginal model for y in (9), and construct, u t = (cid:0) , / BMI , ( gender = Female ) , ( Age ∈ [ , ]) , ( Age ∈ [ , ])) , ( R ace / E th = MX-AME ) , ( R ace / E th = Other-Hisp ) , ( R ace / E th = NonHisp black ) , ( R ace / E th = Other or Multiracial ) (cid:1) (15)with dimension p + 1 = 9 , where ( A ) denotes the indicator function of the individual in the set A . We analyze thedataset with the first set of PBF imputed values under the following comparator models used in the simulation study:Full.both, Full.y, Pseudo.w, Pseudo, Freq and Pop. We recall that we jointly model the response and sampling inclusionprobabilities under Full.both and Full.y, with the response denoted as y = PBF, and we use the same predictors inthe conditional model for the inclusion probabilities and the marginal model for the response such that v = u in(10). In the implementation of pseudo.w we use sampling weights, w ( s )· j of (13), that sum individual sampling weightsfor those units nested in each PSU in order to exponentiate the random effects prior distribution. Our analysesconsider PSU information and sampling weights as provided by NHANES but not strata information. We add thecomparator method Freq.strata that does consider strata information, which would be expected to produce moreoptimistic confidence intervals, fitted using the R package Survey (Lumley, 2019). We recall that the NHANES designemploys 15 strata with 2 PSUs per stratum such that frequentist inferences will be based on the Student- t distributionwith df = P SU − st r at a − p degrees of freedom, equal to df = 21 and df = 7 under Freq and Freq.strata, respectively.The left panel in Figure 1 compares violin plots and central 95% credibility (or confidence) intervals for the ex-pected PBF value for a person in the reference group with “normal" BMI or BMI = 18 . (where BMI < . is labeledas underweight), which represents uncertainty intervals of β + β / . . All point estimates are close to the valuereported in Table 4 of H2012 of 14.5% for this group with FULL.both and FULL.y at . % are closest to H2012.The right panel of Figure 1 depicts the same point estimates and uncertainty intervals but now for non-HispanicWhite woman with BMI = 18 . , which are computed computed as the uncertainty intervals of β + β / . β . Here,again, all CIs contain the PBF estimated in H2012 for this group of 26.9%.In both figures, the Frequentist CIs ( i.e. Freq and Freq.strata) are much wider than the other methods, which indi-cates an inefficiency of this Freq.strata despite it’s consideration of strata should produce smaller uncertainty intervals.By contrast, inference under Full-both and Full-y is similar to Pop indicating the possibility of a non-informative design.This is confirmed by the central 95% CI for κ y , (− . , . ) in FULL.both and FULL.y that contains 0 indicating a non-informative sample; more formally, y ij ⊥ π ij | x ij . The posterior mean estimates for the plug-in Pseudo.w and Pseudoexpress slightly more bias (relative to the . % of H2012) because of the use of use of noisy weights, which are notnecessary since the NHANES design is non-informative for PBF. The fully Bayesian methods (FULL.both, FULL.y), bycontrast, performs weight smoothing to mitigate bias induced by noisy weights.Figure 2 displays violin plots of the posterior distribution of the standard deviation of the PSU-REs, σ η y (in themarginal model for y ), for the non-frequentist methods (as we recall that the frequentist comparator methods do notinclude PSU-REs). As before, the inference under Full.Both and Full.y are similar to Pop due to the non-informativenessof the sampling design for PBF.Interpreting the coefficients in the model for y (top rows in Table 5), after controlling for BMI, women have, onaverage, 12.3% higher PBF than men, PBF increases with age and Other or multiracial and MX-AME people have thehighest PDF followed by white and other Hispanic while nonhispanic blacks have the lowest PBF.When implementing our comparator frequentist approaches Lohr (2019) (see also Heeringa et al., 2017, p 62)recommends to fit the model with and without weights. If the point and standard error estimates are different betweenthe two, they recommend that the analyst should explain the difference based on the construction of the sampling ayes Estimation/Informative Cluster Sampling 19 lllllll PopFreq.strataFreqPseudoPseudo.wFULL.yFULL.both 0.14 0.15 0.16 0.17 b + b lllllll b + b + b F I G U R E 1
Left: Violin plots along with, mean (dot) and central 95% credible or confidence interval (horizontalline) for the expected PBF for subjects in the reference group (non-Hispanic White man in age group 18-29) withBMI = 18 . , β + β / . , under all considered methods. Right: the same but for non-Hispanic White woman in agegroup 18-29, β + β / . β . Individuals with BMI < . are labeled as underweight.weights, e.g. oversampling of certain minority or age groups. In our example, inference under Freq and Pop generatesimilar point estimated but Freq yields, in general, greater standard errors. The analyst needs to decide and justifywhich model to use inference under frequentist modeling; though under NHANES guidelines (CDC, 2007) the weightsshould be used for all NHANES 2005-2006 analyses. By contrast, the fully Bayesian approaches do not require thedata analyst to make this choice about whether to use the weighted or unweighted estimates. Inference for modelparameter κ y , under Full.Both or Full.y, informs the analyst if the design is informative; κ y = 0 implies non-informativedesign and the magnitude of κ y is a measure of informativeness (in the scale of y ). Full.both and Full.y correct inferencewhen the design is informative, but also mitigate against bias induced by weights when the sampling design is non-informative (through weight smoothing), as in this particular application, and produces results that are similar to thePop method that ignores the sampling weights. Full.Both, is the only method that, as shown in Subsection 4.3, alsoprovides appropriate model uncertainty estimation when the PSUs are informative (not the case in this application);e.g., when y ij (cid:54)⊥ j | x ij for some i and j . Parameter mean sd 2.5% 97.5%Parameters for y ij | · · · intercept 0.518 0.003 0.512 0.524 / BMI -6.862 0.071 -6.999 -6.721gender: Female 0.123 0.001 0.121 0.125Age: 30-49 0.005 0.001 0.002 0.007Age: 50-69 0.022 0.001 0.019 0.025Race/eth: MX-AME 0.004 0.002 0.001 0.007Race/eth: Other-Hisp -0.001 0.003 -0.007 0.005Race/eth: Non Hisp Black -0.018 0.002 -0.021 -0.015Race/eth: Other-Multiracial 0.007 0.003 0.002 0.013 σ η y σ y π ij | y ij , · · · PBF -0.042 0.223 -0.479 0.397Intercept -0.429 0.128 -0.682 -0.177 / BMI 2.240 1.850 -1.375 5.812gender: Female -0.025 0.031 -0.086 0.038Age: 30-49 -0.549 0.020 -0.587 -0.510Age: 50-69 -0.207 0.021 -0.249 -0.167Race/eth: MX-AME 1.589 0.025 1.539 1.637Race/eth: Other-Hisp 0.454 0.045 0.366 0.542Race/eth: Non Hisp Black 1.277 0.023 1.231 1.323Race/eth: Other-Multiracial 0.367 0.040 0.290 0.446 σ η π σ π TA B L E 5
Inferende under Full.both using the first set of NHANES imputed values of PBF. Column headers: mean,sd, 2.5% and 97.5% denote the posterior expected value, standard deviation, and 0.025 and 97.5 quantiles. ayes Estimation/Informative Cluster Sampling 21 l llll
PopPseudoPseudo.wFULL.yFULL.both 0.0025 0.0050 0.0075 0.0100 σ η y F I G U R E 2
Violin plots of posterior samples of PSU RE, ( σ η y ) along with central 95% credible interval (blackvertical line) and mean (dot) for non frequentist methods. | Kilocalorie Consumption
In this second example, we estimate the daily kilocalories consumed in each one of the gender, age and ethnicitygroupings as used in the previous application with age now having two additional categories: 0-8,9-17,18-29,30-49and 50-80. In this application, u = v is of dimension p + 1 = 10 ; as in (15) after removing its second entry / BMI, andadding two more age groups. The reference categories are White, male and 0-8 age group.We use dietary data from 2015-2016 NHANES cycle. Each participant answers a 24-hour dietary recall interviewin two days: Day 1 and Day 2. The Day 1 recall interview takes place when the participant visits the Mobile ExamCenter (MEC) unit where other NHANES measurements are taken. The Day 2 recall interview is collected by telephoneand it is scheduled for 3 to 10 days later (See CDC (2015) for more details). Based on theses interviews NHANESprovides datasets with estimates of kilocalories (and many other nutrients) ingested by the participant 24 hours beforethe interview along with their dietary sampling weight. In this analysis, we consider the Day 1 dataset and the samplingweights that come in it. See Subsection A.2 for more details.There are 8,506 participants who completed the Day 1 dietary recall, of which this analysis considers the n = 8 , with positive sampling weights or, equivalently, with recall status labeled “complete and reliable” by NHANES CDC(2015). We fitted the same multiple linear regression models as in the previous example with response y = log ( kcal +1 ) with kcal the NHANES estimate of kilocalories consumption based on Day 1 recall interview.Figure 3 depicts violin plots of the estimated mean daily kcal consumption for White males in age groups 0-8 (left)and 30-49 (right). More specifically, the left panel depicts violin plots of exp ( β ) − with the posterior mean pointand distribution estimates for β drawn from the posterior distribution under the (full and plug-in pseudo) Bayesianmethods. The frequentist methods estimate the the distribution of exp ( β ) − using ˆ β + t × SE ( ˆ β ) with t ∼ Student-twith
P SU − St r at a − p degrees of freedom (=20 under Freq and 6 under Freq.strata). The right panel constructs thefrequentist methods using the vector ( β , β ) to be drawn from the distribution ( ˆ β , ˆ β ) t + ˆΣ / , t where the randomvector t = ( t , t ) has entries t , t iid ∼ Student-t, with the same degrees of freedom as above, and ˆΣ / , ˆΣ / , = ˆΣ , with ˆΣ , the estimated variance-covariance matrix of ( ˆ β , ˆ β ) t .Figure 3 shows that for these groups, inference under FULL.both and Full.y are similar but, in contrast to thefirst application, different from Pop. The central 95% credible interval for κ y , (-0.110,-0.048), does not contain zero,indicating that the sampling design is informative and both FULL.both, and also (not shown) Full.y, corrects for this.The point estimates under Pseudo.w, Pseudo, Freq and Freq.strata are close to one another, but different from thefully Bayesian methods, indicating that the weight smoothing provided by the full Bayesian methods is more robustto noise present in the weights that may injure point estimation.Figure 4 depicts the violin plots under Pop, the Bayesian and pseudolikelihhod methods of the standard deviationof the PSU-RE, σ η y (Freq and Freq.strata do not model PSU-RE). Inference for this parameter under the plug-in,pseudo posterior methods differ from the fully Bayesian estimates. Figure 5 shows that the posterior distributionof individual PSU random effects in the marginal response model (PSU-RE) also differs. The figure focuses on twoparticular random effects, η y and η y , that are coherent with the general pattern we see over the random effects;in particular, the fully Bayes methods express less estimation uncertainty than do the pseudo Bayesian methods,indicating a greater estimation efficiency by jointly modeling the response and inclusion probabilities. | DISCUSSION
We have extended our work in LS2019 to include PSU information in a model-based, fully Bayesian analysis of in-formative samples. The extension consists of replacing the fixed effect model by a mixed effect model that includes ayes Estimation/Informative Cluster Sampling 23 l llllll
PopFreq.strataFreqPseudoPseudo.wFULL.yFULL.both1200 1300 1400 1500 1600 exp ( b ) - lllll ll exp ( b + b ) - F I G U R E 3
Violin plots, under all methods, for average kcal consumption for people in the reference groupWhite males 8 year old or younger (left), and White males in the age group 29-48 (right) along with point estimate(dot) central 95% credible, or confidence, interval (horizontal line within violin plot). l llll
PopPseudoPseudo.wFULL.yFULL.both 0.00 0.02 0.04 0.06 σ η y F I G U R E 4
Violin plots, of the estimate of the standard deviation of the PSU-RE, σ η y , under all non-frequentistmethods. lll ll PopPseudoPseudo.wFULL.yFULL.both −0.10 −0.05 0.00 0.05 h l llll −0.05 0.00 0.05 h F I G U R E 5
Violin plots, under all non-frequentist methods, for PSU-RE, η y and η y . PSUs 15 and 17 herecorrespond to PSUs 1 in strata 8 and 14, respectively, in the 2015-2016 demographic NHANES dataset. ayes Estimation/Informative Cluster Sampling 25 PSU/cluster-indexed random effects in the marginal model for y and the conditional model for π | y to capture de-pendence induced by the clustering structure. We have shown via simulation that our fully Bayesian approach yieldscorrect uncertainty quantification, or equivalently CIs, with coverage close to their nominal level, including for therandom effects variances. Competing methods fail to do so in at least one simulation scenario. In particular, Full.bothis the only appropriate method, of all here considered, when the sample design is informative for the selection ofPSUs. The results in simulation scenario S , where the design is not informative, revealed that the method is alsorobust to noise in weights.Our fully Bayesian methods proposed here are mixed effect linear models that not only take into account the pos-sible association of individuals within the same cluster but also, in contrast to the design-based frequentist methods,quantify this association; that is, the within PSU correlation can be estimated.We demonstrated our method with two applications analyzing NHANES data. In the first application, the com-parative methods behave as in simulation scenario S ; that is, the Bayesian methods adjust the inference for nonin-formative design yielding similar results to an analysis assuming cSRS where sampling weights do not play a role. Bycontrast, inference under competing methods is affected by the noise induced from these sampling weights. In thesecond application the design is informative and inference under Bayesian methods differs from other methods.In our two applications, inclusion of the strata information into the frequentist analysis did not yield shorter con-fidence intervals as we expected (Freq.strata not better than Freq), indicating that perhaps strata are homogeneous(with respect to the response/ predictor relationship considered). The next natural step of the method is to includestrata information into our Bayesian analysis. The second application only analyzed data from Day 1 dietary question-naire. To analyze Day 1 and Day 2 data with one model we need to adapt our approach to repeated measures. This isanother current line of research.To implement our Bayesian method, we derived an exact likelihood for the observed sample. In principle, thislikelihood can also be used for maximum likelihood estimation opening the door for model-based frequentist inference.Our approach requires the modeler to specify a distribution for π i | y i , · · · . Estimation requires the computationof an expected value, the denominator in (7). We assume a lognormal conditional likelihood for the marginal inclusionprobability, given the response, where the mean function is linear in the response, both of which facilitate use ofTheorem 2 to obtain a closed form for this expected value. Our simulation study showed that the Bayesian methodis robust against misspecification of these assumptions. Future work is needed to ease conditions in Theorem 2.To sum up, we have presented the first model-based Bayesian estimation approach that accounts for both informa-tive sampling within the individuals in the same PSU and when the PSU is informative to produce correct uncertaintyquantification. A | APPENDIXA.1 | Application Details for PBF and BMI analysisA.1.1 | Data used
Three publicly available datatset were downloaded from the NHANES website.• Demographic Variables & Sample Weights: DEMO_D.XPT (CDC (2007), ). Used columns: – sdmvstra: Stratum to which the participant belongs – sdmvpsu: PSU indicator – wtmec2yr: Full Sample 2 Year MEC Exam Weight – riagendr: Sex – ridageyr: age in years – ridreth1:Race/ethnicity• Body Measures: BMX_D.XPT( ).Used columns: – bmxbmi: BMI ( k g / m )• Dual-Energy X-ray Absorptiometry - Whole Body: dxx_d.XPT ( ) – dxdtopf: Total percent body fat (%) – _MULT_: Imputation Version in , . . . , . For Figures 1 and 2 and Table 5 we use the first set of imputations.This is, rows with _MULT_=1. A.2 | Application Details for Daily Kilocalories AnalysisA.2.1 | Dataset
Two publicly available datatset were downloaded from the NHANES website.• Demographic Variables and Sample Weights: DR1TOT_I.XPT( ).Columns used – sdmvstra: Stratum to which the participant belongs – sdmvpsu: PSU indicator – riagendr: Sex – ridageyr: age in years – ridreth1:Race/ethnicity• Dietary Interview - Total Nutrient Intakes, First Day: DR1TOT_I.XPT( ).Columns used – dr1tkcal: Energy (kcal) – wtdrd1: Dietary day one sample weightNotice that, in this example, following the NHANES guidelines for the analisys of dietary data, we are not using theFull Sample 2 Year MEC Exam Weights “wtmec2yr” included in the demographic dataset DR1TOT_I.XPT but insteaddietary day one sample weights “wtdrd1” constructed based on “MEC sample weights and further adjusting for (a) theadditional non-response and (b) the differential allocation by weekdays (Monday through Thursday), Fridays, Saturdaysand Sundays for the dietary intake data collection” (CDC, 2015). ayes Estimation/Informative Cluster Sampling 27 A.3 | Quantity between Brackets the in Augmented Pseudolikelihood in (13)
MatchesLikelihood under Weighted Linear Regression
The weighted linear regression model is y ij | u ij , θ , η yj ∼ normal (cid:16) u tij β + η yj , σ y / w ij (cid:17) with w ij > known.Using the fact that (cid:104) normal ( y | µ , σ ) (cid:105) w = 1 w / ( π ) ( w − )/ ( σ ) ( w − )/ × normal ( y | µ , σ / w ) we obtain that the expression between brackets in (13), (cid:206) Jj =1 (cid:206) n j i =1 p ( y ( s ) ij | θ ) w ( s ) ij , equals J (cid:214) j =1 n j (cid:214) i =1 (cid:104) normal (cid:16) y ( s ) ij | u tij β + η yj , σ y (cid:17)(cid:105) w ( s ) ij ∝ ( σ ) (cid:16)(cid:205) j , i [ w ( s ) ij − ] (cid:17) / J (cid:214) j =1 n j (cid:214) i =1 normal (cid:16) y ( s ) ij | u tij β + η yj , σ y / w ( s ) ij (cid:17) but, by construction, (cid:205) Jj =1 (cid:205) n j i =1 w ( s ) ij = n and therefore the expontent of σ in the denominator in the expressionabove is zero, and the quantity between brackets is the likelihood of the weighted linear regression model. references Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P. and Riddell, A.(2016) Stan: A probabilistic programming language.
Journal of Statistical Software , , 1–37.CDC (2007) Centers for Disease Control (CDC). National Health and Nutrition Examination Survey, 2005-2006 data docu-mentation, codebook, and frequencies, demographic variables and sample weights (demo_d). URL - /DEMO_D.htm . Accessed: March 31, 2020.— (2011) Centers for Disease Control (CDC). Continuous NHANES web tutorial: Key concepts about NHANES survey design.URL .htm . Accessed: May 21, 2018.— (2015) Centers for Disease Control (CDC). NHANES: Measuring Guides for the Dietary Recall Interview. URL . Page last reviewed: November 6, 2015.Accessed: May 14, 2020.— (2017) Centers for Disease Control (CDC). NHANES: About the National Health and Nutrition Examination Survey.URL . Page last reviewed: September 15, 2017. Accessed: May14, 2020.Heeringa, S., West, B. and Berglund, P. (2017) Applied survey data analysis. second edi.Heo, M., Faith, M. S., Pietrobelli, A. and Heymsfield, S. B. (2012) Percentage of body fat cutoffs by sex, age, and race-ethnicity inthe US adult population from NHANES 1999–2004. The American Journal of Clinical Nutrition , , 594–602. URL https://doi.org/ . /ajcn. . .León-Novelo, L. G. and Savitsky, T. D. (2019) Fully bayesian estimation under informative sampling. Electronic Journal ofStatistics , , 1608–1645.Lohr, S. L. (2019) Sampling: Design and Analysis: Design And Analysis . CRC Press.Lumley, T. (2019) survey: analysis of complex survey samples. R package version 3.35-1.Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998) Weighting for unequal selection probabilitiesin multilevel models.
Journal of the Royal Statistical Society: series B (statistical methodology) , , 23–40.Rabe-Hesketh, S. and Skrondal, A. (2006) Multilevel modelling of complex survey data. Journal of the Royal Statistical Society:Series A (Statistics in Society) , , 805–827.Savitsky, T. D. and Williams, M. R. (2019) Bayesian mixed effects model estimation under informative sampling. Williams, M. R. and Savitsky, T. D. (2020) Bayesian estimation under informative sampling with unattenuated dependence.
Bayesian Analysis , , 57–77. URL http://dx.doi.org/ . / -BA1143