Adjusted Logistic Propensity Weighting Methods for Population Inference using Nonprobability Volunteer-Based Epidemiologic Cohorts
11 Adjusted Logistic Propensity Weighting Methods for Population Inference using Nonprobability Volunteer-Based Epidemiologic Cohorts
Lingxiao Wang , Richard Valliant , and Yan Li The Joint Program in Survey Methodology, University of Maryland, College Park, U.S.A. Research Professor Emeritus at the University of Michigan and University of Maryland * Address correspondence to Yan Li, 1218 Lefrak Hall, 7521 Preinkert Dr, College Park, MD 20742; email: [email protected]
Abstract (208 out of 250 words)
Many epidemiologic studies forgo probability sampling and turn to nonprobability volunteer-based samples because of cost, response burden, and invasiveness of biological samples. However, finite population inference is difficult to make from the nonprobability sample due to the lack of population representativeness. Aiming for making inferences at the population level using nonprobability samples, various inverse propensity score weighting (IPSW) methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. In this paper, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. Compared to existing IPSW methods, the proposed ALP method is easy to implement by ready-to-use software while producing approximately unbiased estimators for population quantities regardless of the nonprobability sample rate. The efficiency of the ALP estimator can be further improved by scaling the survey sample weights in propensity estimation. Taylor linearization variance estimators are proposed for ALP estimators of finite population means that account for all sources of variability. The proposed ALP methods are evaluated numerically via simulation studies and empirically using the naïve unweighted National Health and Nutrition Examination Survey III sample, while taking the 1997 National Health Interview Survey as the reference, to estimate the 15-year mortality rates. Keywords: Nonprobability sample, finite population inference, propensity score weighting, variance estimation, survey sampling INTRODUCTION
In the big data era, assembling volunteer-based epidemiologic cohorts within integrated healthcare systems that have electronic health records and a large pre-existing base of volunteers are increasingly popular due to their cost-and-time efficiency, such as the UK Biobank in the UK National Health Service. However, samples of volunteer-based cohorts are not randomly selected from the underlying finite target population, and therefore cannot well represent the target population. As a result, the naïve sample estimates obtained from the cohort can be biased for the finite population quantities. For example, the estimated all-cause mortality rate in the UK Biobank was only half that of the UK population, and the Biobank is not representative of the UK population with regard to many sociodemographic, physical, lifestyle and health-related characteristics.Aiming for making inferences at the population level using nonprobability samples, various propensity-score weighting and matching methods have been proposed to improve the population representativeness of nonprobability samples, by using probability-based survey samples as external references in survey research. Inverse propensity score weighting (IPSW) methods have been studied with the propensity defined by the participation rate of population units in the nonprobability sample. We review two methods—both assume that the units in the nonprobability sample are observed according to some random, but unknown, mechanism. Because that mechanism is unknown, the inclusion probability of each unit must be estimated. As described in section 2, all methods are based on estimating a population log-likelihood, although the methods differ in their details. Valliant and Dever estimated participation rates by fitting a logistic regression model to the combined nonprobability sample and a reference, probability sample.
6, 7
Sample weights for the probability sample were scaled by a constant so that the scaled probability sample was assumed to represent the complement of the nonprobability sample. Each unit in the nonprobability sample was assigned a weight of one. This results in the sum of the scaled weights in the combined probability plus nonprobability sample being an estimate of the population size. This method will be referred to as the rescaled design weight (RDW) method. The participation rate for each nonprobability sample unit was estimated by the inverse of the estimated inclusion (or participation) probability. The RDW estimator is biased especially when the participation rate of the nonprobability sample is large, as noted by Chen et al As a remedy, Chen et al estimated the participation rate by manipulating the log-likelihood estimating equation in a somewhat different way. The resulting estimator, denoted by CLW, is consistent and approximately unbiased regardless of the magnitude of participation rates. Compared to the CLW method, which requires special programming, the RDW method has the advantage of easy implementation by ready-to-use software such as R, Stata, or SAS. Survey practitioners can simply fit a logistic regression model with scaled survey weights in the probability sample to obtain the estimated participation rates. In this paper, we propose an adjusted logistic propensity weighting (ALP) method to estimate the participation rates for nonprobability sample units. Similar to the CLW, the proposed ALP method relaxes the assumptions required by the RDW method,
6, 7 by formulating the method in an innovative way. As in the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. Taylor linearization variance estimators are proposed for ALP estimates that account for variability due to geographic clustering in the nonprobability sample, differential pseudo-weights, as well as the estimation of the propensity scores. Moreover, the proposed ALP method is proved, analytically and numerically, as or more efficient compared to the CLW method, and can flexibly scale the probability sample weights for propensity estimation to further improve efficiency. METHODS 2.1.
Basic setting
Let
𝐹𝑃 (cid:3404) (cid:4668)1, ⋯ , 𝑁(cid:4669) represent the finite population with size 𝑁 . We are interested in estimating the finite population mean 𝜇 (cid:3404) 𝑁 (cid:2879)(cid:2869) ∑ 𝑦 (cid:3036)(cid:3036)∈(cid:3022) . Suppose a volunteer-based nonprobability sample 𝑠 (cid:3030) of size 𝑛 (cid:3030) is selected from 𝐹𝑃 by a self-selection mechanism, with 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ( (cid:3404) 1 if 𝑖 ∈ 𝑠 (cid:3030) ; 0 otherwise) denoting the indicator of 𝑠 (cid:3030) inclusion. The underlying participation rate of nonprobability sample for a finite population unit is defined as 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667) (cid:3404) 𝐸 (cid:3030) (cid:4676) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ∣∣ 𝑦 (cid:3036) , 𝒙 (cid:3036) (cid:4677), 𝑖 ∈ 𝐹𝑃 where the expectation 𝐸 (cid:3030) is with respect to the nonprobability sample selection, and 𝒙 (cid:3036) is a vector of self-selection variables, i.e., covariates related to the probability of inclusion in 𝑠 (cid:3030) . The corresponding implicit nonprobability sample weight is 𝑤 (cid:3036) (cid:3404) 1/𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) for 𝑖 ∈ 𝐹𝑃 . We consider the following assumptions for the nonprobability sample self-selection.A1. The nonprobability sample selection is uncorrelated with the variable of interest given the covariates, i.e., 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) 𝐸 (cid:3030) (cid:4676) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ∣∣ 𝑦 (cid:3036) , 𝑥 (cid:3036) (cid:4677) (cid:3404) 𝐸 (cid:3030) (cid:4676) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ∣∣ 𝑥 (cid:3036) (cid:4677) for 𝑖 ∈ 𝐹𝑃. A2. All finite population units have a positive participation rate, i.e., 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3408) 0 for 𝑖 ∈ 𝐹𝑃 .A3. The nonprobability sample participation is uncorrelated given the self-selection variables, i.e., 𝑐𝑜𝑣(cid:4672) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) , 𝛿 (cid:3037)(cid:4666)(cid:3030)(cid:4667) ∣∣ 𝒙 (cid:3036) , 𝒙 (cid:3037) (cid:4673) (cid:3404) 0 for 𝑖 (cid:3405) 𝑗 . An independent reference survey sample 𝑠 (cid:3046) of size 𝑛 (cid:3046) is randomly selected from 𝐹𝑃 . The sample inclusion indicator, selection probability, and the corresponding sample weights are defined by 𝛿 (cid:3036)(cid:4666)(cid:3046)(cid:4667) (= if 𝑖 ∈ 𝑠 (cid:3046) ; 0 otherwise), 𝜋 (cid:3036)(cid:4666)(cid:3046)(cid:4667) (cid:3404) 𝐸 (cid:3046) (cid:4672) 𝛿 (cid:3036)(cid:4666)(cid:3046)(cid:4667) ∣∣ 𝒙 (cid:3036) (cid:4673) , and 𝑑 (cid:3036) (cid:3404) (cid:2869)(cid:3095) (cid:3284)(cid:4666)(cid:3294)(cid:4667) , respectively, where 𝐸 (cid:3046) is with respect to the survey sample selection. Existing logistic propensity weighting method
In this section, we first briefly introduce the existing RDW and CLW methods and discuss their pros and cons.
Rescaled design weight method (RDW)
Valliant and Dever
6, 7 considered (implicitly) the population likelihood function of participation rate as
𝐿(cid:4666)𝜸(cid:4667) (cid:3404) (cid:3537) (cid:4676)𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667)(cid:4677) (cid:3083) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667)(cid:4677) (cid:2869)(cid:2879)(cid:3083) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:3036)∈(cid:3007)(cid:3017) , (2.2.1) where 𝜸 is a vector of unknown parameters for modeling the participation rates 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667) . To simplify the notation, we use 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) below. For example, log (cid:3421) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3425) (cid:3404) 𝜸 (cid:3021) 𝒙 (cid:3036) , for 𝑖 ∈ 𝐹𝑃, (2.2.2) under the logistic regression model. Then, the log-likelihood function can be written as 𝑙(cid:4666)𝜸(cid:4667) (cid:3404) (cid:3533) (cid:4674)𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3397) (cid:4676)1 (cid:3398) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677)(cid:4675) (cid:3036)∈(cid:3007)(cid:3017) (cid:3404) (cid:3533) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017)(cid:2879)(cid:3046) (cid:3278) , (2.2.3) where the set 𝐹𝑃 (cid:3398) 𝑠 (cid:3030) represents the finite population units that are not self-selected into the nonprobability sample. Since
𝐹𝑃 (cid:3398) 𝑠 (cid:3030) is not available in practice, the pseudo-loglikelihood function was constructed to estimate 𝑙(cid:4666)𝜸(cid:4667) by 𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) (cid:3533) 𝑤 (cid:3036)∗ log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) 𝑤 (cid:3036)∗ log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3046) (cid:3294) (2.2.4) where 𝑤 (cid:3036)∗ (cid:3404) (cid:3421)1, for 𝑖 ∈ 𝑠 (cid:3030) 𝑑 (cid:3036) (cid:3015)(cid:3553) (cid:3294) (cid:2879)(cid:3041) (cid:3278) (cid:3015)(cid:3553) (cid:3294) , for 𝑖 ∈ 𝑠 (cid:3046) , with 𝑁(cid:3553) (cid:3046) (cid:3404) ∑ 𝑑 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) being the survey estimate of the target finite population size 𝑁 . This leads to the total of the scaled weights across the probability sample units being ∑ 𝑤 (cid:3036)∗(cid:3036)∈(cid:3046) (cid:3294) = 𝑁(cid:3553) (cid:3046) (cid:3398) 𝑛 (cid:3030) . The rationale for rescaling is to weight the survey sample to represent the complement of 𝑠 (cid:3030) in the finite population, i.e., the set 𝐹𝑃 (cid:3398) 𝑠 (cid:3030) . Under the logistic regression model, the nonprobability sample participation rate 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) can be estimated by fitting Model (2.2.2) to the combined sample of 𝑠 (cid:3030) and scaled - weighted 𝑠 (cid:3046) with scaled weights 𝑤 (cid:3036)∗ , leading to the RDW estimates.The RDW method has been shown to effectively reduce the bias of the naïve nonprobability sample estimates. However, the summand ∑ log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017)(cid:2879)(cid:3046) (cid:3278) in (2.2.3) is not a fixed finite population total because units in the nonprobability sample are treated as being randomly observed. This leads to a bias as shown below. Comparing the expectation of the population log-likelihood function 𝑙(cid:4666)𝜸(cid:4667) in (2.2.3) and the expectation of the pseudo log-likelihood 𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667) in (2.2.4), and letting 𝐸(cid:4666)⋅(cid:4667) (cid:3404) 𝐸 (cid:3030) 𝐸 (cid:3046) (cid:4666)⋅(cid:4667) we have 𝐸(cid:4668)𝑙(cid:4666)𝜸(cid:4667)(cid:4669) (cid:3404) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) (cid:3533) (cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) , and
𝐸(cid:3419)𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667)(cid:3423) (cid:3404) 𝐸 (cid:3030) 𝐸 (cid:3046) (cid:3419)𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667)(cid:3423)(cid:3404) 𝐸 (cid:3030) (cid:3428)(cid:3533) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) (cid:3432) (cid:3397) 𝐸 (cid:3046) (cid:4680)(cid:3533) 𝛿 (cid:3036)(cid:4666)(cid:3046)(cid:4667) ⋅ 𝑁(cid:3553) (cid:3046) (cid:3398) 𝑛 (cid:3030) 𝑁(cid:3553) (cid:3046) 𝑑 (cid:3036) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) (cid:4681) (cid:3404)(cid:4662) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) (cid:3533) (cid:4676)1 (cid:3398) 𝑛 (cid:3030) 𝑁 (cid:4677) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) by assuming 𝐸 (cid:3046) (cid:3435)𝑁(cid:3553) (cid:3046) (cid:3439) (cid:3404) 𝑁 . The difference of the two expectations, denoted by Δ (cid:3019)(cid:3005)(cid:3024) , can be written as c s Δ (cid:3019)(cid:3005)(cid:3024) (cid:3404) 𝐸(cid:3419)𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667)(cid:3423) (cid:3398) 𝐸(cid:4668)𝑙(cid:4666)𝜸(cid:4667)(cid:4669) (cid:3404) (cid:3533) (cid:4676)𝑛 (cid:3030) 𝑁 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) which, in general, is nonzero. Accordingly, the nonprobability sample participation rates estimated by solving for 𝜸 in 𝜕𝑙(cid:4634) (cid:3043) (cid:4666)𝜸(cid:4667)/𝜕𝜸 (cid:3404) 0 under Model (2.2.2) can be biased, unless either (i) the nonprobability sample units have small participation rates, i.e., both 𝑛 (cid:3030) /𝑁 and 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) are close to 0 for all 𝑖 ∈ 𝐹𝑃 in which case log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3406) 0 , or (ii) all population units are equally likely to participate in the nonprobability sample, i.e. 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ≡ 𝑛 (cid:3030) /𝑁 . CLW Method
Chen et al proposed another IPSW method using the same likelihood function 𝐿(cid:4666)𝜸(cid:4667) in (2.2.1), but rewriting the population log-likelihood as 𝑙(cid:4666)𝜸(cid:4667) (cid:3404) (cid:3533) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) . (2.2.5) In contrast to the RDW method, CLW estimated the population total of log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) by a weighted reference sample total and constructed the pseudo log-likelihood as 𝑙 (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) (cid:3533) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) 𝑑 (cid:3036) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3046) (cid:3294) . (2.2.6) Under the same logistic regression model (2.2.2), the participation rate 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) was estimated by solving the pseudo estimation equation 𝑆 (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) 1𝑁 (cid:4678)(cid:3533) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) (cid:3398) (cid:3533) 𝑑 (cid:3036) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:4679) (cid:3404) 0, (2.2.7) according to the pseudo log-likelihood (2.2.6). The resulting estimator of finite population mean was proved to be design consistent. In contrast to the RDW method, CLW does not require condition (i) or (ii) in RDW method for unbiased estimation of participation rates 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) . In the next section, we propose an adjusted logistic propensity (ALP) method, which corrects the bias in the RDW method. The proposed ALP method provides consistent estimators of finite population means and is as easy to implement as the RDW method. Adjusted logistic propensity method (ALP)
Instead of directly modeling the nonprobability sample participation rates 𝜋 (cid:4666)(cid:3030)(cid:4667) as in the RDW and CLW methods, we consider the propensity score 𝑝 (cid:3036) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (cid:4667) , where the notation ∪ ∗ represents the combination of 𝑠 (cid:3030) and 𝐹𝑃 , allowing for duplicated 𝑠 (cid:3030) in 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 . Notice that 𝑝 (cid:3036) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667)𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667) (cid:3397) 𝑃(cid:4666) 𝑖 ∈ 𝐹𝑃 ∣ 𝐹𝑃 (cid:4667) (cid:3409) 12, and the equality holds only if 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667) (cid:3404) 1 , i.e., the 𝐹𝑃 unit 𝑖 participates in the cohort for sure. Accordingly, we have the nonprobability sample participation rate as 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) (cid:3043) (cid:3284) (cid:2869)(cid:2879)(cid:3043) (cid:3284) . This result is due to ci c P i s FPp P i s FP P i FP FP ∣∣ ∣ 𝑝 (cid:3036) (cid:3036) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (cid:4667)𝑃(cid:4666) 𝑖 ∈ 𝐹𝑃 ∣∣ 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (cid:4667) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667)𝑃(cid:4666) 𝑖 ∈ 𝐹𝑃 ∣ 𝐹𝑃 (cid:4667) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667) (cid:3404) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (2.3.1) Suppose the propensity score 𝑝 (cid:3036) can be modeled parametrically by 𝑝 (cid:3036) (cid:3404) 𝑝(cid:4666)𝒙 (cid:3036) ; 𝜷(cid:4667) (cid:3404) expit(cid:4666)𝜷 (cid:3021) 𝒙 (cid:3036) (cid:4667) , where 𝜷 is a vector of unknown model parameters. That is, log (cid:3420) 𝑝 (cid:3036) (cid:3036) (cid:3424) (cid:3404) 𝜷 (cid:3021) 𝒙 (cid:3036) , for 𝑖 ∈ 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (2.3.2) Notice that 𝜷 , the coefficients in Model (2.3.2), differ from the coefficients 𝜸 in Model (2.2.2) because the two logistic regression models have different dependent variables. The corresponding likelihood function can be written as 𝐿 ∗ (cid:4666)𝜷(cid:4667) (cid:3404) (cid:3537) 𝑝 (cid:3036)(cid:3019) (cid:3284) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667) (cid:4666)(cid:2869)(cid:2879)(cid:3019) (cid:3284) (cid:4667)(cid:3036)∈(cid:3046) (cid:3278) ∪ ∗ (cid:3007)(cid:3017) , where 𝑅 (cid:3036) indicates the membership of 𝑠 (cid:3030) in 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (=1 if 𝑖 ∈ 𝑠 (cid:3030) ; 0 if 𝑖 ∈ 𝐹𝑃 ), followed by the log-likelihood 𝑙 ∗ (cid:4666)𝜷(cid:4667) (cid:3404) (cid:3533) (cid:4668)𝑅 (cid:3036) ⋅ log 𝑝 (cid:3036) (cid:3397) (cid:4666)1 (cid:3398) 𝑅 (cid:3036) (cid:4667) log(cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)(cid:4669) (cid:3036)∈(cid:3046) (cid:3278) ∪ ∗ (cid:3007)(cid:3017) (cid:3404) (cid:3533) log 𝑝 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) log(cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667) (cid:3036)∈(cid:3007)(cid:3017) . (2.3.3) The corresponding pseudo log-likelihood is 𝑙 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) (cid:3533) log 𝑝 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) (cid:3533) 𝑑 (cid:3036) log(cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667) (cid:3036)∈(cid:3046) (cid:3294) , (2.3.4) The maximum pseudo likelihood estimator 𝜷(cid:3553) from (2.3.4) can be obtained by solving the pseudo estimating equation 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0, where 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:4682)(cid:3533) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667) (cid:3036)∈(cid:3046) (cid:3278) 𝒙 (cid:3036) (cid:3398) (cid:3533) 𝑑 (cid:3036) 𝑝 (cid:3036) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:4683). (2.3.5) Note the participation rate 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) (cid:3043) (cid:3284) (cid:2869)(cid:2879)(cid:3043) (cid:3284) (cid:3404) exp(cid:4666)𝜷 (cid:3021) 𝒙 (cid:3036) (cid:4667) based on (2.3.1) and (2.3.2) and 𝑝 (cid:3036) is bounded by 0 and (cid:2869)(cid:2870) , and 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) is automatically bounded by 0 and 1. The ALP estimator of 𝜇 is 𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:3404) ∑ (cid:3050) (cid:3284)(cid:3250)(cid:3261)(cid:3265)(cid:3284)∈(cid:3294)(cid:3278) (cid:3052) (cid:3284) ∑ (cid:3050) (cid:3284)(cid:3250)(cid:3261)(cid:3265)(cid:3284)∈(cid:3294)(cid:3278) , (2.3.6) where 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017) (cid:3404) 1/𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜷(cid:3553)(cid:4667) for 𝑖 ∈ 𝑠 (cid:3030) . In addition to being easy to implement with existing survey software, the ALP estimator from (2.3.6) does not require conditions (i) or (ii), unlike RDW.We consider the following limiting process for the theoretical development.
4, 8
Suppose there is a sequence of finite populations 𝐹𝑃 (cid:3038) of size 𝑁 (cid:3038) , for 𝑘 (cid:3404) 1, 2, ⋯ . Cohort 𝑠 (cid:3030),(cid:3038) of size 𝑛 (cid:3030),(cid:3038) and survey sample 𝑠 (cid:3046),(cid:3038) of size 𝑛 (cid:3046),(cid:3038) are sampled from each 𝐹𝑃 (cid:3038) . The sequences of the finite population, the cohort and the survey sample have their sizes satisfy lim (cid:3038)→(cid:2998) (cid:3041) (cid:3295),(cid:3286) (cid:3015) (cid:3286) → 𝑓 (cid:3047) where 𝑡 (cid:3404) 𝑐 or 𝑠 and (cid:3047) (cid:3409) 1 (regularity condition C1 in Appendix A). In the following the index 𝑘 is suppressed for simplicity. Theorem.
Consistency of ALP estimator of finite population mean (see Appendix B)Under the regularity conditions A1-A3, and C1-C5 in Appendix A, and assuming the logistic regression model for the propensity scores, the ALP estimate 𝜇̂ (cid:3002)(cid:3013)(cid:3017) is design consistent for 𝜇 , in particular 𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:3398) 𝜇 (cid:3404) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3030)(cid:2879)(cid:2869)/(cid:2870) (cid:3439) , with the finite population variance 𝑉𝑎𝑟(cid:4666)𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667) (cid:3404)(cid:4662) 𝑁 (cid:2879)(cid:2870) (cid:3533) 𝑝 (cid:3036) (cid:4666)1 (cid:3398) 2𝑝 (cid:3036) (cid:4667) (cid:4682)(cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝑝 (cid:3036) (cid:3398) 𝒃 (cid:3021) 𝒙 (cid:3036) (cid:4683) (cid:3036)∈(cid:3007)(cid:3017) (cid:2870) (cid:3397) 𝒃 (cid:3021) 𝑫𝒃, (2.3.7) 𝑝 (cid:3036) (cid:3404) expit(cid:4666)𝜷 (cid:3021) 𝒙 (cid:3036) (cid:4667) , 𝒃 (cid:3021) (cid:3404) (cid:4668)∑ (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:4669)(cid:4668)∑ 𝑝 (cid:3036) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:4669) (cid:2879)(cid:2869) , and 𝑫 (cid:3404) 𝑁 (cid:2879)(cid:2870) 𝑉 (cid:3046) (cid:3435)∑ 𝑑 (cid:3036) 𝑝 (cid:3036) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3439) is the design-based variance-covariance matrix under the probability sampling design for 𝑠 (cid:3046) . We prove that in large sample, 𝑉𝑎𝑟(cid:4666)𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667) (cid:3404) 𝑂(cid:4666)𝑛 (cid:3030)(cid:2879)(cid:2869) (cid:4667) is as or more efficient compared to
𝑉𝑎𝑟(cid:4666)𝜇̂ (cid:3004)(cid:3013)(cid:3024) (cid:4667) (cid:3404) 𝑂(cid:4666)min(cid:4666)𝑛 (cid:3046) , 𝑛 (cid:3030) (cid:4667) (cid:2879)(cid:2869) (cid:4667), which depends on the nonprobability and probability sample sizes (see Appendix C). An alternative method would be to omit the odds transformation, which uses 𝑝 (cid:3036) to approximate the participation rate 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) . Denote this method by FDW for full design weight, which contrasts to the scaling of the survey sample weights in the RDW method. Comparing the expectation of the population log-likelihood function 𝑙(cid:4666)𝜸(cid:4667) in (2.2.3) and the expectation of the pseudo log-likelihood 𝑙 (cid:3043)∗ (cid:4666)𝜷(cid:4667) in (2.3.3) with 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) replacing 𝑝 (cid:3036) by the FDW method, i.e., 𝑙(cid:4634) (cid:3043)∗ (cid:4666)𝜸(cid:4667) (cid:3404)∑ log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3046) (cid:3278) (cid:3397) ∑ 𝑑 (cid:3036) log(cid:4672)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4673) (cid:3036)∈(cid:3046) (cid:3294) , we have their difference, denoted by Δ (cid:3007)(cid:3005)(cid:3024) , written as Δ (cid:3007)(cid:3005)(cid:3024) (cid:3404) 𝐸(cid:3419)𝑙(cid:4634) (cid:3043)∗ (cid:4666)𝜸(cid:4667)(cid:3423) (cid:3398) 𝐸(cid:4668)𝑙(cid:4666)𝜸(cid:4667)(cid:4669) (cid:3404) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) (cid:3533) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) (cid:3398) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) (cid:3398) (cid:3533) (cid:4672)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4673) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) (cid:3404) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) . The bias is zero only if 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) for 𝑖 ∈ 𝐹𝑃 are all close to zero. Thus, the odds transformation step in ALP could be skipped if all nonprobability participation rates are extremely small; but, in general, that step is essential for unbiased estimation. Variance estimation
Using the finite population variance formula (2.3.7), the first summand can be consistently estimated by (cid:3419)𝑁(cid:3553) (cid:4666)(cid:3030)(cid:4667) (cid:3423) (cid:2879)(cid:2870) (cid:3533) (cid:4666)1 (cid:3398) 𝑝̂ (cid:3036) (cid:4667)(cid:4666)1 (cid:3398) 2𝑝̂ (cid:3036) (cid:4667) (cid:4682)(cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667)𝑝̂ (cid:3036) (cid:3398) 𝒃(cid:3553) (cid:3021) 𝒙 (cid:3036) (cid:4683) (cid:3036)∈(cid:3046) (cid:3278) (cid:2870) , (2.4.1) where 𝑝̂ (cid:3036) is the predicted propensity score for 𝑖 ∈ 𝑠 (cid:3030) , 𝑁(cid:3553) (cid:4666)(cid:3030)(cid:4667) (cid:3404) ∑ 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017)(cid:3036)∈(cid:3046) (cid:3278) , and 𝒃(cid:3553) (cid:3021) (cid:3404) (cid:3419)∑ (cid:4666)𝑦 (cid:3036) (cid:3398) (cid:3036)∈(cid:2929) (cid:3161) 𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667)𝒙 (cid:3036)(cid:3021) (cid:3423)(cid:3419)∑ 𝑝̂ (cid:3036) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3046) (cid:3278) (cid:3423) (cid:2879)(cid:2869) . The second summand 𝒃 (cid:3021) 𝑫𝒃 is estimated by 𝒃(cid:3553) (cid:3021) 𝑫(cid:3553) 𝒃(cid:3553) , where
𝑫(cid:3553) is the survey design consistent variance estimator of D . For example, under stratified multistage cluster sampling with 𝐻 strata and 𝑎 (cid:3035) primary sampling units (PSUs) in stratum ℎ selected with replacement, 𝑫(cid:3553) (cid:3404) (cid:3419)𝑁(cid:3553) (cid:4666)(cid:3046)(cid:4667) (cid:3423) (cid:2879)(cid:2870) ⋅ (cid:3533) 𝑎 (cid:3035) 𝑎 (cid:3035) (cid:3398) 1 (cid:3533) (cid:4666)𝒛 (cid:3039) (cid:3398) 𝒛(cid:3364)(cid:4667)(cid:4666)𝒛 (cid:3039) (cid:3398) 𝒛(cid:3364)(cid:4667) (cid:3021)(cid:3028) (cid:3283) (cid:3039)(cid:2880)(cid:2869)(cid:3009)(cid:3035)(cid:2880)(cid:2869) , (2.4.2) where 𝑁(cid:3553) (cid:4666)(cid:3046)(cid:4667) (cid:3404) ∑ 𝑑 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) , 𝒛 (cid:3039) (cid:3404) ∑ 𝑑 (cid:3036) 𝑝̂ (cid:3036) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294)(cid:3283)(cid:3287) is the weighted PSU total for cluster 𝑙 in stratum ℎ , 𝑠 (cid:3046)(cid:3035)(cid:3039) is the set of sample elements stratum ℎ and cluster 𝑙 , and 𝒛(cid:3364) (cid:3404) (cid:2869)(cid:3028) (cid:3283) ∑ 𝒛 (cid:3039)(cid:3028) (cid:3283) (cid:3039) is the mean of the PSU totals in stratum ℎ . Scaling survey weights in the likelihood for the ALP Method
Unlike to the CLW method, the proposed ALP can flexibly scale the survey weights in estimating equation (2.3.5) to improve efficiency. We multiply the second summand in 𝑆 ∗ (cid:4666)𝜷(cid:4667) by a constant 𝜆 , say 𝜆 (cid:3404) (cid:3436) (cid:3041) (cid:3278) ∑ (cid:3031) (cid:3284)(cid:3284)∈(cid:3294)(cid:3294) (cid:3440) , so that the sum of the scaled survey weights (cid:4666)𝜆𝑑 (cid:3036) (cid:4667) is 𝑛 (cid:3030) . Accordingly, the score function becomes 𝑆 (cid:3090)∗ (cid:4666)𝜷(cid:4667) (cid:3404) (cid:3533) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667) (cid:3036)∈(cid:3046) (cid:3278) 𝒙 (cid:3036) (cid:3398) 𝜆 (cid:3533) 𝑑 (cid:3036) 𝑝 (cid:3036) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) . Solving 𝑆 (cid:3090)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0 for 𝜷 , and the resulting vector of estimates is denoted by 𝜷(cid:3553) (cid:3090) (cid:3404) (cid:3435)𝛽(cid:4632) (cid:2868),(cid:3090) , 𝜷(cid:3553) (cid:2869),(cid:3090) (cid:3439) , where 𝛽(cid:4632) (cid:2868),(cid:3090) is estimate of the intercept. Similar derivations to those in Scott & Wild and Li et al can be used to prove that 𝜷(cid:3553) (cid:2869),(cid:3090) is design-consistent with various efficiency gains, depending on the variability of survey weights versus the nonprobability sample weights (with implicit common value of 1). However, the estimate of the intercept 𝛽(cid:4632) (cid:2868),(cid:3090) can be badly biased with scaled weights. As a result, the estimate of participation rate exp(cid:3435)𝜷(cid:3553) (cid:3090)(cid:3021) 𝒙 (cid:3036) (cid:3439) including 𝛽(cid:4632) (cid:2868),(cid:3090) would also be biased. The bias of 𝛽(cid:4632) (cid:2868),(cid:3090) , however, would not affect the estimate of population mean because the scaled ALP-weighted mean, 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) , 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) (cid:3404) ∑ 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017).(cid:3020)(cid:3036)∈(cid:3046) (cid:3278) 𝑦 (cid:3036) ∑ 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017).(cid:3020)(cid:3036)∈(cid:3046) (cid:3278) (cid:3404) ∑ exp (cid:2879)(cid:2869) (cid:3435)𝜷(cid:3553) (cid:2869),(cid:3090)(cid:3021) 𝒙 (cid:3036) (cid:3439) (cid:3036)∈(cid:3046) (cid:3278) 𝑦 (cid:3036) ∑ exp (cid:2879)(cid:2869) (cid:3435)𝜷(cid:3553) (cid:2869),(cid:3090)(cid:3021) 𝒙 (cid:3036) (cid:3439) (cid:3036)∈(cid:3046) (cid:3278) depends on 𝜷(cid:3553) (cid:2869),(cid:3090) only. It can be proved that 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) is a consistent estimator of the finite population mean, 𝜇 . The Taylor linearization (TL) variance estimator of 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) can be obtained by substituting 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017) , 𝜇̂ (cid:3002)(cid:3013)(cid:3017) , 𝜷(cid:3553) , 𝑝̂ (cid:3036) and 𝑑 (cid:3036) by 𝑤 (cid:3036)(cid:3002)(cid:3013)(cid:3017).(cid:3020) , 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) , 𝜷(cid:3553) (cid:3090) , 𝑝̂ (cid:3036),(cid:3090) (cid:3404) exp(cid:3435)𝜷(cid:3553) (cid:3090)(cid:3021) 𝒙 (cid:3036) (cid:3439) and 𝜆𝑑 (cid:3036) , respectively, in Formulae (2.4.1) and (2.4.2). Details on the variance and the consistency of ALP.S are discussed in the dissertation by Wang. SIMULATIONS 3.1.
Finite population generation and sample selection
We applied simulation setups similar to those in Chen et al . In the finite population 𝐹𝑃 of size 𝑁 (cid:3404) 500,000 , a vector of covariates 𝒙 (cid:3036) (cid:3404) (cid:4666)𝑥 (cid:2869)(cid:3036) , 𝑥 (cid:2870)(cid:3036) , 𝑥 (cid:2871)(cid:3036) , 𝑥 (cid:2872)(cid:3036) (cid:4667) (cid:3021) was generated for 𝑖 ∈ 𝐹𝑃 where 𝑥 (cid:2869)(cid:3036) (cid:3404) 𝑣 (cid:2869)(cid:3036) , 𝑥 (cid:2870)(cid:3036) (cid:3404) 𝑣 (cid:2870)(cid:3036) (cid:3397) 0.3𝑥 (cid:2869)(cid:3036) , 𝑥 (cid:2871)(cid:3036) (cid:3404) 𝑣 (cid:2871)(cid:3036) (cid:3397) 0.2(cid:4666)𝑥 (cid:2869)(cid:3036) (cid:3397) 𝑥 (cid:2870)(cid:3036) (cid:4667) , 𝑥 (cid:2872)(cid:3036) (cid:3404) 𝑣 (cid:2872)(cid:3036) (cid:3397) 0.1(cid:4666)𝑥 (cid:2869)(cid:3036) (cid:3397) 𝑥 (cid:2870)(cid:3036) (cid:3397) 𝑥 (cid:2871)(cid:3036) (cid:4667) , with 𝑣 (cid:2869)(cid:3036) ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(cid:4666)0.5(cid:4667) , 𝑣 (cid:2870)(cid:3036) ∼ 𝑈𝑛𝑖𝑓𝑜𝑟𝑚(cid:4666)0, 2(cid:4667) , 𝑣 (cid:2871)(cid:3036) ∼ 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙(cid:4666)1(cid:4667) , and 𝑣 (cid:2872)(cid:3036) ∼ 𝜒 (cid:2870) (cid:4666)4(cid:4667) . The variable of interest 𝑦 (cid:3036) ∼ 𝑁𝑜𝑟𝑚𝑎𝑙(cid:4666)𝜇 (cid:3036) , 1(cid:4667) , where 𝜇 (cid:3036) (cid:3404) (cid:3398)𝑥 (cid:2869)(cid:3036) (cid:3398) 𝑥 (cid:2870)(cid:3036) (cid:3397) 𝑥 (cid:2871)(cid:3036) (cid:3397) 𝑥 (cid:2872)(cid:3036) for 𝑖 ∈ 𝐹𝑃 . The parameter of interest was the finite population mean 𝜇 (cid:3404) (cid:2869)(cid:3015) ∑ 𝑦 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 3.97 . The probability-based survey sample 𝑠 (cid:3046) with the target sample size 𝑛 (cid:3046) (cid:3404) 12,500 (sampling fraction 𝑓 (cid:3046) (cid:3404) 2.5% ) was selected by Poisson sampling, with inclusion probability 𝜋 (cid:3036)(cid:4666)(cid:3046)(cid:4667) (cid:3404)(cid:4666)𝑛 (cid:3046) ⋅ 𝑞 (cid:3036) (cid:4667)/ ∑ 𝑞 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) for 𝑖 ∈ 𝐹𝑃 , where 𝑞 (cid:3036) (cid:3404) 𝑐 (cid:3397) 𝑥 (cid:2871)(cid:3036) (cid:3397) 0.03𝑦 (cid:3036) with 𝑐 controlling for the variation of the survey weights, (cid:3036)(cid:4666)(cid:3046)(cid:4667) . We set 𝑐 (cid:3404) (cid:3398)0.26 so that max 𝑞 (cid:3036) / min 𝑞 (cid:3036) (cid:3404) 20 . The volunteer-based nonprobability sample 𝑠 (cid:3030) (with a target sample size 𝑛 (cid:3030) ) was also selected by Poisson sampling but with different inclusion probabilities 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) for 𝑖 ∈ 𝐹𝑃 . We considered two scenarios with different functional forms of 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) so that the ALP (and FDW) or the CLW method had the true linear logistic regression propensity model in one scenario but not in the other. In Scenario 1, 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) exp(cid:4666)𝛽 (cid:2868) (cid:3397) 𝜷 (cid:3021) 𝒙 (cid:3036) (cid:4667) was the specified participation rate for the 𝑖 (cid:3047)(cid:3035) population unit to be included into the nonprobability sample. The underlying true propensity model for ALP (and FDW) methods, shown in (2.3.2), was log (cid:4676) (cid:3043) (cid:3284) (cid:2869)(cid:2879)(cid:3043) (cid:3284) (cid:4677) (cid:3404) log(cid:4676)𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3404) 𝛽 (cid:2868) (cid:3397) 𝜷 (cid:3021) 𝒙 (cid:3036) , which implied log (cid:3420) (cid:3095) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:2869)(cid:2879)(cid:3095) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:3424) (cid:3404) 𝛽 (cid:2868) (cid:3397) 𝜷 (cid:3021) 𝒙 (cid:3036) (cid:3397) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) . This model differed from the underlying linear model (2.2.2) assumed by the CLW method by the addition of the term log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) . In Scenario 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) expit(cid:4666)𝛾 (cid:2868) (cid:3397) 𝜸 (cid:3021) 𝒙 (cid:3036) (cid:4667) was specified so that log (cid:3420) (cid:3095) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:2869)(cid:2879)(cid:3095) (cid:3284)(cid:4666)(cid:3278)(cid:4667) (cid:3424) (cid:3404) 𝛾 (cid:2868) (cid:3397) 𝜸 (cid:3021) 𝒙 (cid:3036) , which was the model (2.2.2) assumed by the CLW method. This model, however, implied that log (cid:4676) (cid:3043) (cid:3284) (cid:2869)(cid:2879)(cid:3043) (cid:3284) (cid:4677) (cid:3404) log(cid:4676)𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) (cid:3404)𝛾 (cid:2868) (cid:3397) 𝜸 (cid:3021) 𝒙 (cid:3036) (cid:3397) log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) , which was different from the model assumed by the ALP and the FDW method (by the extra term log(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677) ). Hence, ALP and CLW estimates of the population mean were expected to be unbiased in one scenario but not the other since both methods assume a linear logistic propensity model. The biases of the FDW and RDW estimates, as measured by Δ (cid:3007)(cid:3005)(cid:3024) and Δ (cid:3019)(cid:3005)(cid:3024) , depended on 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) , and went to 0 as 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) approached 0. The biases became larger as 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) increased in either scenario. In both scenarios, the coefficients were set to be 𝜷 (cid:3404) 𝜸 (cid:3404) (cid:4666)0.18, 0.18, (cid:3398)0.27, (cid:3398)0.27(cid:4667) (cid:3021) . The intercepts 𝛽 (cid:2868) and 𝛾 (cid:2868) were controlled so that the expected number of nonprobability sample units 𝐸 (cid:3030) (cid:4666)𝑛 (cid:3030) (cid:4667) (cid:3404) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3007)(cid:3017) was varied from 1,250, 2,500, 5,000, to 10,000 with the corresponding overall participation rate 𝑓 (cid:3030) (cid:3404) (cid:3006) (cid:3278) (cid:4666)(cid:3041) (cid:3278) (cid:4667)(cid:3015) being from 0.5%, 5%, 10%, or 20%. Evaluation Criteria
We examined the performance of five IPSW estimators of finite population mean 𝜇 : (1)-(2) 𝜇̂ (cid:3002)(cid:3013)(cid:3017) and 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) described in Section 2.2-2.5; (3) 𝜇̂ (cid:3007)(cid:3005)(cid:3024) using weights from the ALP method omitting the odds transformation; (4) 𝜇̂ (cid:3004)(cid:3013)(cid:3024) proposed by Chen et al ; and (5) 𝜇̂ (cid:3019)(cid:3005)(cid:3024) proposed by Valliant & Dever , compared with the naïve nonprobability sample mean ( 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) ) that did not use weights, and the weighted nonprobability sample mean, 𝜇̂ (cid:3021)(cid:3024) , with weights equal to the inverse of the true nonprobability sample inclusion probabilities. Note that 𝜇̂ (cid:3021)(cid:3024) is unavailable in practice because the true nonprobability sample inclusion probabilities are unknown. Relative bias (%RB), empirical variance ( 𝑉 ), mean squared error (MSE) of the point estimates were used to evaluate the performance of the four IPSW point estimates, calculated by %RB (cid:3404) (cid:2869)(cid:3003) ∑ (cid:3091)(cid:3549) (cid:4666)(cid:3277)(cid:4667) (cid:2879)(cid:3091)(cid:3091)(cid:3003)(cid:3029)(cid:2880)(cid:2869) (cid:3400) 100 , 𝑉 (cid:3404) (cid:2869)(cid:3003)(cid:2879)(cid:2869) ∑ (cid:4676)𝜇̂ (cid:4666)(cid:3029)(cid:4667) (cid:3398) (cid:2869)(cid:3003) ∑ 𝜇̂ (cid:4666)(cid:3029)(cid:4667)(cid:3003)(cid:3029)(cid:2880)(cid:2869) (cid:4677) (cid:2870)(cid:3003)(cid:3029)(cid:2880)(cid:2869) , MSE (cid:3404) (cid:2869)(cid:3003) ∑ (cid:3419)𝜇̂ (cid:4666)(cid:3029)(cid:4667) (cid:3398) 𝜇(cid:3423) (cid:3003)(cid:3029)(cid:2880)(cid:2869) (cid:2870) , where
𝐵 (cid:3404) 4,000 is the number of simulation runs, 𝜇̂ (cid:4666)(cid:3029)(cid:4667) is one of the point estimates obtained from the 𝑏 th simulated sample, and 𝜇 is the true finite population mean. We also evaluated the variance estimates using the variance ratio (VR) and 95% confidence interval coverage probability (CP), which were calculated as VR (cid:3404) (cid:3117)(cid:3251) ∑ (cid:3049)(cid:3548) (cid:4666)(cid:3277)(cid:4667)(cid:3251)(cid:3277)(cid:3128)(cid:3117) (cid:3023) (cid:3400) 100 , and CP (cid:3404) (cid:2869)(cid:3003) ∑ 𝐼(cid:3435)𝜇 ∈ 𝐶𝐼 (cid:4666)(cid:3029)(cid:4667) (cid:3439) (cid:3003)(cid:3029)(cid:2880)(cid:2869) , where 𝑣(cid:3548) (cid:4666)(cid:3029)(cid:4667) is the proposed analytical variance estimate in simulated sample b , and 𝐶𝐼 (cid:4666)(cid:3029)(cid:4667) (cid:3404)(cid:4672)𝜇̂ (cid:4666)(cid:3029)(cid:4667) (cid:3398) 1.96(cid:3493)𝑣(cid:3548) (cid:4666)(cid:3029)(cid:4667) , 𝜇̂ (cid:4666)(cid:3029)(cid:4667) (cid:3397) 1.96(cid:3493)𝑣(cid:3548) (cid:4666)(cid:3029)(cid:4667) (cid:4673) is the 95% confidence interval from the 𝑏 -th simulated samples. Results
Table 1 presents simulation results for the seven nonprobability sample estimators of the finite population mean. The naïve estimator 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) that ignored the underlying sampling scheme had relative biases ranging from -36.5% to -42.8% while the true weighted nonprobability sample estimator, 𝜇̂ (cid:3021)(cid:3024) , was approximately unbiased in all scenarios. The variance of 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) was much smaller than that of the other estimators, but its bias caused the MSE to be extremely high (not reported). Consistent with the bias theory in section 2, the RDW point estimator 𝜇̂ (cid:3019)(cid:3005)(cid:3024) and the FDW point estimator 𝜇̂ (cid:3007)(cid:3005)(cid:3024) were approximately unbiased when 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) was small for all 𝑖 ∈ 𝐹𝑃 and the overall participation rate 𝑓 (cid:3030) (cid:3404) (cid:2869)(cid:3015) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667)(cid:3036)∈(cid:3007)(cid:3017) was low, but more biased as 𝑓 (cid:3030) increased. The coverage probabilities decreased correspondingly. As expected, the ALP estimators 𝜇̂ (cid:3002)(cid:3013)(cid:3017) and 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) (or the CLW estimator 𝜇̂ (cid:3004)(cid:3013)(cid:3024) ) consistently provided unbiased point estimators in the scenarios where they were expected to be unbiased, i.e., scenario 1 for 𝜇̂ (cid:3002)(cid:3013)(cid:3017) and 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) , and scenario 2 for 𝜇̂ (cid:3004)(cid:3013)(cid:3024) . When the underlying model was incorrect for an estimator, biases occurred. For example, the relative biases of 𝜇̂ (cid:3004)(cid:3013)(cid:3024) in scenario 1 were 0.05%, 1.29%, 2.94%, and 7.80% as 𝑓 (cid:3030) increased from 0.5%, 5%, 10%, to 20%, respectively. In scenario 2, the corresponding relative biases for 𝜇̂ (cid:3002)(cid:3013)(cid:3017) are -0.19%, -1.03%, -1.81%, and -2.85%.Consistent with the theory in Section 2, the ALP estimator 𝜇̂ (cid:3002)(cid:3013)(cid:3017) was more efficient than 𝜇̂ (cid:3004)(cid:3013)(cid:3024) with consistently smaller empirical variances in all scenarios, especially when the nonprobability sample size was much larger than the probability sample size. Among all considered methods, 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) was approximately unbiased with smallest variance under Scenario 1 of the correct model. Under Scenario 2 of a misspecified model, 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) was biased but most efficient, and therefore achieved smallest MSE.The variance estimators for 𝜇̂ (cid:3002)(cid:3013)(cid:3017) , 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) and 𝜇̂ (cid:3004)(cid:3013)(cid:3024) performed very well (with VR’s near 1), providing coverage probabilities close to the nominal level under the correct propensity models when 𝑓 (cid:3030) was large. The lower coverage of the nominal level (about 88%) when 𝑓 (cid:3030) (cid:3404) 0.5% was due to the small sample bias with skewed distributions of underlying sampling weights in the selected nonprobability sample. REAL DATA EXAMPLE
We use the same data example as Wang for illustration purpose. We estimated prospective 15-year all-cause mortality rates for adults in the US using the adult household interview part of The Third U.S. National Health and Nutrition Examination Survey (NHANES III) III conducted in 1988-1994, with sample size 𝑛 (cid:3030) (cid:3404) 20,050 . We ignored all complex design features of NHANES III and treated it as a nonprobability sample. The coefficient of variation (CV) of sample weights is 𝑛 (cid:3046) (cid:3404) 19,738 ). The 1994 NHIS used a multistage stratified cluster sample design with 125 strata and 248 pseudo-PSUs. We collapsed strata with only one PSU with the next nearest stratum for variance estimation purpose. Both samples of NHANES III and NHIS were linked to National Death Index (NDI) for mortality, allowing us to quantify the relative bias of unweighted NHANES estimates, assuming the NHIS estimates as the gold standard. The usage of NHANES III as the “nonprobability sample” has several advantages for illuminating the performance of the propensity weighting methods. The “nonprobability sample” and the reference survey sample have approximately the same target population, data collection mode, and similar questionnaires. This ensures that the pseudo-weighted “nonprobability sample” could potentially represent the target population, and thus enables us to characterize the performance of the propensity weighting methods in real data. The distributions of selected common covariates in the two samples were presented in Table 2. As expected, the covariates in the weighted samples of NHANES and 1994 NHIS have very close distributions because both weighted samples represent approximately the same finite population. In contrast, the covariates distribute quite differently in the unweighted
NHANES from the weighted samples, especially for design variables such as age, race/ethnicity, poverty, and region. The propensity model included main effects of common demographic characteristics (age, sex race/ethnicity, region, and marital status), socioeconomic status (education level, poverty, and household income), tobacco usage (smoking status, and chewing tobacco), health variables (body mass index [BMI], and self-reported health status), and a quadratic term for age. Appendix D shows the final propensity models for the five considered methods. To evaluate the performance of the five PS-based methods, we used relative difference from the NHIS estimate %RD (cid:3404) (cid:3091)(cid:3549)(cid:2879)(cid:3091)(cid:3549) (cid:3263)(cid:3257)(cid:3258)(cid:3268) (cid:3091)(cid:3549) (cid:3263)(cid:3257)(cid:3258)(cid:3268) (cid:3400) 100 , TL variance estimate ( 𝑉 ), and estimated MSE (cid:3404)(cid:4666)𝜇̂ (cid:3398) 𝜇̂ (cid:3015)(cid:3009)(cid:3010)(cid:3020) (cid:4667) (cid:2870) (cid:3397) 𝑉 , which treated the NHIS estimate as truth. Table 3 shows that the weighted 1994 NHIS and the sample-weighted NHANES III estimates (TW) of 15-year all-cause mortality were very close (cid:4666) %RD = 2.6%). In contrast, the naïve NHANES III estimate of overall mortality was ~52% biased from the NHIS estimate because older people who have higher mortalities were oversampled (Table 2). All five IPSW methods substantially reduced the bias from the naïve estimate. Consistent with the simulation results, the ALP, FDW, RDW, and CLW method yielded close estimates when the sample fraction of the nonprobability sample was small ( 𝑓(cid:4632) (cid:3030) (cid:3404) (cid:3041) (cid:3278) (cid:3015)(cid:3553) (cid:3294) (cid:3404) 1.06 (cid:3400) 10 (cid:2879)(cid:2872) calculated from Table 2). The ALP.S method, by scaling the NHIS sample weights in propensity estimation, reduced more bias than the other methods, and was more efficient. Therefore, the ALP.S estimate had the smallest MSE. DISCUSSION
This paper proposed adjusted logistic propensity weighting methods for population inference using nonprobability samples. The proposed ALP method corrects the bias in the rescaled design weight method (RDW, Valliant 2020) by formulating the problem in an innovative way. As does the RDW method, the proposed ALP method retains the advantage of easy implementation by fitting a propensity model with survey weights in ready-to-use software. The proposed ALP estimators are design consistent. Taylor linearization variance estimators for ALP estimates are derived. Consistency of the ALP finite population mean estimators was proved theoretically and evaluated numerically. Both ALP and CLW methods fit the propensity model to the combined nonprobability sample and the weighted survey sample. Highly variable weights in the sample lead to low efficiency of the estimated propensity model coefficients. Therefore, the variances of the ALP and the CLW estimators of the finite population means can be large. The proposed ALP is proved analytically and numerically to be as or more efficient compared to the CLW method. The ALP with scaled weights produces consistent propensity estimates and further improves efficiency as shown in the simulation and the real data example. The CLW estimator with the scaled survey weights, albeit more efficient, is biased (simulation results not shown). It worth noting that ALP and CLW methods assume different logistic regression models for propensity score estimation. Propensity is defined as 𝑝 (cid:3036) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (cid:4667) by ALP in (2.3.2) and 𝜋 (cid:3036) (cid:3404) 𝑃(cid:4666) 𝑖 ∈ 𝑠 (cid:3030) ∣∣ 𝐹𝑃 (cid:4667) by CLW in (2.2.2). Model diagnostics need conducted to select the propensity model and is the focus of our future research. There are a number of shortcomings associated with the estimation of propensity score using logistic regression. First, the logistic model is susceptible to model misspecification, requiring assumptions regarding correct variable selection and functional form, including the choice of polynomial terms and multiple-way interactions. If any of these assumptions are incorrect, propensity score estimates can be biased, and balance may not be achieved when conditioning on the estimated PS. Second, implementing a search routine for model specification, such as repeatedly fitting logistic regression models while in/excluding predictor variables, interactions or transformations of variables can be computationally infeasible or suboptimal. In this context, parametric regression can be limiting in terms of possible model structures that can be searched over, particularly when many potential predictors are present (high dimensional data). Various machine learning methods for estimating the propensity score that incorporating survey weights will be our future research interest. REFERENCE Collins R. What makes UK Biobank special. Lancet. 2012; 379(9822):1173-4. 2.
Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, Collins R, Allen NE. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American journal of epidemiology. 2017; 186(9):1026-34. 3.
Elliott MR, Valliant R. Inference for nonprobability samples. Statistical Science. 2017; 32(2):249-64. 4.
Chen Y, Li P, Wu C. Doubly Robust Inference With Nonprobability Survey Samples. Journal of the American Statistical Association. 2019; 1-1. 5.
Wang L., Graubard B.I., Katki H, Li Y. Improving external validity of epidemiologic cohort Analyses: a kernel weighting approach. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2020; DO-10.1111/rssa.12564. Valliant R, Dever JA. Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research. 2011;40(1):105-37. 7.
Valliant R. Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology. 2020;8(2):231-63. 8.
Krewski, D., Rao, J.N. (1981) Inference from stratified samples: properties of the linearization, jackknife and balanced repeated replication methods. The Annals of Statistics, 1010-9. 9.
Scott AJ, Wild CJ. Fitting logistic models under case - control or choice based sampling. Journal of the Royal Statistical Society: Series B (Methodological). 1986;48(2):170-82. 10. Li Y, Graubard BI, DiGaetano R. Weighting methods for population - based case–control studies with complex sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2011; 60(2):165-85. 11. Wang L. Improving external validity of epidemiologic analyses by incorporating data from population-based surveys. Doctoral dissertation, University of Maryland, College Park; 2020. 12.
Massey JT. Design and estimation for the national health interview survey, 1985-94. US Department of Health and Human Services, Public Health Service, Centers for Disease Control, National Center for Health Statistics; 1989. 13.
Ezzati TM, Massey JT, Waksberg J, Chu A, Maurer KR. Sample design: Third National Health and Nutrition Examination Survey. Vital and health statistics. Series 2, Data evaluation and methods research. 1992; (113):1-35. 14.
Hartley HO, Rao JN, Kiefer G. Variance estimation with one unit per stratum. Journal of the American Statistical Association. 1969; 64(327):841-51. 15. Table 1 Results from 4,000 simulated survey samples and nonprobability samples with low to high participation rates under various propensity score models Scenario 1 True propensity model for ALP Scenario 2True propensity model for CLW %RB V (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667) VR MSE (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667) CP %RB 𝑉 (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667) VR MSE (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667) CP 𝑓 (cid:3030) (cid:3404) 0.5% 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) -42.76 0.22 0.99 -42.61 0.22 1.00 𝜇̂ (cid:3021)(cid:3024) -0.13 4.38 0.93 4.39 0.90 -0.12 4.38 0.93 4.38 0.90 𝜇̂ (cid:3019)(cid:3005)(cid:3024) -0.29 3.73 0.93 3.75 0.87 -0.40 3.63 0.93 3.66 0.87 𝜇̂ (cid:3007)(cid:3005)(cid:3024) -0.28 3.73 0.93 3.75 0.87 -0.40 3.63 0.93 3.66 0.87 𝜇̂ (cid:3002)(cid:3013)(cid:3017) -0.07 3.70 0.93 3.77 0.88 -0.19 3.66 0.93 3.67 0.88 𝜇̂ (cid:3004)(cid:3013)(cid:3024) 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) -0.11 3.54 0.92 3.54 0.87 -0.21 3.45 0.92 3.45 0.87 𝑓 (cid:3030) (cid:3404) 5% 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) -42.74 0.02 0.99 -41.21 0.02 1.01 𝜇̂ (cid:3021)(cid:3024) -0.04 0.50 0.98 0.50 0.92 -0.02 0.46 1.00 0.47 0.93 𝜇̂ (cid:3019)(cid:3005)(cid:3024) -2.15 0.56 1.00 1.29 0.66 -3.05 0.43 1.01 1.89 0.45 𝜇̂ (cid:3007)(cid:3005)(cid:3024) -2.05 0.57 1.00 1.23 0.68 -2.95 0.43 1.01 1.81 0.47 𝜇̂ (cid:3002)(cid:3013)(cid:3017) -0.01 0.62 1.00 0.62 0.94 -1.03 0.47 1.01 0.64 0.85 𝜇̂ (cid:3004)(cid:3013)(cid:3024) 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) -0.05 0.45 1.00 0.45 0.92 -0.63 0.35 1.02 0.41 0.86 𝑓 (cid:3030) (cid:3404) 10% 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) -42.74 0.01 1.11 -39.65 0.01 1.11 𝜇̂ (cid:3021)(cid:3024) -0.01 0.25 1.02 0.25 0.94 -0.01 0.22 1.01 0.22 0.94 𝜇̂ (cid:3019)(cid:3005)(cid:3024) -4.25 0.34 1.00 3.20 0.17 -5.62 0.22 0.99 5.20 0.02 𝜇̂ (cid:3007)(cid:3005)(cid:3024) -3.87 0.35 1.00 2.71 0.24 -5.28 0.22 0.99 4.62 0.03 𝜇̂ (cid:3002)(cid:3013)(cid:3017) 𝜇̂ (cid:3004)(cid:3013)(cid:3024) 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) -0.03 0.27 1.02 0.27 0.94 -0.86 0.18 1.01 0.29 0.81 𝑓 (cid:3030) (cid:3404) 20% 𝜇̂ (cid:3015)(cid:3028)(cid:3036)(cid:3049)(cid:3032) -42.75 0.00 1.26 -36.50 0.01 1.21 𝜇̂ (cid:3021)(cid:3024) -0.02 0.15 0.93 0.15 0.93 -0.02 0.11 0.96 0.11 0.93 𝜇̂ (cid:3019)(cid:3005)(cid:3024) -8.58 0.21 0.95 11.83 0.00 -9.59 0.10 0.97 14.60 0.00 𝜇̂ (cid:3007)(cid:3005)(cid:3024) -7.15 0.23 0.96 8.29 0.01 -8.51 0.11 0.98 11.53 0.00 𝜇̂ (cid:3002)(cid:3013)(cid:3017) 𝜇̂ (cid:3004)(cid:3013)(cid:3024) 𝜇̂ (cid:3002)(cid:3013)(cid:3017).(cid:3020) -0.03 0.19 0.96 0.19 0.94 -1.02 0.10 1.00 0.27 0.73 Table 2 Distribution of selected common variables in NIH-AARP and NHIS NHIS 1994 NHANES III Total Count 𝑛 (cid:3046) (cid:3404) 𝑁(cid:3553) (cid:3046) (cid:3404) 𝑛 (cid:3030) (cid:3404) 𝑁(cid:3553) (cid:3046) (cid:3404)
Table 3. Relative difference (%RD) of all-cause 15-year mortality estimates from the NHIS estimate with estimated variance ( 𝑉 ) and mean squared error (MSE)Method Estimate (%) %RD V (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667) MSE (cid:4666)(cid:3400) 10 (cid:2873) (cid:4667)
NHIS 17.6 TW 17.1 -2.65 Naïve 26.7 52.16 ALP 18.6 6.08 1.87 13.27FDW 18.6 6.08 1.87 13.28RDW 18.6 6.08 1.87 13.28CLW 18.6 6.07 1.87 13.24ALP.S 17.2 -2.05 1.08 2.37 Supporting Document for Adjusted Logistic Propensity Weighting Methods for Population Inference using Nonprobability Volunteer-Based Epidemiologic Cohorts
By Lingxiao Wang, Richard Valliant, and Yan Li A. Regularity Conditions C1
The finite population size 𝑁 , the cohort sample sizes 𝑛 (cid:3030) , and survey sample size 𝑛 (cid:3046) satisfy lim (cid:3015)→(cid:2998),(cid:3041) (cid:3278) →(cid:2998) 𝑛 (cid:3030) /𝑁 (cid:3404) 𝑓 (cid:3030) ∈ (cid:4666)0, 1(cid:4667) , and lim (cid:3015)→(cid:2998)(cid:3041) (cid:3294) →(cid:2998) 𝑛 (cid:3046) /𝑁 (cid:3404) 𝑓 (cid:3046) ∈ (cid:4666)0, 1(cid:4667) . C2 There exist constants 𝑐 (cid:2869) and 𝑐 (cid:2870) such that (cid:2869) (cid:3409) 𝑁𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑛 (cid:3030) (cid:3415) (cid:3409) 𝑐 (cid:2870) , and (cid:2869) (cid:3409)𝑁𝜋 (cid:3036)(cid:4666)(cid:3046)(cid:4667) 𝑛 (cid:3046) (cid:3415) (cid:3409) 𝑐 (cid:2870) for all units 𝑖 ∈ 𝐹 . C3 The finite population ( 𝐹𝑃 ) and the sample selection for 𝑠 (cid:3046) satisfy 𝑁 (cid:2879)(cid:2869) ∑ 𝑑 (cid:3036) 𝒓 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3398)𝑁 (cid:2879)(cid:2869) ∑ 𝒓 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3046)(cid:2879)(cid:2869)/(cid:2870) (cid:3439) , where 𝒓 (cid:3036) includes 𝒙 (cid:3036) and 𝑦 (cid:3036) where the order in probability is with respect to the probability sampling mechanism used to select 𝑠 (cid:3046) and 𝑑 (cid:3036) (cid:3404) 1/𝜋 (cid:3036)(cid:4666)(cid:3046)(cid:4667) . C4 The 𝐹𝑃 and the propensity scores 𝑝 (cid:3036) ’s satisfy 𝑁 (cid:2879)(cid:2869) ∑ 𝑦 (cid:3036)(cid:2870)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 𝑂(cid:4666)1(cid:4667) , 𝑁 (cid:2879)(cid:2869) ∑ ‖𝒙 (cid:3036) ‖ (cid:2871)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404)𝑂(cid:4666)1(cid:4667) , 𝑁 (cid:2879)(cid:2869) ∑ 𝑝 (cid:3036) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 𝑂(cid:4666)1(cid:4667) being a positive definite matrix. C5 The cohort participation and the survey sample selection satisfy
𝐶𝑜𝑣(cid:4672)𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) , 𝛿 (cid:3037)(cid:4666)(cid:3046)(cid:4667) (cid:4673) (cid:3404) 0 for 𝑖, 𝑗 ∈𝐹𝑃 . Conditions C1 – C3 are regularly used in practice. Under C1 , sample fractions of the nonprobability and probability sample are bounded. Condition C2 indicates the (implicit) sample weights of nonprobability and probability sample units are bounded, i.e., 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) 𝑂 (cid:4672) (cid:3041) (cid:3278) (cid:3015) (cid:4673) and 𝜋 (cid:3036)(cid:4666)(cid:3046)(cid:4667) (cid:3404) 𝑂 (cid:4672) (cid:3041) (cid:3294) (cid:3015) (cid:4673) , and the inclusion probabilities for the nonprobability and probability samples do not differ in terms of order of magnitude from simple random sampling. Condition C3 guarantees consistency of the Horvitz-Thompson estimators obtained from the probability sample. Condition C4 is the typical finite moment conditions to validate Taylor series expansions. Condition C5 requires that selection of the nonprobability and the probability samples be independent, which simplifies the asymptotic variance calculation. 2 B. Proof of Theorem
We consider the following limiting process (Krewski & Rao, 1981; Chen, Li &Wu, 2019). Suppose there is a sequence of finite populations 𝐹𝑃 (cid:3038) of size 𝑁 (cid:3038) , for 𝑘 (cid:3404) 1, 2, ⋯ . Cohort 𝑠 (cid:3030),(cid:3038) of size 𝑛 (cid:3030),(cid:3038) and survey sample 𝑠 (cid:3046),(cid:3038) of size 𝑛 (cid:3046),(cid:3038) are sampled from each 𝐹𝑃 (cid:3038) . The sequences of the finite population, the cohort and the survey sample have their sizes satisfy lim (cid:3038)→(cid:2998) (cid:3041) (cid:3295),(cid:3286) (cid:3015) (cid:3286) → 𝑓 (cid:3047) where 𝑡 (cid:3404) 𝑐 or 𝑠 and (cid:3047) (cid:3409) 1 (regularity condition C1 in Appendix A). In the following the index 𝑘 is suppressed for simplicity. Let 𝜼 (cid:3021) (cid:3404) (cid:4666)𝜇, 𝜷 (cid:3021) (cid:4667) . The ALP estimate of the finite population mean, 𝜇̂ (cid:3002)(cid:3013)(cid:3017) , given in expression (2.3.6) in the main text, along with the estimates of propensity score model parameters, 𝜷(cid:3553) (solution of 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0 in expression (2.3.5) in the main text), can be combined as 𝜼(cid:3549) (cid:3021) (cid:3404)(cid:3435)𝜇̂ (cid:3002)(cid:3013)(cid:3017) , 𝜷(cid:3553) (cid:3021) (cid:3439) , which is the solution to the joint pseudo estimating equations Φ(cid:4666)𝜼(cid:4667) (cid:3404) ⎝⎛ 𝑈(cid:4666)𝜇(cid:4667) (cid:3404) 1𝑁 (cid:3533) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤 (cid:3036) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667) (cid:3036)∈(cid:3007)(cid:3017) 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)𝒙 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3398) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝛿 (cid:3036)(cid:4666)(cid:3046)(cid:4667) 𝑑 (cid:3036) 𝑝 (cid:3036) 𝒙 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) ⎠⎞(cid:3404) 𝟎, (B.1) where 𝑤 (cid:3036) (cid:3404) 1/𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)/𝑝 (cid:3036) . Under the joint randomization of the propensity model (i.e., self-selection of 𝑠 (cid:3030) ) and the sampling design of 𝑠 (cid:3046) , we have 𝐸(cid:4668)Φ(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669) (cid:3404) 𝟎 , where 𝜼 (cid:2868)(cid:3021) (cid:3404) (cid:4666)𝜇 (cid:2868) , 𝜷 (cid:2868)(cid:3021) (cid:4667) with 𝜇 (cid:2868) and 𝜷 (cid:2868) being the true value of 𝜇 and 𝜷 respectively. The consistency of 𝜼(cid:3549) follows similar arguments to those in Chen, Li & Wu (2019) (which cited Section 3.2 of Tsiatis (2007)). Under the conditions C1 - C4 , we have Φ(cid:4666)𝜼(cid:3549)(cid:4667) (cid:3404) 𝟎
By applying the first-order Taylor expansion, we have 𝜼(cid:3549) (cid:3398) 𝜼 (cid:2868) (cid:3404)(cid:4662) (cid:4670)𝐸(cid:4668)𝜙(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669)(cid:4671) (cid:2879)(cid:2869)
Φ(cid:4666)𝜼 (cid:2868) (cid:4667), (B.2) where
𝐸(cid:4668)𝜙(cid:4666)𝜼(cid:4667)(cid:4669) (cid:3404) 𝐸 (cid:4676) (cid:3105)(cid:2957)(cid:4666)𝜼(cid:4667)(cid:3105)𝜼 (cid:4677) (cid:3404) (cid:3436)𝑈 (cid:3091) 𝑈 𝜷 𝜷 (cid:3440) , and 𝑈 (cid:3091) (cid:3404) 𝐸(cid:4666)𝜕𝑈 𝜕𝜇⁄ (cid:4667) (cid:3404) (cid:3398) 1𝑁 (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) (cid:3398)1, 𝑈 𝜷 (cid:3404) 𝐸(cid:4666)𝜕𝑈 𝜕𝜷 (cid:3021) ⁄ (cid:4667) (cid:3404) 1𝑁 (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667) 𝜕𝑤 (cid:3036) 𝜕𝜷 (cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) (cid:3398) 1𝑁 (cid:3533) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) 𝑆 𝜷 (cid:3404) 𝐸(cid:3435)𝜕𝑆 (cid:3043)∗ 𝜕𝜷⁄ (cid:3439) (cid:3404) (cid:3398) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) ⋅ 𝑝 (cid:3036) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)𝒙 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) 𝒙 (cid:3036)(cid:3021) (cid:3398) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝑝 (cid:3036) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) (cid:3398) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝑝 (cid:3036) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:4666)negative definite by condition 𝐂𝟒(cid:4667) It follows that 𝜇̂ (cid:3404) 𝜇 (cid:2868) (cid:3397) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3030)(cid:2879)(cid:2869)/(cid:2870) (cid:3439) , and
𝑉𝑎𝑟(cid:4666)𝜼(cid:3549)(cid:4667) (cid:3404)(cid:4662) (cid:4670)𝐸(cid:4668)𝜙(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669)(cid:4671) (cid:2879)(cid:2869)
𝑉𝑎𝑟(cid:4668)Φ(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669)(cid:4670)𝐸(cid:4668)𝜙(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669) (cid:3021) (cid:4671) (cid:2879)(cid:2869) , (B.3) where (cid:4670)𝐸(cid:4668)𝜙(cid:4666)𝜼(cid:4667)(cid:4669)(cid:4671) (cid:2879)(cid:2869) (cid:3404) (cid:3437)(cid:3398)1 (cid:3015)(cid:2878)(cid:3041) (cid:3278) (cid:3015) 𝒃 (cid:3021) 𝜷(cid:2879)(cid:2869) (cid:3441) , and 𝒃 (cid:3021) (cid:3404) (cid:4668)∑ (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:4669)(cid:4668)∑ 𝑝 (cid:3036) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) (cid:4669) (cid:2879)(cid:2869) . The middle part of (B.3), i.e., 𝑉𝑎𝑟(cid:4668)Φ(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669) , can be calculated by partitioning
Φ(cid:4666)𝜼(cid:4667) (cid:3404) Φ (cid:2869) (cid:3397) Φ (cid:2870) , where Φ (cid:2869) (cid:3404) (cid:3533) ⎩⎨⎧ 1𝑁 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤 (cid:3036) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)1𝑁 (cid:3397) 𝑛 (cid:3030) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)1 (cid:3398) 𝑝 (cid:3036) (cid:4667)𝒙 (cid:3036) ⎭⎬⎫ (cid:3036)∈(cid:3007)(cid:3017) , Φ (cid:2870) (cid:3404) (cid:3398)1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) (cid:3420) 0𝛿 (cid:3036)(cid:4666)(cid:3046)(cid:4667) 𝑑 (cid:3036) 𝑝 (cid:3036) 𝒙 (cid:3036) (cid:3424) (cid:3036)∈(cid:3007)(cid:3017) . Notice that Φ (cid:2869) and Φ (cid:2870) are independent under condition C5 , because Φ (cid:2869) only involves randomization of cohort participation while Φ (cid:2869) only involves survey sample selection. Hence, 𝑉𝑎𝑟(cid:4668)Φ(cid:4666)𝜼 (cid:2868) (cid:4667)(cid:4669) (cid:3404) 𝑉𝑎𝑟(cid:4666)Φ (cid:2869) (cid:4667) (cid:3397) 𝑉𝑎𝑟(cid:4666)Φ (cid:2870) (cid:4667) where
𝑉𝑎𝑟(cid:4666)Φ (cid:2869) (cid:4667) (cid:3404) (cid:3533) 𝑝 (cid:3036) (cid:4666)1 (cid:3398) 2𝑝 (cid:3036) (cid:4667) ⎩⎨⎧ 1𝑁 (cid:2870) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667) (cid:2870) /𝑝 (cid:3036)(cid:2870) (cid:3030) (cid:4667) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝒙 (cid:3036)(cid:3021) /𝑝 (cid:3036) (cid:3030) (cid:4667) (cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝒙 (cid:3036) /𝑝 (cid:3036) (cid:3030) (cid:4667) (cid:2870) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021) ⎭⎬⎫ (cid:3036)∈(cid:3007)(cid:3017) under the assumption of Poisson sampling of the nonprobability sample, and 𝑉𝑎𝑟(cid:4666)Φ (cid:2870) (cid:4667) (cid:3404) (cid:4672)0 𝟎 (cid:3021) 𝑫 being the design-based variance-covariance matrix under the probability sampling design for sample 𝑠 (cid:3046) . For example, if survey sample is randomly selected by Poisson sampling, 𝑫 (cid:3404)(cid:4666)𝑁 (cid:3397) 𝑛 (cid:3030) (cid:4667) (cid:2879)(cid:2870) ∑ (cid:4666)𝑑 (cid:3036) (cid:3398) 1(cid:4667)𝑝 (cid:3036)(cid:2870) 𝒙 (cid:3036) 𝒙 (cid:3036)(cid:3021)(cid:3036)∈(cid:3007)(cid:3017) . The finite population variance of 𝜇̂ (cid:3002)(cid:3013)(cid:3017) is the first diagonal element of 𝑉𝑎𝑟(cid:4666)𝜼(cid:3549)(cid:4667) , and given by
𝑉𝑎𝑟(cid:4666)𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667) (cid:3404) (cid:4666)(cid:3398)1 𝒃 (cid:3021) (cid:4667) ⋅ (cid:3435)𝑉𝑎𝑟(cid:4666)Φ (cid:2869) (cid:4667) (cid:3397) 𝑉𝑎𝑟(cid:4666)Φ (cid:2870) (cid:4667)(cid:3439) ⋅ (cid:4672)(cid:3398)1𝒃 (cid:4673) (cid:3404) 𝑁 (cid:2879)(cid:2870) (cid:3533) 𝑝 (cid:3036) (cid:4666)1 (cid:3398) 2𝑝 (cid:3036) (cid:4667) (cid:4682)(cid:4666)𝑦 (cid:3036) (cid:3398) 𝜇(cid:4667)𝑝 (cid:3036) (cid:3398) 𝒃 (cid:3021) 𝒙 (cid:3036) (cid:4683) (cid:3036)∈(cid:3007)(cid:3017) (cid:2870) (cid:3397) 𝒃 (cid:3021) 𝑫𝒃.
Note 𝑝 (cid:3036) (cid:3404) 𝑃(cid:4666)𝑖 ∈ 𝑠 (cid:3030) |𝑠 (cid:3030) ∪ ∗ 𝐹𝑃(cid:4667) (cid:3409) 1/2 . C. Comparing Orders of Magnitude of
𝑽𝒂𝒓(cid:4666)𝝁(cid:3549)
𝑨𝑳𝑷 (cid:4667) and
𝑽𝒂𝒓(cid:4666)𝝁(cid:3549)
𝑪𝑳𝑾 (cid:4667)
The pseudo-weighted nonprobability sample estimator of the population mean is written as 𝜇̂ (cid:3404) 1∑ 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:2929) (cid:3278) (cid:3533) 𝑤(cid:3557) (cid:3036) 𝑦 (cid:3036)(cid:3036)∈(cid:2929) (cid:3278) where 𝑤(cid:3557) (cid:3036) is the pseudoweight 𝑤(cid:3557) (cid:3036)(cid:3002)(cid:3013)(cid:3017) in the ALP estimator 𝜇̂ (cid:3002)(cid:3013)(cid:3017) 𝑤(cid:3557) (cid:3036)(cid:3002)(cid:3013)(cid:3017) (cid:3404) 1 (cid:3398) 𝑝̂ (cid:3036) 𝑝̂ (cid:3036) (cid:3404) exp (cid:2879)(cid:2869) (cid:3435)𝜷(cid:3553) (cid:3021) 𝒙 (cid:3036) (cid:3439) or the pseudoweight 𝑤(cid:3557) (cid:3036)(cid:3004)(cid:3013)(cid:3024) in the CLW estimator 𝜇̂ (cid:3004)(cid:3013)(cid:3024) 𝑤(cid:3557) (cid:3036)(cid:3004)(cid:3013)(cid:3024) (cid:3404) 1𝜋(cid:3548) (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3404) 1 (cid:3397) exp (cid:2879)(cid:2869) (cid:4666)𝜸(cid:3549) (cid:3021) 𝒙 (cid:3036) (cid:4667) where 𝜷(cid:3553) and 𝜸(cid:3549) are solutions of pseudo estimation equations 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0 and 𝑆 (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) 0 in formulae (2.3.5) and (2.2.7) in the main text, respectively. According to the law of total variance, finite population variance of 𝜇̂ can be written as 𝑉(cid:4666)𝜇̂ (cid:4667) (cid:3404) 𝐸 (cid:3050) (cid:4670)𝑉 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:4671) (cid:3397) 𝑉 (cid:3050) (cid:4670)𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:4671) (C.1) where 𝒘(cid:3557) (cid:3404) (cid:4666)𝑤(cid:3557) (cid:2869) , … , 𝑤(cid:3557) (cid:3015) (cid:4667) is the vector of pseudo nonprobability sample weight for the finite population; 𝐸 (cid:3050) and 𝑉 (cid:3050) are with respect to the propensity model; 𝑉 (cid:3030) and 𝐸 (cid:3030) are with respect to the nonprobability sampling process, and we have 5 𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667) (cid:3404) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036) 𝑦 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) 𝑂(cid:4666)𝑛 (cid:3030)(cid:2879)(cid:2869) (cid:4667) and 𝑉 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667) (cid:3404) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4672)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4673)𝑤(cid:3557) (cid:3036)(cid:2870) (cid:4678)𝑦 (cid:3036) (cid:3398) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) 𝑦 (cid:3036) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:4679) (cid:2870)(cid:3036)∈(cid:3007)(cid:3017) (cid:4672)∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:4673) (cid:2870) assuming Poisson sampling. The first term in (C.1), which is 𝐸 (cid:3050) (cid:4670)𝑉 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:4671) , has order 𝑂(cid:4666)𝑛 (cid:3030)(cid:2879)(cid:2869) (cid:4667) for both 𝜇̂ (cid:3002)(cid:3013)(cid:3017) and 𝜇̂ (cid:3004)(cid:3013)(cid:3024) under condition C2 . The second term in (C.1) is approximately 𝑉 (cid:3050) (cid:4670)𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:4671) ≐ (cid:4678)𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) (cid:4679) 𝑉(cid:4666)𝒘(cid:3557) (cid:4667) (cid:4678)𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) (cid:4679) (cid:3021) (C.2) The middle term in (C.2)is 𝑉(cid:4666)𝒘(cid:3557) (cid:4667) (cid:3404) (cid:3436)𝜕𝒘(cid:3557)𝜕𝚩(cid:3553) (cid:3440) 𝑉(cid:3435)𝚩(cid:3553)(cid:3439) (cid:3436)𝜕𝒘(cid:3557)𝜕𝚩(cid:3553) (cid:3440) (cid:3021) (cid:3404) (cid:3420) 𝜕𝜕𝚩(cid:3553) exp (cid:2879)(cid:2869) (cid:3435)𝚩(cid:3553) (cid:3021) 𝒙(cid:3439)(cid:3424) (cid:3419)𝑉(cid:3435)𝚩(cid:3553)(cid:3439)(cid:3423) (cid:3420) 𝜕𝜕𝚩(cid:3553) exp (cid:2879)(cid:2869) (cid:3435)𝚩(cid:3553) (cid:3021) 𝒙(cid:3439)(cid:3424) (cid:3021) . where 𝚩(cid:3553) (cid:3404) 𝜷(cid:3553) or 𝜸(cid:3549) are solutions of pseudo estimating equations 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0 and 𝑆 (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) 0 in the formulae (2.3.5) and (2.2.7). Therefore 𝑉 (cid:3050) (cid:4670)𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:4671) ≐ (cid:4678)𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) 𝜕𝒘(cid:3557)𝜕𝚩(cid:3553) (cid:4679) 𝑉(cid:3435)𝚩(cid:3553)(cid:3439) (cid:4678)𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) 𝜕𝒘(cid:3557)𝜕𝚩(cid:3553) (cid:4679) (cid:3021) where 𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) (cid:3404) (cid:3421)𝜋 (cid:2869)(cid:4666)(cid:3030)(cid:4667) 𝑦 (cid:3036) (cid:3398) 𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)∑ 𝜋 (cid:2869)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:2869)(cid:3036)∈(cid:3007)(cid:3017) , ⋯ , 𝜋 (cid:3015)(cid:4666)(cid:3030)(cid:4667) 𝑦 (cid:3036) (cid:3398) 𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)∑ 𝜋 (cid:3015)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3015)(cid:3036)∈(cid:3007)(cid:3017) (cid:3425) (cid:3021) , and (cid:4678)𝜕𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)𝜕𝒘(cid:3557) 𝜕𝒘(cid:3557)𝜕𝚩(cid:3553) (cid:4679) (cid:3404) (cid:3398) ∑ (cid:4676)𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) exp (cid:2879)(cid:2869) (cid:3435)𝚩(cid:3553) (cid:3021) 𝒙 (cid:3036) (cid:3439) (cid:3435)𝑦 (cid:3036) (cid:3398) 𝐸 (cid:3030) (cid:4666)𝜇̂ |𝒘(cid:3557) (cid:4667)(cid:3439)𝒙 (cid:3036) (cid:4677) (cid:3036)∈(cid:3007)(cid:3017) ∑ 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝑤(cid:3557) (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 𝑂(cid:4666)1(cid:4667) for both ALP and CLW. To solve the order of 𝑉(cid:3435)𝚩(cid:3553)(cid:3439) , we first write
𝚩(cid:3553) (cid:3398) 𝚩 (cid:3404) 𝐼 (cid:2879)(cid:2869) (cid:4666)𝚩(cid:4667)𝑆(cid:3435)𝚩(cid:3553)(cid:3439) (cid:3397) 𝑜 (cid:3043) (cid:4672)𝑆(cid:3435)𝚩(cid:3553)(cid:3439)(cid:4673), (C.3) where
𝑩 (cid:3404) 𝜷 or 𝜸 are solutions to the census estimating equation 𝑆(cid:4666)𝚩(cid:4667) (cid:3404) 0 , and
𝐼(cid:4666)𝚩(cid:4667) (cid:3404) (cid:3105)(cid:3020)(cid:3105)𝚩 (cid:4666)𝐁(cid:4667) is the Hessian matrix. Specifically, for the ALP method the census estimating equation can be obtained by rewriting expression (3) in the main text and differentiating with respect to 𝜷 , leading to 6 𝑆(cid:4666)𝜷(cid:4667) (cid:3404) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) (cid:4668)𝑅 (cid:3036) (cid:3398) 𝑝 (cid:3036) (cid:4666)𝜷(cid:4667)(cid:4669)𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) ∪ ∗ (cid:3007)(cid:3017) , where 𝑅 (cid:3036) indicates the membership of 𝑠 (cid:3030) in 𝑠 (cid:3030) ∪ ∗ 𝐹𝑃 (=1 if 𝑖 ∈ 𝑠 (cid:3030) ; 0 if 𝑖 ∈ 𝐹𝑃 ), and 𝑝 (cid:3036) (cid:4666)𝜷(cid:4667) (cid:3404)𝐸(cid:4666) 𝑅 (cid:3036) ∣∣ 𝒙 (cid:3036) ; 𝜷 (cid:4667) (cid:3404) expit(cid:4666)𝜷 (cid:3021) 𝒙 (cid:3036) (cid:4667) defined in (2.3.3) and (2.3.2) in the main text respectively. The estimate 𝜷(cid:3553) is solution to the pseudo estimating equation 𝑆 (cid:3043)∗ (cid:4666)𝜷(cid:4667) (cid:3404) 0 , where 𝑑 (cid:3036) is the basic design weights for 𝑖 ∈ 𝑠 (cid:3046) and 𝑑 (cid:3036) (cid:3404) 1 for 𝑖 ∈ 𝑠 (cid:3030) . We have 𝑆 (cid:3043)∗ (cid:3435)𝜷(cid:3553)(cid:3439) (cid:3404) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝑑 (cid:3036) (cid:3419)𝑅 (cid:3036) (cid:3398) 𝑝 (cid:3036) (cid:3435)𝜷(cid:3553)(cid:3439)(cid:3423)𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) ∪ ∗ (cid:3046) (cid:3294) (cid:3404) 𝑆(cid:3435)𝜷(cid:3553)(cid:3439) (cid:3397) 𝑂 (cid:3043) (cid:4678) 1(cid:3493)𝑛 (cid:3030) (cid:3397) 𝑛 (cid:3046) (cid:4679) (cid:3404) 0 under condition C3 . Combined with (C.3), this leads to 𝜷(cid:3553) (cid:3398) 𝜷 (cid:3404) 𝑂 (cid:3043) (cid:3436) (cid:2869)(cid:3493)(cid:3041) (cid:3278) (cid:2878)(cid:3041) (cid:3294) (cid:3440) with 𝐼(cid:4666)𝜷(cid:4667) (cid:3404) 𝜕𝑆𝜕𝜷 (cid:4666)𝜷(cid:4667) (cid:3404) (cid:3398) 1𝑁 (cid:3397) 𝑛 (cid:3030) (cid:3533) 𝑝 (cid:3036) (cid:4666)𝜷(cid:4667)(cid:4668)1 (cid:3398) 𝑝 (cid:3036) (cid:4666)𝜷(cid:4667)(cid:4669)𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) ∪ ∗ (cid:3007)(cid:3017) (cid:3404) 𝑂(cid:4666)1(cid:4667) under Condition C4 . We have 𝑉(cid:3435)𝜷(cid:3553)(cid:3439) (cid:3404) 𝑂 (cid:3436) 1𝑛 (cid:3030) (cid:3397) 𝑛 (cid:3046) (cid:3440).
For the CLW method, the census estimating equation is
𝑆(cid:4666)𝜸(cid:4667) (cid:3404) 1𝑁 (cid:3533) (cid:4676)𝛿 (cid:3036) (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667)(cid:4677)𝒙 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) where 𝛿 (cid:3036) is the indicator of the population unit 𝑖 being included in 𝑠 (cid:3030) (=1 if 𝑖 ∈ 𝑠 (cid:3030) ; 0 otherwise), and 𝜋 (cid:3036) (cid:4666)𝜸(cid:4667) (cid:3404) 𝐸(cid:4666) 𝛿 (cid:3036) ∣∣ 𝒙 (cid:3036) ; 𝜸 (cid:4667) (cid:3404) expit(cid:4666)𝜸 (cid:3021) 𝒙 (cid:3036) (cid:4667) . The estimate 𝜸(cid:3549) is solution to the pseudo estimating equation 𝑆 (cid:3043) (cid:4666)𝜸(cid:4667) (cid:3404) 0 shown below 𝑆 (cid:3043) (cid:4666)𝜸(cid:3549)(cid:4667) (cid:3404) 1𝑁 (cid:4682)(cid:3533) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3278) (cid:3398) (cid:3533) 𝑑 (cid:3036) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:3549)(cid:4667)𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:4683) (cid:3404) 1𝑁 (cid:3533) 𝛿 (cid:3036) 𝑥 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) 1𝑁 (cid:3533) 𝑑 (cid:3036) (cid:4676)𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:3398) 𝜋(cid:3548) (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4677)𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3398) 1𝑁 (cid:3533) 𝑑 (cid:3036) 𝛿 (cid:3036)(cid:4666)(cid:3030)(cid:4667) 𝒙 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3404) 0. (C.4) Under condition C3 , we have the second and third term in (C.4)
1𝑁 (cid:3533) 𝑑 (cid:3036) (cid:4672)𝛿 (cid:3036) (cid:3398) 𝜋(cid:3548) (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4673)𝑥 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3404) 1𝑁 (cid:3533) (cid:4672)𝛿 (cid:3036) (cid:3398) 𝜋(cid:3548) (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4673)𝑥 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3046)(cid:2879)(cid:2869)/(cid:2870) (cid:3439), and
1𝑁 (cid:3533) 𝑑 (cid:3036) 𝛿 (cid:3036) 𝑥 (cid:3036)(cid:3036)∈(cid:3046) (cid:3294) (cid:3404) 1𝑁 (cid:3533) 𝛿 (cid:3036) 𝑥 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3397) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3046)(cid:2879)(cid:2869)/(cid:2870) (cid:3439). Hence 7 𝑆 (cid:3043) (cid:4666)𝜸(cid:3549)(cid:4667) (cid:3404) 𝑆(cid:4666)𝜸(cid:3549)(cid:4667) (cid:3397) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3046)(cid:2879)(cid:2869)/(cid:2870) (cid:3439) (cid:3404) 0, which, combined with (C.3), leads to 𝜸(cid:3549) (cid:3398) 𝜸 (cid:3404) 𝑂 (cid:3043) (cid:3435)𝑛 (cid:3046)(cid:2879)(cid:2869)/(cid:2870) (cid:3439) with 𝐼(cid:4666)𝜸(cid:4667) (cid:3404) (cid:3398) 1𝑁 (cid:3533) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667)(cid:4676)1 (cid:3398) 𝜋 (cid:3036)(cid:4666)(cid:3030)(cid:4667) (cid:4666)𝜸(cid:4667)(cid:4677)𝒙 (cid:3036)(cid:3021) 𝒙 (cid:3036)(cid:3036)∈(cid:3007)(cid:3017) (cid:3404) 𝑂(cid:4666)1(cid:4667) under condition C6 in Chen, Li & Wu (2019). We have 𝑉(cid:4666)𝜸(cid:3549)(cid:4667) (cid:3404) 𝑂 (cid:3436) 1𝑛 (cid:3046) (cid:3440)
As the result, the second term in (C.1) for the ALP and the CLW method has the order of
𝑂 (cid:4672) (cid:2869)(cid:3041) (cid:3294) (cid:2878)(cid:3041) (cid:3278) (cid:4673) and
𝑂 (cid:4672) (cid:2869)(cid:3041) (cid:3294) (cid:4673) , respectively. Combining the two terms in (C.1), we have
𝑉(cid:4666)𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667) (cid:3404) 𝑂 (cid:3436) 1𝑛 (cid:3030) (cid:3440) (cid:3397) 𝑂 (cid:3436) 1𝑛 (cid:3046) (cid:3397) 𝑛 (cid:3030) (cid:3440) (cid:3404) 𝑂 (cid:3436) 1𝑛 (cid:3030) (cid:3440) and
𝑉(cid:4666)𝜇̂ (cid:3004)(cid:3013)(cid:3024) (cid:4667) (cid:3404) 𝑂 (cid:3436) 1𝑛 (cid:3030) (cid:3440) (cid:3397) 𝑂 (cid:3436) 1𝑛 (cid:3046) (cid:3440) (cid:3404) 𝑂 (cid:3436) 1min(cid:4666)𝑛 (cid:3030) , 𝑛 (cid:3046) (cid:4667)(cid:3440).
Therefore, in large samples we have
𝑉(cid:4666)𝜇̂ (cid:3002)(cid:3013)(cid:3017) (cid:4667) (cid:3409) 𝑉(cid:4666)𝜇̂ (cid:3004)(cid:3013)(cid:3024) (cid:4667) , and the estimator 𝜇̂ (cid:3002)(cid:3013)(cid:3017) is more efficient than 𝜇̂ (cid:3004)(cid:3013)(cid:3024) especially when 𝑛 (cid:3030) ≫ 𝑛 (cid:3046) . D. Supplementary table on estimated coefficients of propensity models
RDW CLW ALP (FDW) ALP.S (Intercept) -8.92 -8.92 -8.92 0.05
Age (in years) -0.06 -0.06 -0.06 -0.06
Age Sex (ref: male) Female -0.10 -0.10 -0.10 -0.03
Education level -0.16 -0.16 -0.16 -0.11
Race/Ethnicity (ref: NH-White) NH-Black 1.33 1.33 1.33 1.47 Hispanic 1.62 1.62 1.62 1.64 NH-Other -0.35 -0.35 -0.35 -0.28
Poverty (ref: No) Yes 0.15 0.15 0.15 0.11 Unknown -0.01 -0.01 -0.01 0.01
Health Status
Region (ref: Northeast) Midwest 0.25 0.25 0.25 0.15 South 0.41 0.41 0.41 0.35 West 0.29 0.29 0.29 0.14
Marital Status (ref: married or living as married)
Single -0.19 -0.19 -0.19 -0.12 Previously married -0.01 -0.01 -0.01 -0.02
Smoking (ref: Non-smoker) Former smoker 0.12 0.12 0.12 0.10 Current smoker 0.16 0.16 0.16 0.14
Household Income -0.01 -0.01 -0.01 -0.01
Chewing tobacco (ref: No)
Yes -0.35 -0.35 -0.35 -0.34
BMI (ref: normal)
Under-weight -0.02 -0.02 -0.02 -0.12
Over-weight 0.03 0.03 0.03 0.01 Obese -0.06 -0.06 -0.06 -0.04