[PDF] Multilevel calibration weighting for survey data

Abstract

A pressing challenge in modern survey research is to find calibration weights when covariates are high dimensional and especially when interactions between variables are important. Traditional approaches like raking typically fail to balance higher-order interactions; and post-stratification, which exactly balances all interactions, is only feasible for a small number of variables. In this paper, we propose multilevel calibration weighting, which enforces tight balance constraints for marginal balance and looser constraints for higher-order interactions. This incorporates some of the benefits of post-stratification while retaining the guarantees of raking. We then correct for the bias due to the relaxed constraints via a flexible outcome model; we call this approach Double Regression with Post-stratification (DRP). We characterize the asymptotic properties of these estimators and show that the proposed calibration approach has a dual representation as a multilevel model for survey response. We assess the performance of this method via an extensive simulation study and show how it can reduce bias in a case-study of a large-scale survey of voter intention in the 2016 U.S. presidential election. The approach is available in the multical R package.

Full PDF

MMultilevel calibration weighting for survey data ∗ Eli Ben-Michael, Avi Feller, and Erin HartmanFebruary 19, 2021

Abstract

A pressing challenge in modern survey research is to ﬁnd calibration weights when covari-ates are high dimensional and especially when interactions between variables are important.Traditional approaches like raking typically fail to balance higher-order interactions; and post-stratiﬁcation, which exactly balances all interactions, is only feasible for a small number ofvariables. In this paper, we propose multilevel calibration weighting, which enforces tight bal-ance constraints for marginal balance and looser constraints for higher-order interactions. Thisincorporates some of the beneﬁts of post-stratiﬁcation while retaining the guarantees of rak-ing. We then correct for the bias due to the relaxed constraints via a ﬂexible outcome model;we call this approach Double Regression with Post-stratiﬁcation (DRP). We characterize theasymptotic properties of these estimators and show that the proposed calibration approach hasa dual representation as a multilevel model for survey response. We assess the performance ofthis method via an extensive simulation study and show how it can reduce bias in a case-studyof a large-scale survey of voter intention in the 2016 U.S. presidential election. The approach isavailable in the multical

R package. ∗ We would like to thank Peng Ding, Yair Ghitza, Mark Handcock, Luke Keele, Drew Linzer, Luke Miratrix,Jonathan Robinson, Jas Sekhon, and participants at BigSurv2020 and AsianPolmeth VIII for useful discussion andcomments. This research was supported in part by the Hellman Family Fund at UC Berkeley and by the Institute ofEducation Sciences, U.S. Department of Education, through Grant R305D200010. The opinions expressed are thoseof the authors and do not represent views of the Institute or the U.S. Department of Education. a r X i v : . [ s t a t . M E ] F e b Introduction

Response rates for face-to-face and telephone polls have plummeted in recent years, leading toincreased reliance on non-probability samples and greater attention to the need to adjust for perva-sive non-response bias (Kennedy and Hartig, 2019; Caughey et al., 2020). At the same time, therehas been a dramatic rise in the size and scale of auxiliary data available for survey research (Japecet al., 2015). Taken together, these changes have transformed the collection and analysis of publicopinion data.Accounting for higher-order interactions is an important challenge when constructing calibrationweights for non-probability samples. For example, recent evaluations of public opinion polling inthe United States found that failing to adjust for the interaction between key variables such aseducation, race, and geographic region led to substantial bias during the 2016 election (Kennedyet al., 2018). Traditional approaches, like raking, can perform poorly in this setting, typicallybalancing marginal distributions while failing to balance higher-order interactions. By contrast,post-stratiﬁcation, which exactly balances all interactions, is only feasible for a small number ofvariables. And while approaches like multilevel regression and post-stratiﬁcation (MRP; Gelmanand Little, 1997) use outcome modeling to overcome this, they do not produce a single set of surveyweights for all outcomes.In this paper, we propose two principled approaches to account for higher-order interactionswhen estimating population quantities from non-probability samples. First, we propose multilevelcalibration weighting, which exactly balances the ﬁrst-order margins and approximately balancesinteractions, prioritizing balance in lower-order interactions over higher-order interactions. Thus,this approach incorporates some of the beneﬁts of post-stratiﬁcation while retaining the guaranteesof the common-in-practice raking approach. And unlike outcome modeling approaches like MRP,multilevel calibration weights are estimated once and applied to all survey outcomes, an importantpractical constraint in many survey settings.In some cases, however, multilevel calibration weighting alone may be insuﬃcient to achievegood covariate balance on all higher order interactions, possibly leading to bias; or researchersmight only be focused on a single outcome of interest. For this, we propose

Double Regression withPost-stratiﬁcation (DRP), which combines multilevel calibration weights with outcome modeling.Similar to model-assisted survey calibration (Breidt and Opsomer, 2017), this approach uses out-come modeling, taking advantage of ﬂexible modern prediction methods, to estimate and correctfor the possible bias from imperfect balance. When the weights alone achieve good balance onhigher-order interactions, the adjustment from the outcome model is minimal. When the higher-order imbalance is large, however, the bias correction will also be large and the combined estimatorwill rely more heavily on the outcome model.We characterize the numerical and statistical properties of both multilevel calibration weight-ing and the combined DRP estimator. By linking non-response bias to imbalance between the1espondent sample and the overall population, we show how multilevel calibration and DRP con-trol bias, and indicate a tradeoﬀ between lower bias through better balance and lower variancethrough smaller adjustments. We then describe this behavior asymptotically and show that thebias correction is critical for asymptotic Normality, yielding a non-parametric analog to recentdouble-robustness results in survey estimation (Chen et al., 2020). Through the Lagrangian dual,we also show that the multilevel calibration approach implicitly ﬁts a multilevel model of non-response, shrinking the coeﬃcients on higher order interaction terms and partially pooling acrosscells.We evaluate the performance of our estimators and demonstrate the importance of adjustingfor higher order interactions in an empirical study predicting Republican vote share in the 2016US presidential election from a pre-election Pew poll of vote intention. Kennedy et al. (2018) showthat, in 2016 election public opinion polling, many surveys failed to accurately account for the shiftin public opinion among white voters with no college education, particularly in the Midwesternregion of the country. We evaluate whether accounting for this higher-order interaction of race,education level, and geographic region can, retrospectively, improve public opinion estimates in thepublicly available Pew poll. We use the large sample of voters in the post-election CooperativeCongressional Election Study (CCES; Ansolabehere and Schaﬀner, 2017) to construct populationtargets and to serve as a “ground truth” for the predictions. We also construct a simulation studycalibrated to this example. We show that the multilevel weights substantially improve balancein interactions relative to raking and ad hoc post-stratiﬁcation. We then show that further biascorrection through DRP can signiﬁcantly improve estimation.The paper proceeds as follows. Section 2 describes the notation and estimands, and formally de-scribes various common survey weighting procedures such as raking and post-stratiﬁcation. Section3 characterizes the estimation error for arbitrary weighting estimators to motivate our multilevelcalibration procedure, then describes the procedure. Section 4 proposes the DRP estimator and an-alyzes its numerical and statistical properties. Section 5 shows the dual relation between multilevelcalibration and modelling non-response with a multilevel model. Section 6 reports a simulationstudy calibrated to the case study, and Section 7 uses these procedures in the application. Themethods we develop here are available in the multical

R package.

We will consider a ﬁnite population of N individuals indexed i = 1 , . . . , N . Each of these individualshas an outcome Y i that we observe if they respond to the survey, denoted by a binary variable R i where R i = 1 indicates that unit i responds, and n = (cid:80) i R i is the total number of respondents.2igure 1: Example of D (1) and D (2) with 3 covariates: age, a binary for self-reported female, andparty identiﬁcation.This response variable can include respondents to a probability sample or a convenience sample. Inaddition each individual is also associated with a set of d categorical covariates X i , . . . , X id , wherethe j th covariate is categorical with J j levels, so that the vector of covariates X i ∈ [ J ] × . . . × [ J d ].We will assume that we observe the covariates X i for individuals that respond, but not necessarilyfor those that do not respond.Rather than consider these variables individually, we will rewrite the vector X i as a singlecategorical covariate, the cell for unit i , S i ∈ [ J ], where J = J × . . . × J d . We can then summarizeinformation via these cells. We denote N P ∈ N J as the population count vector with N P s = (cid:80) i { S i = s } , and n R ∈ N J as the respondent count vector , with n R s = (cid:80) i R i { S i = s } . Finally,for each cell s we will consider a set of binary vectors D ( k ) s that denote the cell in terms of its k th order interactions, and collect the vectors into matrices D ( k ) = [ D ( k )1 . . . D ( k ) J ] (cid:48) , and into onecombined J × J matrix D = [ D (1) , . . . , D ( d ) ]. Figure 1 shows an example of D (1) and D (2) withthree covariates: age, a binary for self-reported female, and party identiﬁcation.Our goal is to estimate the population average outcome, which we can write as a cell-sizeweighted average of the within-cell averages, i.e. µ ≡ N N (cid:88) i =1 Y i = J (cid:88) s =1 N P s N µ s where µ s ≡ N P s (cid:88) S i = s Y i . (1) Often the response variable denotes whether a unit is in the sample or part of a target population, and so therespondents are not a subset of the overall population. For simplicity, we treat this as non-response, but our resultscan be extended to generalize to particular target populations. We focus on categorical covariates because it is common to only have population information for categoricalcovariates and so continuous covariates are often coarsened. However, the procedures we describe below can bestraightforwardly adapted for continuous covariates.

3o estimate the population average, we will rely on the average outcomes we observe within eachcell. For cell s , the responder average is ¯ Y s ≡ n R s (cid:88) S i = s R i Y i . (2)We will assume that outcomes are missing at random within cells, so that the cell responder averagesare unbiased for the true cell averages (Rubin, 1976): Assumption 1 (Missing at random within cells) . For all cells s = 1 , . . . , J , E (cid:2) ¯ Y s | n R s (cid:3) = µ s .We will denote the propensity score as P ( R i = 1) ≡ π i , and the probability of responding conditionalon being in cell s as π ( s ) ≡ N P s (cid:80) S i = s π i . We assume that this probability is non-zero for all cells. Assumption 2. π ( s ) > s = 1 , . . . , J .These assumptions allow us to identify the overall population average using only the observed data.However, in order for Assumption 1 to be plausible, we will need the cells to be very ﬁne-grained andhave covariates that are quite predictive of non-response and the outcome. As we will see below,this creates a trade-oﬀ between identiﬁcation and estimation: the former becomes more plausiblewith more ﬁne-grained information, while the latter becomes more diﬃcult (see also D’Amour et al.,2020). We ﬁrst consider estimating µ by taking a weighted average of the respondents’ outcomes, withweights ˆ γ i for unit i . Because the cells are the most ﬁne-grained information we have, we willrestrict the weights to be constant within each cell, relying on Assumption 1 that outcomes aremissing at random within cells. We denote the estimated weight for cell s as ˆ γ ( s ), and estimatethe population average µ via:ˆ µ (ˆ γ ) ≡ N N (cid:88) i =1 R i ˆ γ i Y i = 1 N (cid:88) s n R s ˆ γ ( s ) ¯ Y s . (3)If the individual probabilities of responding were known, we could choose to weight cell S by theinverse of the propensity score, ˆ γ ( s ) = π ( s ) (Horvitz and Thompson, 1952). Unfortunately, theresponse probabilities are unknown. Instead, one class of procedures estimates the propensity scoreas ˆ π ( s ) and weights cell s by the inverse estimated propensity score ˆ γ ( s ) = π ( s ) .One way to estimate the propensity score is as the proportion of the population in cell s thatresponded, n R s N P s , this leads to post-stratiﬁcation weights ˆ γ ps ( s ) = N P s n R s . Because the outcomes aremissing at random within each cell by Assumption 1, these post-stratiﬁcation weights will leadto an unbiased estimator for the population average µ . However, this estimator is only deﬁned if4igure 2: Percent of the population that is not represented in the survey, beginning with educationand successively interacting with income, religion, race, a binary for self-reported female, age, partyidentiﬁcation, and born-again Christian status.there is at least one responder within each cell, and therefore is often infeasible in practice. In evenmoderate dimensional cases it is unlikely there is at least one respondent for each cell. For instance,in our empirical example there are J = 12 ,

347 distinct cells in the population but only n = 2 , J = 6), each cell has at least one respondent. However, when including all eight covariates, thenon-empty cells in the sample represent less than a quarter of the population. This motivates thesearch for alternative approaches.A common alternative chooses weights so that the weighted marginal distribution of the co-variates exactly matches that of the full population. This raking on margins procedure solves aconvex optimization problem that ﬁnds the minimally “disperse” weights that satisfy this constraint(Deming and Stephan, 1940; Deville and S¨arndal, 1992; Deville et al., 1993). Speciﬁcally, we ﬁndweights that solve min γ (cid:88) s n R s γ ( s ) subject to (cid:88) s D (1) s n R s γ ( s ) = (cid:88) s D (1) s N P s L ≤ γ ( s ) ≤ U ∀ s = 1 , . . . , J. (4)These are the minimum variance weights that exactly balance ﬁrst order margins. In addition to These are (i) education (6 levels), (ii) income (9 levels), (iii) race (4 levels), (iv) a binary for self-reported female(2 levels), (v) age (4 levels), (vi) party ID (3 levels), (vii) born again Christian (2 levels) and (viii) region (5 levels).Only 30% of the potential combinations exist in the population. L and an upper bound U . Setting the lower bound L = 0 and the upper bound U = ∞ restrictsthe normalized cell weights, N γ s n R s to be in the J − µ (ˆ γ ) is between the minimum and maximum outcomes in the sample. Incontrast to the post-stratiﬁcation weights, we typically expect weights to exist that exactly balanceon the marginal distributions. However, failing to adjust for higher order interaction terms has thepotential to induce severe bias.Before turning to our proposal for balancing higher-order interactions, we brieﬂy describe someexisting approaches. Chen et al. (2019) and McConville et al. (2017) discuss model-assisted cali-bration approaches which rely on the LASSO for variable selection. Caughey and Hartman (2017)select higher-order interactions to balance using the LASSO. Hartman et al. (2021) provide a kernelbalancing method for matching joint covariate distributions between non-probability samples anda target population. Linzer (2011) provides a latent class model for estimating cell probabilitiesand marginal eﬀects in highly-stratiﬁed data.Finally, in Section 4, we also discuss outcome modeling strategies as well as approaches thatcombine calibration weights and outcome modeling. A small number of papers have previouslyexplored this combination for non-probability samples. Closest to our setup is Chen et al. (2020),who combine inverse propensity score weights with a linear outcome model and show that theresulting estimator is doubly robust. Related examples include Yang et al. (2020), who give high-dimensional results for a related setting; and Si et al. (2020), who combine weighting and outcomemodeling in a Bayesian framework.

We now propose multilevel calibration weights, which bridge the gap between post-stratiﬁcationand raking on the margins. First, we inspect the ﬁnite-sample estimation error and mean squareerror of the weighting estimator ˆ µ ( γ ) for a set of weights γ , diﬀerentiating the impact of imbalancein lower- and higher-order terms on the bias. We then use this decomposition to ﬁnd weights thatcontrol the components of the MSE by approximately post-stratifying while maintaining raking onthe margins. If we allow unbounded extrapolation and set L = −∞ and U = ∞ , the resulting estimator will be equivalent tolinear regression weights from a linear regression of the outcome on ﬁrst order indicators D (1) (Ben-Michael et al.,2020a). Many other choices of constraints and penalty functions are possible, including penalizing the distance toknown design weights, see Deville and S¨arndal (1992); these can also be incorporated into the discussion below. .1 Estimation error We begin by inspecting the estimation error ˆ µ ( γ ) − µ for weights γ . Deﬁne the residual for unit i as ε i ≡ Y i − µ S i , and the average respondent residual in cell s as ¯ ε s = n R s (cid:80) i R i ε i . The estimation errordecomposes into a term due to imbalance in the cell distributions and a term due to idiosyncraticvariation within cells: ˆ µ (ˆ γ ) − µ = 1 N (cid:88) s ( n s ˆ γ s − N s ) × µ s (cid:124) (cid:123)(cid:122) (cid:125) imbalance in cell distribution + 1 N s (cid:88) s =1 n s ˆ γ s ¯ ε s (cid:124) (cid:123)(cid:122) (cid:125) idiosyncratic error . (5)By Assumption 1, which states that outcomes are missing at random within cells, the idiosyncraticerror will be zero on average, and so the bias will be due to imbalance in the cell distribution.By H¨older’s inequality, we can see that the mean square error, conditioned on the number ofrespondents in each cell, is E (cid:104) (ˆ µ ( γ ) − µ ) | n R (cid:105) = 1 N (cid:32)(cid:88) s ( n s γ ( s ) − N s ) µ s (cid:33) (cid:124) (cid:123)(cid:122) (cid:125) bias + (cid:88) s (cid:18) n R s N (cid:19) ˆ γ s σ s (cid:124) (cid:123)(cid:122) (cid:125) variance ≤ N (cid:88) s µ s × (cid:88) s (cid:0) n R s γ ( s ) − N P s (cid:1) (cid:124) (cid:123)(cid:122) (cid:125) imbalance in cell distribution + σ (cid:88) s (cid:18) n R s N (cid:19) γ ( s ) (cid:124) (cid:123)(cid:122) (cid:125) noise , (6)where σ s = Var( ¯ Y s | n R ) and σ = max s σ s . We therefore have two competing objectives if wewant to control the mean square error for any given realization of our survey. To minimize thebias we want to ﬁnd weights that control the imbalance between the true and weighted proportionswithin each cell. To minimize the variance we want to ﬁnd diﬀuse weights so that the sum of thesquared weights is small.The decomposition above holds for imbalance measures across all of the strata, without takinginto account their multilevel structure. In practice, we expect cells that share features to havesimilar outcomes on average. We can therefore have ﬁner-grained control by leveraging our repre-sentation of the cells into their ﬁrst order marginals D (1) s and interactions of order k , D ( k ) s . To dothis, consider the infeasible population regression using the D ( k ) s representation as regressors,min η N (cid:88) i =1 (cid:32) Y i − d (cid:88) k =1 D ( k ) S i · η k (cid:33) . (7)With the solution to this regression, η ∗ = ( η ∗ , . . . , η ∗ d ), we can decompose the population averagein cell s based on the interactions between the covariates, µ s = (cid:80) dk =1 D ( k ) s · η ∗ k . This decomposition7n terms of the multilevel structure allows us to understand the role of imbalance in lower- andhigher-order interactions on the bias. Plugging this decomposition into Equation (6) we see thatthe bias term in the conditional MSE further decomposes into the level of imbalance for the k th order interactions weighted by the strength of the interactions: E (cid:104) (ˆ µ ( γ ) − µ ) | n R (cid:105) = 1 N (cid:32) d (cid:88) k =1 η ∗ k · (cid:88) s (cid:0) n R s γ ( s ) − N P s (cid:1) D ( k ) s (cid:33) + (cid:88) s (cid:18) n R s N (cid:19) γ ( s ) σ s ≤ N (cid:32) d (cid:88) k =1 (cid:107) η ∗ k (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) s (cid:0) n R s γ ( s ) − N s (cid:1) D ( k ) s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:33) + σ (cid:88) s (cid:18) n R s N (cid:19) γ ( s ) . (8)Typically, we expect that the “main eﬀects” might be much stronger than any of the interactionterms and so the coeﬃcients on the ﬁrst order terms, (cid:107) η ∗ (cid:107) , will be large relative to the coeﬃcientsfor higher order terms. This is why raking on the margins as in Equation (4)—which only con-trols the main eﬀects and implicitly assumes an additive functional form—is often seen as a goodapproximation (Mercer et al., 2018). However, ignoring higher order interactions entirely can leadto bias. We therefore propose to ﬁnd weights that prioritize main eﬀects while still minimizingimbalance in interaction terms when feasible. We now design a convex optimization problem that controls the components of the conditionalMSE on the right hand side of Equation (8). To do this, we ﬁnd weights that control the imbalancein all interactions in order to control the bias, while penalizing the sum of the squared weights tocontrol the variance. Speciﬁcally, we solve the following optimization problem:min γ ∈ R J d (cid:88) k =2 λ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) s D ( k ) s n R s γ s − D ( k ) s N P s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:88) s n R s γ s subject to (cid:88) s D (1) s n R s γ s = (cid:88) s D (1) s N P s L ≤ γ s ≤ U ∀ s = 1 , . . . J. (9)where the λ k are smoothing hyper-parameters, discussed below.We can view this optimization problem as adding an additional objective to the usual rakingestimator in Equation (4), optimizing for higher order balance. As with the raking estimator, themultilevel calibration weights are constrained to exactly balance ﬁrst order margins. Subject tothis exact marginal constraint, the weights then minimize the imbalance in k th order interactionsfor all k = 2 , . . . , d . In this way, the multilevel calibration weights approximately post-stratifyby optimizing for balance in higher-order interactions rather than requiring exact balance in all8nteractions as the post-stratiﬁcation weights do. Following the bias-variance decomposition inEquations (6) and (8), this objective is penalized by the sum of the squared weights. In addition,we also constrain the weights to be between the user-deﬁned lower and upper limits, L and U .As with the raking estimator, we can also replace the sum of the squared weights with a diﬀerentpenalty function, including penalizing deviations from known sampling weights following Devilleand S¨arndal (1992). Finally, we could solve a variant of Equation (9) without the multilevelstructure encoded by the D ( k ) s variables. This would treat cells as entirely distinct and perform noaggregation across cells while approximately post-stratifying. From our discussion in Section 3.1,this would ignore the potential bias gains from directly leveraging the multilevel structure. An important component of the optimization problem are the hyper-parameters λ k for k = 2 , . . . , d .These hyper-parameters control the relative priority that balancing the higher-order interactionsreceives in the objective in an inverse relationship. If λ k is large, then the weights will be moreregularized and the k th order interaction terms will be less prioritized. In the limit as all λ k → ∞ ,no weight is placed on any interaction terms, and Equation (9) reduces to raking on the margins.Conversely, if λ k is small, more importance will be placed on balancing k th order interactions.For example, if λ = 0, then the optimization problem will rake on margins and second orderinteractions. As all λ k → The hyper-parameters deﬁne a bias-variance trade-oﬀ. Smaller values will decrease the bias byimproving the balance on higher order interaction terms. This comes at the expense of increasingvariance by decreasing the eﬀective sample size. In practice, we suggest explicitly tracing out thistrade-oﬀ. For a sequence of potential hyper-parameter values λ (1) , λ (2) , . . . , set all of the hyper-parameters to be λ k = λ ( j ) . We can then look at the two components of the objective in Equation(9), plotting the level of imbalance (cid:80) dk =2 (cid:13)(cid:13)(cid:13)(cid:80) s D ( k ) s n R s γ s − D ( k ) s N P s (cid:13)(cid:13)(cid:13) against the eﬀective samplesize n eﬀ = ( (cid:80) s n R s ˆ γ ( s ) ) (cid:80) s n R s ˆ γ ( s ) . After fully understanding this trade-oﬀ, practitioners can choose a common λ somewhere along the curve. For example, in our analysis in Section 7, we choose λ to achieve 95%of the potential balance improvement in higher-order terms of λ = 0 relative to raking. Finally,note that the optimization problem is on the scale of the population counts. This means that witha common hyper-parameter, the higher-order interactions, which have lower values of D ( k ) (cid:48) N P , willhave less weight by design. If we allow unbounded extrapolation and set L = −∞ and U = ∞ , the resulting estimator will be equivalent tothe MRP estimate (10), with regularization hyper-parameters λ ( k ) (Ben-Michael et al., 2020a). Double regression with post-stratiﬁcation (DRP)

So far we have focused on multilevel calibration, with weights that exactly match the ﬁrst ordermarginals between the sample and the full population, while approximately balancing higher orderinteractions. This approach is independent of the outcomes and so we can use a single set of weightsto estimate the population average for multiple diﬀerent outcomes. However, in some cases it maynot be possible to achieve good covariate balance on higher order interactions, meaning that ourestimates may still be biased. Similar to model-assisted survey calibration with design weights(Breidt and Opsomer, 2017), we can address this by specializing to a particular outcome and byexplicitly using outcome information to estimate and correct for the bias.We begin by reviewing outcome modeling, especially multilevel regression with post-stratiﬁcation(MRP), and then propose double regression with post-stratiﬁcation (DRP).

A common alternative to the weighting approaches above is to estimate the population average µ using estimates of the cell averages µ s (Gelman and Little, 1997). These approaches take modelledestimates of the cell averages, ˆ µ s and post-stratify them to the population totals asˆ µ mrp = 1 N (cid:88) s N P s ˆ µ s . (10)By smoothing estimates across cells, outcome modeling gives estimates of ˆ µ s even for cells with norespondents, thus sidestepping the primary feasibility problem of post-stratiﬁcation. We use theterm Multilevel Regression with Post-stratiﬁcation (MRP) to refer to the broad class of outcomemodelling approaches that post-stratify modelled cell averages, and discuss particular choices ofoutcome model in Section 4.3 below.With weights ˆ γ , we can use MRP to estimate the bias due to imbalance in higher order inter-actions by taking the diﬀerence between the MRP estimate for the population and a hypotheticalestimate with population cell counts n R s ˆ γ ( s ): (cid:100) bias = ˆ µ drp − N (cid:88) s n R s ˆ γ ( s )ˆ µ s = 1 N (cid:88) s ˆ µ s × (cid:0) N P s − n R s ˆ γ ( s ) (cid:1) . (11)This uses the outcome model to collapse the imbalance in the J cells into a single diagnostic.Our main proposal is to use this diagnostic to correct for any remaining bias from the multilevelcalibration weights. We refer to the estimator as Double Regression with Post-Stratiﬁcation (DRP),as it incorporates two forms of “regression”—a regression of the outcome ˆ µ ( s ) and a regression ofresponse ˆ γ ( s ) through the dual problem, as we discuss in Section 5 below. We construct the10stimator using weights ˆ γ ( s ) and cell estimates ˆ µ s asˆ µ drp (ˆ γ ) = ˆ µ (ˆ γ ) + 1 N (cid:88) s ˆ µ s × (cid:0) N P s − n R s ˆ γ ( s ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) imbalance in cell s = ˆ µ mrp + 1 N (cid:88) s n R s ˆ γ ( s ) × ( ¯ Y s − ˆ µ s ) (cid:124) (cid:123)(cid:122) (cid:125) error in cell s . (12)The two lines in Equation (12) give two equivalent perspectives on how the DRP estimator adjustsfor imbalance. The ﬁrst line begins with the multilevel calibration estimate ˆ µ (ˆ γ ) and then adjustsfor the estimate of the bias using the outcome model N (cid:80) s ˆ µ s ( N P S − n R s ˆ γ ( s )). If the population andre-weighted cell counts are substantially diﬀerent in important cells, the adjustment from the DRPestimator will be large. On the other hand, if the population and re-weighted sample counts areclose in all cells then ˆ µ drp (ˆ γ ) will be close to ˆ µ (ˆ γ ). In the limiting case of post-stratiﬁcation whereall the counts are equal, the two estimators will be equivalent, ˆ µ drp (ˆ γ ps ) = ˆ µ (ˆ γ ps ). The second lineinstead starts with the MRP estimate, ˆ µ mrp , and adjusts the estimate based on the error withineach cell. If the outcome model has poor ﬁt in cells that have large weight, then the adjustmentwill be large.As we discuss next, this uses outcome information to improve the statistical properties of theestimator, at the expense of specializing it to a particular outcome. More broadly, this estimator is aspecial case of augmented approximate balancing weights estimators (Athey et al., 2018; Hirshbergand Wager, 2019) and is closely related to generalized regression estimators (Cassel et al., 1976),augmented IPW estimators (Robins et al., 1994; Chen et al., 2020), and bias-corrected matchingestimators (Rubin, 1976). We now show that adjusting for imbalance with an outcome model reduces bias, and that this biasreduction allows for inference through asymptotic normality by ensuring that the bias is asymptot-ically smaller than the variance. To see this, we can again inspect the estimation error. Analogousto Equation (5), the diﬀerence between the DRP estimator and the true population average isˆ µ drp (ˆ γ ) − µ = 1 N (cid:88) s ( n s ˆ γ s − N s ) (cid:124) (cid:123)(cid:122) (cid:125) imbalance in cell s × (ˆ µ s − µ s ) (cid:124) (cid:123)(cid:122) (cid:125) error in cell s + 1 N s (cid:88) s =1 n s ˆ γ s ¯ ε s (cid:124) (cid:123)(cid:122) (cid:125) noise . (13)Comparing to Equation (5), where the estimation error depends solely on the imbalance and thetrue cell averages, we see that the estimation error for DRP depends on the product of the imbalancefrom the weights and the estimation error from the outcome model. Therefore, if the model is areasonable predictor for the true cell averages, the estimation error will decrease.11o formalize this, we consider an asymptotic framework with a sequence of ﬁnite populationsof size N , and let N → ∞ . In this framework, we make two modiﬁcations to our setup. First, westrengthen Assumption 2 to hold strictly for all population sizes N , so that min s π ( s ) ≥ π ∗ > R i to be independent.We detail these and other regularity assumptions on the design in the Appendix A. Theorem 1. If (cid:80) s (ˆ µ s − µ s ) = o p (1), then under the regularity conditions in Assumption A.1,the DRP estimator ˆ µ drp (ˆ γ ) isˆ µ drp (ˆ γ ) − µ = 1 N N (cid:88) i =1 R i π ( S i ) ε i + o p (cid:18) π ∗ √ N (cid:19) . Furthermore, if N √ V N (cid:80) Ni =1 R i π ( S i ) ε i ⇒ N (0 ,

1) for V N = Var (cid:16) N (cid:80) Ni =1 R i π ( S i ) ε i (cid:17) , thenˆ µ drp (ˆ γ ) − µ √ V N ⇒ N (0 , . Theorem 1 shows us that as long as the modelled cell averages are consistent, the model and thecalibration weights combine to ensure that the bias will be negligible relative to the variance, asymp-totically. We expect that the residuals ε i have much lower variance than the raw outcomes, and soasymptotically the DRP estimator will also have lower variance than the oracle Horvitz-Thompsonestimator, similar to other model-assisted estimators (Breidt and Opsomer, 2017). However, notethat the minimum cell response probability π ∗ will aﬀect the quality of the asymptotic approxima-tion; if individuals in some cells are very unlikely to respond, it will be diﬃcult both to model thoseaverages and achieve good balance from the respondents. Finally, Theorem 1 is a non-parametricanalog to recent double-robustness results in survey estimation, such as from Chen et al. (2020).Rather than estimating a parametric outcome and non-response model and having consistency ifeither of the models is correct, we use non-parametric formulations and require them to convergeat appropriate rates (see Hirshberg and Wager, 2019, for further discussion).In our analysis in Section 7, we use Theorem 1 to construct conﬁdence intervals for the popu-lation total µ . To do this, we start with a plug-in estimate for the variance,ˆ V = 1 N n (cid:88) i =1 R i ˆ γ ( S i ) ( Y i − ˆ µ S i ) . (14)We then construct approximate level α conﬁdence intervals via ˆ µ drp (ˆ γ ) ± z − α/ √ V , where z − α/ is the 1 − α/ .3 Choosing an outcome model in MRP and DRP The choice of outcome model is crucial for both MRP and DRP. As we discussed in Section 3, weoften believe that the strata have important hierarchical structure where main eﬀects and lower-order interactions are more predictive than higher-order interactions. We consider two broad classesof outcome model that accommodate this structure: multilevel outcome models, which explicitlyregularize higher-order interactions; and tree-based models, which implicitly regularize higher-orderinteractions.

Multilevel outcome model.

We ﬁrst consider multilevel models, which have a linear form asˆ µ mr s = ˆ η mr · D s , where ˆ η mr are the estimated regression coeﬃcients (Gelman and Little, 1997; Ghitzaand Gelman, 2013; Gao et al., 2020). MRP directly post-stratiﬁes these model estimates, using thecoeﬃcients to predict the value in the population:ˆ µ mrp = ˆ η mr · N (cid:88) s N P s D s . In contrast, the DRP estimator only uses the coeﬃcients to adjust for any remaining imbalanceafter weighting, ˆ µ drp (ˆ γ ) = ˆ µ (ˆ γ ) + ˆ η mr · (cid:32) N (cid:88) s D s (cid:0) N P s − n R s ˆ γ ( s ) (cid:1)(cid:33) . This performs bias correction. When we use a multilevel outcome model, we can also view theDRP estimator as a weighting estimatorˆ µ drp (ˆ γ ) = 1 N (cid:88) s ˜ γ ( s ) n R s ¯ Y s . In particular, when using the maximum a posteriori (MAP) estimate of a multilevel model in thecorresponding DRP estimator, the outcome model directly adjusts the weights˜ γ ( s ) = ˆ γ ( s ) + (cid:0) N P − diag( n R )ˆ γ (cid:1) (cid:48) D (cid:0) D (cid:48) diag( n R ) D + Q (cid:1) − D s , where Q is the prior covariance matrix associated with the multilevel model (Breidt and Opsomer,2017). Importantly, while the multilevel calibration weights ˆ γ are constrained to be non-negative,the DRP weights allow for extrapolation outside of the support of the data (Ben-Michael et al.,2020a). Trees and general weighting methods.

More generally, we can consider an outcome modelthat smooths out the cell averages, using a weighting function between cells s and s (cid:48) , W ( s, s (cid:48) ),to estimate the population cell average, ˆ µ s = (cid:80) s (cid:48) W ( s, s (cid:48) ) n R s (cid:48) ¯ Y s (cid:48) . A multilevel model is a special13ase that smooths the cell averages by partially pooling together cells with the same lower-orderfeatures. In this more general case the DRP estimator is again a weighting estimator, with adjustedweights ˜ γ ( s ) = ˆ γ ( s ) + (cid:88) s (cid:48) W ( s, s (cid:48) )( N P s (cid:48) − n R s (cid:48) ˆ γ ( s (cid:48) )) . Here the weights are adjusted by a smoothed average of the imbalance in similar cells. In theextreme case where the weight matrix is diagonal with elements n R , the DRP estimate reduces tothe post-stratiﬁcation estimate, as above.One important special case are tree-based methods such as those considered by Montgomery andOlivella (2018) and Bisbee (2019). These estimate the outcome via bagged regression trees, suchas random forests (Breiman, 2001), gradient boosted trees (Friedman, 2001), or Bayesian additiveregression trees (Chipman et al., 2010). These approaches combine the predictions of diﬀerenttrees, and can be viewed as data-adaptive weighting estimators where the weight for cell s and s (cid:48) , W ( s, s (cid:48) ), is proportional to the fraction of trees where cells s and s (cid:48) share a leaf node (Athey et al.,2019). Therefore, the DRP estimator will smooth the weights by adjusting them to correct for theimbalance in cells that share many leaf nodes. Bias-variance tradeoﬀ.

The key diﬀerence between MRP and DRP is what role the outcomemodel plays, and how one chooses the model to negotiate the bias-variance tradeoﬀ. Because MRP-style estimators only use the outcome model, the performance of the outcome model completelydetermines the performance of the estimator. For example, in a multilevel model we want to includehigher-order interaction terms in order to reduce the bias. However, this can increase the varianceto an unacceptable degree, so we choose a model with lower variance and higher bias.In contrast, with DRP the role of the model is to correct for any potential bias remainingafter multilevel calibration. Because we can only approximately post-stratify, this bias-correction iskey. It also means that DRP is less reliant on the outcome model, which only needs to adequatelyperform bias correction. Therefore, the bias-variance tradeoﬀ is diﬀerent for DRP, and we prioritizebias over variance. By including higher order interactions or deeper trees, the model will be ableto adjust for any remaining imbalance in higher order interactions after weighting.

We now show that the multilevel calibration approach is a form of inverse propensity score weightingwith a multilevel non-response model. This connection is instructive, especially for DRP, becausetraditional propensity score models can have steep data requirements, often requiring detailedindividual-level data for both the sample and the target population (Chen et al., 2020). By contrast,14he data requirements for multilevel calibration weights are somewhat weaker, requiring aggregatedata on all interactions of interest.In particular, when we enforce exact balance on all interactions, the multilevel calibrationweights will be equivalent to IPW with propensity scores estimated via a fully-saturated general-ized linear model (GLM) — and both our proposed weights and traditional IPW weights will beequivalent to post-stratiﬁcation weights. As we show, the primary diﬀerence between the multilevelcalibration approach and a multilevel GLM is in how the propensity score coeﬃcients are regular-ized. Through the Lagrangian dual, we will see that the multilevel calibration approach implicitlyregularizes the coeﬃcients on interactions to guarantee balance while the multilevel GLM approachdoes not.

We begin by deriving the Lagrangian dual to the optimization problem (9). By inspecting thedual, we can characterize the implicit propensity score model associated with the weights, movingsmoothly between raking on margins and post-stratiﬁcation. This builds on recent results notingthe connection between approximate balancing weights estimators and regularized propensity scoreestimation (e.g. Wang and Zubizarreta, 2020; Hirshberg et al., 2019; Chattopadhyay et al., 2020;Ben-Michael et al., 2020b) as well as a long history linking raking weights to IPW with a propensityscore that is log-linear in the ﬁrst-order marginals (Little and Wu, 1991).The dual problem involves optimizing over a series of Lagrange multipliers. The raking con-straint induces one set of Lagrange multipliers β (1) . In the same way, the approximate post-stratiﬁcation objective induces an additional set of Lagrange multipliers β ( k ) — one for each groupof higher order interactions. These dual variables are then chosen to optimize a regularized objectivefunction. Proposition 1.

If a feasible solution to (9) exists, the Lagrangian dual problem with L = 0 and U = ∞ is min β N n (cid:88) i =1  R i max (cid:40) , d (cid:88) k =1 D ( k ) S i · β ( k ) (cid:41) − d (cid:88) k =1 D ( k ) S i · β k (cid:124) (cid:123)(cid:122) (cid:125) loss function q ( β ) + d (cid:88) k =2 λ k (cid:107) β k (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) regularization , (15)where β = ( β (1) , . . . , β ( d ) ). If ˆ β solves (15), the primal weights are recovered asˆ γ s = max (cid:40) , d (cid:88) k =1 D ( k ) s i · ˆ β ( k ) (cid:41) ≡ γ ( s ; ˆ β ) . (16) For ease of exposition we have derived the Lagrangian dual for the the usual case where L = 0 and U = ∞ . Forgeneral L < U , γ ( s ; ˆ β ) will be truncated at L and U , and the loss function will change slightly.

15o connect this to propensity score estimation, we can inspect the minimizer of the expected loss, E [ q ( β )]. The zero gradient condition for the expected loss is ∇ E [ q ( β )] = 0 ⇐⇒ (cid:88) s N P s π ( s ) γ ( s ; β ) D s = (cid:88) s N P s D s . The unique weights that solve the expected zero gradient condition are precisely the inverse propen-sity weights γ ( s ; β ) = π ( s ) . Therefore, the dual solution is a regularized M -estimator for thepropensity score, with a fully saturated propensity score model that includes all interactions. We now compare regularization in multilevel calibration weights versus more traditional multilevelGLM estimation for the propensity score. These two models have the same starting point: bothare M -estimators for the propensity score and, in the special case without regularization, both areequivalent to post-stratiﬁcation weights and to each other. Both estimators also partially pool thepropensity score estimates across cells. However, in practical settings where full post-stratiﬁcationis infeasible, regularization aﬀects the two approaches diﬀerently. For multilevel calibration, theregularization in the dual problem (15) ensures a level of balance on interaction terms. By contrast,for the multilevel GLM, the regularization instead controls a diﬀerent quantity that is only indirectlyrelevant for estimating the population average.To see this, we can examine the zero gradient conditions for the two approaches. First, formultilevel calibration weights, the level of partial pooling directly relates to balance in the higherorder interactions. The zero gradient condition for the regularized dual problem (15) implies thatthe imbalance in the k th order interactions is1 N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) s D ( k ) s n R s γ ( s ; ˆ β ) − (cid:88) s D ( k ) s N P s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = λ k (cid:107) ˆ β ( k ) (cid:107) , Therefore λ k directly controls the level of balance in the k th order interactions, and so the level ofregularization controls how far the re-weighted sample is from the target.We can compare this to the zero gradient condition of the propensity score π ( s ; ˆ β ) estimatedvia a multilevel GLM with equivalent hyper-parameters:1 N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) J (cid:88) s =1 D ( k ) s n R s − J (cid:88) s =1 D ( k ) s π ( s ; ˆ β ) N P s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = λ k (cid:107) ˆ β ( k ) (cid:107) . Here the hyper-parameter λ k instead controls the diﬀerence between the observed sample countsand the expected counts under the model. This diﬀerence is only indirectly related to estimating thepopulation means, in essence estimating the propensity score π ( s ) rather than the inverse propensity16core π ( s ) . Therefore, while both approaches are estimators of a fully-interacted propensity score,the regularization in multilevel calibration controls an upper bound on the bias when estimatingthe population average µ . In contrast, regularization in the multilevel GLM provides a conditionon a quantity that is incidental to estimating µ . We now evaluate the statistical behavior of the multilevel calibration and DRP estimators onsimulated data based on the 2016 United States presidential election case study. We combinethe 2,062 respondents from the Oct 16, 2020 Pew survey with the 44,909 respondents in the 2016Congressional Cooperative Election Study (CCES) sample to form a population of size N = 46 , R i = 1, while theremaining units have R i = 0. We construct cells by fully interacting the eight variables in Figure2, with J = 12 ,

347 distinct cells in the population. We then calibrate two non-response models tothe response structure in this population. First, we ﬁt a random forest model with B = 500 trees,so that the probability of responding in cell s is π rf ( s ) = (cid:88) s (cid:48) n R s (cid:48) B B (cid:88) b =1 { s (cid:48) ∈ L b ( s ) }| L b ( s ) | . We also consider a fourth order model, where the response probability for cell s is π (4) ( s ) = logit − (cid:32) (cid:88) k =1 ˆ β ( k ) · D ( k ) s (cid:33) , and the coeﬃcient vector ˆ β is under-regularized so that there is poor overlap. We similarly considertwo diﬀerent outcome models for presidential candidate vote choice. First, we ﬁx the outcomes tobe unchanged from the original data; second, we model the probability that unit i votes Republican( Y i = 1) as a fourth order logistic regression model as above, similarly under-regularized.To generate simulation runs, we re-sample from the population with replacement, so the totalnumber of units N is ﬁxed while the number of units within each cell N P s varies. We then generateresponses and outcomes according to the probabilities above using two pairs: (a) fourth ordermodels for both the response and the outcome, and (b) a random forest response model with thetrue, deterministic outcomes. We consider using multilevel calibration weighting in Equation (9),balancing ﬁrst, second, third, and fourth order interactions with λ ( k ) = 1 and setting λ ( k ) = 0 forinteractions of higher order. We also consider the DRP estimator, bias correcting with either athird-order ridge regression or a random forest, as well as MRP with these outcome models. Finally,17igure 3: Bias and RMSE across 1000 simulation runs. RMSE for the oracle Horvitz-Thompsonestimator in the under-regularized fourth order model (6.4%) omitted for scale.we compare to the oracle Horvitz-Thompson estimator with the true response probabilities.Figure 3 shows the bias and root mean square error (RMSE) of these approaches across simu-lation runs. First looking at the bias, we see that under both data generating processes (DGPs) itis not enough to rake on margins, and there are substantial gains to balancing second and higherorder interactions. Next, bias correction can provide large improvements: under both DGPs, DRPreduces the bias relative to raking on the margins alone by nearly the same degree as directly bal-ancing higher order interaction terms. Even in the under-regularized fourth order DGP—where theoracle Horvitz-Thompson estimator performs poorly—we can signiﬁcantly reduce the bias. DRPalso has reduced bias relative to MRP alone with the same outcome model. Focusing on RMSE,we see that the decrease in bias from balancing higher order interactions outweighs the increase invariance only when balancing second order interactions, with third and fourth order interactionshaving a worse bias-variance trade-oﬀ. We see however, that the bias-variance trade-oﬀ for includ-ing an outcome model through DRP is favorable under both outcome models and DGPs, with theDRP estimator with a random forest outcome model and raking weights having the lowest RMSE.Finally, MRP with ridge regression has higher RMSE than multilevel calibration and DRP, while18RP with random forest (the oracle estimator for one of the DGPs) has lower RMSE. We now turn to evaluating the proposed estimators in practice. We begin by showing balancegains from the multilevel calibration procedure and inspecting how bias correction through DRPaﬀects both the point estimates and conﬁdence intervals. We then evaluate the performance ofmultilevel calibration and DRP when predicting state-speciﬁc vote counts from the national pre-election survey of vote intention.Public opinion polling for the November 8, 2016 U.S. presidential election was notable forhow state level polls, particularly in the Upper Midwest, failed to accurately predict the winningcandidate. While national polls were, on average, some of the most accurate in recent publicopinion polling, these state-level errors lead public opinion researchers to incorrectly predict thewinner of the electoral college. Kennedy et al. (2018) attribute these errors to three main sources:(1) a late swing among undecided voters towards Trump, (2) failure to account for non-responserelated to education level, particularly among white voters, and (3) to a lesser degree, failure toproperly predict the composition of the electorate. Our analysis focuses on addressing concern (2)by allowing for deep interactions among important covariates, including race, education level, andregion. We address concern (3) by deﬁning the target population and target outcome using the 2016CCES, a large, high-quality public opinion survey which accurately captures electoral outcomes atthe state level. We note, however, that our analysis, which uses the ﬁnal public release of the Pewelectoral poll from mid-October, cannot account for a late break towards Trump among undecidedvoters, which may contribute to remaining residual bias.We compute population cell counts N P s from the post-2016 election CCES poll, limiting tothose who voted in the election and weighting according to provided survey weights. We considerthe balance of three diﬀerent weighting estimators. First, we rake on margins for eight variablesmeasured in both surveys, equivalent to solving (9) with λ k → ∞ for k ≥ L = 0 , U = ∞ .Next, we balance up to 6 th order interaction terms, setting a common hyper-parameter λ k = λ for k = 2 , . . . , λ k → ∞ for k = 7 ,

8. To select λ , we solve (9) for a series of potential values,tracing out the bias-variance trade-oﬀ in Figure 4. We ﬁnd that a value of λ = 12 . We also considerbias-correcting the multilevel weights with DRP with (a) a fourth order ridge regression model and More speciﬁcally, our goal is to recover the ground truth deﬁned by the CCES, rather than the true electoraloutcomes. However, the CCES has been shown to be highly accurate at the state-level (Ansolabehere and Schaﬀner,2017). We collapse income and age to 3 levels, education to a binary indicator for greater than a high school degree,and race to a binary indicator for white. k = 1 , . . . ,

6, versus the eﬀective sample size.Imbalance measures are scaled as the percent reduction in imbalance relative to raking on margins.(b) gradient boosted trees, both tuned with cross validation.Figure 5 shows the imbalance when weighting by these three approaches for interactions upto order 4. To place the balance on the same scale, we divide the diﬀerence between the re-weighted sample and the population in the j th interaction of order k by the population count, (cid:12)(cid:12)(cid:12)(cid:80) s D ( k ) sj ( n R s ˆ γ ( s ) − N P ) (cid:12)(cid:12)(cid:12)(cid:80) s D ( k ) sj N P s . By design, both the raking and multilevel calibration weights exactly balanceﬁrst order margins; however, post-stratifying on a limited set of collapsed cells does not guaranteebalance on the margins of the uncollapsed cells, due to missing values. The multilevel calibrationweights achieve signiﬁcantly better balance on second order interactions than do the raking weightsor the post-stratiﬁcation weights. For higher order interactions these gains are still visible but lessstark, as it becomes more diﬃcult to achieve good balance.This improvement in balance comes at some cost to variance. Figure 6a shows the empiricalCDF of the respondent weights for the three approaches. The multilevel calibration weights thatbalance higher order interactions have a greater proportion of large weights, with a longer tail thanraking or collapsed post-stratiﬁcation. These large weights lead to a smaller eﬀective sample size.The multilevel calibration weights yield an eﬀective sample size of 1,099 for a design eﬀect of 1.87,while raking and the collapsed post-stratiﬁcation weights have eﬀective sample sizes of 1,431 and1,482 respectively.Figure 6b plots the point estimates and approximate 95% conﬁdence intervals for the multilevelcalibration and DRP approaches, along with the estimated Republican vote share from the weightedCCES sample. The diﬀerent weights result in diﬀerent predictions of the vote share, ranging froma point estimate of 42.5% for raking to 47.5% for post-stratiﬁcation. Additionally, the somewhatsmaller eﬀective sample size for multilevel calibration manifests itself in the standard error, leading20igure 5: Covariate imbalance for interactions up to order 4, measured as the diﬀerence betweenthe weighted and target count, divided by the target count.to slightly larger conﬁdence intervals. The DRP estimators, bias correcting with either ridge re-gression or gradient boosted trees, have similar point estimates to multilevel calibration alone. Thisindicates that the remaining imbalances in higher order interactions after weighting in Figure 5 donot lead to large estimated biases. However, by including an outcome model the DRP estimatorssigniﬁcantly reduce the standard errors.To empirically validate the role of balancing higher order interactions, we use the national pre-election Pew survey to predict Republican vote share within each state. The pre-election surveywas designed as a national survey and so there are substantial diﬀerences between the sampleand the state-level totals. For each state we compute the population count vector N P from theweighted CCES, subset to the state of interest. Here we use a common λ = 1. We then imputethe Republican vote share for that state via weighting alone and DRP with gradient boosted trees,balancing interactions up to order six. We consider both restricting the sample respondents to bein the same region as the state and including all sample respondents. Figure 7 shows the absolutebias and RMSE across the 50 states as the order increases from raking on ﬁrst order margins toapproximately balancing sixth order interactions. There are substantial gains to bias-correctionthrough DRP when raking on the margins in terms of both bias and variance. Balancing higherorder interactions also improves estimation over raking alone. And while the relative improvementof DRP over multilevel calibration diminishes as we balance higher order interactions, these gains21 a) Empirical CDF of weights (b) Predictions of Republican vote share Figure 6: (a) Distribution of weights. Dashed line indicates a uniform adjustment Nn . (b) Pointestimates and approximate 95% conﬁdence intervals. Thick dashed line is the weighted CCESestimate, thinner dashed lines indicate the lower and upper 95% conﬁdence limits.are still apparent. Finally, while not restricting respondents by region results in lower bias acrossthe 50 states, the higher RMSE shows that the estimates of state vote share are poor but averagingout. The precipitous decline in response rates for traditional polling approaches, and increased relianceon convenience samples, has led to the potential for large biases due to non-representative samplesand increased emphasis on methods to adjust for this bias. One important driver of bias due tonon-representative samples is the diﬀerence in response rates among groups deﬁned by ﬁne-grainedhigher order interactions. A key question is how to adjust for such interactions with even a moderatenumber of characteristics. While ideally we would post-stratify on all interactions of importantcovariates simultaneously, the cost of collecting the necessary sample size would be prohibitive,especially with low response rates. In practice, analysts circumvent this via ad hoc approaches,such as only adjusting for ﬁrst-order marginal characteristics or collapsing cells together.In this paper we provide two alternative approaches, multilevel calibration weighting and

DoubleRegression with Post-stratiﬁcation (DRP) , which provide principled ways to combine ﬁne-grainedcalibration weighting and modern machine learning prediction techniques. The multilevel calibra-tion weights improve on existing practice by approximately post-stratifying in a data-driven way,while at least ensuring exact raking on ﬁrst order margins. DRP then takes advantage of ﬂexibleregression methods to further adjust for diﬀerences in ﬁne-grained cells in a parsimonious way. For22igure 7: Absolute bias and MSE when imputing Republican vote share in 50 states from thenational Pew survey, restricting to respondents in the same region and unrestricted by region.groups where the weights successfully adjust for diﬀerences in response rates, the DRP estimateis driven by the weights; for groups that remain over- or under-represented, DRP instead relieson a ﬂexible regression model to estimate and adjust for remaining non-response bias. Throughtheoretical, numerical, and simulation results, we ﬁnd that these approaches can signiﬁcantly im-prove estimation. Speciﬁcally, adjusting for higher-order interactions with multilevel calibrationhas much lower bias than ignoring them by only raking on the ﬁrst-order margins. Incorporat-ing ﬂexible outcome estimators such as multilevel regression or tree-based approaches in our DRPestimator further improves upon weighting alone.However, our proposal is certainly not a panacea, and important questions remain. First, whilewe choose the value of the hyper-parameters by tracing out the bias-variance trade-oﬀ, it might bepreferable to select them via data-adaptive measures. For example, Wang and Zubizarreta (2020)propose a cross-validation style approach that takes advantage of the Lagrangian dual formulation.It may be possible to use such approaches in this setting.Second, the key assumption is that outcomes are missing at random within cells. While wenever expect this to be entirely true, it allows us to make progress on estimation, and with granularenough groups, we may hope that this assumption is approximately true. It is important thento characterize how our conclusions would change if this assumption is violated, and the responseand the outcome are correlated even within cells. This form of sensitivity analysis is common23or matching estimators in observational studies, and has been proposed for inverse probabilityweighting (Zhao et al., 2019) and balancing estimators (Soriano et al., 2021). We leave adaptingthese approaches to this setting as future work.Third, with many higher order interactions it is diﬃcult to ﬁnd good information on populationtargets. We may have to combine various data sources collected in diﬀerent manners, or imputeunknown cells in the target population, and uncertainty in the population targets can also leadto increased variance (see Caughey et al., 2020, for a recent review). Fourth, during the surveyprocess we can obtain very detailed auxiliary information on survey respondents that we cannotobtain for the population, even marginally. Incorporating this sort of auxiliary information intothe estimation procedure will be important to future work.Finally, the asymptotic theory only considers a ﬁxed number of covariates. With a growing num-ber of covariates, the requirements on the outcome model will likely need to be stronger in order toallow for asymptotic normality of the DRP estimator. Additionally, with categorical covariates, thenon-response probability is completely determined by all of the interactions; continuous covariateswill require stronger assumptions.We propose principled procedures to account for non-response bias due to diﬀerences in responserates in groups deﬁned by higher order interactions. While this is a pernicious problem, especiallywith lowered response rates, it is far from the only form of non-response bias, let alone the onlydiﬃculty with modern surveys. We therefore view multilevel calibration and DRP as only one partof the analyst’s toolkit, supplementing design and data considerations.24 eferences

Ansolabehere, S. and B. F. Schaﬀner (2017). CCES Common Content, 2016.Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: debiased inferenceof average treatment eﬀects in high dimensions. Technical report.Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests.

Annals of Statis-tics 47 (2), 1179–1203.Ben-Michael, E., A. Feller, and J. Rothstein (2020a). The Augmented Synthetic Control Method .Ben-Michael, E., A. Feller, and J. Rothstein (2020b). Variation in impacts of letters of recom-mendation on college admissions decisions: Approximate balancing weights for treatment eﬀectheterogeneity in observational studies.Bisbee, J. (2019). Barp: Improving mister p using bayesian additive regression trees.

AmericanPolitical Science Review 113 (4), 1060–1065.Breidt, F. J. and J. D. Opsomer (2017). Model-Assisted Survey Estimation with Modern PredictionTechniques.

Statistical Science 32 (2), 190–205.Breiman, L. (2001). Random forests.

Machine Learning 45 , 5–32.Cassel, C. M., C.-E. Sarndal, and J. H. Wretman (1976). Some results on generalized diﬀerenceestimation and generalized regression estimation for ﬁnite populations.

Biometrika 63 (3), 615–620.Caughey, D., A. J. Berinsky, S. Chatﬁeld, E. Hartman, E. Schickler, and J. S. Sekhon (2020).

TargetEstimation and Adjustment Weighting for Survey Nonresponse and Sampling Bias . Elements inQuantitative and Computational Methods for the Social Sciences. Cambridge University Press.Caughey, D. and E. Hartman (2017). Target selection as variable selection: Using the lasso toselect auxiliary vectors for the construction of survey weights.

Available at SSRN 3494436 .Chattopadhyay, A., Christopher H. Hase, and J. R. Zubizarreta (2020). Balancing Versus ModelingApproaches to Weighting in Practice.

Statistics in Medicine in press .Chen, J. K. T., R. L. Valliant, and M. R. Elliott (2019). Calibrating non-probability surveys toestimated control totals using lasso, with an application to political polling.

Journal of the RoyalStatistical Society: Series C (Applied Statistics) 68 (3), 657–681.Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability survey samples.

Journal of the American Statistical Association 115 (532), 2011–2021.Chipman, H. A., E. I. George, and R. E. McCulloch (2010). BART: Bayesian additive regressiontrees.

Annals of Applied Statistics 6 (1), 266–298.Deming, W. E. and F. F. Stephan (1940). On a Least Squares Adjustment of a Sampled Fre-quency Table When the Expected Marginal Totals are Known.

The Annals of MathematicalStatistics 11 (4), 427–444. 25eville, J. C. and C. E. S¨arndal (1992). Calibration estimators in survey sampling.

Journal of theAmerican Statistical Association 87 (418), 376–382.Deville, J. C., C. E. S¨arndal, and O. Sautory (1993). Generalized raking procedures in surveysampling.

Journal of the American Statistical Association 88 (423), 1013–1020.D’Amour, A., P. Ding, A. Feller, L. Lei, and J. Sekhon (2020). Overlap in observational studieswith high-dimensional covariates.

Journal of Econometrics .Friedman, J. (2001). Greedy Function Approximation : A Gradient Boosting Machine.

The Annalsof Statistics 29 (5), 1189–1232.Gao, Y., L. Kennedy, D. Simpson, A. Gelman, et al. (2020). Improving multilevel regression andpoststratiﬁcation with structured priors.

Bayesian Analysis .Gelman, A. and T. C. Little (1997). Poststratiﬁcation Into Many Categories Using HierarchicalLogistic Regression.

Survey Methodology 23 (2), 127–135.Ghitza, Y. and A. Gelman (2013). Deep interactions with MRP: Election turnout and votingpatterns among small electoral subgroups.

American Journal of Political Science 57 (3), 762–776.Hartman, E., C. Hazlett, and C. Sterbenz (2021). A kernel balancing approach for reducing speci-ﬁcation assumptions in survey weighting.

Unpublished Manuscript

Journal ofthe American Statistical Association 44 (260), 663–685.Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher(2015, 11). Big Data in Survey Research: AAPOR Task Force Report.

Public Opinion Quar-terly 79 (4), 839–880.Kennedy, C., M. Blumenthal, S. Clement, J. D. Clinton, C. Durand, C. Franklin, K. McGeeney,L. Miringoﬀ, K. Olson, D. Rivers, L. Saad, G. E. Witt, and C. Wlezien (2018). An Evaluationof the 2016 Election Polls in the United States.

Public Opinion Quarterly 82 (1), 1–33.Kennedy, C. and H. Hartig (2019). Response rates in telephone surveys have resumed their decline.Linzer, D. A. (2011). Reliable inference in highly stratiﬁed contingency tables: Using latent classmodels as density estimators.

Political Analysis , 173–187.Little, R. J. and M. M. Wu (1991). Models for contingency tables with known margins when targetand sampled populations diﬀer.

Journal of the American Statistical Association 86 (413), 87–95.26cConville, K. S., F. J. Breidt, T. C. Lee, and G. G. Moisen (2017). Model-assisted surveyregression estimation with the lasso.

Journal of Survey Statistics and Methodology 5 (2), 131–158.Mercer, A., A. Lau, and C. Kennedy (2018). For weighting online opt-in samples, what mattersmost.

Pew Research Center .Montgomery, J. M. and S. Olivella (2018). Tree-Based Models for Political Science Data.

AmericanJournal of Political Science 62 (3), 729–744.Robins, J. M., A. Rotnitzky, L. Ping Zhao, and L. Ping ZHAO (1994). Estimation of RegressionCoeﬃcients When Some Regressors are not Always Observed.

Journal of the American StatisticalAssociation 89427 , 846–866.Rubin, D. B. (1976). Inference and Missing Data.

Biometrika 63 (3), 581–592.Si, Y., R. Trangucci, J. S. Gabry, and A. Gelman (2020). Bayesian hierarchical weighting adjustmentand survey inference. arXiv preprint arXiv:1707.08220 .Soriano, D., E. Ben-Michael, P. Bickel, A. Feller, and S. Pimentel (2021). Sensitivity analysis forbalancing weights. Technical report. working paper.Wainwright, M. J. (2019).

High-Dimensional Statistics: A Non-Asymptotic Viewpoint . CambridgeSeries in Statistical and Probabilistic Mathematics. Cambridge University Press.Wang, Y. and J. R. Zubizarreta (2020). Minimal dispersion approximately balancing weights:Asymptotic properties and practical considerations.

Biometrika 107 (1), 93–105.Yang, S., J. K. Kim, and R. Song (2020). Doubly robust inference when combining probability andnon-probability samples with high dimensional data.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) 82 (2), 445–465.Zhao, Q., D. S. Small, and B. B. Bhattacharya (2019). Sensitivity analysis for inverse probabilityweighting estimators via the percentile bootstrap.

Journal of the Royal Statistical Society. SeriesB: Statistical Methodology 81 (4), 735–761. 27

Proofs and derivations

Assumption A.1.

There is a sequence of populations of size N with N → ∞ such that(a) The response variables R i are independent.(b) π ( s ) ≥ π ∗ > N , where π ∗ N → N → ∞ .(c) The residuals ε i ≡ Y i − µ S i satisfy N (cid:80) Ni =1 ε i < ∞ for all population sizes N .(d) The maximum variance across cells conditional on the cell counts, σ ≡ max s σ s = max s Var (cid:0) ¯ ε s | n R s (cid:1) ,converges to 0 in probability.(e) For X = o p (cid:16) π ∗ √ N (cid:17) , the variance of the oracle estimator V = N (cid:80) i π i (1 − π i ) π ( S i ) ε i satisﬁes X √ V = o p (1) .(f) We ﬁnd ˆ γ via the modiﬁed problemmin γ ∈ R J d (cid:88) k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) s D ( k ) s n R s γ s − D ( k ) s N P s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) subject to 0 ≤ γ s ≤ ∀ s = 1 , . . . J. (17) Lemma 1.

Let κ ≡ (cid:107) D − (cid:107) (cid:107) D (cid:107) be the ratio of the maximum and minimum singular values of D . The solution to (17) satisﬁes (cid:115)(cid:88) s ( n R s ˆ γ ( s ) − N P s ) ≤ κ (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) n R s π ( s ) − N P s (cid:19) Proof of Lemma 1.

Slightly abusing notation, denote π ∈ (0 , J as the vector of inverse responseprobabilities for each cell. π is feasible for optimization problem (17), and so1 (cid:107) D − (cid:107) (cid:107) diag( n R )ˆ γ − N P (cid:107) ≤ (cid:107) D (cid:48) (diag( n R )ˆ γ − N P ) (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13) D (cid:48) (cid:18) diag( n R ) 1 π − N P (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) D (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) diag( n R ) 1 π − N P (cid:13)(cid:13)(cid:13)(cid:13) Multiplying by (cid:107) D − (cid:107) gives the result. Lemma 2.

Let π ∗ = min s π ( s ). For any δ > N (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) N P s − n R s π ( s ) (cid:19) ≤ π ∗ √ N (cid:16)(cid:112) J log 5 + δ (cid:17) , with probability at least 1 − exp (cid:0) − π ∗ N δ (cid:1) . 28 roof of Lemma 2. Since R i ∈ { , } is bounded, it is sub-Guassian with scale parameter . andbecause they are independent, N P s N − n R s Nπ ( s ) = N P s − π ( s ) (cid:80) S i = s R i is a mean-zero sub-Gaussianrandom variable with scale parameter √ N P s π ( s ) N ≤ π ∗ √ N . Now by a discretization argument (Wain-wright, 2019, § N (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) N P s − n R s π ( s ) (cid:19) ≥ π ∗ √ N (cid:16)(cid:112) J log 5 + δ (cid:17) with probability at most exp (cid:0) − π ∗ N δ (cid:1) . This completes the proof. Lemma 3. If (cid:80) s (ˆ µ s − µ s ) = o p (1), then1 N (cid:88) s (ˆ µ s − µ s ) (cid:0) n R s ˆ γ ( s ) − n P s (cid:1) = o p (cid:18) π ∗ √ N (cid:19) Proof of Lemma 3.

First, note that by Cauchy-Schwartz,1 N (cid:88) s (ˆ µ s − µ s ) (cid:0) n R s ˆ γ ( s ) − n P s (cid:1) ≤ (cid:115)(cid:88) s (ˆ µ s − µ s ) N (cid:115)(cid:88) s ( n R s ˆ γ ( s ) − N P s ) From Lemma 2, the term on the right is O p (cid:16) π ∗ √ N (cid:17) , so since (cid:80) s (ˆ µ s − µ s ) = o p (1), the productis o p (cid:16) π ∗ √ N (cid:17) Lemma 4.

Under Assumption A.1d, the solution to (17), ˆ γ satisﬁes1 N (cid:88) s ˆ γ ( s ) n R s ¯ ε s = 1 n N (cid:88) i =1 R i π ( S i ) ε i + o p (cid:18) π ∗ √ N (cid:19) Proof of Lemma 4.

First, we write the noise term as1 N (cid:88) s ˆ γ ( s ) n R s ¯ ε s = 1 n N (cid:88) i =1 R i π ( S i ) ε i + 1 n N (cid:88) i =1 R i (cid:18) ˆ γ ( s ) − π ( S i ) (cid:19) ε i . The variance of the second term, conditional on the cell counts n R isVar (cid:32) n N (cid:88) i =1 R i (cid:18) ˆ γ ( s ) − π ( S i ) (cid:19) ε i | n R (cid:33) = 1 N (cid:88) s (cid:18) ˆ γ ( s ) − π ( s ) (cid:19) n R s σ s ≤ σ N (cid:88) s (cid:18) ˆ γ ( s ) − π ( s ) (cid:19) n R s So by Chebyshev’s inequality, conditional on the cell counts n R we have that with probability at29east 1 − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n N (cid:88) i =1 R i (cid:18) ˆ γ ( s ) − π ( S i ) (cid:19) ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σN √ δ (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) ˆ γ ( s ) − π ( s ) (cid:19) n R s . Now notice that (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) ˆ γ ( s ) − π ( s ) (cid:19) n R s = (cid:13)(cid:13)(cid:13)(cid:13) diag( n R ) (cid:18) ˆ γ − π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) diag( n R )ˆ γ − N P + N P − diag( n R ) 1 π (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) diag( n R )ˆ γ − N P (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) N P − diag( n R ) 1 π (cid:13)(cid:13)(cid:13)(cid:13) From Lemma 1 we can further bound this by (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) s (cid:18) ˆ γ ( s ) − π ( s ) (cid:19) n R s ≤ (1 + κ ) (cid:13)(cid:13)(cid:13)(cid:13) N P − diag( n R ) 1 π (cid:13)(cid:13)(cid:13)(cid:13) From Lemma 2, this is O p (cid:16) π ∗ √ N (cid:17) . Noting that by Assumption A.1d σ = o p (1), shows that thisremainder term is o p (cid:16) π ∗ √ N (cid:17) . Proof of Theorem 1.

First, we write ˆ µ drp (ˆ γ ) − µ asˆ µ drp (ˆ γ ) − µ = 1 N (cid:88) s (ˆ µ s − µ s ) (cid:0) n R s ˆ γ ( s ) − N P s (cid:1) + 1 N N (cid:88) i =1 R i ˆ γ ( S i ) ε i From Lemma 3, the ﬁrst term is o p (cid:16) π ∗ √ N (cid:17) and from Lemma 4 the second term is N (cid:80) i R i π ( S i ) ε i + o p (cid:16) π ∗ √ N (cid:17) . Combining these gives the ﬁrst result. Assumption A.1e combined with an applicationof Slutsky’s theorem gives the second result. Proof of Proposition 1.

We begin by re-writing the optimization problem (9) with L = 0 and U = ∞ in terms of auxiliary covariates E ( k ) ≡ (cid:80) s D ( k ) s n R s γ s − D ( k ) s N P s . The optimization problembecomes min γ ∈ R J d (cid:88) k =2 λ k (cid:13)(cid:13)(cid:13) E ( k ) (cid:13)(cid:13)(cid:13) + 12 (cid:88) s n R s γ s subject to (cid:88) s D (1) s n R s γ s = (cid:88) s D (1) s N P s (cid:88) s D ( k ) s n R s γ s − D ( k ) s N P s − E ( k ) = 00 ≤ γ s ∀ s = 1 , . . . J. L ( γ, E , β ) ≡ J (cid:88) s =1 n R s γ s − n R s γ s d (cid:88) k =1 D ( k ) s · β ( k ) + N P s d (cid:88) k =1 D ( k ) s · β ( k ) + d (cid:88) k =2 λ k (cid:107)E ( k ) (cid:107) − E ( k ) · β ( k ) The dual problem maximizes the Lagrangian over the domain of γ and E , so q ( β ) = − min ≤ γ s , E L ( γ, E , β )= J (cid:88) s =1 n s R min ≤ γ s (cid:40) γ s − γ s d (cid:88) k =1 D ( k ) s · β ( k ) (cid:41) + N P s d (cid:88) k =1 D ( k ) s · β ( k ) + d (cid:88) k =2 min E ( k ) (cid:26) λ k (cid:107)E ( k ) (cid:107) − E ( k ) · β ( k ) (cid:27) = 12 J (cid:88) s =1 n R s max (cid:40) , d (cid:88) k =1 D ( k ) S i · β ( k ) (cid:41) − N P s d (cid:88) k =1 D ( k ) s · β ( k ) + d (cid:88) k =2 λ k (cid:107) β ( k ) (cid:107) Since there exists a feasible solution to (9) by assumption, by Slater’s condition min β q ( β ) is equiv-alent to the solution to the primal problem. The solution to the inner minimization shows that theprimal and dual variables are related by ˆ γ s = max (cid:110) , (cid:80) dk =1 D ( k ) s · ˆ β ( k ) (cid:111)(cid:111)