[PDF] Robust Bayesian Inference for Big Data: Combining Sensor-based Records with Traditional Survey Data

Abstract

Big Data often presents as massive non-probability samples. Not only is the selection mechanism often unknown, but larger data volume amplifies the relative contribution of selection bias to total error. Existing bias adjustment approaches assume that the conditional mean structures have been correctly specified for the selection indicator or key substantive measures. In the presence of a reference probability sample, these methods rely on a pseudo-likelihood method to account for the sampling weights of the reference sample, which is parametric in nature. Under a Bayesian framework, handling the sampling weights is an even bigger hurdle. To further protect against model misspecification, we expand the idea of double robustness such that more flexible non-parametric methods as well as Bayesian models can be used for prediction. In particular, we employ Bayesian additive regression trees, which not only capture non-linear associations automatically but permit direct quantification of the uncertainty of point estimates through its posterior predictive draws. We apply our method to sensor-based naturalistic driving data from the second Strategic Highway Research Program using the 2017 National Household Travel Survey as a benchmark.

Full PDF

RRobust Bayesian Inference for Big Data: CombiningSensor-based Records with Traditional Survey Data

Ali Rafei , Carol A. C. Flannagan , Brady T. West and Michael R. Elliott ∗ Survey Methodology Program, University of Michigan University of Michigan Transportation Research Institute

Abstract:

Big Data often presents as massive non-probability samples. Not only is the selectionmechanism often unknown, but larger data volume ampliﬁes the relative contribution of selectionbias to total error. Existing bias adjustment approaches assume that the conditional meanstructures have been correctly speciﬁed for the selection indicator or key substantive measures. Inthe presence of a reference probability sample, these methods rely on a pseudo-likelihood methodto account for the sampling weights of the reference sample, which is parametric in nature. Undera Bayesian framework, handling the sampling weights is an even bigger hurdle. To further protectagainst model misspeciﬁcation, we expand the idea of double robustness such that more ﬂexiblenon-parametric methods as well as Bayesian models can be used for prediction. In particular,we employ Bayesian additive regression trees, which not only capture non-linear associationsautomatically but permit direct quantiﬁcation of the uncertainty of point estimates through itsposterior predictive draws. We apply our method to sensor-based naturalistic driving data fromthe second Strategic Highway Research Program using the 2017 National Household Travel Surveyas a benchmark.

Keywords and phrases:

Big Data, non-probability sample, quasi-randomization, predictionmodel, doubly robust, augmented inverse propensity weighting, Bayesian additive regression trees.

1. Introduction

The 21 st century is witnessing a re-emergence of non-probability sampling in various domains (Daas et al.,2015; Lane, 2016; Murdoch and Detsky, 2013; Senthilkumar et al., 2018). On the one side, probabilitysampling, which has dominated the survey methodology realm for decades, is facing new challenges,mainly because of a steady decline in response rates (Groves, 2011; Johnson and Smith, 2017; Miller,2017). On the other side, new modes of data collection via sensors, web portals, and smart deviceshave emerged that routinely capture a variety of human activities. These automated processes have ledto an ever-accumulating massive volume of unstructured information, so-called “Big Data” (Couper,2013; Japec et al., 2015; Kreuter and Peng, 2014). Being easy to access, inexpensive to collect, andhighly detailed, this broad range of data is perceived to be valuable for producing oﬃcial statistics asan alternative or supplement to probability surveys (Beresewicz et al., 2018; Kitchin, 2015; Struijs et al.,2014).Although the immense size makes Big Data desirable for rare event studies, predictive analysis, and smallarea estimation (Kim et al., 2018; Lohr and Raghunathan, 2017; Rao, 2015), new challenges may arisewhen it comes to ﬁnite population inference (Franke et al., 2016; Japec et al., 2015). Lack of a knownsampling mechanism is often the foremost concern. Perhaps what distinguishes these types of organicdata most from design-based surveys is self-selection in the data-generating process, which can lead tobiased estimates. When the sample is unbalanced with respect to the target population composition,larger data volume increases the relative contribution of selection bias to absolute or squared error. Menget al. (2018) call this phenomenon a “Big Data Paradox”, and these authors show both theoretically andempirically that the impact of selection bias on the eﬀective sample size can be extremely large. ∗ Corresponding author; address: 426 Thompson St. Ann Arbor, MI 48109. Rm 4068 ISR, email: [email protected]. a r X i v : . [ s t a t . M E ] J a n Robust Bayesian Inference for Big Data The motivating application in this article comes from naturalistic driving studies (NDS), which are onereal-world example of Big Data for rare event investigations. Since traﬃc collisions are inherently rareevents, measuring accurate pre-crash behaviors as well as exposure frequency in normal driving demandsaccurate long-term follow-up of the population of drivers. Thus, NDS are designed to continuously monitordrivers’ behavior via in-vehicle sensors, cameras, and advanced wireless technologies (Guo et al., 2009).The detailed information collected by NDS are considered a rich resource for assessing various aspects oftransportation such as traﬃc safety, crash causality, and travel patterns (Huisingh et al., 2019; Tan et al.,2017). However, because of the high administrative and technical costs, participants are usually recruitedvia convenience samples from limited geographical areas. Therefore, inference based on the NDS datamay suﬀer from selection bias.It is apparent that for a non-probability sample, classical design-based approaches cannot be applieddirectly for making ﬁnite population inference, even though one could imagine that the data elements areselected randomly. The main reason is that the probabilities of selection are missing in a non-probabilitysample and cannot be estimated from the sample itself (Chen et al., 2020). Thus, as recommended bythe American Association for Public Opinion Research (AAPOR) task force on non-probability samples,adjustment methods should rely on models and external auxiliary information (Baker et al., 2013). In thepresence of a relevant probability sample with a set of common auxiliary variables, which is often calleda “reference survey”, two general approaches can be taken: (1) quasi-randomization (QR)–estimating theprobabilities of being included in the non-probability sample, also known as propensity scores (PS), whiletreating the Big Data as a quasi-random sample (Lee, 2006; Lee and Valliant, 2009; Valliant and Dever,2011), and (2) prediction modeling (PM)–ﬁtting models on the non-probability sample to predict theresponse variable for units in the reference survey (Kim et al., 2018; Kim and Rao, 2012; Rivers, 2007;Wang et al., 2015). In either case, design-based approaches can then be utilized to compute point andinterval estimates.Nevertheless, since both QR and PM methods hinge upon mass imputation, correctly specifying theunderlying model is essential under either case (Lenis et al., 2018), and especially the latter which relieson extrapolation. To further protect against model misspeciﬁcation, a third method can be applied bycombining the QR approach with the PM method, in a way that the adjusted estimate of a populationquantity, such as the population mean, is consistent if either model does hold. In this sense, adjustmentsby such a method are called doubly robust (DR). Proposed by Robins et al. (1994), augmented inversepropensity weighting (AIPW) is the earliest class of DR methods, which borrows the idea of a generalizeddiﬀerence estimator from Cassel et al. (1976). This prominence led the AIPW estimator to gain popularityquickly in the causal inference setting (Bang and Robins, 2005; Kang et al., 2007; Scharfstein et al., 1999;Tan et al., 2019; Tan, 2006). It was later expanded to adjust for non-response bias in the survey samplingcontext (Haziza and Rao, 2006; Kim and Park, 2006; Kott, 1994, 2006; Kott and Chang, 2010).Further extension to multiple robustness has been developed by Han and Wang (2013), where multiplemodels are speciﬁed and consistency is obtained as long as at least one of the models are correctlyspeciﬁed. However, simulation results show that a DR estimator is always less eﬃcient than eithera correctly-speciﬁed QR or PM solely. In situations where both QR and PM are misspeciﬁed, evenmoderately, Kang et al. (2007) show that the AIPW method may not work that well. Cao et al. (2009)conclude that the performance of the QR model in the AIPW estimator depends on how close the inversePS weighted (IPSW) mean of the selection indicator variable is to the sample size. As a result, theyrecommended estimating the parameters of the QR model under the restriction that sum of the quotientsof the selection indicators by PS equals the sample size approximately.Chen et al. (2020) oﬀer further adjustments to adapt the AIPW estimator to a non-probability samplingsetting where an external benchmark survey is available. While their method employs a modiﬁedpseudo-likelihood approach to estimate the selection probabilities for the non-probability sample, aparametric model is used to impute the outcome for units of the reference survey. Inspired by Kimand Haziza (2014), the authors propose to estimate the model parameters by simultaneously solving theestimating equations to maintain the DR property for the variance estimator. Wu and Sitter (2001) pointout that the AIPW estimator resembles inverse propensity weighting followed by a GREG calibration

Robust Bayesian Inference for Big Data based on the estimated auxiliary totals from the reference survey. This two-step method has beenfrequently used elsewhere (Brick, 2015; Dutwin and Buskirk, 2017; Lee and Valliant, 2009; Valliant,2020). Combining a model-assisted method with pseudo-weighting, Valliant (2020) also proposes anequivalent DR weighting approach for inference in non-probability samples. An extension of AIPW tohigh-dimensional data using LASSO has also been suggested by Yang et al. (2020).To further weaken or relax the modeling assumptions, the current article aims to propose alternativemodel-assisted DR methods by incorporating more ﬂexible prediction methods, such as supervisedmachine learning algorithms, into the AIPW estimator. A notable advantage of such predictive toolsis automatic variable selection, which features the ability to capture complex non-linear relationships andhigh-order interactions. As a result, these algorithmic methods, e.g. tree ensembles, kernels, and neuralnetworks, have been widely used in the contexts of causal inference and incomplete data analysis (Mayeret al., 2020; McConnell and Lindner, 2019; Wendling et al., 2018). However, a major challenge with theuse of them in a non-probability sample is how to handle the selection probabilities of the reference surveywhen ﬁtting the propensity model. Under a Bayesian framework, incorporating the sampling weights intothe regression models is an even bigger hurdle (Gelman et al., 2007). The method provided by Chen et al.(2020) relies on the pseudo-likelihood approach, which is generally limited to the parametric models fromthe exponential family.To eliminate the need to account for the sampling weights, one possible solution is to generate syntheticpopulations using the weights in the reference survey (An and Little, 2008; Dong et al., 2014; Zangenehand Little, 2015). However, in a Big Data setting, a synthetic population can be extremely large even ifits size is a few times larger than the combined sample, so ﬁtting models on such a huge dataset can becomputationally demanding if not impossible (Mercer, 2018). To augment the prediction model in theAIPW estimator while avoiding this technique, we propose an alternative pseudo-weighting approach thatis inspired by Elliott and Valliant (2017). This two-step method computationally separates the propensitymodel from the sampling weights, allowing for a broader range of models such as algorithmic methods tobe utilized for imputing the missing inclusion probabilities. Because of this feature, one can also performBayesian PS modeling or Bayesian AIPW using the well-known two-step method (Kaplan and Chen,2012; Saarela et al., 2016). A well-calibrated Bayesian method can appropriately capture the uncertaintyin the imputed PS or the outcome variable via Monte Carlo Markov Chain (MCMC) algorithms, meetingthe desirable frequentist repeated sampling properties (Dawid, 1982).In this paper, in addition to the parametric Bayes, we apply Bayesian Additive Regression Trees (BART),which provides a strong non-parametric predictive tool by automatically capturing non-linear associationsas well as high-order interactions (Chipman et al., 2007). The idea of BART is based on the sum-of-treesregression approximating the outcome variable as an arbitrary function of predictors. However, to avoidoverﬁtting, a set of prior distributions is assigned to trees’ structure, including length, decision rules, andterminal node parameters. Given the observed data, these priors are updated under a Bayesian frameworkand the posterior predictive distribution is simulated via a backﬁtting MCMC (Chipman et al., 2010).BART is especially desirable in high-dimensional data, where variable selection is a big challenge (Hillet al., 2011; Spertus and Normand, 2018). In addition, the posterior predictive distribution producedby BART makes it easier to quantify the uncertainty due to the imputation of pseudo-weights and theoutcome variable (Kaplan and Chen, 2012; Tan et al., 2019).The advantages of BART for PS adjustment have been elucidated in Kern et al. (2016) and Wendlinget al. (2018). In a non-probability sample setting, Rafei et al. (2020) utilized BART to estimate thepseudo-weights proposed by Elliott and Valliant (2017). Their simulation results revealed that BARTproduces approximately unbiased estimates when the true underlying PS model is unknown. Using BART,Mercer (2018) compared the AIPW estimator with a prediction model that uses the estimated PS as apredictor in the model and found that the AIPW estimator performed best in terms of both bias andeﬃciency. His method, however, simulated a synthetic population to cope with the unequal selectionprobabilities of the reference survey. In an item non-response setting, Tan et al. (2019) exploited BARTto compare the AIPW method with its competitor, the penalized spline of propensity prediction (PSPP)method, where the latter uses BART only for the PS model. According to their simulation study, BART

Robust Bayesian Inference for Big Data outperformed the GLM when the true models are unknown, but with a slight loss of eﬃciency. However,PSPP proved to give a smaller root mean square error than AIPW, which contradicted the main ﬁndingin Mercer (2018).To assess the performance of our proposed method under BART, we apply it to the sensor-based BigData from the second phase of the Strategic Highway Research Program (SHRP2), which is the largestNDS conducted to date. The aim is to adjust for the potential selection bias in the sample mean ofsome trip-related variables (Antin et al., 2015). To this end, we employ the 2017 National HouseholdTravel Survey (NHTS) as the reference survey, which can serve as a probability sample representing thepopulation of American drivers (Santos et al., 2011). While daily trip measures in SHRP2 are recordedvia sensors, NHTS asks respondents to self-report their trip measures through an online travel log. Byanalyzing the aggregated data at the day level, we compare the DR adjusted sensor-based estimates inSHRP2 with the self-reported weighted estimates in NHTS to assess the performance of our proposedmethods in terms of bias and eﬃciency.The rest of the article is organized as follows. In Section 2, we develop the theoretical background behindthe proposed methods and associated variance estimators. A simulation study is designed in Section 3to assess the repeated sampling properties of the proposed estimator, i.e. bias and eﬃciency. Section 4describes the datasets and auxiliary variables we use in the current study. The results of adjusted estimatesbased on the combined samples of SHRP2 and NHTS at the day level are presented in Section 5. Finally,Section 6 reviews the strengths and weakness of the study in more detail and suggests some futureresearch directions.

2. Methods

Denote by U a ﬁnite population of size N . We consider the values of a scalar outcome variable, y i , i = 1 , , ..., N , and x i = [1 , x i , x i , ..., x ip ], the values of a p -dimensional vector of relevant auxiliaryvariables, X . Let S B be a non-probability sample of size n B selected from U . The goal is to estimatean unknown ﬁnite population quantity, e.g. Q ( Y ). Here, the quantity of interest is considered to be theﬁnite population mean that is a function of the outcome variable, i.e. Q ( Y ) = y U = (cid:80) Ni =1 y i /N . Suppose δ Bi = I ( i ∈ S B ) ( i = 1 , ..., N ) is the inclusion indicator variable of the “big data” survey S B of size n B in U . Further, we initially assume that given X , elements in S B are independent draws from U , but later,we relax this assumption by considering S B to have a single-stage clustering design as is the case in thereal-data application of this article.Suppose S R is a parallel reference survey of size n R , for which the same set of covariates, X , has beenmeasured. We assume y i is unobserved for i ∈ S R ; otherwise inference could be directly drawn basedon S R . Also, let δ Ri = I ( i ∈ S R ) denotes the inclusion indicator variable associated with S R for i ∈ U .To avoid unnecessary complexity, we assume that units of S R are selected independently. Being a fullprobability sample implies that the selection mechanism in S R is ignorable given its design features, i.e. f ( δ Ri | y i , d i ) = f ( δ Ri | d i ) for i ∈ U , where d i denotes a q -dimensional vector of associated design variables.Thus, one can deﬁne the selection probabilities and sampling weights in S R as π Ri = p ( δ Ri = 1 | d i ) and w Ri = 1 /π Ri , respectively, for i ∈ U , which we assume are known.While X and D may overlap or correlate, we deﬁne x ∗ i = [ x i , d i ] T , the ( p + q )-dimensional vector of allauxiliary variables associated with S B and S R . To be able to make unbiased inference for S B , we considerthe following assumptions for S B : C1. Positivity — S B actually does have a probabilistic sampling mechanism, albeit unknown. Thatmeans p ( δ Bi = 1 | x i ) > x i in U . C2. Ignorability —the selection mechanism of S B is fully governed by X , which implies Y | = δ B | X .Then, for i ∈ U , the pseudo-inclusion probability associated with S B is deﬁned as π Ai = p ( δ Bi =1 | x i ). Robust Bayesian Inference for Big Data C3. Independence —conditional on X ∗ , S R and S B are selected independently, i.e. δ B | = δ R | X ∗ .Note that the ﬁrst two assumptions are collectively called “strong ignorability” by Rosenbaum and Rubin(1983). Considering C1-C3 , the joint density of y i , δ Bi and δ Ri can be factorized as below: f ( y i , δ Bi , δ Ri | x i , d i ; θ, β, γ ) = f ( y i | x i , d i ; θ ) f ( δ Bi | x i ; β ) f ( δ Ri | d i ; γ ) , ∀ i ∈ U (2.1)where Ψ = ( θ, β, γ ) are some distributional parameters. While θ and β are unknown, γ may be known as S R is a probability sample. A QR approach involves modeling f ( δ Bi | x ∗ i ; β ), whereas a PM approach dealswith modeling f ( y i | x ∗ i ; θ ).Now suppose S B and S R have trivial overlap, i.e. p ( δ Bi + δ Ri = 2) ≈ S B depends on X observed for the entire population. Thus, given the combinedsample, S = S B ∪ S R , with n = n B + n R being the sample size, it is reasonable to expect that thepseudo-inclusion probabilities, π Bi ’s, are a function of both x i and d i for i ∈ S . Let z i = I ( i ∈ S B | δ i = 1)be the indicator of subject i belonging to the non-probability sample in the combined sample where δ i = δ Bi + δ Ri . Note that since S B ∩ S R = ∅ , δ i can take either 0 or 1 as below: δ i = (cid:40) , if δ Ri = 0 and δ Bi = 01 , if δ Ri = 1 or δ Bi = 1 (2.2)Figure 1 illustrates the data structure in both the ﬁnite population and the combined sample. Combined sample ??? . ....1 Y BR 𝜋X Z ??.... ? . .....0 .....000..0 11..1 𝑥 𝑦 𝛿 𝐵 𝛿 𝑅 𝑆 𝐵 𝑆 𝑅 𝑈 − 𝑆 Population ?.?0.0 ??.. ? 𝑥 𝑦𝑧 𝜋 𝑅 𝑆 𝐵 𝑆 𝑅 Combined sample 𝜋 𝐵 Fig 1:

Data structure in the population and the combined sample

In QR, the non-probability sample is treated as if the self-selection mechanism of population units mimicsa stochastic process, but with unknown selection probabilities. Then, attempts are made to estimate thesemissing quantities in S B based on external information. Conditional on x ∗ i , suppose π Bi follows a logisticregression model in the ﬁnite population. We have π B ( x i ; β ) = p ( δ Bi = 1 | x i ; β ) = exp { β T x i } exp { β T x i } , ∀ i ∈ U (2.3)where β is a vector of model parameters in U . Using a modiﬁed pseudo-maximum likelihood approach(PMLE), Chen et al. (2020) demonstrate that, given S , a consistent estimate of β can be obtained by Robust Bayesian Inference for Big Data solving the following estimating equation with respect to β : U ( β ) = n B (cid:88) i =1 x i − n R (cid:88) i =1 π B ( x i ; β ) x i /π Ri = 0 (2.4)The estimate of π Bi ’s, which we also call propensity scores (PS), are obtained by plugging the solution ofEq. 2.4, i.e. (cid:98) β , into Eq. 2.3. It is important to note that the proposed PS estimator by Chen et al. (2020)depends implicitly on d i in addition to x i , because we know that π Ri is a determinstic function of d i for i ∈ U . Under certain regularity conditions, the authors show that the inverse PS weighted (IPSW) meanfrom S B yields a consistent and asymptotically unbiased estimate for the population mean.Obviously, the possible solutions of Eq. 2.4 are not a typical output of logistic regression procedures inthe existing standard software. With one additional assumption, which is mutual exclusiveness of the twosamples, i.e. S B ∩ S R = ∅ , we show that estimating π Bi ’s can be reduced to modeling Z i for i ∈ S insteadof modeling δ Bi for i ∈ U . Intuitively, one can view the selection process of the i -th population unit in S B as being initially selected in the joint sample ( δ i = 1) and then being selected in S B given the combinedsample ( Z i = 1). By conditioning on x ∗ i , the selection probabilities in S B are factorized as p ( δ Bi = 1 | x ∗ i ) = p ( δ Bi = 1 , δ i = 1 | x ∗ i )= p ( δ Bi = 1 | δ i = 1 , x ∗ i ) p ( δ i = 1 | x ∗ i )= p ( Z i = 1 | x ∗ i ) p ( δ i = 1 | x ∗ i ) i ∈ S (2.5)Note that the last expression in Eq. 2.5 results from the deﬁnition of Z i given S . The same factorizationcan be derived for the selection probabilities in S R . Thus, we have p ( δ Ri = 1 | x ∗ i ) = p ( Z i = 0 | x ∗ i ) p ( δ i = 1 | x ∗ i ) (2.6)By dividing the two sides of the equations 2.5 and 2.6, one can get rid of p ( δ i = 1 | x ∗ i ) and obtain thepseudo-selection probabilities in S B as below: p ( δ Bi = 1 | x ∗ i ) = p ( δ Ri = 1 | x ∗ i ) p ( Z i = 1 | x ∗ i ) p ( Z i = 0 | x ∗ i ) (2.7)It is clear that p ( δ Ri = 1 | x ∗ i ) = π Ri as x ∗ i contains d i and the sampling design of S R is known given d i . Aswill be seen in Section 2.4, conditioning on d i is vital for the DR estimator, as Chen’s method is limitedto situations where the dimension of the auxiliary variables must be the same in QR and PM.Note that Eq. 2.7 is identical to the pseudo-weighting formula Elliott and Valliant (2017) derive fora non-probability sample. Unlike the PMLE approach, modeling Z i in S can be performed using thestandard binary logistic regression or any alternative classiﬁcation methods, such as supervised machinelearning algorithms. Under a logistic regression model, we have p ( Z i = 1 | x ∗ i ) = exp { β T x ∗ i } exp { β T x ∗ i } (2.8)where β denotes the vector of model parameters being estimated via maximum likelihood estimation(MLE). Hence, in situations where π Ri is known or can be calculated for i ∈ S B , the estimate of π Bi for i ∈ S B is given by (cid:98) π Bi = π Ri exp { (cid:98) β T x ∗ i } = π Ri p i ( (cid:98) β )1 − p i ( (cid:98) β ) (2.9)where (cid:98) β denotes the MLE estimate of the logistic regression model parameters, and p i ( (cid:98) β ) is a shorthandof p ( Z i = 1 | x ∗ i ; (cid:98) β ). Intuitively, one can envision that the ﬁrst factor in 2.9 treats S B as if it is selectedunder the design of S R , and the second factor attempts to balance the distribution of x in S B with respectto that in S R . Robust Bayesian Inference for Big Data Having π Bi estimated based on 2.9 for all i ∈ S B , one can construct the H´ajek-type pseudo-weightedestimator for the ﬁnite population mean as below: (cid:98) y P AP W = 1 (cid:98) N B n B (cid:88) i =1 y i (cid:98) π Bi (2.10)where (cid:98) N B = (cid:80) n B i =1 /π Bi . Hereafter, we refer to the estimator in 2.10 as propensity-adjusted probabilityweighting (PAPW). Under mild regularity conditions, the ignorable assumption in S B given x , the logisticregression model and the additional assumption of S B ∩ S R = ∅ , Appendix I.1 shows that this estimatoris consistent and asymptotically unbiased for y U .In situations where π Ri is unknown for i ∈ S B , Elliott and Valliant (2017) suggest predicting this quantityfor units of the non-probability sample. Note that, in this situation, it is no longer required to conditionon d i in addition to x i . Treating π Ri as a random variable for i ∈ S B conditional on x i , one can obtainthis quantity by regressing the π Ri ’s on the x i ’s in the reference survey. This is because p ( δ Ri = 1 | x i ) = (cid:90) p ( δ Ri = 1 | π Ri , x i ) p ( π Ri | x i ) dπ Ri = (cid:90) π Ri p ( π Ri | x i ) dπ Ri = E ( π Ri | x i ) i ∈ S R (2.11)However, since the outcome is continuous bounded taking values within (0 , Beta regressionmodel is recommended (Ferrari and Cribari-Neto, 2004). Note that, π Ri is ﬁxed given d i as S R is aprobability sample, but conditional on x i , π Ri can be regarded as a random variable.Rafei et al. (2020) call this approach propensity-adjusted probability prediction (PAPP). This two-stepderivation of pseudo-inclusion probabilities is especially useful, as it separates sampling weights in S R from the propensity model computationally. When the true model is unknown, this feature enables us toﬁt a broader and more ﬂexible range of models, such as algorithmic tree-based methods. It is worth notingthat modeling E ( π Ri | x i ) does not impose an additional ignorable assumption in S R given x , because inthe extreme case if δ Ri | = x i , that means weighted and unweighted distributions of x are identical in S R ,and therefore the π Ri ’s can be safely ignored in propensity modeling. An alternative approach to deal with selectivity in Big Data is modeling f ( y | x ∗ ) (Smith, 1983). In afully model-based fashion, this essentially involves mass imputing y for the population non-sampledunits, U − S B . When x ∗ is unobserved for non-sampled units, it is recommended that a syntheticpopulation is generated by undoing the selection mechanism of S R through a non-parametric Bayesianbootstrap method using the design variables in S R (Dong et al., 2014; Zangeneh and Little, 2015). In thenon-probability sample context, Elliott and Valliant (2017) propose an extension of the General RegressionEstimator (GREG) when only summary information about x ∗ , such as totals, is known regarding U . Insituations where an external probability sample is available with x ∗ measured, an alternative is to limit theoutcome prediction to the units in S R , and then, use design-based approaches to estimate the populationquantity (Kim et al., 2018; Rivers, 2007; Yang and Kim, 2018).However, to the best of our knowledge, none of the prior literature distinguish the role of D from X inthe conditional mean structure of the outcome, while the likelihood factorization in Eq. 2.1 indicates thatpredicting y requires conditioning not only on x but also on d . Suppose U is a realization of a repeatedrandom sampling process from a super-population under the following model: E ( y i | x ∗ i ; θ ) = m ( x ∗ i ; θ ) ∀ i ∈ U (2.12) Robust Bayesian Inference for Big Data where m ( x ∗ i ; θ ) can be either a parametric model with m being a continuous diﬀerentiable function or anunspeciﬁed non-parametric form. Under the ignorable condition where f ( y i | x ∗ i , z i = 1; θ ) = f ( y i | x ∗ i , z i = 0; θ ) = f ( y i | x i , d i ; θ ) (2.13)an MLE estimate of θ can be obtained by regressing Y on X ∗ given S B . The predictions for units in S R are then given by (cid:98) y i = E ( y i | x ∗ i , z i = 0; (cid:98) θ ) = m ( x ∗ i ; (cid:98) θ ) ∀ i ∈ S R (2.14)where m ( x ∗ i ; (cid:98) θ ) = (cid:98) θ T x ∗ i . Once y is imputed for all units in the reference survey, the population mean canbe estimated through the H´ajek formula as below: (cid:98) y P M = 1 (cid:98) N R n R (cid:88) i =1 (cid:98) y i π Ri (2.15)where (cid:98) y i = m ( x ∗ i ; (cid:98) θ ) for i ∈ S R , (cid:98) N R = (cid:80) n R i =1 w Ri and π Ri is the selection probability for subject i ∈ S .The asymptotic properties of the estimator in 2.15, including consistency and unbiasedness, have beeninvestigated by Kim et al. (2018). Note that in situations where π Ri is available for i ∈ S B , one can use w Ri instead of the high-dimensional d i as a predictor in m ( . ). This method is known as linear-in-the-weightprediction (LWP) (Bang and Robins, 2005; Scharfstein et al., 1999; Zhang and Little, 2011). However,since outcome imputation relies fully on extrapolation, even minor misspeciﬁcation of the underlyingmodel can be seriously detrimental to bias correction. To reduce the sensitivity to model misspeciﬁcation, Chen et al. (2020) reconcile the two aforementionedapproaches, i.e. QR and PM, in a way that estimates remain consistent even if one of the two models isincorrectly speciﬁed. Their method involves an extension of the augmented inverse propensity weighting(AIPW) proposed by Robins et al. (1994). When N is known, the expanded AIPW estimator takes thefollowing form: y DR = 1 N n B (cid:88) i =1 { y i − m ( x ∗ i ; θ ) } π B ( x ∗ i ; β ) + 1 N n R (cid:88) j =1 m ( x ∗ i ; θ ) π Rj (2.16)where given x ∗ , β and θ are estimated using the modiﬁed PMLE and MLE methods mentioned insections 2.2 and 2.3, respectively. The theoretical proof of the asymptotic unbiasedness of y DR under thecorrect modeling of π B ( x ∗ i ; β ) or m ( x ∗ i ; θ ) is reviewed in Appendix I.1.To avoid using π R in modeling δ Bi because of the PMLE restrictions we discussed in Section 2.2, in thisstudy, we suggest estimating π Bi for i ∈ S B in Eq. 2.16 based on the PAPW/PAPP method dependingon whether π Ri is available for i ∈ S B or not. As a result, in situations where π Ri is known for i ∈ S B ,our proposed DR estimator is given by (cid:98) y DR = 1 N n B (cid:88) i =1 π Ri (cid:20) − p i ( β ) p i ( β ) (cid:21) { y i − m ( x ∗ i ; θ ) } + 1 N n R (cid:88) j =1 m ( x ∗ i ; θ ) π Rj (2.17)where p i ( β ) = p ( Z i = 1 | x ∗ i ; β ). We demonstrate that this form of the AIPW estimator is identical tothat deﬁned by Kim and Haziza (2014) in the non-response adjustment context under probability surveys.Assuming that y i is fully observed for i ∈ S R , let us deﬁne the following HT-estimator for the populationmean: (cid:98) y U = 1 N n R (cid:88) i =1 y i π Ri (2.18) Robust Bayesian Inference for Big Data Now, one can easily conclude that (cid:98) y DR = 1 N n (cid:88) i =1 π Ri (cid:20) Z i (cid:18) − p i ( β ) p i ( β ) (cid:19) { y i − m ( x ∗ i ; θ ) } + (1 − Z i ) m ( x ∗ i ; θ ) (cid:21) = (cid:98) y U + 1 N n (cid:88) i =1 π Ri (cid:20) Z i p i ( β ) − (cid:21) (cid:8) y i − m ( x ∗ i ; θ ) (cid:9) (2.19)where p i ( β ) = p ( Z i = 1 | x ∗ i ; β ). The formula in 2.19 is similar to what is derived by Kim and Haziza(2014). Therefore, the rest of the theoretical proof of asymptotic unbiasedness, i.e. (cid:98) y DR − (cid:98) y U = O p ( n − / B ),in Kim and Haziza (2014) should hold for our modiﬁed AIPW estimator in 2.17 as well.To preserve the DR property for both the point and variance estimator of y DR , as suggested by Kimand Haziza (2014), one can solve the following estimating equations simultaneously given S to obtain theestimate of ( β , θ ). The aim is to cancel the ﬁrst-order derivative terms in the Taylor-series expansion of (cid:98) y DR − (cid:98) y U under QR and PM. These estimating equations are given by ∂∂β (cid:104)(cid:98) y DR − (cid:98) y U (cid:105) = 1 N n (cid:88) i =1 Z i π Ri (cid:20) p i ( β ) − (cid:21) { y i − m ( x ∗ i ; θ ) } x ∗ i = 0 ∂∂θ (cid:104)(cid:98) y DR − (cid:98) y U (cid:105) = 1 N n (cid:88) i =1 π Ri (cid:20) Z i p i ( β ) − (cid:21) ˙ m ( x ∗ i ; θ ) = 0 (2.20)where ˙ m is the derivative of m with respect to β . Under a linear regression model, ˙ m ( x ∗ i ) = x ∗ i . Therefore,given the same regularity conditions, ignorability in S B , the logistic regression model as well as theadditional imposed assumption of S B ∩ S R = ∅ , one can show that the proposed DR estimator is consistentand approximately unbiased given that either the QR or PM model holds.It is important noting that the system of equations in 2.20 may not have unique solutions unless thedimension of covariates in QR and PM is identical. Therefore, the AIPW estimator by Chen et al. (2020)may not be applicable here, as our likelihood factorization suggests that conditioning on d i is necessaryat least for the PM. Furthermore, when π Ri is known for i ∈ S B , one can replace the q -dimensional d i with the 1-dimensional w Ri in modeling both QR and PM. Bang and Robins (2005) shows that estimatorsbased on a linear-in-weight prediction model remains consistent. A fully Bayesian approach speciﬁes a model for the joint distribution of selection indicator, δ Bi , and theoutcome variable, y i , for i ∈ U (An, 2010; McCandless et al., 2009). This requires multiply generatingsynthetic populations and ﬁtting the QR and PM models on each of them repeatedly (Little and Zheng,2007; Zangeneh and Little, 2015), which can be computationally expensive under a Big Data setting.While joint modeling may result in good frequentist properties (Little, 2004), feedback occurs betweenthe two models (Zigler et al., 2013). This can be controversial in the sense that PS estimates should notbe informed by the outcome model (Rubin, 2007). Here, we are interested in modeling the PS and theoutcome separately through the two-step framework proposed by Kaplan and Chen (2012). The ﬁrst stepinvolves ﬁtting Bayesian models to multiply impute the PS and the outcome by randomly subsamplingthe posterior predictive draws, and Rubin’s combining rules are utilized as the second step to obtainthe ﬁnal point and interval estimates. This method not only is computationally eﬃcient as it suﬃcesto ﬁt the models once and on the combined sample but also cuts the undesirable feedback betweenthe models as they are ﬁtted separately. Bayesian modeling can be performed either parametrically ornon-parametrically. Robust Bayesian Inference for Big Data As the ﬁrst step, we employ Bayesian Generalized Linear Models to handle multiple imputations of π Bi and y i for i ∈ S , and π Ri if it is unknown for i ∈ S R . Under a standard Bayesian framework, a set ofindependent prior distributions are assigned to the model parameters, and conditional on the observeddata, the associated posterior distributions are simulated through an appropriate MCMC method, suchas Metropolis–Hastings algorithm. We propose the following steps: Step γ T , φ, β T , θ T , σ ) ∼ p ( γ ) p ( φ ) p ( β ) p ( θ ) p ( σ ) Step π Ri | x i , γ, φ ∼ Beta ( φ [ logit − ( γ T x i )] , φ [1 − logit − ( γ T x i )]) Step Z i | x i , β ∼ Bernoulli ( logit − { β T x i } ) Step Y i | x i , θ, σ ∼ N ormal ( θ T x i , σ )where ( γ T , φ ), β T and ( θ T , σ ) are the parameters associated with modeling π Ri in a Beta regression( Step Z i in a binary logistic regression ( Step

3) and Y i is a linear regression ( Step p ( . ) denotes a prior density function. Note that in situations where π Ri is calculable for i ∈ S B , Step x i should be replaced by x ∗ i . It is understood that setting non-informative priorsto the model parameters can avoid overestimating the variance in a two-step Bayesian method (Kaplanand Chen, 2012).Suppose (cid:98) Θ ( m ) T = (cid:104) ( (cid:98) γ ( m ) T , (cid:98) φ ( m ) , (cid:98) β ( m ) T , (cid:98) θ T ( m ) , (cid:98) σ ( m ) (cid:105) is the m -th unit of an M -sized random sample fromthe MCMC draws associated with the posterior distribution of the models parameters. Then, given that π Ri is known for i ∈ S B , one can obtain the m-th draw of the proposed AIPW estimator as below: (cid:98) y ( m ) DR = 1 (cid:98) N B n B (cid:88) i =1 y i − (cid:98) θ ( m ) T x ∗ i π Ri exp [ (cid:98) β ( m ) T x ∗ i ] + 1 (cid:98) N R n R (cid:88) j =1 (cid:98) θ ( m ) T x ∗ i π Rj (2.21)In situations where π Ri is unknown for i ∈ S B , the m -th draw of the AIPW estimator is given by (cid:98) y ( m ) DR = 1 (cid:98) N B n B (cid:88) i =1 (cid:26) exp [ (cid:98) γ ( m ) T x i ] exp [ (cid:98) γ ( m ) T x i ] (cid:27)(cid:26) y i − (cid:98) θ ( m ) T x ∗ i exp [ (cid:98) β ( m ) T x i ] (cid:27) + 1 (cid:98) N R n R (cid:88) j =1 (cid:98) θ ( m ) T x ∗ i π Rj (2.22)Having (cid:98) y ( m ) DR for all m = 1 , , ..., M , then, Rubin’s combining rule for the point estimate (Rubin, 2004)can be employed to obtain the ﬁnal AIPW estimator as below: (cid:98) y DR = 1 M M (cid:88) m =1 (cid:98) y ( m ) DR (2.23)If at least one of the underlying models is correctly speciﬁed, we would expect that this estimator isapproximately unbiased. The variance estimation under the two-step Bayesian method is discussed inSection 2.6. Despite the prominent feature of double robustness, for a given non-probability sample, neither QR norPM have a known modeling structure in practice. When both working models are invalid, the AIPWestimator will be biased and a non-robust estimator based on PM may produce a more eﬃcient estimatethan the AIPW (Kang et al., 2007). To show the advantage of our modiﬁed estimator in 2.17 over thatproposed by Chen et al. (2020), we employ Bayesian Additive Regression Trees (BART) as a predictivetool for multiply imputing the π Bi ’s as well as the y i ’s in S . A brief introduction to BART has beenprovided in Appendix I.2. Robust Bayesian Inference for Big Data Suppose BART approximates a continuous outcome variable through an implicit function f as below: y i = f ( x ∗ i ) + (cid:15) i ∀ i ∈ S B (2.24)where (cid:15) i ∼ N (0 , σ ). Accordingly, one can train BART in S B and multiply impute y i for i ∈ S R using thesimulated posterior predictive distribution. Regarding QR, we consider two situations; ﬁrst, π Ri is knownfor i ∈ S B . Under this circumstance, it suﬃces to model z i on x ∗ i in S to estimate π Bi for i ∈ S B . Fora binary outcome variable, BART utilizes a data augmentation technique with respect to a given link function, to map { , } values to R via a probit link. SupposeΦ − [ p ( Z i = 1 | x ∗ i )] = h ( x ∗ i ) ∀ i ∈ S (2.25)where Φ − is the inverse CDF of the standard normal distribution. Hence, using the posterior predictivedraws generated by BART in 2.25, p ( Z i = 1 | x ∗ i ) and consequently π Bi can be imputed multiply for i ∈ S B .For a given imputation m ( m = 1 , , ..., M ), one can expand the DR estimator in 2.17 as below: (cid:98) y ( m ) DR = 1 (cid:98) N B n B (cid:88) i =1 π Ri (cid:26) − Φ[ (cid:98) h ( m ) ( x ∗ i )]Φ[ (cid:98) h ( m ) ( x ∗ i )] (cid:27)(cid:8) y i − (cid:98) f ( m ) ( x ∗ i ) (cid:9) + 1 (cid:98) N R n R (cid:88) j =1 (cid:98) f ( m ) ( x ∗ j ) π Rj (2.26)where (cid:98) f ( m ) ( . ) and (cid:98) h ( m ) ( . ) are the constructed sum-of-trees associated with the m -th MCMC draw in 2.24and 2.25, respectively, after training BART on the combined sample.Secondly, in situations where π Ri is not available for i ∈ S B , we suggest applying BART to multiply imputethe missing π Ri ’s in S B . Since the outcome is continuous bounded within (0 , logit transformation ofthe π Ri ’s can be used as the outcome variable in BART to map the values to R . Such a model is presentedby log (cid:18) π Ri − π Ri (cid:19) = k ( x i ) + (cid:15) i ∀ i ∈ S R (2.27)where k is a sum-of-trees function approximated by BART. Under this circumstance, (cid:98) y DR based on the m -th draw from the posterior distribution is given by (cid:98) y ( m ) DR = 1 (cid:98) N B n B (cid:88) i =1 (cid:26) exp [ (cid:98) k ( m ) ( x i )] exp [ (cid:98) k ( m ) ( x i )] (cid:27)(cid:26) − Φ[ (cid:98) h ( m ) ( x i )]Φ[ (cid:98) h ( m ) ( x i )] (cid:27)(cid:8) y i − (cid:98) f ( m ) ( x ∗ i ) (cid:9) + 1 (cid:98) N R n R (cid:88) j =1 (cid:98) f ( m ) ( x ∗ j ) π Rj (2.28)Having (cid:98) y ( m ) DR estimated for m = 1 , , ..., M , one can eventually use Rubin’s combining rule (Rubin, 2004)to obtain the ultimate point estimate as in 2.23. To obtain an unbiased variance estimate for the the proposed DR estimator, one needs to account for threesources of uncertainty: (i) the uncertainty due to estimated pseudo-weights in S B , (ii) the uncertaintydue to the predicted outcome in both S B and S R , and (iii) the uncertainty due to sampling itself. In thefollowing, we consider two scenarios: π Ri is known for i ∈ S B Under this circumstance, we derive the sandwich-type variance estimator for the proposed PAPWestimator in Appendix I.1. For a known N and under the speciﬁed regularity assumptions and logisticregression model, the sample estimator of variance based on S is given by (cid:100) V ar (cid:16)(cid:98) y P AP W (cid:17) = 1 N n B (cid:88) i =1 (cid:8) − (cid:98) π Bi (cid:9) (cid:32) y i − (cid:98) y P AP W (cid:98) π Bi (cid:33) − (cid:98) b T N n B (cid:88) i =1 (cid:8) − p i ( (cid:98) β ) (cid:9) (cid:32) y i − (cid:98) y P AP W (cid:98) π Bi (cid:33) x ∗ i + (cid:98) b T (cid:34) N n (cid:88) i =1 p i ( (cid:98) β ) x ∗ i x ∗ Ti (cid:35) (cid:98) b (2.29) Robust Bayesian Inference for Big Data where (cid:98) b T = (cid:26) N n B (cid:88) i =1 (cid:32) y i − (cid:98) y P AP W (cid:98) π Bi (cid:33) x ∗ Ti (cid:27)(cid:26) N n (cid:88) i =1 p i ( (cid:98) β ) x ∗ i x ∗ Ti (cid:27) − (2.30)and (cid:98) π Bi is the estimated pseudo-selection probability based on 2.10 for i ∈ S B .The derivation of the asymptotic variance estimator for (cid:98) y DR is then straightforward and follows Chenet al. (2020). It is given by (cid:100) V ar ( (cid:98) y DR ) = (cid:98) V + (cid:98) V − (cid:98) B ( (cid:98) V ) (2.31)where V = V ar ( (cid:98) y P M ) under the sampling design of S R . V is the variance of (cid:98) y DR − (cid:98) y U under the jointsampling design of S R and the PS model. This quantity can be estimated from S R as below: (cid:98) V = 1 N n B (cid:88) i =1 (cid:20) − (cid:98) π Bi ( (cid:98) π Bi ) (cid:21) { y i − m ( x ∗ i ; (cid:98) θ ) } (2.32)Finally, B ( (cid:98) V ) corrects for the bias in V under the PM, and is given by (cid:98) B ( (cid:98) V ) = 1 N n (cid:88) i =1 (cid:20) Z i (cid:98) π Bi − − Z i π Ri (cid:21) (cid:98) σ i (2.33)where (cid:98) σ i = (cid:100) V ar ( y i | x i ). Since the quantity in 2.33 tends to zero asymptotically under the QR model, thederived variance estimator in 2.31 is DR. Note that such an asymptotic estimator needs N to be known. π Ri is unknown for i ∈ S B To estimate the variance of (cid:98) y DR in 2.17 under the GLM, we employ the bootstrap repeated replicationmethod proposed by Rao and Wu (1988). For a given replication b ( b = 1 , , ..., B ), we draw replicatedbootstrap subsamples, S ( b ) R and S ( b ) B , of sizes n R − n B − S R and S B , respectively. Thesampling weights in S ( b ) R are updated as below: w ( b ) i = w i n R n R − h i ∀ i ∈ S ( b ) R (2.34)where h i is the number of times the i -th unit has been repeated in S ( b ) B . Let’s assume (cid:98) y ( b ) DR is the DRestimate based on the b -th combined bootstrap sample, S ( b ) , using Eq. 2.8. The variance estimator isgiven by (cid:100) V ar ( (cid:98) y DR ) = 1 B B (cid:88) b =1 (cid:104)(cid:98) y ( b ) DR − y DR (cid:105) (2.35)where y DR = (cid:80) Bb =1 (cid:98) y ( b ) DR /B . Note that when S R and S B are clustered, which is the case in our application,bootstrap subsamples are selected from the primary sampling units (PSU), and n R and n B are replacedby their respective PSU sizes.To estimate the variance of (cid:98) y DR under a Bayesian framework, whether parametric or non-parametric, wetreat y i for i ∈ S R , and π Ri and e i for i ∈ S B , as missing values in Eq. 2.17 and multiply impute thesequantities using the Monte Carlo Markov Chain (MCMC) sequence of the posterior predictive distributiongenerated by BART. For M randomly selected MCMC draws, one can estimate (cid:98) y ( m ) DR for m = 1 , , ..., M based on Eq. 2.17. Following Rubin’s combining rule for variance estimation under multiple imputation,the ﬁnal variance estimate of (cid:98) y DR is given as below: (cid:100) V ar ( (cid:98) y DR ) = V W + (cid:18) M (cid:19) V B (2.36) Robust Bayesian Inference for Big Data where V W = (cid:80) Mm =1 (cid:100) V ar ( (cid:98) y ( m ) DR ) /M , V B = (cid:80) Mm =1 ( (cid:98) y ( m ) DR − y DR ) / ( M −

1) and y DR = (cid:80) Mm =1 (cid:98) y ( m ) DR /M . Wehave shown in the Appendix I.1 that the within-imputation component can be approximated by (cid:100) V ar ( (cid:98) y ( m ) DR ) ≈ (cid:98) N B n B (cid:88) i =1 var ( y i ) (cid:0)(cid:98) π Bi (cid:1) + 1 (cid:98) N R var (cid:18) π Ri (cid:19) (cid:26) n R (cid:88) i =1 (cid:16)(cid:98) y ( m ) i (cid:17) + n R (cid:32) (cid:98) t R (cid:98) N R (cid:33) − n R (cid:88) i =1 (cid:98) y ( m ) i (cid:27) (2.37)where t R = (cid:80) n R i =1 (cid:98) y ( m ) i /π Ri . Note that when S R or S B is clustered, under a Bayesian framework, it isimportant to ﬁt multilevel models to obtain unbiased variance (Zhou et al., 2020). In addition, one needsto account for the intraclass correlation across the sample units in (cid:100) V ar ( (cid:98) y ( m ) DR ) for m = 1 , , ..., M . Further,one may use the extension of BART with random intercept to properly specify the working models undera cluster sampling design (Tan et al., 2016).

3. Simulation Study

Three simulations are studied in this section to assess the performance of our proposed methods andassociated variance estimators in terms of bias magnitude and other repeated sampling properties. Tothis end, we consider various situations depending on whether π Ri is available for i ∈ S B or not, andwhether units of S B are independent or not. The design of our ﬁrst simulation is inspired by the one implemented in Chen et al. (2020). For all threestudies, the non-probability samples are given a random selection mechanism with unequal probabilities,but it is later assumed that these selection probabilities are unknown at the stage of analysis, and thegoal is to adjust for the selection bias using a parallel probability sample whose sampling mechanism isknown. We conduct the simulation under both asymptotic frequentist and two-step Bayesian frameworks.Consider a ﬁnite population of size N = 10 with z = { z , z , z , z } being a set of auxiliary variablesgenerated as follows: z ∼ Ber ( p = 0 . z ∼ U (0 , z ∼ Exp ( µ = 1) z ∼ χ (3.1)and x = { x , x , x , x } is deﬁned as a function of z as below: x = z x = z + 0 . z x = z + 0 . x + x ) x = z + 0 . x + x + x ) (3.2)Given x , a continuous outcome variable y is deﬁned by y i = 2 + x i + x i + x i + x i + σ(cid:15) i (3.3)where (cid:15) i ∼ N (0 , σ is deﬁned such that the correlation between y i and (cid:80) k =1 x ki equals ρ ,which takes one of the values { . , . , . } . Further, associated with the design of S B , a set of selectionprobabilities are assigned to the population units through the following logistic model: log (cid:18) π Bi − π Bi (cid:19) = γ + 0 . x i + 0 . x i + 0 . x i + 0 . x i where γ is determined such that (cid:80) Ni =1 π Bi = n B . For S R , we assume π Ri ∝ γ + z i where γ is obtainedsuch that max { π Ri } / min { π Ri } = 50. Hence, π Ri is assumed to be known for i ∈ S B as long as z isobserved in S B . Using these measures of size, we repeatedly draw pairs of samples of sizes n R = 100 and n B = 1 ,

000 associated with S R and S B from U through a Poisson sampling method. Note that unitsin both S R and S B are independently selected, and n R << n B , which might be the case in a Big Datasetting. Extensions with n B = 100 and n B = 10 ,

000 for both frequentist and Bayesian methods areprovided in Appendix I.3.

Robust Bayesian Inference for Big Data Once S B and S R are drawn from U , we assume that the π Bi ’s for i ∈ S B and y i ’s for i ∈ S R areunobserved, and the aim is to adjust for the selection bias in S B based on the combined sample, S .The simulation is then iterated K = 5 ,

000 times, where the bias-adjusted mean, SE, and associated95% conﬁdence interval (CI) for the mean of y are estimated in each iteration. Under the frequentistapproach, the AIPW point estimates are obtained by simultaneously solving the estimating equationsin 2.20. In addition, the proposed two-step method is used to derive the AIPW point estimates underthe parametric Bayes. Also, to estimate the variance, we use the DR asymptotic method proposed byChen et al. (2020), and the conditional variance formula in Eq. 2.36 under the frequentist and Bayesianapproaches, respectively. For the latter, we set ﬂat priors to the model parameters, and simulate theposteriors using 1 ,

000 MCMC draws after omitting an additional 1 ,

000 draws as the burn-in period. Wethen get a random sample of size M = 200 from the posterior draws to obtain the point and varianceestimates.To evaluate the repeated sampling properties of the competing method, relative bias (rBias), relativeroot mean square error (rMSE), the nominal coverage rate of 95% CIs (crCI) and SE ratio (rSE) arecalculated as below: rbias ( (cid:98) y DR ) = 100 × K K (cid:88) k =1 (cid:16)(cid:98) y ( k ) DR − y U (cid:17) /y U (3.4) rM SE ( (cid:98) y DR ) = 100 × (cid:118)(cid:117)(cid:117)(cid:116) K K (cid:88) k =1 (cid:16)(cid:98) y ( k ) DR − y U (cid:17) /y U (3.5) crCI ( (cid:98) y DR ) = 100 × K K (cid:88) k =1 I (cid:18)(cid:12)(cid:12)(cid:98) y ( k ) DR − y U (cid:12)(cid:12) < z . (cid:113) var ( (cid:98) y ( k ) DR ) (cid:19) (3.6) rSE ( (cid:98) y DR ) = 1 K K (cid:88) k =1 (cid:113) var ( (cid:98) y ( k ) DR ) / (cid:118)(cid:117)(cid:117)(cid:116) K − K (cid:88) k =1 (cid:16)(cid:98) y ( k ) DR − y DR (cid:17) (3.7)where (cid:98) y ( k ) DR denotes the DR adjusted sample mean from iteration k , y DR = (cid:80) Kk =1 (cid:98) y ( k ) DR /K , y U is the ﬁnitepopulation true mean, and var ( . ) represents the variance estimate of the adjusted mean based on thesample. Finally, we investigate diﬀerent scenarios of whether models are correctly speciﬁed or not to testif our proposed method is DR. In order to misspecify a model, we remove x from the predictors of theworking model.Table 1 summarizes the results of the ﬁrst simulation study under the frequentist approach. As illustrated,unweighted estimates of the population mean are biased in both S R and S B . For the non-robustestimators, when the working model is valid, it seems that PM outperforms QR consistently in terms ofbias correction across diﬀerent ρ values. While PAPW works slightly better than IPSW with respect tobias, when the QR model is true, the latter tends to overestimate the variance slightly according to thevalues of rSE. In addition, the smaller value of rMSE indicates that PAPW is more eﬃcient than IPSW.For the PM, both crCI and rSE reﬂect accurate estimates of the variance for all values of ρ . When theworking model is incorrect, point estimates associated with both QR and PM are biased, but the varianceremains unbiased. These ﬁndings hold across all three values of ρ .For the DR methods, it is evident that estimates based on both PAPW and IPSW remain unbiased whenat least one of the PM or QR models holds. Also, the values of crCI and rSE reveal that the asymptoticvariance estimator is DR for both methods. Comparing the rMSE values, the AIPW estimate basedon IPSW is slightly more eﬃcient than the one based on PAPW. While the variance estimates remainunbiased under the false-false model speciﬁcation status, point estimates are severely biased. Finally, theperformance of both AIPW estimators improves with respect to bias reduction especially when the QRmodel is misspeciﬁed.For the Bayesian approach, the simulation results are displayed in Table 2. Note that we no longer are Robust Bayesian Inference for Big Data Table 1

Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under thefrequentist approach in the ﬁrst simulation study for ρ = { . , . , . } ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted 8.528 19.248 92.580 1.009 8.647 11.065 77.400 1.018 8.682 9.719 50.880 1.020Fully weighted -0.029 20.276 94.740 1.001 0.006 8.035 95.080 1.010 0.015 5.008 94.880 1.008

Non-probability sample ( S B ) Unweighted 31.742 32.230 0.000 1.009 31.937 32.035 0.000 1.012 31.996 32.049 0.000 1.013Fully weighted 0.127 6.587 95.440 1.013 0.078 2.583 95.660 1.014 0.061 1.554 95.440 1.012

Non-robust adjustment

Model speciﬁcation: TruePAPW -1.780 8.088 96.960 1.107 -1.906 4.734 95.680 1.103 -1.947 4.186 94.040 1.100IPSW -3.054 10.934 97.240 1.305 -3.134 8.145 95.220 1.173 -3.160 7.778 92.380 1.067PM 0.490 7.577 95.160 1.007 0.190 4.668 94.620 0.991 0.095 4.204 94.560 0.985Model speciﬁcation: FalsePAPW 26.338 27.089 3.140 1.112 26.434 26.618 0.000 1.123 26.461 26.580 0.000 1.128IPSW 28.269 28.917 0.580 1.021 28.474 28.648 0.000 1.018 28.536 28.654 0.000 1.014PM 28.093 28.750 0.640 1.022 28.315 28.494 0.000 1.022 28.382 28.505 0.000 1.021

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW 0.238 8.070 95.160 1.017 0.100 4.787 95.020 0.996 0.056 4.235 94.640 0.987AIPW–IPSW 0.105 7.861 95.100 1.019 0.053 4.737 94.760 0.996 0.036 4.222 94.600 0.987Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW 0.311 8.197 95.420 1.021 0.172 4.988 95.000 1.013 0.127 4.460 95.180 1.011AIPW–IPSW 0.222 7.962 95.460 1.024 0.170 4.901 95.420 1.019 0.152 4.405 95.300 1.018Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW 0.877 13.362 96.900 1.028 0.327 6.089 95.820 1.027 0.154 4.523 95.240 1.006AIPW–IPSW 0.609 12.532 96.580 1.025 0.232 5.842 95.500 1.022 0.113 4.464 95.340 1.003Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 28.301 28.995 1.000 1.024 28.392 28.579 0.000 1.021 28.419 28.546 0.000 1.018AIPW–IPSW 28.104 28.762 0.720 1.024 28.313 28.493 0.000 1.023 28.376 28.500 0.000 1.022

PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM:prediction model; AIPW: augmented inverse propensity weighting. able to use the PMLE approach. Instead, we apply the PAPP method assuming that π Ri is unknownfor i ∈ S B . As illustrated, PAPP outperforms with respect to bias among all the non-robust methods.Surprisingly, the magnitude of bias is even smaller in the Bayesian PAPP than the QR methods examinedunder the frequentist framework. In addition, it seems estimates under the Bayesian approach are slightlymore eﬃcient than those obtained under the frequentist methods. While the variance is approximatelyunbiased for ρ = 0 .

2, there is evidence that PM and QR increasingly underestimate and overestimatethe true variance, respectively, as the value of ρ increases. Regarding the DR methods, it is evidentthat AIPW estimates are even less biased and more eﬃcient in the Bayesian approach compared to thealternative frequentist method, especially when the PM is misspeciﬁed, but the QR model holds. It isclear from the table that DR property holds for all values of ρ , when at least one of the working modelsare correctly speciﬁed. In the previous simulation study, we violated the ignorable assumption in order to misspecify the workingmodel by dropping a key auxiliary variable. Now, we focus on a situation where models misspeciﬁed withrespect to the functional form of their conditional means. To this end, we consider non-linear associationsand two-way interactions in constructing of the outcome variables as well as the selection probabilities.This also allows us to examine the ﬂexibility of BART as a non-parametric method when the truefunctional form of the underlying models is unknown. In addition, to simulate a more realistic situation,this time, two separate sets of auxiliary variables are generated, D associated with the design of S A ,and X associated with the design of S R . However, we allow the two variables to be correlated through abivariate Gaussian distribution as below: (cid:18) dx (cid:19) ∼ M V N (cid:18)(cid:18) (cid:19) , (cid:18) ρρ (cid:19)(cid:19) (3.8) Robust Bayesian Inference for Big Data Table 2

Comparing the performance of the bias adjustment methods and associated variance estimator under the two-stepparametric Bayesian approach in the ﬁrst simulation study for ρ = { . , . , . } ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Non-robust adjustment

Model speciﬁcation: TrueUnweighted 8.528 19.248 92.580 1.009 8.647 11.065 77.400 1.018 8.682 9.719 50.880 1.020Fully weighted -0.029 20.276 94.740 1.001 0.006 8.035 95.080 1.010 0.015 5.008 94.880 1.008

Non-probability sample ( S B ) Unweighted 31.620 32.106 0.000 1.014 31.906 32.003 0.000 1.015 31.993 32.045 0.000 1.017Fully weighted 0.026 6.615 95.260 1.010 0.052 2.604 95.240 1.007 0.059 1.564 95.240 1.006

Non-robust adjustment

Model speciﬁcation: TruePAPW -1.846 8.148 96.340 1.081 -1.890 4.749 96.860 1.163 -1.906 4.195 96.560 1.200IPSW 0.113 7.566 96.500 1.076 0.117 4.302 97.700 1.140 0.117 3.759 97.900 1.164PM 0.385 7.534 95.180 1.027 0.151 4.644 95.060 1.001 0.078 4.190 95.000 0.989Model speciﬁcation: FalsePAPW 26.290 27.041 2.280 1.051 26.499 26.687 0.000 1.071 26.562 26.684 0.000 1.083IPSW 28.151 28.784 0.500 1.038 28.446 28.612 0.000 1.025 28.535 28.647 0.000 1.015PM 27.981 28.641 0.840 1.040 28.291 28.472 0.000 1.025 28.384 28.510 0.000 1.015

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW 0.115 8.093 96.940 1.097 0.057 4.764 97.120 1.121 0.037 4.219 97.200 1.130AIPW–IPSW 0.009 7.803 96.600 1.083 0.019 4.704 96.980 1.106 0.020 4.206 96.960 1.114Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW -0.016 7.930 97.180 1.108 -0.080 4.444 97.940 1.166 -0.098 3.842 98.140 1.193AIPW–IPSW -0.079 7.648 96.820 1.095 -0.074 4.411 97.700 1.151 -0.069 3.867 97.900 1.175Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW 0.557 7.693 96.380 1.086 0.214 4.669 96.760 1.092 0.105 4.195 96.600 1.090AIPW–IPSW 0.392 7.526 95.980 1.067 0.155 4.637 96.340 1.077 0.080 4.189 96.420 1.078Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 28.167 28.864 1.360 1.096 28.359 28.549 0.000 1.082 28.416 28.548 0.000 1.068AIPW–IPSW 27.990 28.647 0.980 1.069 28.289 28.471 0.000 1.059 28.379 28.506 0.000 1.049

PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM:prediction model; AIPW: augmented inverse propensity weighting.

Robust Bayesian Inference for Big Data Note that ρ controls how strongly the sampling design of S R is associated with that of S A . In addition,the values of d i can be either observed or unobserved for i ∈ S A . In this simulation, we set ρ = 0 .

2, butlater we check other values ranging from 0 to 0 . U , we consider the following non-linear model: y i = 2 f k ( x i ) − d i + 0 . x i d i + σ(cid:15) i (3.9)where (cid:15) i ∼ N (0 , σ is determined such that the correlation between y i and 2 f k ( x i ) − d i + 0 . x i d i equals 0 . i ∈ U . The function f k ( . ) is assumed to take one of the following forms: SIN : f ( x ) = sin ( x ) EXP : f ( x ) = exp ( x/ SQR : f ( x ) = x / S A and S R depends on x and d , respectively. Thus, each i ∈ U is assigned twovalues corresponding to the probabilities of selection in S R and S A through a logistic function as below: π R ( x i ) = P ( δ Ri = 1 | d i ) = exp { γ + 0 . d i } exp { γ + 0 . d i } π Ak ( x i ) = P k ( δ Ai = 1 | x i ) = exp { γ + f k ( x i ) } exp { γ + f k ( x i ) } (3.11)where δ Ri and δ Ai are the indicators of being selected in S R and S A , respectively.Associated with S R and S A , independent samples of size n R = 100 and n A = 1 ,

000 were selectedrandomly from U with Poisson sampling at the ﬁrst stage and simple random sampling at the secondstage. The sample size per cluster, n α , was 1 and 50 for S R and S A , respectively. The model intercepts, γ and γ in Eq. 3.11, are obtained such that (cid:80) n R i =1 /π Ri = n R and (cid:80) n A i =1 /π Ri = n A . We restrict thissimulation to Bayesian analysis based on the proposed PAPW and PAPP methods but focus on how wellthe non-parametric Bayes performs over the parametric Bayes in situations when the true structure ofboth underlying models are supposed to be unknown. The rest of the simulation design is similar to thatdeﬁned in Simulation I, except for the way we specify a working model. This is done by including onlythe main and linear eﬀects of X and D in the PM model, and the main and linear eﬀect of X in the QRmodel. BART’s performance is examined under the assumption that the true functional form of both QRand PM models is unknown, and thus, only main eﬀects are included in BART. Note that dropping a keyauxiliary variable, which was the case in Simulation I, leads to a violation from the ignorable assumption,which may not be compensated by the use of more ﬂexible approaches, such as BART.The ﬁndings of this simulation for the two-step Bayesian approach with 1 ,

000 MCMC draws and M = 200,are exhibited numerically in Table 3. Regarding the non-robust methods, both QR and PM estimatorsshow unbiased results across the three deﬁned functions, i.e. SIN, EXP and SQR, as long the workingGLM is valid, with the minimum value of rBias associated with the PAPP method. According to therSE values, there is evidence that PAPW and PAPP overestimate the variance, and PM underestimatethe variance to some degrees, especially under the EXP and SQR scenarios. When the speciﬁed GLMis wrong, as seen, point estimates are biased for both QR and PM methods across all three functions.However, BART produces approximately unbiased results with smaller values of rMSE than GLM. Ingeneral, the PM method outperforms the QR methods Under BART with respect to bias, but resultsbased on the PAPP method is more eﬃcient. In addition, BART tends to overestimate the variance underboth QR and PM methods.When it comes to the DR adjustment, Bayesian GLM produces unbiased results across all the threedeﬁned functions if the working model of either QR or PM holds. However, the variance is slightlyunderestimated for the SIN function when the PM speciﬁed model is wrong, and it is overestimated forthe EXP function under all model-speciﬁcation scenarios. As expected, point estimates are biased whenthe GLM is misspeciﬁed for both QR and PM. However, BART tends to produce unbiased estimatesconsistently across all three functions, and the magnitude of both rBias and rMSE are smaller in theAIPW estimator based on PAPP compared to the AIPW estimator based on PAPW. Finally, as in thenon-robust method, variance under BART is overestimated compared to the GLM. Robust Bayesian Inference for Big Data Table 3

Comparing the performance of the bias adjustment methods and associated variance estimator under the two-stepparametric Bayesian approach in the second simulation study for ρ = 0 . SIN EXP SQRModel-method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted -17.210 23.109 80.000 0.999 -8.406 11.126 78.300 1.000 -17.302 20.563 65.800 1.002Fully weighted -0.623 17.027 94.440 0.987 -0.303 7.947 94.580 0.987 -0.675 13.219 94.000 0.975

Non-probability sample ( S B ) Unweighted 33.063 33.379 0.000 1.003 40.307 40.409 0.000 1.079 49.356 49.570 0.000 1.016Fully weighted 0.019 6.010 95.120 1.006 0.005 2.755 94.880 1.005 0.009 3.948 94.980 0.992

Non-robust adjustment

Model speciﬁcation: TrueGLM–PAPW -0.425 9.257 96.300 1.072 -0.185 4.262 98.700 1.257 -0.325 6.649 98.360 1.213GLM–PAPP 0.018 8.460 95.680 1.018 0.040 3.870 98.560 1.238 -0.037 5.914 98.760 1.222GLM–PM -0.411 9.899 94.680 0.982 -0.371 4.504 94.440 0.988 -0.762 8.115 92.520 0.947Model speciﬁcation: FalseGLM–PAPW 7.180 11.635 86.360 1.027 2.511 5.299 97.220 1.316 52.170 52.559 0.000 1.102GLM–PAPP 7.647 11.265 78.000 0.954 3.025 5.425 96.180 1.277 53.095 53.397 0.000 1.122BART–PAPW 4.035 10.078 96.980 1.217 2.811 5.129 98.440 1.472 8.356 11.082 97.180 1.468BART–PAPP 1.098 8.530 96.660 1.121 1.108 4.120 98.880 1.391 4.482 7.479 98.020 1.401GLM–PM 5.870 10.542 87.920 0.972 -6.589 9.264 82.520 0.976 48.993 49.409 0.000 0.994BART–PM 0.577 9.635 96.960 1.115 0.087 4.501 97.540 1.155 0.249 8.276 96.080 1.062

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueGLM–AIPW–PAPW -0.450 9.930 95.760 1.023 -0.165 4.593 98.180 1.200 -0.458 8.116 96.520 1.089GLM–AIPW–PAPP -0.452 9.925 95.780 1.020 -0.162 4.592 98.140 1.193 -0.453 8.106 96.500 1.086Model speciﬁcation: QR–True, PM–FalseGLM–AIPW–PAPW -0.279 9.996 93.160 0.926 0.310 5.697 98.780 1.303 -0.338 7.128 97.480 1.154GLM–AIPW–PAPP -0.134 9.418 94.120 0.961 0.508 4.977 99.480 1.475 -0.275 7.376 97.580 1.152Model speciﬁcation: QR–False, PM–TrueGLM–AIPW–PAPW -0.411 10.098 96.080 1.024 -0.176 4.715 98.460 1.234 -0.771 8.122 95.480 1.057GLM–AIPW–PAPP -0.417 10.101 96.020 1.021 -0.173 4.705 98.400 1.229 -0.778 8.119 95.420 1.057Model speciﬁcation: QR–False, PM–FalseGLM–AIPW–PAPW 9.015 13.176 84.140 1.000 6.735 8.693 94.100 1.456 50.835 51.288 0.000 1.019GLM–AIPW–PAPP 9.191 12.717 84.860 1.082 6.787 8.181 96.660 1.761 51.667 52.131 0.000 1.047BART–AIPW–PAPW 0.425 10.071 97.900 1.184 0.122 4.689 99.280 1.407 -0.259 8.349 97.960 1.231BART–AIPW–PAPP -0.144 9.794 97.820 1.184 -0.100 4.541 99.280 1.405 -0.245 8.329 97.740 1.203

PAPW: propensity adjusted probability weighting; PAPP: propensity adjusted probability Prediction; QR: quasi-randomization; PM:prediction model; AIPW: augmented inverse propensity weighting.

Robust Bayesian Inference for Big Data Sine the non-probability sample in the application of this study is clustered, we performed a thirdsimulation study. To this end, the hypothetical population is assumed to be clustered with A = 10 clusters, each of size n α = 10 ( N = 10 ). Then, three cluster-level covariates, { x , x , d } , are deﬁnedwith the following distributions:  d α x α x α  ∼ M V N   ,  − ρ/ ρ − ρ/ − ρ/ ρ − ρ/  x α = I ( x α >

0) (3.12)where d denotes a design variable in S R , and { x , x } describes the selection mechanism in S B . Primarily,we set ρ = 0 .

8, but later we check other values ranging from 0 to 0 . ρ controls howstrongly the sampling design of S R is associated with that of S B . Furthermore, we assume that both d and x are observed for units of S .Again, to be able to assess BART’s performance, we consider non-linear associations with polynomialterms and two-way interactions in construction of the outcome variables as well as the selectionprobabilities. Two outcome variables are studied, one continuous ( y c ) and one binary ( y b ), which bothdepend on { x, d } as below: y cαi | x α , d α ∼ N ( µ = 1 + 0 . x α + 0 . x α − . x α − . x α x α − . d α + u α , σ = 1) (3.13) y bαi | x α , d α ∼ Ber (cid:18) p = exp {− . x α + 0 . x α − . x α − . x α x − . d α + u α } exp {− . x α + 0 . x α − . x α − . x α x − . d α + u α } (cid:19) (3.14)where u α ∼ N (0 , σ u ), and σ u is determined such that the intraclass correlation equals 0 . i ∈ U , we then consider the following set of selectionprobabilities associated with the design of the S R and S B : π R ( x α ) = P ( δ Rα = 1 | d α ) = exp { γ + 0 . d α } exp { γ + 0 . d α } π B ( x α ) = P ( δ Bα = 1 | x α ) = exp { γ − . x α + 0 . x α + 0 . x α − . x α x α } exp { γ − . x α + 0 . x α + 0 . x α − . x α x α } (3.15)where δ Ri and δ Bi are the indicators of being selected in S R and S B , respectively. Associated with S R and S B , two-stage cluster samples of size n R = 100 and n B = 10 ,

000 were selected randomly from U withPoisson sampling at the ﬁrst stage and simple random sampling at the second stage. The sample sizeper cluster, n α , was 1 and 50 for S R and S B , respectively. The model intercepts, γ and γ in 3.15, areobtained such that (cid:80) n R i =1 /π Ri = n R and (cid:80) n B i =1 /π Ri = n B /n α .The rest of the simulation design is similar to that deﬁned in Simulation II, except for the methods we usefor point and variance estimation. In addition to the situation where π Ri is known for i ∈ S B , we considera situation where π R is unobserved for i ∈ S B and draw the estimates based on PAPP. Furthermore,unlike the simulation I, DR estimates are achieved by separately ﬁtting the QR and PM models, and toget the variance estimates, a bootstrap technique is applied with B = 200 based on Rao and Wu (1988).Finally, under BART, Rubin’s combining rules are employed to derive the point and variance estimatesbased on the random draws of the posterior predictive distribution. As in Simulations II, we considerdiﬀerent scenarios of model speciﬁcation. To misspecify a model, we only include the main eﬀects inthe working model. Also, under BART, no interaction or polynomial is included as input. Again, wemisspecify the functional form of the working models, while in Simulation I, we assumed that auxiliaryvariables are partially observed when misspecifying a model.The means of the synthesized U for the outcome variables were y cU = 3 .

39 and y bU = 0 .

40. Figure 2compares the bias magnitude and eﬃciency across the non-robust methods. As illustrated, point estimates

Robust Bayesian Inference for Big Data from both S R and S B are biased if the true sampling weights are ignored. After adjusting, for bothcontinuous and binary outcomes, the bias is close to zero under both QR and PM methods when theworking model is correct. However, the lengths of the error bars reveal that the proposed PAPW/PAPPmethod is more eﬃcient than the IPSW. When only main eﬀects are included in the model, all adjustedestimates are biased except for those based on BART. Note that BART cannot be applied under IPSW.Further details about the simulation results for the non-robust methods are displayed in Appendix I.3.We see that IPSW tends to have slightly larger magnitudes of rBias and rMSE for both y c and y b . Also,the values of rSE close to 1 indicate that the Rao & Wu’s bootstrap method of variance estimationperforms well under both QR and PM approaches. However, 95% coverage is achieved only when theworking model is correct.In Figure 3, we depict the ﬁndings of the simulation corresponding to the DR estimators under diﬀerentpermutations of model speciﬁcation. One can immediately infer that for all the employed methods, AIPWproduces unbiased results when either the PM or QR model holds. However, in situations where the trueunderlying models for both PM and QR are unknown, the point estimates based on BART remainsunbiased for under both PAPW and PAPP approaches. Furthermore, under the GLM, it is evident thatAIPW estimates based on PAPW/PAPP are slightly less biased and more eﬃcient than those based onIPSW when the PM is incorrect (TF) according to the lengths of the error bars. Details of the numericalresults can be found in Appendix I.3. The latter compares BART with GLM under a situation whereboth working models are wrong. Results showing the performance of the bootstrap variance estimatorare provided in Figure 4. The crCI values are all close to the correct value unless both working modelsare incorrectly speciﬁed. While the same result we observed for the continuous variable under BART,there is evidence that BART widely underestimate the variance of the AIPW estimator for the binaryoutcome. Note that the estimation of variance under BART is based on the MCMC draws of the posteriorprediction distribution using Rubin’s combining rule. To conclude, we observe that when neither the PMnor QR model are known, BART based on PAPP produces unbiased and eﬃcient estimates with accuratevariance.As the ﬁnal step, we replicate the simulation for diﬀerent values of ρ ranging from 0 to 0 . ρ increases. Generally, itseems that the value of rMSE decreases for all competing methods as ρ increases, but for all values of ρ , PAPW are PAPP are less biased than IPSW. It is only when ρ = 0 for the continuous variable thatIPSW outperforms the PPAW/PAPP in bias reduction. However, when d is highly correlated with x ,there is also evidence of better performance by PAPP than IPSW in terms of bias reduction. We believethis is mainly because the stronger association between x and d implies that the additional ignorableassumption under PAPP is better met, while this correlation causes a sort of collinearity in IPSW leadingto a loss of eﬃciency. The rest of the methods did not show signiﬁcant changes as the value of ρ increases.Numerical values associated with Figure 7 have been provided in Appendix I.3.

4. Application to the Strategic Highway Research Program 2 analysis

We brieﬂy describe SHRP2, the non-probability sample, and the NHTS, the probability sample as wellas the variables used for statistical adjustment.

SHRP2 is the largest naturalistic driving study (NDS) conducted to date, with the primary aim to assesshow people interact with their vehicle and traﬃc conditions while driving (of Sciences, 2013). About A = 3 ,

140 drivers aged 16 −

95 years were recruited from six geographically dispersed sites across theUnited States, and over ﬁve million trips and 50 million driven miles have been recorded during theirparticipation time. The average follow-up time per person was n α = 440 days. A quasi-random approachwas initially employed to select samples by random cold calling from a pool of 17 ,

000 pre-registered

Robust Bayesian Inference for Big Data UW/FWPAPWPAPPIPSWPM −25 0 25 50 75 100 rBias (%) M e t hod (a) −10 0 10 20 30 rBias (%) Spec FalseTrue

Model

BARTGLMNA (b)

Fig 2:

Comparing the performance of the non-robust approaches for (a) the continuous outcome ( Y c ) and (b)the binary outcome ( Y b ) when the model is correctly speciﬁed. Error bars represent the 2.5% and 97.5%percentiles of the empirical distribution of bias over the simulation iterations. UW: unweighted; FW:fully weighted; PM: prediction model; PAPP: propensity adjusted probability prediction; IPSW: inversepropensity score weighting Robust Bayesian Inference for Big Data UW/TWTrue−TrueTrue−FalseFalse−TrueFalse−False 0 50 100 rBias (%) M ode l s pe c i f i c a t i on ( Q R − P M ) (a) −10 0 10 20 30 rBias (%) Method UWFWPAPWPAPPIPSW

Model

BARTGLMNA (b)

Fig 3:

Comparing the performance of the doubly robust estimators under diﬀerent model-speciﬁcation scenariosfor (a) the continuous outcome ( Y c ) and (b) the binary outcome ( Y b ). 95% CIs have been generated basedon the 2.5% and 97.5% percentiles of the empirical distribution of bias over the simulation iterations.UW: unweighted; FW: fully weighted; PAPP: propensity adjusted probability prediction; IPSW: inversepropensity score weighting Robust Bayesian Inference for Big Data c r C I ( % ) (a) Method

PAPWPAPPIPSW

Model

BARTGLM (b) T r u e − T r u e T r u e − F a l s e F a l s e − T r u e F a l s e − F a l s e Model specification (QR−PM) r SE (c) T r u e − T r u e T r u e − F a l s e F a l s e − T r u e F a l s e − F a l s e Model specification (QR−PM) Method

PAPWPAPPIPSW

Model

BARTGLM (d)

Fig 4:

Comparing the 95% CI coverage rates for the means of (a) continuous outcome and (b) binary outcomeand SE ratios for (c) continuous outcome and (d) binary outcome across diﬀerent DR methods underdiﬀerent model speciﬁcation scenarios. UW: unweighted; FW: fully weighted; PAPP: propensity adjustedprobability prediction; IPSW: inverse propensity score weighting

Robust Bayesian Inference for Big Data TWPAPWPAPPIPSWPMAIPW_PAPWAIPW_PAPPAIPW_IPSW M e t hod −5.0−2.50.0 rBias (a) −3−2−10 rBias (b) TWPAPWPAPPIPSWPMAIPW_PAPWAIPW_PAPPAIPW_IPSW . . . . . . . . . rho M e t hod rMSE (c) . . . . . . . . . rho rMSE (d) Fig 5:

Comparing the rBias for the means of (a) continuous outcome and (b) binary outcome and rMSE for themeans of (c) continuous outcome and (d) binary outcome across diﬀerent adjustment methods and diﬀerentvalues of ρ . UW: unweighted; FW: fully weighted; PAPP: propensity adjusted probability prediction;IPSW: inverse propensity score weighting Robust Bayesian Inference for Big Data volunteers. However, because of the low success rate along with budgetary constraints, the investigatorslater chose to pursue voluntary recruitment. Sites were assigned one of three pre-determined sample sizesaccording to their population density (Campbell, 2012). While the sample size was equally distributedacross females and males within each age group, the youngest and eldest age groups were oversampledin the sense that crash risk is expected to be higher among those subgroups. Thus, one can concludethat the selection mechanism in SHRP2 is a combination of convenience and quota sampling methods.Further description of the study design and recruitment process can be found in Antin et al. (2015).SHRP2 data are collected in multiple stages. Selected participants are initially asked to complete multipleassessment tests, including executive function and cognition, visual perception, visual-cognitive, physicaland psychomotor capabilities, personality factors, sleep-related factors, general medical condition, drivingknowledge, etc. In addition, demographic information such as age, gender, household income, educationlevel, and marital status as well as vehicle characteristics such as vehicle type, model year, manufacturer,and annual mileage are gathered at the screening stage. A trip in SHRP2 is deﬁned as the time intervalduring which the vehicle is operating. The in-vehicle sensors start recording kinematic information, thedriver’s behaviors, and traﬃc events continuously as soon as the vehicle is switched on. Encrypted dataare stored in a removable hard drive, and participants are asked to provide access to the vehicle every fourto six months, so that hard drives with accumulated data are removed and replaced. Then, trip-relatedinformation such as average speed, duration, distance, and GPS trajectory coordinates are obtained byaggregating the sensor records at the trip level (Campbell, 2012). In the present study, we use data from the eighth round of the NHTS conducted from March 2016through May 2017 as the reference survey. The NHTS is a nationally representative survey, repeatedcross-sectionally almost every seven years. It is aimed at characterizing personal travel behaviors amongthe civilian, non-institutionalized population of the United States. The 2017 NHTS was a mixed-modesurvey, in which households were initially recruited by mailing through an address-based sampling (ABS)technique. Within the selected households, all eligible individuals aged ≥ ≤

15 years old.The overall sample size was 129,696, of which roughly 20% was used for national representativity and theremaining 80% was regarded as add-ons for the state-level analysis. The recruitment response rate was30.4%, of which 51.4% reported their trips via the travel logs (Santos et al., 2011). In NHTS, a travelday is deﬁned from 4:00 AM of the assigned day to 3:59 AM of the following day on a typical weekday.A trip is deﬁned as that made by one person using any mode of transportation. While trip distance wasmeasured by online geocoding, the rest of the trip-related information was based on self-reporting. Atotal of 264,234 eligible individuals aged ≥ Because of the critical role of auxiliary variables in maintaining the ignorable assumption for the selectionmechanism of the SHRP2 sample, particular attention was paid to identify and build as many commonvariables as possible in the combined sample that are expected to govern both selection mechanism andoutcome variables in SHRP2. However, since the SHRP2 sample is gathered from a limited geographicalarea, in order to be able to generalize the ﬁndings to the American population of drivers, we had to assumethat no other auxiliary variable apart from those investigated in this study will deﬁne the distribution ofthe outcome variables. This assumption is in fact embedded in the ignorable condition in the SHRP2 giventhe set of common observed covariates. Three distinct sets of variables were considered: (i) demographicinformation of the drivers, (ii) vehicle characteristics, and (iii) day-level information. These variables andassociated levels/ranges are listed in Table 4.

Robust Bayesian Inference for Big Data To reduce computational burden, we aimed to make inference at the day level, so SHRP2 data wereaggregated at the day level. Considering this, we constructed several trip-related outcome variables atthe day level such as daily frequency of trips, daily total trip duration, daily total distance driven, meandaily trip average speed, and mean daily start time of trips that were available in both datasets as wellas daily maximum speed, daily frequency of brakes per mile, and daily percentage of trip with a fullstop, which was available in SHRP2 only. The ﬁnal sample sizes of the complete day-level datasets were n B = 837 ,

061 and n R = 133 ,

582 in SHRP2 and NHTS, respectively.In order to make the two datasets more comparable, we ﬁltered out all the subjects in NHTS who werenot drivers or were younger than 16 years old or used public transportation or transportation modesother than cars, SUVs, vans, or light pickup trucks. One major structural diﬀerence between NHTS andSPMD was that in the NHTS, participants’ trips were recorded for only one randomly assigned weekday,while in SPMD, individuals were followed up for several months or years. Therefore, to properly accountfor the potential intraclass correlation across sample units in SHRP2, we treated SHRP2 participantsas clusters for variance estimation. For BART, we ﬁtted random intercept BART (Tan et al., 2016). Inaddition, since the π Ri were not observed for units of SHRP2, we employed the PAPP and IPSW methodsto estimate pseudo-weights, so variance estimation under the GLM was based on the Rao & Wu bootstrapmethod throughout the application section. Table 4

List of auxiliary variables and associated levels/ranges that are used to adjust for selection bias in SHRP2

Auxiliary variables (scale) Levels/rangeDemographic information gender (female, male)age (yrs) (16-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75+)race (White, Black, other)ethnicity (Hispanic, non-Hispanic)birth country (citizen, alien)education level ( ≤ HS, HS completed, associate, graduate, post-graduate)household income ( times $ × Vehicle characteristics age (yrs) (0-4, 5-9, 10-14, 15-19, 20+)type (passenger car, Van, SUV, truck)make (American, European, Asian)mileage ( × Day-level information weekend indicator of trip day { } season of trip day (winter, spring, summer, fall)

5. Results

According to Figure 6, one can visually infer that the largest discrepancies between the sample distributionof auxiliary variables in SHRP2 and that in the population stem from participants’ age, race andpopulation size of residential area as well as vehicles’ age and vehicles’ type. The youngest and eldest agegroups have been oversampled as are Whites and non-Hispanics. In addition, we found that the proportionof urban dwellers is higher in SHRP2 than that in the NHTS. In terms of vehicle characteristics, SHRP2participants tend to own passenger cars more than the population average, whereas individuals with othervehicle types were underrepresented in SHRP2.

Robust Bayesian Inference for Big Data As the ﬁrst step of QR, we checked if there is any evidence of a lack of common distributional supportbetween the two studies for the auxiliary variables. Figure 7a compares the kernel density of the estimatedPS using BART across the two samples. As illustrated, a notable lack appears on the left tail of the PSdistribution in SHRP2. This is mainly because of the attempt we make to generalize the results ofSHRP2 to the entire population, while the SHRP2 sample covers only six states. However, owing to thehuge sample size in SHRP2, we believe this does not jeopardize the positivity assumption seriously. Theestimated population size of drivers was (cid:98) N = 133 , ,

744 based on the sampling weight in NHTS. Theavailable auxiliary variables are strong predictors of the NHTS selection probabilities for SRHP2: theaverage pseudo- R was for BART 73% in a 10-fold cross validation.In Figure 7b, we compare the distribution of estimated pseudo-weights across the QR methods. It seemsthat PAPP based on BART is the only method that does not produce inﬂuential weights. Also, thehighest variability in the estimated pseudo-weights belonged to the PAPP method under GLM. Figure 8compares the predictive power of BART with GLM and also classiﬁcation and regression trees (BART)in modeling Z and Y on X . As can be seen, the largest values of area under the ROC curve (AUC)and the largest values of (pseudo)- R in the radar across diﬀerent trip-related outcome variables areassociated with BART. Additionally, Figure 13 in Appendix I.4 exhibits how pseudo-weighting based onPAPP-BART improves the imbalance in the distribution of X in SHRP2 with respect to the weighteddistribution of NHTS.Figure 9 depicts the adjusted sample means for some trip-related measures that were available in bothSHRP2 and NHTS. The methods we compare here encompass PAPP, IPSW and PM as the non-robustapproaches, and AIPW with PAPP and AIPW with IPSW as the DR approaches. Also, a comparisonis made between GLM and BART for all the methods except those involving IPSW. Our results suggestthat, as expected, the oversampling of younger and older drivers lead to underestimating miles drivenand length of trips, and overestimating the time of the ﬁrst trip of the day; other factors may impactthese variables, as well as the average speed of a given drive. For three of these four variables (total tripduration, total distance driven, and start hour of daily trip), there appeared to be improvement withrespect to bias considering the NHTS weighted estimates as the benchmark, although only trip durationappears to be fully corrected. In Figure 14 of Appendix I.4, we display the posterior predictive densityof mean daily total distance driven under PAPP, PM and AIPW-PAPP. Note that the narrow varianceassociated with the PAPP approach is due to the fact that the posterior predictive distribution underpseudo-weighting does not account for the clustering eﬀects in SHRP2. It is in fact V W in 2.36 that iscapturing this source of uncertainty in variance estimation.Among the QR methods, we observed that the PAPP based on BART gives the most accurate estimatewith respect to bias for this variable. However, the relatively narrow 95% CI associated with BART mayindicate that BART does not properly propagate the uncertainty in pseudo-weighting. Regarding thePM, it seems BART performs as well as GLM, but with wider uncertainty. As a consequence, the AIPWestimator performs the same in terms of bias across diﬀerent QR methods. The AIPW estimator basedon IPSW, on the other hand, seems to be is more eﬃcient than the ones based on PAPP. However, theseﬁndings are not consistent across the outcome variables.Results related to the adjusted means for some SHRP2-speciﬁc outcome variables are summarized inFigure 10. These variables consist of (a) daily maximum speed, (b) frequency of brakes per mile, and(c) percentage of trip duration when the vehicle is fully stopped. For the daily maximum speed, we takeone further step and present the DR adjusted mean based on the IPSW-GLM and PAPP-BART bysome auxiliary variables in Figure 11. As illustrated, higher levels of mean daily maximum speed areassociated with males, age group 35-44 years, Blacks, high school graduates, Asian cars, and weekends.According to the lengths of 95% CIs, one can see that the AIPW-PAPP-BART consistently producesmore eﬃcient estimates than the AIPW-IPSW-GLM. Further numerical details of these ﬁndings by theauxiliary variables have been provided in Tables 11-17 in Appendix I.4. Robust Bayesian Inference for Big Data F e m a l e M a l e F e m a l e M a l e S HR P NH T S Percent (%) S e x + − − − − − −

24 75 + − − − − − − S HR P NH T S A ge G r oup O t he r A s i an B l a ck W h i t e O t he r A s i an B l a ck W h i t e S HR P NH T S R a c e H i s pan i c N on − H i s pan i c H i s pan i c N on − H i s pan i c S HR P NH T S E t hn i c i t y O t he r U S O t he r U S S HR P NH T S B i r t h C oun t r y P o s t − g r adua t e G r adua t e C o ll ege H S c o m p l e t ed l e ss t han H S P o s t − g r adua t e G r adua t e C o ll ege H S c o m p l e t ed l e ss t han H S S HR P NH T S E du c a t i on + − − −

49 150 + − − − S HR P NH T S HH I n c o m e R en t ed O w ned R en t ed O w ned S HR P NH T S H o m e O w n W o r k a t ho m e P a r t − t i m e F u ll − t i m e W o r k a t ho m e P a r t − t i m e F u ll − t i m e S HR P NH T S J ob S t a t u s + + S HR P NH T S HH S i z e + − − − <

50 1000 + − − − < S HR P NH T S U r ban S i z e + − − − − + − − − − S HR P NH T S V eh i c l e A ge E u r opean A s i an A m e r i c an E u r opean A s i an A m e r i c an S HR P NH T S V eh i c l e M a k e P i ck up S U VV an C a r P i ck up S U VV an C a r S HR P NH T S V eh i c l e T y pe + − − − − − − + − − − − − − S HR P NH T S M il eage O t he r G a s / D O t he r G a s / D S HR P NH T S F ue l t y pe S a t F r i T hu W ed T ue M on S un S a t F r i T hu W ed T ue M on S un S HR P NH T S W ee k da y F a ll S u mm e r S p r i ng W i n t e r F a ll S u mm e r S p r i ng W i n t e r S HR P NH T S S ea s on F i g6 : C o m p a r i n g t h e d i s t r i bu t i o n o f c o mm o n a u x ili a r yv a r i a b l e s i nS H R P w i t h w e i g h t e d NH T S Robust Bayesian Inference for Big Data propensity scores den s i t y Method

NHTSSHRP2 (a) method p s eudo − w e i gh t s ( l og sc a l e ) (b) Fig 7:

Comparing the distribution of (a) estimated propensity scores between SHRP2 and NHTS using BARTand (b) estimated pseudo-weights in SHRP2 across the applied quasi-randomization methods (cid:19)(cid:17)(cid:19) (cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:23) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:27) (cid:20)(cid:17)(cid:19) (cid:19) (cid:17) (cid:19)(cid:19) (cid:17) (cid:21)(cid:19) (cid:17) (cid:23)(cid:19) (cid:17) (cid:25)(cid:19) (cid:17) (cid:27)(cid:20) (cid:17) (cid:19) (cid:19)(cid:17)(cid:19) (cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:23) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:27) (cid:20)(cid:17)(cid:19) (cid:19) (cid:17) (cid:19)(cid:19) (cid:17) (cid:21)(cid:19) (cid:17) (cid:23)(cid:19) (cid:17) (cid:25)(cid:19) (cid:17) (cid:27)(cid:20) (cid:17) (cid:19) (cid:41)(cid:68)(cid:79)(cid:86)(cid:72)(cid:3)(cid:83)(cid:82)(cid:86)(cid:76)(cid:87)(cid:76)(cid:89)(cid:72)(cid:3)(cid:85)(cid:68)(cid:87)(cid:72) (cid:55) (cid:85) (cid:88)(cid:72) (cid:3) (cid:83)(cid:82) (cid:86) (cid:76) (cid:87) (cid:76) (cid:89) (cid:72) (cid:3) (cid:85) (cid:68) (cid:87) (cid:72) (cid:19)(cid:17)(cid:19) (cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:23) (cid:19)(cid:17)(cid:25) (cid:19)(cid:17)(cid:27) (cid:20)(cid:17)(cid:19) (cid:19) (cid:17) (cid:19)(cid:19) (cid:17) (cid:21)(cid:19) (cid:17) (cid:23)(cid:19) (cid:17) (cid:25)(cid:19) (cid:17) (cid:27)(cid:20) (cid:17) (cid:19) (cid:47)(cid:82)(cid:74)(cid:76)(cid:87)(cid:29)(cid:3)(cid:3)(cid:36)(cid:56)(cid:38)(cid:32)(cid:27)(cid:28)(cid:17)(cid:27)(cid:8)(cid:38)(cid:36)(cid:53)(cid:55)(cid:29)(cid:3)(cid:36)(cid:56)(cid:38)(cid:32)(cid:27)(cid:25)(cid:17)(cid:21)(cid:8)(cid:37)(cid:36)(cid:53)(cid:55)(cid:29)(cid:3)(cid:36)(cid:56)(cid:38)(cid:32)(cid:28)(cid:26)(cid:17)(cid:20)(cid:8) (cid:1004)(cid:1009)(cid:1005)(cid:1004)(cid:1005)(cid:1009)(cid:1006)(cid:1004)(cid:1006)(cid:1009)(cid:24)(cid:437)(cid:396)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:24)(cid:349)(cid:400)(cid:410)(cid:258)(cid:374)(cid:272)(cid:286)(cid:4)(cid:448)(cid:286)(cid:396)(cid:258)(cid:336)(cid:286)(cid:3)(cid:400)(cid:393)(cid:286)(cid:286)(cid:282)(cid:94)(cid:410)(cid:258)(cid:396)(cid:410)(cid:3)(cid:346)(cid:381)(cid:437)(cid:396)(cid:68)(cid:349)(cid:374)(cid:3)(cid:346)(cid:381)(cid:437)(cid:396)(cid:68)(cid:258)(cid:454)(cid:3)(cid:400)(cid:393)(cid:286)(cid:286)(cid:282)(cid:94)(cid:410)(cid:381)(cid:393)(cid:3)(cid:410)(cid:349)(cid:373)(cid:286)(cid:94)(cid:410)(cid:381)(cid:393)(cid:3)(cid:396)(cid:258)(cid:410)(cid:286)(cid:271)(cid:396)(cid:258)(cid:364)(cid:286)(cid:3)(cid:296)(cid:396)(cid:286)(cid:395)(cid:17)(cid:396)(cid:258)(cid:364)(cid:286)(cid:3)(cid:396)(cid:258)(cid:410)(cid:286) (cid:39)(cid:62)(cid:68) (cid:17)(cid:4)(cid:90)(cid:100) (b)(a)

Fig 8:

Comparing the performance of BART vs GLM in both estimating propensity scores and predicting sometrip-related outcomes. The radar plot on the right side displays the values of (pseudo-) R between BARTand GLM. AUC: area under curve; CART: classiﬁcation and regression trees Robust Bayesian Inference for Big Data d i s t an c e ( m il e ) (a) du r a t i on ( m i n ) Model

NAGLMBART (b) U W P A P P P M L E O R A I P W _ P A P PA I P W _ P M L E a v e r age s peed ( M P H ) (c) U W P A P P P M L E O R A I P W _ P A P PA I P W _ P M L E da y hou r Model

NAGLMBART (d)

Fig 9:

Evaluation of pseudo-weights by comparing weighted estimates of daily frequency of trips between NHTSand SHRP2: (a) Mean daily total trip duration, (b) Mean daily total distance driven, (c) Mean trip averagespeed, and (d) Mean daily start hour of trips. The dashed line and surrounding shadowed area representweighted estimates and 95% CIs in NHTS, respectively. UW: unweighted; PAPP: propensity adjustedprobability prediction; IPSW: inverse propensity score weighting; NA: not applicable

Robust Bayesian Inference for Big Data l U W P A P P P M L E O R A I P W _ P A P PA I P W _ P M L E s peed ( M P H ) (a) l U W P A P P P M L E O R A I P W _ P A P PA I P W _ P M L E N u m be r o f b r a k e s pe r m il e Model l NAGLMBART (b) l U W P A P P P M L E O R A I P W _ P A P PA I P W _ P M L E s t op pe r c en t age Model l NAGLMBART (c)

Fig 10:

Adjusted estimates of some SHRP2-speciﬁc outcomes: (a) Mean daily maximum speed, (b) dailyfrequency of brakes per mile driven, and (c) daily percentage of stop time. UW: unweighted; PAPP:propensity adjusted probability prediction; IPSW: inverse propensity score weighting M a l e F e m a l e s peed ( M P H ) (a) − − − − − − + (b) W h i t e B l a c k A s i a n O t h e r Method

UWAIPW_PAPPAIPW_IPSW

Model

NAGLMBART (c) < H S H S c o m p C o ll e g e G r a d u a t e P o s t − g r a d s peed ( M P H ) (d) A m e r i c a n A s i a n E u r o p e a n (e) W e e k d a y W e e k e n d Method

UWAIPW_PAPPAIPW_IPSW

Model

NAGLMBART (f)

Fig 11:

Bias adjusted estimates of mean daily maximum speed (MPH) driven by (a) gender, (b) age groups,(c) race, (d) education, (e) vehicle manufacturer, and (f) weekend indicator. UW: unweighted; PAPP:propensity adjusted probability prediction; IPSW: inverse propensity score weighting; NA: not applicable

Robust Bayesian Inference for Big Data

6. Discussion

In this study, we proposed a doubly robust (DR) adjustment method for ﬁnite population inference innon-probability samples when a well-designed probability sample is available as a benchmark. Combiningthe ideas of pseudo-weighting with prediction modeling, our method involved a modiﬁed version ofAIPW, which is DR in the sense that estimates are consistent if either underlying model holds. Moreimportantly, the proposed method permitted us to apply a wider class of predictive tools, especiallysupervised algorithmic methods. To better address model misspeciﬁcation, our study employed BARTto multiply impute both pseudo-inclusion probabilities and the outcome variable. We also proposed amethod to estimate the variance of the DR estimator based on the posterior predictive draws simulatedby BART. In a simulation study, we then assessed the repeated sampling properties of our proposedestimator. Finally, we apply it to real Big Data from naturalistic driving studies with the aim to improvethe potential selection bias in the estimates of ﬁnite population mean.Generally, the simulation ﬁndings revealed that our modiﬁed AIPW method produces less biased estimatesthan its competitors, especially when n R << n B . When at least one of the models, i.e. QR or PM, iscorrectly speciﬁed, all the DR methods generated unbiased results, though our estimator was substantiallymore eﬃcient with narrower 95% CIs. However, when both working models are invalid, our ﬁndingssuggest that DR estimates based on the GLM can be severely biased. However, under BART, it seemsthat estimates remain approximately unbiased if the true model structure associated with both QR andPM is unknown to the researcher. In contrast to the conventional IPSW estimator, we found that the newproposed estimator produces more stable results in terms of bias and eﬃciency across diﬀerent samplingfractions and various degrees of association between the sampling designs of S R and S B .Generally, the results of the application suggest near total removal of bias for only one of the four variablesthat can be estimated from the reference survey (daily total distance driven). We believe this failureoriginates from several sources. First and foremost, the bias observed in the ﬁnal estimates is very likelyto be mixed with measurement error because we compared the results of sensor data with self-reporteddata as a benchmark. Second, there was evidence of departure from the positivity assumption in SHRP2.Studies show that even a slight lack of common support in the distribution of auxiliary variables maylead to inﬂated variance and aggravated bias (Hill and Su, 2013). Part of this can be due to the factthat we attempted to generalize the results to the general population of American drivers, while SHRP2data was restricted to six states. Another reason might be deviation from the ignorable assumptions: Theassociations between the auxiliary variables and the outcome variables were relatively weak and varyingacross the variables.A key strength of our study is the ability to combine modern regression methods such as BART tominimize model speciﬁcation in a doubly-robust estimator that combines information from a referencesurvey with a non-probability sample. We also consider an important application to non-probabilitysampling to naturalistic driving studies.Our study was not without weaknesses. First, our adjustment method assumes that the two samples aremutually exclusive. However, in many Big Data scenarios (though not the one we consider), the samplingfraction may be non-trivial, so it is likely that the two samples overlap substantially. In such a situation,it is important to check how sensitive our proposed pseudo-weighting approach is to this assumption.Extensions may be plausible to account for the duplicate units of the population in the pooled sample.Another drawback is that the combined dataset may be subject to diﬀerential measurement error inthe variables. This issue is particularly acute in our SHPR2 analysis, because the deﬁnition of a trip may not be identical between the two studies: although trip measures in the SHRP2 are recorded bysensors, in the NHTS trip measures are memory and human estimation based, as they are self-reported.Having such error-prone information either as the outcome or as an auxiliary variable may lead to biasedresults. Finally, we failed to use the two-step Bayesian method for the application part, because SHRP2data were clustered demanding for Bayesian multi-level models to properly estimate the variance of theDR estimators. Considering the available computational resources, Bayesian multi-level modeling wasvery heavy to be ﬁtted on such a large-scale dataset. This prompted us to apply a Bayesian bootstrap Robust Bayesian Inference for Big Data technique to the actual data instead of a full Bayesian method.There are a number of potential future directions for our research. First, we would like to expand theasymptotic variance estimator under PAPP when π Ri cannot be computed for i ∈ S B . Alternatively, onemay be interested in developing a fully model-based approach, in which a synthetic population is createdby undoing the sampling stages via a Bayesian bootstrap method, and attempts are made to imputethe outcome for non-sampled units of the population. The synthetic population idea makes it easierto incorporate the design features of the reference survey into adjustments, especially when Bayesianinference is of interest. While correcting for selection bias, one can adjust for the potential measurementerror in the outcome variables as well if there exists a validation dataset where both mismeasured anderror-free values of the variables are observed (Kim and Tam, 2020). When combining data from multiplesources, it is also likely that auxiliary variables are subject to diﬀerential measurement error. Hong et al.(2017) propose a Bayesian approach to adjust for a diﬀerent type of measurement error in a causalinference context. Finally, in a Big Data setting, ﬁtting models can be computationally demanding. Toaddress this issue, it might be worth expanding the divide-and-recombine techniques for the proposedDR methods.

7. Acknowledgement

The present study was part of the doctoral research of the ﬁrst author of this article at the MichiganProgram in Survey Methodology. The authors would like to thank all the researchers and staﬀ whohave been involved in collecting the data of both NHTS 2017 and SHRP2. Our gratitude also goes toprofessors, Katharine Abraham, Stanley Presser and Joseph Sedarski at the University of Maryland whohave improved the quality of this article with their valuable comments. The ﬁndings and conclusionsof this paper are those of the authors and do not necessarily represent the views of the Virginia TechTransportation Institute (VTTI), SHRP2, the Transportation Research Board, or the National Academyof Sciences.

8. Conﬂict of Interest

The authors declare that there was no conﬂict of interest in the current research.

Robust Bayesian Inference for Big Data References

Abu-Nimeh, S., D. Nappa, X. Wang, and S. Nair (2008). Bayesian additive regression trees-based spamdetection for enhanced email privacy. In

Availability, Reliability and Security, 2008. ARES 08. , pp.1044–1051. IEEE.An, H. and R. J. Little (2008). Robust model-based inference for incomplete data via penalized splinepropensity prediction.

Communications in Statistics–Simulation and Computation 37 (9), 1718–1731.An, W. (2010). 4. bayesian propensity score estimators: Incorporating uncertainties in propensity scoresinto causal inference.

Sociological Methodology 40 (1), 151–189.Antin, J., K. Stulce, L. Eichelberger, and J. Hankey (2015). Naturalistic driving study: descriptivecomparison of the study sample with national data. Technical report.Baker, R., J. M. Brick, N. A. Bates, M. Battaglia, M. P. Couper, J. Dever, K. J. Gile, and R. Tourangeau(2013). Summary report of the aapor task force on non-probability sampling.

Journal of SurveyStatistics and Methodology 1 , 90–143.Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models.

Biometrics 61 (4), 962–973.Beresewicz, M., R. Lehtonen, F. Reis, L. Di Consiglio, and M. Karlberg (2018). An overview of methodsfor treating selectivity in big data sources.Brick, J. M. (2015). Compositional model inference.

JSM Proceedings (Survey Research Methods Section) ,299–307.Campbell, K. L. (2012). The shrp 2 naturalistic driving study: Addressing driver performance andbehavior in traﬃc safety.

TR News (282), 30–35.Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving eﬃciency and robustness of the doubly robustestimator for a population mean with incomplete data.

Biometrika 96 (3), 723–734.Cassel, C. M., C. E. S¨arndal, and J. H. Wretman (1976). Some results on generalized diﬀerence estimationand generalized regression estimation for ﬁnite populations.

Biometrika 63 (3), 615–620.Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability survey samples.

Journalof the American Statistical Association 115 , 2011–2021.Chipman, H. A., E. I. George, and R. E. McCulloch (2007). Bayesian ensemble learning. In

Advances inneural information processing systems , pp. 265–272.Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). Bart: Bayesian additive regression trees.

The Annals of Applied Statistics 4 (1), 266–298.Couper, M. (2013). Is the sky falling? new technology, changing media, and the future of surveys. keynotepresentation at the 5th european survey research association conference. ljubliana, slovenia.Daas, P. J., M. J. Puts, B. Buelens, and P. A. van den Hurk (2015). Big data as a source for oﬃcialstatistics.

Journal of Oﬃcial Statistics 31 (2), 249–262.Dawid, A. P. (1982). The well-calibrated bayesian.

Journal of the American StatisticalAssociation 77 (379), 605–610.Dong, Q., M. R. Elliott, and T. E. Raghunathan (2014). A nonparametric method to generate syntheticpopulations to adjust for complex sampling design features.

Survey Methodology 40 (1), 29.Dutwin, D. and T. D. Buskirk (2017). Apples to oranges or gala versus golden delicious? comparing dataquality of nonprobability internet samples to low response rate probability samples.

Public OpinionQuarterly 81 (S1), 213–239.Elliott, M. R. and R. Valliant (2017). Inference for nonprobability samples.

Statistical Science 32 (2),249–264.Ferrari, S. and F. Cribari-Neto (2004). Beta regression for modelling rates and proportions.

Journal ofApplied Statistics 31 (7), 799–815.Franke, B., J.-F. Plante, R. Roscher, E.-s. A. Lee, C. Smyth, A. Hateﬁ, F. Chen, E. Gil, A. Schwing,A. Selvitella, et al. (2016). Statistical inference, learning and models in big data.

InternationalStatistical Review 84 (3), 371–389.Gelman, A. et al. (2007). Struggles with survey weighting and regression modeling.

StatisticalScience 22 (2), 153–164.Green, D. P. and H. L. Kern (2012). Modeling heterogeneous treatment eﬀects in survey experiments

Robust Bayesian Inference for Big Data with bayesian additive regression trees. Public Opinion Quarterly 76 (3), 491–511.Groves, R. M. (2011). Three eras of survey research.

Public Opinion Quarterly 75 (5), 861–871.Guo, F., J. M. Hankey, et al. (2009). Modeling 100-car safety events: A case-based approach for analyzingnaturalistic driving data. Technical report, Virginia Tech. Virginia Tech Transportation Institute.Han, P. and L. Wang (2013). Estimation with missing data: beyond double robustness.

Biometrika 100 (2),417–430.Haziza, D. and J. N. Rao (2006). A nonresponse model approach to inference under imputation formissing survey data.

Survey Methodology 32 (1), 53.Hill, J. and Y.-S. Su (2013). Assessing lack of common support in causal inference using bayesiannonparametrics: Implications for evaluating the eﬀect of breastfeeding on children’s cognitive outcomes.

The Annals of Applied Statistics , 1386–1420.Hill, J., C. Weiss, and F. Zhai (2011). Challenges with propensity score strategies in a high-dimensionalsetting and a potential alternative.

Multivariate Behavioral Research 46 (3), 477–513.Hong, H., K. E. Rudolph, and E. A. Stuart (2017). Bayesian approach for addressing diﬀerential covariatemeasurement error in propensity score methods.

Psychometrika 82 (4), 1078–1096.Huisingh, C., C. Owsley, E. B. Levitan, M. R. Irvin, P. Maclennan, and G. McGwin (2019). Distracteddriving and risk of crash or near-crash involvement among older drivers using naturalistic driving datawith a case-crossover study design.

The Journals of Gerontology. Series A, Biological Sciences andMedical Sciences 74 , 550–555.Hunsberger, S., B. I. Graubard, and E. L. Korn (2008). Testing logistic regression coeﬃcients withclustered data and few positive outcomes.

Statistics in Medicine 27 (8), 1305–1324.Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher (2015).Big data in survey research: Aapor task force report.

Public Opinion Quarterly 79 (4), 839–880.Johnson, T. P. and T. W. Smith (2017). Big data and survey research: Supplement or substitute? In

Seeing Cities Through Big Data , pp. 113–125. Springer.Kang, J. D., J. L. Schafer, et al. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data.

Statistical Science 22 (4), 523–539.Kaplan, D. and J. Chen (2012). A two-step bayesian approach for propensity score analysis: Simulationsand case study.

Psychometrika 77 (3), 581–609.Kern, H. L., E. A. Stuart, J. Hill, and D. P. Green (2016). Assessing methods for generalizing experimentalimpact estimates to target populations.

Journal of Research on Educational Eﬀectiveness 9 (1),103–127.Kim, J. K. and D. Haziza (2014). Doubly robust inference with missing data in survey sampling.

StatisticaSinica 24 (1), 375–394.Kim, J. K. and H. Park (2006). Imputation using response probability.

Canadian Journal ofStatistics 34 (1), 171–182.Kim, J. K., S. Park, Y. Chen, and C. Wu (2018). Combining non-probability and probability surveysamples through mass imputation. arXiv preprint arXiv:1812.10694 .Kim, J. K. and J. N. Rao (2012). Combining data from two independent surveys: a model-assistedapproach.

Biometrika 99 (1), 85–100.Kim, J.-k. and S.-M. Tam (2020). Data integration by combining big data and survey sample data forﬁnite population inference. arXiv preprint arXiv:2003.12156 .Kim, J. K., Z. Wang, Z. Zhu, and N. B. Cruze (2018). Combining survey and non-survey data for improvedsub-area prediction using a multi-level model.

Journal of Agricultural, Biological and EnvironmentalStatistics 23 (2), 175–189.Kitchin, R. (2015). The opportunities, challenges and risks of big data for oﬃcial statistics.

StatisticalJournal of the IAOS 31 (3), 471–481.Kott, P. S. (1994). A note on handling nonresponse in sample surveys.

Journal of the American StatisticalAssociation 89 (426), 693–696.Kott, P. S. (2006). Using calibration weighting to adjust for nonresponse and coverage errors.

SurveyMethodology 32 (2), 133–142.Kott, P. S. and T. Chang (2010). Using calibration weighting to adjust for nonignorable unit nonresponse.

Journal of the American Statistical Association 105 (491), 1265–1275.

Robust Bayesian Inference for Big Data Kreuter, F. and R. D. Peng (2014). Extracting information from big data: Issues of measurement,inference and linkage.

Privacy, Big Data, and the Public Good: Frameworks for Engagement , 257.Lane, J. (2016). Big data for public policy: The quadruple helix.

Journal of Policy Analysis andManagement 35 (3), 708–715.Lee, S. (2006). Propensity score adjustment as a weighting scheme for volunteer panel web surveys.

Journal of Oﬃcial Statistics 22 (2), 329.Lee, S. and R. Valliant (2009). Estimation for volunteer panel web surveys using propensity scoreadjustment and calibration adjustment.

Sociological Methods & Research 37 (3), 319–343.Lenis, D., B. Ackerman, and E. A. Stuart (2018). Measuring model misspeciﬁcation: Application topropensity score methods with complex survey data.

Computational Statistics & Data Analysis .Little, R. J. (2004). To model or not to model? competing modes of inference for ﬁnite populationsampling.

Journal of the American Statistical Association 99 (466), 546–556.Little, R. J. and H. Zheng (2007). The bayesian approach to the analysis of ﬁnite population surveys.

Bayesian Statistics 8 (1), 1–20.Lohr, S. L. and T. E. Raghunathan (2017). Combining survey data with other data sources.

StatisticalScience 32 (2), 293–312.Mayer, I., E. Sverdrup, T. Gauss, J.-D. Moyer, S. Wager, J. Josse, et al. (2020). Doubly robust treatmenteﬀect estimation with missing attributes.

Annals of Applied Statistics 14 (3), 1409–1431.McCandless, L. C., P. Gustafson, and P. C. Austin (2009). Bayesian propensity score analysis forobservational data.

Statistics in medicine 28 (1), 94–112.McConnell, K. J. and S. Lindner (2019). Estimating treatment eﬀects with machine learning.

HealthServices Research 54 (6), 1273–1282.McGuckin, N. and A. Fucci (2018). Summary of travel trends: 2017 national household travel survey(report fhwa-pl-18-019).

Washington, DC: Federal Highway Administration, US Department ofTransportation .Meng, X.-L. et al. (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, bigdata paradox, and the 2016 us presidential election.

The Annals of Applied Statistics 12 (2), 685–726.Mercer, A. W. (2018).

Selection Bias in Nonprobability Surveys: A Causal Inference Approach . Ph. D.thesis.Miller, P. V. (2017). Is there a future for surveys?

Public Opinion Quarterly 81 (S1), 205–212.Murdoch, T. B. and A. S. Detsky (2013). The inevitable application of big data to health care.

Journalof the American Medical Association 309 (13), 1351–1352.of Sciences, T. R. B. N. A. (2013).

The 2nd Strategic Highway Research Program Naturalistic DrivingStudy Dataset .Oman, S. D. and D. M. Zucker (2001). Modelling and generating correlated binary variables.

Biometrika 88 (1), 287–290.Rafei, A., C. A. Flannagan, and M. R. Elliott (2020). Big data for ﬁnite population inference: Applyingquasi-random approaches to naturalistic driving data using bayesian additive regression trees.

Journalof Survey Statistics and Methodology 8 (1), 148–180.Rao, J. N. (2015).

Small-Area Estimation . New York: Wiley.Rao, J. N. and C. Wu (1988). Resampling inference with complex survey data.

Journal of the americanstatistical association 83 (401), 231–241.Rivers, D. (2007). Sampling for web surveys. In

Joint Statistical Meetings .Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coeﬃcients when someregressors are not always observed.

Journal of the American Statistical Association 89 (427), 846–866.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observationalstudies for causal eﬀects.

Biometrika 70 (1), 41–55.Rubin, D. B. (2004).

Multiple imputation for nonresponse in surveys . New York: Wiley.Rubin, D. B. (2007). The design versus the analysis of observational studies for causal eﬀects: parallelswith the design of randomized trials.

Statistics in Medicine 26 (1), 20–36.Saarela, O., L. R. Belzile, and D. A. Stephens (2016). A bayesian view of doubly robust causal inference.

Biometrika 103 (3), 667–681.Santos, A., N. McGuckin, H. Y. Nakamoto, D. Gray, and S. Liss (2011). Summary of travel trends: 2009

Robust Bayesian Inference for Big Data national household travel survey. Technical report.Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonignorable drop-outusing semiparametric nonresponse models. Journal of the American Statistical Association 94 (448),1096–1120.Senthilkumar, S., B. K. Rai, A. A. Meshram, A. Gunasekaran, and S. Chandrakumarmangalam (2018).Big data in healthcare management: A review of literature.

American Journal of Theoretical andApplied Business 4 (2), 57–69.Smith, T. (1983). On the validity of inferences from non-random sample.

Journal of the Royal StatisticalSociety A146 , 394–403.Spertus, J. V. and S.-L. T. Normand (2018). Bayesian propensity scores for high-dimensional causalinference: A comparison of drug-eluting to bare-metal coronary stents.

Biometrical Journal 60 , 721–733.Struijs, P., B. Braaksma, and P. J. Daas (2014). Oﬃcial statistics and big data.

Big Data & Society 1 (1),1–6.Tan, Y. V., M. R. Elliott, and C. A. Flannagan (2017). Development of a real-time prediction model ofdriver behavior at intersections using kinematic time series data.

Accident Analysis & Prevention 106 ,428–436.Tan, Y. V., C. A. Flannagan, and M. R. Elliott (2016). Predicting human-driving behavior tohelp driverless vehicles drive: random intercept bayesian additive regression trees. arXiv preprintarXiv:1609.07464 .Tan, Y. V., C. A. Flannagan, and M. R. Elliott (2019). “robust-squared” imputation models using bart.

Journal of Survey Statistics and Methodology 7 (4), 465–497.Tan, Z. (2006). A distributional approach for causal inference using propensity scores.

Journal of theAmerican Statistical Association 101 (476), 1619–1637.Valliant, R. (2020). Comparing alternatives for estimation from nonprobability samples.

Journal ofSurvey Statistics and Methodology 8 , 231–263.Valliant, R. and J. A. Dever (2011). Estimating propensity adjustments for volunteer web surveys.

Sociological Methods & Research 40 (1), 105–137.Wang, W., D. Rothschild, S. Goel, and A. Gelman (2015). Forecasting elections with non-representativepolls.

International Journal of Forecasting 31 (3), 980–991.Wendling, T., K. Jung, A. Callahan, A. Schuler, N. Shah, and B. Gallego (2018). Comparing methodsfor estimation of heterogeneous treatment eﬀects using observational data from health care databases.

Statistics in medicine .Wu, C. and R. R. Sitter (2001). A model-calibration approach to using complete auxiliary informationfrom survey data.

Journal of the American Statistical Association 96 (453), 185–193.Yang, S. and J. K. Kim (2018). Integration of survey data and big observational data for ﬁnite populationinference using mass imputation. arXiv preprint arXiv:1807.02817 .Yang, S., J. K. Kim, and R. Song (2020). Doubly robust inference when combining probability andnon-probability samples with high dimensional data.

Journal of the Royal Statistical Society B82 ,445–465.Zangeneh, S. Z. and R. J. Little (2015). Bayesian inference for the ﬁnite population total froma heteroscedastic probability proportional to size sample.

Journal of Survey Statistics andMethodology 3 (2), 162–192.Zhang, G. and R. Little (2011). A comparative study of doubly robust estimators of the mean withmissing data.

Journal of Statistical Computation and Simulation 81 (12), 2039–2058.Zhou, Q., C. McNeal, L. A. Copeland, J. P. Zachariah, and J. J. Song (2020). Bayesian propensity scoreanalysis for clustered observational data.

Statistical Methods & Applications 29 (2), 335–355.Zigler, C. M., K. Watts, R. W. Yeh, Y. Wang, B. A. Coull, and F. Dominici (2013). Model feedback inbayesian propensity score estimation.

Biometrics 69 (1), 263–273.

Robust Bayesian Inference for Big Data I. Appendix

I.1. Theoretical proofs

Suppose there exists an inﬁnite sequence of ﬁnite populations U ν of sizes N ν with ν = 1 , , ..., ∞ .Corresponding to U ν are a non-probability sample S B,ν and a probability sample S R,ν with n B,ν and n R,ν being the respective sample sizes. Also, let us assume that N ν →∞ , n B,ν →∞ and n R,ν →∞ as ν →∞ ,while n B,ν /N ν → f B , and n R,ν /N ν → f R with 0 < f R < < f B <

1. However, from now on, wesuppress the subscript ν for rotational simplicity. In order to be able to make unbiased inference basedon S B , we consider the following conditions:1. The set of observed auxiliary variables, X , fully governs the selection mechanism in S B . This iscalled an ignorable condition, implying p ( δ Bi = 1 | y i , x i ) = p ( δ Bi = 1 | x i ) for i ∈ U .2. The S B actually does have a probability sampling mechanism, albeit unknown. This means p ( δ Bi =1 | x i ) > i ∈ U .3. Units of S R and S B are selected independently from U given the observed auxiliary variables, X ∗ ,i.e. δ Ri | = δ Bj | X ∗ for i (cid:54) = j .4. The sampling fractions, f R and f B , are small enough such that the possible overlap between S R and S B is negligible, i.e. S R ∩ S B = ∅ .5. The true underling models for Y | X ∗ and δ B | X and δ R | X are known.In addition, to be able to drive the asymptotic properties of the proposed estimators, we consider thefollowing regularity conditions according to Chen et al. (2020):1. For any given x , ∂m ( x ; θ ) /∂θ exists and is continuous with respect to θ , and | ∂m ( x ; θ ) /∂θ | ≤ h ( x ; θ )for θ in the neighborhood of θ , and (cid:80) Ni =1 h ( x i ; θ ) = O (1).2. For any given x , ∂ m ( x ; θ ) /∂θ T exists and is continuous with respect to θ , and max j,l | ∂ m ( x ; θ ) /∂θ j ∂θ l | ≤ k ( x ; θ ) for θ in the neighborhood of θ , and (cid:80) Ni =1 k ( x i ; θ ) = O (1).3. For u i = { x i , y i , m ( x i ; θ ) } , the ﬁnite population and the sampling design in S R satisfy N − (cid:80) n R i =1 u i /π Ri − N − (cid:80) Ni =1 u i = O p ( n − / R ).4. There exist c and c such that 0 < c ≤ N π Bi /n B ≤ c and 0 < c ≤ N π Ri /n R ≤ c for all i ∈ U .5. The ﬁnite population and the propensity scores satisfy N − (cid:80) Ni =1 y i = O (1), N − (cid:80) Ni =1 || x i || = O (1), and N − (cid:80) Ni =1 π Bi (1 − π Bi ) x i x Ti is a positive deﬁnite matrix.Note that while we assume π Ri is calculable for i ∈ S B throughout the proofs, extensions can be providedfor situations where π Ri need to be predicted for i ∈ S B . I.1.1. Asymptotic properties of PAPW estimator

Since (cid:98) β is the MLE estimate of β in the logistic regression of Z i on x ∗ i , it is clear that (cid:98) β p → β . Twoimmediate result of this are that (cid:98) π Bi p → π Bi and E ( (cid:98) π Bi | x ∗ i ) = π Bi where (cid:98) π Bi is deﬁned as in 2.9. Now, weprove the consistency and asymptotic unbiasedness of the PAPW estimator in 2.15. To this end, we showthat (cid:98) y P AP W − y U = O p ( n − / B ). Consider the following set of estimating equations:Φ n ( η ) =  n − (cid:80) ni =1 Z i ( y i − y U ) /π Bi n − (cid:80) ni =1 { Z i − p i ( β ) } x ∗ i  =  N − (cid:80) Ni =1 δ Bi ( y i − y U ) /π Bi N − (cid:80) Ni =1 δ i { Z i − p i ( β ) } x ∗ i  = 0 (I.1)where η = ( y U , β ). Robust Bayesian Inference for Big Data In the following, we show that E δ B [Φ n ( (cid:98) η ) | x ∗ i ] = 0. We start with the ﬁrst component of Φ n ( (cid:98) η ) E δ B (cid:34) N N (cid:88) i =1 E δ B ( δ Bi )( y i − y U ) π Bi (cid:12)(cid:12) x ∗ i (cid:35) = 1 N N (cid:88) i =1 E δ B ( δ Bi | x ∗ i )( y i − y U ) π Bi = 1 N N (cid:88) i =1 π Bi ( y i − y U ) π Bi = 0Noting that E δ B [Φ n ( (cid:98) η )] = E δ [ E Z { Φ n ( (cid:98) η ) | δ i = 1 } ], for the second component, we have E δ B (cid:34) N N (cid:88) i =1 δ i { Z i − p i ( β ) } x i (cid:12)(cid:12)(cid:12)(cid:12) x i (cid:35) = E δ (cid:34) E Z (cid:26) N N (cid:88) i =1 δ i { Z i − p i ( β ) } x i (cid:12)(cid:12) δ i = 1 , x i (cid:27)(cid:35) = E δ (cid:34) N N (cid:88) i =1 δ i { E Z ( Z i | δ i = 1 , x i ) − p i ( β ) } x ∗ i (cid:35) = E δ (cid:34) N N (cid:88) i =1 δ i { p i ( β ) − p i ( β ) } x ∗ i (cid:35) = 0 (I.2)Now, we apply the ﬁrst-order Taylor approximation to Φ n ( (cid:98) η ) around η as below: (cid:98) η − η = [ E { φ n ( η ) } ] − Φ n ( η ) + O p ( n − / B ) (I.3)where φ n ( η ) = ∂ Φ n ( η ) /∂η . ∂∂y U (cid:34) N N (cid:88) i =1 δ Bi ( y i − y U ) π Bi (cid:35) = − N N (cid:88) i =1 δ Bi π Bi (I.4) ∂∂β (cid:34) N N (cid:88) i =1 δ Bi ( y i − y U ) π Bi (cid:35) = ∂∂β (cid:34) N N (cid:88) i =1 δ Bi π Bi (cid:26) p i ( β )1 − p i ( β ) (cid:27) ( y i − y U ) (cid:35) = − N N (cid:88) i =1 δ Bi π Bi ( y i − y U ) x ∗ Ti (I.5) ∂∂β (cid:34) N N (cid:88) i =1 δ i (cid:8) Z i − p i ( β ) (cid:9)(cid:35) = − N N (cid:88) i =1 δ i p i ( β ) [1 − p i ( β )] x ∗ i x ∗ Ti (I.6)Therefore, we have φ n ( η ) = (cid:32) − N (cid:80) Ni =1 δ Bi π Bi − N (cid:80) Ni =1 δ Bi π Bi ( y i − y U ) x ∗ Ti − N (cid:80) Ni =1 δ i p i ( β ) [1 − p i ( β )] x ∗ i x ∗ Ti (cid:33) (I.7)Thus, it follows that (cid:98) y P M = y U + O p ( n − / B ).Now, we turn to deriving the asymptotic variance estimator for (cid:98) y P M . According to the sandwich formula,we have

V ar ( (cid:98) η ) = [ E { φ n ( η ) } ] − V ar (cid:8) φ n ( η ) (cid:9) (cid:2) E { φ n ( η ) } T (cid:3) − + O p ( n − B ) (I.8) Robust Bayesian Inference for Big Data Given the fact that E ( δ i = 1 | x ∗ i ) = p ( δ Bi = 1 | x ∗ i ) p ( Z i = 1 | x ∗ i ) = π Ri − p i ( β ) (I.9)It can be shown that E (cid:8) φ n ( η ) (cid:9) = (cid:32) − − N (cid:80) Ni =1 ( y i − y U ) x ∗ Ti − N (cid:80) Ni =1 π Bi [1 − p i ( β )] x ∗ i x ∗ Ti (cid:33) (I.10)And (cid:2) E (cid:8) φ n ( η ) (cid:9)(cid:3) − = (cid:32) − b T − (cid:104) N (cid:80) Ni =1 π Bi [1 − p i ( β )] x ∗ i x ∗ Ti (cid:105) − (cid:33) (I.11)where b T = (cid:26) N N (cid:88) i =1 ( y i − y U ) x ∗ Ti (cid:27)(cid:26) N N (cid:88) i =1 π Bi (cid:8) − p i ( β ) (cid:9) x ∗ i x ∗ Ti (cid:27) − (I.12)Now, the goal is to calculate V ar (cid:8) φ n ( η ) (cid:9) . We know that V ar δ B (cid:32) N N (cid:88) i =1 δ Bi ( y i − y U ) π Bi (cid:12)(cid:12)(cid:12)(cid:12) x i (cid:33) = 1 N N (cid:88) i =1 ( y i − y U ) ( π Bi ) π Bi (1 − π Bi )= 1 N N (cid:88) i =1 (cid:26) − π Bi π Bi (cid:27) ( y i − y U ) (I.13) V ar δ B (cid:32) N N (cid:88) i =1 δ i (cid:8) Z i − p i ( β ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12) x ∗ i (cid:33) = E δ (cid:34) V ar Z (cid:32) N N (cid:88) i =1 δ i (cid:8) Z i − p i ( β ) (cid:9)(cid:12)(cid:12) δ i = 1 , x ∗ i (cid:33)(cid:35) + V ar δ (cid:34) δ i E Z (cid:32) N N (cid:88) i =1 δ i (cid:8) Z i − p i ( β ) (cid:9)(cid:12)(cid:12) δ i = 1 , x ∗ i (cid:33)(cid:35) = 1 N E δ (cid:32) N (cid:88) i =1 δ i V ar Z ( Z i ) x ∗ i x ∗ Ti (cid:12)(cid:12)(cid:12)(cid:12) x ∗ i (cid:33) + 0= 1 N N (cid:88) i =1 π Ri p i ( β ) x ∗ i x ∗ Ti (I.14) Cov (cid:32) N N (cid:88) i =1 δ Bi ( y i − y U ) π Bi , N N (cid:88) i =1 δ i (cid:8) Z i − p i ( β ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12) x ∗ i (cid:33) = E δ (cid:34) E Z (cid:32) N N (cid:88) i =1 δ i Z i ( y i − y U ) π Bi (cid:12)(cid:12)(cid:12)(cid:12) δ i = 1 , x ∗ i (cid:33)(cid:35) = 1 N N (cid:88) i =1 (cid:8) − p i ( β ) (cid:9) ( y i − y U ) x ∗ i (I.15)Therefore, we have V ar (cid:8) Φ n ( η ) (cid:9) = (cid:32) N (cid:80) Ni =1 { (1 − π Bi ) /π Bi } ( y i − y U ) N (cid:80) Ni =1 (cid:8) − p i ( β (cid:1) } ( y i − y U ) x ∗ Ti N (cid:80) Ni =1 (cid:8) − p i ( β (cid:1) } ( y i − y U ) x ∗ i N (cid:80) Ni =1 π Bi (cid:8) − p i ( β ) (cid:9) x ∗ i x ∗ Ti (cid:33) (I.16)The ﬁnal asymptotic variance estimator of (cid:98) y P AP W is given by

V ar (cid:8)(cid:98) y P AP W (cid:9) = 1 N N (cid:88) i =1 (cid:26) − π Bi π Bi (cid:27) ( y i − y U ) − b T N N (cid:88) i =1 (cid:8) − p i ( β ) (cid:9) ( y i − y U ) x ∗ i + b T (cid:34) N N (cid:88) i =1 π Bi (cid:8) − p i ( β ) (cid:9) x ∗ i x ∗ Ti (cid:35) b (I.17) Robust Bayesian Inference for Big Data To obtain the variance estimate based on the observed samples of S B and S R , we substitute the populationcomponents with their estimates from the samples. (cid:100) V ar (cid:8)(cid:98) y P AP W (cid:9) = 1 N n B (cid:88) i =1 (cid:8) − (cid:98) π Bi (cid:9) (cid:18) y i − y U (cid:98) π Bi (cid:19) − (cid:98) b T N n B (cid:88) i =1 (cid:8) − p i ( (cid:98) β ) (cid:9) (cid:18) y i − y U (cid:98) π Bi (cid:19) x ∗ i + (cid:98) b T (cid:34) N n (cid:88) i =1 p i ( (cid:98) β ) x ∗ i x ∗ Ti (cid:35) (cid:98) b (I.18)where (cid:98) b T = (cid:26) N n B (cid:88) i =1 (cid:18) y i − y U (cid:98) π Bi (cid:19) x ∗ Ti (cid:27)(cid:26) N n (cid:88) i =1 p i ( (cid:98) β ) x ∗ i x ∗ Ti (cid:27) − (I.19) I.1.2. Proof of doubly robustness

As discussed in the section 2.4, a doubly robust estimator should be consistent even if either model ismisspeciﬁed. To prove the doubly robustness property of the AIPW estimator proposed here, let initiallyassume that (cid:98) θ p → θ if the prediction model (PM) is correctly speciﬁed, and (cid:98) φ p → φ and (cid:98) β p → β if thepseudo-weighting model is correctly speciﬁed. Given the true probabilities of selection in S B , we knowthat HT estimator is design-unbiased for the population total, i.e. E (cid:32) n B (cid:88) i =1 y i /π Bi (cid:33) = E (cid:32) N (cid:88) i =1 δ Bi y i /π Bi (cid:33) = N (cid:88) i =1 E ( δ Bi ) y i /π Bi = N (cid:88) i =1 π Bi y i /π Bi = N (cid:88) i =1 y i = (cid:98) y U (I.20)And the same result will be obtained for S R . Therefore E (cid:32) n B (cid:88) i =1 y i /π Bi (cid:33) = E (cid:32) n R (cid:88) i =1 y i /π Ri (cid:33) = (cid:98) y U (I.21)Now we have (cid:98) y DR p → E ( (cid:98) y DR ) = E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i ) (cid:98) π Bi + n R (cid:88) i =1 (cid:98) y i π Ri (cid:27) = E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i ) (cid:98) π Bi + n B (cid:88) i =1 (cid:98) y i π Bi (cid:27) = E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i ) (cid:98) π Bi + (cid:98) y i π Bi (cid:27) = E (cid:26) n B (cid:88) i =1 (cid:98) y i π Bi + ( y i − (cid:98) y i ) (cid:98) π Bi − ( (cid:98) y i − (cid:98) y i ) π Bi (cid:27) (I.22) Robust Bayesian Inference for Big Data (cid:98) y DR p → E ( (cid:98) y DR ) = y U + E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i )( 1 (cid:98) π Bi − π Bi ) (cid:27) = y U + E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i )( π Bi (cid:98) π Bi − (cid:27) (I.23)Under the ignorable assumption in S B , we have Y | = π B | X, π R . Hence (cid:98) y DR p → E ( (cid:98) y DR ) = y U + E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i )( π Bi (cid:98) π Bi − (cid:27) = y U + E (cid:26) E (cid:26) n B (cid:88) i =1 ( y i − (cid:98) y i )( π Bi (cid:98) π Bi − | x i , π Ri (cid:27)(cid:27) = y U + E (cid:26) n B (cid:88) i =1 E ( y i − (cid:98) y i | x i , π Ri ) E ( π Bi (cid:98) π Bi − | x i , π Ri ) (cid:27) (I.24)If we assume the pseudo-weighting model is correctly speciﬁed, then we expect (cid:98) π Bi p → π Bi and E (cid:18) π Bi (cid:98) π Bi − | x i , π Ri (cid:19) p → π Bi π Bi − (cid:98) y DR p → y U regardless of whether the PM is correctly speciﬁed or not. In situationswhere the mean model is correctly speciﬁed, then we expect that (cid:98) y i p → y i . Hence E (cid:0) y i − (cid:98) y i | x i , π Ri (cid:1) p → E (cid:0) y i − y i | x i , π Ri (cid:1) = 0 (I.26)which means that (cid:98) y DR p → y U even if the PW model is incorrectly speciﬁed. I.1.3. Variance estimation under the Bayesian approach

As discussed in Section 2.6, in this study, we use Rubin’s combining rule to estimate the variance of theAIPW estimator under the two-step Bayesian approach. The idea stems from the conditional varianceformula, which involves two parts: (1) within-imputation variance and between-imputation variance. Thelatter is straightforward and achieves by taking the variance of the (cid:98) y ( m ) DR across the M MCMC draws. Thewithin-imputation variance requires more attention as one needs to account for the intraclass correlationsdue to clustering and use linearization techniques when dealing with a ratio estimator.It is clear that this component is calculated conditional on the observed (cid:98) y ( m ) i for i ∈ S , (cid:98) π B ( m ) i for i ∈ S B and (cid:98) pi Bi . var (cid:16)(cid:98) y ( m ) DR (cid:12)(cid:12)(cid:98) π B ( m ) i , (cid:98) y ( m ) i (cid:17) = var  (cid:98) N B n B (cid:88) i =1 (cid:16) y i − (cid:98) y ( m ) i (cid:17)(cid:98) π Bi + 1 (cid:98) N R n R (cid:88) i =1 (cid:98) y ( m ) i π Ri (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) π B ( m ) i , (cid:98) y ( m ) i  = 1 (cid:98) N B n B (cid:88) i =1 var ( y i ) (cid:0)(cid:98) π Bi (cid:1) + var (cid:32) (cid:98) N R n R (cid:88) i =1 (cid:98) y ( m ) i π Ri (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) y ( m ) i (cid:33) (I.27)For the ﬁrst component, which equals var ( y ) (cid:80) n B i =1 (cid:0)(cid:98) π Bi (cid:1) − / (cid:98) N B , it suﬃces to estimate the variance of y .The second component, however, deals with the variance of a ratio estimator, which requires linearization Robust Bayesian Inference for Big Data techniques. Let’s deﬁne (cid:98) t R = (cid:80) n R i =1 (cid:98) y ( m ) i /π Ri , Taylor-series approximation of the variance is given by var (cid:32) (cid:98) N R n R (cid:88) i =1 (cid:98) y ( m ) i π Ri (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) y ( m ) i (cid:33) = var (cid:32) (cid:98) t R (cid:98) N R (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) y ( m ) i (cid:33) ≈ (cid:98) N R (cid:26) var (cid:16)(cid:98) t R (cid:12)(cid:12)(cid:98) y ( m ) i (cid:17) + (cid:32) (cid:98) t R (cid:98) N R (cid:33) var ( (cid:98) N R ) − (cid:32) (cid:98) t R (cid:98) N R (cid:33) cov (cid:16)(cid:98) t R , (cid:98) N R (cid:12)(cid:12)(cid:98) y ( m ) i (cid:17) (cid:27) (I.28)Since (cid:98) t R depends on (cid:98) y ( m ) i , we have var (cid:16)(cid:98) t R (cid:12)(cid:12)(cid:98) y ( m ) i (cid:17) = n R (cid:88) i =1 (cid:16)(cid:98) y ( m ) i (cid:17) var (cid:18) π Ri (cid:19) (I.29) cov (cid:16)(cid:98) t R , (cid:98) N R (cid:12)(cid:12)(cid:98) y ( m ) i (cid:17) = n R (cid:88) i =1 (cid:98) y ( m ) i var (cid:18) π Ri (cid:19) (I.30)Therefore, the variance of the ratio estimator is approximated by var (cid:32) (cid:98) N R n R (cid:88) i =1 (cid:98) y ( m ) i π Ri (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) y ( m ) i (cid:33) ≈ (cid:98) N R var (cid:18) π Ri (cid:19) (cid:26) n R (cid:88) i =1 (cid:16)(cid:98) y ( m ) i (cid:17) + n R (cid:32) (cid:98) t R (cid:98) N R (cid:33) − n R (cid:88) i =1 (cid:98) y ( m ) i (cid:27) (I.31)And the ﬁnal within-imputation variance can be given by var (cid:16)(cid:98) y ( m ) DR (cid:12)(cid:12)(cid:98) π B ( m ) i , (cid:98) y ( m ) i (cid:17) ≈ (cid:98) N B n B (cid:88) i =1 var ( y i ) (cid:0)(cid:98) π Bi (cid:1) + 1 (cid:98) N R var (cid:18) π Ri (cid:19) (cid:26) n R (cid:88) i =1 (cid:16)(cid:98) y ( m ) i (cid:17) + n R (cid:32) (cid:98) t R (cid:98) N R (cid:33) − n R (cid:88) i =1 (cid:98) y ( m ) i (cid:27) (I.32)Note that in situations where either S R or S B is a clustered sample, the derivation of thewithin-imputation variance would remain the same, but y i , π Ri , (cid:98) π B ( m ) i , and (cid:98) y ( m ) i will represent the totalfor cluster i , and n R and n B are the number of clusters in S R and S B , respectively. Robust Bayesian Inference for Big Data I.2. Bayesian Additive Regression Trees

BART is a ﬂexible ensemble of trees method, which allows handling non-linear relationships as well asmulti-way interaction eﬀects. The idea of BART is based on the sum-of-trees, where trees are sequentiallymodiﬁed on the basis of residuals from the other trees. In a tree-based method, the variation in theresponse variable is explained by hierarchically splitting the sample into more homogeneous subgroups(Green and Kern, 2012). As illustrated in Figure 12, a binary-structured tree consists of a root node,a set of interior nodes, a set of terminal nodes associated with parameters and decision rules that linksthese nodes (Abu-Nimeh et al., 2008). a c b d e 𝜇 𝜇 𝜇 Fig 12:

Example of a binary-structured trees model

I.2.1. BART for continuous outcomes

Suppose y = f ( x ) + (cid:15) as is the case in every statistical model, where y ∈ R is a continuous outcome, x denotes an n × p matrix of covariates, and (cid:15) ∼ N (0 , σ ) is the error term. BART will then approximatethe outcome as below: y ≈ m (cid:88) j =1 f ( x, T j , M j ) (I.33)where T j is the j -th tree with b j terminal nodes, and associated M j = ( µ j , µ j , . . . , µ b j j ) T parameters.BART is a Bayesian approach, since it assigns prior distributions to T , M , and σ (Chipman et al., 2010;Tan et al., 2016). Assuming an independence structure between trees, we can deﬁne the prior as follows: p [( T , M ) , ..., ( T m , M m ) , σ − ] = [ m (cid:89) j =1 p ( T j , M j )] p ( σ − ) (I.34)Using the multiplication law of probability, the joint distribution of p ( T j , M j ) can be written as: p ( T j , M j ) = p ( M j | T j ) p ( T j )= b j (cid:89) i =1 p ( µ ij | T j ) p ( T j ) (I.35)where i = 1 , ..., b j denotes the terminal node parameters for tree j . Therefore, the joint distribution in I.34can be factored as below: p (cid:2) ( T , M ) , ..., ( T m , M m ) , σ − (cid:3) =  m (cid:89) j =1 (cid:26) b j (cid:89) i =1 p ( µ ij | T j ) (cid:27) p ( T j )  p ( σ − ) (I.36) Robust Bayesian Inference for Big Data Suggested by Chipman et al. (2007), the following distributions can be used for µ ij | T j and σ − : µ ij | T j ∼ N ( µ µ , σ µ ) (I.37) σ − ∼ G ( ν , νλ T j involves three components of the tree structure: length of the tree, decision rules and thechoice of covariate at a given node. However, prior speciﬁcation for T j depends on several factors, anddetailed discussions can be found in Chipman et al. (2010). Given the data, these parameters are updatedthrough a combination of “Bayesian backﬁtting” and MCMC Gibbs sampler method. The trained treesare then summed up to approximate the outcome variable. Finally, m is typically assumed to be ﬁxed,but can be assessed by cross-validation. I.2.2. BART for binary outcomes

For the binary outcome, a probit link function is usually employed in the sense that y is an indicatorvariable dichotomizing a normally distributed latent continuous outcome like y ∗ at a real value c so that: y = (cid:40) * > c0 y ∗ ≤ c , y ∗ ∼ N (0 ,

1) (I.39)Therefore, the new model will be given by: G ( x ) = Φ − [ p ( y = 1 | x )] = m (cid:88) j =1 f ( x, T j , M j ) (I.40)where Φ − [ . ] is the inverse of standard normal CDF. Since we implicitly assumed σ ≡

1, the only priorswe need to specify are p ( µ ij | T j ) and p ( T j ). In order to be able to draw the posterior distribution of T j and µ ij , we need to generate the latent continuous variable, y ∗ , given y k . Chipman et al. (2010) recommendsa data augmentation method based on the following algorithm: y ∗ k = (cid:40) max ( N ( G ( x k ) , ,

0) if y k = 1 max ( N ( G ( x k ) , ,

0) if y k = 0 (I.41)Since the structure of priors is very similar to BART for continuous outcomes (Tan et al., 2016), weupdate the estimates G ( x k ) after drawing samples from T j ’s and µ ij ’s. To apply BART, in this research,we utilize the ‘ BayesTree ’ and ‘

BART ’ packages in R . Robust Bayesian Inference for Big Data I.3. Further extensions of the simulation study

I.3.1. Simulation study I

This subsection provides additional results associated with Simulation I. Table 5 and Table 6 summarizethe ﬁndings of the simulation in 3.1 under the frequentist approach when n B = 100, and n B = 10 , Table 5

Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under thefrequentist approach in the ﬁrst simulation study for n R = 100 and n B = 100 ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.02Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008

Non-probability sample ( S B ) Unweighted 31.895 36.418 57.0 1.014 32.213 33.2 1.740 1.008 32.310 32.853 0.0 0.995Fully weighted 0.171 21.078 94.8 0.996 0.247 8.265 94.9 0.999 0.268 4.994 94.2 0.995

Non-robust adjustment

Model speciﬁcation: TruePAPW -1.192 23.466 95.2 1.018 -1.205 9.452 95.3 1.015 -1.211 5.982 95.8 1.007IPSW -2.917 26.505 97.3 1.386 -3.036 12.700 97.0 1.355 -3.075 9.470 97.0 1.308PM 0.372 20.989 94.6 0.994 0.148 8.351 94.9 0.995 0.077 5.160 95.0 0.992Model speciﬁcation: FalsePAPW 27.140 33.436 75.6 1.059 27.393 28.814 16.6 1.043 27.470 28.276 2.5 1.025IPSW 28.372 33.972 67.9 1.012 28.711 29.951 8.3 1.002 28.815 29.515 0.5 0.99PM 28.199 33.790 68.4 1.011 28.541 29.771 8.3 1.001 28.645 29.337 0.3 0.988

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW -0.084 22.973 96.4 1.047 -0.014 8.996 96.2 1.038 0.007 5.368 95.5 1.017AIPW–IPSW -0.184 22.449 96.3 1.046 -0.049 8.826 96.1 1.038 -0.009 5.314 95.9 1.016Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW -0.436 23.709 96.4 1.038 -0.286 9.866 96.6 1.062 -0.241 6.520 97.2 1.101AIPW–IPSW -0.427 23.083 96.4 1.039 -0.227 9.570 96.6 1.070 -0.166 6.298 97.5 1.119Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW -0.045 29.068 97.3 1.107 0.011 11.113 96.9 1.097 0.026 6.073 96.2 1.068AIPW–IPSW -0.194 28.208 97.5 1.104 -0.044 10.825 97.1 1.094 0.001 5.974 96.5 1.062Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 28.301 34.194 71.3 1.037 28.570 29.868 10.9 1.028 28.652 29.379 0.7 1.016AIPW–IPSW 28.178 33.806 70.4 1.035 28.525 29.764 9.4 1.025 28.631 29.326 0.5 1.013

Robust Bayesian Inference for Big Data Table 6

Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under thefrequentist approach in the ﬁrst simulation study for n R = 100 and n B = 10 , ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.02Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008

Non-probability sample ( S B ) Unweighted 30.014 30.066 0.0 1.008 30.197 30.207 0.0 1.019 30.252 30.257 0.0 1.033Fully weighted 0.032 2.083 95.3 1.005 0.018 0.816 95.1 1.007 0.012 0.490 95.1 1.007

Non-robust adjustment

Model speciﬁcation: TruePAPW -2.067 4.582 94.9 1.108 -2.145 4.120 92.8 1.107 -2.170 4.072 92.2 1.107IPSW -2.618 7.717 94.5 0.958 -2.673 7.334 91.1 0.923 -2.692 7.308 90.6 0.979PM 0.296 4.515 95.2 0.994 0.121 4.134 94.8 0.986 0.065 4.095 94.6 0.985Model speciﬁcation: FalsePAPW 24.493 24.616 0.0 1.126 24.592 24.651 0.0 1.153 24.621 24.673 0.0 1.161IPSW 26.675 26.804 0.0 0.992 26.871 26.949 0.0 0.970 26.930 27.002 0.0 0.964PM 26.509 26.645 0.0 1.003 26.717 26.800 0.0 0.989 26.779 26.856 0.0 0.986

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW 0.180 4.633 95.1 0.994 0.080 4.162 94.8 0.986 0.047 4.104 94.7 0.985AIPW–IPSW 0.052 4.582 95.2 0.995 0.035 4.152 94.6 0.987 0.028 4.101 94.5 0.985Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW 0.262 4.719 95.1 1.000 0.163 4.250 94.9 0.997 0.130 4.191 94.7 0.996AIPW–IPSW 0.188 4.652 95.4 1.002 0.171 4.225 95.0 0.998 0.164 4.174 94.8 0.998Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW 1.376 8.569 94.5 0.953 0.503 4.829 95.1 0.995 0.231 4.215 95.2 0.992AIPW–IPSW 0.864 7.648 94.7 0.948 0.322 4.643 95.3 0.990 0.152 4.182 95.0 0.989Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 26.696 26.835 0.0 0.998 26.779 26.862 0.0 0.987 26.803 26.880 0.0 0.985AIPW–IPSW 26.520 26.655 0.0 1.001 26.718 26.801 0.0 0.989 26.777 26.854 0.0 0.986

Robust Bayesian Inference for Big Data Table 7

Comparing the performance of the bias adjustment methods and associated variance estimator under the two-stepBayesian approach in the ﬁrst simulation study for n R = 100 and n B = 100 ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 95.0 1.008

Non-probability sample ( S B ) Unweighted 32.238 36.815 56.3 1.003 32.303 33.3 1.620 1.003 32.322 32.865 0.0 0.996Fully weighted 0.494 21.398 94.3 0.981 0.329 8.400 94.0 0.981 0.276 5.057 93.6 0.979

Non-robust adjustment

Model speciﬁcation: TruePAPW -0.589 24.195 97.4 1.117 -0.755 9.795 99.0 1.326 -0.801 6.178 99.8 1.653IPSW 1.169 22.844 97.2 1.118 1.016 9.163 98.6 1.345 0.976 5.719 99.8 1.701PM 0.709 21.489 95.280 1.029 0.272 8.545 95.580 1.020 0.140 5.245 94.640 1.000Model speciﬁcation: FalsePAPW 28.008 34.396 76.3 1.091 28.027 29.477 19.6 1.116 28.022 28.840 3.4 1.141IPSW 29.763 35.215 70.2 1.083 29.827 31.032 10.0 1.106 29.841 30.519 0.8 1.125PM 28.588 34.226 70.9 1.055 28.658 29.895 10.6 1.050 28.691 29.380 0.7 1.042

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW 0.320 23.802 97.8 1.154 0.125 9.306 99.1 1.357 0.067 5.493 99.9 1.731AIPW–IPSW 0.249 22.778 97.4 1.142 0.099 8.976 99.1 1.339 0.056 5.387 99.9 1.688Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW 0.304 23.858 97.7 1.156 0.126 9.386 99.2 1.389 0.065 5.661 99.9 1.781AIPW–IPSW 0.226 22.814 97.5 1.146 0.096 9.041 99.1 1.376 0.052 5.543 99.8 1.747Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW 0.881 22.077 96.8 1.126 0.333 8.742 98.6 1.281 0.153 5.303 99.8 1.558AIPW–IPSW 0.762 21.483 96.6 1.103 0.290 8.554 98.4 1.251 0.135 5.246 99.7 1.509Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 28.659 34.756 77.6 1.135 28.660 30.013 17.4 1.142 28.649 29.399 2.1 1.151AIPW–IPSW 28.575 34.237 74.7 1.115 28.656 29.903 13.7 1.124 28.674 29.368 1.1 1.132

Robust Bayesian Inference for Big Data Table 8

Comparing the performance of the bias adjustment methods and associated variance estimator under the two-stepBayesian approach in the ﬁrst simulation study for n R = 100 and n B = 10 , ρ = 0 . ρ = 0 . ρ = 0 . Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE

Probability sample ( S R ) Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008

Non-probability sample ( S B ) Unweighted 30.014 30.066 0.0 1.008 30.197 30.207 0.0 1.019 30.252 30.257 0.0 1.033Fully weighted 0.032 2.083 95.3 1.005 0.018 0.816 95.1 1.007 0.012 0.490 95.1 1.007

Non-robust adjustment

Model speciﬁcation: TruePAPW -2.032 4.578 93.0 1.031 -2.106 4.111 90.9 1.032 -2.138 4.062 90.2 1.035IPSW -0.015 4.094 95.2 1.011 -0.036 3.605 95.1 1.004 -0.042 3.547 95.2 1.002PM 0.297 4.517 81.6 0.679 0.120 4.136 75.3 0.579 0.065 4.094 73.1 0.563Model speciﬁcation: FalsePAPW 24.524 24.647 0.0 1.042 24.618 24.678 0.0 1.062 24.650 24.702 0.0 1.069IPSW 26.406 26.518 0.0 0.982 26.602 26.662 0.0 0.940 26.663 26.717 0.0 0.931PM 26.512 26.648 0.0 0.851 26.715 26.798 0.0 0.728 26.779 26.856 0.0 0.700

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueAIPW–PAPW 0.178 4.635 84.7 0.721 0.079 4.160 77.3 0.607 0.047 4.103 75.7 0.588AIPW–IPSW 0.058 4.574 83.6 0.705 0.036 4.149 77.0 0.601 0.028 4.100 75.5 0.585Model speciﬁcation: QR–True, PM–FalseAIPW–PAPW 0.151 4.273 94.5 0.971 0.050 3.734 93.7 0.943 0.025 3.660 93.9 0.941AIPW–IPSW 0.106 4.245 94.4 0.966 0.083 3.767 93.7 0.945 0.075 3.712 93.7 0.941Model speciﬁcation: QR–False, PM–TrueAIPW–PAPW 0.496 4.566 83.7 0.709 0.193 4.142 76.8 0.599 0.096 4.096 75.2 0.581AIPW–IPSW 0.312 4.514 82.7 0.695 0.127 4.133 76.7 0.595 0.068 4.094 74.9 0.579Model speciﬁcation: QR–False, PM–FalseAIPW–PAPW 26.709 26.849 0.0 0.893 26.786 26.869 0.0 0.751 26.808 26.885 0.0 0.717AIPW–IPSW 26.521 26.656 0.0 0.870 26.718 26.800 0.0 0.740 26.777 26.854 0.0 0.709

Robust Bayesian Inference for Big Data I.3.2. Simulation study III

Table 9 and Table 10 exhibits the numerical results associated with the plots of Simulation III.

Table 9

Comparing the performance of the bias adjustment methods in the third simulation study for ρ = 0 . Continuous outcome ( Y c ) Binary outcome ( Y b )Model-method rBias rMSE crCI rSE rBias rMSE crCI rSE Probability sample ( S R ) Unweighted 48.705 52.900 30.7 1.015 11.304 16.881 88.2 1.022Fully weighted 0.080 15.400 96.2 1.025 0.131 13.858 95.3 1.026

Non-probability sample ( S B ) Unweighted 68.309 70.415 0.0 0.156 21.763 22.794 0.5 0.181Fully weighted 0.137 7.581 95.7 1.023 0.074 6.512 94.7 0.99

Non-robust adjustment

Model speciﬁcation: TrueGLM–PAPW 0.448 10.994 94.7 1.036 0.072 7.266 96.2 1.034GLM–PAPP 0.204 11.192 93.9 1.037 0.080 7.188 96.2 1.031GLM–IPSW 0.839 18.138 96.0 1.275 -0.838 9.458 97.3 1.116GLM–PM 0.110 11.157 94.2 1.015 0.055 7.401 94.4 0.995Model speciﬁcation: FalseGLM–PAPW 7.337 13.187 94.2 1.033 5.115 8.502 90.4 1.02GLM–PAPP 6.762 13.546 94.2 1.032 5.046 8.471 88.5 1.035GLM–IPSW 22.513 35.600 99.5 1.155 9.390 13.098 89.5 1.099BART–PAPW 2.272 10.468 100.0 2.487 1.633 7.391 99.5 1.436BART–PAPP 3.990 11.469 100.0 2.299 0.313 7.243 99.3 1.342GLM–PM 37.071 42.523 53.0 1.006 12.600 14.932 63.6 1.003BART–PM 0.286 11.581 92.7 0.996 0.594 9.102 81.2 0.688

Doubly robust adjustment

Model speciﬁcation: QR–True, PM–TrueGLM–AIPW–PAPW 0.307 11.186 95.0 1.019 0.083 7.459 94.2 1.001GLM–AIPW–PAPP 0.295 11.187 94.5 1.019 0.089 7.439 94.0 0.998GLM–AIPW–IPSW 0.372 11.193 95.8 1.037 0.120 7.478 94.4 1.003Model speciﬁcation: QR–True, PM–FalseGLM–AIPW–PAPW 0.381 12.774 95.5 1.035 0.047 7.487 96.2 1.04GLM–AIPW–PAPP 0.424 11.934 94.7 1.041 0.155 7.275 96.0 1.032GLM–AIPW–IPSW -8.223 17.625 92.3 1.181 -2.842 9.086 95.2 1.047Model speciﬁcation: QR–False, PM–TrueGLM–AIPW–PAPW 0.127 11.177 94.7 1.020 0.067 7.451 94.0 0.997GLM–AIPW–PAPP 0.122 11.172 94.7 1.019 0.054 7.438 94.2 0.997GLM–AIPW–IPSW 0.117 11.167 94.8 1.020 0.055 7.433 94.0 0.998Model speciﬁcation: QR–False, PM–FalseGLM–AIPW–PAPW 50.327 53.922 21.9 1.002 15.651 17.552 50.3 1.007GLM–AIPW–PAPP 50.793 54.215 20.9 1.002 15.834 17.605 47.8 1.003GLM–AIPW–IPSW 47.867 51.106 27.9 1.163 15.112 16.884 53.8 1.051BART–AIPW–PAPW 0.276 11.593 94.4 1.035 0.701 9.186 81.9 0.698BART–AIPW–PAPP 0.261 11.591 94.2 1.031 0.682 9.155 81.7 0.697PAPW: propensity adjusted probability weighting; PAPP: propensity adjusted probabilityPrediction; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: predictionmodel; AIPW: augmented inverse propensity weighting.

Robust Bayesian Inference for Big Data Table 10

Comparing the values of rBias and rMSE for diﬀerent methods across diﬀerent values of ρ . Continuous outcome ( Y c ) Binary outcome ( Y b ) Non-robust Doubly robust Non-robust Doubly robust ρ PAPW PAPP IPSW PM PAPW PAPP IPSW PAPW PAPP IPSW PM PAPW PAPP IPSW rBias0.0 -0.179 0.022 -4.772 -0.224 -0.215 -0.218 -0.235 -0.537 -0.220 -2.345 -0.399 -0.464 -0.475 -0.489 -0.195 0.493 -4.406 0.048 -0.161 -0.160 -0.250 -0.329 0.095 -2.071 -0.144 -0.159 -0.149 -0.172 rMSE0.0

NOTE: GLM has been used for prediction, and the underlying models in each method have been correctly speciﬁed.

Robust Bayesian Inference for Big Data I.4. Supplemental results of the application on SHRP2 data

Robust Bayesian Inference for Big Data Table 11

Mean daily trip duration (min) and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 70.289 (68.809,71.77) 72.411 (63.583,81.238) 70.97 (67.971,73.97) 70.61 (66.131,75.088)Female 429,749 67.662 (66.355,68.968) 70.79 (67.353,74.226) 69.107 (66.683,71.531) 68.522 (64.432,72.611)

Age group

Race

White 745,596 67.845 (66.833,68.858) 71.687 (65.246,78.128) 68.183 (65.836,70.529) 67.861 (64.386,71.336)Black 43,109 86.294 (80.759,91.83) 74.42 (66.374,82.466) 81.587 (75.046,88.127) 79.728 (68.019,91.437)Asian 26,265 68.723 (63.684,73.761) 66.792 (58.089,75.495) 66.777 (60.785,72.769) 65.958 (53.748,78.169)Other 22,091 72.284 (66.895,77.674) 79.723 (69.505,89.942) 75.924 (69.729,82.118) 75.314 (63.089,87.539)

Ethnicity

Non-Hisp 808,098 68.699 (67.697,69.701) 71.999 (66.066,77.933) 69.337 (67.166,71.507) 68.555 (64.866,72.244)Hispanic 28,963 75.681 (70.488,80.873) 72.068 (63.599,80.536) 74.545 (69.145,79.944) 75.449 (66.582,84.316)

Education ¡High school 50,943 61.108 (58.134,64.083) 67.647 (58.129,77.165) 67.32 (61.588,73.051) 68.246 (56.385,80.108)HS completed 78,045 69.025 (65.979,72.071) 86.848 (58.569,115.128) 69.752 (64.868,74.637) 70.399 (61.472,79.325)College 237,206 68.997 (67.153,70.841) 70.312 (64.184,76.44) 70.712 (66.638,74.785) 70.896 (65.722,76.069)Graduate 326,860 70.859 (69.188,72.529) 71.314 (68.333,74.296) 71.313 (69.073,73.554) 69.984 (65.783,74.186)Post-grad 144,007 67.218 (64.984,69.451) 64.26 (60.143,68.377) 68.713 (64.864,72.562) 66.496 (62.395,70.597)

HH income

HH size

Urban size ¡50k 34,987 67.602 (62.771,72.432) 79.22 (59.18,99.26) 65.75 (59.749,71.751) 66.109 (57.069,75.149)50-200k 119,970 62.608 (60.337,64.879) 65.759 (61.25,70.268) 65.151 (62.164,68.138) 67.211 (61.409,73.014)200-500k 44,578 68.576 (63.52,73.632) 87.248 (73.018,101.477) 68.884 (63.664,74.104) 69.636 (61.746,77.526)500-1000k 276,629 68.017 (66.289,69.745) 66.524 (61.364,71.685) 68.123 (65.323,70.923) 70.338 (64.971,75.704)1000k+ 360,897 71.928 (70.451,73.404) 70.91 (67.926,73.894) 73.567 (71.441,75.693) 72.962 (68.493,77.43)

Vehicle make

American 290,228 66.507 (64.905,68.108) 71.826 (59.917,83.734) 68.256 (65.302,71.21) 69.04 (63.968,74.113)Asian 528,810 70.265 (69,71.53) 72.7 (69.653,75.747) 71.602 (69.436,73.768) 70.211 (66.415,74.007)European 18,023 69.261 (63.898,74.624) 66.191 (59.703,72.679) 71.403 (65.95,76.855) 69.836 (60.506,79.166)

Vehicle type

Car 610,245 68.686 (67.539,69.834) 73.853 (65.931,81.776) 69.706 (67.4,72.012) 70.236 (66.799,73.673)Van 27,866 69.2 (64.432,73.968) 68.389 (61.064,75.714) 73.096 (66.388,79.804) 64.905 (54.298,75.512)SUV 158,202 68.993 (66.851,71.134) 68.424 (62.145,74.704) 69.291 (66.318,72.263) 69.469 (64.351,74.587)Pickup 40,748 72.361 (66.713,78.008) 69.934 (59.062,80.805) 74.495 (64.949,84.04) 70.256 (58.87,81.643)

Fuel type

Gas/D 761,292 68.637 (67.61,69.664) 71.334 (66.221,76.446) 69.895 (67.66,72.131) 69.443 (65.954,72.931)Other 75,769 71.986 (68.598,75.373) 82.674 (72.987,92.361) 77.039 (72.37,81.708) 75.696 (67.822,83.571)

Weekend

Weekday 712,411 67.671 (66.701,68.64) 70.362 (65.734,74.991) 68.72 (66.616,70.824) 68.348 (64.806,71.89)Weekend 124,650 76.196 (75.001,77.392) 78.646 (71.128,86.164) 77.649 (75.099,80.199) 76.577 (73.08,80.074)

Robust Bayesian Inference for Big Data Table 12

Mean daily trip distance (mile) and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 33.852 (32.963,34.741) 35.51 (32.247,38.773) 34.782 (33.146,36.418) 34.358 (32.254,36.461)Female 429,749 31.06 (30.27,31.849) 31.901 (29.932,33.871) 31.947 (30.601,33.293) 31.428 (28.965,33.89)

Age group

Race

White 745,596 32.189 (31.554,32.824) 33.173 (30.862,35.484) 33.426 (32.11,34.743) 32.85 (31.118,34.582)Black 43,109 37.275 (34.577,39.973) 35.696 (31.364,40.029) 34.187 (31.146,37.228) 34.131 (28.834,39.427)Asian 26,265 30.638 (28.095,33.181) 29.601 (25.527,33.675) 28.311 (25.238,31.383) 28.289 (22.62,33.958)Other 22,091 32.789 (29.699,35.879) 37.919 (30.975,44.862) 35.518 (30.553,40.484) 35.058 (27.876,42.239)

Ethnicity

Non-Hisp 808,098 32.328 (31.723,32.933) 32.852 (30.823,34.881) 33.217 (32.085,34.349) 32.362 (30.845,33.879)Hispanic 28,963 34.935 (31.713,38.158) 36.882 (32.766,40.997) 35.126 (31.226,39.027) 36.344 (29.859,42.828)

Education ¡High school 50,943 25.659 (23.905,27.412) 28.248 (24.014,32.482) 28.977 (25.986,31.967) 30.351 (22.902,37.8)HS completed 78,045 32.04 (30.14,33.939) 36.853 (29.19,44.515) 33.596 (30.989,36.203) 33.497 (30.336,36.658)College 237,206 31.848 (30.812,32.885) 33.666 (30.509,36.824) 33.038 (31.177,34.899) 32.956 (30.622,35.29)Graduate 326,860 33.879 (32.85,34.908) 35.56 (32.73,38.389) 35.093 (33.48,36.706) 34.091 (31.938,36.244)Post-grad 144,007 32.637 (31.235,34.039) 29.935 (27.6,32.269) 32.801 (30.877,34.725) 31.414 (29.555,33.273)

HH income

HH size

Urban size ¡50k 34,987 36.147 (32.885,39.408) 34.93 (28.343,41.518) 34.077 (30.506,37.648) 32.945 (30.263,35.628)50-200k 119,970 31.028 (29.388,32.668) 32.379 (29.47,35.288) 32.032 (29.692,34.372) 32.636 (30.058,35.215)200-500k 44,578 36.416 (33.616,39.216) 44.143 (36.054,52.231) 35.585 (33.228,37.942) 36.461 (32.151,40.771)500-1000k 276,629 31.973 (30.952,32.994) 32.453 (27.672,37.234) 31.005 (29.331,32.679) 32.781 (28.04,37.522)1000k+ 360,897 32.366 (31.497,33.236) 32.475 (30.577,34.373) 32.371 (31.117,33.626) 32.302 (30.134,34.471)

Vehicle make

American 290,228 30.948 (29.995,31.9) 35.285 (31.317,39.254) 32.784 (31.234,34.335) 32.956 (30.909,35.004)Asian 528,810 33.249 (32.476,34.022) 33.339 (31.554,35.124) 34.022 (32.623,35.42) 33.124 (31.156,35.092)European 18,023 31.719 (28.896,34.542) 29.905 (26.439,33.37) 32.979 (30.006,35.951) 32.05 (27.154,36.946)

Vehicle type

Car 610,245 32.126 (31.428,32.823) 33.619 (31.253,35.985) 32.916 (31.691,34.141) 32.518 (30.426,34.611)Van 27,866 31.212 (28.225,34.199) 29.109 (24.54,33.679) 32.682 (28.052,37.312) 31.109 (24.519,37.699)SUV 158,202 32.848 (31.558,34.137) 33.857 (30.466,37.249) 33.15 (31.374,34.926) 32.559 (29.986,35.133)Pickup 40,748 35.958 (32.813,39.103) 35.086 (30.375,39.798) 36.557 (32.917,40.197) 36.352 (31.036,41.668)

Fuel type

Gas/D 761,292 32.121 (31.502,32.739) 33.524 (31.522,35.526) 33.18 (32.006,34.354) 32.813 (31.021,34.605)Other 75,769 35.409 (33.302,37.515) 43.864 (37.513,50.214) 39.259 (35.942,42.576) 37.388 (34.271,40.505)

Weekend

Weekday 712,411 31.895 (31.307,32.482) 33.181 (31.327,35.036) 32.817 (31.666,33.968) 32.41 (30.68,34.14)Weekend 124,650 35.41 (34.689,36.132) 37.037 (34.351,39.724) 36.64 (35.09,38.19) 35.853 (33.806,37.899)

Robust Bayesian Inference for Big Data Table 13

Mean daily average speed (MPH) of trips and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 25.474 (25.137,25.811) 26.964 (24.667,29.261) 26.199 (25.578,26.82) 25.999 (24.877,27.12)Female 429,749 24.61 (24.297,24.923) 24.463 (23.239,25.688) 24.906 (24.32,25.492) 24.76 (23.59,25.93)

Age group

Race

White 745,596 25.109 (24.863,25.354) 25.225 (24.255,26.196) 26.086 (25.565,26.608) 25.656 (24.95,26.363)Black 43,109 24.227 (23.188,25.265) 27.223 (22.295,32.151) 23.198 (22.202,24.194) 23.433 (20.47,26.397)Asian 26,265 24.35 (23.278,25.423) 25.203 (23.46,26.947) 23.038 (21.674,24.402) 24.508 (21.082,27.934)Other 22,091 24.76 (23.473,26.046) 25.049 (23.008,27.091) 24.68 (23.128,26.232) 25.956 (22.168,29.744)

Ethnicity

Non-Hisp 808,098 25.039 (24.805,25.274) 25.023 (24.156,25.89) 25.674 (25.197,26.152) 25.246 (24.437,26.055)Hispanic 28,963 24.777 (23.526,26.028) 27.808 (24.042,31.574) 25.233 (23.873,26.593) 26.284 (22.881,29.688)

Education ¡High school 50,943 23.16 (22.31,24.01) 23.142 (21.364,24.919) 23.825 (22.322,25.328) 24.791 (21.638,27.943)HS completed 78,045 25.192 (24.438,25.945) 24.851 (22.707,26.995) 26.025 (25.019,27.031) 25.165 (22.538,27.793)College 237,206 24.506 (24.09,24.921) 25.888 (22.586,29.189) 24.988 (24.184,25.793) 24.7 (23.717,25.683)Graduate 326,860 25.426 (25.043,25.81) 26.769 (25.764,27.775) 26.363 (25.795,26.93) 26.368 (25.077,27.659)Post-grad 144,007 25.569 (25.027,26.11) 25.031 (23.955,26.107) 25.446 (24.775,26.117) 25.555 (24.399,26.711)

HH income

HH size

Urban size ¡50k 34,987 28.437 (27.061,29.813) 25.595 (22.943,28.247) 27.951 (26.354,29.548) 27.097 (25.536,28.659)50-200k 119,970 24.455 (23.814,25.096) 25.081 (23.851,26.31) 24.784 (23.965,25.603) 25.031 (24.155,25.907)200-500k 44,578 27.64 (26.634,28.645) 27.073 (23.546,30.601) 27.024 (26.049,27.999) 26.931 (24.582,29.279)500-1000k 276,629 25.758 (25.355,26.162) 26.189 (23.701,28.678) 25.05 (24.289,25.812) 25.513 (24.121,26.904)1000k+ 360,897 20.451 (18.941,21.961) 23.557 (21.359,25.755) 21.9 (20.41,23.389) 23.488 (20.639,26.336)

Vehicle make

American 290,228 24.799 (24.402,25.195) 27.212 (24.331,30.094) 25.766 (25.047,26.485) 25.353 (24.079,26.627)Asian 528,810 25.174 (24.884,25.464) 24.771 (23.69,25.853) 25.509 (25.004,26.015) 25.464 (24.358,26.569)European 18,023 24.534 (23.553,25.514) 24.974 (23.083,26.866) 24.291 (22.942,25.64) 25.307 (23.285,27.329)

Vehicle type

Car 610,245 24.893 (24.622,25.164) 25.115 (24.327,25.904) 25.313 (24.794,25.832) 25.357 (24.467,26.247)Van 27,866 23.562 (22.539,24.586) 23.064 (21.378,24.75) 23.484 (22.376,24.591) 23.527 (20.832,26.223)SUV 158,202 25.398 (24.87,25.925) 26.495 (22.622,30.369) 25.635 (24.887,26.384) 25.008 (23.511,26.505)Pickup 40,748 26.43 (25.484,27.375) 26.245 (23.43,29.059) 26.628 (25.453,27.804) 25.788 (23.842,27.733)

Fuel type

Gas/D 761,292 24.955 (24.711,25.199) 25.727 (24.205,27.249) 25.507 (25.005,26.01) 25.361 (24.277,26.446)Other 75,769 25.784 (25.091,26.476) 27.804 (26.114,29.493) 27.052 (25.991,28.113) 26.676 (24.691,28.66)

Weekend

Weekday 712,411 25.077 (24.847,25.308) 25.744 (24.351,27.138) 25.598 (25.1,26.096) 25.425 (24.36,26.49)Weekend 124,650 24.76 (24.518,25.003) 25.939 (23.843,28.034) 25.356 (24.811,25.901) 25.194 (23.987,26.401)

Robust Bayesian Inference for Big Data Table 14

Mean start time of the ﬁrst daytrips and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 13.824 (13.751,13.898) 13.556 (13.304,13.807) 13.572 (13.418,13.725) 13.486 (13.304,13.667)Female 429,749 13.799 (13.736,13.861) 13.578 (13.362,13.793) 13.533 (13.386,13.681) 13.515 (13.389,13.64)

Age group

Race

White 745,596 13.77 (13.719,13.821) 13.522 (13.331,13.712) 13.506 (13.372,13.64) 13.46 (13.318,13.602)Black 43,109 14.065 (13.887,14.242) 13.632 (13.387,13.876) 13.695 (13.49,13.901) 13.662 (13.21,14.114)Asian 26,265 14.351 (14.135,14.567) 14.281 (13.405,15.157) 14.051 (13.522,14.58) 13.841 (13.51,14.172)Other 22,091 14.055 (13.786,14.324) 13.349 (12.959,13.739) 13.585 (13.254,13.917) 13.492 (12.695,14.289)

Ethnicity

Non-Hisp 808,098 13.803 (13.754,13.851) 13.563 (13.359,13.767) 13.547 (13.419,13.675) 13.49 (13.362,13.617)Hispanic 28,963 14.049 (13.81,14.288) 13.602 (13.303,13.901) 13.544 (13.278,13.81) 13.558 (13.084,14.031)

Education ¡High school 50,943 14.3 (14.177,14.424) 13.601 (13.416,13.786) 13.604 (13.394,13.814) 13.513 (13.106,13.921)HS completed 78,045 13.895 (13.73,14.06) 13.425 (12.957,13.893) 13.509 (13.236,13.781) 13.478 (13.072,13.884)College 237,206 14.003 (13.913,14.092) 13.641 (13.36,13.922) 13.611 (13.433,13.788) 13.532 (13.339,13.725)Graduate 326,860 13.695 (13.617,13.774) 13.68 (13.378,13.982) 13.65 (13.5,13.799) 13.558 (13.42,13.696)Post-grad 144,007 13.539 (13.431,13.648) 13.307 (12.878,13.737) 13.369 (13.197,13.542) 13.399 (13.122,13.677)

HH income

HH size

Urban size ¡50k 34,987 13.52 (13.266,13.773) 13.383 (12.795,13.972) 13.337 (12.845,13.829) 13.328 (13.031,13.625)50-200k 119,970 13.928 (13.794,14.062) 13.747 (13.489,14.005) 13.842 (13.672,14.011) 13.698 (13.522,13.873)200-500k 44,578 13.918 (13.705,14.13) 13.518 (12.933,14.103) 13.817 (13.571,14.064) 13.66 (13.276,14.045)500-1000k 276,629 13.759 (13.678,13.84) 13.395 (13.213,13.576) 13.564 (13.45,13.679) 13.503 (13.305,13.7)1000k+ 360,897 14.286 (13.859,14.713) 13.451 (12.893,14.01) 13.654 (13.317,13.992) 13.385 (12.683,14.087)

Vehicle make

American 290,228 13.8 (13.714,13.886) 13.339 (13.067,13.61) 13.455 (13.258,13.651) 13.426 (13.221,13.631)Asian 528,810 13.799 (13.741,13.858) 13.646 (13.485,13.808) 13.627 (13.517,13.736) 13.561 (13.43,13.692)European 18,023 14.337 (14.094,14.58) 14.19 (13.492,14.888) 13.684 (13.334,14.034) 13.552 (13.171,13.933)

Vehicle type

Car 610,245 13.849 (13.792,13.906) 13.649 (13.445,13.854) 13.658 (13.53,13.787) 13.611 (13.476,13.747)Van 27,866 13.588 (13.336,13.84) 13.414 (13.064,13.764) 13.472 (13.172,13.772) 13.514 (13.168,13.861)SUV 158,202 13.773 (13.675,13.871) 13.623 (13.279,13.966) 13.497 (13.336,13.657) 13.528 (13.264,13.793)Pickup 40,748 13.714 (13.536,13.893) 13.725 (13.41,14.04) 13.544 (13.305,13.783) 13.502 (13.079,13.925)

Fuel type

Gas/D 761,292 13.841 (13.791,13.891) 13.565 (13.389,13.741) 13.556 (13.428,13.685) 13.498 (13.36,13.636)Other 75,769 13.51 (13.338,13.683) 13.525 (13.217,13.833) 13.463 (13.254,13.672) 13.56 (13.264,13.856)

Weekend

Weekday 712,411 13.824 (13.775,13.872) 13.576 (13.408,13.744) 13.558 (13.431,13.684) 13.502 (13.364,13.64)Weekend 124,650 13.74 (13.685,13.794) 13.496 (13.22,13.773) 13.531 (13.397,13.664) 13.486 (13.334,13.637)

Robust Bayesian Inference for Big Data Table 15

Mean daily maximum speed (MPH) and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 60.187 (59.706,60.669) 62.687 (59.483,65.89) 60.847 (60.045,61.649) 60.677 (58.953,62.402)Female 429,749 59.448 (58.969,59.928) 60.28 (59.195,61.366) 60.023 (59.205,60.84) 59.193 (58.034,60.353)

Age group

Race

White 745,596 59.449 (59.092,59.806) 60.312 (59.478,61.145) 60.254 (59.581,60.929) 59.586 (58.217,60.954)Black 43,109 64.628 (62.991,66.264) 68.229 (60.656,75.802) 62.22 (60.3,64.14) 62.152 (58.85,65.453)Asian 26,265 61.08 (59.323,62.836) 60.573 (58.414,62.733) 59.081 (56.873,61.288) 59.464 (55.818,63.11)Other 22,091 61.008 (59.099,62.917) 60.986 (58.585,63.387) 59.968 (58.45,61.485) 60.988 (55.744,66.231)

Ethnicity

Non-Hisp 808,098 59.718 (59.37,60.066) 60.221 (59.395,61.048) 60.232 (59.595,60.869) 59.528 (58.485,60.571)Hispanic 28,963 62.31 (60.707,63.914) 66.308 (60.971,71.645) 61.976 (60.418,63.533) 62.437 (58.69,66.184)

Education ¡High school 50,943 58.103 (56.954,59.251) 59.199 (57.013,61.386) 58.949 (57.325,60.572) 60.162 (57.18,63.145)HS completed 78,045 59.865 (58.83,60.901) 61.116 (59.657,62.576) 60.812 (59.457,62.166) 60.382 (57.673,63.092)College 237,206 59.874 (59.25,60.497) 62.623 (58.482,66.764) 59.982 (58.9,61.065) 59.491 (57.762,61.219)Graduate 326,860 60.185 (59.62,60.751) 61.474 (60.049,62.898) 61.405 (60.379,62.43) 60.718 (59.63,61.807)Post-grad 144,007 59.414 (58.566,60.262) 59.809 (58.019,61.598) 60.113 (58.801,61.426) 59.221 (57.561,60.881)

HH income

HH size

Urban size ¡50k 34,987 60.422 (58.928,61.917) 61.118 (58.986,63.251) 59.964 (58.006,61.921) 59.665 (57.238,62.093)50-200k 119,970 56.12 (55.22,57.021) 57.621 (55.544,59.698) 57.162 (56.071,58.254) 57.313 (55.844,58.782)200-500k 44,578 62.847 (61.27,64.423) 64.07 (60.75,67.39) 62.507 (60.978,64.037) 62.711 (60.338,65.086)500-1000k 276,629 60.193 (59.62,60.766) 61.073 (57.067,65.078) 59.886 (58.928,60.846) 60.296 (58.82,61.773)1000k+ 360,897 60.303 (59.8,60.807) 62.075 (59.415,64.736) 60.647 (59.977,61.317) 60.287 (59.099,61.475)

Vehicle make

American 290,228 59.36 (58.773,59.946) 63.24 (59.63,66.85) 60.218 (59.339,61.096) 59.877 (58.058,61.695)Asian 528,810 60.013 (59.588,60.438) 60.796 (59.891,61.701) 60.751 (60.095,61.407) 60.203 (59.036,61.371)European 18,023 61.016 (59.074,62.958) 58.842 (56.171,61.514) 58.984 (56.944,61.025) 59.049 (55.175,62.922)

Vehicle type

Car 610,245 59.744 (59.338,60.149) 60.92 (60.083,61.757) 60.44 (59.765,61.115) 60.119 (58.854,61.383)Van 27,866 57.722 (56.154,59.289) 58.36 (55.812,60.907) 58.674 (56.088,61.26) 58.263 (55.586,60.94)SUV 158,202 60.093 (59.36,60.825) 62.613 (57.431,67.795) 60.444 (59.297,61.59) 59.511 (57.678,61.345)Pickup 40,748 61.092 (59.557,62.627) 62.97 (61.201,64.739) 61.359 (59.346,63.371) 60.707 (57.51,63.905)

Fuel type

Gas/D 761,292 59.878 (59.516,60.239) 61.537 (59.67,63.404) 60.473 (59.842,61.105) 59.937 (58.565,61.309)Other 75,769 59.105 (58.131,60.079) 61.685 (58.505,64.865) 61.082 (59.975,62.189) 60.645 (58.056,63.234)

Weekend

Weekday 712,411 59.684 (59.344,60.023) 61.322 (59.663,62.982) 60.295 (59.687,60.902) 59.801 (58.483,61.119)Weekend 124,650 60.517 (60.151,60.883) 62.809 (60.044,65.575) 61.312 (60.601,62.023) 60.768 (59.333,62.204)

Robust Bayesian Inference for Big Data Table 16

Mean daily frequency of brakes per driven mile and associated 95% CIS by diﬀerent covariates across DR adjustmentmethods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 4.415 (4.247,4.583) 3.835 (3.139,4.531) 4.456 (4.129,4.784) 4.345 (3.789,4.902)Female 429,749 4.579 (4.43,4.728) 4.957 (4.518,5.396) 4.825 (4.471,5.179) 4.508 (4.04,4.977)

Age group

Race

White 745,596 4.521 (4.401,4.641) 4.518 (4.018,5.018) 4.583 (4.272,4.895) 4.402 (4,4.805)Black 43,109 4.366 (3.885,4.848) 3.807 (2.117,5.496) 5.074 (4.442,5.706) 5.053 (4.117,5.988)Asian 26,265 4.256 (3.675,4.837) 4.349 (3.393,5.304) 5.222 (4.061,6.383) 4.483 (3.405,5.561)Other 22,091 4.319 (3.732,4.907) 4.197 (3.011,5.382) 4.438 (3.574,5.302) 3.705 (2.167,5.243)

Ethnicity

Non-Hisp 808,098 4.488 (4.375,4.601) 4.56 (4.117,5.003) 4.597 (4.323,4.871) 4.442 (4.051,4.832)Hispanic 28,963 4.813 (4.037,5.588) 3.842 (2.609,5.075) 4.84 (3.782,5.898) 4.3 (2.899,5.701)

Education ¡High school 50,943 4.942 (4.522,5.362) 4.779 (3.58,5.979) 5.209 (4.526,5.893) 5.204 (3.91,6.498)HS completed 78,045 4.163 (3.8,4.526) 3.495 (1.656,5.335) 4.241 (3.662,4.819) 4.179 (3.275,5.084)College 237,206 4.561 (4.347,4.775) 4.466 (3.609,5.323) 4.713 (4.269,5.158) 4.604 (4.062,5.145)Graduate 326,860 4.347 (4.165,4.528) 4.174 (3.765,4.583) 4.564 (4.201,4.927) 4.17 (3.59,4.749)Post-grad 144,007 4.769 (4.509,5.029) 5.031 (4.514,5.548) 4.788 (4.387,5.19) 4.541 (3.857,5.224)

HH income

HH size

Urban size ¡50k 34,987 4.051 (3.567,4.535) 4.313 (2.858,5.768) 4.293 (3.57,5.015) 4.24 (3.746,4.734)50-200k 119,970 4.789 (4.49,5.089) 4.921 (4.454,5.389) 4.761 (4.265,5.257) 4.696 (4.02,5.372)200-500k 44,578 4.241 (3.825,4.657) 4.177 (3.147,5.206) 4.567 (4.075,5.059) 4.231 (3.049,5.413)500-1000k 276,629 3.969 (3.793,4.145) 3.977 (3.342,4.612) 4.21 (3.867,4.552) 4.171 (3.549,4.793)1000k+ 360,897 4.884 (4.703,5.066) 4.539 (3.985,5.093) 4.948 (4.599,5.297) 4.626 (4.025,5.227)

Vehicle make

American 290,228 4.762 (4.548,4.975) 4.239 (3.381,5.097) 5.007 (4.549,5.466) 4.914 (4.347,5.482)Asian 528,810 4.392 (4.261,4.524) 4.569 (4.244,4.894) 4.451 (4.181,4.721) 4.206 (3.78,4.632)European 18,023 3.401 (2.946,3.856) 3.161 (2.359,3.963) 3.664 (3.006,4.322) 2.898 (0.958,4.837)

Vehicle type

Car 610,245 4.504 (4.368,4.641) 4.281 (3.65,4.913) 4.564 (4.225,4.903) 4.248 (3.785,4.711)Van 27,866 4.435 (4.064,4.806) 5.298 (4.287,6.31) 4.855 (4.41,5.3) 4.569 (3.345,5.792)SUV 158,202 4.351 (4.148,4.555) 4.222 (3.078,5.365) 4.56 (4.215,4.904) 4.381 (3.73,5.032)Pickup 40,748 5.043 (4.391,5.696) 4.611 (3.881,5.34) 5.022 (4.255,5.789) 5.226 (4.061,6.391)

Fuel type

Gas/D 761,292 4.435 (4.319,4.551) 4.345 (3.865,4.825) 4.649 (4.34,4.959) 4.426 (3.992,4.859)Other 75,769 5.145 (4.737,5.553) 4.718 (3.688,5.747) 4.934 (4.308,5.561) 4.388 (3.347,5.429)

Weekend

Weekday 712,411 4.492 (4.379,4.605) 4.365 (3.91,4.82) 4.639 (4.339,4.938) 4.413 (3.963,4.864)Weekend 124,650 4.54 (4.427,4.654) 4.305 (3.738,4.871) 4.675 (4.374,4.975) 4.497 (4.073,4.921)

Robust Bayesian Inference for Big Data Table 17

Mean daily percentage of stop time and associated 95% CIS by diﬀerent covariates across DR adjustment methods

Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPPCovariate n (95%CI) (95%CI) (95%CI) (95%CI)Total

Gender

Male 407,312 24.618 (24.157,25.079) 24.048 (21.863,26.234) 24.06 (23.158,24.961) 0.242 (0.231,0.252)Female 429,749 26.371 (25.945,26.797) 27.107 (25.226,28.988) 25.873 (24.968,26.779) 0.261 (0.25,0.271)

Age group

Race

White 745,596 25.216 (24.882,25.55) 25.693 (24.172,27.213) 24.071 (23.292,24.849) 0.245 (0.237,0.253)Black 43,109 29.711 (28.421,31.001) 27.666 (25.29,30.042) 29.697 (28.071,31.324) 0.294 (0.267,0.32)Asian 26,265 25.989 (24.582,27.396) 23.631 (20.325,26.937) 26.098 (24.013,28.184) 0.252 (0.216,0.289)Other 22,091 26.955 (24.952,28.958) 26.442 (23.616,29.269) 26.484 (23.923,29.045) 0.252 (0.21,0.294)

Ethnicity

Non-Hisp 808,098 25.438 (25.118,25.758) 25.622 (24.061,27.183) 24.532 (23.79,25.274) 0.251 (0.242,0.26)Hispanic 28,963 27.746 (25.946,29.546) 26.345 (24.033,28.657) 27.102 (25.369,28.836) 0.251 (0.224,0.277)

Education ¡High school 50,943 27.86 (26.587,29.134) 28.96 (26.302,31.618) 27.223 (25.326,29.119) 0.262 (0.233,0.292)HS completed 78,045 26.136 (25.022,27.249) 28.155 (25.07,31.241) 25.534 (24.179,26.889) 0.263 (0.24,0.287)College 237,206 26.881 (26.288,27.474) 27.043 (24.085,30.001) 26.128 (24.88,27.377) 0.264 (0.253,0.276)Graduate 326,860 24.96 (24.472,25.448) 22.543 (21.102,23.983) 23.625 (22.803,24.447) 0.236 (0.218,0.254)Post-grad 144,007 23.375 (22.656,24.094) 24.105 (22.558,25.651) 23.845 (22.697,24.992) 0.237 (0.219,0.254)

HH income

HH size

Urban size ¡50k 34,987 20.874 (19.097,22.651) 26.022 (21.85,30.194) 21.679 (19.496,23.862) 0.228 (0.21,0.247)50-200k 119,970 23.798 (22.902,24.694) 23.87 (22.364,25.376) 24.287 (22.881,25.692) 0.239 (0.222,0.257)200-500k 44,578 22.435 (21.355,23.515) 24.487 (18.513,30.461) 23.436 (22.035,24.837) 0.236 (0.209,0.263)500-1000k 276,629 25.334 (24.785,25.882) 25.695 (24.531,26.86) 26.128 (25.326,26.93) 0.263 (0.249,0.278)1000k+ 360,897 27.062 (26.615,27.508) 26.883 (25.813,27.953) 27.43 (26.73,28.131) 0.271 (0.26,0.283)

Vehicle make

American 290,228 26.28 (25.728,26.831) 24.962 (22.25,27.673) 24.957 (23.906,26.007) 0.255 (0.244,0.265)Asian 528,810 25.075 (24.682,25.468) 26.283 (24.65,27.917) 24.94 (24.125,25.755) 0.249 (0.239,0.258)European 18,023 26.233 (24.82,27.646) 23.294 (19.732,26.857) 25.298 (22.99,27.607) 0.243 (0.21,0.276)

Vehicle type

Car 610,245 25.632 (25.267,25.997) 25.738 (24.14,27.335) 25.054 (24.321,25.788) 0.25 (0.24,0.261)Van 27,866 27.205 (25.523,28.887) 28.585 (24.943,32.227) 28.5 (25.898,31.103) 0.277 (0.237,0.316)SUV 158,202 25.274 (24.518,26.031) 25.441 (22.085,28.796) 25.053 (23.917,26.189) 0.254 (0.241,0.267)Pickup 40,748 23.596 (22.247,24.945) 23.906 (20.704,27.107) 23.839 (21.462,26.217) 0.239 (0.211,0.267)

Fuel type

Gas/D 761,292 25.777 (25.444,26.11) 25.638 (24.128,27.147) 25.059 (24.318,25.801) 0.252 (0.243,0.261)Other 75,769 22.913 (21.982,23.844) 20.192 (18.427,21.956) 22.057 (20.348,23.765) 0.216 (0.195,0.237)

Weekend

Weekday 712,411 25.395 (25.079,25.712) 25.465 (24.053,26.876) 24.83 (24.105,25.554) 0.25 (0.241,0.259)Weekend 124,650 26.218 (25.892,26.545) 25.812 (23.822,27.803) 25.623 (24.805,26.442) 0.258 (0.248,0.268)

Robust Bayesian Inference for Big Data F e m a l e M a l e F e m a l e M a l e S HR P NH T S Percent (%) S e x + − − − − − −

50 1000 + − − − < S HR P NH T S U r ban S i z e + − − − − + − − − − S HR P NH T S V eh i c l e A ge E u r opean A s i an A m e r i c an E u r opean A s i an A m e r i c an S HR P NH T S V eh i c l e M a k e P i ck up S U VV an C a r P i ck up S U VV an C a r S HR P NH T S V eh i c l e T y pe + − − − − − − + − − − − − − S HR P NH T S M il eage O t he r G a s / D O t he r G a s / D S HR P NH T S F ue l t y pe S a t F r i T hu W ed T ue M on S un S a t F r i T hu W ed T ue M on S un S HR P NH T S W ee k da y F a ll S u mm e r S p r i ng W i n t e r F a ll S u mm e r S p r i ng W i n t e r S HR P NH T S S ea s on F i g13 : C o m p a r i n g t h e d i s t r i bu t i o n o f c o mm o n a u x ili a r yv a r i a b l e s i np s e ud o - w e i g h t e dS H R P ( P A PP – B A R T ) w i t h w e i g h t e d NH T S Robust Bayesian Inference for Big Data mean daily total distance driven (mile) den s i t y Method

PAPPPMAIPW

Propensity scores under BART