[PDF] Data Fusion for Correcting Measurement Errors

Abstract

Often in surveys, key items are subject to measurement errors. Given just the data, it can be difficult to determine the distribution of this error process, and hence to obtain accurate inferences that involve the error-prone variables. In some settings, however, analysts have access to a data source on different individuals with high quality measurements of the error-prone survey items. We present a data fusion framework for leveraging this information to improve inferences in the error-prone survey. The basic idea is to posit models about the rates at which individuals make errors, coupled with models for the values reported when errors are made. This can avoid the unrealistic assumption of conditional independence typically used in data fusion. We apply the approach on the reported values of educational attainments in the American Community Survey, using the National Survey of College Graduates as the high quality data source. In doing so, we account for the informative sampling design used to select the National Survey of College Graduates. We also present a process for assessing the sensitivity of various analyses to different choices for the measurement error models. Supplemental material is available online.

Full PDF

DData Fusion for Correcting Measurement Errors

Tracy Schifeling, Jerome P. Reiter, Maria DeYoreo ∗ Abstract

Often in surveys, key items are subject to measurement errors. Given just thedata, it can be diﬃcult to determine the distribution of this error process, andhence to obtain accurate inferences that involve the error-prone variables. Insome settings, however, analysts have access to a data source on diﬀerent in-dividuals with high quality measurements of the error-prone survey items. Wepresent a data fusion framework for leveraging this information to improve infer-ences in the error-prone survey. The basic idea is to posit models about the ratesat which individuals make errors, coupled with models for the values reportedwhen errors are made. This can avoid the unrealistic assumption of conditionalindependence typically used in data fusion. We apply the approach on the re-ported values of educational attainments in the American Community Survey,using the National Survey of College Graduates as the high quality data source.In doing so, we account for the informative sampling design used to select theNational Survey of College Graduates. We also present a process for assessingthe sensitivity of various analyses to diﬀerent choices for the measurement errormodels. Supplemental material is available online.

KEY WORDS: fusion, imputation, measurement error, missing, survey. ∗ This research was supported by The National Science Foundation under award SES-11-31897. Theauthors wish to thank Seth Sanders for his input on informative prior speciﬁcations, and MauricioSadinle for discussion that improved the strategy for accounting for the informative sample design. a r X i v : . [ s t a t . M E ] O c t Introduction

Survey data often contain items that are subject to measurement errors. For example,some respondents might misunderstand a question or accidentally select the wrongresponse, thereby providing values unequal to their factual values. Left uncorrected,these measurement errors can result in degraded inferences (Kim et al., 2015). Unfor-tunately, the distribution of the measurement errors typically is not estimable from thesurvey data alone. One either needs to make strong assumptions about the measure-ment error process (e.g., as in Curran and Hussong, 2009), or leverage informationfrom some other source of data, as we do here.One natural source of information is a validation sample, i.e., a dataset with boththe reported, possibly erroneous values and the true values measured on the sameindividuals. These individuals could be a subset of the original survey (Pepe, 1992;Yucel and Zaslavsky, 2005), or a completely distinct set (Raghunathan, 2006; Schenkerand Raghunathan, 2007; Schenker et al., 2010; Carrig et al., 2015). With validationdata, one can model the relationship between the error-prone and true values, anduse the model to replace the error-prone items with multiply-imputed, plausible truevalues (Reiter, 2008; Siddique et al., 2015).In many settings, however, it is not possible to obtain validation samples, e.g.,because it is too expensive or because someone other than the analyst collected thedata. In such cases, another potential source of information is a separate, “gold stan-dard” dataset that includes true (or at least very high quality) measurements of theitems subject to error, but not the error-prone measurements. Unlike validation sam-ples, the gold standard dataset alone does not provide enough information to estimatethe relationship between the error-prone and true values; it only provides informationabout the distribution of the true values. Thus, analysts are faced with a special case2f data fusion (Rubin, 1986; Moriarity and Scheuren, 2001; Rassler, 2002; D’Orazioet al., 2006; Reiter, 2012; Fosdick et al., 2016), i.e., integrating information from twodatabases with disjoint sets of individuals and distinct variables.One default approach, common in other data fusion contexts, is to assume that theerror-prone and true values are conditionally independent given some set of variables X common to both the survey and gold standard data. Eﬀectively, this involves usingthe gold standard data to estimate a predictive model for the true values from X , andapplying the estimated model to impute replacements for all values of the error-proneitems in the survey. However, this conditional independence assumption completelydisregards the information in the error-prone values, which sacriﬁces potentially usefulinformation. For example, consider national surveys that ask people to report theireducational attainment. We might expect most people to report values accurately andonly a modest fraction to make errors. It does not make sense to alter every individual’sreported values in the survey, as would be done using a conditional independenceapproach.In this article, we develop a framework for leveraging information from gold stan-dard data to improve inferences in surveys subject to measurement errors. The basicidea is to encode plausible assumptions about the error process, e.g., most people donot make errors when reporting educational attainments, and the reporting process,e.g., when people make errors, they are more likely to report higher attainments thanactual, into statistical models. We couple those models with distributions for the un-derlying true data values, and use multiple imputation to create plausible corrections tothe error-prone survey values, which then can be analyzed using the methods from Ru-bin (1987). This allows us to avoid unrealistic conditional independence assumptionsin lieu of more scientiﬁcally defensible models.The remainder of this article is organized as follows. In Section 2, we review an3xample of misreporting of educational attainment in data collected by the CensusBureau, so as to motivate the methodological developments. In Section 3, we intro-duce the general framework for specifying measurement error models to leverage theinformation in gold standard data. In Section 4, we apply the framework to handle po-tential measurement error in educational attainment in the 2010 American CommunitySurvey (ACS), using the 2010 National Survey of College Graduates (NSCG) as a goldstandard ﬁle. In doing so, we deal with a key complication in the data integration:accounting for the informative sampling design used to sample the NSCG. We alsodemonstrate how the framework facilitates analysis of the sensitivity of conclusionsto diﬀerent measurement error model speciﬁcations. In Section 5, we provide a briefsummary. To illustrate the potential for reporting errors in educational attainment that can arisein surveys, we examine data from the 1993 NSCG. The 1993 NSCG surveyed individualswho indicated on the 1990 census long form that they had at least a college degree(Fesco et al., 2012). The questionnaire asked about educational attainment, includingdetailed questions about educational histories. These questions greatly reduce thepossibility of respondent error, so that the educational attainment values in the NSCGcan be considered a gold standard (Black et al., 2003). The census long form, incontrast, did not include detailed follow up questions, so that reported educationalattainment is prone to measurement error.The Census Bureau linked each individual in the NSCG to their correspondingrecord in the long form data. The linked ﬁle is available for download from the Inter-university Consortium for Political and Social Research (National Science Foundation,4able 1: Unweighted cross-tabulation of reported education in the NSCG and censuslong form from the linked dataset. BA stands for bachelor’s degree; MA stands formaster’s degree; Prof stands for professional degree; and, PhD stands for Ph. D. degree.The 14,319 individuals in the group labeled No Degree did not have a college degree,despite reporting otherwise. The 51,396 individuals in the group labeled Other did nothave one of (BA, MA, Prof, PhD) and are discarded from subsequent analyses.Census-reported education (cid:122) (cid:125)(cid:124) (cid:123)

BA MA Prof PhD

Total

NSCG-reportededucation 

BA 89580 4109 1241 249 95179MA 1218 33928 655 526 36327Prof 382 359 8648 563 9952PhD 99 193 452 6726 7470

Total

Y ZD E  (cid:88) ? (cid:88) D G (cid:26) (cid:88) (cid:88) ?Figure 1: Graphical representation of data fusion set-up. In the survey data D E , weonly observe the error-prone measurement Z but not the true value Y . In the goldstandard data D G , we only observe Y but not Z . We observe variables X in bothsamples. As in Figure 1, let D E and D G be two data sources comprising distinct individuals,with sample sizes n E and n G , respectively. For each individual i in D G or D E , let X i =( X i , . . . , X ip ) be variables common to both surveys, such as demographic variables.We assume these variables have been harmonized (D’Orazio et al., 2006) across D G and D E and are free of errors. Let Y represent the error-free values of some variableof interest, and let Z be an error-prone version of Y . We observe Z but not Y forthe n E individuals in D E . We observe Y but not Z for the n G individuals in D G .For simplicity of notation, we assume no missing values in any variable, although themultiple imputation framework easily handles missing values. Additionally, D E caninclude variables for which there is no corresponding variable in D G . These variablesdo not play a role in the measurement error modeling, although they can be used inmultiple imputation inferences.We seek to estimate Pr( Y, Z | X ), and use it to create multiple imputations forthe missing values in Y for the individuals in D E . We do so for the common settingwhere ( X , Y, Z ) are all categorical variables; similar ideas apply for other data types.For j = 1 , . . . , p , let each X j have d j levels. Let Z have d Z levels and Y have d Y d Z = d Y , but this need not be the case generally. For example, inthe NSCG/ACS application, Z is the educational attainment among those who reporta college degree in the ACS, which has d Z = 4 levels (bachelor’s degree, master’sdegree, professional degree, or Ph. D. degree), and Y is the educational attainmentin the NSCG, which has d Y = 5 levels. An additional level is needed because someindividuals in the NSCG truly do not have a college degree.For all i ∈ D E , let E i be an (unobserved) indicator of a reporting error, that is, E i = 1 when Y i (cid:54) = Z i and E i = 0 otherwise. Using E enables us to write Pr( Y, Z | X )as a product of three sub-models. For individual i , the full data likelihood (omittingparameters for simplicity) can be factored asPr( Y i = k, Z i = l | X i ) = Pr( Y i = k | X i ) × Pr( E i = e | Y i = k, X i )Pr( Z i = l | E i = e, Y i = k, X i ) . (1)This separates the true data generation process and the measurement error generationprocess, which facilitates model speciﬁcation. In particular, we can use D G to estimatethe true data distribution Pr( Y | X ). We then can posit diﬀerent models for the ratesof making errors, Pr( E i = e | Y i = k, X i ), and for the reported values when errors aremade, Pr( Z i = l | E i = 1 , Y i = k, X i ). Intuitively, the error model locates the recordsfor which Y i (cid:54) = Z i , and the reporting model captures the patterns of misreported Z i .Of course, when E i = 0, Pr( Z i = Y i ) = 1. A similar factorization is used by Yucel andZaslavsky (2005), He et al. (2014), Kim et al. (2015), and Manrique-Vallier and Reiter(2016), among others.By construction, D G and D E cannot be used to estimate any of the conditionalprobabilities Pr( Y | Z, X ) directly. Hence, we have to restrict the number and typesof parameters in the sub-models in (1). Put another way, if we tried to estimate a fully8aturated model for ( E, Z | X ), we would not be able to identify all the parametersby using D G and D E alone. To see this, assume for the moment that all d X = Π pj =1 d j possible combinations of X are present in D G and D E . To estimate the distributionof ( E, Z | X ) using a fully saturated model, we require ( d Y − d X + ( d Z − d Y d X =( d Y d Z − d X independent pieces of information from ( D G , D E ), where each subtractionof one derives from the requirement that probabilities sum to one. However, D G and D E together provide only ( d Z − d X +( d Y − d X + d X = ( d Z + d Y − d X independent piecesof information, where we add a d X to properly account for the sum to one constraint. Akey insight here is that since the true data model requires d Y d X parameters to estimatethe joint distribution for ( Y, X ), the data can identify at most ( d Z − d X parametersin the error and reporting models, combined. Related identiﬁcation issues arise in thecontext of refreshment sampling to adjust for nonignorable attrition in longitudinalstudies (Hirano et al., 2001; Schifeling et al., 2015; Si et al., 2015). Pr( Y i = k | X i ) One can use any model for ( Y | X ) that adequately describes the conditional distri-bution, such as a (multinomial) logistic regression. In the NSCG/ACS application,,we use a fully saturated multinomial model, accounting for the informative samplingdesign in D G using the approach described in Section 4.1. One also could use a jointdistribution for ( Y, X ), such as a log-linear model or a mixture of multinomials model(Dunson and Xing, 2009; Si and Reiter, 2013).9 .2 Error model Pr( E i = 1 | Y i , X i ) In cases where d Y = d Z , a generic form for the error model isPr( E i = 1 | X i , Y i = k ) = g ( X i , Y i , β ) , (2)where g ( X i , Y i , β ) is some function of its arguments and β is some set of unknownparameters. A convenient class of functions that we use here is the logistic regressionof E i on some design vector M i derived from ( X i , Y i ), with corresponding coeﬃcients β . The analyst can encode diﬀerent versions of M i to represent assumptions about theerror process.The simplest speciﬁcation is to set each M i equal to a vector of ones, which impliesthat there is a common probability of error for all individuals. This error modelmakes sense when the analyst believes the errors in Z occur completely at random;for example, when errors arise simply because respondents accidentally and randomlyselect the wrong response in the survey, or when all respondents are equally likelyto misunderstand the survey question. A more realistic possibility is to allow theprobability of error to depend on some variables in X i but not on Y i , e.g., men misreporteducation at diﬀerent rates than women. This could be encoded by including anintercept for one of the sexes in M i . Finally, one can allow the probability of errorto depend on Y i itself—for example, people who truly do not have at least a collegedegree are more likely to misreport—by including some function of it in M i .In the case where d Z (cid:54) = d Y , as in the NSCG/ACS application, we automaticallyset E i = 1 for any individual with Y i / ∈ { d Z } . For example, we set E i = 1 for allindividuals who are determined in the NSCG not to have a college degree but reportso in the ACS. The stochastic part of the error model only applies to individuals whotruly have at least a bachelor’s degree. 10 .3 Reporting model Pr( Z i | E i = 1 , Y i , X i ) When there is no reporting error for individual i , i.e., E i = 0, we know that Z i = Y i . When there is a reporting error, we must model the reported value Z i . As with(2), one can posit a variety of distributions for the reporting error, which is somefunction h ( X i , Y i , α ) with parameters α . We now describe a few reporting error modelsfor illustration. One could use more complicated models, e.g., based on multinomiallogistic regression, as well.A simple model assumes that values of Z i are equally likely, as in Manrique-Vallierand Reiter (2016). We havePr( Z i = l | X i , Y i = k, E i = 1) =  / ( d Z −

1) if l (cid:54) = k, k ∈ { d z } /d Z if k / ∈ { d Z } . (3)Such a reporting model could be reasonable when reporting errors are due to clericalerrors. We note that this model does not accurately characterize the reporting errorsin the 1993 linked NSCG data, per Table 1.Alternatively, one can allow the probabilities to depend on Y i , so that( Z i | X i , Y i = k, E i = 1) ∼ Categorical( p k (1) , . . . , p k ( d Z )) , (4)where each p k ( l ) is the probability of reporting Z = l given that Y = k , and p k ( k ) = 0.One can further parameterize the reporting model so that the reporting probabilitiesvary with X . For example, to make the probabilities vary with sex and true education11alues, we can use( Z i | X i , Y i = k, E i = 1) ∼  Categorical( p M,k (1) , . . . , p

M,k ( d Z )) if X i,sex = MCategorical( p F,k (1) , . . . , p

F,k ( d Z )) if X i,sex = F . (5) As apparent in Sections 3.2 and 3.3, the error and reporting models can take on manyspeciﬁcations. Without linked data, analysts cannot use exploratory data analysisto inform the model choice. Instead, we recommend that analysts posit scientiﬁcallydefensible measurement error models, and make post-hoc checks of the sensibility ofanalyses from those models. We demonstrate this approach in Section 4. For example,analysts can check whether or not the predicted probabilities of errors implied by themodel seem plausible. As another diagnostic, analysts can compare the distributionof the imputed values of ( Y | X ) in D E to the empirical distribution of ( Y | X ) in D G . This is akin to diagnostics in multiple imputation for missing data that compareimputed and observed values (Abayomi et al., 2008). When these distributions diﬀersubstantially, it suggests the measurement error model speciﬁcation (or possibly thetrue data model) is inadequate. Such diagnostic checks only can reveal problems withthe model speciﬁcation; they do not indicate that a particular speciﬁcation is correct.More generally, it is prudent to keep the restrictions on the number of identiﬁableparameters in mind when specifying the models. At most one can identify the equiv-alent of ( d Z − d X parameters in the combined model for ( E i , Z i | X i ). Generally,for ease of speciﬁcation and interpretation, we favor rich error models, e.g., with M i including variables in X i and Y i , coupled with simple reporting models like those inSection 3.3.The exact strategy for estimating the model depends on the features of D G and D E .12hen both datasets can be treated as simple random samples, we suggest using a fullyBayesian approach after concatenating D G and D E . Here, one can use typical priordistributions for the true data and error models. For reporting models like those in (4)and (5), it is convenient to use independent Dirichlet priors for each ( p k (1) , . . . , p k ( k − , p k ( k + 1) , . . . , p k ( d Z )). In the NSCG/ACS application, we create prior distributionsfor the reporting models using the information from Table 1. Absent such information,analysts can use uniform prior distributions.When it does not make sense to concatenate D G and D E , it can be convenientto use a multi-stage estimation strategy. When imputing missing Y in D E , all of theinformation needed from D G is represented by the parameters of the true data model, θ .Hence, we ﬁrst can construct a (possibly approximate) posterior distribution of θ usingonly D G . We then sample many draws from this distribution. We plug these draws inthe Gibbs sampling steps for a Bayesian predictive distribution for ( Y i | Z i , X i , θ ) forthe cases in D E , thereby generating the multiple imputations. We describe the Gibbssampler for this step for the NSCG/ACS application in the supplementary material. We now use the framework to adjust inferences for potential reporting error in educa-tional attainment in the 2010 ACS, using the public use microdata for the 2010 NSCGas the gold standard ﬁle D G . We consider two main analyses that could be aﬀected byreporting error in education. First, we estimate from the ACS the number of scienceand engineering degrees awarded to women. We base the estimate on an indicator inthe ACS for whether or not each individual has such a degree. Second, we examine13verage incomes across degrees. This focus is motivated in part by the ﬁndings of Blacket al. (2006, 2008), who found that apparent wage gaps in the 1990 census long formdata could be explained by reporting errors in education.As D E , we use the subset of ACS microdata that includes only individuals whoreported a bachelor’s degree or higher and are under age 76. The resulting sample sizeis n E = 600 , X , we include gender, age group (24 and younger, 25 - 39, 40 -54, and 55 and older), and an indicator for whether the individual’s race is black orsomething else. In the NSCG, we discarded 38 records with race suppressed, leaving asample size of n G = 77 , M = 50 multiple imputationsof the plausible true education values in the 2010 ACS, which we then analyze usingthe methods of Rubin (1987). For all speciﬁcations, the true data model is a saturatedmultinomial distribution for the ﬁve values of Y for each combination of X . We beginby describing how we estimate the parameters of the true data distribution, accountingfor the informative sampling design of the NSCG. The 2010 NSCG uses reported education in the 2010 ACS as a stratiﬁcation variable(Fesco et al., 2012; Finamore, 2013). Its unweighted percentages can over-represent14r under-represent degree types in the population; this is most obviously the case forindividuals without a college degree ( Y i = 5). We need to account for this informativesampling when estimating parameters of the true data model. We do so with a twostage approach. First, we use survey-weighted inferences to estimate population totalsof ( Y | X ) from the 2010 NSCG. Second, we turn these estimates into an approximateBayesian posterior distribution for input to ﬁtting the measurement error models usedto impute plausible values of Y i for individuals in the ACS. We now describe thisprocess, which can be used generally when D G is collected via a complex survey design.Suppose for the moment that d Y = d Z . This is not the case when D E is theACS (where d Z = 4) and D G is the NSCG (where d Y = 5); however, we start hereto ﬁx ideas. For all possible combinations x , let θ xk = Pr( Y = k | X = x ), and let θ x = ( θ x , . . . , θ xd Y ). We seek to use D G to specify f ( θ | X , Y ). To do so, we ﬁrstparameterize θ xk = T xk / (cid:80) d Y j =1 T xj , where T xk is the population count of individualswith ( X i = x , Y i = k ). We estimate T x = ( T x , . . . , T xd Y ) and the associated covariancematrix of the estimator using standard survey-weighted estimation. Let w i be thesample weight for all i ∈ D G . We compute the estimated total and associated variancefor each x and k as ˆ T xk = n G (cid:88) i =1 w i I ( X i = x , Y i = k ) (6) (cid:100) Var( ˆ T xk ) = n G n G − n G (cid:88) i =1 (cid:32) w i I ( X i = x , Y i = k ) − ˆ T xk n G (cid:33) . (7)15or each k and l , with l (cid:54) = k , we also compute the estimated covariance, (cid:100) Cov( ˆ T xk , ˆ T xl ) = n G n G − n G (cid:88) i =1 (cid:34) (cid:32) w i I ( X i = x , Y i = k ) − ˆ T xk n G (cid:33) × (cid:32) w i I ( X i = x , Y i = l ) − ˆ T xl n G (cid:33) (cid:35) . (8)The variance and covariance estimators are the design-based estimators for probabilityproportional to size sampling with replacement, as is typical of multi-stage complexsurveys (Lohr, 2010).Switching now to a Bayesian modeling perspective, we assume that T x ∼ Log-Normal( µ x , τ x ), so as to ensure a distribution with positive values for all true totals. Weselect ( µ x , τ x ) so that each E( T xk ) = ˆ T xk and Var( T x ) = ˆΣ( ˆ T x ), the estimated covariancematrix with elements deﬁned by (7) and (8). These are derived from moment matching(Tarmast, 2001). We have µ xj = log( ˆ T xj ) − τ x [ j, j ] / τ x [ j, j ] = log (cid:16) x [ j, j ] / ( ˆ T xj ) (cid:17) (10) τ x [ j, i ] = log (cid:16) x [ j, i ] / ( ˆ T xj · ˆ T xi ) (cid:17) , (11)where the notation [ j, i ] denotes an element in row j and column i of the matrix. Wedraw T ∗ x from this log-normal distribution, and transform to draws θ ∗ x .Since the 2010 NSCG does not include individuals who claim in the ACS to haveless than a bachelor’s degree, we cannot use D G directly to estimate T x . Instead,we estimate T x + = T x + T x + T x + T x + T x using the ACS data, and estimate( T x , T x , T x , T x ) from the NSCG using the method described previously; this leadsto an estimate for T x . More precisely, let the ACS design-based estimator for T x + Error model Reporting modelExpression for M Ti β P r ( Z i | Y i = k, E i = 1)Model 1 β + (cid:80) k =2 β k I ( Y i = k ) Categorical( p k (1) , . . . , p k (4))Model 2 β + (cid:80) k =2 β ( M ) k I ( Y i = k, X i,sex = M) Categorical( p k (1) , . . . , p k (4))Model 3 β + (cid:80) k =2 β ( no ) k I ( Y i = k, X i,black = no) Categorical( p k (1) , . . . , p k (4))+ (cid:80) k =1 β ( yes ) k I ( Y i = k, X i,black = yes)Model 4 β + (cid:80) k =2 β ( M ) k I ( Y i = k, X i,sex = M) Categorical( p M,k (1) , . . . , p

M,k (4)) if X i,sex = M+ (cid:80) k =1 β ( F ) k I ( Y i = k, X i,sex = F) Categorical( p F,k (1) , . . . , p

F,k (4)) if X i,sex =F be ˆ T x + , with design-based variance estimate ˆ σ ( ˆ T x + ). We sample a value T ∗ x + ∼ Normal( ˆ T x + , ˆ σ ( ˆ T x + )). Using an independent sample of values of ( T ∗ x , . . . , T ∗ x ) fromthe NSCG, we compute T ∗ x = T ∗ x + − (cid:80) j =1 T ∗ xj , and set T ∗ x = ( T ∗ x , . . . , T ∗ x ). We repeatthese steps 10,000 times. We then compute the mean and covariance matrix of the10,000 draws, which we again plug into (9) – (11). The resulting log-normal distri-bution is the approximate posterior distribution of θ x . We include an example of thisentire procedure in the supplementary material. The two sets of measurement error models include four that use ﬂat prior distributionsand three that use informative prior distributions based on the 1993 linked data. For allerror models, we use a logistic regression of E i on various main eﬀects and interactions of Y i and X i . For all reporting models, we use categorical distributions with probabilitiesthat depend on Y i and possibly X i . The four models with ﬂat prior distributions aresummarized in Table 2. In Model 1, the error and reporting models depend only on17able 3: Summary of informative prior speciﬁcations for 2010 NSCG/ACS analysis formales with bachelor’s degrees. Error rate Reporting probabilities ( p M, (2) , p M, (3) , p M, (4))Model 4 Beta(1, 1) Dirichlet(1, 1, 1)Model 5 Beta(.76, 14.24) Dirichlet(3.54, 1.27, 0.19)Model 6 Beta(2724.2, 50862) Dirichlet(2235.3, 799.7, 123.1)Model 7 Beta(500, 99500) Dirichlet(1, 1, 1) Y i . Model 2 and 3 keep the reporting model as in (4) but expand the error model.In Model 2, the probability of a reporting error can vary with Y i and sex ( X i,sex ). InModel 3, error probabilities can vary with Y i and the indicator for black race ( X i,black ).In Model 4, the error and reporting models both depend on Y and sex.For Models 5 – 7, we use the speciﬁcation in Model 4 and incorporate prior in-formation about the measurement errors from the 1993 linked data. In constructingthe priors, we ﬁrst remove records that have been ﬂagged as having missing educationthat has been imputed, because these imputations might not closely reﬂect the actualeducation values (Black et al., 2003). Table 3 displays the prior distributions for maleswith bachelor’s degrees. Details on how we arrive at these and other groups’ priorspeciﬁcations are in the supplementary material; here, we summarize brieﬂy.For Model 5, we set the prior distributions for each β ( x ) k so that the error ratesare centered at the estimate from the 1993 linked data. We also require the central95% probability interval of the prior distribution on each error rate to be close to( . , . p M,k ( z ) and p F,k ( z ), we center most of the prior distributionsat the corresponding estimates from the 1993 linked data. We require the central 95%probability interval of each prior distribution to have support on values of p · ,k ( z ) within ± .

10 of the 1993 point estimate, truncating at zero or one as needed. One exception is18he reporting probabilities for those with “no college degree” who report “professional”degree, which we center at half the 1993 estimate. The Census Bureau has improvedthe clarity of the deﬁnition of Professional in the 20 years since the 1990 long form, asdiscussed in the prior speciﬁcation section of the supplementary material.For Model 6, we use the same prior means as in Model 5 for both error and re-porting models. However, we substantially tighten the prior distributions to makethe prior variance accord with the uncertainty in the point estimates from the 1993linked data. We do so by using prior sample sizes that match those from the 1993NSCG. For example, the 1993 NSCG included 53,586 males with bachelor’s degrees(excluding those records who had their Census education imputed). We therefore useBeta(2724 . , x . We similarlyincrease the prior sample sizes for the reporting probabilities to match the 1993 NSCGsample sizes.Model 7 departs from the 1993 linked data estimates and encodes a strong priorbelief that almost no one misreports their education except for haphazard mistakes.Here, we set the prior mean for the probability of misreporting education to .005for all demographic groups. We use a prior sample size of 100,000, making the priordistribution concentrate strongly around .005. For the reporting probabilities, we use anon-informative prior distribution for convenience, since the estimates of the reportingprobabilities are strongly inﬂuenced by the concentrated prior distributions on the errorrates.Finally, for comparison purposes, we also ﬁt the model based on a conditionalindependence assumption (CIA). To impute Y i for individuals in the ACS under theCIA, we sample θ ∗ and then impute ( Y ∗ | θ ∗ , X ) from the true data model. Here, wedo not use the reported value of Z i in the imputations.19 .3 Empirical results We ﬁrst examine what each model suggests about the extent and nature of the mea-surement errors in the 2010 ACS. We then use the models to assess sensitivity of resultsabout the substantive questions related to number of degrees and income.

Table 4 displays the multiple imputation point estimates and 95% conﬁdence intervalsfor the proportions of errors by gender and NSCG education, obtained from the M = 50draws of E i for all individuals in D E . We begin by comparing results for the set ofmodels with ﬂat prior distributions (Models 1 – 4) and the CIA model, then move tothe set of models with informative prior distributions (Models 5 – 7).The CIA model suggests extremely large error percentages, especially for the highesteducation levels. These rates seem unlikely to be reality, leading us to reject the CIAmodel. The overall error rates for Models 1 – 4 are similar and more realistic thanthose from the CIA model. The diﬀerences in error estimates between Model 2 andModel 1 suggest that the probability of error depends on sex. Comparing results forModel 3 and Model 1, however, we see little evidence of important race eﬀects on thepropensity to make errors.Model 4 generalizes Model 2 by allowing the reporting probabilities to vary bysex. If these probabilities were similar across sex in reality, we would expect thetwo models to produce similar results. However, the estimated error rates are fairlydiﬀerent; for example, the estimated proportion of errors for female professionals fromModel 4 is about double that from Model 2. To determine where the models diﬀermost, we examine the estimated reporting probabilities, displayed in Table 5. Model 4estimates some signiﬁcant diﬀerences in reporting probabilities by gender. For example,20ales with bachelor’s degrees who make a reporting error are estimated to report amaster’s degree with probability .96, whereas females with bachelor’s degrees who makea reporting error are estimated to report a master’s degree with probability .67 and aprofessional degree with probability .30. Other large diﬀerences exist for professionaldegree holders. Females with professional degrees who make a reporting error are mostlikely to report a bachelor’s degree, whereas men with professional degrees who makea reporting error are most likely to report a master’s degree or Ph. D. We note thatsome of the estimates for Model 4 are based on small sample sizes, which explains thewide standard errors.Turning to Models 5 – 7, we can see the impact of the informative prior distributionsby comparing results in Table 4 under these models to those for Model 4. Moving fromModel 4 to Model 5, the most noticeable diﬀerences are for women with a Ph. D.and men with a master’s degree, for whom Model 5 suggests lower error rates. Thesegroups have smaller sample sizes, so that the data do not swamp the eﬀects of theprior distribution. When making the prior sample sizes very large as in Models 6 and7, the information in the prior distribution tends to overwhelm the information in thedata. We provide more thorough investigation of the impact of the prior speciﬁcationsin the supplementary material.Of course, we cannot be certain which model most closely reﬂects the true measure-ment error mechanism. The best we can do is perform diagnostic tests to see whichmodels, if any, should be discounted as not adequately describing the observed data.For each ACS imputed dataset D ( m ) E under each model, we compute the sample pro-portions, ˆ π ( m ) xk , and corresponding multiple imputation 95% conﬁdence intervals for all16 ˙5 unique values of ( X , Y ). We determine how many of the 80 estimated populationpercentages of Y | X computed from the 2010 NSCG (using the estimated ˆ T x + fromthe ACS to back into an estimate of ˆ T x ) fall within the multiple imputation 95%21onﬁdence intervals. Models that yield low rates do not describe the data accurately.For Model 1, 73 of 80 NSCG population share estimates are contained in the ACSmultiple imputation intervals. Corresponding counts are 75 for Model 2, 71 for Model3, and 76 for Model 4. These results suggest that Model 1 and Model 3 may be inferiorto Model 2 and Model 4. For the models with informative prior distributions, thecounts are 74 for Model 5, 67 for Model 6, and 54 for Model 7. Although the priorbeliefs in models 6 and 7 seem plausible at ﬁrst glance, the diagnostic suggests thatthey do not describe the 2010 data distributions as well as Models 4 and 5.Considering the results as well as the diagnostic check, if we had to choose onemodel we would select Model 5. It seems plausible that the probability of misreportingeducation, as well as the reported value itself when errors are made, depend on bothsex and true education level. Additionally, the prior distribution from the 1993 linkeddata pulls estimates in groups with little sample size to measurement error distributionsthat seem more plausible on face value. However, one need not use the data fusionframework for measurement error to select a single model; rather, one can use theframework to examine sensitivity of analyses to the diﬀerent speciﬁcations. Figure 2 displays the multiply-imputed, survey-weighted inferences for the total numberof women with science and engineering degrees, computing using the ACS-speciﬁcindicator variable. We show results for Models 4 – 7, the CIA model, and based onthe ACS data without any adjustment for misreporting education. The conﬁdenceintervals for Model 4 and Model 5 overlap substantially, suggesting not much practicaldiﬀerence in choosing among these models. However, both are noticeably diﬀerentfrom the other models, especially for the Ph. D. and professional degrees. As the priordistributions on the error rates get stronger, the estimated counts increase towards22

CS CIA m4 m5 m6 m744.24.44.64.855.2 x 10 Model E s t i m a t ed t o t a l no . o f sc i . and eng . deg r ee s a w a r ded t o w o m en Bachelors degree

ACSCIA modelmodel 4model 5model 6model 7

ACS CIA m4 m5 m6 m722.12.22.32.42.52.6 x 10 Model E s t i m a t ed t o t a l no . o f sc i . and eng . deg r ee s a w a r ded t o w o m en Masters degree

ACS CIA m4 m5 m6 m733.544.555.566.577.5 x 10 Model E s t i m a t ed t o t a l no . o f sc i . and eng . deg r ee s a w a r ded t o w o m en Professional degree

ACS CIA m4 m5 m6 m71.522.533.544.55 x 10 Model E s t i m a t ed t o t a l no . o f sc i . and eng . deg r ee s a w a r ded t o w o m en PhD degree

The framework presented in this article oﬀers analysts tools for using the informationin a high quality, separate data source to adjust for measurement errors in the databaseof interest. Key to the framework is to replace conditional independence assumptionstypically used in data fusion with carefully considered measurement error models. Thisavoids sacriﬁcing information and facilitates analysis of the sensitivity of conclusionsto alternative measurement error speciﬁcations. Analysts can use diagnostic tests torule out some measurement error models, and perform sensibility tests on others toidentify reasonable candidates. 24

A MA Prof PhD none345678910 x 10 Education level E s t i m a t ed a v e r age i n c o m e ACSCIA modelmodel 4model 5model 6model 7

Figure 3: Multiple imputation point and 95% conﬁdence interval estimates for theaverage income within each education level. The ACS estimate is the survey-weightedestimate based on the reported education level in the 2010 ACS.Besides survey sampling contexts like the one considered here involving the ACSand NSCG, the framework oﬀers potential approaches for dealing with possible mea-surement errors in organic (big) data. This is increasingly important, as data stewardsand analysts consider replacing or supplementing high quality but expensive surveyswith inexpensive and large-sample organic data. Often, scant attention is paid to thepotential impact of measurement errors on inferences from those data. The frameworkcould be used with high quality, validated surveys as the gold standard data, allowingfor adjustments to the error-prone organic data.25

CS CIA m4 m5 m6 m733.544.555.566.57 x 10 model specification E s t i m a t ed a v e r age i n c o m e b y gende r Bachelors degree male (small marker)female (large marker) ACS CIA m4 m5 m6 m73.544.555.566.577.58 x 10 model specification E s t i m a t ed a v e r age i n c o m e b y gende r Masters degreeACS CIA m4 m5 m6 m73456789101112 x 10 model specification E s t i m a t ed a v e r age i n c o m e b y gende r Professional degree ACS CIA m4 m5 m6 m7345678910 x 10 model specification E s t i m a t ed a v e r age i n c o m e b y gende r PhD degree

Figure 4: Multiple imputation point and 95% conﬁdence interval estimates for theaverage income for men and women within each education level. The ACS estimate isthe survey-weighted estimate based on the reported education level in the 2010 ACS.

Supplementary Materials

All supplemental ﬁles listed below are contained in a single .zip ﬁle (supplementary.zip)and can be obtained via a single download.

Supplementary Results:

Supplementary details and additional results for paper.(supp-material-ﬁnal.pdf)

ACS data: atlab code:

Matlab ﬁles containing main code MAINCODE edu 2010app report1993.mand helper functions design.m and dirsamp.m, as well as parameter ﬁles mu.matand tauSPD.mat. (code.zip)

Prior Distributions:

Csv ﬁles are provided for priors used in Model 5 and readin by main Matlab code, referred to as femalereportprior1993.csv, malereport-prior1993.csv, betareportprior.csv. (priors.zip)

References

Abayomi, K., Gelman, A., and Levy, M. (2008), “Diagnostics for multivariate impu-tations,”

Journal of the Royal Statistical Society: Series C (Applied Statistics) , 57,273–291.Black, D., Haviland, A., Sanders, S., and Taylor, L. (2006), “Why do minority menearn less? A study of wage diﬀerentials among the highly educated,”

The Review ofEconomics and Statistics , 88, 300–313.Black, D., Sanders, S., and Taylor, L. (2003), “Measurement of higher education inthe census and Current Population Survey,”

Journal of the American StatisticalAssociation , 98, 545–554.Black, D. A., Haviland, A. M., Sanders, S. G., and Taylor, L. J. (2008), “Gender wagedisparities among the highly educated,”

Journal of Human Resources , 43, 630–659.Carrig, M., Manrique-Vallier, D., Ranby, K., Reiter, J. P., and Hoyle, R. (2015), “Amultiple imputation-based method for the retrospective harmonization of data sets,”

Multivariate Behavioral Research , 50, 383–397.Curran, P. J. and Hussong, A. M. (2009), “Integrative data analysis: The simultaneousanalysis of multiple data sets,”

Psychological Methods , 14, 81–100.D’Orazio, M., Di Zio, M., and Scanu, M. (2006),

Statistical Matching: Theory andPractice , Hoboken, NJ: Wiley.Dunson, D. B. and Xing, C. (2009), “Nonparametric Bayes modeling of multivariatecategorical data,”

Journal of the American Statistical Association , 104, 1042–1051.Fesco, R. S., Frase, M. J., and Kannankutty, N. (2012), “Using the American Commu-nity Survey as the sampling frame for the National Survey of College Graduates,”Working Paper NCSES 12-201, National Science Foundation, National Center forScience and Engineering Statistics, Arlington, VA.27inamore, J. (2013),

National Survey of College Graduates: About The Survey , Na-tional Center for Science and Engineering Statistics.Fosdick, B. K., DeYoreo, M., and Reiter, J. P. (2016), “Categorical data fusion usingauxiliary information,”

Annals of Applied Statistics , To appear.He, Y., Landrum, M. B., and Zaslavksy, A. M. (2014), “Combining information fromtwo data sources with misreporting and incompleteness to assess hospice-use amongcancer patients: a multiple imputation appraoch,”

Statistics in Medicine , 33, 3710–3724.Hirano, K., Imbens, G., Ridder, G., and Rubin, D. (2001), “Combining panel data setswith attrition and refreshment samples,”

Econometrica , 69, 1645–1659.Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P., and Wang, Q. (2015), “Simultane-ous edit-imputation for continuous microdata,”

Journal of the American StatisticalAssociation , 110, 987 – 999.Lohr, S. L. (2010),

Sampling: Design and Analysis , Boston, MA: Brooks/Cole, 2nd ed.Manrique-Vallier, D. and Reiter, J. P. (2016), “Bayesian simultaneous edit and impu-tation for multivariate categorical data,”

Journal of the American Statistical Asso-ciation , To appear.Moriarity, C. and Scheuren, F. (2001), “Statistical matching: A paradigm for assessingthe uncertainty in the procedure,”

Journal of Oﬃcial Statistics , 17, 407 – 422.National Science Foundation (1993), “National Survey of College Graduates, 1993,”http://doi.org/10.3886/ICPSR06880.v1, iCPSR06880-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-10-02.Pepe, M. S. (1992), “Inference using surrogate outcome data and a validation sample,”

Biometrika , 79, 355 – 365.Raghunathan, T. E. (2006), “Combining information from multiple surveys for assess-ing health disparities,”

Allgemeines Statistisches Archiv , 90, 515–526.Rassler, S. (2002),

Statistical Matching , New York: Springer.Reiter, J. (2008), “Multiple imputation when records used for imputation are not usedor disseminated for analysis,”

Biometrika , 95, 933–946.Reiter, J. P. (2012), “Bayesian ﬁnite population imputation for data fusion,”

StatisticaSinica , 22, 795 – 811.Rubin, D. B. (1986), “Statistical matching using ﬁle concatenation with adjustedweights and multiple imputations,”

Journal of Business & Economic Statistics , 4,87–94. 28 (1987),

Multiple Imputation for Nonresponse in Surveys , New York: John Wiley &Sons.Schenker, N. and Raghunathan, T. E. (2007), “Combining information from multiplesurveys to enhance estimation of measures of health,”

Statistics in Medicine , 26,1802–1811.Schenker, N., Raghunathan, T. E., and Bondarenko, I. (2010), “Improving on analysesof self-reported data in a large-scale health survey by using information from anexamination-based survey,”

Statistics in Medicine , 29, 533–545.Schifeling, T. A., Cheng, C., Reiter, J. P., and Hillygus, D. S. (2015), “Accountingfor nonignorable unit nonresponse and attrition in panel studies with refreshmentsamples,”

Journal of Survey Statistics and Methodology , 3, 265 – 295.Si, Y. and Reiter, J. (2013), “Nonparametric Bayesian multiple imputation for incom-plete categorical variables in large-scale assessment surveys,”

Journal of Educationaland Behavioral Statistics , 38, 499–521.Si, Y., Reiter, J. P., and Hillygus, D. S. (2015), “Semi-parametric selection modelsfor potentially non-ignorable attrition in panel studies with Refreshment Samples,”

Political Analysis , 23, 92–112.Siddique, J., Reiter, J. P., Brincks, A., Gibbons, R. D., Crespi, C. M., and Brown,C. H. (2015), “Multiple imputation for harmonizing longitudinal non-commensuratemeasures in individual participant data meta-analysis,”

Statistics in Medicine , 34,3399–3414.Tarmast, G. (2001), “Multivariate log-normal distribution,” in

International StatisticalInstitute: Seoul 53rd Session .Yucel, R. M. and Zaslavsky, A. M. (2005), “Imputation of binary treatment variableswith measurement error in administrative data,”

Journal of the American StatisticalAssociation , 100, 1123–1132. 29able 4: Error rate estimates from diﬀerent model speciﬁcations. Models 1-7 are runfor 100,000 MCMC iterations. We save M = 50 completed datasets under each model.For each dataset, we compute the estimated overall error rate, estimated error rate bygender and imputed Y , and associated variances using ratio estimators that incorporatethe ACS ﬁnal survey weights. Estimate by group Estimateoverall Y =BA Y =MA Y =Prof. Y =PhD CIA model

Male .37 (.36, .37) .76 (.75, .76) .91 (.91, .92) .94 (.93, .95) .57 (.55, .58)Female .35 (.35, .36) .72 (.71, .72) .95 (.94, .95) .97 (.96, .97)

Model 1

Male .05 (.04, .06) .10 (.08, .11) .18 (.15, .21) .27 (.23, .31) .17 (.16, .19)Female .05 (.05, .06) .09 (.08, .10) .18 (.15, .21) .28 (.24, .32)

Model 2

Male .05 (.04, .06) .18 (.16, .21) .27 (.18, .37) .36 (.30, .42) .20 (.18, .21)Female .05 (.05, .06) .12 (.10, .14) .26 (.20, .33) .41 (.29, .53)

Model 3

Male .05 (.04, .06) .09 (.08, .11) .17 (.14, .20) .25 (.21, .30) .17 (.16, .19)Female .05 (.05, .06) .09 (.08, .10) .17 (.14, .20) .26 (.21, .31)

Model 4

Male .05 (.04, .06) .19 (.16, .23) .36 (.26, .46) .36 (.27, .45) .22 (.20, .24)Female .09 (.08, .10) .14 (.11, .17) .52 (.44, .59) .55 (.40, .70)

Model 5

Male .07 (.06, .08) .19 (.16, .22) .23 (.14, .32) .34 (.27, .41) .22 (.20, .24)Female .09 (.08, .10) .12 (.09, .15) .50 (.43, .57) .31 (.17, .46)

Model 6

Male .05 (.05, .05) .09 (.08, .10) .10 (.09, .11) .10 (.09, .11) .16 (.14, .17)Female .05 (.04, .05) .06 (.05, .07) .16 (.14, .18) .07 (.06, .09)

Model 7

Male .01 (.01, .01) .01 (.00, .01) .00 (.00, .01) .01 (.00, .01) .11 (.09, .13)Female .01 (.01, .01) .01 (.01, .01) .01 (.00, .01) .01 (.00, .01) Z =BA Z = MA Z = Prof. Z = PhD Y =BAModel 2 - .95 (.87, 1.00) .04 (.00, .11) .01 (.00, .03)Model 4 - Male - .96 (.90, 1.00) .02 (.00, .07) .02 (.00, .05)Model 4 - Female - .67 (.58, .76) .30 (.22, .38) .03 (.00, .07) Y =MAModel 2 .02 (.00, .06) - .51 (.43, .59) .47 (.39, .55)Model 4 - Male .04 (.00, .11) - .57 (.48, .66) .39 (.31, .47)Model 4 - Female .11 (.00, .25) - .39 (.26, .52) .50 (.40, .61) Y =Prof.Model 2 .05 (.00, .16) .69 (.54, .83) - .26 (.14, .38)Model 4 - Male .02 (.00, .06) .69 (.44, .94) - .29 (.04, .54)Model 4 - Female .91 (.79, 1.00) .06 (.00, .16) - .04 (.00, .10) Y = PhDModel 2 .01 (.00, .04) .39 (.15, .63) .60 (.36, .83) -Model 4 - Male .01 (.00, .05) .21 (.02, .39) .78 (.60, .96) -Model 4 - Female .10 (.00, .30) .77 (.50, 1.00) .13 (.00, .34) - Y =NoneModel 2 .95 (.95, .96) .03 (.03, .04) .01 (.01, .01) .00 (.00, .00)Model 4 - Male .97 (.96, .97) .03 (.02, .03) .01 (.00, .01) .00 (.00, .00)Model 4 - Female .96 (.95, .97) .04 (.03, .05) .00 (.00, .00) .00 (.00, .00)=NoneModel 2 .95 (.95, .96) .03 (.03, .04) .01 (.01, .01) .00 (.00, .00)Model 4 - Male .97 (.96, .97) .03 (.02, .03) .01 (.00, .01) .00 (.00, .00)Model 4 - Female .96 (.95, .97) .04 (.03, .05) .00 (.00, .00) .00 (.00, .00)