[PDF] An Empirical Comparison of Multiple Imputation Methods for Categorical Data

Abstract

Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online.

Full PDF

AAn Empirical Comparison of Multiple ImputationMethods for Categorical Data

Olanrewaju Akande, Fan Li and Jerome Reiter ˚ Abstract

Multiple imputation is a common approach for dealing with missing values in statistical databases.The imputer ﬁlls in missing values with draws from predictive models estimated from the ob-served data, resulting in multiple, completed versions of the database. Researchers have devel-oped a variety of default routines to implement multiple imputation; however, there has beenlimited research comparing the performance of these methods, particularly for categorical data.We use simulation studies to compare repeated sampling properties of three default multipleimputation methods for categorical data, including chained equations using generalized lin-ear models, chained equations using classiﬁcation and regression trees, and a fully Bayesianjoint distribution based on Dirichlet Process mixture models. We base the simulations on cat-egorical data from the American Community Survey. In the circumstances of this study, theresults suggest that default chained equations approaches based on generalized linear models ˚ Olanrewaju Akande is PhD student, Department of Statistical Science, Duke University, Durham, NC 27708 (E-mail: [email protected]); Fan Li is Associate Professor, Department of Statistical Science, Duke Univer-sity, NC 27708 (E-mail: ﬂ[email protected]); and Jerome P. Reiter is Professor of Statistical Science, Duke University,Durham, NC 27708 (E-mail: [email protected]). This research was supported by grants from the National ScienceFoundation (SES-11-31897) and the Alfred P. Sloan Foundation (G-2015-20166003). a r X i v : . [ s t a t . C O ] D ec re dominated by the default regression tree and Bayesian mixture model approaches. Theyalso suggest competing advantages for the regression tree and Bayesian mixture model ap-proaches, making both reasonable default engines for multiple imputation of categorical data.A supplementary material for this article is available online. Key words: latent, missing, mixture, nonresponse, tree

1. INTRODUCTION

Nearly all sample surveys and censuses suffer from item nonresponse, e.g., individuals do notanswer some questions. It is well-known that statistical analyses based on only the complete cases(with all variables observed) or available cases (with all variables observed for the speciﬁc analy-sis) can be problematic. At best, such analyses are inefﬁcient, as they sacriﬁce information frompartially observed responses. At worst, they result in biased inferences when there are systematicdifferences between the observed data and the missing data (Rubin 1976). See Little and Rubin(2002) for additional discussion of the drawbacks of complete/available case analysis.A common approach to handling item nonresponse is multiple imputation (MI) (Rubin 1987).In MI, the analyst creates multiple, completed datasets by replacing the missing values with drawsfrom (posterior) predictive distributions estimated with the observed data. The analyst estimatesquantities of interest using each completed data set, and combines the estimates using methodsdeveloped by Rubin (1987). This process incorporates the additional uncertainty from the missingdata in inferences. The completed datasets also can be released as public use ﬁles. For reviews ofMI, see Rubin (1996); Schafer (1997); Barnard and Meng (1999); Reiter and Raghunathan (2007);2arel and Zhou (2007).When implementing MI, most analysts adopt one of two general classes of strategies: jointmodeling (JM) and fully conditional speciﬁcation (FCS) (van Buuren 2007). In the JM strategy, theanalyst speciﬁes a joint distribution for all variables in the data. Imputations are sampled from theimplied conditional distributions of the variables with missing data, given all other variables. TheJM strategy is appealing, in that it accords with the theory in Rubin (1987). In practice, however,specifying accurate joint distributions for large numbers of variables can be challenging. Indeed,most popular JM approaches — such as AMELIA (Honaker et al. 2011), proc MI in SAS (Yang2011), and the routine “norm” in R (Schafer 1997) — make a simplifying assumption that thedata follow multivariate Gaussian distributions, even for categorical variables. Another option forcategorical data is to use a log-linear model; this is implemented in the R package “cat” (Schafer1997).In the FCS strategy, the analyst directly speciﬁes and samples from the univariate distributionsof each variable conditional on all other variables, without ﬁrst forming a proper joint distribution.The most popular FCS approach is known as multiple imputation by chained equations (MICE),which uses generalized linear models for each conditional distribution (Raghunathan et al. 2001;van Buuren and Groothuis-Oudshoorn 2011; Royston and White 2011; Su et al. 2011). The mainstrength of MICE lies in its simplicity and ﬂexibility — one can tailor the predictive model forindividual variables, e.g., use a logistic regression for binary variables and a linear regression forcontinuous variables. However, MICE lacks theoretical basis in that the speciﬁed univariate condi-tional distributions may not be compatible (Arnold and Press 1989; Gelman and Speed 1993); thatis, the set of the conditional distributions may not correspond to any joint distribution. Therefore,3ICE deﬁnes a potentially incompatible Gibbs sampler (Li et al. 2014). Despite this theoreticaldrawback, simulation-based research suggests that MICE performs well in practice (Brand 1999;van Buuren et al. 1999; Raghunathan et al. 2001; Rubin 2003; van Buuren et al. 2006; van Buuren2007, to name a few). Software packages implementing MICE include “IVEware” in SAS (Raghu-nathan et al. 2002), “mice” (van Buuren and Groothuis-Oudshoorn 2011; van Buuren 2012) and“mi” in R (Su et al. 2011), and “mi” and “ice” in STATA (Royston and White 2011).When the data include a large number of exclusively categorical variables, analysts imple-menting MI face a challenging task. When using log-linear models for a JM approach, the space ofpossible models is enormous due to the large number of potential interaction effects, making it dif-ﬁcult to select an imputation model. Similar model selection issues plague MICE approaches, e.g.,when specifying a multinomial logistic regression model for a multi-valued variable with missingdata. Many analysts default to including main effects only; indeed, this is the default option inmost popular MICE packages.Motivated by these shortcomings, several authors have developed more ﬂexible, default MI en-gines for both strategies. For JM imputation, Si and Reiter (2013) and Manrique-Vallier and Reiter(2014a,b) use Dirichlet process mixture of products of multinomial distributions (DPMPM), whichare nonparametric Bayesian versions of the latent class models used by Vermunt et al. (2008).The DPMPM imputation routines are implemented in the R software package, “NPBayesImpute”(Manrique-Vallier et al. 2014). For FCS imputation, Burgette and Reiter (2010) and van Buuren(2012) use classiﬁcation and regression trees (CART). MI via trees is available as an option in“mice” in R. Independently, these techniques have shown promise as general-purpose routinesfor MI, suggesting that they may be preferable to default applications of MICE. However, to our4nowledge, these two techniques have never been compared in simulation contexts with large-dimensional, genuine categorical data, nor have they been compared to default implementations ofchained equations with generalized linear models. These lack of comparisons make it difﬁcult, ifnot impossible, to assess the relative merits of each procedure.In this article, we compare these three MI methods for categorical data using repeated sam-pling simulations. Using hypothetical populations comprising data from the American CommunitySurvey (ACS), we ﬁnd that the DPMPM and CART imputation engines (which we label with MI-DPM and MI-CART, respectively) result in better repeated sampling performance than the standardchained equations via generalized linear models (which we label with MI-GLM). The results sug-gest competing advantages for the MI-CART and MI-DPM approaches, depending on the samplesize and amount of missing data. Both procedures are sensible default choices for MI of categoricaldata.The remainder of this article is organized as follows. In Section 2, we review the three methodsfor MI with categorical data. In Section 3, we describe the simulation design and results with ACSdata. In Section 4, we conclude with general lessons learned and some suggestions for practicalimplementation.

2. MULTIPLE IMPUTATION FOR CATEGORICAL DATA

We ﬁrst introduce notation for describing MI in the context of categorical data. Let Y ij Pt , . . . , D j u be the value of variable j for individual i , where j “ , . . . , p and i “ , . . . , n . Foreach individual i , let Y i “ p Y i , . . . , Y ip q . Let Y “ p Y , . . . , Y n q be the n ˆ p matrix comprising5he data for all records included in the sample. We write Y “ p Y obs , Y mis q , where Y obs and Y mis are respectively the observed and missing parts of Y . We write Y mis “ p Y mis , . . . , Y mis p q ,where Y mis j represents all missing values for variable j , where j “ , . . . , p . Similarly, we write Y obs “ p Y obs , . . . , Y obs p q for the corresponding observed data.In MI, the analyst generates values of Y mis using models estimated with Y obs . This results ina completed dataset Y p l q . The analyst draws L completed datasets, t Y p l q : l “ , . . . , L u , that areavailable for dissemination or analysis. In MI-GLM, we generate imputations from a sequence of predictive distributions derivedfrom univariate generalized linear models. The approach proceeds as follows. We ﬁrst specifyan order for imputing the variables. For example, the “mice” package in R by default imputesthe variables in the order that they appear in the data matrix (although a different order can bespeciﬁed by the user), and the “IVEWare” package imputes variables in increasing order of thenumber of missing cases. Suppose that r ď p variables have missing values, and that r “ p ´ r variables are fully complete. Let the ordered list of variables be deﬁned as p Y p q , . . . , Y p p q q , whereany variable with no missing values is put at the end of the list. We next ﬁll in initial imputationsat the missing values. These can be obtained from random draws from the observed marginaldistributions for each variable with missing data (the “mice” package in R does this). Alternatively,for each variable j with missing values, we can sample from conditional distributions, p Y p j q | Y p q , . . . , Y p j ´ q , Y p r ` q , . . . , Y p p q q , where each model is estimated on the available cases for thatmodel. 6ith this initial set of completed data, we now use an iterative process akin to a Gibbs samplerto update the imputations. At each iteration t of the updating, we estimate the predictive model, p Y p q | Y obs p q , t Y u p t ´ qp k q : k ą uq , where Y p t ´ qp k q includes the set of observed and imputed valuesfor variable k at iteration t ´ . We replace Y p t ´ q mis p q with draws from this conditional distribution, Y p t q mis p q . We repeat this process for each variable j with missing data, estimating and imputing fromeach predictive model, p Y p j q | Y obs p j q , t Y p t qp k q : k ă j u , t Y p t ´ qp k q : k ą j uq . We repeat the cycle t ą times. The values at the ﬁnal iteration are used to create the completed dataset, Y p l q “ p Y obs , Y p l q mis q .The entire process is replicated L times to create the full set of multiple imputations.For categorical data, most implementations of MI-GLM use logistic regressions for binaryvariables, multinomial logistic regressions for unordered variables, and cumulative logistic regres-sions for ordered variables. The default speciﬁcation of the predictor functions is to include maineffects only for all p ´ variables. It is possible to remove predictors from and add interactionsto certain univariate conditionals using special commands; for “mice” in R, the predictorMatrixargument can be used to add or remove predictors and the passive imputation option can be usedto add interactions. For Y p j q corresponding to unordered categorical variables with large numbersof levels, multinomial logistic regressions can have large numbers of parameters, which can resultin unstable coefﬁcient estimates with high standard errors. CART for categorical data (Breiman et al. 1984), originally developed for classiﬁcation, canbe modiﬁed to be an imputation engine. For any outcome Y p j q , CART recursively partitions thepredictor space in a way that can be effectively represented by a tree structure, with leaves corre-7ootA C L5L1 L2 L3 L4Figure 1. Illustration of the tree structure in CART. A: African-Americans; C: Caucasian; H:Hispanic; M: male; F: female. Leaf L1 contains female African-Americans; leaf L2 contains maleAfrican-Americans; leaf L3 contains female Caucasians; leaf L4 contains male Caucasians; and,leaf L5 contains Hispanics of both genders.sponding to the subsets of units. The values of Y p j q in each leaf can be conceived as draws fromthe conditional distribution of Y p j q for the set of predictor values that satisfy the partitioning crite-ria that deﬁne that leaf. To illustrate, Figure 1 displays a ﬁctional regression tree for an outcomevariable and two predictors, gender (male or female) and race/ethnicity (African-American, Cau-casian, or Hispanic). To approximate the conditional distribution of Y p j q for a particular gender andrace/ethnicity combination, one uses the values in the corresponding leaf. For example, to approx-imate the distribution of Y p j q for female Caucasians, one uses the observed values of the outcomein leaf L3.MI-CART operates like MI-GLM, except that CART models are used in place of logisticregressions. Initial imputations can be obtained by running CART on the available cases for p Y p j q | Y p q , . . . , Y p j ´ q , Y p r ` q , . . . , Y p p q q . For any value of p Y i p q , . . . , Y i p j ´ q , Y i p r ` q , . . . , Y i p p q q ,including values for previously initialized variables, we sample the initial imputation by droppingdown the tree for Y p j q until ﬁnding the appropriate leaf, and sample from the values in that leaf.8urgette and Reiter (2010) suggest an additional step before sampling values within the leaves,namely to use Bayesian bootstraps within each leaf (holding constant the number of observationsin each leaf) to create the pools from which to sample. The Bayesian bootstrap incorporates un-certainty about the distributions in each leaf. As with MI-GLM, the process is cycled t ą times,resulting in a completed dataset. We note, however, that this imputation routine does not incorpo-rate uncertainty about the split points in the trees.CART algorithms, and hence MI-CART, can struggle with categorical predictors with manylevels. For example, a categorical predictor with 32 levels results in over two billion potentialpartitions, making it computationally difﬁcult to run the algorithm in sensible time with standardcomputers. The MI-DPM procedure assumes that the distribution of the categorical data can be char-acterized using a latent class model. We assume that all combinations of variables are possi-ble a priori – that is, there are no structural zeros (Si and Reiter 2013). Each individual be-longs to one of K ă 8 latent classes. Within each class, all variables follow independentmultinomial distributions. To express this as a formal probability model, let z i P p , . . . , K q represent the latent class of individual i ; π k “ Pr p z i “ k q , for k “ , . . . , K , be the prob-ability of being in latent class k ; λ kjy “ Pr p Y ij “ y | z i “ k q be the probability that vari-able Y ij takes on the value y for records in latent class k . Also, let π “ p π , . . . , π K q , and let9 “ t λ kjy : k “ , . . . , K ; j “ , . . . , p ; y “ , . . . , D j u . We have Y ij | z i , λ indep „ Discrete D j p λ z i j , . . . , λ z i jD j q for all i and j (1) z i | π iid „ Discrete K p π , . . . , π K q for all i. (2)We can express the marginal probability of any quantity by averaging over the latent classes; forexample, P r p Y i “ y , . . . , Y ip “ y p | λ , π q “ K ÿ k “ π k p ź j “ λ kjy j . (3)Although variables are independent within classes, averaging over classes can result in dependenceamong the variables. Dunson and Xing (2009) show that, with large enough K , this model isconsistent for any joint probability distribution.For prior distributions, Si and Reiter (2013) and Manrique-Vallier and Reiter (2014b) use atruncated version of the stick-breaking representation of the Dirichlet process proposed by Dunsonand Xing (2009), p λ kj , . . . , λ kjD j q indep „ Dirichlet p D j q (4) π k “ V k ź h ă k p ´ V h q (5) V k iid „ Beta p , α q for k “ , . . . , K ´ V K “ (6) α „ Gamma p . , . q . (7)This prior distribution facilitates efﬁcient computation while allowing the data to dominate theposterior distribution. 10o set K , we recommend the approach in Si and Reiter (2013) and Manrique-Vallier and Re-iter (2014b). The analyst starts with a modest value, say K “ . In each iteration of the MCMC,the analyst can compute the number of classes that have at least one individual. If the number ofoccupied classes reaches K across any of the iterations, then the analyst should increase K andrepeat. As long as K is large enough, the estimated posterior distribution is largely insensitive tothe value of K . The iterative method of selecting K primarily serves to improve computationalefﬁciency, as it avoids estimating parameters for many empty (and essentially irrelevant) classes.Posterior inferences are obtained via a Gibbs sampler. Missing values are handled within thesampler. Given a draw of the parameters and observed data, one samples a value of the latent classindicator for the record using (2). Given a draw of the latent class indicator, one samples valuesfor the missing items using independent draws from (1). The independence of variables within thelatent classes facilitates computation and imputation in datasets with many categorical variables.The MI-DPM approach can be adapted for categorical data with structural zeros (Manrique-Vallierand Reiter 2014a), and it is not limited to a maximum number of levels. Because it relies on acoherent joint model, the MI-DPM engine can be the building block for handling more complicatedmissing data situations, including nonignorable nonresponse (Si et al. 2015) and measurement error(Manrique-Vallier and Reiter 2015).

3. REPEATED SAMPLING EVALUATIONS

We base the simulation studies on data from the public use microdata ﬁles from the 2012 ACS,available for download from the United States Bureau of the Census ( ). As brief background, the ACS is sent to about 1 in 38 households11n the United States, with the goal of enabling estimation of population demographics and housingcharacteristics at small geographies, e.g., at census tracts. For each sampled household, the surveyincludes questions about the individuals living in the household (e.g., their ages, races, incomes)and about the characteristics of the housing unit (e.g., number of bedrooms, presence of runningwater or not, presence of a telephone line or not). The public use data include about 1.5 millionhousing units.To construct the population data for the simulation, we treat the housing unit as the unit ofanalysis. We disregard data on individual people but keep ACS summaries of family compositionsthat can be viewed as household-level variables (e.g., presence of children or not, number of work-ers in household). We remove all identiﬁcation variables (serial number, region, state, area codeand division), variables corresponding to allocation ﬂags (binary variables indicating if the corre-sponding variables have been imputed), variables corresponding to replicate survey weights, andall continuous variables. To simplify comparisons, we eliminate structural zeros — i.e., impossiblecombinations of categories like a married 12-year old — by removing variables or levels. We alsodelete about 830,000 units corresponding to vacant houses and households with single occupants,again to avoid structural zeros. The ﬁnal data comprise 671,153 housing units and 35 categoricalvariables. We treat these as a population from which to sample. We treat all variables as unorderedcategorical variables to simplify imputation modeling and analysis. The data comprise 11 binaryvariables (seven of which have marginal probabilities very close to one or zero), 17 variables withthree to nine levels, and 7 variables with ten or more levels. The variables are described in Table 1in the supplementary material.We simulate the processes of random sampling and nonresponse repeatedly, so as to ex-12mine frequentist properties of the three imputation procedures. We take samples of size n Pt , u and randomly blank either 30% or 45% of the values of each variable independentlyacross variables. This results in data that are missing completely at random (MCAR), which is themost favorable scenario for all MI procedures. We also create a missing at random (MAR) sce-nario with n “ and a 30% missingness rate. In each sample, we use MI-GLM, MI-CART,and MI-DPM to create three sets of L “ completed datasets. We repeat the process 200 times,each time generating new samples and new missingness patterns. We note that the missing datamechanism tends to result in datasets with 10 or fewer cases with all variables measured, makingcomplete case analysis untenable.To implement both MI-GLM and MI-CART, we use the “mice” package in R (van Buurenand Groothuis-Oudshoorn 2014). For MI-GLM, we use the default choices for binary and nominalvariables with two or more levels: logistic regressions with main effects for all variables. For MI-CART (the “cart” option in “mice”), we use the default arguments for ﬁtting the trees, e.g., at leastfour observations in any terminal node. We run both MI-GLM and MI-CART for t “ cycles assuggested in van Buuren and Groothuis-Oudshoorn (2011). To implement the DPMPM, we use the“NPBayesImpute” package in R developed by Manrique-Vallier et al. (2014). We set the numberof latent classes K “ , which appears sufﬁciently large based on tuning with initial runs. Werun each MCMC chain for iterations using the ﬁrst as burn-in. We evaluate the MI methods using sets of marginal probabilities, bivariate probabilities, andtrivariate probabilities. We consider only estimands that satisfy np ą and n p ´ p q ą , where13 is the probability in the population, to eliminate estimands where the central limit theorem isnot likely to hold even with no missing data. For each estimand, we use each set of L completeddatasets to compute point and interval estimates via the methods of Rubin (1987). As a brief review,let q be the completed-data point estimator of some estimand Q , and let u be the estimator ofvariance associated with q . For l “ , . . . , L , let q l and u l be the values of q and u in completeddataset Y p l q . We use ¯ q “ ř Ll “ q l { L as the point estimate of Q . We use T “ p ` { L q b ` ¯ u as the estimated variance of ¯ q , where b “ ř Ll “ p q l ´ ¯ q q {p L ´ q and ¯ u “ ř Ll “ u l { L . We form95% conﬁdence intervals using p ¯ q ´ Q q „ t v p , T q , where t v is a t -distribution with v “ p L ´ qp ` ¯ u {rp ` { L q b sq degrees of freedom. Multiple imputation inferences require that (i) thecentral limit theorem applies for the complete-data estimate, and (ii) the sampling distribution ofthe complete-data point estimates is approximately normal. When p is near zero or one, theseconditions may not hold, which can result in unreliable inferences including intervals that includezero or one.For each estimand, we compare the three MI procedures by two metrics. We also com-pute these metrics for estimates based on the sampled data before introduction of missing values.First, we compute the proportion of the two hundred 95% conﬁdence intervals that contain thecorresponding Q from the full ACS data. Second, we compute the relative mean squared error(Rel.MSE) for ¯ q , deﬁned as Rel.MSE “ ř h “ p ¯ q p h q ´ Q q ř h “ p ˆ q p h q ´ Q q , (8)where ¯ q p h q is the value of ¯ q in simulation h , and ˆ q p h q is the estimate of Q from the sampled databefore introduction of missing values in simulation h .14 .2 Summary of results In our initial experiments, MI-GLM crashed when the data include more than one nominalvariable with more than ten categories. This results from problems estimating the multinomiallogistic regressions with many outcome levels. Therefore, we consider two types of simulations.In the ﬁrst type, we remove the seven variables with more than ten categories, leaving 28 variables,in order to allow comparisons across all three procedures. In the second type, we include theseven variables and compare only MI-CART and MI-DPM. We summarize results using graphicaldisplays and tables. In all plots, MI-GLM is abbreviated as GLM, MI-CART as CART, MI-DPMas DPM, and the pre-missing data results as NO. Median coverage rates across all estimands foreach scenario are available in the supplementary material. In all tables, the entries are summariesof the Rel.MSEs combined over all relevant probabilities; e.g., the 25th, 50th, and 75th percentilesof the Rel.MSEs for all marginal probabilities. For all results, Monte Carlo standard errors forthe coverage rates and the reported summaries of the Rel.MSEs are sufﬁciently small to rule outchance error as explanation for apparent differences in the results. n “ and 30% MCAR We ﬁrst create scenarios with n “ using the MCAR mechanism with the 30% miss-ingness rate. Including all 28 variables, we have 98 marginal probabilities, 3343 bivariate proba-bilities, and 58778 trivariate probabilities. Typically, to create L “ completed datasets with astandard desktop computer, MI-GLM runs for about 1 hour and 54 minutes, MI-CART runs forabout 45 minutes, and MI-DPM runs for about 50 minutes.15igure 2. Simulated coverage rates for MI-GLM, MI-CART, MI-DPM, and the pre-missing dataintervals when n “ with 30% values MCAR. We exclude seven variables with more thanten levels, resulting in p “ variables for imputation and analysis.Figure 2 displays the simulated coverage rates of the 95% conﬁdence intervals based on the28 variables. For most estimands, all three MI methods result in reasonable coverage rates. Allthree distributions are skewed to the left, particularly for the bivariate and trivariate probabilities.Overall among the three MI procedures, MI-GLM tends to result in the most coverage rates farfrom the nominal 95% level. MI-CART offers the fewest extremely low coverage rates. MI-DPMtends to result in the highest coverage rates, although it also has many low rates for bivariate andtrivariate probabilities. Across all MI methods, the lowest coverage rates tend to be associatedwith seven binary variables with marginal probabilities very close to one in the population. Thesevariables include questions about the presence of bathtubs, refrigerators, running water, sinks,stoves, telephones and ﬂush toilets in the households.Table 1 displays the values of Rel.MSE for all ¯ q . The MI-GLM procedure tends to resultin the most inaccurate point estimates, and the MI-CART procedure tends to result in the most16able 1. Distributions of relative mean squared errors when n “ and 30% values MCAR.We exclude seven variables with more than ten levels, resulting in p “ variables for analysisbetween MI-GLM, MI-CART, and MI-DPM.Marginal Bivariate TrivariateGLM CART DPM GLM CART DPM GLM CART DPMMin. 1.0 1.0 1.0 1.0 0.6 0.6 0.6 0.6 0.41st Qu. 1.4 1.2 1.3 1.4 1.2 1.2 1.3 1.1 1.2Median 1.6 1.4 1.5 1.6 1.4 1.4 1.5 1.3 1.43rd Qu. 2.6 1.5 1.9 2.0 1.6 1.7 1.9 1.6 1.6Max. 27670 27550 6.7 39530 38530 188.4 49040 47150 202.6accurate point estimates. Detailed investigations indicate that the differences in performance reﬂectdifferences in biases more than differences in variances. The largest values in the distributions (e.g.,the maxima in Table 1) usually involve the seven variables with low probabilities in the population.As a sensitivity analysis, we also remove the seven variables with probabilities near one, andperform an independent simulation of 200 runs on the remaining 21 variables. We are left with 83marginal probabilities, 2590 bivariate probabilities and 37216 trivariate probabilities. The relevantﬁgure and table are presented in the online supplementary material. In short, the overall patternsare similar to those in Figure 2 and Table 1. Removing these seven variables also removes mostof the extremely low coverage rates for MI-GLM, making it more competitive with MI-CARTalthough MI-CART continues to result in slightly better coverage rates overall. MI-DPM yieldsmedian coverage rates around or slightly above 95%; however, it continues to have longer left tailsthan the other methods for bivariate and trivariate probabilities. Removing the seven variables alsoremoves the extremely large Rel.MSE values seen in Table 1. The lowest coverage rates and largestRel.MSEs continue to be associated with probabilities closest to zero or one, which is generallythe case in all the simulation scenarios we considered.As a ﬁnal sensitivity analysis, we add in the seven variables with more than ten categories,17 a) Results with 45% missing data(b) Results with n “ Figure 3. Simulated coverage rates for MI-GLM, MI-CART, MI-DPM, and the pre-missing dataintervals for other MCAR scenarios with p “ variables. Top panel is for scenario with n “ and a 45% missing data rate, and bottom panel is for scenario with n “ and a 30%missing data rate.and perform an independent set of 200 simulations comparing MI-CART and MI-DPM only. Wecontinue to exclude the variables with probabilities near one. As a result, the comparison focuseson 28 variables with 177 marginal probabilities, 9049 bivariate probabilities and 180218 trivariate18robabilities. To generate L “ completed datasets, typically MI-CART takes about 2 hoursand 45 minutes, and MI-DPM takes about 55 minutes. The relevant ﬁgure and table are presentedin the online supplement. In short, including the variables with ten levels does not fundamentallychange the conclusions about MI-CART and MI-DPM. We next consider two other MCAR scenarios to examine whether or not overall conclusionschange when (i) increasing the rate of missingness and (ii) decreasing the sample size. For bothscenarios, we use only the p “ variables described in the ﬁrst sensitivity analysis of Section3.2.1; that is, we discard the variables with more than 10 levels and the variables associated withmarginal probabilities very close to zero or one.To examine the role of missingness rate, we increase from 30% to 45% of values MCARindependently for each variable. We continue to use n “ . The results are presented inthe top panels of Figure 3 and Table 2. Compared to results with the 30% missing data rate, wesee increased numbers of low coverage rates for all three MI methods. This results from reducednumbers of observations on which to estimate the imputation models. MI-DPM offers the highestcoverage rates for marginal probabilities. The distributions of coverage rates for the three methodsare similar for bivariate and trivariate probabilities, with again MI-DPM tending to yield highercoverate rates. As before, MI-CART tends to offer the smallest values of Rel.MSE, with MI-DPMa close second. MI-GLM tends to have the largest Rel.MSEs.We also consider a simulation with 10% of values MCAR independently for each variable.Not surprisingly, with the low missingness rate, for the most part the Rel.MSEs are similar across19able 2. Distributions of relative mean squared errors for MI-GLM, MI-CART, and MI-DPM forother MCAR scenarios with p “ variables. Top panel is for scenario with n “ and a45% missing data rate, and bottom panel is for scenario with n “ and a 30% missing datarate. Marginal Bivariate TrivariateGLM CART DPM GLM CART DPM GLM CART DPMResults with 45% missing dataMin. 1.2 1.1 1.0 1.0 0.7 0.5 0.6 0.5 0.41st Qu. 1.8 1.4 1.4 1.8 1.4 1.5 1.6 1.3 1.4Median 2.4 1.7 1.8 2.3 1.8 1.8 2.0 1.7 1.83rd Qu. 3.2 2.0 2.1 3.1 2.3 2.4 2.6 2.3 2.5Max. 15.1 7.1 17.2 86.8 44.8 367.0 90.5 119.8 342.8Results with n “ Min. 1.0 1.0 1.0 0.9 0.8 0.7 0.9 0.7 0.51st Qu. 1.4 1.1 1.2 1.4 1.1 1.1 1.3 1.1 1.1Median 1.9 1.3 1.3 1.6 1.3 1.3 1.5 1.2 1.23rd Qu. 2.9 1.4 1.5 2.1 1.4 1.4 1.8 1.4 1.4Max. 27.8 14.2 3.5 47.0 21.1 20.4 66.2 21.5 31.6the three methods, although MI-GLM still tends to have the largest Rel.MSEs overall. All threemethods have higher coverage rates than in the previous scenarios. MI-DPM continues to yieldhigher coverage rates for most estimands, but also longer left tails, than MI-CART and MI-GLM,particularly for bivariate and trivariate probabilities. The relevant ﬁgure and table are presented inthe online supplementary material.To examine the role of sample size, we return to the 30% missingness rate and set n “ .As evident in the bottom panel of Figure 3, MI-GLM tends to result in the most coverage ratesbelow the nominal 95% level and the most extremely low rates. For most estimands, MI-DPMresults in the highest coverage rates with most of the density concentrated above 95%, although thedistribution has a long left tail as in previous scenarios. For marginal and bivariate probabilities,MI-CART coverage rates tend to concentrate below 95% although with a much shorter left tail20han the other methods. As evident in Table 2, the MI-GLM procedure still tends to result in theleast accurate point estimates, and the MI-CART and MI-DPM procedures give reasonably similarperformance. The large values of Rel.MSE in Table 2 again correspond to low probability eventsin the population. Often MAR is a more plausible assumption than MCAR in practice. We therefore design asimulation scenario with MAR, setting n “ and using the 21 variables described previously.We set two variables, household type (HHT) which has 5 levels and tenure of property/home (TEN)which has 4 levels, to be always fully observed. Among households with HHT = 1, in each samplewe randomly and independently blank 15% of values in each of the remaining 19 variables; thecorresponding rates for HHT P p , , , q are 35%, 50%, 50% and 30%, respectively. Similarly,for TEN = p , , , q we use missing item rates of 40%, 15%, 30% and 5%, respectively. Thisresults in around 40% missing values spread across the 19 variables. We select TEN and HHT asthe fully complete variables because they have statistically signiﬁcant associations with most ofthe other 19 variables.The results are displayed in Figure 4 and Table 3. The patterns and conclusions from the MARscenario are similar to those from the corresponding MCAR scenario. Overall, MI-CART tendsto result in the most coverage rates concentrated around 95% and fewest very low rates, and themost accurate point estimates. For most bivariate and trivariate estimands, MI-DPM still results incoverage rates above 95%, but it has the longest lower tail, sometimes reaching very low rates. MI-GLM results in coverage rates for marginal and bivariate probabilities that are concentrated slightly21igure 4. Simulated coverage rates for MI-GLM, MI-CART, MI-DPM, and the pre-missing dataintervals when n “ and 30% values MAR. We exclude seven variables with more than tenvariables and seven variables with marginal probabilities near one, resulting in p “ variablesfor imputation and analysis.below 95%, but its lower tail is comparable to that of MI-CART. The large values of Rel.MSE inTable 3 correspond to low probability events in the population.

4. CONCLUDING REMARKS

The simulation results suggest several general conclusions about the three MI procedures forcategorical data. First, default applications of MI-GLM with main effects, which are arguablythe most common implementation for multiple imputation, appear to be inferior to MI-CART andMI-DPM overall. These latter procedures automatically ﬁnd and model important dependencestructures that are missed with default applications of MI-GLM. Of course, one could use morecomplicated predictor functions in the generalized linear models, but with high dimensional vari-ables selecting appropriate sets of interaction effects to include in the conditional models is daunt-22able 3. Distributions of relative mean squared errors for MI-GLM, MI-CART, MI-DPM, and thepre-missing data intervals when n “ and 30% values MAR. We exclude seven variableswith more than ten variables and seven variables with marginal probabilities near one, resulting in p “ variables for imputation and analysis.Marginal Bivariate TrivariateGLM CART DPM GLM CART DPM GLM CART DPMMin. 1.0 1.0 1.0 0.9 0.8 0.6 0.6 0.6 0.51st Qu. 1.2 1.1 1.1 1.3 1.1 1.1 1.1 1.1 1.1Median 1.5 1.3 1.4 1.5 1.3 1.4 1.4 1.2 1.33rd Qu. 1.8 1.5 2.1 1.8 1.5 1.8 1.6 1.4 1.7Max. 4.5 2.5 9.4 28.2 19.1 144.0 43.6 41.8 195.4ing. Second, it is difﬁcult to identify a clear winner between MI-CART and MI-DPM. Across thesimulations, the median coverage rates for MI-DPM tend to be larger than the median coveragerates for MI-CART, although both tend to be close to 95%. On the other hand, MI-CART tends toresult in fewer very low rates than MI-DPM does, and it tends to result in smaller relative meansquared errors. Therefore, analysts concerned with getting at least nominal coverage rates for mostestimands, but potentially at the expense of some very low rates, may prefer MI-DPM. Analystswilling to accept slightly lower coverage rates for most estimands with potentially lower risk ofvery low rates, in addition to smaller mean squared errors, may prefer MI-CART. Third, the resultssuggest that one might favor MI-DPM over MI-CART in situations where one needs to lean onthe model more heavily — the scenarios with smaller sample size and higher missing data ratesin our simulations — and favor MI-CART otherwise. Intuitively, by design the joint model un-derlying the MI-DPM engine tends to shrink probability estimates from low count cells towardsthose for higher count cells. In modest-sized samples or with high rates of missing data, this canimprove accuracy. The effects of this shrinkage generally decrease with sample size. Thus, weconjecture that MI-DPM and MI-CART would give similar results for much larger n ; indeed, wefound this to be the case in a small number of simulation runs with n “ , . Of course, when23he sample size is too small, MI-DPM and MI-CART can lack sufﬁcient data to estimate complexrelationships accurately, highlighting the importance of checking the quality of imputations, e.g.,with predictive checks as in Burgette and Reiter (2010), for any method.As with any simulation study, the conclusions suggested by these results may not generalizeto all other settings. The simulations were based on relatively simple missing data mechanisms,default applications of imputation strategies that were not tuned speciﬁcally to the data at hand,and only a subset of the (huge space of) possible estimands. For example, it may be that MI-GLM is most effective among the three imputation procedures when the parameters of interest area particular GLM that is also used in one of the conditional speciﬁcations, although this is notguaranteed to be the case when the speciﬁed GLM poorly describes the relationships in the data.We considered only data comprising nominal categorical variables. Of course, many datasetsalso include ordinal and continuous variables. The MI-CART procedure can be easily applied forsuch mixed data types, but obviously MI-DPM cannot. One needs a different joint model, suchas a general location model (Schafer 1997) or the mixture model of MurrayReiter2016

Thus,one should not generalize ﬁndings here to mixed data types. Additionally, we did not use theordinal nature of the data when ﬁtting MI-GLM. Models that do so, like cumulative logit models(

Agresti2013 ), can be more effective than multinomial logistic regressions, provided that theirunderlying assumptions (e.g., proportional odds) are sensible for the data at hand. Proportionalodds assumptions do not always match the data distribution, e.g., when the ordinal variable hasmost mass at the lowest and highest values of the variable, so that careful model checking iswarranted before using them in MI-GLM in lieu of the weaker multinomial logit assumptions.The simulation results suggest some useful practical steps to improve MI. First, the perfor-24ance of the procedures suffered when variables with probabilities nearly equal to one (or zero)are included in the models. Additionally, occasionally MI-CART, and less frequently MI-GLMand MI-DPM, generated completed datasets with b “ for these variables, so that the multipleimputation variance estimator and degrees of freedom broke down. Accepting such decreased per-formance seems unnecessary, since imputation from marginal distributions are likely to work justas effectively for these variables. We suggest that analysts remove such variables before impu-tation. CART initially suffered the most for probabilities close to zero or one before restrictinganalyses only to quantities with np ą and n p ´ p q ą . Imposing that restriction andremoving those seven variables improved the performance of CART. Second, we recommend notconsidering multinomial regression modeling, as is done in default applications of MI-GLM, whensome multinomial variables with missing values have many levels. We could not get default ver-sions of MI-GLM even to ﬁt in such circumstances. It may be possible to collapse categories oruse fewer predictors in the models, although this may sacriﬁce model quality. This could be prob-lematic when such variables are central to a secondary analysis. Third, the lack of a clear winnersuggests beneﬁts for assessing the quality of an imputation method for data akin to those at hand;for example, using repeated sampling simulations with other, representative data sources or, whenlarge enough, the complete cases. Such evaluations are not routinely used in applications, althoughexamples exist (e.g., Ezzati-Rice et al. 1995; Raghunathan and Rubin 1997; Schafer et al. 1998;Tang et al. 2005; van Buuren et al. 2006). SUPPLEMENTARY MATERIALS

The supplementary material contains a table describing the variables used in the simulation, median25overage rates across all scenarios, and ﬁgures for the sensitivity checks in Section 3.

REFERENCES

Arnold, B. C. and S. J. Press (1989), “Compatible Conditional Distributions”,

Journal of the Amer-ican Statistical Association

84, pp. 152–156.Barnard, J. and X. L. Meng (1999), “Applications of multiple imputation in medical studies: FromAIDS to NHANES”,

Statistical Methods in Medical Research

8, pp. 17–36.Brand, J. P. L. (1999),

Development, implementation and evaluation of multiple imputation strate-gies for the statistical analysis of incomplete data sets , Erasmus University.Breiman, L., J. H. Friedman, R. A. Olshen, and C. I. Stone (1984),

Classiﬁcation and RegressionTrees , Chapman and Hall/CRC.Burgette, L. F. and J. P. Reiter (2010), “Multiple imputation for missing data via sequential regres-sion trees”,

American Journal of Epidemiology

Statistical Methods in Medical Research

16, pp. 219–242.— (2012),

Flexible imputation of missing data , Boca Raton, FL.: Chapman and Hall/CRC.van Buuren, S. and K. Groothuis-Oudshoorn (2011), “mice: Multivariate imputation by chainedequations in R”,

Journal of Statistical Software

The Comprehensive R Archive Network .26an Buuren, S., H. C. Boshuizen, and D. L. Knook (1999), “Multiple imputation of missing bloodpressure covariates in survival analysis”,

Statistics in Medicine

18, pp. 681–694.van Buuren, S., J. P. L. Brand, C. G. M. Groothius-Oudshoorn, and D. B. Rubin (2006), “Fullyconditional speciﬁcations in multivariate imputation”,

Journal of Statistical Computation andSimulation

Journal of the American Statistical Association

Proceedings of the Section onSurvey Research Methods, American Statistical Association , pp. 668–672.Gelman, A. and T. P Speed (1993), “Characterizing a joint probability distribution by condition-als”,

Journal of the Royal Statistical Society Series B: Statistical Methodology

55, pp. 185–188.Harel, O. and X. H. Zhou (2007), “Multiple imputation: review of theory, implementation andsoftware”,

Statistics in Medicine

26, pp. 3057–3077.Honaker, J., G. King, and M. Blackwell (2011), “Amelia II: A Program for Missing Data”,

Journalof Statistical Software

Journal of Computational and Graphical Statistics

Statistical analysis with missing data , New Jersey: Wiley.27anrique-Vallier, D. and J. P. Reiter (2014a), “Bayesian estimation of discrete multivariate trun-cated latent structure models”,

Journal of Computational and Graphical Statistics

23, pp. 1061–1079.— (2014b), “Bayesian multiple imputation for large-scale categorical data with structural zeros”,

Survey Methodology

40, pp. 125–134.— (2015), “Bayesian simultaneous edit and imputation for multivariate categorical data”,

Techni-cal Report. Dept. of Statistics, Duke University .Manrique-Vallier, D., J. P. Reiter, J. Hu, and W. Quanli (2014), “NPBayesImpute: Non-parametricBayesian multiple imputation for categorical data”,

The Comprehensive R Archive Network .Raghunathan, T. E. and D. B. Rubin (1997), “Roles for Bayesian techniques in survey sampling”,

Proc. Survey Methods Section of the Statistical Society of Canada , pp. 51–55.Raghunathan, T. E., J. M. Lepkowski, J. van Hoewyk, and P. Solenberger (2001), “A multivari-ate technique for multiply imputing missing values using a sequence of regression modelss”,

Survey Methodology

27, pp. 85–96.Raghunathan, T. E., P. W. Solenberger, and V. J. Hoewyk (2002), “IVEware: imputation and vari-ance estimation software user guide.”,

Survey Research Center, Institute for Social Research,University of Michigan .Reiter, J. P. and T. E. Raghunathan (2007), “The multiple adaptations of multiple imputation”,

Journal of the American Statistical Association

Journal of Statistical Software

Biometrika

63, pp. 581–592.— (1987),

Multiple imputation for nonresponse in surveys , New York: John Wiley & Sons.— (1996), “Multiple imputation after 18+ years”,

Journal of the American Statistical Association

91, pp. 473–489.— (2003), “Nested multiple imputation of NMES via partially incompatible MCMC”,

StatisticaNeerlandica

Analysis of incomplete multivariate data , London: Chapman and Hall.Schafer, J. L., T. M. Ezzati-Rice, K. M. Johnson W., R. J. A. Little, and D. B. Rubin (1998),“The NHANES III multiple imputation project”,

Proceedings of the survey Research MethodsSection of the American Statistical Association , pp. 28–37.Si, Y. and J. P. Reiter (2013), “Nonparametric Bayesian multiple imputation for incomplete cat-egorical variables in large-scale assessment surveys”,

Journal of Educational and BehavioralStatistics

Political Analysis

23, pp. 92–112.Su, Y. S., A. Gelman, J. Hill, and M. Yajima (2011), “Multiple imputation with diagnostics (mi) inR: Opening Windows into the black box”,

Journal of Statistical Software

Statistics in Medicine

24, pp. 2111–2128.29ermunt, J. K., J. R. V. Ginkel, L. A. V. der Ark, and K. Sijtsma (2008), “Multiple imputation ofincomplete categorical data using latent class analysis”,

Sociological Methodology

38, pp. 369–397.Yang, Y. (2011), “Multiple imputation using SAS software”,