A flexible and efficient algorithm for joint imputation of general data
AA flexible and efficient algorithm for jointimputation of general data
Michael W.
Robbins ∗ August 6, 2020
Abstract
Imputation of data with general structures (e.g., data with continuous, binary, un-ordered categorical, and ordinal variables) is commonly performed with fully conditionalspecification (FCS) instead of joint modeling. A key drawback of FCS is that it doesnot invoke an appropriate data augmentation mechanism and as such convergence ofthe resulting Markov chain Monte Carlo procedure is not assured. Methods that usejoint modeling lack these drawbacks but have not been efficiently implemented in dataof general structures. We address these issues by developing a new method, coherentmultivariate imputation (CMI), that draws imputations from a latent joint multivariatenormal model that underpins the generally structured data. This model is constructedusing a sequence of flexible conditional linear models that enables the resulting proce-dure to be efficiently implemented on high dimensional datasets in practice. Simulationsshow that CMI performs well when compared to those that utilize FCS. Furthermore,the new method is dramatically more computationally efficient than FCS procedures.
KEY WORDS:
Missing Data, Multiple Imputation, Joint Modeling, Fully ConditionalSpecification, Chained Equations, Markov Chain Monte Carlo.
Missing data present one of the classical problems of statistical analyses. Imputation, inwhich missing values are replaced with plausible entries according to some sort of statisti-cal model, is a highly popular approach for addressing missing data as it yields completeddatasets that can be analyzed with traditional techniques. Modern approaches to imputa-tion have tended to settle within a Bayesian paradigm wherein imputations are sampled atrandom from a posterior predictive distribution; this begets the multiple imputation frame-work in which estimators of uncertainty can be adjusted for imputation error through thecreation of several imputed datasets. Reviews of missing data, imputation, and multipleimputation are numerous—examples include Rubin (1987), Rubin (1996), Schafer (1999),Carpenter and Kenward (2012), and Little and Rubin (2020). ∗ Statistician, RAND Corporation, Pittsburgh, PA 15213 (E-mail: [email protected] ). † Acknowledgments : The author acknowledges funding from grant R21AG058123 from the NationalInstitutes of Health. a r X i v : . [ s t a t . M E ] A ug ost commonly used imputation procedures generate imputations across iterations ofMarkov chain Monte Carlo (MCMC) with data augmentation (Tanner and Wong 1987).As such, theoretically valid multivariate imputation requires sampling from the (joint) pos-terior distribution of the missing data given the observed data. However, procedures thatimplement sampling from coherent joint distributions (e.g. Quartagno and Carpenter 2019;Schafer 2017; Zhao and Schafer 2018) tend to be incompatible or highly inefficient withdata of general structures or high dimensions.An alternative to joint modeling is fully conditional specification (FCS, Raghunathanet al. 2001; Van Buuren et al. 2006; Van Buuren and Groothuis-Oudshoorn 2010; White et al.2011), also known as chained equations, wherein each variable is imputed from a conditionalmodel that potentially includes all other variables. This process naturally lends itself toimputation of variables of general structure (e.g., binary, unordered categorical, ordinal);further, transformation (e.g., Robbins and White 2011; Robbins 2014) or predictive meanmatching (Little 1988) can be applied to preserve continuous marginal distributions thatnon-standard. However, since the conditional models can be, in theory, incompatible withone another, FCS does not represent a valid form of Gibbs sampling, and the imputationsare not guaranteed to converge across iterations of MCMC. In spite of these theoreticalflaws, FCS performs well in practice (White et al. 2011; Van Buuren 2018) and is widelyused and available across a host of software (e.g., Raghunathan et al. 2002; Van Buurenand Groothuis-Oudshoorn 2010; Su et al. 2011; Honaker et al. 2011).We introduce a new procedure that borrows from earlier ideas (Carpenter and Kenward2012; Robbins et al. 2013) and addresses the theoretical and empirical issues encounteredwith FCS. This new algorithm, referred to as coherent multivariate imputation (CMI) im-poses a latent multivariate normal process in order to facilitate imputation of continuous,binary, unordered categorical, and ordinal (i.e., ordered categorical) variables. To ensuretheoretical validity, CMI draws imputations from a joint model while building that modelfrom a sequence of linear conditional models. Modeling in such a fashion enables flexibilityin the selection of conditional relationships that permitted between variables. The SWEEPoperator (Goodnight 1979) optimizes the computational performance of the algorithm. CMIalso runs dramatically faster than FCS in the empirical illustrations presented here.The remainder of this article proceeds as follows. Fundamental concepts regardingimputation are provided in Section 2. The CMI algorithm is detailed in Section 3—itsfavorable performance is illustrated in a rigorous simulation study provided in Section 4.The article concludes with discussion in Section 5. We begin by sketching fundamental concepts for imputation of missing data. Relevantimputation methods are founded on the concept of data augmentation (DA, Tanner andWong 1987). DA is designed for cases where the desired objective of sampling from aposterior distribution P ( θ | y ) is difficult, but for some latent variable z , sampling from P ( z | y, θ ) and P ( θ | y, z ) is simple, where P ( · ) is general notation for a probabilistic density.As such, DA involves iteratively sampling from P ( z | y, θ ) and P ( θ | y, z ) in order to yield validdraws from P ( θ, z | y ). In missing data models, it is common to let y represent the observeddata in the DA formulation, z represent the missing data, and θ model parameters. As such,2mputation via DA involves iteratively alternating between an imputation step (or I Step),which involves sampling udpated imputations from the density of the missing data giventhe observed data and the parameters sampled from the previous iteration, and a parameterstep (P Step) wherein one samples parameters from the density of the parameters given theobserved data and the imputations sampled from the preceding I Step.To illustrate the DA process with more formal notation, let χ obs denote the observeddata and χ mis denote the missing data, while χ = { χ obs , χ mis } gives the complete data.Furthermore, Θ is a set of model parameters that govern the distribution of χ . The ob-jective is to sample imputations from P ( χ mis | χ obs , Θ ). Letting χ ( t )mis and Θ ( t ) representsamples of χ mis and Θ drawn at the t th iteration, these are updated within the ( t + 1) th iteration as follows:I Step: Draw χ ( t +1)mis from P ( χ mis | χ obs , Θ ( t ) ).P Step: Draw Θ ( t +1) from P ( Θ | χ obs , χ ( t +1)mis ).As t → ∞ , convergence is observed in that { χ ( t )mis , Θ ( t ) } can be shown to represent a randomdraw from P ( χ mis , Θ | χ obs ). Validity of estimators derived from the imputed data is con-tingent upon the missing at random assumption (in the nomenclature of Little and Rubin2020).Gibbs sampling (Geman and Geman 1984) is used to update imputations within the IStep. Specifically, letting χ = { X , . . . , X p } , within the ( t + 1) th iteration, we sequentiallyupdate X ( t ) j for each j by replacing values that were originally missing (in X j ) with drawsfrom P (cid:16) X j (cid:12)(cid:12)(cid:12) X ( t +1)1 , . . . , X ( t +1) j − , X ( t ) j +1 , . . . , X ( t ) p , Θ ( t ) (cid:17) , (1)which serves to create X ( t +1) j . In the event that χ follows a Guassian distribution, multi-variate normal theory can be used to the form of each of the above conditional models givena mean vector and covariance matrix extracted from Θ . However, joint modeling in thismanner for more general data, which may contain binary, unordered categorical or ordinalvariables, is more complicated. Elaborating, one can construct a joint model via a sequenceconditional models using P ( X , X , . . . , X p | Θ ) = p (cid:89) j =1 P ( X j | X , . . . , X j − , θ ∗ j ) , where Θ = { θ ∗ , . . . , θ ∗ p } . Given the specific marginal structure of each X j , models for P ( X j | X , . . . , X j − ) (2)and θ ∗ j may be easily determined for j = 1 , . . . , p , which yields a valid joint density. Nonethe-less, sampling from P ( X j | X , . . . , X j − , X j +1 , . . . , X p ) (3)for j = 1 , . . . , p in a manner that is congenial with the resulting joint density, as is requiredfor valid Gibbs sampling, often presents an intractable problem for general data structures.3ully conditional specification (FCS) circumvents the above problem by modeling eachconditional expression of the form in (3) instead of addressing the joint distribution. Assuch, in lieu of a P Step, FCS samples model parameters for each conditoinal model withineach phase of the Gibbs sampling. That is, for each j = 1 , . . . , p , imputations for X j at the( t + 1) th iteration are determined via θ ( t +1) j ∼ P ( θ j | X ( t +1)1 , . . . , X ( t +1) j − , X ( t ) j , . . . , X ( t ) p ) , X ( t +1) j ∼ P ( X j | X ( t +1)1 , . . . , X ( t +1) j − , X ( t ) j +1 , . . . , X ( t ) p ) . where θ j indicates model parameters for the density seen in (3). Since the sequence con-ditional expressions given by (3) may define an incoherent joint distribution when modeledseparately, there is no guarantee that { χ ( t )mis , Θ ( t ) } will converge to P ( χ mis , Θ | χ obs ) acrossiterations with FCS; in fact, divergence is possible. Most references that discuss conver-gence in FCS methods (e.g., White et al. 2011; Van Buuren 2018) recommend the use of asmall number of iterations of MCMC (usually as low as five, which is the default in severalalgorithms), perhaps to hedge against the possibility of divergence.Others have noted performance issues with FCS when applied in high dimensionaldatasets (e.g., Loh et al. 2016); nonetheless, it has observed prevalent usage when appliedin a large-scale surveys (e.g., Schenker et al. 2006). Here, we introduce a novel imputation method: the so-called Coherent Multivariate Impu-tation (CMI). This procedure is designed to accomplish the following:1. Sample imputations from a coherent joint distribution;2. Have the flexibility to impute variables of a variety of structures (e.g., continuous,binary, unordered categorical, ordinal);3. Afford the user the ability to determine which conditional relationships are permittedwithin the imputation model;4. Be computationally feasible and efficient in high dimensional datasets.In this endeavor, we revisit the data augmentation framework, but instead of assuming thatthe latent process z (as described at the beginning of Section 2) represents only the missingdata whereas the other process y is the observed data, we assume that there is a latent datasystem that underpins all data values (observed or missing) and that the collected datainstead represent available knowledge regarding this system in that some variables may befully or partially observed. As in Section 2, let χ = { X , . . . , X p } denote that collected data (which may containmissing values). We assume that each variable contained in χ has either a continuous,categorical, binary, or ordinal distribution. Extensions involving semi-continuous data and4ight-censored data are discussed in Section 5. For simplicity, we assume that binary vari-ables take on value 0 or 1, and we assume that if X j is unordered categorical or ordinalwith k j > X j ∈ { , . . . , k j } . We reformat the data so that if X j is unordered categorical, it is represented by k j − X j falls into the i th category, the i th nested variable is unity and is zero otherwise. Specifically,a categorical variable X j is reformatted into nested variables X ∗ j (cid:48) , . . . , X ∗ j (cid:48) + k j − for someindex j (cid:48) as follows: X ∗ j (cid:48) + (cid:96) − = ? , if X j < (cid:96) or X j = ? , , if X j = (cid:96), , if X j > (cid:96). (4)for 1 ≤ (cid:96) ≤ k j − X ∗ j (cid:48) , . . . , X ∗ j (cid:48) + k j − areimputed. Let χ ∗ = { X ∗ , . . . , X ∗ q } denote the reformatted data, where q ≥ p and where χ ∗ contains only continuous, binary, and ordinal variables. Note that variables that arenot unordered categorical are copied over from χ to χ ∗ . For an unordered categoricalvariable X j , we suggest ordering the categories from least to most prevalent when creatingthe nested variables; this will minimize the number of missing values that are artificiallyimposed.The formulation in (4) represents a nested version of the manner in which semi-continuousdata are frequently handled in imputation algorithms (e.g., Robbins et al. 2013). Specif-ically, the categorical variable is first broken down into a binary variable that indicateswhether or not the original variable falls into the first category and a categorical variablethat is set as the value of the original variable but is missing when the original variablefalls into the first category. Next, this second (categorical) variable is dissected in a similarmanner—this yields a second binary variable that is unity when the original variable fellinto the second category, missing when it fell into the first, and zero otherwise, along with athird variable that is unordered categorical and contains missing values for cases where theoriginal categorical variable fell into one of the first two categories. This process is repeateduntil all categories are embodied by nested binary variables. The advantage of this processis that it allows the nested variables to be (conditionally) independent of one another andis easily reversed following imputation.Borrowing from the idea of probit modeling, akin to how it has been previously applied inimputation settings (Carpenter and Kenward 2012), we assume that a multivariate Gaussiandistribution underpins χ ∗ . Specifically, ψ = { Z , . . . , Z q } indicates the underlying latentprocess. We assume that ψ ∼ N q ( µ , Σ ) for a mean vector µ and variance matrix Σ . Theprocess of observed data χ ∗ is generated from the latent process ψ as follows:If X ∗ j is continuous, X ∗ j = F − j (Φ( Z j )) (5)where F j ( · ) is the marginal cumulative distribution function (CDF) of X ∗ j , in that F j ( x ) =Pr( X j ≤ x ) where Pr( A ) gives the probability of event A , and where Φ( · ) denotes the CDFof a standard normal random variable. Of course, prior to imputation, the observed datashould be transformed to have a standard normal distribution via the inverse transformation Z j = Φ − ( F j ( X ∗ j )). Transformations of this type may be performed with a parametricdensity (Robbins and White 2011; Robbins et al. 2013) or in a non-parametric manner5ith a kernel or empirical distribution (Robbins 2014). This formulation serves to link thecontinuous data via a Gaussian copula (Nelsen 2009). If X ∗ j is binary, a probit-type modelis imposed: X ∗ j = (cid:26) , if Z j < , , if Z j ≥ . Lastly, if X ∗ j is ordinal where X ∗ j ∈ { , , . . . , k j } , X ∗ j = i if τ j,i − < Z j ≤ τ j,i ,for i ∈ { , . . . , k j } , where τ j,i = Φ − (Pr( X ∗ j ≤ i )) for i ∈ { , . . . , k j − } and where we set τ j, = −∞ and τ j,k j = ∞ .Note that the latent multivariate normal process can be modeled conditionally upon aset of fully observed predictors (see, for example, Robbins et al. 2013); these variables canobey any distribution and need not be underpinned by a Gaussian density. For simplicityof the exposition, we do not condition on such variables in the following. The P Step of CMI is akin to that of the so-called Iterative Sequential Regression methodof Robbins et al. (2013). The objective of the P Step is to determine values of the meanvector µ and variance matrix Σ of the latent multivariate Gaussian process; however, thesequantities are modeled indirectly. Specifically, we build a joint model for ψ by stating linearforms for conditional models seen in (2) in that Z j is allowed to depend on variables thatprecede it in sequence but not those that antecede it. That is, we assume Z j = V j β j + σ j (cid:15) j , (6)for j = 1 , . . . , q , where V j denotes an n × κ j predictor matrix of which the columns aresome subset of the columns of { ; Z ; . . . ; Z j − } , with indicating a vector of ones, andwhere β j denotes a length- κ j vector of regression coefficients—the flexibility to selectivelyreduce the size of the predictor set for each conditional model is crucial in our setting asreferenced previously. This model imposes that P ( Z j | Z , . . . , Z j − ) = P ( Z j | V j ). Notethat the predictor matrix V j for a Z j that corresponds to a nested binary variable withinan unordered categorical variable should exclude any other nested variables from that samecategorical variable.Assuming a non-informative prior for Θ = { β , σ , . . . , β q , σ q } in that P ( Θ ) ∝ (cid:81) qj =1 /σ j ,the posterior distributions of β j and σ j (assuming fully observed ψ ) are derived as follows.If X ∗ j is binary, we fix σ j = 1, which is in accordance with traditional probit modeling.Otherwise, σ j | ψ ∼ Inv- χ ( n − κ j , s j ) , (7)where, letting the superscript T indicate a matrix transpose, s j = ( Z j − V j ˆ β j ) T ( Z j − V j ˆ β j ) / ( n − κ j ) with ˆ β j = ( V Tj V j ) − V Tj Z j and with Inv- χ ( · , · ) denoting an inversechi-square distribution. Likewise, β j | σ j , ψ ∼ N κ j ( ˆ β j , σ j ( V Tj V j ) − ) . (8)6iven imputed values of the latent process, ψ ( t ) = { Z ( t )1 ; . . . ; Z ( t ) q } at the t th iteration, theP Step involves sampling β ( t ) j and σ ( t ) j from P ( β j , σ j | Z ( t )1 , . . . , Z ( t ) j − ) for j = 1 , . . . , q inaccordance with (7), when needed, and (8) above.We next calculate µ ( t ) and Σ ( t ) , the mean vector and covariance matrix of the process ψ at the t th iteration, from the parameter set { β ( t )1 , σ ( t )1 , . . . , β ( t ) q , σ ( t ) q } using establishedformulas for deriving means and covariances from a sequence of regression models; the sup-plement to Robbins et al. (2013) provides some illustrations of such calculations. The I Step for the ( t + 1) th of CMI involves sampling ψ ( t +1) from P ( ψ | χ ∗ obs , µ ( t ) , Σ ( t ) ),where χ ∗ obs includes the fully and partial observed information regarding ψ from χ ∗ . Since χ ∗ is uniquely determined from ψ , we do not need to recalculate χ ∗ at each iteration inorder to align with the data augmentation framework. First, we use µ ( t ) and Σ ( t ) to findthe parameters that define P ( Z j | Z , . . . , Z j − , Z j +1 , . . . , Z p ) for each j = 1 , . . . , q , which isGaussian by multivariate normal theory. We execute Gibbs sampling from this distribution.As such, for each j ∈ { , . . . , q } , let µ ( t +1) j |· = E [ Z j | Z ( t +1)1 , . . . , Z ( t +1) j − , Z ( t ) j +1 , . . . , Z ( t ) p , µ ( t ) , Σ ( t ) ] ,σ ( t +1) j |· = Var( Z j | Z ( t +1)1 , . . . , Z ( t +1) j − , Z ( t ) j +1 , . . . , Z ( t ) p , µ ( t ) , Σ ( t ) ) . Multivariate normal theory is used to determine µ ( t +1) j |· and σ ( t +1) j |· (for details, see thesupplement to Robbins et al. 2013).If X ∗ j is continuous: • For cases where X ∗ j is observed, set Z ( t +1) j = Z j ; • For cases where X ∗ j is missing, sample Z ( t +1) j from N( µ ( t +1) j |· , σ ( t +1) j |· ).Note that if X ∗ j is binary or ordinal, only partial information is known regarding Z j , evenfor cases where X ∗ j is observed. This information is incorporated in the sampling schemefor binary X ∗ j as follows: • For cases where X ∗ j is missing, sample Z ( t +1) j from N( µ ( t +1) j |· , σ ( t +1) j |· ); • For cases with X ∗ j = 0, draw Z ( t +1) j from trN( µ ( t +1) j |· , σ ( t +1) j |· , −∞ , • For cases with X ∗ j = 1, draw Z ( t +1) j from trN( µ ( t +1) j |· , σ ( t +1) j |· , , ∞ ).In the above, trN( µ, σ , a, b ) is a truncated normal distribution with mean µ , variance σ , and bounds of a and b . That is, X ∼ trN( µ, σ , a, b ) ⇒ X ≡ ( Z | a < Z < b ) with Z ∼ N ( µ, σ ). To find Z ( t +1) j if X ∗ j is ordinal with k j categories: • For cases where X ∗ j is missing, sample Z ( t +1) j from N( µ ( t +1) j |· , σ ( t +1) j |· ); • For cases with X ∗ j = i where 1 ≤ i ≤ k j , draw X ( t +1) j from trN( µ ( t +1) j |· , σ ( t +1) j |· , τ j,i − , τ j,i ).7erein, we again set τ j, = −∞ and τ j,k j = ∞ .To initialize the MCMC procedure, we find that setting µ (0) j |· = 0 and σ (0) j |· = 1 andsampling ψ (0) = { Z (0)1 ; . . . ; Z (0) q } according to the rules above performs sufficiently well. Ofcourse, more rigorous options could be implemented. After a burn-in period of b iterations, the MCMC procedure is stopped and ψ ( b ) = { Z ( b )1 , . . . , Z ( b ) q } indicates the final imputed version of the latent data. The final imputations for the (re-formatted) recorded dataset are denoted (cid:101) χ ∗ = { (cid:102) X ∗ , . . . , (cid:102) X ∗ q } (cid:48) and are derived from ψ ( b ) asfollows.If X ∗ j is continuous, (cid:102) X ∗ j = F − j (Φ( Z ( b ) j )); see Robbins et al. (2013) and Robbins (2014)for specifics regarding transformation and untransformation of marginal distributions. If X ∗ j is binary, (cid:102) X ∗ j = (cid:40) , if Z ( b ) j < , , if Z ( b ) j ≥ , and if X ∗ j is ordinal with k j categories, (cid:102) X ∗ j = i if τ j,i − < Z ( b ) j ≤ τ j,i . for i ∈ { , . . . , k j } .The nesting structure described in (4), in which case an unordered categorical variable X j ⊆ χ is represented by { X ∗ j (cid:48) , . . . , X ∗ j (cid:48) + k j − } ⊆ χ ∗ for some j (cid:48) , is then reversed, creatingthe final imputed dataset (cid:101) χ = { (cid:102) X , . . . , (cid:102) X p } . This is accomplished by setting (cid:102) X j = , if X ∗ j (cid:48) = 1 , , if X ∗ j (cid:48) +1 = 1 and X ∗ j (cid:48) = 0 , ... k j − , if X ∗ j (cid:48) + k j − = 1 and X ∗ i = 0 for each i ∈ { j (cid:48) , . . . , j (cid:48) + k j − } ,k j , if X ∗ i = 0 for each i ∈ { j (cid:48) , . . . , j (cid:48) + k j − } , and by setting other variables contained in χ equal to their corresponding imputed versionin (cid:101) χ ∗ .To apply multiple imputation (Rubin 1987, 1996), the entire process illustrated above isrepeated independently m times to procedure m separately imputed datasets. Well knowncombining rules are used to aggregate the datasets and adjust estimators for imputationerror.Note that the marginal transformations that are applied to continuous variables in(5) assume that F j ( x ) = Pr( X j ≤ x ) is known for each relevant j and likewise that τ j,i = Φ − { Pr( X ∗ j ≤ i ) } is assumed known for each ordinal X j . In practice, these quantitiesare estimated which may induce bias into the transformations in missingness mechanismsthat are not missing completely at random (borrowing the terminology of Little and Rubin2020). However, earlier studies involving continuous data (Robbins et al. 2013; Robbins2014) find no evidence of substantial bias stemming from transformations. Note also that8he copula framework applied to continuous variables requires that following the marginaltransformations, the transformed variables obey a multivariate normal distribution (i.e.,relationships between variables are linear). However, the aforementioned studies (e.g., Rob-bins et al. 2013; Robbins 2014) have also shown that in practice, bivariate relationships areoften more linear following such transformations than before.The manner in which we handle categorical variables is, by our knowledge, novel. Alter-native approaches proposed by other authors do not impose missingness in nested variables(Allison 2002; Honaker et al. 2011; Carpenter and Kenward 2012)—imputed values of thecategorical variable are then set as the category that observes the highest value among theimputed nested variables. However, rigorous evaluations of this approach are scarce, asnoted by Carpenter and Kenward (2012). In contrast, our proposed approach performs wellin simulations (see Section 4). The sweep operator (Beaton 1964; Goodnight 1979) is used to dramatically improve thecomputational efficiency of the CMI algorithm in both the P Step and I Step. Specifically,through the use of this operation in the P Step, all information needed for the conditionalmodels of Z ( t ) j for each j = 1 , . . . , q , as seen in (6), can be calculated through nearlythe same amount of computations as would be needed to determine only the quantitiesnecessary for the conditional model for Z ( t ) q . To elaborate, the sweep operator is a matrixtransformation that is applied to a specific column of a symmetric matrix (i.e., “sweepingin” the column), and groups of columns may be “swept in” by applying the operation tothe individual columns (and resulting matrices) in sequence. The operation functions sothat columns may be“swept in” in any given order without changing the end result.Assume for the moment that each predictor matrix used for the models in (6) containsthe maximum permissible number of predictors (i.e., V ( t ) j = { , Z ( t )1 , . . . , Z ( t ) j − } . If the first j columns of the ( q + 1) × ( q + 1) matrix A ( t ) = ( V ( t ) q +1 ) (cid:48) V ( t ) q +1 are swept in, the result, whichis a ( q + 1) × ( q + 1) matrix that we notate with B ( t ) j , contains submatrices which yieldthe information needed to determine ˆ β ( t ) j and ( s ( t ) j ) without further matrix computations.Then, the sweep operator can be applied to column j + 1 of B ( t ) j to yield ˆ β ( t ) j +1 and ( s ( t ) j +1 ) .As such, applying the operation in sequence to columns 1 through q of the matrix A ( t ) yields ˆ β ( t ) j and ( s ( t ) j ) for each j = 1 , . . . , q in the same number of computations it wouldtake to calculate only ˆ β ( t ) q and ( s ( t ) q ) . Note that there also exists a reverse sweep operatorthat is used to “sweep out” any predictors that have been excluded from specific conditionalmodels.In the I Step, the reverse sweep operator is applied to each of the columns of ( Σ ( t ) ) − separately to help find µ ( t ) j |· and σ ( t ) j |· . CMI applies joint modeling which avoids the theoretical issues encountered with FCS andguarantees that CMI imputations will converge across iterations of MCMC. Further, strate-9ic use of the sweep operator in CMI ensures that it is more computationally efficient thanexisting implementations of FCS. Most existing implementations of imputation by jointmodeling (e.g. Robbins et al. 2013; Schafer 2017; Zhao and Schafer 2018) do not facilitategeneral data structures.The R package jomo (Carpenter and Kenward 2012; Quartagno and Carpenter 2019),which uses a latent Gaussian process to underpin non-continuous variables, is perhaps mostclosely aligned with CMI in terms of utility, but CMI includes many advancements over jomo . Specifically, jomo does not build the joint model from a sequence of conditionalmodels as seen in (6) but instead attempts to directly estimate the covariance matrix.However, estimation of a covariance matrix that is subject to restrictions (e.g., the diagonalelements that correspond to binary variables must be set to 1) is difficult in practice as theresult may not be positive semi-definite. jomo addresses this issue by using a guess-and-check Metropolis-Hastings algorithm, and further applies a guess-and-check method in lieuof sampling from a truncated normal distribution. These issues lead to infeasibility of thealgorithm when applied to high dimensional, complex data. Lastly, jomo does not let itsuser specify dependencies as efforts to set specific elements of a covariance matrix equal tozero may also result in a matrix that is not positive semi-definite.The CMI procedure provides a more natural method by which covariance matrices of thelatent process can be estimated. For instance, by setting the conditional error variance of themodels for binary variables to be one (instead of attempting to restrict diagonal elements of acovariance matrix to be one), we ensure that the resulting covariance matrix will be positivesemi-definite and can be estimated using appropriate Bayesian techniques. Furthermore,variables can be dropped from specific conditional models in (6) while maintaining a positivesemi-definite covariance matrix, enabling the user to avoid multi-collinearities and imposedesired conditional dependence structures.
In this section, we perform a simulation study to evaluate the effectiveness of CMI and com-pare its performance to that of FCS. We use the implementation of FCS available in theR package mice (Van Buuren and Groothuis-Oudshoorn 2010); mice contains flexibility todetermine which dependencies are allowed for each variable (by default all possible depen-dencies are enabled). Furthermore, one can assign different methods of imputation to eachvariable, with Gaussian imputation available for continuous variables, logistic regression forbinary variables, and polytomous regression for categorical variables. mice also implementspredictive mean matching (PMM, Little 1988), which uses a nearest neighbor-type approachbased on a predictive model and is often applied to handle continuous variables that mayhave non-Gaussian marginal distributions. PMM can also be applied to binary, unorderedcategorical, and ordinal variables.In this simulation study, we generate a dataset that contains six variables with differingmarginal structures, loosely outlined as follows: • X – Unordered categorical • X – Continuous (fully observed) 10 X – Continuous • X – Binary (generated from a probit-type model) • X – Ordinal (generated from a probit-type model) • X – Binary (generated from a logistic model)Elaborating, X is generated from a multinomial distribution with 4 categories. A latentprocess that underpins X . . . , . . . X is generated from a multivariate normal distributionwith while conditioning on X . Further, X is generated from a logistic model conditionalon X . . . , . . . X . Non-negligible associations exist between all variables. We generate n = 2 ,
500 observations of each variable.Missingness is stochastically imposed in the synthetic data using the following threemechanisms. In each case, around a third of the observations are missing (excluding X ).1. MCAR: Missingness probabilities are independent of any other data characteristics.2. MAR: Missingness probabilities depend upon only the fully observed variable X .3. NMAR: Missingness probabilities in variable X j depend upon only X j for j ∈{ , , . . . , } .These mechanisms are designed in line with the nomenclature of Little and Rubin (2020).Further details on the data generating and missingness mechanisms are provided in theSection A.1 of the supplemental materials. Note that missingness rates in each variable(with the exception of X ) are approximately 33% under each mechanism.Next, the missing values are imputed using three methods:1. logistic: mice is used with logistic regression for binary variables, polytomous regres-sion for the categorical variable, and PMM for the remaining variables.2. PMM: mice is used with PMM for all variables.3. CMI: Coherent multivariate imputation as proposed in Section 3. An empirical dis-tribution transformation (Robbins 2014) is applied to continuous variables.We use 5 iterations of MCMC for the mice methods and 25 iterations for CMI; moreiterations of CMI are used because it notably more computationally efficient. All possibleinter-variable dependencies are enabled for each method. To adjust for imputation error, weuse multiple imputation (Rubin 1987, 1996) with m = 40 independently imputed datasetsfor each method. This selection of m is in line with the recommendations of Graham et al.(2007) given the missingness rates used here.We use N = 5 ,
000 replications for this simulation study—that is, the above processof simulating and imputing data is repeated independently 5,000 times. The followingparameters are tracked in each replication for each method: • Means and the variance-covariance matrix of the dataset { X , , . . . , X , , X , . . . , X } where the X ,k for k ∈ { , . . . , } are categorical indicators underpinning X (al-though the mean and variance of X are excluded). There are 8 mean parameterscalculated with 8 variances and 36 covariances.11 Estimated regression coefficients, and standard errors of those coefficients, for allfully specified conditional models of the form P ( X j | X . . . , X j − , X j +1 , X ) for j ∈ { , . . . , } . For continuous and ordinal variables, we fit a basic linear model.For binary variables, we fit a logistic regression, and for the categorical variable, wefit a multinomial log-linear model via the nnet package in R (Venables and Ripley2002). There are 58 regression parameters tabulated with 58 standard errors on thoseparameters.We calculate root-mean squared error (rMSE) for all parameters and coverage rates for asubset of parameters.We let ˆ θ [ r ] ( x ) denote the value of a parameter θ estimated at the r th replication forimputation method x (ˆ θ [ r ] ( x ) is calculated as the average of separate estimates of θ producedfor each of the multiply imputed datasets). For method x , we calculate the rMSE in theestimate of θ as follows: rMSE θ ( x ) = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) r =1 [ˆ θ [ r ] ( x ) − θ ] . Note that unlike coverage, rMSE is calculated for all parameters listed above. To evaluatethe relative improvement (or lack there of) in rMSE offered by CMI over the two mice -basedmethods, we calculate the following:∆MSE θ ( x ) = rMSE θ (CMI) − rMSE θ ( x )rMSE θ ( x )for x representing the logistic method and for x then representing the PMM method.Box plots of ∆MSE θ ( x ) for the logistic and PMM methods for all 168 parameters areshown in Figure 1. A negative value of ∆MSE θ ( x ) indicates that CMI outperforms therespective procedure. The figure indicates that more often than not, CMI provides betterrMSE than the mice -based methods. Specifically, for the logistic method, 61.3%, 69.6%,and 62.5% of parameters have a negative value of ∆MSE θ ( x ) under the MCAR, MAR, andNMAR mechanisms, respectively. These values are 67.3%, 70.2%, and 57.1% for the PMMmethod. The exact values of rMSE θ ( x ) for all methods and mechanisms are shown in tablesprovided in A.2 of the supplemental materials.If θ is the mean of a variable or a regression coefficient, we use Rubin’s combining rules(Rubin 1987) across the multiply imputed datasets to approximate the variance of ˆ θ [ r ] ( x ),which we denote T [ r ] ( x ) at the r th replication. Then, for these parameters, we calculate thecoverage of a (1 − α )% confidence interval around θ as N − (cid:80) Nr =1 C [ r ] θ ( x ) where C [ r ] θ ( x ) = (cid:40) , if θ ∈ { ˆ θ [ r ] ( x ) ± t − α/ ,d [ r ] (cid:112) T [ r ] ( x ) } , , otherwise . (9)and where t α,ν is the 100 α th percentile of a t distribution with ν degrees of freedom (wherethe degrees of freedom at the r th replication, d [ r ] , are calculated from the within- andbetween-imputation variances). 12 lll lllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllll lllllllllllllll llllllllllllllllllllllllllllllllllll logistic PMM logistic PMM logistic PMM − − − − MCARMARNMAR
Figure 1: Box plots of the relative difference in rMSE between CMI and the logistic andPMM mice methods of 168 separate parameters under various missingness mechanisms.Box plots across the 66 parameters for which the coverage rates were calculated areshown in Figure 2 for each method and missingness mechanism. The estimated rates ap-proximate the coverage of a 95% confidence interval for the parameters. NMAR resultsare excluded from the figure since all methods provide poor coverage under NMAR miss-ingness. Exact rates of coverage are reported in tables provided in Section A.2 of thesupplemental materials. Figure 2 shows that CMI tends to provide conservative intervals(the coverage is usually slightly above 95%), and the logistic method tends to provideanti-conservative intervals (coverage is below 95%), whereas PMM provides both conser-vative and anti-conservative intervals with a greater magnitude of spread across the pa-rameters than seen with the other methods. Note that the appearance of conservative oranti-conservative intervals may due to difficulties in determining sampling distributions ofestimators with finite n and m .In summary, within our simulation study, CMI performs no worse than the proceduresthat use FCS; furthermore, arguments can be made that CMI outperforms those methods. The proposed CMI method accomplishes the objectives stated at the beginning of Section3. Specifically, imputations are sampled from a coherent joint distribution in that thedata augmentation framework of Tanner and Wong (1987) is obeyed, there by ensuringMCMC covergence. Furthermore, it is easily applied in large datasets. Existing generalimputation procedures (e.g., Van Buuren and Groothuis-Oudshoorn 2010; Su et al. 2011;Quartagno and Carpenter 2019) are usually computationally onerous or simply inoperablewhen applied to high dimensional data. For example, the proposed CMI method wasupwards of 30 times faster than mice in applications. The need for efficient imputation13 lll l llll logistic PMM CMI logistic PMM CMI
MCARMAR
Figure 2: Boxplots of the simulated coverage rates for the 95% confidence intervals of 66separate parameters under various methods and missingness mechanisms.algorithms with high dimensional data is amplified by the fact that big data are becomingincreasingly prevalent and that studies have shown the need for exhaustive variable selectionwhen building imputation models (Robbins and White 2014).CMI contains the flexibility to select predictors that are included in each conditionalmodel of the form (6), which is needed for a variety of reasons:1. To facilitate the handling of skip logic in that child questions are not allowed todepend upon parent questions; imputation of mixed discrete/continuous variables isoften handled in a similar manner (e.g., Robbins et al. 2013).2. To facilitate the nested structure in (4) so that nested indicators of one categoricalvariable are not allowed to conditionally depend upon one another.3. To avoid the possiblity of having the number of predictors in the conditional modelexceed the number of observed cases for the response variable (or similar restrictions)or to otherwise avoid collinearity issues wherein a predictor matrix would not be offull rank.4. To invoke expert opinion regarding the interdependence of variables.However, conditional independence of two variables within models of the form in (2) doesnot imply conditional independence between those variables in models of the form in (3),nor does it imply marginal pairwise independence of the variables. As such, discretion withthe selection of variables for conditional models within CMI should perhaps be viewed as ameans of obtaining an intuitively and computationally valid joint distribution as opposedto a means by which impermissible variable relationships are prevented.14imilarly, there is potential that (when one is selective with regards to the predictorsused within the conditional models) the ordering of the variables may affect the imputations.However, as the variable ordering used in our data example was natural due to the nearlymonotonic nature of the missingness, we did not explore this issue here and leave it forfurther work.Although not discussed here directly, the CMI procedure can be generalized for use indata of a broader ranger of structures than considered herein. For instance, the procedurecan be reformatted to handle semi-continuous data using the principles outlined by Robbinset al. (2013); likewise, it could be reformatted to impute variables that have right-censoredobservations through the use of a Kaplan-Meier-based transformation. Variables that obeya Poisson (or negative binomial) distribution can be imputed using the empirical distri-bution transformation (as outlined for continuous variables) or, in a manner that is moretheoretically justifiable but also more computationally intensive, imputed using the processoutlined here for ordinal variables but while incorporating a Poissonian-type model for thecutpoints.
REFERENCES
Allison, P. D. (2002).
Missing Data . Sage University Paper Series on Quantitative Appli-cations in the Social Sciences. Thousand Oaks, CA: Sage Publications.Beaton, A. E. (1964). The use of special matrix operators in statistical calculus. Technicalreport, Educational Testing Service. RB–64–51.Carpenter, J. and M. Kenward (2012).
Multiple imputation and its application . John Wiley& Sons.Geman, D. and S. Geman (1984). Stochastic relaxation, Gibbs distributions, and theBayesian reconstruction of images.
IEEE Transactions on Pattern Analysis and MachineIntelligence 6 , 721–741.Goodnight, J. H. (1979). A tutorial on the SWEEP operator.
The American Statistician 33 ,149–158.Graham, J. W., A. E. Olchowski, and T. D. Gilreath (2007). How many imputations arereally needed? some practical clarifications of multiple imputation theory.
Preventionscience 8 (3), 206–213.Honaker, J., G. King, M. Blackwell, et al. (2011). Amelia ii: A program for missing data.
Journal of statistical software 45 (7), 1–47.Little, R. J. and D. B. Rubin (2020).
Statistical Analysis with Missing Data (3rd ed.). JohnWiley & Sons.Little, R. J. A. (1988). A test of missing completely at random for multivariate data withmissing values.
Journal of the American Statistical Association 83 , 1198–1202.Loh, W.-Y., J. Eltinge, M. Cho, and Y. Li (2016). Classification and regression tree methodsfor incomplete data from sample surveys. arXiv preprint arXiv:1603.01631 .Nelsen, R. B. (2009).
An Introduction to Copulas (2nd ed.). New York, New York: Springer.Quartagno, M. and J. Carpenter (2019). jomo: A package for Multilevel Joint ModellingMultiple Imputation .Raghunathan, T., J. Lepkowski, J. Van Hoewyk, and P. Solenberger (2001). A multivariate15echnique for multiply imputing missing values using a sequence of regression models.
Survey Methodology 27 , 85–95.Raghunathan, T. E., P. W. Solenberger, and J. Van Hoewyk (2002). Iveware: Imputationand variance estimation software.
Ann Arbor, MI: Survey Methodology Program, SurveyResearch Center, Institute for Social Research, University of Michigan .Robbins, M. W. (2014). The utility of nonparametric transformations for imputation ofsurvey data.
Journal of Official Statistics 30 (4), 675–700.Robbins, M. W., S. K. Ghosh, and J. D. Habiger (2013). Imputation in high-dimensionaleconomic data as applied to the Agricultural Resource Management Survey.
Journal ofthe American Statistical Association 108 (501), 81–95.Robbins, M. W. and T. K. White (2011). Farm commodity payments and imputationin the Agricultural Resource Management Survey.
American Journal of AgriculturalEconomics 93 (2), 606–612.Robbins, M. W. and T. K. White (2014). Direct payments, cash rents, land values, andthe effects of imputation in us farm-level data.
Agricultural and Resource EconomicsReview 43 (3), 451–470.Rubin, D. B. (1987).
Multiple Imputation for Nonresponse in Surveys . New York, NewYork: John Wiley & Sons.Rubin, D. B. (1996). Multiple imputation after 18+ years.
Journal of the American Sta-tistical Association 91 , 473–489.Schafer, J. L. (1999). Multiple imputation: A primer.
Statistical Methods in MedicalResearch 8 , 3–15.Schafer, J. L. (2017). mix: Estimation/Multiple Imputation for Mixed Categorical andContinuous Data . R package version 1.0-10.Schenker, N., T. E. Raghunathan, P.-L. Chiu, D. M. Makuc, G. Zhang, and A. J. Cohen(2006). Multiple imputation of missing income data in the National Health InterviewSurvey.
Journal of the American Statistical Association 101 , 924–933.Su, Y.-S., M. Yajima, A. E. Gelman, and J. Hill (2011). Multiple imputation with diagnos-tics (mi) in r: opening windows into the black box.
Journal of Statistical Software 45 (2),1–31.Tanner, M. A. and W. H. Wong (1987). The calculation of posterior distributions by dataaugmentation (with discussion).
Journal of the American Statistical Association 82 , 528–550.Van Buuren, S. (2018).
Flexible imputation of missing data . CRC press.Van Buuren, S., J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, and D. B. Rubin (2006).Fully conditional specification in multivariate imputation.
Journal of Statistical Compu-tation and Simulation 76 , 1049–1064.Van Buuren, S. and K. Groothuis-Oudshoorn (2010). mice: Multivariate imputation bychained equations in r.
Journal of Statistical Software , 1–68.Venables, W. N. and B. D. Ripley (2002).
Modern Applied Statistics with S (Fourth ed.).New York: Springer. ISBN 0-387-95457-0.White, I. R., P. Royston, and A. M. Wood (2011). Multiple imputation using chainedequations: issues and guidance for practice.
Statistics in medicine 30 (4), 377–399.Zhao, J. H. and J. L. Schafer (2018). pan: Multiple imputation for multivariate panel orclustered data . R package version 1.6. 16 upplementary Materials:A Flexible and Efficient Algorithmfor Joint Imputation of General Data
Michael W. Robbins
A.1 Simulation Details
A synthetic dataset { X , . . . , X } is generated as follows. The first variable is drawn from a4-level multinomial distribution with categorical probabilities given by { / , / , / , / } ;we let X denote an n × ψ from a 4-dimensional multivariate normal distribution with a meanvector of and a covariance matrix that has ones along the diagonal and 1/2 on eachoff diagonal element. Letting Z = { Z , . . . , Z } , we write Z = X γ + ψ where γ = { / , / , − / , − / } (cid:48) . In addition, we let π = X ρ + ψξ where ρ = { / , / , − / , − / } (cid:48) and ξ = { / , − / , − / , / } (cid:48) . We set X = Z , X = Z , and X = 1 if Z ≤ X = 0 otherwise. In addition, X = , if Z ≤ − . , , if Z ≤ Z > − . , , if Z ≤ . Z > , , if Z > . , Lastly, X is sampled from a binomial distribution so that Pr( X = 1) = 1 / [1 + exp( − π )].Missingness is imposed in { X , . . . , X } as follows. Let R j denote an indicator thatis unity if X j is missing and zero otherwise for j ∈ { , , . . . , } . Further, Pr( R j = 1) =1 / [1 + exp( − β j, − β j, X − β j, X j )] for j ∈ { , , . . . , } . For all mechanisms, we set β j, = log 2 to obtain a missingness rate of approximately 1/3 for each variable. UnderMCAR missingness, β j, = 0 and β j, = 0 for all j . Likewise, under MAR missingness, β , = 1 / β , = 1, β , = − β , = 3 /
4, and β , = − / β j, = 0 for all j , andunder NMAR missingness, β , = 1 / β , = 1, β , = − β , = 3 /
4, and β , = − / β j, = 0 for all j . A.2 Simulation Tables
Tables A.1-A.15 list the coverage and rMSE for the parameters studied in the simulationsof Section 4 under the three different missingness mechanisms.17able A.1: Coverage rates and rMSE for the means of the simulated variables (where { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a MCAR missingness mechanism. The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X Coverage logistic 0.9424 0.9358 0.9368 0.9406 — 0.9482 0.9536 0.9524 0.9472PMM 0.9556 0.9420 0.9414 0.9524 — 0.9482 0.9552 0.9520 0.9500CMI 0.9538 0.9500 0.9530 0.9536 — 0.9482 0.9556 0.9518 0.9474rMSE logistic 0.0103 0.0106 0.0105 0.0104 — 0.0245 0.0115 0.0178 0.0122PMM 0.0103 0.0106 0.0105 0.0104 — 0.0245 0.0116 0.0177 0.0122CMI 0.0103 0.0106 0.0105 0.0104 — 0.0245 0.0116 0.0178 0.0122
Table A.2: Coverage rates using three methods of imputation for the parameters of the fully-specified regression models of the simulated variables (where { X , . . . , X } are binaryindicators created from X ) across the 5,000 simulated datasets under a MCAR missingnessmechanism. Rows indicate the outcome variable and columns indicate the predictors. Themethods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicatorscreated from X ) across the 5,000 simulated datasets under a MCAR missingness mech-anism. Rows indicate the outcome variable and columns indicate the predictors. Themethods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indica-tors created from X ) across the 5,000 simulated datasets under a MCAR missingnessmechanism. The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a MCAR missingness mechanism. Rows indicate the outcome variable and columnsindicate the predictors. The methods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X Table A.6: Coverage rates and rMSE for the means of the simulated variables (where { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a MAR missingness mechanism. The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X Coverage logistic 0.9346 0.9264 0.9322 0.9372 — 0.9516 0.9542 0.9464 0.9500PMM 0.9550 0.9288 0.9276 0.9446 — 0.9518 0.9540 0.9484 0.9488CMI 0.9674 0.9468 0.9662 0.9518 — 0.9510 0.9514 0.9458 0.9476rMSE logistic 0.0109 0.0109 0.0108 0.0104 — 0.0262 0.0121 0.0183 0.0124PMM 0.0110 0.0114 0.0112 0.0106 — 0.0263 0.0123 0.0184 0.0125CMI 0.0102 0.0108 0.0101 0.0103 — 0.0260 0.0122 0.0182 0.0124 { X , . . . , X } are binaryindicators created from X ) across the 5,000 simulated datasets under a MAR missingnessmechanism. Rows indicate the outcome variable and columns indicate the predictors. Themethods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicatorscreated from X ) across the 5,000 simulated datasets under a MAR missingness mechanism.Rows indicate the outcome variable and columns indicate the predictors. The methodslabeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicatorscreated from X ) across the 5,000 simulated datasets under a MAR missingness mechanism.The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a MAR missingness mechanism. Rows indicate the outcome variable and columnsindicate the predictors. The methods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X Table A.11: Coverage rates and rMSE for the means of the simulated variables (where { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a NMAR missingness mechanism. The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X Coverage logistic 0.2190 0.8898 0.7450 0.5034 — 0.0000 0.0000 0.7022 0.0526PMM 0.2306 0.8920 0.7580 0.5106 — 0.0000 0.0000 0.7014 0.0522CMI 0.2272 0.8950 0.7562 0.5106 — 0.0000 0.0000 0.7038 0.0490rMSE logistic 0.0260 0.0111 0.0155 0.0207 — 0.2139 0.0799 0.0272 0.0479PMM 0.0261 0.0111 0.0153 0.0209 — 0.2142 0.0804 0.0273 0.0481CMI 0.0260 0.0111 0.0154 0.0208 — 0.2133 0.0805 0.0272 0.0481 { X , . . . , X } are bi-nary indicators created from X ) across the 5,000 simulated datasets under a NMAR miss-ingness mechanism. Rows indicate the outcome variable and columns indicate the predic-tors. The methods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binaryindicators created from X ) across the 5,000 simulated datasets under a NMAR missingnessmechanism. Rows indicate the outcome variable and columns indicate the predictors. Themethods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indica-tors created from X ) across the 5,000 simulated datasets under a NMAR missingnessmechanism. The methods labeled “logistic” and “PMM” use mice . X X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X X X X { X , . . . , X } are binary indicators created from X ) across the 5,000 simulated datasetsunder a NMAR missingness mechanism. Rows indicate the outcome variable and columnsindicate the predictors. The methods labeled “logistic” and “PMM” use mice . Intercept X X X X X X X X logistic X X X X X X X X X X X X X X X X X X X X X X X X0.1518 0.1082 0.1047 0.1122 0.0504 0.0503 0.0896 0.0653 —