MMissing Data and Prediction: The Pattern Mixture Kernel Submodel
Sarah Fletcher Mercaldo ∗ and Jeffrey D. Blume, Ph.D. ∗∗ Department of Biostatistics, Vanderbilt University, Nashville, TN* email: sarah.fl[email protected]** email: [email protected]
Summary:
Missing data are a common problem for both the construction and implementation of a predictionalgorithm. Pattern mixture kernel submodels (PMKS) - a series of submodels for every missing data pattern thatare fit using only data from that pattern - are a computationally efficient remedy for both stages. Here we show thatPMKS yield the most predictive algorithm among all standard missing data strategies. Specifically, we show that theexpected loss of a forecasting algorithm is minimized when each pattern-specific loss is minimized. Simulations and are-analysis of the SUPPORT study confirms that PMKS generally outperforms zero-imputation, mean-imputation,complete-case analysis, complete-case submodels, and even multiple imputation (MI). The degree of improvement ishighly dependent on the missingness mechanism and the effect size of missing predictors. When the data are Missingat Random (MAR) MI can yield comparable forecasting performance but generally requires a larger computationalcost. We see that predictions from the PMKS are equivalent to the limiting predictions for a MI procedure that usesa mean model dependent on missingness indicators (the MIMI model). Consequently, the MIMI model can be used toassess the MAR assumption in practice. The focus of this paper is on out-of-sample prediction behavior; implicationsfor model inference are only briefly explored.
Key words:
Missing data; Missing-indicator method; Pattern Mixture Models; Prediction models. a r X i v : . [ s t a t . M E ] A p r issing Data and Prediction: The Pattern Mixture Kernel Submodel
1. Introduction
The Problem
While missing data are problematic for both estimation and prediction, the statistical lit-erature has been largely focused on addressing the impact of missing data on estimationprocedures and parameter inference. Missing data present a two-fold problem for forecasting:first, in building a model, and second, in using the model to make out-of-sample predictionsfor individuals with missing predictors. Here we focus on the second problem, specificallyevaluating what Wood et al. (2015) defines as Pragmatic Model Performance, which refers tothe model’s performance in a future clinical setting where some individuals may have partlymissing predictors.It is often assumed that imputation methods, because they improve parameter estima-tion procedures, also improve out-of-sample prediction performance. However, this is justspeculation, and our investigation indicates that this is the exception rather than the rule.Ideally, it would be possible to find a prediction rule that uses simple imputation or did notrequire imputation, and thus was much less computationally burdensome and more readilyapplied in practice. The impact of missing data for out-of-sample prediction is uniformlyunderestimated. A poor imputation algorithm at the prediction stage can drastically reducea model’s overall prediction performance, and data frequently go missing at this stage in reallife.1.2
Current Approaches to Imputation
Typical strategies for dealing with missing predictors are driven by practical constraints.Common strategies include zero imputation and mean imputation, which are trivial toimplement, but often lead to poor predictions. Conditional mean imputation and multipleimputation can be implemented with accessible software when fitting a model, but they arerarely used in the clinic when predictors are missing for out-of-sample predictions (Janssent al., 2009). The advantages and drawbacks of imputation methods for out-of-sample impu-tation procedures are listed in Table 1. The obvious issue, not well addressed in the literature,is the extent to which these approaches degrade prediction performance, as later shown.[Table 1 about here.]Multiple Imputation (MI) draws multiple placeholder values from conditional distributionsderived from the observed data (van Buuren, 2012; Janssen et al., 2010; Harrell, 2013),uses the placeholder values to fit the model, and then combines the models using Rubin’srules (Rubin, 2009). When the data are missing at random (MAR), MI can substantiallyincrease inferential efficiency by leveraging information in incomplete data records. The ‘best’predictions from a multiply imputed prediction model are the model’s predictions averagedover all imputation sets (Vergouwe et al., 2010; Wood et al., 2015). Recently, the popularityof MI as the primary ‘principled’ strategy for constructing and applying a prediction modelin the presence of missing data has grown (Harrell, 2013; Janssen et al., 2009).However, applying a multiply imputed prediction model to an out-of-sample individual whois missing predictor information is not straightforward. This is because, technically speaking,the predictions need to be re-estimated with the imputation procedure based on the originaldata and the new out-of-sample record if you want to apply Predictive Mean Matching orK-Nearest Neighbor imputation techniques (as we did in our simulatons and examples). Ofcourse, this requires the original data, the imputation datasets, and substantial on-demandcomputing power, which is often impractical in real world settings. Moreover, this approachoften has a heavy computational burden and is not easily programmed in web applications(because the imputation algorithm must be repeated for every out-of-sample prediction).One could ignore this step and use the multiply-imputed model along with some one-stepimputation procedure using the saved chain equations or fitted conditional distributions, butthis process is not likely to be congenial with the original fitting approach. issing Data and Prediction: The Pattern Mixture Kernel Submodel Proposed Solution
Here we propose using an approach that we call the Pattern Mixture Kernel Submodels(PMKS) procedure. PMKS postulates a unique prediction model for each missing datapattern and estimates that model using only the subjects in that pattern. As a result,PMKS requires no imputation algorithm. Details are provided in section 3.1. As a forecastingalgorithm, PMKS benefits greatly from the reduction in prediction bias that comes with thepattern-specific approach. The loss in efficiency that can results when the data are MAR, orwhen the patterns have common parameters, is often small in comparison. Moreover, we willsee the PMKS is optimal in the sense of minimizing the expected prediction loss. Because ofthese advantages, we anticipate that PMKS will have broad impact in the arena of big datawhere prediction is paramount and MI is often computationally unfeasible e.g., when usingan entire system of electronic medical records.1.4
Organization
Section 2 defines our notation and provides a brief background to key missing data concepts.Section 3 describes our proposed methods, provides a simple example, and draws connectionsbetween PMKS and MI models in some generality. Section 4 describes extensive simulationsof PMKS in order to establish that wide range of setting under which PMKS excels. Section5 describes the performance of PMKS compared to other imputation strategies applied tothe SUPPORT Study, a multi-center two phase study of 9105 patients, from which a day 3Physiology Score (SPS) was predicted (Phillips et al., 1995). Section 6 provides some briefconcluding remarks. . Notation and Background
Notation
Let Y = ( Y , ..., Y n ) be the vector of n -length observed responses. With our focus onprediction, we assume all responses are observed. This assumption can be relaxed, but itis not necessary to our discussion here. Predictors (covariates) are denoted by a ( n × p )matrix X = ( X , ..., X p ) where X j = ( X j , ..., X nj ) T for j = 1 , ..., p predictor vectors oflength n . Let M = { M ij } be the ( n × p ) matrix of missing data indicators where indicator M ij = 1 if X ij is missing and M ij = 0 if X ij is observed for i = 1 , ..., n individuals and j = 1 , ..., p parameters.To differentiate between models, we will use different greek symbols for their parameters.For example, parameters in the pattern mixture kernel submodels (PMKS) will be donatedwith a γ . Parameters in a traditional regression mean model E [ Y | X ] = Xβ - those typicallyof interest in an estimation setting - will be donated by β . Notice this is a strong assumptionwith regards to the missing data, forcing the same mean-response function for every missingdata pattern, and implying a MAR mechanism within a single marginal model. Parame-ters representing the effects of the missingness indicators, M , will be denoted by δ ; theseparameters will distinguish our MIMI model (defined in section 3.5) from a traditional MImodel.2.2 Pattern Mixture and Selection Models
Our approach has roots in the established literature on pattern mixture models (Little(1993)). The traditional pattern mixture model factorization is: P ( Y , M | X , γ , π ) = P ( Y | X , M , γ ) P ( M | X , π ), where π is a parameter vector for themissingness mechanism (Little and Rubin (2014)). The pattern-mixture approach allowsfor a different response model in each missing data pattern. An alternative formulation isthe selection model: P ( Y , M | X , θ , ω ) = P ( Y | X , θ , ω ) P ( M | Y , X , ω ) where θ and ω are issing Data and Prediction: The Pattern Mixture Kernel Submodel parameter vectors (Little and Rubin (2014) Little and Wang (1996)). This factorizationdescribes the (single) marginal response model. In this paper, we will not explicitly considerselection models except to use them to generate data from certain missing data mechanismsthat cannot be simulated otherwise. While the selection model allows for Y and M tobe either independent or dependent, the pattern-mixture model is used when the responsemodel changes by missing data pattern. This flexibility, it turns out, lends an advantage toour proposed PMKS prediction algorithm.2.3 Missingness Mechanisms
To describe missing data, we use the following missingness mechanisms: Missing Completelyat Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), Missingat Random where the missingness depends on Y (MARY), and Missing Not at Randomwhere the missingness depends on Y (MNARY) (Little and Rubin (2014)). The latter twomechanisms can only be simulated in the selection model formulation. A more detaileddescription of these missingness mechanisms is given later in section 3.If the missingness mechanism is MCAR, then pattern mixture and selection models areequivalent (Little, 1993). When the data are not MCAR, the parameters of the kernelfunctions associated with the selection and pattern mixture models have different interpre-tations and care must be taken when estimating and interpreting them. The selection modeldescribes the marginal relationship of Y on X while the pattern mixture model describesthe relationship of Y on X conditional on M . Marginal effects from the selection modelare generally not identifiable in the context of a pattern mixture model, although someparameterizations can be identified though complete case restrictions that essentially forceequality restraints on certain parameters (Little, 1993). Identifiability is obviously a problemwhen the goal is estimation and data are MNAR. However, here our goal is predictionand complex re-parameterizations of marginal effects are not a major impediment even ifhe mapping is not easily reversed. If one marginal model is truly of interest, it is alwayspossible to marginalize over the pattern-specific model. Of course, how that model shouldbe interpreted when the data are not MAR is not immediately clear.2.4 Complete Case Models and Submodels
Two alternatives to multiple imputation are complete case models and complete case sub-models. A complete case analysis simply ignores the records with missing data and estimatesa single model. Note that complete-case models have routinely poor prediction performancewhen the data are not Missing Completely at Random (MCAR) (Knol et al., 2010; Janssenet al., 2009). Complete Case Submodels (CCS) use all available data to fit a submodel foreach missing data pattern. That means, for CCS, a data records often contribute to multiplepatterns/submodels. This approach is vastly different from a complete-case analysis wherejust a single model is fit using only subjects with complete data.Note that the complete case model does not solve the problem with missing out-of-samplepredictors; some type of imputation is still required. In contrast, CCS do not have thisproblem. Their predictions come from the submodel that matches the out-of-sample missingdata pattern and so no imputation is needed. For PMKS, each data record contributes onlyto a single pattern/submodel, which turns out to be a key difference compared to CCS.
3. Methods
Pattern Mixture Kernel Submodels
PMKS has roots in pattern mixture model, using the kernel of the pattern mixture modelas a prediction machine. These pattern-specific models are the PMKS. To make predictions,we fit the set of models { ˆ f , ..., ˆ f k } where ˆ f m = ˆ f m ( X , M ) is the pattern mixture modelin pattern m = 1 , ..., k where k p different patterns. Here we consider straightforwardprediction algorithms such as ˆ f m = E ( Y | X , M ; ˆ γ m ) where ˆ γ m is the vector of estimated issing Data and Prediction: The Pattern Mixture Kernel Submodel pattern specific parameters. However, our findings apply to any generalized linear model ormachine learning prediction technique. Although up to k = 2 p different models might haveto be fit, in practice only a small fraction of those patterns are observed.Each PMKS is fit using only subjects from that pattern. This is in contrast to CompleteCase Submodels (CCS) where the submodels are fit using all available data. This subtledifference turns out to be critically important; as it is only appropriate under certain missingdata mechanisms. For comparison, we denote the set of CCS models as { ˆ g , ..., ˆ g k } where ˆ g m =ˆ g m ( X ) = E ( Y | X ; ˆ β ∗ m ) for patterns m = 1 , ..., k . Here we use ˆ β ∗ m as the estimated patternspecific parameter vector. The asterisk here is used to distinguish the CCS parameters fromthose obtained in a complete case analysis ( ˆ β ). The difference between ˆ f m and ˆ g m is illustratedbelow in the context of a simple example. Later in section 3.4, we discuss how to fit PMKS andCCS when a pattern is not observed or the data are too sparse. CCS have the advantage ofbeing fittable in many of these situations with the obvious drawback of its strong dependenceon the MAR assumption.3.2 A Simple Example
A simple illustration helps to fix ideas. Consider a linear model for continuous outcome Y with two covariates X , X . There are only four missing data patterns: (1) X , X bothobserved; (2) X missing, X observed; (3) X observed, X missing; (4) X , X both missing.The pattern mixture kernel submodels are the set of corresponding response models in eachmissing data pattern. These are given in table 2.[Table 2 about here.]Using the above notation, the estimated PMKS response function for E [ Y | X , M =1 , M = 0] is ˆ f while the CCS analogue for E [ Y | X ] is ˆ g . Note that γ p,m does not necessarilyequal β ∗ p for any m = 1 , , ,
4. Also, none these submodels corresponds to the typicalmarginal model expressed as E [ Y | X , X ] = β + β X + β X where the parameters β epresent traditional direct effects. It is tempting to assume that the model based on pattern1 would yield proper estimates of this marginal model, but this only happens when the dataare MAR on the covariates (and not on the response). This is because that is the only casewhere the marginal model corresponds exactly with the data generating mechanism in eachpattern. That is, it is the only case where the marginal model is true mean response modelfor every pattern (White and Carlin (2010); Bartlett et al. (2014)).3.3 Prediction Performance of PMKS
PMKS is computationally efficient for missing data because it avoids the issue; it fits a seriesof models in which none have any missing data. Fitting each submodel is now straightforwardbecause the missing data problem has been avoided. The key realization is that minimizingthe expected loss in each pattern amounts to minimizing the expected loss marginally. Thus,we need only use standard techniques to fit and cross-validate the pattern specific models.3.3.1
Minimizing the Expected Prediction Error.
Minimizing the expected prediction er-ror in each pattern will, in turn, minimize the overall expected prediction error. To see this,note that: E Y | X [ L ( Y, ˆ f ( X ))] = E M h E Y | X , M h L (cid:16) Y, ˆ f m (cid:17)ii = X M P ( M ) E Y | X , M h L (cid:16) Y, ˆ f ( X , M ) (cid:17)i where ˆ f m = ˆ f m ( X , M ). Hence, selecting ˆ f m to minimize the pattern specific expected loss, E Y | X,M h L (cid:16) Y, ˆ f m ( X , M ) (cid:17)i , will in turn minimize the overall loss E Y | X [ L ( Y, ˆ f ( X ))].Here L (cid:16) Y, ˆ f m ( X , M ) (cid:17) must be a properly defined pattern-specific loss function. Theloss function is flexible; it could be squared error or 0/1 loss (Hastie et al., 2009). Butthe pattern-specific restriction is important; the result might not hold for certain metricswhere predictions in one pattern are compared with predictions in another (for example, thearea under the ROC curve). In that case, the overall AUC is not equal to the average of issing Data and Prediction: The Pattern Mixture Kernel Submodel pattern specific AUCs, which is the property we need in order for to take advantage of thisapproach. Fortunately, most common loss functions of interest in prediction problems havethis property.Note that constructing a prediction model within each missing data pattern effectivelyresolves the missing data dilemma because the missing predictors are missing for everyone inthat pattern and only marginal effects can be estimated. The result implies that, in practice,prediction models should be constructed and cross validated within each pattern in order tomaximize predictive ability. The only reason to do this marginally is if the MAR assumptionis known to hold. Then direct estimation of E Y | X h Y, ˆ f ( X ) i is complex, in part, becauseonly a single model ˆ f is used for all predictions and fitting that model with missing datarequires a complex algorithm such as multiple imputation or the EM algorithm to handlethe missing data. Since each PMKS is directly estimable with routine tools, the right side ofthe equation can be fit, minimized, and cross validated rather easily. This simple argumentimpacts practice because M is often ignored in the modeling stage due to historical concernsabout the missing indicator method.3.3.2 PMKS loss is a weighted average of a full and reduced model.
For a linear modelthe squared prediction error is a common and relevant loss function. To examine the bias-variance tradeoff in PMKS, it is helpful to revisit a simple example given by Shmueli (2010)in which the Expected Prediction Error (EPE) is evaluated for a “fully specified model(large) versus an “underspecified model” (small). Suppose data come from the model f ( x ) = β + β x + β x + (cid:15) with (cid:15) ∼ N (0 , f ( x ) = ˆ β + ˆ β x + ˆ β x . Here the expected prediction error (EPE) is the sumof the bias, variance, and irreducible error of the predictions or fitted values (Hastie et al.(2009)): EPE L = E (cid:20)(cid:16) Y − ˆ f ( x , x ) (cid:17) (cid:21) = σ (cid:18) x x ]( X L X L )[1 x x ] (cid:19) here EPE L denotes the EPE of the full model. In contrast the EPE of the underspecifiedor submodel is given by:EPE S = E (cid:20)(cid:16) Y − ˆ f ∗ ( x ) (cid:17) (cid:21) = (( γ + x γ ) − ( β + β x + β x )) + σ [1 x ]( X S X S )[1 x ] where ˆ f ∗ ( x ) = ˆ β ∗ , + ˆ β ∗ , x . Note that in this case ˆ f ∗ ( x ) = ˆ g . The EPE of the PMKSmodel is just a weighted average of the large and small prediction models.EPE P MKS = X m P ( M = m )EPE m = EPE L (1 − P ( M )) + EPE S P ( M )To illustrate the bias variance tradeoff, we simulated the EPE in figure 1. The simulationfixes X = x , and draws from the conditional distribution X | X = x ∼ N ( µ + σ σ ρ , ( x = µ ) , (1 − ρ , ) σ ). The EPE for the correctly specified full model is just the irreducible error,whereas the EPE for the underspecified model increases as the out-of-sample predictor movesaway from its population mean. [Figure 1 about here.]The out-of-sample prediction error from the large model is given by the green line in figure1, and is equal to the model variance. If the data were generated from the large modeland predictions were given from the small model that includes only X , then the expectedprediction error is approximated by the purple points in figure 1.The yellow points in figure 1 denote the prediction error that arised from the PMKS inthis setting; ˆ f makes predictions when all data are available and ˆ f makes predictions whenonly X is available. Clearly, PMKS has smaller EPE for every out-of-sample X . In thiscase the probability of missingness was 50%, P ( M = 1) = 0 . Practical Considerations
As discussed earlier, multiple imputation has a substantial computational burden for out-of-sample predictions with missing data because the imputation algorithm must be repeatedadding the new person in the data run. PMKS, on the other hand, do not need to be re- issing Data and Prediction: The Pattern Mixture Kernel Submodel computed for every new prediction. The upfront computational effort is large for 2 p patterns,but it is often minor compared to the MI machinery required in the same setting and inpractice only a fraction of available patterns are observed. When data are sparse withina pattern, it may not be possible to fit the PMKS. In such cases it is necessary to makeassumptions and the CCS approach is a reasonable option. This hybrid approach has workedwell for us in practice, in large part because the contribution to the EPE for patterns thatare too sparse to fit with PMKS is often negligible.If p is very large, and storing 2 p prediction models seems unreasonable, there are severaloptions. First, fit models only for observable patterns, ignoring patterns not observed. Second,only fit models for patterns in which the missing variable, or combinations of missingvariables, are ‘important’ to the predictions. Third, if the data are available in real time, itmay be possible to fit PMKS on demand, since imputation is not necessary, and this reducesthe need to store all 2 p models simultaneously. Lastly, shrinkage methods to the MIMI model,discussed in section 3.5, can indicate how best to borrow strength over the patterns.It is important to distinguish between the computation cost between the in-sample modelconstruction phase and out-of-sample prediction phase. Both PMKS and MI could havehigh in-sample computation cost depending on the number of predictors and the data size.But as described in table 1, the out-of-sample computational costs for PMKS are negligible,whereas for MI they can remain intensive, even for a single individual. Importantly, PMKSdoes not require missing data mechanisms to be consistent in the data used to construct themodel and the target population. This is because, conditional on the missingness pattern, thedata are effectively MAR. The MAR assumption implies ( M ⊥ X j ) | X − j , ∀ j = 1 ..p . PMKSreformulates this assumption as ( M ⊥ X j ) | ( X − j M − j ), allowing the MAR assumption tohold within each pattern (Little, 1993)..5 PMKS as the limit of MIMI model
There are some interesting connections between PMKS and MI. PMKS is the limit of acongenial MI procedure when the mean model depends on the missing data indicators. Thisnew MI procedure - which we call this the Multiple Imputation with Missingness Indicators(MIMI) model - can be used to assess the MAR assumption in practice. The MIMI modelalso makes the context clear about what elements of the model can be assumed identicalacross the patterns for inference purposes. The utility of missingness indicators can onlybe realized through an imputation procedure, an often overlooked and important point.The implications for estimation will be discussed elsewhere, as the focus of this paper is onprediction, but the connection is an important one.The MIMI model is a multiple imputation model that is dependent on the indicators M i from i = 1 , .., p . Typically, the mean model would depend on the missingness indications.For example, in the past example with p = 2 covariates, X and X , we might consider thefollowing mean model: E [ Y | X , X , M , M ] = β + β X + β X + δ M + δ M + δ X M + δ X M + δ X M + δ X M (1)where the β parameters represent the traditional direct effects of interest and the δ parameters, which we will call auxiliary parameters, explain how the traditional effectschange by missing data pattern. If the data are MCAR, then δ i = 0 ∀ i . Otherwise, thetraditional effects might not exist as the dependencies in the model parameters can becomplex. An informal test of the δ parameters may provide some insight into the observedmissing data mechanism. If there is no evidence to suggest that the indicators contributeto the model predictions, this may help to make the case for a MAR mechanism. Shrinkagemethods may also be applied to the delta δ parameters to help assess which covariates havenon-ignorable missing data mechanisms, and will be addressed in future work. issing Data and Prediction: The Pattern Mixture Kernel Submodel Note, Molenberghs et al., 2008 describes a longitudinal setting where every MNAR modelcan be decomposed into a set of MAR models. Molenberghs et al., 2008 rightly asserts thatthis duality is problematic for inference about a parameter. However, as noted in the paper,these representations will yield different predictions and so the implications are different inout context where prediction is the objective. The results from Molenberghs et al., 2008are important in that they aligns with our results in the non-longitudinal setting, werepredictions must match those from the natural pattern mixture model (that is MAR bydefinition conditional on the pattern) in order to retain its predictive optimality.Unfortunately, model (1) cannot be properly fit unless the missing predictors are imputed.When the missing data are imputed with a proper MI algorithm, the coefficients of (1)are essentially identified by that algorithms imputation scheme and the auxiliary parametersbecome estimable under those assumptions. Model (1) is an example of the Multiple Imputa-tion with Missingness Indicators (MIMI) model. Note that MIMI makes certain assumptionsabout the missingness mechanism, and these assumptions will lead to different fits of theauxiliary parameters. This flexibility is both good and bad; good because we could use theauxiliary parameters to check assumptions, and bad because the auxiliary parameters areinherently unidentifiable.Adding missingness indicators to a model (when the goal was parameter estimation) hasreceived criticism historically. The simply plug-in varieties of the missing-indicator methodyields biased parameter estimates even in simple cases where data are missing completelyat random (MCAR) (Allison, 2001; Groenwold et al., 2012). The classical missing-indicatormethod fills in a constant (often zero or the overall mean) for the missing values and augmentsthe data design matrix with a binary indicator for each covariate with missing values.However, when the missing-indicator method is combined with proper imputation methodsthe model produces unbiased parameter estimates in the same cases in which complete casestimation is unbiased (Jones, 1996; Dardanoni et al., 2011, 2015). That is, when M i and Y are conditionally independent given X i , then for any choice of imputation matrix, the OLSestimate of (1) coincides with the OLS estimate of β in the complete case model (Bartlettet al., 2014; Jones, 1996; Dardanoni et al., 2011; White and Carlin, 2010). Thus, the MIMImodel is essentially an extension of the ideas in (Jones, 1996) to a more flexible multipleimputation setting.Although the missing-indicator method has been heavily investigated in the context ofinference, this method has not been explored for prediction where a bias-variance tradeoffmay be more desirable. The MIMI model leverages multiple imputation to create placeholdersfor the missing data that do not negatively impact the models predictive ability. But ofcourse, the value of these imputations does impact the estimation of direct effects and theproperties of those estimators.The connection between the MIMI model and PMKS can be seen through a simplerearrangement of the mean model (1). There are differences; the MIMI model forces constantvariance across all missing data patterns, whereas PMKS allows the variance to change bypattern. PMKS are most easily understood as projections of the true pattern-specific modelinto the space of observed covariates. As such, slope parameters for observed covariates maybe distorted if missing covariates are correlated with observed covariates. In those cases, themodel is really estimating the total effect when the direct effect is the quantity of interest.PMKS is a series of models based on the total effects that can be estimated from the dataat hand, while MIMI tries to reparametrize each patter-specific mean model and average thedirect effects of interests.Applying the plug-in principle, the MIMI mean model reduces to the PMKS when condi-tional mean imputation is used to impute missing covariates. Denote the imputed covariatesas X ∗ i = E [ X | X ] = α + α X if X i is missing and X i otherwise. issing Data and Prediction: The Pattern Mixture Kernel Submodel Rearranging the MIMI model we have: E [ Y | X , X , M , M ] =( β + δ M + δ M )( β + δ M + δ M ) X ( β + δ M + δ M ) X Which just reduces to PMKS. To illustrate, for the 4 patterns in our running example: E [ Y | X , X , M = 0 , M = 0] = β + β X + β X E [ Y | X , X , M = 1 , M = 0] = ( β + δ ) + ( β + δ ) E [ X | X ] + ( β + δ ) X = ( β + δ ) + ( β + δ )( α + α X ) + ( β + δ ) X = ( β + δ + β α + δ α ) + ( β + δ + β α + δ α ) X = γ + γ X Note that model E [ Y | X , X , M = 1 , M = 0] = γ + γ X is the submodel including onlythe covariate X fit within the group of individuals who are missing the covariate X (this iswhy the conditioning on M is important) and is equivalent to ˆ f in section 3.1. Hence, PMKSand MIMI are two parameterizations of the ‘same’ model. To examine this, we next presentsimulations in the linear model case under a wide variety of missing data mechanisms. Thisprinciple extends to the GLM setting, on the linear predictor scale, and is currently underinvestigation by the authors. More complex prediction machines that are highly non-linearshould exhibit the same general behavior. . Simulations We generated n multivariate normal predictor vectors according to ( x x ) ∼ N ( µ , Σ ), where µ = (3 ,
3) and Σ = ( . . ), for example, are set to provide certain predictor profiles interms of their correlation. Simulated outcomes Y are generated from various combinationsof x and x . The pattern mixture model formulation uses X to induce one of three missingdata mechanisms, MCAR, MAR, or MNAR. The outcome Y is then generated from theMIMI mean model using the true X values and the simulated missing data indicators. Herethe missingness may only depend on the predictors vector X . In contrast, the selectionmodel formulation simulates Y from the marginal model Y = β + β X + β X + (cid:15) , where (cid:15) ∼ N (0 , Y . A more complex model canalways be reduced to a linear combination of non-missing variables, and missing variables,and so this simple example is representative of more complicated situations.We simulated the following five missing data mechanisms as defined in table 3 for this situ-ation: MCAR, MAR, MNAR, MARY, and MNARY. The latter two mechanisms could onlybe simulated in the selection model formulation. We forced the missingness data mechanismto be consistent between the in-sample and out-of-sample populations, and ν is empiricallycalculated to maintain the desired probability of missingness.[Table 3 about here.]4.1 Parameters
Parameters profiles explored were β = 1 , , ρ = 0 , . , . P ( M = 1) = 0 . , . , . n = 50 , , , β = 3, ρ = 0 . P ( M = 1) = 0 .
50, and n = 1000. For the out ofsample population we assumed one-by-one enrollment. Missing data was impute by zeroimputation, unconditional mean imputation, single conditional mean imputation using a issing Data and Prediction: The Pattern Mixture Kernel Submodel Bayesian conditional mean model, single conditional mean imputation using a frequentistconditional mean model, or multiple imputation (predictive mean matching, 10 imputations).We fixed the imputation engine based on the in-sample population to closely mimic real worldapplication of these methods.4.2
Simulation procedure
We compared the performance of PMKS, complete case model predictions, complete casesubmodel (CCS) predictions, traditional MI and the MIMI imputation model. The fullsimulation procedure was as follows: (1) data are generated and missing data indicatorsare generated according to the missing data mechanism in table 3, as described in section5.1; (2) missing data are imputed; (3) the MI model, MIMI model, CCS, and PMKS modelsare fit; (4) step 1 is repeated to obtain a new out-of-sample population; (5) individuals areimputed one by one, using the above imputation procedures, assuming a fixed imputationengine from the in-sample population; (6) individual predictions and performance measuresare computed; (7) steps 1 through 6 are repeated 10,000 times.A squared error loss function was used to compare performance of the approaches. For ex-ample the squared error loss across all missing data patterns in the PMKS is n P i P j P ( M i =1)( Y ij − ˆ Y ij ) where j = 1 , , n subjects and i = 1 , , m patterns. This loss is the averaged overthe 10,000 simulations to approximate the expected loss. Table 4 shows the average squaredimputation error for predictor x as a function of imputation strategy and missingnessmechanism.4.3 Simulation Results
Results are presented for the following set of parameters: β = 1 , β = 3 , β = 1 , δ = 1 , δ =1 , P ( M = 1) = 0 . , ν = 1 , ν = 1 , ν ,Y = 1 , ν ,Y = 1. There were negligible differences inpattern specific and total squared error loss for the MCAR missing data mechanism. For allmissing data scenarios, MI and conditional mean imputations resulted in a biased parameterstimation. This bias appears most clearly in predictions for observations without missingdata (blue dots). When Y is added to the MI model, the model parameters had negligiblebias. However, since the out-of-sample Y is missing, the out-of-sample imputations of x have greater bias than the imputation model in which Y is not included resulting in a highertotal prediction error (e.g., see table 4).When Y is generated from a selection model formulation, all methods perform similarly(apart from MI as described above) under the MAR missing data mechanism. When data areMNAR, PMKS and the MIMI models have slightly lower total and pattern specific squarederror loss compared to the traditionally available methods. When Y is generated under thepattern mixture formulation with a MNAR missing data mechanism (MNAR PMY), PMKSand MIMI have both lower pattern specific contributions to the prediction error (PE) in thepattern where x is missing, and lower total prediction error compared to all other methods.As might be expected, PMKS and CCS have different out-of-sample prediction performancewhen the missing data are not Missing at Random (MAR). In fact, PMKS minimizes theexpected prediction loss regardless of missingness mechanism, while CCS tends to rivalPMKS only when the data are MAR. We will see that when the data are modified to inducea Missing Not at Random (MNAR) mechanism, PMKS has optimal predictions on averagecompared to traditional methods.As both the strength of the missingness mechanism and the beta coefficient associatedwith the missing variable increase, the magnitude of the differences in methods favorsPMKS/MIMI over all the other methods.[Figure 2 about here.][Table 4 about here.]Table 4 of out-of-sample imputations of x provides insight into some of the biases seen in issing Data and Prediction: The Pattern Mixture Kernel Submodel figure 2. When Y is included in the imputation model during model construction, parameterestimates tend to be unbiased. However, when Y is included during MI performed usingpredictive mean matching and chained equations, the imputations of x show have the largestsquared error of all the imputations procedures for every missing data mechanism apartfrom unconditional mean imputation. Although the apparent bias in imputations for missingcovariates seem small, their total contribution over all individuals can be quite significant.These results show that bias in imputing missing predictors leads to poorer downstreampredictions and larger prediction error for the outcome.Papers have explored in detail the advantages of including Y in the imputation model(Moons et al., 2006). Using Y in the imputation model during model construction leads tounbiased estimates of regression coefficients. Whereas this may be a fine approach duringthe model building (in-sample) population, it is not practical in the prediction setting wherethe outcome is unknown and would be imputed as part of the fixed imputation model. Whenusing chained equations imputation model in which the covariate with the least amount ofmissing data (in these simulations Y ) is imputed first, can lead to biased imputations for thenext missing covariates of the chain (in these simulations X ). We do not present the situationin which Y was used in the in-sample imputation model to produce unbiased regressionestimates, but not included in the out-of-sample imputation model - a combination whichwould have less propagated imputation bias. Even though it may seem that the inclusion of Y in the imputation model will lower prediction error, careful thought and attention needto be placed on the practicality of this, as well as the statistical implications.
5. Application: SUPPORT Data Example
The Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments(SUPPORT) was a multi-center, two phase study, of 9105 patients. The primary goal ofthe study was to estimate survival over a 180-day period for seriously ill hospitalized adultsPhillips et al., 1995). A component of the SUPPORT prognostic model was the SUPPORTday 3 Physiology Score (SPS), and obtaining valid predictions from the SPS Model isvital since it is the single most important prognostic factor in the SUPPORT model. TheSUPPORT physiology score can range from 0 to 100 and we include the following covariates:Partial pressure of oxygen in the arterial blood, Mean blood pressure, White blood count,Albumin, APACHE III respiration score, temperature, Heart rate per minute, Bilirubin,Creatinine, and Sodium.After excluding one individual missing SPS score, and one individual missing all covariates,9103 individuals remained in our SPS model of which 3842 had complete data, 2323 weremissing Partial pressure of oxygen in the arterial blood, 212 were missing white blood count,3370 were missing albumin, 2599 were missing bilirubin, and 66 were missing creatinine,resulting in 23 observed missing data patterns, and 1024 possible missing data patterns.Ten-fold cross-validation was used to compare the squared error loss of MI, CCSM, MIMIand PMKS within missing data patterns, as well as total average squared error loss, weightedby proportion of individuals in each pattern.5.1
SUPPORT Example Results
For each method, ten-fold cross validation of the prediction models was implemented. Forthe patterns with less than or equal to N = ( p + 1) ∗ issing Data and Prediction: The Pattern Mixture Kernel Submodel methods and CCS, for the patterns in which partial pressure of oxygen in the arterial bloodwas missing. [Figure 3 about here.]The original data results are shown in the two sub-figures in the left of figure 3. The totalmodel PE does not differ between the four methods. When a MNAR mechanism is inducedin the support data, as shown in the two left sub-figures, PMKS and MIMI outperform CCSand MI. In the patterns for which partial pressure of oxygen in the arterial blood (pafi) ismissing, the benefits of PMKS and MIMI compared to CCS and MI are apparent. For thesepatterns, both the unweighted PE (figure 3 top-right) and weighted PE (figure 3 bottom-right) show this reduction in PE. The model PE, which is the sum of all the pattern specificcontributions to the PE results in approximately a 50% reduction in PE for PMKS/MIMIcompared to MI, and a 40% reduction in PE for PMKS/MIMI compared to CCS.
6. Remarks
Statistical literature abounds with imputation methods for model inference, but there arevery few practical solutions for obtaining predictions for new individuals who do not presentwith all of the necessary predictors. In this paper, we have shown that PMKS providescompetitive if not optimal predictions for a variety of missing data mechanisms, and haslarge gains in computation time since external data and imputation models are no longerneeded to make new predictions. Our method is robust, straightforward to implement, andcan easily be extended for any generalized linear model and models with nonlinear effects(this is ongoing work).While PMKS is clearly optimal in the sense of minimizing an expected prediction loss,the procedure also has implications for model inference under non-MAR scenarios. It leadsto an important extension of classical MI procedures for estimation where the MAR can beelaxed or assessed. Moreover, PMKS is more computationally efficient thanMI procedures.In the age of big data, this is an important consideration and driving factor in most scientificcontexts with big data.6.1
Remark A: Conditioning Y on X and M
One might ask whether we are interested in the model marginalized over M , E [ Y | X ] = Xβ ,or the conditional model, E [ Y | X , M ] = Xβ + M δ . This is a philosophical question withmany differing viewpoints. For inferential purposes (which we do not consider here) it hasbeen argued that the marginal model is the model of interest, however in many situationscan be imagined in which the conditional model is the simpler way to express a complicatedmarginal model. From the prediction point of view, PMKS/MIMI model is robust to bothsituations and therefore is the preferred method to use.By assuming the marginal model is correct, you are making the assumption that data areMAR and the δ parameters are zero. As the number of covariates increase, this assumptionin practice seems to be more plausible. A way to asses this in practice is to evaluate whetherthe MI model and PMKS/MIMI model give similar results, as we did in the SUPPORTexample.6.2 Remark B: The Relationship between Y and M The relationship between Y and M plays an important role in our modeling assumptions.An outcome generated from a selection model formulation is assumed to be independent ofthe missing data mechanism, such that the outcome would be the same regardless of whethercovariate information is missing or observed. The pattern mixture model formulation assumesthat the missing data mechanism is part of the response model, such that the outcome canbe depend on the missing data pattern. Both mechanisms are thought to exist in Biomedicaldata, but unfortunately the two approaches represent fundamentally different descriptions ofthe underlying natural process. The two approach will only coincide when δ = ... = δ p = 0 issing Data and Prediction: The Pattern Mixture Kernel Submodel (the MAR assumption). The β parameters could be adjusted to account for the differentialmissingness across the two approaches, but then the model are not comparable.6.3 Remark C: Extending to Generalized Linear Models and Other Prediction Approaches
We performed the same set of simulations assuming a true logistic regression model, wherewe used a logarithmic scoring rule to compare methods. The general ordering of results holdsand will be explored in future papers. These results extend to random forests as well. Forexample, our results suggest a random forest should be fit by pattern, using only the datain that pattern.Care should be taken when developing clinical prediction models when missing data ispresent. Prediction models should be fit with PMKS and estimation should be conductedunder MIMI for optimal results, since these methods are robust to a wide variety of missingdata mechanisms compared to commonly available MI and CCS methods.
Acknowledgments
The authors wish to thank Professors Robert Greevy, Matt Shotwell, Thomas Stewart,Melinda Aldrich, Frank Harrell and Nathaniel Mercaldo for critical reading, helpful sug-gestions, and valuable feedback of the original version of the paper. In addition, the authorswould like to thank Eric Grogan, and the TREAT Laboratory via funding from the De-partment of Veterans Affairs Career Development Award (10-024), and Melinda Aldrich viafunding from NIH/NCI 5K07CA172294, as well as the Department of Thoracic Surgery forsupporting this work.
Conflict of Interest : None declared.
References
Allison, P. D. (2001).
Missing Data . SAGE Publications.Bartlett, J. W., Carpenter, J. R., Tilling, K., and Vansteelandt, S. (2014). Improving uponhe efficiency of complete case analysis when covariates are MNAR.
Biostatistics
Journal of Econometrics
Journal of Econometrics
CanadianMedical . . . .Harrell, F. E. (2013).
Regression Modeling Strategies . With Applications to Linear Models,Logistic Regression, and Survival Analysis. Springer Science & Business Media.Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The Elements of Statistical Learning .Springer Series in Statistics. Springer New York, New York, NY.Janssen, K. J. M., Donders, A. R. T., Harrell Jr., F. E., Vergouwe, Y., Chen, Q., Grobbee,D. E., and Moons, K. G. M. (2010). Missing covariate data in medical research: to imputeis better than to ignore.
Journal of clinical epidemiology
Clinical Chemistry
Journal of the American Statistical Association .Knol, M. J., Janssen, K. J., Donders, A. R. T., Egberts, A. C., Heerdink, E. R., Grobbee,D. E., Moons, K. G., and Geerlings, M. I. (2010). Unpredictable bias when using themissing indicator method or complete case analysis for missing confounder values: an issing Data and Prediction: The Pattern Mixture Kernel Submodel empirical example. Journal of clinical epidemiology
Biometrics
Journalof the American Statistical Association
Statistical Analysis with Missing Data . John Wiley& Sons.Molenberghs, G., Beunckens, C., Sotto, C., and Kenward, M. G. (2008). Every missingnessnot at random model has a missingness at random counterpart with equal fit.
Journalof the Royal Statistical Society. Series B. Statistical Methodology
Journal of clinicalepidemiology
Ann Intern Med .Rubin, D. B. (2009).
Multiple Imputation for Nonresponse in Surveys . Wiley Series inProbability and Statistics. John Wiley & Sons, Hoboken, NJ, USA.Shmueli, G. (2010). To Explain or to Predict?
Statistical Science
Flexible Imputation of Missing Data , volume 20125245 of
Chapman& Hall/CRC Interdisciplinary Statistics Series . CRC Press.Vergouwe, Y., Royston, P., Moons, K. G. M., and Altman, D. G. (2010). Developmentand validation of a prediction model with missing predictor data: a practical approach.
Journal of clinical epidemiology
Statistics in Medicine
Biometrical Journal issing Data and Prediction: The Pattern Mixture Kernel Submodel Simulated MCAR
Out of Sample X A v g . P r ed i c t i on E rr o r -4 σ -3 σ -2 σ - σ E[ X ] σ σ σ σ Large
Small
PMKS
Weighted Avg.
Figure 1.
Comparison of Expected Prediction Error for the Large fully specified model: E [ Y | X , X ] = β + β X + β X , and Small underspecified model: E [ Y | X ] = β , + β ∗ , X .Pattern Mixture Kernel Submodel (PMKS) predictions are a weighted average of the Largeand Small models, weighted by P(M=1) Complete Case ModelCond. Mean Imp. (Freq.)Cond. Mean Imp. (Bayes.)Multiple Imp. (no y)Multiple Imp. (include y)CCSCond. Mean Imp. (Freq: MIMI)Cond. Mean Imp. (Bayes: MIMI)Multiple Imp. (no y: MIMI)Multiple Imp. (include y: MIMI)PMKSComplete Case ModelCond. Mean Imp. (Freq.)Cond. Mean Imp. (Bayes.)Multiple Imp. (no y)Multiple Imp. (include y)CCSCond. Mean Imp. (Freq: MIMI)Cond. Mean Imp. (Bayes: MIMI)Multiple Imp. (no y: MIMI)Multiple Imp. (include y: MIMI)PMKSComplete Case ModelCond. Mean Imp. (Freq.)Cond. Mean Imp. (Bayes.)Multiple Imp. (no y)Multiple Imp. (include y)CCSCond. Mean Imp. (Freq: MIMI)Cond. Mean Imp. (Bayes: MIMI)Multiple Imp. (no y: MIMI)Multiple Imp. (include y: MIMI)PMKSComplete Case ModelCond. Mean Imp. (Freq.)Cond. Mean Imp. (Bayes.)Multiple Imp. (no y)Multiple Imp. (include y)CCSCond. Mean Imp. (Freq: MIMI)Cond. Mean Imp. (Bayes: MIMI)Multiple Imp. (no y: MIMI)Multiple Imp. (include y: MIMI)PMKS
Red: Total Prediction Error, Blue: Pattern 1 − No Missing Data, Green: Pattern 2 − Missing X Comparison of Pattern Prediction Error Among Missingness Mechanisms M AR M AR P M Y M NAR M NAR P M Y Pattern Specific Contribution to Prediction Error
Figure 2.
Simulation Results set of parameters: β = 1 , β = 3 , β = 1 , δ = 1 , δ =1 , P ( M = 1) = 0 . , ν = 1 , ν = 1 , ν ,Y = 1 , ν ,Y = 1. The missing data mechanisms Missingat Random (MAR) and Missing Not at Random (MNAR) were generated under a PatternMixture Y (PMY) and Selection Model Y formulation. Red triangles represent the TotalPrediction Error (TPE) summed over all missing data patterns. Blue circles represent thePE for pattern 1 where there is no missing data. Green circles represent the PE for pattern2 in which x is missing. issing Data and Prediction: The Pattern Mixture Kernel Submodel SUPPORT Example pa f i m eanbp w b l c a l b r e s p t e m ph r t b ili c r ea s od XXX XX XX X XXX XX XX X XX X X XXX XX XX X XX X XX X X XX XX X XX X X XX X XX X X XX X X X X PMKSMIMICCSMI Model PE82.684.482.685.7111156771013243034355697340445466967128214333842N Pattern
Prediction Error
Induced MNARY: SUPPORT Example pa f i m eanbp w b l c a l b r e s p t e m ph r t b ili c r ea s od XXX XX XX X XXX XX XX X XX X X XXX XX XX X XX X XX X X XX XX X XX X X XX X XX X X XX X X X X PMKSMIMICCSMI Model PE84.283.8139.2169.3111156771013243034355697340445466967128214333842N Pattern
Prediction Error
SUPPORT Example pa f i m eanbp w b l c a l b r e s p t e m ph r t b ili c r ea s od XXX XX XX X XXX XX XX X XX X X XXX XX XX X XX X XX X X XX XX X XX X X XX X XX X X XX X X X X PMKSMIMICCSMI Model PE82.684.482.685.7111156771013243034355697340445466967128214333842N Pattern
Pattern Specific Contribution (PE*Pattern Weight)
Induced MNARY: SUPPORT Example pa f i m eanbp w b l c a l b r e s p t e m ph r t b ili c r ea s od XXX XX XX X XXX XX XX X XX X X XXX XX XX X XX X XX X X XX XX X XX X X XX X XX X X XX X X X X PMKSMIMICCSMI Model PE84.283.8139.2169.3111156771013243034355697340445466967128214333842N Pattern
Pattern Specific Contribution (PE*Pattern Weight)
Figure 3.
The covariates included in the SPS prediction model include Partial pressureof oxygen in the arterial blood (pafi), Mean blood pressure (meanbp), White blood count(wblc), Albumin (alb), APACHE III respiration score (resp), temperature (temp), Heartrate per minute (hrt), Bilirubin (bili), Creatinine (crea), and Sodium (sod). There are23 patterns present in the SUPPORT data, and missing covariates are denoted with ’X’. N is the total number of subjects in each missing data pattern. Pattern Mixture KernelSubmodels (PMKS), Multiple Imputation with Missingness Indicators (MIMI), CompleteCase Submodels (CCS), and traditional Multiple Imputation (MI) methods are all compared.The top two figures are the unweighted pattern specific PE, and the bottom two figures arethe pattern specific contribution to the PE in which the partial PE is weighted by theobserved proportion of individuals in each pattern. able 1 Comparing imputation methods for an out-of-sample individual with missing data.
Out-of-sample Imp. Requires Pros ConsZero Imputation Nothing Neglible computation time Zero may not be an appropriate valueProbably results in incorrect predictionsMean Imputation Unconditional means Neglible computation time Only works for the average individualConditional Mean Imputation Conditional mean imputation model for every missing data pattern Lower computation time Large bias/variance tradeoff for MNARCan approximate a MI procedureCCS Submodels to be fit Negligible computation time Large bias/varaince tradeoff for MNARMay be advantageous if data are MARFittable for unobserved patternsPMKS Submodels to be fit Negligible computation time May be less efficient if data are MARWorks for any missingness mechanism Patterns with low membership may not fit wellMMI Original data/Conditional distribution Works for any missingness mechanism High computational costComputer/Imputation engine Allows for efficient parameter estimation Not viable in the clinicMI Original data/Conditional distribution Established method High computational costComputer/Imputation engine Works when data are MAR Not viable,in the clinicLarge bias/variance tradeoff for MNAR issing Data and Prediction: The Pattern Mixture Kernel Submodel Table 2
Comparison of Pattern Mixture Kernel Submodels (PMKS) and Complete Case Submodels (CCS).
Pattern PMKS ( ˆ f m ) CCS ( ˆ g m ) X obs , X obs E [ Y | X , X , M = 0 , M = 0] = γ , + γ , X + γ , X E [ Y | X , X ] = β ∗ , + β ∗ , X + β ∗ , X X miss , X obs E [ Y | X , M = 1 , M = 0] = γ , + γ , X E [ Y | X ] = β ∗ , + β ∗ , X X obs , X miss E [ Y | X , M = 0 , M = 1] = γ , + γ , X E [ Y | X ] = β ∗ , + β ∗ , X X miss , X miss E [ Y | M = 1 , M = 1] = γ , E [ Y ] = β ∗ , γ p,m , β ∗ p,m represents the effect of the p th covariate in pattern m able 3 Missing data mechanisms used for simulation. ν is empirically calculated to allow the probability of missingness tomaintain the desired level. expit = e x e x . Missing Data Mechanism for X MCAR P ( M ) = expit ( ν ) MAR P ( M ) = expit ( ν + ν X ) MARY P ( M ) = expit (cid:18) ν + ν ,Y (cid:18) Y/σ y + X √ cor ( Y,X )) (cid:19)(cid:19) MNAR P ( M ) = expit ( ν + ν X ) MNARY P ( M ) = expit (cid:18) ν + ν ,Y (cid:18) Y/σ y + X √ cor ( Y,X )) (cid:19)(cid:19) issing Data and Prediction: The Pattern Mixture Kernel Submodel Table 4
Squared Imputation Error of the true X compared to the imputed X under different imputation methods andmissing data mechanisms: Imputation Error of X = P i ( X i − ˆ X i ) MAR MNAR MAR PMY MNAR PMYUnconditional Mean 0 .
56 0 .
56 0 .
76 0 . .
38 0 .
38 0 .
53 0 . .
49 0 .
49 0 .
69 0 . .
47 0 .
47 0 .
61 0 . .
76 0 .
74 0 .
75 0 ..