[PDF] Misguided Use of Observed Covariates to Impute Missing Covariates in Conditional Prediction: A Shrinkage Problem

Abstract

Researchers regularly perform conditional prediction using imputed values of missing data. However, applications of imputation often lack a firm foundation in statistical theory. This paper originated when we were unable to find analysis substantiating claims that imputation of missing data has good frequentist properties when data are missing at random (MAR). We focused on the use of observed covariates to impute missing covariates when estimating conditional means of the form E(y|x, w). Here y is an outcome whose realizations are always observed, x is a covariate whose realizations are always observed, and w is a covariate whose realizations are sometimes unobserved. We examine the probability limit of simple imputation estimates of E(y|x, w) as sample size goes to infinity. We find that these estimates are not consistent when covariate data are MAR. To the contrary, the estimates suffer from a shrinkage problem. They converge to points intermediate between the conditional mean of interest, E(y|x, w), and the mean E(y|x) that conditions only on x. We use a type of genotype imputation to illustrate.

Full PDF

Misguided Use of Observed Covariates to Impute Missing Covariates in Conditional Prediction: A Shrinkage Problem

Charles F Manski a , Michael Gmeiner b , and Anat Tambur c February 2021 Abstract Researchers regularly perform conditional prediction using imputed values of missing data. However, applications of imputation often lack a firm foundation in statistical theory. This paper originated when we were unable to find analysis substantiating claims that imputation of missing data has good frequentist properties when data are missing at random (MAR). We focused on the use of observed covariates to impute missing covariates when estimating conditional means of the form E(y|x, w). Here y is an outcome whose realizations are always observed, x is a covariate whose realizations are always observed, and w is a covariate whose realizations are sometimes unobserved. We examine the probability limit of simple imputation estimates of E(y|x, w) as sample size goes to infinity. We find that these estimates are not consistent when covariate data are MAR. To the contrary, the estimates suffer from a shrinkage problem. They converge to points intermediate between the conditional mean of interest, E(y|x, w), and the mean E(y|x) that conditions only on x. We use a type of genotype imputation to illustrate. a Department of Economics and Institute for Policy Research, Northwestern University, Evanston, IL 60208 b Department of Economics, Northwestern University, Evanston, IL 60208 c Feinberg School of Medicine, Northwestern University, Chicago, IL 60611 1. Introduction Researchers regularly perform conditional prediction using imputed values of missing data. Imputations are embedded in widely used public datasets. For example, the U.S. Census Bureau provides hot-deck imputations of missing data in public releases of the Current Population Survey and other major Census surveys (U.S. Census Bureau, 2006, 2011). Whereas the hot-deck method associates a single imputation with each case of missing data, Rubin (1987, 1996) promoted random multiple imputation (RMI) as a general approach for coping with missing values in public-use data. The adjective “random” refers to drawing imputed values at random from a specified probability distribution and treating the imputations as if they are real data. The adjective “multiple” refers to repetition of the random imputation process, generating multiple pseudo datasets and correspondingly multiple estimates of quantities of interest. Rubin (1996) made this broad recommendation (p. 473): “For the context for which it was envisioned, with database constructors and ultimate users as distinct entities, I firmly believe that multiple imputation is the method of choice for addressing problems due to missing values.” Imputation is used in medical research that aims to predict health outcomes conditional on patient covariates. Considering missing data in clinical trials, a National Research Council panel (National Research Council, 2010) cautioned against use of single imputation, but the panel argued favorably for RMI. Use of RMI has also been recommended for observational studies in medicine (e.g., Sterne et al. , 2009; Azur et al., et al. , 2017). Various methods of genotype imputation have been proposed to increase the predictive power of medical risk assessments conditional on patient genotype (e.g., Li et al. , 2009). Researchers use an auxiliary database of precise genotypes for a specific subset of persons to impute precise genotypes for persons in the main study population who have only coarse genotyping. Unfortunately, applications of imputation in general, and RMI in particular, lack a firm foundation in statistical theory. Consider RMI. Rubin originally motivated RMI from a subjective Bayesian perspective. One places a joint subjective distribution on all observed and unobserved quantities. One wants to compute the posterior subjective distribution of some unobserved quantity conditional on all of the observed ones. Given this, Rubin’s RMI is simply a computational method that uses Monte Carlo integration to approximate the mean of the posterior distribution. Appendix A explains further. The foundational problem, that imputation conditional on observable data provides no new information beyond these data, arises when one considers RMI from a frequentist perspective, as has generally been the case in practice. Rubin asserted good frequentist properties for RMI, but he formally studied only a highly restricted form of frequentist inference; see Appendix A. Other authors have asserted broadly, but without proof, that RMI has good frequentist properties when data are missing at random (MAR). For example, Sterne et al. (2009) states (p. 5): “under the missing at random assumption multiple imputation should correct biases that may arise in complete cases analyses.” Pedersen et al. (2017) states (p. 157): “Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data.” This paper originated when we were unable to find analysis substantiating claims that RMI has good frequentist properties when data are MAR. Looking into the matter, we focused on the use of observed covariates to impute missing covariates when estimating conditional means of the form E(y|x, w). Here y is an outcome whose realizations are always observed, x is a covariate whose realizations are always observed, and w is a covariate whose realizations are sometimes unobserved. A prominent example is hot-deck imputation in surveys, which replaces each missing value of w with an observed value of w from another respondent who has a similar value of x. Another prominent example is imputation of refined genotypes. Here w is a refined genotype that may not be observed in a study population and x is a crude type that is always observed. Then, knowledge of the distribution P(w|x) in a specific subset of persons (typically based on ethnic background) may be used to impute w in the study population. We examine the probability limit of simple imputation estimates of E(y|x, w) as sample size goes to infinity. We find that these estimates are not consistent when covariate data are MAR. To the contrary, the estimates suffer from a shrinkage problem. They converge to points intermediate between the conditional mean of interest, E(y|x, w), and the mean E(y|x) that conditions only on x. Hence, we conclude that use of observed covariates to impute missing covariates is misguided. Section 2 presents the theoretic analysis. Section 3 uses a type of genotype imputation to illustrate. 2. Imputation of Missing Covariate Data in Conditional Prediction It is well understood that imputation is not a panacea for inference with missing data. Horowitz and Manski (1998, 2000) studied nonparametric partial identification of conditional means in the absence of assumptions restricting the distribution of missing outcome and covariate data. They observed that using imputations in place of missing data does not generally yield consistent estimates, but they did not study specific imputation methods under specified assumptions on data generation. We do so here. We examine the probability limit of estimates of conditional means that use observed covariates to randomly impute missing covariates. Formally, consider a population with members characterized by variables (y, x, w, z). Here y ∊ Y is a real outcome with bounded domain Y, whereas x ∊ X and w ∊ W are covariate vectors with finite domains X and W. Realizations of x are always observable, but some realizations of w are not. The binary variable z indicates whether w is observable (z = 1) or not (z = 0). Let the population distribution of (y, x, w, z) be denoted P. The objective is to learn E(y|x = ξ, w = ω) when P(x = ξ, w = ω) > 0 . A random sample of N population members are drawn. One observes (y i , x i , z i ) for all i = 1, . . . , N and observes w i when z i = 1. Suppose that, if w were always observed, one would estimate P by the empirical distribution P N and would estimate E(y|x = ξ, w = ω) by its sample analog E N (y|x = ξ, w = ω). Now consider estimation of E(y|x = ξ, w = ω) when some data on w are missing.

We first give the general form of sample-analog estimates that treat imputed values of w as real data. We then consider estimates that impute missing values of w randomly, conditional on x. 2.1. Imputation in Generality To cope with missing data on w, let each member of the population be assigned

M > 0 imputed values u m ∊ W, m = 1, . . , M. Single imputation occurs when M = 1 and multiple imputation when M > 1.

In the sample of size N, Let N(1, ξ, ω) be the sub -sample of cases where (z = 1, x = ξ, w = ω). Let N m (0, ξ, ω) be the sub-sample where (z = 0, x = ξ, u m = ω). Let N ξω = |N(1, ξ, ω)|, N m0 ξω = |N m (0, ξ, ω)|, and π mN ξω ≡ N ξω /(N ξω + N m0 ξω ). Then, whenever N ξω + N m0 ξω > 0, the m th imputation estimate of E(y|x = ξ, w = ω) is 1 (1) θ mN ξω ≡ ────── ( ∑ y i + ∑ y i ) N ξω + N m0 ξω i ∊ N(1, ξ, ω ) i ∊ N m (0, ξ, ω ) 1 1 = π mN ξω ─── ∑ y i + (1 − π mN ξω ) ─── ∑ y i . N ξω i ∊ N(1, ξ, ω ) N m0 ξω i ∊ N m (0, ξ, ω ) Let N ⇾ ∞. By the Law of Large Numbers, the probability limit of θ mN ξω is (2) θ m ξω ≡ E(y|x = ξ , w = ω, z = 1) ∙π m ξω + E(y|x = ξ , u m = ω, z = 0) ∙ (1 − π m ξω ), where P(z = 1, x = ξ, w = ω ) (3) π m ξω = ──────────────────────────── P(z = 1, x = ξ, w = ω ) + P(z = 0, x = ξ , u m = ω ) P(z = 1 , w = ω| x = ξ ) = ──────────────────────────── . P(z = 1 , w = ω| x = ξ ) + P(z = 0, u m = ω| x = ξ ) In general, θ m ξω does not equal E(y|x, w). By the Law of Iterated Expectations, (4) E(y| x = ξ, w = ω) = E(y|x = ξ, w = ω, z = 1)∙P(z = 1| x = ξ, w = ω) + E (y|x = ξ , w = ω, z = 0 )∙ P(z = 0| x = ξ, w = ω). θ mξω if we study particular classes of imputation methods under specified assumptions. We consider methods that specify a vector G m (u m |x = ξ), ξ ∊ X of probability distributions on W. For each person with x = ξ, the imputation method draw s u m at random from G m (u m |x = ξ) . Random imputation of genotypes as described earlier exemplifies this type of imputation. So does any deterministic imputation method that makes u m a function of x. With deterministic imputation, G m (u m |x = ξ), ξ ∊ X are degenerate distributions. A first basic result holds for all specifications of G m and all processes of data generation. Whatever G m may be, u m is by construction statistically independent of (y, z) conditional on x. Hence, (5a) E(y|x = ξ , u m = ω , z = 0) = E(y|x = ξ , z = 0), (5b) P(z = 0, u m = ω | x = ξ ) = P(z = 0|x = ξ ) ⋅ G m (u m = ω | x = ξ ). It follows that (2)-(3) reduce to (6) θ mξω = E(y|x = ξ, w = ω, z = 1)∙π mξω + E(y|x = ξ, z = 0)∙(1 − π mξω ), P(z = 1, w = ω| x = ξ) (7) π mξω = ──────────────────────────────── . P(z = 1, w = ω| x = ξ) + P(z = 0|x = ξ) ⋅ G m (u m = ω| x = ξ) This finding suggests shrinkage. Whereas E(y|x = ξ, w = ω) is a weighted average of E(y|x = ξ, w = ω, z = 1) and E(y|x = ξ, w = ω, z = 0), θ mξω is a weighted average of E(y|x = ξ, w = ω, z = 1) and E(y|x = ξ, z = 0). The reason that I use the imprecise word “suggests” is that the weighting may differ in the two cases. The former weights are P(z = 1| x = ξ, w = ω) and P(z = 0| x = ξ, w = ω) . The latter are π mξω and 1 − π mξω . The connection to classical shrinkage becomes exact if the missing data are MAR conditional on x, in the sense that (y, w) is statistically independent of z conditional on x. Then (8a) E(y|x = ξ , w = ω , z = 1) = E(y|x = ξ , w = ω ), (8b) E(y|x = ξ , z = 0) = E(y|x = ξ ), (8c) P(z = 1, w = ω | x = ξ ) = P(z = 1| x = ξ ) ⋅ P(w = ω | x = ξ ). It follows that (6)-(7) reduce to (9) θ mξω = E(y|x = ξ, w = ω)∙π mξω + E(y|x = ξ)∙(1 − π mξω ), P(z = 1| x = ξ) ⋅ P(w = ω| x = ξ) (10) π mξω = ─────────────────────────────────────── . P(z = 1| x = ξ) ⋅ P(w = ω| x = ξ) + P(z = 0|x = ξ) ⋅ G m (u m = ω| x = ξ) Thus, θ mξω is now a weighted average of the conditional means E(y|x = ξ, w = ω) and

P(z = 1| x = ξ ) + E (y|x = ξ)∙

P(z = 0| x = ξ ). Hence, the asymptotic bias of the imputation estimate is [E (y|x = ξ) − E(y|x = ξ, w = ω)]∙

P(z = 0| x = ξ ). 3. Illustration: Imputation of HLA Allele-Level Genotypes in Research Predicting Transplant Outcomes 3.1. Background An important clinical problem in organ transplantation is to predict the outcomes that occur when an organ is transplanted into a recipient. Covariates with predictive power include data characterizing the organ and patient. A prominent consideration is the genetic match between donor and recipient, measured by their Human Leukocyte Antigen (HLA) genotypes. A persistent problem in research predicting transplant outcomes is incomplete information on donor and recipient HLA typing. The Organ Procurement and Transplantation Network (OPTN) requires transplant centers to provide pre-transplant data to the Scientific Registry of Transplant Recipients (SRTR), which collates the data with reports of transplant outcomes and makes the combined data available for analysis. Researchers use the SRTR data to investigate how transplant outcomes vary with measured attributes of organ quality, patient age/health, and HLA typing. Until 2010, the only HLA typing information required for donors and recipients was on the HLA (A, B, DR) loci, at the serologic level. Since then, more accurate molecular typing has been required. Over time, additional loci information has been mandated for the donor, including HLA (DQB1, DPB1, DQA1). However, no such requirements have been made for the typing of patients. Many patients are listed for transplant with only HLA (A, B, DR) low-resolution (two-digit) typing information available, although some patients have serologic equivalent DQ typing. Higher resolution (four-digit) typing would be beneficial because each low-resolution, two-digit, typing represents multiple alleles, whereas each four-digit typing identifies a unique allele. Donor and recipient HLA types that appear matched with two-digit coding may be mismatched with four-digit coding. A patient may have antibodies to an allele within a low-resolution antigen group, but the donor typing may be of a different allele, against which the recipient does not have a donor-specific antibody. Differences at the allele level can also translate into differences in the assignment of molecular mismatches, currently proposed for use in risk stratification of transplant recipients. Aiming to refine the predictions possible with incomplete knowledge of HLA data, methods have been developed to use available statistics on the distributions of high-resolution HLA typing within specified ethnic/national sub-populations to impute unobserved typing. Two prominent approaches are imputation of most prevalent types (“winner-take-all”) and RMI. Both approaches use a known distribution of high-resolution typing given low-resolution typing. Articles imputing most prevalent types include Geneugelijk et al. , 2017), Tinckam et al. (2016), and Nilsson et al. (2019). Ones using RMI include Gragert et al. (2014) and Kamoun et al. is never observed in the SRTR data, the MAR assumption necessarily holds, with P(z = 1|x = ξ) = 0 for all values of ξ. Hence, equation (11) reduces to the result θ m ξω = E (y|x = ξ). Thus, imputation of high-resolution HLA does not improve prediction of transplant outcomes beyond the information in observable low-resolution SRTR data. Imputation being ineffective, it is natural to ask what predictions of transplant outcomes can logically be made by combining SRTR and HaploStats data. Formally, what can be learned about P(y|x, w) given knowledge of P(y|x) and P(w|x)? This question has been addressed by Manski, Tambur, and Gmeiner (2019), drawing on earlier research on the ecological inference problem in medical risk assessment (Manski, 2018). Analysis is simplest when the outcome of interest can take two values, say 0 and 1. For example, y = 1 may denote that a graft survives for a specified length of time and y = 0 that it does not. Then application of basic probability theory shows that, for any values (x, w) of the observed and unobserved attributes, the outcome probability P(y = 1|x = ξ , w = ω ) lies between certain lower and upper bounds that are computable given the available information. The lower bound is [P(y = 1|x = ξ ) – P(w ≠ ω |x = ξ)]/P(w = ω |x = ξ ) and the upper bound is P(y = 1|x = ξ )/P(w = ω |x = ξ ). The lower bound is informative, in the sense of being larger than zero, if P(y = 1|x = ξ ) exceeds P(w ≠ ω |x = ξ ). The upper bound is informative, in the sense of being smaller than one, if P(y = 1|x = ξ ) is less than P(w = ω |x = ξ ). For example, suppose the SRTR data show that when a donor and recipient have observed attributes x, the frequency with which a graft survives for a given length of time is P(y = 1|x = ξ ) = 0.6. Drawing on HaploStats, suppose that w is the most prevalent pair of haplotypes when the donor and recipient have attributes ξ , with P(w = ω |x = ξ ) = 0.8. Then the computable lower bound on P(y = 1|x = ξ , w = ω ) is (0.6 – 0.2)/0.8 = 0.5 and the upper bound is 0.6/0.8 = 0.75. Going beyond the case where the outcome is binary, the ecological inference problem has been studied when the objective is to learn the conditional mean E(y|x, w) for a real-valued outcome. The analysis is mathematically more subtle than when y is binary, but a tractable finding emerges. Knowledge of P(y|x) and P(w|x) yields a computable bound on E(y|x, w). See Manski (2018). 0 3.3. Using Imputation in Risk-Assessment Models that Condition on Number of HLA Mismatches Analysis of the ecological inference problem proves that imputation of HLA types cannot be informative about the distribution P(y|x, w) of transplant outcomes when x is observed typing, w is unobserved typing, and knowledge of P(w|x) is used to impute w. This fact does not, however, imply that imputation is useless for all versions of transplant risk assessment. T ransplant researchers often aim to learn not P(y|x, w) but rather P[y|f(x, w)], where f( ⋅ , ⋅ ) is a specified many-to-one function of (x, w). In particular, it is common to predict graft survival conditional on the number of (donor, recipient) HLA mismatches at various loci, rather than on the underlying HLA types. Many combinations of types can yield the same number of mismatches. When researchers perform risk assessment with models that condition only on the number of mismatches, it is theoretically possible that imputations may have predictive power. The frequency distributions of high-resolution types generated by HaploStats condition on all of the low-resolution HLA data that researchers input, not only on the number of mismatches implied by these data. Hence, imputations in principle might add predictive power to that attainable conditioning only on number of mismatches. It does not seem possible to determine theoretically whether imputations have predictive power when used in risk-assessment models that condition only on numbers of mismatches rather than on the totality of observed HLA data. We can, however, use available SRTR data to illustrate such use of imputation. We summarize here and provide details in Appendix B. We consider use of a logit model to predict five-year graft survival as a function of the numbers of low-resolution mismatches at the (A, B, DR) loci. This model is estimable with the SRTR data. The estimate presented in the appendix shows that the probability of five-year graft survival decreases with the number of mismatches at each of the three loci, with DR mismatch having the strongest and statistically most significant effect. Now suppose that one were to use randomly imputed values of DR rather than actual DR data when estimating the logit model. As described in Appendix B, we use the SRTR data, which contains actual low-1 resolution (A, B, DR) data, to determine the empirical conditional distribution P(DR|A, B). This done, we perform RMI, repeatedly imputing DR values and estimating the logit model. We find that imputation of DR does not reveal the actual strong effect of DR mismatch on graft survival. To the contrary, the mean of the RMI coefficients for DR mismatch is close to zero. Thus, imputation is not informative in this illustration of risk assessment conditional on numbers of mismatches. Appendix A. Rubin’s Bayesian and Frequentist Theory of RMI A concise statement of the Bayesian theory motivating RMI was given in Rubin (1996), where he considered the posterior distribution for a real parameter Q(Y), Y being a random vector with some components observed and some missing. He wrote (p. 476). “The key Bayesian motivation for multiple imputation is given by result 3.1 in Rubin (1987). . . . . the results and its consequences can be easily stated using the simplified notation that the complete-data are Y = (Y obs , Y mis ), where Y obs is observed and Y mis is missing. Specifically, the basic result is P(Q|Y obs ) = ∫P(Q|Y obs , Y mis )P(Y mis |Y obs )dPY mis .” In Bayesian language, P(Q|Y obs ) is the posterior predictive distribution of Q conditional on Y obs , P(Q|Y obs , Y mis ) is the posterior for Q given (Y obs , Y mis ), and P(Y mis |Y obs ) is the posterior for Y mis given Y obs . In non-Bayesian language, the equation applies the Law of Total Probability. Rubin supposed that P(Q|Y obs , Y mis ) and P(Y mis |Y obs ) are specified subjective distributions, making P(Q|Y obs ) computable. In practice, he focused on the posterior mean of Q; that is E(Q|Y obs ) = ∫ E(Q|Y obs , Y mis )P(Y mis |Y obs )dPY mis . Observe that Rubin’s “basic result” does not explicitly refer to RMI. He interpreted it as RMI by considering Monte Carlo integration as a practical approach to approximate E(Q|Y obs ). To perform Monte Carlo integration, one draws repeated values of Y mis at random from P(Y mis |Y obs ) and averages the resulting values of E(Q|Y obs , Y mis ). Semantically, one may refer to Monte Carlo draws of Y mis as imputations. Hence, RMI is Monte Carlo integration. 2 The above motivation for RMI is well-grounded from a subjective Bayesian perspective. A disconnect between the theory and practice of RMI stems from the effort made by Rubin to assert desirable frequentist properties for RMI. To a subjective Bayesian, the posterior mean E(Q|Y obs ) is well-defined and interpretable regardless of whether it equals an objective quantity of scientific interest. A frequentist, however, assumes the existence of an objective quantity of interest, say Q * , and wants to estimate this quantity well in some sense, across repeated samples. In general, the posterior mean E(Q|Y obs ) need not be a good estimate of Q * when P(Q|Y obs , Y mis ) and P(Y mis |Y obs ) are simply subjective distributions. To prove good frequentist properties for E(Q|Y obs ) typically requires one to assume that P(Q|Y obs , Y mis ) and P(Y mis |Y obs ) are objectively correct. Rubin demonstrated awareness of this core requirement when he wrote (Rubin, 1996, p. 474): “My conclusion is that ‘correctly’ modeling the missing data must be, in general, the data constructor's responsibility.” However, he provided no evidence that data constructors are able to model missing data correctly. Rubin argued that two desirable frequentist properties for statistical procedures are “randomization validity,” which he interpreted as requiring approximately unbiased point estimates of scientific estimands, and “confidence validity,” requiring that actual coverage probabilities for confidence intervals should be at least as large as nominal coverage probabilities. He wrote (Rubin, 1996, p. 476): “Multiple imputation was designed to satisfy both achievable objectives by using the Bayesian and frequentist paradigms in complementary ways: the Bayesian model based approach to create procedures, and the frequentist (randomization-based approach) to evaluate procedures.” Continuing, he wrote that if the multiple imputations are “proper” and complete data inference is randomization-valid, then (p. 477): “the large- m repeated-imputation inference . . . is randomization-valid for the scientific estimand Q, no matter how complex the survey design. ” It is not easy to understand Rubin’s extended verbal discussion of what he means by “proper” multiple imputation. However, we believe that we understand the type of frequentist inference that he had in mind. His symbol m refers to the number of random draws made from P(Y mis |Y obs ) and, hence “large- m ” refers to asymptotic analysis as m goes to infinity. Thus, he meant that, by the Law of Large Numbers and the Central 3 Limit Theorem, Monte Carlo integration yields a well-behaved estimate of a population mean as the number of pseudo-draws goes to infinity. Randomization validity in this sense means that RMI yields a consistent estimate of E(Q|Y obs ) asymptotically in m. It implies nothing about the quality of RMI in estimation of Q * . Appendix B. Empirical Illustration of RMI Applied to HLA Genotypes Types As in Manski, Tambur, and Gmeiner (2019), we examine data on the outcomes of deceased-donor transplants recorded in the SRTR from 2009 through 2018. Most HLA codings in the SRTR are at the two-digit level. We convert occasional four-digit codings to two-digit following OPTN guidelines with the exception of the coding 103 for DR. Transplants with codings 2, 3, 5, or 6 for DR are excluded due to ambiguity of these codings. Mismatches are defined as the number of unique antigens the donor has that the recipient does not have. We study transplants for which the donor and patient were both coded as white. Within this population, we consider each combination of (A, B, DR) antigens separately for the donor and recipient. Rather than use HaploStats to impute two-digit DR genotypes, we use the available SRTR data on (A, B, DR) types. We generate separate estimates of P(DR|A, B) for donors and recipients Whereas some (A, B) types are common among SRTR transplants with white donors and recipients, other types are sparse. To obtain meaningful estimates of P(DR|A, B), we restrict attention to (A,B) types for which at least 10 observations were present in the data. For each such case, we compute the empirical distribution of DR conditional on (A, B) and use it as the estimate of P(DR|A, B). When analyzing transplant outcomes, we restrict the sample to adult transplants in which the patient was receiving their first transplant and the KDPI variable can be calculated. We define five-year survival to take the value 0 if an individual has re-transplant, death, or otherwise graft failure within 1,825 days of transplant. The five-year survival variable takes the value 1 if the individual is observed for more than 1,825 days after transplant and the first date of a failure event, if any, occurs after 1,825 days. Our final estimation 4 sample, comprising cases for which five-year survival can be calculated, and for which the A-B combination of both patients and donors has more than 10 observations, contains 5,045 transplants. Coefficients from logit regressions using observable (A, B, DR) data are in table 1. We find that both B and DR mismatch have negative association with survival probability, the coefficients being strong in magnitude and statistically significant by conventional criteria. Table 1: Logit Coefficients Using Observed Data Five Year Survival A Mismatches [0- -0.118 (0.051) DR Mismatches [0- -0.163 (0.049) Constant 0.974 (0.053) N 5,045 Robust standard errors in parentheses We next replace observable DR data with imputations. We perform RMI with 50 repetitions, imputing DR separately for donors and patients using the empirical distributions, P(DR|A,B). In each repetition, we estimate the logit model in Table 1, again using observed mismatches for A and B, but now using imputed DR mismatches. The means and standard deviations of the logit coefficients across the 50 repetitions are shown in table 2. Table 2: Average Logit Coefficients using RMI to Impute DR Mismatches A Mismatches .008 (.003) B Mismatches -.194 (.010) DR Mismatches .003 (.047) Averages are from 50 RMI draws. Standard deviations in parentheses. The primary insight is that the average coefficient for DR mismatches is near 0 when using imputed data rather than actual data. Also note that the average coefficients for A and B mismatches are slightly 5 more negative than the analogous coefficients in Table 1. This suggests that the correlation between DR and other mismatches causes the effect of DR to load onto the other mismatch coefficients when the true DR data are not utilized.

6 References Azur, M., E. Stuart, C. Frangakis, And P. Leaf (2011), “Multiple Imputation by Chained Equations: What Is It and How Does It Work?”

International Journal of Methods in Psychiatric Research , 20, 40-49. Geneugelijk, K., J. Wissing, D. Koppenaal, M. Niemann, and E. Spierings (2017), “Computational Approaches to Facilitate Epitope-Based HLA Matching in Solid Organ Transplantation,”

Journal of Immunology Research , https://doi.org/10.1155/2017/9130879. Gragert, L., S. Fingerson, M. Albrecht, M. Maiers, M. Kalaycio, and B. Hill (2014), “Fine-mapping of HLA associations with chronic lymphocytic leukemiain US populations,”

Blood

Journal of Econometrics , 84, 37-58. Horowitz, J. and C. Manski (2000), “Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data,”

Journal of the American Statistical Association , 95, 77-84. Kamoun, M. et al. (2017), “HLA Amino Acid Polymorphisms and Kidney Allograft Survival,”

Transplantation

Annual Review of Genomics and Human Genetics , 10, 387–406. Manski, C. (2018), "Credible Ecological Inference for Medical Decisions with Personalized Risk Assessment,"

Quantitative Economics,

9, 541-569. Manski, C. A. Tambur, and M. Gmeiner (2019), “Predicting Kidney Transplant Outcomes with Partial Knowledge of HLA Mismatch,”

Proceedings of the National Academy of Sciences

The Prevention and Treatment of Missing Data in Clinical Trials . Washington, DC: The National Academies Press. https://doi.org/10.17226/12955. Nilsson, J., D. Ansari, M. Ohlsson, P. Höglund, A. Liedberg, J. Smith, and B. Andersson (2019), “Human

Leukocyte Antigen‐Based R isk Stratification in Heart Transplant Recipients—Implications for Targeted Surveillance,”

Journal of the American Heart Association , , https://doi.org/10.1161/JAHA.118.011124. Pedersen, A., E. Mikkelsen, D. Cronin-Fenton, N. Kristensen, T. My Pham, L. Pedersen, and I. Petersen (2017), “Missing Data and Multiple Imputation in Clinical Epidemiological Research,” Clinical Epidemiology , 9, 157-166. Rubin, D. (1987),

Multiple Imputation for Nonresponse in Surveys , New York: John Wiley & Sons. Rubin, D. (1996), “Multiple Imputation after 18+ Years,”

Journal of the American Statistical Association , 91, 473-489. Sterne, J., I. White, J. Carlin, M. Spratt, P. Royston, M. Kenward, A. Wood, and J. Carpenter (2009), “Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls,”

BMJ,

7 Tinckam, K. C. Rose, S. Hariharan, and J. Gill (2016), “Re-examining Risk of Repeated HLA Mismatch in Kidney Transplantation,”

Journal of the American Society of Nephrology , , 2833-2841. U. S. Census Bureau (2006), Current Population Survey Design and Methodology , Technical Paper 66, Washington, DC: U. S. Census Bureau. U.S. Census Bureau (2011),