LListwise Deletion in High Dimensions

J. Sophia Wang and Peter M. Aronow * January 28, 2021

Abstract

We consider the properties of listwise deletion when both n and the number of variables grow large.We show that when (i) all data has some idiosyncratic missingness and (ii) the number of variablesgrows superlogarithmically in n , then, for large n , listwise deletion will drop all rows with probability1. We present numerical illustrations to demonstrate ﬁnite- n implications. These results suggest, inpractice, using listwise deletion may mean using few of the variables available to the researcher evenwhen n is very large. Listwise deletion is a commonly used approach for addressing missing data that entails excluding anyobservations that have missing data for any variable used in an analysis. It constitutes the default behaviorfor standard data analyses in popular software packages: for example, rows with any missing data areby default omitted by the lm function in R (R Core Team, 2020) and by the regress command in Stata(Stata.com, 2020).However, scholars have increasingly recognized that listwise deletion may not be a generally appropriateresearch method to handle missing data. While a common critique focuses on the plausibility of the“missing completely at random” assumption (Schafer, 1997, p. 23; Allison, 2001, p. 6-7, Cameron andTrivedi, 2005, p. 928, Little and Rubin, 2019, p. 15), issues about efﬁciency in estimation have also beenraised (Berk, 1983, p. 540, Schafer, 1997, p. 38, Allison, 2001, p. 6). Namely, since listwise deletiondiscards data, the resulting estimators can be inefﬁcient relative to approaches that use more of the data(e.g., imputation methods).These issues have been raised to an audience of political scientists (King et al., 2001; Lall, 2016; Honakerand King, 2010), but the manner in which listwise deletion can hinder the researcher has been underap-preciated. Namely, if the researcher seeks to use many variables with missingness, it may be impossible * J. Sophia Wang is Graduate Student, Department of Political Science, Yale University. Peter M. Aronow is AssociateProfessor, Departments of Political Science, Biostatistics, and Statistics and Data Science, Yale University. Thanks to ForrestCrawford and Rosa Kleinman for helpful comments. a r X i v : . [ s t a t . O T ] J a n ltogether to draw any statistical conclusion whatsoever. Accordingly, the use of listwise deletion mayimply a severe restriction on variables used in an analysis.The primary purpose of this note is to make this argument rigorous by considering the high-dimensionalproperties of listwise deletion. We show that when (i) all variables have some idiosyncratic missingnessand (ii) the number of variables grows with n at any superlogarithmic rate, listwise deletion will yield nousable data asymptotically with probability 1. We further provide numerical illustrations to shed light onﬁnite- n properties. Let n be the number of observations. Let k be the number of variables (columns) in the data set. Let M ij be a random indicator variable for whether or not the j th variable in the i th row is missing. Let M ij represent the random vector collecting the missingness indicators up to variable j , ( M i , M i , . . . , M ij ) .Our results depend on two assumptions. First, we assume mutual independence of missingness acrossrows. This would be violated when, e.g., observations are clustered. Assumption 1.

All rows of the data (cid:0) ( M , ..., M k ) , ..., ( M n , ..., M nk ) (cid:1) are mutually independent.We also assume that there is some idiosyncratic missingness in each variable. Namely, we will assumethat there is a factorization of the data such that all conditional probabilities are bounded away from zero.Substantively, this assumption rules out, e.g,, two variables sharing the same missingness pattern withprobability 1. Assumption 2.

There exists a q ∗ ∈ [0 , such that for all i ,• Pr( M i = 0) ≤ q ∗ • Pr( M ij = 0 | M i ( j − = ) ≤ q ∗ , for all j ∈ { , . . . , k } .With some additional notation, this assumption can be weakened to only require the existence of an indexordering over j such that the condition holds. We also note that Assumpton 2 may be unrealistic in settingswhere missingness is always identical across some groups of variables (e.g., because two or more variablescome from a common data source). In such settings, our results can be generalized to the setting where k refers to the number of groups of variables, rather than the number of variables themselves. To easeexposition, we omit formalizing these extensions. We begin by application of elementary probability theory to yield the following ﬁnite- n result.2 emma 3. Under Assumptions 1 and 2, the probability that listwise deletion removes all rows is p all ≥ (1 − q k ∗ ) n . Proof:

By the independence assumption, p all = Π ni =1 (1 − Pr( M ik = )) . Denote q ij = Pr( M ij =0 | M i ( j − = ) if j > , else q ij = Pr( M ij = 0) . By Assumption 2, q ij ≤ q ∗ , for all j ∈ { , , . . . , k } .By the chain rule of conditional probability, Pr( M ik = ) = q i q i · · · q ik . This means that the probabilityof a single observation containing at least one missing entry is (1 − q i q i · · · q ik ) . Since q ∗ ≥ q ij for all j ∈ { , , . . . , k } , q k ∗ ≥ q i q i · · · q ik . Thus (1 − q k ∗ ) ≤ (1 − q i q i · · · q ik ) . Thus (1 − q k ∗ ) n is a lower boundfor the probability of all n observations each containing at least one missing entry. (cid:3) We can now consider the asymptotic properties of listwise deletion. To do so, we embed the above probleminto a sequence. We let k n = f ( n ) , where f has range over the natural numbers, and allow M ij,n andtherefore q ij,n to vary at each n . To ease exposition, we omit notational dependence on n .We have our third and ﬁnal assumption: k grows superlogarithmically in n . Superlogarithmic rates can beextremely slow, and include any polynomial rate of growth (e.g., n c for any c > ). To see how slow theserates can be, our results would include the rate b log( n ) . c , which would permit use of 2 variables with 10observations, 8 variables with a thousand observations, and 17 variables with a million observations. Thisassumption encompasses rates that are slow enough that that they would not preclude good asymptoticbehavior for most standard estimators. For example, see the minimal assumptions invoked by Lai et al.(1978) for convergence of the least squares estimator. Assumption 4.

The number of covariates grows superlogarithmically in n , so that lim n →∞ f ( n ) log ( n ) = ∞ .Assumption 4 can be concisely stated as k = ω (log n ) . The following proposition demonstrates that whenAssumptions 1, 2 and 4 hold, then the probability of listwise deletion yielding no usable data tends to as n → ∞ . Proposition 5.

Under Assumptions 1, 2 and 4, lim n →∞ p all = 1 . Proof:

First we will show that q f ( n ) ∗ = o ( n − ) , or lim n →∞ nq f ( n ) ∗ = 0 . Note that lim n →∞ nq f ( n ) ∗ = lim n →∞ e log nq f ( n ) ∗ = lim n →∞ e log n + f ( n ) log q ∗ . Since q ∗ ∈ (0 , , log q ∗ < . Since f ( n ) = ω (log n ) , the sequence log n + f ( n ) log q ∗ diverges to negativeinﬁnity, and so lim n →∞ e log n + f ( n ) log q ∗ = 0 = lim n →∞ nq f ( n ) ∗ . Since q ∗ ∈ (0 , and k = f ( n ) ≥ ∗ , − q f ( n ) ∗ > − and − q f ( n ) ∗ ≤ . By Bernoulli’s Inequality, since n ∈ N , (1 − q f ( n ) ∗ ) n ≥ ∗ n ( − q f ( n ) ∗ ) = 1 − nq f ( n ) ∗ . Thus − nq f ( n ) ∗ ≤ (1 − q f ( n ∗ ) n ≤ in the common3omain n ∈ N . Since lim n →∞ and lim n →∞ − nq f ( n ) ∗ = 1 − lim n →∞ nq f ( n ) ∗ = 1 , by the SqueezeTheorem, lim n →∞ (1 − q f ( n ) ∗ ) n = 1 . Then, since ∀ n, (1 − q f ( n ) ∗ ) n ≤ p all ≤ , we have lim n →∞ p all = 1 , again by the Squeeze Theorem. (cid:3) Thus we have shown that even modest rates of growth in the number of covariates can render any resultingstatistical inference asymptotically impossible with listwise deletion.

Here we demonstrate the ﬁnite- n implications of our ﬁndings. In the following ﬁgures, we plot ﬁguresfor different values of an upper bound on the conditional probability of each data entry being observed q ∗ ,the number of observations (rows) n , the number of variables (columns) k and a sharp lower bound on p all denoted p all = (1 − q f ( n ) ∗ ) n . p all is the lowest possible (”best-case”) probability that listwise deletionremoves all data.Figure 1 demonstrates how listwise deletion asymptotically removes all data. Figure 1 plots the sharplower bound p all for the probability that listwise deletion removes all data against the number of variables k in three different settings for n = 100 , , in each subﬁgure. By Lemma 3, we can compute p all = (1 − q k ∗ ) n .In each subﬁgure, we simultaneously consider three different values for the upper bound of the conditionalprobability q ∗ deﬁned in Assumption 2: the red curves represent p all calculated with q ∗ = 0 . , theblue curves represent p all calculated with q ∗ = 0 . , and the green curves represent p all calculated with q ∗ = 0 . . We see that the rate of p all converges to as k gets large, which is faster when q gets smaller.However, the rate of convergence heavily depends on q ∗ . When q ∗ = 0 . (i.e., each variable has a 25%chance of missingness), the probability is extremely close to even when n = 100 , . However, whenthe upper bound q ∗ is as large as . , p all is essentially zero when n = 100 and k = 150 , reﬂecting thefact that the probability of missingness is essential in determining the properties of listwise deletion.Figure 2 illustrates, for a given n, q ∗ , how large k can be while still ensuring that p all ≤ . , . . Wecompute this using the result from Lemma 3, p all = (1 − q k ∗ ) n . Since (1 − q k ∗ ) n is strictly increasingin k , solving for equality p all = (1 − q k ∗ ) n we will get the smallest possible k for each p all that k = $ log(1 − p all /n )log( q ∗ ) % . We present two subﬁgures: with p all = 0 . for the ﬁrst subﬁgure and p all = 0 . for thesecond subﬁgure, and we plot the k against the n for three different upper bounds q ∗ = 0 . , . , . . With more missingness, q ∗ = 0 . , . , even relatively small k can yield missingness of p all . For example,even with n = 10 , and q ∗ = 0 . we need only k = 25 to have a 50% probability that all rows will bemissing. However, when missingness is very low, k needs to be very large to cause all data to be missing.For example, with n = 10 , and q ∗ = 0 . , we need k = 724 to have a 50% probability that all rowswill be missing. 4 . . . Lower bound on probability that all rows missing (n = 100)

Number of variables P r obab ili t y q ∗ =0 . q ∗ =0 . q ∗ =0 . . . . Lower bound on probability that all rows missing (n = 1000)

Number of variables P r obab ili t y q ∗ =0 . q ∗ =0 . q ∗ =0 . . . . Lower bound on probability that all rows missing (n = 10000)

Number of variables P r obab ili t y q ∗ =0 . q ∗ =0 . q ∗ =0 . Figure 1.

Lower bound of probability that all rows missing ( p all ) plotted against values of n , k and q ∗ . Largest k such that probability all data missing is ≤ . n k q ∗ =0 . q ∗ =0 . q ∗ =0 . (a) p all ≤ . Largest k such that probability all data missing is ≤ . n k q ∗ =0 . q ∗ =0 . q ∗ =0 . (b) p all ≤ . Figure 2.

Largest k such that p all ≤ . , . plotted against n and q ∗ .5ur ﬁnal numerical illustration considers an upper bound on the expected proportion of observations thatare missing, − q k ∗ , which does not depend on n . Figure 3 plots the expected proportion of data observedversus the number of variables k . We see the same qualititative relationship as before — as the number ofvariables increases, we have a very quick decline in the proportion of usable data. In comparison to Figure1, the expected proportion of data missing tends faster to for each q ∗ considered as k gets large, as it isequivalent to the special case for p all when n = 1 . . . . . . . Lower bound on expected proportion of data missing

Number of Variables E x pe c t ed p r opo r t i on m i ss i ng q ∗ =0 . q ∗ =0 . q ∗ =0 . Figure 3.

Lower bound of expected proportion of all rows missing ( p all ) plotted against values of n , k and q ∗ . Our theoretical and numerical results demonstrate that listwise deletion cannot generally accommodatemany variables, and that this problem is not resolved asymptotically. Application of high dimensionalasymptotics reveals that listwise deletion is even more fragile than was previously understood.This implies that scholars who are committed to listwise deletion may be unable to use all of the variablesthat are necessary for an otherwise valid data analysis even when n is large. For example, in order toachieve valid inferences in an observational study, a scholar may identify a large number of variablesnecessary to be conditioned on. However, if these variables exhibit idiosyncratic missingness, then theuse of listwise deletion would require the scholar to exclude variables that would be necessary to attainan unbiased estimate. Neither dropping necessary variables nor dropping many observations is desirable,6nd thus the researcher in the situation should entertain alternatives to listwise deletion.We conclude by emphasizing that this note should not be read as advocacy for the generic use of anyparticular method for addressing missing data. As Arel-Bundock and Pelc (2018) and Pepinsky (2018)demonstrate, no best method is best across all settings, and listwise deletion can outperform alternatives(e.g., multiple imputation) depending on the underlying data generating process. Our results provideadditional support for the perspective that the most suitable inferential strategy is one chosen based on thespeciﬁcs of the problem at hand. References

Allison, P. D. (2001).

Missing Data . Sage University Papers Series on Quantitative Applications in SocialSciences. Thousand Oaks, CA: Sage.Arel-Bundock, V. and K. J. Pelc (2018). When can multiple imputation improve regression estimates?

Political Analysis 26 (2), 240–245.Berk, R. (1983). Applications of the general linear model to survey data. In A. B. A. Peter H. Rossi,James D Wright (Ed.),

Handbook of Survey Research , Quantitative Studies in Social Relations. NewYork: Academic Press.Cameron, A. and P. Trivedi (2005).

Microeconometrics: Methods and Applications . New York: Cam-bridge University Press.Honaker, J. and G. King (2010). What to do about missing values in time-series cross-section data.

American journal of political science 54 (2), 561–581.King, G., J. Honaker, A. Joseph, and K. Scheve (2001). Analyzing incomplete political science data: Analternative algorithm for multiple imputation.

American political science review , 49–69.Lai, T. L., H. Robbins, and C. Z. Wei (1978). Strong consistency of least squares estimates in multipleregression.

Proceedings of the National Academy of Sciences of the United States of America 75 (7),3034–3036.Lall, R. (2016). How multiple imputation makes a difference.

Political Analysis 24 (4), 414–433.Little, R. J. and D. B. Rubin (2019).

Statistical analysis with missing data , Volume 793. John Wiley &Sons.Pepinsky, T. B. (2018). A note on listwise deletion versus multiple imputation.

Political Analysis 26 (4),480–488.R Core Team (2020).

R: A Language and Environment for Statistical Computing . Vienna, Austria: RFoundation for Statistical Computing.Schafer, J. L. (1997).