[PDF] Parameter estimation in Cox models with missing failure indicators and the OPPERA study

Abstract

In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the "gold standard" for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible. Instead, it is common to use questionnaires to screen for incidence of TMD and perform the "gold standard" examination only on participants who screen positively. Unfortunately, some participants may leave the study before receiving the "gold standard" examination. Within the framework of survival analysis, this results in missing failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failure indicators. We estimate the probability of being an incident case for those lacking a "gold standard" examination using logistic regression. These estimated probabilities are used to generate multiple imputations of case status for each missing examination that are combined with observed data in appropriate regression models. The variance introduced by the procedure is estimated using multiple imputation. The method can be used to estimate both regression coefficients in Cox proportional hazard models as well as incidence rates using Poisson regression. We simulate data with missing failure indicators and show that our method performs as well as or better than competing methods. Finally, we apply the proposed method to data from the OPPERA study.

Full PDF

aa r X i v : . [ s t a t . M E ] J u l Research Article

Statisticsin Medicine

Received XXXX

Parameter estimation in Cox models withmissing failure indicators and the OPPERAstudy

Naomi C. Brownstein a,b,c , Jianwen Cai c , Gary D. Slade c , and Eric Bair c,d ∗† In a prospective cohort study, examining all participants for incidence of the condition of interest may beprohibitively expensive. For example, the “gold standard” for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible.Instead, it is common to use questionnaires to screen for incidence of TMD and perform the “gold standard”examination only on participants who screen positively. Unfortunately, some participants may leave the studybefore receiving the “gold standard” examination. Within the framework of survival analysis, this results in missingfailure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study,a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failureindicators. We estimate the probability of being an incident case for those lacking a “gold standard” examinationusing logistic regression. These estimated probabilities are used to generate multiple imputations of case status foreach missing examination that are combined with observed data in appropriate regression models. The varianceintroduced by the procedure is estimated using multiple imputation. The method can be used to estimate bothregression coefﬁcients in Cox proportional hazard models as well as incidence rates using Poisson regression.We simulate data with missing failure indicators and show that our method performs as well as or better thancompeting methods. Finally, we apply the proposed method to data from the OPPERA study. Copyright c (cid:13)

Cox regression; missing data; multiple imputation; Poisson regression; survival analysis a Ion Cyclotron Resonance Program, National High Magnetic Field Laboratory, Florida State University, Tallahassee FL, U.S.A. b Department of Statistics, Florida State University, Tallahassee FL, U.S.A c Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill NC, U.S.A. d School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill NC, U.S.A. ∗ Correspondence to: Eric Bair, CB † E-mail: [email protected]

Statist. Med. , 00 1–8 Copyright c (cid:13)

Prepared using simauth.cls [Version: 2010/03/10 v3.00] tatisticsin Medicine

N. C. Brownstein et al.

1. Introduction

Time-to-event analyses are frequently conducted in medicine, actuarial science, and numerous other ﬁelds of appliedscience. There is a well-developed set of survival analysis methods implemented in standard software. Semi-parametricmethods, such as the Cox proportional hazards model, allow robust estimation of the effects of covariates on the hazardfunction. However, these methods require the analyst to know the failure status of each participant, which may not alwaysbe available.In some cases the outcome of interest may be difﬁcult to ascertain. For example, in oncology studies, researchers maywant to differentiate between deaths due to cancer and deaths due to car accidents or other unrelated causes. Investigatorsmay easily record the mortality of all subjects, but it may be extremely difﬁcult or costly to ﬁnd out exactly why eachsubject died. One possible solution to this problem is delayed event adjudication [1]. This means that possible cases are notidentiﬁed immediately but screened using simple methods that may have poor sensitivity or speciﬁcity. Later, the screenedcandidate cases are re-examined using a more precise, but also more costly and time-consuming, method to determine thetrue event status.The study that motivates our work is Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA), aprospective cohort study to identify risk factors for the onset of temporomandibular disorders (TMD). Each (initially TMD- free) OPPERA study participant was followed for a median of 2.8 years to identify cases of ﬁrst-onset TMD. However,it was impractical to perform a physical examination on every participant. It would also have been inefﬁcient given thatmost study participants did not develop the condition. Instead, this “gold standard” examination was performed onlyon participants with positive screens on a quarterly screening questionnaire that was designed to assess recent orofacialpain [2]. However, some participants with positive screens were lost to follow-up before receiving the “gold standard”examination. Thus a time-to-event analysis would have some participants with missing failure indicators.Previous research indicates that when a subset of the failure indicators are missing, one can obtain more accurateestimates of the parameters of interest by using appropriate tools to estimate these missing values [1, 3, 4]. Cookand Kosorok [1] estimate parameters in Cox proportional hazard models with missing failure indicators by weightingobservations according to their probability of being a true case. They show that the estimators are consistent andasymptotically normally distributed. However, the standard error of their proposed estimate cannot be easily obtained usingexisting software without bootstrapping. For the OPPERA data, a separate Cox model was calculated for each putativerisk factor of interest, including approximately three thousand genetic markers. Consequently, applying this method to theOPPERA genetic data would be computationally intractable.In the OPPERA study, the likelihood that a participant with a positive screen was examined was weakly associatedwith demographic variables such as gender, race, or socioeconomic status [2]. This indicated that the failure indicators inthe OPPERA study were not missing completely at random (MCAR). Application of models that assume MCAR failure indicators may result in biased estimates of hazard ratios for covariates of interest. More importantly, a participant’sresponses to their screening questions are predictive of whether or not they are an incident case of TMD. This settingpresents statistical challenges, which require care in order to avoid bias and maintain efﬁciency. Additionally, incidencerate estimates are desired, and none of the methods currently available allow for estimation of the incidence rate. There isa clear need for new methodology to effectively answer the research questions of the OPPERA study.In this paper, we propose a method for parameter and variance estimation in Cox regression models with missing failureindicators. The motivating data set is introduced in section 2. We describe our method in section 3. In section 4, we reportthe results of simulations. Finally, in section 5 we apply our method to the OPPERA study. We conclude with a discussionin section 6.2 (cid:13)

Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

2. Motivating Data Set: The OPPERA Study

OPPERA is a prospective cohort study designed to identify risk factors for ﬁrst-onset TMD. A total of 3,263 initiallyTMD-free subjects were recruited at four study sites between 2006 and 2008. TMD status was conﬁrmed by physicalexamination of the jaw joints and muscles using the Research Diagnostic Criteria for TMD [5], which is the gold standardfor diagnosing TMD.Upon enrollment in the study, each OPPERA participant was evaluated for a wide variety of possible risk factors forTMD, including psychological distress, previous history of painful conditions, and sensitivity to experimental pain. Fora brief overview of the risk factors of interest in the OPPERA study, see Section S1 in the Supporting Information. SeeOhrbach et al. [6], Fillingim et al. [7], Greenspan et al. [8], Maixner et al. [9], and Smith et al. [10] for a completedescription of the baseline measures that were collected in OPPERA.After enrollment, each participant was asked to complete questionnaires to evaluate recent orofacial pain once everythree months. These questionnaires (hereafter referred to as “screeners”) evaluated the frequency and severity of pain inthe orofacial region during the previous three months. The purpose of the screener was to identify participants who werelikely to have recently developed TMD. For a complete description of the screener, see Slade et al. [11]. Participants witha positive screen were asked to undergo a follow-up physical examination by a clinical expert to diagnose presence or absence of TMD.Of the 3,263 subjects, 2,737 ﬁlled out at least 1 screener, and the remaining 521 did not ﬁll out any screeners. Thetotal number of screeners was 26,666. There were 717 positive screeners, 486 (about 68%) of which were followed by aclinical examination. As reported in Bair et al. [2], case classiﬁcations made by one examiner (hereafter, “Examiner

3. Model

Assume there are n independent participants. For each participant i ( i = 1 , . . . , n ), let C i and T i denote the potentialtimes until censoring and failure, respectively, let V i = min( T i , C i ) , ∆ i = I ( T i ≤ C i ) . Let Z i a p × vector of covariatesmeasured at baseline and let X i be a q × vector of covariates measured at the time of the putative event. We assume thehazard for participant i follows a Cox proportional hazards model λ ( t | z i ) = λ ( t ) exp( β ′ z i ) (1)where λ ( t ) is an unspeciﬁed baseline hazard function. Let ξ i denote the indicator that ∆ i is observed. We observe ( V i , ξ i ) for i = 1 , . . . , n and ∆ i when ξ i = 1 .In the OPPERA study, V i is the length of time for participant i between enrollment in the study and either of two events1. a screener which resulted in a diagnosis of incident TMD2. the last-completed screener before loss-to-follow-up.Note that participants with a positive screen do not ﬁll out additional screeners until they are examined, so V i will bethe time until the positive screen for a participant who has a positive screen but is never examined. If participant i hada positive screen and subsequently was diagnosed with TMD, then ∆ i = 1 . If participant i either had a negative screenon the last quarterly screener before loss-to-follow up or a positive screen and was diagnosed to be free of TMD, then ∆ i = 0 . If participant i had a positive screen on the last screener but was not examined, then ∆ i is missing and ξ i = 0 . The Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.putative risk factors for TMD that were assessed at enrollment are denoted by the vector Z i . Responses to the screener forparticipant i at time V i are denoted by the vector X i . For OPPERA, we also deﬁne Q i = 1 if participant i has a positivescreen on their ﬁnal screener and Q i = 0 otherwise.We assume the failure indicators are missing at random (MAR) as follows: P ( ξ i = 0 | X i , Z i , V i , ∆ i , Q i = 1) = P ( ξ i = 0 | X i , Z i , V i , Q i = 1) (2)In other words, the probability of having a missing failure indicator may depend on measured factors, but it does notdepend on whether or not an event occurred. We will describe how to estimate the probability in (2) in Section 3.2 andthen show how to use this estimate to impute the missing event indicators in Section 3.3. We model the probability that participant i with a missing failure indicator is a case by a logistic regression model basedon X i and V i : P (∆ i = 1 | X i , V i , Z i , ξ i = 0 , Q i = 1) = exp( α ′ X i + γ ′ Z i + ηV i )1 + exp( α ′ X i + γ ′ Z i + ηV i ) I ( Q i = 1) (3) That is, we estimate the probability of examiner-diagnosed TMD in a participant who was not examined as intended.(Here I ( x ) denotes an indicator function.) The probability was estimated using the time between enrollment and their lastpositive screener as well as their answers on that screener. Then, for those individuals who had a positive screen on thelast screener (i.e. those with Q i = 1 ) and were not examined, the estimated probability of being a case is estimated by (3)with the parameters replaced by their respective estimates based on individuals who were examined.Note that this also assumes that there is one observation per subject, which may not be the case in practice. For example,if some participants had a positive screen on more than one screener and are examined at least once, then we have multipleobservations per participant. In that case, ﬁtting a generalized linear mixed effects logistic regression model rather than astandard logistic regression model could account for correlations between the responses of the same participant. However,only a small number of participants in the OPPERA study were examined multiple times after positive screeners, so wesimply discarded all but the most recent screener when analyzing the OPPERA data (thereby avoiding this problem ofrepeated observations). One popular method for handling missing data is multiple imputation. For a comprehensive review on multiple imputation,see Rubin [12]. Our imputation procedure is as follows:1. Estimate the coefﬁcients α , γ , and η in ( α , γ , and η had a prior distributionthat was Cauchy with center 0 and scale 2.5.2. For each observation with a missing failure indicator, sample from the posterior distribution of α , γ , and η to obtainan estimate of the probability that an event occurred for each such observation.3. Generate a Bernoulli random variable with success probability equal to the predicted probability found in step (2).4. Combine the raw data and imputed data from step (3) to form a completed data set.5. Fit the Cox proportional hazards model to the completed data set.6. Record each parameter estimate ˆ β j and covariance matrix ˆ U j .7. Repeat steps (3)-(6) for a total of m times, where m is the desired number of imputations.Next, we combine all of the estimates. The average parameter estimate is ¯ β = 1 m m X j =1 ˆ β j , (4) (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine the within-imputation variance estimate is ¯ U = 1 m m X j =1 ˆ U j , (5)and the between-imputation variance ˆ B = 1 m − m X j =1 ( ˆ β j − ¯ β )( ˆ β j − ¯ β ) ′ . (6)Finally, the estimated covariance matrix is ˆ Var ( ¯ β ) = ¯ U + (cid:18) m (cid:19) ˆ B. (7)It can be shown that ¯ β/ ˆ Var ( ¯ β ) is approximately t distributed with degrees of freedom ( m − (cid:18) m ¯ U ( m + 1) ˆ B (cid:19) (8)(7) and (8) can be used to compute conﬁdence intervals for the multiply imputed parameter estimate ¯ β . Previous sections of this paper described how to estimate hazard ratios in the presence of missing failure indicators. It mayalso be of interest to estimate incidence rates for the same event using Poisson regression instead of Cox regression. Forexample, one of the aims of the OPPERA study is to estimate the incidence rate of ﬁrst-onset TMD.In order to estimate incidence rates, we estimate the case probabilities as described previously based on participantswho had a positive screen and were examined. Then we impute case status as described in section 3.3 for those who hada positive screen but were not examined. However, in this case we ﬁt Poisson regression models, rather than Cox models,to the completed data sets. Finally, we calculate the incidence rate based on the estimates of the regression coefﬁcients inthe Poisson model. Speciﬁcally, we use the data from imputation j to ﬁt the model log( E (∆ ij | X i , Z i , V i )) = µ + τ ′ X i + λ ′ Z i + log( V i ) (9)where ∆ ij denotes the j th imputation for observation i , j = 1 , . . . , m . We combine the m imputations using equation (4)and ¯ µ = 1 m m X j =1 ˆ µ j . (10) The estimated incidence rate for an individual with covariates X ∗ and Z ∗ is given by exp(¯ µ + ¯ τ X ∗ + ¯ λZ ∗ ) . The variabilityof ¯ µ , ¯ τ , and ¯ λ may be estimated using (7), and conﬁdence intervals may be computed based on the t distribution using (8),as described previously.

4. Simulations

Data with missing failure indicators were simulated, and several possible methods were compared with respect tobias, coverage, and conﬁdence interval width. Survival times for 1,000 individuals were generated with exponentiallydistributed failure times under a proportional hazards model with covariates as proposed by Bender et al. [

1) where λ ( t ) = 1 is the baseline hazard. Forour simulations, Z i was a single baseline covariate following a normal distribution with mean 2 and unit variance. In Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.other words, conditional of Z i , the failure times T i followed an exponential distribution with hazard exp( β ′ Z i ) where β ∈ {− . , − . , − } . The censoring times C i followed an exponential distribution with mean 5 (corresponding to ahazard of exp( − log(5)) ≈ exp( − . ). This yielded about 35%, 75% and 90% censoring for β = − . , β = − . , and β = − , respectively. We also deﬁned ∆ i = I ( T i ≤ C i ) . If ∆ i = 0 , the implication is that the follow up period endedbefore the participant developed TMD, meaning that the observation was censored at time C i .Covariates are represented by Z i , a risk factor for TMD measured at enrollment, and X i , a measurement collected onthe last screener. For each observation, a normally distributed covariate X i was generated with mean ∆ i and standarddeviation 0.3. In OPPERA, X i represents a question on the screener evaluating some symptom of ﬁrst-onset TMD, suchas the frequency of jaw pain. This was used to generate Q i = I ( X i > . , an indicator of whether participant i screenedpositive on their last screener. Note that X i depends on ∆ i , since participants who developed ﬁrst-onset TMD are morelikely to report symptoms on their screener, and Q i depends on X i , since the screener is positive if enough symptoms arereported. Also, ξ i = I (∆ i is observed ) corresponds to the indicator of whether participant i came in for their clinical examif Q i = 1 . In all simulations, δ i was used as the failure indicator rather than ∆ i , where δ i is deﬁned as δ i =  ∆ i if Q i = 10 if Q i = 0 In other words, we set the failure indicator δ i = 0 if the ﬁnal screener was negative. This decision was made to reﬂect thefact that OPPERA participants who had a negative screen were not examined. Hence it is possible that some participantsdeveloped ﬁrst-onset TMD but were never examined due to their ﬁnal screener being negative. Thus, the simulations(incorrectly) treat these observations as censored.We created missing failure indicators under the following classical missing data mechanisms of Rubin [14]:1. The probability of having a missing failure indicator is independent of the data. This is known as missing completelyat random (MCAR).2. The probability of having a missing failure indicator depends on an observed covariate. This is known as missing atrandom (MAR).3. The probability of having a missing failure indicator depends on the (potentially unobserved) failure indicator. Thisis known as missing not at random (MNAR).Our method assumes that the data are MAR, which includes MCAR as a special case. Our simulations under MAR andMNAR parallel the study protocol in that failure indicators can only be missing for those with positive screeners. In otherwords, observations were potentially missing if and only if Q i = 1 . (Individuals with negative screeners have Q i = 0 andare assumed to be censored. Those with positive screeners have Q i = 1 and may have missing clinical examinations.)Details and results for MCAR and MNAR data are shown in Sections S2.2 and S2.4 in the Supporting Information. We also considered several simulation scenarios where the logistic regression model for predicting the failure indicator wasmisspeciﬁed; see Section S2.3 in the Supporting Information. For MAR data, we set failure indicators to be missing withprobability P ( ξ i = 0 | X i , Z i , V i , Q i = 1) = exp( − . − . Z i + 0 . V i )1 + exp( − . − . Z i + 0 . V i ) (11)This resulted in approximately 50% of failure indicators being set to missing, which is consistent with the rate of missingfailure indicators in the OPPERA study.In each simulated data set, all observations with observed failure indicators who had a positive screen were used to ﬁt alogistic regression model for case status with covariates Z i , X i and V i . That is, using the complete data (i.e. observationswith Q i = 1 and ξ i = 1 ), we ﬁt the logistic regression model for the event probability conditional on Z i , X i , and V i ,namely logit { P (∆ i = 1 | X i , Z i , V i , Q i = 1 , ξ i = 1) } = α ′ X i + γ ′ Z i + ηV i (12)6 (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

The estimated probabilities ˆ p i = exp(ˆ α ′ X i +ˆ γ ′ Z i +ˆ ηV i )1+exp(ˆ α ′ X i +ˆ γ ′ Z i +ˆ ηV i ) were calculated for individuals with Q i = 1 (where ˆ α , ˆ γ , and ˆ η aredrawn from their posterior distribution).To evaluate the performance of our method, multiple imputation was employed to calculate 10 imputed estimates of β for each simulation as described in Section 3.3. For each observation i with Q i = 1 and ξ i = 0 , we estimated failureindicators ˆ∆ ij independently for each imputation j .A Cox proportional hazards model was ﬁt for each imputed data set, and the imputed estimates of the regressioncoefﬁcient and their variances were recorded. These were aggregated using equations (4) and (7) to create conﬁdenceintervals for the multiple imputation estimates.The performance of our method was compared with that of the method of Cook and Kosorok [1]. To obtain the estimatesof Cook and Kosorok [1], for each simulated data set, we estimated the probabilities ˆ p i that the (potentially unobserved)event for participant i is a true event, as described previously. We then ﬁt a weighted Cox proportional hazards model tothe data set with weights calculated as follows: Each observation with a missing failure indicator was deleted and replacedwith two new observations. Each such pair of observations had the same failure time and covariates, but different failureindicators and weights. The ﬁrst observation had weight ˆ p i and ˆ∆ i = 1 , and the second observation had weight − ˆ p i and ˆ∆ i = 0 . Participants with fully observed data retained a single observation in the data set with unit weight. The estimatedregression coefﬁcient, ˆ β was recorded. The variance of this estimate was estimated by generating 1,000 bootstrap replicates of each simulated data set andreﬁtting the model for each bootstrap replicate. A set of 1,000 subjects was selected at each bootstrap iteration by samplingfrom the data with replacement. For each bootstrap replicate, the estimated probability ˆ p ∗ i that participant i is a true failurewas calculated. These estimated ˆ p ∗ i ’s were used to calculate a bootstrap estimate ˆ β ∗ of β using a weighted Cox model asdescribed in the previous paragraph. The average parameter estimate, ¯ˆ β and percentile conﬁdence intervals ( β . , β . ) were all recorded, where β θ is the θ th quantile among the 1,000 bootstrap replicates.We also compared our method to the ideal situation in which the true values of ∆ i were observed for all observations(note that ∆ i was used instead of δ i in this case), complete case analysis (meaning that we exclude from the data set allobservations with missing failure indicators), and two ad hoc methods in which we treat the missing indicators either allas censored or all as failures. Results under the assumption of MAR are shown in Table 1. We estimated the bias of eachmethod by calculating the mean difference between the estimated Cox regression coefﬁcient and the true coefﬁcient overthe 1000 simulations. We also calculated the mean width of the conﬁdence intervals produced by each method over the1000 simulations. Similarly, we calculated the empirical coverage probability for the conﬁdence intervals produced byeach method by dividing the number of times that the conﬁdence intervals contained the true value of the parameter by1000. We also report the Monte Carlo error for the coverage rate, which is the error in the empirical coverage probabilitydue to conducting only a ﬁnite number of simulations (which would be p α (1 − α ) /n for n simulations). Finally, the rateof missing information and the average running time of each method was computed. All calculations were performed using R versions 3.0.2 running on a single core of a Dell C6100 server with a 2.93GHz Intel processor. The function “mi.binary” in the “mi” R package was used to generate the imputed values of themissing failure indicators. The functions “boot” and “boot.ci” in the “boot” R package were used to calculate the bootstrapestimates of the standard error of the Cook and Kosorok [1] method. The Cox proportional hazard models were ﬁt usingthe “coxph” function in the “survival” R package. The code used to perform the simulations (and analyze the OPPERAdata) is available in the Supporting Information.The empirical coverage probability of the conﬁdence intervals produced by multiple imputation is close to the nominallevel (0.95) in all simulations. Our multiple imputation method and the method of Cook and Kosorok [1] produced approximately unbiased estimates and valid conﬁdence intervals in all the scenarios we considered. The estimatesproduced by the other methods showed a larger amount of bias and did not always achieve the desired coverage level.Our multiple imputation method also yielded the narrowest conﬁdence intervals in each scenario. Although the methodof Cook and Kosorok [

1] produced conﬁdence intervals that were only slightly wider, this indicates that our proposed

Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

Table 1.

Simulation Results for MAR β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0008 0.0005 0.1666 0.0004 0.962 0.008Complete Case 0.0033 0.0007 0.2152 0.0004 0.955 0.007Treat all as Censored 0.1058 0.0007 0.2127 0.0004 0.514 0.007Treat all as Failures 0.0018 0.0005 0.1699 0.0004 0.964 0.008Cook & Kosorok -0.0009 0.0005 0.1728 0.0004 0.959 22.0Multiple Imputation -0.0003 0.0005 0.1721 0.0004 0.961 0.49-1.5 Full Data 0.0047 0.0011 0.3176 0.0002 0.938 0.008Complete Case -0.0558 0.0015 0.4317 0.0003 0.927 0.007Treat all as Censored 0.1241 0.0014 0.421 0.0003 0.767 0.007Treat all as Failures 0.0716 0.0011 0.3154 0.0002 0.841 0.007Cook & Kosorok 0.0052 0.0011 0.3399 0.0003 0.942 17.50Multiple Imputation 0.0082 0.0011 0.3353 0.0002 0.942 0.40-3 Full Data -0.0294 0.0025 0.7606 0.0009 0.945 0.007Complete Case -0.2044 0.0036 1.0855 0.0017 0.918 0.008Treat all as Censored 0.0988 0.0034 1.0413 0.0015 0.92 0.008

Treat all as Failures 0.5914 0.0025 0.6293 0.0006 0.085 0.008Cook & Kosorok -0.0302 0.0029 0.9078 0.0017 0.94 17.33Multiple Imputation -0.0042 0.0028 0.8556 0.0014 0.947 0.43 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. method may have slightly greater power to detect true associations, particularly when the absolute value of β is large. Ourproposed method also tended to have lower bias than the method of Cook and Kosorok [1] when the absolute value of β islarge. The running time of our proposed method was also signiﬁcantly less than the running time of the Cook and Kosorok[1] method. Moreover, for most parameter values, the coverage probabilities for the complete case and ad hoc methodswere signiﬁcantly different ( p < . ) from the nominal rate.In addition, we examined the performance of our proposed methods when we changed the logistic regression model for ∆ i . We investigate two additional types of models: one in which the model contained a variable unrelated to case statusand another in which the model does not include one variable related to case status. As in the previous simulations, thefailure times were generated by (1), censoring was exponential with mean 5, failure indicators were set to be missingcompletely at random or missing at random with probability given in equation (11), Z i ∼ N (2 , , X i ∼ N (∆ i , . and Q i = I ( Y i > . for i = 1 , . . . , n . We also generated X i ∼ N (0 , where Z i , X i , X i were mutually independent and X i was independent of ∆ i and Q i .In the previous simulations, we ﬁt the data to (12) with covariates Z i and X i = X i . The additional simulations insteadused the covariates and parameters as follows:1. ˜ X i = { , X i , X i } ˜ X i = 0 .That is, rather than ﬁtting model (12) to the data, we modeled the case probability with logit { P (∆ i = 1 | X i , Z i , V i , Q i = 1) } = ˜ α ′ ˜ X i + γZ i + ηV i . (13)The results, which are shown in Section S2.3 in the Supporting Information, remained similar under both alternativemodels. This indicates that the proposed methods are robust to misspeciﬁcation of the logistic regression model in some8 (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine situations. Most notably, leaving out one covariate that was weakly related to case status did not markedly decrease theperformance of the method.We also performed some simulations where a random subset of the observations with Q i = 0 were set to have missingfailure indicators. The model to predict ∆ i was ﬁtted using only the observations for which Q i = 1 , but the model wasapplied to all observations with missing failure indicators (including observations where Q i = 0 ). The results are shownin Section S2.3 in the Supporting Information. In this case our method (as well as the Cook and Kosorok [1] method)produced reasonable results when the logistic regression model was speciﬁed correctly or when an extra covariate wasincluded in the model. However, both methods performed poorly when an important covariate was missing from thelogistic regression model.Finally, we conducted simulations to evaluate the method’s ability to estimate incidence rates. A similar multipleimputation strategy was applied to Poisson regression. Our method produced estimates much closer to the true incidencerates than the complete case estimate. In fact, the complete case method underestimated incidence rates by as much as afactor of 3. See Section S2.5 in the Supporting Information for details.

5. Analysis of the OPPERA Study

In this section, we apply our method to estimate hazard ratios and incidence rates in the OPPERA study using m = 10 imputations. We applied our method to the OPPERA cohort to adjust for the effect of participants with missing clinical examinations.(Note that examinations for participants evaluated by Examiner covariates, as shown in Bair et al. [2]. Several other possible predictors of being diagnosed with TMD were identiﬁed, butincluding these additional predictors in the model did not improve the predictive accuracy of the model and hence theywere not included. (In general failure to include a relevant predictor variable when performing multiple imputation willproduce greater error than including an irrelevant variable as evidenced by our simulations, so generally it is better to erron the side of including too many predictors rather than too few. However, in this case, our testing indicated that includedadditional variables did not improve the predictive accuracy of the model and in fact might actually decrease the accuracy.Hence, in this case we favored the more parsimonious model.)Thus, we estimated the probability of being diagnosed with TMD based on the count of non-speciﬁc orofacialsymptoms, time since enrollment, and OPPERA study site. This model was used to perform multiple imputation forthose with no clinical examination. These imputed data sets were used to ﬁt a series of Cox proportional hazards modelsto estimate the hazard ratio (and associated conﬁdence interval and p-value) for each predictor using the methods describedin section 3.3. Examples of predictors include perceived stress, history of comorbid chronic pain conditions, and smokingstatus.

Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.In addition, Bair et al. [2] examined univariate relationships between examination attendance and numerous possiblepredictor variables. Differences between examined and non-examined participants were small and most were notstatistically signiﬁcant. However, a few of the differences were statistically signiﬁcant, indicating that the data were notMCAR, since MCAR requires that the probability of a missing observation does not depend on the data.Table 2 shows the results of applying our method to a subset of the putative risk factors of TMD measured in OPPERA.Due to the large number of putative risk factors measured in OPPERA, we only report the results for a selected subset ofthe variables. All continuous variables were normalized to have mean 0 and standard deviation 1 prior to ﬁtting the Coxmodels. (Thus, the hazard ratios for the continuous variables represent the hazard ratios corresponding to a one-standarddeviation increase in the predictor variable.) In Table 2, all the quantitative sensory testing and psychosocial variableswere continuous, while all of the clinical variables were dichotomous (and hence were not normalized). The small numberof missing values in these predictor variables were (singly) imputed using the EM algorithm; see Greenspan et al. [8] orFillingim et al. [7] for details. For a more detailed description of the OPPERA domains, see Section S1 in the SupportingInformation, Maixner et al. [15], and Slade et al. [16].

Table 2.

Results from the OPPERA Study

Treat All MCIs as Censored Multiple Imputation

HR LCL UCL P HR LCL UCL PClinical VariableIn the last month 3.26 1.83 5.84 < < < < < < < < < < < < < The rate of missing information varied slightly for each putative risk factor. The average rate of missing informationwas approximately . . Compared to the unimputed results, which treated missing failure indicators as censoredobservations, imputation slightly reduced the hazard ratios for most of the psychosocial variables that were measuredin OPPERA. For instance, Table (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

A similar pattern was observed after applying our imputation method to the measures of experimental pain sensitivity.The mechanical pain aftersensation ratings were strongly associated with ﬁrst-onset TMD before imputation, but theywere only weakly associated with ﬁrst-onset TMD after imputation. The pressure pain algometer ratings were also moreweakly associated with TMD after imputation (and two of three ratings in Table 2 were no longer signiﬁcantly associatedwith ﬁrst-onset TMD at the p < . level).Interestingly, the hazard ratios for the presence of one or more palpation tender points at the temporalis and massetermuscles were also attenuated after imputation. These tender points were evaluated as part of the clinical examination usinga different protocol than the quantitative sensory testing algometer pain ratings. However, both pain measures (algometerand palpation) were measured at the same facial locations. While the palpation ratings were more strongly associated withﬁrst-onset TMD than the algometer ratings both before and after imputation, it is interesting that different pain sensitivitymeasures using different protocols at the same anatomical location were both attenuated by imputation.The effects of other clinical variables were also attenuated after imputation. For example, the hazard ratios associatedwith being unable to open one’s mouth wide in the past month and having two or more comorbid pain conditions were bothnoticeably attenuated after imputation. However, other clinical variables were more strongly associated with ﬁrst-onsetTMD after imputation. For example, having a history of respiratory illness was only weakly associated with ﬁrst-onsetTMD before imputation (HR=1.38, p=0.04), but the association was much stronger after imputation (HR=1.43, p=0.004). Also, being a current smoker was not signiﬁcantly associated with ﬁrst-onset TMD before imputation (HR=1.26, p=0.24)but was associated after imputation (HR=1.49, p=0.02).

In Table 3, the incidence rate of ﬁrst-onset TMD was estimated using two different approaches. First, all missingfailure indicators were treated as censored. Second, the multiple imputation method in this paper was used to estimatethe incidence rate. The estimated TMD incidence rate using multiple imputation was 66% greater than the unimputedestimate. The estimated incidence rate increased by 70% for females and 87% for males. Estimated incidence rates forwhites and Hispanics were 118% and 202% higher, respectively, with imputation. Thus, the incidence rate is likely to beunderestimated without imputation.

Table 3.

Estimated TMD Incidence Rates With and Without ImputationNo MI MI Percent ChangeOverall 2.23 3.78 70%Males 1.87 3.49 87%Females 2.46 4.19 70%

White 1.70 3.70 118%Black 4.20 5.70 36%Hispanic 1.17 3.53 202%Other 1.10 1.86 69%

Incidence rates are given in cases per 100 person-years.

6. Discussion

We have developed a computationally efﬁcient method to adjust for missing failure indicators in time-to-event data usinglogistic regression and multiple imputation. Logistic regression is used to estimate the failure probability for participantswith missing failure indicators. The missing values are imputed, and the standard errors are estimated using our multiple

Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.imputation method. This framework is important in studies where failure status may be measured in stages, which maylead to missing failure status indicators. This is a common occurrence in studies of diseases that are difﬁcult or expensiveto diagnose, such as TMD.The present method is similar to the method of Magder and Hughes [17], who use an iterative procedure for parameterestimation based on the EM algorithm. Our assumption of MAR data renders their iterative method unnecessary. Othermethods [18, 19, 20] depend on the MCAR assumption, which does not hold for the OPPERA study. Chen et al.[21] estimate Cox regression parameters using the EM algorithm and establish their consistency under basic regularityconditions, including missing at random (MAR) failure indicators. However, their approach depends on the assumptionsof piecewise constant proportional hazard functions for the censoring time as well as for the failure time.In each simulation scenario, our multiple imputation method produced the narrowest valid conﬁdence intervals and nosigniﬁcant bias. In particular, the method of Cook and Kosorok [1] produced slightly wider conﬁdence intervals in allbut one of the simulations we considered. The differences were small, so the performance of the two methods appear tobe comparable for most practical purposes. However, we believe that our method has several possible advantages overthe method of Cook and Kosorok [1]. First, bootstrapping is much more intensive computationally than our multipleimputation approach. Calculating bootstrap conﬁdence intervals generally requires at least 1000 bootstrap replicates [22],whereas as few as 10 imputed data sets may be sufﬁcient for multiple imputation [23]. Although the difference in the computing time of the two methods is small for a single ﬁtted model, many such models will be required in the course ofthe OPPERA study. OPPERA has already collected data on approximately 3000 genetic markers and has plans to collectdata on approximately a million genetic markers in a genome-wide association study. Thus, at least 3000 (and potentiallyas many as a million) Cox models will need to be ﬁt, and our proposed method may allow for a signiﬁcant decrease incomputing time. Moreover, our method can also be easily implemented in popular statistical software packages (such asSAS) without additional programming.Additionally, our methodology may easily be extended to other models, such as Poisson regression. We conductedsimulations (Table S9 in the Supporting Information) that showed that our proposed method can be used to estimateincidence rates using Poisson regression, which is one of the research aims of the OPPERA study. In particular, estimatesof the failure rates were biased when missing failure indicators were treated as censored or when the complete case methodwas used, but they were unbiased when we employed the methodology in this paper.Our method may yield increased bias and decreased coverage if the logistic regression model for predicting case statusis inaccurate, as observed in the simulations in Section S2.3 in the Supporting Information. However, this would also betrue for competing methods, including the method of Cook and Kosorok [1].Our proposed also requires that the missing data be MAR. Although it is impossible to test this assumption directly, Bairet al. [2] showed that there were no signiﬁcant differences between those who did and not attend their clinical examinationwith respect to a wide range of demographic variables and putative risk factors for TMD. Thus, the MAR assumption is reasonable for OPPERA. Furthermore, the results of the simulations described in Section S2.4 show that our proposedmethod can produce valid results in some situations even if the MAR assumption is violated.Also, our proposed method is only useful for imputing missing event failure indicators among participants who havepositive screeners. If a participant develops ﬁrst-onset TMD but still has a negative screener, such a participant will betreated as censored, and our method is unable to correct for this misclassiﬁcation. The OPPERA screener was designedto have high sensitivity and modest speciﬁcity, so the number of false negative screens is expected to be low. (Indeed,OPPERA performed clinical examinations on a subset of the participants with negative screeners. Although analysis ofthis data is ongoing, preliminary results suggests that the false negative rate is less than 5%.) Thus, we expect that thesmall number of false negative screens will not meaningfully affect the results of our analysis. Also, note that under oursimulation scenarios, we assumed that some failures were not observed due to a negative screener. Since our proposedmethod gave satisfactory results in these simulation scenarios, it appears that failing to observe some events due to negativescreeners should not signiﬁcantly bias the results.12 (cid:13)

Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

In the OPPERA study, the hazard ratios associated with some variables were noticeably different after imputation.Although other results remained qualitatively unchanged, we note that even small changes in hazard ratios are important.In addition, estimated incidence rates were signiﬁcantly increased after imputation. Since the results of OPPERA maybecome normative in the orofacial pain literature, precise calculation of the incidence rate of TMD and the hazard ratiosassociated with putative risk factors is important. Thus, imputation is recommended.

Acknowledgements

The authors would like to acknowledge and thank the principal investigators of the OPPERA study, namely WilliamMaixner, Luda Diatchenko, Bruce Weir, Richard Ohrbach, Roger Fillingim, Joel Greenspan, and Ronald Dubner. TheOPPERA study was supported by NIH/NIDCR grant U01DE017018. Naomi Brownstein was supported by NIH/NIEHST32ES007018 and NSF Graduate Research Fellowship Program grant 0646083. Jianwen Cai was supported by NIH/NCIgrant P01CA142538 and NIH/NIEHS grant R01ES021900. Eric Bair was supported by NIH/NIDCR grant R03DE023592,NIH/NCATS grant UL1TR001111, and NIH/NIEHS grant P03ES010126.

References

1. Cook TD, Kosorok MR. Analysis of time-to-event data with incomplete event adjudication.

Journal of the American Statistical Association (468):pp. 1140–1152. URL .2. Bair E, Brownstein NC, Ohrbach R, Greenspan JD, Dubner R, Fillingim RB, Maixner W, Smith SB, Diatchenko L, Gonzalez Y, et al. . Study protocol,sample characteristics, and loss to follow-up: The OPPERA prospective cohort study. The Journal of Pain (12):T2–T19.3. Magaret AS. Incorporating validation subsets into discrete proportional hazards models for mismeasured outcomes. Statistics in Medicine (26):5456–5470, doi:10.1002/sim.3365. URL http://dx.doi.org/10.1002/sim.3365 .4. Dodd LE, Korn EL, Freidlin B, Gray R, Bhattacharya S. An audit strategy for progression-free survival. Biometrics (3):1092–1099, doi:10.1111/j.1541-0420.2010.01539.x. URL http://dx.doi.org/10.1111/j.1541-0420.2010.01539.x .5. Dworkin S, LeResche L. Research diagnostic criteria for temporomandibular disorders: review, criteria, examinations and speciﬁcations, critique. Journal of Craniomandibular Disorders (4):301–355.6. Ohrbach R, Fillingim RB, Mulkey F, Gonzalez Y, Gordon S, Gremillion H, Lim PF, Ribeiro-Dasilva M, Greenspan JD, Knott C, et al. .Clinical ﬁndings and pain symptoms as potential risk factors for chronic TMD: Descriptive data and empirically identiﬁed domainsfrom the OPPERA case-control study. The Journal of Pain (11, Supplement):T27 – T45, doi:10.1016/j.jpain.2011.09.001. URL .7. Fillingim RB, Ohrbach R, Greenspan JD, Knott C, Dubner R, Bair E, Baraian C, Slade GD, Maixner W. Potential psychosocial risk factors for chronicTMD: Descriptive data and empirically identiﬁed domains from the OPPERA case-control study. The Journal of Pain (11, Supplement):T46 –T60, doi:10.1016/j.jpain.2011.08.007. URL .8. Greenspan JD, Slade GD, Bair E, Dubner R, Fillingim RB, Ohrbach R, Knott C, Mulkey F, Rothwell R, Maixner W.Pain sensitivity risk factors for chronic TMD: Descriptive data and empirically identiﬁed domains from the OPPERAcase control study. The Journal of Pain (11, Supplement):T61 – T74, doi:10.1016/j.jpain.2011.08.006. URL .9. Maixner W, Greenspan JD, Dubner R, Bair E, Mulkey F, Miller V, Knott C, Slade GD, Ohrbach R, Diatchenko L, et al. . Potential autonomic risk factors for chronic TMD: Descriptive data and empirically identiﬁed domains from theOPPERA case-control study. The Journal of Pain (11, Supplement):T75 – T91, doi:10.1016/j.jpain.2011.09.002. URL .10. Smith SB, Maixner DW, Greenspan JD, Dubner R, Fillingim RB, Ohrbach R, Knott C, Slade GD, Bair E, Gibson DG, et al. . Potential genetic riskfactors for chronic TMD: Genetic associations from the OPPERA case control study. The Journal of Pain (11, Supplement):T92 – T101,doi:10.1016/j.jpain.2011.08.005. URL .11. Slade GD, Sanders A, Bair E, Brownstein NC, Fillingim RB, Maixner W, Greenspan JD, Ohrbach R. Pre-clinical episodes of orofacial pain symptomsand their association with healthcare behaviors in the OPPERA prospective cohort study. Pain

May 2013; :750–760, doi:10.1016/j.pain.2013.01.014.12. Rubin DB. Multiple imputation after 18+ years.

Journal of the American Statistical Association (434):pp. 473–489. URL . Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

13. Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models.

Statistics in Medicine

February 2005; :1713–1723.14. Rubin DB. Inference and missing data. Biometrika (3):pp. 581–592. URL .15. Maixner W, Diatchenko L, Dubner R, Fillingim RB, Greenspan JD, Knott C, Ohrbach R, Weir B, Slade GD. Orofacial pain prospective evaluation andrisk assessment study - the OPPERA study. The Journal of Pain

November 2011; (11):T4–T11.16. Slade GD, Bair E, By K, Mulkey F, Baraian C, Rothwell R, Reynolds M, Miller V, Gonzalez Y, Gordon S, et al. . Study methods, recruitment,sociodemographic ﬁndings, and demographic representativeness in the OPPERA study. The Journal of Pain

November 2011; (11):T12–T26.17. Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology (2):195–203.URL .18. McKeague IW, Subramanian S. Product-limit estimators and Cox regression with missing censoring information.

Scandinavian Journal of Statistics (4):pp. 589–601. URL .19. Gijbels I, Lin D, Ying Z. Non- and semi- parametric analysis of failure-time data with missing failure indicators. Lecture Notes-Monograph Series

March 2007; (1):203–223.20. Subramanian S. Efﬁcient estimation of regression coefﬁcients and baseline hazard under proportionality of conditionalhazards. Journal of Statistical Planning and Inference (1-2):81 – 94, doi:10.1016/S0378-3758(99)00153-6. URL .21. Chen P, He R, Shen Js, Sun Jg. Regression analysis of right-censored failure time data with missing censoring indicators. Acta Mathematicae ApplicataeSinica (English Series) :415–426. URL http://dx.doi.org/10.1007/s10255-008-8807-1 , 10.1007/s10255-008-8807-1.22. Efron B, Tibshirani RJ. An introduction to the bootstrap . Chapman and Hall/CRC: Boca Raton, FL, 1993.23. Little RJA, Rubin DB.

Statistical Analysis with Missing Data . Wiley Series in Probability and Mathematical Statistics, 2002. (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

Web-based Supplementary Materials for “Parameter Estimationin Cox Models with Missing Failure Indicators” by Naomi CBrownstein, Jianwen Cai, Gary Slade, and Eric Bair

S1. Description of the OPPERA Study

The primary objective of the OPPERA study is to identify possible risk factors for developing ﬁrst-onset TMD. SeeMaixner et al. [15], Slade et al. [16], and Bair et al. [2] for a more detailed description of the study. The risk factorsconsidered in OPPERA are classiﬁed into the following domains: sociodemographic, clinical, psychosocial, autonomic,quantitative sensory testing (QST), and genetics. The remainder of this section describes these OPPERA domains in moredetail.First, sociodemographic information was recorded for each OPPERA participant. This includes age, gender, race, andOPPERA study site, as well as educational attainment, income, and marital status. For example, TMD is more commonin females than males and in non-Hispanic whites than in other races. Details are provided in Slade et al. [16].

Clinical risk factors refer to variables that “typically are considered in clinical settings when evaluating patients” [6].These clinical variables may be evaluated via physical examinations or questionnaires. Examples include headaches,back aches, pain in other regions of the body, jaw mobility, jaw noises, and orofacial trauma. OPPERA participants alsoself-reported their health history, including the presence of comorbid pain conditions such as irritable bowel syndrome,ﬁbromyalgia, and dysmenorrhea.Psychosocial factors have also been shown to be associated with TMD [7]. Speciﬁc qualities related to psychosocialfunctioning were evaluated in OPPERA, including general psychological function, affective distress, psychological stress,somatic awareness, and coping/catastrophizing. Affective distress measures include state and trait anxiety and mood.Psychological stress includes perceived stress and measures of post-traumatic stress disorder. Somatic awareness assessessensitivity to physical sensations. Finally, coping/catastrophizing assesses individuals’ ability to handle pain.The association between TMD and the function of the autonomic nervous system was also evaluated. Key measuresof autonomic function include blood pressure, heart rate, and heart rate variability, which were measured during theOPPERA baseline medical examination. In previous studies, TMD was associated with higher heart rates and lower heartrate variability, which are symptoms of dysregulation of the autonomic nervous system. See Maixner et al. [9] for a moredetailed description of the autonomic data collected in OPPERA.The QST variables collected in OPPERA measure sensitivity to experimental pain. Several measures of experimentalpain sensitivity were collected, including pressure pain thresholds measured by algometers, mechanical (pinprick) painsensitivity, and thermal pain sensitivity. See Greenspan et al. [8] for a more detailed description of these QST variables.

Finally, the association between TMD and selected genetic markers was evaluated. A total of 3295 single nucleotidepolymorphisms (SNP’s) were selected from genes that are believed to be associated with pain. See Smith et al. [10] formore detail on how the SNP’s were chosen and their association with TMD.

S2. Results of Additional Simulations

S2.1. Overview of Additional Simulations

In this appendix, we provide the results of additional simulations. We investigate the performance of the method undera variety of missing data mechanisms. We also consider scenarios where the logistic regression model for estimating theprobability of being a case is misspeciﬁed.

Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

Table S1.

Simulation Results for MCAR β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data 0.0003 0.0005 0.1667 0.0001 0.95 0.008Complete Case -0.0518 0.0007 0.2203 0.0001 0.854 0.007Treat all as Censored 0.0005 0.0007 0.2205 0.0001 0.952 0.008Treat all as Failures 0.0056 0.0006 0.1699 0.0001 0.953 0.007Cook & Kosorok 0 0.0006 0.1746 0.0001 0.955 22.36Multiple Imputation 0.0002 0.0006 0.173 0.0001 0.958 0.51-1.5 Full Data -0.0018 0.0011 0.3184 0.0002 0.942 0.008Complete Case -0.1303 0.0014 0.4257 0.0003 0.798 0.007Treat all as Censored -0.0063 0.0014 0.4218 0.0003 0.954 0.007Treat all as Failures 0.0817 0.0011 0.3149 0.0002 0.808 0.007Cook & Kosorok -0.0021 0.0011 0.3426 0.0004 0.943 17.64Multiple Imputation 0.001 0.0011 0.3395 0.0002 0.944 0.39-3 Full Data -0.0152 0.0025 0.7561 0.0008 0.953 0.007Complete Case -0.2332 0.0035 1.0332 0.0015 0.894 0.007Treat all as Censored -0.0191 0.0033 1.0066 0.0014 0.959 0.008

Treat all as Failures 0.6654 0.0024 0.6166 0.0006 0.047 0.007Cook & Kosorok -0.0186 0.0028 0.8938 0.0016 0.939 16.38Multiple Imputation 0.0073 0.0027 0.8493 0.0014 0.958 0.4 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. Recall that we created missing failure indicators under the following classical missing data mechanisms of Rubin(1976):1. The probability of having a missing failure indicator is independent of the data. This is known as missing completelyat random (MCAR).2. The probability of having a missing failure indicator depends on an observed covariate. This is known as missing atrandom (MAR).3. The probability of having a missing failure indicator depends on the failure indicator itself. This is known as missingnot at random (MNAR).

S2.2. Additional Simulations Under MCAR

In order to more closely parallel the OPPERA study, we simulated data where we randomly set 40% of the failureindicators to be missing for those with Q i = 1 . (Note that our simulations assume that failure indicators can only bemissing when Q i = 1 . Without this assumption the data would not be MCAR in this scenario, since Q i depends on X i ,which is observed.) This setup assumes that the probability that a participant has a non-missing failure indicator dependsonly on whether or not their screener was positive. The logistic regression model in this case included the covariates Z i and X i as before, as well as the time of the screener. Results are shown in Table S1. All methods had a negligibleamount of bias in these scenarios except for the complete case method and the method that treated all missing indicatorsas failures. In these simulations, the complete case method also displayed extreme bias and poor coverage. This indicatesthat a complete case analysis would not be appropriate for a study such as OPPERA. (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

Table S2.

Results for an Extra Covariate Included in the Logistic Regression Model β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data 0.0016 0.0006 0.1667 0.0001 0.942 0.008Complete Case 0.0037 0.0007 0.2156 0.0001 0.938 0.008Treat all as Censored 0.1054 0.0007 0.213 0.0001 0.509 0.008Treat all as Failures 0.004 0.0006 0.1701 0.0001 0.952 0.008Cook & Kosorok 0.0012 0.0006 0.174 0.0001 0.955 22.91Multiple Imputation 0.0018 0.0006 0.1724 0.0001 0.955 0.52-1.5 Full Data -0.0058 0.001 0.3185 0.0002 0.955 0.008Complete Case -0.0693 0.0015 0.4332 0.0004 0.909 0.007Treat all as Censored 0.1139 0.0014 0.4222 0.0003 0.799 0.007Treat all as Failures 0.0622 0.001 0.3161 0.0002 0.877 0.007Cook & Kosorok -0.0056 0.0011 0.3408 0.0003 0.946 18.09Multiple Imputation -0.0024 0.0011 0.3368 0.0002 0.953 0.4-3 Full Data -0.0242 0.0025 0.7606 0.0008 0.962 0.007Complete Case -0.1941 0.0037 1.0845 0.0017 0.924 0.007Treat all as Censored 0.1045 0.0036 1.0405 0.0015 0.906 0.007

Treat all as Failures 0.5995 0.0025 0.6291 0.0006 0.089 0.007Cook & Kosorok -0.0172 0.0028 0.909 0.0017 0.953 16.93Multiple Imputation 0.0111 0.0027 0.8578 0.0014 0.958 0.42 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. S2.3. Alternative Logistic Regression Models

We considered several scenarios where the logistic regression model for the probability of being a case is misspeciﬁed.Recall that we originally modeled the probability of being a case as P (∆ i = 1 | X i , Z i , V i ) = exp( α ′ X i + γ ′ Z i + ηV i )1 + exp( α ′ X i + γ ′ Z i + ηV i ) (S.1)The original logistic model had the covariates Z i , X i , and V i where Z i ∼ N (2 , and X i ∼ N (∆ i , . are mutuallyindependent for j = 1 , , and i = 1 , . . . , n .Two alternative models were examined:1. The ﬁrst alternative model was of the form (S.1) but used the covariates ˜ X i = { X i , X i } and V i where X i ∼ N (0 , . This scenario was to used to evaluate the robustness of the method when an extraneous covariate is includedin the model. (Note that X i is independent of Z i , X i , ∆ i , and Q i .)2. The second alternative model was generated according to (S.1) but was ﬁt with the covariates Z i and V i . In thecontext of OPPERA, this represents the scenario in which we failed to include a covariate that is associated withﬁrst-onset TMD.Tables S2 and S3 indicate that our method produces valid results even if a noisy variable is added to the model or if animportant variable is not included in the model.Next, we consider the scenario where failure indicators may be missing even if a participant had a negative screener(i.e. Q i = 0 ). For each such simulation, we randomly selected 40% of the observations to have missing failure indicatorswhen Q i = 0 . (The mechanism for missing failure indicators when Q i = 1 is the same as described previously.) In theﬁrst such simulation, the logistic regression model was correctly speciﬁed when Q i = 1 . (However, it will be applied to allobservations with missing failure indicators, including those for which Q i = 0 . Since the true value of the failure indicator Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

Table S3.

Results when a Relevant Covariate is Omitted from the Logistic Regression Model β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0007 0.0006 0.1666 0.0001 0.939 0.007Complete Case 0.0046 0.0007 0.2153 0.0001 0.948 0.007Treat all as Censored 0.1062 0.0007 0.2128 0.0001 0.497 0.007Treat all as Failures 0.0021 0.0006 0.17 0.0001 0.942 0.007Cook & Kosorok -0.0006 0.0006 0.1738 0.0001 0.945 17.84Multiple Imputation -0.0003 0.0006 0.1726 0.0001 0.947 0.36-1.5 Full Data 0.0012 0.0011 0.3174 0.0002 0.951 0.007Complete Case -0.0581 0.0015 0.4308 0.0004 0.908 0.006Treat all as Censored 0.1237 0.0014 0.4198 0.0003 0.761 0.007Treat all as Failures 0.0697 0.001 0.315 0.0002 0.844 0.007Cook & Kosorok 0.0029 0.0011 0.3449 0.0004 0.942 14.54Multiple Imputation 0.0055 0.0011 0.3411 0.0003 0.939 0.3-3 Full Data -0.0146 0.0025 0.7562 0.0008 0.952 0.007Complete Case -0.1925 0.0038 1.0793 0.0017 0.911 0.007Treat all as Censored 0.1076 0.0035 1.0352 0.0015 0.899 0.007

Treat all as Failures 0.6021 0.0025 0.6261 0.0006 0.097 0.007Cook & Kosorok -0.0181 0.0029 0.9025 0.0017 0.935 14.11Multiple Imputation 0.0074 0.0029 0.8761 0.0015 0.936 0.32 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. is always 0 when Q i = 0 , the model will be biased for these observations.) In the two remaining simulation scenarios, themodel will be misspeciﬁed even when Q i = 1 by either adding an extra covariate or leaving out a signiﬁcant covariate aswe did in the earlier simulations.The results of these three additional simulations are shown in Tables S4, S5, and S6. The model performs well in twoof the three scenarios, indicating that our methodology is robust against misspeciﬁcation of the logistic regression model.However, when an important covariate is not included in the model, the estimates are badly biased. Empirical coverageranged from 0% to 50%, signiﬁcantly below the nominal rate. This indicates that our method can give incorrect resultsif the predictive accuracy of the logistic regression model is poor. Note that the method of Cook and Kosorok [1] alsoperforms poorly in this scenario. If one cannot accurately estimate which failure indicators are missing, it is unlikely thatany method can produce valid conﬁdence intervals for the Cox regression coefﬁcients. S2.4. Simulations Under MNAR

We examined two possible scenarios where the data is MNAR:1. In the ﬁrst, we set 30% of the censored observations and 50% of the failures to have missing indicators.2. In the second, we set 20% of the censored observations and 60% of the failures to have missing indicators.The results of these simulations are shown in Tables 1 and 2. Bias increased for all methods under both MNAR scenarios.In particular, the complete case method consistently displayed a high amount of bias and did not achieve the desiredcoverage rate. For our imputation method and the method of Cook and Kosorok [

1] may not be valid. On the otherhand, even when the data was not MAR, our method provided an improvement in terms of bias and coverage compared to4 (cid:13)

Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

Table S4.

Results when the Logistic Regression Model is Applied to Observations with Q i = 0 β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0029 0.0006 0.1669 0.0001 0.951 0.007Complete Case 0.0422 0.0007 0.2164 0.0001 0.864 0.006Treat all as Censored 0.1042 0.0007 0.2132 0.0001 0.516 0.007Treat all as Failure 0.0883 0.0005 0.1526 0.0001 0.372 0.007Cook & Kosorok -0.002 0.0006 0.1781 0.0002 0.945 21.41Multiple Imputation 0.0028 0.0006 0.1779 0.0001 0.951 0.48-1.5 Full Data -0.0031 0.001 0.3187 0.0002 0.956 0.007Complete Case 0.053 0.0015 0.4306 0.0004 0.907 0.006Treat all as Censored 0.12 0.0014 0.4224 0.0003 0.779 0.007Treat all as Failures 0.8199 0.0007 0.204 0.0001 0 0.007Cook & Kosorok -0.0026 0.0011 0.3483 0.0004 0.948 17.785Multiple Imputation 0.0128 0.0011 0.3557 0.0006 0.961 0.38-3 Full Data -0.0244 0.0026 0.7602 0.0009 0.947 0.008Complete Case 0.0024 0.0035 1.0686 0.0017 0.954 0.006Treat all as Censored 0.1043 0.0035 1.0379 0.0015 0.914 0.008

Treat all as Failures 2.5176 0.0008 0.2218 0.0001 0 0.007Cook & Kosorok -0.0104 0.0029 0.9885 0.0028 0.96 18.69Multiple Imputation 0.0909 0.0029 1.0932 0.0053 0.969 0.424 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. Table S5.

Results when an Extra Covariate is Included in the Logistic Regression Model and the Model is Applied toObservations with Q i = 0 β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0011 0.0005 0.1669 0.0001 0.956 0.008Complete Case 0.0442 0.0007 0.2167 0.0001 0.876 0.006Treat all as Censored 0.1061 0.0007 0.2133 0.0001 0.506 0.007Treat all as Failures 0.0913 0.0005 0.1525 0.0001 0.342 0.008Cook & Kosorok 0.0011 0.0006 0.1783 0.0002 0.959 22.05Multiple Imputation 0.0064 0.0006 0.1791 0.0002 0.956 0.5-1.5 Full Data -0.003 0.001 0.3185 0.0002 0.949 0.007Complete Case 0.0525 0.0014 0.4288 0.0004 0.911 0.006Treat all as Censored 0.1171 0.0014 0.4214 0.0003 0.814 0.007

Treat all as Failures 0.8186 0.0007 0.2041 0.0001 0 0.008Cook & Kosorok -0.0022 0.0011 0.3495 0.0004 0.947 18.37Multiple Imputation 0.0149 0.0011 0.3588 0.0008 0.953 0.397-3 Full Data -0.0098 0.0025 0.7555 0.0008 0.943 0.008Complete Case 0.0042 0.0036 1.0687 0.0017 0.948 0.007Treat all as Censored 0.116 0.0033 1.0332 0.0014 0.918 0.008Treat all as Failures 2.5152 0.0008 0.2218 0.0001 0 0.008Cook & Kosorok -0.0013 0.0028 0.9876 0.0028 0.964 20.22Multiple Imputation 0.1024 0.0028 1.0969 0.0055 0.956 0.46 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

Table S6.

Results when a Relevant Covariate is not Included in the Logistic Regression Model and the Model is Appliedto Observations where Q i = 0 β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0027 0.0005 0.1667 0.0001 0.963 0.008Complete Case 0.0418 0.0007 0.2159 0.0001 0.878 0.006Treat all as Censored 0.1018 0.0007 0.2128 0.0001 0.533 0.008Treat all as Failures 0.0887 0.0005 0.1523 0.0001 0.375 0.007Cook & Kosorok 0.0814 0.0005 0.157 0.0001 0.5 19.66Multiple Imputation 0.0816 0.0005 0.1573 0.0001 0.471 0.39-1.5 Full Data -0.0062 0.001 0.3185 0.0002 0.947 0.007Complete Case 0.0517 0.0014 0.4294 0.0004 0.903 0.006Treat all as Censored 0.1154 0.0014 0.4218 0.0003 0.806 0.007Treat all as Failures 0.8183 0.0007 0.2044 0.0001 0 0.007Cook & Kosorok 0.5006 0.0012 0.3808 0.0006 0.004 16.76Multiple Imputation 0.5151 0.0012 0.3916 0.001 0.005 0.32-3 Full Data -0.0126 0.0025 0.756 0.0008 0.952 0.008Complete Case 0.0057 0.0036 1.0711 0.0016 0.947 0.007

Treat all as Censored 0.1073 0.0035 1.0376 0.0014 0.902 0.008Treat all as Failures 2.5132 0.0009 0.2218 0.0001 0 0.007Cook & Kosorok 0.8862 0.0036 1.0736 0.0028 0.2 17.01Multiple Imputation 0.9962 0.0036 1.1777 0.0043 0.117 0.353 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. the complete case method and the method that treats all missing subjects as failures. Moreover, the coverage probabilitywas slightly greater for our method than for the method of Cook and Kosorok [1]. S2.5. Simulations for Poisson Regression

We performed simulations to evaluate the performance of our method when the desired time-to-event analysis is a Poissonregression model rather than a Cox model. Poisson models are commonly used to estimate incidence rates, which is anobjective of the OPPERA study.The simulations were identical to those described in Section 4 except that the imputed data was used to ﬁt Poissonregression models rather than Cox proportional hazards models. That is, we ﬁt the data from imputations j = 1 , . . . , m to the model log( µ i ) = α + βZ i + log( V i ) . (S.2)where µ i is the expected number of cases and the offset, log( V i ) , is the logarithm of the survival time. We measured thebias, deﬁned as ˆ β minus the true value, for β ∈ {− . , − . , − } . The Cook and Kosorok [1] method does not immediately generalize to Poisson regression. Consequently, we onlycompared our method to the unachievable ideal of no missing data, the complete case method, and the two ad hoc methods.The use of Poisson regression allows us to estimate incidence rates. For each simulation, we estimated the incidencerate based on the coefﬁcients of the Poisson regression model in (S.2). Speciﬁcally, estimated incidence rates for ﬁxedvalues of Z i are given by exp( α + βZ i ) (S.3) The bias, conﬁdence interval width, and coverage probability of each method are shown in Table

S9. We also present theestimated incidence rates for each quartile of the random variable Z i (i.e. the quartiles of the N (2 , distribution) along6 (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

Table S7.

Simulation Results for MNAR, scenario 1 β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data 0.0022 0.0006 0.1668 0.0001 0.932 0.007Complete Case -0.0695 0.0008 0.2419 0.0001 0.787 0.006Treat all as Censored 0.0005 0.0008 0.2422 0.0001 0.943 0.007Treat all as Failures 0.0063 0.0006 0.1702 0.0001 0.929 0.007Cook & Kosorok -0.0014 0.0006 0.1768 0.0001 0.937 19.9Multiple Imputation -0.0011 0.0006 0.175 0.0001 0.936 0.46-1.5 Full Data -0.0019 0.001 0.3182 0.0002 0.965 0.008Complete Case -0.1799 0.0016 0.4704 0.0004 0.699 0.007Treat all as Censored -0.0042 0.0016 0.4624 0.0004 0.95 0.007Treat all as Failures 0.0625 0.001 0.3173 0.0002 0.866 0.007Cook & Kosorok -0.0197 0.0011 0.3505 0.0004 0.934 16.69Multiple Imputation -0.0173 0.0011 0.3447 0.0003 0.959 0.38-3 Full Data -0.0208 0.0025 0.7571 0.0008 0.946 0.007Complete Case -0.3316 0.004 1.1528 0.002 0.846 0.007Treat all as Censored -0.0423 0.0038 1.1107 0.0017 0.952 0.007

Treat all as Failures 0.5224 0.0026 0.648 0.0007 0.18 0.008Cook & Kosorok -0.0677 0.0029 0.9291 0.0019 0.922 16.32Multiple Imputation -0.0507 0.0028 0.8641 0.0014 0.959 0.4 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. Table S8.

Simulation Results for MNAR, scenario 2 β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † Running Time (s.)-0.5 Full Data -0.0022 0.0005 0.167 0.0001 0.952 0.008Complete Case -0.0976 0.0009 0.2704 0.0002 0.713 0.007Treat all as Censored -0.0017 0.0009 0.2706 0.0001 0.936 0.007Treat all as Failures 0.0009 0.0006 0.1706 0.0001 0.95 0.008Cook & Kosorok -0.0091 0.0006 0.1796 0.0001 0.944 20.85Multiple Imputation -0.0087 0.0006 0.178 0.0001 0.948 0.44-1.5 Full Data -0.0007 0.0011 0.3183 0.0002 0.95 0.007Complete Case -0.2321 0.0018 0.5311 0.0005 0.586 0.007Treat all as Censored -0.0019 0.0017 0.5175 0.0004 0.952 0.007Treat all as Failures 0.0444 0.0011 0.3202 0.0002 0.9 0.007

Cook & Kosorok -0.0397 0.0012 0.3575 0.0004 0.9 16.03Multiple Imputation -0.0382 0.0012 0.3537 0.0003 0.932 0.36-3 Full Data -0.0112 0.0025 0.7565 0.0009 0.948 0.007Complete Case -0.409 0.0043 1.3088 0.0023 0.827 0.007Treat all as Censored -0.0255 0.0041 1.2435 0.002 0.954 0.007Treat all as Failures 0.3801 0.0026 0.682 0.0007 0.425 0.007Cook & Kosorok -0.1038 0.003 0.9663 0.0021 0.919 15.45Multiple Imputation -0.0891 0.0028 0.8839 0.0015 0.959 0.38 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. with the true (theoretical) rates for each quartile. See Table S10. Statist. Med. , 00 1–8 Copyright c (cid:13) Prepared using simauth.cls tatisticsin Medicine

N. C. Brownstein et al.

Table S9.

Simulation Results for Poisson Models, MAR β ∗ Method Bias SE (Bias) Width SE (Width) Coverage † -0.5 Full Data -0.0036 0.0005 0.1596 0.0001 0.956Complete Case -0.0101 0.0007 0.2067 0.0001 0.948Treat all as Censored 0.0813 0.0007 0.2049 0.0001 0.652Treat all as Failures -0.0002 0.0006 0.1628 0.0001 0.957Multiple Imputation -0.0036 0.0006 0.1652 0.0001 0.961-1.5 Full Data -0.0005 0.0010 0.2864 0.0002 0.945Complete Case -0.1008 0.0015 0.3956 0.0004 0.829Treat all as Censored 0.0704 0.0014 0.3883 0.0003 0.898Treat all as Failures 0.0717 0.0010 0.2857 0.0002 0.820Multiple Imputation 0.0027 0.0011 0.3059 0.0003 0.951-3 Full Data -0.0184 0.0021 0.5994 0.0007 0.958Complete Case -0.2733 0.0032 0.8710 0.0016 0.792Treat all as Censored 0.0206 0.0030 0.8485 0.0012 0.954Treat all as Failures 0.5232 0.0025 0.5544 0.0006 0.085Multiple Imputation 0.0126 0.0023 0.7012 0.0014 0.964 ∗ : The rate of missing information is . when β = − . , . when β = − . , and . when β = − . † : The Monte Carlo error is 0.007. Our method had close to 95% coverage probability when Poisson regression was used. None of the other methodshad proper coverage for all of the simulations. Multiple imputation yielded the least bias of all the methods besides theunachievable ideal of observing all data. It also produced more narrow conﬁdence intervals than the complete case methodand the method that treats all missing failure indicators as censored.The bias evident in parameter estimation was compounded for incidence rates. The complete case method and the twoad hoc methods consistently underestimated incidence. In fact, the complete case method underestimated incidence byabout 30-200%. By contrast, our method differed from the unachievable ideal by only about 4-6%. (cid:13) Statist. Med. , 00 1–8

Prepared using simauth.cls . C. Brownstein et al.

Statisticsin Medicine

Table S10.

Simulation Results for Incidence Rates β Method Q1 SE Q2 SE Q3 SE-0.5 True Rate 0.5154 0.3679 0.2626Full Data 0.5157 0.0003 0.3671 0.0002 0.2615 0.0002