[PDF] Handling Missingness Value on Jointly Measured Time-Course and Time-to-event Data

Abstract

Joint modeling technique is a recent advancement in effectively analyzing the longitudinal history of patients with the occurrence of an event of interest attached to it. This procedure is successfully implemented in biomarker studies to examine parents with the occurrence of tumor. One of the typical problem that influences the necessary inference is the presence of missing values in the longitudinal responses as well as in covariates. The occurrence of missingness is very common due to the dropout of patients from the study. This article presents an effective and detailed way to handle the missing values in the covariates and response variable. This study discusses the effect of different multiple imputation techniques on the inferences of joint modeling implemented on imputed datasets. A simulation study is carried out to replicate the complex data structures and conveniently perform our analysis to show its efficacy in terms of parameter estimation. This analysis is further illustrated with the longitudinal and survival outcomes of biomarkers' study by assessing proper codes in R programming language.

Full PDF

HHandling Missingness Value on Jointly Measured Time-Course andTime-to-event Data

Gajendra K. Vishwakarma a , Atanu Bhattacharjee b , Souvik Banerjee a a Department of Mathematics & Computing,Indian Institute of Technology Dhanbad, Dhanbad-826004, India,E-mail: vishwagk@rediﬀmail.com, [email protected] b Section of Biostatistics, Centre for Cancer Epidemiology,Tata Memorial Center, Navi Mumbai, India b Homi Bhabha National Institute, Mumbai, IndiaE-mail : [email protected]

Abstract

Joint modeling technique is a recent advancement in eﬀectively analyzing the longitudinal historyof patients with the occurrence of an event of interest attached to it. This procedure is successfullyimplemented in biomarker studies to examine parents with the occurrence of tumor. One of thetypical problem that inﬂuences the necessary inference is the presence of missing values in thelongitudinal responses as wel l as in covariates. The occurrence of missingness is very commondue to the dropout of patients from the study. This article presents an eﬀective and detailed way tohandle the missing values in the covariates and response variable. This study discusses the eﬀect ofdiﬀerent multiple imputation techniques on the inferences of joint modeling implemented on imputeddatasets. A simulation study is carried out to replicate the complex data structures and convenientlyperform our analysis to show its eﬃcacy in terms of parameter estimation. This analysis is furtherillustrated with the longitudinal and survival outcomes of biomarkers’ study by assessing propercodes in R programming language.

Keywords:

Joint Modeling, Longitudinal response, Survival outcome, Missing Data, MultipleImputation

Mathematics Subject Classiﬁcation: 62P10

Longitudinal measurements are receiving much attention in clinical research where patients are followedup to an event of interest. In these studies, the progression of disease can be closely monitored for a periodof time and outcomes can be predicted more eﬀcetively by assessing temporal changes in responses. Theanalysis of follow-up studies with time-to-event observations is called Joint modeling of longitudinal andsurvival data (Tsiatis [1995], Faucett & Thomas [1996])This article is motivated by the gene expression data from Gene Expression Omnibus (GEO)database where longitudinal responses were collected for a number of liquid markers. The patientswere given diﬀerent treatments and the responses were collected at diﬀerent time-points. It is examinedwhether any progression of tumor is developed for the patients during or after the treatment period.The problem with the dataset is there are several missing values in the longitudinal responses as wellas covariates which aﬀect the required inferences. The covariate missingness is an important aspect inthis study as it diﬀerentiate the simple response missingness to a more complex missingness structure. Corresponding Author a r X i v : . [ s t a t . M E ] J a n o, it is necessary to develop a proper statistical methodology to impute those missing values to gatheradditional information from the longitudinal and survival patterns.Gene expression data, collected from micro-array experiments, are useful to measure temporalchanges in gene expression. These data play a crucial role in understanding the complex mechanismof genes and its eﬀect on the disease progression. The tumor-speciﬁc gene expression proﬁling has animportant contribution to biomarker detection (Bhattacharjee et al. [2018]). The potentiality of early de-tection of disease or providing valuable information on disease progression presents the protein molecularbiomarkers as a popular choice. Circulating biomarkers, which are obtained in biological liquid samples,have several advantages over tissue-based proﬁling due to their non-invasive nature and eﬃciency in lon-gitudinal tracking of patients. However, the cost of collection and measurement of biomarkers is risingand hence the loss of signiﬁcant amount of longitudinal responses is unaﬀordable and it is essential toretrieve the information lost based on the available data.During the study related to such a vulnerable disease, it often occurs that patients leave the studywithout any prior information (Hui et al. [2013], Bell et al. [2013]). As a result, this type of datasetcontains missing observations due to drop-out of patients. Missingness can be divided into three diﬀerentforms based on dependency pattern on the dataset, namely, Missing Completely at Random (MCAR),Missing at Random (MAR) and Missing not at Random (MNAR). In Bayesian and Likelihood methodol-ogy, MCAR and MAR mechanism are well addressed by introducing ignorable missingness and analyzingthe observed data only (Robins et al. [1995], Hogan et al. [2004], Tsiatis [2007], Bhattacharjee & Vish-wakarma [2019]). The MNAR mechanism comes under the non-ignorable category which requires morespeciﬁcations of the joint distribution of the data and missing structure (Ibrahim & Molenberghs [2009]).There are methods that simply ignore the covariates for which some observations are missing andanalyze the remaining data as complete case analysis (Little & Rubin [2014]). But in this process,the sample size is reduced and eﬃciency is lost. Also when the dataset is not large enough but theproportion of missing values is high, the estimates become biased (Sun et al. [2006], Rubin [1976]). Someattempts have been made to address issues on covariates in joint modeling. Lu analyzed a joint modelingstrategy for skew longitudinal survival data with missingness and mismeasured covariates using a logitmodel (Lu [2017]). A pseudo-likelihood approach was also implemented for longitudinal binary data withnon-ignorable missing response and covariates (Parzen et al. [2006]). Here a separate model has beenconsidered for the missingness in covariates. Also, eﬀort has been made in computing joint model withcovariate subject to a limit of detection Sattar & Sinha [2017]. Work has been performed for missingand left-censored time-varying covariates in joint modeling using some modiﬁed prior distributionalassumption (Chen et al. [2014]). However, there is a clear scarcity of proper modeling technique andimputation strategy for missing covariates in joint modeling context and it needs a detailed analytic andcomputational eﬀort to show a pathway for the solution of this problem.In joint models, a classical linear mixed eﬀect model is popularly used to specify longitudinal mea-surements and Cox proportional hazard model for the survival part. The mixed eﬀect model for longitu-dinal measurements is associated inside the survival model with an association parameter, (for example,see (Wulfson & Tsiatis [1997], Henderson [2000], Rizopoulos [2012])). These characteristics enable jointmodeling to capture the inﬂuence of covariates trajectory on the survival duration of patients. Bayesianand maximum likelihood techniques are eﬃciently utilized for parameter estimation (Armero et al. [2016],Song et al. [2016]). The joint modeling work was stimulated in problems depending on AIDS (Tsiatis[1995]) and currently it is becoming increasingly popular in other areas also (Ibrahim et al. [2010]).There are several R software packages which serve as good tools to ﬁt joint models for a continuouslongitudinal data and a time-to-event process under maximum likelihood method (Rizopoulos [2010]).Bayesian approach has also been employed in several packages to estimate the parameters and providemore accurate results in many instances (Rizopoulos [2016]). Some recent packages use Monte Carlo2xpectation-Maximization algorithm for estimation of parameters and consider multivariate Gaussianprocess for joint modeling (Philipson [2018], Hickey et al. [2018]). The mice (Multivariate Imputationof Chained Equations) package in R provides a good solution for the complex incomplete data prob-lems (Buuren & Groothuis-Oudshoorn [2010]). It generates the missing observations using conditionaldensities by MCMC procedure and speciﬁes diﬀerent imputation models for diﬀerent covariates. Someother packages use hot-deck imputation, k-nearest neighbor imputation, regression imputation etc. Othersuccessful eﬀorts considered expectation-maximization with bootstrapping (EMB) algorithm to imputemissing observations considering the normality assumption and MAR (missing at random) missingnessstructure (Honaker [2011]). Iterative robust model-based imputation approach, implemented through irmi() function in VIM (Kowarik & Templ [2016]) package, also provides a good alternative to imputemissing observations using robust methods (Templ et al. [2011]). We use R open-source software for thesimulation of posterior samples for the parameters using Markov Chain Monte Carlo (MCMC) method.In section 2, we describe the estimation framework of joint models for longitudinal and time-to-eventdata and also explain the imputation strategies for missing data with covariate missingness. In section3, a simulation study is described to explain the mentioned methodology. In section 4, we perform ouranalysis on a real dataset of gene expression to show its eﬃcacy. In section 5 and 6, we discuss someresults on the proposed technique and also add future scopes of this methodology. In this section, we have discussed the model-building techniques to analyze the longitudinal and survivalmeasurements and the joint modeling method to obtain the eﬃcient treatment strategy.Longitudinal data is a collection of repeated observations of a particular subject over a period oftime (Weiss [2005]). The temporal ordering of longitudinal data is important as measurements that areclose to each other tend to similar than the measurements collected at a distant time-point. So, theanalysis also gets diﬃcult with complex variance structure.One of the widely used tools for analysis of longitudinal measurements is mixed eﬀect model. Themixed eﬀect model is deﬁned as, Y i ( t ) = M i ( t ) + (cid:15) i ( t ) = X i ( t ) β + Z i ( t ) b i + (cid:15) i ( t ) ,b i ∼ N (0 , D ) , (cid:15) i ( t ) ∼ N (0 , σ ) (1)where Y i ( t ) denotes the longitudinal measurement of i th patient at time t , M i ( t ) is the true eﬀect ofthe data or, the true longitudinal response which we will model in mixed eﬀect model, (cid:15) i ( t ) is the errorcomponent attached to the data. X i ( t ) & Z i ( t ) are the ﬁxed eﬀects and random eﬀects respectively. Also, β & b i are the coeﬃcients of ﬁxed eﬀect and random eﬀects respectively. We assume the measurementerror component (cid:15) i ( t ) follows a Normal distribution with mean 0 and variance σ and the error isindependent of b i . The estimates of the coeﬃcients can be computed using generalized estimatingequations for unknown correlation structures between outcomes.Dataset consisting of time to the event of interest is referred to as time-to-event data. When theevent of interest is deﬁned as occurrence of death or any disease to the patient enlisted in the subject,then this kind of dataset is referred to as Survival Data. Observations are called censored when the eventof interest does not appear for the patient till the time he is available in the study. Generally, this kindof data is continuous in nature. 3on-parametric methods like Kaplan-Meier estimator is widely used to estimate and plot the survivalprobabilities as a function of time. It can also be used for comparison of multiple groups of subjects.For survival data, often the interest lies in the association of hazard function and related covariates. Apopularly used parametric model for survival data is Cox proportional hazard model which is deﬁned as, h i ( t | K i ) = h ( t ) exp ( K i γ ) (2)where h i ( t | K i ) is the hazard rate for the i th patient who experiences the event of interest at timet, h ( t ) is the baseline hazard function and K i is baseline covariates with corresponding regressioncoeﬃcient γ . Some popular choices of baseline hazard function are piecewise constant function, Weibullrisk function and unspeciﬁed risk function. In most applications, one of the main objectives is to estimatethe γ which explains how the covariates are related to the hazard function.In joint modeling, we associate both the mixed eﬀect model of longitudinal measurements and Coxmodel of survival outcomes in a single model and estimate the coeﬃcients (Wulfson & Tsiatis [1997],Tsiatis & Davidian [2001]). The joint model is explained as, h i ( t | K i , M i )= lim dt → P r ( t ≤ T ∗ i < t + dt | T ∗ i ≥ t, K i , M ij ) /dt (3)= h ( t ) exp ( K i γ + M ij α ) (4)Here α measures the eﬀect of longitudinal process on the time-to-event measurement. It serves as alink between the risk function of cancer infection at time t with the underlying longitudinal model. Theother parameters have their usual meanings as explained in the longitudinal and survival process. Oneof the advantages of this joint model lies in the fact that the parameters are estimated simultaneouslyin this approach.The log-likelihood of the joint model as described by Rizopoulous (Rizopoulos [2012]) is as follows, log p ( T i , δ i , y i ; θ ) = log (cid:90) p ( T i , δ i , y i , b i ; θ ) db i (5)= log (cid:90) p ( T i , δ i | b i ; θ t )[ (cid:89) j p { y i | b i ; θ y } ] p ( b i ; θ b ) db i (6) δ i is the censoring indicator where 1 implies censored observation and 0 implies uncensored obser-vation. θ = ( θ b , θ t , θ y ) denotes full parameter vector where θ t is parameters for time-to-event data, θ y denotes the parameters of longitudinal observations and θ b denotes the parameters for random eﬀectcomponent. The parameters of the joint model are estimated using the maximum likelihood methodand the random eﬀects are predicted using their conditional expectations given all of the data. Inter-pretations of the parameters remain the same as what is explained in random eﬀect model and survivalmodel. Direct maximization of this model is impossible to work on due to its complex nature andnon-parametric components. The expectation-maximization(EM) algorithm is one of the most usedtechniques for optimization purposes of joint likelihood (Wulfson & Tsiatis [1997], Henderson [2000]).In bayesian framework, the parameters of the model are estimated using Markov Chain MonteCarlo(MCMC) method. The posterior distribution of the joint model is deﬁned as, p ( θ, b i | T i , δ i , y i ) ∝ p ( T i , δ i | b i , θ ) p ( y i | b i , θ ) p ( b i | θ ) p ( b i | θ ) p ( θ ) (7)4here θ denotes the vector of all the parameters. So, survival likelihood for patient i is, p ( T i , δ i | b i , β, θ s ) = h i { T i | M i ( T i ) , θ s } δ i S i { T i | M i ( T i ) , θ s } = [ h ( T i | γ s ) exp { γ T K i + αM i ( T i ) } ] δ i × exp (cid:16) − (cid:90) T i h ( s | γ s ) exp { γ T K i + αM i ( s ) } ds (cid:17) (8)Here θ s = ( γ s , γ, α ) where γ s denotes the parameters associated with the B-splines for the baselinehazard. An inverse Wishart prior is considered for the variance-covariance matrix ( D ) and a gammaprior is used for the variance of errors ( σ ).The Bayesian approach oﬀers the use of Markov Chain Monte Carlo (MCMC) sampling algorithmto estimate the posterior distribution of the parameters. The joint models are analyzed using R pro-gramming software. In real-life situations, obtaining a dataset without any missing observations especially for patients withdiseases is a diﬃcult job. A subject can refrain itself to be measured at follow-up times which creates amonotone missing data pattern in longitudinal study. Also, there are some instances when the subject isabsent at one follow-up time and then present again at subsequent time-points, resulting in intermittentmissingness. It is diﬃcult to evaluate the likelihood functions with non-monotone missing data patternsas simple factorization rule cannot be applied here. However, in cases where missing data pattern isconsidered to be missing at random (MAR), some imputation models using the software can be appliedto estimate the missing data. Similarly, in survival studies, datasets regarding the death of a patientoften become unavailable generating censored observations. The covariates or factors impacting uponthe failure time are also not available always throughout the entire study. The diﬀerence of measuringfrequencies and the inability to measure all the covariates at a certain time slot will also result inincomplete data.A very eﬃcient way to analyze dataset consists of missing observations is considering the ignorabilitycondition. This condition comes into the picture when the missingness pattern is MAR and full-dataparameters can be decomposed into two independent parts: one signiﬁes the full data response modeland the other exhibits missing data mechanism. The ignorability condition does not interpret to dropthe missing observations, rather implies that the missingness mechanism can be left unspeciﬁed to drawposterior inference of the full data model. When the ignorability condition is believed to be not asuitable option, we can use a more general approach where missingness depends on the unobserved partof responses given the observed part and covariates.In MAR, the missingness depends only on the observed values, not on the unobserved ones (Templet al. [2011]). In our study, it is found that the patients tend to skip the clinical tests due to healthreasons whenever their condition is worsening or the size of the tumor is big enough. Also, it can befound that some of the patients do not feel obliged to test often when values are not much signiﬁcantand importance of further testing is not felt. So, the missingness depends on the observed value of thebiomarker i.e., the missingness is Missing at Random (MAR).Four diﬀerent mechanisms that have been used in this article for multiple imputations are namely,Expectation-Maximization Bootstrap Algorithm, Predictive Mean Matching, Classiﬁcation & RegressionTree and Bayesian Linear Regression Method. The multiple imputation algorithms with covariate miss-ingness are discussed here. The details on these methods are explained in the following works (Buuren& Groothuis-Oudshoorn [2010], Honaker [2011], Rubin [2004], Doove et al. [2014], Vishwakarma et al.[2016]). 5 lassiﬁcation and Regression Tree (CART)

Cart method is a popular machine learning algorithmwhich identiﬁes predictors and cutpoints in the predictors that are used to split the dataset into morehomogeneous subsamples. This process is repeated to obtain a series of splits that deﬁne a binary tree.The imputation algorithm in CART method is,(i)Draw a bootstrap sample of ( ˙ y obs , ˙ X obs ) of size n from ( y obs , X obs )(ii) Fit ˙ y obs by ˙ X obs by a tree model f ( X ).(iii) Predict n terminal nodes from f ( X mis ).(iv) For y mis , determine in which leaf they will end up according to the tree formed in setp-(iii).(v) Randomly select one donor i j of the leaf ended up in step (iv) and replace the missing value ˙ y j by y j , j = 1 , , ..., n (vi) Repeat the steps (ii) to (v) several numbers of times.(vii) The same procedure is performed for each variable in the dataset replacing the variable con-taining missing observations with the other variables for which the data available. Thus at each iteration,we obtain imputed values for each variable. Expectation-Maximization Bootstrap (EMB) Algorithm

The EMB algorithm is an eﬃcientmethod for multiple imputation technique. Here the objective is to break the distribution of observeddata and missingness into other parts where the missingness likelihood can be explained by observeddata. Under MAR conditions, let D obs denotes the observed data and M the missingness matrix. So,we get, p ( D obs , M | θ ) = p ( M | D obs ) p ( D obs | θ ) (9)where θ is the complete data parameters. In this imputation technique, it is assumed that the completedata follow multivariate normal distribution i.e., D ∼ N k ( µ, Σ). Now, the parameters will be estimatedas L ( θ | D obs ) ∝ p ( D obs | θ ) = (cid:90) p ( D | θ ) dD mis (10)The data is ﬁrst divided in m bootstrapped samples. Then for each sample, the EM algorithm isperformed to produce point estimates of µ and Σ. Thus we impute the missing observations using thoseestimates and original sample units. Analysis is performed for each of the samples and in the end, theresults are combined by averaging the estimated values. Predictive Mean Matching

In this method, the following algorithm is maintained to impute themissing observations.(i) For cases with no missing data, a linear regression of y on X is ﬁtted and estimates of thecoeﬃcients are obtained.(ii) A random draw is made from the posterior distribution of coeﬃcients. Typically this can be arandom draw from a multivariate normal distribution.(iii) Using those random draws of the coeﬃcients, the predicted values of y are generated.(iv) For each missing y , a set of cases with observed y whose predicted values are close to theobserved ones according to any metric deﬁned is chosen.(v) From those cases, we randomly choose one and assign its observed value to the missing observa-tion.(vi) Repeat the process to generate multiple datasets.This purpose of this method is to match cases with missing data to similar cases found in theobserved dataset. 6 ayesian Linear Regression Method In this method, the missing observations are imputed usingBayesian linear regression model. The imputation algorithm is,(i) Calculate V = ( S + diag ( S ) κ ) − with small κ where S = X (cid:48) obs X obs (ii) Calculate the regression weights ˆ β = V X (cid:48) obs y obs (iii) Calculate σ = ( y obs − X obs ˆ β ) (cid:48) ( y obs − X obs ˆ β ) /g where g is a random variable drawn from χ n − q )(iv) Calculate ˙ β = ˆ β + ˙ σ ˙ zV / where V / is computed using Cholesky decomposition and ˙ z are q random draws from N (0 , y ) = X mis ˙ β + ˙ z ˙ σ where ˙ z is n independent variates from N (0 , A simulation study was performed to ﬁnd out the eﬀect of missing observations on the inference andestimates of parameters. First, we simulated a full dataset using ﬁxed estimates of the parameters forboth longitudinal and survival model and also selected an association parameter for the joint model. Thesubmodel for the longitudinal outcome is deﬁned as, y i ( t ) = β + β t + β ∗ ( t stage ) i + β ∗ ( t stage ) i + β ∗ ( t stage ) i + β ∗ ( n stage ) i + b i + (cid:15) i ( t ) (11)where β ’s deﬁne the ﬁxed eﬀect parameters and b deﬁnes the subject speciﬁc variations which areassumed to be normally distributed N (0 , σ ). The variance-covariance matrix D (= σ , as b is univariate.D is used to keep consistency with the notation) of the random eﬀects is left unstructured. The t stage and n stage are two categorical covariates; t stage has 4 levels namely 0,1,2,3 and n stage has 2 levelsnamely 0 and 1. The levels are ordinal in nature and it is assumed that higher the value, greater the eﬀectof the covariate. The levels of t stage and n stage are drawn using equal probability for all levels. Whilechoosing the time component, we kept in mind the treatment plan for patients in the actual scenario.There are 4 patient visit times or sample collection times considered for simulation of longitudinaloutcomes and the visit times are chosen from the following distributions: P oisson (0 . , N (30 , , N (60 , N (90 ,

5) respectively.The survival outcomes and times are generated by, h i ( t ) = h ( t ) exp ( γ ∗ ( t stage ) i + γ ∗ ( t stage ) i + γ ∗ ( t stage ) i + αm i ( t )) (12) m i ( t ) is considered as the true longitudinal response at time t . The starting parameters are dr-wan randomly from diﬀerent distributions to maintain the independence of the regression coeﬃcientsand incorporate the diﬀerence in estimation procedure under diﬀerent methods. The distributions con-sidered for diﬀerent parameters are as follows: β ∼ U (0 , , β ∼ U (0 . , , β ∼ U (0 . , , β ∼ U (0 . , . , β ∼ U (0 . , . , β ∼ U (0 . , . , γ ∼ U ( − . , . , γ ∼ U ( − , , γ ∼ U ( − , , σ ∼ U (0 . , α is generated using U (0 . , . D or, Var(b) is generated using U (0 . , t stage and n stage . However, the missingness in covariates is randomly generated. Thecovariate missingness is considered to be around 20% of the total observations.7 .2 Analysis Multiple Imputation (MI) technique is employed to generate the missing data. Here we use 4 diﬀerentmultiple imputation methods such as Expectation-Maximization Bootstrap Algorithm, Predictive MeanMatching, Classiﬁcation & Regression Tree and Bayesian Linear Regression Method. Imputed datasetsare obtained by using of MICE and Amelia package in R open-source software. For each of the MItechniques, we obtained 4 imputed datasets and analysis was carried out on each dataset. Once theentire analysis is carried out, the results are combined to obtain the estimates of the parameters andthey are compared with the true parameter values considered for simulation.We genearted 5000 sets of starting values of the paramters and repeated the simulation study andcorresponding analysis. For each generated dataset containing missing observations, we employed thementioned missing data techniques and obtained 200 imputed datasets for each of them to incorporatethe sampling ﬂuctuations among the imputed samples. The joint modeling analysis is performed oneach of the those imputed datasets and results are combined. Thus, as a whole, the simulation study isreplicated 5000 times with 200 imputed samples for each generated data for each multiple imputationmethod.A linear mixed eﬀect model is considered for the longitudinal process with t stage and n stage are considered as covariates. Proportional hazard model is considered for the survival outcome withthe baseline hazard function is assumed to follow Weibull distribution. We considered two diﬀerentapproaches for estimating the parameters of joint models namely maximum likelihood approach andbayesian approach. Results for both the methods are compared and the estimated parameters arechecked with the original values. In ML approach of joint modeling, we considered unspeciﬁed modelfor baseline hazard function whereas for bayesian approach, the baseline risk function is approximatedusing splines. Gauss-Hermite quadrature integration was used for the estimation of parameters in MLapproach and Markov Chain Monte Carlo was used for parameter estimate in bayesian procedure.The eﬀect of missing observations in response and covariates has been assessed using two diﬀerentjoint modeling approach. The diﬀerence in parameter values in the simulation study has been examinedand on the basis of the estimates, the multiple imputation technique are judged. The estimates of theparameters are obtained for each generated dataset and the parameter estimates (for ML methods) orposterior means (for Bayesian method) and standard errors (for ML method) or standard deviations(for Bayesian method) are averaged. As the starting parameters are randomly generated from diﬀerentdistributions, we compared the performance of diﬀerent imputation techniques through two performancemeasures namely Coverage Probability and Root Mean Squared Error (RMSE). The performance mea-sures of the parameters for both response and covariate missingness is displayed in Tables 1 & 2 and thesame for only response missingness are presented in Tables 3 & 4.Among diﬀerent multiple imputation techniques, the data imputed by Predictive Mean Matching(PMM) method results in highest coverage and lowest RMSE among the four methods followed by EMBalgorithm. The Bayesian Normal Regression (NORM) and Classiﬁcation and Regression Tree (CART)method performs close to each other while CART methods performs poorest among those four methods.The coverages are higher for joint modeling analysis using maximum likelihood method than bayesianmethod. Also, the models show better performance when the missingness is occurred in response only.The RMSEs are much lower for the association parameter α resulting in better estimate. Among diﬀerentimputation methods, most of the estimates obtained from imputed data using EMB algorithm are foundto be close to the full data analysis. For response missingness only, all the four methods perform betterand they are close to each other while PMM method also performs slightly better than the others.8 Analysis on Gene Expression Study

To perform our analysis, we have used a published dataset on Gene Expression study collected from GeneExpression Omnibus (GEO) database ( ) under the accessionnumber GSE65622. In this study, all the patients were given neoasjuvant chemotherapy (NACT) fol-lowed by chemoradiotherapy (CRT) and treatment was evaluated for four weeks after CRT completion.After 2-4 weeks, radical pelvic surgery was considered. The study consists dataset on 80 patients and theobservations were obtained in the following manner. Serum samples were collected at baseline (Saplingpoint=1), following NACT post-NACT(Sampling point=2), at CRT completion (Sampling point=3),post-CRT (Sampling point=4) and at treatment evaluation (Sampling point=5). With a high-densityantibody array (AHH-BLG-1;RayBiotech Inc.), the samples were analyzed for the presence of 507 pro-teins. Further explanation about the study setting and detailed description of the data can be cited fromMeltzer et al. (2016). The objective of the study is to classify the proteins according to their eﬀect oncancer progression and investigate the association between factors regarding tumor severity and durationof disease reoccurrence.In the biomarker study, measurements on several biomarkers are gathered from the patients. Itis our interest to ﬁnd how the missing observations in

Rantes , a well-established biomarker related topulmonary tuberculosis and HIV disease, aﬀects the analysis of joint modeling. The

Rantes biomarkeris also known as

CCL og ( y i ( t )) = β + β ∗ ( days/ i + β ∗ t stage i + β ∗ n stage i + β ∗ trg score i + b i + (cid:15) i ( t ) (13)where t stage, n stage and trg score are the ordinal covariates where higher value denotes more severeeﬀect.In the survival model, we used Cox Proportional Hazard Model using t-stage and n-stage as covari-ates. The survival submodel is deﬁned as, h i ( t ) = h ( t ) exp ( γ ∗ t stage + γ ∗ t stage + + γ ∗ n stage + γ ∗ n stage + αm i ( t )) (14)The baseline hazard maximum likelihood approach is left unstructured. In bayesian approach, it ismodeled with cubic B-splines with 5 knots are being on the percentiles of the observed event times.Both bayesian and ML (maximum likelihood) framework has been used for joint modeling. In MLframework, the baseline risk function is left unspeciﬁed whereas in bayesian framework, the baselinerisk function is approximated using splines. Also in bayesian framework, a simple association structurehas been used where the longitudinal model is associated with the survival covariates with an associa-tion parameter. We have considered only the observations which have occurred before the event. Theestimates of the parameters are obtained through joint modeling for each generated dataset and the pa-rameter estimates and standard errors (for ML estimate) or standard deviations (for bayesian method)are averaged. Comparison for the bayesian models and ML models was done using Deviance informationcriterion (DIC) & Akaike Information Criterion (AIC) values respectively and they are shown in Tables5 and 6.From the tables, we can ﬁnd that all the imputation models are pretty close to each other andestimates of some of the parameters are diﬀerent in complete case analysis than imputed data analysis.In complete case analysis, we choose only those patients where all the longitudinal outcomes till thetime of event are available. By comparing the AIC values, it can be said that the PMM, CART andEMB algorithm imputation methods are very close to each other whereas NORM method has higherAIC showing a less preferable method. The association parameter between longitudinal biomarker andsurvival event obtained through ML estimate is nearly 1.65 for all the cases except EMB algorithm forwhich the estimate is 2.015. However, for bayesian method, it is approximately 1.10 for all the methodsexcept EMB algorithm for which the estimate is 1.916. Joint modeling of longitudinal and survival data with missing covariates is a challenging area in oncologyresearch. The presence of missing observations makes it diﬃcult to analyze the data and often leads tobiased inference. Available techniques based on complete observations often leads to biased estimate whenthe proportion of missingness is high enough in the dataset. The joint models combine the conventionallinear mixed eﬀect model for longitudinal responses with the renowned Cox proportional hazard modelfor the survival process. Maximizing the likelihood is not always possible in straightway due to thenon-explicit form of the likelihood function which leads to the use of numerical techniques and bayesianmethodology.It is more challenging to compute the conditional distribution of the missingness mechanism presentin the joint models. Listwise deletion method to get the complete data is valid only under MCAR con-ditions, but not for MAR and MNAR situations. More speciﬁcally, when the probability of missingness10s dependent on the unobserved responses and covariates, it requires special techniques to handle suchscenarios.Separate modeling mechanism is necessary to postulate the cases to obtain valid inference for jointmodels. Three main modeling frameworks for missing observations are a pattern mixture model, sharedparameter model, and selection model. However, these models do not take into account the missingnessof the covariates and are mainly concerned with the response variable. In our study, we used multipleimputation techniques to obtain the missing observations in the covariates also. Among several algo-rithms, we took support of four methods for data imputation through EMB algorithm, Correlation andregression tree, Bayesian normal regression, and Predictive mean matching. The Bayesian perspective inthe form of EMB algorithm for imputation has produced better estimates among those methods. Priordistributional assumptions are used to generate statistical inferences from the posterior distributions ofthe unknown variables including the missing data. Well-known packages in R programming softwarewere used to analyze the dataset. In this article, we demonstrated an eﬀective way to analyze data in joint modeling scenario when theresponse and covariates both have missing observations. Missing covariates for joint longitudinal andsurvival data modeling is a challenging task to solve. Our approach shows an eﬃcient way to handlesuch missingness in Biomarker study with diﬀerent multiple imputation techniques and their eﬀects onjoint modeling. Both the joint modeling approach is assessed to extensively examine the eﬀect of missingobservations on diﬀerent estimation procedures. In Biomarker study, the cost of obtaining dataset ishigh, so it is necessary to use proper techniques to handle missing observations.Our procedure is still limited in handling missingness in the parametric approach. However, furtherresearch work is required to accommodate the non-gaussian response process. The semi-parametricapproach could be an attractive alternative choice for multiple imputations. It is more ﬂexible thanothers with extension for non-Gaussian response structure to ﬁt joint models.

The detailed R programming code used in this article is available in the following link https://github.com/souvikbanerjee91/missing-data-in-joint-modeling.git

Authors are deeply indebted to the editor Prof. Narayanaswamy Balakrishnan and learned referee fortheir valuable suggestions leading to improving the quality of contents and presentation of the originalmanuscript. Authors are also thankful to Science and Engineering Research Board, Department ofScience & Technology, Government of India, for providing necessary support to carry out the presentresearch work through project Grant No. EMR/2016/003305.

References

Armero, C., Forn´e, C., Ru´e, M., Forte, A., Perpi˜n´an, H., G´omez, G., & Bar´e, M. (2016) Bayesianjoint ordinal and survival modeling for breast cancer risk assessment.

Statistics in medicine , 35(28),5267-5282. 11acon, K. B., Premack, B. A., Gardner, P., & Schall, T. J. (1995) Activation of dual T cell signalingpathways by the chemokine RANTES.

Science , 269(5231), 1727-1730.Bell, M. L., Kenward, M. G., Fairclough, D. L., & Horton, N. J. (2013) Diﬀerential dropout and bias inrandomised controlled trials: when it matters and when it may not.

BMJ (Clinical research ed.) , 346,e8668.Bernard, P. S., & Wittwer, C. T. (2002) Real-time PCR technology for cancer diagnostics.

ClinicalChemistry , 48(8), 1178-1185.Bhattacharjee, A. and Vishwakarma, G.K., Time-course data prediction for repeatedly measured geneexpression data.

International Journal of Biomathematics , 12(4), 1950033.Bhattacharjee, A., Vishwakarma, G.K., and Thomas, A., Bayesian state-space modeling in gene expres-sion data analysis: an application with biomarker prediction. Mathematical Biosciences, 305, 96-101,2018.Buuren, S. V., & Groothuis-Oudshoorn, K. (2010) mice: Multivariate Imputation by Chained Equationsin R.

Journal of statistical software , 45(3), 1-67.Chen, Q., May, R. C., Ibrahim, J. G., Chu, H., & Cole, S. R. (2014) Joint modeling of longitudinal andsurvival data with missing and left-censored time-varying covariates.

Statistics in medicine , 33(26),4560-4576.Cox, D. R. (1972) Regression models and life-tables.

Journal of the Royal Statistical Society: Series B(Methodological) , 34(2), 187-202.Cox, D. R., & Oakes, D. (1984) Analysis of survival data.

Chapman & Hall/CRC, London .Doove, L. L., Buuren, S. V., & Dusseldorp, E. (2014) Recursive partitioning for missing data imputationin the presence of interaction eﬀects.

Computational Statistics & Data Analysis , 72, 92-104.Faucett, C.L. & Thomas, D.C. (1996) Simultaneously modeling censored survival data and repeatedlymeasured covariates: a Gibbs sampling approach,

Statistics in Medicine , 15(15), 1663-1685.Follmann, D. & Wu, M. C. (1995) An approximate generalized linear model with random eﬀects forinformative missing data.

Biometrics , 51(1), 151–168.Heckman, J. J. (1976) The common structure of statistical models of truncation, sample selection andlimited dependent variables and a simple estimator for such models.

Annals of Economic and SocialMeasurement , 5(4), 475–492.Henderson, R., Diggle, P., Dobson, A. (2000) Joint modeling of longitudinal measurements and eventtime data.,

Biostatistics , 1(4),465– 480.Hickey, G. L., Philipson, P., Jorgensen, A., & Kolamunnage-Dona, R. (2018) joineRML: a joint modeland software package for time-to-event and multivariate longitudinal outcomes.

BMC medical researchmethodology , 18(1), 50.Hogan, J. W., Roy, J. and Korkontzelou, C. (2004) Handling drop-out in longitudinal studies.

Statisticsin medicine , 23(9), 1455-1497.Honaker, J., King, G., & Blackwell, M. (2011) Amelia II: A program for missing data.

Journal ofstatistical software , 45(7), 1-47. 12ui, D., Glitza, I., Chisholm, G., Yennu, S., & Bruera, E. (2013) Attrition rates, reasons, and predictivefactors in supportive care and palliative oncology clinical trials.

Cancer , 119(5), 1098-1105.Ibrahim, J.G., Molenberghs G. (2009) Missing data methods in longitudinal studies: a review.

Test ,18(1), 1-43.Ibrahim J.G., Chu H & Chen L.M. (2010) Basic concepts and methods for joint models of longitudinaland survival data.

Journal of Clinical Oncology , 28(16), 2796–2801.Iqbal, N., & Iqbal, N. (2014) Imatinib: A Breakthrough of Targeted Therapy in Cancer,

ChemotherapyResearch and Practice , Article ID 357027, 9 pages.Kowarik, A., & Templ, M. (2016) Imputation with the R Package VIM.

Journal of Statistical Software ,74(7), 1-16.Little, R. J. A. (1994) A class of pattern-mixture models for normal incomplete data.

Biometrika , 81(3),471–483.Little, R. J., & Rubin, D. B. (2014) Statistical analysis with missing data (Vol. 333).

John Wiley & Sons .Lu, T. (2017) Jointly modeling skew longitudinal survival data with missingness and mismeasured co-variates.

Journal of Applied Statistics , 44(13), 2354-2367.Malvezzi, M., Carioli, G., Bertuccio, P., Boﬀetta, P., Levi, F., La Vecchia, C., & Negri, E. (2017)European cancer mortality predictions for the year 2017, with focus on lung cancer.

Annals of Oncology ,28(5), 1117-1123.Marrugo-Ram´ırez, J., Mir, M., & Samitier, J. (2018) Blood-based cancer biomarkers in liquid biopsy:A promising non-invasive alternative to tissue biopsy.

International Journal of Molecular Sciences ,19(10), 2877.Molenberghs, G. & Kenward, M. G. (2007) Missing Data in Clinical Studies (Vol. 61).

John Wiley &Sons .Padma V. V. (2015) An overview of targeted cancer therapy.

BioMedicine , 5(4), 19.Parzen, M., Lipsitz, S. R., Fitzmaurice, G. M., Ibrahim, J. G., & Troxel, A. (2006) Pseudo-likelihoodmethods for longitudinal binary data with non-ignorable missing responses and covariates.

Statisticsin medicine , 25(16), 2784-2796.Perkel, J. M. (2013) PCR Past, Present & Future.

The Scientist Magazine , 27(12), 63-65.Philipson P, Sousa I, Diggle P, Williamson P, Kolamunnage-Dona R, Henderson R, Hickey G (2018)joineR: Joint modelling of repeated measurements and time-to-event data. R package version 1.2.4,https://github.com/graemeleehickey/joineR/.Pray, L. A. (2008) Gleevec: The breakthrough in cancer treatment.

Nature Education , 1(1), 37.Rizopoulos, D. (2010) JM: An R package for the joint modelling of longitudinal and time-to-event data.

Journal of Statistical Software (Online) , 35(9), 1-33.Rizopoulos D. (2012) Joint models for longitudinal and time-to-event data: With applications in R.

Chapman and Hall/CRC, New York .Rizopoulos, D. (2016) The R package JMbayes for ﬁtting joint models for longitudinal and time-to-eventdata using MCMC.

Journal of Statistical Software , 72(7), 1-46.13obins, J. M., Rotnitzky, A., & Zhao, L. P. (1995) Analysis of semiparametric regression models forrepeated outcomes in the presence of missing data.

Journal of the American Statistical Association ,90(429), 106-121.Rubin, D. B. (1976) Inference and missing data.

Biometrika , 63(3), 581-592.Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81).

John Wiley & Sons .Sattar, A., & Sinha, S. K. (2019). Joint modeling of longitudinal and survival data with a covariatesubject to a limit of detection.

Statistical methods in medical research , 28(2), 486-502.Shih, Y. C. T., Smieliauskas, F., Geynisman, D. M., Kelly, R. J., & Smith, T. J. (2015) Trends in thecost and use of targeted cancer therapies for the privately insured nonelderly: 2001 to 2011.

Journalof Clinical Oncology , 33(19), 2190.Song, H., Peng, Y., & Tu, D. (2016) Recent Development in the Joint Modeling of Longitudinal Qualityof Life Measurements and Survival Data from Cancer Clinical Trials.

Advanced Statistical Methods inData Science , 153-168. Springer, Singapore.Sorokin, P. (2000) Mylotarg ™ Approved for Patients With CD33+ Acute Myeloid Leukemia.

ClinicalJournal of Oncology nursing , 4(6).Sun, Y., Ma, L., Mathew, J., Wang, W., & Zhang, S. (2006) Mechanical systems hazard estimation usingcondition monitoring.

Mechanical systems and signal processing , 20(5), 1189-1201.Templ, M., Kowarik, A., & Filzmoser, P. (2011). Iterative stepwise regression imputation using standardand robust methods.

Computational Statistics & Data Analysis , 55(10), 2793-2806.Tsiatis, A. A., Degruttola, V., & Wulfsohn, M. S. (1995) Modeling the relationship of survival to lon-gitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS.

Journal of the American Statistical Association , 90(429), 27-37.Tsiatis, A. A.,Davidian, M., (2001) A semiparametric estimator for the proportional hazards model withlongitudinal covariates measured with error,

Biometrika , 88(2), 447-458.Tsiatis, A. (2007) Semiparametric theory and missing data.

Springer Science & Business Media, NewYork .Vishwakarma, G.K., Bhattacharjee, A., Jose, J. and Ramesh, V. (2016) Cancer patients missing painscore information: Application with imputation techniques.

Epidemiology, Biostatistics and PublicHealth , 13(4), e11916: 1-8.Weiss, R. E., (2005) Modeling longitudinal data.

Springer Science & Business Media, New York .Wu, M. C. & Carroll, R. J. (1988) Estimation and comparison of changes in the presence of informativeright censoring by modeling the censoring process.

Biometrics , 44(1), 175–188.Wulfsohn, M.S. & Tsiatis, A.A. (1997) A joint model for survival and longitudinal data measured witherror,

Biometrics , 53(1), 330-339. 14igure 1: Proﬁle plot of log values of the Rantes Biomarker for diﬀerent patients15 a) (b)(c) (d)

Figure 2: Density comparisons of imputed data for 4 diﬀerent imputation techniques applied onBiomarker data (a) EMB Algorithm, (b) CART method, (c) NORM method, (d) PMM Method. 5imputed datasets per method were generated and plotted with the observed biomarker values for com-parison. 16able 1: Coverage Probability and Root Mean Squared Error of Bayesian Estimates of Coeﬃ-cients in Joint Models on Simulated Biomarker Data with both response and covariate missing-ness (The RMSEs are given in the brackets)Variable FullDataset EMBAlgorithm PMMMethod NORMMethod CARTMethod β β β β β β γ γ γ α σ β β β β β β γ γ γ α σ β β β β β β γ γ γ α σ β β β β β β γ γ γ α D σ -0.066 (0.124) -0.119 (0.125) -0.119 (0.125) -0.142 (0.133) -0.128 (0.135)n stage -0.128 (0.101) -0.140 (0.086) -0.116 (0.085) -0.073 (0.089) -0.148 (0.090)trg score αα ) 1.334 (0.240) 2.015 (0.221) 1.651 (0.214) 1.646 (0.216) 1.679 (0.219)Variance ofrandom component(D) 0.193 (0.214) 0.208 (0.251) 0.188 (0.247) 0.565 (0.248) 0.600 (0.263)AIC 188.441 415.523 419.878 413.261 466.68219able 6: Comparison of Bayesian estimates of Coeﬃcients in Joint Models on diﬀerently imputedBiomarker Data (The posterior standard deviations of the estimates are given in the brackets)Variable Complete Data(Excluding themissingObservations) EMBAlgorithm PMMMethod CARTMethod NORMMethod(Intercept) 4.211 (0.428) 4.273 (0.293) 4.227 (0.312) 4.218 (0.287) 4.242 (0.314)(days/365) 0.536 (0.188) 0.548 (0.154) 0.562 (0.145) 0.536 (0.148) 0.529 (0.165)t stage -0.033 (0.432) -0.083 (0.298) -0.108 (0.316) -0.115 (0.291) -0.102 (0.314)n stage -0.121 (0.313) -0.143 (0.209) -0.142 (0.213) -0.079 (0.198) -0.144 (0.214)trg score αα