Glucose values prediction five years ahead with a new framework of missing responses in reproducing kernel Hilbert spaces, and the use of continuous glucose monitoring technology
Marcos Matabuena, Paulo Félix, Carlos Meijide-Garcia, Francisco Gude
GGlucose values prediction five years ahead with a newframework of missing responses in reproducing kernelHilbert spaces, and the use of continuous glucosemonitoring technology
Marcos Matabuena , Paulo F´elix , Carlos Meijide-Garcia , Francisco Gude [email protected] Abstract
AEGIS study possesses unique information on longitudinal changes in circulatingglucose through continuous glucose monitoring technology (CGM). However, asusual in longitudinal medical studies, there is a significant amount of missing datain the outcome variables. For example, 40 percent of glycosylated hemoglobin(A1C) biomarker data are missing five years ahead. With the purpose to reducethe impact of this issue, this article proposes a new data analysis frameworkbased on learning in reproducing kernel Hilbert spaces (RKHS) with missingresponses that allows to capture non-linear relations between variable studiesin different supervised modeling tasks. First, we extend the Hilbert-Schmidtdependence measure to test statistical independence in this context introducinga new bootstrap procedure, for which we prove consistency. Next, we adapt oruse existing models of variable selection, regression, and conformal inference toobtain new clinical findings about glucose changes five years ahead with theAEGIS data. The most relevant findings are summarized below: i) We identifynew factors associated with long-term glucose evolution; ii) We show the clinicalsensibility of CGM data to detect changes in glucose metabolism; iii) We canimprove clinical interventions based on our algorithms’ expected glucose changesaccording to patients’ baseline characteristics.
Motivation and outline contributions
With advances in digital patient monitoring and personalized medicine, a newclinical paradigm based on optimizing medical decisions according to data-driven1/38 a r X i v : . [ s t a t . M L ] D ec pproaches, is emerging. Diabetes mellitus is an essential reference point tothe application of these techniques. It is estimated that approximately 50% ofdiabetes patients have not been diagnosed yet. Furthermore, adherence andeffectiveness of treatments are poor across many patient groups; and diseaseprevalence is increasing with contemporary lifestyles. In this sense, predictivemodels that forecast and identify risk factors associated with the evolution ofglycemic profiles in the short and long term is vital for identifying patients atrisk of disease development, improving early diagnosis, and prescribing optimaldynamic treatments. This paper’s primary goal is to study the relationshipbetween the AEGIS study patients’ baseline characteristics and the primarybiomarker of diabetes diagnosis and control- glycosylated hemoglobin (A1C)-five years ahead. In addition, we introduce information about continuous glucosemonitoring (CGM) into the models to capture individual glucose homeostasisfluctuations at a high-resolution level. As five-year A1C data registries are missingfor approximately 40% of patients, we propose a new data-analysis frameworkbased on RKHS learning with missing responses as a methodological contribution.This machine learning (ML) paradigm allows to detect complex non-linearrelations between study variables and analyzes simultaneous data of differentnature from several information sources such as CGM. In particular, we addressthe statistical independence testing problem via a new Hilbert-Schmidt criteriumdesigned explicitly for this context, and we do several adaptions of existing model-free methods of variable selection, regression models, and conformal inferencealgorithms. Using these models, we achieve new clinical findings: i) We identifysome diabetes biomarkers associated with glucose variations in the standardclinical routine, both marginally and from a multivariate perspective, ii) Weshow the need to incorporate CGM technology to predict glucose changes in thelong term, iii) We identify some risk patients’ phenotypes for which the model’spredictive capacity is moderate, and therefore more personalized follow-up isneeded by them.
Diabetes mellitus is one of the most critical public health problems being theninth major cause of death of mortality worldwide [Zheng et al., 2018, Saeediet al., 2020]. At present, over 416 and 47 million patients have Type II and TypeI diabetes respectively [Saeedi et al., 2019] with estimated health costs of diseasemanagement that reach 760 billion dollars [Williams et al., 2020]. Moreover,several projections forecast a significant increase in prevalence in the followingdecades [Whiting et al., 2011, Cho et al., 2018]. Considering the growth of thispandemic among the general population [Tabish, 2007,Hu et al., 2015,Ginter andSimko, 2013], the need to pursue new health politics to enable early recognitionof risk patients and improvement in the methodology of disease diagnosis in thestandard clinical routine is noteworthy. Nowadays, around 50% of patients withdiabetes are undiagnosed [Saeedi et al., 2019], and the proliferation of sedentarylifestyles is more generalized between the population [Finkelstein et al., 2012]2/38eing a significant causal factor [Ng et al., 2014] of the progressive increasein the incidence of chronic diseases [Visscher and Seidell, 2001, Zheng et al.,2018], or that the density curve of body mass index along different age-groupsis taking higher and higher values in the over-height and obesity range [Flegalet al., 2012]. As a consequence, clinical complications and burden of health costsassociated [Rubin et al., 1994] with the impaired glycemic condition in the earlystages of the disease in patients to whom no specific glycemic individual glucosehomeostasis control interventions are performed [Walker et al., 2010] will have astronger impact on human condition. [Zheng et al., 2018, Dabelea et al., 2017]A new emerging clinical paradigm based on digital and precision medicine[Topol, 2010, Kosorok and Laber, 2019, Schork, 2015] can be a landmark toimprove early diagnosis. In this context, clinical decisions, e.g., treatmentprescription, can be optimized through the intensive use of statistical modelsand machine learning techniques [Kosorok and Moodie, 2015,Kosorok and Laber,2019, Zhao et al., 2011, Coronato et al., 2020, Cirillo and Valencia, 2019], thatexploit the rich source of information generated by patients’ monitoring [Li et al.,2017].In the particular case of diabetes [Ellahham, 2020, Gunasekeran et al., 2020,Zou et al., 2018], the application of these models can be a valuable weaponto an improvement in the early identification of patients with a high risk ofdeveloping diabetes, the prediction of complications such as retinopathy, as wellas dynamical prescriptions of optimal treatments [Tsiatis, 2019, Zhang et al.,2012, Goldberg and Kosorok, 2012] in patients with type I diabetes. In thiscase, the patient variables involved in the individual data-driven treatmentsroutine [Luckett et al., 2020] can include non-pharmacological interventions suchas physical exercise or diet, insulin pumps, or other common drugs such asmetformin [Walker et al., 2010, for Disease Control et al., 2011].The current advances in device technology allow assessing patients’ glucosemetabolism at a high-resolution level, capturing the individual differences inthe glucose fluctuations at different time scales via continuous glucose moni-toring (CGM) [Zaccardi and Khunti, 2018]. However, current standard clinicalbiomarkers of diabetes diagnosis and control such as glycosylated hemoglobin(A1C) or fasting plasma glucose (FPG) [Zhang et al., 2010] capture only partiallythe temporal complexity of the glycaemic profiles, measuring summary charac-teristics as the mean glucose over the precedent 3-month (A1C) or a glucosevalue in a specific instant of time selected in the morning (FPG) [Selvin et al.,2007]. CGM device has been used primarily in specific risk managing situationsconcerning patients with type I diabetes [Poolsup et al., 2013]. However, theiruse is more popular in clinical routines because of decreased costs and technolog-ical improvement. As a consequence, more general applications in both diseaseand healthy populations are emerging even outside the field of diabetes [Luet al., 2020]. Some relevant examples include the acquisition of new clinicalknowledge in epidemiological studies, the screening of patients, the evaluation ofthe prognosis of patients with diabetes, and the optimization of the diet [Becket al., 2019, Zeevi et al., 2015]. Nevertheless, from the methodological point ofview, there are essential difficulties in exploiting the data generated by these3/38evices in realistic environments where patients are monitored in free-livingconditions, and glucose fluctuations are not temporally aligned between patients,being time series analysis unfeasible. In a recent work, [Matabuena et al., 2020]introduced new functional profile of CGM data termed glucodensity to overcomethese limitations, (see Figure 1 for intuitive explanation about the new dataanalysis method). The results show that this alternative method can assessglucose homeostasis more accurately than the so-called time in range metrics,the current gold standard in CGM data handling [Battelino et al., 2019, Becket al., 2019, Wilmot et al., 2020, Nguyen et al., 2020]. Moreover, this new pro-posal’s application overcame some limitations of time in range metrics such aspredefinition of target zones- which can depend on the study population- andthe loss of information caused by a discretization in different intervals of therecorded information.Several predictive models have been developed in the present body of scientificliterature to control and diagnose diabetes [Edlitz and Segal, 2020, Zaitcev et al.,2020, Lee et al., 2013]. However, to the best of our knowledge, these modelspresent several limitations as they do not incorporate the rich information aboutindividual glucose homeostasis dynamics provided by CGM. Moreover, authorsoften fit the predictive models with data of observational nature without applyingspecific techniques to correct non-randomness in the sampling design, affectinggeneralization and inference in the predictive models. We believe that CGMtechnology can introduce new insight into the assessment of future glucosemetabolism behavior. In addition, the use of high-quality data such as thoseobtained from a random sample of the general population is essential to obtainrobust and reproducible conclusions about model performance. This studydesign, and not dealing with observational data, is the gold-standard practiceto assess treatments’ performance with safety in other domains such as clinicaltrials.The AEGIS population-based study [Gude et al., 2017] is one of the mostrepresentative cohorts in the world that analyzes unique clinical characteristicsabout glucose dynamics over ten years, with a random sample of 1516 individualsfrom A Estrada (Galicia, Spain). At the beginning of this study, 581 participantswere randomly selected because of the economical and logistical implicationsof wearing a CGM device for 3-7 days. After a 5-year follow-up, a significantfraction of those individuals did not agree to perform a second monitoring,while some 5-year relevant outcomes such as A1C could not be measured in40% of the patients. This situation is commonplace in cohort studies, wherethe presence of missing data in different outcomes with a lack of follow-up ofpatients is familiar [Laird, 1988, Tsiatis, 2007]. Generally, in the literature ofmissing data when the outcome or response is missing, two different situationsare considered [Tsiatis, 2007, Rubin, 1976, Little and Rubin, 2019]. In the firstcase, MAR (
Missing At Random ) assumption ensures that the mechanism ofmissing data is random and independent of covariates. In the second case,MNAR (
Missing Not At Random ) hypothesis assumes that some covariates havea certain impact on the mechanism of missing data; for instance, in our example,older patients are less susceptible to perform a second CGM monitoring so that4/38 . . . . Glucodensities
Glucose values mg/dL D en s i t y Glucodensities in quantile space
Quantile G l u c o s e v a l ue s m g / dL Figure 1: Glucodensities are estimated from a random sample of the AEGIS studywith diabetic and normoglycemic patients. Our glucose representation estimatesthe proportion of time spent by a patient at each glucose concentration overa continuum. This represents a more sophisticated approach to assess glucosemetabolism. In the figure at the right, the representation of the glucodensitiesin the space of quantile functions is shown.the probability of not observing a patient increase with age.The aim of this paper is twofold. First, we will introduce new MachineLearning (ML) models to test statistical independence, perform variable se-lection, and predict and make inferences in a context where the response ismissing in some individuals, something rather usual in the outcome variablesAEGIS study. The proposed models are based on a Reproducible Kernel HilbertSpace (RKHS) learning paradigm, one of the most powerful machine learningstrategies [Sch¨olkopf et al., 2002,Steinwart and Christmann, 2008], that has beensuccessfully applied in many real applications because of its ability to detectcomplex non-linear relations between study variables, see for example [Ivanciucet al., 2007, Noble, 2006, Salcedo-Sanz et al., 2014, Deka et al., 2014, Muandet5/38t al., 2016, Ghorbani et al., 2020]. Second, we apply the developed models tothe AEGIS database, composed patients with and without diabetes. As a result,we identify markers associated with A1C-measured glucose values five years andpredict future glucose values to acquire future changes in patient conditions.
With the aim of stratification patients risk, different diabetes scores as the Finnish(FINDRISC) [Makrilakis et al., 2011] and the German (GDRS) [M¨uhlenbruchet al., 2018] ones try to predict the probability of developing diabetes in ten yearstime with a logistic regression or the time to becoming a diabetic person withsurvival models such as Cox regression. The variables included in the modelsare easily obtained, such as age, sex, anthropometric measurements, or otherclinical history information such as lifestyle, family history, and medication.More recently, Yochai Edlitz and Eran Segal [Edlitz and Segal, 2020] proposeseveral alternative ML models based on boosting gradient machine algorithmsthat involve different variables of greater or lesser complexity of measurement,such as the measures discussed above or laboratory biochemical biomarkers, oreven genotyping variables. However, the UK Biobank sample is observational,and the authors do not use specific techniques to avoid the biases associatedwith the sampling mechanism, which limits generalization and reproducibility ofthe obtained results. Other articles present in the literature predict continuousbiomarkers of diagnosis and control of diabetes such as FPG or A1C in both thenon-diabetic and diabetic patients instead of predicting the event “developmentof diabetes” or another categorical variable [Zaitcev et al., 2020, Lee et al., 2013].We consider that the latter approach has considerable advantages respect thefirst mainly for two reasons: i) In terms of statistical inference and interpretationof the results, it is more precise to predict a continuous variable rather thana categorized one, in which there is an evident loss of information; ii) fromthe clinical point of view, we lose the information of each individual’s glucosevalues with the categorization of the variables. Suppose we do not categorizethe variables and use the biomarkers’ value as a continuous variable. In thatcase, we can analyze a target sample involving both patients without diabetesand patients with diabetes, as with the AEGIS study database.In this sense, some authors have recently affirmed that the current geocentricdefinition of diabetes may be strict in various situations [Vas et al., 2017]. UsingCGM technology with multitudes of measurements of an individual’s glucosemetabolism may lead to the establishment of more personalized diagnosticthresholds [Zaccardi and Khunti, 2018].Based on the discussed reasons, this study’s clinical goal is to predict A1Cfive years ahead using information provided by a continuous glucose monitoringdevice, what has never been explored in the literature. We select A1C as theoutcome instead of FPG because A1C presented a much more reproduciblelaboratory biochemical measurement between different testing than FPG, andthe response variable is subject to less measurement error [Selvin et al., 2007].6/38 .2 RKHS models with missing data
In the last decades, the statistical community has developed a vast amount ofnew methodology contributions to minimize the bias caused by missing data incovariates and response variable in various unsupervised and supervised modelingtasks [Londschien et al., 2020, Little and Rubin, 2019]. The contributions carriedout under the paradigm of statistical learning or ML are sparser and more recentthan in the field of Statistics; see, for example [Muzellec et al., 2020].In this paper, we restrict our attention to the case that the only variable withmissing entries is the response. This situation is common in many longitudinalstudies where the lack of data retrieval in some patients’ follow-up is frequent,and there is no information about the evolution of several outcomes related tothem in different periods.We chose the RKHS learning paradigm to tackle our missing data problem.The primary rationale for this decision is that estimators have the optimal non-parametric convergence rate [Stone, 1982] under certain hypotheses. For example,when data live in a low-dimensional manifold or a very smooth functional space.However, methods remain valid with heterogeneous complex data [Borgwardtet al., 2006, Muandet et al., 2016] as graphs or curves that take values on acontinuum, as in our case concerning the distribution-functional representationof glucose profiles built through the concept of glucodensity [Matabuena et al.,2020].There is available methodology in this context only on predictive models forperforming regression to the best of our knowledge. We introduce new methodsfor statistical independence testing, variable selection, and inference on theuncertainty of new predictions. Liu and Goldberg [Liu et al., 2020] proposesa Kernel Ridge Regression estimator for either the case in which the missingdata mechanism is handled by propensity score via IPW (
Inverse ProbabilityWeighting ) estimator or a double-robust approach [Tsiatis, 2007,Bang and Robins,2005]. Next, we summarize the state-of-the-art of RKHS-based techniques withcomplete data. Kernel means embedding [Muandet et al., 2016, Gretton et al.,2012] is a powerful tool to build different statistics to contrast equality betweenprobability distributions, measure statistical independence in RKHS spaces,and capture a broad interest between practitioners of the ML community and,in particular, of kernel methods. Variable Selection in RKHS spaces allowsidentifying the best subset of predictors without assuming any functional formunderlying the covariates’ dependence structure and the predictor variables. Twodifferent strategies were considered: i) Learning the gradient of the conditionalmean function [Yang et al., 2016]; ii) maximize the norm of the conditionalcorrelation operator [Chen et al., 2017].Finally, measuring the uncertainty of predictions is an essential task tosupport clinical decision-making through these models. The previous issue canbe done, for example, through the construction of a confidence interval thatcontains the real value with a certain probability margin. In a set up of completedata, this problem can be addressed with conformal inference [Shafer and Vovk,2008]. At this moment, no methodology is available with missing data. However,7/38e exploit recent advances on causal inference in combination with this area [Leiand Cand`es, 2020], and we adapt our methodology using the connection betweenthese two areas [Ding et al., 2018].
Let { ( X i , Y i , R i ) } ni =1 be a independent random sample of a random vector( X, Y, R ) taking values in V × R × { , } , where V can denote any set as generalas we want, for example, graphs or random functions. Let X denote the covariates, Y the response variable, and R a binary random variable that indicates whetherthe response is missing or not, which we assume to be distributed according tothe probability law π ( x ) = P ( R = 1 | X = x ), which depends on the covariates X . For this, we suppose that R ⊥ Y | X . In MAR missing data mechanism, R ∼ Bernoulli ( p ), where p is the probability that the event ”missing responsedatum” happens, and this event is independent of the values that the covariatestake.Therefore, we propose a new data analysis framework in different predictiontasks when some Y (cid:48) i s are not observed. In particular, Y i is missing if R i = 0( i = 1 , . . . , n ). Below, we discuss the modeling tasks that we address togetherwith new different methodological contributions. • Independence statistical testing: A cornerstone problem in statistics, epi-demiology, and in a general setting of data analysis is testing statisticalindependence between random variables X and Y. In this case; we carriedon a hypothesis test using the sample { ( X i , Y i ) } ni =1 , and we check if thereexist any evidences that X and Y are not independent, i.e., we can rejector not a null hypothesis H : X ⊥ Y according to whether the statisticvalue belongs to the rejection region. To do this, we must calibrate thetest under the null hypothesis to determine what results are expected tohappen with a certain probability if the null hypothesis holds. In ourconcrete case, we have to take into account the effects of the mechanismof missing data in the response variable Y in order to the design the testand to calibrate the null distribution, which is determined by the behaviorof the previous function π ( · ). In the first case, we propose a methodologyto deal with this problem based on kernel mean embeddings, which isvalid when the covariates vector and response live in a separable Hilbertspace. In addition, we introduce a new bootstrap procedure to performtest calibration, adapted to kernel mean embeddings. • Variable selection: Consider the mean regression problem: Y = m ( X ) + (cid:15), (1)where (cid:15) is a random error of mean zero and X = ( X , · · · , X p ) is a randomvector composed of p covariates. In many problems, it is important toidentify the subset of covariates I ⊂ { X , · · · , X p } that has an impact onthe prediction Y . The previous problem is remarkable primarily because8/38f the next two factors: i) to achieve parsimonious predictive modelsthat generalize well with the new cases; ii) to discover the genuine causesassociated with diseases or patient prognosis. We will modify the algorithmof Lei Yang, Shaogao Lv, Junhui Wang, so it remains valid when theresponse is missing [Yang et al., 2016]. • Prediction and inference: The ultimate goal of any predictive task is alwaysto explain the relationship between the variables Y and X , for example,according to the model defined in equation 1. Here, with real-world data, m ( · ), can have any functional shape, although it is also common to restrictit to a parametric, semi-parametric form (e.g., additive structure), orassume that the conditional mean function lives in a smooth function space.Furthermore, it is crucial to measure the predictions’ uncertainty and givea region of probability containing the real value with an appropriate levelof confidence. An appropriate level can be 90% to control and secure thereliability of the results returned by the algorithm with a substantial marginof probability. In the prediction task, we will use the kernel ridge regressionmodel proposed by Liu and Goldberg [Liu et al., 2020]. However, usingthe theory of linear regression, we will calculate the leave-one-out cross-validation regularization parameter efficiently and take into account themissing data mechanism. It is important to note that this class of models’regularization parameters largely determine the model performance, asevinced in some relevant recent papers [Liang et al., 2020, Hastie et al.,2019, Bartlett et al., 2020]. Additionally, using advances in conformalinference recently exploited in causal inference [Lei and Cand`es, 2020], wewill obtain regions that have good finite sample coverage.As for the glucose prediction clinical study case, our main contributions arethe following: • We identify several markers associated with the evolution of glucose fiveyears ahead. • With the aim to optimize medical decisions, we provide a predictivealgorithm to forecast expected A1C values five years ahead. We use ascovariates individual’s glycemic status and other clinical variables. • We interpret and discuss the residuals and the predictive capacity of thedifferent models from the clinical point of view, providing interpretableclinical phenotypes in which future forecasts will have a large uncertainty.
Kernel mean embeddings [Gretton et al., 2007, Muandet et al., 2016], or theequivalent distance correlation in the statistical community [Sz´ekely et al.,9/38007, Szekely and Rizzo, 2017, Sejdinovic et al., 2013] are among the mostextensive and general methodologies in the case of complete data to test statisticalindependence. Subsequently, let us introduce some elementary background overthe previous distances/transformations before explaining the extension that theresponse variable can be missing.Consider an arbitrary H X RKHS associated with the random variable X ,which is uniquely determined by a positive definite symmetric kernel K X : V × V → R + - with V an arbitrary set, that satisfies the following two con-ditions: i) K X ( · , x ) ∈ H X , ii) < f, K X ( · , x ) > = f ( x ) ∀ f ∈ H X . Given a V -valued random variable X with probability measure P X , the kernel mean em-bedding of X is defined as the function φ X : s ∈ V (cid:55)→ (cid:82) V × V K X ( s, x ) dP ( dx ) = E X ∼ P X ( K X ( · , X )) ∈ H X . Roughly speaking, φ X ( · ) embeds the data in a newseparable Hilbert space that is typically infinite dimensional. In the following, wesuppose that all used kernels K ( · , · ) are characteristic , an important property thatguarantees the ability to characterize independence or equality in distributionwith this methodology against all alternatives. Concretely, we say that K X ( · , · )is a characteristic kernel if φ X ( · ) is injective [Simon-Gabriel and Sch¨olkopf, 2018].Now, let Y be another random variable that, for the sake of simplicity, wewill assume to take values in R as in the problem definition above, which wewill interpret as the response variable . By definition, testing the null hypothesis H null : X ⊥ Y is equivalent to testing H null : P X,Y = P X P Y , that is, thedistribution probability measure is expressed as a product of marginals measures.With this aim in mind, we introduce some extra notation. We denote by φ Y ( · ), φ X,Y ( · ) the kernel mean embeddings of Y and the bivariate random variable( X, Y ) that depends on kernel K ( X,Y ) . We must note, firstly, that X and Y can have different dimensions. In addition, it is natural to consider in V × R the RKHS space H X ⊗ H Y , where it makes sense to define the global kernel as K ( X,Y ) ( x, y )( x (cid:48) , y (cid:48) ) = K X ( x, x (cid:48) ) K Y ( y, y (cid:48) ) ∀ ( x, x (cid:48) ) ∈ V × V and ( y, y (cid:48) ) ∈ R × R and ⊗ denote tensor product. Then, a natural way of testing independenceis measuring the distance between the functions φ X,Y ( · ) and φ Y ( · ) ⊗ φ X ( · ).More specifically, we define the Hilbert-Schmidt independence criterion (HSIC)between P X,Y and P X P Y as HSIC ( P X,Y , P X P Y ) = || φ X,Y − φ X ⊗ φ Y || H X ⊗ H Y (2)Expanding Equation 2 we have HSIC ( P X,Y , P X P Y ) == < φ X,Y − φ X ⊗ φ Y , φ X,Y − φ X ⊗ φ Y > H X ⊗ H Y = < φ X,Y , φ
X,Y > H X ⊗ H Y + < φ X ⊗ φ Y , φ X ⊗ φ Y > H X ⊗ H Y − < φ X,Y , φ X ⊗ φ Y > H X ⊗ H Y . (3)and using properties of RKHS and Fubini’s theorem, we get 10/38
SIC ( P X,Y , P X P Y ) == E ( X,Y,X (cid:48) ,Y (cid:48) ) ( K X ( X, X (cid:48) ) K Y ( Y, Y (cid:48) ))+ E ( X,X (cid:48) ) ( K X ( X, X (cid:48) )) E ( Y,Y (cid:48) ) ( K Y ( Y, Y (cid:48) )) − E ( X,Y ) ( E X (cid:48) ( K X ( X, X (cid:48) )) E Y (cid:48) ( K Y ( Y, Y (cid:48) ))) . (4)Here X (cid:48) , Y (cid:48) are iid. copies of random variables X, Y .To understand the HSIC procedure well, it is essential to remark that thisprocedure consists only of calculating the squared distance between two meanfunctions in the appropriate RKHS space, which was transformed into originaldata to capture all distributional differences between the involved randomvariables.In practice, only a sample { ( X i , Y i ) } ni =1 is observed. Therefore, we mustreplace the population mean by sample mean defined through its empiricaldistribution. Then, the empirical estimator of the Hilbert-Schmidt independencecriterion estimator is given by (cid:92) HSIC ( ˆ P X,Y , ˆ P X ˆ P Y ) == 1 n n (cid:88) i =1 n (cid:88) j =1 ( K X ( X i , X j ) K Y ( Y i , Y j ))+ 1 n n (cid:88) i =1 n (cid:88) j =1 K X ( X i , X j ) n (cid:88) i =1 n n (cid:88) i =1 n (cid:88) j =1 K Y ( Y i , Y j ) − n n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 K X ( X i , X j ) K Y ( Y i , Y k ) . (5)With MNAR data, we observe { ( X i , Y i , R i ) } ni =1 and we have to estimate themissing data mechanism, that is given by the function π ( · ) = P ( R = 1 | X = · ).Several procedures were proposed in the literature for this aim such as logisticregression, lasso, random forest, or a model ensemble as Super Learner amongothers [Van der Laan et al., 2007]. Afterwards, we re-weight the dataset, takinginto account how difficult it is to observe the response of the i th datum. Inparticular, we define the weight associated with the i th datum w i via inverseprobability weighting (IPW) estimator, as w i = R i nπ ( X i ) ( i = 1 , · · · , n ) . (6)We define the normalized-weight of w i , as w ∗ i = w i (cid:80) ni =1 w i ( i = 1 , · · · , n ) . (7)We denoted the estimated i th-weigh as ˆ w i and ˆ w ∗ i respectively, after estimateˆ π ( · ). 11/38o get an estimator of HSIC with missing data, it is enough to replace theuniform weight n of the empirical distribution with the normalized weightsˆ W ∗ = ( ˆ w ∗ , · · · , ˆ w ∗ n ) in the Equation 5. Concretely, we have, (cid:92) HSIC ( ˆ P X,Y , ˆ P X ˆ P Y ) = n (cid:88) i =1 n (cid:88) j =1 ˆ w ∗ i ˆ w ∗ j ( K X ( X i , X j ) K Y ( Y i , Y j ))+ n (cid:88) i =1 n (cid:88) j =1 ˆ w ∗ i ˆ w ∗ j K X ( X i , X j ) n (cid:88) i =1 n (cid:88) j =1 ˆ w ∗ i ˆ w ∗ j K Y ( Y i , Y j ) − n (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 ˆ w ∗ i ˆ w ∗ j ˆ w ∗ k K X ( X i , X j ) K Y ( Y i , Y k ) . (8)Calibration under the null hypothesis with the precedent statistic is nottrivial, and the permutation approach is generally not valid. We propose abootstrap approach for this case based on simple Efron’s bootstrap [Efron andTibshirani, 1994], which remains valid with glucodensities: several bootstrapprocedures cannot deal with complex constrained distributional objects that donot live in vector spaces.Under the null hyphotesis H null : P X,Y = P X P Y , then φ X,Y ( · ) − φ Y ( · ) ⊗ φ X ( · ) = 0( · ). So, (cid:92) HSIC ( ˆ P X,Y , ˆ P X ˆ P Y ) = < ˆ φ X,Y − ˆ φ Y ⊗ ˆ φ X , ˆ φ X,Y − ˆ φ Y ⊗ ˆ φ X > H X ⊗ H Y = < ˆ φ X,Y − φ X,Y + φ Y ⊗ φ X − ˆ φ Y ⊗ ˆ φ X , ˆ φ X,Y − φ X,Y + φ Y ⊗ φ X − ˆ φ Y ⊗ ˆ φ X > H X ⊗ H Y . (9)Then, a natural bootstrap procedure that allows estimating the p -value of theindependence testing problem can be as follows:1. Select randomly with replacement and equal probability n elements fromthe original sample D = { ( X i , Y i , R i ) } ni =1 m times. We denote by D j ∗ = { ( X j ∗ i , Y j ∗ i , R j ∗ i ) } ni =1 ( j = 1 , · · · m ), the j th random sample obtained.2. Calculate (cid:92) HSIC j ∗ ( ˆ P X,Y , ˆ P X ˆ P Y ) as (cid:92) HSIC j ∗ ( ˆ P X,Y , ˆ P X ˆ P Y ) = < ˆ φ X,Y − ˆ φ j ∗ X,Y + ˆ φ j ∗ Y ⊗ ˆ φ j ∗ X − ˆ φ Y ⊗ ˆ φ X , ˆ φ X,Y − ˆ φ j ∗ X,Y + ˆ φ j ∗ Y ⊗ ˆ φ j ∗ X − ˆ φ Y ⊗ ˆ φ X ( · ) > H X ⊗ H Y ( j = 1 , · · · m ) , (10)12/38here ˆ φ j ∗ X,Y ( · ), ˆ φ j ∗ X ( · ) and ˆ φ j ∗ X ( · ) are the kernel mean embedding estimatedwith the bootstrap-sample D j ∗ = { ( X j ∗ i , Y j ∗ i , R j ∗ i ) } ni =1 .3. Estimate the p − value asp-value =1 m m (cid:88) j =1 I ( (cid:92) HSIC j ∗ ( ˆ P X,Y , ˆ P X ˆ P Y ) ≥ (cid:92) HSIC ( ˆ P X,Y , ˆ P X ˆ P Y )) . (11)Using some standard tools of empirical process theory [Van Der Vaart andWellner, 1996, Van de Geer, 2000] , we can establish the bootstrap’s consistencywith missing data in this framework. We introduce specif details in the Appendix,Section 5. Learning in RKHS allows to detect complex dependence relations between X = ( X , · · · , X p ) and Y . In particular, given the regression model defined inthe Equation 1, we can select the influential truth variables using the methods-provided in this framework in spite of the fact that we do not have priorinformation about the functional form of m ( · ) as in most modern applications.Suppose that m ( · ) is a differentiable function and let g ( x ) = ∇ m ( x ) =( ∂m∂x ( x ) , · · · , ∂m∂x p ( x )) be its gradient avaliated in the point x ∈ V . We know that x i in Equation 1 is an irrelevant predictor if ∂m∂x i ( x ) = 0 ∀ x ∈ V , namely, if thepartial derivative along direction i is null then the i th covariate does not provideinformation about the geometry of the prediction function m ( · ). Then, a na¨ıveapproach to select the best subset of variables non parametrically is to learnthe gradient and use the value of the norm of that estimation as a criterium.We note that there exists vast research on finding the best subset of covariatesin the setup of linear regression model, where currently is an active researcharea, see for example this following pieces of contemporary work [Hazimeh andMazumder, 2018, Bertsimas et al., 2016].In a small neighbourhood of x , A x , we know in virtue of Taylor’s theoremthat Y ( x ) = m ( x ) + (cid:15) ≈ m ( z ) + g ( z ) T ( x − z ) + (cid:15) ∀ z ∈ A x . As m ( z ) + (cid:15) = Y ( z ),then Y ( x ) ≈ Y ( z ) + g ( z ) T ( x − z ) or Y ( x ) − Y ( z ) ≈ g ( z ) T ( x − z ) what implies E ( Y ( x ) − Y ( z )) = g ( z ) T ( x − z ) ∀ z ∈ A x if the volume of A x is small enough.Then, the squared error of the true gradient g ( · ) function can be constructed: (cid:90) ( V × R )( V × R ) ω ( x, z )( y − v − g ( z )( x − z )) ·· dP X,Y ( dx, dy ) dP X (cid:48) ,Y (cid:48) ( dz, dv ) (12)13/38here ω ( x, z ) is an appropriate weight function that is zero when z is not inthe same neighborhood as x . In addition, with X (cid:48) , Y (cid:48) , we denote independentand random variables distributed as X, Y , respectively.With the loss-function defined in Equation 12 in mind, the empirical estimatoris naturally defined asˆ g ( · ) = arg min g =( g ,...g p ) ∈ H × · · · × H (cid:124) (cid:123)(cid:122) (cid:125) p-times n (cid:88) i =1 n (cid:88) j =1 ω ( X i , X j )( Y i − Y j − g T ( X i )( X i − X j ))+ λ p (cid:88) i =1 || g i || . (13)Here, H denotes the selected RKHS associated in each coordinate of thegradient we aim to learn, and λ is the smoothing parameter that avoids over-fitting in the modeling task if it is appropriately selected.If we observe Equation 13, the objective function is convex and by definitionthe solution of the minimization problem lies inside the cartesian product ofRKHS, which is as well an RKHS. Then we can apply the classical Representertheorem [Sch¨olkopf et al., 2001] to guarantee that the solution for each componentof gradient is defined explicitly by ˆ g i ( x ) = (cid:80) nj =1 ˆ α ij K ( X j , x ) ( i = 1 , · · · , p ),where ˆ α ij ’s are real numbers. Replacing the functional form of a solution inEquation 13, the optimization problem turns to be:ˆ α = arg min α =( α ij ) i =1 , ··· ,pj =1 , ··· ,n ∈ R n × p n (cid:88) i =1 n (cid:88) j =1 ( Y i − Y j − p (cid:88) k =1 α ji K ( X kj , X ki )( X kj , X ki )) p (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 λα ij K ( X ij , X ik ) α ik (14)Again, we can adapt the algorithm to MNAR data with the IPW estimator.In particular, we have:ˆ α = arg min α =( α ij ) i =1 , ··· ,pj =1 , ··· ,n ∈ R n × p n (cid:88) i =1 n (cid:88) j =1 ˆ ω ∗ i ˆ ω ∗ j ω ( X i , X j )( Y i − Y j − p (cid:88) k =1 α ji K ( X kj , X ki )( X kj , X ki ))+ p (cid:88) i =1 n (cid:88) j =1 n (cid:88) k =1 λα ij K ( X ij , X ik ) α ik , (15)14/38here ˆ ω ∗ i is denotes the i th normalized weight according to the missing datamechanism, see Equation 7 for more details.In practice, the smoothing parameter is selected via cross-validation, andthe precedent optimization problem is solved with specific optimization gradienttechniques of group Lasso [Ida et al., 2019, Yang and Zou, 2015] taking intoaccount the multidimensional block structure of ˆ α ij ’s ( i = 1 , · · · , p ), that is, if anyˆ α ij = 0 ( j = 1 , · · · , n ), then ˆ g i ( · ) = 0. For specific details about this procedure,we refer the reader to the original paper, which concerns complete data. . Given a sample { ( X i , Y i ) } ni =1 , linear ridge regression is based on solving thefollowing optimization problem:ˆ β = arg min β ∈ R p n (cid:88) i =1 ( Y i − < X i , β > ) + λ || β || (16)which is given by ˆ β = ( X T X + λI ) − X T Y where X = X ... X n , Y = Y ... Y n and λ > H X be a RKHS space with kernel K X ( · , · ) = < φ ( · ) , φ ( · ) > . Then, ifin Equation 16 we transform each X i into φ ( X i ) ( i = 1 , . . . , n ) and supposethat β = (cid:80) ni =1 α i φ ( X i ), we can obtain a solution to the ridge problem with asimilar structure only changing the usual dot product by the inner product ofthe selected RKHS space. In particular, we have ˆ α = ( K + λI ) − Y , where K = K X ( X , X ) · · · K X ( X , X n )... . . . ... K X ( X n , X ) . . . K X ( X n , X n ) .Following Liu and Goldberg [Liu et al., 2020], they propose two differentestimator when the response is missing in this setting. In both cases, the solutionhas the same close-expression given by Representer Theorem [Sch¨olkopf et al.,2001] in the usual way. First, they handle missing data mechanism via IPWestimator an obtain ˆ α = ( λI + KW ) − W Y . Second, through doubly robustestimation that combines preliminary imputation of missing response with IPWestimator, they getˆ α = ( K + λI ) − ( W Y + ( I − W ) µ ( X )) , where (17)15/38 = w w n , denotes a diagonal matrix that contains the weights w (cid:48) i s (see Equation 6) and µ ( X ) = µ ( X i )... µ ( X n ) , where µ ( · ) denotes the imputationfunction.Doubly robust estimators have optimal asymptotic variance when theirweights w (cid:48) i s and their imputation model are correctly specified, and only one ofthese approaches needs to be correctly specified to achieve consistency. However,if any of them fails, then the model performance can deteriorate dramatically withfinite sample [Kang et al., 2007]. When both models are misleadingly specified,no gain may be obtained with this more sophisticated approach [Vermeulen andVansteelandt, 2015, Kang et al., 2007].An essential issue in the performance of these models and that has receivedconsiderable attention in recent years is the smoothing parameter’s impact on thegeneralization of the models, which is strongly connected with the interpolationproblem in RKHS space with minimum norm. In our setting, we have selectedthe smoothing parameter through leave-one-out cross-validation, calculating theexplicit expression using linear regression theory with the missing data formulaeof estimators. Further details about the recent theoretical advances in this fieldare available in the following papers [Liang et al., 2020,Hastie et al., 2019,Bartlettet al., 2020].To end this subsection, we introduce a specific algorithm to perform conformalinference based on [Lei and Cand`es, 2020], that allows to provide an interval thatcontains the response with a confidence level 1 − α for new observation X n +1 .We randomly split the data { ( X i , Y i , R i ) } ni =1 without replacement in two sam-ples X training = { ( X trainingi , Y trainingi , R trainingi ) } n i =1 , X test = { ( X testi , Y testi , R testi ) } n i =1 of size n , n respectively, with n + n = n . The steps of the algorithm aresummarized as below:1. Using X training , fit the mean regression function ˆ m ( · ) according the methodprovided in the Equation 17.2. For all i = 1 , · · · n with R testi = 1, define the following non-conformalmeasure:ˆ (cid:15) i = | Y testi − ˆ m ( X testi ) | .
3. We estimate the empirical distribution ˆ F (cid:15)n +1 using the previous residualsand representing the theoretical residual of observation X n +1 with theartificial value of infinite. For this task, we use the IPW estimate with16/38eights defined in Equations (6, 7) and the function ˆ π training ( · ) calculatedin Step 1, where we must incorporate also the weight of X n +1 , ˆ w n .4. With ˆ F (cid:15)n +1 , calculate the 1 − α quantile that we denote with ˆ q − α .5. Return [ ˆ m ( X n +1 ) − ˆ q − α , ˆ m ( X n +1 ) + ˆ q − α ] as the searched interval. RKHS modeling is a powerful data analysis paradigm that allows efficient dataanalysis of different nature simultaneously [Borgwardt et al., 2006]. To dothis, the critical point is to select a suitable kernel that accurately capturesthe differences and specific characteristics of each of the information sourcesexamined. In our particular case, we have a continuous probability distribution,multidimensional data, and categorical data, X = ( X gluco , X mult , X categ ) ∈ V gluco × V mult × V categ . We know that the Gaussian kernel with the standardEuclidean distance is a characteristic and universal kernel with vectorial real-data.Moreover, we can show that Gaussian-Kernel conserves those mentioned above,considering the set of continuous density functions endowed with L − Wassersteingeometry. Under these conditions, we have theoretical guarantees that wecan approximate a large variety of functional forms m ( · ) in each model fittedindividual. However, we want to detect possible interactions between differentdata sources, and for this, we must build a proper global kernel. Based on theconnection between the defined positive kernel and the negative type metrics[Lyons et al., 2013] [Berg et al., 1984, Sejdinovic et al., 2013], we know someproperties that allow us to build a simple kernel that integrates the three sources.To this end, in our setting, we can define a global Gaussian kernel as K X ( x = ( x gluco , x mult , x categ ) , y = ( y gluco , y mult , y categ ))= e − ( a || xgluco − ygluco || σ gluco + b || xmult − ymult || σ mult + c || xcateg − ycateg || σ categ ) ∀ x, y ∈ V (18)where a, b, c, σ gluco , σ mult , σ categ > a, b, c ) ∈ { ( a, b, c ) ∈ R : a + b + c = 1 and 0 ≤ a ≤ , ≤ b ≤ , ≤ c ≤ } . A more refined variety of strategies to build a global kernel can be foundin [G¨onen and Alpaydın, 2011]. It is well known that kernel parameter tuning is more sensitive than the choiceitself [Sch¨olkopf et al., 2002] among a family of kernels with the property ofbeing characteristic or universal [Simon-Gabriel and Sch¨olkopf, 2018] in differentmodeling tasks. For this reason, here, we have restricted ourselves to the Gaussian17/38ernel. Besides, there is an available rule to select the bandwidth parameter σ > { X i = ( X glucoi , X multi , X categi ) } ni =1 be the sample and we build the kernelmatrix ( K ) i,j = K ( X i , X j ) , i, j = 1 , . . . , n where K ( X i , X j ) = e − || Xi − Xj || σ . Theheuristic median rule σ > σ = (cid:113) median {|| X i − X j || : 1 ≤ i < j ≤ n } (19)In the sense of [Reddi et al., 2014], we suggest to find the optimal kernelbandwidth parameter σ ∗ in a grid of points of the form σ ∗ = σ γ with γ ∈ (0 , σ gluco , σ mult , σ categ . a, b, c > The AEGIS population study conducted in the Spanish town of A Estrada(Galicia) aims to analyze the steady evolution of different clinical featuressuch as longitudinal changes in circulating glucose in 1516 patients over 10years. In addition, non-routinary medical tests such as continuous glucosemonitoring are performed every five years on a randomized subset composedof 581 patients. Table 1 shows the basal characteristics of the 581 continuousglucose monitored patients grouped by sex. The collected variables include age,glycosilated hemoglobin (A1C), fasting plasma glucose (FPG), insulin resistance(HOMA-IR), body mass index (BMI); along with glycemic variability metrics:continuous overall net glycemic action (CONGA), mean amplitude of glycaemicexcursions (MAGE), mean of the daily differences (MODD). As we appointedbefore, A1C and FPG are the popular variables of diabetes diagnosis or controlin the standard clinical routine. HOMA-IR or insulin resistance is an essentialvariable which is strongly connected to different cellular mechanisms of diabetesdevelopment [Rehman and Akash, 2016]. Body mass index is a global variable ofhealth status related to all mortality causes, progression of diseases, complications,and the onset of a wide multi-spectrum of diseases such as metabolic syndromeor diabetes. Finally, the different examined glucose variability metrics captureaspects of glucose values’ oscillation along different periods. Glucose variability isa representative characteristic of glucose metabolism, being the third componentof dysglycemia [Monnier et al., 2008].We can found specific details about how laboratory measurements andcontinuous glucose monitoring were performed [Gude et al., 2017, Matabuenaet al., 2020]. 18/38 en ( n = 220) Women ( n = 361)Age, years 47 . ± . . ± . . ± . . ± . mg/dL ±
23 91 ± mg/dL.µUI/m . ± .
56 2 . ± . kg/m . ± . . ± . mg/dL . ± .
40 0 . ± . mg/dL . ± . . ± . . ± .
58 0 . ± . Table 1: Characteristics of AEGIS study participants with CGM monitoring bysex. Mean and standard deviation are shown.
BM I - body mass index;
F P G -fasting plasma glucose; A c - glycated haemoglobin; HOM A − IR - homeostasismodel assessment-insulin resistance; CON GA - glycemic variability in termsof continuous overall net glycemic action;
M ODD - mean of daily differences;
M AGE - mean amplitude of glycemic excursions.We address the main challenge of forecasting the relationship between thepredictors and the outcome five years ahead: A1C to be precise. 40 % dataof this variable is missing. Therefore, we have to use specific missing datatechniques that limit biases in the obtained results to be widely generalized andreproducible to other study populations.
With the values in Table 1 together that were collected through CGM data viaglucodensity representation [Matabuena et al., 2020] (Figure 1), we treat toanswer the following clinical open problems:1. We want to study if there exists a statistical association between eachpredictor using the A1C five year ahead, as outcome. For this purpose,we use the Hilbert-Schmidt independence criterion that we propose in thecontext of missing data in Section 2.1 together with a specific bootstrapapproach that we design for such a task. Moreover, we study the clinicalrelevance of marginal associations with a bidimensional plot.2. We selected the non-parametric subset of variables (Section 2.2) that bestexplains glucose changes using vectorial valued data of Table 1. Withthis analysis, we can obtain new clinical knowledge about this diabetesbiomarker’s role in changes of glucose metabolism.3. We assess the impact of introducing CGM data via glucodensities in themodels predictive capacity. For this purpose, we fit two kernel ridgeregression models (Section 2.3): one that includes glucodensities and otherwhich does not. In some patients who residuals are large for, we useconformal inference to measure the uncertainty and characterize patientphenotypes. 19/38 .3 Results
Our aim is to study whether there is any evidence of univariate statisticalassociation for normoglycemic patients between glucose variation measured by A C − years − A C initial and the variables shown in Table 1. In this particularcase, the underlying missing data mechanism is estimated using univariate logisticregression. lll llll lll l ll ll l ll ll ll lllll lll l ll l lll lll lll lllll lll ll ll l ll llllll l lll ll ll lll llll l l llll ll llllllll l ll l ll lllll ll ll ll ll ll lll l lll ll llll ll ll ll ll ll lll l ll lll ll lll lll l lll l llll ll ll ll l lll ll lll lll l llll ll ll ll ll ll llll ll llll ll lll l lll l l llll ll
20 30 40 50 60 70 80 . . . . Age D i ff e r en c e A C − y ea r s l ll l ll ll ll lll lll l llll ll lll llll lll l ll l ll llll lll ll ll l ll lllll lll ll ll lll l lll llll llll ll l ll l ll ll ll l ll l l ll ll l ll ll llll ll ll ll lll ll l lll ll ll ll ll lll ll ll l l lll ll lll lll lll lll ll lll l lll ll l l lll ll ll l l ll l ll l ll ll lllll ll lll lll ll ll ll lll l llllll lll ll
70 75 80 85 90 95 . . . . FPG D i ff e r en c e A C − y ea r s lll lll l lllllll ll lllll llll ll llllll llll l ll lll llll l ll ll lllll lll lll ll lll l lll lll ll lll ll llll ll llll l llll ll lll llll ll ll ll llll llll ll lllll ll ll ll lll ll lll l l llll lll lllllll ll ll ll ll llll lll lll l llll l lll l ll lllll lllll lllllllll ll ll ll ll lllll ll ll ll . . . . HOMA−IR D i ff e r en c e A C − y ea r s lll l ll ll ll llll llll ll l lll llllll lll llll lll lll lll l llll ll l lll lll lll ll llll l ll lll ll lll ll lll ll ll llll l ll ll llll ll ll llll llll ll llll lll l lll ll ll ll ll l llll ll l lll l lll lll lll l ll ll llll lll l lll ll lll llll ll ll l l lll ll lll ll l lll llll lll lllll l lll ll l l ll ll
20 25 30 35 40 45 . . . . BMI D i ff e r en c e A C − y ea r s l lll llll ll llllll l lllll lllll ll l lll l ll l lll ll ll llll ll lll l ll ll ll l ll ll lll ll ll lll lll ll l l ll ll ll lll l l ll llll lllll ll ll lll l llll llllll ll lll ll ll ll ll ll lll ll ll llll ll lll ll l l ll llll ll ll lll llll ll l ll lllll lll ll ll l ll ll ll lll ll l ll ll ll lll lll lll lll lll . . . . A1C D i ff e r en c e A C − y ea r s llll llll ll llll lll llll ll ll ll lll lllllll lll l lll ll l ll lll ll llllll lll ll lllll llll ll ll l llll ll ll ll ll l llll lll lll lll lllll ll lll ll lll ll l l llll llll llll ll llllllll llll ll lll lll l ll lllll ll lll lll lll lll llll ll ll lll llll lll lll llll l ll llll lll l l ll lll lll . . . . CONGA D i ff e r en c e A C − y ea r s llll lll l llllll lll lll l ll ll ll lll ll lll ll lllllll ll lll lll ll l ll lllll lll ll lll llll l l ll l llll llllllll llllll ll lll lll llllll l lll ll ll lll l l l lll llll llll ll ll llll ll ll lllll ll lll l ll llll l ll lll ll l lll lll llll ll ll lll llll ll l l ll llll lll llll llll l ll lll lll
20 30 40 50 60 . . . . MAGE D i ff e r en c e A C − y ea r s llll llll ll ll ll lll lll l ll llllll l llll llll lll ll lll lll lll ll llllll lll ll lll l lllll ll ll l ll ll lll ll lll ll l lll ll lll l ll lll l ll ll l ll lll l ll l ll ll l l lll llll ll llllllll ll lll ll ll llll ll lllll l l lllll l ll l lll llll ll ll lll llll lll l ll ll ll lll llll llll l ll lll lll . . . . MODD D i ff e r en c e A C − y ea r s lll l . . . . Sex D i ff e r en c e A C − y ea r s Results in Table 2 show that the only statistically significant variables with ap-value less than 5% are glucodensities and basal A1C. The plots above illustratethat the dependency relations among vectorial variables are weak, in case anyof them held. Next, we are using multivariate models that exploit potentialinteractions between variables to improve association with changes in the A1Cvariable. 20/38ariable p − value Age 0 . . . . . . . . . < . We have adjusted the model defined in Section 2.2 seeking the subset of variablesmost strongly associated with A1C values five years ahead. For this purpose, weused 581 patients, and we considered all the variables on Table 1 except sex. Inorder to avoid overfitting and improving results reproducibility, we select modelparameters using cross-validation. Finally, we estimate the underlying missingdata mechanism via lasso logistic regression.In this case, the variables selected by the algorithm are: Age, A1C, FPG, BMI,and MAGE. It is essential to note that the procedure used detected higher-orderinteractions between the covariates and the respective variable from multivariateperspective. Moreover, in contrast to the previous section, both diabetic andnon-diabetic patients have been analyzed.
We fit two kernel ridge regression models with the goal of predicting A1C atfive years ahead. The first includes non-CGM-variables A1C, FPG, Age, BMIas covariates, and the second one incorporates CGM data via glucodensities aswell. Kernel selection and parameter tuning have been calibrated following theindications on Sections 2.4 and 2.5. The R of the first model, according tomissing data mechanism and by leave-one cross-validation calculated, is 0 . .
66. In Figure 2, we plot the residues againsteach value of basal A1c. In general, we can say that the most considerableresidues are found in diabetic patients, while in other cases, the distribution ofindividuals residuals is heterogeneous depending on the patient’s characteristics.These results show that when introduce in the models the glucodensity, CGMinformation can provide a piece of valuable extra knowledge on long-term glucosechanges. 21/38igure 3 depicts confidence intervals at a confidence level of 90 % afterapplying conformal inference methodology to measure the uncertainty of thepredictions performed by the regression model. l lll ll ll ll l ll lll l lll lll ll lll ll ll ll lll lll lll l l llll lll llll ll ll l ll llll ll lllll l ll l lll ll ll ll ll llll lllll ll lll lll ll ll l ll ll ll ll l ll ll ll ll lll l l lll ll l lll ll l ll lllllll ll lll ll lll l lll lll ll ll l ll lll ll l ll lll llll ll lllll ll llll ll l ll ll ll lll lll ll ll l lll ll lll l lll l llll l lll lll lll lll lll lll lll lll ll l ll lll lll l lll lll ll l llll lll ll lll ll ll lll lllll l ll llll lll lll l lll l lll lll ll − − Real values A1C−initial R e s i dua l s Figure 2: Residuals vs. A1C-initial for the model that includes glucodensity asa covariate.Below, we consider that a patient has a considerable uncertainty in theirA1C prediction if the length interval is greater than 1. In this case, we cancharacterize clinical features that allow to assign a patient high-low variabilitygroups based on future glucose values uncertainty. In particular, following Figure4, we see that if a patient was diagnosed with diabetes before basal period,there are essential uncertainty in their glucose values in the future. In addition,the same happens if the patients have an elevated HOMA-IR, overweight andadvanced age. 22/38
Uncertain predictions A1C−5years vs. A1C−initial
A1C−initial A C − y ea r s l lll ll ll ll l ll lll l lll lll ll lll ll ll ll lll lll lll l l llll lll llll ll ll l ll llll ll lllll l ll l lll ll ll ll ll llll lllll ll lll lll ll ll l ll ll ll ll l ll ll ll ll lll l l lll ll l lll ll l ll lllllll ll lll ll lll l lll lll ll ll l ll lll ll l ll lll llll ll lllll ll llll ll l ll ll ll lll lll ll ll l lll ll lll l lll l llll l lll lll lll lll lll lll lll lll ll l ll lll lll l lll lll ll l llll lll ll lll ll ll lll lllll l ll llll lll lll l lll l lll lll ll Figure 3: Conformal inference intervals of kernel ridge regression predictionsfor each observed response of A1C in the AEGIS database 90% nominal levelcoverage .
The incidence and proliferation of diabetes is one of the most critical publichealth problems in the world [Ginter and Simko, 2013]. With the purpose togain new clinical knowledge in this field and support medical decision making,the relationship between patient basal characteristics at the start of the AEGISstudy and five-year A1C values has been modeled. First, we have identifiedseveral biomarkers associated with five-year glucose variations. Second, we haveanalyzed two nonlinear regression models’ predictive capacity to forecast A1C,showing the advantages of CGM technology for this predictive task. Finally, wehave identified clinical characteristics of patients who produced unacceptablefittings from a clinical point of view. In particular, we identify some patientphenotypes that need to be tracked with more attention. We easily describedthem using routine biomarkers of standard clinical practice. 23/38igure 4: Collection of clinical decision rules that allow us to identify whether apatient is going to have a lot of uncertainty in their A1C predictions or not. Weestablish that there is much uncertainty in the patient’s glucose changes if theprediction’s conformal interval is longer than 1.In order to improve clinical decisions, such as the design of optimal dynamicinterventions [Tsiatis, 2019], it is essential to design tools that quantify thefuture state of a patient’s homeostasis based on his or her current glycemicprofile and other clinical variables. This problem is also fundamental to theidentification of patients at risk of developing diabetes or in the detection of risksituations or other complications such as retinopathy. Here we have seen that wecan predict this relationship in an acceptable way five years ahead, measuringA1C with a sample that includes diabetic and non-diabetic patients. However,in some patients, the discrepancies are significant. Changes in a patient’sbody composition, pharmacological treatments, lifestyle such as physical activitypatterns, diet, or disease development over these five years could explain partiallythese changes from the biological point of view.Our phenotypes show that both insulin resistance- the previous diabetesdiagnosis- and overweight explain if the performed predictions show a considerablelevel of uncertainty according to our model. From a practical point of view, thismeans that there is a more significant variability in glucose changes five years24/38head in these patients, and as a result, their future glycemic status is uncertain.It is advisable to perform more routine follow-ups of this group of patients; andthe performed interventions should have a personalized focus based on theirdynamic evolution against treatments.The R value obtained predicting A1C via leave-one cross-validation with theintroduction of CGM information is similar to the one reported by other authors,even though some of them predict these biomarkers in the short term [Gaynanovaet al., 2020, Zaitcev et al., 2020]. Introducing continuous glucose monitoringinformation through the concept of glucodensity [Matabuena et al., 2020] mayprovide extra information on glucose fluctuations and provide more accuracy inthe prediction of A1C. However, other studies do not use a random sample; thepatients are in standardized conditions and analyze different target populations;consequently, direct and accurate comparison between clinical findings andstudies is not an easy issue.Using statistical dependence measures with normoglycemic patients to testthe marginal association of biomarkers with A1C demonstrates that we mustuse multivariate models to capture the complexity of long-term glucose changes.In this sense, we can naturally introduce several sources of information intomodels simultaneously with Kernel ridge regression or other RKHS techniquesas the HSIC dependence measure. These data sources can include glucoseprofiles through glucodensities or other sources of information that have notbeen considered in the present work, such as omics data, and may have asignificant impact on the glucose changes [Gou et al., 2020]. We can detecthigher-order interactions between model variables as well.From a methodological perspective, new extensions of the proposed modelsarise naturally as doubly robust approaches and longitudinal models that dynam-ically allow to introduce patient condition changes in body mass index or otherrelevant variables and update model predictions in real-time. At the same time,in order to improve models performance, as a future work topic it is exciting toexplore the possibility of fitting some model parameters with powerful techniquesof machine learning that combine models such as Super Learner [Van der Laanet al., 2007], create specific semi-parametric models for this domain [Tsiatis,2007], or use other distances/kernels in our RKHS framework.This paper predicts how much the primary variable in diabetes diagnosisand control changes. However, glucose metabolism is very complex, and otherdecision criteria derived from CGM data may be used in standard clinicalroutines that capture other aspects of glucose metabolism that go beyondglucose mean [Hirsch and Brownlee, 2010, Group, 2018]. In this sense predictingchanges in glucodensities five years ahead using the baseline data of the AEGISstudy would be exciting, and it can provide a clear picture of glucose evolutionvalues in time at the distributional level. 25/38 Conclusion and practical implications
This work proposed a new framework of data analysis based on RKHS learningwhen some response entries are missing. This situation is commonplace inmedical studies when testing patients’ evolution in different periods. Our newtools allow testing statistical independence, selecting variables, predicting, andmaking inferences about the predictions in the context of missing responses. Asa relevant example of application, we have illustrated the usefulness of thesemethods for predicting glucose progression five years ahead, including the novelintroduction of continuous glucose monitoring as a predictor. The results showthat predicting glucose homeostasis’ evolution with continuous monitoring canprovide more information than widely used classic diabetes biomarkers. Ourpredictive model can support clinical medical decisions with the identificationof patients at risk for developing diabetes or complications in some groups ofpatients where model uncertainty is low. Finally, we have characterized thephenotype of patients with many discrepancies between real and predicted valuesusing easily-measured variables. We must plan more personalized follow-ups forthese patients taking into account dynamic changes in patients’ conditions.
Acknowledgment
This work has received financial support from Instituto de Salud Carlos III(ISCIII), Grant/Award Number: PI16/01395; Ministry of Economy and Compet-itiveness (SPAIN) European Regional Development Fund (FEDER); the AxenciaGalega de Innovaci´on, Conseller´ıa de Econom´ıa, Emprego e Industria, Xunta deGalicia, Spain, Grant/Award Number: GPC IN607B 2018/01; Spanish Ministryof Science, Innovation and Universities (grant RTI2018-099646-B-I00), GalicianMinistry of Education, University and Professional Training (grant 2019-2022ED431G-2019/04).
References
Arcones and Gine, 1992. Arcones, M. A. and Gine, E. (1992). On the boot-strap of u and v statistics.
The Annals of Statistics , pages 655–674.Bang and Robins, 2005. Bang, H. and Robins, J. M. (2005). Doubly robustestimation in missing data and causal inference models.
Biometrics , 61(4):962–973.Bartlett et al., 2020. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A.(2020). Benign overfitting in linear regression.
Proceedings of the NationalAcademy of Sciences .Battelino et al., 2019. Battelino, T., Danne, T., Bergenstal, R. M., Amiel,S. A., Beck, R., Biester, T., Bosi, E., Buckingham, B. A., Cefalu, W. T.,Close, K. L., et al. (2019). Clinical targets for continuous glucose monitoring26/38ata interpretation: recommendations from the international consensus ontime in range.
Diabetes Care , 42(8):1593–1603.Beck et al., 2019. Beck, R. W., Bergenstal, R. M., Riddlesworth, T. D., Koll-man, C., Li, Z., Brown, A. S., and Close, K. L. (2019). Validation of timein range as an outcome measure for diabetes clinical trials.
Diabetes Care ,42(3):400–405.Berg et al., 1984. Berg, C., Christensen, J. P. R., and Ressel, P. (1984).
Har-monic analysis on semigroups: theory of positive definite and related func-tions , volume 100. Springer.Bertsimas et al., 2016. Bertsimas, D., King, A., and Mazumder, R. (2016).Best subset selection via a modern optimization lens.
The annals of statistics ,pages 813–852.Beutner and Z¨ahle, 2012. Beutner, E. and Z¨ahle, H. (2012). Deriving theasymptotic distribution of u-and v-statistics of dependent data using weightedempirical processes.
Bernoulli , pages 803–822.Boistard et al., 2017. Boistard, H., Lopuha¨a, H. P., Ruiz-Gazen, A., et al.(2017). Functional central limit theorems for single-stage sampling designs.
The Annals of Statistics , 45(4):1728–1758.Borgwardt et al., 2006. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel,H.-P., Sch¨olkopf, B., and Smola, A. J. (2006). Integrating structured biologi-cal data by kernel maximum mean discrepancy.
Bioinformatics , 22(14):e49–e57.Chen et al., 2017. Chen, J., Stern, M., Wainwright, M. J., and Jordan, M. I.(2017). Kernel feature selection via conditional covariance minimization.
Advances in Neural Information Processing Systems , 30:6946–6955.Cho et al., 2018. Cho, N., Shaw, J., Karuranga, S., Huang, Y., da Rocha Fer-nandes, J., Ohlrogge, A., and Malanda, B. (2018). Idf diabetes atlas: Globalestimates of diabetes prevalence for 2017 and projections for 2045.
Diabetesresearch and clinical practice , 138:271–281.Cirillo and Valencia, 2019. Cirillo, D. and Valencia, A. (2019). Big dataanalytics for personalized medicine.
Current opinion in biotechnology , 58:161–167.Coronato et al., 2020. Coronato, A., Naeem, M., De Pietro, G., and Paragliola,G. (2020). Reinforcement learning for intelligent healthcare applications: Asurvey.
Artificial Intelligence in Medicine , 109:101964.Dabelea et al., 2017. Dabelea, D., Stafford, J. M., Mayer-Davis, E. J.,D’Agostino, R., Dolan, L., Imperatore, G., Linder, B., Lawrence, J. M.,Marcovina, S. M., Mottl, A. K., et al. (2017). Association of type 1 diabetesvs type 2 diabetes diagnosed during childhood and adolescence with compli-cations during teenage years and young adulthood.
Jama , 317(8):825–835.27/38eka et al., 2014. Deka, P. C. et al. (2014). Support vector machine applica-tions in the field of hydrology: a review.
Applied soft computing , 19:372–386.Ding et al., 2018. Ding, P., Li, F., et al. (2018). Causal inference: A missingdata perspective.
Statistical Science , 33(2):214–237.Edlitz and Segal, 2020. Edlitz, Y. and Segal, E. (2020). Prediction of type 2diabetes mellitus onset using simple logistic regression models. medRxiv .Efron and Tibshirani, 1994. Efron, B. and Tibshirani, R. J. (1994).
An intro-duction to the bootstrap . CRC press.Ellahham, 2020. Ellahham, S. (2020). Artificial intelligence in diabetes care.
The American Journal of Medicine .Finkelstein et al., 2012. Finkelstein, E. A., Khavjou, O. A., Thompson, H.,Trogdon, J. G., Pan, L., Sherry, B., and Dietz, W. (2012). Obesity and severeobesity forecasts through 2030.
American journal of preventive medicine ,42(6):563–570.Flegal et al., 2012. Flegal, K. M., Carroll, M. D., Kit, B. K., and Ogden, C. L.(2012). Prevalence of obesity and trends in the distribution of body massindex among us adults, 1999-2010.
Jama , 307(5):491–497.for Disease Control et al., 2011. for Disease Control, C., Prevention, et al.(2011). National diabetes fact sheet: national estimates and general infor-mation on diabetes and prediabetes in the united states, 2011.
Atlanta, GA:US department of health and human services, centers for disease control andprevention , 201(1):2568–2569.Garreau et al., 2017. Garreau, D., Jitkrittum, W., and Kanagawa, M.(2017). Large sample analysis of the median heuristic. arXiv preprintarXiv:1707.07269 .Gaynanova et al., 2020. Gaynanova, I., Punjabi, N., and Crainiceanu, C.(2020). Modeling continuous glucose monitoring (cgm) data during sleep.
Biostatistics .Ghorbani et al., 2020. Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari,A. (2020). When do neural networks outperform kernel methods?
Advancesin Neural Information Processing Systems , 33.Gin´e and Zinn, 1990. Gin´e, E. and Zinn, J. (1990). Bootstrapping generalempirical measures.
The Annals of Probability , pages 851–869.Ginter and Simko, 2013. Ginter, E. and Simko, V. (2013). Type 2 diabetesmellitus, pandemic in 21st century. In
Diabetes , pages 42–50. Springer.Goldberg and Kosorok, 2012. Goldberg, Y. and Kosorok, M. R. (2012). Q-learning with censored data.
Annals of statistics , 40(1):529. 28/38¨onen and Alpaydın, 2011. G¨onen, M. and Alpaydın, E. (2011). Multiplekernel learning algorithms.
The Journal of Machine Learning Research ,12:2211–2268.Gou et al., 2020. Gou, W., Ling, C.-w., He, Y., Jiang, Z., Fu, Y., Xu, F., Miao,Z., Sun, T.-y., Lin, J.-s., Zhu, H.-l., et al. (2020). Interpretable machinelearning framework reveals robust gut microbiome features associated withtype 2 diabetes.
Diabetes Care .Gretton et al., 2012. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf,B., and Smola, A. (2012). A kernel two-sample test.
The Journal of MachineLearning Research , 13(1):723–773.Gretton et al., 2007. Gretton, A., Fukumizu, K., Teo, C., Song, L., Sch¨olkopf,B., and Smola, A. (2007). A kernel statistical test of independence.
Advancesin neural information processing systems , 20:585–592.Group, 2018. Group, B. A. W. (2018). Need for regulatory change to incor-porate beyond a1c glycemic metrics.
Diabetes Care , 41(6):e92–e94.Gude et al., 2017. Gude, F., D´ıaz-Vidal, P., R´ua-P´erez, C., Alonso-Sampedro,M., Fern´andez-Merino, C., Rey-Garc´ıa, J., Cadarso-Su´arez, C., Pazos-Couselo, M., Garc´ıa-L´opez, J. M., and Gonzalez-Quintela, A. (2017).Glycemic variability and its association with demographics and lifestylesin a general adult population.
Journal of diabetes science and technology ,11(4):780–790.Gunasekeran et al., 2020. Gunasekeran, D. V., Ting, D. S., Tan, G. S., andWong, T. Y. (2020). Artificial intelligence for diabetic retinopathy screening,prediction and management.
Current opinion in ophthalmology , 31(5):357–365.Hastie et al., 2019. Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J.(2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 .Hazimeh and Mazumder, 2018. Hazimeh, H. and Mazumder, R. (2018). Fastbest subset selection: Coordinate descent and local combinatorial optimiza-tion algorithms. arXiv preprint arXiv:1803.01454 .He et al., 1996. He, X., Shao, Q.-M., et al. (1996). A general bahadur rep-resentation of m-estimators and its application to linear regression withnonstochastic designs.
The Annals of Statistics , 24(6):2608–2630.Hirsch and Brownlee, 2010. Hirsch, I. B. and Brownlee, M. (2010). Beyondhemoglobin a1c—need for additional markers of risk for diabetic microvascu-lar complications.
Jama , 303(22):2291–2292. 29/38u et al., 2015. Hu, F. B., Satija, A., and Manson, J. E. (2015). Curbing thediabetes pandemic: the need for global policy solutions.
Jama , 313(23):2319–2320.Ida et al., 2019. Ida, Y., Fujiwara, Y., and Kashima, H. (2019). Fast sparsegroup lasso. In
Advances in Neural Information Processing Systems , pages1702–1710.Ivanciuc et al., 2007. Ivanciuc, O. et al. (2007). Applications of support vectormachines in chemistry.
Reviews in computational chemistry , 23:291.Janssen et al., 2003. Janssen, A., Pauls, T., et al. (2003). How do bootstrapand permutation tests work?
The Annals of statistics , 31(3):768–806.Kang et al., 2007. Kang, J. D., Schafer, J. L., et al. (2007). Demystifyingdouble robustness: A comparison of alternative strategies for estimating apopulation mean from incomplete data.
Statistical science , 22(4):523–539.Korolyuk and Borovskich, 2013. Korolyuk, V. S. and Borovskich, Y. V.(2013).
Theory of U-statistics , volume 273. Springer Science & BusinessMedia.Kosorok, 2007. Kosorok, M. R. (2007).
Introduction to empirical processesand semiparametric inference . Springer Science & Business Media.Kosorok and Laber, 2019. Kosorok, M. R. and Laber, E. B. (2019). Precisionmedicine.
Annual review of statistics and its application , 6:263–286.Kosorok and Moodie, 2015. Kosorok, M. R. and Moodie, E. E. (2015).
Adap-tive treatment strategies in practice: planning trials and analyzing data forpersonalized medicine . SIAM.Laird, 1988. Laird, N. M. (1988). Missing data in longitudinal studies.
Statis-tics in medicine , 7(1-2):305–315.Lee et al., 2013. Lee, B. J., Ku, B., Nam, J., Pham, D. D., and Kim, J. Y.(2013). Prediction of fasting plasma glucose status using anthropometricmeasures for diagnosing type 2 diabetes.
IEEE journal of biomedical andhealth informatics , 18(2):555–561.Lei and Cand`es, 2020. Lei, L. and Cand`es, E. J. (2020). Conformal infer-ence of counterfactuals and individual treatment effects. arXiv preprintarXiv:2006.06138 .Li et al., 2017. Li, X., Dunn, J., Salins, D., Zhou, G., Zhou, W., Sch¨ussler-Fiorenza Rose, S. M., Perelman, D., Colbert, E., Runge, R., Rego, S., et al.(2017). Digital health: tracking physiomes and activity using wearable biosen-sors reveals useful health-related information.
PLoS biology , 15(1):e2001402.30/38iang et al., 2020. Liang, T., Rakhlin, A., et al. (2020). Just interpolate:Kernel “ridgeless” regression can generalize.
Annals of Statistics , 48(3):1329–1347.Little and Rubin, 2019. Little, R. J. and Rubin, D. B. (2019).
Statisticalanalysis with missing data , volume 793. John Wiley & Sons.Liu et al., 2020. Liu, T., Goldberg, Y., et al. (2020). Kernel machines withmissing responses.
Electronic Journal of Statistics , 14(2):3766–3820.Londschien et al., 2020. Londschien, M., Kov´acs, S., and B¨uhlmann, P. (2020).Change point detection for graphical models in the presence of missing values.
Journal of Computational and Graphical Statistics , pages 1–32.Lu et al., 2020. Lu, J., Wang, C., Shen, Y., Chen, L., Zhang, L., Cai, J., Lu,W., Zhu, W., Hu, G., Xia, T., and Zhou, J. (2020). Time in range in relationto all-cause and cardiovascular mortality in patients with type 2 diabetes: Aprospective cohort study.
Diabetes Care .Luckett et al., 2020. Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs,D. M., Mayer-Davis, E., and Kosorok, M. R. (2020). Estimating dynamictreatment regimes in mobile health using v-learning.
Journal of the AmericanStatistical Association , 115(530):692–706. PMID: 32952236.Lyons et al., 2013. Lyons, R. et al. (2013). Distance covariance in metricspaces.
The Annals of Probability , 41(5):3284–3305.Ma and Wang, 2019. Ma, X. and Wang, J. (2019). Robust inference usinginverse probability weighting.
Journal of the American Statistical Association ,pages 1–10.Makrilakis et al., 2011. Makrilakis, K., Liatis, S., Grammatikou, S., Perrea,D., Stathi, C., Tsiligros, P., and Katsilambros, N. (2011). Validation ofthe finnish diabetes risk score (findrisc) questionnaire for screening forundiagnosed type 2 diabetes, dysglycaemia and the metabolic syndrome ingreece.
Diabetes & metabolism , 37(2):144–151.Matabuena et al., 2020. Matabuena, M., Petersen, A., Vidal, J. C., and Gude,F. (2020). Glucodensities: a new representation of glucose profiles usingdistributional data analysis. arXiv preprint arXiv:2008.07840 .Monnier et al., 2008. Monnier, L., Colette, C., and Owens, D. R. (2008).Glycemic variability: the third component of the dysglycemia in diabetes. isit important? how to measure it?
Journal of diabetes science and technology ,2(6):1094–1100.Muandet et al., 2016. Muandet, K., Fukumizu, K., Sriperumbudur, B., andSch¨olkopf, B. (2016). Kernel mean embedding of distributions: A reviewand beyond. arXiv preprint arXiv:1605.09522 . 31/38¨uhlenbruch et al., 2018. M¨uhlenbruch, K., Paprott, R., Joost, H.-G., Boe-ing, H., Heidemann, C., and Schulze, M. B. (2018). Derivation and externalvalidation of a clinical version of the german diabetes risk score (gdrs) includ-ing measures of hba1c.
BMJ Open Diabetes Research and Care , 6(1):e000524.Muzellec et al., 2020. Muzellec, B., Josse, J., Boyer, C., and Cuturi, M. (2020).Missing data imputation using optimal transport. In III, H. D. and Singh,A., editors,
Proceedings of the 37th International Conference on MachineLearning , volume 119 of
Proceedings of Machine Learning Research , pages7130–7140, Virtual. PMLR.Ng et al., 2014. Ng, M., Fleming, T., Robinson, M., Thomson, B., Graetz, N.,Margono, C., Mullany, E. C., Biryukov, S., Abbafati, C., Abera, S. F., et al.(2014). Global, regional, and national prevalence of overweight and obesityin children and adults during 1980–2013: a systematic analysis for the globalburden of disease study 2013.
The lancet , 384(9945):766–781.Nguyen et al., 2020. Nguyen, M., Han, J., Spanakis, E. K., Kovatchev, B. P.,and Klonoff, D. C. (2020). A review of continuous glucose monitoring-basedcomposite metrics for glycemic control.
Diabetes Technology & Therapeutics .Noble, 2006. Noble, W. S. (2006). What is a support vector machine?
Naturebiotechnology , 24(12):1565–1567.Politis and Romano, 1994. Politis, D. N. and Romano, J. P. (1994). Largesample confidence regions based on subsamples under minimal assumptions.
The Annals of Statistics , pages 2031–2050.Poolsup et al., 2013. Poolsup, N., Suksomboon, N., and Kyaw, A. M. (2013).Systematic review and meta-analysis of the effectiveness of continuous glucosemonitoring (cgm) on glucose control in diabetes.
Diabetology & metabolicsyndrome , 5(1):39.Reddi et al., 2014. Reddi, S. J., Ramdas, A., P´oczos, B., Singh, A., andWasserman, L. (2014). On the decreasing power of kernel and distancebased nonparametric hypothesis tests in high dimensions. arXiv preprintarXiv:1406.2083 .Rehman and Akash, 2016. Rehman, K. and Akash, M. S. H. (2016). Mecha-nisms of inflammatory responses and development of insulin resistance: howare they interlinked?
Journal of biomedical science , 23(1):1–18.Rubin, 1976. Rubin, D. B. (1976). Inference and missing data.
Biometrika ,63(3):581–592.Rubin et al., 1994. Rubin, R. J., Altman, W. M., and Mendelson, D. N. (1994).Health care expenditures for people with diabetes mellitus, 1992.
The Journalof Clinical Endocrinology & Metabolism , 78(4):809A–809F. 32/38aeedi et al., 2019. Saeedi, P., Petersohn, I., Salpea, P., Malanda, B., Karu-ranga, S., Unwin, N., Colagiuri, S., Guariguata, L., Motala, A. A., Ogurtsova,K., et al. (2019). Global and regional diabetes prevalence estimates for 2019and projections for 2030 and 2045: Results from the international diabetesfederation diabetes atlas.
Diabetes research and clinical practice , 157:107843.Saeedi et al., 2020. Saeedi, P., Salpea, P., Karuranga, S., Petersohn, I., Ma-landa, B., Gregg, E. W., Unwin, N., Wild, S. H., and Williams, R. (2020).Mortality attributable to diabetes in 20–79 years old adults, 2019 estimates:Results from the international diabetes federation diabetes atlas.
Diabetesresearch and clinical practice , page 108086.Salcedo-Sanz et al., 2014. Salcedo-Sanz, S., Rojo- ´Alvarez, J. L., Mart´ınez-Ram´on, M., and Camps-Valls, G. (2014). Support vector machines inengineering: an overview.
Wiley Interdisciplinary Reviews: Data Miningand Knowledge Discovery , 4(3):234–267.Sch¨olkopf et al., 2001. Sch¨olkopf, B., Herbrich, R., and Smola, A. J. (2001).A generalized representer theorem. In
International conference on computa-tional learning theory , pages 416–426. Springer.Sch¨olkopf et al., 2002. Sch¨olkopf, B., Smola, A. J., Bach, F., et al. (2002).
Learning with kernels: support vector machines, regularization, optimization,and beyond . MIT press.Schork, 2015. Schork, N. J. (2015). Personalized medicine: time for one-persontrials.
Nature , 520(7549):609–611.Sejdinovic et al., 2013. Sejdinovic, D., Sriperumbudur, B., Gretton, A., andFukumizu, K. (2013). Equivalence of distance-based and rkhs-based statisticsin hypothesis testing.
The Annals of Statistics , pages 2263–2291.Selvin et al., 2007. Selvin, E., Crainiceanu, C. M., Brancati, F. L., and Coresh,J. (2007). Short-term variability in measures of glycemia and implicationsfor the classification of diabetes.
Archives of internal medicine , 167(14):1545–1551.Shafer and Vovk, 2008. Shafer, G. and Vovk, V. (2008). A tutorial on confor-mal prediction.
Journal of Machine Learning Research , 9(Mar):371–421.Simon-Gabriel and Sch¨olkopf, 2018. Simon-Gabriel, C.-J. and Sch¨olkopf, B.(2018). Kernel distribution embeddings: Universal kernels, characteristickernels and kernel metrics on distributions.
The Journal of Machine LearningResearch , 19(1):1708–1736.Steinwart and Christmann, 2008. Steinwart, I. and Christmann, A. (2008).
Support vector machines . Springer Science & Business Media.Stone, 1982. Stone, C. J. (1982). Optimal global rates of convergence fornonparametric regression.
The annals of statistics , pages 1040–1053. 33/38zekely and Rizzo, 2017. Szekely, G. J. and Rizzo, M. L. (2017). The energyof data.
Annual Review of Statistics and Its Application , 4:447–479.Sz´ekely et al., 2007. Sz´ekely, G. J., Rizzo, M. L., Bakirov, N. K., et al. (2007).Measuring and testing dependence by correlation of distances.
The annalsof statistics , 35(6):2769–2794.Tabish, 2007. Tabish, S. A. (2007). Is diabetes becoming the biggest epidemicof the twenty-first century?
International Journal of health sciences , 1(2):V.Topol, 2010. Topol, E. J. (2010). Transforming medicine via digital innovation.
Science translational medicine , 2(16):16cm4–16cm4.Tsiatis, 2007. Tsiatis, A. (2007).
Semiparametric theory and missing data .Springer Science & Business Media.Tsiatis, 2019. Tsiatis, A. A. (2019).
Dynamic Treatment Regimes: StatisticalMethods for Precision Medicine . CRC press.Van de Geer, 2000. Van de Geer, S. A. (2000).
Applications of empiricalprocess theory , volume 91. Cambridge University Press Cambridge.Van der Laan et al., 2007. Van der Laan, M. J., Polley, E. C., and Hubbard,A. E. (2007). Super learner.
Statistical applications in genetics and molecularbiology , 6(1).Van Der Vaart and Wellner, 1996. Van Der Vaart, A. W. and Wellner, J. A.(1996). Weak convergence. In
Weak convergence and empirical processes ,pages 16–28. Springer.Vas et al., 2017. Vas, P. R. J., Alberti, K. G., and Edmonds, M. E. (2017).Prediabetes: moving away from a glucocentric definition.
The Lancet Diabetes& Endocrinology , 5(11):848 – 849.Vermeulen and Vansteelandt, 2015. Vermeulen, K. and Vansteelandt, S.(2015). Bias-reduced doubly robust estimation.
Journal of the AmericanStatistical Association , 110(511):1024–1036.Visscher and Seidell, 2001. Visscher, T. L. and Seidell, J. C. (2001). Thepublic health impact of obesity.
Annual review of public health , 22(1):355–375.Walker et al., 2010. Walker, K., O’Dea, K., Gomez, M., Girgis, S., and Co-lagiuri, R. (2010). Diet and exercise in the prevention of diabetes.
Journalof human nutrition and dietetics , 23(4):344–352.Whiting et al., 2011. Whiting, D. R., Guariguata, L., Weil, C., and Shaw, J.(2011). Idf diabetes atlas: global estimates of the prevalence of diabetes for2011 and 2030.
Diabetes research and clinical practice , 94(3):311–321.34/38illiams et al., 2020. Williams, R., Karuranga, S., Malanda, B., Saeedi, P.,Basit, A., Besan¸con, S., Bommer, C., Esteghamati, A., Ogurtsova, K.,Zhang, P., et al. (2020). Global and regional estimates and projections ofdiabetes-related health expenditure: Results from the international diabetesfederation diabetes atlas.
Diabetes Research and Clinical Practice , page108072.Wilmot et al., 2020. Wilmot, E., Lumb, A., Hammond, P., Murphy, H., Scott,E., Gibb, F., Platts, J., and Choudhary, P. (2020). Time in range: a bestpractice guide for uk diabetes healthcare professionals in the context of thecovid-19 global pandemic.
Diabetic Medicine , page e14433.Yang et al., 2016. Yang, L., Lv, S., and Wang, J. (2016). Model-free variableselection in reproducing kernel hilbert space.
The Journal of MachineLearning Research , 17(1):2885–2908.Yang and Zou, 2015. Yang, Y. and Zou, H. (2015). A fast unified algorithmfor solving group-lasso penalize learning problems.
Statistics and Computing ,25(6):1129–1141.Zaccardi and Khunti, 2018. Zaccardi, F. and Khunti, K. (2018). Glucosedysregulation phenotypes — time to improve outcomes.
Nature ReviewsEndocrinology , 14(11):632–633.Zaitcev et al., 2020. Zaitcev, A., Eissa, M. R., Hui, Z., Good, T., Elliott, J.,and Benaissa, M. (2020). A deep neural network application for improvedprediction of hba1c in type 1 diabetes.
IEEE Journal of Biomedical andHealth Informatics .Zeevi et al., 2015. Zeevi, D., Korem, T., Zmora, N., Israeli, D., Rothschild, D.,Weinberger, A., Ben-Yacov, O., Lador, D., Avnit-Sagi, T., Lotan-Pompan,M., et al. (2015). Personalized nutrition by prediction of glycemic responses.
Cell , 163(5):1079–1094.Zhang et al., 2012. Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., andLaber, E. (2012). Estimating optimal treatment regimes from a classificationperspective.
Stat , 1(1):103–114.Zhang et al., 2010. Zhang, X., Gregg, E. W., Williamson, D. F., Barker, L. E.,Thomas, W., Bullard, K. M., Imperatore, G., Williams, D. E., and Albright,A. L. (2010). A1c level and future risk of diabetes: a systematic review.
Diabetes care , 33(7):1665–1673.Zhao et al., 2011. Zhao, Y., Zeng, D., Socinski, M. A., and Kosorok, M. R.(2011). Reinforcement learning strategies for clinical trials in nonsmall celllung cancer.
Biometrics , 67(4):1422–1433.Zheng et al., 2018. Zheng, Y., Ley, S. H., and Hu, F. B. (2018). Globalaetiology and epidemiology of type 2 diabetes mellitus and its complications.
Nature Reviews Endocrinology , 14(2):88. 35/38ou et al., 2018. Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., and Tang, H.(2018). Predicting diabetes mellitus with machine learning techniques.
Fron-tiers in genetics , 9:515.
Appendix: Bootstrap consistency
Lemma 1.
Following the notation established in Section 1.3 and 2.1, let ussuppose that E ( K X ( X, X (cid:48) ) ) < ∞ , E ( K Y ( Y, Y (cid:48) ) ) < ∞ , π ( · ) = P ( R =1 | X = · ) is a known two times differentiable function that verify π ( x ) > forall x ∈ V and that all the probability weights associated have a regularly varyingtail according to [Ma and Wang, 2019]. Under these conditions, the empiricaland bootstrap statistics defined in Section 2.1 are still consistent for detecting allsecond-order finite-moment alternatives with the Hilbert-Schmidt independencemeasure.Proof. As in the Section 1 .
3, we asume that we observe { ( X i , Y i , R i ) } ni =1 anindependent random sample of a random vector ( X, Y, R ) taking values in V × R × { , } , where for simplicity in the using of empirical processes tools weassume that V is the real line, although the proof remains valid in the generalcase.Firstly note that under standard conditions, the empirical process of thedistribution function P X : √ n ( P n,X − P X ) converges by the central functionallimit theorem to a Gaussian process of mean zero which is a Brownian bridgeand which we denote by G X [Van Der Vaart and Wellner, 1996].We define the class of functions F ( X,Y ) = { f t (cid:48) : t (cid:48) ∈ V × R } , where f t (cid:48) =( t x ,t y ) =(( y, x ) , r ) = (1 { t y ≤ y, t x ≤ x, r = 1 } , { t y ≥ y, t x ≥ x } , { t y ≤ y, r = 1 } , { t x ≤ x, r = 1 } , { t y ≥ y } , { t x ≥ x } ). If we consider the empirical and populationmeasures of ( Y, R ) and (
X, Y, R ) denoted as P Y , P X,Y P n,Y , P n,X,Y , we can seethis probability measures as random functions in the space A ( F ( X,Y ) ) = { Q ∈ M ( X,Y ) : sup { (cid:82) f dQ < ∞ : f ∈ F ( X,Y ) } < ∞} , where M X,Y is the space ofall probability bidimesional measures with moment of second order in R . Wecan see that under this conditions, F Y and F X,Y are a Vapnik–Chervonenkisclass and hence a Donsker class [Van Der Vaart and Wellner, 1996]. Then, √ n ( P n,Y − P Y ) D −→ , G Y √ n ( P n,X,Y − P X,Y ) D −→ G X,Y .Adapting these arguments to empirical bootstrap processes [Van Der Vaartand Wellner, 1996], we reach analogous convergences: √ n ( P ∗ n,X − P n,X ) D −→ G X , √ n ( P ∗ n,Y − P n,Y ) D −→ G Y , √ n ( P ∗ n,X,Y − P n,X,Y ) D −→ G X,Y . Our statistic aim (see Equation 9) can be seen as a mapping of the empiricaldistribution via the IPW estimator. As the missing data mechanism given by the36/38unction π ( · ) (differentiable function) is known, we can apply delta functionalmethod to empirical process and empirical bootstrap process and obtain theGaussian convergence of estimators modified via IPW estimator [Van Der Vaartand Wellner, 1996]. We denote as follows the dependence of this estimator viaIPW weights as follows P n,X,Y,w , P ∗ n,X,Y,w . . . and we refer to the limit empiricalprocess √ n ( P ∗ n,X,Y,w − P X,Y ) D −→ G X,Y,w , . . . .Our statistic (see Equation 9) depends on the differences φ X ( · ) φ Y ( · ) − ˆ φ X ( · ) ˆ φ Y ( · ), . . . which are mappings of the previous empirical processes into ourselected RKHS. Suppose that these mappings defined by kernel mean embeddingsare Hadamard differentiable [Van Der Vaart and Wellner, 1996], for example,in the case of Gaussian kernel. If so, we can apply the functional delta methodtwice, first to the kernel mean embedding and second to the dot product ofRKHS to derive the empirical bootstrap process’ limit distribution. The explicitlimit distribution depends on the missing data mechanism; however, we can seethat the limit for statistics, in general, is an infinite linear combination of Chi-square distributions [Korolyuk and Borovskich, 2013]. If the mapping were notdifferentiable, we should use quasi-Hadamard differentiable arguments [Beutnerand Z¨ahle, 2012]; or other techniques in general bootstrap measures [Gin´e andZinn, 1990].We can prove the test consistency of statistics as follow. First, we must estab-lish that statistics converge to population quantity, that is (cid:92) HSIC ( ˆ P X,Y , ˆ P X ˆ P Y ) p −→ HSIC ( P X,Y , P X P Y ). For this, we only need to apply the Continuous MappingTheorem [Van Der Vaart and Wellner, 1996] and take into account that thevariance of the empirical kernel mean embeddings functions (mean function inappropriate RKHS space) converges to zero as n increases to infinity. To seethis, we can use the fact that P n,X,Y,w → P X,Y in supreme norm as class ofpreviosly functions F ( X,Y ) are Glivenko-Cantelli class because of this is Vap-nik–Chervonenkis class [Van Der Vaart and Wellner, 1996]. Then, apply twicethe Continuous Mapping Theorem, first on kernel mean embedding, second ondot product in selected RKHS (that it is continuous for RKHS belong to HilbertSpace), and guarantee the final convergence with the arithmetic properties onconvergence in probability. Second, we must consider the non-negativeness ofthe empirical and population Hilbert Schmidt dependence measure, togetherwith the fact that at the population level, the dependence measure is zero bydefinition if and only if X and Y are independent (null hypothesis), divergingstochastically to infinity otherwise.We know that the introduced statistical bootstrap imitates the limit distri-bution under the null hypothesis; therefore, the bootstrap’s consistency is alsoguaranteed. [Janssen et al., 2003].Note that our statistic can be written as a V-statistic. In this case, thesimple bootstrap is not consistent in general [Arcones and Gine, 1992], and itis necessary to resort ourselves to other strategies as subsampling [Politis andRomano, 1994] or a centered statistic with respect to the mean. In fact, we areimplicitly doing it with the bootstrap residues that we construct.Finally, we must note that this paper is the first paper to apply and provide37/38heoretical guarantees of Efron Boostrap with Hilbert-Schmidt criterium ordistance correlation with missing data to the best of our knowledge. However,the arguments remain valid in the complete data case, and in this framework,we miss a paper that applies this test calibration strategy and uses empiricalprocess theory.In practice, we do not know the function π ( · ), and we must estimate it. Thisprocedure can be done using simple techniques such as logistic regression ormore complex ensemble approaches as Super Learner [Van der Laan et al., 2007].In general, providing a similar proof to that in Lemma 1 without introduc-ing assumptions about the functional class of the missing data mechanism isimpossible. In literature concerning IPW estimators, research often restricts itshypothesis to allowing the function π θ ∈ Θ ( · ) to depend only on a finite set ofparameters θ ∈ Θ of a finite-dimensional space, that is, under the framework ofparametric models. A more general approach included the theory of M-estimatoror the use that the missing data mechanics model parameter admits Bahadurexpansion [He et al., 1996].In these cases, the proof scheme is similar to that used to prove the likelihoodestimator’s asymptotic properties via Taylor expansion.In this paper, we restrict our attention to the simply case that R ⊥ Y | X and π β ∗ ∈ Θ ( x ) = e − β − βT x ; what is to say, R | X takes values as in a generalizedlineal model according to logistic regression with β ∗ ∈ Θ = { ( β , β ) ∈ R × R p } .In this case, we have: Theorem 2.
Following the notation established in Section 1.3 and 2.1, let ussuppose that E ( K X ( X, X (cid:48) ) ) < ∞ , E ( K Y ( Y, Y (cid:48) ) ) < ∞ , π ( x ) = P ( R =1 | X = x ) = e − β − βT x > and that sufficient conditions of Gaussian asymp-totic maximal likelihood hold x ∈ V . Moreover, we assume that all the probabilityweights associated have a regularly varying tail according to [Ma and Wang,2019]. Under these conditions, the empirical and bootstrap statistics definedin Section 2.1 are still consistent for detecting all second-order finite-momentalternatives with the Hilbert-Schmidt independence measure.Proof.. Moreover, we assume that all the probabilityweights associated have a regularly varying tail according to [Ma and Wang,2019]. Under these conditions, the empirical and bootstrap statistics definedin Section 2.1 are still consistent for detecting all second-order finite-momentalternatives with the Hilbert-Schmidt independence measure.Proof.