Glucodensities: a new representation of glucose profiles using distributional data analysis
Marcos Matabuena, Alexander Petersen, Juan C.Vidal, Francisco Gude
GGlucodensities: a new representation of glucose profiles usingdistributional data analysis
Marcos Matabuena , , ∗ , Alexander Petersen , Juan C.Vidal and Francisco Gude Unidad de Epidemiolog´ıa Cl´ınica, Hospital C´ınico Universitario de Santiago deCompostela, Spain CiTIUS (Centro Singular de Investigaci´on en Tecnolox´ıas Intelixentes), Universidade deSantiago de Compostela, Spain Department of Statistics and Applied Probability, University of California, Santa Barbara ∗ [email protected] August 19, 2020
Abstract
Biosensor data has the potential ability to improve disease control and detection. However,the analysis of these data under free-living conditions is not feasible with current statisticaltechniques. To address this challenge, we introduce a new functional representation of biosen-sor data, termed the glucodensity, together with a data analysis framework based on distancesbetween them. The new data analysis procedure is illustrated through an application in dia-betes with continuous-time glucose monitoring (CGM) data. In this domain, we show markedimprovement with respect to state of the art analysis methods. In particular, our findingsdemonstrate that i) the glucodensity possesses an extraordinary clinical sensitivity to capturethe typical biomarkers used in the standard clinical practice in diabetes, ii) previous biomark-ers cannot accurately predict glucodensity, so that the latter is a richer source of information,and iii) the glucodensity is a natural generalization of the time in range metric, this being thegold standard in the handling of CGM data. Furthermore, the new method overcomes manyof the drawbacks of time in range metrics, and provides deeper insight into assessing glucosemetabolism.
The steadily increasing availability and prominence of biosensor data have given rise to new method-ological challenges for their statistical analysis. A primary feature of these data is that the moni-tored individuals are in free-living conditions, making a direct analysis of the recorded time seriesbetween groups of patients problematic if not infeasible. A clear example of such data is foundin the study of diabetes, where continuous glucose monitoring (CGM) is increasingly used. Theelevation of glucose is distinct between individuals and is influenced by factors such as mealtimes,diet composition, or physical exercise (Ewings et al., 2015). Consequently, an exciting topic ofdebate is how to exploit the enormous wealth of information recorded by CGM to draw more re-liable conclusions about the glucose homeostasis rather than the cursory summary measures suchas fasting plasma glucose (FPG) or glycated hemoglobin (A1c) (Zaccardi and Khunti, 2018).1 a r X i v : . [ s t a t . A P ] A ug ince 2010, the American Diabetes Association (ADA) has included measurement of A1c levelsto both diagnosis and diabetes control Association et al. (2018). A1c levels reflect underlying glucoselevels over the preceding 2–3 months, testing is convenient because blood samples can be obtained atany time of day, overnight fasting is not required, and A1c within-patient reproducibility is superiorto that of fasting plasma glucose and oral glucose tolerance tests (OGTTs) Selvin et al. (2007).However, recent articles have provided evidence for the need to go beyond A1c and use new measuresfor glycemic control (Group, 2018; Bergenstal, 2015) in order to capture more diverse aspects of thetemporally evolving glucose levels beyond the average, for example, glucose variability and time inrange metrics. The time in metric range measures the proportion of time an individual’s glucoselevels is maintained in different target zones. In the case of diabetes, these can include rangescorresponding to hypoglycemia and hyperglycemia. In an innovative article, Beck et al. (2019)validated the time in range metric, showing that it is a good predictor of long-term microvascularcomplications despite just measuring glucose values seven times per day. Lu et al. (2018) reachedsimilar conclusions but using CGM technology only for 24 hours in each patient. At the same time,it is well-known that two patients may have the same glycosylated hemoglobin and a completelydifferent glycemic profile (Beck et al., 2017). These new approaches and findings have lead clinicalspecialists to consider that continuous glucose measurement during long monitoring periods canlead to more accurate results in research and clinical practice than in standard methods (Hirschet al., 2019). In fact, since 2012, the European Medicine Agency (for Medicinal Products forHuman Use et al., 2012) recommends the use of CGM to validate the effect of drugs for treatmentor prevention of diabetes mellitus.Traditionally, CGM was designed for the risk management in real-time for type 1 diabetes andthe control of glucose values with insulin pumps (Kovatchev et al., 2009; Feig et al., 2017; DiMeglioet al., 2018). Notwithstanding, more recent applications of CGM have been more general. Theyinvolve, for example, screening patients, optimizing diet, epidemiological studies, assessing patientprognosis, and supporting treatment prescription, and have even been used in healthy populations(Freeman and Lyons, 2008; Hall et al., 2018). In addition to the increasing utility of CGM data,the technology is gradually becoming cheaper, and new devices capable of measuring glucose in anon-invasive way, for example, with glasses (Nichols et al., 2013), are quickly emerging. All of theseadvances are facilitating the adoption of CGM in standard clinical practice.In 2012, a panel of experts discussed how to represent CGM data in an “easy to view format”(Bergenstal et al., 2013). They also analyzed the convenience of using glycemic variability measuresand other summary measures such as time in range to extract the recorded information from CGM.In 2019, ADA established an updated version of clinical standards to use and define target zoneswith time in range metrics (Battelino et al., 2019). In a more recent review about the CGM metric,they establish time in range as a gold standard measure (Nguyen et al., 0).Motivated by the problem of analyzing data gathered via CGM more precisely, while stillleveraging the advantages possessed by time in range metrics, we propose an approach based on theconstruction of a functional profile of glucose values for each subject. Conceptually, the approachis a natural extension of time in range metrics in which the ranges shrink and size and increase innumber, so that new profile effectively measures the proportion of time each patient spends at eachspecific glucose concentration rather than a coarsely defined range. As a result of this, the newfunctional profile, which we refer to as a glucodensity, automatically and simultaneously capturesall parameters arising from individual glucose distributions. Figure 1 illustrates a set of constructedglucodensities that represent the data objects for which we will propose the use of a tailored set of2tatistical methods.Mathematically, glucodensities constitute functional distributional data since each glucodensityrepresents a distribution of glucose concentrations. As such, these complex and constrained curvescannot be directly analyzed with the usual techniques. To overcome this, we introduce a frameworkfor the analysis of glucodensities by compiling suitable methods that are based on the calculationof glucodensities distances. We also reveal the superior clinical capacity of our representationcompared to classical measures of diabetes. Finally, we demonstrate that our representation hasa higher sensitivity than the standard time in range metric to explain the glycemic differencesbetween patients in various settings, including regression analysis. A new shiny interface to use themethods outlined in this paper is available at https://tec.citius.usc.es/diabetes .
50 100 150 200 250 300 350 400 . . . . Glucodensities
Glucose, mg/dL D en s i t y Figure 1: Example of a set of glucodensities estimated from a random sample of the AEGISpopulation-based study
The structure of this paper is as follows. First, we briefly describe the AEGIS study, and themethods used. We then formally introduce the concept of glucodensity, the estimation methods,and some essential statistical background to understand the statistical procedures introduced inthe paper. Subsequently, we explain the regression models used in the validation of the repre-3iomarker Clinical significanceA1c in diabetes diagnosis and controlGold standard markerHOMA-IR resistance and β -cell functionMeasurements to quantify insulinCONGAMODD glucose variabilitySummary indices ofMAGETable 1: Clinical importance of biomarkers used in the statistical analysissentation. Afterward, we show the results that demonstrate the superiority of glucodensity overglucose representations of state-art. Then, we illustrate the use with real data of the glucodensitiesmethodology in two-sample testing and cluster analysis. Finally, we discuss the clinical implica-tions of these results, their limitations, and the new perspectives of the glucodensities method inmedicine and device technology. A subset of the subjects in the A Estrada Glycation and Inflammation Study (AEGIS; trial
N CT ) provided the sample for the present work. In thelatter cross-sectional study, an age-stratified random sample of the population (aged ≥
18) wasdrawn from Spain’s National Health System Registry. A detailed description has been publishedelsewhere (Gude et al., 2017). For a one year beginning in March, subjects were periodically exam-ined at their primary care centre where they ( i ) completed an interviewer-administered structuredquestionnaire; ( ii ) provided a lifestyle description; ( iii ) were subjected to biochemical measure-ments; and ( iv ) were prepared for CGM (lasting 6 days). The subjects who made up the presentsample were the 581 (361 women, 220 men) who completed at least 2 days of monitoring, out ofan original 622 persons who consented to undergo a 6-day period of CGM. Another 41 originalsubjects were withdrawn from the study due to non-compliance with protocol demands (n = 4) ordifficulties in handling the device (n = 37). The characteristics of the participants are shown onthe Table 2. The present study was reviewed and approved by the Clinical Research Ethics Committee fromGalicia, Spain (CEIC2012-025). Written informed consent was obtained from each participant inthe study, which conformed to the current Helsinki Declaration.
Glucose was determined in plasma samples from fasting participants by the glucose oxidase per-oxidase method. A1c was determined by high-performance liquid chromatography in a Menarini4en ( n = 220) Women ( n = 361)Age, years 47 . ± . . ± . . ± . . ± . mg/dL ±
23 91 ± mg/dL.µU I/m . ± .
56 2 . ± . kg/m . ± . . ± . mg/dL . ± .
40 0 . ± . mg/dL . ± . . ± . . ± .
58 0 . ± . BM I - body mass index;
F P G - fasting plasma glucose; A c - glycated haemoglobin; HOM A − IR - homeostasis model assessment-insulin resistance; CON GA - glycemic variabilityin terms of continuous overall net glycemic action;
M ODD - mean of daily differences;
M AGE -mean amplitude of glycemic excursions.Diagnostics HA-8160 analyser; all A1c values were converted to DCCT-aligned values (Hoelzel et al.,2004). Insulin resistance was estimated using the homeostasis model assessment method (HOMA-IR) as the fasting concentration of plasma insulin ( µ units/mL) × plasma glucose (mg/dL)/ 405(Matthews et al., 1985). Glycaemic variability was measured in terms of continuous overall net glycemic action (CONGA)(McDonnell et al., 2005), the mean amplitude of glycaemic excursions (MAGE) (Service et al.,1970), and the mean of the daily differences (MODD) (Molnar et al., 1972) in glucose concentration.
At the start of each monitoring period, a research nurse inserted a sensor (Enlite TM , Medtronic,Inc, Northridge, CA, USA) subcutaneously into the subject’s abdomen, and instructed him/her inthe use of the iPro TM CGM device (Medtronic, Inc, Northridge, CA, USA). The sensor continu-ously measures the interstitial glucose level 40 −
400 (range mg/dL) of the subcutaneous tissue,recording values every 5 min. Participants were also provided with a conventional OneTouchRVerioR Pro glucometer (LifeScan, Milpitas, CA, USA) as well as compatible lancets and test stripsfor calibrating the CGM. All subjects were asked to make at least three capillary blood glucosemeasurements (usually before main meals). These readings were taken without checking the currentCGM reading. On the seventh day the sensor was removed and the data downloaded and storedfor further analysis. If the number of data-acquisition “skips” per day totalled more than 2 h, theentire day’s data were discarded.
The time in the range metric was calculated with two different methods. In the first, through theCGM records of the AEGIS study, we estimate the deciles of CGM records with normoglycemic5atients and use as cut-offs the deciles (Table 3). In the second, we use cut-off points establishedby the ADA in the 2019 Medical guideline (Battelino et al., 2019) (Table 4).Range 1 < − − − − − − − − > < − − − > The density functions for each individual was estimated with non-parametric Nadaraya-Watsonprocedure. For this purpose, we used a Gaussian kernel and rule of thumb as a smoothing parameter.In addition, we estimate quantile representation for 2-Wasserstein methods using the empiricaldistribution.The following three regression models were used: i) The non-parametric kernel functional regres-sion model through 2-Wasserstein distance with the glucodensity as predictor (Ferraty and Vieu,2006); ii) A global 2-Wasserstein regression model where the glucodensity is response (Petersenand M¨uller, 2019); and iii) k -nearest neighbor algorithm in the case of time in range metrics with k = 10 neighbors.In the case of time in range metrics, we applied the isometric log-ratio (ilr) transformationfor compositional data prior to fitting the model. To avoid problems with zeros, a fixed positiveconstant was added to each each range, which were then normalized to add to 1.All analyses were carried out using R software. Functional data analysis was performed usingthe fda.usc package (Febrero-Bande and de la Fuente, 2012), which is freely available at https://cran.r-project.org/ , and our own implementations of the ANOVA test of Dubey and M¨uller62019) or Fr`echet regression in Petersen and M¨uller (2019) using the 2-Wasserstein distance. Theglucodensities and their quantile representation were estimated using the R basis functions. For patient i , denote the gathered glucose monitoring data by pairs ( t ij , X ij ), j = 1 , . . . , m i , wherethe t ij represent recording times that are typically equally spaced across the observation interval,and X ij is the glucose level at time t ij ∈ [0 , T i ] . Note that the number of records m i , the spacingbetween them, and the overall observation length T i can vary by patient. One can think of these dataas discrete observations of a continuous latent process Y i ( t ) , with X ij = Y i ( t ij ) . The glucodensityfor this patient is defined in terms of this latent process as f i ( x ) = F (cid:48) i ( x ) , where F i ( x ) = 1 T i (cid:90) T i ( Y i ( t ) ≤ x ) d t for inf t ∈ [0 ,T i ] Y i ( t ) ≤ x ≤ sup t ∈ [0 ,T i ] Y i ( t )is the proportion of the observation interval in which the glucose levels remain below x. Since F i are increasing from 0 to 1, the data to be modeled are a set of probability density functions f i ,i = 1 , . . . , n. Of course, neither F i nor the glucodensity f i is observed in practice, but one can construct anapproximation through a density estimate ˜ f i ( · ) obtained from the observed sample. In this case ofCGM data, the glucodensities may have different support and shape. Therefore, we suggest usinga non-parametric approach to estimate each density function. For example, using a kernel-typeestimator, we have ˜ f i ( x ) = 1 m i m i (cid:88) j =1 K h i ( x − X ij ) , where h i > K h i ( s ) = h i K ( sh i ). The choice of K does not have abig impact on the efficiency of the estimator, but the value of h i is crucial.Several alternatives for selecting the smoothing parameter have been proposed in the literature,including cross-validation, minimizing the estimated mean integrated squared error (MISE), or a“rule of thumb” derived from the assumption that the density is Gaussian. In this last case, thechoice can be explicitly written as ˜ h i = 1 .
06 ˜ σ i m − / i , where ˜ σ i is the sample standard deviation ofthe X ij . For more details, see Silverman (1986). Other approaches for the density function estima-tion include the use orthogonal series (e.g., Fourier or Wavelet) expansions, splines, or smoothing ofhistograms. For further details the reader is referred to Antoniadis (1997); Izenman (1991); M¨ullerand Petersen (2014). Let [ a, b ] be an interval of the real line, which may be unbounded, and suppose that each gluco-density f i has support contained in [ a, b ]. From a statistical point of view, the sample f , . . . , f n may be modeled and analyzed using methods of functional data analysis (Ramsay et al., 2005;Wang et al., 2016). However, since the f i must be positive and satisfy (cid:82) ba f i ( x ) dx = 1 , classicalmethods have in recent years been adapted to account for the nonlinear, distributional structureof density samples (Petersen and M¨uller, 2016; Hron et al., 2016). The general approach is todefine a metric or distance between densities that, in turn, leads to descriptive statistics that7espect the unique density properties. For example, define the data space of glucodensities as A := { f : [ a, b ] → R + : (cid:82) ba f ( x ) dx = 1 and (cid:82) ba x f ( x ) dx < ∞} . Given two arbitrary glucodensities f, g ∈ A , the 2-Wasserstein distance (Villani, 2008) between f and g is d W ( f, g ) = (cid:115)(cid:90) ba ( F − ( x ) − G − ( x )) dx, (1)where F and G are the cumulative distribution functions (cdfs) of the density functions f and g .The 2-Wasserstein distance is a natural distance to measure the similarity between densityfunctions through its representation in the space of the quantile (inverse cdf) functions and ithas already been successfully applied in biological problems. Furthermore, it has computationaland modeling advantages compared to the usual L [ a, b ] metric when glucodensities have differentsupport within [ a, b ]. Finally, it has a physical interpretation in the theory of optimal transport.As glucodensities are distributional data, the subsequent application of the usual techniquesfor functional data, such as estimation of mean, covariance, and regression models, may lead tomisleading results. Hence, we have chosen to use models based on the 2-Wasserstein distance,although other choices are possible. As a starting point, based on the notion of distance we cangeneralise the mean and variance of a random variable that takes values in an abstract space withmetric structure (Fr´echet, 1948). As we will see, similar adaptations can be developed for regression,hypothesis testing, or to perform cluster analysis. Given a distance d : A × A → R + , of which d W is one example, and a random variable f defined on A , the Fr´echet mean of f is µ f = arg min g ∈ A E ( d ( f, g )) . The
Fr´echet variance of Z is then σ f = E ( d ( f, µ f )) . If the choice of distance is the Wasserstein metric d W , these are given the names of Wassersteinmean and variance, respectively. In the following subsections we will extend these concepts ofFr`echet to statistical methodologies of regression, clustering, and hypothesis testing based on thenotion of distance. Let f be a functional random variable taking values in ( A, d W ) and Y a random variable that takevalues in the real line. We assume the following regression relationship between f and Y , whichrepresent the predictor and response variables, respectively: Y = g ( f ) + (cid:15) (2)where g : A → R is an unknown smooth function, and the random error (cid:15) satisfies E ( (cid:15) ) = 0.Given a sample { ( f i , Y i ) ∈ A × R } ni =1 , most non-parametric estimators ˜ g ( · ) have the form of aweighted average of the responses ˜ g ( x ) = n (cid:88) i =1 w ni ( x ) Y i . (3)8n general, the weights w ni ( x ) depend on the distance between each f i and x , with larger distancesreceiving lower weights, and satisfy (cid:80) ni =1 w ni ( x ) = 1 (Ferraty and Vieu, 2006). A typical choicewould be the Nadaraya–Watson weights w ni ( x ) = K ( d ( x,f i ) h ) (cid:80) ni =1 ( K ( d ( x,f j ) h )) , (4)where h is a smoothing parameter and K : R → R is a known univariate probability densityfunction called the kernel. For more details about this procedure see Ferraty and Vieu (2006). Asan alternative for the above method, we can use the kernel methods in Reproductive Kernel HilbertSpaces (RKHS) (Preda, 2007; Szab´o et al., 2016). In the case of the regression models with a density function as response, the literature is not veryextensive to the current date (Nerini and Ghattas, 2007; Han et al., 2019; Petersen and M¨uller,2019; Capitaine et al., 2019; Talska et al., 2018). In this article we use the model proposed inPetersen and M¨uller (2019) which allows us to incorporate the desired metric d W and is a directgeneralization of classical linear regression. The primary rationale for our use of this model is that,unlike the other approaches mentioned above, there is a methodology developed to performanceinferential procedures such as confidence bands and hypothesis testing in order to establish thesignificance of the input variables in the model Petersen et al. (2019).Let f be a random variable (e.g. a glucodensity) that take values in the space of ( A, d W ) definedabove. Consider a random vector U ⊂ R d that contains the set of predictors. Our interest is in theFr`echet regression function, or function of conditional Fr`echet means, f ( u ) := arg min g ∈ A E ( d W ( f, g ) | U = u ) , u ∈ R d (5)Petersen and M¨uller (2019) imposes a particular model for f that, in direct analogy to classicallinear regression, takes the form of a weighted Fr`echet mean: f ( u ) = arg min g ∈ A E ( s ( U, u ) d W ( f, g )) , u ∈ R d . (6)Here, the weight function is s ( U, u ) = 1 + ( U − µ ) T Σ − ( u − µ ) , µ = E ( U ) , Σ = Cov( U ) , (7)and Σ is assumed to be positive definite.Given a sample ( U i , f i ) , i = 1 , . . . , n, of independent pairs each distributed as ( U, f ) , one canproceed to estimate f ( u ) for any desired input u. Due to the intimate connection between theWasserstein metric and quantile functions as in (1), for most inferential procedures it is sufficientto estimate the conditional Wasserstein mean quantile function Q ( u ) corresponding to f ( u ) . Let D be the set of quantile functions, Q i the quantile function corresponding to the random density f i , and define empirical weights s in ( u ) = 1+( U i − U ) T ˆΣ − ( u − U ) , where U and ˜Σ are the sample meanand variance of the U i , respectively. The natural estimator under d W is the weighted empiricalmean quantile function ˜ Q ( u ) = arg min Q ∈ D n (cid:88) i =1 s in ( x ) (cid:107) Q − Q i (cid:107) , (8)9here (cid:107)·(cid:107) denotes the L [0 ,
1] norm on D .A straightforward algorithm for computing ˜ Q ( u ) is shown in Supplementary Material of originalreference Petersen et al. (2019). In addition, two algorithms are given to estimate the confidencebands at a given significance level α for both the quantile functional parameter Q ( u ) and the densityparameter f ( u ). To validate the glucodensity representation, we use the database from the AEGIS study (Gudeet al., 2017). The database contains the continuous glucose monitoring data between 2-6 days of581 patients from a random sample of a general population. A detailed description of the datais introduced in Section 2 together with characteristics of patients in Table 2. To develop thevalidation task, we use two different regression models: i) a non-parametric regression model wherethe unique predictor is glucodensity, and ii) a linear regression model where the response is aglucodensity. Further details on the regression models used can be found in the Section 3. Thefirst model was used to predict glycated hemoglobin (A1c) (Kilpatrick, 2000), homeostatic modelassessment (HOMA-IR) Ausk et al. (2010), and the following measures of glycemic variabilityService (2013); Monnier et al. (2008); Gude et al. (2017): continuous overall net glycemic action(CONGA), mean amplitude of glycemic excursions (MAGE) and mean of daily differences (MODD),through glucodensity representation. In contrast, the second was used to predict the glucodensitywith the five variables above. Figure 1 gives a visualization of the sample of glucodensities used inthese models. Biological significance in variables under consideration is described in Table 1.
The aim of the first set of regression analyses is to demonstrate that the glucodensity is sufficientlyrich in its information content to recover the aforementioned biomarkers with high precision. Toquantify this precision, we estimated the R after fitting a non-parametric model for each biomarkeras the outcome variable, using the glucodensity as the sole predictor (i.e. independent variable).The R estimates for A1c, HOMA-IR, MAGE, MODD, CONGA were 0 .
79, 0 .
79, 0 .
92, 0 .
86, and 0 . In the second regression analysis with the glucodensity as the outcome variable, we aim to show thatthe previous measurements commonly used in the clinical practice are not capable of capturing theglucodensity with high accuracy. This fact is not completely surprising because, as noted by someauthors (Zaccardi and Khunti, 2018), the information provided by a CGM is more precise thanthat contained in summary measures. To accomplish this, we computed a suitable version of R for this task after fitting a regression model where the response is a glucodensity, and the previousvariables are the predictors. In this case, the R estimated was 0 .
74. As predicted, comparedto the previous section’s results, we were not able to accurately capture the complex nature of10 llll ll lll llll l lllll l ll l ll llll lll l llll llll l ll ll ll ll ll ll lll lll lll ll ll l lll l ll l ll lllllll lllllll lll ll lllll llll lllll llll lllll llllll ll llll lll l ll lll l ll l lll ll llll l lll l lllll ll ll llllllll ll lll l lllll lll lll ll ll llllllll lll l l lll l ll l ll llll llll llll ll llll llll lll llll llllllll lll lll ll ll lll llllllll ll llll lll ll lll lllll llll lll l ll llll ll l ll ll lll lll lll llll lll lll lllllll lllll lllll lllllll lllllll lllll lllllll llllll l lllll lll llll llll lll llllll ll lll lll lll lll ll ll l lll ll llllll lll ll lll llllllllll lllll ll lll lll ll llllll lllllll l llll lll ll llll ll ll ll lllll ll llll llllll ll ll llll llll lll ll lllll l lllllllllllllllll lllll ll ll lllll lll
A1c, %
Real values E s t i m a t ed v a l ue s l ll llllllllllll l lll ll lllll lll lllllllll llll ll lllll ll llll llll ll llll lll l llll lll llll llllll ll lllllll lllllllllll llllllllllllll lll llllll lll llll ll l ll lllllllll lllll lll llllllllll lll ll lll lllllll ll llll ll lll llll lllll llllllllllllllll lllll lllllll ll lll lllll lllllll lllllllll ll lll llll llllll llll llll lllll lll lll ll lll lllll ll l llllllll lllll lll llllll lllll lll lllllllllllll lllllll lllllllllllllll llll lll lllllll ll lllllllllll llllllll lllllll ll ll lllllll lll llll llll lllll llll ll lllll lllllll llllllll ll llllllllllllllllll lll llllll ll llllll lll llllllllllll lllllll lllll lllllll llllllllllll llllll llllllllllllllllll ll lllll lll HOMA−IR, mass units
Real values E s t i m a t ed v a l ue s llll ll l lll llll l llllll ll llllll ll ll l ll ll llll lll ll llll llll lll lll l ll lllll lll l ll lllllllll lllllllll ll lll lllllllll ll ll ll l ll lll l ll ll ll ll ll llll llll lll l lll lll ll l lll llll lll ll ll ll lllll lll l lll l lll lll l llllll llll llll ll ll lll l llll lll lll ll llllll l ll lll llll l lll lll llll lll llll l ll l lllll ll lll llllll ll ll lllll llll lll ll lll ll ll ll ll llll ll lll llll lllllllll llll lll lll lllll l l lll ll lll l ll lll l ll ll llll lllll llll ll ll ll lll ll ll lll ll l ll lll llll ll lll lll lll lll ll lll lll ll lllllll ll lll l lll lll ll l llllll lll l llll ll lll l lll lll lll ll ll llll ll llll l ll lllll l ll lll lllll ll l l lll ll llll ll ll ll ll llll l lllll lllll l lllll lllll ll lll l ll ll lll ll ll lll lll
50 100 150
MAGE, mg/dL
Real values E s t i m a t ed v a l ue s l lll ll l lll llll l ll llll llll llll lllll llll llll lll l l lll l ll ll lll lll lll ll lll lll l llll ll lllll lll lll lllll lll lll llllll l l ll lllllll ll llllll ll ll llll l lll lll l llll ll ll l lll l lll llll lll l ll llll llll lll l lll lll l llllll llll llll llll lll l lllllll lll lllllllll l ll ll lllll lll ll l lll l lllllll l lll ll lllllll l lll lllll ll lll ll lllll ll ll lll ll ll ll ll lllll ll ll ll ll lllllllll llll lll lllll lll ll lll ll llll ll lll l llll ll ll lllll l lll ll ll lllll lllllllll llllllllll lll llll l lll lll ll lll ll l ll lll llll ll lll lllllll lllllll llllll llll ll ll ll lll lll lll ll llllll ll lll llll ll lll l ll l ll lllll ll ll lll ll ll ll ll ll ll ll llll l lll ll lll ll lll lll l llll llllll ll ll lllll ll lll lll MODD
Real values E s t i m a t ed v a l ue s l lll ll l lll lll l lll llll ll lll lll llll l ll l l llll lll l l lll l ll ll lll lll l ll lllll lll l ll l l ll lll ll llll ll lll ll lll ll l llllll ll ll lll l l llll ll ll ll ll llll l l l l ll lll l lll lll ll l lll llll lll l lll l ll llll lll l llll lll lll l lll lll llll llll ll ll lll ll llll ll lll ll lll lll l llll l llll l lll lll lll l ll lllll l ll l ll l ll ll l ll ll ll ll l l ll llll l ll l llll l llll ll ll ll l l lllll l lll llll llllll lll lll l lll lll lllll l l ll l ll lll l lllll l ll ll llll lllll llll ll l lll lll llll lll ll l ll lll llll lllll ll l l ll lll ll lll lll ll l llll ll ll llll lll lll lll llllll llll llll lllll l lll lll lll ll ll llll ll llll l ll lll ll l ll lll lllll ll l l lll ll ll ll ll ll ll ll llll llll ll lllll lll l ll lllll ll ll l l ll ll ll lll ll lll lll . . . CONGA, mg/dL
Real values E s t i m a t ed v a l ue s Figure 2: Real values vs Estimated values when glucodensity is predictorglucodensities, even while using the combined predictive power of several commonly used summarymeasures. Moreover, in some cases, the differences in prediction can be significant (see Figure 3).
To illustrate the higher clinical sensitivity of glucodensities compared to time in range metrics, wecompared the ability of each representation to predict A1c, HOMA-IR, and glycemic variabilitymetrics MODD, MAGE, and CONGA, using the data from AEGIS study. The predictive capacityof the glucodensity representation was illustrated above, and this section gives the correspondingresults for time in range metrics, where these were calculated according to two sets of cutoffs. Inthe first, the deciles of the normoglycemic individuals from the AEGIS study were used, while inthe second those proposed by the ADA were used. Tables 3 and 4 in Section 2 show the exact cutoffvalues for both cases. Since the time in range metrics constitute a sample of standard compositionaldata, the isometric log-ratio (ilr) transformation was employed in combination with a k -nearestneighbor algorithm as a regression model for predicting the scalar variables. Methodological detailsabout this statistical procedure can be found in Section 2.11 .0 0.2 0.4 0.6 0.8 1.0 − − Residuals
Quantile G l u c o s e , m g / dL Figure 3: Residuals in quantile space prediction glucodensities
Figure 4 compares the real and estimated values of the previous five variables under the two timein range metrics under consideration with. Table 5 provides the estimates of R for each variableand metric. The predictive capacity is significantly worse than that attained by the glucodensitymethodology. The superiority of the glucodensity is particularly noteworthy in the case of theHOMA-IR variable, where the association is quite weak for time in range metrics. Even for the othervariables where the values of R are moderate, the larger residuals seen in diabetes patients withmore severe alterations of glucose metabolism indicate that time in range metrics are particularlypoorly suited for such patients. Interestingly, we do not observe substantial or consistent differencesA1c HOMA-IR CONGA MAGE MODDNormoglycemic cut-off 0 .
63 0 .
22 0 .
68 0 .
65 0 . .
61 0 .
08 0 .
73 0 .
69 0 . R estimated with time in range metrics under consideration12etween the two time in range metrics used, as deciles perform better than ADA criteria for twoof the variables, while in other instances the ordering was reversed. l llll ll lll llll l lllll l ll l ll llll lll l llll llll l ll ll ll ll ll ll lll lll lll ll ll l lll l ll l ll lllllll lllllll lll ll lllll llll lllll llll lllll llllll ll llll lll l ll lll l ll l lll ll llll l lll l lllll ll ll llllllll ll lll l lllll lll lll ll ll llllllll lll l l lll l ll l ll llll llll llll ll llll llll lll llll llllllll lll lll ll ll lll llllllll ll llll lll ll lll lllll llll lll l ll llll ll l ll ll lll lll lll llll lll lll lllllll lllll lllll lllllll lllllll lllll lllllll llllll l lllll lll llll llll lll llllll ll lll lll lll lll ll ll l lll ll llllll lll ll lll llllllllll lllll ll lll lll ll llllll lllllll l llll lll ll llll ll ll ll lllll ll llll llllll ll ll llll llll lll ll lllll l lllllllllllllllll lllll ll ll lllll lll . . . . A1c, %
Real values E s t i m a t ed v a l ue s l llll ll lll llll l lllll l ll l ll llll lll l llll llll l ll ll ll ll ll ll lll lll lll ll ll l lll l ll l ll lllllll lllllll lll ll lllll llll lllll llll lllll llllll ll llll lll l ll lll l ll l lll ll llll l lll l lllll ll ll llllllll ll lll l lllll lll lll ll ll llllllll lll l l lll l ll l ll llll llll llll ll llll llll lll llll llllllll lll lll ll ll lll llllllll ll llll lll ll lll lllll llll lll l ll llll ll l ll ll lll lll lll llll lll lll lllllll lllll lllll lllllll lllllll lllll lllllll llllll l lllll lll llll llll lll llllll ll lll lll lll lll ll ll l lll ll llllll lll ll lll llllllllll lllll ll lll lll ll llllll lllllll l llll lll ll llll ll ll ll lllll ll llll llllll ll ll llll llll lll ll lllll l lllllllllllllllll lllll ll ll lllll lll l ll llllllllllll l lll ll lllll lll lllllllll llll ll lllll ll llll llll ll llll lll l llll lll llll llllll ll lllllll lllllllllll llllllllllllll lll llllll lll llll ll l ll lllllllll lllll lll llllllllll lll ll lll lllllll ll llll ll lll llll lllll llllllllllllllll lllll lllllll ll lll lllll lllllll lllllllll ll lll llll llllll llll llll lllll lll lll ll lll lllll ll l llllllll lllll lll llllll lllll lll lllllllllllll lllllll lllllllllllllll llll lll lllllll ll lllllllllll llllllll lllllll ll ll lllllll lll llll llll lllll llll ll lllll lllllll llllllll ll llllllllllllllllll lll llllll ll llllll lll llllllllllll lllllll lllll lllllll llllllllllll llllll llllllllllllllllll ll lllll lll HOMA−IR, mass units
Real values E s t i m a t ed v a l ue s l ll llllllllllll l lll ll lllll lll lllllllll llll ll lllll ll llll llll ll llll lll l llll lll llll llllll ll lllllll lllllllllll llllllllllllll lll llllll lll llll ll l ll lllllllll lllll lll llllllllll lll ll lll lllllll ll llll ll lll llll lllll llllllllllllllll lllll lllllll ll lll lllll lllllll lllllllll ll lll llll llllll llll llll lllll lll lll ll lll lllll ll l llllllll lllll lll llllll lllll lll lllllllllllll lllllll lllllllllllllll llll lll lllllll ll lllllllllll llllllll lllllll ll ll lllllll lll llll llll lllll llll ll lllll lllllll llllllll ll llllllllllllllllll lll llllll ll llllll lll llllllllllll lllllll lllll lllllll llllllllllll llllll llllllllllllllllll ll lllll lllllll ll l lll llll l llllll ll llllll ll ll l ll ll llll lll ll llll llll lll lll l ll lllll lll l ll lllllllll lllllllll ll lll lllllllll ll ll ll l ll lll l ll ll ll ll ll llll llll lll l lll lll ll l lll llll lll ll ll ll lllll lll l lll l lll lll l llllll llll llll ll ll lll l llll lll lll ll llllll l ll lll llll l lll lll llll lll llll l ll l lllll ll lll llllll ll ll lllll llll lll ll lll ll ll ll ll llll ll lll llll lllllllll llll lll lll lllll l l lll ll lll l ll lll l ll ll llll lllll llll ll ll ll lll ll ll lll ll l ll lll llll ll lll lll lll lll ll lll lll ll lllllll ll lll l lll lll ll l llllll lll l llll ll lll l lll lll lll ll ll llll ll llll l ll lllll l ll lll lllll ll l l lll ll llll ll ll ll ll llll l lllll lllll l lllll lllll ll lll l ll ll lll ll ll lll lll
50 100 150
MAGE, mg/dL
Real values E s t i m a t ed v a l ue s llll ll l lll llll l llllll ll llllll ll ll l ll ll llll lll ll llll llll lll lll l ll lllll lll l ll lllllllll lllllllll ll lll lllllllll ll ll ll l ll lll l ll ll ll ll ll llll llll lll l lll lll ll l lll llll lll ll ll ll lllll lll l lll l lll lll l llllll llll llll ll ll lll l llll lll lll ll llllll l ll lll llll l lll lll llll lll llll l ll l lllll ll lll llllll ll ll lllll llll lll ll lll ll ll ll ll llll ll lll llll lllllllll llll lll lll lllll l l lll ll lll l ll lll l ll ll llll lllll llll ll ll ll lll ll ll lll ll l ll lll llll ll lll lll lll lll ll lll lll ll lllllll ll lll l lll lll ll l llllll lll l llll ll lll l lll lll lll ll ll llll ll llll l ll lllll l ll lll lllll ll l l lll ll llll ll ll ll ll llll l lllll lllll l lllll lllll ll lll l ll ll lll ll ll lll lll l lll ll l lll llll l ll llll llll llll lllll llll llll lll l l lll l ll ll lll lll lll ll lll lll l llll ll lllll lll lll lllll lll lll llllll l l ll lllllll ll llllll ll ll llll l lll lll l llll ll ll l lll l lll llll lll l ll llll llll lll l lll lll l llllll llll llll llll lll l lllllll lll lllllllll l ll ll lllll lll ll l lll l lllllll l lll ll lllllll l lll lllll ll lll ll lllll ll ll lll ll ll ll ll lllll ll ll ll ll lllllllll llll lll lllll lll ll lll ll llll ll lll l llll ll ll lllll l lll ll ll lllll lllllllll llllllllll lll llll l lll lll ll lll ll l ll lll llll ll lll lllllll lllllll llllll llll ll ll ll lll lll lll ll llllll ll lll llll ll lll l ll l ll lllll ll ll lll ll ll ll ll ll ll ll llll l lll ll lll ll lll lll l llll llllll ll ll lllll ll lll lll . . . MODD
Real values E s t i m a t ed v a l ue s l lll ll l lll llll l ll llll llll llll lllll llll llll lll l l lll l ll ll lll lll lll ll lll lll l llll ll lllll lll lll lllll lll lll llllll l l ll lllllll ll llllll ll ll llll l lll lll l llll ll ll l lll l lll llll lll l ll llll llll lll l lll lll l llllll llll llll llll lll l lllllll lll lllllllll l ll ll lllll lll ll l lll l lllllll l lll ll lllllll l lll lllll ll lll ll lllll ll ll lll ll ll ll ll lllll ll ll ll ll lllllllll llll lll lllll lll ll lll ll llll ll lll l llll ll ll lllll l lll ll ll lllll lllllllll llllllllll lll llll l lll lll ll lll ll l ll lll llll ll lll lllllll lllllll llllll llll ll ll ll lll lll lll ll llllll ll lll llll ll lll l ll l ll lllll ll ll lll ll ll ll ll ll ll ll llll l lll ll lll ll lll lll l llll llllll ll ll lllll ll lll llll lll ll l lll lll l lll llll ll lll lll llll l ll l l llll lll l l lll l ll ll lll lll l ll lllll lll l ll l l ll lll ll llll ll lll ll lll ll l llllll ll ll lll l l llll ll ll ll ll llll l l l l ll lll l lll lll ll l lll llll lll l lll l ll llll lll l llll lll lll l lll lll llll llll ll ll lll ll llll ll lll ll lll lll l llll l llll l lll lll lll l ll lllll l ll l ll l ll ll l ll ll ll ll l l ll llll l ll l llll l llll ll ll ll l l lllll l lll llll llllll lll lll l lll lll lllll l l ll l ll lll l lllll l ll ll llll lllll llll ll l lll lll llll lll ll l ll lll llll lllll ll l l ll lll ll lll lll ll l llll ll ll llll lll lll lll llllll llll llll lllll l lll lll lll ll ll llll ll llll l ll lll ll l ll lll lllll ll l l lll ll ll ll ll ll ll ll llll llll ll lllll lll l ll lllll ll ll l l ll ll ll lll ll lll lll . . . . CONGA, mg/dL
Real values E s t i m a t ed v a l ue s l lll ll l lll lll l lll llll ll lll lll llll l ll l l llll lll l l lll l ll ll lll lll l ll lllll lll l ll l l ll lll ll llll ll lll ll lll ll l llllll ll ll lll l l llll ll ll ll ll llll l l l l ll lll l lll lll ll l lll llll lll l lll l ll llll lll l llll lll lll l lll lll llll llll ll ll lll ll llll ll lll ll lll lll l llll l llll l lll lll lll l ll lllll l ll l ll l ll ll l ll ll ll ll l l ll llll l ll l llll l llll ll ll ll l l lllll l lll llll llllll lll lll l lll lll lllll l l ll l ll lll l lllll l ll ll llll lllll llll ll l lll lll llll lll ll l ll lll llll lllll ll l l ll lll ll lll lll ll l llll ll ll llll lll lll lll llllll llll llll lllll l lll lll lll ll ll llll ll llll l ll lll ll l ll lll lllll ll l l lll ll ll ll ll ll ll ll llll llll ll lllll lll l ll lllll ll ll l l ll ll ll lll ll lll lll Figure 4: Real values vs. Estimated values when time in range metric is the predictor. Blue, timein range metric with cut-offs calculated with normoglycemic patients of AEGIS database. Red,time in range metric using of cut-offs suggested by ADA.
As a special case of regression, suppose we have a sample f , . . . f n of glucodensities defined on( A, d W ) belonging to k different groups G , G , · · · , G k that partition { , . . . , n } and are of size n j ( j = 1 , · · · , k ), so that (cid:80) kj =1 n j = n . If the goal is to simply test whether the Wasserstein meansare equal for each group, Petersen et al. (2019) developed testing procedures based on model (6) forthis purpose. An advantage of this model is its flexibility, which allows for multiple factor layoutsas well as tests for interactions. However, the theoretical properties of these tests require a type ofequal variance assumption that may be restrictive for some data sets.More generally, one may wish to test the null hypothesis that the population distributions of the k groups share common Wasserstein means and variances, against the alternative that at least one13f the groups has a different population distribution compared to the others in terms of either itsWasserstein mean or variance. In this scenario, Dubey and M¨uller (2019) investigated a test statisticbased on the group proportions λ j,n = n j n − , the groupwise sample Wasserstein means ˜ µ j =arg min g ∈ A (cid:80) i ∈ G j d W ( f i , g ) and variances ˜ V j = n − j (cid:80) i ∈ G j d W ( f i , ˜ µ j ) , the pooled Wassersteinmean ˆ µ p = arg min g ∈ A (cid:80) kj =1 (cid:80) i ∈ G j d W ( f i , g ) and variance ˜ V p = n − (cid:80) kj =1 (cid:80) i ∈ G j d W ( f i , ˜ µ p ) , andfinally the quantities ˜ σ j = 1 n j (cid:88) i ∈ G j d W ( f i , ˆ µ j ) − n j (cid:88) i ∈ G j d W ( f i , ˆ µ j ) as estimates of the variance of ˆ V j . Then, with F n = ˜ V p − k (cid:88) j =1 λ j,n ˜ V j , R n = (cid:88) j Y, Y (cid:48) ∼ F and Z, Z (cid:48) ∼ G that are defined on a(semi)metric space (Ω , ρ ) of negative type, where ρ : V × V → R is the semi-metric. Though thenotation in this section is quite general, in particular we have in mind the case (Ω , ρ ) = ( A, d W )corresponding to glucodensities. The energy distance associated with ρ between the distribution F and G is (cid:15) ρ ( F, G ) = 2 E ( ρ ( Y, Z )) − E ( ρ ( Y, Y (cid:48) )) − E ( ρ ( Z, Z (cid:48) )) . Given random samples Y , . . . , Y n iid ∼ F and Z , . . . , Z m iid ∼ G , the sample energy distance is˜ (cid:15) ρ ( F, G ) = 2 1 nm n (cid:88) i =1 m (cid:88) j =1 ρ ( Y i , Z j ) − n n (cid:88) i =1 n (cid:88) i =1 ρ ( Y i , Y j ) − m m (cid:88) i =1 m (cid:88) i =1 ρ ( Z i , Z j ) . The asymptotic distribution of the above statistic for a null hypothesis ( H : F = G ) as well asfor the alternative ( H a : F (cid:54) = G ) is dependent on the chosen semimetric ρ . Besides, its expression14s difficult to calculate and to implement in practice. Hence, when using the energy distancebased methods, the distribution under the null hypothesis is usually calibrated with a permutationmethod. Alternatives to calibrate the distribution under the null hypothesis include the wild ora weighted boostrap, as described in Leucht and Neumann (2013); Jim´enez-Gamero et al. (2019).The energy distance can also be extended to handle samples from more than two populations.Given k independent samples Y j , . . . , Y jn j iid ∼ F j , j = 1 , . . . , k, the energy distance statistic is˜ (cid:15) ρ ( F , . . . , F k ) (cid:88) ≤ j 10. Therefore, there is no statistically significant differencebetween men and women at the significance level of 5 percent.Figure 5 shows the glucodensity samples for each gender using their quantile representations.The pointwise means of these quantile functions constitute the quantile function of the sampleWasserstein mean glucodensites. These, together with pointwise standard deviation curves, arealso shown in Figure 5. On average, the groups are quite similar. However, certain discrepanciesare observed between both groups in terms of their variance, although not large enough for the testto show statistical significant differences. Women Quantile G l u c o s e , m g / dL Men Quantile G l u c o s e , m g / dL Mean Quantile G l u c o s e , m g / dL WomenMen Standart desviation Quantile G l u c o s e , m g / dL WomenMen Figure 5: (Left two panels) Glucodensities for men and women of the AEGIS study, plotted asquantile functions; (Third panel) 2-Wasserstein mean quantile functions for each group; (FourthPanel) Cross-sectional standard deviation curves for quantile functions in each group. Cluster analysis is an essential tool for identifying subgroups of patients with similar characteristics.As an example, with the diabetes patients’ data from the AEGIS study, we perform a clusteranalysis using three clusters. To establish when a patient has diabetes, we use the doctor’s previousdiagnostic criteria, or if individuals currently have their glucose values measured with A1c and FPGin the ranges established by the ADA to be classified in that category.16igure 6 contains the results of applying the cluster analysis in diabetes patients. The algorithmhas identified three differentiated groups of patients. The first group is patients with normal glucosevalues, probably because they are on medication, and the diagnosis of diabetes was made in thepast. The second group are patients with slightly altered diabetes metabolism. Finally, the lastgroup is patients with severely altered glucose values, and as can be seen in the glucodensities, theirglucose is continuously fluctuating. The two-dimensional graphical representation of the densityfunction of A1c and FPG helps to validate these findings. 50 100 150 200 250 300 350 400 . . . Cluster 1 Glucose, mg/dL D en s i t y Cluster 1 A1c, % F P G , m g / dL . 50 100 150 200 250 300 350 400 . . . Cluster 2 Glucose, mg/dL D en s i t y Cluster 2 A1c, % F P G , m g / dL . . . . . 50 100 150 200 250 300 350 400 . . . Cluster 3 Glucose, mg/dL D en s i t y Cluster 3 A1c, % F P G , m g / dL − − − . . . . . . Figure 6: Clustering analysis of diabetes patients in AEGIS study The primary contribution of this article is to propose a new representation of CGM data calledglucodensity. We have validated this representation from a clinical point of view, proving that it ismore accurate than time in range metrics. 17 .1 Diabetes etiology and biological components to capture in a mathematicalrepresention Diabetes encompasses a heterogeneous group of impaired glucose metabolism, such as the frequentpresence of hyperglycemias or hypoglycemias Association et al. (2018). Anomalous glucose fluctu-ations are another essential trait of dysglycemic regulation Monnier and Colette (2011); Monnieret al. (2008). The use of glycemic control measures that go beyond the average glucose valuessuch as A1c and also capture i) the impact of time spent at each glucose concentration on theglucose deregulation process, ii) the oscillations of glucose associated with cellular damage Monnierand Colette (2011), is crucial in the management of patients with diabetes as in the assessment ofglucose metabolism with a high degree of precision. Our proposal accurately captures the components of diabetes mentioned above. Using clinical data,we evaluated the clinical sensitivity against established biomarkers in diabetes. We found a highassociation between A1c, HOMA-IR, CONGA, MODD, MAGE, and glucodensity. In the case ofthe HOMA-IR variable, the predictive ability does not seem excellent, although, to the best of ourknowledge, no known marker shows a predictive ability against that variable. However, our modelcan provide consistent values in moderate and large HOMA-IR values. While the fit for the variableA1c was not perfect, we must consider that the time scale for the A1c and the glucodensities werequite different. A1c is a measure that reflects the average glucose over 2 − R of 0 . 79 is better than the averageglucose recorded by the monitoring period ( R = 0 . R shows a moderate relationship between those variables. However, we are introducingthe essential variables of the glucose deregulation process. A possible explanation of this is thatthe use of the summary measures commonly used in diabetes can hardly capture an individual’sglycemic profile. Glucose metabolism is very complex and highly dependent on the patient’s con-ditions. For example, the cellular mechanisms are different in type I and type II diabetes. Inthe former, there is an inhibition of β -cell function and consequent non-insulin production, whileinsulin secretion is reduced in the latter (Taylor, 2013). In this context, the introduction of theconcept of glucodensity provides greater clinical accuracy to the possible decisions derived fromsuch representation compared to traditional methods because we utilize the entire distribution ofglucose concentrations of an individual over time. While time in range metrics may also achieve the previous aim, they do so to a clearly lesserextent than the glucodensity. Our proposal can capture the differences between individuals in eachglucose concentration. Notwithstanding, time in range only measures glucose differences alongintervals with the subsequent loss of information. Also, time in range metrics are substantiallylimited since the target zones must be defined previously, and these may also depend on the studypopulation or the aim of the analysis.Empirical results demonstrate the advantages of our proposal out of the theoretical framework.The ability of glucodensity to predict A1c, HOMA-IR, and the CONGA, MAGE, and MODD18ariability measures is surprisingly high, much higher than that achieved with the range metricdespite using two different target zones: the deciles of normoglycemic patients glucose values andthe target zones prescribed by the ADA.The estimated R between glucodensities and A1c is similar than that reported by other authorsbetween A1c and average glucose values Nathan et al. (2007). However, in this study, patients aremonitored only for 2 − R between A1c and the mean glucose in our databaseis only 0 . From a statistical standpoint, glucodensities are a special constrained type of functional data knownas distributional data; therefore, it is not possible to directly use the usual statistical techniques.In this paper, we have proposed a framework for the analysis of these distributional data based ondistances with existing techniques for hypothesis testing, cluster analysis, and regression models.However, further methodological development is necessary, as it can be the case of mixed modelsor causal inference methods where there is no available methodology. A potential limitation of our representation is that it ignores the order of events. Instead, it analyzesonly the distribution of glucose values. However, the event sequence may not be a critical componentin diabetes modeling. The main factor of microvascular and macrovascular complications is chronichyperglycemia Cryer (2014); ˇSkrha et al. (2016), and this is captured with high accuracy by ourmodels. Moreover, an essential aspect of managing diabetes patients is hypoglycemia control, andour proposal also captures this. Finally, the third component of dysglycemia Monnier et al. (2008),glucose variability, can accurately predict by our representations, at least, through metrics CONGA,MAGE, and MODD.The sample size used may also be a limitation from a statistical point of view. Nevertheless,in the field of diabetes, the AEGIS study is the world’s largest databases and, unlike other stud-ies, is composed of randomly selected individuals from a general population and non-participantsZeevi et al. (2015). Finally, for study validation, perhaps the most reliable way of validating thenew representation is in terms of the patients’ long-term prognosis. However, to the best of ourknowledge, no study with a reasonable sample size has this information from the intensive use ofCGM technology. Moreover, we have established the clinical validity from variables that do havea clear and established relationship with the prognosis and prevalence of diabetes as evidenced inthe current literature in the field. Adopting the concept of glucodensity in clinical practice and biomedical research could be verypromising in the following ways. 19 To have a simple and more accurate representation of the glycaemic profile of an individual.This representation is especially useful in the management of diabetic patients and to assessthe effects of an intervention. • To establish if there are statistically significant differences between patients subjected todifferent interventions, for example, in a clinical trial. • To identify different subtypes of patients based on their glycaemic condition and other vari-ables. Cluster analysis of glucodensities can create new patient subtypes based on the riskof diabetes or other complications. Furthermore, it allows us to describe the etiology ofthe disease better by creating groups of subjects whose glucose profiles and other clinicalcharacteristics are similar. • To establish the prognosis or risk of a patient or to analyze the relationship of an individual’sglycaemic profile with different clinical variables in epidemiological studies. • To predict changes in the glycaemic profile based on the characteristics of the individuals andthe intervention performed. For example: how does the glucodensity vary according to thediet? • To recommend the most advantageous treatments for a patient. Following the previous idea,a causal inference model could be fitted where the response is glucodensity, for example, toestablish which diet is the most beneficial for the individual to achieve a suitable glucoselevels. We introduce glucodensities methodology with CGM data. However, our methodology is alsovalid for data from other biosensors such as accelerometers to measure physical activity levels. Inthis domain, the time in range metric is one of the most used representations, and perhaps theadoption of our approach can lead to better results Dumuid et al. (2018, 2020). The adoption ofnew methodology with other biosensors may be an essential research issue to be addressed in thefuture.In the diabetes field, it will be necessary to evaluate the predictive capacity of the glucodensityin the long-term prognosis of patients. In addition, it would be interesting to assess, in moreextended monitoring periods, the reproducibility between days and weeks with the representationconstructed. One way to accomplish this is to compute the intraclass correlation coefficient (ICC)using, for example, the methodology proposed recently in Xu et al. and based on distances betweenfunctions. Acknowledgements We thanks Russell Lyons for his discussions on the use of energy-distance based methods withglucodensities.This work has received financial support from Carlos III Health Institute, Grant/Award Num-ber: PI16/01395; Ministry of Economy and Competitiveness (SPAIN) European Regional Devel-opment Fund (FEDER); the Axencia Galega de Innovaci´on, Conseller´ıa de Econom´ıa, Emprego e20ndustria, Xunta de Galicia, Spain, Grant/Award Number: GPC IN607B 2018/01; National Sci-ence Foundation, Grant/Award Number: DMS-1811888; the Spanish Ministry of Economy andCompetitiveness Grant/Award Number: TIN2015-73566-JIN and TIN2017-84796-C21-R. Competing Interests The authors declare no competing interests. References Antoniadis, A. (1997). Wavelets in statistics: A review. Journal of the Italian Statistical Society , 97. URL https://doi.org/10.1007/BF03178905 . Association, A. D. et al. (2018). 6. glycemic targets: standards of medical care in dia-betes—2018. Diabetes Care , S55–S64. Ausk, K. J. , Boyko, E. J. and Ioannou, G. N. (2010). Insulin resistance pre-dicts mortality in nondiabetic individuals in the u.s. Diabetes Care , https://care.diabetesjournals.org/content/33/6/1179.full.pdf , URL https://care.diabetesjournals.org/content/33/6/1179 . Battelino, T. , Danne, T. , Bergenstal, R. M. , Amiel, S. A. , Beck, R. , Biester, T. , Bosi, E. , Buckingham, B. A. , Cefalu, W. T. , Close, K. L. , Cobelli, C. , Dassau, E. , DeVries, J. H. , Donaghue, K. C. , Dovc, K. , Doyle, F. J. , Garg, S. , Grunberger,G. , Heller, S. , Heinemann, L. , Hirsch, I. B. , Hovorka, R. , Jia, W. , Kordonouri, O. , Kovatchev, B. , Kowalski, A. , Laffel, L. , Levine, B. , Mayorov, A. , Mathieu, C. , Mur-phy, H. R. , Nimri, R. , Nørgaard, K. , Parkin, C. G. , Renard, E. , Rodbard, D. , Saboo,B. , Schatz, D. , Stoner, K. , Urakami, T. , Weinzimer, S. A. and Phillip, M. (2019). Clini-cal targets for continuous glucose monitoring data interpretation: Recommendations from the in-ternational consensus on time in range. Diabetes Care . https://care.diabetesjournals.org/content/early/2019/06/07/dci19-0028.full.pdf , URL https://care.diabetesjournals.org/content/early/2019/06/07/dci19-0028 . Beck, R. W. , Bergenstal, R. M. , Riddlesworth, T. D. , Kollman, C. , Li, Z. , Brown,A. S. and Close, K. L. (2019). Validation of time in range as an outcome measure for diabetesclinical trials. Diabetes Care , https://care.diabetesjournals.org/content/42/3/400.full.pdf , URL https://care.diabetesjournals.org/content/42/3/400 . Beck, R. W. , Connor, C. G. , Mullen, D. M. , Wesley, D. M. and Bergenstal, R. M. (2017). The fallacy of average: How using hba1c alone to assess glycemic control can be mislead-ing. Diabetes Care , https://care.diabetesjournals.org/content/40/8/994.full.pdf , URL https://care.diabetesjournals.org/content/40/8/994 . Bergenstal, R. M. (2015). Glycemic variability and diabetes complications: Does it mat-ter? simply put, there are better glycemic markers! Diabetes Care , https://care.diabetesjournals.org/content/38/8/1615.full.pdf , URL https://care.diabetesjournals.org/content/38/8/1615 .21 ergenstal, R. M. , Ahmann, A. J. , Bailey, T. , Beck, R. W. , Bissen, J. , Buckingham,B. , Deeb, L. , Dolin, R. H. , Garg, S. K. , Goland, R. , Hirsch, I. B. , Klonoff, D. C. , Kruger, D. F. , Matfin, G. , Mazze, R. S. , Olson, B. A. , Parkin, C. , Peters, A. , Powers, M. A. , Rodriguez, H. , Southerland, P. , Strock, E. S. , Tamborlane, W. and Wesley, D. M. (2013). Recommendations for standardizing glucose reporting and analysis tooptimize clinical decision making in diabetes: The ambulatory glucose profile (agp). DiabetesTechnology & Therapeutics , https://doi.org/10.1089/dia.2013.0051 , URL https://doi.org/10.1089/dia.2013.0051 . Capitaine, L. , Genuer, R. and Thi´ebaut, R. (2019). Fr´echet random forests. . Cryer, P. E. (2014). Glycemic goals in diabetes: Trade-off between glycemic control and iatrogenichypoglycemia. Diabetes , https://diabetes.diabetesjournals.org/content/63/7/2188.full.pdf , URL https://diabetes.diabetesjournals.org/content/63/7/2188 . DiMeglio, L. A. , Evans-Molina, C. and Oram, R. A. (2018). Type 1 diabetes. TheLancet , . Dubey, P. and M¨uller, H.-G. (2019). Fr´echet analysis of variance for random ob-jects. Biometrika , http://oup.prod.sis.lan/biomet/article-pdf/106/4/803/30646779/asz052.pdf , URL https://doi.org/10.1093/biomet/asz052 . Dumuid, D. , Pediˇsi´c, ˇZ. , Palarea-Albaladejo, J. , Mart´ın-Fern´andez, J. A. , Hron, K. and Olds, T. (2020). Compositional data analysis in time-use epidemiology: What, why, how. International journal of environmental research and public health , https://pubmed.ncbi.nlm.nih.gov/32224966 . Dumuid, D. , Stanford, T. E. , Martin-Fern´andez, J.-A. , ˇZeljko Pediˇsi´c , Maher, C. A. , Lewis, L. K. , Hron, K. , Katzmarzyk, P. T. , Chaput, J.-P. , Fogelholm, M. , Hu, G. , Lambert, E. V. , Maia, J. , Sarmiento, O. L. , Standage, M. , Barreira, T. V. , Broyles,S. T. , Tudor-Locke, C. , Tremblay, M. S. and Olds, T. (2018). Compositional data analysisfor physical activity, sedentary time and sleep research. Statistical Methods in Medical Research , https://doi.org/10.1177/0962280217710835 , URL https://doi.org/10.1177/0962280217710835 . Ewings, S. M. , Sahu, S. K. , Valletta, J. J. , Byrne, C. D. and Chipperfield, A. J. (2015).A bayesian network for modelling blood glucose concentration and exercise in type 1 diabetes. Statistical Methods in Medical Research , https://doi.org/10.1177/0962280214520732 , URL https://doi.org/10.1177/0962280214520732 . Febrero-Bande, M. and de la Fuente, M. (2012). Statistical computing in functional dataanalysis: The r package fda.usc. Journal of Statistical Software, Articles , . Feig, D. S. , Donovan, L. E. , Corcoy, R. , Murphy, K. E. , Amiel, S. A. , Hunt, K. F. , Asztalos, E. , Barrett, J. F. R. , Sanchez, J. J. , de Leiva, A. , Hod, M. , Jovanovic,L. , Keely, E. , McManus, R. , Hutton, E. K. , Meek, C. L. , Stewart, Z. A. , Wysocki,T. , O’Brien, R. , Ruedy, K. , Kollman, C. , Tomlinson, G. , Murphy, H. R. , Grisoni, J. ,22 yrne, C. , Davenport, K. , Neoh, S. , Gougeon, C. , Oldford, C. , Young, C. , Green, L. , Rossi, B. , Rogers, H. , Cleave, B. , Strom, M. , Adelantado, J. M. , Chico, A. I. , Tun-didor, D. , Malcolm, J. , Henry, K. , Morris, D. , Rayman, G. , Fowler, D. , Mitchell,S. , Rosier, J. , Temple, R. , Turner, J. , Canciani, G. , Hewapathirana, N. , Piper, L. , Kudirka, A. , Watson, M. , Bonomo, M. , Pintaudi, B. , Bertuzzi, F. , Daniela, G. , Mion,E. , Lowe, J. , Halperin, I. , Rogowsky, A. , Adib, S. , Lindsay, R. , Carty, D. , Craw-ford, I. , Mackenzie, F. , McSorley, T. , Booth, J. , McInnes, N. , Smith, A. , Stanton,I. , Tazzeo, T. , Weisnagel, J. , Mansell, P. , Jones, N. , Babington, G. , Spick, D. , Mac-Dougall, M. , Chilton, S. , Cutts, T. , Perkins, M. , Scott, E. , Endersby, D. , Dover,A. , Dougherty, F. , Johnston, S. , Heller, S. , Novodorsky, P. , Hudson, S. , Nisbet,C. , Ransom, T. , Coolen, J. , Baxendale, D. , Holt, R. , Forbes, J. , Martin, N. , Wal-bridge, F. , Dunne, F. , Conway, S. , Egan, A. , Kirwin, C. , Maresh, M. , Kearney, G. , Morris, J. , Quinn, S. , Bilous, R. , Mukhtar, R. , Godbout, A. , Daigle, S. , Lubina, A. , Jackson, M. , Paul, E. , Taylor, J. , Houlden, R. , Breen, A. , Banerjee, A. , Bracken-ridge, A. , Briley, A. , Reid, A. , Singh, C. , Newstead-Angel, J. , Baxter, J. , Philip,S. , Chlost, M. , Murray, L. , Castorino, K. , Frase, D. , Lou, O. and Pragnell, M. (2017). Continuous glucose monitoring in pregnant women with type 1 diabetes (conceptt):a multicentre international randomised controlled trial. The Lancet , . Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis: Theory and Practice(Springer Series in Statistics) . Springer-Verlag, Berlin, Heidelberg. for Medicinal Products for Human Use, C. et al. (2012). Guideline on clinical investigationof medicinal products in the treatment or prevention of diabetes mellitus. London, EuropeanMedicines Society . Franca, G. , Vogelstein, J. T. and Rizzo, M. (2020). Kernel k-groups via hartigan’s method. IEEE Transactions on Pattern Analysis and Machine Intelligence Fr´echet, M. R. (1948). Les ´el´ements al´eatoires de nature quelconque dans un espace distanci´e. Annales de l’institut Henri Poincar´e , . Freeman, J. and Lyons, L. (2008). The use of continuous glucose monitoring toevaluate the glycemic response to food. Diabetes Spectrum , https://spectrum.diabetesjournals.org/content/21/2/134.full.pdf , URL https://spectrum.diabetesjournals.org/content/21/2/134 . Group, B. A. W. (2018). Need for regulatory change to incorporate beyond a1c glycemic metrics. Diabetes Care , e92–e94. Gude, F. , D´ıaz-Vidal, P. , R´ua-P´erez, C. , Alonso-Sampedro, M. , Fern´andez-Merino,C. , Rey-Garc´ıa, J. , Cadarso-Su´arez, C. , Pazos-Couselo, M. , Garc´ıa-L´opez, J. M. and Gonzalez-Quintela, A. (2017). Glycemic variability and its association with demographicsand lifestyles in a general adult population. Journal of diabetes science and technology , https://pubmed.ncbi.nlm.nih.gov/28317402 .23 all, H. , Perelman, D. , Breschi, A. , Limcaoco, P. , Kellogg, R. , McLaughlin, T. and Snyder, M. (2018). Glucotypes reveal new patterns of glucose dysregulation. PLOS Biology , https://doi.org/10.1371/journal.pbio.2005143 . Han, K. , M¨uller, H.-G. and Park, B. U. (2019). Additive functional regression for densitiesas responses. Journal of the American Statistical Association , https://doi.org/10.1080/01621459.2019.1604365 , URL https://doi.org/10.1080/01621459.2019.1604365 . Hirsch, I. B. , Sherr, J. L. and Hood, K. K. (2019). Connecting the dots: Vali-dation of time in range metrics with microvascular outcomes. Diabetes Care , https://care.diabetesjournals.org/content/42/3/345.full.pdf , URL https://care.diabetesjournals.org/content/42/3/345 . Hoelzel, W. , Weykamp, C. , Jeppsson, J.-O. , Miedema, K. , Barr, J. R. , Goodall, I. , Hoshino, T. , John, W. G. , Kobold, U. , Little, R. , Mosca, A. , Mauri, P. , Paroni, R. , Susanto, F. , Takei, I. , Thienpont, L. , Umemoto, M. and Wiedmeyer, H.-M. (2004).Ifcc reference system for measurement of hemoglobin a1c in human blood and the nationalstandardization schemes in the united states, japan, and sweden: A method-comparison study. Clinical Chemistry , http://clinchem.aaccjnls.org/content/50/1/166.full.pdf , URL http://clinchem.aaccjnls.org/content/50/1/166 . Hron, K. , Menafoglio, A. , Templ, M. , Hruuzova, K. and Filzmoser, P. (2016). Simplicialprincipal component analysis for density functions in bayes spaces. Computational Statistics &Data Analysis , Izenman, A. J. (1991). Review papers: Recent developments in nonparametric density estima-tion. Journal of the American Statistical Association , https://doi.org/10.1080/01621459.1991.10475021 , URL https://doi.org/10.1080/01621459.1991.10475021 . Jim´enez-Gamero, M. , Alba-Fern´andez, M. and Ariza-L´opez, F. (2019). Approximatingthe null distribution of a class of statistics for testing independence. Journal of Computationaland Applied Mathematics , 131 – 143. URL . Kilpatrick, E. S. (2000). Glycated haemoglobin in the year 2000. Journal of Clinical Pathol-ogy , https://jcp.bmj.com/content/53/5/335.full.pdf , URL https://jcp.bmj.com/content/53/5/335 . Kovatchev, B. P. , Breton, M. , Man, C. D. and Cobelli, C. (2009). In silico preclinicaltrials: A proof of concept in closed-loop control of type 1 diabetes. Journal of Diabetes Scienceand Technology , https://doi.org/10.1177/193229680900300106 ,URL https://doi.org/10.1177/193229680900300106 . Leucht, A. and Neumann, M. H. (2013). Dependent wild bootstrap for degenerate u- and v-statistics. Journal of Multivariate Analysis , 257 – 280. URL . Lu, J. , Ma, X. , Zhou, J. , Zhang, L. , Mo, Y. , Ying, L. , Lu, W. , Zhu, W. , Bao, Y. , Vigersky, R. A. and Jia, W. (2018). Association of time in range, as assessed by con-tinuous glucose monitoring, with diabetic retinopathy in type 2 diabetes. Diabetes Care ,24 https://care.diabetesjournals.org/content/41/11/2370.full.pdf , URL https://care.diabetesjournals.org/content/41/11/2370 . Matthews, D. , Hosker, J. , Rudenski, A. , Naylor, B. , Treacher, D. and Turner, R. (1985). Homeostasis model assessment: insulin resistance and β -cell function from fasting plasmaglucose and insulin concentrations in man. Diabetologia , McDonnell, C. , Donath, S. , Vidmar, S. , Werther, G. and Cameron, F. (2005). A novelapproach to continuous glucose analysis utilizing glycemic variation. Diabetes Technology &Therapeutics , https://doi.org/10.1089/dia.2005.7.253 , URL https://doi.org/10.1089/dia.2005.7.253 . Molnar, G. , Taylor, W. and Ho, M. (1972). Day-to-day variation of continuously monitoredglycaemia: a further measure of diabetic instability. Diabetologia , Monnier, L. and Colette, C. (2011). Glycemic variability: Can we bridge the divide betweencontroversies? Diabetes Care , https://care.diabetesjournals.org/content/34/4/1058.full.pdf , URL https://care.diabetesjournals.org/content/34/4/1058 . Monnier, L. , Colette, C. and Owens, D. R. (2008). Glycemic variability: the third componentof the dysglycemia in diabetes. is it important? how to measure it? Journal of diabetes scienceand technology , M¨uller, H.-G. and Petersen, A. (2014). Density estimation including examples. Wiley StatsRef:Statistics Reference Online Nathan, D. , Turgeon, H. and Regan, S. (2007). Relationship between glycated haemoglobinlevels and mean glucose levels over time. Diabetologia , Nerini, D. and Ghattas, B. (2007). Classifying densities using functional regression trees: Ap-plications in oceanology. Computational Statistics & Data Analysis , . Nguyen, M. , Han, J. , Spanakis, E. K. , Kovatchev, B. P. and Klonoff, D. C. (0). Areview of continuous glucose monitoring-based composite metrics for glycemic control. DiabetesTechnology & Therapeutics , null. PMID: 32069094, https://doi.org/10.1089/dia.2019.0434 , URL https://doi.org/10.1089/dia.2019.0434 . Nichols, S. P. , Koh, A. , Storm, W. L. , Shin, J. H. and Schoenfisch, M. H. (2013). Biocom-patible materials for continuous glucose monitoring devices. Chemical Reviews , https://doi.org/10.1021/cr300387j . Petersen, A. , Liu, X. and Divani, A. A. (2019). Wasserstein f -tests and confidence bands forthe fr`echet regression of density response curves. . Petersen, A. and M¨uller, H.-G. (2016). Functional data analysis for density functions bytransformation to a hilbert space. Ann. Statist. , https://doi.org/10.1214/15-AOS1363 . Petersen, A. and M¨uller, H.-G. (2019). Fr´echet regression for random objects with euclideanpredictors. Ann. Statist. , https://doi.org/10.1214/17-AOS1624 .25 reda, C. (2007). Regression models for functional data by reproducing kernel hilbert spacesmethods. Journal of Statistical Planning and Inference , 829 – 840. Special Issue on Nonpara-metric Statistics and Related Topics: In honor of M.L. Puri, URL . Ramsay, J. , Ramsay, J. and Silverman, B. (2005). Functional Data Analysis . Springer Seriesin Statistics, Springer. URL https://books.google.es/books?id=mU3dop5wY_4C . Selvin, E. , Crainiceanu, C. M. , Brancati, F. L. and Coresh, J. (2007). Short-term vari-ability in measures of glycemia and implications for the classification of diabetes. Archives ofinternal medicine , Service, F. J. (2013). Glucose variability. Diabetes , https://diabetes.diabetesjournals.org/content/62/5/1398.full.pdf , URL https://diabetes.diabetesjournals.org/content/62/5/1398 . Service, F. J. , Molnar, G. D. , Rosevear, J. W. , Ackerman, E. , Gatewood, L. C. and Taylor, W. F. (1970). Mean amplitude of glycemic excursions, a measure of diabetic insta-bility. Diabetes , http://diabetes.diabetesjournals.org/content/19/9/644.full.pdf , URL http://diabetes.diabetesjournals.org/content/19/9/644 . Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis . Chapman & Hall,London. Singh, R. , Barden, A. , Mori, T. and Beilin, L. (2001). Advanced glycation end-products: areview. Diabetologia , https://doi.org/10.1007/s001250051591 . ˇSkrha, J. , ˇSoupal, J. and Pr´azn`y, M. (2016). Glucose variability, hba1c and microvascularcomplications. Reviews in Endocrine and Metabolic Disorders , Szab´o, Z. , Sriperumbudur, B. K. , P´oczos, B. and Gretton, A. (2016). Learning theory fordistribution regression. J. Mach. Learn. Res. , Szekely, G. J. and Rizzo, M. L. (2017). The energy of data. Annual Review of Statistics andIts Application , https://doi.org/10.1146/annurev-statistics-060116-054026 ,URL https://doi.org/10.1146/annurev-statistics-060116-054026 . Talska, R. , Menafoglio, A. , Machalova, J. , Hron, K. and Fivserova, E. (2018). Com-positional regression with functional response. Computational Statistics & Data Analysis , Taylor, R. (2013). Type 2 diabetes. Diabetes Care , https://care.diabetesjournals.org/content/36/4/1047.full.pdf , URL https://care.diabetesjournals.org/content/36/4/1047 . Villani, C. (2008). Optimal transport: old and new , vol. 338. Springer Science & Business Media. Wang, J.-L. , Chiou, J.-M. and M¨uller, H.-G. (2016). Functional data analysis. Annual Reviewof Statistics and Its Application , u, M. , Reiss, P. T. and Cribben, I. (????). Generalized reliability based on distances. Biometrics , n/a . https://onlinelibrary.wiley.com/doi/pdf/10.1111/biom.13287 , URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13287 . Zaccardi, F. and Khunti, K. (2018). Glucose dysregulation phenotypes—time to improve out-comes. Nature Reviews Endocrinology , Zeevi, D. , Korem, T. , Zmora, N. , Israeli, D. , Rothschild, D. , Weinberger, A. , Ben-Yacov, O. , Lador, D. , Avnit-Sagi, T. , Lotan-Pompan, M. et al. (2015). Personalizednutrition by prediction of glycemic responses. Cell ,163