[PDF] Integration of Clinical Criteria into the Training of Deep Models: Application to Glucose Prediction for Diabetic People

Abstract

Standard objective functions used during the training of neural-network-based predictive models do not consider clinical criteria, leading to models that are not necessarily clinically acceptable. In this study, we look at this problem from the perspective of the forecasting of future glucose values for diabetic people. In this study, we propose the coherent mean squared glycemic error (gcMSE) loss function. It penalizes the model during its training not only of the prediction errors, but also on the predicted variation errors which is important in glucose prediction. Moreover, it makes possible to adjust the weighting of the different areas in the error space to better focus on dangerous regions. In order to use the loss function in practice, we propose an algorithm that progressively improves the clinical acceptability of the model, so that we can achieve the best tradeoff possible between accuracy and given clinical criteria. We evaluate the approaches using two diabetes datasets, one having type-1 patients and the other type-2 patients. The results show that using the gcMSE loss function, instead of a standard MSE loss function, improves the clinical acceptability of the models. In particular, the improvements are significant in the hypoglycemia region. We also show that this increased clinical acceptability comes at the cost of a decrease in the average accuracy of the model. Finally, we show that this tradeoff between accuracy and clinical acceptability can be successfully addressed with the proposed algorithm. For given clinical criteria, the algorithm can find the optimal solution that maximizes the accuracy while at the same meeting the criteria.

Full PDF

11 Integration of Clinical Criteria into the Training ofDeep Models: Application to Glucose Prediction forDiabetic People

Maxime De Bois, Mounîm A. El Yacoubi, and Mehdi Ammi

Abstract —Standard objective functions used during the train-ing of neural-network-based predictive models do not considerclinical criteria, leading to models that are not necessarilyclinically acceptable. In this study, we look at this problem fromthe perspective of the forecasting of future glucose values fordiabetic people. In this study, we propose the coherent meansquared glycemic error (gcMSE) loss function. It penalizes themodel during its training not only of the prediction errors, butalso on the predicted variation errors which is important inglucose prediction. Moreover, it makes possible to adjust theweighting of the different areas in the error space to better focuson dangerous regions. In order to use the loss function in practice,we propose an algorithm that progressively improves the clinicalacceptability of the model, so that we can achieve the best tradeoffpossible between accuracy and given clinical criteria. We evaluatethe approaches using two diabetes datasets, one having type-1patients and the other type-2 patients. The results show that usingthe gcMSE loss function, instead of a standard MSE loss function,improves the clinical acceptability of the models. In particular, theimprovements are signiﬁcant in the hypoglycemia region. We alsoshow that this increased clinical acceptability comes at the cost ofa decrease in the average accuracy of the model. Finally, we showthat this tradeoff between accuracy and clinical acceptability canbe successfully addressed with the proposed algorithm. For givenclinical criteria, the algorithm can ﬁnd the optimal solution thatmaximizes the accuracy while at the same meeting the criteria.

Index Terms —deep learning, clinical acceptability, multi-objective optimization, neural network, glucose prediction, di-abetes

I. I

NTRODUCTION

With 4.2 million of imputed deaths in 2019, diabetes isundoubtedly one of the major diseases of our modern world[1]. Compared to healthy persons, diabetic people experiencetrouble in the regulation of their blood glucose level. Whereaspancreas of type-1 diabetic people do not produce insulin, ahormone responsible for the absorption of glucose in the blood,the body cells of type-2 diabetic patients get increasinglyresistant to its action. Failing to regulate the blood sugar levelput the patient at risk of getting in states of hypoglycemiaand hyperglycemia. In hypoglycemia (blood sugar level below70 mg/dL) the patient faces short-term consequences such asclumsiness, coma or even death. On the other hand, withhyperglycemia (blood sugar level above 180 mg/dL), the

M. De Bois is with CNRS-LIMSI and the Université Paris-Saclay, Orsay,France (e-mail: [email protected]).M. A. El Yacoubi is with Samovar, CNRS, Télécom SudParis, InstitutPolytechnique de Paris, Évry, FranceM. Ammi is with Université Paris 8, Saint-Denis, France consequences are more long-term with an increased risk ofcardiovascular diseases or blindness.In the recent years, a lot of researchers have been inter-ested in the creation of glucose predictive models [2]. Usingpast glucose values, carbohydrate (CHO) intakes and insulininfusions information, the models can forecast the futureglucose values 30 to 60 minutes ahead of time [2]. For thediabetic patient, being able to know the future values of his/herglycemia could be highly beneﬁcial as hypo/hyperglycemiacould be anticipated. Historically, glucose predictive modelsare based on autoregressive processes [3]. However, thanksto the advance in machine learning and deep learning, butalso in the increased availability of data, we are currentlywitnessing a shift in favor of more complex models, andin particular models based on neural networks. The use ofstandard feedforward neural networks has been explored with,for instance, the works of Pappada et al. [4], Georga et al. [5] and Ben Ali et al. [6]. Recurrent neural networks, andin particular those based on long short-term memory (LSTM)units, are probably the most popular deep models for glucoseprediction. Aliberti et al. showed that they are more accuratethan standard autoregressive models [7]. Mirshekarian et al. demonstrated their superiority over support vector regression(SVR) models that use expert physiological features [8]. Theyalso have been shown to beneﬁt from the addition of variousinput features such as heart rate or skin conductance [9],[10]. Other neural-network-based solutions have been tried outrecently. Among them, we can highlight the promising use ofconvolutional neural networks [11], [12].Models based on neural networks are trained by backprop-agating the gradient of the average error to the weights ofthe network. In glucose prediction, as in almost all regressionproblems, the average error is computed as the mean squarederror (MSE). As a consequence, the models are trained onmaximizing the accuracy of the predictions. However, in thebenchmark study we recently conducted [13], we showed thata good accuracy does not ensure that the predictions are clin-ically acceptable. Indeed, some errors, despite their relativelylow magnitude, can be very dangerous for the patient (e.g.,errors in the hypoglycemia region). To address this issue, DelFavero et al. proposed the gMSE loss function that ampliﬁesthe weights of the errors based on the observed glycemicregion [14]. They showed that using the gMSE instead of thestandard MSE decreases the number of dangerous predictionsat the cost of reducing the average accuracy of the model.While their methodology is promising, their study has several limitations that we aim at addressing. First, as the approach hasbeen evaluated using autoregressive models on virtual diabeticpatients, it is unclear how it translates to more complex modelsand to real patients. Also, their approach focuses on only oneaspect of the clinical acceptability of the predictions, whichis the point clinical accuracy. Another aspect of the clinicalacceptability of the predictions is the clinical accuracy ofpredicted variations (i.e., difference between two successivepredictions compared to the observed variations), which istaken into account in the widely used continuous glucose-error grid analysis (CG-EGA) metric [15]. Indeed, inaccuratepredicted glucose variations can be very dangerous as they canconfuse the patient in the understanding of the future evolutionof the glycemia.Our contributions are:1) We propose a new loss function called the coherentmean squared glycemic error (gcMSE). Compared tothe standard MSE, it includes constaints directly relatedto the clinical acceptability of the models. In particular,it penalizes the model during its training not only onprediction errors, but also on predicted variations errors[16]. Moreover, it makes possible to increase the im-portance of speciﬁc regions in the error space (e.g., thehypoglycemia region).2) The gcMSE faces a multi-objective optimization prob-lem. Indeed, when promoting the learning of a modelfocused more on making clinically acceptable predic-tions, we reduce the constraints on its global accuracy.However, for the model to be useful for the diabeticpatient, it needs to be accurate. To address this challenge,we propose the PICA algorithm that iteratively relaxthe accuracy constraints so that the focus is progres-sively more in favor on the satisfaction of the clinicalconstraints. This enables the creation of a model thatmaximizes the accuracy while at the same time thatrespects the given clinical constraints.3) We evaluate the proposed solutions on two diabetesdatasets, the IDIAB dataset and the OhioT1DM dataset,characterized by their heterogeneity. Whereas the IDIABdataset, collected by ourselves, is made of 6 type-2 diabetic patients, the OhioT1DM dataset has beenreleased by Marling et al. and comprises data from 6type-1 diabetic patients [17].4) We open-sourced the code written in Python that hasbeen used in this study in a GitHub repository [18].The paper is organized as follows. First, after introducingthe CG-EGA metric in more details, we present the wholeframework for its integration into the training of deep models.Then, we describe the machine learning pipeline, with thepreprocessing of the data, the models we used, and the evalu-ation process. Finaly, we present and discuss the experimentalresults.II. I

NTEGRATING C LINICAL C RITERIA INTO THE T RAINING OF D EEP M ODELS

In this section we propose a method to integrate the clinicalcriteria of the CG-EGA within the training of deep models. First, we introduce, in details, how the CG-EGA metric works.Then, we present the gcMSE loss function that integrates theclinical constraints. Finally, we propose a methodology to usethis new cost function in practice.

A. Presentation of the CG-EGA

Originally proposed by Kovatchev et al. for the evaluationof the clinical acceptability of blood glucose sensors [15], the continuous glucose-error grid analysis (CG-EGA) is a widelyused metric to assess the clinical acceptability of glucosepredictive models [2]. It is made of the combination of twodifferent evaluation grids: the point-error grid analysis (P-EGA) and the rate-error grid analysis (R-EGA). While theP-EGA measure the clinical accuracy of the predictions, the R-EGA measures the clinical accuracy of the predicted variations.The predicted variations are computed as the rate of changebetween two consecutive predictions. Both grids attribute to agiven prediction a score from A (best) to E (worst) representingthe dangerousness of the prediction. Figure 1 gives a graphicalrepresentation of both grids. The scores in both grids are thencombined into a ﬁnal label assessing the clinical acceptabilityof the metric. A prediction can either be an accurate prediction (AP), a benign error (BE), or an erroneous prediction (EP).Table I details the reasoning behind the CG-EGA scores.First, the CG-EGA has a different behavior depending onthe glycemic region (hypoglycemia, euglycemia, or hyper-glycemia). Essentially, the glycemic region impacts the waybad R-EGA scores (C to E) are accounted. For instance, in thehypoglycemia region, a lE score in the R-EGA, representinga fast predicted decrease in glycemia while a fast increase isobserved, can lead to a benign error (BE) if the last predictionis accurate (A in the P-EGA). In the hypoglycemia region,the CG-EGA implies that it is not dangerous for the patientto predict a decrease in glycemia as it will not lead to life-threatening actions from the patient. On the other hand, theabsence of detection of negative variations in the uD anduE zones is extremely dangerous: hypoglycemia is becomingmuch worse, which could result in consequences such as comaor even death. Overall, for a prediction to be labelled as anaccurate prediction AP, it needs good scores (A or B) in bothgrids.In summary, compared to standard accuracy metrics such asthe root mean squared error (RMSE), the CG-EGA also eval-uates the accuracy of the predicted variations. And, most im-portantly, these evaluations depend on the observed glycemicregion. These aspects should be taken into account if we wantto add clinical constraints based on the CG-EGA into thetraining of the models.

B. Coherent Mean Squared Error

In deep learning, the models are trained by backpropagatingthe gradient of the loss function to the networks’ weights.Thus, by modifying the objective function, it is possible tomodify the predictive behavior of the model. We can ﬁnd nu-merous cost functions in the literature, the most used being thecross-entropy for classiﬁcation problems and the mean squarederror (MSE) for regression problems. Since glucose prediction

100 200 300 400100200300400

A BBCCD DEE

True glucose value [mg.dL -1 ] P r e d i c t e d g l u c o s e v a l u e [ m g . d L - ] Point-Error Grid Analysis APBEEP − − − − − − − − AB B uClCuD lDuE lE

True glucose variation [mg.dL -1 .min -1 ] P r e d i c t e d g l u c o s e v a r i a t i o n [ m g . d L - . m i n - ] Rate-Error Grid Analysis APBEEP

Fig. 1: Example of the CG-EGA classiﬁcation with the P-EGA (left) and R-EGA (right).TABLE I: Classiﬁcation of glucose predictions performed by the CG-EGA. Depending on the scores obtained on the P-EGAand R-EGA, a prediction is classiﬁed as an accurate prediction (AP), a benign error (BE) or erroneous prediction (EP).

P-EGA

Hypoglycemia Euglycemia Hyperglycemia

A D E A B C A B C D E R - E G A A AP EP EP AP AP EP AP AP EP EP EPB AP EP EP AP AP EP AP AP EP EP EPuC BE EP EP BE BE EP BE BE EP EP EPlC BE EP EP BE BE EP BE BE EP EP EPuD EP EP EP BE BE EP BE BE EP EP EPlD BE EP EP BE BE EP EP EP EP EP EPuE EP EP EP EP EP EP EP EP EP EP EPlE BE EP EP EP EP EP EP EP EP EP EPAP: Accurate Prediction; BE: Benign Error; EP: Erroneous Prediction is a regression task, deep models in the ﬁeld use the MSE inthe model’s training. Equation 1 describes MSE as the squareddifference between the observed g and predicted ˆ g , averagedover N samples. In this study, we propose modiﬁcations to theMSE cost function to improve the clinical acceptability of thepredictions. M SE ( g , ˆ g ) = 1 N N (cid:88) n =1 ( g n − ˆ g n ) (1)First, as we have seen by analyzing the CG-EGA behavior,it is essential to penalize predicted variation errors in additionto prediction errors. To do this, we can use the coherentmean squared error (cMSE) loss function, previously proposedin a work of ours [16]. The cMSE is the MSE of thepredictions weighted by the MSE of the predicted variations.The Equation 2 describes the cMSE loss function with ∆ g and ∆ ˆ g representing, respectively, the observed and predictedglucose variations. We call the weighting coefﬁcient c the coherence factor. It represents the relative importance we giveto the accuracy of the predicted variations versus the accuracyof the predictions. cM SE ( g , ˆ g ) = M SE ( g , ˆ g ) + c · M SE ( ∆ g , ∆ˆ g )= 1 N N (cid:88) n =1 ( g n − ˆ g n ) + c · (∆ g n − ∆ˆ g n ) (2)To be able to use the cMSE, we can use a recurrent neuralnetwork (e.g., LSTM) with two outputs (see Figure 2). Thetwo outputs represent the prediction at the given predictionhorizon P H and the prediction at

P H − ∆ T , ∆ T being thetime interval between two predictions. For instance, with aprediction interval of 5 minutes and a prediction horizon of 30minutes, the networks outputs the predictions at the horizons30 and 25 minutes. These two outputs enables the computationof the predicted variations, as depicted by Equation 3. Thearchitecture of recurrent neural networks is particularly suitedto this task since each sub-module of the unfolded network ˆ y t + P H − ˆ y t + P H

NN ... NN NN X t − H X t − X t Fig. 2: General architecture of a two-output recurrent neural network that has been unrolled H times, where H is the lengthof the history of input data to the model. X t are the input data to the model at time t (e.g., glucose, insulin, and carbohydratesat time t ), and ˆ y t + P H is the model prediction (e.g., blood glucose prediction) at t + P H , where

P H is the prediction horizon.(see Figure 2) shares the same weights. ∆ˆ g t + P H = ˆ g t + P H − ˆ g t + P H − ∆ T ∆ T (3) C. Coherent Mean Squared Glycemic Error

The analysis of the CG-EGA showed us that glucoseprediction errors and predicted variation errors do not havethe same clinical importance in the error space (see TableI). Although generally of greater magnitude, these clinicalerrors are rare and represent only a small portion of thegradient in the updating of the network’s weights during itstraining. Therefore, minimizing the MSE (or, equivalently,the cMSE) does not directly reduce the number of clinicalerrors. Indeed, most of the weight updates are focused towardsthe improvement of the accuracy of predictions that alreadyhave good a clinical acceptability. In the ﬁeld of multi-classclassiﬁcation, it is very common to weight samples from under-represented classes by artiﬁcially increasing their presencewithin the training set. In their work on object recognitionwithin images, Lin et al. proposed to dynamically weight thelearning samples according to their difﬁculty (a sample beingconsidered easy when the probability of the correspondingclass is very high, showing a high degree of conﬁdence inthe model) [19]. By reducing the weights of samples judgedeasy, the training of the model focuses on the samples forwhich it has the most difﬁculty. Finally, Del Favero et al. proposed, in the context of glucose prediction, to modify theMSE to better account for the dangerous regions of the P-EGA[14]. In particular, they proposed that samples with observedhypoglycemia or hyperglycemia should be given a higherweighting. Although this work was evaluated on autoregressivemodels and virtual patients, their results showed that this newcost function reduces the number of predictions in zone D andE of the P-EGA grid.Taking inspiration from their work, we propose todynamically penalize prediction errors as well as pre-dicted variation errors. This new cost function, named coherent mean squared glycemic error (gcMSE), penal- izes predictions differently depending on the P-EGA andR-EGA regions (see Equation 4). In Equation 4b, P X and p x , X ∈ { A, B, uC, lC, uD, lD, uE, lE } and x ∈{ a, b, uC, lC, uD, ld, uE, le } , represent the P-EGA grid re-gions and their respective weights. Contrary to the originalP-EGA, we have segmented the C, D and E regions intwo, as it is already the case for the R-EGA. This allowsmore ﬂexibility in assigning weights. Equivalently, in Equation4c, R X and r x , X ∈ { A, B, uC, lC, uD, lD, uE, lE } and x ∈ { a, b, uC, lC, uD, ld, uE, le } represent the regions of theR-EGA grid and their respective weights. gcM SE ( g , ˆ g ) = P ( g , ˆ g ) · M SE ( g , ˆ g )+ c · R ( ∆ g , ∆ˆ g ) · M SE ( ∆ g , ∆ˆ g ) (4)with, P ( g , ˆ g ) =  p a , if { g , ˆ g } ∈ P A p b , if { g , ˆ g } ∈ P B p uc , if { g , ˆ g } ∈ P uC p lc , if { g , ˆ g } ∈ P lC p ud , if { g , ˆ g } ∈ P uD p ld , if { g , ˆ g } ∈ P lD p ue , if { g , ˆ g } ∈ P uE p le , if { g , ˆ g } ∈ P lE (4b)and, R ( ∆ g , ∆ˆ g ) =  r a , if { ∆ g , ∆ˆ g } ∈ R A r b , if { ∆ g , ∆ˆ g } ∈ R B r uc , if { ∆ g , ∆ˆ g } ∈ R uC r lc , if { ∆ g , ∆ˆ g } ∈ R lC r ud , if { ∆ g , ∆ˆ g } ∈ R uD r ld , if { ∆ g , ∆ˆ g } ∈ R lD r ue , if { ∆ g , ∆ˆ g } ∈ R uE r le , if { ∆ g , ∆ˆ g } ∈ R lE (4c)Using the gcMSE instead of the standard MSE introduces

14 new hyperparameters to be optimized: the c coherencefactor, and the weights associated with the P-EGA and R-EGAregions. This task being particularly laborious, we proposesimpliﬁcations reducing the number of hyperparameters : • First, it is not interesting to improve the accuracy of thepredicted variations in zones A and B. Indeed, all predic-tions belonging to these zones are clinically sufﬁcientlyaccurate. Thus, we can set r a = r b = 0 . • From the perspective of the possible maximization of theAP rate, BE and EP predictions can be seen as equallyimportant. This allows us to set most of the C, D andE zones to the same value. Moreover, the coherencefactor c alone allows us to weight the importance we givebetween the accuracy of the predictions and the accuracyof predicted variations. Thus, we can decide to set allthese weights to 1. • Only the hypoglycemic P-EGA regions D and E ( P uD and P uE ) require a special treatment in order to increase theimportance of samples in the hypoglycemic region. Wedenote the weight associated with these areas by p hypo .The Equation 2 summarizes these design simpliﬁcations,allowing the gcMSE cost function to have only 3 hyperparam-eters: p ab , p hypo , and c . The choice of these hyperparametersdepends on both the learning objectives and the experimentalconditions. The coherence factor c must be chosen dependingon the importance of the cost function M SE ( ∆ g , ∆ˆ g ) com-pared to the M SE ( g , ˆ g ) . The choice of the coefﬁcient p hypo must be made according to the size of the datasets. When fewhypoglycemic samples are available, it is possible to give avalue of p hypo > . As for p ab , it represents the precisionconstraint we give during training. The lower its value, themore the training of the model focuses on improving its clinicalacceptability at the expense of its accuracy. P ( g , ˆ g ) =  p ab , si { g , ˆ g } ∈ { P A , P B } p hypo , si { g , ˆ g } ∈ { P uD , P uE } , else (2a)and, R ( ∆ g , ∆ˆ g ) = (cid:40) , si { ∆ g , ∆ˆ g } ∈ { R A , R B } , else (2b) D. Progressive Improvement of the Clinical Acceptability

In order to be able to use the gcMSE cost function, weneed to formulate the learning objective, and in particularthe relative importance of improving the clinical acceptability.Indeed, as shown in the work of Del Favero et al. , animprovement in the clinical acceptability is often matched bya deterioration in the statistical accuracy [14].Research in the ﬁeld of multi-objective optimization (MOO)highlights the need of using selection criteria, which can takethe form of a weighting between the different objectives, orthresholds for the different objectives [20]. Even though thereis no standard clinical criterion for glucose prediction modelstoday, we propose to project ourselves by assuming their exis-tence. These clinical criteria could take the form of minimum thresholds in AP and/or maximum thresholds in EP followingthe CG-EGA (e.g., minimum 95% of predictions obtaining theAP score in to CG-EGA). Our learning objective in this casewould be to maximize the accuracy of the predictions whilemeeting the clinical criteria.To achieve this goal, we need to test a large numberof different model architectures (hyperparameters), each testinvolving the training of a neural network. This training isvery expensive in the context of deep learning. Therefore,an efﬁcient training methodology must be used in order toreach the optimal solution. The methodologies generally usedto answer multi-objective optimization problems are based ongenetic methods (such as NSGA-II [21]). Although faster thana simple grid search, these algorithms involve randomnessin the changes made to the different tests slowing down theconvergence.In order to circumvent this problem, we propose the progres-sive improvement of clinical acceptability (PICA) algorithm.Starting from a solution that maximizes the model’s accuracywithout taking into account its clinical acceptability, the con-straints on the precision are gradually relaxed in favor of itsclinical acceptability. This has the consequence of graduallydegrading the statistical accuracy of the model, a degradationthat is accompanied by a progressive improvement in clinicalacceptability.

Algorithm 1:

Progressive Improvement of the ClinicalAcceptability (PICA)

Data: clinical criteria C, model M, update coefﬁcient α ,smoothing coefﬁcient β Result:

Model maximizing the accuracy while respectingthe clinical criteria C or − i ← M ← train(MSE) g , ˆ g ← predict( M ) ˆ g ∗ ← smooth( ˆ g , β ) while C ( M i ) = F alse et MASE ( g i , ˆ g ∗ i ) < do i ← i + 1 gcMSE i ← gcMSE avec p ab ← α i − M i ← ﬁnetune( M , cMSE i ) g i , ˆ g i ← predict( M i ) ˆ g ∗ i ← smooth( ˆ g i , β ) if MASE ( g i , ˆ g ∗ i ) < then return M i else return − The Algorithm 1 gives a description of the steps of the PICAalgorithm. The updating law of the weights p ab , representingthe constraints in the statistical accuracy, is to be chosenaccording to the experimental conditions. In this study, weuse the law deﬁned by the Equation 3 (with α ∈ [0 , beingthe speed of the relaxation of the accuracy constraints). Asfor the MASE metric ( mean absolute scaled error , proposedby Hyndman et al. [22], see Equation 4), it is used as astopping criterion when clinical criteria are not achievable. The algorithm stops when the MASE exceeds 1, meaning that anaïve prediction model (whose prediction is equal to the lastknown observation) is more accurate than the present model.Finally, we use an exponential smoothing of the predictions.This smoothing allows to attenuate the important ﬂuctuationsof the predictions in the ﬁrst steps of the algorithm. By beingsmall, it allows a signiﬁcant gain in clinical acceptability, inreturn for a minimal loss of accuracy. For more details on theexponential smoothing of the predictions, we invite the readerto refer to the post-processing steps in Section III-C. p ab = α i − (3) M ASE ( g , ˆ g , P H ) = N · (cid:80) Nn =1 | g n − ˆ g n | N − P H · (cid:80) Nn = P H | g n − g n − P H | (4)The PICA algorithm avoids unnecessary iterations, eachiteration bringing the model closer and closer to its goal.Moreover, instead of being trained from its initial state, themodel is reﬁned from the ﬁrst model, trained with the standardMSE. This reﬁnement requires much less iteration than afull training, and thus allows the algorithm to run faster.Another approach would have been to reﬁne the model fromthe previous iteration instead. However, in practice, we wereconfronted with a local optimum problem, preventing themodel from ﬁnding a better solution after updating the costfunction. III. M ETHODS

In this section, we present the whole methodology thathas been followed for the evaluation of the proposed lossesand the PICA algorithm. First, we present the experimentaldatasets and their preprocessing. Then, we provide detailsabout the post-processing of the predictions and the models’evaluation. Finally, we describe the different models with theirimplementation.We have made the whole implementation of the data pipelineavailable in a GitHub repository [18].

A. Experimental Data

In this study, we used two datasets made of several diabeticpatients: the IDIAB dataset and the OhioT1DM dataset. Whilethe IDIAB has been collected by ourselves between 2018and 2019 after the approval by the French ethical commitee(ID RCB 2018-A00312-53), the OhioT1DM has recently beenreleased by Marling et al. [17].

1) IDIAB Dataset (I):

The IDIAB dataset is made of 6type-2 diabetic patients (5F/1M, age 56.5 ± ± kg/m ). The patients had been monitoredfor 31.17 ±

2) OhioT1DM Dataset (O):

The OhioT1DM is made ofdata coming from 6 type-1 diabetic patients (2M/4F, agebetween 40 and 60 years old, BMI not disclosed) that hadbeen monitored for 8 weeks in free living conditions. For moreinformation concerning the experimental system, we redirectthe reader to [17]. We restrict ourselves to the glucose values,the insulin infusions, and the CHO intakes to remain consistentwith IDIAB data.

B. Preprocessing

The preprocessing stage aims at preparing the data for theiruse in the training and the evaluation of the models. It is madeof several steps depicted by Figure 3 and described in thefollowing paragraphs.

1) Cleaning:

The glucose time-series from the IDIABdataset possess several erroneous values. These values arecharacterized by peaks lasting only one sample. We decided toremove these samples from the data as keeping them would behurtful for the training as well for the evaluation of the models.Instead of removing them by hand, we used an automatedmethodology proposed in our previous work [16]. A sampleis ﬂagged as erroneous if the surrounding rates of change areincoherent with the typical distribution of rates of change, andif they are of opposite signs.

2) Samples Creation:

The two datasets have been resam-pled to a sample every 5 minutes which is the samplingfrequency of the OhioT1DM glucose signal. While we tookthe mean of the glucose signals, the CHO and insulin valueshave been accumulated.The input samples have been obtained by using a slidingwindow of length H of 3 hours (36 samples) on the threesignals. The prediction objective is, for each sample, theglucose value 30 minutes (6 samples) in the future (predicitonhorizon, PH, of 30 minutes).

3) Recovering Missing Data:

Both datasets contain numer-ous missing values coming either from sensor or human errors.Moreover, contrary to the OhioT1DM dataset, the upsamplingof the IDIAB glucose signal (from 15 minutes to 5 minutes)has introduced a lot of missing values as well. We canartiﬁcially recover some of them by following the followingstrategy for every sample:1) linearly interpolate the glucose history when the missingvalue is surrounded by two known glucose values;2) extrapolate linearly in the opposite case, usually whenthe missing glucose value is the most recent data;3) discard samples when the ground truth y t + P H is notknown to prevent training and testing on artiﬁcial data.

4) Splitting:

The datasets are split into training, validationand testing sets. While the testing set is used for the ﬁnal evalu-ation of the models, the validation is used as a prior evaluationfor the optimization of the models’ hyperparameters.The testing set is made of the last 10 days for theOhioT1DM dataset and of the last 5 days for the IDIAB

Preprocessing

Files Loading Cleaning SamplesCreation RecoveringMissingData Splitting FeatureScaling historylength predictionhorizonsamplingfrequency cross-validationfactortime-series time-series samples samples cv. folds train. foldsvalid. foldstest. folds scaler meanand std

Fig. 3: Preprocessing of the data.dataset, the latter being around two times smaller. The re-maining days have been split into training and validation setsfollowing an 80%/20% distribution with 5 permutations.

5) Feature Scaling:

Finally, the samples have been stan-dardized (zero mean and unit variance) w.r.t. their training set.

C. Post-processing and Evaluation

The evaluation of the predictive models is done followingthe steps described by Figure 4. In this study, we focus modelsthat are personalized to the patient and that predict futureglucose values with a 30-minute prediction horizon. Beforeevaluating the predictions, we follow two mandatory post-processing steps. First, we rescale the predictions to theiroriginal scale (see the features scaling preprocessing step).Then, we reconstruct the prediction time-series by reorderingthe predictions. In addition, the predictions made by the modelscan be smoothed, as it is done in the PICA algorithm.

1) Exponential Smoothing:

The PICA algorithm involvesthe smoothing of the predictions at each iteration. The ob-jective of this smoothing is to reduce excessive ﬂuctuationsin the predicted glucose signal. These oscillations are notrepresentative of actual glucose variations and are thereforedangerous for the patient.We chose the exponential smoothing technique rather thanthe moving average technique because it gives more weight torecent predictions. Exponential smoothing can be deﬁned asrecursive, with each value of the smoothed signal being equalto a weighting between the value of the original signal and theprevious value of the smoothed signal (see Equation 5, where ˆ g ∗ t represents the smoothed value of the glucose prediction ˆ g t and β the smoothing coefﬁcient) [23]. ˆ g ∗ t = (cid:40) ˆ g , if t = 0 β · ˆ g t + (1 − β ) · ˆ g ∗ t − , else (5)The higher β is, the stronger is the weight given to theoriginal signal, and the less smooth the signal is. The choiceof the β smoothing coefﬁcient in [0 , must be made carefully.Indeed, too aggressive smoothing will result in a temporal shiftof the signal. In the context of glucose prediction, this willgreatly reduce the accuracy of the model, and therefore itsusefulness for the patient.To our knowledge, although common in signal processing(e.g., power consumption prediction [24]), no post-processingsmoothing has been done in the literature of glucose prediction.We can nevertheless note the occasional use of low-pass ﬁlters (which act similarly to the exponential smoothing technique)on the input signal [3], [25]. D. Metrics

To evaluate the models we use four different metrics: theRMSE, the MAPE, the MASE and the CG-EGA. For eachmetric, the performances are averaged over the 5 test subsetsof each patient linked to a 5-fold cross-validation on thetraining/validation permutations. They are then also averagedon all the patients from the same dataset. Both the RMSE,MAPE and MASE metrics give a complementary measure ofthe accuracy of the prediction. While the RMSE is closelyrelated to the prediction scale, the MAPE is scale independentand is expressed in percentage. As for the MASE, it measuresthe average usefulness of the predictions compared to naïvepredictions (predictions equal to the last known observations).The MASE is computed following Equation 4, presented in theprevious section. On the other hand, the CG-EGA measuresthe clinical acceptability of the prediction by analyzing theclinical accuracy as well as the coherence between successivepredictions. In the end, the CG-EGA classiﬁes a predictioneither as an accurate prediction (AP), a benign error (BE), oran erroneous prediction (EP). A high AP rate and a low EPrate are necessary for a model to be clinically acceptable. Therates can be either averaged over all the test samples, or for thesamples within a speciﬁc glycemic region (i.e., hypoglycemia,euglycemia and hyperglycemia).

E. Glucose Predictive Models

The objective of the study is to improve the clinical ac-ceptability of deep models. To this end, we ﬁrst proposed anew cost function cMSE which penalizes the model during itstraining not only on prediction errors but also on predictedvariation errors. We then proposed the gcMSE, which isthe cMSE customized to glucose prediction. In particular, itintroduces weighting coefﬁcients based on the CG-EGA toenhance the clinical acceptability of the model. Finally, weproposed the PICA algorithm that allows to progressivelyimprove the clinical acceptability of the models through theuse of the gcMSE function. The models that we present hereaim at evaluating these different proposals.We use as reference models the Support Vector Regressionmodel (SVR) and Long Short-Term Memory recurrent neuralnetwork (LSTM) from the GLYFE benchmark study [26]. Asthe preprocessing steps are identical in the two studies, theresults are fully comparable. The SVR and LSTM models

Post-processing for every paired fold

Smoothing Rescaling Reshaping Evaluationpredictionsground truths RMSEMAPEMASECG-EGA scaler meanand std samplingfrequency samplingfrequencypredictionhorizonpredictionsground truths predictionsground truths

Fig. 4: Post-processing and evaluation of the predictions.represent, respectively, the best model and the best deep modelin this benchmark.First, to analyze the potential improvement of the clinicalacceptability through the cMSE and gcMSE cost functions, wecan evaluate the pcLSTM and gpcLSTM models respectively.These two models are based on a two-output LSTM archi-tecture, which, apart from the presence of the two outputs,is identical to the LSTM model of the GLYFE benchmarkstudy. They are respectively trained to minimize the cMSEand gcMSE loss functions with a coherence factor c set to 8for the IDIAB dataset and 2 for the OhioT1DM dataset. Thisdifference between the two sets is explained by a MSE ofthe predicted variations being approximately 4 times greaterfor the OhioT1DM dataset. As for the coefﬁcients p ab and p hypo of the gcMSE, we set them to 1 and 10 respectively.These coefﬁcients are identical to those from the ﬁrst iterationof the PICA algorithm. In addition, we propose to evaluatean additional variant of the gcMSE whose coefﬁcient p ab is set to 0. This model, denoted gpcLSTM CA , is a modelthat aims at maximizing the clinical acceptability, withouttaking into account the precision of the model beyond clinicalacceptability needs.The PICA algorithm uses the exponential smoothing tech-nique to stabilize successive predictions. In order to fully eval-uate the impact of the cost functions and the PICA algorithm,we use the exponential smoothing technique on all the modelspresented in this study. The smoothed variant of each model isrepresented by a superscript asterisk (e.g., LSTM ∗ , pcLSTM ∗ ,gpcLSTM ∗ CA ). All these models use a smoothing coefﬁcient of0.85, as it degrades only slightly the accuracy of the predictedsignal.The PICA algorithm makes a compromise between thegpcLSTM ∗ and gpcLSTM ∗ CA models. The emphasis on clin-ical acceptability of this compromise is progressive overthe iterations of the algorithm. However, the precision con-straint, through the coefﬁcient p ab is never equal to 0 (modelgpcLSTM ∗ CA ), because such a model has a precision far toolow to be useful for the diabetic patient. This is why the PICAalgorithm stops when the MASE exceeds the value of 1 onthe validation set. We represent by the model gpcLSTM ∗ PICA the results obtained when the PICA algorithm stops. Theseresults present the upper bounds of clinical acceptability whilemaintaining a useful accuracy. In the PICA algorithm, we usethe coefﬁcient p ab update law presented by Equation 3. Itinvolves the coefﬁcient α , the rate at which the constraint inaccuracy is relaxed, which has been set to 0.9 in this study. Ahigher coefﬁcient gives better control over the ﬁnal trade-off, in return for a slower execution time (more iterations beforeconvergence). The PICA algorithm uses exponential smoothingon the model’s predictions to increase the stability of thepredicted signal. The smoothing coefﬁcient β , as for all thesmoothed variants of the other models, it was ﬁxed at 0.85.IV. E XPERIMENTAL R ESULTS

A. Presentation of the Experimental Results

In this section we present the experimental results of thisstudy. These results are represented in the form of two tables:Table II and III. While Table II describes the general resultsof the different models in terms of RMSE, MAPE, MASE andgeneral CG-EGA, Table III gives a more detailed description,by region, of the CG-EGA.Within our two reference models, SVR and LSTM, the SVRmodel is the model with the best clinical acceptability (generalor regional CG-EGA) for comparable accuracy. In particular,the SVR model has one of the best clinical acceptability inthe hypoglycemia region (69.39% and 49.71% AP for theIDIAB and OhioT1DM datasets respectively). The exponentialsmoothing improves the clinical acceptability of the SVRmodel (SVR * model) by -12.79% of AP rate for an increaseof +0.90% in RMSE (decrease in accuracy). The LSTM * model is subject to similar changes with -11.44% AP and+0.98% RMSE. Table III shows that these improvements inclinical acceptability occur in the euglycemia or hyperglycemiaregions, and not in the hypoglycemia region (small decreasein AP).The pcLSTM model and its smoothed variant pcLSTM * ,using the cMSE cost function as well as the two-outputarchitecture of the LSTM network, are showed to improvethe clinical acceptability while deteriorating the accuracy. Inparticular, the pcLSTM * model compared to the LSTM * modelhas -24.18% AP, and +8.95% RMSE. The improvement inclinical acceptability is greater for the OhioT1DM dataset(-32.19% AP) than for the IDIAB dataset (-16.16% AP).For a comparable decrease in accuracy, the OhioT1DM setbeneﬁts more from the cMSE cost function than the IDIABset. Moreover, the pcLSTM * model has among the best clinicalacceptability scores in the euglycemia and hyperglycemiaregions. However, in comparison with the LSTM or LSTM * models, the clinical acceptability in the hypoglycemia regionis deteriorated, especially for the OhioT1DM dataset. Here we represent the decrease, in %, of what is metrically improvable.For the AP, which has a maximum of 100%, the ratio of change is calculatedas (100 − AP ) / (100 − AP ) . TABLE II: Mean (with standard deviation) of statistical accuracy (RMSE, MAPE, and MASE) and general clinical acceptability(CG-EGA) for a prediction horizon of 30 minutes and for the IDIAB and OhioT1DM datasets.

Model RMSE MAPE MASE CG-EGA (general)

AP BE EP

IDIAB Dataset

SVR (6.02) (0.44) (0.15) (2.81) (2.06) (1.23)

LSTM 19.85 (6.00) (1.11) (0.10) (2.99) (1.71) (1.82)

SVR * (6.20) (0.44) (0.15) (2.57) (1.69) (1.35) LSTM * (6.30) (1.21) (0.09) (3.13) (1.75) (2.00) pcLSTM (5.68) (1.34) (0.11) (3.26) (1.66) (2.07) pcLSTM * (6.04) (1.40) (0.11) (3.35) (1.73) (2.07) gpcLSTM (5.64) (0.92) (0.13) (2.66) (1.48) (1.54) gpcLSTM * (5.94) (0.95) (0.13) (2.84) (1.55) (1.57) gpcLSTM CA (11.20) (5.55) (0.55) (2.76) (2.56) (0.91) gpcLSTM *CA (11.18) (5.47) (0.54) (2.87) (2.61) (0.92) gpcLSTM *PICA (7.15) (1.18) (0.09) (2.74) (1.99) (1.22) OhioT1DM Dataset

SVR 20.15 (2.33) (2.11) (0.02) (3.91) (2.83) (1.83)

LSTM (2.08) (2.10) (0.02) (4.17) (2.88) (2.11)

SVR * (2.30) (2.12) (0.02) (4.05) (2.72) (1.90) LSTM * (2.03) (2.10) (0.02) (3.94) (2.51) (2.04) pcLSTM (2.23) (2.32) (0.03) (3.76) (2.05) (2.14) pcLSTM * (2.22) (2.35) (0.03) (3.61) (1.94) (2.12) gpcLSTM (2.69) (2.14) (0.03) (3.63) (2.52) (1.48) gpcLSTM * (2.69) (2.16) (0.03) (3.45) (2.31) (1.49) gpcLSTM CA (6.31) (2.76) (0.53) (2.85) (1.66) (1.28) gpcLSTM *CA (6.27) (2.76) (0.53) (2.88) (1.64) (1.30) gpcLSTM *PICA (2.49) (2.09) (0.03) (3.59) (2.23) (1.64) TABLE III: Mean (with standard deviation) of per-region clinical acceptability (CG-EGA) for a prediction horizon of 30minutes and for the IDIAB and OhioT1DM datasets.

Model CG-EGA (per region)

Hypoglycemia Euglycemia Hyperglycemia

AP BE EP AP BE EP AP BE EP

IDIAB Dataset

SVR (33.51) (0.70) (33.54) (2.01) (1.83) (0.47) (6.09) (3.86) (2.53)

LSTM (30.73) (0.00) (30.73) (1.48) (1.55) (0.38) (5.60) (3.21) (2.45)

SVR * (31.47) (0.35) (31.51) (1.81) (1.66) (0.36) (5.67) (3.23) (2.79) LSTM * (31.22) (0.00) (31.22) (1.35) (1.46) (0.38) (6.04) (3.67) (2.58) pcLSTM (29.27) (0.00) (29.27) (0.90) (0.82) (0.20) (5.81) (3.18) (2.80) pcLSTM * (27.83) (0.00) (27.83) (0.98) (0.91) (0.11) (6.25) (3.48) (2.85) gpcLSTM (24.95) (0.00) (24.95) (1.11) (0.99) (0.26) (5.12) (2.83) (2.46) gpcLSTM * (25.17) (0.00) (25.17) (1.17) (1.02) (0.22) (5.60) (3.09) (2.68) gpcLSTM CA (9.58) (3.43) (8.15) (1.36) (1.03) (0.40) (4.46) (4.52) (2.39) gpcLSTM *CA (9.53) (3.43) (8.13) (1.32) (0.97) (0.44) (4.69) (4.70) (2.33) gpcLSTM *PICA (27.85) (1.14) (28.22) (1.18) (1.08) (0.15) (4.84) (3.53) (1.49) OhioT1DM Dataset

SVR (18.75) (4.02) (18.70) (4.24) (3.26) (1.23) (3.24) (3.01) (1.84)

LSTM (23.17) (3.72) (24.23) (5.33) (4.06) (1.47) (3.70) (2.73) (2.21)

SVR * (21.11) (4.05) (21.65) (4.22) (3.21) (1.22) (3.43) (2.98) (2.00) LSTM * (23.50) (4.15) (24.17) (4.83) (3.58) (1.37) (3.55) (2.40) (2.24) pcLSTM (19.11) (3.73) (19.35) (3.43) (2.53) (1.01) (3.64) (2.55) (2.03) pcLSTM * (18.23) (3.48) (18.55) (3.17) (2.35) (0.96) (3.54) (2.50) (1.96) gpcLSTM (22.59) (3.83) (22.86) (3.91) (2.90) (1.12) (3.84) (3.20) (2.01) gpcLSTM * (22.06) (3.15) (22.42) (3.69) (2.77) (1.04) (3.69) (2.95) (2.02) gpcLSTM CA (8.50) (2.08) (8.01) (2.03) (1.39) (0.74) (5.00) (2.64) (2.63) gpcLSTM *CA (8.49) (1.97) (8.00) (2.02) (1.34) (0.77) (5.05) (2.69) (2.62) gpcLSTM *PICA (20.12) (2.38) (20.23) (3.57) (2.57) (1.07) (3.95) (2.66) (2.31) The gpcLSTM and gpcLSTM * models, using the gcMSE cost function, cMSE customized to blood glucose prediction, show a degradation of the RMSE and an improvement ofthe AP rate similar to the pcLSTM and pcLSTM * models.However, the gpcLSTM and gpcLSTM * models have a lowerEP rate (-19.53% and -20.07% respectively), suggesting animproved clinical acceptability. Table III shows that this im-provement is mainly in the hypoglycemia region with muchlower EP rates.The models gpcLSTM CA and gpcLSTM *CA use a gcMSEfunction with the coefﬁcient p ab of 0. Thus, these modelsfocus only on improving the clinical acceptability. Not seekingto improve the accuracy of predictions beyond the requiredclinical accuracy (P-EGA Zone B), these models have a verypoor RMSE, MAPE and MASE. Nevertheless, they have thebest clinical acceptability, with the highest AP and the lowestEP rates. The improvement is particularly important in thehypoglycemic region, as can be seen in Table III.The gpcLSTM *PICA model represents the latest iteration ofthe PICA algorithm with a MASE on the validation set ofless than 1. This model is intended to maximize the clinicalacceptability, while having a reasonable accuracy (MASE lessthan 1). Compared to the gpcLSTM *CA model, it has a slightlylower clinical acceptability (but better than all other models,thanks in particular to its low EP rate). B. Discussion

The results show us that exponential smoothing reducesthe benign error (BE) rate in favor of a better AP rate, byreducing the amplitude of the variations between successivepredictions. This improvement is valid for most of the mod-els and has for counterpart a rather small decrease in thegeneral accuracy of the model. Thus, exponential smoothing,used softly (coefﬁcient β of 0.85) is an efﬁcient method toimprove the stability of the prediction signal, making it saferfor the diabetic patient. However, it remains useless in thehypoglycemia range where the majority of clinical predictionerrors are due to poor accuracy.The effects of using the cMSE cost function on glucosepredictions are similar: successive glucose predictions are moreconsistent with each other, resulting in a large reduction in theBE rate. The effects are greater for the OhioT1DM dataset,which sees its EP rate decrease at the same time. We canexplain this by a higher noise in the predicted glucose signalof the OhioT1DM set, noise that comes from the initial glucosesignal. With its lower sampling frequency, the IDIAB glucosesignal manages to be less noisy in comparison. The cMSEallows successive predictions to be made with a rate of changethat better reﬂects the actual rate of change and thus improvesits clinical acceptability. However, like exponential smoothing,improvements in clinical acceptability are not generalized toall glycemic regions. In particular, the hypoglycemic regionappears to suffer from the use of cMSE with an increase in itsEP rate, especially for the OhioT1DM dataset.The gcMSE action is more focused on the decrease of theEP rate, as shown by the models gpcLSTM, gpcLSTM CA ,gpcLSTM *PICA . In contrast with the exponential smoothingtechnique and the cMSE cost function, the gcMSE improves allglycemic regions, and in particular the hypoglycemic region. Moreover, these improvements allow the LSTM neural net-work to surpass, in clinical acceptability, the SVR model whichis the best model of the GLYFE benchmark study. Figure 5allows us to appreciate the differences in the predictions ofthe different models. First, we can see the large variationsand noise in the predicted glucose signal of the LSTM model.These oscillations are reduced for the other models, becomingcloser to the observed glucose signal. However, when using thecMSE cost function (pcLSTM * signal in purple), we witnessa large loss of accuracy in the hypoglycemia region (between4:00 and 8:00 am). While the signal gpcLSTM *PICA shows to bevery close to the signal observed in the hypoglycemia region,this is done at the cost of an overall loss in accuracy. Finally,gpcLSTM * , is a compromise between the two.Although we can conclude on the strength of using thegcMSE cost function in the training of deep models predictingfuture glucose levels in people with diabetes, the differentresults show us that there are many possible tradeoffs be-tween accuracy and clinical acceptability. The PICA algorithmproposed in this study aims at selecting the best compromisebetween accuracy and clinical acceptability based on selec-tion criteria. Figure 6 gives a graphical representation of thechanges in MASE, general AP rate and general EP rate of themodels throughout the PICA algorithm for all the patients. Aspreviously discussed, there is no clinical criterion for glucosepredictive models yet, so the only criterion for stopping thealgorithm here was the MASE exceeding 1. The ﬁgure ﬁrstshows us that the number of iterations before stopping thealgorithm is variable from one dataset to another, and alsofrom one patient to another (25.0 ± ± : : : : : : : : : : : : G l y ce m i a [ m g/ d L ] observed glycemiaLSTM*pcLSTM*gpcLSTM*gpcLSTM*PICA Fig. 5: Predictions of the LSTM * , pcLSTM * , gpcLSTM * and gpcLSTM *PICA models for the patient 575 from the OhioT1DMdataset for a given day.more stable and easier to predict. Thus, for a future practicaluse, the clinical criteria must be rigorously standardized.Finally, we note that the MASE on the testing set (the onereported in Tables II and III) is slightly higher than 1 (1.03and 1.01 for the IDIAB and OhioT1DM datasets). Using sucha stopping criterion, we could have assumed that the ﬁnalMASE on the testing set would be less than 1, as it is the caseon the validation set. This happens because the test subsetis not fully representative of the validation subset. This isdue to the general small quantities of data in the datasets,negatively impacting the representativeness of these subsets.We also note that the standard deviation for the IDIAB datasetis higher, showing that the ﬁnal value of the MASE is highlyvariable depending on the subject. Thus, the accuracy of thePICA algorithm would be improved by using more data (whichwould also improve the performance of the models in general).V. C ONCLUSION

In this study, we proposed a framework for the integrationof clinical criteria into the training of deep models. Clinicalcriteria are often different from standard statistical metrics usedas loss functions. As a consequence the best model, given aloss function used during its training, is not necessarily themodel with the best clinical acceptability. We address thisissue from the perspective of the challenging task of predictingfuture glucose values of diabetic people.In glucose prediction, the CG-EGA metric measures theclinical acceptability of the predictions. In particular, it as- TABLE IV: Number of patients within a given dataset that canrespect different clinical criteria (minimal AP rate or maximalEP rate) through the PICA algorithm.

Clinical Criterion Dataset

AP ( ≥ ) EP ( ≤ ) IDIAB Ohio80 - 6 690 - 6 395 - 4 097 - 3 0- 7 6 6- 5 6 4- 3 6 3- 1 4 080 7 6 690 5 6 395 3 4 097 1 2 0 sesses the safety of the predictions by looking at the predictionaccuracy and the predicted rate of change accuracy. Moreover,the metric behaves differently for the different glycemic re-gions, some errors being more dangerous than others withoutbeing high amplitude errors. Starting from the cMSE lossfunction we proposed in an earlier work of ours [16] thatpenalizes the model during its training not only on predictionerrors but also on predicted variation errors, we proposed topersonalize the loss function to the glucose prediction task. . . . . M A S E IDIAB 1IDIAB 2IDIAB 3IDIAB 4IDIAB 5IDIAB 6 (a) MASE evolution of the IDIAB patients . . . . M A S E OhioT1DM 559OhioT1DM 563OhioT1DM 570OhioT1DM 575OhioT1DM 588OhioT1DM 591 (b) MASE evolution of the OhioT1DM patients C G - E G A - A P ( i n % ) IDIAB 1IDIAB 2IDIAB 3IDIAB 4IDIAB 5IDIAB 6 (c) AP evolution of the IDIAB patients C G - E G A - A P ( i n % ) OhioT1DM 559OhioT1DM 563OhioT1DM 570OhioT1DM 575OhioT1DM 588OhioT1DM 591 (d) AP evolution of the OhioT1DM patients C G - E G A - EP ( i n % ) IDIAB 1IDIAB 2IDIAB 3IDIAB 4IDIAB 5IDIAB 6 (e) EP evolution of the IDIAB patients C G - E G A - EP ( i n % ) OhioT1DM 559OhioT1DM 563OhioT1DM 570OhioT1DM 575OhioT1DM 588OhioT1DM 591 (f) EP evolution of the OhioT1DM patients

Fig. 6:

Evolution of the MASE and CG-EGA (AP and EP) metrics throughout the PICA algorithm for the IDIAB and OhioT1DM datasets.Iterations 0 and 0 * respectively represent the results of the model trained with the MSE cost function before and after smoothing thepredictions. Based on the CG-EGA, this personalization, called gcMSE,weights the errors differently depending on the scores obtainedin the P-EGA and R-EGA grids. Finally, we proposed thePICA algorithm to obtain the solution that maximizes theaccuracy of the model while at the same time respecting givenclinical criteria.We evaluate the different proposed loss functions and thePICA algorithm with two different diabetes datasets, theIDIAB and the OhioT1DM dataset. First, we showed thatthe cMSE loss function increase the coherence of successivepredictions, improving the clinical acceptability of the models.However, this improvement comes at the cost of a decreasein the accuracy of the model. Then, we showed that thegcMSE further improves the clinical acceptability by reducingthe rate of life-threatening errors. Finally, we demonstratethe usefulness of the PICA algorithm that help in choosingthe desired tradeoff between general accuracy and clinicalacceptability.The analysis of different clinical criteria showed that not allthe patients were able to meet them easily. It depends on thedifﬁculty of the glucose prediction task of the patient, varyingfrom patient to patient, but also on the nature of dataset, andin particular on the devices used for the data collection. Thesefactors would need to be taken into account when creatingfuture regulations for the use of such models by diabeticpatients. A

CKNOWLEDGMENT

This work is supported by the "IDI 2017" project fundedby the IDEX Paris-Saclay, ANR-11-IDEX-0003-02. We wouldlike to thank the French diabetes health network Revesdiab andDr. Sylvie JOANNIDIS for their help in building the IDIABdataset used in this study.R

Internationaljournal for numerical methods in biomedical engineering , vol. 33, no. 6,p. e2833, 2017.[3] G. Sparacino, F. Zanderigo, S. Corazza, A. Maran, A. Facchinetti, andC. Cobelli, “Glucose concentration can be predicted ahead in time fromcontinuous glucose monitoring sensor time-series,”

IEEE Transactionson biomedical engineering , vol. 54, no. 5, pp. 931–937, 2007.[4] S. M. Pappada, B. D. Cameron, P. M. Rosman, R. E. Bourey, T. J.Papadimos, W. Olorunto, and M. J. Borst, “Neural network-based real-time prediction of glucose in patients with insulin-dependent diabetes,”

Diabetes technology & therapeutics , vol. 13, no. 2, pp. 135–141, 2011.[5] E. I. Georga, V. C. Protopappas, D. Ardigò, D. Polyzos, and D. I.Fotiadis, “A glucose model based on support vector regression forthe prediction of hypoglycemic events under free-living conditions,”

Diabetes technology & therapeutics , vol. 15, no. 8, pp. 634–643, 2013.[6] J. B. Ali, T. Hamdi, N. Fnaiech, V. Di Costanzo, F. Fnaiech, and J.-M.Ginoux, “Continuous blood glucose level prediction of type 1 diabetesbased on artiﬁcial neural network,”

Biocybernetics and BiomedicalEngineering , vol. 38, no. 4, pp. 828–840, 2018.[7] A. Aliberti, I. Pupillo, S. Terna, E. Macii, S. Di Cataldo, E. Patti, andA. Acquaviva, “A multi-patient data-driven approach to blood glucoseprediction,”

IEEE Access , vol. 7, pp. 69 311–69 325, 2019.[8] S. Mirshekarian, R. Bunescu, C. Marling, and F. Schwartz, “Usinglstms to learn physiological models of blood glucose behavior,” in

Engineering in Medicine and Biology Society (EMBC), 2017 39th AnnualInternational Conference of the IEEE . IEEE, 2017, pp. 2887–2891. [9] S. Mirshekarian, H. Shen, R. Bunescu, and C. Marling, “Lstms andneural attention models for blood glucose prediction: Comparative ex-periments on real and synthetic data,” in . IEEE, 2019, pp. 706–712.[10] J. Martinsson, A. Schliep, B. Eliasson, and O. Mogren, “Blood glucoseprediction with variance estimation using recurrent neural networks,”

Journal of Healthcare Informatics Research , pp. 1–18, 2019.[11] M. De Bois, M. A. E. Yacoubi, and M. Ammi, “Adversarial multi-sourcetransfer learning in healthcare: Application to glucose prediction fordiabetic people,” arXiv preprint arXiv:2006.15940 , 2020.[12] T. Zhu, K. Li, P. Herrero, J. Chen, and P. Georgiou, “A deep learningalgorithm for personalized blood glucose prediction.” in

KHD@ IJCAI ,2018, pp. 64–78.[13] M. De Bois, M. Ammi, and M. A. E. Yacoubi, “Glyfe: Review andbenchmark of personalized glucose predictive models in type-1 diabetes,” arXiv preprint arXiv:2006.15946 , 2020.[14] S. Del Favero, A. Facchinetti, and C. Cobelli, “A glucose-speciﬁcmetric to assess predictors and identify models,”

IEEE transactions onbiomedical engineering , vol. 59, no. 5, pp. 1281–1290, 2012.[15] B. P. Kovatchev, L. A. Gonder-Frederick, D. J. Cox, and W. L. Clarke,“Evaluating the accuracy of continuous glucose-monitoring sensors:continuous glucose–error grid analysis illustrated by therasense freestylenavigator data,”

Diabetes Care , vol. 27, no. 8, pp. 1922–1928, 2004.[16] M. De Bois, M. A. El Yacoubi, and M. Ammi, “Prediction-coherent lstm-based recurrent neural network for safer glucose predictions in diabeticpeople,” in

International Conference on Neural Information Processing .Springer, 2019, pp. 510–521.[17] C. Marling and R. C. Bunescu, “The ohiot1dm dataset for blood glucoselevel prediction.” in

KHD@ IJCAI , 2018, pp. 60–63.[18] M. De Bois, “Integration of clinical criteria into the trainingof deep models: Application to glucose prediction for diabeticpeople,” 2020, doi: 10.5281/zenodo.3904234. [Online]. Available:https://github.com/dotXem/DeepClinicalGlucosePrediction[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal lossfor dense object detection,” in

Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 2980–2988.[20] R. T. Marler and J. S. Arora, “Survey of multi-objective optimizationmethods for engineering,”

Structural and multidisciplinary optimization ,vol. 26, no. 6, pp. 369–395, 2004.[21] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, “A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization:Nsga-ii,” in

International conference on parallel problem solving fromnature . Springer, 2000, pp. 849–858.[22] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecastaccuracy,”

International journal of forecasting , vol. 22, no. 4, pp. 679–688, 2006.[23] R. G. Brown,

Smoothing, forecasting and prediction of discrete timeseries . Courier Corporation, 2004.[24] J. W. Taylor and P. E. McSharry, “Short-term load forecasting methods:An evaluation based on european data,”

IEEE Transactions on PowerSystems , vol. 22, no. 4, pp. 2213–2219, 2007.[25] C. Pérez-Gandía, A. Facchinetti, G. Sparacino, C. Cobelli, E. Gómez,M. Rigla, A. de Leiva, and M. Hernando, “Artiﬁcial neural networkalgorithm for online glucose prediction from continuous glucose moni-toring,”