[PDF] A random shuffle method to expand a narrow dataset and overcome the associated challenges in a clinical study: a heart failure cohort example

Abstract

Heart failure (HF) affects at least 26 million people worldwide, so predicting adverse events in HF patients represents a major target of clinical data science. However, achieving large sample sizes sometimes represents a challenge due to difficulties in patient recruiting and long follow-up times, increasing the problem of missing data. To overcome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinality is the number of patients in that dataset), population-enhancing algorithms are therefore crucial. The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate, without the need of specific hypotheses and regression models. The cardinality enhancement was validated against an established random repeated-measures method with regard to the correctness in predicting clinical conditions and endpoints. In particular, machine learning and regression models were employed to highlight the benefits of the enhanced datasets. The proposed random shuffle method was able to enhance the HF dataset cardinality (711 patients before dataset preprocessing) circa 10 times and circa 21 times when followed by a random repeated-measures approach. We believe that the random shuffle method could be used in the cardiovascular field and in other data science problems when missing data and the narrow dataset cardinality represent an issue.

Full PDF

OORIGINAL RESEARCH November 2020 | Volume 7 | Article 599923

Edited by:

Gaetano Ruocco,Regina Montis Regalis Hospital, Italy

Reviewed by:

Alberto Aimo,Sant’Anna School of AdvancedStudies, ItalyKristen M. Tecson,Baylor Scott & White ResearchInstitute (BSWRI), United States *Correspondence:

Alessio [email protected] [email protected]

Specialty section:

This article was submitted toHeart Failure and Transplantation,a section of the journalFrontiers in Cardiovascular Medicine

Received:

28 August 2020

Accepted:

19 October 2020

Published:

20 November 2020

Citation:

Fassina L, Faragli A, Lo Muzio FP,Kelle S, Campana C, Pieske B,Edelmann F and Alogna A (2020) ARandom Shufﬂe Method to Expand aNarrow Dataset and Overcome theAssociated Challenges in a ClinicalStudy: A Heart Failure CohortExample.Front. Cardiovasc. Med. 7:599923.doi: 10.3389/fcvm.2020.599923

A Random Shufﬂe Method to Expanda Narrow Dataset and Overcome theAssociated Challenges in a ClinicalStudy: A Heart Failure CohortExample

Lorenzo Fassina *, Alessandro Faragli , Francesco Paolo Lo Muzio ,Sebastian Kelle , Carlo Campana , Burkert Pieske , Frank Edelmann andAlessio Alogna * Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy, Department of InternalMedicine and Cardiology, Deutsches Herzzentrum Berlin, Berlin, Germany, Department of Internal Medicine and Cardiology,Charité – Universitätsmedizin Berlin, Berlin, Germany, Berlin Institute of Health (BIH), Berlin, Germany, DZHK (GermanCentre for Cardiovascular Research), Partner Site Berlin, Berlin, Germany, Department of Surgery, Dentistry, Paediatrics andGynaecology, University of Verona, Verona, Italy, Department of Medicine and Surgery, University of Parma, Parma, Italy, Department of Cardiology, Sant’Anna Hospital, ASST-Lariana, Como, Italy

Heart failure (HF) affects at least 26 million people worldwide, so predicting adverseevents in HF patients represents a major target of clinical data science. However,achieving large sample sizes sometimes represents a challenge due to difﬁculties inpatient recruiting and long follow-up times, increasing the problem of missing data. Toovercome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinalityis the number of patients in that dataset), population-enhancing algorithms are thereforecrucial. The aim of this study was to design a random shufﬂe method to enhance thecardinality of an HF dataset while it is statistically legitimate, without the need of speciﬁchypotheses and regression models. The cardinality enhancement was validated againstan established random repeated-measures method with regard to the correctnessin predicting clinical conditions and endpoints. In particular, machine learning andregression models were employed to highlight the beneﬁts of the enhanced datasets.The proposed random shufﬂe method was able to enhance the HF dataset cardinality(711 patients before dataset preprocessing) circa 10 times and circa 21 times whenfollowed by a random repeated-measures approach. We believe that the random shufﬂemethod could be used in the cardiovascular ﬁeld and in other data science problemswhen missing data and the narrow dataset cardinality represent an issue.

Keywords: random shufﬂe, missing data, narrow dataset cardinality, data science, heart failure

INTRODUCTION

Heart failure (HF) aﬀects at least 26 million people worldwide (1), so predicting adverse events inHF patients represents a major target of clinical data science. Common challenges in clinical studiesand trials are as follows (2, 3): (i) troubles in ﬁnding patients ﬁtting the eligibility criteria (e.g.,rare disease); (ii) diﬃculties in the enrollment because of a poorly formulated informed consent; assina et al. Random Shufﬂe (iii) data collection problems; (iv) time delays because ofcomplicated study design or due to unpredictable events; and (v)ﬁnancial demands of the clinical practice. All these issues couldbe the cause of missing data and datasets with narrow cardinality,which are relevant challenges in data science (in a clinical dataset,the cardinality is the number of patients in that dataset).As a consequence, researchers need to produce novelhypotheses and methods to deal with these issues, whichare particularly critical when the dataset is used to buildrisk models in the ﬁeld of clinical cardiology. A successfuleﬀort to overcome the abovementioned issues is representedby the MAGGIC risk score, developed as a tool of riskstratiﬁcation for both morbidity and mortality in HF patients(4, 5). To build MAGGIC, Pocock et al. (5) have combined30 datasets to enlarge patients’ cardinality, thereby reachingan astonishing amount of 39,372 patients, and handled themissing patients’ values via multiple imputations using chainedequations (6, 7). In detail, to deal with missing data, regressionequations are deﬁned; the missing values are initially replacedby randomly chosen observed values of each variable, andthen the missing values are replaced by a random draw fromthe distribution deﬁned by the regression equations, and atthe end of the last iteration, the ﬁnal value becomes thechosen imputed value. Hence, we can argue that a randomprocedure could be important to overcome not only the issueof missing data, but also, at the same time, the one of narrowdataset cardinality.The conceptual challenge of missing data is dual: 1) missingpatients (i.e., completely missing data but plausible patients,as discussed later) who cause a narrow cardinality of thedataset and 2) missing data in patients with a partial listof needed values. In the current work, we unify the visionof these two kinds of missing data, searching for themwith a random method, our novel random shuﬄe methodwithout the use of speciﬁc hypotheses and regression models:we only need the original data, and we randomly shuﬄethem while it is statistically legitimate. “Statistically legitimate”means that, to validate our random shuﬄe method, the newdatasets with enhanced cardinality were compared to thoseenhanced via an established random repeated-measures method(8, 9).Indeed, the aim of this work is not to obtain arisk score, but to introduce an innovative method toenlarge the dataset cardinality and boost up the statisticalperformance. Our random shuﬄe method can be appliedin other research ﬁelds when both missing data and limiteddataset are issues because of ﬁnancial, experimental, orethical limitations.

DATA AND METHODSOriginal Dataset

The clinical dataset is composed of a total of 711 German,Austrian, and Italian patients suﬀering from HF in diﬀerentstages, in hospital facility due to either an acute hospitalizationor an ambulatory visit, released and followed up for a period of 6 months. Patients were enrolled in two distinctclinical studies: (i) the Aldo-DHF trial (10), a multicenter,randomized, placebo-controlled, double-blind, two-armed,parallel-group study that enrolled patients from 10 trialsites in Germany and Austria (data are available in the

Supplementary Materials ) and (ii) the STOP-SCO trial, aprospective, multicenter, observational study that enrolledpatients from 10 hospitals in the Northern Italy (unpublisheddata, that are available in the

Supplementary Materials ). Theprotocol and amendments were approved by the institutionalreview board at each participating center, and the trials wereconducted in accordance with the principles of the Declarationof Helsinki, Good Clinical Practice guidelines, and localand national regulations. Written informed consent wasprovided by all patients before any study-related procedureswere performed.The studied endpoints at 6 months were a compositeendpoint (all-cause hospitalization plus all-cause mortality) andall-cause hospitalization.The dataset is organized in rows (patients) and columns(clinical parameters or features). The features are of two types:i) 13 binary features that show the presence (value =

1) or theabsence (value =

0) of the following conditions: peripheraledema, composite endpoint, age >

75 years, angiotensinreceptor blockers intake, β -blockers intake, left ventricularejection fraction at admission > > <

50 mL/min, heart rate at release ≥ <

12 g/dL forwomen, <

13 g/dL for men), all-cause hospitalizationendpoint, more than 2 hospitalizations in the lastyear; andii) 6 numerical features: age, heart rate at release, bodyweight at release, systolic aortic pressure at release, diastolicaortic pressure at release, left ventricular ejection fractionat admission.To preprocess the clinical dataset for the removal of patientswith missing values, two exclusion criteria were sequentially set:1) at least an endpoint lacking (composite endpoint, all-causehospitalization endpoint) and 2) at least a feature lacking (otherthan endpoints).After the preceding data cleaning, the 13 binary featureswere used as dummy variables (11) to group the patientsinto classes, where the number of classes could be, atmaximum, 2 . In particular, a self-balancing (12) (alsocalled height-balancing) was applied to the tree of thebinary features obtaining a new sorting of the dataset. Insummary, the ordered list of the ﬁrst 13 columns is the i)list above.Moreover, because an intraclass-intrafeature randomshuﬄing is possible if and only if the class cardinality is >

1, the monoexample classes (i.e., with a lone patient)were excluded.After preprocessing, the dataset is composed of 385patients grouped into 61 classes. Conceptually, each class November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe represents a particular clinical condition; in other words,the class label delimits a dataset subset inside which theshuﬄing is legitimate and not tautologic [as we show below,in statistical manner, via the comparison to a MATLAB-implemented repeated-measures ﬁtting followed by its“random” method (8, 9); MATLAB R (cid:13) , The MathWorks,Inc., Natick, MA].In Figure 1A , for demonstration purposes, we show asimpliﬁed representation of the original dataset with four patientsanalyzed with 3 features and grouped into 2 classes.In

Figure 2A , for sake of example and comparison withenhancing methods (

Figures 2B,C ), we plot two originalnumerical features for two classes (e.g., the 1 st and the 3 rd of61 classes). The following sections will describe how to obtainvariants of the original dataset. Repeated-Measure Variant

In MATLAB R (cid:13) (Statistics and Machine Learning Toolbox TM ),there are already implemented functions as the “ﬁtrm” (acronymfor “ﬁt repeated-measures model”) with the associated “random”method permitting to generate new random response valuesgiven predictor values (8, 9).In particular, in the ﬁtrm function, the measurements (the6 numerical features above listed) are the responses, andthe class column (with the aforementioned 61 classes) is thepredictor variable. The ﬁtrm function produces a repeated-measures model onto which we can apply the random method torandomly generate new response values, that is, new numericalmeasurements for our 6 numerical features. We called thisrandom generation as “repeated-measures” variant ( Figure 1B ),and we added it to the original dataset (

Figure 1A ) obtaining anenhanced dataset (

Figure 2B ).Theoretically, it is possible to generate at will withoutoutputting replicated values, but we have introduced a calculuscheckpoint to delete eventually replicated patients in theenhanced dataset.

Shufﬂe Variant

In MATLAB R (cid:13) , we have implemented an intraclass randomexchange/shuﬄe of values inside each feature (i.e., each featureis independently shuﬄed in random and intraclass manner).We called this random exchange/shuﬄe as “shuﬄe” variant( Figure 1C ), and we added it to the original dataset (

Figure 1A )obtaining an enhanced dataset (

Figure 2C ).It is likely to shuﬄe with outputting replicated patients(especially inside low-cardinality classes), so we have introduceda calculus checkpoint to delete replicated patients in theenhanced dataset.

Hotelling t Statistic

Hotelling T distribution is a multivariate distributionproportional to the F distribution; in particular, it is ageneralization of the Student t distribution for multivariatepurposes. Hotelling t statistic is a generalization of Student t statistic used in multivariate hypothesis testing (13, 14).In our multivariate problem, we have 6 numerical features,and we would enhance the original dataset without generating a diﬀerent population ( p > p -value is notsigniﬁcant (i.e., the enhanced shuﬄed population is the same asthe original dataset or the enhanced repeated-measures one). Combined Approach

In a combined approach, an enhanced shuﬄed population wassubjected to a repeated-measures processing.

Stressing the Enhanced Datasets viaMachine Learning and Regression

In our speciﬁc cardiology problem (HF), the main goals ofhaving enhanced datasets by enlarging their cardinality, whileit is legitimate, are a greater classiﬁcation/prediction skill (e.g.,to predict the patient’s class of risk) and a greater regressionskill (e.g., to estimate the likelihood of two endpoints: compositeendpoint, all-cause hospitalization endpoint). In other words, weare trying to overcome the issues of missing data and datasetswith narrow cardinality, which are typically due to ﬁnancial,experimental, or ethical limitations without losing the statisticalnature of the original dataset, boosting its statistical performancewhile legitimate ( p > t -test).To highlight the beneﬁts of the enhanced datasets vs. theoriginal one, we have compared their classiﬁcation/predictionskill and regression skill.In detail, to stress via machine learning, we have used allthe 19 features (13 binary, 6 numerical) and the column withthe class labels as the response column (the enhanced datasethad 61 classes as the original one). A 10-fold cross-validationwas applied to calculate the accuracy (%) by the MATLAB R (cid:13) Classiﬁcation Learner application (methods: ﬁne tree, ﬁne KNN,weighted KNN, linear SVM; all default settings were unchanged).To stress via regression, we have used 17 features (11 binary,i.e., excluding the 2 endpoints; 6 numerical) and, as responsecolumn, a column containing a speciﬁc endpoint (compositeendpoint or all-cause hospitalization endpoint). A 10-fold cross-validation was applied to calculate the root mean square error(RMSE) by the MATLAB R (cid:13) Regression Learner application(methods: ﬁne tree, linear, linear SVM; all default settingswere unchanged).

RESULTSHotelling t Statistic

The two enhanced populations (repeated-measure, shuﬄe) werethe same as the original one until 20 × enlargement; that is, wearrived up to 7,700 patients (including the 385 original). Furtherenhancements were not legitimate ( p < × shuﬄedpopulation was subjected to a 2 × repeated-measures processing, November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe

FIGURE 1 |

Simpliﬁed representation of the original dataset along with its variants. (A)

The simpliﬁed original dataset showing four patients (P = patient) eachanalyzed with three features (F = feature), displayed with different symbols and colors, and grouped into two classes highlighted with the colored boxes. (B) Representation of the “repeated-measure” variant to expand the cardinality of the original dataset. (C)

Same as B, but for our proposed “shufﬂe” variant. and we arrived up to 15,199 patients (including the 385 original).Further enlargements were not legitimate ( p < Stressing the Enhanced Datasets viaMachine Learning and Regression

The comprehensive results are presented in the following tablesin terms of accuracy (%) and RMSE.Accuracy is a metric for evaluating the performanceof machine learning in terms of the fraction of correctclassiﬁcations. In this example dataset, high accuracy means thata sizable portion of patients was grouped into the correct classes(

Table 1 ).RMSE is a good estimator for the standard deviation ofprediction errors; it informs about how far oﬀ we expect theregression model to be on its next prediction. If the RMSE isvery small (

Tables 2 , ), the predicted value of an endpoint willpractically coincide with the observed binary value in the future. DISCUSSION

To stratify patients according to their cardiovascular events riskin a 6-month follow-up after hospital discharge, the appropriatemethod of classiﬁcation needs to be accurately determinedin the case of the original dataset. In our case, the ﬁneKNN algorithm implemented in MATLAB R (cid:13) revealed to bea good choice (accuracy equal to 93.2%, Table 1 ). However, the enlargement or enhancement of the cardinality of theoriginal dataset, while it is legitimate, could possibly enable agreater classiﬁcation/prediction skill. In detail, we have designedand developed a random shuﬄe method and validated itagainst the already used random repeated-measures method:the validation has given statistical legitimacy to the randomshuﬄe method (while p > t statistic),and we have obtained a performance (accuracy up to 100%,independently from the classiﬁcation method) better than thatof the ﬁne KNN dedicated only to the original dataset ( Table 1 ).These results prove that the strategy with binary features,used to deﬁne the classes, and our random shuﬄe method toenhance the dataset can give a particularly good classiﬁcationperformance (

Table 1 ).To estimate the likelihood of the two endpoints (compositeand all-cause hospitalization), a linear regression is already agood choice (

Tables 2 , ). However, the enlargement of thecardinality of the original dataset via both the random repeated-measures method and the random shuﬄe method or via thecombined approach can give a better performance (RMSE downto 0), as stressed via the ﬁne tree regression method. For example,a fatal clinical set is positive for nt-proBNP > ≥

90 bpm, whereas a rehospitalization clinical setis positive for peripheral edema and left ventricular ejectionfraction > November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe

FIGURE 2 |

Comparison of the simpliﬁed original dataset with its enhancements. (A)

Plot of two original numerical features for two classes (the 1 st and the 3 rd of 61classes). (B) Plot of two numerical features for two classes (the 1 st and the 3 rd of 61 classes) whose cardinality has been enhanced 2 × : original plus one intraclassrandom generation of values inside each feature according to a ﬁtted repeated-measures model. (C) Plot of two numerical features for two classes (the 1 st and the 3 rd of 61 classes) whose cardinality has been enhanced 2 × : original plus one intraclass random exchange/shufﬂe of values inside each feature (each feature isindependently shufﬂed in random and intraclass manner). TABLE 1 |

Machine learning with 10-fold cross-validation to calculate the classiﬁcation accuracy (%).

Accuracy (%) 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shufﬂe Combined

Fine tree 86.2 100 100 100Fine KNN 93.2 100 100 100Weighted KNN 86.0 100 100 100Linear SVM 75.3 100 100 100

The names of the classiﬁcation methods (ﬁne tree, ﬁne KNN, weighted KNN, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB ® Classiﬁcation Learnerapplication (all default settings were unchanged).

Clinicians could certainly claim that the abovementionedinferences could be easily made also without the use ofmathematical methods or tools of artiﬁcial intelligence (e.g.,classiﬁcation/prediction or regression as shown in the

Tables 1 – ). Indeed, we consider such a provocative observation as a majorstrength of this study because we have validated the random shuﬄe method not only by statistics, but also, more importantly,by clinical judgment.Another clinical strength is that the chosen features arepatients’ event ratios at hospitalization and follow-up. Thus,by randomly shuﬄing these features between patients, we arecreating in silico plausible patients with a realistic and likely November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe

TABLE 2 |

Regression with 10-fold cross-validation, endpoint = composite, to calculate the regression RMSE (root mean square error). RMSE 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shufﬂe Combined

Fine tree 0.093 0 0 0Linear 2.7 × − × − × − × − Linear SVM 0.108 0.066 0.065 0.065

TABLE 3 |

Regression with 10-fold cross-validation, endpoint = all-cause hospitalization, to calculate the regression RMSE (root mean square error). RMSE 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shufﬂe Combined

Fine tree 0.003 0 0 0Linear 1.9 × − × − × − × − Linear SVM 0.146 0.065 0.065 0.065

The names of the regression methods (ﬁne tree, linear, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB ® Regression Learner application (all defaultsettings were unchanged). combination of comorbidities and event ratios. Therefore, theenhancement of the dataset cardinality yields not only statisticalbut also clinical worth.In conclusion, we have shown that our random shuﬄe methodis validated not only by statistical comparison to an alreadyestablished method (the random repeated-measures method) butalso, more notably, by the clinical knowledge and expertise. Inaddition, in comparison with the random repeated-measuresmethod, a mathematical advantage of the random shuﬄe methodis the absence of a ﬁtting procedure. Consequently, we believethat our random shuﬄe method can also be applied in otherresearch ﬁelds when missing data and the narrow cardinalityof a dataset are issues because of ﬁnancial, experimental, orethical limitations.

MORE TECHNICAL DISCUSSIONExclusion Criteria

Three exclusion criteria were sequentially set: 1) at least anendpoint lacking (thus, 116 patients were removed); 2) atleast a feature lacking (other than endpoints) (another 67patients removed); and 3) the monoexample classes (i.e., witha lone patient) were excluded (another 143 patients removed).Because the monoexample classes cannot be shuﬄed, one couldcertainly observe that exclusion criteria 1 and 2 are particularlyselective. For instance, to increase the number of patients afterpreprocessing, only one endpoint at a time could be consideredfor patient’s exclusion; this choice is certainly possible andcorrect, but implies the cutting of an entire feature, that is, theother endpoint, and as a consequence, we would obtain a reducedstratiﬁcation of the patients. In addition, the random repeated-measures method does not tolerate missing data. Summarizing,the choice was (i) a lower number of patients but with all features,all endpoints, and full stratiﬁcation or, on the contrary, (ii) a higher number of patients but with a reduced set of features andendpoints and with a reduced stratiﬁcation. To stress the randomshuﬄe method, we have chosen the ﬁrst possibility, which is the“worst case” in terms of patients’ number and stratiﬁcation. Inany case, the meaning of the random shuﬄe method remains thesame as described above. Moreover, the choice permitted the useof the same data for both classiﬁcation and regression.

Cardinality Enhancement

The cardinality of the original dataset could be small becauseof two concomitant reasons: (i) a small number of classes (lowstratiﬁcation) and (ii) a small number of patients inside theclasses. With these traits of the original database, the intraclass-intrafeature random shuﬄing has “suﬀocating borders” in whichto act, and the database enhancement is also subjected to thedeletion of repeated patients: in that case, we can hypothesizethat the times of dataset enhancement is calmed down by thesmall cardinality of the original dataset. On the contrary, we seethe maximum possibility of enhancement when the number ofclasses and the number of class patients are both high. On theother hand, we see intermediate possibilities when the classes arefew but with many patients in each and, vice versa, when theclasses are many but with few patients in each. In our originaldataset, the classes were many (61 classes), and some of them hadfew patients (e.g., before cardinality enhancement, two or three orfour patients); for additional details, see the following discussiondedicated to oversampling.

Oversampling

The random shuﬄe method could also be seen as a new kind ofoversampling dedicated to the classes of both minority (with lownumber of patients) and majority (with high number of patients).Oversampling is useful when there is an imbalance (related tothe number of patients) between majority and minority classes November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe able to downgrade the classiﬁcation performance (15, 16). Theimbalance can be corrected via oversampling inside minorityclasses and undersampling inside majority ones, e.g., via theSMOTE (Synthetic Minority Oversampling Technique) alongwith a randomly reduced number of patients in the majorityclasses (15). In a diﬀerent approach respect to (15), wherethe information content is ampliﬁed or reduced in minorityor majority classes, respectively, we have oversampled bothminority and majority classes, while it is statistically legitimate;in other words, we preserve the imbalance (hallmark of adataset), and we multiply the information content, while it isstatistically legitimate, obtaining an enhanced classiﬁcation andregression performance. We could also hypothesize that thereinforcement of all classes could improve the “exclusion power”of classiﬁcation algorithms permitting them to better predictpatients into reinforced minority classes.

Cross-Validation for Oversampled Datasets

One could certainly observe that the cross-validation, althougha very common and accepted technique to avoid the overﬁttingin classiﬁcation and regression and so to ameliorate theirprediction skill, could be prone to “overoptimism” when appliedto oversampled datasets because similar samples or exact replicasmay appear in both the training and test partitions. Thisissue has been clearly discussed by Santos et al. (17), whofound a useful combination of characteristics to obtain a not-overoptimistic oversampling: (i) use of cleaning procedures,(ii) cluster-based synthetization of samples, and (iii) adaptiveweighting of minority samples. The last cannot be appliedbecause of the simple nature of the shuﬄing, but the othertwo have been comprised in the proposed method: the randomshuﬄe is done in an intraclass manner, and then, we deletepossible patients’ replicas before further analysis; moreover, asthird characteristic, each feature is independently shuﬄed, sothat plausible patients are synthetized as clinically discussedabove. The combination of these three method’s traits makes usconﬁdent in the cross-validation done.

CLINICAL LIMITATIONS

The clinical timepoint is to be considered approximately inthe middle between those of the two trials used (Aldo-DHFand STOP-SCO). Even if the two trials were diﬀerent in termsof patients’ nationality, we used them together because theyrepresent a real-life heterogeneous set of HF patients who arecommonly observed in daily clinics. The risk prediction model at 6 months and an investigation on the diﬀerences between thedata of the two trials were not purposes of this study and will beaddressed in another work.

DATA AVAILABILITY STATEMENT

Data and codes (with MIT License) along with reproducibilityinstructions are available in the

Supplementary Material and also here: https://github.com/lorfas74/random-shuﬄe onGitHub development platform.

ETHICS STATEMENT

The studies involving human participants were reviewed andapproved by the institutional review board at each participatingcenter and were conducted in accordance with the principles ofthe Declaration of Helsinki, Good Clinical Practice guidelines,and local and national regulations. The patients/participantsprovided their written informed consent to participate inthis study.

AUTHOR CONTRIBUTIONS

Random shuﬄe method (hypothesis, design, andimplementation): LF. Statistics, machine learning, regression,and validation: LF, FPLM, AF, and AA. Acquisition of clinicaldata: AF, AA, SK, CC, FE, and BP. Clinical discussion: AF,AA, FPLM, FE, and BP. Wrote and edited the manuscript:all authors.

FUNDING

AA is a participant in the BIH-Charité Clinician ScientistProgram funded by the Charité – Universitätsmedizin Berlinand the Berlin Institute of Health. This work was supported byPRIN grant (2017AXL54F_002). We acknowledge support fromthe German Research Foundation (DFG) and the Open AccessPublication Fund of Charité – Universitätsmedizin Berlin.

SUPPLEMENTARY MATERIAL

REFERENCES

1. Savarese G, Lund LH. Global public health burden of heart failure.

Card. Fail.Rev . (2017) 3:7–11. doi: 10.15420/cfr.2016:25:22. English RA, Lebovitz Y, Giﬃn RB. Challenges in clinical research [chapter3]. In:

Transforming Clinical Research in the United States: Challenges andOpportunities: Workshop Summary . Washington, DC: National AcademiesPress (2010). p. 19–36. 3. Singhal R, Rana R. Intricacy of missing data in clinical trials:deterrence and management.

Int. J. Appl. Basic Med. Res . (2014)4:S2–5. doi: 10.4103/2229-516X.1407064. Prieto-Merino D, Pocock SJ. The science of risk models.

Eur. J. Prev. Cardiol .(2012) 19:7–13. doi: 10.1177/20474873124489955. Pocock SJ, Ariti CA, McMurray JJ, Maggioni A, Kober L, Squire IB, et al.Predicting survival in heart failure: a risk score based on 39372 patients from30 studies.

Eur. Heart J . (2013) 34:1404–13. doi: 10.1093/eurheartj/ehs337 November 2020 | Volume 7 | Article 599923assina et al. Random Shufﬂe

6. White IR, Royston P. Imputing missing covariate values for the Cox model.

Stat. Med . (2009) 28:1982–98. doi: 10.1002/sim.36187. White IR, Royston P, Wood AM. Multiple imputation using chainedequations: issues and guidance for practice.

Stat. Med . (2011) 30:377–99. doi: 10.1002/sim.40678. MathWorks.

MATLAB R (cid:13) Function to Fit Repeated Measures Model.

MATLAB R (cid:13) Function to Generate New Random Response ValuesGiven Predictor Values.

Eur. J. Heart Fail . (2010) 12:874–82. doi: 10.1093/eurjhf/hfq08711. Suits DB. Use of dummy variables in regression equations.

J. Am. Stat. Assoc .(1957) 52:548–51. doi: 10.1080/01621459.1957.1050141212. Knuth D. “Balanced trees [section 6.2.3 of volume 3 (Sorting and searching)].”In:

The Art of Computer Programming . Redwood City, CA: Addison-Wesley(1998). p. 458-81.13. Hotelling H. The generalization of Student’s ratio.

Ann. Math. Stat . (1931)2:360–78. doi: 10.1214/aoms/1177732979 14. Trujillo-Ortiz A.

HotellingT2.

J. Artif. Intell. Res . (2002) 16:321–57. doi: 10.1613/jair.95316. Kaur P, Gosain A. FF-SMOTE: a metaheuristic approach to combatclass imbalance in binary classiﬁcation.

Appl. Artif. Intell . (2019) 33:420–39. doi: 10.1080/08839514.2019.157701717. Santos MS, Soares JP, Abreu PH, Araujo H, Santos J. Cross-validationfor imbalanced datasets: avoiding overoptimistic and overﬁttingapproaches [Research Frontier].

IEEE Comput. Intell. Mag . (2018)13:59–76. doi: 10.1109/MCI.2018.2866730

Conﬂict of Interest:

The authors declare that the research was conducted in theabsence of any commercial or ﬁnancial relationships that could be construed as apotential conﬂict of interest.

Copyright © 2020 Fassina, Faragli, Lo Muzio, Kelle, Campana, Pieske, Edelmannand Alogna. This is an open-access article distributed under the terms of the CreativeCommons Attribution License (CC BY). The use, distribution or reproduction inother forums is permitted, provided the original author(s) and the copyright owner(s)are credited and that the original publication in this journal is cited, in accordancewith accepted academic practice. No use, distribution or reproduction is permittedwhich does not comply with these terms.8