A random shuffle method to expand a narrow dataset and overcome the associated challenges in a clinical study: a heart failure cohort example
Lorenzo Fassina, Alessandro Faragli, Francesco Paolo Lo Muzio, Sebastian Kelle, Carlo Campana, Burkert Pieske, Frank Edelmann, Alessio Alogna
OORIGINAL RESEARCH November 2020 | Volume 7 | Article 599923
Edited by:
Gaetano Ruocco,Regina Montis Regalis Hospital, Italy
Reviewed by:
Alberto Aimo,Sant’Anna School of AdvancedStudies, ItalyKristen M. Tecson,Baylor Scott & White ResearchInstitute (BSWRI), United States *Correspondence:
Alessio [email protected] [email protected]
Specialty section:
This article was submitted toHeart Failure and Transplantation,a section of the journalFrontiers in Cardiovascular Medicine
Received:
28 August 2020
Accepted:
19 October 2020
Published:
20 November 2020
Citation:
Fassina L, Faragli A, Lo Muzio FP,Kelle S, Campana C, Pieske B,Edelmann F and Alogna A (2020) ARandom Shuffle Method to Expand aNarrow Dataset and Overcome theAssociated Challenges in a ClinicalStudy: A Heart Failure CohortExample.Front. Cardiovasc. Med. 7:599923.doi: 10.3389/fcvm.2020.599923
A Random Shuffle Method to Expanda Narrow Dataset and Overcome theAssociated Challenges in a ClinicalStudy: A Heart Failure CohortExample
Lorenzo Fassina *, Alessandro Faragli , Francesco Paolo Lo Muzio ,Sebastian Kelle , Carlo Campana , Burkert Pieske , Frank Edelmann andAlessio Alogna * Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy, Department of InternalMedicine and Cardiology, Deutsches Herzzentrum Berlin, Berlin, Germany, Department of Internal Medicine and Cardiology,Charité – Universitätsmedizin Berlin, Berlin, Germany, Berlin Institute of Health (BIH), Berlin, Germany, DZHK (GermanCentre for Cardiovascular Research), Partner Site Berlin, Berlin, Germany, Department of Surgery, Dentistry, Paediatrics andGynaecology, University of Verona, Verona, Italy, Department of Medicine and Surgery, University of Parma, Parma, Italy, Department of Cardiology, Sant’Anna Hospital, ASST-Lariana, Como, Italy
Heart failure (HF) affects at least 26 million people worldwide, so predicting adverseevents in HF patients represents a major target of clinical data science. However,achieving large sample sizes sometimes represents a challenge due to difficulties inpatient recruiting and long follow-up times, increasing the problem of missing data. Toovercome the issue of a narrow dataset cardinality (in a clinical dataset, the cardinalityis the number of patients in that dataset), population-enhancing algorithms are thereforecrucial. The aim of this study was to design a random shuffle method to enhance thecardinality of an HF dataset while it is statistically legitimate, without the need of specifichypotheses and regression models. The cardinality enhancement was validated againstan established random repeated-measures method with regard to the correctnessin predicting clinical conditions and endpoints. In particular, machine learning andregression models were employed to highlight the benefits of the enhanced datasets.The proposed random shuffle method was able to enhance the HF dataset cardinality(711 patients before dataset preprocessing) circa 10 times and circa 21 times whenfollowed by a random repeated-measures approach. We believe that the random shufflemethod could be used in the cardiovascular field and in other data science problemswhen missing data and the narrow dataset cardinality represent an issue.
Keywords: random shuffle, missing data, narrow dataset cardinality, data science, heart failure
INTRODUCTION
Heart failure (HF) affects at least 26 million people worldwide (1), so predicting adverse events inHF patients represents a major target of clinical data science. Common challenges in clinical studiesand trials are as follows (2, 3): (i) troubles in finding patients fitting the eligibility criteria (e.g.,rare disease); (ii) difficulties in the enrollment because of a poorly formulated informed consent; assina et al. Random Shuffle (iii) data collection problems; (iv) time delays because ofcomplicated study design or due to unpredictable events; and (v)financial demands of the clinical practice. All these issues couldbe the cause of missing data and datasets with narrow cardinality,which are relevant challenges in data science (in a clinical dataset,the cardinality is the number of patients in that dataset).As a consequence, researchers need to produce novelhypotheses and methods to deal with these issues, whichare particularly critical when the dataset is used to buildrisk models in the field of clinical cardiology. A successfuleffort to overcome the abovementioned issues is representedby the MAGGIC risk score, developed as a tool of riskstratification for both morbidity and mortality in HF patients(4, 5). To build MAGGIC, Pocock et al. (5) have combined30 datasets to enlarge patients’ cardinality, thereby reachingan astonishing amount of 39,372 patients, and handled themissing patients’ values via multiple imputations using chainedequations (6, 7). In detail, to deal with missing data, regressionequations are defined; the missing values are initially replacedby randomly chosen observed values of each variable, andthen the missing values are replaced by a random draw fromthe distribution defined by the regression equations, and atthe end of the last iteration, the final value becomes thechosen imputed value. Hence, we can argue that a randomprocedure could be important to overcome not only the issueof missing data, but also, at the same time, the one of narrowdataset cardinality.The conceptual challenge of missing data is dual: 1) missingpatients (i.e., completely missing data but plausible patients,as discussed later) who cause a narrow cardinality of thedataset and 2) missing data in patients with a partial listof needed values. In the current work, we unify the visionof these two kinds of missing data, searching for themwith a random method, our novel random shuffle methodwithout the use of specific hypotheses and regression models:we only need the original data, and we randomly shufflethem while it is statistically legitimate. “Statistically legitimate”means that, to validate our random shuffle method, the newdatasets with enhanced cardinality were compared to thoseenhanced via an established random repeated-measures method(8, 9).Indeed, the aim of this work is not to obtain arisk score, but to introduce an innovative method toenlarge the dataset cardinality and boost up the statisticalperformance. Our random shuffle method can be appliedin other research fields when both missing data and limiteddataset are issues because of financial, experimental, orethical limitations.
DATA AND METHODSOriginal Dataset
The clinical dataset is composed of a total of 711 German,Austrian, and Italian patients suffering from HF in differentstages, in hospital facility due to either an acute hospitalizationor an ambulatory visit, released and followed up for a period of 6 months. Patients were enrolled in two distinctclinical studies: (i) the Aldo-DHF trial (10), a multicenter,randomized, placebo-controlled, double-blind, two-armed,parallel-group study that enrolled patients from 10 trialsites in Germany and Austria (data are available in the
Supplementary Materials ) and (ii) the STOP-SCO trial, aprospective, multicenter, observational study that enrolledpatients from 10 hospitals in the Northern Italy (unpublisheddata, that are available in the
Supplementary Materials ). Theprotocol and amendments were approved by the institutionalreview board at each participating center, and the trials wereconducted in accordance with the principles of the Declarationof Helsinki, Good Clinical Practice guidelines, and localand national regulations. Written informed consent wasprovided by all patients before any study-related procedureswere performed.The studied endpoints at 6 months were a compositeendpoint (all-cause hospitalization plus all-cause mortality) andall-cause hospitalization.The dataset is organized in rows (patients) and columns(clinical parameters or features). The features are of two types:i) 13 binary features that show the presence (value =
1) or theabsence (value =
0) of the following conditions: peripheraledema, composite endpoint, age >
75 years, angiotensinreceptor blockers intake, β -blockers intake, left ventricularejection fraction at admission > > <
50 mL/min, heart rate at release ≥ <
12 g/dL forwomen, <
13 g/dL for men), all-cause hospitalizationendpoint, more than 2 hospitalizations in the lastyear; andii) 6 numerical features: age, heart rate at release, bodyweight at release, systolic aortic pressure at release, diastolicaortic pressure at release, left ventricular ejection fractionat admission.To preprocess the clinical dataset for the removal of patientswith missing values, two exclusion criteria were sequentially set:1) at least an endpoint lacking (composite endpoint, all-causehospitalization endpoint) and 2) at least a feature lacking (otherthan endpoints).After the preceding data cleaning, the 13 binary featureswere used as dummy variables (11) to group the patientsinto classes, where the number of classes could be, atmaximum, 2 . In particular, a self-balancing (12) (alsocalled height-balancing) was applied to the tree of thebinary features obtaining a new sorting of the dataset. Insummary, the ordered list of the first 13 columns is the i)list above.Moreover, because an intraclass-intrafeature randomshuffling is possible if and only if the class cardinality is >
1, the monoexample classes (i.e., with a lone patient)were excluded.After preprocessing, the dataset is composed of 385patients grouped into 61 classes. Conceptually, each class November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle represents a particular clinical condition; in other words,the class label delimits a dataset subset inside which theshuffling is legitimate and not tautologic [as we show below,in statistical manner, via the comparison to a MATLAB-implemented repeated-measures fitting followed by its“random” method (8, 9); MATLAB R (cid:13) , The MathWorks,Inc., Natick, MA].In Figure 1A , for demonstration purposes, we show asimplified representation of the original dataset with four patientsanalyzed with 3 features and grouped into 2 classes.In
Figure 2A , for sake of example and comparison withenhancing methods (
Figures 2B,C ), we plot two originalnumerical features for two classes (e.g., the 1 st and the 3 rd of61 classes). The following sections will describe how to obtainvariants of the original dataset. Repeated-Measure Variant
In MATLAB R (cid:13) (Statistics and Machine Learning Toolbox TM ),there are already implemented functions as the “fitrm” (acronymfor “fit repeated-measures model”) with the associated “random”method permitting to generate new random response valuesgiven predictor values (8, 9).In particular, in the fitrm function, the measurements (the6 numerical features above listed) are the responses, andthe class column (with the aforementioned 61 classes) is thepredictor variable. The fitrm function produces a repeated-measures model onto which we can apply the random method torandomly generate new response values, that is, new numericalmeasurements for our 6 numerical features. We called thisrandom generation as “repeated-measures” variant ( Figure 1B ),and we added it to the original dataset (
Figure 1A ) obtaining anenhanced dataset (
Figure 2B ).Theoretically, it is possible to generate at will withoutoutputting replicated values, but we have introduced a calculuscheckpoint to delete eventually replicated patients in theenhanced dataset.
Shuffle Variant
In MATLAB R (cid:13) , we have implemented an intraclass randomexchange/shuffle of values inside each feature (i.e., each featureis independently shuffled in random and intraclass manner).We called this random exchange/shuffle as “shuffle” variant( Figure 1C ), and we added it to the original dataset (
Figure 1A )obtaining an enhanced dataset (
Figure 2C ).It is likely to shuffle with outputting replicated patients(especially inside low-cardinality classes), so we have introduceda calculus checkpoint to delete replicated patients in theenhanced dataset.
Hotelling t Statistic
Hotelling T distribution is a multivariate distributionproportional to the F distribution; in particular, it is ageneralization of the Student t distribution for multivariatepurposes. Hotelling t statistic is a generalization of Student t statistic used in multivariate hypothesis testing (13, 14).In our multivariate problem, we have 6 numerical features,and we would enhance the original dataset without generating a different population ( p > p -value is notsignificant (i.e., the enhanced shuffled population is the same asthe original dataset or the enhanced repeated-measures one). Combined Approach
In a combined approach, an enhanced shuffled population wassubjected to a repeated-measures processing.
Stressing the Enhanced Datasets viaMachine Learning and Regression
In our specific cardiology problem (HF), the main goals ofhaving enhanced datasets by enlarging their cardinality, whileit is legitimate, are a greater classification/prediction skill (e.g.,to predict the patient’s class of risk) and a greater regressionskill (e.g., to estimate the likelihood of two endpoints: compositeendpoint, all-cause hospitalization endpoint). In other words, weare trying to overcome the issues of missing data and datasetswith narrow cardinality, which are typically due to financial,experimental, or ethical limitations without losing the statisticalnature of the original dataset, boosting its statistical performancewhile legitimate ( p > t -test).To highlight the benefits of the enhanced datasets vs. theoriginal one, we have compared their classification/predictionskill and regression skill.In detail, to stress via machine learning, we have used allthe 19 features (13 binary, 6 numerical) and the column withthe class labels as the response column (the enhanced datasethad 61 classes as the original one). A 10-fold cross-validationwas applied to calculate the accuracy (%) by the MATLAB R (cid:13) Classification Learner application (methods: fine tree, fine KNN,weighted KNN, linear SVM; all default settings were unchanged).To stress via regression, we have used 17 features (11 binary,i.e., excluding the 2 endpoints; 6 numerical) and, as responsecolumn, a column containing a specific endpoint (compositeendpoint or all-cause hospitalization endpoint). A 10-fold cross-validation was applied to calculate the root mean square error(RMSE) by the MATLAB R (cid:13) Regression Learner application(methods: fine tree, linear, linear SVM; all default settingswere unchanged).
RESULTSHotelling t Statistic
The two enhanced populations (repeated-measure, shuffle) werethe same as the original one until 20 × enlargement; that is, wearrived up to 7,700 patients (including the 385 original). Furtherenhancements were not legitimate ( p < × shuffledpopulation was subjected to a 2 × repeated-measures processing, November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle
FIGURE 1 |
Simplified representation of the original dataset along with its variants. (A)
The simplified original dataset showing four patients (P = patient) eachanalyzed with three features (F = feature), displayed with different symbols and colors, and grouped into two classes highlighted with the colored boxes. (B) Representation of the “repeated-measure” variant to expand the cardinality of the original dataset. (C)
Same as B, but for our proposed “shuffle” variant. and we arrived up to 15,199 patients (including the 385 original).Further enlargements were not legitimate ( p < Stressing the Enhanced Datasets viaMachine Learning and Regression
The comprehensive results are presented in the following tablesin terms of accuracy (%) and RMSE.Accuracy is a metric for evaluating the performanceof machine learning in terms of the fraction of correctclassifications. In this example dataset, high accuracy means thata sizable portion of patients was grouped into the correct classes(
Table 1 ).RMSE is a good estimator for the standard deviation ofprediction errors; it informs about how far off we expect theregression model to be on its next prediction. If the RMSE isvery small (
Tables 2 , ), the predicted value of an endpoint willpractically coincide with the observed binary value in the future. DISCUSSION
To stratify patients according to their cardiovascular events riskin a 6-month follow-up after hospital discharge, the appropriatemethod of classification needs to be accurately determinedin the case of the original dataset. In our case, the fineKNN algorithm implemented in MATLAB R (cid:13) revealed to bea good choice (accuracy equal to 93.2%, Table 1 ). However, the enlargement or enhancement of the cardinality of theoriginal dataset, while it is legitimate, could possibly enable agreater classification/prediction skill. In detail, we have designedand developed a random shuffle method and validated itagainst the already used random repeated-measures method:the validation has given statistical legitimacy to the randomshuffle method (while p > t statistic),and we have obtained a performance (accuracy up to 100%,independently from the classification method) better than thatof the fine KNN dedicated only to the original dataset ( Table 1 ).These results prove that the strategy with binary features,used to define the classes, and our random shuffle method toenhance the dataset can give a particularly good classificationperformance (
Table 1 ).To estimate the likelihood of the two endpoints (compositeand all-cause hospitalization), a linear regression is already agood choice (
Tables 2 , ). However, the enlargement of thecardinality of the original dataset via both the random repeated-measures method and the random shuffle method or via thecombined approach can give a better performance (RMSE downto 0), as stressed via the fine tree regression method. For example,a fatal clinical set is positive for nt-proBNP > ≥
90 bpm, whereas a rehospitalization clinical setis positive for peripheral edema and left ventricular ejectionfraction > November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle
FIGURE 2 |
Comparison of the simplified original dataset with its enhancements. (A)
Plot of two original numerical features for two classes (the 1 st and the 3 rd of 61classes). (B) Plot of two numerical features for two classes (the 1 st and the 3 rd of 61 classes) whose cardinality has been enhanced 2 × : original plus one intraclassrandom generation of values inside each feature according to a fitted repeated-measures model. (C) Plot of two numerical features for two classes (the 1 st and the 3 rd of 61 classes) whose cardinality has been enhanced 2 × : original plus one intraclass random exchange/shuffle of values inside each feature (each feature isindependently shuffled in random and intraclass manner). TABLE 1 |
Machine learning with 10-fold cross-validation to calculate the classification accuracy (%).
Accuracy (%) 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shuffle Combined
Fine tree 86.2 100 100 100Fine KNN 93.2 100 100 100Weighted KNN 86.0 100 100 100Linear SVM 75.3 100 100 100
The names of the classification methods (fine tree, fine KNN, weighted KNN, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB ® Classification Learnerapplication (all default settings were unchanged).
Clinicians could certainly claim that the abovementionedinferences could be easily made also without the use ofmathematical methods or tools of artificial intelligence (e.g.,classification/prediction or regression as shown in the
Tables 1 – ). Indeed, we consider such a provocative observation as a majorstrength of this study because we have validated the random shuffle method not only by statistics, but also, more importantly,by clinical judgment.Another clinical strength is that the chosen features arepatients’ event ratios at hospitalization and follow-up. Thus,by randomly shuffling these features between patients, we arecreating in silico plausible patients with a realistic and likely November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle
TABLE 2 |
Regression with 10-fold cross-validation, endpoint = composite, to calculate the regression RMSE (root mean square error). RMSE 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shuffle Combined
Fine tree 0.093 0 0 0Linear 2.7 × − × − × − × − Linear SVM 0.108 0.066 0.065 0.065
The names of the regression methods (fine tree, linear, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB ® Regression Learner application (all defaultsettings were unchanged).
TABLE 3 |
Regression with 10-fold cross-validation, endpoint = all-cause hospitalization, to calculate the regression RMSE (root mean square error). RMSE 385 patients 7,700 patients 7,700 patients 15,199 patientsOriginal dataset 20 × Repeated measure 20 × Shuffle Combined
Fine tree 0.003 0 0 0Linear 1.9 × − × − × − × − Linear SVM 0.146 0.065 0.065 0.065
The names of the regression methods (fine tree, linear, linear SVM) refer to the preset tools inside the “Model Type” section of the MATLAB ® Regression Learner application (all defaultsettings were unchanged). combination of comorbidities and event ratios. Therefore, theenhancement of the dataset cardinality yields not only statisticalbut also clinical worth.In conclusion, we have shown that our random shuffle methodis validated not only by statistical comparison to an alreadyestablished method (the random repeated-measures method) butalso, more notably, by the clinical knowledge and expertise. Inaddition, in comparison with the random repeated-measuresmethod, a mathematical advantage of the random shuffle methodis the absence of a fitting procedure. Consequently, we believethat our random shuffle method can also be applied in otherresearch fields when missing data and the narrow cardinalityof a dataset are issues because of financial, experimental, orethical limitations.
MORE TECHNICAL DISCUSSIONExclusion Criteria
Three exclusion criteria were sequentially set: 1) at least anendpoint lacking (thus, 116 patients were removed); 2) atleast a feature lacking (other than endpoints) (another 67patients removed); and 3) the monoexample classes (i.e., witha lone patient) were excluded (another 143 patients removed).Because the monoexample classes cannot be shuffled, one couldcertainly observe that exclusion criteria 1 and 2 are particularlyselective. For instance, to increase the number of patients afterpreprocessing, only one endpoint at a time could be consideredfor patient’s exclusion; this choice is certainly possible andcorrect, but implies the cutting of an entire feature, that is, theother endpoint, and as a consequence, we would obtain a reducedstratification of the patients. In addition, the random repeated-measures method does not tolerate missing data. Summarizing,the choice was (i) a lower number of patients but with all features,all endpoints, and full stratification or, on the contrary, (ii) a higher number of patients but with a reduced set of features andendpoints and with a reduced stratification. To stress the randomshuffle method, we have chosen the first possibility, which is the“worst case” in terms of patients’ number and stratification. Inany case, the meaning of the random shuffle method remains thesame as described above. Moreover, the choice permitted the useof the same data for both classification and regression.
Cardinality Enhancement
The cardinality of the original dataset could be small becauseof two concomitant reasons: (i) a small number of classes (lowstratification) and (ii) a small number of patients inside theclasses. With these traits of the original database, the intraclass-intrafeature random shuffling has “suffocating borders” in whichto act, and the database enhancement is also subjected to thedeletion of repeated patients: in that case, we can hypothesizethat the times of dataset enhancement is calmed down by thesmall cardinality of the original dataset. On the contrary, we seethe maximum possibility of enhancement when the number ofclasses and the number of class patients are both high. On theother hand, we see intermediate possibilities when the classes arefew but with many patients in each and, vice versa, when theclasses are many but with few patients in each. In our originaldataset, the classes were many (61 classes), and some of them hadfew patients (e.g., before cardinality enhancement, two or three orfour patients); for additional details, see the following discussiondedicated to oversampling.
Oversampling
The random shuffle method could also be seen as a new kind ofoversampling dedicated to the classes of both minority (with lownumber of patients) and majority (with high number of patients).Oversampling is useful when there is an imbalance (related tothe number of patients) between majority and minority classes November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle able to downgrade the classification performance (15, 16). Theimbalance can be corrected via oversampling inside minorityclasses and undersampling inside majority ones, e.g., via theSMOTE (Synthetic Minority Oversampling Technique) alongwith a randomly reduced number of patients in the majorityclasses (15). In a different approach respect to (15), wherethe information content is amplified or reduced in minorityor majority classes, respectively, we have oversampled bothminority and majority classes, while it is statistically legitimate;in other words, we preserve the imbalance (hallmark of adataset), and we multiply the information content, while it isstatistically legitimate, obtaining an enhanced classification andregression performance. We could also hypothesize that thereinforcement of all classes could improve the “exclusion power”of classification algorithms permitting them to better predictpatients into reinforced minority classes.
Cross-Validation for Oversampled Datasets
One could certainly observe that the cross-validation, althougha very common and accepted technique to avoid the overfittingin classification and regression and so to ameliorate theirprediction skill, could be prone to “overoptimism” when appliedto oversampled datasets because similar samples or exact replicasmay appear in both the training and test partitions. Thisissue has been clearly discussed by Santos et al. (17), whofound a useful combination of characteristics to obtain a not-overoptimistic oversampling: (i) use of cleaning procedures,(ii) cluster-based synthetization of samples, and (iii) adaptiveweighting of minority samples. The last cannot be appliedbecause of the simple nature of the shuffling, but the othertwo have been comprised in the proposed method: the randomshuffle is done in an intraclass manner, and then, we deletepossible patients’ replicas before further analysis; moreover, asthird characteristic, each feature is independently shuffled, sothat plausible patients are synthetized as clinically discussedabove. The combination of these three method’s traits makes usconfident in the cross-validation done.
CLINICAL LIMITATIONS
The clinical timepoint is to be considered approximately inthe middle between those of the two trials used (Aldo-DHFand STOP-SCO). Even if the two trials were different in termsof patients’ nationality, we used them together because theyrepresent a real-life heterogeneous set of HF patients who arecommonly observed in daily clinics. The risk prediction model at 6 months and an investigation on the differences between thedata of the two trials were not purposes of this study and will beaddressed in another work.
DATA AVAILABILITY STATEMENT
Data and codes (with MIT License) along with reproducibilityinstructions are available in the
Supplementary Material and also here: https://github.com/lorfas74/random-shuffle onGitHub development platform.
ETHICS STATEMENT
The studies involving human participants were reviewed andapproved by the institutional review board at each participatingcenter and were conducted in accordance with the principles ofthe Declaration of Helsinki, Good Clinical Practice guidelines,and local and national regulations. The patients/participantsprovided their written informed consent to participate inthis study.
AUTHOR CONTRIBUTIONS
Random shuffle method (hypothesis, design, andimplementation): LF. Statistics, machine learning, regression,and validation: LF, FPLM, AF, and AA. Acquisition of clinicaldata: AF, AA, SK, CC, FE, and BP. Clinical discussion: AF,AA, FPLM, FE, and BP. Wrote and edited the manuscript:all authors.
FUNDING
AA is a participant in the BIH-Charité Clinician ScientistProgram funded by the Charité – Universitätsmedizin Berlinand the Berlin Institute of Health. This work was supported byPRIN grant (2017AXL54F_002). We acknowledge support fromthe German Research Foundation (DFG) and the Open AccessPublication Fund of Charité – Universitätsmedizin Berlin.
SUPPLEMENTARY MATERIAL
REFERENCES
1. Savarese G, Lund LH. Global public health burden of heart failure.
Card. Fail.Rev . (2017) 3:7–11. doi: 10.15420/cfr.2016:25:22. English RA, Lebovitz Y, Giffin RB. Challenges in clinical research [chapter3]. In:
Transforming Clinical Research in the United States: Challenges andOpportunities: Workshop Summary . Washington, DC: National AcademiesPress (2010). p. 19–36. 3. Singhal R, Rana R. Intricacy of missing data in clinical trials:deterrence and management.
Int. J. Appl. Basic Med. Res . (2014)4:S2–5. doi: 10.4103/2229-516X.1407064. Prieto-Merino D, Pocock SJ. The science of risk models.
Eur. J. Prev. Cardiol .(2012) 19:7–13. doi: 10.1177/20474873124489955. Pocock SJ, Ariti CA, McMurray JJ, Maggioni A, Kober L, Squire IB, et al.Predicting survival in heart failure: a risk score based on 39372 patients from30 studies.
Eur. Heart J . (2013) 34:1404–13. doi: 10.1093/eurheartj/ehs337 November 2020 | Volume 7 | Article 599923assina et al. Random Shuffle
6. White IR, Royston P. Imputing missing covariate values for the Cox model.
Stat. Med . (2009) 28:1982–98. doi: 10.1002/sim.36187. White IR, Royston P, Wood AM. Multiple imputation using chainedequations: issues and guidance for practice.
Stat. Med . (2011) 30:377–99. doi: 10.1002/sim.40678. MathWorks.
MATLAB R (cid:13) Function to Fit Repeated Measures Model.
MATLAB R (cid:13) Function to Generate New Random Response ValuesGiven Predictor Values.
Eur. J. Heart Fail . (2010) 12:874–82. doi: 10.1093/eurjhf/hfq08711. Suits DB. Use of dummy variables in regression equations.
J. Am. Stat. Assoc .(1957) 52:548–51. doi: 10.1080/01621459.1957.1050141212. Knuth D. “Balanced trees [section 6.2.3 of volume 3 (Sorting and searching)].”In:
The Art of Computer Programming . Redwood City, CA: Addison-Wesley(1998). p. 458-81.13. Hotelling H. The generalization of Student’s ratio.
Ann. Math. Stat . (1931)2:360–78. doi: 10.1214/aoms/1177732979 14. Trujillo-Ortiz A.
HotellingT2.
J. Artif. Intell. Res . (2002) 16:321–57. doi: 10.1613/jair.95316. Kaur P, Gosain A. FF-SMOTE: a metaheuristic approach to combatclass imbalance in binary classification.
Appl. Artif. Intell . (2019) 33:420–39. doi: 10.1080/08839514.2019.157701717. Santos MS, Soares JP, Abreu PH, Araujo H, Santos J. Cross-validationfor imbalanced datasets: avoiding overoptimistic and overfittingapproaches [Research Frontier].
IEEE Comput. Intell. Mag . (2018)13:59–76. doi: 10.1109/MCI.2018.2866730
Conflict of Interest:
The authors declare that the research was conducted in theabsence of any commercial or financial relationships that could be construed as apotential conflict of interest.
Copyright © 2020 Fassina, Faragli, Lo Muzio, Kelle, Campana, Pieske, Edelmannand Alogna. This is an open-access article distributed under the terms of the CreativeCommons Attribution License (CC BY). The use, distribution or reproduction inother forums is permitted, provided the original author(s) and the copyright owner(s)are credited and that the original publication in this journal is cited, in accordancewith accepted academic practice. No use, distribution or reproduction is permittedwhich does not comply with these terms.8