[PDF] Application of Machine Learning to Predict the Risk of Alzheimer's Disease: An Accurate and Practical Solution for Early Diagnostics

Abstract

Alzheimer's Disease (AD) ravages the cognitive ability of more than 5 million Americans and creates an enormous strain on the health care system. This paper proposes a machine learning predictive model for AD development without medical imaging and with fewer clinical visits and tests, in hopes of earlier and cheaper diagnoses. That earlier diagnoses could be critical in the effectiveness of any drug or medical treatment to cure this disease. Our model is trained and validated using demographic, biomarker and cognitive test data from two prominent research studies: Alzheimer's Disease Neuroimaging Initiative (ADNI) and Australian Imaging, Biomarker Lifestyle Flagship Study of Aging (AIBL). We systematically explore different machine learning models, pre-processing methods and feature selection techniques. The most performant model demonstrates greater than 90% accuracy and recall in predicting AD, and the results generalize across sub-studies of ADNI and to the independent AIBL study. We also demonstrate that these results are robust to reducing the number of clinical visits or tests per visit. Using a metaclassification algorithm and longitudinal data analysis we are able to produce a "lean" diagnostic protocol with only 3 tests and 4 clinical visits that can predict Alzheimer's development with 87% accuracy and 79% recall. This novel work can be adapted into a practical early diagnostic tool for predicting the development of Alzheimer's that maximizes accuracy while minimizing the number of necessary diagnostic tests and clinical visits.

Full PDF

AApplication of Machine Learning to Predict the Risk ofAlzheimer’s Disease: An Accurate and Practical Solution forEarly Diagnostics

Courtney Cochrane , David Castineira , Nisreen Shiban , Pavlos Protopapas , Institute for Applied Computational Science, Harvard John A. Paulson School ofEngineering and Applied Sciences, Cambridge, MA, US* [email protected]

Abstract

Alzheimer’s Disease (AD) ravages the cognitive ability of more than 5 million Americansand creates an enormous strain on the health care system. This paper proposes a machinelearning predictive model for AD development without medical imaging and with fewerclinical visits and tests, in hopes of earlier and cheaper diagnoses. That earlier diagnosescould be critical in the effectiveness of any drug or medical treatment to cure this disease.Our model is trained and validated using demographic, biomarker and cognitive testdata from two prominent research studies: Alzheimer’s Disease Neuroimaging Initiative(ADNI) and Australian Imaging, Biomarker & Lifestyle Flagship Study of Aging (AIBL).We systematically explore different machine learning models, pre-processing methodsand feature selection techniques. The most performant model demonstrates greater than90% accuracy and recall in predicting AD, and the results generalize across sub-studiesof ADNI and to the independent AIBL study. We also demonstrate that these resultsare robust to reducing the number of clinical visits or tests per visit. Using a meta-classification algorithm and longitudinal data analysis we are able to produce a ”lean”diagnostic protocol with only 3 tests and 4 clinical visits that can predict Alzheimer’sdevelopment with 87% accuracy and 79% recall. This novel work can be adapted intoa practical early diagnostic tool for predicting the development of Alzheimer’s thatmaximizes accuracy while minimizing the number of necessary diagnostic tests andclinical visits.

Author summary

The main goal of this paper is to propose a machine learning solution for the problemof predicting the risk of developing Alzheimer’s Disease (AD). This is achieved bysystematically analyzing medical records from two of the longest longitudinal studies ofAD, ADNI and AIBL. We analyze different machine learning algorithms as well as featureselection methods and preprocessing techniques. Our proposed solution encompasses adiagnostic protocol for early testing of AD that has high accuracy and recall while alsominimizing the number of diagnostic tests the patient is subjected to and eliminatingthe need for costly imaging data. This renders our solution both accurate as well aspractical for an early detection program of AD.June 17, 2020 1/14 a r X i v : . [ q - b i o . Q M ] J un Introduction

Alzheimer’s Disease (AD) is the most common cause of dementia, a group of braindisorders that cause the loss of intellectual and social skills. AD manifests as a progressive,degenerative disorder that attacks the brain’s nerve cells, or neurons, resulting in loss ofmemory, thinking and language skills, as well as behavioral changes [1]. Currently AD isan irreversible process with no cure. The personal, social and economic impact of AD isprofound: In the United States, more than 5 million people aged 65 or over are sufferingfrom Alzheimer’s disease and the estimated national cost of patient care for Alzheimer’sand other dementias was $

236 billion in 2016 [2]. Further, AD is the sixth leading causeof death in the US (third for older people) [3].Although Alzheimer’s was first identified more than a century ago, effective treatmentshave proved elusive. Drug and non-drug treatments can help alleviate some cognitiveand behavioral symptoms of AD, but there is still no cure. Researchers continue to workon developing treatments that can reverse disease progression and improve the qualityof life for people with Alzheimer’s. One of the critical challenges for dealing with ADis the lack of understanding about the neurodegenerative process associated with thisdisease. There are currently two widely-believed, competing hypotheses:1) The amyloid hypothesis: One prime suspect for AD is a microscopic brain proteinfragment called beta-amyloid. This protein is a sticky compound that accumulatesin the brain, disrupting communication between brain cells and eventually killingthem. Some researchers believe that flaws in the processes governing production,accumulation or disposal of beta-amyloid are the primary cause of Alzheimer’s [4].2) The tau hypothesis: The accumulation of the tau protein is thought to be a majorplayer in the development of Alzheimer’s disease. In particular, the tau hypothesisasserts that the formation of neurofibrillary tangles (insoluble twisted fibers thatare formed inside the cells) causes the development of AD [5].Given the huge social and economic impact of any potential treatment for AD, severalcompanies are actively researching this field. Recent research has created optimism thata treatment for AD is close to fruition [6]. Regardless of the treatment, one criticalaspect for the practical deployment of any potential AD drug is the ability for this drugto be used widely and preventively [9]. Thus, this work aims to provides a model, usingtwo well-known studies of Alzheimer’s disease (ADNI and AIBL), that is both suitablefor early detection and practically applicable in clinical settings. To achieve this aim, westrategically evaluate our model’s performance with the smallest (and least expensive)feature subsets so that our model can be used for early screening, before any symptomsappear. Our model utilizes features including demographics, biomarkers, and cognitivetests. Due to the prohibitive expense of medical imaging data (e.g., MRI and PETscans) as an early detection test, we remove medical imaging as a possible feature. Ourmachine learning approach uses metrics derived from longitudinal data analysis, and ouranalysis evaluates optimal feature selection techniques, data imputation methods, andclassification algorithms. Ultimately our goal is to provide a cost-effective pre-screeningtest battery for Alzheimer’s disease.

We explored existing literature regarding Alzheimer’s prediction as well as longitudinaldata handling. Prediction of Alzheimer’s is a popular area of research with researchersapplying a plethora of supervised learning techniques to the problem. We also foundexisting literature that worked with the same main data set that we use in our study,June 17, 2020 2/14DNI. In general we find that Support Vector Machines are the most popular machinelearning technique applied to this problem [11] [12]. However, recently Neural networkshave gained popularity [13], and novel approaches have been attempted, including usingNatural Language Processing to find linguistic deficits [12]. The biggest difference be-tween most studies predicting Alzheimer’s and our work lies in our exclusion of medicalimaging as a feature for the model. Our goal was to produce a model that could aid inearly detection, and the cost of medical imaging is a deterrent for people who are unsureabout getting tested for the disease. Therefore, we excluded medical imaging (PETand MRI) from our analysis. The literature that uses medical imaging has producedaccuracies of 80% [14] and recalls of 85% [15] in predicting conversion to Alzheimer’s.The few models that do not include medical imaging data, instead using cognitive testsand demographics, have accuracies of less than 85% [16]. Ultimately, this is a ripegeneral area for research, but there are a dearth of studies considering prediction withoutthe use of medical data.We also conducted a literature review for longitudinal data analysis, which is veryrelevant for the type of data typically associated with AD clinical studies. In our case, thelongitudinal nature of the AD data results from the observation of subjects (patients) overtime during sequential clinical visits. The difficulty in dealing with this data stems fromthe inconsistent number of observations across patients and the potentially correlateddata within patients. The literature for longitudinal data analysis is extensive [17] [18].Methods based on summary metrics or statistics [19] have been broadly used, wheretemporal measurements are summarized into key statistical descriptors (e.g., mean, mode,area-under the curve, etc). Another common solution to handle longitudinal data is tofit this data using some type of regression model. Regression models permit inferenceregarding the dynamic response over time and how this evolution varies with patientcharacteristics such as treatment assignment or other demographic factors. However,standard regression methods assume that all observations are independent, and thismay produce invalid standard errors if the assumptions are not valid. For this reason,advanced regression methods have been proposed to overcome some of these limitationssuch as Random-Coefficient Models [20] and General Regression Methods [21]. Some ofthese models are very flexible in allowing for imbalanced data, missing values, differingnumber of time points from subject to subject, and unequal spacing of time point intervalswithin and across subjects. In addition, recent work has been done applying machinelearning techniques such as Neural Networks [22] and Support Vector Classifiers [23]to longitudinal data. For the work presented here, summary metrics have been shownto provide great results in generating features for predictive modeling, with the addedbenefit of generating more parsimonious models.

For this study we have utilized two existing repositories for AD studies: ADNI [24] andAIBL [26]:The ADNI (Alzheimer’s Disease Neuroimaging Initiative) dataset is an ongoing,longitudinal multi-study that has been carried out since 2004. It has acquired data andspecimens from 1,700 participants at 60 clinical sites around Canada and the UnitedStates. The study enrolls selected populations for future treatment, and the subjectsinclude AD patients, mild cognitive impairment subjects, and an elderly control. Thesuccesses of the ADNI database includes developing standardized methods, improvingJune 17, 2020 3/14rial efficiency, and creating an infrastructure for sharing raw and processed data withoutembargo. The initiative is supported by $67 million in private and public sector donations.The initial phase of the study is known as ADNI1. In 2009, the second phase, ADNIGO,was started containing 200 participants with Early Mild Cognitive Impairment. In2011, the third phase, ADNI2, began with 150 participants with Late Mild CognitiveImpairment.The AIBL (Australian Imaging, Biomarker & Lifestyle Flagship Study of Aging)dataset contains data from a 4.5 year longitudinal study of cognition which started in2006. It is a large scale cohort study containing 1,112 participants and conducted attwo centers, Perth and Melbourne, in Australia. The study focuses on early detection,specifically in terms of lifestyle interventions. The AIBL data contains 211 AD patients,133 MCI patients, and 768 healthy volunteers and follows the ADNI1 protocols for datacollection. The available data includes clinical and cognitive data, image data (extractedfrom MRI and PET data), biomarker data including blood, genotype, and ApoE, anddietary and lifestyle data. These latter assessments examine participant’s diet, exercisepatterns, body composition, and sleep habits. [26].More detailed clinical description of the ADNI and AIBL cohorts have been previouslypublished in [25] and [26] respectively. It is worth noticing that AIBL and ADNIhave many of the same goals and are designed to identify the biomarkers, cognitivecharacteristics, and health and lifestyle factors that impact AD.For this study we used 94 predictors that were reported in the ADNIMERGE table(a special dataset that merges key ADNI tables). These predictors provided specificinformation about patient demographics, genetics, blood biomarkers and cognitive testsfrom participants in different longitudinal multi-center studies.

Model Procedure

We experiment with multiple different pre-processing techniques, feature selection meth-ods, and machine learning models. The specific options we experimented with aredelineated below. • Data-Preprocesing : Our data pre-processing includes generation of the labels,conversion of categorical variables, longitudinal data handling and imputation.First, we generated labels for our classification problem by merging features acrossthe different data files in the ADNI dataset. We then excluded 14 subjects whowere diagnosed as AD but then were diagnosed as either cognitively normal (CN)or mild cognitive impairment (MCI) in future years (there is currently no way toreverse AD so this indicates a mistake in the data).We next one-hot encoded all categorical variables and performed feature engineer-ing. For all longitudinal features, we computed a series of summary metrics (mean,standard deviation, absolute changes and time intervals) for each patient thatcaptured their temporal evolution along multiple clinical visits. Finally, we splitthe data into a training and test set and performed imputation. We investigatedimputation by mean or mode (mean for numerical columns and mode for categoricalcolumns), and k-Nearest Neighbors imputation. • Feature Reduction and Selection : Our data is high-dimensional, so we experimentedwith two different feature reduction techniques: Singular Value Decompositionand Affinity Propagation. Singular Value Decomposition, or SVD [27] operates bycombining information from several (likely) correlated vectors, and forming basisvectors which explain most of the variance in the data and are guaranteed to beorthogonal in higher dimensional space. SVD and PCA (Principal ComponentJune 17, 2020 4/14nalysis, a very popular dimensionality reduction technique) are closely related.On the other hand Affinity Propagation, or AP [28] is a relatively new clusteringalgorithm based on the concept of ”message passing” between data points. Oncewe obtain clusters of features then we can compute the so-called exemplars (fea-tures that are good representatives of themselves and some other features). Thisapproach provides an elegant feature selection technique. Notice that AP does notrequire the number of clusters to be determined or estimated before running thealgorithm (and this is in contrast to other clustering techniques such as k-means),although a user of this technique must sill define some hyperparameters (e.g.,preferences) that affect the resulting number of clusters. • Supervised Learning : The supervised learning module performs five-fold crossvalidation and grid search over the hyperparameters and model selection specifiedin the pipeline. The models implemented in our pipeline were Random Forest,Logistic Regression, k-Nearest Neighbors, Support Vector Machines (SVM), Multi-layer Perceptron (MLP), AdaBoost, Linear SVM, Gradient Boosting, and DecisionTrees. More details for these methods can easily be found in machine learningliterature [29] [30]. Using the parameters that maximize recall on the validationset, the model predicts and outputs the labels for the test set. • Model Evaluation : Finally, the predictions of our model are evaluated againstthe true labels using the following metrics: confusion matrix, accuracy, recall,precision, f1 score, and ROC curve. In this work, we focus on optimizing modelaccuracy (percentage of patients correctly labeled by our model) and recall (per-centage of patients who develop AD that are correctly identified) [32]. In medicalsettings, like predicting AD development, it is often crucial to minimize falsenegatives, and therefore we try to optimize the recall of our models. However,our pipeline automatically computes the full suite of metrics for potential use infurther exploration.We evaluated every possible combination of model parameters (imputation, featureselection and model type) in order to quantify their predictive power. For each possiblemodel, we assess the performance on fifty different random training and test set splitswhere the ratio of training to test data is 2:1. For these experiments, we use the entireADNI dataset. We carry out five-fold cross-validation to pick the top four model/hyperparameter combinations that maximize recall on the validation set, and recordmetrics across one hundred random splits of the train and test set to prove consistencyof our results.

We also investigated whether our model can generalize, how much longitudinal data weneed before we can make accurate predictions, and whether we can produce a ”lean”model that minimizes time/money cost while maintaining high accuracy and recall.

We also conducted a series of analyses in order to determine whether our model cangeneralize across different study protocols. We first explore the robustness of our top twomodels on the different sub-studies of ADNI. We train and test the model separately onADNI1, ADNI2, and ADNIGO and record our four performance metrics. Next, we traina model on the ADNI1 dataset and then test separately on the ADNI2 dataset and theJune 17, 2020 5/14DNIGO dataset. Likewise, we try all other pairs of sub-studies for the training and testsets. We also analyzed our model’s performance on the AIBL dataset. Note that AIBLrepresents a completely different repository of patients to ADNI (i.e., different patientsfollowing different protocols), which gives us an excellent opportunity to validate ourdata-driven solutions for AD prediction. For this particular study we considered thehandful of features that are common to both AIBL and ADNI repositories: Age, Gender,APOE4 (a genetic test) and MMSE (a cognitive test).

Longitudinal Data Analysis

As described earlier, one of our goals is to propose a practical data-driven solution forAD predictions that can be used in clinical settings. For this purpose, it is important tounderstand how the longitudinal dimension of the data (e.g., number of visits) affect theperformance of our predictions. To evaluate this, we trained our model assuming thatthe full history of the patients in the training set was available. Then we tried to predictthe label for the test set patients with restricted information from a limited number ofvisits. This analysis aims to evaluate the number of medical visits necessary for obtaininga given performance in predicting AD. The relationship between the number of visitsand total time of study (from baseline to last visit) was also considered in this analysis.

Cost Analysis

To evaluate whether we can limit the cost for a patient while still providing an accurateprediction, we take two approaches. First, we research the time that patients spend oneach test and then plot the features of our model against the time needed to obtainthose features, with the features ordered by feature importance (as determined by theRandom Forest algorithm).Secondly, we utilize a meta-classification algorithm to produce models that have highaccuracies and recalls while minimizing the testing time for the patient. We follow theapproach of [33] and build a meta-model, a Decision Tree, that balances accuracy ofprediction with time cost (for the patient). First we generate the meta-classificationdataset. We group our features into twelve different categories: demographics (e.g.age, gender, education), APOE4 genetic marker data, information about the number ofyears since the baseline diagnosis, and nine different cognitive test features. We firstproduce the power set of the original twelve features. Then we take all sets in the powerset that have size 1, 2 or 3 (a total of 298 sets to be considered). For each of thesesets of features, we train a Random Forest model on the given features. The labelspredicted by each of these 298 models become a feature in our meta-classification dataset.The meta-classification model is a decision tree which splits based on the algorithmdesigned by [33]. Instead of choosing a node to split on based on information gain, wechoose the node, M i , that has the maximal Information Gain ( M i ) Expected Cost ( M i ) where Information Gainis Shannon Information Gain and Expected Cost is defined as Expected Cost ( M i ) = P L ( M i ) Cost ( M i )+(1 − P L ( M i ))[ (cid:80) v ∈ C Mi (cid:80) mj = i +1 P L ( M j | M i = v ) Cost ( M j | M i = v )] [33]. Ultimately, this splitting criterion balances information gain with the cost of using thegiven model. We use this algorithm to create a decision tree that balances feature costand information gain.June 17, 2020 6/14 Model Comparison

ROC Curves- 50 train/test splits

Fig 1.

ROC Curve for Random Forest with mean/mode imputation (mean fornumerical features and mode for categorical), no feature selection, and 1,000 trees. Smalllabels over the curve show the different thresholds used to make the class predictions.

Results

Comparison of Models

For all the model parameter combinations, we determine that the best two models(with the highest recall on the validation set) are: 1. Random Forest with mean/modeimputation, no feature selection, and 1,000 trees and 2. Random Forest with k-NearestNeighbors imputation, no feature selection, and 1,200 trees. Table 1 shows the perfor-mance in terms of accuracy, recall, precision and F1 score. These scores are averagedacross one hundred different random training and test set splits. Both our top modelsglean an average accuracy greater than 92% and an average recall greater than 91%,with the Model 1 having higher accuracy and Model 2 having higher recall. We producean ROC curve for Model 1 which demonstrates the strong model performance (Figure1). Further we identify the features that are most important for Model 1 (Figure 2).As shown, the top five most important features are related to longitudinal metrics forCDRSB, FAQ, ADAS11, ADAS13, and MMSE, which are all cognitive tests.

Table 1.

Comparison of Metrics for Top 2 Models

Accuracy Recall Precision F1 ScoreModel 1

Model 2

Model Performance on ADNI sub-studies

We test Model 1 and Model 2 when trained/tested on ADNI1, ADNI2, and ADNIGOseparately. The models based on ADNI1 data and ADNI2 data have similar accuraciesand recalls compared to the models trained on the full ADNI dataset. The modelstrained on solely ADNIGO data has a high accuracy, but low recall with high resultvariance. This is most likely a consequence of the different demographics of the sub-studies: only 10% of the ADNIGO patients developed AD versus 30% and 50% patientsin ADNI2 and ADNI1, respectively. Further, ADNIGO only has 129 patients versus789 in ADNI2 and 819 in ADNI1, making the training sample much smaller for ADNIGO.June 17, 2020 7/14 ig 2.

Feature Importances for Top 25 Features in Model 1Finally, for each pair of sub-studies, we train on one study and test on the other todetermine whether our model generalizes across the different sub-studies. The differentsub-studies have slightly different protocols and patient demographics, so we explorewhether the results of one study can accurately predict the AD status of the patientsin another. We find greater than 90% recall when training on ADNI1 or ADNI2 andtesting on the other two studies. However, when the model is trained on the ADNIGOdataset, the recall when testing on ADNI1 or ANDI2 plummets to approximately 14%.We expect this is due to the small number of Alzheimer’s patients in the ADNIGOdataset and the small sample size. Ultimately, this analysis shows that our best modelsgeneralize well on different training and testing subsets of ADNI data, as long as thereis a sufficiently large training set size with a moderate amount of Alzheimer’s patients.

Validation of Model with AIBL Data

One of the stretch goals for this study was the evaluation of our predictive models withpatients that are part of a completely different repository such as AIBL. To this end, weconsidered a total of 861 patients from the AIBL database (note: for the ADNI repositorywe considered 1737 patients). Since AIBL uses slightly different protocols, a direct mergeof both databases was not possible. Thus, we focused our analysis on a limited numberof features that are consistently available in both ADNI and AIBL patients. Althoughsmall in size, this set of features includes the demographics of the patients (age andgender), their genetic characterization (APOE4) and their responses to a well-knowncognitive test (MMSE). An important consideration is that the distribution of thesefeatures across the ADNI and AIBL populations is not exactly the same. The ADNIdataset has more men than women, and AIBL patients tend to be older than ADNIpatients. The distributions of MMSE scores and APOE4 results look approximately thesame between the two studies, after taking into account the difference in samples sizes.We conducted two different studies involving the AIBL data:June 17, 2020 8/14

AIBL Test 1: We merged all patients (i.e., ADNI and AIBL) into a single set andthen split them (randomly) into training and test sets (without consideration ofwhat repository patients belong to). From here we ran 100 different simulationsusing Model 1 (Random Forest with mean/mode imputation, no feature selection,and 1,000 trees). • AIBL Test 2: In this study we trained a model using all ADNI patients and thenwe assigned all AIBL patients to the test set. We ran 100 different simulationsusing Model 1.Results for accuracy, recall, prediction and f1 score for these two tests using AIBLdata are summarized in Table 2. These results demonstrate that when merging ADNI andAIBL patients into one single study our predictive model still yields high accuracy (90.0%)and recall (85.0%). When training on ADNI patients and testing on AIBL patients, wealso obtain high accuracy; however, we observe a lower recall and precision. We expectthat these results could be improved with domain adaptation and the introduction ofmore common features to both AIBL and ADNI.

Table 2.

Metrics for models that consider AIBL data

Accuracy Recall Precision F1 ScoreAIBL Test 1

90% 85% 82% 83%

AIBL Test 2

93% 80% 69% 74%

Longitudinal Data Analysis

Next, we investigated how the predictive power of our best models are affected by thenumber of visits available for each patient. Number of visits is strongly correlated withthe length of time a patient has been in the study. We also investigate the distributionof the maximum number of years that ADNI patients have been studied (see Figure 3).This figure shows that approximately 80% of the ADNI patients have endured 4 years orless of study under this protocol, while only 20% of the patients have undergone morethan 4 years of study.Using Model 1 (as used in Table 1) we ran 100 samples for random splits of the ADNIpatients into training and validation sets. For the training set we assumed that the fullhistory of visits was known (history is known). However, for patients in the test set, wefixed the maximum number of visits that could be used in computing the longitudinalfeatures of the patients. Results, which are shown in Table 3, clearly indicate that theperformance of the predictive model improves as the number of visits for test patientsincrease. This trend is expected, as additional visits provide more valuable informationabout the patient evolution. The main contribution of Table 3 is the quantificationof this trend. We see, for example, that achieving 89% accuracy requires at least 6clinical visits for an average patient (meaning approximately 2.5 to 4.5 years of study). Aguarantee of 85% recall on the prediction would require around 16 visits for the averagepatients (roughly equivalent to 8-10 years of study). Hence this table potentially providesa valuable tool for doctors and patients to understand the number of visits required inorder to obtain predictions for AD with a satisfying confidence level.

Cost Analysis and Meta-Classification

When we plot the time needed to obtain the features for each successive model versusthe accuracy of these models, we see that after approximately 217 minutes of testing,the accuracy of the models plateau (Figure 4). Note that when a longitudinal metric, forJune 17, 2020 9/14 able 3.

Longitudinal Analysis: prediction metrics vs. number of visits [0.3 - 1] 574 79 65 79 70 [0.5 - 2] 550 86 75 88 81 [1.5 - 3] 494 87 78 88 83 [2.5 - 4.5] 371 89 81 86 83 [4 - 6] 221 89 83 84 84 [6 - 8] 81 89 80 89 84 [8 - 10] 45 90 85 89 87 Fig 3.

Histogram for maximum number of years ADNI patients have been studiedexample mean CDRSB, is added to the model already containing another longitudinalmetric for the same feature, e.g. standard deviation of CDRSB, no extra testing time isrequired to include this feature. Therefore, the x-axis contains duplicated values.An example of a meta-classification Decision Tree is included in Figure 5, with theaccuracy and recall scores averaged over fifty random training and test set splits toensure robustness of results. The accuracy, recall and time needed to perform the testsare provided for each level in the tree. Note that accuracy increases by level, but thatrecall peaks at Level 1, most likely because this is a more generalizable model. Note thateven with the the Level 1 tree, trained on just three cognitive tests, CDRSB, ADAS13,and MOCA, which only take a combined 1 hour and 27 minutes, our model is able topredict with better than 90% accuracy and recall. The CDRSB (Clinical DementiaRating Box Score) takes 30 minutes approximately and is scored based on the resultsfrom an interview with the patient and the patient’s caregiver. ADAS11 is one of the

Fig 4.

Accuracy with Successive Features Added to Model versus Time of MedicalTestsJune 17, 2020 10/14 CDRSB, ADAS13, MOCADemographics, ADAS11, MMSE Demographics, MMSE, RAVLTDemographics,MMSE Demographics,ADAS13, ECogPt CDRSB, FAQ, ECogSP FAQ, ECogSP

Fig 5.

Example of Decision Tree produced by Meta-Classificationmost popular cognitive tests for AD consisting of a 45 minute written test containing 11questions. Finally MOCA (Montreal Cognitive Assessment) is a brief written test thattakes approximately 12 minutes to complete. Note that CDRSB and ADAS11 rank inthe top five most important features in Figure 2. The inclusion of MOCA is most likelya result of its low time cost.While we currently only consider the cost of features in terms of the patient’s time,this model could easily incorporate the monetary cost of these tests as well. These resultsshow that this type of meta-classification model can perform very well, suggesting itspossible implementation as a data-driven diagnostic tool. Using this model, the patientand doctor can weigh whether the added specificity and sensitivity warrant the extratime and cost of the medical test.We also considered the possibility of combining longitudinal analysis with meta-classification. For this analysis, we took the features in the Level 1 model (CDRSB,ADAS13 and MOCA) and ran an analysis similar to the one presented in the LongitudinalAnalysis section. Results are presented in Table 4. Once again, this table quantifies theperformance of the predictive models (in this case using only the 3 features of Level 1tree) for different numbers of clinical visits. Compared to Table 3, the number of visitsrequired to achieve 89% accuracy using only these three features would be larger (16 vs6). Nevertheless, these results indicate that good performance for AD prediction canstill be achieved using a limited number of visits and a reduced set of features.

Table 4.

Longitudinal Analysis: prediction metrics vs. number of visits using Level 1tree features from meta-classification analysis [0.3 - 1] 574 78 64 77 69 [0.5 - 2] 550 85 77 84 81 [1.5 - 3] 494 87 79 86 82 [2.5 - 4.5] 371 87 81 79 81 [4 - 6] 221 87 84 80 82 [6 - 8] 81 87 80 87 82 [8 - 10] 45 89 87 87 86 Discussion

Our work shows that it is possible to build a data-driven model that can confidentlypredict the risk of developing Alzheimer’s in the future with a level of accuracy and recallthat are above 90%. The necessary data for such a prediction is patient demographicJune 17, 2020 11/14nformation, a genetic test (APOE4 genotyping) and a battery of cognitive tests. Wedemonstrated that imaging data (MRI and PET scans), which are more costly in terms oftime and money, are not necessary for highly accurate predictions. We also demonstratedhow well our model generalizes by evaluating the model performance for different ADNIsub-studies (testing one against the others and quantifying model performance) andagainst a cohort of patients that belong to a completely different repository (AIBL). Inall cases, our predictive models show very robust performance.We carefully quantified the impact that the number of clinical visits of data availablefor a patient has on the predictive performance of our model. We also implementeda meta-classification technique to identify the combination of features that providethe optimal balance between model prediction and feature cost. In each case we haveidentified models that can still provide a high level of accuracy and recall. We believeour work provides the right framework for a practical deployment of an AD predictivetool in clinical settings. As an example, we have proposed a diagnostic protocol withonly 3 tests and 4 clinical visits that can predict AD with 87% accuracy and 79% recall.Ultimately our model framework could be used by physicians and patients togetherto determine appropriate plans for diagnosis and monitoring of the risk of developing AD.Any potential model to be deployed in real world settings will have to perform wellrelative to a clinician. Based on the literature, physicians can diagnose Alzheimer’swith 87% accuracy and 91% recall [34]. Our best models produce equivalent or betterpredictions relative to physicians for the harder problem of predicting future developmentof AD. Going forward, a parallel study of model prediction versus physician predictionwould be necessary to validate the models and gain doctor’s trust in this method. Alimitation to this approach is due to our training labels being provided by doctors. Those”true” labels carry some level of uncertainty as AD is a difficult disease to diagnose invivo. Our predictive models are ultimately only as good as the training data used tobuild them. Finally, it is important to recognize that this work has focused on proposingmodels that offer high predictive performance, with no consideration for interpretationof these models. Expected FDA new regulations for CDS (Clinical Decision Support)software could incentivize developing models.

Acknowledgments

Data used in preparation of this article were obtained from the Alzheimer’s DiseaseNeuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the inves-tigators within the ADNI contributed to the design and implementation of ADNIand/or provided data but did not participate in analysis or writing of this report. Acomplete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how to apply/ADNI Acknowledgement List.pdf

References

1. Alzheimer’s Foundation of America; 2016. Available from: .2. Alzheimer’s Association; 2016. Available from: .3. National Institute of Aging; 2016. Available from: .June 17, 2020 12/14. Alzheimer’s Association. Beta-amyloid and the Amyloid Hypothesis; 2017.Available from: .5. Mohandas E, Rajmohan V, Raghunath B,. Neurobiology of Alzheimer’s disease.Indian J Psychiatry. 2009;doi: 10.4103/0019-5545.44908.6. Fillit H. type; 2014. Available from: .7. Sevigny J, Chiao P, Bussi`ere T, Weinreb PH, Williams L, Maier M, et al. Theantibody aducanumab reduces A-beta plaques in Alzheimer’s disease. Nature.2016;537(7618):50–56.8. Biogen. Nature Publishes Results from Pre-Clinical Research and Phase 1bStudy of Biogen’s Investigational Alzheimer’s Disease Treatment Aducanumab;2016. Available from: http://media.biogen.com/press-release/corporate/nature-publishes-results-pre-clinical-research-and-phase-1b-study-biogens-in .9. Regalado A. type; 2016. Available from: .10. Rasmusson J. type; 2016. Available from: .11. Kl¨oppel S, Stonnington CM, Chu C, Draganski B, Scahill RI, Rohrer JD, et al.Automatic classification of MR scans in Alzheimer’s disease. Brain. 2008;131(3):681–689.12. Orimaye SO, Wong JS, Golden KJ, Wong CP, Soyiri IN. Predicting probableAlzheimer’s disease using linguistic deficits and biomarkers. BMC bioinformatics.2017;18(1):34.13. Hosseini-Asl E, Gimel’farb G, El-Baz A. Alzheimer’s Disease Diagnostics bya Deeply Supervised Adaptable 3D Convolutional Network. arXiv preprintarXiv:160700556. 2016;.14. Korolev IO, Symonds LL, Bozoki AC, Initiative ADN, et al. Predicting Progres-sion from Mild Cognitive Impairment to Alzheimer’s Dementia Using Clinical,MRI, and Plasma Biomarkers via Probabilistic Pattern Classification. PloS one.2016;11(2):e0138866.15. Devanand DP, Liu X, Tabert MH, Pradhaban G, Cuasay K, Bell K, et al. Com-bining early markers strongly predicts conversion from mild cognitive impairmentto Alzheimer’s disease. Biological psychiatry. 2008;64(10):871–879.16. Datta P, Shankle W, Pazzani M. Applying machine learning to an Alzheimer’sdatabase. In: Conference proceedings of the AAAI symposium; 1996.17. Locascio JJ, Atri A. An overview of longitudinal data analysis methods for neurolog-ical research. Dementia and geriatric cognitive disorders extra. 2011;1(1):330—357.doi:10.1159/000330228.18. van Belle G, Fisher LD, Heagerty PJ, Lumley T. Biostatistics: A MethodologyFor the Health Sciences. Wiley Series in Probability and Statistics. Wiley; 2004.Available from: https://books.google.com/books?id=KSh8IOrLPzwC .19. Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G. Longitudinal Data Analy-sis. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press;2008. Available from: https://books.google.es/books?id=zVBjCvQCoGQC .June 17, 2020 13/140. Rutter CM, Elashoff RM. Analysis of longitudinal data: Random coef-ficient regression modelling. Statistics in Medicine. 1994;13(12):1211–1231.doi:10.1002/sim.4780131204.21. Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. WileySeries in Probability and Statistics. Wiley; 2011. Available from: https://books.google.com/books?id=qOmxRtdNJpEC .22. Tandon R, Adak S, Kaye JA. Neural Networks for LongitudinalStudies in Alzheimer’s Disease. Artif Intell Med. 2006;36(3):245–255.doi:10.1016/j.artmed.2005.10.007.23. Chen S, Bowman FD. 2011;.24. Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, et al. TheAlzheimer’s disease neuroimaging initiative. Neuroimaging Clinics of North Amer-ica. 2005;15(4):869–877.25. Petersen RC, Aisen PS, Beckett LA. Alzheimer’s Disease Neuroimaging Initiative(ADNI). Neurology. 2010;74(3):201—209. doi:10.1212/WNL.0b013e3181cb3e25.26. Ellis KA, Bush AI, Darby D, De Fazio D, Foster J, Hudson P, et al. The AustralianImaging, Biomarkers and Lifestyle (AIBL) study of aging: methodology andbaseline characteristics of 1112 individuals recruited for a longitudinal study ofAlzheimer’s disease. International Psychogeriatrics. 2009;21(04):672–687.27. Wall ME, Rechtsteiner A, Rocha LM. Singular value decomposition and principalcomponent analysis. A Practical Approach to Microarray Data Analysis. 2003; p.91–109.28. Frey BJ, Dueck D. Clustering by passing messages between data points. Science.2007;315:2007.29. Bishop CM. Pattern Recognition and Machine Learning (Information Science andStatistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.; 2006.30. Duda RO, Hart PE, Stork DG. Pattern Classification (2Nd Edition). Wiley-Interscience; 2000.31. Kuhn M, Johnson K. Applied Predictive Modeling. SpringerLink : B¨ucher.Springer New York; 2013. Available from: https://books.google.com/books?id=xYRDAAAAQBAJ .32. Descoins A. type; 2013. Available from: https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/ .33. Pichara K, Protopapas P, Le´on D. Meta-classification for variable stars. TheAstrophysical Journal. 2016;819(1):18.34. Mok W, Chow T, Zheng L, Mack W, Miller C. Clinicopathological concordanceof dementia diagnoses by community versus tertiary care clinicians. AmericanJournal of Alzheimer’s Disease & Other Dementias ®®