Multimodal Inductive Transfer Learning for Detection of Alzheimer's Dementia and its Severity
Utkarsh Sarawgi, Wazeer Zulfikar, Nouran Soliman, Pattie Maes
MMultimodal Inductive Transfer Learning for Detection of Alzheimer’sDementia and its Severity
Utkarsh Sarawgi * , Wazeer Zulfikar * , Nouran Soliman, Pattie Maes Massachusetts Institute of Technology { utkarshs, wazeer, nouran, pattie } @mit.edu Abstract
Alzheimer’s disease is estimated to affect around 50 millionpeople worldwide and is rising rapidly, with a global economicburden of nearly a trillion dollars. This calls for scalable,cost-effective, and robust methods for detection of Alzheimer’sdementia (AD). We present a novel architecture that leveragesacoustic, cognitive, and linguistic features to form a multimodalensemble system. It uses specialized artificial neural networkswith temporal characteristics to detect AD and its severity,which is reflected through Mini-Mental State Exam (MMSE)scores. We first evaluate it on the ADReSS challenge dataset,which is a subject-independent and balanced dataset matchedfor age and gender to mitigate biases, and is available throughDementiaBank. Our system achieves state-of-the-art testaccuracy, precision, recall, and F1-score of 83.3% each for ADclassification, and state-of-the-art test root mean squared error(RMSE) of 4.60 for MMSE score regression. To the best ofour knowledge, the system further achieves state-of-the-art ADclassification accuracy of 88.0% when evaluated on the fullbenchmark DementiaBank Pitt database. Our work highlightsthe applicability and transferability of spontaneous speech toproduce a robust inductive transfer learning model, and demon-strates generalizability through a task-agnostic feature-space.The source code is available at https://github.com/wazeerzulfikar/alzheimers-dementia
Index Terms : Alzheimer’s Dementia Detection, AffectiveComputing, Human-Computer Interaction, Computational Par-alinguistics, Machine Learning, Speech Processing
1. Introduction
Alzheimer’s disease is a progressive disorder that causes braincells to degenerate and is the most common cause of dementiaworldwide. It mainly causes cognitive and behavioural deteri-oration of the patients [1] which is reflected through memoryloss, language impairment [2], and a decreased ability to ex-press their needs. This in turn affects their quality of life, prog-nosis, and social relationships. Consequently, it has been im-posing increased health risks [3] and a significant financial bur-den to patients, caregivers, families, and healthcare institutions[4]. The number of people with dementia worldwide in 2015was estimated at 47.47 million, and reaching 135.46 million in2050 [5]. At the time of writing this paper, someone in the U.S.develops Alzheimers disease every 66 seconds, and by 2050 it isprojected to be 33 seconds [6]. According to the World HealthOrganization, the global economic burden is nearly a trilliondollars which amounts to 1.1% of the global GDP. [7], with63% of people with dementia living in low- and middle-incomecountries [8]. In this work, we aim to take a significant step * Equal Contribution towards more reliable, cost-effective, scalable, and noninvasivetechnologies to detect the onset of Alzheimer’s dementia (AD)and predict the Mini-Mental State Exam [9] scores to estimatethe severity of it.Dementia can be strongly characterized by cognitive degen-eration leading to language impairment which primarily occursdue to decline in semantic and pragmatic levels of language pro-cessing [10]. It has been widely reported that AD can be moresensitively detected with the help of a linguistic analysis thanwith other cognitive examinations [11] and also long before thediagnosis is medically confirmed [12]. The temporal character-istics of spontaneous speech, such as speech tempo, number ofpauses in speech, and their length are sensitive detectors of theearly stage of the disease [13, 14, 15, 16, 17]. Given the relativeease of collecting balanced and representative data of sponta-neous speech and their corresponding transcriptions, they canbe utilized in early and robust predictions for the onset of AD.Consequently, our research work:1. Presents a novel architecture comprising of domain-specific feature engineering and artificial neural net-works for Alzheimer’s Dementia (AD) detection and itsseverity through classification and MMSE score regres-sion (Section 3).2. Evaluates the system in a subject-independent settingwith a carefully curated balanced and stratified datasetmatched for age and gender, to help minimize commonbiases in the tasks (Section 3.1).3. Achieves state-of-the-art test accuracy, precision, re-call, and F1-score for AD classification, and state-of-the-art test RMSE for MMSE score predictions on theADReSS (Alzheimers Dementia Recognition throughSpontaneous Speech) dataset. To the best of our knowl-edge, the system further achieves state-of-the-art ADclassification accuracy when evaluated on the full bench-mark DementiaBank Pitt database (Sections 4 and 5).4. Spans a multimodal feature space to increase generaliz-ability and robustness, and uses ensemble mechanisms toleverage individual feature sets and model performances.5. Reflects upon the transferability and interdependence ofthe two tasks of AD classification and MMSE regression.
2. Related work
Many current AD detection studies use medical imaging [18,19, 20] with deep neural networks and random forests. Sev-eral studies claim that AD can be sensitively detected in earlystages by doing linguistic analysis which leverages speech andlanguage features to train machine learning models for the de-tection of AD [13, 14, 15, 16, 17, 21].In study [22], machine learning methods based on imagedescription were used reaching an accuracy of 75% on a limited a r X i v : . [ ee ss . A S ] A ug umber of subjects enrolled in a longitudinal study. Study [23]used logistic regression trained with spectrogram features ex-tracted from audio files reaching accuracy of 83.3% and 84.4%on VBSD and Dem@Care datasets respectively. Data used ineach of the above works are limited to around 32 to 36 subjectsand highly imbalanced between the classes and across age andgender. In study [14], different traditional classification algo-rithms like logistic regression, SVM, and more were used tolearn speech parameters from dialogues in Carolina Conver-sations Collection. The best of their solutions reached 86.5%leave-one-out cross-validation (LOOCV) accuracy with 38 sub-jects. Works based on data extracted from DementiaBank havereported scores of around 0.87, 0.85, 0.82, 0.80, 0.79, 0.64, and0.62 [24, 25, 13, 26, 27, 28, 29] for AD classification. Study[30] used speech related features to get a mean absolute error(MAE) of 3.83 for MMSE scores with longitudinal data de-rived from DementiaBank. While a number of works have pro-posed speech and language based approaches to AD recognitionthrough speech, their studies have used different, often unbal-anced and acoustically varied data sets, thereby introducing biasand hindering generalization, reproducibility and comparabilityof the proposed approaches.
3. Methods and materials
The DementiaBank Pitt database [31] consists of speech record-ings and transcripts of spoken picture descriptions elicited fromparticipants through the Cookie Theft picture from the BostonDiagnostic Aphasia Exam [32]. The database consists of multi-ple samples per subject corresponding to multiple visits. Thefull database contains 242 speech samples from 99 controlhealthy subjects and 255 speech samples from 168 AD sub-jects. The dataset also provides Mini-Mental Status Exami-nation (MMSE) scores, ranging from 0 to 30, of the subjects,which offers a way to quantify cognitive function and screen forcognitive loss by testing the individuals’ orientation, attention,calculation, recall, language and motor skills [9]. A 10-foldcross-validation was used on this database for fair comparisonwith previously reported results.The ADReSS Challenge Dataset [29] is a balanced subsetconsisting of 156 speech samples, each from a unique subject,matched for age and gender and evenly spread across the twoclasses, AD and non-AD. A stratified train-test split of around70-30 (108 and 48 subjects) for this dataset was provided bythe challenge. The test set was held out for all experimentationuntil final evaluation. Any cross-validation mentioned in thepaper refers to cross-validation using the train split. Normalizedspeech segments are also provided, but we only use full audiosamples. The MMSE scores provided are used as labels for theregression task.We first evaluate on the balanced ADReSS dataset and thenextend the evaluation to the full DementiaBank Pitt database.
People with dementia show symptoms of cognitive decline, im-pairment in memory, communication, and thinking [17]. To in-clude such domain knowledge and context, our system extractscognitive and acoustic features using three different strategies,which are then prepared and fed into their respective neuralmodels. Similarly extracted features have been repeatedly usedto propose speech recognition based solutions for automated de-tection of mild cognitive impairment from spontaneous speech [33, 17]. The following features were extracted upon exploringthe data to find the most descriptive set of correlated featuresfor detecting AD and its severity:•
Disfluency:
A set of 11 distinct and carefully curated fea-tures from the transcripts, like word rate, intervention rate, anddifferent kinds of pause rates reflecting upon speech impedi-ments like slurring and stuttering. These are normalized by therespective audio lengths and scaled thereafter.•
Acoustic:
The ComParE 2013 feature set [34] was ex-tracted from the audio samples using the open-sourced openS-MILE v2.1 toolkit, widely used for affect analyses in speech[35]. This provides a total of 6,373 features that include energy,MFCC, and voicing related low-level descriptors (LLDs), andother statistical functionals. This feature set encodes changes inspeech of a person and has been used as an important noninva-sive marker for AD detection [36, 29]. Our system standardizesthis set of features using z-score normalization, and uses princi-pal component analysis (PCA) to project the 6,373 features ontoa low-dimensional space of 21 orthogonal features with highestvariance. The number of orthogonal features was selected byanalyzing the percentage of variance explained by each of thecomponents.•
Interventions:
Cognitive features reflect upon potentialloss of train of thoughts and context. Our system extracts thesequence of speakers from the transcripts, categorizing it as sub-ject or the interviewer. To accommodate for the variable lengthof these sequences, they are padded or truncated to length of 32steps, found upon analyses and tuning of sequence lengths.We evaluated each of these features individually and in acombined fashion to highlight the different configurations andcompare their performances.
Figure 1 - (1), (2), and (3) illustrate the architecture of the disflu-ency, acoustic, and interventions models respectively. The dis-fluency model is a multi-layer perceptron (MLP) that projectsthe 11-feature input to a higher dimensional space for betterseparability of the binary classes. The acoustic model is anMLP with a single hidden layer that adds non-linearity and reg-ularizes the PCA decomposed feature space. The interventionsmodel uses a recurrent architecture to learn the temporal rela-tions from the sequence of interventions. These models weretrained with corresponding inputs obtained upon feature engi-neering (Section 3.2), and one-hot encoded binary class labels.To leverage the features learnt from classification for re-gression, transfer learning was done on the trained classifica-tion models. The regression module, as shown in Figure 1 - (4)replaced the terminal output layer in the models and the remain-ing original layers were frozen. The resultant models were thentrained with MMSE scores as labels.A 5-fold cross-validation setting was adopted for evalua-tion. The models were also evaluated in a leave-one-out crossvalidation (LOOCV) setting, which in the case of ADReSSdataset is equivalent to leave-one-subject-out cross validation(LOSO) since each datapoint is an independent subject. Eachtraining run used a batch size of 8; and Adam optimizer with alearning rate of 0.01 to minimize categorical cross-entropy lossfor classification, and a learning rate of 0.001 to minimize meansquared error loss for regression. The best models were savedby monitoring the validation loss in each fold.To leverage all sets of features and models together, a paral-lel ensemble was performed using the outputs of the three mod-els for each of the two tasks independently. We experimentedigure 1:
Architecture of (1) Disfluency, (2) Acoustic, (3) Inter-ventions models, and (4) Regression module. with three kinds of ensemble modules for classification:•
Hard:
A majority vote was taken between the predictionsof the three individual models.•
Soft:
To leverage the confidence of the predictions, aweighted sum of the class probabilities was computed for fi-nal decision. The weight used was /N where N is the totalnumber of models.• Learnt:
Instead of weighing the confidence of all themodels equally as in soft voting above, we used a logistic re-gression to learn the weights. A logistic regression voter wastrained using class probabilities as inputs.For regression, the predictions of all the individual modelswere averaged by the ensemble module.
4. Results
The results of the experiments were recorded using a combina-tion of accuracy, precision, recall and F1-score for classifica-tion, and root mean squared error (RMSE) for regression.
Table 1 shows the 5-fold cross-validation results for the classi-fication task. The individual features achieved competitive per-formance, although the acoustic model slightly overfits whilethe interventions model marginally underfits on the data. Theensemble model counteracted these and achieved an increased5-fold mean training as well as validation accuracy with compa-rable variance. The low variance generally observed across allruns signifies high model stability across folds which is essen- Table 1:
Model Split Accuracy RMSEDisfluency Train 0.87 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Val ± ± Table 2:
Ensemble Type Split AccuracyHard Train 0.91 ± ± Soft Train 0.86 ± ± ± Val 0.81 ± Receiver Operating Characteristic for Disfluency,Acoustic, and Interventions models, cumulatively calculatedover validation splits of all the folds of 5-fold cross-validation.
Table 3:
Baseline comparison of the AD classification. Our testresults below are corresponding to the hard ensemble model.
Model Accuracy Precision Recall F1-ScoreLOSO Luz et al. [29] 0.77 0.77 0.76 0.77Ensemble ( ours ) TEST Luz et al. [29] 0.75 0.83 0.62 0.71Ensemble ( ours ) Table 4:
Baseline comparison of the MMSE score regression.Our test results are corresponding to the regression ensemble.
Model RMSELOSO Luz et al. [29] 4.38Ensemble ( ours ) TEST Luz et al. [29] 5.20Ensemble ( ours ) Figure 3:
Confusion matrices for the hard ensemble classifica-tion model (1) cumulatively calculated over the validation splitsof all the folds of LOOCV and (2) 5-fold cross-validation, and(3) calculated on the held out test set.
The same AD classification models were retrained on the De-mentiaBank Pitt database and a 10-fold cross-validation wasperformed for fair comparison with previously reported re-sults. To the best of our knowledge, our hard ensemble modelachieves state-of-the-art 0.88 ± Comparison of the AD classification on DementiaBankPitt. All are 10-fold cross-validation results. Our results beloware corresponding to the hard ensemble model.
Model Accuracy Precision Recall F1-ScoreFraser et al. [13] 0.82 - - -Masrani [25] 0.85 - - 0.85Kong et al. [24] 0.87 0.86
Ensemble ( ours )
5. Discussion and Future Work
There has been substantial work using spontaneous speech sam-ples and manual transcriptions present in the DementiaBankdataset [31]. Some of the highest reported scores for ADclassification are 0.87, 0.85, 0.82, 0.80, 0.79, 0.64, and 0.63[24, 25, 13, 26, 27, 28, 29]. Many of these previous resultswere obtained on datasets with variable subject dependencies.In such datasets, a data point corresponds to a session and therecan exist multiple sessions per subject. Given the subject inde-pendent setting in ADReSS dataset, our LOSO method clearlydistinguishes the left-out test subject. Hence, the near perfectLOSO results on classification and regression (Tables 3 and4) demonstrate that every subject individually can be correctlyevaluated with the engineered features. Furthermore, almostall previous results are reported using cross-validation, whereasour work is evaluated on a designated held-out test set as well.This helps overcome ‘validation overfitting’ which is prone insmall dataset settings.Study [30] used speech related features to obtain a cross-validated mean absolute error (MAE) of 3.83 for MMSE scoreswith data derived from DementiaBank. Our ensemble re-gression model recorded a cross-validated MAE of 3.01 onADReSS dataset.Through considerable improvements in both the AD classi-fication and MMSE score regression by employing an ensem-ble of independent models extracting acoustic and cognitivefeatures, our work reveals the potential of multimodal analy-sis and its applicability to a age and gender balanced subject-independent dataset. Future work would include incorporat-ing automated transcription of speech samples in our system.The continuous range of the MMSE scores can provide moreinsights into progression of dementia. This can further beleveraged for risk stratification and analyzing potential causalrelationships modelling AD with its symptoms and markers,through a longitudinal dataset.
6. Conclusion
We present a novel architecture that uses domain knowledgefor inductive transfer learning for AD classification and MMSEscore regression. Our work achieves state-of-the-art accuracy,precision, recall, and F1-score of 83.3% each for AD classifica-tion, and state-of-the-art RMSE of 4.60 for MMSE predictionson the designated held-out test set of the ADReSS challenge.To the best of our knowledge, the system further achieves state-of-the-art AD classification accuracy of 88.0% when evaluatedon the full benchmark DementiaBank Pitt database. Our systemspans a multimodal feature space to increase generalization androbustness. We aim to extend our work by adding automatedtranscription, further textual analysis, and personalized contextthrough longitudinal data. . References [1] J. G. Molinuevo, “Role of biomarkers in the early diagnosis ofalzheimer’s disease,”
Revista espanola de geriatria y gerontolo-gia , vol. 46, pp. 39–41, 2011.[2] L. M. V. ESCOBAR and N. P. AFANADOR, “Calidad de vidadel cuidador familiar y dependencia del paciente con alzheimer,”
Avances en Enfermer´ıa , vol. 28, no. 1, pp. 116–128, 2010.[3] R. Schulz and S. R. Beach, “Caregiving as a risk factor for mor-tality: the caregiver health effects study,”
Jama , vol. 282, no. 23,pp. 2215–2219, 1999.[4] J. M. Atance, A. I. Yusta, and B. G. Grupeli, “Costs study inalzheimer’s disease,”
Revista clinica espanola , vol. 204, no. 2,pp. 64–69, 2004.[5] M. Prince, R. Bryce, E. Albanese, A. Wimo, W. Ribeiro, and C. P.Ferri, “The global prevalence of dementia: a systematic reviewand metaanalysis,”
Alzheimer’s & dementia , vol. 9, no. 1, pp. 63–75, 2013.[6] A. Association et al. , “2016 alzheimer’s disease facts and figures,”
Alzheimer’s & Dementia , vol. 12, no. 4, pp. 459–509, 2016.[7] W. H. Organization et al. , “The top 10 causes of death. fact sheetno. 310. 2017.”[8] ——, “The epidemiology and impact of dementia. current stateand future trends. geneva, switz: World health organization;2015.”[9] T. N. Tombaugh and N. J. McIntyre, “The mini-mental state exam-ination: a comprehensive review,”
Journal of the American Geri-atrics Society , vol. 40, no. 9, pp. 922–935, 1992.[10] S. H. Ferris and M. Farlow, “Language impairment in alzheimersdisease and benefits of acetylcholinesterase inhibitors,”
Clinicalinterventions in aging , vol. 8, p. 1007, 2013.[11] G. Szatloczki, I. Hoffmann, V. Vincze, J. Kalman, and M. Pakaski,“Speaking in alzheimers disease, is that an early sign? importanceof changes in language abilities in alzheimers disease,”
Frontiersin aging neuroscience , vol. 7, p. 195, 2015.[12] M. Mesulam, A. Wicklund, N. Johnson, E. Rogalski, G. C. L´eger,A. Rademaker, S. Weintraub, and E. H. Bigio, “Alzheimer andfrontotemporal pathology in subsets of primary progressive apha-sia,”
Annals of Neurology: Official Journal of the American Neu-rological Association and the Child Neurology Society , vol. 63,no. 6, pp. 709–719, 2008.[13] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimers disease in narrative speech,”
Journal ofAlzheimer’s Disease , vol. 49, no. 2, pp. 407–422, 2016.[14] S. Luz, S. de la Fuente, and P. Albert, “A method for analysis ofpatient speech in dialogue for dementia detection,” arXiv preprintarXiv:1811.09919 , 2018.[15] B. Mirheidari, D. Blackburn, T. Walker, A. Venneri, M. Reuber,and H. Christensen, “Detecting signs of dementia using word vec-tor representations.” in
Interspeech , 2018, pp. 1893–1897.[16] F. Haider, S. De La Fuente, and S. Luz, “An assessment of paralin-guistic acoustic features for detection of alzheimer’s dementia inspontaneous speech,”
IEEE Journal of Selected Topics in SignalProcessing , 2019.[17] M. L. B. Pulido, J. B. A. Hern´andez, M. ´A. F. Ballester, C. M. T.Gonz´alez, J. Mekyska, and Z. Sm´ekal, “Alzheimer’s disease andautomatic speech analysis: a review,”
Expert Systems with Appli-cations , p. 113213, 2020.[18] D. Lu, K. Popuri, G. W. Ding, R. Balachandar, and M. F. Beg,“Multimodal and multiscale deep neural networks for the earlydiagnosis of alzheimers disease using structural mr and fdg-petimages,”
Scientific reports , vol. 8, no. 1, pp. 1–13, 2018.[19] A. Ortiz, F. Lozano, J. M. Gorriz, J. Ramirez, F. J. Martinez Mur-cia, A. D. N. Initiative et al. , “Discriminative sparse features foralzheimer’s disease diagnosis using multimodal image data,”
Cur-rent Alzheimer Research , vol. 15, no. 1, pp. 67–79, 2018. [20] S. Sarraf and G. Tofighi, “Deep learning-based pipeline to recog-nize alzheimer’s disease using fmri data,” in . IEEE, 2016, pp. 816–820.[21] F. Di Palo and N. Parde, “Enriching neural models withtargeted features for dementia detection,” arXiv preprintarXiv:1906.05483 , 2019.[22] V. Rentoumi, L. Raoufian, S. Ahmed, C. A. de Jager, and P. Gar-rard, “Features and machine learning classification of connectedspeech samples from patients with autopsy proven alzheimer’sdisease with and without additional vascular pathology,”
Journalof Alzheimer’s Disease , vol. 42, no. s3, pp. S3–S17, 2014.[23] L. Liu, S. Zhao, H. Chen, and A. Wang, “A new machine learn-ing method for identifying alzheimer’s disease,”
Simulation Mod-elling Practice and Theory , vol. 99, p. 102023, 2020.[24] W. Kong, H. Jang, G. Carenini, and T. Field, “A neural modelfor predicting dementia from language,” in
Machine Learning forHealthcare Conference , 2019, pp. 270–286.[25] V. Masrani, “Detecting dementia from written and spoken lan-guage,” Ph.D. dissertation, University of British Columbia, 2018.[26] M. Yancheva and F. Rudzicz, “Vector-space topic models for de-tecting alzheimers disease,” in
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , 2016, pp. 2337–2346.[27] L. Hern´andez-Dom´ınguez, S. Ratt´e, G. Sierra-Mart´ınez, andA. Roche-Bergua, “Computer-based evaluation of alzheimers dis-ease and mild cognitive impairment patients during a picture de-scription task,”
Alzheimer’s & Dementia: Diagnosis, Assessment& Disease Monitoring , vol. 10, pp. 260–268, 2018.[28] S. Luz, “Longitudinal monitoring and detection of alzheimer’stype dementia from spontaneous speech data,” in . IEEE, 2017, pp. 45–46.[29] S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhin-ney, “Alzheimer’s dementia recognition through spontaneousspeech: The adress challenge,” arXiv preprint arXiv:2004.06833 ,2020.[30] M. Yancheva, K. C. Fraser, and F. Rudzicz, “Using linguistic fea-tures longitudinally to predict clinical scores for alzheimers dis-ease and related dementias,” in
Proceedings of SLPAT 2015: 6thWorkshop on Speech and Language Processing for Assistive Tech-nologies , 2015, pp. 134–139.[31] J. T. Becker, F. Boiler, O. L. Lopez, J. Saxton, and K. L. McGo-nigle, “The natural history of alzheimer’s disease: description ofstudy cohort and accuracy of diagnosis,”
Archives of Neurology ,vol. 51, no. 6, pp. 585–594, 1994.[32] H. Goodglass, E. Kaplan, and B. Barresi,
BDAE-3: Boston Diag-nostic Aphasia Examination–Third Edition . Lippincott Williams& Wilkins Philadelphia, PA, 2001.[33] L. T´oth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatl´oczki,Z. B´anr´eti, M. P´ak´aski, and J. K´alm´an, “A speech recognition-based solution for the automatic detection of mild cognitive im-pairment from spontaneous speech,”
Current Alzheimer Research ,vol. 15, no. 2, pp. 130–138, 2018.[34] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent devel-opments in opensmile, the munich open-source multimedia fea-ture extractor,” in
Proceedings of the 21st ACM international con-ference on Multimedia , 2013, pp. 835–838.[35] F. Eyben, M. W¨ollmer, and B. Schuller, “Opensmile: the mu-nich versatile and fast open-source audio feature extractor,” in
Proceedings of the 18th ACM international conference on Mul-timedia , 2010, pp. 1459–1462.[36] K. Lopez-de Ipi˜na, J. B. Alonso, J. Sol´e-Casals, N. Barroso,M. Faundez-Zanuy, M. Ecay-Torres, C. M. Travieso, A. Ezeiza,A. Estanga et al.et al.