Alzheimer's Dementia Detection from Audio and Text Modalities
Edward L. Campbell, Laura Docío-Fernández, Javier Jiménez Raboso, Carmen García-Mateo
AAlzheimer’s Dementia Detection from Audio and Text Modalities
Edward L. Campbell , Laura Doc´ıo-Fern´andez , Javier Jim´enez Raboso , Carmen Garc´ıa-Mateo GTM research group, AtlanTTic Research Center, University of Vigo, Spain acceXible [email protected], [email protected], [email protected], [email protected] Abstract
Automatic detection of Alzheimer’s dementia by speechprocessing is enhanced when features of both the acousticwaveform and the content are extracted. Audio and texttranscription have been widely used in health-related tasks, asspectral and prosodic speech features, as well as semantic andlinguistic content, convey information about various diseases.Hence, this paper describes the joint work of the GTM-UVIGOresearch group and acceXible startup to the ADDReSS chal-lenge at INTERSPEECH 2020. The submitted systems aim todetect patterns of Alzheimer’s disease from both the patientsvoice and message transcription. Six different systems havebeen built and compared: four of them are speech-based andthe other two systems are text-based. The x-vector, i-vector,and statistical speech-based functionals features are evaluated.As a lower speaking fluency is a common pattern in patientswith Alzheimer’s disease, rhythmic features are also proposed.For transcription analysis, two systems are proposed: one usesGloVe word embedding features and the other uses several fea-tures extracted by language modelling. Several intra-modalityand inter-modality score fusion strategies are investigated. Theperformance of single modality and multimodal systems arepresented. The achieved results are promising, outperformingthe results achieved by the ADDReSS’s baseline systems.
Index Terms : Alzheimer’s disease detection, i-vector, x-vector,speech fluency, word embedding, recurrent neural networks,score level fusion, ADDReSS challenge
1. Introduction
The Alzheimer’s disease (AD) is a neurodegenerative illnessthat represents the most common cause of dementia in theworld. It is provoked by the damage of neurons involved inthinking, learning and memory. This disease has three mainstages. In the first one, called preclinical, patients do not presentclear symptoms because the brain initially compensate for them,enabling individuals to continue to function normally. The sec-ond one is defined as Mild Cognitive Impairment (MCI). In thisstage, patients show greater cognitive decline then expected fortheir age, having problems to express and connect ideas. How-ever, these changes may be only noticeable to family membersand friends. The critical phase is the last one, which is calledas dementia. It is characterized by noticeable memory, think-ing and behavioural symptoms that impair a person’s ability tofunction in daily life [1][2].Common signs of AD are related to problems with utteringwords; consequently, people with AD may have trouble follow-ing or joining a conversation, they may stop in the middle of asentence and have no idea how to continue. As a result, analysisof speech and its transcription may represent a suitable mecha-nism for detecting the AD during the second or third stage of the disease [1] [3]. According to the literature[4][5][6], using infor-mation from the patient’s voice, as well as from its transcriptionwould ease the early AD detection task.In this work, four speech-based systems and two text-basedsystems are proposed for automatic distinction between pa-tient’s with and without AD. Those based on the speech signalare compound by four approaches to represent their spectral andprosodic content, namely: i-vector and x-vector embeddings,rhythmic features, and statistical functionals. The first three usesuper vector machine (SVM) as classifiers and the last one usesa linear discriminant analysis (LDA) classifier. As for systemsthat use speech transcriptions, one relies on GloVe word em-bedding features [7] using a recurrent neural network (RNN)as classifier, and the other one is based on features extracted bylanguage modelling, using a SVM as classifier. Finally, an intra-modality and inter-modality score fusion strategy was done toimprove the final results.All the individual systems, and its fusion, are evaluated onthe AD recognition task within the framework of the ADReSSChallenge [8]. This challenge targets the AD detection usingspontaneous speech. The data used in the Challenge consistsof speech recordings, and their transcripts, corresponding to thedescription of the Cookie Theft picture. Specifically, it is a se-lection of Alzheimer and control patients from of the Demen-tiaBanks Pitt corpus .The rest of the paper is organized as follows. Sections 2and 3 describe the speech-based and text-based systems, re-spectively. Section 4 outlines the experimental framework. Theexperimental results are exposed and discussed in Section 5. Fi-nally, Section 6 draws some conclusions and future work.
2. Speech-based systems
In this section, a description of speech feature extraction strate-gies and classification methods is done, with special attention todifferent statistical features extracted from the patient’s voice.
Speech embedding features are considered as state-of-artspeech representation for speaker recognition application.These speech representations can be applied for AD detection,as long as they preserve those spectral patterns in the speaker’svoice that allow the distinction between patient with and with-out AD. In this paper, two strategies were analyzed, the firstone uses the i-vector paradigm [9], and the second one uses anx-vector [10] based representation. The main characteristics ofthese approaches are briefly described below. A. i-vectors To extract the i-vectors an universal background model (UBM)and a total variability matrix T model must be trained. As http://dementia.talkbank.org/ a r X i v : . [ ee ss . A S ] A ug peech parameterization these models use 13 perceptual lin-ear prediction (PLP) cepstral coefficients, combined with twopitch-related features (F0 and voicing probability) [11]. Thisfeatures are augmented with their delta and acceleration coef-ficients, leading to vectors of dimension 45. These featureswere chosen since that combination achieves a representation ofspeech that includes spectral information and prosodic featuressuch as rhythm or intonation that are embedded in the funda-mental frequency. The UBM was a diagonal covariance, 512-component gaussian mixture model (GMM), and was trainedwith data from outside this task; and T was trained using thetask training data. The dimension of the i-vectors was set to125 and they were length normalized. B. x-vectors A pretrained time delay deep neural network (TDNN) , with5 time delay layers and two dense layers, and trained to dis-criminate between speakers, was used. The network was im-plemented using the nnet3 neural network library in the KaldiSpeech Recognition Toolkit [12] and trained on augmentedVoxCeleb 1 and VoxCeleb 2 datasets [13]. The input to theTDNN are 30 mel-frequency cepstral coefficients (MFCC), andthe embeddings are extracted from the first dense layer with adimensionality of 512. The output of this layer (x-vector) isfirst projected using latent discriminant analysis (LDA) into a200 dimensional space and then length normalized.Instead of extracting an i-vector or x-vector (embeddings)to represent the entire audio signal, a set of these vectors is ob-tained applying a sliding window. In this way, each audio file isrepresented by a certain number of embeddings, which are thenused for classification.Both systems use as classifier an SVM with a linear ker-nel. Since a number of embeddings are extracted from eachaudio, there will also be a set of classification results (one foreach embedding), which must be combined to obtain a patient’sclassification in AD or non-AD. In this work, the mean of theclassifications was used as score for the final decision. This system represents the audio waveform with a set of vectorscalled functionals. These are mainly statistical and regressionfeatures extracted from a wide variety of low-level acoustic at-tributes using the openSMILE [14] software. Specifically, theset of functionals proposed in the AVEC 2013 evaluation (Au-diovisual Emotion Recognition Challenge and Depression) [15]was used, which consist of 2,268 descriptors extracted from 32low-level features of the audio signal related to energy, the spec-trum, volume and tone.As the number of descriptors is very high in relation to theamount of data available, a selection of those most relevantwas done. For this, a correlation based feature subset selec-tion (CFS) algorithm [16] was used, resulting in 57 functionalsrelated with the low level descriptors shown in Table 1. Thefunctionals were extracted using a sliding window of 3 s and anoverlap of 1 s. As with the i-vectors and x-vectors this systemuses an SVM with a linear kernel as classifier.
A common pattern in patient with AD is the lack of speakingfluency, being the rhythm a viable clue to detect that behav-ior. However, prosodic information (e.g., mean energy) was http://kaldi-asr.org/models.html Table 1:
Low level descriptors in the set of 57 functionals
Energy and spectral features
Loudness, MFCC and delta-MFCC,energy in bands 250–650Hz and 1 kHz–4 kHz,25% spectral roll-of points,spectral flux, entropy, skewness,psychoacousitc sharpness, harmonicity, flatness
Voicing related F ,voicing probability, jitteralso used because the correct pronunciation of words does notonly depends on the rhythm but intonation, tone and stress. Forthis system, the selected parameters were based on [4] [17], andthey are as follows:• Number of syllables• Rate of speech (syllables / original duration)• Speaking duration• Average fundamental frequency• Median of fundamental frequency• Minimum fundamental frequency• Pronunciation posterior probability• Average voice interval duration• Average duration of pairs • Mean energy• The ratio between the energy mean and its standard-deviationThe extraction process was done using the Python li-brary My-Voice Analysis , the voice activity detection of theSIDEKIT software [18], and our own algorithms.The classification process was done by the LDA algorithm,projecting the rhythmic feature vector into an one-dimensionalspace where every projected point represents a classificationscore.
3. Text-based systems
Two different text-based approaches were analysed. The firstone uses the manual transcripts to extract linguistic informationfor creating the input features of a classifier. The second oneis a sequential model based on deep learning, which classifiesdirectly from the sequence of GloVe word embeddings.
The transcripts contained in the dataset are in CHAT format[19], which facilitates speech annotation and analysis. Theyinclude the transcript of both the subject and the investigatorin charge of the test, as well as additional non-speech annota-tions: times, researcher interventions, errors, morphological orsyntactic analysis.Transcripts are divided into several interventions, i.e. sen-tences or parts of complete sentences with meaning, and thisgranularity has been maintained in the preprocessing. Meta-data, researcher interventions and linguistic analysis includedin the files have been removed. We keep only subjects words asinput for the classifiers, as it better reflects the real-world taskof detecting Alzheimers based only on speech analysis. Pairs: consecutive voiced and unvoiced segments https://github.com/Shahabks/my-voice-analysis .2. Sequential model The first approach consists of training a RNN at interventionlevel, by classifying if a given sentence belongs to an AD sub-ject. Then, the probability that the subject had AD given allhis/her interventions is computed as the mean of the probabili-ties of each intervention. By using the interventions as indepen-dent samples the training dataset is increased to a size of 1492records from the 108 subjects. In the training process, wordsare tokenized, and interventions are padded up to 20 tokens.The RNN is composed of 3 layers: an Embedding layerwith pre-trained weights of GloVe 50-dimensional representa-tion of words, a long short term memory (LSTM) layer with 4units and an output layer with 1 unit and sigmoid activation. Ad-ditionally, a Dropout at 10% rate is performed in the Embeddingand LSTM layers for regularization and to prevent overfitting.The model is trained for 10 epochs, using a batch size of 16 andAdam optimization.
In this approach, several linguistic features and indicators arebuilt from subjects interventions and, using this feature vectoras input, an SVM is trained. Unlike the previous method, theclassification here is performed at subject level, by consideringthe full transcript of each participant.Previous works [20] have shown that certain linguistic fea-tures are useful for detecting AD using Cookie Theft test. Here13 features have been built, grouped into 4 categories:• Extension information such as the number of interven-tions, number of words per intervention and mean wordlength.• Vocabulary richness, by measuring the number of uniquewords used by the subject.• Presence of key informational concepts: kitchen, mother,stool, boy and girl.• Frequency of verbs, nouns, adjectives and pronouns fromPOS-tagging.Each feature is then rescaled with min-max normalizationin range [0, 1] and an SVM with radial basis function (RBF)kernel and C = 1.0 is trained, whose output is the probabilitythat the subject had AD given the 13-dimensional feature vector.
4. Experimental framework
The training dataset [8] consists of the recordings and manualtranscripts of 108 subjects performing the test known as CookieTheft, whose objective is to describe an image. Out of the 108participants, 54 are patients diagnosed with Alzheimer’s. Table2 shows the training data distribution in detail.Table 2:
ADReSS training dataset
AD non-AD
Age interval Male Female Male Female [50 , [55 , [60 , [65 , [70 , [75 , The evaluation metrics for the AD classification taskare:
Accuracy = TN + TPN , Precision π = TPTP + FP , Recall ρ = TPTP + FN and F = 2 π ∗ ρπ + ρ .where N is the number of patients, TP, FP and FN are thenumber of true positives, false positives and false negatives, re-spectively.All the systems have been trained using leave-one-subject-out (LOSO) cross-validation strategy for measuring the gener-alization error. Therefore, models use 107 subjects as trainingdata and are validated on the held-out subject.
5. Results
This section presents the results in the AD classification forboth the leave-one-subject-out (LOSO) and test settings of theADReSS challenge.
Table 3 illustrates the individual results achieved by the fourspeech-based systems. These results show that the x-vectors,functionals, and fluency based systems have similar perfor-mance regarding the accuracy, as well as Area Under Curve(AUC). The i-vector was the weakest model, being the only onewith an accuracy under 70%, although it is still above the chal-lenge baseline results [8].Table 3:
Results of the proposed speech-based systems class Precision Recall F1 Score Accuracy AUCi-vector non-AD 0.7058 0.6666 0.6857 0.6944 0.6806AD 0.6842 0.7222 0.7027x-vector non-AD 0.7400 0.6851 0.7115 0.7222 0.7615AD 0.7068 0.7592 0.7321functionals non-AD 0.7454 0.7592 0.7522 0.7500 0.7435AD 0.7547 0.7407 0.7476fluency non-AD 0.7450 0.7037 0.7238 0.7314 0.7613AD 0.7192 0.7592 0.7387
The results for both text-based systems are shown in the ta-ble 4. Concerning the RNN model, the AUC in held-out set is0.8563. For a selected threshold probability, we also obtain anaccuracy of 0.7407, recall of 0.8704 and F1-score of 0.7705.The false negative rate is 13%. For the linguistic model, theAUC in held-out set is 0.7510. For a selected threshold proba-bility the accuracy is 0.6852, recall of 0.7037 and F1-score of0.6909. The false negative rate is 30%.Table 4:
Results of the proposed text-based systems class Precision Recall F1 Score Accuracy AUCRNN model non-AD 0.7424 0.9074 0.8166 0.7962 0.8563AD 0.8809 0.6851 0.7708Linguistic model non-AD 0.7017 0.7407 0.7207 0.7129 0.7510AD 0.7254 0.6851 0.7047
Moreover, two different score fusion strategies were carriedout. In the first one (referred as Fusion I), the scores of everysystem was normalized by z-norm, and merged by the average.In the second one (referred as Fusion II), the same normaliza-tion strategy was used and the average was computed on thespeech-based and text-based system scores separately. Finally,the new scores were again merged by the average. The figure 2shows the ROC curves of the described fusion systems. The Fu-sion I model had an accuracy of 0.8056, a F1-score of 0.8073,and an AUC of 0.8930. On the other hand, the Fusion II modelresented an accuracy of 0.8241, a F1-score of 0.8190 and anAUC of 0.9023. The results show that the fusion of both text-based and speech-based modality improves the detection of AD.For further insight, figures 1 and 2 compares the ROCcurves of all individual systems and their fusion, respectively.Figure 1:
ROC curve of all systems
Figure 2:
ROC curve of fusion
Based on the previous results, five systems were submitted tothe challenge. The table 5 shows the performance achieved bythem on the testing dataset, which consist of recordings of 48subjects.The results achieved by the RNN model were really simi-lar in training and testing, which is a good sign of the lack ofoverfitting on the training stage as a result of a Dropout at 10%rate performed in the Embedding and LSTM (Long Short TermMemory) layers. However, the submitted speech-based systemshad a significant decrease in performance. This decline wouldbe a consequence of a poor generalization capacity. As a result,the performance of the fusion systems also declines.
6. Conclusions and Further work
Two lines of work have been developed on the ADReSS Chal-lenge dataset: one based on speech processing and the otherbased on text processing. Table 5:
Results on the testing dataset class Precision Recall F1 Score AccuracyFusion II non-AD 0.8000 0.6667 0.7273 0.7500AD 0.7143 0.8333 0.7692Fusion I non-AD 0.8235 0.5833 0.6829 0.7292AD 0.6774 0.8750 0.7636RNN model non-AD 0.6667 1.0000 0.8000 0.7500AD 1.0000 0.5000 0.6667fluency non-AD 0.6087 0.5833 0.5957 0.6042AD 0.6000 0.6250 0.6122x-vector non-AD 0.5553 0.4167 0.4762 0.5417AD 0.5333 0.6667 0.5926
It is important to highlight the following points when as-sessing both solutions:•
Performance: x-vector, functionals, and fluency speechfeatures have shown a competitive performance in theLOSO cross-validation. However, their performance de-creases on the test setting. This would be a result of a lowgeneralization level achieved at the training stage. Onthe other hand, the text-based RNN model has obtainedsuperior results both in the LOSO and testing settings,but with unbalanced values of recall and precision in thelast one. Nevertheless, the Fusion II strategy was a so-lution to that problem, presenting the same accuracy thatthe RNN model but with more balanced measures. Asa result, multimodal systems demonstrate to be a betterstrategy for the detection of AD than individual sytems.• Complexity:
The x-vector and RNN systems, as deep-learning-based models, need a large amount of trainingdata to be able to generalize well.•
Explainability: deep learning models are black-boxmodels, being very difficult to interpret from a humanperspective. On the contrary, the linguistic, fluency,functionals and i-vector features created here are ex-plainable and one could determine the weight of themin the classification.Several future research lines have been identified for furtherwork. Firstly, investigation on improved classification basedon deep neural networks and novel acoustical feature extrac-tion algorithms. Secondly, addition of new linguistic featuresand non-verbal information (breaks, silence duration, word mis-takes, etc.) in text-based systems. Thirdly, analysis of strate-gies for increasing the generalization capacity of the proposedspeech-based systems; for example: to find new rhythmic pa-rameters more discriminatory between patients with and with-out AD. It is also important to conduct experiments in otherexperimental settings, for example using other questions of themini-mental state examination test, to validate the results ob-tained and, above all, to increase the size of the data settings.
7. Acknowledgements
This work has received financial support from the SpanishMinisterio de Economia y Competitividad through the projectSpeech&Sign RTI2018-101372-B-100, and also from Xunta deGalicia (AtlanTTic and ED431B 2018/60 grants) and EuropeanRegional Development FundERDF. . References [1] A. Association, “2019 alzheimer’s disease facts and figures,”
Alzheimer’s & Dementia , vol. 15, no. 3, pp. 321–387, 2019.[2] P. J. Nestor, P. Scheltens, and J. R. Hodges, “Advances in the earlydetection of alzheimer’s disease,”
Nature medicine , vol. 10, no. 7,pp. S34–S41, 2004.[3] S. Ahmed, A.-M. F. Haigh, C. A. de Jager, and P. Garrard, “Con-nected speech as a marker of disease progression in autopsy-proven alzheimers disease,”
Brain , vol. 136, no. 12, pp. 3727–3737, 2013.[4] J. J. G. Meil´an, F. Mart´ınez-S´anchez, J. Carro, D. E. L´opez,L. Millian-Morell, and J. M. Arana, “Speech in alzheimer’s dis-ease: Can temporal and acoustic parameters discriminate demen-tia?”
Dementia and Geriatric Cognitive Disorders , vol. 37, no.5-6, pp. 327–334, 2014.[5] K. Lopez-de Ipi˜na, J. B. Alonso, J. Sol´e-Casals, N. Barroso,P. Henriquez, M. Faundez-Zanuy, C. M. Travieso, M. Ecay-Torres, P. Martinez-Lage, and H. Eguiraun, “On automatic diag-nosis of alzheimers disease based on spontaneous speech analysisand emotional temperature,”
Cognitive Computation , vol. 7, no. 1,pp. 44–55, 2015.[6] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimers disease in narrative speech,”
Journal ofAlzheimer’s Disease , vol. 49, no. 2, pp. 407–422, 2016.[7] J. Pennington, R. Socher, and C. D. Manning, “Glove: Globalvectors for word representation.” in
In Conference on Empiri-cal Methods in Natural Language Processing (EMNLP) , vol. 14,2014, pp. 1532–1543.[8] S. Luz, F. Haider, S. de la Fuente, D. Fromm, andB. MacWhinney, “Alzheimer’s dementia recognition throughspontaneous speech: The ADReSS Challenge,” in
Proceedingsof INTERSPEECH 2020 , Shanghai, China, 2020. [Online].Available: https://arxiv.org/abs/2004.06833[9] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech and Language Processing , 2010.[10] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in , 2018, pp. 5329–5333.[11] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” in , 2014,pp. 2494–2498.[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” 2011.[13] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker verification in the wild,”
Computer Speechand Language , vol. 60, 2020.[14] F. Eyben, M. Wllmer, and B. W. Schuller, “Opensmile: the mu-nich versatile and fast open-source audio feature extractor.” in
ACM Multimedia . ACM, 2010, pp. 1459–1462.[15] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia,S. Schnieder, R. Cowie, and M. Pantic, “AVEC 2013 - the contin-uous audio/visual emotion and depression recognition challenge,”in
Proceedings of the 3rd International Worskhop on Audio/VisualEmotion Challenge, Proceedings of AVEC’13 , 2013.[16] M. A. Hall, “Correlation-based feature subset selection for ma-chine learning,” Ph.D. dissertation, University of Waikato, Hamil-ton, New Zealand, 1998.[17] M. Ajili, S. Rossato, D. Zhang, and J.-F. Bonastre, “Impact ofrhythm on forensic voice comparison reliability,” 2018. [18] A. Larcher, K. A. Lee, and S. Meignier, “An extensible speakeridentification sidekit in python,” in .IEEE, 2016, pp. 5095–5099.[19] B. Macwhinney, “The childes project: tools for analyzing talk,”
Child Language Teaching and Therapy , vol. 8, 01 2000.[20] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimer’s disease in narrative speech,”