[PDF] Alzheimer's Dementia Detection from Audio and Text Modalities

Abstract

Automatic detection of Alzheimer's dementia by speech processing is enhanced when features of both the acoustic waveform and the content are extracted. Audio and text transcription have been widely used in health-related tasks, as spectral and prosodic speech features, as well as semantic and linguistic content, convey information about various diseases. Hence, this paper describes the joint work of the GTM-UVIGO research group and acceXible startup to the ADDReSS challenge at INTERSPEECH 2020. The submitted systems aim to detect patterns of Alzheimer's disease from both the patient's voice and message transcription. Six different systems have been built and compared: four of them are speech-based and the other two systems are text-based. The x-vector, i-vector, and statistical speech-based functionals features are evaluated. As a lower speaking fluency is a common pattern in patients with Alzheimer's disease, rhythmic features are also proposed. For transcription analysis, two systems are proposed: one uses GloVe word embedding features and the other uses several features extracted by language modelling. Several intra-modality and inter-modality score fusion strategies are investigated. The performance of single modality and multimodal systems are presented. The achieved results are promising, outperforming the results achieved by the ADDReSS's baseline systems.

Full PDF

AAlzheimer’s Dementia Detection from Audio and Text Modalities

Edward L. Campbell , Laura Doc´ıo-Fern´andez , Javier Jim´enez Raboso , Carmen Garc´ıa-Mateo GTM research group, AtlanTTic Research Center, University of Vigo, Spain acceXible [email protected], [email protected], [email protected], [email protected] Abstract

Automatic detection of Alzheimer’s dementia by speechprocessing is enhanced when features of both the acousticwaveform and the content are extracted. Audio and texttranscription have been widely used in health-related tasks, asspectral and prosodic speech features, as well as semantic andlinguistic content, convey information about various diseases.Hence, this paper describes the joint work of the GTM-UVIGOresearch group and acceXible startup to the ADDReSS chal-lenge at INTERSPEECH 2020. The submitted systems aim todetect patterns of Alzheimer’s disease from both the patientsvoice and message transcription. Six different systems havebeen built and compared: four of them are speech-based andthe other two systems are text-based. The x-vector, i-vector,and statistical speech-based functionals features are evaluated.As a lower speaking ﬂuency is a common pattern in patientswith Alzheimer’s disease, rhythmic features are also proposed.For transcription analysis, two systems are proposed: one usesGloVe word embedding features and the other uses several fea-tures extracted by language modelling. Several intra-modalityand inter-modality score fusion strategies are investigated. Theperformance of single modality and multimodal systems arepresented. The achieved results are promising, outperformingthe results achieved by the ADDReSS’s baseline systems.

Index Terms : Alzheimer’s disease detection, i-vector, x-vector,speech ﬂuency, word embedding, recurrent neural networks,score level fusion, ADDReSS challenge

1. Introduction

The Alzheimer’s disease (AD) is a neurodegenerative illnessthat represents the most common cause of dementia in theworld. It is provoked by the damage of neurons involved inthinking, learning and memory. This disease has three mainstages. In the ﬁrst one, called preclinical, patients do not presentclear symptoms because the brain initially compensate for them,enabling individuals to continue to function normally. The sec-ond one is deﬁned as Mild Cognitive Impairment (MCI). In thisstage, patients show greater cognitive decline then expected fortheir age, having problems to express and connect ideas. How-ever, these changes may be only noticeable to family membersand friends. The critical phase is the last one, which is calledas dementia. It is characterized by noticeable memory, think-ing and behavioural symptoms that impair a person’s ability tofunction in daily life [1][2].Common signs of AD are related to problems with utteringwords; consequently, people with AD may have trouble follow-ing or joining a conversation, they may stop in the middle of asentence and have no idea how to continue. As a result, analysisof speech and its transcription may represent a suitable mecha-nism for detecting the AD during the second or third stage of the disease [1] [3]. According to the literature[4][5][6], using infor-mation from the patient’s voice, as well as from its transcriptionwould ease the early AD detection task.In this work, four speech-based systems and two text-basedsystems are proposed for automatic distinction between pa-tient’s with and without AD. Those based on the speech signalare compound by four approaches to represent their spectral andprosodic content, namely: i-vector and x-vector embeddings,rhythmic features, and statistical functionals. The ﬁrst three usesuper vector machine (SVM) as classiﬁers and the last one usesa linear discriminant analysis (LDA) classiﬁer. As for systemsthat use speech transcriptions, one relies on GloVe word em-bedding features [7] using a recurrent neural network (RNN)as classiﬁer, and the other one is based on features extracted bylanguage modelling, using a SVM as classiﬁer. Finally, an intra-modality and inter-modality score fusion strategy was done toimprove the ﬁnal results.All the individual systems, and its fusion, are evaluated onthe AD recognition task within the framework of the ADReSSChallenge [8]. This challenge targets the AD detection usingspontaneous speech. The data used in the Challenge consistsof speech recordings, and their transcripts, corresponding to thedescription of the Cookie Theft picture. Speciﬁcally, it is a se-lection of Alzheimer and control patients from of the Demen-tiaBanks Pitt corpus .The rest of the paper is organized as follows. Sections 2and 3 describe the speech-based and text-based systems, re-spectively. Section 4 outlines the experimental framework. Theexperimental results are exposed and discussed in Section 5. Fi-nally, Section 6 draws some conclusions and future work.

2. Speech-based systems

In this section, a description of speech feature extraction strate-gies and classiﬁcation methods is done, with special attention todifferent statistical features extracted from the patient’s voice.

Speech embedding features are considered as state-of-artspeech representation for speaker recognition application.These speech representations can be applied for AD detection,as long as they preserve those spectral patterns in the speaker’svoice that allow the distinction between patient with and with-out AD. In this paper, two strategies were analyzed, the ﬁrstone uses the i-vector paradigm [9], and the second one uses anx-vector [10] based representation. The main characteristics ofthese approaches are brieﬂy described below. A. i-vectors To extract the i-vectors an universal background model (UBM)and a total variability matrix T model must be trained. As http://dementia.talkbank.org/ a r X i v : . [ ee ss . A S ] A ug peech parameterization these models use 13 perceptual lin-ear prediction (PLP) cepstral coefﬁcients, combined with twopitch-related features (F0 and voicing probability) [11]. Thisfeatures are augmented with their delta and acceleration coef-ﬁcients, leading to vectors of dimension 45. These featureswere chosen since that combination achieves a representation ofspeech that includes spectral information and prosodic featuressuch as rhythm or intonation that are embedded in the funda-mental frequency. The UBM was a diagonal covariance, 512-component gaussian mixture model (GMM), and was trainedwith data from outside this task; and T was trained using thetask training data. The dimension of the i-vectors was set to125 and they were length normalized. B. x-vectors A pretrained time delay deep neural network (TDNN) , with5 time delay layers and two dense layers, and trained to dis-criminate between speakers, was used. The network was im-plemented using the nnet3 neural network library in the KaldiSpeech Recognition Toolkit [12] and trained on augmentedVoxCeleb 1 and VoxCeleb 2 datasets [13]. The input to theTDNN are 30 mel-frequency cepstral coefﬁcients (MFCC), andthe embeddings are extracted from the ﬁrst dense layer with adimensionality of 512. The output of this layer (x-vector) isﬁrst projected using latent discriminant analysis (LDA) into a200 dimensional space and then length normalized.Instead of extracting an i-vector or x-vector (embeddings)to represent the entire audio signal, a set of these vectors is ob-tained applying a sliding window. In this way, each audio ﬁle isrepresented by a certain number of embeddings, which are thenused for classiﬁcation.Both systems use as classiﬁer an SVM with a linear ker-nel. Since a number of embeddings are extracted from eachaudio, there will also be a set of classiﬁcation results (one foreach embedding), which must be combined to obtain a patient’sclassiﬁcation in AD or non-AD. In this work, the mean of theclassiﬁcations was used as score for the ﬁnal decision. This system represents the audio waveform with a set of vectorscalled functionals. These are mainly statistical and regressionfeatures extracted from a wide variety of low-level acoustic at-tributes using the openSMILE [14] software. Speciﬁcally, theset of functionals proposed in the AVEC 2013 evaluation (Au-diovisual Emotion Recognition Challenge and Depression) [15]was used, which consist of 2,268 descriptors extracted from 32low-level features of the audio signal related to energy, the spec-trum, volume and tone.As the number of descriptors is very high in relation to theamount of data available, a selection of those most relevantwas done. For this, a correlation based feature subset selec-tion (CFS) algorithm [16] was used, resulting in 57 functionalsrelated with the low level descriptors shown in Table 1. Thefunctionals were extracted using a sliding window of 3 s and anoverlap of 1 s. As with the i-vectors and x-vectors this systemuses an SVM with a linear kernel as classiﬁer.

A common pattern in patient with AD is the lack of speakingﬂuency, being the rhythm a viable clue to detect that behav-ior. However, prosodic information (e.g., mean energy) was http://kaldi-asr.org/models.html Table 1:

Low level descriptors in the set of 57 functionals

Energy and spectral features

Loudness, MFCC and delta-MFCC,energy in bands 250–650Hz and 1 kHz–4 kHz,25% spectral roll-of points,spectral ﬂux, entropy, skewness,psychoacousitc sharpness, harmonicity, ﬂatness

Voicing related F ,voicing probability, jitteralso used because the correct pronunciation of words does notonly depends on the rhythm but intonation, tone and stress. Forthis system, the selected parameters were based on [4] [17], andthey are as follows:• Number of syllables• Rate of speech (syllables / original duration)• Speaking duration• Average fundamental frequency• Median of fundamental frequency• Minimum fundamental frequency• Pronunciation posterior probability• Average voice interval duration• Average duration of pairs • Mean energy• The ratio between the energy mean and its standard-deviationThe extraction process was done using the Python li-brary My-Voice Analysis , the voice activity detection of theSIDEKIT software [18], and our own algorithms.The classiﬁcation process was done by the LDA algorithm,projecting the rhythmic feature vector into an one-dimensionalspace where every projected point represents a classiﬁcationscore.

3. Text-based systems

Two different text-based approaches were analysed. The ﬁrstone uses the manual transcripts to extract linguistic informationfor creating the input features of a classiﬁer. The second oneis a sequential model based on deep learning, which classiﬁesdirectly from the sequence of GloVe word embeddings.

The transcripts contained in the dataset are in CHAT format[19], which facilitates speech annotation and analysis. Theyinclude the transcript of both the subject and the investigatorin charge of the test, as well as additional non-speech annota-tions: times, researcher interventions, errors, morphological orsyntactic analysis.Transcripts are divided into several interventions, i.e. sen-tences or parts of complete sentences with meaning, and thisgranularity has been maintained in the preprocessing. Meta-data, researcher interventions and linguistic analysis includedin the ﬁles have been removed. We keep only subjects words asinput for the classiﬁers, as it better reﬂects the real-world taskof detecting Alzheimers based only on speech analysis. Pairs: consecutive voiced and unvoiced segments https://github.com/Shahabks/my-voice-analysis .2. Sequential model The ﬁrst approach consists of training a RNN at interventionlevel, by classifying if a given sentence belongs to an AD sub-ject. Then, the probability that the subject had AD given allhis/her interventions is computed as the mean of the probabili-ties of each intervention. By using the interventions as indepen-dent samples the training dataset is increased to a size of 1492records from the 108 subjects. In the training process, wordsare tokenized, and interventions are padded up to 20 tokens.The RNN is composed of 3 layers: an Embedding layerwith pre-trained weights of GloVe 50-dimensional representa-tion of words, a long short term memory (LSTM) layer with 4units and an output layer with 1 unit and sigmoid activation. Ad-ditionally, a Dropout at 10% rate is performed in the Embeddingand LSTM layers for regularization and to prevent overﬁtting.The model is trained for 10 epochs, using a batch size of 16 andAdam optimization.

In this approach, several linguistic features and indicators arebuilt from subjects interventions and, using this feature vectoras input, an SVM is trained. Unlike the previous method, theclassiﬁcation here is performed at subject level, by consideringthe full transcript of each participant.Previous works [20] have shown that certain linguistic fea-tures are useful for detecting AD using Cookie Theft test. Here13 features have been built, grouped into 4 categories:• Extension information such as the number of interven-tions, number of words per intervention and mean wordlength.• Vocabulary richness, by measuring the number of uniquewords used by the subject.• Presence of key informational concepts: kitchen, mother,stool, boy and girl.• Frequency of verbs, nouns, adjectives and pronouns fromPOS-tagging.Each feature is then rescaled with min-max normalizationin range [0, 1] and an SVM with radial basis function (RBF)kernel and C = 1.0 is trained, whose output is the probabilitythat the subject had AD given the 13-dimensional feature vector.

4. Experimental framework

The training dataset [8] consists of the recordings and manualtranscripts of 108 subjects performing the test known as CookieTheft, whose objective is to describe an image. Out of the 108participants, 54 are patients diagnosed with Alzheimer’s. Table2 shows the training data distribution in detail.Table 2:

ADReSS training dataset

AD non-AD

Age interval Male Female Male Female [50 , [55 , [60 , [65 , [70 , [75 , The evaluation metrics for the AD classiﬁcation taskare:

Accuracy = TN + TPN , Precision π = TPTP + FP , Recall ρ = TPTP + FN and F = 2 π ∗ ρπ + ρ .where N is the number of patients, TP, FP and FN are thenumber of true positives, false positives and false negatives, re-spectively.All the systems have been trained using leave-one-subject-out (LOSO) cross-validation strategy for measuring the gener-alization error. Therefore, models use 107 subjects as trainingdata and are validated on the held-out subject.

5. Results

This section presents the results in the AD classiﬁcation forboth the leave-one-subject-out (LOSO) and test settings of theADReSS challenge.

Table 3 illustrates the individual results achieved by the fourspeech-based systems. These results show that the x-vectors,functionals, and ﬂuency based systems have similar perfor-mance regarding the accuracy, as well as Area Under Curve(AUC). The i-vector was the weakest model, being the only onewith an accuracy under 70%, although it is still above the chal-lenge baseline results [8].Table 3:

Results of the proposed speech-based systems class Precision Recall F1 Score Accuracy AUCi-vector non-AD 0.7058 0.6666 0.6857 0.6944 0.6806AD 0.6842 0.7222 0.7027x-vector non-AD 0.7400 0.6851 0.7115 0.7222 0.7615AD 0.7068 0.7592 0.7321functionals non-AD 0.7454 0.7592 0.7522 0.7500 0.7435AD 0.7547 0.7407 0.7476ﬂuency non-AD 0.7450 0.7037 0.7238 0.7314 0.7613AD 0.7192 0.7592 0.7387

The results for both text-based systems are shown in the ta-ble 4. Concerning the RNN model, the AUC in held-out set is0.8563. For a selected threshold probability, we also obtain anaccuracy of 0.7407, recall of 0.8704 and F1-score of 0.7705.The false negative rate is 13%. For the linguistic model, theAUC in held-out set is 0.7510. For a selected threshold proba-bility the accuracy is 0.6852, recall of 0.7037 and F1-score of0.6909. The false negative rate is 30%.Table 4:

Results of the proposed text-based systems class Precision Recall F1 Score Accuracy AUCRNN model non-AD 0.7424 0.9074 0.8166 0.7962 0.8563AD 0.8809 0.6851 0.7708Linguistic model non-AD 0.7017 0.7407 0.7207 0.7129 0.7510AD 0.7254 0.6851 0.7047

Moreover, two different score fusion strategies were carriedout. In the ﬁrst one (referred as Fusion I), the scores of everysystem was normalized by z-norm, and merged by the average.In the second one (referred as Fusion II), the same normaliza-tion strategy was used and the average was computed on thespeech-based and text-based system scores separately. Finally,the new scores were again merged by the average. The ﬁgure 2shows the ROC curves of the described fusion systems. The Fu-sion I model had an accuracy of 0.8056, a F1-score of 0.8073,and an AUC of 0.8930. On the other hand, the Fusion II modelresented an accuracy of 0.8241, a F1-score of 0.8190 and anAUC of 0.9023. The results show that the fusion of both text-based and speech-based modality improves the detection of AD.For further insight, ﬁgures 1 and 2 compares the ROCcurves of all individual systems and their fusion, respectively.Figure 1:

ROC curve of all systems

Figure 2:

ROC curve of fusion

Based on the previous results, ﬁve systems were submitted tothe challenge. The table 5 shows the performance achieved bythem on the testing dataset, which consist of recordings of 48subjects.The results achieved by the RNN model were really simi-lar in training and testing, which is a good sign of the lack ofoverﬁtting on the training stage as a result of a Dropout at 10%rate performed in the Embedding and LSTM (Long Short TermMemory) layers. However, the submitted speech-based systemshad a signiﬁcant decrease in performance. This decline wouldbe a consequence of a poor generalization capacity. As a result,the performance of the fusion systems also declines.

6. Conclusions and Further work

Two lines of work have been developed on the ADReSS Chal-lenge dataset: one based on speech processing and the otherbased on text processing. Table 5:

Results on the testing dataset class Precision Recall F1 Score AccuracyFusion II non-AD 0.8000 0.6667 0.7273 0.7500AD 0.7143 0.8333 0.7692Fusion I non-AD 0.8235 0.5833 0.6829 0.7292AD 0.6774 0.8750 0.7636RNN model non-AD 0.6667 1.0000 0.8000 0.7500AD 1.0000 0.5000 0.6667ﬂuency non-AD 0.6087 0.5833 0.5957 0.6042AD 0.6000 0.6250 0.6122x-vector non-AD 0.5553 0.4167 0.4762 0.5417AD 0.5333 0.6667 0.5926

It is important to highlight the following points when as-sessing both solutions:•

Performance: x-vector, functionals, and ﬂuency speechfeatures have shown a competitive performance in theLOSO cross-validation. However, their performance de-creases on the test setting. This would be a result of a lowgeneralization level achieved at the training stage. Onthe other hand, the text-based RNN model has obtainedsuperior results both in the LOSO and testing settings,but with unbalanced values of recall and precision in thelast one. Nevertheless, the Fusion II strategy was a so-lution to that problem, presenting the same accuracy thatthe RNN model but with more balanced measures. Asa result, multimodal systems demonstrate to be a betterstrategy for the detection of AD than individual sytems.• Complexity:

The x-vector and RNN systems, as deep-learning-based models, need a large amount of trainingdata to be able to generalize well.•

Explainability: deep learning models are black-boxmodels, being very difﬁcult to interpret from a humanperspective. On the contrary, the linguistic, ﬂuency,functionals and i-vector features created here are ex-plainable and one could determine the weight of themin the classiﬁcation.Several future research lines have been identiﬁed for furtherwork. Firstly, investigation on improved classiﬁcation basedon deep neural networks and novel acoustical feature extrac-tion algorithms. Secondly, addition of new linguistic featuresand non-verbal information (breaks, silence duration, word mis-takes, etc.) in text-based systems. Thirdly, analysis of strate-gies for increasing the generalization capacity of the proposedspeech-based systems; for example: to ﬁnd new rhythmic pa-rameters more discriminatory between patients with and with-out AD. It is also important to conduct experiments in otherexperimental settings, for example using other questions of themini-mental state examination test, to validate the results ob-tained and, above all, to increase the size of the data settings.

7. Acknowledgements

This work has received ﬁnancial support from the SpanishMinisterio de Economia y Competitividad through the projectSpeech&Sign RTI2018-101372-B-100, and also from Xunta deGalicia (AtlanTTic and ED431B 2018/60 grants) and EuropeanRegional Development FundERDF. . References [1] A. Association, “2019 alzheimer’s disease facts and ﬁgures,”

Alzheimer’s & Dementia , vol. 15, no. 3, pp. 321–387, 2019.[2] P. J. Nestor, P. Scheltens, and J. R. Hodges, “Advances in the earlydetection of alzheimer’s disease,”

Nature medicine , vol. 10, no. 7,pp. S34–S41, 2004.[3] S. Ahmed, A.-M. F. Haigh, C. A. de Jager, and P. Garrard, “Con-nected speech as a marker of disease progression in autopsy-proven alzheimers disease,”

Brain , vol. 136, no. 12, pp. 3727–3737, 2013.[4] J. J. G. Meil´an, F. Mart´ınez-S´anchez, J. Carro, D. E. L´opez,L. Millian-Morell, and J. M. Arana, “Speech in alzheimer’s dis-ease: Can temporal and acoustic parameters discriminate demen-tia?”

Dementia and Geriatric Cognitive Disorders , vol. 37, no.5-6, pp. 327–334, 2014.[5] K. Lopez-de Ipi˜na, J. B. Alonso, J. Sol´e-Casals, N. Barroso,P. Henriquez, M. Faundez-Zanuy, C. M. Travieso, M. Ecay-Torres, P. Martinez-Lage, and H. Eguiraun, “On automatic diag-nosis of alzheimers disease based on spontaneous speech analysisand emotional temperature,”

Cognitive Computation , vol. 7, no. 1,pp. 44–55, 2015.[6] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimers disease in narrative speech,”

Journal ofAlzheimer’s Disease , vol. 49, no. 2, pp. 407–422, 2016.[7] J. Pennington, R. Socher, and C. D. Manning, “Glove: Globalvectors for word representation.” in

In Conference on Empiri-cal Methods in Natural Language Processing (EMNLP) , vol. 14,2014, pp. 1532–1543.[8] S. Luz, F. Haider, S. de la Fuente, D. Fromm, andB. MacWhinney, “Alzheimer’s dementia recognition throughspontaneous speech: The ADReSS Challenge,” in

Proceedingsof INTERSPEECH 2020 , Shanghai, China, 2020. [Online].Available: https://arxiv.org/abs/2004.06833[9] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech and Language Processing , 2010.[10] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in , 2018, pp. 5329–5333.[11] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” in , 2014,pp. 2494–2498.[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” 2011.[13] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker veriﬁcation in the wild,”

Computer Speechand Language , vol. 60, 2020.[14] F. Eyben, M. Wllmer, and B. W. Schuller, “Opensmile: the mu-nich versatile and fast open-source audio feature extractor.” in

ACM Multimedia . ACM, 2010, pp. 1459–1462.[15] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia,S. Schnieder, R. Cowie, and M. Pantic, “AVEC 2013 - the contin-uous audio/visual emotion and depression recognition challenge,”in

Proceedings of the 3rd International Worskhop on Audio/VisualEmotion Challenge, Proceedings of AVEC’13 , 2013.[16] M. A. Hall, “Correlation-based feature subset selection for ma-chine learning,” Ph.D. dissertation, University of Waikato, Hamil-ton, New Zealand, 1998.[17] M. Ajili, S. Rossato, D. Zhang, and J.-F. Bonastre, “Impact ofrhythm on forensic voice comparison reliability,” 2018. [18] A. Larcher, K. A. Lee, and S. Meignier, “An extensible speakeridentiﬁcation sidekit in python,” in .IEEE, 2016, pp. 5095–5099.[19] B. Macwhinney, “The childes project: tools for analyzing talk,”

Child Language Teaching and Therapy , vol. 8, 01 2000.[20] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimer’s disease in narrative speech,”