Exploiting Fully Convolutional Network and Visualization Techniques on Spontaneous Speech for Dementia Detection
EExploiting Fully Convolutional Network and Visualization Techniques onSpontaneous Speech for Dementia Detection
Youxiang Zhu, Xiaohui Liang
Department of Computer Science, University of Massachusetts Boston, USA { Youxiang.Zhu001, Xiaohui.Liang } @umb.edu Abstract
In this paper, we exploit a Fully Convolutional Network (FCN)to analyze the audio data of spontaneous speech for dementiadetection. A fully convolutional network accommodates speechsamples with varying lengths, thus enabling us to analyze thespeech sample without manual segmentation. Specifically, wefirst obtain the Mel Frequency Cepstral Coefficient (MFCC)feature map from each participant’s audio data and convert thespeech classification task on audio data to an image classifica-tion task on MFCC feature maps. Then, to solve the data insuf-ficiency problem, we apply transfer learning by adopting a pre-trained backbone Convolutional Neural Network (CNN) modelfrom the MobileNet architecture and the ImageNet dataset. Wefurther build a convolutional layer to produce a heatmap usingOtsu’s method for visualization, enabling us to understand theimpact of the time-series audio segments on the classificationresults. We demonstrate that our classification model achieves66.7% over the testing dataset, > . of the baseline modelprovided in the ADReSS challenge. Through the visualizationtechnique, we can evaluate the impact of audio segments, suchas filled pauses from the participants and repeated questionsfrom the investigator, on the classification results. Index Terms : Alzheimer’s disease, MFCC feature map, classi-fication, transfer learning, visualization
1. Introduction
The number of patients with Alzheimer’s Disease (AD) over theage of 65 is expected to reach 13.8 million by 2050, causing ahuge crisis on the public health system [1]. While there is noproven effective treatment on AD, it is significant to detect earlysymptoms of AD such that interventions can be implemented inthe early stage. While screening measures, neuropsychologi-cal assessments, and MRI imaging scans are not pragmatic ap-proaches, recent studies have explored spontaneous speech fora practical and low-cost early detection of dementia symptoms.Pitt corpus [2], one of the large speech datasets, includes spon-taneous speech obtained from a Cookie Theft Picture (CTP)description task. The CTP task has also been explored withcomputerized agent to automate and mobilize the speech col-lection [3, 4] and in other languages including Mandarin [5, 6],German [7], Sweden [8]. Other spontaneous speech datasets forAD research include those collected from film-recall tasks [9],story-retelling tasks [10], map-based tasks [11], and humanconversations [12].Researchers have studied linguistic features extracted fromtranscripts for building classification and regression models. Arecent survey showed effective linguistic features include se-mantic content, syntax and morphology, pragmatic language,discourse fluency, speech rate, and speech monitoring [13]. Thelinguistic features were often manually selected based on expertknowledge, and the analysis methods were complex and highly task-dependent. A potential research direction is to automatethe linguistic analysis. For example, Croisile et al. manuallyextracted 23 information units from the picture using languageknowledge and found the analysis based on them was effectivein dementia detection [14]. Fraser et al. confirmed both theinitial 23 information units and the auto-generated informationunits are effective in analysis [15]. Yancheva et al. [16] andFraser et al. [8] further proposed to auto-generate topic mod-els that can recall 97% of the human-annotated informationunits. Similarly, the acoustic-based analysis was started withpre-defined features and recently automated with computationalmodels. Hoffman et al. considered acoustic features for each ut-terance [17]. Fraser et al. evaluated the statistical significanceof pause and word acoustic features [10]. Toth et al. consideredfour descriptors for silent/filled pauses and phonemes [18]. Tothet al. implemented a customized automatic speech recognition(ASR) and automatic feature selection for phones, boundaries,and filled pauses [19, 9]. Haider et al. proposed an automaticacoustic analysis approach using the paralinguistic acoustic fea-tures of audio segments [20, 21].In this paper, we envision an automated speech analysis ofthe audio data for dementia detection. We observed Haider etal. segmented the audio data into small pieces by setting thelog energy threshold parameter to 65dB with a maximum du-ration of 10 seconds [20, 21]. We feel the segmentation maycause critical time-series information loss. Any smaller speechsegments hardly represent the overall speech sample. In addi-tion, the speech continuity is removed by segmentation, makingthe model inaccurately capture the time-series characteristics.Thus, our model aims to accommodate a speech sample of eachparticipant as input and preserve the time-series characteristicsof the speech samples [22, 23]. Our contributions are as follows.First, we converted a speech classification task on the audiodata to an image classification task on the Mel Frequency Cep-stral Coefficient (MFCC) feature maps. The feature maps areautomatically extracted from the audio data and preserve thetime-series characteristics of the speech.Second, we explored the Fully Convolutional Network(FCN) to accommodate the speech samples with varyinglengths. We employed the transfer learning technique byadopting a pre-trained backbone Convolutional Neural Net-work (CNN) from the MobileNet architecture and the ImageNetdataset. Compared to the baseline model, ours achieves betteraccuracy and a more balanced F1 score.Third, we embedded a convolutional layer in our model toenable the visualization of the impact of audio segments on theclassification results, thus increasing our understanding of howthe classification model works. We found the visualization tech-nique identifies the filled pauses from the participant and the re-peated questions from the investigator as positive signs of AD. a r X i v : . [ ee ss . A S ] A ug GB imagesMFCC feature maps Backbone CNNBackbone CNN GAP 2DGAP 1D FC 1000GAP 1D SoftmaxSoftmax
Pre-trained Parameters (Transfer Learning)(h, w, 3) (h’, w’, 1024) 1024 1000(p, t, 3) (p’, t’, 1024) (t’,1024) (t’, 2) 2 M ob il e N e t Prob Prob Prob … Prob Prob non-AD
Prob AD VisualizationCONV1D 2
Figure 1:
Proposed classification model with transfer learning and visualization
2. ADReSS Challenge Dataset
We studied the dataset created for the ADReSS challenge [21],which is a part of the Pitt corpus [2], with the numbers of par-ticipants balanced for age and gender. The data consists ofspeech recordings and transcripts of spoken picture descriptionselicited from participants through the Cookie Theft picture fromthe Boston Diagnostic Aphasia Exam [24, 25]. We studied thefull-wave enhanced audio, which contains the audio recordingsafter noise removal. The training dataset includes speech datafrom 24 male participants with AD, 30 female with AD, 24 malenon-AD participants, and 30 female non-AD participants. TheADReSS testing dataset includes speech data from 11 male par-ticipants with AD, 13 female with AD, 11 male non-AD par-ticipants, and 13 female non-AD participants. The completedataset information can be found at Luz et al. [21].
3. MFCC Feature Maps
Mel-frequency cepstral coefficients have been widely used inspeech recognition research [26]. Fraser et al. carried out anacoustic-prosodic analysis on the Pitt corpus using 42 MFCCfeatures [16, 27]. We extracted an MFCC feature map for eachparticipant’s entire audio sample. The MFCC feature map isdenoted as a ( p, t ) -matrix where the hyper-parameter p is set to64, and t is related to the duration of the speech sample. Weuse librosa function with sampling rate of 22050, window sizeof 2048, and step size of 512. In Figure 2, we show the sam-ple MFCC feature maps of participants 001 (non-AD) and 083(AD), respectively. The data in the first row is scaled for visu-alization purposes. By extracting the MFCC feature maps, weconvert the speech dataset to an image dataset. The advantagesof MFCC feature maps are three-fold: i) the conversion fromspeech to MFCC feature maps can be done automatically; ii)the silent pauses in the audio data are preserved as a distinctivefeature in MFCC feature maps as shown in Figure 2; iii) wefound the audio dataset contains speech from the investigatorand filled pauses from the participant that are shown to be im-portant [9]. While identifying these audio segments requires ex-pensive human efforts or customized Automatic Speech Recog-nition (ASR), we envision the MFCC feature maps preserve thetime-series structure, and the classification model may continu-ously learn to deal with these effects.
4. Classification Model
We aim to design a classification model to classify the audiosamples into the non-AD and AD groups. After converting eachaudio sample to an MFCC feature map, we focus on developingan image classification model. To improve the learning effec-
HC 001 32.507771 - 37.151738AD 083 25.539733 - 30.183321 “… looks like a garage or something with curtains and [pause] the grass in the garden …”“… trying to get a cookie [pause] for himself [pause] and also one …”
Figure 2:
Sample MFCC feature maps of 001 and 083 tiveness over the small audio dataset, we apply transfer learn-ing using ImageNet and MobileNet. In the following, we firstexplain the transfer learning technique and then introduce ourmodel. An overview of our model is shown in Figure 1.
We developed a transfer learning technique using the knowl-edge from the image datasets and pre-trained image classifica-tion models to overcome the insufficiency of the audio dataset.
ImageNet is an image dataset organized according to theWordNet hierarchy [28]. Each meaningful concept in WordNet,possibly described by multiple words or word phrases, is calleda “synset.” There are more than 100,000 synsets in WordNet,majority of which are nouns (80,000+). ImageNet provides, onaverage, 1000 images to illustrate each synset. Images of eachconcept are quality-controlled and human-annotated. ImageNetdataset has been widely used in designing and evaluating the im-age classification models [29].
MobileNet is a lightweight net-work architecture that significantly reduces the computationaloverhead as well as parameter size by replacing the standardconvolution filters to the depth-wise convolutional filters and thepoint-wise convolutional filters [30]. The total parameters of theMobileNet backbone are of a size 17.2 MB, significantly lessthan other convolutional neural networks. Considering the lim-ited size of the speech dataset, we thought a smaller model withless complexity, such as MobileNet, may worth being tested.The MobileNet architecture is shown at the above layer ofthe Figure 1. With an RGB image as input, the output is theprobability that the image belongs to each of the 1000 classes.Denote the input image as a 3-dimensional ( h, w, -matrixwhere h is height, w is width, and 3 represents the RGB chan-nel. A backbone CNN consists of a set of convolution, pooling,and activation operations. We used the full width (1.0) Mo-bileNet backbone pre-trained on a resolution of 128*128 im-ges. The detailed architecture can be refer to the paper [30].The backbone converts an input of ( h, w, -matrix to an outputof ( h (cid:48) , w (cid:48) , -matrix where ( h (cid:48) , w (cid:48) ) are functionally relatedto ( h, w ) , and 1024 represents the feature channel number, i.e.,the depth of the backbone CNN. The output ( h (cid:48) , w (cid:48) , -matrix is then fed to a Global Average Pooling (GAP) layer forreducing the dimensions of h (cid:48) and w (cid:48) and obtaining a -dimension feature. A Fully Connected (FC) layer with 1000neurons is employed to produce the output according to thewanted 1000 classes. Lastly, a softmax activation layer is addedto produce the classification results as the probabilities for 1000classes that add up to 1. The pre-training of MobileNet istime-consuming and may take weeks due to the large ImageNetdataset. The pre-trained parameters of the backbone CNN fromMobileNet are made available, though. We used the parametersand saved time on the pre-training. Our proposed model is shown at the bottom layer of the Fig-ure 1. Our FCN architecture employs the pre-trained backboneCNN module from the MobileNet. Denote the MFCC featuremap of the audio sample as a ( p, t, -matrix where p is a hyper-parameter set to 64, and t is related to the duration of the speechsample. To match with the module input, i.e., an RGB im-age, we duplicated the MFCC feature map twice and made theMFCC feature map as a ( p, t, -matrix. In this way, we canfeed the MFCC feature map into the backbone CNN module ofthe MobileNet in the same way as an RGB image. The output ofthe backbone CNN is denoted as a ( p (cid:48) , t (cid:48) , -matrix where ( p (cid:48) , t (cid:48) ) are functionally related to ( p, t ) . We employed a GAP-1D (one-dimensional) to reduce p (cid:48) dimension of the matrix. The t (cid:48) dimension is preserved for enabling the visualization. We fur-ther used a 1D convolutional layer with neurons to adapt to thewanted 2 classes. The output of the 1D convolutional layer isused to build a 1D heatmap for visualization. Finally, we addedanother GAP-1D layer to reduce the t (cid:48) dimension and the soft-max activation layer to produce the classification results as twoprobabilities for the two classes that add up to 1.
5. Evaluation
We implemented the classification model with Keras and Ten-sorFlow. We used a mini-batch with batch size at each train-ing step and a very small learning rate of 1e-5, while minimiz-ing the cross-entropy loss with the RMSProp optimizer [31].As the samples in a mini-batch are required to have the samesize, we used zero-paddings to pad the samples such that theirlengths are equal to the max length in the mini-batch. The zero-paddings have limited impact on the classification task becausei) they can be easily distinguished from non-zero pauses and ii)the employed GAP and softmax layers produces average heightand width values and relative values. We further confirmed fromour visualization techniques that zero-paddings in the trainingproduced minimum impact on the classification results. In thetesting phase, we treated the testing sample as a mini-batch witha batch size of such that our classification model can take sam-ples with any length.Due to the limited audio dataset, we designed a set of train-ing strategies as follows. First, we split the provided speechdataset into two halves with equal sizes. We used one halffor training and the other half for validation. We trained ourmodel with max 1000 epochs and selected the epoch that has the highest validation accuracy after the model converges. Weperformed such training twice by switching the training andvalidation datasets. We thus obtained two models ( M , M ) that are a complement to each other. Note that k -fold cross-validation is a classical evaluation strategy. Here, we chose k = 2 for a relatively large validation dataset because a toosmall validation dataset may not reflect the overall data distri-bution. A larger validation dataset enables us to better controlthe learning rate for our model to produce a stable classifica-tion accuracy on the validation dataset. We did not use “leave-one-subject-out (LOSO)” because i) LOSO is not suitable fortraining deep neural network models in terms of computationalefficiency; ii) we focus on learning the parameters of the deepneural networks but not model selection; and iii) we focus onthe results of the provided test dataset. Second, we merged theabove two models ( M , M ) into M by averaging the out-put probabilities of ( M , M ) . M takes an advantage ofthe entire dataset. Third, we used all the speech samples totrain a model M , where we selected the epoch with minimumtraining loss instead of maximum validation accuracy. Lastly,we merged the above three models ( M , M , M ) to a model M by adding the output probabilities of ( M , M , M ) .Note that, models ( M , M , M ) mainly focus on improvingthe validation accuracy. This strategy is usually adopted whenthe training dataset is small. Models ( M , M ) considerminimizing the training loss, which is a general approach fordeep learning, where a large training dataset is available. We trained and tested our models using two different datasets,both of which are provided by the ADReSS challenge. The test-ing dataset were provided after the models were trained. Ourmodels output a binary result, non-AD or AD. The evaluationmetrics are accuracy TN + TPN , precision π = TPTP + FP , recall ρ = TPTP + FN , and F1 score πρπ + ρ , where N is the number ofparticipants, T P , F P and
F N are the numbers of true posi-tives, false positives and false negatives, respectively.Table 1:
Classification results
Class Prec. Recall F1 Acc. M (val.) non-AD 0.68 0.54 0.60 0.646AD 0.62 0.75 0.68 M (val.) non-AD 0.58 0.58 0.58 0.583AD 0.58 0.58 0.58 M non-AD 0.63 0.79 0.70 0.667AD 0.72 0.54 0.62 M (loss) non-AD 0.63 0.71 0.67 0.646AD 0.67 0.58 0.62 M non-AD 0.59 0.67 0.63 0.604AD 0.62 0.54 0.58Baseline [21] non-AD 0.67 0.50 0.57 0.625AD 0.60 0.75 0.67Our models ( M , M ) achieve 64.6%, 58.3% accuracy, re-spectively, as shown in Table 1. We found these are consistentwith our validation accuracy 62.96% and 61.11% obtained inthe training phase. After combining the outputs from M and M , our model M achieves the highest accuracy 66.67% ofour five attempts. We consider the model M is relativelysuccessful as it outperformed M , M , and the baseline modelwith 62.5%. We think this performance gain of M is ob-tained because it considers all samples in training and inheritsthe knowledge from the image classification model via trans-fer learning. Our models M and M achieves 64.6% a) *PAR: and &uh she’s getting her feet wet from the overflow of the water from the sink . (b) *PAR: she seems to be oblivious to the fact that the &s sink is overflowing .(d) *PAR: &uh and he’s [/] he’s in the &c &t cookie jar . (e) *PAR: and she’s [//] &uh &w &uh &h she has [/] &uh has +/ . (c) *INV: tell me everything that you see going on in that picture .(f) *INV: you see going on in the picture . (g) *INV: okay anything else ? Figure 3:
Visualizing the impacts of audio segments. (a,b,c) are from 001 non-AD and (d,e,f,g) are from 079 AD and 60.4% accuracy, respectively. Without the validation stepin training, these models focus on minimizing the training lossand definitely need more data to improve accuracy. Both mod-els M and M used all the samples in the training phase, but M achieves a higher accuracy than M . One possible expla-nation is that in M after splitting the training samples intotwo equal halves, the sample-wise differences in each half be-come smaller. When M merges the outputs of M and M ,it simply chooses the model with higher confidence and thusproduces a higher accuracy. In general, as the training sam-ples are limited in size and with a large sample-wise difference,our model may largely modify the parameters of the pre-trainedbackbone CNN from the MobileNet, resulting in a degenerationof the discriminative ability of the pre-trained model and yield-ing overfitting. An enhanced data splitting method may help.Note that we currently split the training dataset into two equal-sized halves in a random way, and we envision that the CNNfeature-based splitting method may enhance the performance.At last, we found that our five models achieve more balancedF1 scores, compared to the baseline model [21].
6. Visualization
One significant contribution of our models is to enable the visu-alization of the impacts of the audio segments on the classifica-tion results. As shown in Figure 1, our model incorporates a 1Dconvolutional layer with 2 neurons that converts a ( t (cid:48) , -matrix to a ( t (cid:48) , -matrix, where t (cid:48) is functionally related to thetime t . For the dimension of size 2, the first row representsthe non-AD class, and the second row represents the AD class.We chose the second row (or the first row) and used the Otsu’sthresholding method to evaluate the impact scores over the timedimension. Otsu’s method is used to perform an automatic im-age thresholding [32]. In the simplest form, the algorithm re-turns a single intensity threshold that separates pixels into twoclasses, foreground and background. This threshold is deter-mined by minimizing intra-class intensity variance, or equiva-lently, by maximizing inter-class variance. In our visualizationmodule, we used Otsu’s method to assign either for smallervalues (dark color) or for larger values (yellow color). Wealso used the nearest neighbor interpolation technique to scalethe vector from size t (cid:48) to size t .Figure 3 shows the visualization bar on seven utterances.In general, for non-AD samples, the dark segments contributemore to the non-AD result compared to the yellow segments,for AD samples, the yellow segments contribute more to theAD result compared to the dark segments. Note that the Otsu’s method produces both dark and the yellow segments for bothnon-AD and AD samples. We have two observations. First,from (a) and (b), dark segments represent quality speech andcontribute to the non-AD results; from (d) and (e), yellow seg-ments represent filled pauses and unclear speech and contributeto the AD results. Second, we have an interesting observationon the investigator’s audio data, which are mixed with the par-ticipant’s audio data. From (c), the investigator’s speech mixedwith the non-AD’s sample shows yellow, contributing to the ADresult as a noise. From (f), the investigator’s speech mixed withAD’s sample shows dark, contributing to the non-AD result asa noise. More importantly, from (g), the investigator’s speechfrom the AD sample shows yellow, contributing to the AD re-sult. By cross-checking the transcripts, we found our modelmay capture the similar utterance that the investigators used forpushing the AD participants for more conversations and con-sider this utterance as a positive sign for the AD result.
7. Discussion
Data augmentation.
As the training samples are limited, weexploit possible data augmentation techniques. Common im-age augmentation techniques include rotating or zooming in/outthe images for additional samples. However, these techniquesdo not apply to the MFCC feature maps due to the differentmeanings of their different dimensions. Thus, we consider an-other data augmentation technique, i.e., randomly masking cer-tain periods of an MFCC feature map with zeros. The mask israndomly generated at different positions for every epoch witha length of 200 to 400 units. One advantage of using this dataaugmentation method is its consistency with the data represen-tation of our model input, where zero-paddings were adopted inthe mini-batch implementation. However, our attempts did notintroduce significant accuracy gains.
Heterogeneous speech dataset.
The limited speech datasetis the biggest barrier in this research. Researchers proposed dif-ferent models and tested them over different datasets, resultingin loosely-connected conclusions. ADReSS challenge is a sig-nificant effort to bring researchers together to study the samedataset for producing more meaningful results. Our computa-tional model is fully automated and has the potential to applyto any speech and even multilingual speech. While researchersexplicitly identified silent pauses, filled pauses, and speech du-ration for building classification models, we envision these fea-tures are preserved in the feature maps, and as more datasetsare available, our computational model will self-adapt to bothexplicit and implicit acoustic features. . Conclusions
We proposed a classification model to analyze audio data fordementia detection. Our model employs the fully convolu-tional network to accommodate the audio samples with vary-ing lengths and preserve the time-series characteristics. We ex-tracted the MFCC feature maps from the audio data and con-verted the speech classification task to an image classificationtask. We then applied the transfer learning technique to adopt apre-trained model from the MobileNet architecture. Our modelachieves higher accuracy than the baseline model. Finally, weimplemented a visualization technique to provide intuitive vi-sual feedback on the impacts of the audio segments on the clas-sification results. We envision our computational model can beapplied to other speech datasets and have the potential to becontinuously enhanced with deep learning techniques.
9. Acknowledgements
This research is funded by the US National Institutes of HealthNational Institute on Aging, under grant No. 1R01AG067416.
10. References
Archives of Neurology ,vol. 51, no. 6, pp. 585–594, 1994.[3] B. Mirheidari, D. Blackburn, K. Harkness, T. Walker, A. Ven-neri, M. Reuber, and H. Christensen, “An avatar-based system foridentifying individuals likely to develop dementia,” in
Interspeech2017 . ISCA, 2017, pp. 3147–3151.[4] B. Mirheidari, Y. Pan, T. Walker, M. Reuber, A. Venneri,D. Blackburn, and H. Christensen, “Detecting alzheimer’s diseaseby estimating attention and elicitation path through the alignmentof spoken picture descriptions with the picture prompt,” arXivpreprint arXiv:1910.00515 , 2019.[5] T. Wang, Q. Yan, J. Pan, F. Zhu, R. Su, Y. Guo, L. Wang, andN. Yan, “Towards the speech features of early-stage dementia:Design and application of the mandarin elderly cognitive speechdatabase,”
Proc. Interspeech 2019 , pp. 4529–4533, 2019.[6] Y.-W. Chien, S.-Y. Hong, W.-T. Cheah, L.-H. Yao, Y.-L. Chang,and L.-C. Fu, “An automatic assessment system for alzheimer’sdisease based on speech using feature sequence generator and re-current neural network,”
Scientific Reports , vol. 9, no. 1, pp. 1–10,2019.[7] C. Sattler, H.-W. Wahl, J. Schrder, A. Kruse, P. Schnknecht,U. Kunzmann, T. Braun, C. Degen, I. Nitschke, W. Rahmlow,P. Rammelsberg, J. Siebert, B. Tauber, B. Wendelstein, andA. Zenthfer,
Interdisciplinary Longitudinal Study on Adult De-velopment and Aging (ILSE) , 01 2015, pp. 1–10.[8] K. C. Fraser, N. Linz, B. Li, K. L. Fors, F. Rudzicz, A. K¨onig,J. Alexandersson, P. Robert, and D. Kokkinakis, “Multilingualprediction of alzheimers disease through domain adaptation andconcept-based language modelling,” in
Proceedings of the 2019Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , 2019, pp. 3659–3670.[9] L. T´oth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatl´oczki,Z. B´anr´eti, M. P´ak´aski, and J. K´alm´an, “A speech recognition-based solution for the automatic detection of mild cognitive im-pairment from spontaneous speech,”
Current Alzheimer Research ,vol. 15, no. 2, pp. 130–138, 2018.[10] K. C. Fraser, F. Rudzicz, N. Graham, and E. Rochon, “Automaticspeech recognition in the diagnosis of primary progressive apha-sia,” in
Proceedings of the fourth workshop on speech and lan-guage processing for assistive technologies , 2013, pp. 47–54. [11] S. de la Fuente Garcia, C. W. Ritchie, and S. Luz, “Protocol fora conversation-based analysis study: Prevent-ed investigates dia-logue features that may help predict dementia onset in later life,”
BMJ open , vol. 9, no. 3, p. e026254, 2019.[12] B. Mirheidari, D. Blackburn, T. Walker, M. Reuber, and H. Chris-tensen, “Dementia detection using automatic analysis of conver-sations,”
Computer Speech & Language , vol. 53, pp. 65–79, 2019.[13] K. D. Mueller, B. Hermann, J. Mecollari, and L. S. Turkstra,“Connected speech and language in mild cognitive impairmentand alzheimers disease: A review of picture description tasks,”
Journal of clinical and experimental neuropsychology , vol. 40,no. 9, pp. 917–939, 2018.[14] B. Croisile, B. Ska, M.-J. Brabant, A. Duchene, Y. Lepage,G. Aimard, and M. Trillet, “Comparative study of oral and writ-ten picture description in patients with alzheimer’s disease,”
Brainand language , vol. 53, no. 1, pp. 1–19, 1996.[15] K. C. Fraser, K. L. Fors, and D. Kokkinakis, “Multilingual wordembeddings for the assessment of narrative speech in mild cogni-tive impairment,”
Computer Speech & Language , vol. 53, pp. 121– 139, 2019.[16] M. Yancheva and F. Rudzicz, “Vector-space topic models for de-tecting alzheimer’s disease,” in
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , 2016, pp. 2337–2346.[17] I. Hoffmann, D. Nemeth, C. D. Dye, M. P´ak´aski, T. Irinyi,and J. K´alm´an, “Temporal parameters of spontaneous speech inalzheimer’s disease,”
International journal of speech-languagepathology , vol. 12, no. 1, pp. 29–34, 2010.[18] L. T´oth, G. Gosztolya, V. Vincze, I. Hoffmann, G. Szatl´oczki,E. Bir´o, F. Zsura, M. P´ak´aski, and J. K´alm´an, “Automatic detec-tion of mild cognitive impairment from spontaneous speech usingasr,” in
Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[19] G. Gosztolya, L. T´oth, T. Gr´osz, V. Vincze, I. Hoffmann,G. Szatl´oczki, M. P´ak´aski, and J. K´alm´an, “Detecting mild cog-nitive impairment from spontaneous speech by correlation-basedphonetic feature selection,” in
INTERSPEECH , 2016, pp. 107–111.[20] F. Haider, S. De La Fuente, and S. Luz, “An assessment of paralin-guistic acoustic features for detection of alzheimer’s dementia inspontaneous speech,”
IEEE Journal of Selected Topics in SignalProcessing , 2019.[21] S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhin-ney, “Alzheimer’s dementia recognition through spontaneousspeech: The adress challenge,” 2020.[22] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen,R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al. ,“Cnn architectures for large-scale audio classification,” in . IEEE, 2017, pp. 131–135.[23] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention basedfully convolutional network for speech emotion recognition,” in . IEEE, 2018, pp.1771–1775.[24] L. Honig and R. Mayeux, “Natural history of alzheimers disease,”
Aging Clinical and Experimental Research , vol. 13, no. 3, pp.171–182, 2001.[25] H. Goodglass, E. Kaplan, and B. Barresi,
BDAE-3: Boston Diag-nostic Aphasia Examination–Third Edition . Lippincott Williams& Wilkins Philadelphia, PA, 2001.[26] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni-tion algorithms using mel frequency cepstral coefficient (mfcc)and dynamic time warping (dtw) techniques,” arXiv preprintarXiv:1003.4083 , 2010.[27] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimers disease in narrative speech,”
Journal ofAlzheimer’s Disease , vol. 49, no. 2, pp. 407–422, 2016.28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-geNet: A Large-Scale Hierarchical Image Database,” in
CVPR09 ,2009.[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Chal-lenge,”
International Journal of Computer Vision (IJCV) , vol.115, no. 3, pp. 211–252, 2015.[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficientconvolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861 , 2017.[31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gra-dient by a running average of its recent magnitude,”
COURSERA:Neural networks for machine learning , vol. 4, no. 2, pp. 26–31,2012.[32] N. Otsu, “A threshold selection method from gray-level his-tograms,”
IEEE transactions on systems, man, and cybernetics ,vol. 9, no. 1, pp. 62–66, 1979.
11. List of Acronyms AD Alzheimer’s Disease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
CTP
Cookie Theft Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
MFCC
Mel Frequency Cepstral Coefficient . . . . . . . . . . . . . . . 1
ASR
Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . 2
FCN
Fully Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . 1
CNN
Convolutional Neural Network. . . . . . . . . . . . . . . . . . . . . .1
GAP