[PDF] Exploiting Fully Convolutional Network and Visualization Techniques on Spontaneous Speech for Dementia Detection

Abstract

In this paper, we exploit a Fully Convolutional Network (FCN) to analyze the audio data of spontaneous speech for dementia detection. A fully convolutional network accommodates speech samples with varying lengths, thus enabling us to analyze the speech sample without manual segmentation. Specifically, we first obtain the Mel Frequency Cepstral Coefficient (MFCC) feature map from each participant's audio data and convert the speech classification task on audio data to an image classification task on MFCC feature maps. Then, to solve the data insufficiency problem, we apply transfer learning by adopting a pre-trained backbone Convolutional Neural Network (CNN) model from the MobileNet architecture and the ImageNet dataset. We further build a convolutional layer to produce a heatmap using Otsu's method for visualization, enabling us to understand the impact of the time-series audio segments on the classification results. We demonstrate that our classification model achieves 66.7% over the testing dataset, 62.5% of the baseline model provided in the ADReSS challenge. Through the visualization technique, we can evaluate the impact of audio segments, such as filled pauses from the participants and repeated questions from the investigator, on the classification results.

Full PDF

EExploiting Fully Convolutional Network and Visualization Techniques onSpontaneous Speech for Dementia Detection

Youxiang Zhu, Xiaohui Liang

Department of Computer Science, University of Massachusetts Boston, USA { Youxiang.Zhu001, Xiaohui.Liang } @umb.edu Abstract

In this paper, we exploit a Fully Convolutional Network (FCN)to analyze the audio data of spontaneous speech for dementiadetection. A fully convolutional network accommodates speechsamples with varying lengths, thus enabling us to analyze thespeech sample without manual segmentation. Speciﬁcally, weﬁrst obtain the Mel Frequency Cepstral Coefﬁcient (MFCC)feature map from each participant’s audio data and convert thespeech classiﬁcation task on audio data to an image classiﬁca-tion task on MFCC feature maps. Then, to solve the data insuf-ﬁciency problem, we apply transfer learning by adopting a pre-trained backbone Convolutional Neural Network (CNN) modelfrom the MobileNet architecture and the ImageNet dataset. Wefurther build a convolutional layer to produce a heatmap usingOtsu’s method for visualization, enabling us to understand theimpact of the time-series audio segments on the classiﬁcationresults. We demonstrate that our classiﬁcation model achieves66.7% over the testing dataset, > . of the baseline modelprovided in the ADReSS challenge. Through the visualizationtechnique, we can evaluate the impact of audio segments, suchas ﬁlled pauses from the participants and repeated questionsfrom the investigator, on the classiﬁcation results. Index Terms : Alzheimer’s disease, MFCC feature map, classi-ﬁcation, transfer learning, visualization

1. Introduction

The number of patients with Alzheimer’s Disease (AD) over theage of 65 is expected to reach 13.8 million by 2050, causing ahuge crisis on the public health system [1]. While there is noproven effective treatment on AD, it is signiﬁcant to detect earlysymptoms of AD such that interventions can be implemented inthe early stage. While screening measures, neuropsychologi-cal assessments, and MRI imaging scans are not pragmatic ap-proaches, recent studies have explored spontaneous speech fora practical and low-cost early detection of dementia symptoms.Pitt corpus [2], one of the large speech datasets, includes spon-taneous speech obtained from a Cookie Theft Picture (CTP)description task. The CTP task has also been explored withcomputerized agent to automate and mobilize the speech col-lection [3, 4] and in other languages including Mandarin [5, 6],German [7], Sweden [8]. Other spontaneous speech datasets forAD research include those collected from ﬁlm-recall tasks [9],story-retelling tasks [10], map-based tasks [11], and humanconversations [12].Researchers have studied linguistic features extracted fromtranscripts for building classiﬁcation and regression models. Arecent survey showed effective linguistic features include se-mantic content, syntax and morphology, pragmatic language,discourse ﬂuency, speech rate, and speech monitoring [13]. Thelinguistic features were often manually selected based on expertknowledge, and the analysis methods were complex and highly task-dependent. A potential research direction is to automatethe linguistic analysis. For example, Croisile et al. manuallyextracted 23 information units from the picture using languageknowledge and found the analysis based on them was effectivein dementia detection [14]. Fraser et al. conﬁrmed both theinitial 23 information units and the auto-generated informationunits are effective in analysis [15]. Yancheva et al. [16] andFraser et al. [8] further proposed to auto-generate topic mod-els that can recall 97% of the human-annotated informationunits. Similarly, the acoustic-based analysis was started withpre-deﬁned features and recently automated with computationalmodels. Hoffman et al. considered acoustic features for each ut-terance [17]. Fraser et al. evaluated the statistical signiﬁcanceof pause and word acoustic features [10]. Toth et al. consideredfour descriptors for silent/ﬁlled pauses and phonemes [18]. Tothet al. implemented a customized automatic speech recognition(ASR) and automatic feature selection for phones, boundaries,and ﬁlled pauses [19, 9]. Haider et al. proposed an automaticacoustic analysis approach using the paralinguistic acoustic fea-tures of audio segments [20, 21].In this paper, we envision an automated speech analysis ofthe audio data for dementia detection. We observed Haider etal. segmented the audio data into small pieces by setting thelog energy threshold parameter to 65dB with a maximum du-ration of 10 seconds [20, 21]. We feel the segmentation maycause critical time-series information loss. Any smaller speechsegments hardly represent the overall speech sample. In addi-tion, the speech continuity is removed by segmentation, makingthe model inaccurately capture the time-series characteristics.Thus, our model aims to accommodate a speech sample of eachparticipant as input and preserve the time-series characteristicsof the speech samples [22, 23]. Our contributions are as follows.First, we converted a speech classiﬁcation task on the audiodata to an image classiﬁcation task on the Mel Frequency Cep-stral Coefﬁcient (MFCC) feature maps. The feature maps areautomatically extracted from the audio data and preserve thetime-series characteristics of the speech.Second, we explored the Fully Convolutional Network(FCN) to accommodate the speech samples with varyinglengths. We employed the transfer learning technique byadopting a pre-trained backbone Convolutional Neural Net-work (CNN) from the MobileNet architecture and the ImageNetdataset. Compared to the baseline model, ours achieves betteraccuracy and a more balanced F1 score.Third, we embedded a convolutional layer in our model toenable the visualization of the impact of audio segments on theclassiﬁcation results, thus increasing our understanding of howthe classiﬁcation model works. We found the visualization tech-nique identiﬁes the ﬁlled pauses from the participant and the re-peated questions from the investigator as positive signs of AD. a r X i v : . [ ee ss . A S ] A ug GB imagesMFCC feature maps Backbone CNNBackbone CNN GAP 2DGAP 1D FC 1000GAP 1D SoftmaxSoftmax

Pre-trained Parameters (Transfer Learning)(h, w, 3) (h’, w’, 1024) 1024 1000(p, t, 3) (p’, t’, 1024) (t’,1024) (t’, 2) 2 M ob il e N e t Prob Prob Prob … Prob Prob non-AD

Prob AD VisualizationCONV1D 2

Figure 1:

Proposed classiﬁcation model with transfer learning and visualization

2. ADReSS Challenge Dataset

We studied the dataset created for the ADReSS challenge [21],which is a part of the Pitt corpus [2], with the numbers of par-ticipants balanced for age and gender. The data consists ofspeech recordings and transcripts of spoken picture descriptionselicited from participants through the Cookie Theft picture fromthe Boston Diagnostic Aphasia Exam [24, 25]. We studied thefull-wave enhanced audio, which contains the audio recordingsafter noise removal. The training dataset includes speech datafrom 24 male participants with AD, 30 female with AD, 24 malenon-AD participants, and 30 female non-AD participants. TheADReSS testing dataset includes speech data from 11 male par-ticipants with AD, 13 female with AD, 11 male non-AD par-ticipants, and 13 female non-AD participants. The completedataset information can be found at Luz et al. [21].

3. MFCC Feature Maps

Mel-frequency cepstral coefﬁcients have been widely used inspeech recognition research [26]. Fraser et al. carried out anacoustic-prosodic analysis on the Pitt corpus using 42 MFCCfeatures [16, 27]. We extracted an MFCC feature map for eachparticipant’s entire audio sample. The MFCC feature map isdenoted as a ( p, t ) -matrix where the hyper-parameter p is set to64, and t is related to the duration of the speech sample. Weuse librosa function with sampling rate of 22050, window sizeof 2048, and step size of 512. In Figure 2, we show the sam-ple MFCC feature maps of participants 001 (non-AD) and 083(AD), respectively. The data in the ﬁrst row is scaled for visu-alization purposes. By extracting the MFCC feature maps, weconvert the speech dataset to an image dataset. The advantagesof MFCC feature maps are three-fold: i) the conversion fromspeech to MFCC feature maps can be done automatically; ii)the silent pauses in the audio data are preserved as a distinctivefeature in MFCC feature maps as shown in Figure 2; iii) wefound the audio dataset contains speech from the investigatorand ﬁlled pauses from the participant that are shown to be im-portant [9]. While identifying these audio segments requires ex-pensive human efforts or customized Automatic Speech Recog-nition (ASR), we envision the MFCC feature maps preserve thetime-series structure, and the classiﬁcation model may continu-ously learn to deal with these effects.

4. Classiﬁcation Model

We aim to design a classiﬁcation model to classify the audiosamples into the non-AD and AD groups. After converting eachaudio sample to an MFCC feature map, we focus on developingan image classiﬁcation model. To improve the learning effec-

HC 001 32.507771 - 37.151738AD 083 25.539733 - 30.183321 “… looks like a garage or something with curtains and [pause] the grass in the garden …”“… trying to get a cookie [pause] for himself [pause] and also one …”

Figure 2:

Sample MFCC feature maps of 001 and 083 tiveness over the small audio dataset, we apply transfer learn-ing using ImageNet and MobileNet. In the following, we ﬁrstexplain the transfer learning technique and then introduce ourmodel. An overview of our model is shown in Figure 1.

We developed a transfer learning technique using the knowl-edge from the image datasets and pre-trained image classiﬁca-tion models to overcome the insufﬁciency of the audio dataset.

ImageNet is an image dataset organized according to theWordNet hierarchy [28]. Each meaningful concept in WordNet,possibly described by multiple words or word phrases, is calleda “synset.” There are more than 100,000 synsets in WordNet,majority of which are nouns (80,000+). ImageNet provides, onaverage, 1000 images to illustrate each synset. Images of eachconcept are quality-controlled and human-annotated. ImageNetdataset has been widely used in designing and evaluating the im-age classiﬁcation models [29].

MobileNet is a lightweight net-work architecture that signiﬁcantly reduces the computationaloverhead as well as parameter size by replacing the standardconvolution ﬁlters to the depth-wise convolutional ﬁlters and thepoint-wise convolutional ﬁlters [30]. The total parameters of theMobileNet backbone are of a size 17.2 MB, signiﬁcantly lessthan other convolutional neural networks. Considering the lim-ited size of the speech dataset, we thought a smaller model withless complexity, such as MobileNet, may worth being tested.The MobileNet architecture is shown at the above layer ofthe Figure 1. With an RGB image as input, the output is theprobability that the image belongs to each of the 1000 classes.Denote the input image as a 3-dimensional ( h, w, -matrixwhere h is height, w is width, and 3 represents the RGB chan-nel. A backbone CNN consists of a set of convolution, pooling,and activation operations. We used the full width (1.0) Mo-bileNet backbone pre-trained on a resolution of 128*128 im-ges. The detailed architecture can be refer to the paper [30].The backbone converts an input of ( h, w, -matrix to an outputof ( h (cid:48) , w (cid:48) , -matrix where ( h (cid:48) , w (cid:48) ) are functionally relatedto ( h, w ) , and 1024 represents the feature channel number, i.e.,the depth of the backbone CNN. The output ( h (cid:48) , w (cid:48) , -matrix is then fed to a Global Average Pooling (GAP) layer forreducing the dimensions of h (cid:48) and w (cid:48) and obtaining a -dimension feature. A Fully Connected (FC) layer with 1000neurons is employed to produce the output according to thewanted 1000 classes. Lastly, a softmax activation layer is addedto produce the classiﬁcation results as the probabilities for 1000classes that add up to 1. The pre-training of MobileNet istime-consuming and may take weeks due to the large ImageNetdataset. The pre-trained parameters of the backbone CNN fromMobileNet are made available, though. We used the parametersand saved time on the pre-training. Our proposed model is shown at the bottom layer of the Fig-ure 1. Our FCN architecture employs the pre-trained backboneCNN module from the MobileNet. Denote the MFCC featuremap of the audio sample as a ( p, t, -matrix where p is a hyper-parameter set to 64, and t is related to the duration of the speechsample. To match with the module input, i.e., an RGB im-age, we duplicated the MFCC feature map twice and made theMFCC feature map as a ( p, t, -matrix. In this way, we canfeed the MFCC feature map into the backbone CNN module ofthe MobileNet in the same way as an RGB image. The output ofthe backbone CNN is denoted as a ( p (cid:48) , t (cid:48) , -matrix where ( p (cid:48) , t (cid:48) ) are functionally related to ( p, t ) . We employed a GAP-1D (one-dimensional) to reduce p (cid:48) dimension of the matrix. The t (cid:48) dimension is preserved for enabling the visualization. We fur-ther used a 1D convolutional layer with neurons to adapt to thewanted 2 classes. The output of the 1D convolutional layer isused to build a 1D heatmap for visualization. Finally, we addedanother GAP-1D layer to reduce the t (cid:48) dimension and the soft-max activation layer to produce the classiﬁcation results as twoprobabilities for the two classes that add up to 1.

5. Evaluation

We implemented the classiﬁcation model with Keras and Ten-sorFlow. We used a mini-batch with batch size at each train-ing step and a very small learning rate of 1e-5, while minimiz-ing the cross-entropy loss with the RMSProp optimizer [31].As the samples in a mini-batch are required to have the samesize, we used zero-paddings to pad the samples such that theirlengths are equal to the max length in the mini-batch. The zero-paddings have limited impact on the classiﬁcation task becausei) they can be easily distinguished from non-zero pauses and ii)the employed GAP and softmax layers produces average heightand width values and relative values. We further conﬁrmed fromour visualization techniques that zero-paddings in the trainingproduced minimum impact on the classiﬁcation results. In thetesting phase, we treated the testing sample as a mini-batch witha batch size of such that our classiﬁcation model can take sam-ples with any length.Due to the limited audio dataset, we designed a set of train-ing strategies as follows. First, we split the provided speechdataset into two halves with equal sizes. We used one halffor training and the other half for validation. We trained ourmodel with max 1000 epochs and selected the epoch that has the highest validation accuracy after the model converges. Weperformed such training twice by switching the training andvalidation datasets. We thus obtained two models ( M , M ) that are a complement to each other. Note that k -fold cross-validation is a classical evaluation strategy. Here, we chose k = 2 for a relatively large validation dataset because a toosmall validation dataset may not reﬂect the overall data distri-bution. A larger validation dataset enables us to better controlthe learning rate for our model to produce a stable classiﬁca-tion accuracy on the validation dataset. We did not use “leave-one-subject-out (LOSO)” because i) LOSO is not suitable fortraining deep neural network models in terms of computationalefﬁciency; ii) we focus on learning the parameters of the deepneural networks but not model selection; and iii) we focus onthe results of the provided test dataset. Second, we merged theabove two models ( M , M ) into M by averaging the out-put probabilities of ( M , M ) . M takes an advantage ofthe entire dataset. Third, we used all the speech samples totrain a model M , where we selected the epoch with minimumtraining loss instead of maximum validation accuracy. Lastly,we merged the above three models ( M , M , M ) to a model M by adding the output probabilities of ( M , M , M ) .Note that, models ( M , M , M ) mainly focus on improvingthe validation accuracy. This strategy is usually adopted whenthe training dataset is small. Models ( M , M ) considerminimizing the training loss, which is a general approach fordeep learning, where a large training dataset is available. We trained and tested our models using two different datasets,both of which are provided by the ADReSS challenge. The test-ing dataset were provided after the models were trained. Ourmodels output a binary result, non-AD or AD. The evaluationmetrics are accuracy TN + TPN , precision π = TPTP + FP , recall ρ = TPTP + FN , and F1 score πρπ + ρ , where N is the number ofparticipants, T P , F P and

F N are the numbers of true posi-tives, false positives and false negatives, respectively.Table 1:

Classiﬁcation results

Class Prec. Recall F1 Acc. M (val.) non-AD 0.68 0.54 0.60 0.646AD 0.62 0.75 0.68 M (val.) non-AD 0.58 0.58 0.58 0.583AD 0.58 0.58 0.58 M non-AD 0.63 0.79 0.70 0.667AD 0.72 0.54 0.62 M (loss) non-AD 0.63 0.71 0.67 0.646AD 0.67 0.58 0.62 M non-AD 0.59 0.67 0.63 0.604AD 0.62 0.54 0.58Baseline [21] non-AD 0.67 0.50 0.57 0.625AD 0.60 0.75 0.67Our models ( M , M ) achieve 64.6%, 58.3% accuracy, re-spectively, as shown in Table 1. We found these are consistentwith our validation accuracy 62.96% and 61.11% obtained inthe training phase. After combining the outputs from M and M , our model M achieves the highest accuracy 66.67% ofour ﬁve attempts. We consider the model M is relativelysuccessful as it outperformed M , M , and the baseline modelwith 62.5%. We think this performance gain of M is ob-tained because it considers all samples in training and inheritsthe knowledge from the image classiﬁcation model via trans-fer learning. Our models M and M achieves 64.6% a) *PAR: and &uh she’s getting her feet wet from the overﬂow of the water from the sink . (b) *PAR: she seems to be oblivious to the fact that the &s sink is overﬂowing .(d) *PAR: &uh and he’s [/] he’s in the &c &t cookie jar . (e) *PAR: and she’s [//] &uh &w &uh &h she has [/] &uh has +/ . (c) *INV: tell me everything that you see going on in that picture .(f) *INV: you see going on in the picture . (g) *INV: okay anything else ? Figure 3:

Visualizing the impacts of audio segments. (a,b,c) are from 001 non-AD and (d,e,f,g) are from 079 AD and 60.4% accuracy, respectively. Without the validation stepin training, these models focus on minimizing the training lossand deﬁnitely need more data to improve accuracy. Both mod-els M and M used all the samples in the training phase, but M achieves a higher accuracy than M . One possible expla-nation is that in M after splitting the training samples intotwo equal halves, the sample-wise differences in each half be-come smaller. When M merges the outputs of M and M ,it simply chooses the model with higher conﬁdence and thusproduces a higher accuracy. In general, as the training sam-ples are limited in size and with a large sample-wise difference,our model may largely modify the parameters of the pre-trainedbackbone CNN from the MobileNet, resulting in a degenerationof the discriminative ability of the pre-trained model and yield-ing overﬁtting. An enhanced data splitting method may help.Note that we currently split the training dataset into two equal-sized halves in a random way, and we envision that the CNNfeature-based splitting method may enhance the performance.At last, we found that our ﬁve models achieve more balancedF1 scores, compared to the baseline model [21].

6. Visualization

One signiﬁcant contribution of our models is to enable the visu-alization of the impacts of the audio segments on the classiﬁca-tion results. As shown in Figure 1, our model incorporates a 1Dconvolutional layer with 2 neurons that converts a ( t (cid:48) , -matrix to a ( t (cid:48) , -matrix, where t (cid:48) is functionally related to thetime t . For the dimension of size 2, the ﬁrst row representsthe non-AD class, and the second row represents the AD class.We chose the second row (or the ﬁrst row) and used the Otsu’sthresholding method to evaluate the impact scores over the timedimension. Otsu’s method is used to perform an automatic im-age thresholding [32]. In the simplest form, the algorithm re-turns a single intensity threshold that separates pixels into twoclasses, foreground and background. This threshold is deter-mined by minimizing intra-class intensity variance, or equiva-lently, by maximizing inter-class variance. In our visualizationmodule, we used Otsu’s method to assign either for smallervalues (dark color) or for larger values (yellow color). Wealso used the nearest neighbor interpolation technique to scalethe vector from size t (cid:48) to size t .Figure 3 shows the visualization bar on seven utterances.In general, for non-AD samples, the dark segments contributemore to the non-AD result compared to the yellow segments,for AD samples, the yellow segments contribute more to theAD result compared to the dark segments. Note that the Otsu’s method produces both dark and the yellow segments for bothnon-AD and AD samples. We have two observations. First,from (a) and (b), dark segments represent quality speech andcontribute to the non-AD results; from (d) and (e), yellow seg-ments represent ﬁlled pauses and unclear speech and contributeto the AD results. Second, we have an interesting observationon the investigator’s audio data, which are mixed with the par-ticipant’s audio data. From (c), the investigator’s speech mixedwith the non-AD’s sample shows yellow, contributing to the ADresult as a noise. From (f), the investigator’s speech mixed withAD’s sample shows dark, contributing to the non-AD result asa noise. More importantly, from (g), the investigator’s speechfrom the AD sample shows yellow, contributing to the AD re-sult. By cross-checking the transcripts, we found our modelmay capture the similar utterance that the investigators used forpushing the AD participants for more conversations and con-sider this utterance as a positive sign for the AD result.

7. Discussion

Data augmentation.

As the training samples are limited, weexploit possible data augmentation techniques. Common im-age augmentation techniques include rotating or zooming in/outthe images for additional samples. However, these techniquesdo not apply to the MFCC feature maps due to the differentmeanings of their different dimensions. Thus, we consider an-other data augmentation technique, i.e., randomly masking cer-tain periods of an MFCC feature map with zeros. The mask israndomly generated at different positions for every epoch witha length of 200 to 400 units. One advantage of using this dataaugmentation method is its consistency with the data represen-tation of our model input, where zero-paddings were adopted inthe mini-batch implementation. However, our attempts did notintroduce signiﬁcant accuracy gains.

Heterogeneous speech dataset.

The limited speech datasetis the biggest barrier in this research. Researchers proposed dif-ferent models and tested them over different datasets, resultingin loosely-connected conclusions. ADReSS challenge is a sig-niﬁcant effort to bring researchers together to study the samedataset for producing more meaningful results. Our computa-tional model is fully automated and has the potential to applyto any speech and even multilingual speech. While researchersexplicitly identiﬁed silent pauses, ﬁlled pauses, and speech du-ration for building classiﬁcation models, we envision these fea-tures are preserved in the feature maps, and as more datasetsare available, our computational model will self-adapt to bothexplicit and implicit acoustic features. . Conclusions

We proposed a classiﬁcation model to analyze audio data fordementia detection. Our model employs the fully convolu-tional network to accommodate the audio samples with vary-ing lengths and preserve the time-series characteristics. We ex-tracted the MFCC feature maps from the audio data and con-verted the speech classiﬁcation task to an image classiﬁcationtask. We then applied the transfer learning technique to adopt apre-trained model from the MobileNet architecture. Our modelachieves higher accuracy than the baseline model. Finally, weimplemented a visualization technique to provide intuitive vi-sual feedback on the impacts of the audio segments on the clas-siﬁcation results. We envision our computational model can beapplied to other speech datasets and have the potential to becontinuously enhanced with deep learning techniques.

9. Acknowledgements

This research is funded by the US National Institutes of HealthNational Institute on Aging, under grant No. 1R01AG067416.

10. References

Archives of Neurology ,vol. 51, no. 6, pp. 585–594, 1994.[3] B. Mirheidari, D. Blackburn, K. Harkness, T. Walker, A. Ven-neri, M. Reuber, and H. Christensen, “An avatar-based system foridentifying individuals likely to develop dementia,” in

Interspeech2017 . ISCA, 2017, pp. 3147–3151.[4] B. Mirheidari, Y. Pan, T. Walker, M. Reuber, A. Venneri,D. Blackburn, and H. Christensen, “Detecting alzheimer’s diseaseby estimating attention and elicitation path through the alignmentof spoken picture descriptions with the picture prompt,” arXivpreprint arXiv:1910.00515 , 2019.[5] T. Wang, Q. Yan, J. Pan, F. Zhu, R. Su, Y. Guo, L. Wang, andN. Yan, “Towards the speech features of early-stage dementia:Design and application of the mandarin elderly cognitive speechdatabase,”

Proc. Interspeech 2019 , pp. 4529–4533, 2019.[6] Y.-W. Chien, S.-Y. Hong, W.-T. Cheah, L.-H. Yao, Y.-L. Chang,and L.-C. Fu, “An automatic assessment system for alzheimer’sdisease based on speech using feature sequence generator and re-current neural network,”

Scientiﬁc Reports , vol. 9, no. 1, pp. 1–10,2019.[7] C. Sattler, H.-W. Wahl, J. Schrder, A. Kruse, P. Schnknecht,U. Kunzmann, T. Braun, C. Degen, I. Nitschke, W. Rahmlow,P. Rammelsberg, J. Siebert, B. Tauber, B. Wendelstein, andA. Zenthfer,

Interdisciplinary Longitudinal Study on Adult De-velopment and Aging (ILSE) , 01 2015, pp. 1–10.[8] K. C. Fraser, N. Linz, B. Li, K. L. Fors, F. Rudzicz, A. K¨onig,J. Alexandersson, P. Robert, and D. Kokkinakis, “Multilingualprediction of alzheimers disease through domain adaptation andconcept-based language modelling,” in

Proceedings of the 2019Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , 2019, pp. 3659–3670.[9] L. T´oth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatl´oczki,Z. B´anr´eti, M. P´ak´aski, and J. K´alm´an, “A speech recognition-based solution for the automatic detection of mild cognitive im-pairment from spontaneous speech,”

Current Alzheimer Research ,vol. 15, no. 2, pp. 130–138, 2018.[10] K. C. Fraser, F. Rudzicz, N. Graham, and E. Rochon, “Automaticspeech recognition in the diagnosis of primary progressive apha-sia,” in

Proceedings of the fourth workshop on speech and lan-guage processing for assistive technologies , 2013, pp. 47–54. [11] S. de la Fuente Garcia, C. W. Ritchie, and S. Luz, “Protocol fora conversation-based analysis study: Prevent-ed investigates dia-logue features that may help predict dementia onset in later life,”

BMJ open , vol. 9, no. 3, p. e026254, 2019.[12] B. Mirheidari, D. Blackburn, T. Walker, M. Reuber, and H. Chris-tensen, “Dementia detection using automatic analysis of conver-sations,”

Computer Speech & Language , vol. 53, pp. 65–79, 2019.[13] K. D. Mueller, B. Hermann, J. Mecollari, and L. S. Turkstra,“Connected speech and language in mild cognitive impairmentand alzheimers disease: A review of picture description tasks,”

Journal of clinical and experimental neuropsychology , vol. 40,no. 9, pp. 917–939, 2018.[14] B. Croisile, B. Ska, M.-J. Brabant, A. Duchene, Y. Lepage,G. Aimard, and M. Trillet, “Comparative study of oral and writ-ten picture description in patients with alzheimer’s disease,”

Brainand language , vol. 53, no. 1, pp. 1–19, 1996.[15] K. C. Fraser, K. L. Fors, and D. Kokkinakis, “Multilingual wordembeddings for the assessment of narrative speech in mild cogni-tive impairment,”

Computer Speech & Language , vol. 53, pp. 121– 139, 2019.[16] M. Yancheva and F. Rudzicz, “Vector-space topic models for de-tecting alzheimer’s disease,” in

Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , 2016, pp. 2337–2346.[17] I. Hoffmann, D. Nemeth, C. D. Dye, M. P´ak´aski, T. Irinyi,and J. K´alm´an, “Temporal parameters of spontaneous speech inalzheimer’s disease,”

International journal of speech-languagepathology , vol. 12, no. 1, pp. 29–34, 2010.[18] L. T´oth, G. Gosztolya, V. Vincze, I. Hoffmann, G. Szatl´oczki,E. Bir´o, F. Zsura, M. P´ak´aski, and J. K´alm´an, “Automatic detec-tion of mild cognitive impairment from spontaneous speech usingasr,” in

Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[19] G. Gosztolya, L. T´oth, T. Gr´osz, V. Vincze, I. Hoffmann,G. Szatl´oczki, M. P´ak´aski, and J. K´alm´an, “Detecting mild cog-nitive impairment from spontaneous speech by correlation-basedphonetic feature selection,” in

INTERSPEECH , 2016, pp. 107–111.[20] F. Haider, S. De La Fuente, and S. Luz, “An assessment of paralin-guistic acoustic features for detection of alzheimer’s dementia inspontaneous speech,”

IEEE Journal of Selected Topics in SignalProcessing , 2019.[21] S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhin-ney, “Alzheimer’s dementia recognition through spontaneousspeech: The adress challenge,” 2020.[22] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen,R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al. ,“Cnn architectures for large-scale audio classiﬁcation,” in . IEEE, 2017, pp. 131–135.[23] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention basedfully convolutional network for speech emotion recognition,” in . IEEE, 2018, pp.1771–1775.[24] L. Honig and R. Mayeux, “Natural history of alzheimers disease,”

Aging Clinical and Experimental Research , vol. 13, no. 3, pp.171–182, 2001.[25] H. Goodglass, E. Kaplan, and B. Barresi,

BDAE-3: Boston Diag-nostic Aphasia Examination–Third Edition . Lippincott Williams& Wilkins Philadelphia, PA, 2001.[26] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recogni-tion algorithms using mel frequency cepstral coefﬁcient (mfcc)and dynamic time warping (dtw) techniques,” arXiv preprintarXiv:1003.4083 , 2010.[27] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic fea-tures identify alzheimers disease in narrative speech,”

Journal ofAlzheimer’s Disease , vol. 49, no. 2, pp. 407–422, 2016.28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-geNet: A Large-Scale Hierarchical Image Database,” in

CVPR09 ,2009.[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Chal-lenge,”

International Journal of Computer Vision (IJCV) , vol.115, no. 3, pp. 211–252, 2015.[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcientconvolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861 , 2017.[31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gra-dient by a running average of its recent magnitude,”

COURSERA:Neural networks for machine learning , vol. 4, no. 2, pp. 26–31,2012.[32] N. Otsu, “A threshold selection method from gray-level his-tograms,”

IEEE transactions on systems, man, and cybernetics ,vol. 9, no. 1, pp. 62–66, 1979.

11. List of Acronyms AD Alzheimer’s Disease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

CTP

Cookie Theft Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

MFCC

Mel Frequency Cepstral Coefﬁcient . . . . . . . . . . . . . . . 1

ASR

Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . 2

FCN

Fully Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . 1

CNN

Convolutional Neural Network. . . . . . . . . . . . . . . . . . . . . .1

GAP