[PDF] Classification of Huntington Disease using Acoustic and Lexical Features

Abstract

Speech is a critical biomarker for Huntington Disease (HD), with changes in speech increasing in severity as the disease progresses. Speech analyses are currently conducted using either transcriptions created manually by trained professionals or using global rating scales. Manual transcription is both expensive and time-consuming and global rating scales may lack sufficient sensitivity and fidelity. Ultimately, what is needed is an unobtrusive measure that can cheaply and continuously track disease progression. We present first steps towards the development of such a system, demonstrating the ability to automatically differentiate between healthy controls and individuals with HD using speech cues. The results provide evidence that objective analyses can be used to support clinical diagnoses, moving towards the tracking of symptomatology outside of laboratory and clinical environments.

Full PDF

CClassiﬁcation of Huntington Disease using Acoustic and Lexical Features

Matthew Perez , Wenyu Jin , Duc Le , Noelle Carlozzi , Praveen Dayalu , Angela Roberts ,Emily Mower Provost Computer Science and Engineering, University of Michigan, Ann Arbor, MI Physical Medicine & Rehabilitation, University of Michigan, Ann Arbor, MI Michigan Medicine, University of Michigan, Ann Arbor, MI Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL [email protected], [email protected], [email protected], [email protected],[email protected], [email protected] [email protected] Abstract

Speech is a critical biomarker for Huntington Disease (HD),with changes in speech increasing in severity as the disease pro-gresses. Speech analyses are currently conducted using eithertranscriptions created manually by trained professionals or us-ing global rating scales. Manual transcription is both expensiveand time-consuming and global rating scales may lack sufﬁcientsensitivity and ﬁdelity [1]. Ultimately, what is needed is anunobtrusive measure that can cheaply and continuously trackdisease progression. We present ﬁrst steps towards the devel-opment of such a system, demonstrating the ability to automati-cally differentiate between healthy controls and individuals withHD using speech cues. The results provide evidence that objec-tive analyses can be used to support clinical diagnoses, movingtowards the tracking of symptomatology outside of laboratoryand clinical environments.

Index Terms : Huntington disease, speech analysis, clinical ap-plication, speech feature extraction, speech recognition

1. Introduction

Huntington disease (HD) is a fatal, autosomal neurodegenera-tive disease that typically affects 12 per 100,000 people in theWestern World [2]. HD is insidious and progressive, affectingmotor skills, speech, cognition, and behavior [3]. The diagnosisof HD is based on unequivocal motor symptoms and is typicallymade when individuals are in their mid-40’s [4, 5]. Current re-search suggests that speech motor deﬁcits precede the onset oflimb and trunk chorea [5], providing an opportunity to leveragechanges in speech as a sensitive biomarker. These biomarkerscan then be used to support distributed ecologically valid symp-tom tracking. This paper builds towards that goal, investigatinghow the speech signal can be used to automatically detect HD.HD speech is typically characterized by decreases in thenumber of words pronounced, syntactic complexity, and speechrate, in addition to increases in paraphasic errors, ﬁller usage,and sentence duration [6, 5]. Language and speech symp-toms are common in HD, occurring in approximately 90% ofcases [7, 8]. As such, acoustic analyses may provide meaning-ful therapeutic and diagnostic information for individuals withHD, especially given that preliminary research has shown thatHD-related speech deﬁcits can be objectively characterized andincrease with disease progression [9]. The development of anobjective, non-invasive acoustic biomarker, sensitive enough todetect disease progression in people with premanifest and man-ifest HD, will provide new avenues for clinical research andtreatments.We present an initial step towards detecting changes in HD severity by ﬁrst demonstrating the efﬁcacy of the speech sig-nal for detecting the presence of HD. The system includes tran-scription, feature extraction, and classiﬁcation. The transcriptsare generated either by humans or by training in-domain au-tomatic speech recognition (ASR) systems. The features areclinically-inspired and include ﬁller usage, pauses in speech,speech rate, and pronunciation errors. We investigate the staticand dynamic feature properties. The static feature sets describespeech behavior using summary statistics; the dynamic featuresets provide an opportunity to directly model the feature varia-tion between utterances. We model the static feature sets usingk-Nearest Neighbors ( k -NN) with Euclidean distance and DeepNeural Networks (DNN). We model the dynamic feature setsusing k -NN with Dynamic Time Warping (DTW) Distance andLong-Short-Term Memory Networks (LSTM). We investigatethe impact of transcription error, moving from manual transcrip-tions, to transcripts generated using forced alignment to knownprompts, and ﬁnally to automatic speech recognition (ASR).Our results demonstrate the efﬁcacy of speech-centered ap-proaches for detecting HD. We show that we can accurately de-tect HD using a simple static ( k -NN) approach, resulting in anaccuracy of 0.81 (chance performance is 0.5). We then showthat these results can be improved using dynamic feature sets(DTW) or deep methods, which can capture the non-linear rela-tionships in our feature sets (DNN/LSTM). Finally, we demon-strate that in domains with limited lexical variability, manualtranscripts can be replaced with ASR transcripts without suf-fering from a degradation in performance, even given a worderror rate of 9.4%. This indicates the robustness of the identi-ﬁed speech features. The novel aspects of our approach includeone of the ﬁrst investigations into automated speech-centeredHD detection and a focus on understanding the importance ofmodeling temporal variability for detecting HD.

2. Related Work

Previous works have demonstrated the feasibility of automatedspeech assessment for various neurocognitive disorders such asdementia [10], aphasia [11], and Alzheimer’s disease [12]. Re-search has investigated the feasibility of using automatic speechrecognition (ASR) to extract lexical features from transcribedaudio [13, 14]. However, off-the-shelf ASR systems are notwell suited to this domain because of the abnormal speech pat-terns, high speaker variability, and lack of data that are commonin dysarthric speech [15, 16]. Furthermore, these off-the-shelfASR systems may miss out on critical cues such as the pres-ence of ﬁllers, stutters, or mispronunciations, all of which con-tribute to the perception of disordered speech [12]. Another a r X i v : . [ ee ss . A S ] A ug able 1: Summary of participant demographics

Premanifest (n=12) Early (n=12) Late (n=7) Control (n=31) All (n=62)Age — Mean (SD) 42.6 (9.8) 52.0 (11.2) 54.6 (9.7) 50.3 (11.1) 49.8 (11.1)Gender — % Male 36.4 41.7 37.5 38.7 38.7Race — % White 90.9 100 87.5 90.3 91.9Race — % Black 0 0 12.5 6.5 4.8Race — % Other 9.1 0 0 3.2 3.3Table 2:

Dataset summary: total dataset size (seconds) and av-erage utterance duration (seconds). HC describes the healthycontrols and HD, the participants with Huntington

Group Duration (avg. dur) ± ± ± ±

3. Data Description

The data in this study was collected from an HD study con-ducted at the University of Michigan. The data consists of 62speakers, 31 healthy and 31 with HD. Out of the 31 individualswith HD, 11 are premanifest, 12 are in the early stage, and 8are in the late stage. HD groups were created using the TotalMotor Score (TMS) and the Total Functional Capacity (TFC)score [19] from the Uniﬁed Huntingtons Disease Rating Scale(UHDRS) [20]. Speciﬁcally, individuals were designated aspremanifest HD if they had a positive gene test (HD CAG >

35) and a clinician-rated score of less than 4 on the last item ofthe TMS (which provides an index of clinician-rated diagnos-tic conﬁdence). Those with clinician-rated scores greater thanor equal to 4 on the last item of the TMS were included in themanifest HD group. For those with manifest HD, TFC scores(which provide an index of clinician-rated functional capacity)were used to determine HD stage. Speciﬁcally, scores rangefrom 0 (low functioning) to 13 (highest level of functioning);TFC sum scores of 7-13 were considered early-stage and sumscores of 0-6 were considered late-stage HD.The data includes both read speech and spontaneous speechsections. This study focuses only on the read speech portion,during which participants read the Grandfather Passage [21].The Grandfather Passage [22, 23, 24] is a phonetically balancedparagraph containing 129 words and 169 syllables, and is a stan-dard reading passage used in speech-language pathology. Thispassage is commonly used to test for dysarthric speech [25]. Ta-ble 2 shows additional information about the scope of our datasuch as the size and number of utterances.

4. Data Transcription

Recordings were deidentiﬁed and transcribed using the CHATapproach [26] and Computerized Language Analysis (CLAN)software. CHAT transcriptions identify speech errors (phono- Table 3:

Performance of ASR system (per speaker) for healthyand HD speech, including WER, insertion (ins), deletion (del),and substitution (sub)

WER % Ins Del SubAll . ± . . ± . . ± . . ± . HD . ± . . ± . . ± . . ± . HC . ± . . ± . . ± . . ± . logical, semantic, or neologistic), vowel distortions, word rep-etitions, retracing, assimilations, dialect variances, letter andword omissions, utterances, pauses, glottal sounds unique toHD, vocalizations, spontaneous speech for each participant, andvariations in rate, fundamental frequency (F0), and voice qual-ity. Interrater reliability (greater than or equal to 90% agree-ment) was established between two trained raters and a Ph.D.-level Speech Language Pathologist. The raters then individ-ually transcribed each recording and their transcriptions werecompared. Raters were required to reach a consensus for allidentiﬁed discrepancies. In cases where consensus could not bereached, the Speech Language Pathologist was consulted. Manual transcripts often represent a bottleneck because they arecostly and time-intensive to obtain. ASR can provide an alterna-tive. However, off-the-shelf systems are often unusable due tothe acoustic mismatch between the healthy speech used to trainthese systems and the speech patterns of individuals of the tar-get population. We address this by training in-domain acousticand language models using a specialized lexicon.The lexicon we used is initialized using standard Englishphone-level pronunciations provided by the CMU pronuncia-tion dictionary [27]. We augment this using the pronunciationerrors identiﬁed in the manual transcripts. We use a bigram lan-guage model extracted over the manual transcripts.The acoustic model is a monophone Hidden Markov Model(HMM) with three-states per phone and a left-to-right topol-ogy. The emission probability of each state is estimated using aGaussian Mixture Model (GMM). We use a monophone acous-tic model rather than a triphone model due to the relatively smallsize of the dataset ( ∼ . ± . .This relatively low WER is strongly attributed to the constrainedspeech produced by reading the grandfather passage. We present three approaches to investigate the effect of er-ror propagation. We ﬁrst assume the availability of manual peaker AudioHuman TranscriptionGrandfather Passage ForcedAlignment ASR Transcript(ASRT)Aligned Trans., Oracle (FA-ORAT)Aligned Trans., Grandfather (FA-GF) F e a t u r e E x t r a c t i o n P r e d i c t D i a g n o s i s S e g . ForcedAlignmentASR S e g . Figure 1:

System diagram for HD classiﬁcation (note: seg. is segmentation)

Table 4:

Average dimension size of feature vectors

FA-ORAT FA-GF ASRTUtt-level . ± . . ± . . ± . Spkr-level . ± . . ± . . ± . transcripts and investigate the feasibility of extracting featuresfor HD classiﬁcation ( force-aligned oracle transcription , FA-ORAT). Next, we assume that the subject prompt is availableand ask if the subject’s speech goals can be used as a target forforce alignment ( forced-aligned grandfather transcription , FA-GF). This allows us to investigate the effect of the mismatchintroduced by speech errors. Finally, we assume only that seg-mentation information is available, noting when speaker utter-ances begin and end, but that the transcript itself is unknown( ASR transcription , ASRT). This provides insight into the ef-fectiveness of an automatic system that can transcribe, extractfeatures, and predict diagnosis (Figure 1).All approaches use the acoustic model discussed in Sec-tion 4.2. The ASRTs are discussed in Section 4.2. FA-ORATsare generated by force-aligning the input audio to the manualtranscripts. FA-GFs are generated by force-aligning the inputaudio to the original Grandfather passage text.FA-ORAT and ASRT utterances are segmented using themanual transcripts before both ASR and force alignment (fu-ture work will remove the reliance on manual segmentation forASRT). FA-GF utterances are segmented into utterances usingthe natural sentences within the Grandfather passage ( Figure 1).

5. Feature Extraction

We describe two feature sets: utterance-level (dynamic) andspeaker-level (static). The utterance-level features are extractedover each utterance and provide insight into the relationship be-tween the time-series behavior of the features and diagnosis. Wenormalize the utterance-level features using speaker-dependentz-normalization. The speaker-level features are calculated byapplying summary statistics to the normalized utterance-levelfeatures, including max, min, mean, SD, range, and quartiles(25th, 50th, 75th). We group all features by subject and removefeatures that have either zero variance or zero information gainwith respect to the target class (Table 4). We perform this onceat the utterance-level for our dynamic features and again aftersummary statistics are computed for our static features.

Filler Features : Fillers are parts of speech that are not purpose-ful and that do not contain formal meaning (i.e., ah, eh, um,uh). Fillers are labeled during the human transcription process.They are preserved in ORAT and are estimated in ASRT. Theyare ignored in FA-GF due to the absence of ﬁllers in the orig-inal passage. The utterance-level features include: number ofﬁllers, number of ﬁllers per second, number of ﬁllers per word,number of ﬁllers per phone, total ﬁller duration per utterance,and total ﬁller duration per second.

Pause Features : Pauses are periods without speech that lastfor at least 150 ms [29]. The utterance-level features include:the number of pauses, number of pauses per second, numberof pauses per word, number of pauses per phone, total pauseduration, and total pause duration per second.

Speech Rate Features : Speech rate captures an individual’sspeaking speed. The utterance-level features include: the num-ber of phones, number of phones per second, number of phonesper word, number of words, and number of words per second.

Goodness of Pronunciation Features : Goodness of Pronunci-ation (GoP) measures the ﬁtness of a reference acoustic model(trained over all HD and HC speakers) to a given phone bycomputing the difference between the average acoustic log-likelihood of a force-aligned phoneme and that of an uncon-strained phone loop [30]:GoP ( p ) = 1 N log P ( O | p ) P ( O | P L ) , (1)where p is a sequence of phones, O is the MFCC acoustic ob-servation, N is the number of frames, and P L is the uncon-strained phone loop. The utterance-level features include: theGoP score for each phone in the utterance.

6. Methods

We use Leave-One-Subject-Out (LOSO) paradigm: in each run,a single subject is held-out as the test speaker and the model istrained and validated on the remaining speakers. Within thetraining partition, 80% of the data is used to train the model and20% is used to validate. This process is repeated over all speak-ers. All results presented are accuracies averaged over all sub-jects in the study. We train four models: k -NN with Euclideandistance ( k -NN), k -NN with DTW distance (DTW), Deep Neu-ral Networks (DNN), and Long-Short-Term Memory RecurrentNeural Networks (LSTM-RNN).We hypothesize that HD can be detected using speech fea-tures. We test this hypothesis ﬁrst using k -NN, which assigns alabel to an instance based on the plurality of its closest k neigh-bors. k -NN uses the speaker-level features, while DTW usesthe utterance-level features. In both approaches, we sweep overthe number of neighbors, k .We further hypothesize that the relationship betweenspeech features and diagnosis can be more accurately modeledby exploiting non-linear feature interactions. We test this hy-pothesis using Deep Neural Networks (DNN) over the speaker-level features and Long-Short-Term Memory Recurrent Neu-ral Networks (LSTM-RNN) over the utterance-level features.Both DNN and LSTM-RNN are implemented using Keras witha Tensorﬂow [31] backend. The DNN is comprised of twofully connected layers with ReLU activation functions, a soft-max output layer for binary classiﬁcation, and dropout lay-ers between each fully connected layer in the network. Ourable 5: Classiﬁcation results (FA = forced alignment, ORAT = oracle, GF = grandfather, ASRT = ASR system)

Method FA-ORAT FA-GF ASRTAccuracy F1 (HD) Accuracy F1 (HD) Accuracy F1 (HD) k -NN 0.81 0.77 0.82 0.79 0.81 0.77DTW 0.87 0.86 0.84 0.81 0.81 0.77DNN 0.87 0.87 0.85 0.84 0.85 0.84LSTM-RNN 0.87 0.86 0.84 0.82 0.85 0.84Table 6: The confusion matrix derived from the average per-centage of classiﬁcations across all data processing methodsand classiﬁers. Rows = ground truth, columns = prediction

Healthy HDHealthy 0.95 0.05Premanifest 0.54 0.46Early 0.14 0.86Late 0.02 0.98LSTM-RNN is comprised of a two LSTM layers with recur-rent dropout, bias l2 regularization, and kernel l2 regularizationfollowed by a softmax output layer. In both networks, we per-form a hyperparameter sweep for layer width (32, 64, 128) anddropout rate (0.0, 0.2, 0.4). We use an ensemble approach, inwhich we train ﬁve separate models and the mode of the ﬁvepredictions is used as the ﬁnal prediction of the system.

7. Results

The results (Table 5) show the feasibility of HD detection usingall presented methods. The FA-ORAT approach results in themost accurate HD predictions, with an accuracy of 0.81 for k -NN, 0.87 for DTW, 0.87 for DNN, and 0.87 for LSTM-RNN.In the majority of cases in Table 5, the accuracy slightly de-creases when less accurate transcriptions are used in the placeof ORAT. For example, we obtain an accuracy of 0.85 for DNNwhen using both FA-GF and ASRT.We assess the statistical signiﬁcance of the changes in per-formance across classiﬁcation and transcription approaches us-ing Cochran’s Q test, which compares the binary predictionover each of the 62 speakers across all methods and classi-ﬁers. We assert signiﬁcance when p < . . The resultdemonstrates that there is not an overall statistically signiﬁcantdifference over individual classiﬁers and transcription method(Q(11)=15.6, p=0.157). This suggests that there are multipleopportunities to recognize symptomatology and avenues to re-search how speech changes are associated with illness. Further,given appropriately constrained content, ASR transcripts can beused as a substitute to manual transcripts for extracting speechfeatures to assess HD symptomatology.Our system is accurately able to distinguish betweenhealthy patients and individuals with early and late stage HD(Table 6). Our results show improved classiﬁcation for later HDstages, which suggests that our features can more accuratelycapture HD speech for individuals whose disease is more ad-vanced, compared to those at earlier stages. Further, it points tothe difﬁculty in recognizing premanifest HD due to similaritiesin speech compared to both healthy and HD populations.We analyze the relationship between feature category anddisease stage, focusing on the static ORAT feature set. We ag-gregate the test sets generated over each run of LOSO (62 sets),retaining only the features that have non-zero variance and in-formation gain across all 62 speakers (GoP, speech rate, pause,and ﬁller features). We then separate the data into the four dis- Table 7: Percentage of features (GoP, Speech Rate (SR), Pauses(P)) extracted from ORAT that are statistically signiﬁcantly dif-ferent (p < Feature ORATFeature Increase Decrease No Change P r e GoP 0.0 0.06 0.94SR 0.04 0.32 0.64P 0.35 0.0 0.65 E a r l y GoP 0.12 0.76 0.12SR 0.2 0.48 0.32P 0.85 0.0 0.15 L a t e GoP 0.12 0.76 0.12SR 0.2 0.6 0.2P 0.64 0.0 0.36ease categories (HC, premanifest, early, late) and identify thesubset of features that are signiﬁcantly different between theHC population and each of the disease stages. We assert signif-icance when p < . , using a two-tailed independent samplest-test. We apply Bonferroni correction to account for the family-wise error rate. We present the percentage of features that arestatistically signiﬁcant between the HC population and each dis-ease stage and note whether the features of the individuals withHD are greater than, less than, or not statistically signiﬁcantlydifferent from those of the HC population. The results demon-strate that generally, GoP decreases, speech rate decreases, andthe number of pauses increase with disease severity (Table 7).

8. Conclusion

In this work, we demonstrate the effectiveness of classifyingHD using key speech features, including speech rate, pauses,ﬁllers, and GoP. Our experimental results show that automatedapproaches can be used to generate transcripts, extract fea-tures, and classify HD. The accuracy of the presented methodincreases with disease stage, which suggests that speech mayserve as an effective biomarker that could be used to track HDprogression. Finally, the performance of both the static and dy-namic approaches suggests that there are mulitple opportunitiesfor tracking symptomatology in this domain. Further improve-ments for our automated system can be made by increasing theperformance of ASR by incorporating additional out-of-domaindata (e.g., [32]). We will also investigate the development ofnew, more descriptive, features.

9. Acknowledgements

Work on this manuscript was supported by the National Insti-tutes of Health (NIH), National Center for Advancing Transla-tional Sciences (UL1TR000433). In addition, a portion of thisstudy sample was collected in conjunction National Institutesof Health (NIH), National Institute of Neurological Disordersand Stroke (R01NS077946) and/or Enroll-HD (funded by theCHDI Foundation). Lastly, this work was also supported by theNational Science Foundation (CAREER-1651740).

0. References [1] N. Carlozzi, S. Schilling, J.-S. Lai, J. Perlmutter, M. Nance,J. Waljee, J. Miner, S. Barton, S. Goodnight, and P. Dayalu,“Hdqlife: the development of two new computer adaptive testsfor use in huntington disease, speech difﬁculties, and swallowingdifﬁculties,”

Quality of Life Research , vol. 25, no. 10, pp. 2417–2427, 2016.[2] S. L. Mason, R. E. Daws, E. Soreq, E. B. Johnson, R. I. Scahill,S. J. Tabrizi, R. A. Barker, and A. Hampshire, “Predicting clin-ical diagnosis in huntington’s disease: An imaging polymarker,”

Annals of neurology , 2018.[3] J. S. Paulsen, “Early detection of huntingtons disease,”

Futureneurology , vol. 5, no. 1, pp. 85–104, 2010.[4] W. Hinzen, J. Rossell´o, C. Morey, E. Camara, C. Garcia-Gorro,R. Salvador, and R. de Diego-Balaguer, “A systematic linguisticproﬁle of spontaneous narrative speech in pre-symptomatic andearly stage huntington’s disease,”

Cortex , 2017.[5] A. P. Vogel, C. Shirbin, A. J. Churchyard, and J. C. Stout, “Speechacoustic markers of early stage and prodromal huntington’s dis-ease: A marker of disease onset?”

Neuropsychologia , vol. 50,no. 14, pp. 3273–3278, 2012.[6] W. P. Gordon and J. Illes, “Neurolinguistic characteristics of lan-guage production in huntington’s disease: a preliminary report,”

Brain and Language , vol. 31, no. 1, pp. 1–10, 1987.[7] A. B. Young, I. Shoulson, J. B. Penney, S. Starosta-Rubinstein,F. Gomez, H. Travers, M. A. Ramos-Arroyo, S. R. Snodgrass,E. Bonilla, H. Moreno et al. , “Huntington’s disease in venezuelaneurologic features and functional decline,”

Neurology , vol. 36,no. 2, pp. 244–244, 1986.[8] I. Hertrich and H. Ackermann, “Acoustic analysis of speech tim-ing in huntington s disease,”

Brain and language , vol. 47, no. 2,pp. 182–196, 1994.[9] L. R. Kaploun, J. H. Saxman, P. Wasserman, and K. Marder,“Acoustic analysis of voice and speech characteristics in presymp-tomatic gene carriers of huntington’s disease: biomarkers forpreclinical sign onset?”

Journal of Medical Speech-LanguagePathology , vol. 19, no. 2, pp. 49–65, 2011.[10] K. C. Fraser, J. A. Meltzer, N. L. Graham, C. Leonard, G. Hirst,S. E. Black, and E. Rochon, “Automated classiﬁcation of primaryprogressive aphasia subtypes from narrative speech transcripts,” cortex , vol. 55, pp. 43–60, 2014.[11] D. Le, K. Licata, C. Persad, and E. M. Provost, “Automatic as-sessment of speech intelligibility for individuals with aphasia,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 24, no. 11, pp. 2187–2199, 2016.[12] L. T´oth, G. Gosztolya, V. Vincze, I. Hoffmann, G. Szatl´oczki,E. Bir´o, F. Zsura, M. P´ak´aski, and J. K´alm´an, “Automatic detec-tion of mild cognitive impairment from spontaneous speech usingasr,” in

Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[13] K. Fraser, F. Rudzicz, N. Graham, and E. Rochon, “Automaticspeech recognition in the diagnosis of primary progressive apha-sia,” in

Proceedings of the Fourth Workshop on Speech and Lan-guage Processing for Assistive Technologies , 2013, pp. 47–54.[14] R. Sadeghian, D. J. Schaffer, and S. A. Zahorian, “Using auto-matic speech recognition to identify dementia in early stages,”

TheJournal of the Acoustical Society of America , vol. 138, no. 3, pp.1782–1782, 2015.[15] L. Zhou, K. C. Fraser, and F. Rudzicz, “Speech recognition inalzheimer’s disease and in its assessment.” in

INTERSPEECH ,2016, pp. 1948–1952.[16] K. Mengistu and F. Rudzicz, “Comparing humans and automaticspeech recognition systems in recognizing dysarthric speech,”

Ad-vances in Artiﬁcial Intelligence , pp. 291–300, 2011.[17] B. Peintner, W. Jarrold, D. Vergyri, C. Richey, M. L. G. Tempini,and J. Ogar, “Learning diagnostic models using speech and lan-guage measures,” in

Engineering in Medicine and Biology Soci-ety, 2008. EMBS 2008. 30th Annual International Conference ofthe IEEE . IEEE, 2008, pp. 4648–4651. [18] E. C. Guerra and D. F. Lovey, “A modern approach to dysarthriaclassiﬁcation,” in

Engineering in Medicine and Biology Society,2003. Proceedings of the 25th Annual International Conferenceof the IEEE , vol. 3. IEEE, 2003, pp. 2257–2260.[19] I. Shoulson, R. Kurlan, A. Rubin, D. Goldblatt, J. Behr, C. Miller,J. Kennedy, K. A. Bamford, E. D. Caine, D. K. Kido et al. , “As-sessment of functional capacity in neurodegenerative movementdisorders: Huntingtons disease as a prototype,”

Quantiﬁcation ofneurologic deﬁcit. Boston: Butterworths , pp. 271–283, 1989.[20] K. Kieburtz, J. B. Penney, P. Corno, N. Ranen, I. Shoulson, A. Fei-gin, D. Abwender, J. T. Greenarnyre, D. Higgins, F. J. Marshall et al. , “Uniﬁed huntingtons disease rating scale: reliability andconsistency,”

Neurology , vol. 11, no. 2, pp. 136–142, 2001.[21] J. Reilly and J. L. Fisher, “Sherlock holmes and the strange caseof the missing attribution: A historical note on the grandfatherpassage,”

Journal of Speech, Language, and Hearing Research ,vol. 55, no. 1, pp. 84–88, 2012.[22] J. Duffy, “Motor speech disorders: Substrates, differential diag-nosis, and management. st. louis, mo: Mosby-year book,” 1995.[23] R. I. Zraick, D. J. Davenport, S. D. Tabbal, T. J. Hutton, G. M.Hicks, and J. H. Patterson, “Reliability of speech intelligibilityratings using the uniﬁed huntington disease rating scale,”

Journalof Medical Speech-Language Pathology , vol. 12, no. 1, pp. 31–41,2004.[24] F. L. Darley, A. E. Aronson, and J. R. Brown,

Motor speech dis-orders . Saunders, 1975.[25] R. Patel, K. Connaghan, D. Franco, E. Edsall, D. Forgit, L. Olsen,L. Ramage, E. Tyler, and S. Russell, “the caterpillar: A novelreading passage for assessment of motor speech disorders,”

Amer-ican Journal of Speech-Language Pathology , vol. 22, no. 1, pp.1–9, 2013.[26] B. MacWhinney,

The CHILDES project: The database . Psychol-ogy Press, 2000, vol. 2.[27] R. Weide, “The cmu pronunciation dictionary, release 0.6,”

Carnegie Mellon University , 1998.[28] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in

IEEE 2011 workshopon automatic speech recognition and understanding , no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.[29] B. Roark, M. Mitchell, J.-P. Hosom, K. Hollingshead, and J. Kaye,“Spoken language derived measures for detecting mild cognitiveimpairment,”

IEEE transactions on audio, speech, and languageprocessing , vol. 19, no. 7, pp. 2081–2090, 2011.[30] S. M. Witt and S. J. Young, “Phone-level pronunciation scoringand assessment for interactive language learning,”

Speech com-munication , vol. 30, no. 2, pp. 95–108, 2000.[31] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: Asystem for large-scale machine learning.” in

OSDI , vol. 16, 2016,pp. 265–283.[32] D. Le, K. Licata, and E. M. Provost, “Automatic paraphasia detec-tion from aphasic speech: A preliminary study,”