[PDF] Detecting Parkinson's Disease From an Online Speech-task

Abstract

In this paper, we envision a web-based framework that can help anyone, anywhere around the world record a short speech task, and analyze the recorded data to screen for Parkinson's disease (PD). We collected data from 726 unique participants (262 PD, 38% female; 464 non-PD, 65% female; average age: 61) -- from all over the US and beyond. A small portion of the data was collected in a lab setting to compare quality. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet "the quick brown fox jumps over the lazy dog..". We extracted both standard acoustic features (Mel Frequency Cepstral Coefficients (MFCC), jitter and shimmer variants) and deep learning based features from the speech data. Using these features, we trained several machine learning algorithms. We achieved 0.75 AUC (Area Under The Curve) performance on determining presence of self-reported Parkinson's disease by modeling the standard acoustic features through the XGBoost -- a gradient-boosted decision tree model. Further analysis reveal that the widely used MFCC features and a subset of previously validated dysphonia features designed for detecting Parkinson's from verbal phonation task (pronouncing 'ahh') contains the most distinct information. Our model performed equally well on data collected in controlled lab environment as well as 'in the wild' across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with a video/audio enabled device, contributing to equity and access in neurological care.

Full PDF

DDetecting Parkinson’s Disease From an Online Speech-task

Wasifur Rahman a * , Sangwu Lee a , Md. Saiful Islam b , Victor Nikhil Antony a , Harshil Ratnu a ,Mohammad Rafayet Ali a , Abdullah Al Mamun a , Ellen Wagner c , Stella Jensen-Roberts c ,Max Little d,e , Ray Dorsey c , and Ehsan Hoque aa Department of Computer Science, University of Rochester, United States b Department of Computer Science and Engineering, BUET, Bangladesh c Center for Health + Technology, University of Rochester Medical Center, United States d School of Computer Science, University of Birmingham, United Kingdom e MIT Media Lab, MIT, United States

Abstract

In this paper, we envision a web-based framework that canhelp anyone, anywhere around the world record a shortspeech task, and analyze the recorded data to screen forParkinson’s disease (PD). We collected data from 726 uniqueparticipants (262 PD, 38% female; 464 non-PD, 65% female;average age: 61) – from all over the US and beyond. A smallportion of the data was collected in a lab setting to comparequality. The participants were instructed to utter a popularpangram containing all the letters in the English alphabet “thequick brown fox jumps over the lazy dog · · · ”. We extractedboth standard acoustic features (Mel Frequency Cepstral Co-efﬁcients (MFCC), jitter and shimmer variants) and deeplearning based features from the speech data. Using these fea-tures, we trained several machine learning algorithms. Weachieved 0.75 AUC (Area Under The Curve) performanceon determining presence of self-reported Parkinson’s diseaseby modeling the standard acoustic features through the XG-Boost – a gradient-boosted decision tree model. Further anal-ysis reveal that the widely used MFCC features and a subsetof previously validated dysphonia features designed for de-tecting Parkinson’s from verbal phonation task (pronouncing‘ahh · · · ’) contains the most distinct information. Our modelperformed equally well on data collected in controlled labenvironment as well as ‘in the wild’ across different genderand age groups. Using this tool, we can collect data from al-most anyone anywhere with a video/audio enabled device,contributing to equity and access in neurological care.

The Parkinson’s disease (PD) is the fastest growing neu-rological disease in the world. Unfortunately, an estimated20% of PD patients remain undiagnosed. This can belargely attributed to the shortage of neurologists world-wide Khadilkar [2013], Howlett [2014], and limited accessto the healthcare. An early diagnosis and continuous moni-toring which allows for adjusting medication dosage are thekeys to managing the symptoms of this incurable disease.The current standard of diagnosis requires in-person clinicvisits where an expert assess the disease symptoms while ob-serving the patients perform tasks from the Uniﬁed Parkin-son’s Disease Rating Scale (MDS-UPDRS) Goetz et al.[2008]. The MDS-UPDRS includes 24 motor related tasksto assess speech, facial expression, limb movements, walk-ing, memory and cognitive abilities. Although many work * [email protected] has shown success by analyzing the hand-movements Aliet al. [2020], limb movement pattern Lonini et al. [2018],and facial expressions Bandini et al. [2017], speech is espe-cially important because around 90% of PD patients exhibitvocal impairment Ho et al. [1998], Logemann et al. [1978],which is often one of the earliest indicators of PD Duffy[2019].In this paper, we present our analysis of 726 audio record-ing of speech from 262 individuals with PD and 464 without.The speech recordings were collected using a web-basedtool called Parkinson’s Analysis with Remote Kinetic-tasks (PARK) . The PARK tool instructed them to utter a popu-lar pangram containing all the letters in the English alpha-bet, “the quick brown fox jumps over the lazy dog...” andrecorded it. This allowed us to rapidly collect dataset thatis more likely to contain the real-world variability associ-ated with geographical boundaries, socio-economic status,age-groups and a wide variety of heterogeneous recordingdevices. The ﬁndings in this paper build on this unique real-world dataset and thus, we believe, could potentially gener-alize for real-world deployments.Collecting audio data from individuals often require in-person visits to the clinic limiting the number of data pointsas well as the diversity within the data. Recent advance-ment has allowed collecting tremor data from wearable sen-sors Kubota et al. [2016] as well as sleep data from RF ra-dio signals Yue et al. [2020]. The existing work with speechand audio analysis utilizes sophisticated equipment for col-lecting data which are often noise free Tsanas et al. [2009],Little et al. [2008] and do not contain the real-world vari-ability. Given that a signiﬁcant portion of the population hasaccess to a mobile device with recording capability (for ex-ample, 81% Americans own a smartphone Center [2019]),we opted to use a framework allowing participants to recorddata from their home. From the recorded audio ﬁles, wehave extracted acoustic features including Mel-frequencycepstral coefﬁcients (MFCCs), known to represent the short-term power spectrum of a sound, jitter/shimmer variants(represents pathological voice quality), pitch related fea-tures, spectral power, and dysphonia related features, whichare designed to capture PD-induced vocal impairment Lit-tle et al. [2008]. Additionally, we extracted features froma deep-learning based encoder – Problem Agnostic Speech a r X i v : . [ ee ss . A S ] D ec i d e o A ud i o T ex t Pre-speech The quick brown fox jumps over the lazy dog The dog wakes up and follows the fox in the forest But again the quick brown fox jumps over the lazy dogIgnore data Ignore dataAnalyze Real Content PD?

Yes

NoPost-speech

Audio Model

Figure 1: An outline of our approach for solving the speechtask of uttering “The quick brown box · · · ”. While recordingdata, the participants often take some time to start recordingthe speech and some more time to exit the data-collectionsystem once the recording is done: those two segments aremarked as pre-speech and post-speech respectively. We au-tomatically remove these two components and analyze the audio data corresponding to completing the speech task topredict whether the subject has Parkinson’s disease (PD).Encoder (PASE) Pascual et al. [2019] – that represents theinformation contained in a raw audio instance through alist of encoded vectors. These features are modeled withfour different machine learning models – Support-Vector-Machine, Random Forest, LightGBM, and XGBoost – toclassify individuals with and without PD.Fig.1 provides an outline of the data-analysis system. Ourcontributions can be summarized as follows:• We report ﬁndings from one of the largest dataset withreal-world variability containing 726 unique participantsmostly from their home.• We analyzed the audio features of speech to predict PDv.s. non-PD with 0.7533 AUC (Area Under The Curve)score.• We provide evidence that our model prioritizes MFCCfeatures and a subset of dysphonia features Little et al.[2008], Tsanas et al. [2012b] consistent with prior litera-ture.• Our model performs consistently well when tested ongender and age stratiﬁed data collected in controlled labenvironment as well as ‘in the wild’.

We collected data from 726 unique participants uttering thesentences “The quick brown fox jumps over the lazy dog.The dog wakes up and follows the fox into the forest, butagain the quick brown fox jumps over the lazy dog” usingthe Parkinson’s Analysis with Remote Kinetic-tasks (PARK)Langevin et al. [2019] tool. Fig. 2 contains some represen-tative samples from our dataset. Table 1 shows the demo-graphic information of the study participants. The genderdistribution in the dataset is slightly skewed. Among all theparticipants, 55% was female and 45% was male. However,among participants with PD, only 38% was female, and fornon-PD, 65% was female. Fig. 7 shows the age distribu-tion of the participants. We have a healthy balance betweenthe number of PD/non-PD participants in the age range of

CBA FED

Figure 2: Some screenshots of our subjects while providingthe data. Subject A is from India, others are from the US.All the subjects except B provided data without any super-vision. B, D, E, and F have been diagnosed with PD; B andF were diagnosed at the early age of 36 and 29 respectively.Electronic informed consent was taken from the participantsto use their photos for publication.Table 1: Demographic Composition of our Dataset

PD non-PDNo. of participants ( N ) 262 464Female/Male 101/161 300/164Age (mean ± std) . ± . . ± . Country (US/other) 199/63 419/45Years since diagnosed (mean ± std) . ± . N/A [40-80] years, but most of the younger ([20-40] years) andolder ([80-90] years) are from non-PD and PD groups re-spectively. Among the 726 participants, 54 completed theaudio recording in a lab and the other 672 completed theirrecording at home. Having participants performing the tasksat home and at the lab allowed us to compare the resultsacross both conditions. No participants appears in both setsand all of our participants used the identical PARK protocol.The data were pre-processed and both standard acous-tic features (pitch, jitter, shimmer, MFCC, etc) and deep-learning based audio embedding features – representing anaudio clip as a feature vector – were extracted; we will callthese

Standard-features and

Embedding-features from nowon. A complete list of

Standard-features is provided in Table4, followed by a detailed description of the features in 4.3.We extract the

Embedding-features from PASE encoder Pas-cual et al. [2019] that converts an audio signal into a repre-sentative vector.The rest of this section is organized as follows: resultsfrom the models built on entire dataset 2.1, interpretation ofthe best model (on entire dataset) 2.2, and results and inter-pretation from specialized models on gender-stratiﬁed andage-matched datasets 2.3.2able 2:

Performance on the entire dataset:

The perfor-mance of various machine learning algorithms using theStandard-features and Embedding-features on a datasetcombining data from both Home-environment and Lab-environment. Models using Standard-features perform bet-ter than the models using Embedding-features in terms ofboth Binary Accuracy and AUC. Although the performanceof the models are almost similar in terms of AUC metric,XGBoost outperforms others by considering both the AUCand Accuracy metrics simultaneously

Algorithm Standard-features Embedding-featuresAUC Accuracy AUC AccuracySVM 0.751 0.735 0.738 0.692Random Forest 0.745 0.720 0.726 0.708LightGBM 0.753 0.720 0.737 0.693XGBoost

To detect the PD patients from our dataset, we appliedfour machine learning algorithms: Support-vector-machine(SVM) Cortes and Vapnik [1995], XGBoost Chen andGuestrin [2016], LightGBM Ke et al. [2017], and Random-Forest Ho [1995]. We used a leave-one-out cross validationstrategy where each data instance of the dataset is left outand the other data instances are used to create a model andpredict the left-out instance iteratively. We used binary ac-curacy and Area-Under-Curve (AUC) metrics to report ourmodel’s performance. For a binary classiﬁer, AUC denotesthe area of the curve produced by plotting the true positiverate versus the false positive rate while varying the deci-sion threshold of the model. Since the dataset is imbalanced,AUC is a better metric than accuracy to demonstrate the per-formance of our models. Table 2 contains the AUC and ac-curacy score of the four machine learning models trainedon the

Standard-features and

Embedding-features sepa-rately. Applying XGBoost on the Standard-features gave usthe best performance of . AUC and . Accuracy. Wealso notice that models trained on interpretable

Standard-features work better than those trained on non-interpretable

Embedding-features . To further focus on the clinical implications of our work,we wanted to interpret the decisions of our classiﬁers. Weuse SHAP (SHapley Additive exPlanations) Lundberg andLee [2017], Lundberg et al. [2020] to recognize the featuresthat are driving the model’s performance. We choose SHAPfor two reasons: it is well-suited for explaining the output ofany machine learning model; it’s the only feature attributionmethod that fulﬁlls the mathematical deﬁnition of fairness.The goal of SHAP is to explain model’s prediction of anyinstance as a sum of contributions from it’s feature values; ifa data-instance can be thought of as: X i = [ f , f , . . . f N ] ,SHAP will assign a number to each of these f j features, de-noting the impact of that feature – both the magnitude anddirection – on the model’s prediction. Then all these local explanations are aggregated to create a global interpretationfor the entire dataset. That global interpretation is presentedin Fig. 3.A, top 20 most impactful features – ranked by hav-ing the most impact to the least – are presented. To calculateeach feature’s impact, all of its SHAP values across all thedata instances are gathered, and then the mean of their ab-solute values is calculated. A more technical description ofSHAP is provided in Section 4.6.Features that impacted the model’s performance are typi-cally the spectral features: the mean values or the variationof MFCC in each spectrum range. Apart from that, someother complex features such as, RPDE (measure of uncer-tainty in F0 estimation), PPE (measure of inability to main-tain a constant F0), and HNR (Harmonic to Noise Ratio)also impacted the model’s decision. The characteristics of a person’s voice is greatly inﬂuencedby their age and gender. In Fig.4, we see that males and fe-males display a changing characteristics in their voice asthey get older. Therefore, it can produce confounding ef-fects in analyzing PD from audio where the machine learn-ing model uses audio features to detect PD. To minimizethe effect of confounding factors, researchers in the pasttrained separate models on data from male and female par-ticipants Tsanas et al. [2012a] or analyzed an age-matcheddataset by considering data from participants above the ageof 50 Ali et al. [2020], Langevin et al. [2019].

Building specialized models for each gender and age-matched analysis

The performance metrics of the ma-chine learning models trained on male, female and age-matched dataset are in Table 3. By comparing the perfor-mance with metrics presented in Table 2, we can see that themodels that used male or age-matched datasets performedin-par or better than the models that used the whole datasetto train. However, there is a performance drop in the modelsusing female dataset. Table 1 shows that females are over-represented in the non-PD group and under-represented inthe PD group, leading to data-imbalance and possibly low-ering performance for the female-only model.We also analyzed the features that are driving these spe-cialized models’ performance through SHAP analysis. Fig3.B displays the most salient features ranked by their SHAPvalue and the distribution of how each feature impacts themodel’s decision making. The most important features arestill dominated by the MFCC related features or complexfeatures like HNR (Harmomic-to-Noise Ratio), relative-band-power in different frequency ranges ( RelBandPower1,RelBandPower3), RPDE (uncertainty in F0 estimation), per-turbation in F0 (DdpJitter) or Perturbation in amplitude(Apq11Shimmer). However, one noticeable fact is that threepitch and jitter related features: MedianPitch (median prin-cipal frequency, StdDevPitch (Standard-deviation in princi-pal frequency) and MedianJitter (median variation in F0) arealso impacting the model’s prediction which was not noticedin the SHAP analysis run on the All-data-model.Similarly, we interpreted the salient features for the Age-matched dataset in Fig 3.C. We noticed that the most salient3 . SHAP for Main-Model B. SHAP for Female-Model C. SHAP for Age-matched-Model

Figure 3: We brieﬂy describe the SHAP analysis of our best performing models on three datasets: (A) entire dataset, (B) femaleonly, and (C) age matched (all subjects are above age 50). The MFCC features (put within a red box) are highly signiﬁcantin all cases. Additionally,

RPDE (measuring uncertainty in F0 estimation),

PPE (measure of inability of maintaining a con-stant F0),

HNR (Harnomic to noise ratio),

MedianShimmer (median in amplitude perturbation), relative-band-power features(

RelBandPower1 , RelBandPower2 , RelBandPower3 indicating the amount of power in several frequency ranges) alsoimpact the model’s behaviour signiﬁcantly. Pitch related features like

MedianPitch , MedianJitter , StdDevPitch arealso important for female and age matched models. Dysphonia features (

RPDE and

PPE ) are the most important for the agematched model.Table 3: Gender and age stratiﬁed models: Three separate datasets are constructed: a Male dataset with male subjects, a Femaledataset with female subjects, and age-matched dataset by excluding the subjects below the age of 50. For each of these datasets,a separate model is constructed and its performance reported below.Algorithm Male Female Age-matchedAUC Accuracy AUC Accuracy AUC AccuracySVM

Random Forest 0.758 0.702 0.699 0.788 0.739 0.713LightGBM 0.725 0.665

Some of the most common voice disorders induced byPD are: dysphonia (distortion or abnormality of voice),dysarthria (problems with speech articulation), and hypo-phonia (reduced voice volume). Two speech-related diag-nostic tasks are commonly used for detecting PD throughexploiting the changing vocal pattern caused by these disor-ders: (i) sustained phonation (the subject is supposed to uttera single vowel for a long time with constant pitch), and (ii)running speech (the subject speaks a standard sentence). Lit-tle et al. Little et al. [2008] developed features for detectingdysphonia from people with PD. Tsanas et al. Tsanas et al.[2009] focused on the telemonitoring of self-administeredsustained vowel phonation task to predict the UPDRS rat-ing on Rating Scales for Parkinson’s Disease [2003] – acommonly used indicator for quantifying PD symptoms.These studies train their models with data captured by so-phisticated devices (e.g., wearable devices, high resolutionvideo recorder, Intel at-home-testing-device telemonitoringsystem) that are often not accessible to all and difﬁcult toscale. The performance of these models can reduce signif-icantly when classifying data collected in home acoustics.Additionally, completing the sustained phonation task cor-rectly requires following a speciﬁc set of guidelines such ascompleting the task in one breath – which can be difﬁcultfor older individuals. In contrast, we analyze the running speech task from thedata collected by using a web-based data collection platformthat can be accessed by anyone, anywhere in the world andrequires only an internet connected device with integratedcamera and microphone. Besides, running speech task doesnot require conforming to speciﬁc instructions, and are moresimilar to the regular conversation, and therefore, the modelcan be potentially augmented to predict PD from regularconversation – a potential game changer in PD assessment.In the future, user-consented plug-ins could be developedfor any application such as Alexa, Google Home, or Zoomwhere audio is transmitted between persons. Anyone whoconsents to download the plug-in and uses it while they areon the phone, over zoom, or giving virtual/in-person presen-tations could beneﬁt from receiving an informal referral tosee a neurologist, when appropriate.

The features that SHAP found to be having impact of mod-eling decisions are well-supported by previous research.For example, MFCC features have already proven to beuseful in a wide range of audio tasks such as speakerrecognition Bhattarai et al. [2017], music information re-trieval M¨uller [2007], voice activity detection Kinnunenet al. [2007], and most importantly, in voice quality as-sessment Tsanas et al. [2011]. Similarly, The high impactof HNR (Harmonic to Noise ratio), RPDE (measuring un-certainty in F0 estimation), and PPE (measure of inabilityof maintaining a constant F0) on the model’s output is incongruence with the ﬁndings from Little et al. Little et al.[2008]. However, explaining Fig. 3(A) in the light of thePD-induced vocal impairment is a difﬁcult task. MFCC fea-tures are calculated by converting the audio signal into thefrequency domain; they denote how energy in the signal isdistributed within the various ranges of frequency. There-fore, giving a physical interpretation to the SHAP valuescorresponding to the MFCC features is not straight-forward.Similarly, Little et al. Little et al. [2008] designed the RPDEand PPE features for modeling the sustained phonation task(uttering ‘ahh · · · ’) with the assumption that the healthy par-ticipants will be able to maintain a smooth and regular voicepattern. In contrast, uttering multiple sentences introduces alot of variation in the data, adding a wide set of heteroge-neous patterns. Therefore, the underlying assumptions be-hind constructing those features do not hold for our task ofuttering multiple sentences.In Fig.5, we present an empirical validation of the SHAPoutput presented in Fig. 3.A. We incrementally add one fea-ture at a time to build a dynamic feature-set, train succes-sive models on that feature-set, and report the Accuracy andAUC performances. We can see that the performance of ourmodel saturates after adding 7-8 features. Therefore, we cansay that the SHAP analysis teases out the most importantfeatures driving the model’s performance successfully.

We built models inclusive of all genders for several reasons.First, there are potential shared characteristics among vocalpatterns of both gender that can be relevant for detecting PD.5igure 5: Validating the SHAP output of Fig. 3.A: Westarted with the most important feature depicted in Fig.3(VariationMFCC5) and added the next salient feature oneat a time. For each feature set, we trained a new modeland showed that model’s performance in terms of AUC andaccuracy. As evident from the Figure, after adding 8 mostsalient features (up to PPE), the primary metric, AUC’svalue is saturated.Second, dividing the dataset into two portions will reduceavailable training data for each model, which may in turnreduce the generalization capability of each model. Besides,our model analyzes data from patients of all ages. Althoughmost people who are diagnosed with PD are over the ageof 60, about 10%-20% of the diagnosed PD population areunder the age of 50, and about half of them are under theage of 40 None [2020a]. As an anecdotal evidence, MichelJ Fox was diagnosed with PD at the age of 29 , Muham-mad Ali had PD by 42 . This is unfortunate because thesepeople have the longest to live with the PD symptoms. Inour dataset, there is also a minority of PD patients below theage of 50 (Fig.7). Based on these observations, we believethat our system should provide access to all people irrespec-tive of age. PD does not discriminate by age while impactinga person, and an automated system should not discriminatebased on age and should provide equitable service to peopleof all ages. However, these factors can work as confoundersin PD analysis. Therefore, we provide additional analysisto ensure that our model is not using the idiosyncrasies ofgroup-speciﬁc information to make prediction in 2.3. When the data was collected in the lab, the participantshad access to a clinician providing support using consistentrecording set up and dedicated bandwidth. On that contrary,the data collected in the home setting assumed no assis-tance, and included the real-world variability of heteroge-neous recording set up and inconsistent internet speed. The- https://parkinsonsnewstoday.com/2016/06/10/muhammad-alis-advocacy-parkinsons-disease-endures-boxing-legacy/ Figure 6: The performance of our models are fairly consis-tent across three different experiments: Removing-lab-data,Removing-home-data (7%), and Keeping-all-data.oretically, the data collected at the ”lab” and ”home” werevery different from each other.To ensure that our model works equally well without the”clean lab data”, we designed two experiments. In experi-ment 1, we removed clean-lab-data which is around 7% ofthe entire dataset (54 data points), retrained our model onrest of the 672 participants with the leave-one-out valida-tion procedure, and calculate the performance metrics. Inexperiment 2, we randomly remove 7% (roughly 54 data-points) ”home” data from the entire dataset (while keepingthe ”lab” data intact), building a model with the rest of 93%data with the leave-one-out cross-validation method. Then,we report the average performance from these 10 runs. Fig.6 contains all these performance metrics, including the oneachieved by keeping-all-data (Table 2). Fig. 6 shows that theAUC metric across these three experiments is fairly consis-tent, with a very small 0.015 drop in AUC for removing-lab-data, demonstrating that our framework performs equallywell across ”lab” and ”home” data.

Using the PARK Langevin et al. [2019] protocol, we havecollected one of the largest dataset of participants conduct-ing a series of motor, facial expression and speech tasksfollowing the MDS-UPDRS PD assessment protocol Goetzet al. [2008]. Although we analyze only the speech task inthis paper, the dataset can be potentially used to automatethe assessment of a large set of MDS-UPDRS tasks and fa-cilitate early stage PD detection; thus, improving the qualityof life for millions of worldwide. However, deploying thedata-collection protocol on the web and facilitating accessfor anyone, anywhere around the world comes at a cost. Sofar, all of our PD participants have been clinically veriﬁedto be diagnosed with PD (more in 4.1). Therefore, the labelof PD data-points are reliable. However, our non-PD partic-ipants have not gone through any clinical veriﬁcation. Ourdata-collection protocol asks them appropriate questions tocheck whether they have been diagnosed with PD, and col-lects data when they answer in the negative. However, we6an not discount the possibility of a small subset of ournon-PD population being in the very early stage of PD, andare oblivious about it. At present, there is estimated to bearound 1 million PD patients in the US, out of a populationof 330 million None [2020b] – yielding a PD prevalencerate of 0.3%. However, as our non-PD dataset is largelytilted towards people over age of 50, the rate in our datasetcould be higher than 0.3%. Even if we consider a liberal 1%prevalence rate, the number of individuals with undiagnosedParkinson’s in our control population is likely low (at most4.6 persons). Therefore, we believe the non-PD data labelto be generally reliable. In future, we plan to model tremorscore in [0-4] range for each task – 0 for no tremor and 4 forsevere tremor – instead of binary label following the MDS-UPDRS protocol to address this problem more thoroughly.

The PARK protocol is web-enabled, allowing anyone withaccess to internet to contribute data. We plan to augment ourdataset by adding more non-native English speakers, morefemales, and more PD participants. As our PD data is col-lected through contacts from local PD clinics and non-PDdata through Amazon Mechanical Turk, majority of our par-ticipants are from the US or other English speaking coun-tries. To make our model more robust on data from non-native English speakers, we are in process of collectingboth PD and non-PD data from non-native English speak-ing countries.Our best model for female-data performs worse than itsmale counterparts – as demonstrated in Table 3. We at-tribute this degraded performance PD/non-PD imbalance forfemale participants in our dataset: the PD/non-PD ratio forfemale is / (Table.1). Previous epidemiological stud-ies have shown that both incidence and prevalence of PD are1.5–2 times higher in men than in women Van Den Eedenet al. [2003], Haaxma et al. [2007]. Therefore, any randomlysampled dataset for PD will have more prevalence of male,contributing to models more biased towards male. Our im-mediate plan is to prioritize collecting balanced data from allgenders, age and race across the geographical boundaries,leading to a balanced dataset.Our dataset also suffers from the ubiquitous problem ofdata imbalance in diagnostic tests: number of data from non-PD participants is 1.8 times more than their PD counterparts.Therefore, there is a risk that the model will be biased to-wards predicting the majority non-PD class as a default andyield a high False-Negative score. To address this, we planto recruit more PD participants in future to make our datasetmore balanced. Although we consider AUC to be a better metric for ourdataset, our model performs 10% better than always choos-ing non-PD as the prediction in terms of Binary-accuracy.To be practically deployable in clinical settings, the perfor-mance needs to be improved further. We will focus on fourpromising avenues: making the dataset balanced; designingbetter features capable of modeling the nuanced pattern inour data; making our model resilient to noise present in our Figure 7: A bar plot showing age-distribution of PD and non-PD subjects in our dataset. Although the number of non-PD subjects is 1.8 times the number of PD subjects, thereis a healthy balance between these two groups in the age-range of [40-80] years. However, non-PD subjects outnum-ber their PD counterparts in the age group [20-40]. Similarly,PD subjects outnumber non-PD subjects signiﬁcantly in theage group of [80-90] years.data; and de-confounding the PD prediction from age andgender variables.For removing noise, we plan to augment the techniquesproposed in Poorjam et. al Poorjam et al. [2019] to automat-ically enhance our data quality by detecting the segments ofdata that conform to our experimental design. Besides, asdiscussed in 2.3, gender and age can appear as confound-ing variables in PD prediction task. In this paper, we haveshowed that our uniﬁed model and the stratiﬁed gender-speciﬁc models have similar performance. However, we planto build better models to systematically deconfound the ef-fects of both the age and gender variables while beneﬁtingfrom them simultaneously. We can do this by incorporatingthe causal bootstrapping technique – a re-sampling methodthat takes into account the causal relationship between vari-ables and negate the effect of spurious, indirect interactions– outlined in Little and Badawy [2019].

Our dataset is collected using the Parkinson’s Analysis withRemote Kinetic-tasks (PARK) Langevin et al. [2019] – aweb-based tool that guides user to conduct a series of mo-tor, facial expression and speech tasks following the MDS-UPDRS PD assessment protocol Goetz et al. [2008]. Theusers are recorded via webcam and microphone connectedto the PC/Laptop and uploaded to a server while perform-ing the tasks. In this work, we focus on the running speechtask, where the participants were instructed to read a sen-tences – ”The quick brown fox jumps over the lazy dog. Thedog wakes up and follows the fox into the forest, but againthe quick brown fox jumps over the lazy dog” . The ﬁrst sen-tence is a pangram: it contains all the letters of the English7able 4: Names of all the features, feature collection protocol and short-description of the features used by us. The correlatedfeatures were removed and features with bold text were used in building the models. Feature names are preceded by the looselydeﬁned umbrella category they belong to.

Feature Code-source Short-descriptionPitch:

MedianPitch

Little et al. [2007] Median principal frequencyPitch:

MeanPitch

Boersma and Weenink [2018] Mean principal frequencyPitch:

StdDevPitch

Little et al. [2007] Standard deviation in principal frequencyJitter:

MeanJitter

Little et al. [2007] Perturbation in principal frequency (mean variation)Jitter:

MedianJitter

Little et al. [2007] Perturbation in principal frequency (median variation)Jitter:LocalJitter Boersma and Weenink [2018] Jitter variantJitter:

LocalAbsoluteJitter

Boersma and Weenink [2018] Jitter variantJitter:RapJitter Boersma and Weenink [2018] Jitter variantJitter:Ppq5Jitter Boersma and Weenink [2018] Jitter variantJitter:

DdpJitter

Boersma and Weenink [2018] Jitter variantShimmer:MeanShimmer Little et al. [2007] Amplitude perturbation (using mean)Shimmer:

MedianShimmer

Little et al. [2007] Amplitude perturbation (using median)Shimmer:LocalShimmer Boersma and Weenink [2018] Shimmer variantShimmer:LocaldbShimmer Boersma and Weenink [2018] Shimmer variantShimmer:Apq3Shimmer Boersma and Weenink [2018] Shimmer variantShimmer:Apq5Shimmer Boersma and Weenink [2018] Shimmer variantShimmer:

Apq11Shimmer

Boersma and Weenink [2018] Shimmer variantShimmer:

DdaShimmer

Boersma and Weenink [2018] Shimmer variantMFCC:

MeanMFCC[0-12]

Little et al. [2007] 13 features of mean MFCCMFCC:

VariationMFCC[0-12]

Little et al. [2007] 13 features of mean variation of MFCC

RelBandPower[0-3]

Tsanas et al. [2012b] Four features capturing relative band power in four spectrum ranges

Harmonic to Noise ratio(HNR)

Boersma and Weenink [2018] Signal-to-noise ratio

Recurrence period density entropy(RPDE)

Little et al. [2007] Pitch estimation uncertainty

Detrended Fluctuation Analysis(DFA)

Little et al. [2007] Measure of stochastic self-similarity in turbulent noise

Pitch period entropy (PPE)

Little et al. [2008] Measure of inability of maintaining constant pitch alphabet; thus we get features relevant for pronouncing allphonemes for each subject.Table 1 describes the demography of our participants ina nutshell. Figure 7 shows the age distributions of the par-ticipants. Among our 726 unique participants, 262 are di-agnosed as PD patients; the rest are non-PD patients. Wegot the contact information of the PD patients from localclinics and various PD support groups. The non-PD partici-pants are recruited from Amazon Mechanical Turk. Duringdata-collection, informed consent of all the participants weretaken. Among the 726 unique participants, only 54 gave datain the lab under the guidance of a study coordinator using thePARK tool; the rest of the 672 participants used the PARKsystem from their home to provide data. All the steps of theresearch – subject recruitment, data collection, data storageand analysis – were completed in accordance with the pro-tocol agreed upon between the researchers and the Institu-tional Review Board (IRB) of the University of Rochester.

During data collection, the participants often took some timeto start uttering the task sentences. After uttering the sen-tence, they often took additional time before stopping therecording. Hence, we have a substantial amount of noisy andirrelevant data at the beginning and at the end of most of thedata instances (see Fig.1). To remove those irrelevant data,we use Penn Phonetics Lab Forced Aligner Toolkit (P2FA) toolkit. Given a audio ﬁle and transcript, it tries to predictthe time boundaries where each of the words in the tran- https://github.com/jaekookang/p2fa py3 script was pronounced. If it cannot recognize a set of words,it skips them and outputs time boundaries for the words itcould recognize successfully. The toolkit is built on the re-search done in Yuan and Liberman [2008], where they applya combination of Hidden Markov Models (from The Hid-den Markov Model Toolkit (HTK) and Gaussian-Mixture-Models(GMM) to align the words with the audio. The pro-cessing is done in several stages: HMM based models out-put the most likely sequence of hidden states for a given au-dio; Those hidden states are combined into phonemes andthose phonemes are combined into words using a prede-ﬁned dictionary (comprised of words and their correspond-ing phoneme pattern) through GMM based models.From the output of this system, we can get the startingtime of ﬁrst word that was recognized by the P2FA andthe ending time of the last word recognized by P2FA. Weuse the audio segment in between them for further analy-sis. For building models capable of predicting PD, we ex-tract two different sets of features: Acoustic-features (4.3)including MFCC, Jitter variants, Shimmer variants, etc; andEmbedding-features (4.4) to represent an acoustic signal asa learned feature vector. The subsequent sections detail fea-ture extraction process. We extracted features by combining a subset of features col-lected through several sources: PRAAT features Boersmaand Weenink [2018] through the Parselmouth Jadoul et al.[2018] python interface and the previously used features rel- http://htk.eng.cam.ac.uk/ . Pitch Related Features:

Pitch denotes the rate of vibra-tions present in a sound.

MedianPitch and

MeanPitch denotethe median and mean fundamental frequency or pitch of theaudio signal.

StdDevPitch denotes the standard deviation offundamental frequency f0.

Jitter related features:

Jitter deﬁnes how much a signaldeviates from its presumed true periodicity; it is often anundesired quantity if our signals are assumed to be periodic.

MeanJitter is the measure of jitter collected by calculatingthe mean variation of f0.

MedianJitter is the jitter measurecalculated using the median variation of f0.

LocalJitter de-notes the average of the absolute differences between con-secutive period of a signal – divided by the average period.

RapJitter – Relative Average Perturbation– is computed bythe average absolute difference between a period and theaverage of that period and the two neighbouring periods;divided by the average period.

Ppq5Jitter denotes the ﬁve-point Period Perturbation Quotient: the average absolute dif-ference between a period and the average of it and its fourclosest neighbours – divided by the average period.

DdpJit-ter denotes the average absolute difference between consec-utive differences between consecutive periods, divided bythe average period.

Shimmer Related Features:

Shimmer is a measurementof amplitude instability in an audio signal; a normal voicewill have minimal instability during sustained verbal phona-tion production.

MeanShimmer is the Shimmer value byquantifying the mean variation of amplitude in voice sig-nals.

MedianShimmer is the Shimmer calculated using themedian variation of amplitude.

LocalShimmer calculates theaverage absolute difference between the amplitude of theconsecutive periods in a signal divided by the average ampli-tude.

LocalDBShimmer is the average of the absolute valueof 10-based logarithm of the difference between the am-plitudes of consecutive periods in the signal, multiplied by20.

Apq3Shimmer is the three-point Amplitude PerturbationQuotient: the average absolute difference between the am-plitude of a period and the average of the amplitudes of itstwo neighbours – left and right – divided by the averageamplitude.

Apq5Shimmer and

Apq11Shimmer are similar to

Apq3Shimmer , but uses data from four and 10 neighboursrespectively instead of two.

DdpShimmer is three times thevalue of

Apq3Shimmer MFCC:

Mel Frequency Cepstral Coefﬁcients(MFCC) Pols et al. [1977] are used to understand therate of energy changes in different spectrum bands ofthe speech: If a cepstral coefﬁcient has negative value, itindicates that majority of spectral energy in that spectrumband is concentrated in the high frequencies; if a cepstralhas positive value, it indicates that majority of spectralenergy is concentrated in low frequencies. As we get severalentries for each of the 13 spectral regions of MFCC, wetake the mean (

MeanMFCC-[0-12] ) and mean variation(

VariationMFCC[0-12] ) for each of these spectral regions.

Relative Band Power:

Relative band power features werecalculated by checking how much power is present infour different spectrum of frequency windows in the range[0,500,1000,2000,4000] Hz. The power contained in thesefour regions are denoted by

RelBandPower[0-3] . Throughapplying FFT, we convert the audio signal into frequencydomain. Then, we calculate the power contained in the fre-quencies belonging to each bucket, aggregate them and cal-culate the median in each of these buckets

Harmonic-to-noise(HNR) ratio

HNR denotes the ratioof desired signal and background noise; higher HNR indi-cates better quality of audio.

Recurrence period density entropy (RPDE):

A per-fectly recurrent time signal will maintain a strict time pe-riod. Recurrence period density entropy (RPDE) determineshow much a signal is maintaining a strict periodicity afterthe signal is reconstructed in phase space Little et al. [2007].By aggregating the time-periods recorded in our signal, andcalculating the entropy of those time-periods, we get a mea-sure of how much variation is present in those time-periods.A perfectly healthy voice will be able to maintain sustainedvibration, hence it should have an entropy close to zero. Fi-nally, the RPDE values are normalized in the range [0,1] tobe used as feature.

Detrended ﬂuctuation analysis (DFA):

As human voiceis produced by turbulent air-ﬂows through our vocal folds,degeneration of voice-fold structure (due to age or diseases)can produce increased noise in speech Little et al. [2007].Detrended ﬂuctuation analysis (DFA) measures the extent ofthe stochastic self-similarity of the noise in the speech signalproduced due to possible alteration in vocal fold structure.These kind of noises can be represented through a statisticalscaling component on a range of physical scales; this scal-ing component is comparatively larger for people with voicedisorders Little et al. [2007, 2008].

Pitch Period Entropy (PPE):

Little et al. [2008] intro-duces a new feature Pitch Period Entropy (PPE) to calcu-late the entropy present in the pitch of an audio signal. First,a standard time-signal of pitch is converted into the loga-rithmic domain to capture the logarithmic nature of speechgeneration and perception. Then, to remove the gender andperson speciﬁc trends present in the pitch – as we know thatfemales have higher pitch voices than males, and there ex-ists individual differences in pitch – we apply a standardwhitening ﬁlter. Then, we use calculate the probability den-9ity of the residual signal. For a healthy voice signal, mostof the probabilities will be concentrated on a narrow range.However, the people with vocal disorders cannot maintain asustained pitch for a long time, therefore, there probabilitydistribution will be much more dispersed. This dispersion iscalculated through entropy, which precisely calculates howmuch disorganization there is in a system. A lower entropymeans that the pitch was sustained over a long time, a higherentropy indicates problems with the vocal cords and proba-bly dysphonia as well.

We extracted Problem Agnostic Speech Encoder (PASE)embeddings Pascual et al. [2019] for our audio ﬁles. PASErepresents the information contained in a raw audio instancethrough a list of encoded vectors. To make sure that theencoded vectors contain the same information as the inputaudio ﬁle, they decode various properties of the audio ﬁlewhich include, the audio waveform, the Log Power Spec-trum, Mel-frequency cepstral coefﬁcients (MFCC), fourprosody features (interpolated logarithm of the fundamentalfrequency, voiced/unvoiced probability, zero-crossing rate,and energy), Local Info max, etc. from the encoded vectors.To decode all these properties successfully, the encoded vec-tors must retain the relevant information about the input au-dio ﬁle. As these properties represent the inherent charac-teristics of the input audio ﬁle rather than any task-speciﬁcfeatures, they can be easily used to solve a host of down-stream tasks like speech classiﬁcation, speaker recognition,emotion recognition and as we demonstrate, PD detection.

For each of the feature sets, we applied a standard set ofmachine learning algorithms like Support-vector-machine(SVM) Cortes and Vapnik [1995], XGBoost Chen andGuestrin [2016], LightGBM Ke et al. [2017], and Random-Forest Ho [1995] Classiﬁer to classify PD vs. non-PD. SVMseparates out the data into several classes while maintain-ing a maximum possible margin among the classes. It canuse the kernel trick to project the data on a more abstracthyper-plane and thus provide the model with more expres-siveness. Random forest is built as an ensemble of DecisionTrees; each decision tree builds a tree using a subset of fea-tures and learns if-else type decision rules to make a pre-diction. We also use XGBoost and LightGBM: two algo-rithms based on Gradient boosting where they build succes-sively better models by reﬁning the current models. eXtremeGradient Boosting(XGBoost) provides a framework for fast,distributed gradient boosting while employing sophisticatedheuristics for penalizing poorly performing trees and bet-ter use of regularization. LightGBM is a boosting algorithmthat employs leaf-wise tree growth and hence it can make abetter estimation of the information gain through examininga smaller subset of data, and gain very equivalent accuracyvery fast at the expense of lower memory usage.We used a leave-one-out cross validation training strat-egy; using this strategy one sample of the dataset is leftout and the other n-1 samples are used to create a model and predict the remaining sample. We used metrics like Bi-nary Accuracy and AUC to report our model’s performance.Area-Under-Curve (AUC) is the area under the ROC (Re-ceiver Operating Characteristics) curve. The ROC curve isconstructed by calculating the area under the curve producedby taking the ratio of the True-Positive-Rate and the False-Positive-Rate while varying the threshold of the decision.AUC can have highest value of 1, which denotes that the twoclasses can be separated perfectly; whereas an AUC value of0.5 indicates that the model has no capability to distinguishbetween the two classes Since our dataset is imbalanced,AUC is a much better metric to understand the true perfor-mance of our model.Since there exists signiﬁcant imbalance between our PDand non-PD class and PD is the minority class, we havemuch more samples of non-PD than PD. Therefore, themodel has a tendency to choose the majority non-PD class asa default which can yield a high False-Negative score, whichis particularly bad in our case since our system is envisionedto be screening tool to help people get a preliminary screen-ing for PD so that they can visit a doctor immediately. It willbe quite problematic if we predict a PD patient as non-PDand thus provide him/her with a false sense of security andcreate a situation where his/her disease progress due to lackof medical care. To tackle that challenge, we used the Syn-thetic Minority Oversampling Technique (SMOTE) Chawlaet al. [2002] and SVMSmote Nguyen et al. [2011]. SMOTEcan create synthetic data instances for the under-representedclass. For each sample in the minority class, it calculates theK-nearest neighbours of that instance. Then, it calculates thestraight line connecting each sample to all of its neighboursand then samples new synthetic data samples situated overthat connecting line. SVMSMOTE focuses on the boundaryof different classes. We know that SVM creates a decisionboundary around classes by trying to maintain the maximumpossible margin among classes. Thus SVMSMOTE samplesdata in a manner than focuses on the boundary region of theminority class and samples data from that region such thatthe boundary between the classes is either expanded or con-solidated.

For interpreting the models, we are using the SHAP tech-nique based on Shapley Value. Shapley value is a game-theoretic concept of distributing the payout fairly amongthe players Shapley [1953]. In the machine learning context,each individual feature of an instance can be thought of as aplayer and the payout is the difference between an instance’sprediction and average prediction. It is based on rigorousmathematical foundation and it ensures fair distribution ofimportance amongst the features through ensuring the nec-essary mathematical properties. In principle, Shapley valuecan be computed by calculating the average marginal con-tribution for each feature across all the examples ˇStrumbeljand Kononenko [2014]. For each feature, all possible com-binations of all the other features – deﬁned as coalition – is https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 Please contact the corresponding author for getting access tothe data and code. The maintenance and sharing of data col-lected through the PARK system must adhere to the HealthInsurance Portability and Accountability Act (HIPPA) regu-lations. Therefore, it can only be shared by adding the inter-ested party in the protocol maintained by the relevant Insti-tutional Review Board (IRB).

The authors declare that there are no competing interests.

Wasifur Rahman, Sangwu Lee, Md Saiful Islam and VictorNikhil Antony worked in data analysis, feature extraction,model training, model interpretation and manuscript prepa-ration. Harshil Ratnu, Abdullah Al Mamun, Ellen Weg-ner, Stella Jensen-Roberts helped build, maintain, and co-ordinate the data collection procedure. Mohammad RafayetAli, Max Little, and Ray Dorsey helped improve themanuscript, suggested important experiments and providedaccess to critical resources like code and data. Ehsan Hoquewas the PI of the project; he facilitated the entire project andhelped to shape the narrative of the manuscript.

References

Mohammad Rafayet Ali, Javier Hernandez, E Ray Dorsey,Ehsan Hoque, and Daniel McDuff. Spatio-temporal at-tention and magniﬁcation for classiﬁcation of parkinson’sdisease from videos collected via the internet. In , pages 53–60,2020.Andrea Bandini, Silvia Orlandi, Hugo Jair Escalante, FabioGiovannelli, Massimo Cincotta, Carlos A Reyes-Garcia,Paola Vanni, Gaetano Zaccara, and Claudia Manfredi.Analysis of facial expressions in parkinson’s diseasethrough video-based automatic methods.

Journal of neu-roscience methods , 281:7–20, 2017.Kritagya Bhattarai, PWC Prasad, Abeer Alsadoon, L Pham,and Amr Elchouemi. Experiments on the mfcc appli-cation in speaker recognition using matlab. In

Internet & Technol-ogy.

Journal of artiﬁcial intelligence re-search , 16:321–357, 2002.11ianqi Chen and Carlos Guestrin. Xgboost: A scalabletree boosting system. In

Proceedings of the 22nd acmsigkdd international conference on knowledge discoveryand data mining , pages 785–794. ACM, 2016.Corinna Cortes and Vladimir Vapnik. Support-vector net-works.

Machine learning , 20(3):273–297, 1995.Joseph R Duffy.

Motor Speech Disorders E-Book: Sub-strates, Differential Diagnosis, and Management . Else-vier Health Sciences, 2019.Christopher G Goetz, Barbara C Tilley, Stephanie R Shaft-man, Glenn T Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew BStern, Richard Dodel, et al. Movement disorder society-sponsored revision of the uniﬁed parkinson’s disease rat-ing scale (mds-updrs): scale presentation and clinimetrictesting results.

Movement disorders: ofﬁcial journal of theMovement Disorder Society , 23(15):2129–2170, 2008.Arthur Stanley Goldberger and Arthur Stanley GoldbergerGoldberger.

A course in econometrics . Harvard Univer-sity Press, 1991.Charlotte A Haaxma, Bastiaan R Bloem, George F Borm,Wim JG Oyen, Klaus L Leenders, Silvia Eshuis, JanBooij, Dean E Dluzen, and Martin WIM Horstink. Genderdifferences in parkinson’s disease.

Journal of Neurology,Neurosurgery & Psychiatry , 78(8):819–824, 2007.Aileen K Ho, Robert Iansek, Caterina Marigliani, John LBradshaw, and Sandra Gates. Speech impairment in alarge sample of patients with parkinson’s disease.

Be-havioural neurology , 11(3):131–137, 1998.Tin Kam Ho. Random decision forests. In

Proceedings of3rd international conference on document analysis andrecognition , volume 1, pages 278–282. IEEE, 1995.William P Howlett. Neurology in africa.

Neurology , 83(7):654–655, 2014.Yannick Jadoul, Bill Thompson, and Bart de Boer. Introduc-ing Parselmouth: A Python interface to Praat.

Journal ofPhonetics , 71:1–15, 2018. doi: https://doi.org/10.1016/j.wocn.2018.07.001.Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, WeiChen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Light-gbm: A highly efﬁcient gradient boosting decision tree. In

Advances in neural information processing systems , pages3146–3154, 2017.SV Khadilkar. Neurology in india.

Annals of IndianAcademy of Neurology , 16(4):465, 2013.Tomi Kinnunen, Evgenia Chernenko, Marko Tuononen, PasiFr¨anti, and Haizhou Li. Voice activity detection usingmfcc features and support vector machine. In

Int. Conf.on Speech and Computer (SPECOM07), Moscow, Russia ,volume 2, pages 556–561, 2007.Ken J Kubota, Jason A Chen, and Max A Little. Machinelearning for large-scale wearable sensor data in parkin-son’s disease: Concepts, promises, pitfalls, and futures.

Movement disorders , 31(9):1314–1326, 2016. Raina Langevin, Mohammad Rafayet Ali, Taylan Sen,Christopher Snyder, Taylor Myers, E Dorsey, and Mo-hammed Ehsan Hoque. The park framework for auto-mated analysis of parkinson’s disease characteristics.

Pro-ceedings of the ACM on Interactive, Mobile, Wearableand Ubiquitous Technologies , 3(2):54, 2019.Max Little, Patrick McSharry, Eric Hunter, Jennifer Spiel-man, and Lorraine Ramig. Suitability of dysphonia mea-surements for telemonitoring of parkinson’s disease.

Na-ture Precedings , pages 1–1, 2008.Max A Little and Reham Badawy. Causal bootstrapping. arXiv preprint arXiv:1910.09648 , 2019.Max A Little, Patrick E McSharry, Stephen J Roberts, De-clan AE Costello, and Irene M Moroz. Exploiting non-linear recurrence and fractal scaling properties for voicedisorder detection.

Biomedical engineering online , 6(1):23, 2007.Jeri A Logemann, Hilda B Fisher, Benjamin Boshes, andE Richard Blonsky. Frequency and cooccurrence of vo-cal tract dysfunctions in the speech of a large sample ofparkinson patients.

Journal of Speech and hearing Disor-ders , 43(1):47–57, 1978.Luca Lonini, Andrew Dai, Nicholas Shawen, Tanya Simuni,Cynthia Poon, Leo Shimanovich, Margaret Daeschler,Roozbeh Ghaffari, John A Rogers, and Arun Jayaraman.Wearable sensors for Parkinson’s disease: which data areworth collecting for training symptom detection models. npj Digital Medicine , 1:64, 2018.Scott M Lundberg and Su-In Lee. A uniﬁed approach tointerpreting model predictions. In

Advances in Neural In-formation Processing Systems , pages 4765–4774, 2017.Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex De-Grave, Jordan M Prutkin, Bala Nair, Ronit Katz, JonathanHimmelfarb, Nisha Bansal, and Su-In Lee. From localexplanations to global understanding with explainable aifor trees.

Nature machine intelligence , 2(1):2522–5839,2020.Meinard M¨uller.

Information retrieval for music and motion ,volume 2. Springer, 2007.Hien M Nguyen, Eric W Cooper, and Katsuari Kamei. Bor-derline over-sampling for imbalanced data classiﬁcation.

International Journal of Knowledge Engineering and SoftData Paradigms

MovementDisorders , 18(7):738–750, 2003.12antiago Pascual, Mirco Ravanelli, Joan Serr`a, AntonioBonafonte, and Yoshua Bengio. Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks. In

Proc. of the Conf. of the Int.Speech Communication Association (INTERSPEECH) ,pages 161–165, 2019. URL http://dx.doi.org/10.21437/Interspeech.2019-2605.Louis CW Pols et al. Spectral analysis and identiﬁca-tion of dutch vowels in monosyllabic words.

Amster-damAcademische Pers , 1977.Amir Hossein Poorjam, Mathew Shaji Kavalekalam, Lim-ing Shi, Yordan P Raykov, Jesper Rindom Jensen, Max ALittle, and Mads Græsbøll Christensen. Automatic qualitycontrol and enhancement for voice-based remote parkin-son’s disease detection. arXiv preprint arXiv:1905.11785 ,2019.Lloyd S Shapley. A value for n-person games.

Contributionsto the Theory of Games , 2(28):307–317, 1953.Erik ˇStrumbelj and Igor Kononenko. Explaining predictionmodels and individual predictions with feature contribu-tions.

Knowledge and information systems , 41(3):647–665, 2014.Ingo R Titze and Daniel W Martin. Principles of voice pro-duction, 1998.Athanasios Tsanas, Max A Little, Patrick E McSharry, andLorraine O Ramig. Accurate telemonitoring of parkin-son’s disease progression by noninvasive speech tests.

IEEE transactions on Biomedical Engineering , 57(4):884–893, 2009.Athanasios Tsanas, Max A Little, Patrick E McSharry, andLorraine O Ramig. Nonlinear speech analysis algorithmsmapped to a standard metric achieve clinically usefulquantiﬁcation of average parkinson’s disease symptomseverity.

Journal of the royal society interface , 8(59):842–855, 2011.Athanasios Tsanas, Max A Little, Patrick E McSharry, andLorraine O Ramig. Using the cellular mobile telephonenetwork to remotely monitor parkinsons disease symptomseverity.

IEEE Transactions on Biomedical Engineering ,9, 2012a.Athanasios Tsanas, Max A Little, Patrick E McSharry, Jen-nifer Spielman, and Lorraine O Ramig. Novel speech sig-nal processing algorithms for high-accuracy classiﬁcationof parkinson’s disease.

IEEE transactions on biomedicalengineering , 59(5):1264–1271, 2012b.Stephen K Van Den Eeden, Caroline M Tanner, Allan LBernstein, Robin D Fross, Amethyst Leimpeter, Daniel ABloch, and Lorene M Nelson. Incidence of parkin-son’s disease: variation by age, gender, and race/ethnicity.

American journal of epidemiology , 157(11):1015–1022,2003.Jiahong Yuan and Mark Liberman. Speaker identiﬁcationon the scotus corpus.

Journal of the Acoustical Society ofAmerica , 123(5):3878, 2008. Shichao Yue, Yuzhe Yang, Hao Wang, Hariharan Rahul,and Dina Katabi. Bodycompass: Monitoring sleep pos-ture with wireless signals.