[PDF] SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

Abstract

The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work, we introduce Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter. We benchmark a set of acoustic models on SEP-28k and the public FluencyBank dataset and highlight how simply increasing the amount of training data improves relative detection performance by 28\% and 24\% F1 on each. Annotations from over 32k clips across both datasets will be publicly released.

Full PDF

SSEP-28K: A DATASET FOR STUTTERING EVENT DETECTIONFROM PODCASTS WITH PEOPLE WHO STUTTER

Colin Lea*, Vikramjit Mitra*, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham

Apple

ABSTRACT

The ability to automatically detect stuttering events in speechcould help speech pathologists track an individual’s ﬂuencyover time or help improve speech recognition systems forpeople with atypical speech patterns. Despite increasing in-terest in this area, existing public datasets are too small tobuild generalizable dysﬂuency detection systems and lacksufﬁcient annotations. In this work, we introduce StutteringEvents in Podcasts (SEP-28k), a dataset containing over 28kclips labeled with ﬁve event types including blocks, prolon-gations, sound repetitions, word repetitions, and interjections.Audio comes from public podcasts largely consisting of peo-ple who stutter interviewing other people who stutter. Webenchmark a set of acoustic models on SEP-28k and the pub-lic FluencyBank dataset and highlight how simply increasingthe amount of training data improves relative detection per-formance by 28% and 24% F1 on each. Annotations fromover 32k clips across both datasets will be publicly released.

Index Terms — Dysﬂuencies, stuttering, atypical speech

1. INTRODUCTION

Dysﬂuencies in speech such as sound repetitions, word rep-etitions, and blocks are common amongst everyone and areespecially prevalent in people who stutter. Frequent occur-rences can make social interactions challenging and limit anindividual’s ability to communicate with ubiquitous speechtechnology including Alexa, Siri, and Cortana [1, 2, 3, 4, 5].In this work we investigate the ability to automatically detectdysﬂuencies, which may be valuable for clinical assessmentor development of accessible speech recognition technology.This problem is challenging because there are many vari-ations in how a given individual expresses each dysﬂuencytype, in the patterns of dysﬂuencies between users, and evenhow the situation or environment affects their speech. Forexample, an individual may stutter when conversing but notwhile reading aloud; when talking with a teacher but not afriend; or when stressed before an exam but not in everyday-to-day interaction. The speech pathology community hasspent decades characterizing, developing diagnosis tools, anddeveloping strategies to mitigate these behaviors [6, 7, 8, 9],however, there has been limited success in taking these learn-

Fig. 1 . Speech from someone who stutters may contain eventsincluding sound repetitions (orange), interjections (blue),blocks/pauses (green), or other events that make speechrecognition challenging.ings and applying them to speech recognition technology,where individuals may be frequently cut off or have theirspeech inaccurately transcribed.A major bottleneck in this area is that dysﬂuency datasetstend to be small and have few or inconsistent annotations notinherently designed for work on speech recognition tasks.Kourkounakis et al. [10] used 800 speech clips (53 min-utes) with custom annotations to detect dysﬂuencies from 25children who stutter using the UCLASS dataset [11]. Riadet al. [12] performed a similar task using 1429 utterancesfrom 22 adults who stutter with the recent FluencyBank [13]dataset. Bayer et al. [14] collected a 3.5 hour German datasetwith 37 speakers and developed a model for automated stut-tering severity assessment. Unfortunately, none of the an-notations from these efforts have been released. A corecontribution of our paper is the introduction of the StutteringEvents in Podcasts dataset (SEP-28k) dataset which contains28k annotated clips (23 hours) of speech curated from publicpodcasts. We have released these along with annotations for4k clips (3.5 hours) from FluencyBank targeted at stutteringevent detection.The focus of this paper is on detection of ﬁve stut-tering event types: Blocks, Prolongations, Sound Repe-titions, Word/Phrase Repetitions, and Interjections. Ex-isting work has explored this problem using traditionalsignal processing techniques [15, 16, 17], language mod-eling (LM) [12, 18, 19, 20, 21], and acoustic modeling(AM) [21, 10]. Each approach has be shown to be effective a r X i v : . [ ee ss . A S ] F e b tuttering Labels Deﬁnition SEP-28k FluencyBank Block Gasps for air or stuttered pauses 12.0% 10.3%Prolongation Elongated syllable “M[mmm]ommy” “I [pr-pr-pr-]prepared dinner” “I made [made] dinner” “um,” “uh,” & “you know”

Non-dysﬂuent Labels

Natural pause A pause in speech (not as part of a stutter event) 8.5% 2.7%Unintelligible It is difﬁcult to understand the speech 3.7% 3.0%Unsure An annotator was unsure of their response 0.1% 0.4%No Speech The clip is silent or only contains background noise 1.1% -Poor Audio Quality There are microphone or other quality issues 2.1% -Music Music is playing in the background 1.1% -

Table 1 . Distribution of annotations in each dataset where at least two of three annotators applied a given label.at identifying one or two event types typically on data froma small number of users. Prolongations, or extended sounds,have been detected using short-window autocorrelations [16]and low-level acoustic models [10]. Word/phrase repetitions,if they are well articulated, are easily detected using LM-based approaches [19], with the caveat that single-syllablewords such as in the phrase “I-I-I am” will often be smoothedinto “I am” due to the underlying acoustic model and phraseslike “I am [am]” may be pruned because the LM has neverseen the word “am” repeated before. This is ﬁne for speechrecognition but bad for stuttering event analysis. Arjunet al. [16] addressed this repetition problem by segment-ing pairs of subsequent words and analyzing correlations intheir spectral features. Interjections, including “um”, “uh”,“you know” and other ﬁller words, are perhaps the easiesttype to recognize with a language model if well articulated.Blocks, or gasps/pauses typically within or between words,are difﬁcult to detect because the gasp for breath or pauseis often inaudible. Sound repetitions are also challengingbecause syllables may vary in duration, count, style, andarticulation (e.g.,“[moh-muh-mm]-ommy”).Efforts in HCI have sought out an understanding of speechrecognition needs for users with speech impairments, whichis critical for framing problems like ours [1, 2, 22].

2. DATA2.1. Stuttering Events in Podcasts (SEP-28k)

We manually curated a set of podcasts, many of which con-tain speech from people who stutter talking with other peoplewho stutter, using a two step process. Shows were initial se-lected by searching metadata from a podcast search enginewith terms related to dysﬂuencies such as stutter , speech dis-order , and stammer . This resulted in approximately 40 showsand 100s of hours of audio. Many of these were about speechdisorders but did not contain high rates of speech from people who stutter. After culling down the data we extracted clipsfrom 385 episodes across 8 shows. Speciﬁc show names andlinks to each episode can be found in the dataset respository.We extracted 40 −

250 segments per episode for a total of28,177 clips. Dysﬂuency events are more likely to occur soonbefore, during, or after a pause so we used a voice activitydetector to extract 3-second intervals near pauses. We variedwhere we sampled each interval with respect to a breakpointto capture a more representative set of dysﬂuencies.

We used all of the FluencyBank [13] interview data whichcontains recordings from 32 adults who stutter. As with Riadet al. [12] we found the temporal alignment for some tran-scriptions and dysﬂuency annotations provided were inaccu-rate, so we ignored these and used the same process as SEP-28k to annotate 4,144 clips (3.5 hours).

Annotating stuttering data is difﬁcult because of ambiguity inwhat constitutes stuttering for a given individual. Repetitions,for example, can occur during stuttering events or when an in-dividual wants to emphasize a word or phrase. Speech maybe unintelligible which makes it challenging to identify howa word was stuttered. We annotated our data using a variantof time-interval based assessment [8] in which audio record-ings are broken into 3 second clips and annotated with binarylabels as deﬁned in Table 1. A clip may contain multiple stut-tering event types along with non-dysﬂuency labels such as natural pause and unintelligible speech . SEP-28k was alsoannotated with: no speech , poor audio quality , and music toidentify issues speciﬁc to this medium.Clips were annotated by at least three people who receivedtraining via written descriptions, examples, and audio clipsn how to best identify each dysﬂuency but were not clini-cians. We measured Fleiss Kappa inter-annotator agreementand found word repetitions, interjections, sound repetitions,and no dysﬂuencies were more consistent (0.62, 0.57, 0.40,0.39) and blocks and prolongations had only fair or slightagreement (0.25, 0.11). Blocks can be difﬁcult to assess fromaudio alone; clinicians often rely on physical signs of grasp-ing for air when making this assessment. As such, resultswhen using the block labels should be more speculative. We use F1 score and Equal Error Rate to evaluate dysﬂu-ency detection where each annotation constitutes a binary la-bel. F1 is the harmonic mean of precision ( P ) and recall ( R ): F P · RP + R . Equal Error Rate (EER) is the point in the Re-ceiver Operating Characteristic (ROC) curve where the falseacceptance rate is equal to the false rejection rate and reﬂectshow well the two classes are separated. The lower the EER,the better the performance of the model. We report results foreach label individually and as a combined “Any” label whichincludes all ﬁve stutter types.SEP-28k is partitioned into three splits containing 25ksamples for training, 2k for validation, and 1k for testing.FluencyBank is partitioned across the 32 individuals in thedataset: 26 individuals ( ∼ ∼ ∼

500 clips) for testing. We en-courage others to explore alternative splits to tease out differ-ences between speakers, podcasts, or other analyses.

3. METHODS

Our approach takes an audio clip, extracts acoustic featuresper-frame, applies a temporal model, and outputs a single setof clip-level dysﬂuency labels. We investigated baselines thatare inspired by the dysﬂuency model in [10] and alternativeinput features, model architectures, and loss functions.

Our baseline input is a set of 40 dimensional mel-ﬁlterbankenergy features ( M F B ). We use frequency cut-offs at 0 hz and 8000 hz , a 25 ms window, and a sample rate of 100 hz .We compare with three additional feature types:• F (3 dim): pitch, pitch-delta and voicing features;• A T V (8 dim): articulatory features in the form of vocal-tract (

T V ) constriction variables [23]. These deﬁne de-gree and location of constriction actions within the hu-man vocal tract [23, 24] as implemented in [25];• F P hone (41 dim): phoneme probabilities extractedfrom an acoustic model trained on LibriSpeech [26]using a Time-depth Separable CNN architecture [27].Pitch, voicing, and articulatory features encode voice qualityand often change across dysﬂuency events. We hypothesize

Fig. 2 . Multi-feature acoustic stutter detection modelthese may improve detection of blocks or gasps. Phonemeprobabilities may make it easier to identify sound repetitionswhere the same phoneme ﬁres multiple times in a row.

The baseline stutter detection model consists of a single-layerLSTM network and an improved model adds convolutionallayers per-feature type and learns how the features should beweighted, as shown in Figure 2. We refer to the latter as Con-vLSTM. Feature maps from the convolution layer are com-bined after batch normalization and fed to the LSTM layer.The temporal convolution size for M F B feature was set to3 frames and for the remaining features were set to 5 frames.We use unidirectional recurrent networks where the ﬁnal stateis fed into the per-clip classiﬁer. Both models have two outputbranches: a ﬂuent/dysﬂuent prediction and a soft predictionfor each of the ﬁve event types.

The baseline model has a single cross-entropy loss term.Our improved models are trained with a multi-task objec-tive where the ﬂuent/dysﬂuent branch has a weighted cross-entropy term with focal loss [28] and the per-dysﬂuencybranch has a concordance correlation coefﬁcient (

CCC ) lossusing the inter-annotator agreement for each clip. able 2 . Weighted Accuracy (WA), F1-score and Equal Er-rors Rate (

EER ) from each model on FluencyBank (eval).

W A ↑ F ↑ EER ↓ Baseline (LSTM, XEnt) F P hone M F B M F B + F M F B + F + A T V

Improved (ConvLSTM, CCC) F P hone M F B M F B + F M F B + F + A T V

4. EXPERIMENTS & ANALYSIS4.1. Model Design

Table 2 compares performance across features and architec-tures types. Spectral features with pitch generally performwell and when using the improved model achieve best perfor-mance when adding articulatory signals. This improvementmatches our intuition that variation in intonation and articu-lation coincides with dysﬂuent speech. The phoneme-basedmodels perform worst, despite their ability to extract featuresone might think would be useful for sound repetitions. TheConvLSTM and CCC loss moderately improve F1, likely be-cause this loss explicitly encodes uncertainty in annotators.Table 3 shows performance per-dysﬂuency type. Perfor-mance is worse for Blocks and Word Repetitions. These dys-ﬂuencies tend to last longer in time and have more variation inexpression, which may contribute to the lower performance.Interjections and prolongations tend to have less variabilityand are easier to detect. SEP-28k performance is consistentlyworse than FluencyBank, likely given the larger variety of in-dividuals and speaking styles.

The central hypothesis for this work was that existing datasetsare too small and contain too few participants for training ef-fective dysﬂuency detection models. This is corroborated byresults in Figure 3 which shows performance on SEP-28k andFluencyBank while training on different subsets. In the bestcase, there is a 24% relative F1 improvement in FluencyBankwhen training on all 25k SEP training samples compared tothe 3k FluencyBank set. Even using only 5k SEP clips alreadyperforms FluencyBank performance by 16% F1. This could

Table 3 . F1 score per dysﬂuency type with a baseline LSTMmodel (XEnt loss) trained using single- or multi-task learn-ing (STL, MTL) and the Improved ConvLSTM model (CCCloss). Bl=Block, Pro=Prolongation, Snd=Sound Repetition,Wd=Word Repetition, Int=Interjection

SEP-28k Bl Pro Snd Wd Int Any

Random 13.7 12.8 9.5 4.3 13.6 46.0Baseline (STL) 54.9 65.4 57.2 60.7 64.9 61.5Baseline (MTL) 56.4 65.1 60.5 56.2 69.5 64.5Improved 55.9 68.5 63.2 60.4 71.3 66.8

FluencyBank Bl Pro Snd Wd Int Any

Random 12.9 10.7 28.2 10.3 31.7 31.7Baseline (STL) 58.6 63.2 60.8 61.8 57.2 73.2Baseline (MTL) 54.6 67.6 74.2 55.8 75.0 74.8Improved 56.8 67.9 74.3 59.3 82.6 80.8

Fig. 3 . Test performance when training models only on Flu-encyBank clips or subsets of clips from SEP-28k.be because there are a larger number of users in the datasetand the data contains more variability in speaking styles. Asexpected, performance on SEP-28k is worst when training onFluencyBank and increases with larger numbers of trainingsamples.

5. CONCLUSION

We introduced SEP-28, which contains over an order of mag-nitude more annotations than existing public datasets andadded new annotations to FluencyBank. These annotationscan be used for many tasks so we encourage others to ex-plore the data, labels, and splits in ways beyond what wasis described here. Future work should explore alternativeapproaches, e.g., using language models, which may improveperformance for some dysﬂuency types that are more difﬁcultto detect. Lastly, while dysﬂuencies are most common inthose who stutter, future work should address how they canbe detected from people with other speech disorders, such asdysarthria, which may be characterized differently.

Acknowledgment : Thanks to Lauren Tooley for countlessdiscussions on the clinical aspects of stuttering. . REFERENCES [1] RN Brewer, L Findlater, J Kaye, W Lasecki,C Munteanu, and A Weber, “Accessible voice inter-faces,” in

CSCW , 2018.[2] L Clark, BR Cowan, A Roper, S Lindsay, and O Sheers,“Speech diversity and speech interfaces: Consideringan inclusive future through stammering,” in

Conversa-tional User Interfaces , 2020.[3] K Wheeler, “For people who stutter, the convenience ofvoice assistant technology remains out of reach,” USAToday (online), Jan 2020.[4] P Soundararajan, “Stammering accessibility and testingfor voice assistants & devices,” Personal Blog (online),April 2020.[5] M Corcoran, “When alexa can’t understand you,” Slate(online), Oct 2018.[6] C Van Riper, “The nature of stuttering (2nd ed.),”

Ap-plied Psycholinguistics , 1983.[7] G Riley, “SSI-4 stuttering severity instrument fourthedition,”

Austin, TX: Pro-Ed , 2009.[8] ARS Valente, LMT Jesus, A Hall, and M Leahy, “Event-and interval-based measurement of stuttering: a review,”

IJLCD , 2015.[9] RJ Ingham, AK Cordes, and P Finn, “Time-intervalmeasurement of stuttering: Systematic replication of In-gham, Cordes, and Gow (1993),”

Journal of Speech,Language, and Hearing Research , 1993.[10] T Kourkounakis, A Hajavi, and A Etemad, “Detectingmultiple speech disﬂuencies using a deep residual net-work with bidirectional long short-term memory,” in

ICASSP . IEEE, 2020.[11] S Devis, P Howell, and J Batrip, “The UCLASS archiveof stuttered speech,”

J. Speech Lang. Hear. Res , 2009.[12] R Riad, AC Bachoud-L´evi, F Rudzicz, and E Dupoux,“Identiﬁcation of primary and collateral tracks in stut-tered speech,” in

LREC , 2020.[13] NB Ratner and B MacWhinney, “Fluency bank: A newresource for ﬂuency research and practice,”

Journal ofFluency Disorders , 2018.[14] SP Bayerl, F H¨onig, J Reister, and K Riedhammer, “To-wards automated assessment of stuttering and stutteringtherapy,” in

International Conference on Text, Speech,and Dialogue , 2020. [15] A Dash, N Subramani, T Manjunath, V Yaragarala, andS Tripathi, “Speech recognition and correction of a stut-tered speech,” in

ICACCI , 2018.[16] A K N, Karthik S, K D, P Chanda, and S Tripathi, “Au-tomatic correction of stutter in disﬂuent speech,”

Co-CoNet , 2020.[17] A Czyzewski, A Kaczmarek, and B Kostek, “Intelligentprocessing of stuttered speech,”

Journal of IntelligentInformation Systems , 2003.[18] S Alharbi, M Hasan, A JH Simons, S Brumﬁtt, andP Green, “A lightly supervised approach to detect stut-tering in children’s speech,” in

Interspeech 2018 .[19] P Heeman, R Lunsford, A McMillin, and JS Yaruss,“Using clinician annotations to improve automaticspeech recognition of stuttered speech,” in

Interspeech ,2016.[20] S. Alharbi, A.J.H. Simons, S. Brumﬁtt, and P.D. Green,“Automatic recognition of children’s read speech forstuttering application,” vol. WOCCI, 2017.[21] P. Mahesha and D.S. Vinod, “Gaussian mixture modelbased classiﬁcation of stuttering dysﬂuencies,”

Journalof Intelligent Systems , 2016.[22] S Kane, A Guo, and MR Morris, “Sense and accessi-bility: Understanding people with physical disabilities’experiences with sensing systems,” in

ACM ASSETS ,October 2020.[23] V Mitra CY Espy-Wilson N Seneviratne, G Sivaraman,“Noise robust acoustic to articulatory speech inversion,”

InterSpeech , 2017.[24] V. Mitra, H. Nam, C. Y. Espy-Wilson, E. Saltzman, andL. Goldstein, “Retrieving tract variables from acoustics:A comparison of different machine learning strategies,”

IEEE Journal of Selected Topics in Signal Processing ,2010.[25] V Mitra, S Booker, E Marchiand, DS Farrarand,UD Peitzand, B Chengand, E Tevesand, A Mehtaand,and D Naik, “Leveraging acoustic cues and paralin-guistic embeddings to detect expression from voice,” in

ICASSP , 2019.[26] V Panayotov, G Chen, D Povey, and S Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in

ICASSP , 2015.[27] A Hannun, A Lee, Q Xu, and R Collobert, “Sequence-to-sequence speech recognition with time-depth separa-ble convolutions,” in

Interspeech , 2019.[28] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo-cal loss for dense object detection,” in