SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter
Colin Lea, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham
SSEP-28K: A DATASET FOR STUTTERING EVENT DETECTIONFROM PODCASTS WITH PEOPLE WHO STUTTER
Colin Lea*, Vikramjit Mitra*, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham
Apple
ABSTRACT
The ability to automatically detect stuttering events in speechcould help speech pathologists track an individual’s fluencyover time or help improve speech recognition systems forpeople with atypical speech patterns. Despite increasing in-terest in this area, existing public datasets are too small tobuild generalizable dysfluency detection systems and lacksufficient annotations. In this work, we introduce StutteringEvents in Podcasts (SEP-28k), a dataset containing over 28kclips labeled with five event types including blocks, prolon-gations, sound repetitions, word repetitions, and interjections.Audio comes from public podcasts largely consisting of peo-ple who stutter interviewing other people who stutter. Webenchmark a set of acoustic models on SEP-28k and the pub-lic FluencyBank dataset and highlight how simply increasingthe amount of training data improves relative detection per-formance by 28% and 24% F1 on each. Annotations fromover 32k clips across both datasets will be publicly released.
Index Terms — Dysfluencies, stuttering, atypical speech
1. INTRODUCTION
Dysfluencies in speech such as sound repetitions, word rep-etitions, and blocks are common amongst everyone and areespecially prevalent in people who stutter. Frequent occur-rences can make social interactions challenging and limit anindividual’s ability to communicate with ubiquitous speechtechnology including Alexa, Siri, and Cortana [1, 2, 3, 4, 5].In this work we investigate the ability to automatically detectdysfluencies, which may be valuable for clinical assessmentor development of accessible speech recognition technology.This problem is challenging because there are many vari-ations in how a given individual expresses each dysfluencytype, in the patterns of dysfluencies between users, and evenhow the situation or environment affects their speech. Forexample, an individual may stutter when conversing but notwhile reading aloud; when talking with a teacher but not afriend; or when stressed before an exam but not in everyday-to-day interaction. The speech pathology community hasspent decades characterizing, developing diagnosis tools, anddeveloping strategies to mitigate these behaviors [6, 7, 8, 9],however, there has been limited success in taking these learn-
Fig. 1 . Speech from someone who stutters may contain eventsincluding sound repetitions (orange), interjections (blue),blocks/pauses (green), or other events that make speechrecognition challenging.ings and applying them to speech recognition technology,where individuals may be frequently cut off or have theirspeech inaccurately transcribed.A major bottleneck in this area is that dysfluency datasetstend to be small and have few or inconsistent annotations notinherently designed for work on speech recognition tasks.Kourkounakis et al. [10] used 800 speech clips (53 min-utes) with custom annotations to detect dysfluencies from 25children who stutter using the UCLASS dataset [11]. Riadet al. [12] performed a similar task using 1429 utterancesfrom 22 adults who stutter with the recent FluencyBank [13]dataset. Bayer et al. [14] collected a 3.5 hour German datasetwith 37 speakers and developed a model for automated stut-tering severity assessment. Unfortunately, none of the an-notations from these efforts have been released. A corecontribution of our paper is the introduction of the StutteringEvents in Podcasts dataset (SEP-28k) dataset which contains28k annotated clips (23 hours) of speech curated from publicpodcasts. We have released these along with annotations for4k clips (3.5 hours) from FluencyBank targeted at stutteringevent detection.The focus of this paper is on detection of five stut-tering event types: Blocks, Prolongations, Sound Repe-titions, Word/Phrase Repetitions, and Interjections. Ex-isting work has explored this problem using traditionalsignal processing techniques [15, 16, 17], language mod-eling (LM) [12, 18, 19, 20, 21], and acoustic modeling(AM) [21, 10]. Each approach has be shown to be effective a r X i v : . [ ee ss . A S ] F e b tuttering Labels Definition SEP-28k FluencyBank Block Gasps for air or stuttered pauses 12.0% 10.3%Prolongation Elongated syllable “M[mmm]ommy” “I [pr-pr-pr-]prepared dinner” “I made [made] dinner” “um,” “uh,” & “you know”
Non-dysfluent Labels
Natural pause A pause in speech (not as part of a stutter event) 8.5% 2.7%Unintelligible It is difficult to understand the speech 3.7% 3.0%Unsure An annotator was unsure of their response 0.1% 0.4%No Speech The clip is silent or only contains background noise 1.1% -Poor Audio Quality There are microphone or other quality issues 2.1% -Music Music is playing in the background 1.1% -
Table 1 . Distribution of annotations in each dataset where at least two of three annotators applied a given label.at identifying one or two event types typically on data froma small number of users. Prolongations, or extended sounds,have been detected using short-window autocorrelations [16]and low-level acoustic models [10]. Word/phrase repetitions,if they are well articulated, are easily detected using LM-based approaches [19], with the caveat that single-syllablewords such as in the phrase “I-I-I am” will often be smoothedinto “I am” due to the underlying acoustic model and phraseslike “I am [am]” may be pruned because the LM has neverseen the word “am” repeated before. This is fine for speechrecognition but bad for stuttering event analysis. Arjunet al. [16] addressed this repetition problem by segment-ing pairs of subsequent words and analyzing correlations intheir spectral features. Interjections, including “um”, “uh”,“you know” and other filler words, are perhaps the easiesttype to recognize with a language model if well articulated.Blocks, or gasps/pauses typically within or between words,are difficult to detect because the gasp for breath or pauseis often inaudible. Sound repetitions are also challengingbecause syllables may vary in duration, count, style, andarticulation (e.g.,“[moh-muh-mm]-ommy”).Efforts in HCI have sought out an understanding of speechrecognition needs for users with speech impairments, whichis critical for framing problems like ours [1, 2, 22].
2. DATA2.1. Stuttering Events in Podcasts (SEP-28k)
We manually curated a set of podcasts, many of which con-tain speech from people who stutter talking with other peoplewho stutter, using a two step process. Shows were initial se-lected by searching metadata from a podcast search enginewith terms related to dysfluencies such as stutter , speech dis-order , and stammer . This resulted in approximately 40 showsand 100s of hours of audio. Many of these were about speechdisorders but did not contain high rates of speech from people who stutter. After culling down the data we extracted clipsfrom 385 episodes across 8 shows. Specific show names andlinks to each episode can be found in the dataset respository.We extracted 40 −
250 segments per episode for a total of28,177 clips. Dysfluency events are more likely to occur soonbefore, during, or after a pause so we used a voice activitydetector to extract 3-second intervals near pauses. We variedwhere we sampled each interval with respect to a breakpointto capture a more representative set of dysfluencies.
We used all of the FluencyBank [13] interview data whichcontains recordings from 32 adults who stutter. As with Riadet al. [12] we found the temporal alignment for some tran-scriptions and dysfluency annotations provided were inaccu-rate, so we ignored these and used the same process as SEP-28k to annotate 4,144 clips (3.5 hours).
Annotating stuttering data is difficult because of ambiguity inwhat constitutes stuttering for a given individual. Repetitions,for example, can occur during stuttering events or when an in-dividual wants to emphasize a word or phrase. Speech maybe unintelligible which makes it challenging to identify howa word was stuttered. We annotated our data using a variantof time-interval based assessment [8] in which audio record-ings are broken into 3 second clips and annotated with binarylabels as defined in Table 1. A clip may contain multiple stut-tering event types along with non-dysfluency labels such as natural pause and unintelligible speech . SEP-28k was alsoannotated with: no speech , poor audio quality , and music toidentify issues specific to this medium.Clips were annotated by at least three people who receivedtraining via written descriptions, examples, and audio clipsn how to best identify each dysfluency but were not clini-cians. We measured Fleiss Kappa inter-annotator agreementand found word repetitions, interjections, sound repetitions,and no dysfluencies were more consistent (0.62, 0.57, 0.40,0.39) and blocks and prolongations had only fair or slightagreement (0.25, 0.11). Blocks can be difficult to assess fromaudio alone; clinicians often rely on physical signs of grasp-ing for air when making this assessment. As such, resultswhen using the block labels should be more speculative. We use F1 score and Equal Error Rate to evaluate dysflu-ency detection where each annotation constitutes a binary la-bel. F1 is the harmonic mean of precision ( P ) and recall ( R ): F P · RP + R . Equal Error Rate (EER) is the point in the Re-ceiver Operating Characteristic (ROC) curve where the falseacceptance rate is equal to the false rejection rate and reflectshow well the two classes are separated. The lower the EER,the better the performance of the model. We report results foreach label individually and as a combined “Any” label whichincludes all five stutter types.SEP-28k is partitioned into three splits containing 25ksamples for training, 2k for validation, and 1k for testing.FluencyBank is partitioned across the 32 individuals in thedataset: 26 individuals ( ∼ ∼ ∼
500 clips) for testing. We en-courage others to explore alternative splits to tease out differ-ences between speakers, podcasts, or other analyses.
3. METHODS
Our approach takes an audio clip, extracts acoustic featuresper-frame, applies a temporal model, and outputs a single setof clip-level dysfluency labels. We investigated baselines thatare inspired by the dysfluency model in [10] and alternativeinput features, model architectures, and loss functions.
Our baseline input is a set of 40 dimensional mel-filterbankenergy features ( M F B ). We use frequency cut-offs at 0 hz and 8000 hz , a 25 ms window, and a sample rate of 100 hz .We compare with three additional feature types:• F (3 dim): pitch, pitch-delta and voicing features;• A T V (8 dim): articulatory features in the form of vocal-tract (
T V ) constriction variables [23]. These define de-gree and location of constriction actions within the hu-man vocal tract [23, 24] as implemented in [25];• F P hone (41 dim): phoneme probabilities extractedfrom an acoustic model trained on LibriSpeech [26]using a Time-depth Separable CNN architecture [27].Pitch, voicing, and articulatory features encode voice qualityand often change across dysfluency events. We hypothesize
Fig. 2 . Multi-feature acoustic stutter detection modelthese may improve detection of blocks or gasps. Phonemeprobabilities may make it easier to identify sound repetitionswhere the same phoneme fires multiple times in a row.
The baseline stutter detection model consists of a single-layerLSTM network and an improved model adds convolutionallayers per-feature type and learns how the features should beweighted, as shown in Figure 2. We refer to the latter as Con-vLSTM. Feature maps from the convolution layer are com-bined after batch normalization and fed to the LSTM layer.The temporal convolution size for M F B feature was set to3 frames and for the remaining features were set to 5 frames.We use unidirectional recurrent networks where the final stateis fed into the per-clip classifier. Both models have two outputbranches: a fluent/dysfluent prediction and a soft predictionfor each of the five event types.
The baseline model has a single cross-entropy loss term.Our improved models are trained with a multi-task objec-tive where the fluent/dysfluent branch has a weighted cross-entropy term with focal loss [28] and the per-dysfluencybranch has a concordance correlation coefficient (
CCC ) lossusing the inter-annotator agreement for each clip. able 2 . Weighted Accuracy (WA), F1-score and Equal Er-rors Rate (
EER ) from each model on FluencyBank (eval).
W A ↑ F ↑ EER ↓ Baseline (LSTM, XEnt) F P hone M F B M F B + F M F B + F + A T V
Improved (ConvLSTM, CCC) F P hone M F B M F B + F M F B + F + A T V
4. EXPERIMENTS & ANALYSIS4.1. Model Design
Table 2 compares performance across features and architec-tures types. Spectral features with pitch generally performwell and when using the improved model achieve best perfor-mance when adding articulatory signals. This improvementmatches our intuition that variation in intonation and articu-lation coincides with dysfluent speech. The phoneme-basedmodels perform worst, despite their ability to extract featuresone might think would be useful for sound repetitions. TheConvLSTM and CCC loss moderately improve F1, likely be-cause this loss explicitly encodes uncertainty in annotators.Table 3 shows performance per-dysfluency type. Perfor-mance is worse for Blocks and Word Repetitions. These dys-fluencies tend to last longer in time and have more variation inexpression, which may contribute to the lower performance.Interjections and prolongations tend to have less variabilityand are easier to detect. SEP-28k performance is consistentlyworse than FluencyBank, likely given the larger variety of in-dividuals and speaking styles.
The central hypothesis for this work was that existing datasetsare too small and contain too few participants for training ef-fective dysfluency detection models. This is corroborated byresults in Figure 3 which shows performance on SEP-28k andFluencyBank while training on different subsets. In the bestcase, there is a 24% relative F1 improvement in FluencyBankwhen training on all 25k SEP training samples compared tothe 3k FluencyBank set. Even using only 5k SEP clips alreadyperforms FluencyBank performance by 16% F1. This could
Table 3 . F1 score per dysfluency type with a baseline LSTMmodel (XEnt loss) trained using single- or multi-task learn-ing (STL, MTL) and the Improved ConvLSTM model (CCCloss). Bl=Block, Pro=Prolongation, Snd=Sound Repetition,Wd=Word Repetition, Int=Interjection
SEP-28k Bl Pro Snd Wd Int Any
Random 13.7 12.8 9.5 4.3 13.6 46.0Baseline (STL) 54.9 65.4 57.2 60.7 64.9 61.5Baseline (MTL) 56.4 65.1 60.5 56.2 69.5 64.5Improved 55.9 68.5 63.2 60.4 71.3 66.8
FluencyBank Bl Pro Snd Wd Int Any
Random 12.9 10.7 28.2 10.3 31.7 31.7Baseline (STL) 58.6 63.2 60.8 61.8 57.2 73.2Baseline (MTL) 54.6 67.6 74.2 55.8 75.0 74.8Improved 56.8 67.9 74.3 59.3 82.6 80.8
Fig. 3 . Test performance when training models only on Flu-encyBank clips or subsets of clips from SEP-28k.be because there are a larger number of users in the datasetand the data contains more variability in speaking styles. Asexpected, performance on SEP-28k is worst when training onFluencyBank and increases with larger numbers of trainingsamples.
5. CONCLUSION
We introduced SEP-28, which contains over an order of mag-nitude more annotations than existing public datasets andadded new annotations to FluencyBank. These annotationscan be used for many tasks so we encourage others to ex-plore the data, labels, and splits in ways beyond what wasis described here. Future work should explore alternativeapproaches, e.g., using language models, which may improveperformance for some dysfluency types that are more difficultto detect. Lastly, while dysfluencies are most common inthose who stutter, future work should address how they canbe detected from people with other speech disorders, such asdysarthria, which may be characterized differently.
Acknowledgment : Thanks to Lauren Tooley for countlessdiscussions on the clinical aspects of stuttering. . REFERENCES [1] RN Brewer, L Findlater, J Kaye, W Lasecki,C Munteanu, and A Weber, “Accessible voice inter-faces,” in
CSCW , 2018.[2] L Clark, BR Cowan, A Roper, S Lindsay, and O Sheers,“Speech diversity and speech interfaces: Consideringan inclusive future through stammering,” in
Conversa-tional User Interfaces , 2020.[3] K Wheeler, “For people who stutter, the convenience ofvoice assistant technology remains out of reach,” USAToday (online), Jan 2020.[4] P Soundararajan, “Stammering accessibility and testingfor voice assistants & devices,” Personal Blog (online),April 2020.[5] M Corcoran, “When alexa can’t understand you,” Slate(online), Oct 2018.[6] C Van Riper, “The nature of stuttering (2nd ed.),”
Ap-plied Psycholinguistics , 1983.[7] G Riley, “SSI-4 stuttering severity instrument fourthedition,”
Austin, TX: Pro-Ed , 2009.[8] ARS Valente, LMT Jesus, A Hall, and M Leahy, “Event-and interval-based measurement of stuttering: a review,”
IJLCD , 2015.[9] RJ Ingham, AK Cordes, and P Finn, “Time-intervalmeasurement of stuttering: Systematic replication of In-gham, Cordes, and Gow (1993),”
Journal of Speech,Language, and Hearing Research , 1993.[10] T Kourkounakis, A Hajavi, and A Etemad, “Detectingmultiple speech disfluencies using a deep residual net-work with bidirectional long short-term memory,” in
ICASSP . IEEE, 2020.[11] S Devis, P Howell, and J Batrip, “The UCLASS archiveof stuttered speech,”
J. Speech Lang. Hear. Res , 2009.[12] R Riad, AC Bachoud-L´evi, F Rudzicz, and E Dupoux,“Identification of primary and collateral tracks in stut-tered speech,” in
LREC , 2020.[13] NB Ratner and B MacWhinney, “Fluency bank: A newresource for fluency research and practice,”
Journal ofFluency Disorders , 2018.[14] SP Bayerl, F H¨onig, J Reister, and K Riedhammer, “To-wards automated assessment of stuttering and stutteringtherapy,” in
International Conference on Text, Speech,and Dialogue , 2020. [15] A Dash, N Subramani, T Manjunath, V Yaragarala, andS Tripathi, “Speech recognition and correction of a stut-tered speech,” in
ICACCI , 2018.[16] A K N, Karthik S, K D, P Chanda, and S Tripathi, “Au-tomatic correction of stutter in disfluent speech,”
Co-CoNet , 2020.[17] A Czyzewski, A Kaczmarek, and B Kostek, “Intelligentprocessing of stuttered speech,”
Journal of IntelligentInformation Systems , 2003.[18] S Alharbi, M Hasan, A JH Simons, S Brumfitt, andP Green, “A lightly supervised approach to detect stut-tering in children’s speech,” in
Interspeech 2018 .[19] P Heeman, R Lunsford, A McMillin, and JS Yaruss,“Using clinician annotations to improve automaticspeech recognition of stuttered speech,” in
Interspeech ,2016.[20] S. Alharbi, A.J.H. Simons, S. Brumfitt, and P.D. Green,“Automatic recognition of children’s read speech forstuttering application,” vol. WOCCI, 2017.[21] P. Mahesha and D.S. Vinod, “Gaussian mixture modelbased classification of stuttering dysfluencies,”
Journalof Intelligent Systems , 2016.[22] S Kane, A Guo, and MR Morris, “Sense and accessi-bility: Understanding people with physical disabilities’experiences with sensing systems,” in
ACM ASSETS ,October 2020.[23] V Mitra CY Espy-Wilson N Seneviratne, G Sivaraman,“Noise robust acoustic to articulatory speech inversion,”
InterSpeech , 2017.[24] V. Mitra, H. Nam, C. Y. Espy-Wilson, E. Saltzman, andL. Goldstein, “Retrieving tract variables from acoustics:A comparison of different machine learning strategies,”
IEEE Journal of Selected Topics in Signal Processing ,2010.[25] V Mitra, S Booker, E Marchiand, DS Farrarand,UD Peitzand, B Chengand, E Tevesand, A Mehtaand,and D Naik, “Leveraging acoustic cues and paralin-guistic embeddings to detect expression from voice,” in
ICASSP , 2019.[26] V Panayotov, G Chen, D Povey, and S Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in
ICASSP , 2015.[27] A Hannun, A Lee, Q Xu, and R Collobert, “Sequence-to-sequence speech recognition with time-depth separa-ble convolutions,” in
Interspeech , 2019.[28] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Fo-cal loss for dense object detection,” in