FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning
11 FluentNet: End-to-End Detection of SpeechDisfluency with Deep Learning
Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad
Abstract —Strong presentation skills are valuable and sought-after in workplace and classroom environments alike. Of thepossible improvements to vocal presentations, disfluencies andstutters in particular remain one of the most common andprominent factors of someone’s demonstration. Millions of peopleare affected by stuttering and other speech disfluencies, with themajority of the world having experienced mild stutters whilecommunicating under stressful conditions. While there has beenmuch research in the field of automatic speech recognition andlanguage models, there lacks the sufficient body of work whenit comes to disfluency detection and recognition. To this end, wepropose an end-to-end deep neural network, FluentNet, capableof detecting a number of different disfluency types. FluentNetconsists of a Squeeze-and-Excitation Residual convolutional neu-ral network which facilitate the learning of strong spectral frame-level representations, followed by a set of bidirectional longshort-term memory layers that aid in learning effective temporalrelationships. Lastly, FluentNet uses an attention mechanismto focus on the important parts of speech to obtain a betterperformance. We perform a number of different experiments,comparisons, and ablation studies to evaluate our model. Ourmodel achieves state-of-the-art results by outperforming othersolutions in the field on the publicly available UCLASS dataset.Additionally, we present LibriStutter: a disfluency dataset basedon the public LibriSpeech dataset with synthesized stutters. Wealso evaluate FluentNet on this dataset, showing the strongperformance of our model versus a number of benchmarktechniques.
Index Terms —Speech, stutter, disfluency, deep learning,squeeze-and-excitation, BLSTM, attention.
I. I
NTRODUCTION C LEAR and comprehensive speech is the vital backbone tostrong communication and presentation skills [1]. Wheresome occupations consist mainly of presenting, most careersrequire and thrive from the ability to communicate effectively.Research has shown that oral communication remains one ofthe more employable skills in both the perception of employersand new graduates [2]. Simple changes to ones speakingpatterns such as volume or appearance of disfluencies canhave a huge impact on the ability to convey information ef-fectively. By providing simplified, quantifiable data concerningones speech patterns, as well as feedback on how to changeones speaking habits, drastic improvements could be made toanyone’s communication skills [3].In regard to presentation skills, disfluent speech remainsone of the more common factors [4]. Any abnormality orgenerally uncommon component of one’s speech patterns is
All the authors are with the Department of Electrical and ComputerEngineering, Queen’s University, Kingston, ON, K7L3N6 Canada, e-mail:[email protected]. referred to as a speech disfluency [5]. There are hundredsof different speech disfluencies often grouped together along-side language and swallowing disorders. Of these afflictions,stuttering proves to be one of the most common and mostrecognized of the lot [5].Stuttering, also known as stammering, as a disorder canbe generally defined as issues pertaining to the consistency ofthe flow and fluency of speech. This often involves involuntaryadditions of sounds and words, and the delay or inability toconsistently progress through a phrase. Although labelled asa disorder, stuttering can occur in any persons speech, ofteninduced by stress or nervousness [6]. These cases however donot correlate with stammering as a disorder, but are caused byperformance anxiety [7]. The use of stutter detection does notonly apply to those with long term stutter impairments, butcan appeal to the majority of the world as it can help with theimprovement of communication skills.As the breadth of applications using machine learning tech-niques have flourished in recent decades, they have only re-cently began to be utilized in the field of speech disfluency anddisorder detection. While deep learning has dominated manyareas of speech processing, for instance speech recognition [8][9], speaker recognition [10] [11], and speech synthesis [12][13], very little work has been done toward the problem ofspeech disfluency detection. Disfluencies, including stutters,are not easily definable; they come in many shapes andvariations. This means that factors such as gender, age, accent,and the language themselves will affect the contents of eachstutter, greatly complicating the problem space. As well, thereare many classes of stutter, each with their own sub-classesand with wildly different structures, making the identificationof all stutter types with a single model a difficult task. Evena specific type of stutter applied to a single word can beconducted in a wide variety of ways. Where people are greatat identifying stutters through their experience with them,machine learning models have historically struggled with this(as we show in Section II).Another common issue is the sheer lack of sufficient trainingdata available. Many previous works often rely on their ownmanually recorded, transcribed, and labelled datasets, whichare often quite small due to the work involved in their creation[14] [15] [16] [17]. There is only one commonly used publicdataset, UCLASS [18], that is widely used amongst works inthis area, though it still is also quite small.Many disfluency detection solutions provide some form offiller word identification, flagging and counting any spokeninterjections (e.g. ‘okay’, ‘right’, etc.). However, upon furtherinvestigation, these applications simply request a list of in- a r X i v : . [ ee ss . A S ] S e p terjections from the user and use Speech-to-Text (STT) toolsin order to match the spoken word with any interjections inthe list. Though this may work fine for interjections suchas ‘um’ and ‘uh’ (assuming the used STT tool has thenecessary embeddings), this can lead to serious overall errorsin classification for most other utterances that are actual words,such as ‘like’ , which is commonly used as a filler word in theEnglish language.Early works in stutter detection, realizing the challengesmentioned above, first sought out to test the viability ofidentifying stutters from clean speech. These models primarilyfocused on machine learning models with very small datasets,consisting of a single stutter type, or even a single word[14], [19]. In more recent years, and due to the rise ofautomatic speech recognition (ASR), language models havebeen used to tackle stutter recognition. These works haveproven to be strong at identifying certain stutter types, andhave been showing ever improving results [17], [16]. However,due to the uncertainty surrounding relations between cleanlyspoken and stuttered word embeddings, it remains difficultfor these models to generalize across multiple stutter types.It is hypothesized that by bypassing the use of languagemodels, and by focusing solely on phonetics through the useof convolution networks, a model can be created that bothmaintains a strong average accuracy while also being effectiveacross all stutter types.In this paper, we propose a model capable of detectingspeech disfluencies. To this end, we design FluentNet, adeep neural network (DNN) for automated speech disfluencydetection. The proposed network does not apply any languagemodel aspects, but instead focuses on the direct classificationof speech signals. This allows for the avoidance of complexand time consuming ASR as a pre-processing steps in ourmodel, and would provide the ability to view the scenario asan end-to-end solution using a single deep neural network. Wevalidate our model on a commonly used benchmark datasetUCLASS [18]. To tackle the issue of scarce stutter-relatedspeech datasets, we also develop a synthetic dataset based ona non-stuttered speech dataset (LibriSpeech [20]), which weentitle LibriStutter. This dataset is created to mimic stutteredspeech and vastly expand the amount of data available foruse. Our end-to-end neural network takes spectrogram featureimages as inputs, and uses Squeeze-and-Excitation residual(SE-ResNet) blocks for learning the speech embedding. Next,a bidirectional long short-term memory (BLSTM) networkis used to learn the temporal relationships, followed by anattention mechanism to focus on the more salient parts of thespeech. Experiments show the effectiveness of our approach ingeneralizing across multiple classes of stutters while maintain-ing a high accuracy and strong consistency between classes onboth datasets.The key contributions of our work can be summarizedas follows: (1) We propose FluentNet, an end-to-end deepneural network capable of detection of several types of speechdisfluencies; (2) We develop a synthesized disfluency datasetcalled LibriStutter based on the publicly available LibriSpeechdataset by artificially generating several types of disfluencies,namely sound, word, and phrase repetitions, as well as pro- longations and interjections. The dataset contains the outputlabels that can be used in training deep learning methods;(3) We evaluate our model (FluentNet) on two datasets,
UCLASS and LibriStutter. The experiments show that ourmodel achieves state-of-the-art results on both datasets out-performing a number of other baselines as well as previouslypublished work; (4) We make our annotations on the existingUCLASS dataset, along with the entire LibriStutter datasetand its labels, publicly available to contribute to the field andfacilitate further research.This is an extension of our earlier work titled “DetectingMultiple Speech Disfluencies using a Deep Residual Networkwith Bidirectional Long Short-Term Memory”, published inthe 2020 IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP). This paper focused ontackling the problem of detection and classification of differentforms of stutters. The model used a deep residual networkand bidirectional long short-term memory layers to classifydifferent types of stutters. In this extended work, we replacethe previously used residual blocks of the spectral encoderwith residual squeeze-and-excitation blocks. Additionally, weadd an attention mechanism after the recurrent network tobetter focus the network on salient parts of input speech.Furthermore, we develop a new dataset, which we present inthis paper and make publicly available. Lastly, we performthorough experiments, for instance through additional bench-mark comparisons and ablation studies. Our experiments showthe improvements made by FluentNet over our preliminarywork, as validated on both the UCLASS dataset (previouslyused) as well as the newly developed dataset. This new modelprovides greater advancement towards end-to-end disfluencydetection and classification.The rest of this paper is organized as follows; a discussionof previous contributions towards stutter recognition in SectionII followed by our methodology including a breakdown ofthe model in Section III, the datasets and benchmark modelsapplied in Section IV, a discussion of our results in SectionV, and our conclusion in the final section.II. R ELATED W ORK
There has recently been increasing interest in the fieldsof deep learning, speech, and audio processing. However,as discussed earlier in section I, there has been minimalresearch targeting automated detection of speech disfluenciesincluding stuttering, most likely as a result of insufficient dataand smaller number of potential applications in comparisonto other speech-related problems such as speech recognition[21] [9] and speaker recognition [10] [11]. In the followingsections we first provide a summary of the type of disfluenciescommonly targeted in the area, followed by a review of theexisting work that fall under the umbrella of speech disfluencydetection and classification.
A. Background: Types of Speech Disfluency
There are a number of different stuttering types, oftencategorized into four main groups: repetitions, prolongations, TABLE I: Types of stutters considered for training and testing labels.
Label Stutter Disfluency Description ExampleS Sound Repetition Repetition of a phoneme th-th-thisPW Part-Word Repetition Repetition of a syllable bec-becauseW Word Repetition Repetition of any word why whyPH Phrase Repetition Repetition of multiple successive words I know I know thatR Revision Repetition of thought, rephrased mid sentence I think that- I believe thatI Interjection Addition of fabricated words or sounds um, uh, likePR Prolongation Prolonged sounds whoooooo is thereB Block Involuntary pause within a phrase I want (pause) to interjections, and blocks. A summary of all these disfluencytypes and examples of each have been presented in Table I.The descriptions for each of these categories is as follows.Repetitions are classified as any part of an utterance re-peated at quick pace. As this definition still remains general,repetitions are often further sub-categorized [5]. These sub-categories have been used in previous works classifying stutterdisfluencies [22] [23] [17], which include sound, word, andphrase repetitions, as well as revisions. Sound repetitions (S)are repetitions of a single phoneme, or short sound, oftenconsisting of a single letter. Part-word, or syllable repetitions(PW), as its name suggests, are classified as the repetition ofsyllables, which can consist of multiple phonemes. Similarly,word repetitions (W) are defined as any repetition of a singleword, and phrase repetitions (PH) are the repetition of phrases,consisting of multiple consecutive words. The final repetition-type disfluency is revision (R). Similar to phrase repetitions,they consist of repeated phrases, where the repeated segmentis rephrased, containing new or different information from thefirst iteration. A rise in pitch may accompany this disfluencytype [24].Interjections (I), often referred to as filler words, consist ofthe addition of any utterance that does not logically belongin the spoken phrase. Common interjections in the Englishlanguage include exclamations, such as ‘um’ and ‘uh’ , as wellas discourse markers such as ‘like’ , ‘okay’ , and ‘right’ .Prolongation (PR) stutters are presented as a lengthened orsustained phoneme. The duration of these prolonged utterancesvary alongside the severity of the stutter. Similar to repetitions,this disfluency is often accompanied by a rise in pitch.The final category of stuttering are silent blocks (B), whichare sudden cutoffs of vocal utterances. These are often invol-untary and are presented as pauses within a given phrase. B. Stutter Recognition with Classical Machine Learning
Before the focus of stutter recognition targeted maximizingaccuracy in classification of stammers, a number of workswere performed toward testing the viability of stutter detection.In 1995, Howell et al. [14], who later helped to create theUCLASS dataset [18] used in this paper, employed a set ofpre-defined words to identify repetition and prolongation stut-ters. From these, they extracted the autocorrelation features,spectral information, and envelope parameters from the audio.Each was used as an input to a fully connected artificial neuralnetwork (ANN). Findings showed that the model achievedits strongest classification results against severe disfluencies, and was weakest for mild ones. These models were ableto achieve a maximum detection rate of 0.82 on severeprolongation stutters.Howell et al. [15] later furthered theirwork using a larger set of data, as well as a wider variety ofaudio parameters. This work also introduced an ANN modelfor both repetition and prolongation types, and more judgeswere used to identify stutters with strict restrictions towardsagreement of disfluency labeling. Results showed that the bestparameters for disfluency classification were fragmentationspectral measures for whole words, as well as duration andsupralexical disfluencies of energy in part-words.Tan et al. [19] worked on testing the viability of stutterdetection through a simplified approach in order to maximizethe possible results. By collecting audio samples of clean,stuttered, and artificially generated copies of single pre-chosenwords, they were able to reach an average accuracy of 96%on the human samples using a hidden Markov model (HMM).This served as a temporary benchmark towards the possiblebest average results for stutter detection.Ravikumar et al. have attempted a variety of classifierson syllable repetitions, including an HMM [25] and supportvector machine (SVM) [26] using Mel-frequency cepstralcoefficients (MFCCs) features. Their best results were ob-tained when classifying this stutter type using the SVM on15 participants, achieved an accuracy of 94.35%. No otherdisfluency types were considered.A detailed summary of previously attempted stutter classifi-cation methods, including some of the aforementioned classi-cal models, is available in the form of a review paper in [27].This paper provides background on the use of three differentmodels (ANNs, HMMs and SVM) towards the application ofstutter recognition. Of the works considered in that reviewpaper in 2009, it was concluded that HMMs achieve the bestresults in stutter recognition.
C. Stutter Recognition with Deep Learning
With the recent advancements in deep learning, disfluencydetection and classification has seen an increase in popularitywithin the field with a higher tendency towards end-to-endapproaches. ASR has become an increasingly popular methodof tackling the problem of disfluency classification. As somestuttered speech results in repeated words, as well as prolongedutterances, these can be represented by word embeddings andsound amplitude features, respectively. To exploit this concept,Alharbi et al. [17] detected sound and word repetitions, as well
TABLE II: Summary of previous stutter disfluency classification methods.
Year Author Dataset Features Classification Method Results1995 Howell et al. [14] N/A autocorrelation function, spectral infor-mation, envelope parameters ANN Acc.: 82%1997 Howell et al. [15] 12 Speech Samples oscillographic and spectrographic pa-rameters ANN Avg. Acc.: 92%2007 Tan et al. [19] 20 Samples (single word) MFCC HMM Acc.: 96%2009 Ravikumar et al. [26] 15 Speech Samples MFCC SVM Acc.: 94.35%2016 Zayats et al. [28] Switchboard Corpus MFCC BLSTM w/ Attention F1: 85.92018 Alharbi et al. [17] UCLASS Word Lattice Finite State Transducer, Amplitude andTime Thresholding Avg. MR: 37%2018 Dash et al. [16] 60 Speech Samples Amplitude STT, Amplitude Thresholding Acc.: 86%2019 Villegas et al. [29] 68 Participants Respiratory Biosignals Perceptron Acc.: 95.4%2019 Santoso et al. [30] PASD, UUDB MFCC BLSTM w/ Attention F1: 69.12020 Chen et al. [31] In-house Chinese Corpus Word & Position Embeddings CT-Transformer MR: 38.5% as revision disfluencies using task-oriented finite state trans-ducer (FST) lattices. They also utilized amplitude thresholdingtechniques to detect prolongations in speech. These methodsresulted in an average 37% miss rate across the 4 differenttypes of disfluencies.Dash et al. [16] have used an STT model in order to identifyword and phrase repetitions within stuttered speech. To detectprolongation stutters, they integrated a neural network capableof finding optimal cutoff amplitudes for a given speaker toexpand upon simple thresholding methods. As these ASRworks required full word embeddings to classify repetitions,they either fared poorly against, or did not attempt sound orpart word repetitions.Deep recurrent neural networks (RNN), namely BLSTM,have been used to tackle stutter classification. Zayats et al.[28] trained a BLSTM with Integer Linear Programming (ILP)[32] on a set of MFCC features to detect repetitions withan F-score of 85.9. Similarly, a work done by Santoso etal. applied a BLSTM followed by an attention mechanismto perform stutter detection based on input MFCC features,obtaining an maximum F-score of 69.1 [30]. More recently ina study by Chen et al., a Controllable Time-delay Transformer(CT-Transformer) has been used to detect speech disfluenciesand correct punctuation in real time [31]. In our initial workon stutter classification, we utilized spectrogram features ofstuttered audio and used a BLSTM [33] to learn tempo-ral relationships following spectral frame-level representationlearning by a ResNet. This model achieved a 91.15% averageaccuracy across six different stutter categories.In an interesting recent work, Villegas et al. utilized res-piratory biosignals in order to better detect stutters [29]. Bycorrelating respiratory volume and flow, as well as heart ratemeasurements correlating to the time when a stutter occurs,they were able to classify block stutters with an accuracy of95.4% using an MLP.A 2018 summary and comparison of different featuresand classification methods for stuttering has been conductedby Khara et al. [34]. This work discusses and comparesdifferent popular feature extraction methods, classifiers andtheir uses, as well as their advantages and shortcomings. Thepaper discusses that MFCC feature extraction has historicallyprovided the strongest results.Similarly, ANNs provide themost flexibility and adaptability compared to other models, especially linear ones.Table II provides a summary of the related works ondisfluency detection and classification. It can be observed andconcluded that disfluency classification has been progressingin one of two fronts i ) end-to-end speech-based methods, or ii ) language-based models relying on an ASR pre-processingstep. Our work in this paper is positioned in the first categoryin order to avoid the reliance on an ASR step. Moreover, fromTable II, we observe that although progresses is being made inthe area of speech disfluency recognition, the lack of availabledata remains a hindrance to potential further achievements inthe field. III. P ROPOSED M ETHOD
A. Problem and Solution Overview
Our goal in this section is to design and develop a systemthat can be used for detecting various types of disfluencies.While one approach to tackle this concept is to design a multi-class problem, another approach is to design an ensemble ofsingle-disfluency detectors. In this paper, given the relativelysmall size of available stutter datasets, we use the latterapproach which can help reduce the complexity of each binarytask. Accordingly, the goal is to design a single networkarchitecture that can be trained separately to detect differentdisfluency types with each trained instance, where togetherthey could detect a number of different disfluencies. Figure1 shows the overview of our system. The designed networkshould possess the capability of learning spectral frame-levelrepresentations as well as temporal relationships. Moreover,the model should be able to focus on salient parts of theinputs in order to effectively learn the disfluencies and performaccurately.
B. Proposed Network: FluentNet
We propose an end-to-end network, FluentNet, which usesthe short-time Fourier transform (STFT) spectrograms of audioclips as inputs. These inputs are passed through a Squeeze-and-Excitation Residual Network (SE-ResNet) to learn frame-level spectral representations. As most stutter types havedistinct spectral and temporal properties, a bidirectional LSTMnetwork is introduced to learn the temporal relationships
Input Audio SE-ResNet BLSTM SoundRepetitionClassificationAttentionSE-ResNet BLSTM WordRepetitionClassificationAttentionSE-ResNet BLSTM PhraseRepetitionClassificationAttentionSE-ResNet BLSTM RevisionClassificationAttentionSE-ResNet BLSTM InterjectionClassificationAttentionSE-ResNet BLSTM ProlongationClassificationAttention
Fig. 1: Full model overview using FluentNet for disfluencyclassification.present among different spectrograms. An attention mecha-nism is then added to the final recurrent layer to better focuson the necessary features needed for stutter classification.FluentNet’s final output reveals a binary classification to detecta specific disfluency type that it has been trained for. Thearchitecture of the network is presented in Figure 2(a). In thefollowing, we describe each of the components of our modelin detail.
1) Data Representation:
Input audio clips recorded witha sampling rate of 16 khz are converted to spectrogramsusing STFT with 256 filters (frequency bins) to be fed toour end-to-end model. A sample spectrogram can be seenin Figure 2 where the colours represent the amplitude ofeach frequency bin at a given frame, with blue representinglower amplitudes, and green and yellow representing higheramplitudes. Following the common practice in audio-signalprocessing, a 25 ms frame has been used with an overlap of10 ms .
2) Learning Frame-level Spectral Representations:
Fluent-Net first focuses on learning effective representations fromeach input spectrogram. To do so, CNN architectures are oftenused. Though both residual networks [35] and squeeze-and-excitation (SE) networks [36] are relatively new in the fieldof deep learning, both have proven to improve on previousstate-of-the-art models in a variety of different applicationareas [37], [38]. The ResNet architecture has proven a reliablesolution to the vanishing or exploding gradient problems, bothcommon issues when back-propagating through a deep neuralnetwork. In many cases, as the model depth increases, thegradients of weights in the model become increasing smaller,or inversely, explosively larger with each layer. This mayeventually prevent the gradients from actually changing theweights, or from the weights becoming too large, thus pre-venting improvements in the model. A ResNet, overcomes thisby utilizing shortcuts all through its CNN blocks resulting innorm-preserving blocks capable of carrying gradients throughvery deep models. Squeeze-and-excitation modules have been recently pro-posed and have shown to outperform various DNN modelsusing previous CNN architectures, namely VGG and ResNet,as their backbone architectures [36]. SE networks were firstproposed for image classification, reducing the relative errorcompared to previous works on the ImageNet dataset byapproximately 25% [36].Every kernel within a convolution layer of a CNN results inan added channel (depth) for the output feature map. Whereasrecent works have focused on expanding on the spectralrelationships within these models [39] [40], SE-blocks buildstronger focus on channel-wise relationships within a CNN.These blocks consist of two major operations. The squeeze operation aggregates a feature map across both its heightand width resulting in a one-dimensional channel descriptor.The excitation operation consists of fully connected layersproviding channel-wise weights, which are then applied backto the original feature map.To exploit the capabilities of both ResNet and SE architec-tures and learn effective spectral frame-level representationsfrom the input, we use an SE-ResNet model in our end-to-end network. The network consists of 8 SE-ResNet blocks, asshown in Figure 2(a). Each SE-ResNet block in FluentNet,illustrated in Figure 2(b), consists of three sets of two-dimensional convolution layers, each succeeded by a batchnormalization and Rectified Linear Unit (ReLU) activationfunction. A separate residual connection shares the same inputas the block’s non-identity branch, and is added back tothe non-identity branch before the final ReLU function, butafter the SE unit (described below). Each residual connectioncontains a convolution layer followed by batch normalization.The Squeeze-and-Excitation unit within each SE-ResNet blockbegins with a global pooling layer. The output is then passedthrough two fully connected layers: the first followed by aReLU activation function, and the second succeeded witha sigmoid gating function. The main convolution branch isscaled with the output of the SE unit using channel-wisemultiplication.
3) Learning Temporal Relationships:
In order to learn thetemporal relationships between the representations learnedfrom the input spectrogram, we use an RNN. In particular,LSTM [41] networks have shown to be effective for thispurpose in the past and are widely used for learning sequencesof spectral representations obtained from consecutive segmentsof time-series data [42] [43] [44].Each LSTM unit contains a cell state, which holds informa-tion contained in previous units allowing the network to learntemporal relationships. This cell state is part of the LSTM’smemory unit, where there lie several gates that together controlwhich information from inputs, as well as from the previouscell and hidden states, will be used to generate the current celland hidden states. Namely, the forget gate, f t , and input gate, i t , are utilized to learn what information from each of theserespective states will be saved within the new current state, C t . This is shown by the following equations: f t = σ ( W f · [ h t − , x t ] + b f ) (1) i t = σ ( W i · [ h t − , x t ] + b i ) (2) BLSTMSEResNetBlockRaw Audio Input Stutter ClassificationSpectrogram SEResNetBlock ... ...
SEResNetBlock ...
Attention R eL U B a t c h N o r m C on v R eL U B a t c h N o r m C on v B a t c h N o r m X + C on v B a t c h N o r m C on v R eL U G l oba l P oo l F CR eL U F C S i g m o i d Squeeze-and-Excitation Block a)b)
Fig. 2: a) A full workflow of FluentNet is presented. This network consists of 8 SE residual blocks, two BLSTM layers, anda global attention mechanism. b) The breakdown of a single SE-ResNet block in FluentNet is presented. C t = f t ∗ C t − + i t ∗ tanh ( W C · [ h t − , x t ] + b C ) (3)where σ represents the sigmoid function, and the ∗ operatordenotes point-wise multiplication. This new cell state, alongwith an output gate, o t , are used to generate the hidden stateof the unit, h t , as represented by: o t = σ ( W o · [ h t − , x t ] + b o ) (4) h t = o t ∗ tanh ( C t ) (5)The cell state and hidden state are then passed to successiveLSTM units, allowing the network to learn long-term depen-dencies.We used a BLSTM network [45] in FluentNet. BLSTMsconsist of two LSTMs advancing in opposite directions, maxi-mizing the available context from relationships of both the pastand future. The outputs of these two networks are multipliedtogether into a single output layer. FluentNet consists of twoconsecutive BLSTMs, each utilizing LSTM cells with 512hidden units. A dropout [46] of 20% was also applied to eachof these recurrent layers. To avoid overfitting given the size ofthe dataset, the randomly masked neurons caused by dropoutforces the model to be trained using a sparse representationof the given data.
4) Attention:
The recent introduction of attention mecha-nisms [47] and its subsequent variations [48] have allowedfor added focus on more salient sections of the learnedembedding. These mechanisms have recently been applied tospeech recognition models to better focus on strong emotionalcharacteristics within utterances [49] [50], and have similarlybeen used in FluentNet to improve focus on specific partsof utterances with disfluent attributes. FluentNet uses globalattention [51], which incorporates all hidden state values ofthe encoder (in this case the BLSTM). A diagram showingthe attention model is presented in Figure 3. The final output value of the second layer of the BLSTM, h t , as well as a context vector, C t , derived through the use ofthe attention mechanism are used to generate FluentNet’s finalclassification, ˜ h t . This is done by applying a tanh activationfunction, as shown by: ˜ h t = tanh ( W c [ C t ; h t ]) (6)The context vector of the global attention is the weightedsum of all hidden state outputs of the encoder. An alignmentvector, generated as a relation between h t and each hiddenstate value is passed through a softmax layer, which is thenused to represent the weights to the context vector. Dot productwas used as the alignment score function for this attentionmechanism. The calculation for the context vector can berepresented by: C t = t (cid:88) i =1 ¯ h i ( e h (cid:62) t · ¯ h i (cid:80) ti ‘=1 e h (cid:62) t · ¯ h i ‘ ) (7)where ¯ h i represents the i th BLSTM hidden state’s output. C. Implementation
FluentNet was implemented using Keras [52] with a Ten-sorflow [53] backend. The model was trained with a learningrate of − yielded the strongest results. A root mean squarepropagation (RMSProp) optimizer, and a binary cross-entropyloss function were used. All experiments were trained using anNvidia GeForce GTX 1080 Ti GPU. Python’s Librosa library[54] was used for audio importing and manipulation towardscreating our synthetic dataset as described later. Each STFTspectrogram was generated using four-second audio clips. Thislength of time can encapsulate any stutter apparent in thedataset, with no stutters lasting longer than four seconds. ... AlignmentSoftmax
LSTM Inputs C t h t RNN A tt en t i on h t ~ Fig. 3: Global attention addition to binary classifier of recur-rent network. IV. E
XPERIMENTS
A. Datasets
Despite an abundance of datasets for speech-related taskssuch as ASR and speaker recognition [20] [55] [56], there is aclear lack of corpora that are focused on speech disfluencies.An ideal speech disfluency dataset would require the labellingand categorization of each existing disfluent utterance. Inthis paper, to tackle this problem, in addition to using theUCLASS dataset which is a commonly used stuttered speechcorpus [57] [58] [17], a second dataset was created throughadding speech disfluencies into clean speech. This syntheticcorpus contributes a drastic expansion to the available trainingand testing data for disfluency classification. Through thefollowing subsections, we describe the UCLASS dataset usedin our study, as well as the approach for creating the syntheticdataset, LibriStutter, which we created using the original non-stuttered LibriSpeech dataset.
1) UCLASS:
The University College Londons Archive ofStuttered Speech (UCLASS) [18] is the most commonly useddataset for disfluency-related studies with machine learning.This corpus came in two releases, in 2004 and 2008, from theuniversity’s Division of Psychology and Language Sciences.The dataset consists of 457 audio recordings including mono-logues, readings, and conversations of children with knownstutter disfluency issues. Of those recordings, a select fewcontain written transcriptions of their respective audio files;these were either standard, phonetic or orthographic tran-scriptions.Orthographic format is the best option for manuallabelling of the dataset for disfluency as they try to transcribethe exact sounds uttered by the speaker in the form of standardalphabet. This helps to identify the presence of disfluency in anutterance more easily. The resulting applicable data consistedof 25 unique conversations between an examiner and a child between the ages of 8 and 18, totalling to just over one hourof audio.In order to pair the utterances with their transcriptions, eachaudio file and its corresponding orthographic transcriptionwere passed through a forced time alignment tool. The result-ing table related each alphabetical token in the transcriptionto its matching timestamp within the audio. This process wasthen manually checked for outlaying utterances not matchingtheir transcriptions.The provided orthographic transcriptions only flagged theexistence of disfluencies (through the use of capitalization),but gave no information towards a disfluency type. To builda more detailed dataset and be able to classify the type ofdisfluency, all utterances were manually labelled as one ofthe seven represented classes for our model. These includedclean (no stutter), sound repetitions, word repetitions, phraserepetitions, revisions, interjections, and prolongations. Theannotation methods applied in [22] and [23] were used asguidelines when manually categorizing each utterance. Outof the 8 disfluencies, 6 were used: sound, word, and phraserepetitions, as well as revisions, interjections, and prolonga-tions.. Of the usable audio in the dataset, only three instancesof ‘part-word repetitions’ appeared, lacking sufficient positivetraining samples to feasibly classify these types of stutters. As‘block disfluencies’ exist as the absence of sound, they couldnot feasibly be represented in the orthographic transcriptions,which represent how utterances are performed.
2) LibriStutter:
The 2015 LibriSpeech ASR corpus byPanayotov et al. [20] includes 1000 hours of prompted Englishspeech extracted from audio books derived from the LibriVoxproject. We used this dataset as the basis for our syntheticstutter dataset, which we name LibriStutter. LibriStutter’screation compensates for two shortcomings of the UCLASScorpus: the small amount of labelled stuttered speech dataavailable and the imbalance of the dataset (several disfluencytypes in UCLASS consisted of a small number of samples).To allow for a manageable size for LibriStutter and feasibletraining times, we used a subset of LibriSpeech and set the sizeof LibriStutter to 20 hours. LibriStutter includes synthetic stut-ters for sound repetitions, word repetitions, phrase repetitions,prolongations, and interjections. We generated these stuttertypes by sampling the audio within the same utterance, thedetails of which are described below. Revisions were excludedfrom LibriStutter, as this disfluency type requires the speakerto change and revise what was initially said. This wouldrequire added speech through the use of complex languagemodels and voice augmentation tools to mimic the revisedphrase, both of which fall out of scope for this project.For each audio file selected from the LibriSpeech dataset,we used the Google Cloud Speech-to-Text API [59] to gen-erate a timestamp corresponding to each spoken word. Forevery four-second window of speech within a given audiofile, either a random disfluency type was inserted and labelledaccordingly, or alternatively left clean. Each disfluency typeunderwent a number of processes to best simulate naturalstutters.All repetition stutters relied upon copying existing audiosegments already present within each audio file. Sound rep- etitions were generated by copying the first fraction of arandom spoken word within the sample and repeating thisshort utterance a several times before said word. Althoughrepetitions of sounds can occur at the end of words, knownas word-final disfluencies, this is rarely the case [60]. One tothree repeated sound utterances were added in each stutteredword. After each instance of the repeated sound, a randomempty pause duration of 100 to 350 ms was appended asthis range sounded most natural. Inserted audio may leavesharp cutoffs, especially part-way through an utterances. Toavoid this, interpolation was used to smooth the added audio’stransition into the existing clip.Both word and phrase repetitions underwent similar pro-cesses to that of sound repetitions. For word repetitions werepeated one to two copies of a randomly selected wordbefore the original utterance. For phrase repetitions, a similarapproach was taken, where instead of repeating a particularword, a phrase consisting of two to three words were repeated.The same pause duration and interpolation techniques usedfor sound repetitions were applied to both word and phraserepetition disfluencies.Prolongations consist of sustained sounds, primarily at theend of a word. To mimic this behaviour, the last portionof a word was stretched to simulate prolonged speech. Fora randomly chosen word, the latter 20% of the signal wasstretched by a factor of 5. This prolonged speech segmentreplaced the original word ending. As applying time stretchingto audio results in a drop in pitch, pitch shifting was used torealign the pitch with the original audio. The average pitch ofthe given speech segment was used to normalize the disfluentutterance.Unlike the aforementioned classes, interjection disfluenciescannot be created from existing speech within a sample as itrequires the addition of filler words absent from the originalaudio (for example ‘umm’). Multiple samples of common fillerwords from the UCLASS were isolated and saved separatelyto create a pool of interjections. To create interjection dis-fluencies, a random filler word from this pool was insertedbetween two random words, followed by a short empty pause.The same pitch scaling and normalization method as usedfor prolongations was applies to match the pitches betweenthe interjection and audio clip. Interpolation was used as inrepetition disfluencies to smooth sharp cutoffs caused by theadded utterance. Sound Repetition Word Repetition ProlongationUCLASSLibriStutter
Fig. 4: Spectrograms of the same stutters found in theUCLASS dataset and generated in the LibriStutter dataset. TABLE III: Cosine similarity between a UCLASS datasetstutter and a matching LibriStutter stutter, as well as theaverage of 100 random samples from the LibriStutter dataset.
Stutter UCLASS vs. LibriStutter UCLASS vs. RandomSound Repetition . − . − Word Repetition . − . − Prolongation . − . − To ensure that sufficient realism was incorporated into thedataset, a registered speech language pathologist was consultedfor this project. Nonetheless, it should be mentioned thatdespite our attention to creating a perceptually valid andrealistic dataset, the notion of “realism” itself is not a focus ofthis dataset. Instead, much like synthetic datasets in other areassuch as image processing, the aim is for the dataset to be validenough such that machine learning and deep learning methodscan be trained and evaluated with, and later on transferredto real large-scale datasets [in the future] with little to noadjustments to the model architectures.Figure 4 displays side by side comparisons of spectrogramsof real stuttered data from the UCLASS dataset, and artificialstutters from LibriStutter. Each pairing represents a single stut-ter type, with the same word or sound being spoken in each.It can be observed that the UCLASS stutter samples and theircorresponding LibriStutter examples show clear similarities.Moreover, to numerically compare the samples, cosine similar-ity [61] was calculated between the UCLASS and LibriStutterspectrogram samples shown earlier. To add relevance to thesevalues, a second comparison was performed for each UCLASSspectrogram with respect to 100 random samples from theLibriStutter dataset, and the average score was used as therepresented comparison value. These scores are summarizedin Table III. We observe that the UCLASS cosine similarityscores corresponding to the matching LibriStutter samples arenoticeably (approximately between 10 × to 30 × ) greater thanthose compared to random audio samples, confirming that thedisfluent utterances contained in LibriStutter share phoneticsimilarities with real stuttered samples, empirically showingthe similarity between a few sample real and synthesizedstutters.The LibriStutter dataset consists of approximately 20 hoursof speech data from the LibriSpeech train-clean-100 (trainingset of 100 hours “clean” speech). In turn, LibriStutter sharesa similar make up to that of its predecessor. It consists of disfluent prompted English speech from audiobooks. It alsocontains 23 male and 27 female speakers, with an approximate53% of the audio coming from males, and 47% from females.There are 15000 disfluencies in this dataset, with equal countsfor each of the five disfluency types: 3000 sound, word, andphrase repetitions, as well as prolongations and interjections.Each audio file has a corresponding CSV file containing eachword or utterance spoken, the start and end time of theutterance, and its disfluency type, if any. B. Benchmarks
For a thorough analysis of our results, we compare theresults obtained by the proposed FluentNet to a number of other models. In particular, we employ two type of solutionsfor comparison purposes. First, we compare our results torelated works and the state-of-the-art as follows:
Alharbi et al. [17]:
This work conducted classificationof sound repetitions, word repetitions, revisions, and prolon-gations on the UCLASS dataset through the application oftwo different methods. First, an original speech prompt wasaligned, and then passed to a task-oriented FST to generateword lattices. These lattices were used to detect repeated part-words, words, and phrases within the sample. This methodscored perfect results on word repetition classification,thoughthe results on sound repetitions and revisions proved muchweaker. To classify prolongation stutters, an autocorrelationalgorithm consisting of two thresholds was used: the first todetect speech with similar amplitudes (sustained speech), andanother dynamic threshold to decide whether the duration ofsimilar speech would be considered a prolongation. Using thisalgorithm, perfect prolongation classification was achieved.
Chen et al. [31]:
A CT-Transformer was designed toconduct repetition and interjection disfluency detection onan in-house Chinese speech dataset. Both word and positionembeddings of a provided audio sample were passed througha series of CT self attention layers and fully connectedlayers. This work was able to achieve an overall disfluencyclassification miss rate of 38.5% (F1 score of 70.5). Notably,this is one of the few works to have attempted interjectiondisfluency classification, yielding a miss rate of 25.1%. Notethat the performance on repetition disfluencies encompassesall repetition-type stutters, including sound, word, and phraserepetitions, as well as revisions.
Kourkounakis et al. [33]:
As opposed to other currentmodels focusing on ASR and language models, our previouswork proposed a model relying solely on acoustic and phoneticfeatures, allowing for the classification of several multipledisfluencies types without the need for speech recognitionmethods. This model applied a deep residual network, con-sisting of 6 residual blocks (18 convolution layers) and twobidirectional long short-term memory layers to classify sixdifferent types of stutters. This work achieved an average missrate of 10.03% on the UCLASS dataset, and sustained strongaccuracy and miss rates across all stutter types, prominentlyword repetitions and revisions.
Zayats et al. [28]:
A recurrent network was used toclassify repetition disfluencies within the Switchboard corpus.It consists of a BLSTM followed by an ILP post processingmethod. The input embedding to this network consisted of avector containing each word’s index, part of speech, as wellas 18 other disfluency-based features. The method achieved amiss rate of 19.4% across all repetitions disfluencies.
Villegas et al. [29]:
This model was used a reference tocompare the effectiveness of repository signals towards stutterclassification. These features included the means, standarddeviations, and distances of respiratory volume, respiratoryflow, and heart rate. Sixty-eight participants were used togenerate the data for their experiments. The best performingmodel in this work was an MLP with 40 hidden layers,resulting in a 82.6% average classification accuracy betweenblock and non-block type stutters.
Dash et al. [16]:
This method passed the maximum ampli-tude of the provided audio sample through a neural networkto generate a custom threshold for each sample, trained on aset of 60 speech samples. This amplitude threshold was usedto remove any perceived prolongations and interjections. Theaudio was then passed the audio through a SST tool, whichallowed for the removal of any repeated words, phrases, orcharacters, achieving an overall stutter classification of 86%on a test set of 50 speech segments.Note that the latter three works only provide results on agroup of disfluency types [28], a single disfluency type [29], oroverall stutter classification [16]. As such, only their averagedisfluency classification results could be compared. Moreover,these works ([31], [28], [29], and [16]) have not used theUCLASS dataset, therefore the comparisons should be takencautiously.Next, we also compare the performance of our solutionto a number of other models for benchmarking purposes.These models were selected due to their popularity for time-series learning and their hyperparameters of these modelsare all tuned to obtain the best possible results given theirarchitectures. These benchmarks are as follows: ( i ) VGG-16(Benchmark 1): VGG-16 [62] consists of 16 convolutionalor fully connected layers, comprised of groups of two orthree convolution layers with ReLU activation, with eachgrouping being followed by a max pooling layer.The modelconcludes with three fully connected layers and a final softmaxfunction. ( ii ) VGG-19 (Benchmark 2): This network is verysimilar to its VGG-16 counterpart, with the only differencebeing an addition of three more convolution layers spreadthroughout the model. ( iii ) ResNet-18 (Benchmark 3): ResNet-18 was chosen as a benchmark, which contains 18 layers: eightconsecutive residual blocks each containing two convolutionallayers with ReLU activation, followed by an average poolinglayer and a final fully connected layer.V. R ESULTS AND A NALYSIS
A. Validation
In order to rigorously test FluentNet on the UCLASSdataset, a leave-one-subject-out (LOSO) cross validationmethod was used. The results of models tested on this datasetare represented as the average between 25 experiments, eachconsisting of audio samples from 24 of the participants astraining data, and a unique single participant’s audio as atest set. A 10-fold cross validation method was used on theLibriStutter dataset with a random 90% subset of the samplesfrom each stutter being used for training along with 90% ofthe clean samples chosen randomly. The remaining 10% ofboth clean and stuttered samples were used for testing. Allexperiments were trained over 30 epochs, with minimal changein loss seen in further epochs.The two metrics used to measure the performance of theaforementioned experiments were miss rate and accuracy.Miss rate (1 - recall ) is used to determine the proportion ofdisfluencies which were incorrectly classified by the model.To balance out any bias this metric may hold, accuracy wasused as a second performance metric. TABLE IV: Percent miss rates (MR) and accuracy (Acc) of the six stutter types trained on the UCLASS dataset.
S W PH I PR RPaper Method Dataset MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ Alharbi et al. [17] Word Lattice UCLASS 60 – – – 25 –Kourkounakis et al. [33] ResNet+BLSTM UCLASS 18.10 84.10 3.20 Ours FluentNet
UCLASS
Kourkounakis et al. [33] ResNet+BLSTM LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44Benchmark 1 VGG-16 LibriStutter 20.97 79.33 6.27 92.74 8.90 91.94 36.47 64.05 10.63 89.10Benchmark 2 VGG-19 LibriStutter 20.79 79.66 6.45 93.44 7.92 92.44 34.46 66.92 10.78 89.98Benchmark 3 ResNet-18 LibriStutter 22.47 78.71 6.22 92.70 6.74 93.36 35.56 64.78 10.52 90.32
Ours FluentNet
LibriStutter
B. Performance and Comparison
The results of our model for recognition of each stuttertype are presented for the UCLASS and LibriStutter datasetsin Table IV. FluentNet achieves strong results against all thedisfluency types within both datasets, outperforming nearly allof the related work as well as the benchmark models.As some previous works have been designed to tacklespecific disfluency types as opposed to a general solution fordetecting different types of disfluencies, a few of FluentNet’sindividual class accuracies do not surpass previous works’,namely word repetitions and prolongation. In particular, thework by Alharbi et al. [17] offers perfect word repetitionclassification, as word lattices can easily identify two wordsrepeated one after the other. Amplitude thresholding alsoproves to be a successful prolongation classification method.It should be noted that FluentNet does achieve strong resultsfor these disfluency types as well. Notably, our work is oneof the only ones that has attempted classification of interjec-tion disfluencies. These disfluent utterances lack the uniquephonetic and temporal patterns that, for instance, repetitionor prolongation disfluencies contain. Moreover, they may bepresent as a combination of other disfluency types, for examplean interjection can be both prolonged or repeated. For thesereasons, interjections remain the hardest category, with a24.05% and 29.78% miss rate on the UCLASS an LibriStutterdatasets, respectively. Nonetheless, FluentNet provides goodresults, especially given that interjections have been histori-cally avoided.The task-oriented lattices generated in [17] show strongperformance on word repetitions and prolongations, but strug-gle to detect sound repetitions and revision. Likewise, as ispresented in [31], the CT-Transformer yields a comparableinterjection classification miss rate to that of FluentNet. How-ever, when the same model is applied to repetition stutters, theperformance of the model drops severely, hindering its overalldisfluency detection capabilities. The use of an attention-basedtransformer proves a viable method of classifying interjectiondisfluencies, however as the results suggest, the convolutionaland recurrent architecture in FluentNet allows for effectiverepresentations to be learned for interjection disfluenciesalongside repetitions and prolongations.FluentNet’s achievements surpass our previous work’sacross all disfluency types on the Libristutter dataset, and allbut word repetition accuracy on the UCLASS dataset. Theresults show a greater margin of improvement against the LibriStutter dataset as compared to UCLASS between the twomodels. Notably, word repetitions and prolongation relay adecrease in miss rate of approximately 20% between FluentNetand [33]. This implies the SE and attention mechanisms assistin better representing the disfluent utterances within stutteredspeech found in the synthetic dataset.An interesting observation is that LibriStutter proves a moredifficult dataset compared to UCLASS as evident by thelower performance of all the solutions including FluentNet.This is likely due to the fact that given the large numberof controllable parameters for each stutter type, LibriStutteris likely to contain a larger variance of stutter styles andvariations, resulting in a more difficult problem space.Table V presents the overall performance of our model withrespect to all disfluency types on UCLASS and LibriStutterdatasets. The results are compared with other works on re-spective datasets, and the benchmarks which we implementedfor comparison purposes. We observe that FluentNet achievesaverage miss rates and accuracy of 9.35% and 91.75%on theUCLASS dataset, surpassing the other models and settinga new state-of-the-art . A similar trend can be seen for theLibriStutter dataset where FluentNet outperforms the previousmodel along with all the benchmark models.The BLSTM used in [28] yields successful results to-wards repetition stutter classification by learning temporalrelationships between words, however it remains impairedby its reliance solely on lexical model inputs. On the otherhand, as shown by the results, FluentNet is better able tolearn these phonetic details through the spectral and temporalrepresentations of speech.The work from [16] uses similar classification techniques to[17], however improves upon the thresholding technique withthe addition of a neural neural network. Though achievingan average accuracy of 86% across the same disfluency typesused in this work, FluentNet remains a stronger model givenits effective spectral frame-level and temporal embeddings.Nonetheless, the results of this work contains only a singleoverall accuracy value across all of repetition, interjection, andprolongation disfluency detection. Little is discussed on theorigin and makeup of the dataset used.Of the benchmark models without an RNN component,ResNet performs better than both VGG networks for bothdatasets, indicating that ResNet-style architectures are ableto learn effective spectral representations of speech. Thisfurther justifies the use of a ResNet as the backbone of our TABLE V: Average percent miss rates (MR) and accuracy(Acc) of disfluency classification models.
Paper Dataset Ave. MR ↓ Ave. Acc. ↑ Zayats et al. [28] Switchboard 19.4 –Villegas et al. [29] Custom – 82.6Dash et al. [16] Custom – 86Chen et al. [31] Custom 38.5 –Alharbi et al. [17] UCLASS 37 –Kourkounakis et al. [33] UCLASS 10.03 91.15Benchmark 1 (VGG-16) UCLASS 13.81 86.62Benchmark 2 (VGG-19) UCLASS 12.21 87.92Benchmark 3 (ResNet-18) UCLASS 12.14 89.14
FluentNet
UCLASS
Kourkounakis et al. [33] LibriStutter 14.36 85.30Benchmark 1 (VGG-16) LibriStutter 16.65 83.43Benchmark 2 (VGG-19) LibriStutter 16.08 84.49Benchmark 3 (ResNet-18) LibriStutter 16.30 83.97
FluentNet
LibriStutter T r u e P o s i t i v e R a t e UCLASS
InterjectionSound RepetitionWord RepetitionPhrase RepetitionRevisionProlongation (a) UCLASS T r u e P o s i t i v e R a t e LibriStutter
InterjectionSound RepetitionWord RepetitionPhrase RepetitionProlongation (b) LibriStutter
Fig. 5: ROC curves for each stutter type tested on the UCLASSand LibriStutter datasets.model. Moreover, the addition of the LSTM component tothe benchmarks shows that learning the temporal relationshipsthrough an RNN contributes to the performance.To further demonstrate the performance of FluentNet, theReceiver Operator Characteristic (ROC) curves were generatedfor each disfluency class on the UCLASS and LibriStutterdatasets, as shown in Figures 5(a) and 5(b), respectively. Itcan be seen that word repetitions, phrase repetitions, revisions,and prolongations reveal very strong classification on bothdatasets. Both sound repetitions and interjections classificationfair weakest, with the LibriStutter dataset, proving to be amore difficult dataset for FluentNet, as previously observedand discussed.
C. Parameters
Multiple parameters have been tuned in order to maxi-mize the accuracy of FluentNet and the baseline experimentson both datasets. These include convolution window sizes,epochs, and learning rates, among others. Each has beenindividually tested in order to find the optimal values for thegiven model. Note that all of FluentNet’s hyper-parametersremain the same across all disfluency types.Thorough experiments were performed to obtain the op-timum architecture of FluentNet. For the SE-ResNet com-ponent, we tested a different count of convolution blocks,ranging between 3 to 12, with each block consisting of3 convolutional layers. Eight blocks were found to be theapproximate optimal depth for training the model on the T r a i n i n g A cc u r a c y UCLASS
InterjectionSound RepetitionWord RepetitionPhrase RepetitionRevisionProlongation (a) UCLASS T r a i n i n g A cc u r a c y LibriStutter
InterjectionSound RepetitionWord RepetitionPhrase RepetitionProlongation (b) LibriStutter
Fig. 6: Average training accuracy for FluentNet on the consid-ered stuttered types for the UCLASS and LibriStutter datasets.UCLASS dataset. Similarly, we experimented with the use ofdifferent number of BLSTM layers, ranging between 0 to 3layers. The use of 2 layers yielded the best results. Moreover,the use of bi-directional layers proved slightly more effectivethan uni-directional layers. Lastly, we experimented with anumber of different values and strategies for the learning ratewhere − showed the best results.Figures 6(a) and 6(b) show FluentNet’s performance foreach stutter type against different epochs on the UCLASSand LibriStutter datasets, respectively. It can be seen that thetraining accuracy stabilizes after around 20 epochs. Whereasall disfluencies types in the UCLASS dataset approach perfecttraining accuracy, training accuracy plateaus at much loweraccuracies for interjections and sound repetitions within theLibriStutter dataset. D. Ablation Experiments
To further analyze FluentNet, an ablation study was donein order to systematically evaluate how each component con-tributes towards the overall performance. Both the SE portionand attention mechanisms were removed, individually andtogether, in order to analyse the relationship between theirabsences, and how these affect both accuracy and miss ratesfor each disfluency class. The ablation results for both theUCLASS and LibriStutter datasets can be seen summarizedin Table VI. Overall, FluentNet shows stronger accuracy andlower miss rates across both datasets and all stutter types, com-pared to the three variants. Although the drops in performancevaries across different stutter types with the removal of eachelement, the experiment shows the general advantages of thedifferent components of FluentNet.The results show that across both datasets, the SE compo-nent and the attention mechanism both individually benefit themodel for most stutter types. Removal of the SE componentyields the greatest drop in the accuracy and increase in missrates across nearly all stutter types. The removal of the SEcomponents from FluentNet has the most negative impact.The removal of the global attention mechanism as the finalstage of the model, also reduces the classification accuracyof FluentNet. Similarly, with both the SE component andattention removed, the model showed a decline in accuracyand miss rates across all classes tested. Note that the resultsof these ablation experiments hold similar conclusions forboth the UCLASS and our synthesized dataset (with a slightlyhigher impact observed on UCLASS vs. LibriStutter), thereby TABLE VI: Ablation experiment results, providing accuracy (Acc) and miss rates (MR) for each stutter type and model onthe UCLASS dataset.
S W PH I PR R AverageMethod Dataset MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ MR ↓ Acc. ↑ FluentNet
UCLASS 16.78 84.46 3.43 96.57 3.86 96.26 24.05 81.95 5.34 94.89 2.62 97.38 9.35 91.75w/o Attention UCLASS 16.97 83.13 3.51 96.29 4.23 95.78 24.22 80.78 6.88 92.50 3.25 96.55 9.84 90.84w/o Squeeze-and-Excitation UCLASS 17.37 82.01 4.82 95.34 4.81 95.17 24.59 79.84 6.22 93.10 3.14 96.98 10.16 90.41w/o Squeeze-and-Excitation & Attention UCLASS 18.18 82.83 4.96 95.04 5.32 93.68 28.89 71.01 8.30 91.72 3.30 96.70 11.49 88.50
FluentNet
LibriStutter 17.65 82.24 4.11 94.69 5.71 94.32 29.78 70.12 7.88 92.14 13.03 86.70w/o Attention LibriStutter 18.91 81.14 4.17 94.01 5.92 93.73 31.26 68.91 8.53 91.24 13.76 85.81w/o Squeeze-and-Excitation LibriStutter 19.11 80.72 4.95 94.60 5.87 94.15 31.14 70.02 8.80 91.28 13.97 86.15w/o Squeeze-and-Excitation & Attention LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44 14.36 85.30 reinforcing the validity of LibriStutter’s similarity to realstutters. VI. C
ONCLUSION
Of the measurable metrics of speech, stuttering continues tobe the most difficult to identify as their diversity and unique-ness make them challenging for simple algorithms to model.To this end, we proposed a deep neural network, FluentNet,to accurately classify these disfluencies. FluentNet is an end-to-end deep neural network designed to accurately classifystuttered speech across six different stutter types: sound, word,and phrase repetitions, as well as revisions, interjections,and prolongations. This model uses a Squeeze-and-Excitationresidual network to learn effective spectral frame-level speechrepresentations, followed by recurrent bidirectional long short-term memory layers to learn temporal relationships fromstuttered speech. A global attention mechanism was then addedto focus on salient parts of speech in order to accurately detectthe required influences. Through comprehensive experiments,we demonstrate that FluentNet achieves state-of-the-art resultson disfluency classification with respect to other works inthe area as well as a number of benchmark models on thepublic UCLASS dataset. Given the lack of sufficient data tofacilitate more in-depth research on disfluency detection, wedeveloped a synthetic dataset, LibriStutter, based on the publicLibriSpeech dataset.Future works may include improving on LibriStutter’s re-alism, which could constitute conducting further research intothe physical sound generation of stutters and how they translateto audio signals. Whereas this work focuses on the educationaland business applications of speech metric analysis, furtherwork may focus towards medical and therapeutic use-cases.A
CKNOWLEDGMENT
The authors would like to thank Prof. Jim Hamilton forhis support and valuable discussion throughout this work. Wealso wish to acknowledge Adrienne Nobbe for her consultationtowards this project. R
EFERENCES[1] S. H. Ferguson and S. D. Morgan, “Talker differences in clear andconversational speech: Perceived sentence clarity for young adults withnormal hearing and older adults with hearing loss,”
Journal of Speech,Language, and Hearing Research , vol. 61, no. 1, pp. 159–173, 2018.[2] J. S. Robinson, B. L. Garton, and P. R. Vaughn, “Becoming employable:A look at graduates’ and supervisors’ perceptions of the skills neededfor employability,” in
NACTA Journal
ACM Conference on Interactive, Mobile,Wearable and Ubiquitous Technologies et al. , “Transformer-basedacoustic modeling for hybrid speech recognition,” in
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 6874–6878.[9] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,D. Rybach, A. Kannan, Y. Wu, R. Pang et al. , “Streaming end-to-endspeech recognition for mobile devices,” in
ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6381–6385.[10] A. Hajavi and A. Etemad, “A deep neural network for short-segmentspeaker recognition,”
INTERSPEECH , 2019.[11] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversationsusing x-vectors,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019,pp. 5796–5800.[12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-basedgenerative network for speech synthesis,” in
ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 3617–3621.[13] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesiswith transformer network,” in
Proceedings of the AAAI Conference onArtificial Intelligence , vol. 33, 2019, pp. 6706–6713.[14] P. Howell and S. Sackin, “Automatic recognition of repetitions andprolongations in stuttered speech,”
Proceedings of the First WorldCongress on Fluency Disorders , 01 1995.[15] P. Howell, S. Sackin, and K. Glenn, “Development of a two-stageprocedure for the automatic recognition of dysfluencies in the speech ofchildren who stutter: Ii. ann recognition of repetitions and prolongationswith supplied word segment markers,”
Journal of Speech, Language, andHearing Research , 10 1997.[16] A. Dash, N. Subramani, T. Manjunath, V. Yaragarala, and S. Tripathi,“Speech recognition and correction of a stuttered speech,” in , 2018, pp. 1757–1760.[17] S. Alharbi, M. Hasan, A. Simons, S. Brumfitt, and P. Green, “Alightly supervised approach to detect stuttering in childrens speech,”
INTERSPEECH , pp. 3433–3437, 2018.[18] P. Howell, S. Davis, and J. Bartrip, “The university college londonarchive of stuttered speech (uclass),”
Journal of Speech, Language, andHearing Research , vol. 52, pp. 556–569, 2009.[19] T. Tan, Helbin-Liboh, A. K. Ariff, C. Ting, and S. Salleh, “Applicationof malay speech technology in malay speech therapy assistance tools,”
International Conference on Intelligent and Advanced Systems , pp. 330–334, 2007.[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2015, pp. 5206–5210.[21] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, andR. Collobert, “Fully convolutional speech recognition,” arXiv preprintarXiv:1812.06864 , 2018.[22] E. Yairi and N. G. Ambrose, “Early childhood stuttering i: persistencyand recovery rates,”
Journal of Speech, Language, and Hearing Re-search , vol. 42, 1999.[23] F. S. Juste and C. R. F. de Andrade, “Speech disfluency types offluent and stuttering individuals: Age effects,”
International Journal ofPhoniatrics, Speech Therapy and Communication Pathology
International Conference on AdvancedComputing Technologies, Hyderbad, India , 2008, pp. 514–519.[26] H. K.M Ravikumar, R.Rajagopal, “An approach for objective assessmentof stuttered speech using mfcc features,”
Digital Signal ProcessingJournal , vol. 9, pp. 19–24, 2019.[27] L. S. Chee, O. C. Ai, and S. Yaacob, “Overview of automatic stutteringrecognition system,” in
Proc. International Conference on Man-MachineSystems, no. October, Batu Ferringhi, Penang Malaysia , 2009, pp. 1–6.[28] V. Zayats, M. Ostendorf, and H. Hajishirzi, “Disfluency detection usinga bidirectional lstm,”
INTERSPEECH , pp. 2523–2527, 2016.[29] B. Villegas, K. M. Flores, K. Jos Acua, K. Pacheco-Barrios, and D. Elias,“A novel stuttering disfluency classification system based on respiratorybiosignals,” in , 2019, pp. 4660–4663.[30] J. Santoso, T. Yamada, and S. Makino, “Classification of causes ofspeech recognition errors using attention-based bidirectional long short-term memory and modulation spectrum,” in , 2019, pp. 302–306.[31] Q. Chen, M. Chen, B. Li, and W. Wang, “Controllable time-delay trans-former for real-time punctuation prediction and disfluency detection,”in
ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2020, pp. 8069–8073.[32] K. Georgila, “Using integer linear programming for detecting speechdisfluencies,” in
Proceedings of Human Language Technologies: The2009 Annual Conference of the North American Chapter of the Associ-ation for Computational Linguistics, Companion Volume: Short Papers ,2009, pp. 109–112.[33] T. Kourkounakis, A. Hajavi, and A. Etemad, “Detecting multiple speechdisfluencies using a deep residual network with bidirectional long short-term memory,” in
ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020,pp. 6089–6093.[34] S. Khara, S. Singh, and D. Vir, “A comparative study of the techniquesfor feature extraction and classification in stuttering,” in , 2018, pp. 887–893.[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,”
IEEE Conference on Computer Vision and PatternRecognition , pp. 770–778, 2016.[36] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
CoRR ,vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/1709.01507[37] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261 , 2016.[38] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channelsqueeze & excitationin fully convolutional networks,” in
Internationalconference on medical image computing and computer-assisted inter-vention . Springer, 2018, pp. 421–429.[39] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outsidenet: Detecting objects in context with skip pooling and recurrent neuralnetworks,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 2874–2883.[40] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in
European conference on computer vision .Springer, 2016, pp. 483–499.[41] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComputation , vol. 9, no. 8, pp. 1735–1780, 1997. [42] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect-based sentimentanalysis via embedding commonsense knowledge into an attentive lstm.”in
Aaai , 2018, pp. 5876–5883.[43] H. Y. Kim and C. H. Won, “Forecasting the volatility of stock price in-dex: A hybrid model integrating lstm with multiple garch-type models,”
Expert Systems with Applications , vol. 103, pp. 25–37, 2018.[44] P. Li, M. Abdel-Aty, and J. Yuan, “Real-time crash risk prediction onarterials based on lstm-cnn,”
Accident Analysis & Prevention , vol. 135,p. 105371, 2020.[45] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-works,”
IEEE Transactions on Signal Processing , vol. 45, no. 11, pp.2673–2681, 1997.[46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from over-fitting,”
The Journal of Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[47] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473 ,2014.[48] A. Hajavi and A. Etemad, “Knowing what to listen to: Early attention fordeep speech representation learning,” arXiv preprint arXiv:2009.01822 ,2020.[49] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotionrecognition using recurrent neural networks with local attention,” in , 2017, pp. 2227–2231.[50] T. Sun and A. A. Wu, “Sparse autoencoder with attention mechanism forspeech emotion recognition,” in , 2019, pp. 146–149.[51] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-proaches to attention-based neural machine translation,” arXiv preprintarXiv:1508.04025 , 2015.[52] F. Chollet et al. (2015) Keras. [Online]. Available: https://keras.io[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al. , “Tensorflow: A system for large-scale machine learning,” in { USENIX } symposium on operatingsystems design and implementation ( { OSDI } , 2016, pp. 265–283.[54] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference , vol. 8, 2015.[55] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom.nist speech disc 1-1.1,”
STIN , vol. 93, p. 27403, 1993.[56] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scalespeaker identification dataset,” arXiv preprint arXiv:1706.08612 , 2017.[57] L. S. Chee, O. C. Ai, M. Hariharan, and S. Yaacob, “Mfcc basedrecognition of repetitions and prolongations in stuttered speech usingk-nn and lda,” in , 2009, pp. 146–149.[58] O. C. Ai, M. Hariharan, S. Yaacob, and L. S. Chee, “Classification ofspeech dysfluencies with mfcc and lpcc features,”
Expert Systems withApplications , vol. 39, no. 2, pp. 2157–2165, 2012.[59] (2020) Google cloud speech-to-text. [Online]. Available: https://cloud.google.com/speech-to-text/[60] J. Van Borsel, E. Geirnaert, and R. Van Coster, “Another case of word-final disfluencies,”
Folia phoniatrica et logopaedica , vol. 57, no. 3, pp.148–162, 2005.[61] J. Han, M. Kamber, and J. Pei,
Data Mining , 3rd ed. Elsevier Inc.,2012.[62] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556arXiv preprint arXiv:1409.1556