Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek
DDETECTION OF LEXICAL STRESS ERRORS IN NON-NATIVE (L2) ENGLISH WITH DATAAUGMENTATION AND ATTENTION
Daniel Korzekwa (cid:63) † , Roberto Barra-Chicote (cid:63) , Szymon Zaporowski † ,Grzegorz Beringer (cid:63) , Jaime Lorenzo-Trueba (cid:63) , Alicja Serafinowicz (cid:63) ,Jasha Droppo (cid:63) , Thomas Drugman (cid:63) , Bozena Kostek † (cid:63) Amazon TTS-Research † Gdansk University of Technology, Faculty of ETI, Poland
ABSTRACT
This paper describes two novel complementary techniquesthat improve the detection of lexical stress errors in non-native(L2) English speech: attention-based feature extraction anddata augmentation based on Neural Text-To-Speech (TTS).In a classical approach, audio features are usually extractedfrom fixed regions of speech such as syllable nucleus. Wepropose an attention-based deep learning model that auto-matically derives optimal syllable-level representation fromframe-level and phoneme-level audio features. Training thismodel is challenging because of the limited amount of incor-rect stress patterns. To solve this problem, we propose to aug-ment the training set with incorrectly stressed words gener-ated with Neural TTS. Combining both techniques achieves94.8% precision and 49.2% recall for the detection of incor-rectly stressed words in L2 English speech of Slavic speakers.
Index Terms — lexical stress, language learning, dataaugmentation, text-to-speech, attention, speech assessment
1. INTRODUCTION
Computer Assisted Pronunciation Training (CAPT) usuallyfocus on practicing pronunciation of phonemes, while thereis evidence in non-native (L2) English speakers that practic-ing lexical stress improves speech intelligibility [1]. Lexicalstress is a syllable-level phonological feature. It is a part ofthe phonological rules that define how words should be spo-ken in a given language. Stressed syllables are usually longer,louder and expressed with a higher pitch than their unstressedcounterparts [2]. Lexical stress is inter-connected with phone-mic representation. For example, placing lexical stress on adifferent syllable of a word may lead to different phonemicrealizations knows as ‘vowel reduction’ [3].The focal point of our work is the detection of words withincorrect stress patterns. The training data with human speechis usually highly imbalanced, with few training examples ofincorrectly stressed words. It makes training machine learn-ing models for this task challenging. We address this prob-lem by augmenting the training set with synthetic speech thatis generated with Neural Text-To-Speech (TTS) [4]. Neural TTS allows us generating words with both correct and incor-rect stress patterns.Most of the existing approaches are based on carefullydesigned features that are extracted from fixed regions ofspeech signal such as syllable nucleus [5, 6, 7]. We intro-duce attention mechanism [8] to automatically learn optimalsyllable-level representation. The representation is derivedfrom frame-level (F0, intensity) and phoneme-level (duration)audio features and the corresponding phonetic representationof a word. We do not indicate precisely the regions of theaudio signal that are important for the detection of lexicalstress errors. The attention mechanism does it automatically.To the best of our knowledge, this paper is the first at-tempt, for the task of lexical stress error detection, to: i) augment the training data with Neural TTS, ii) use attentionmechanisms to automatically extract syllable-level featuresfor lexical stress error detection.The paper is structured as follows. Section 2 reviews therelated work. Section 3 describes the proposed model. Sec-tion 4 reviews human and synthetic speech corpora. In Sec-tion 5, we present our experiments, and Section 6 concludesthe paper.
2. RELATED WORK
The existing work focus on the classification of lexical stressusing Neural Networks [9, 6], Support Vector Machines[7, 10] and Fisher’s linear discriminant [11]. There are twopopular variants: a) discriminating syllables between pri-mary stress/no stress [5], and b) classifying between primarystress/secondary stress/no stress [12, 9]. Accuracy is the mostcommonly used performance metric, and it indicates the ratioof correctly classified stress patterns on a syllable [12] orword level [7]. On the contrary, following Ferrer et al. [5],we analyze precision and recall metrics because we aim todetect lexical stress errors and not just classify them.Existing approaches for the classification and detectionof lexical stress errors are based on carefully designed fea-tures. They start with aligning a speech signal with phonetictranscription, performed via forced-alignment [6, 7]. Alter- a r X i v : . [ ee ss . A S ] D ec ig. 1 . Attention-based Deep Learning model for the detection of lexical stress errors.natively, Automatic Speech Recognition (ASR) can provideboth phonetic transcription and its alignment with a speechsignal [12]. Then, prosodic [7] and cepstral features (MFCC,Mel-Spectrogram) [5, 6] are extracted on the syllable [6] orsyllable nucleus [5, 7] level.Shahin et al. [6] computed the features of neighboringvowels, and Li et al. [12] included the features for two pre-ceding and two following syllables in the model. The fea-tures are often preprocessed and normalized firstly to avoidpotential confounding variables [5], and secondly to achievea better model generalization by normalizing the duration andpitch on a word level [5, 11]. Li et al. [9] added canonical lex-ical stress to input features, which improves the accuracy ofthe model.In our approach, we use attention mechanisms to deriveautomatically the regions of audio signal that are importantfor the detection of lexical stress errors. We also use data aug-mentation through the generation of artificial data with NeuralTTS.
3. PROPOSED MODEL
The proposed model consists of three subsystems: a Fea-ture Extractor, an Attention-based Classification Model, anda Lexical Stress Error Detector. It is illustrated in Figure 1.
The Feature Extractor extracts prosodic features and phonemesfrom speech signal a and forced-aligned text t . To obtainforced-alignment, we used Montreal forced-alignment toolkit[13] along with an acoustic model pretrained on LibriSpeechASR corpus [14]. The prosodic features c = f ( a ) are formedby: F0, intensity [dB SPL] and phoneme-level durations. TheF0 and intensity features are computed at the frame level(time step: 10ms, window size: 40ms). The F0 contour islinearly interpolated in unvoiced regions. These raw featureswill be further transformed by the attention-based model tothe syllable-level representation. The Attention-based Classification Model maps frame-leveland phoneme-level features to syllable-level representation.Then, it produces a lexical stress pattern s , modeled as asequence of Bernoulli random variables s = { s , .., s k } (stressed/unstressed) over K syllables of a multi-syllableword, conditioned on audio a and text t representations.Let us define it as a conditional probability distribution s ∼ p ( s | a, t, θ ) , where θ are the parameters of the model.To extract syllable-level features, we use two dot-productattentions operating on the frame and phoneme levels. Thedot-product attention is presented in Equation 1 and it followsthe notation proposed by Vaswani et al. [8]. It is based onthree inputs: Query Q , Keys K and Values V , where d k is thedimensionality of K . Attention ( Q, K, V ) = sof tmax ( QK t √ d k ) V (1)The attention inputs are represented as follows. Query -syllable positional embeddings defined by one-hot syllableindex encodings. Keys - a sequence of sub-phonemes. Eachsub-phoneme is represented by a set of features: phoneme id , syllable index , is vowel , lef t or right sub phoneme .All features are one-hot encoded and processed with a GatedRecurrent Unit (GRU) layer [15] (units:4, dropout: 0.24). Inthe end, encoded sub-phoneme sequences are passed throughlinear dense layers. Values - the F /intensity and duration features for frame-level and phoneme-level attentions.To model relative prominence, we introduce a differentialbi-directional layer that computes the ratios of syllable-levelacoustic features for each syllable and its two neighbors (Fig-ure 1). The output of the differential layer is further processedby three dense layers (units: 4, activation: tanh, dropout:0.24), followed by a linear dense layer (units: 2, dropout:0.24) that produces a two-dimensional output for each syl-lable. It is then squeezed by a softmax function to generatelexical stress probabilities. .3. Training of the Classification Model We train the model on a set of N triplets that contains 1) hu-man recorded words and 2) synthetic words generated usingNeural TTS. A single triplet is represented by { s n , a n , t n } ,where n = 1 ..N is the index of a training example.The concept of data augmentation can be explained us-ing a framework of Bayesian Inference. Consider three ran-dom variables, lexical stress s n , audio signal a n and text t n .All variables are observed for the training examples of humanspeech. However, for the synthetic speech, we only observethe lexical stress and text variables. The audio signal is unob-served (hidden) because we have to generate it.To train this model, we derive a negative log-likelihoodloss over a joint probability distribution of lexical stress s andaudio a random variables, as depicted in Equation 2. The lossis further approximated with a variational lower bound [16],as presented in Equation 3 (we omit θ for brevity). For thetraining examples of synthetic speech, the conditional proba-bility distribution over the audio signal a n ∼ p ( a n | s n , t n ) isestimated with Neural TTS, and for human recorded words, itis given explicitly. L ( θ ) = − N (cid:88) n log (cid:90) p ( s n , a n | t n , θ ) da n (2) log (cid:90) p ( s n , a n | t n ) da n ≈ E a n ∼ p ( a n | t n ,s n ) [ logp ( s n | a n , t n )] (3)The model was implemented in MxNet [17], trainedwith Stochastic Gradient Decent optimizer (learning rate:0.1, batch size: 20) and tuned with Bayesian optimization.Training data were split into buckets based on the number offrames in an audio signal. A single bucket contains wordswith the same number of syllables with zero-padded acousticand sub-phoneme sequences. The Lexical Stress Error Detector reports on lexical stress er-ror if the expected (canonical) and estimated lexical stress fora given syllable do not match and the corresponding probabil-ity is higher than a given threshold.
4. SPEECH CORPUS
Our speech corpus consists of human and synthetic speech.The data were split into training and testing sets with dis-jointed speakers ascribed to each set. Human speech containsL1 and L2 speakers of English. Synthetic data were generatedwith Neural TTS and are included only in the training set. Allaudio files were downsampled to 16kHz sampling rate. Thedata is summarized in Table 1 and we provide more details inthe following subsections.
Table 1 . Train and test sets details.
Data set Speakers(L2) Words(unique) StressErrors
Train set (human) 473 (10) 8223 (1528) 425Train set (TTS) 1 (0) 3937 (1983) 2005Test set (human) 176 (21) 2108 (378) 189
Due to the limited availability of L2 corpora, we recordedour own L2-English corpus of Slavic speakers. It also al-lows us to evaluate the model during interactive English learn-ing sessions with our students. The corpus contains speechfrom 25 speakers (23 Polish, 1 Ukrainian and 1 Lithuanian):7 females and 18 males, all between 24 and 40 years old.All speakers read a list of two hundred words. One hundredwords were prepared by a professional English teacher, in-cluding frequently mispronounced words by Slavic students.The second half consists of the most common words that wereobtained from Google’s Trillion Word Corpus [18] based onn-gram frequency analysis. We excluded abbreviations andone-syllable words.Additionally, L1 and L2 English speech was collectedfrom publicly available speech data sets, including TIMIT[19], Arctic [20], L2-Arctic [21] and Porzuczek [22].
Complementary to human recordings, synthetic speech wasgenerated with Neural TTS by Latorre et al. [4]. The Neu-ral TTS consists of two modules. Context-generation mod-ule is an attention-based encoder-decoder neural network thatgenerates a mel-spectrogram from a sequence of phonemes.Then, a Neural Vocoder converts it to a speech signal. TheNeural Vocoder is a neural network of architecture similar tothe work by [23]. The Neural TTS was trained using speechof a professional American voice talent. To generate wordswith different lexical stress pattern, we modify lexical stressmarkers associated with the vowels in the phonemic transcrip-tion of a word. For example, with the input of /r iy1 m ay0 nd/ we can place lexical stress on the first syllable of the word‘remind’. 1980 popular English words were synthesized withcorrect and incorrect stress patterns.
L1 corpora were segmented into words and annotated auto-matically using a proprietary Amazon American English Lex-icon, taking into account the syntactic context of the word.Neural TTS speech and the speech of L2 speakers were anno-tated by 5 American English linguists into ‘primary’ and ‘nostress’ categories, keeping the words for which a minimumf 4 out of 5 linguists agreed on the stress pattern. Annota-tors were not able to distinguish between primary and sec-ondary lexical stress. 81.5% of synthesized words matchedthe intended stress patterns with a minimum of 4 annotators’agreement. It shows that Neural TTS can be used to generateincorrectly stressed speech.
5. EXPERIMENTS
The proposed model (Att TTS) from Section 3 is compared tothree baseline models that are designed in a way to measurethe impact of the Neural TTS data augmentation and the atten-tion mechanism. To compare these models, we plotted theirprecision-recall curves and gave their corresponding area un-der a curve (AUC).The Att NoTTS model has the same architecture as theAtt TTS but the synthetic speech is excluded from the ‘train-ing set’. The NoAtt TTS model uses the same training setas the Att TTS, but it has no attention mechanism. Instead,as a syllable-level representation, it uses mean values ofacoustic features for the corresponding syllable nucleus. TheNoAtt NoTTS model has no attention and it does not useNeural TTS data augmentation.As state-of-the-art baseline, we use the work by Ferrer etal. [5]. However, a direct comparison is not possible. In theirtest corpus, there were 46.4% (191 out of 411) of incorrectlystressed words, far more than 9.4% (189 out of 2109) wordsin our experiment. The fewer lexical stress errors are made byusers, the more challenging it is to detect it. They also usedproprietary L2 English of Japanese speakers. Due to the lackof available benchmark and standard speech corpora for thetask of lexical stress assessment, we could not make a fairercomparison with the state of the art.
First, we compare Att NoTTS and NoAtt NoTTS models.Using the attention mechanism for automatic extraction ofsyllable-level features significantly improves the detection oflexical stress errors. It is illustrated by precision-recall curveand AUC metric in Figure 2. To be comparable with the studyby Ferrer et al., we fix recall to around 50% and compare themodels using precision as shown in Table 2.
Table 2 . Precision and recall [%, 95% Confidence Interval]of detecting lexical stress errors, at around 50% recall.Model Precision RecallAttTTS 94.8 (89.18-98.03) 49.2 (42.137-56.3)AttNoTTS 87.85 (80.67-93.02) 49.74 (42.66-56.82)NoAttTTS 44.39 (37.85-51.09) 50.26 (43.18-57.34)NoAttNoTTS 48.98 (42.04-55.95) 50.79 (43.70-57.86)
Fig. 2 . Precision-recall curves for evaluated systems.The Att NoTTS attention-based can be further improved.Augmenting the training set with incorrectly stressed words(Att TTS) boosts precision from 87.85% to 94.8%, at a re-call level of 50%. Data augmentation helps because it in-creases the amount of words with incorrect stress pattern inthe training set. It prevents the model from exploiting strongcorrelation between phonemes and lexical stress in correctlystressed words. Using data augmentation in the simpler no-attention-based model (NoAtt TTS) does not help. It is be-cause NoAtt TTS uses only prosodic features for fixed re-gions of speech, so this model cannot overfit to phonetic in-put.Ferrer et al. [5] reported on a similar performance to ourAtt TTS model with a precision of 95% and a recall of 48.3%on L2 English speech of Japanese speakers. However, in theirtesting data, the proportion of incorrectly stressed words ismuch larger, which makes it easier to detect lexical stress er-rors.
6. CONCLUSION AND FUTURE WORK
Using attention-based neural network for the automatic ex-traction of syllable-level features significantly improves thedetection of lexical stress errors in L2 English speech. How-ever, this model has a tendency to classify lexical stress basedon highly-correlated phonemes. We can counteract this effectby augmenting the training set with incorrectly stressed wordsgenerated with Neural TTS. It boosts the performance of theattention-based model by 14.8% in the AUC metric and by7.9% in precision, while maintaining recall at the level closeto 50%. Data Augmentation however does not help when ap-plied to a simpler model without attention mechanism.We found that the current word-level model is not ableto correctly classify lexical stress when two words are linked[24]. For example, two neighboring phonemes /er/ in the text‘her arrange’ /hh er - er ey n jh/ are pronounced as a singlephoneme. To address this problem, we plan to move awayfrom the assessment of isolated words and extend the currentmodel to detect lexical stress errors at the sentence level. . REFERENCES [1] John Field, “Intelligibility and the listener: The roleof lexical stress,”
TESOL quarterly , vol. 39, no. 3, pp.399–423, 2005.[2] Ye-Jee Jung, Seok-Chae Rhee, et al., “Acoustic analysisof english lexical stress produced by korean, japaneseand taiwanese-chinese speakers,”
Phonetics and SpeechSciences , vol. 10, no. 1, pp. 15–22, 2018.[3] Dick R van Bergem, “Acoustic and lexical vowel reduc-tion,” in
Phonetics and Phonology of Speaking Styles ,1991.[4] Javier Latorre et al., “Effect of data reduction onsequence-to-sequence neural tts,” in
ICASSP 2019-2019IEEE Intl. Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 7075–7079.[5] Luciana Ferrer, Harry Bratt, Colleen Richey, HoracioFranco, Victor Abrash, and Kristin Precoda, “Classi-fication of lexical stress using spectral and prosodic fea-tures for computer-assisted language learning systems,”
Speech Communication , vol. 69, pp. 31–45, 2015.[6] Mostafa Ali Shahin, Julien Epps, and Beena Ahmed,“Automatic classification of lexical stress in english andarabic languages using deep learning.,” in
INTER-SPEECH , 2016, pp. 175–179.[7] Jin-Yu Chen and Lan Wang, “Automatic lexical stressdetection for chinese learners’ of english,” in . IEEE, 2010, pp. 407–411.[8] Ashish Vaswani et al., “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[9] Kun Li, Shaoguang Mao, Xu Li, Zhiyong Wu, and He-len Meng, “Automatic lexical stress and pitch accentdetection for l2 english speech using multi-distributiondeep neural networks,”
Speech Communication , vol. 96,pp. 28–36, 2018.[10] Junhong Zhao, Hua Yuan, Jia Liu, and S Xia, “Au-tomatic lexical stress detection using acoustic featuresfor computer assisted language learning,”
Proc. APSIPAASC , pp. 247–251, 2011.[11] Nan Chen and Qianhua He, “Using nonlinear features inautomatic english lexical stress detection,” in . IEEE, 2007, pp. 328–332.[12] Kun Li, Xiaojun Qian, Shiyin Kang, and Helen Meng,“Lexical stress detection for l2 english speech using deep belief networks.,” in
Interspeech , 2013, pp. 1811–1815.[13] Michael McAuliffe, Michaela Socolof, Sarah Mihuc,Michael Wagner, and Morgan Sonderegger, “Montrealforced aligner: Trainable text-speech alignment usingkaldi.,” in
Interspeech , 2017, vol. 2017, pp. 498–502.[14] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus basedon public domain audio books,” in . IEEE, 2015, pp. 5206–5210.[15] Kyunghyun Cho et al., “Learning phrase representationsusing rnn encoder-decoder for statistical machine trans-lation,” arXiv preprint arXiv:1406.1078 , 2014.[16] Michael I Jordan, Zoubin Ghahramani, Tommi SJaakkola, and Lawrence K Saul, “An introduction tovariational methods for graphical models,”
Machinelearning , vol. 37, no. 2, pp. 183–233, 1999.[17] Tianqi et al. Chen, “Mxnet: A flexible and efficient ma-chine learning library for heterogeneous distributed sys-tems,” arXiv preprint arXiv:1512.01274 , 2015.[18] Jean-Baptiste Michel et al., “Quantitative analysis ofculture using millions of digitized books,” science , vol.331, no. 6014, pp. 176–182, 2011.[19] John S Garofolo, Lori F Lamel, William M Fisher,Jonathan G Fiscus, and David S Pallett, “Darpa timitacoustic-phonetic continous speech corpus cd-rom. nistspeech disc 1-1.1,”
STIN , vol. 93, pp. 27403, 1993.[20] John Kominek and Alan W Black, “The cmu arcticspeech databases,” in
Fifth ISCA workshop on speechsynthesis , 2004.[21] Guanlong Zhao et al., “L2-arctic: A non-native en-glish speech corpus,”
Perception Sensing Instrumenta-tion Lab , 2018.[22] Andrzej Porzuczek and Arkadiusz Rojczyk, “Englishword stress in polish learners’ speech production andmetacompetence,”
Research in Language , vol. 15, no.4, pp. 313–323, 2017.[23] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Si-monyan, Oriol Vinyals, Koray Kavukcuoglu, GeorgeDriessche, Edward Lockhart, Luis Cobo, Florian Stim-berg, et al., “Parallel wavenet: Fast high-fidelity speechsynthesis,” in
International conference on machinelearning . PMLR, 2018, pp. 3918–3926.[24] Adolf E Hieke, “Linking as a marker of fluent speech,”