Context-Aware Prosody Correction for Text-Based Speech Editing
Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo
CCONTEXT-AWARE PROSODY CORRECTION FOR TEXT-BASED SPEECH EDITING
Max Morrison (cid:91), ∗ Lucas Rencker (cid:93), ∗ , Zeyu Jin (cid:92) , Nicholas J. Bryan (cid:92) , Juan-Pablo Caceres (cid:92) , Bryan Pardo (cid:91)(cid:91) Northwestern University, Evanston, IL, USA (cid:93)
University of Surrey, Guildford, UK (cid:92)
Adobe Research, San Francisco, CA, USA
ABSTRACT
Text-based speech editors expedite the process of editingspeech recordings by permitting editing via intuitive cut,copy, and paste operations on a speech transcript. A ma-jor drawback of current systems, however, is that editedrecordings often sound unnatural because of prosody mis-matches around edited regions. In our work, we propose anew context-aware method for more natural sounding text-based editing of speech. To do so, we 1) use a series ofneural networks to generate salient prosody features that aredependent on the prosody of speech surrounding the edit andamenable to fine-grained user control 2) use the generatedfeatures to control a standard pitch-shift and time-stretchmethod and 3) apply a denoising neural network to removeartifacts induced by the signal manipulation to yield a high-fidelity result. We evaluate our approach using a subjectivelistening test, provide a detailed comparative analysis, andconclude several interesting insights.
Index Terms — speech, prosody generation, pitch-shifting,time-stretching, deep learning
1. INTRODUCTION
Editing speech recordings based on text transcriptions is anemerging audio editing paradigm [1–3]. Given a time-alignedtext transcription, content creators can quickly edit a tran-script using familiar word processing operations such as cut,copy, and paste and automatically propagate changes to thecorresponding audio recording without having to manuallyedit a raw waveform. However, when cutting or copying-and-pasting a word from one location to another in a recording,a mismatch in prosody [4] can occur between the insertedword and its context (e.g. mismatches in intonation, stress,or rhythm).Prior work relevant to this problem includes neuralprosody generation, speech manipulation (e.g., pitch-shiftingand time-stretching), and speech editing. Previous work inneural prosody generation includes generating phoneme du-rations from text [5, 6] and generating pitch from linguistic ∗ This work was carried out during an internship at Adobe Research.
It was a dark and stormy night. mild and pleasant
Fig. 1 . Context-aware text-based speech editing. We allowspeech recordings (blue) to be manipulated by text-based cut,copy, and paste operations and then perform prosody correc-tion (red) to make the edit sound natural in context.features and phoneme durations [6, 7]. Recently, Fastspeech2 [6] was the first neural network to explicitly generate bothpitch and duration from text. However, these prosody gener-ators cannot be independently trained and require a complextraining setup involving spectrogram supervision and acous-tic feature generation. More critically, FastSpeech 2 does notattempt to generate prosodies that sound natural in contextwith any preceding or following speech.Relevant speech manipulation work includes fast pitch-shifting and time-stretching techniques suitable for real-timeinteractive applications. Digital signal processing (DSP)based methods for pitch-shifting and time-stretching speechinclude Time-Domain Pitch-Synchronous Overlap-and-Add(TD-PSOLA) [8], WORLD [9], and STRAIGHT [10]. Thesemethods are computationally cheap and amenable to interac-tive applications, but can induce audible artifacts. Real-timeneural vocoders such as WaveGlow [11], MelGAN [12], andLPCNet [13] tend to exhibit higher perceptual quality thanDSP-based methods, but lack well-known pitch-shifting ca-pabilities. More recently, FastPitch [14] has demonstratedreal-time, high-quality pitch-shifting and time-stretching, butrequires significant retraining to add a new speaker and lackscontext-aware generation. Finally, text-based speech editingmethods such as VoCo [2] and Descript [3] are capable of per-forming cut, copy, and paste operations, but exhibit flattenedprosody or boundary artifacts (i.e., prosody discontinuities atedit region boundaries). a r X i v : . [ ee ss . A S ] F e b ig. 2 . An overview of our proposed system. We first gen-erate new, context-aware prosody from input phonemes us-ing our pitch and duration generators. Next, we impose thegenerated prosody on the speech by pitch-shifting and time-stretching via TD-PSOLA. Finally, we use HiFi-GAN to miti-gate the artifacts caused by pitch-shifting and time-stretching.In this paper, we propose a new system that improvestext-based speech editing by allowing natural-sounding cut,copy, and paste operations. We perform the cut, copy orpaste operation on the specified text and corresponding au-dio using a phoneme alignment transcription and then cor-rect unnatural prosody regions using a series of four compo-nents as shown in Figure 2. The transformation consists of(1) a neural network that generates phoneme durations (in-cluding silence) from text, (2) a neural network that generatespitch from phonemes and corresponding durations, (3) pitch-shifting and time-stretching via TD-PSOLA [10] to imposethe generated prosody (i.e., duration and pitch), and (4) a de-noising and dereverberation neural network that removes ma-nipulation artifacts of the edited speech [15]. Our pitch andduration generators are amenable to fine- or coarse-graineduser control, and can generate natural prosody in context withany preceding and following speech. We evaluate our ap-proach using a subjective listening test, show a detailed com-parative analysis, and conclude several interesting insights
2. METHODS2.1. Phoneme duration generation
We generate phoneme duration information (i.e., a value inseconds for each phoneme, including silence) for all wordsof an edit region using the first neural network in our system.To do so, we break each word into phonemes and generate aduration value in seconds for each phoneme in the input text. Audio examples are available at https://maxrmorrison.com/sites/context-aware.
Input phonemes are extracted using g2pE [16] and one-hotencoded. The output of our neural network is a real-valuedscalar in seconds that is upper-bounded at 0.5 seconds to pre-vent long silences.Our duration generator is a sequence-to-sequence modelconsisting of an encoder and decoder. The encoder con-tains two 1D convolution blocks followed by a bidirectionalgated recurrent unit (GRU). Each convolution block consistsof a convolution with 512 channels and a kernel size of 5,a ReLU activation, and batch normalization [17]. The de-coder is a unidirectional GRU with 256 channels followedby a linear layer. During training, the network minimizesthe mean squared error between the real-valued phonemedurations predicted by the network and the ground truth du-rations in seconds extracted with the Penn Phonetic ForcedAligner [18]. We also experimented with using an autore-gressive sequence-to-sequence model expecting the learneddistribution of phoneme durations would be multimodal, butfound that the distribution was unimodal and hypothesize thisis due to training on single-speaker data with a consistentreading style.
Given the duration of each phoneme in a sentence beingedited, we use a second neural network to generate a pitchcontour for the edited words. More specifically, we use theexisting Controllable DAR (C-DAR) [19] model for pitchgeneration. C-DAR generates a pitch value for each 10 mil-lisecond frame of speech from one-hot encoded phonemesand linguistic features (see [19]) that have been upsampledaccording to input phoneme durations. C-DAR predicts acategorical distribution over 128 possible pitch values, whichare evenly distributed between -4 and +4 standard deviationsfrom the speaker’s average pitch in base-2 log-space. Groundtruth pitch values are extracted using a PyTorch port [20] ofthe Crepe pitch tracker [21]. To minimize double and halffrequency errors, we decode the pitch from the sequence ofcategorical distributions predicted by Crepe using Viterbi de-coding. Voiced/unvoiced tokens are extracted by performinghysteresis thresholding on Crepe’s network confidence value.
To apply the generated prosody to edited speech, we useTime-Domain Pitch-Synchronous Overlap-and-Add (TD-PSOLA) [8]. We choose TD-PSOLA over the WORLD [9]vocoder due to its superior naturalness and pitch-shifting [19].TD-PSOLA works by pitch-tracking the speech to determinelocations ( pitch epochs ) to split the signal into chunks corre-sponding to one period. These chunks can then be repeatedto perform time-stretching, or shifted to overlap more orless to adjacent chunks to perform pitch-shifting. The mod-ified chunks are combined via Overlap-Add to create thepitch-shifted and time-stretched speech signal. We use theraat [22] implementation of TD-PSOLA via the Parsel-mouth [23] Python package. We provide values for the pitchand time-stretching rate for each 10 millisecond frame ofspeech being edited. The time-stretching rate is the sequenceof ratios between the original and generated phoneme dura-tions, each repeated by the number of frames occupied by theoriginal phoneme.
TD-PSOLA induces some artifacts in the edited speech.These artifacts are more noticeable in frames where dura-tion or pitch change is larger. To mitgate this issue, we useHiFi-GAN [15], a GAN-based neural network that learns totransform noisy and reverberant speech recordings to clean,high-fidelity speech. While HiFi-GAN is only trained to per-form speech denoising and dereverberation, we find it veryeffective at reducing the artifacts induced by pitch-shiftingand time-stretching. This allows us to avoid using complex,speaker-dependent neural vocoders that can be difficult to usefor interactive applications. HiFi-GAN also operates signif-icantly faster than real-time (i.e. about 20x real-time on aTesla V100 GPU), making it well-suited for interactive appli-cations. More broadly, we believe the use of neural denoisersto mitigate DSP-based manipulation algorithm artifacts is anexciting research direction.
In addition to automatically adjusting unnatural prosody fortext-based speech editing, our method allows users to directlymodify the prosody. That is, users of our system can option-ally provide explicit pitch and/or duration information to con-trol the prosody of the output speech and refine automaticallygenerated results. If only a subsequence of the full prosody isgiven (e.g, the pitch values of a word or two or the durationvalues of a couple phonemes), the remainder is predicted by aneural network to sound natural relative to the user-providedpitch and duration. In addition, our system can generate a va-riety of candidate prosodies and allows users to select theirfavorite. This property is greatly beneficial for our real-worldtarget application and not possible using many existing neuralprosody techniques.
3. EXPERIMENTAL DESIGN3.1. Task
We evaluate our proposed system on the task of copying andpasting a phrase from the text and speech of one sentence toanother sentence containing the same phrase. We call thistask “replacement”, and use it to study prosody errors acrossa variety of alternative methods and compare against a perfectground truth, unedited recording. We perform replacement onrepeated phrases of two-to-five words for each dataset listed below. Given that the text is unchanged, there is a high likeli-hood of the prosody sounding correct without any additionalmodification. We manually select 20 examples from eachspeaker where there is a variety of noticeable prosody er-rors due to direct waveform editing. This is our “naive” con-dition. All concatenations of waveforms are performed us-ing a 20 millisecond equal-power crossfade to prevent wave-form boundary artifacts. We use A-weighted loudness match-ing [24] to ensure that the volume of the pasted speech is rea-sonable given the speech context. Our goal is to not only un-derstand how our proposed method can modify the prosodycompared to a naive method, but also understand what com-ponents of our system are most useful. We use Amazon Me-chanical Turk (AMT) to conduct both a mean opinion score(MOS) test and pairwise comparison test against alternativeapproaches. We additionally use a basic pretest to make sureall subjects have proper equipment and hearing ability.
We demonstrate our system with two single-speaker datasets:one male speaker and one female. Our female speaker isLJSpeech [25], a subset of books read by Linda Johnson(reader ID 11049 on LibriVox). We preprocess LJSpeechusing HiFi-GAN [15] to remove reverberation. For our malespeaker, we use the book
The Everyday Life of AbrahamLincoln read by Bill Boerst (reader ID 4788 on LibriVox).We use the Gentle forced aligner [26] to break the long-formspeech and text into sentences for training. We downsampleall speech recordings to a 16 kHz sampling rate.
Our pitch generator [19] and denoising [15] networks are pre-trained according to their original implementations. Our du-ration generator is trained in a single-speaker fashion on eachof our datasets. We use two hours of speech for training andtwo hours for validation. We train the duration generator for30 epochs with a batch size of 64. Training the duration gen-erator takes less than 30 minutes on one Tesla V100 GPU.During training, we randomly provide short sequences ofthe ground truth phoneme duration as input features. We ran-domly select half of the training samples to have k adjacentground truth durations, where k ∼ Uniform (0 , , so thatthe model learns to fill in remaining durations in a context-aware manner. During inference, the phoneme durations ofthe speech context are similarly provided as input features.The remaining, unspecified durations are predicted based onthe durations of the context, which encourages prosodic con-tinuity. We call this method subsequence prediction , and itis also used by our pitch generator to enable similar controlover the generated pitch [19]. We use subsequence predictionto condition the pitch and duration generators on the proceed-ing and following speech prosody. This makes our prosodygenerators context-aware. riginal Naive Average Proposed -Duration -Pitch -HiFi-GAN -Context TacotronMOS Pairwise
Table 1 . Mean-opinion-score (MOS) and pairwise comparison results. Both the MOS and pairwise test results show ourproposed method outperforms alternatives, including naive copying and pasting and our method without context. Our methodis comparable with complicated TTS, but still has room for improvement. Pairwise scores are given as percent preference forour method.
We include several additional test conditions in our listeningstudy including listening test anchors, ablations, and base-lines. We use the original speech as a high-quality anchor(Original). For a low-quality anchor, we use an averagecondition (Average), which uses monotone pitch with con-stant pitch at the mean frequency of the speech context andper-phoneme duration averages computed on the trainingdata. We ablate our duration generator by using the dura-tions of the baseline condition instead of generating newdurations (-Duration). We similarly ablate our pitch gener-ator by using the pitch of the naive cut condition (-Pitch).We ablate HiFi-GAN by using the raw TD-PSOLA outputwithout postprocessing (-HiFi-GAN). We examine the roleof context-awareness by producing examples that use ourproposed method without subsequence prediction (-Context).We compare our method to a baseline that uses prosodygenerated by a multispeaker model based on Tacotron 2 [27]and vocoded with WaveGlow [11] using the full text of eachexample. We extract the prosody using pitch-tracking andphoneme alignment, and apply the prosody via TD-PSOLA.This allows us to directly compare the prosody generatedby the Tacotron-based model to our methods without theadditional comparison of TD-PSOLA and WaveGlow.
4. RESULTS
We report our listening test results for both the MOS and pair-wise comparison tests in Table 1. A total of 159 and 220unique listeners participated for the MOS and pairwise com-parison tasks, respectively. We obtain a total of 16 answersper voice per condition per utterance.First, we compare our proposed system with and withoutcontext-awareness (-Context). We see that context improvesMOS results by a noticeable amount ( . ) as well as pairwisepreference ratings. This result demonstrates the significanceof context-awareness in deep learning models that performspeech editing tasks.Second, we compare our method to a state-of-the-art TTSmethod (Tacotron). When we look at both the MOS andpairwise comparison scores, we find our method slightly out-performs Tacotron for the task of text-based speech editing.Given the relative simplicity of our prosody generation net-works, this is remarkable, as it shows that simple prosody generators can achieve comparable prosody naturalness with-out training a complex, multi-speaker acoustic feature predic-tor or neural vocoder.Third, we compare our method to our own Naive cut-copy-paste implementation. Both MOS and pairwise com-parison results show our proposed method performs better.While it might seem like this should be easy to outperform, itis important to note that the Naive method uses only naturallyrecorded audio and has no time-stretching, pitch-shifting,or other vocoder artifacts by construction. The manual se-lection process for this condition produces examples thatrequires large pitch-shifting and time-stretching operations.TD-PSOLA creates larger artifacts during larger manipu-lations. Our method surpasses this Naive baseline despiteinducing a worst-case scenario for the naturalness of TD-PSOLA.Fourth, we ablate each of our pitch (-Pitch) and duration(-Duration) generators. We see that each of these componentsis important, but the effect of pitch manipulation is more sig-nificant. This result is inline with to our initial analysis ofphoneme duration being unimodal and suggests that furtherwork on context-aware pitch generation may be beneficial.Finally, we compare our method with and without HiFi-GAN. Both MOS and pairwise comparison results sug-gest that HiFi-GAN mitigates speech manipulation artifactscaused by time-stretching and pitch-shifting. We suspect thatfurther advancements in neural denoising and dereverberationsystems will directly benefit our proposed method.
5. CONCLUSION
We propose a system that enables high-fidelity context-awaretext-based speech editing. Our system trains quickly and au-tomatically infers a natural sounding prosody relative to thespeech surrounding the edit region, but also allows explicituser adjustment for complete creative control. We anticipatethat these additional controls for the production and editingof speech can expedite the creative workflow of both noviceand expert content creators. Toward that end, our futurework could explore using an efficient, high-fidelity neuralvocoder instead of TD-PSOLA, or replacing HiFi-GAN witha denoiser catered specifically for TD-PSOLA that can cor-rect formant errors. The latter would be especially useful asTD-PSOLA can perform real-time manipulation even at highsampling rates. . REFERENCES [1] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li, andM. Agrawala, “Content-based tools for editing audiostories,” in
ACM symposium on User interface softwareand technology . ACM, 2013.[2] Z. Jin, G. J. Mysore, S. Diverdi, J. Lu, and A. Finkel-stein, “Voco: Text-based insertion and replacementin audio narration,”
ACM Transactions on Graphics(TOG)
The handbook of phonetic sciences , vol. 5, pp.640–673, 1997.[5] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, andT.-Y. Liu, “Fastspeech: Fast, robust and controllabletext to speech,” in
Advances in Neural Information Pro-cessing Systems , 2019.[6] Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,“FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” arXiv preprint arXiv:2006.04558 , 2020.[7] X. Wang,
Fundamental frequency modeling for neural-network-based statistical parametric speech synthesis ,Ph.D. thesis, The Graduate University for AdvancedStudies, 2018.[8] E. Moulines and F. Charpentier, “Pitch-synchronouswaveform processing techniques for text-to-speech syn-thesis using diphones,”
Speech communication , 1990.[9] M. Morise, F. Yokomori, and K. Ozawa, “World: avocoder-based high-quality speech synthesis system forreal-time applications,”
IEICE Transactions on Infor-mation and Systems , 2016.[10] E. Moulines and F. Charpentier, “Pitch-synchronouswaveform processing techniques for text-to-speech syn-thesis using diphones,”
Speech communication , 1990.[11] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow:A flow-based generative network for speech synthesis,”in
International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019.[12] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C.Courville, “MelGAN: Generative adversarial networksfor conditional waveform synthesis,” in
Advances inNeural Information Processing Systems , 2019.[13] J.-M. Valin and J. S., “LPCNet: Improving neuralspeech synthesis through linear prediction,” in
IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019. [14] A. Ła´ncucki, “FastPitch: Parallel text-to-speech withpitch prediction,” 2020.[15] J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High-fidelity denoising and dereverberation based on speechdeep features in adversarial networks,” in
Interspeech ,2020.[16] K. Park and J. Kim, “g2pe,” https://github.com/Kyubyong/g2p, 2019.[17] S. Ioffe and C. Szegedy, “Batch normalization: Acceler-ating deep network training by reducing internal covari-ate shift,” arXiv preprint arXiv:1502.03167 , 2015.[18] J. Yuan and M. Liberman, “Speaker identification onthe scotus corpus,”
Journal of the Acoustical Society ofAmerica , 2008.[19] M. Morrison, Z. Jin, J. Salamon, N. J. Bryan, and G. J.Mysore, “Controllable neural prosody synthesis,” in
Interspeech , 2020.[20] M. Morrison, “torchcrepe,” https://github.com/maxrmorrison/torchcrepe, 2020.[21] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE:A convolutional representation for pitch estimation,” in
IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2018.[22] P. Boersma, “Praat: doing phonetics by computer,” , 2006.[23] Y. Jadoul, B. Thompson, and B. De Boer, “Introducingparselmouth: A python interface to praat,”
Journal ofPhonetics , vol. 71, pp. 1–15, 2018.[24] RG McCurdy, “Tentative standards for sound level me-ters,”
Electrical Engineering , vol. 55, no. 3, pp. 260–263, 1936.[25] K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[26] R. M. Ochshorn and M Hawkins, “Gentle forcedaligner,” https://github.com/lowerquality/gentle, 2017.[27] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren,P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al., “Trans-fer learning from speaker verification to multispeakertext-to-speech synthesis,” in