Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis
SSUPERVISED AND UNSUPERVISED APPROACHES FOR CONTROLLING NARROWLEXICAL FOCUS IN SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS
Slava Shechtman , Raul Fernandez , David Haws IBM Haifa Research Lab, Haifa – Israel IBM TJ Watson Research Lab, Yorktown Heights, NY – USA [email protected], { fernanra, dhaws } @us.ibm.com ABSTRACT
Although Sequence-to-Sequence (S2S) architectures have becomestate-of-the-art in speech synthesis, capable of generating outputsthat approach the perceptual quality of natural samples, they arelimited by a lack of flexibility when it comes to controlling the out-put. In this work we present a framework capable of controlling theprosodic output via a set of concise, interpretable, disentangled pa-rameters. We apply this framework to the realization of emphaticlexical focus, proposing a variety of architectures designed to ex-ploit different levels of supervision based on the availability of la-beled resources. We evaluate these approaches via listening teststhat demonstrate we are able to successfully realize controllable fo-cus while maintaining the same, or higher, naturalness over an estab-lished baseline, and we explore how the different approaches com-pare when synthesizing in a target voice with or without labeled data.
Index Terms — prosody control, sequence-to-sequence speechsynthesis
1. INTRODUCTION
Sequence-to-Sequence (S2S) speech-synthesis architectures havebecome the state-of-the-art in the field, providing high-quality out-puts that approach or match the perceived quality of natural speechin many studies. Aside from the level of quality attained, thereare many attractive features to these models. They are able tojointly model different aspects of a waveform (e.g., segmental andprosodic), so interactions between them can be implicitly learned.They also do away with classical pipeline architectures in favor of asingle unified model, which is appealing when some of the modulesin the pipeline are difficult to develop (e.g., text-processing for a newlanguage). On the other hand, they suffer from well-documentedshortcomings, such as lack of interpretability (it can be difficult totell which parts of the model are responsible for what functions),lack of controllability (it is more difficult to intervene into the modelin order to control some aspects of the synthesis, which is oftendesired, such as when providing SSML support), and potential in-stability (small deviations at inference time can become exacerbatedand generate highly degraded speech).In this work, we address the controllability issue by expandingthe S2S architecture with mechanisms that can be exposed to theuser to manipulate some property of the output. Although usabilityfactors are not the focus of this work, we nonetheless advocate fora set of properties that will make such controls accessible to the endconsumer of the system, namely:• Interpretability: The listener should be able to clearly hear andidentify the effect of varying a control (e.g., speech is slower, faster, higher-pitched, sounds happier, etc.).• Monotonicity: A design that results in perceptual effects that varymonotonically as the user varies a control has a more intuitive feel,and is more easily tunable.• Low-dimensionality: The user should not be expected to manipu-late a large number of parameters to control the output. The modelshould either expose a low-dimensional controllable representa-tion, or be able to step in and fill in defaults to obviate the task forthe user.• Disentanglement: Though this may be difficult due to the manyways different speech parameters interact, a set of controls that aremore decoupled from each other facilitates the tuning of the out-put along fairly independent (perceptual) dimensions (e.g., tempoand volume could be tuned separately without needing to revisit apreviously tuned parameter).We explore the realization and controllability of narrow lexical focusas a case study for the above. Our objective is the realization of anemphatic level of prominence that is distinct from the type of accen-tuation that we observe in “neutral” broad-focus prosody. Considerthe intonational phrase in the examples below when they occur as areaction to the context in parentheses. In E1, as a reply to a generalquestion, we see a likely case of broad focus prosody, where wine acts as the nuclear element and receives some sort of pitch accent.The same accented word, however, might be given a more emphaticdegree of prominence when it happens in the context of E2. Fur-thermore, we can switch the focal point to a different word in thephrase when it is primed by a different context, as in E3. The [...] inthese examples delimit the domain of focus, which the speaker maydelineate, for instance, by employing a higher degree of disjuncturebetween the focal element and its context.• E1:
Mary is [pouring the wine]. ( What’s Mary doing? )• E2:
Mary is pouring the [wine]. ( Is Mary pouring the beer? )• E3: [Mary] is pouring the wine. ( Is John pouring the wine? )We are interested in the prosodic realizations that arise in examplessuch as E2 and E3 above (but also in other scenarios, such as con-trastive emphasis, requesting clarification, etc.). In Sec. 2 we intro-duce an S2S architecture that supports this type of prosodic control,review in Sec. 3 how our approach compares to relevant researchin the literature, evaluate competing approaches to this question inSec. 4, and conclude in Sec. 5 with some analysis of these resultsand an outline of future steps.
2. ARCHITECTURE
The model (Fig. 1) is a variant of the Tacotron2 architecture pro-posed in [1], augmented with components in the decoder to facili- a r X i v : . [ ee ss . A S ] J a n ate both the injection of controls, and improved stability during de-coding [2]. This sequence-to-sequence model generates an acousticspectral-prosodic representation that is then fed to an independently-trained, LPC-Net-based [3] neural vocoder to generate high-qualitysamples in real time [2].The Encoder comprises the following components, combinedas in Fig. 1 before being sent to the decoder:• The emphasis embedding (A) from a Boolean indicator fea-ture encoding emphatic focus within the utterance, as a wayto provide direct supervision to the model.• The embedding of various linguistic symbols (B) extractedfrom an extended phonetic dictionary comprising phone iden-tity, lexical stress, phrase type, and other symbols for wordboundaries and silences. This analysis is carried out exter-nally by a rules-based TTS Front End module, adopted froma unit selection system [4].• A front-end encoder (C) consisting of convolutional and bi-directional Long Short-Term Memory (Bi-LSTM) layers (asin [1]), encoding the merged embeddings from (A) and (B).• A global utterance-level speaker embedding (D), broadcastover the length of the sequence, to support training in a multi-speaker setting.• A set of 4-dimensional hierarchical prosodic controls (whichwill be introduced in Sec. 2.1) designed to enable the type offine, word-level modification needed to realize the prosodicpatterns associated with emphatic focus. Since these prosodiccontrols are a set of statistics extracted from the acoustic sig-nal, the ground-truth values from the training set are usedduring training (F). At inference time a separate predictivemodule (E) steps in to provide default predictions for the hi-erarchical prosodic trajectories.• An optional user-exposed control (G) to modify the defaultpredictions generated by (E). In particular, we propose a setof additive controls that are linguistically intuitive and in-terpretable (Sec. 2.1). Note that the feed-forward operationin block H is placed after the (optional) user request in G.This design choice is made to preserve the interpretability ofthe quantities the user gets to manipulate (which would notbe the case if the order was reversed, and the independentprosodic targets were blended via a non-linear feed-forwardoperation).The
Decoder is an autoregressive network that largely followsthe standard Tacotron2 architecture, but with modifications on theattention mechanism, autoregressive feedback, choice of targets,and training losses. These have already been described in [2] andare summarized here as follows. The attention is an augmentedtwo-stage attention where the content- and location-based attentionof Tacotron2 are followed by a structure-preserving mechanismencouraging monotonicity and unimodality in the alignment ma-trix. This modification has been found to be crucial to increasestability during inference, particularly in the presence of externalcontrols. A double feedback approach is used during training toexpose the model both to the previous ground-truth output value(i.e., teacher forcing) as well as the previous predicted value (i.e.,inference mode). At inference time, the predicted value is repli-cated. The model is trained in a multi-task fashion to predict the80-dim mel cepstral features in tandem with the parameters neededas inputs for an independently trained LPC-Net neural vocoder. For22kHz signals, these features (which we denote as “LPC features”)consist of a 22-dim vector with 20 cepstral coefficients, log f and Fig. 1 . System architecture. The dashed line indicates the output ofthe S2S model is sent to a separately trained neural vocoder whichdoes not play a role in the optimization of Eqn. 1. f correlation. The predicted LPC features are also processed withtwo post-nets (one to refine the cepstrum, and one to refine the pitchparameters); no post-net refinement is applied on the mel task. Let y Mt and y Lt represent the target sequences for the mel and LPCtasks respectively, ˜ y Mt and ˜ y Lt their final predictions, and ˆ y Lt the“intermediate” LPC-feature prediction (before the post-net). Thenthe following differential loss function is used to train the system: L = MSE (˜ y Mt , y Mt ) + 0 . MSE (ˆ y Lt , y Lt )+ 0 . MSE (˜ y Lt , y Lt ) + 0 . MSE (∆˜ y Lt , ∆ y Lt ) , (1)where the ∆ operator applies the first difference in time to a se-quence, and MSE ( , ) is the mean-squared error. For the sake ofspace, we omit some detail in this exposition, and refer the readerto [5, 2] for additional background and formulae.This architecture accommodates some variants depending on theavailability of labeled resources and types of control that are expos-able to a user. Among them, we explore the following:• Classic Supervision : When labeled data is available, this archi-tecture conditions directly on a Boolean indicator feature. Dur-ing training, the ground truth values of the audio signals are used.During inference a binary request is passed to the system . Thiscorresponds to blocks { A, B, C, D } on the encoder side.• No Supervision:
Under the assumption that no labeled data exists,the architecture defined by { B, C, D, E, F, G, H } provides a way tointroduce sensitivity into the S2S system during training, and, atinference time, control the realization of the prosodic patterns viaa tunable set of controls (cf. the binary control of the supervisedarchitecture).• Hybrid:
Though components E through H are motivated by anunsupervised approach, they may facilitate the realization ofprosodic patterns even when labeled data exists by working intandem with an explicit feature. To investigate this, we consider a“hybrid” approach (defined by the full model { A-G } ) that mixessupervised knowledge with the infrastructure designed to tacklethe case when we don’t have access to it. This value could be either user-specified for given words (e.g., via mark-up) or inferred from text. We do not address here the problem of inferencefrom text (though we have previously in [6]), and focus on the realization ofprosodic controls assuming an existing request. .1. Hierarchical Prosodic-Control Model
Following the motivation for a perceptually-interpretable, low-dimensional control mechanism for prosody discussed in Sec. 1, wepropose a hierarchical set of four prosodic controls that summarizeinformation about the duration and pitch excursion of a signal overlinguistically-meaningful and intuitive intervals of the prosodic hi-erarchy. These controls include global and local properties, and arean extension of the approach in [5], which allowed for controllingglobal aspects like overall tempo, but which lacked any control toeffect the kind of deviation from long-term trends needed to real-ize local emphatic focus. To arrive at these, let us first define thefollowing statistics:• S dur : The log of the average per-phone durations, along a sen-tence (and excluding any silence).• S f : The log- f “spread” (defined as the difference between the95- and 5-percentiles of log- f ), along a sentence.• W dur : The log of the average per-phone durations (as above),along each word.• W f : The log- f “spread” (as above), along each word.Note that the average per-phone durations in the above definitionsare estimated as the duration of speech (in seconds) along the rel-evant spans (word or sentence) divided by the number of phonesymbols contained therein, and that therefore no fine-level phoneticalignment is required in the computation (only coarse word-levelalignments and either phonetic transcriptions or a dictionary). Thesesentence- and word-level properties are propagated down to thetemporal granularity of the phonetic encoder outputs (i.e., phones)to form piecewise functions that are constant within a (sentenceor word) unit. From this we define the following four-componentprosodic-control target vector: P C = Norm σ { [ S dur , S f , W dur − S dur , W f − S f ] } , (2)where Norm σ {} is the linear map [ − σ , σ ] → [ − , , and σ is the global (corpus-wide) variance for each of the statistics in P C .At inference time, the predictions of the prosodic-control subnet arerectified to be piecewise constant as the oracle values that the S2Ssystem was trained with. In the evaluated systems, a mean pool-ing function is applied to the prediction to be constant between the(known) sentence and word boundaries.
Fig. 2 . Architecture of the hierarchical prosodic sub-network forpredicting targets from encoder-level features.The architecture of the prosodic-control predictor (Fig. 2) con-sists of a stack of N blocks, each comprising a concatenation of thespeaker embedding with the block’s input, a Bi-LSTM, Layer nor-malization [7], and Drop-Out. Models are trained in a multi-speakerfashion via a speaker-embedding layer whose output is fed into ev-ery cascaded block. (We will discuss how we instantiate model sizesfor the different components of this architecture when we discuss the details of selecting models for evaluation in Sec. 4.) Since the repli-cation to the phone level artificially introduces an over-contributionto the loss, each observation in each of the prosodic targets is down-weighted by this replication factor (e.g., for the sentence-level tar-gets, each phone-level observation in a 10-phone sentence receivesa weight of . ; a similar approach is applied to the word-leveltargets). These observation-level weights (uniquely determined byprosodic constituency) are then combined with global target-specificweights α that can be set during training to trade-off between thedifferent targets (in this evaluation α = [1 , , . , . ). The modelis then trained with ADAM [8] to minimize the weighted L1 lossbetween predictions and targets. A set of 10% of the sentences inthe training set are held out to tune structure (e.g., number of hiddenunits and blocks) and learning rate hyper-parameters . Fig. 3 . Sample phone-level trajectories of the four prosodic controlsfor a two-sentence input.At run time, lexical focus is controlled by the process illustratedin Fig. 4: The prosodic-control predictions generated by componentE in Fig. 1, and post-processed to be piecewise constant, are off-set by 4 tunable parameters ( α, β, γ, δ ) where the ( α, β ) are globalsentence-level offsets (that are applied uniformly and therefore onlycontribute to the overall expressiveness of the utterance) and ( γ, δ ) boost the word-level predictions of only those words we wish tomake salient (remaining non-focal words receive no offset). Theserun-time hyperparameters can be tuned via an independent develop-ment set.
3. PREVIOUS AND RELATED WORK
Synthesizing emphasis has been previously explored within otherarchitectures like unit selection [9, 10, 11], classical parametric syn-thesis [12, 13, 14], and pipeline systems using neural networks [15].Within S2S models, controllability has recently received a moderateamount of attention, with the Global Style Tokens (GST) proposalof [16] being one of the earliest works to discover latent styles in anunsupervised fashion. GST-based approaches have found wide us-age (see, e.g., [17, 18, 19]), but as these representations are discov-ered, rather than explicitly formulated, they often lack a priori in-terpretability (though post hoc listening often reveals some uniformperceptual quality). GST and others [20] where global tempo is con-trollable lack the finer-grained level of control we pursue due to itsglobal nature. Non-GST approaches include works like [21], wheredirect conditioning on estimated indicators of emotion are used tocontrol the output. Recent work by [22, 23] has also looked at thecontrollability of prosodic properties in Transformer-based neuralTTS systems, although at the core of that approach is a move away ig. 4 . Boosting the prosodic-control predictions of sentence- andword-level targets to realize focus. The example shows a fragmentof an utterance where the word blue is to be emphasized (e.g.,
I don’twant the red one; I want the blue one ). The predicted prosodic con-trols are offset by global and local offsets, where the local offsets areapplied to the focal words only.from S2S models that, for the sake of speed, replaces a S2S teacherwith feed-forward student models that decouple prosodic from spec-tral modeling. We, in contrast, retain the full S2S framework inour implementation. Hierarchical representations and controllabilityhave been explored together in [24, 25] though these approaches lackthe level of interpretability for fine word-control. The work of [26]targets interpretable and controllable hierarchical prosodic controlsand comes closest to the approach we pursue. However, their dis-entanglement is data-driven and leaves some residual couplings be-tween the (pitch and duration) dimensions we control separately; wemodel f dynamics (as opposed to levels), which is more perceptu-ally relevant to realizing emphatic focus; and as we will see in thenext section, our controllable systems attain the same or higher levelof quality when introducing this prosodic variation. Prosody transferacross databases bearing different labels is one of the main applica-tions of our framework, as we will discuss in Sec. 4. The worksof [27, 28] pursue similar goals although, being based on global-sentence level embeddings, they do not address fine-level control aswe do.
4. EVALUATION
The training material comprised four corpora from three profes-sional native speakers of US English, broken down as follows: a setof 10.8K sentences from a male speaker ( M ); a set of 1K sentencesfrom the same male speaker, where each sentence contains severalemphasis-bearing words ( M emp ); and two corpora from two dis-tinct female speakers ( F and F ) containing approximately 17.3Kand 11K sentences respectively. The corpus M emp was collectedby indicating to the speaker the emphasis-bearing words withineach sentence, and instructing him to realize an emphatic level ofprominence on those target words. His prosodic realizations differin marked ways from the style of broad focus prosody in terms oftempo, relative pitch accent height, and disjuncture from adjacentmaterial. The sentences were intended to serve as elicitors for var-ious cases of narrow focus (e.g., contrast, disambiguation, etc.).Notice that labeled data is available for only one speaker, and thatthe size of this corpus is considerably smaller than that of the basecorpora. A sentence from M emp contains three emphatic wordson average, and the overall percentage of such words was approx- imately 23%. We define the following data partitions to facilitatethe ensuing discussion: a set of data with all the resources pooled,including the emphatic data D emp = { M , × M emp , F , F } ,and a base set D base = { M , F , F } . Note that D emp uses 10-fold replication of the M emp subset to compensate for the lowerprior.We would like to investigate the trade-offs between approachesthat use labeled data (when available), and the fully unsupervisedapproach that is possible within the framework proposed in Sec. 2.To that end, consider the following systems:• Base (NoEmph):
A baseline S2S system, which uses global(sentence-level) prosodic controls, but no word-level prosodiccontrol. The training set ( D emp ) subsumes the emphatic data, butno other emphasis-marking feature is used.• Base (Sup):
A baseline system with
Classic Supervision (as inSec. 2) with global controls, trained with D emp and an explicitbinary feature encoding the location of emphasis.• PC-Unsup: A Fully Unsupervised system (as per Sec. 2) withvariable prosodic control, where both the S2S and prosody-prediction components are trained with D base .• PC-Hybrid: A Hybrid system with variable prosodic control,trained with D emp , and an explicit Boolean emphasis indicator asin the Baseline (Sup) model.
Table 1 . Summary of the different properties and training strategiesamong the different systems evaluated.Base Base PC- PC-(NoEmph) (Sup) Unsup HybridControl? N Y Y YType of None Binary Tunable Binary /Control? TunableTraining? D emp D emp D base D emp Emph. feat? N Y N YThe architecture of
Base (NoEmph) with global controls wasalready presented and evaluated in [5]. Since it lacks fine-grainedlexical prosodic control, we do not expect it to perform well onan emphasis-evaluation task. It is used here, however, to providea strong anchor point with respect to overall quality to ensure thatthe alternative proposals do not degrade with respect to the natural-ness afforded by this approach. A common LPC-Net neural vocoder,also trained in a multi-speaker fashion using D base , was used for allexperiments [2].Model selection and tuning was done as follows. First, for theprosodic sub-network, 10% of the training data was held out to doa grid search over structures and learning rate by tracking the held-out loss. The models thus selected were, for the PC-Unsup con-dition, a stack of 5 blocks with 175 hidden units in the Bi-LSTMlayer, and, for the
Hybrid model, a stack of 4 blocks with 200 hid-den units in the Bi-LSTM layer. The speaker embedding was ofdimension 20 in both cases. Once this was fixed, a development setof 20 sentences not used in training was used to perceptually tuneremaining hyper-parameters of the different configurations, includ-ing the dimension of the emphasis-embedding space ( dim = 8 forthe
Hybrid model, and for the Base (Sup) system), and the run-time additive word-level boosting parameters ( γ, δ ) (see the “controloffset” component G in Fig. 1, and Fig. 4) for the PC-Unsup and
PC-Hybrid word-level controls (set to (0 . , . and (0 . , . respectively). These word-level offsets were applied only to the itemn a sentence that was intended to be the focus carrier; the predic-tions of the prosodic-control model remain unboosted for all otherlexical items. Sentence-level boosting was not found to provide anyadvantages over word-level boosting, and the parameters ( α, β ) (seeFig. 4) were therefore only used for the two reference systems (Base(NoEmph) and Base (Sup)), and set to (0 . , . ). The non-negativeboosting values we employ match our theoretical expectations, andwhat we empirically observe in the M emp subset, that focuseditems receive more pronounced pitch accents, and slower speakingrate/longer durations. We observed that the Base (Sup) system al-ready realized these tempo differences quite well, and only boostedthe pitch excursions when tuning the
Hybrid systems. In general, wefind that after tuning a single set of boosting parameters works quitewell across a variety of sentences and voices . We wish to evaluate how the different multi-speaker approaches wehave described fare in a perceptual listening task. In particular,we are interested in examining two test-case scenarios. In the firstcase, we operate under the assumption that the target synthesis voicematches a speaker for whom we have existing training data (i.e.,the matched condition). In the second, and more interesting case,we assume that the target synthesis voice lacks any such labeled re-sources for training (though some exists for a separate speaker), andthat therefore any use the system makes of supervised informationis done indirectly by transferring knowledge from one speaker toanother (we refer to this as the transplant condition). Notice thatthe distinction we have just introduced applies to the systems thatare sensitive to supervision in some way (i.e.,
Base (Sup) and
PC-Hybrid ); system
PC-Unsup , by construction, is not.To evaluate the systems defined in the previous section, whileaddressing the matched and transplant cases respectively, we con-ducted two independent listening tests where the target speakerswere M (whose training data contains an emphatic subset) and F (whose training data does not) . No natural recordings were in-cluded (which could have provided a topline performance) since nocommon set of utterances with emphasis existed for both voices, andwe wanted to run parallel tests. Instead, we opted for an evaluationset of 43 unseen sentences, with each containing a single focusedword. Table 2 . MOS ( σ ) results for the matched condition ( M ). For em-phasis all systems are statistically significantly different from eachother. For quality , there are no statistically significant differencesbetween the pairs { Base (NoEmph), PC-Unsup } and { Base (Sup),PC-Hybrid } ; all other pairwise differences are significant. Signifi-cance is assessed at the p = 0 . level via one-tailed t-tests.System AttributeEmph QualityBase (NoEmph) 2.21 (1.3) 3.87 (0.8)Base (Sup) 4.08 (1.0) 4.10 (0.1)PC-Unsup 3.35 (1.2) 3.82 (0.9)PC-Hybrid 3.96 (1.0) 4.08 (0.8)The listening tests were designed to evaluate the systems interms of two attributes on 5-point scales: (i) how well they realize Samples and additional listening test details are available athttp://ibm.biz/SLT2021. In informal listening, we found F and F to be of comparable quality,so only one voice was selected to keep the test manageable. narrow focus on a given word, and (ii) the overall quality of thesentence. Listeners were recruited through a crowd-sourcing plat-form and presented with one audio sample at a time, accompaniedby a transcript of the text where the intended focus-carrying wordhad been capitalized. To facilitate comprehension of the task weprovided the listeners with the following set of instructions, andcollected their responses in the provided 5-point scales: The UPPERCASE word (excluding the word ”I”, if it exists) in thetext above should sound emphasized in this sample. Assess the levelof emphasis you hear in the UPPERCASE word. It sounds: 1 (neu-trally spoken), 2, 3 (somewhat emphasized), 4, 5 (definitely em-phasized). Assuming the UPPERCASE word is emphasized as re-quested, rate the overall quality and naturalness of this audio sam-ple: 1 (Bad), 2 (Poor), 3 (Fair), 4 (Good), 5 (Excellent).
Each { sentence, system } combination received 25 independentrating tuples (one for each of the 2 attributes). The texts were de-signed to make the choice of focus semantically congruent with thecontext-providing sentence. Tables 2-3 summarize the restuls interms of Mean Opinion Scores (MOS), standard deviation ( σ ), andpairwise statistical significance. Table 3 . MOS ( σ ) results for the transplant condition ( F ). All pair-wise differences are statistically significantly different for emphasis .For quality , { Base (Sup.), PC-Unsup } are statistically equivalent;all other pairwise differences are statistically significantly different.Significance is assessed at the p = 0 . level via one-tailed t-tests.System AttributeEmph QualityBase (NoEmph) 2.20 (1.3) 3.87 (0.9)Base (Sup) 3.71 (1.2) 3.97 (0.9)PC-Unsup 3.58 (1.1) 3.97 (0.9)PC-Hybrid 4.02 (1.0) 4.08 (0.8)
5. DISCUSSION AND CONCLUSIONS
From these evaluations, we can make the following remarks for bothspeakers. All controllable systems achieved a much higher degreeof emphasis than
Base (NoEmph) (which, as expected, attained lowscores in terms of emphasis realizability), and this was achieved atno expense of overall quality since the remaining systems are sta-tistically better or the same. We hypothesize this improvement inquality is due to the fact that conditioning on additional prosodicattributes of the outputs steers the model toward more natural (andstable) points during training. We observe differences between theapproaches, however, comparing the matched vs. transplant condi-tions: when labeled data is available for a target speaker, our experi-ments suggest that the fully-supervised approach offers the best op-erating point in terms of both quality and emphasis (Table 2). Thisapproach, however, does not generalize as well as the hybrid ap-proach does to a new target speaker lacking labeled data (Table 3).For the latter, combining supervision with the prosodic-conditioningframework supplements the performance for both attributes whentraining a multi-speaker model to enable the transfer of knowledge.Lastly, we see that even lacking any labeled data, the frameworkis able to provide a good point of quality and emphasis control bymeans of boosting the predictions of the fully unsupervised model.This is facilitated by our use of a set of controls that are readily inter-pretable and can be perceptually linked to the task at hand. Thoughthe results are very encouraging, some difficult test cases remain.or instance, we have observed in informal listening the challengeposed by some function words, particularly clitics or words contain-ing only unstressed vowels in broad-focus realizations.We have introduced and validated a framework that allows for afiner degree of control over lexical prosody to guide the realization ofnarrow focus in S2S synthesis. This framework encompasses a set ofuser-driven controls that meet the criteria that we highlighted and ad-vocated for in Sec. 1 of the paper: they consist of a low-dimensionalrepresentation of prosody, they are intuitive in the sense that changesto the controls map to identifiable perceptual effects in the output,and they offer a mechanism that disentangles different componentsof prosody (duration and pitch) that can be tuned separately. Theapproach requires only a moderate amount of knowledge external tothe framework in the form of coarse word-level alignments, and wehave shown that it can accommodate various degrees of supervisiondepending on available resources, with different variants bringingin different strengths depending on the operating conditions (e.g.,synthesizing from a speaker with labeled supervised data vs. trans-planting to a novel speaker that lacks such resources).We should note that this framework can also be extended to in-clude other levels of the prosodic hierarchy to explore expressive ef-fects beyond localized narrow focus. For instance, incorporating theintonational phrase into the analysis might provide a way to bettermodel the pitch reset associated with parentheticals. Addressing theshortcomings already mentioned and incorporating these extensionsremain the subject of ongoing and future work.
6. REFERENCES [1] J. Shen, R. R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R.A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis byconditioning wavenet on MEL spectrogram predictions,” in
Proc. ICASSP , Calgary, Canada, 2018, pp. 4779–4783.[2] S. Shechtman, R. Rabinovitz, A. Sorin, Z. Kons, and R. Hoory,“Controllable sequence-to-sequence neural TTS with LPC-NET backend for real-time speech synthesis on CPU,”
CoRR ,2020.[3] J. M. Valin and J. Skoglund, “LPCNET: Improving neu-ral speech synthesis through linear prediction,” in
ICASSP ,Brighton, England, 2019, pp. 5891–5895.[4] J. Pitrelli, R. Bakis, E.M. Eide, R. Fernandez, W. Hamza, andM.A. Picheny, “The IBM Expressive Text-to-Speech Synthe-sis System for American English.,”
IEEE Trans. Audio, Speechand Lang. Processing , vol. 14, no. 4, pp. 1099–1108, July2006.[5] S. Shechtman and A. Sorin, “Sequence to Sequence NeuralSpeech Synthesis with Prosody Modification Capabilities,” in
Proc. SSW10 , Vienna, Austria, 2019, pp. 275–280.[6] Y. Mass, S. Shechtman, M. Mordechay, R. Hoory, O.S.Shalom, G. Lev, and D. Konopnicki, “Word emphasis pre-diction for expressive text to speech.,” in
Proc. Interspeech ,Hyderabad, India, 2018, pp. 2868–2872.[7] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
CoRR , vol. abs/1607.06450, 2016.[8] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in
Proc. ICLR , San Diego, May 2015.[9] A. Raux and A.W. Black, “A unit selection approach to f0modeling and its application to emphasis,” in
Proc. ASRU ,Saint Thomas, VI, 2003, pp. 700–705. [10] R. Fernandez and B. Ramabhadran, “Automatic explorationof corpus-specific properties for expressive text-to-speech: Acase study in emphasis.,” in
Proc. SSW6 , Bonn, Germany,2007, pp. 34–39.[11] V. Strom, R. Nenkova, A.and Clark, Y. Vazquez-Alvarez,J. Brenier, S. King, and D. Jurafsky, “Modeling prominenceand emphasis improves unit-selection synthesis,” in
Proc. In-terspeech , Antwerp, Belgium, 2007, pp. 1282–1285.[12] K. Yu, F. Mairesse, and S. young, “Word-level emphasis mod-elling in HMM-based speech synthesis,” in
Proc. ICASSP , Dal-las, TX, 2010, pp. 4238–4241.[13] F. Meng, Z. Wu, H.M. Meng, J. Jia, and L. Cai, “Hierarchicalenglish emphatic speech synthesis based on HMM with limitedtraining data,” in
Proc. Interspeech , Portland, OR, 2012, pp.466–469.[14] Q.T. Do, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “Ahybrid system for continuous word-level emphasis modelingbased on HMM state clustering and adaptive training,” in
Proc.Interspeech , San Francisco, CA, 2016, pp. 3196–3200.[15] S. Shechtman and M. Mordechay, “Emphatic speech prosodyprediction with deep LSTM networks,” in
Proc. ICASSP , Cal-gary, Canada, 2018, pp. 5119–5123.[16] Y. Wang, D. Stanton, Y. Zhang, R.J. Skerry-Ryan, E. Batten-berg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R.A. Saurous, “Styletokens: Unsupervised style modeling, control and transfer inend-to-end speech synthesis,”
CoRR , vol. abs/1803.09017,2018.[17] R.J. Skerry-Ryan, E. Battenberg, Xiao. Y., Y. Wang, D. Stan-ton, J. Shor, R.J. Weiss, R. Clark, and R.A. Saurous, “To-wards end-to-end prosody transfer for expressive speech syn-thesis with Tacotron,”
CoRR , vol. abs/1803.09047, 2018.[18] Y. Lee and T. Kim, “Robust and fine-grained prosody controlof end-to-end speech synthesis,” in
Proc. ICASSP , Brighton,U.K., 2019, pp. 5911–5915.[19] J. Valle, R.and Li and B. Catanzaro, “Mellotron: Multispeakerexpressive voice synthesis by conditioning on rhythm, pitchand global style tokens,” in
Proc. ICASSP , Barcelona, Spain,2020, pp. 6189–6193.[20] J. Park, K. Han, Y. Jeong, and S.W. Lee, “Phonemic-levelduration control using attention alignment for natural speechsynthesis,” in
Proc. ICASSP , Brighton, U.K., 2019, pp. 5896–5900.[21] X. Zhu, S. Yang, G. Yang, and L. Xie, “Controlling emotionstrength with relative attribute for end-to-end speech synthe-sis,” in
Proc. ASRU , Singapore, 2019, pp. 192–199.[22] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T-Y. Liu,“FastSpeech: Fast, robust and controllable Text to Speech,” in
Advances in Neural Information Processing Systems 32 , pp.3171–3180. 2019.[23] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T-Y.Liu, “FastSpeech 2: Fast and high-quality end-to-end Text toSpeech,”
CoRR , 2020.[24] X. An, Y. Wang, S. Yang, Z. Ma, and L. Xie, “Learning hi-erarchical representations for expressive speaking style in end-to-end speech synthesis,” in
Proc. ASRU , Singapore, 2019, pp.184–191.25] W-N. Hsu, Y. Zhang, R.J. Weiss, H. Zen, Y. Wu, Y. Wang,Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hi-erarchical generative modeling for controllable speech synthe-sis,” in
Proc. ICLR , New Orleans, 2019.[26] G. Sun, Y. Zhang, R.J. Weiss, Y. Cao, H. Zen, and Y. Wu,“Fully-hierarchichal fine-grained prosody modeling for inter-pretable speech synthesis,” in
Proc. ICASSP , Barcelona, Spain,2020, pp. 6264–6268.[27] V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” in
Proc. Interspeech , Graz, Austria, 2019, pp.4440–4444.[28] Y-J. Zhang, S. Pan, L. He, and Z-H. Ling, “Learning latent rep-resentations for style control and transfer in end-to-end speechsynthesis,” in