Hierarchical Sequence to Sequence Voice Conversion with Limited Data
Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint Puskorius
HHierarchical Sequence to Sequence Voice Conversion with Limited Data
Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint Puskorius
Ford Greenfield Labs, Palo Alto, CA { pnaray11,pchakra5,fcharett,gpuskori } @ford.com Abstract
We present a voice conversion solution using recurrent se-quence to sequence modeling for DNNs. Our solution takesadvantage of recent advances in attention based modeling in thefields of Neural Machine Translation (NMT), Text-to-Speech(TTS) and Automatic Speech Recognition (ASR). The prob-lem consists of converting between voices in a parallel settingwhen < source,target > audio pairs are available. Our seq2seqarchitecture makes use of a hierarchical encoder to summarizeinput audio frames. On the decoder side, we use an attentionbased architecture used in recent TTS works. Since there is adearth of large multispeaker voice conversion databases neededfor training DNNs, we resort to training the network with a largesingle speaker dataset as an autoencoder. This is then adaptedfor the smaller multispeaker voice conversion datasets availablefor voice conversion. In contrast with other voice conversionworks that use F , duration and linguistic features, our systemuses mel spectrograms as the audio representation. Output melframes are converted back to audio using a wavenet vocoder. Index Terms : voice conversion, seq2seq, TTS, ASR, DNNs,attention
1. Introduction
Recently, sequence to sequence models have been adapted withgreat success in producing realistic sounding speech in TTS sys-tems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated thatASR can be handled excellently by seq2seq architectures. InTTS, the system takes in a text or phoneme sequence and out-puts a speech representation as output. On the other hand, inASR, one feeds in an audio representation, and the system per-forms the task of classifying audio into text or phoneme. Invoice conversion, both the input and output sequences are audiorepresentations. The problem is related to both ASR and TTSin that like ASR, the DNN must learn to summarize input audioframes into a hidden context, and like in TTS, it must decodeaudio frames from the latent context in a temporal, attentivefashion.In voice conversion, we seek to convert a speech utterancefrom a source speaker A to make it sound like an utterance froma target speaker B. There are two pertinent scenarios, the first ofwhich is when both the source and target speakers are utteringthe same text (the ’parallel’ case), and the second is when the ut-terances don’t match (the ’non-parallel’ case). We focus on par-allel voice conversion in this work with DNNs. While the largergoal of this work is to address the more important problem ofnon-parallel voice conversion (producing parallel datasets forconversion is not easy), we start with the arguably simpler taskof demonstrating how we can achieve this in the parallel sce-nario using seq2seq models.While we can go about the voice conversion task by firstperforming ASR on the source voice, and then sending the textobtained to a TTS engine, our approach leads to an end-to-end Figure 1:
System Diagram: Our Attention based Encoder-Decoder architecture for Voice Conversion takes in a mel-spectrogram for the source speaker and outputs the mel-spectrogram for the target speaker. solution wherein one doesn’t have to train an ASR and TTS en-gine separately. Our approach has a simpler processing pipelineas it only needs audio transcripts (with no accompanying text orneed for segmentation), and can be converted directly to targetrepresentations. A notable aspect of this work is that it getsaround the problem of limited parallel data for voice conversionby pretraining a much larger, single speaker TTS corpus as anautoencoder and then performing transfer learning on the avail-able, diminutive voice conversion datasets. Without this adapta-tion technique it becomes difficult to effectively carry out voiceconversion without having access to larger, expensive to obtainparallel datasets.We train the system using Maximum Likelihood to mini-mize the L1 error between the generated and target mel spectro-grams.
2. Related Work
The traditional pipeline for parallel voice conversion is throughuse of Gaussian Mixture Models (GMMs) [6, 7, 8] or DeepNeural Networks (DNNs) [9, 10, 11, 12]. After first align-ing source and target features using Dynamic Time Warping(DTW)[13], the model is trained so that it learns to producethe target given the source features for each frame. A disadvan-tage of these methods is that they need aligned, parallel data.Moreover, conversions performed on spectral features disre-gard dependencies on other controlling factors such as prosodyand fundamental frequency, duration and rhythm. Furthermore,transforming features on a frame basis disregards temporal con-text dependencies. The dependence on fundamental frequencyis often handled by performing a linear transformation in thelogarithmic domain. A good review article of the topic is foundin [14].Non-negative Matrix Factorization (NMF) [15], tradition-ally used for sound source separation and speech enhancement,has also been used for VC [16]. NMF factorizes a matrix intotwo non-negative factors, the basis or dictionary matrix and theactivation matrix. In the case of VC with parallel training data, a r X i v : . [ ee ss . A S ] J u l ictionaries for speaker 1 and speaker 2 are first constructedseparately. Subsequently, given test source data (speaker 1),the previously learnt dictionary for speaker 1 is used to fac-torize the source voice into a set of source activations, or con-tributions of speaker 1 dictionary to speaker 1 utterance. Thesame is done for speaker 2. The activations for the sourcespeaker utterance are then combined with the dictionary atomsof the target speaker utterance to transform speaker 1 utter-ance into speaker 2. NMF based methods, like GMMs, alsorequire alignment of parallel voice samples using Dynamic Pro-gramming, and other pre-processing steps like the Short-term-Fourier-Transform (STFT).Recent sequence to sequence modeling approaches forvoice conversion have largely been inspired by advances inseq2seq practice in NMT, TTS and ASR, in that they involvean encoder-decoder model as the underlying machinery. It isoften advantageous to classify the input waveform into textor phoneme, and use that information to inform the decodermodel of the content that the input audio representation embod-ies [17, 18]. Our work is most similar to [17], and we compareand contrast salient aspects of both models. In both models,the overall architecture is a seq2seq model inspired by the ASRwork [19] (with a hierarchical encoder stack), and the TTS workTacotron [1] and derivatives. However, in [17], the encoder out-puts are augmented by features extracted with an ASR model,while our approach comprises end-to-end neural networks with-out need for labeled data. There are also several ancillary com-ponents such as the use of additional losses and postprocessingnetworks. Also, we use a Convolutional filter bank, Highwaynetwork to extract ’context’ as a prelude to processing them inthe multilayer hierarchy of encoder RNNs. Nevertheless, wewish to emphasize that a substantive difference between thetwo works is the training philosophy, in how the data limita-tion problem is handled. We elaborate on this in the followingparagraphs with additional examples from the literature.Pertinent to our discussion are seq2seq modeling works[20, 21]. In these works, additional loss terms are introducedto encourage the model to learn alignment and to preserve lin-guistic context. Alignment is maintained by noting that the at-tention curve is predominantly diagonal (in the voice conver-sion problem) between source and target, and including in theloss function a diagonal penalty matrix - a term referred to asguided attention in the TTS work [22]. An additional consider-ation is to prevent the decoder from ’losing’ linguistic context,as would arise when it simply learns to reconstruct the output ofthe target. This was addressed by using additional neural net-works that ensure that the hidden representation produced bythe encoder (similar reasoning applies to the decoder) was ca-pable of reconstructing the input, and thereby retained contextinformation. These manifest as additional loss terms - we alsoglean a similarity to cycle consistency losses [23] - that they call’context preservation losses’. Also noteworthy is that these ap-proaches use non-recurrent architectures for their seq2seq mod-eling.We suspect that the problems that motivated the design ofthese additional losses have their provenance in the diminutivesize of the training corpus (CMU Arctic was used, with utterances per speaker) which is hardly sufficient to learn a di-verse respresentation with good generalization capabilities. Inour work, we arrive at a slightly different way to overcome thedata limitation problem as compared with these works, whichdo so by augmenting data with ASR training [17, 18], and by in-troducing additional losses [20, 21]. Our solution makes use oftransfer learning by first pretraining with a large, single speaker corpus, and then adapting to the smaller, pertinent corpus (CMUArctic) in question.Developments in the generative modeling (primarily, Varia-tional Autoencoders [24] and Generative Adversarial Networks[25]) front have led to their use in voice conversion problems. In[26], a learned similarity metric obtained through a GAN dis-criminator is used to correct oversmoothed speech that resultsfrom maximum likelihood training, which imposes a particu-lar form for the loss function (usually the MSE). A conditionalVAEGAN [27] setup is used in [28] to implement voice conver-sion, with conditioning on speakers, together with a WassersteinGAN discriminator [29] to fix the blurriness issue associatedwith VAEs. Moreover, an important apparatus that is of use intraining non-parallel voice setups consists of Cycle ConsistencyLosses from the famous CycleGAN [23] work for images. Thisforms a building block in the papers [30] and [31].A natural extension to our work is to explore a generativesolution to Voice Conversion as in some of the works above,in order to apply our architectural components to non-parallelsetups.Our work is influenced by recent TTS works involvingtransfer learning and speaker adaptation. The recently pub-lished work [32] demonstrates a methodology to use adapt atrained network for new speakers with a wavenet. Likewise, in[33], a speaker embedding is extracted using a discriminativenetwork for unseen, new speakers which is then used to con-dition a TTS pipeline similar to Tacotron. This philosophy isalso used in [34] where schemes are used to learn speaker em-beddings extracted separately or trained as part of the modelduring adaptation. In all these contexts, it is emphasized thatthe onus is on adapting to small, limited data corpuses, therebycircumventing the need to obtain large datasets to train thesemodels from scratch. In our work, we use the same idea to getaround the problem of not having enough data to train in thevoice conversion dataset under consideration. However, in ourwork, instead of producing new speaker embeddings, we retrainthe model for each new < source, target > pair, a process thatis rapid owing to the small size of the corpus.An interesting alternative to using recurrent (or autoregres-sive) seq2seq modeling for TTS or VC is to use differentialmemory as a way to store speech related information. In theVoiceLoop architecture [35, 36, 37], the input is transformedwith a shallow fully connected network into a context, with at-tention being used to compare with the memory buffer. Thememory buffer itself is updated by replacing its first elementwith a newly computed representation vector. With this ap-proach, which also uses speaker embeddings, the network isable to adapt to new speakers with only a few samples, in addi-tion to having a much reduced network complexity (only shal-low fully connected layers are used).
3. Architecture
We use an attention based encoder-decoder network for ourvoice conversion task. The network architecture borrows heav-ily from recent developments in TTS [1] and ASR [19]. Thesystem takes in an audio representation (mel-spectrogram) asinput, and encodes it into a hidden representation in recurrentfashion. This hidden representation is then processed by an at-tention based decoder into output mel-spectrograms. In order toconvert the mel frames back to audio, we employ a widely usedwavenet vocoder implementation available online [38]. In theTacotron2 [2] work, it was demonstrated that using wavenet asa neural vocoder produced audio samples whose quality was su-igure 2:
The Pre-net and the CBH layers that are used to pro-cess the input mel-spectrogram frames. Output tensor sizes ateach step of processing are indicated by the side of the unit. perior to those from the Griffin-Lim procedure used in Tacotron[1]. A system diagram showing the various components of themodel is shown in Figure 1. We describe its components in thefollowing subsections.
The prenet is a bottleneck layer containing full connections witha ReLU nonlinearity and dropout [1, 39]. The purpose of thislayer is to enable the model to generalize to unseen input withdropout. Mechanisms to achieve this effect in sequence modelsare teacher forcing, scheduled sampling and professor forcing[40, 41, 42]. Prenet processes vectors of 80 dimensions to yieldoutput of the same size. A dropout ratio of . is used. Originally proposed in the context of Neural Machine Transla-tion [43] and later used in [1] where it was named CBHG (Con-volutional Banks, Highway and Gated Recurrent Units), thislayer served as a processing mechanism to accumulate ’word’level context when the input is text. For our voice conversiontask, the effect is similar, in that neighboring speech frames arefiltered so as to abstract the equivalent phoneme level represen-tation, mixed with speaker characteristics and prosodic content.Together with the hierarchical RNN encoder units describedlater, we could view this assemblage as an implementation ofCBHG. The Pre-net and CBH layers, along with the tensor out-put sizes at each step of the processing are shown in Figure 2and described below.
Convolutional Filter Banks
A bank of 1-D convolutionalfilters of size (1 , , are used to capture n-grams of varyingwidth. The input sequence is convolved with each of these fil-ters and the results from all filters are concatenated together.Each convolution filter preserves the original length of the se-quence by padding it to extend original length by w − , wherethe filter is of width w , followed by BatchNorm and ReLU oper-ations. The filter maps obtained are then stacked in the channeldimension with output channels being produced for each con-volution. This is followed by a max-pool operation with stride 1, which maintains the length of the sequence. A 1-D convo-lutional projection operation then reduces the sequence lengthto the original size, followed by a final linear layer that alsomaintains representation length. Highway Layers
The Highway layer is like a Resnet blockwith a skip connection that is a shortcut for information flowthat skips the intermediate layers, but with learnable weights todetermine the extent of the information skip. We use 4 highwaylayers.
We design our encoder as a stack of bidirectional layers, reduc-ing the sequence length by a factor of as the data flows up thestack (Figure 3). This construction was first proposed in [19]in the context of speech recognition with DNNs. The encoder’stask is to summarize audio input to an intermediate hidden rep-resentation embodying linguistic content, akin to text. However(and this might be argued as a desirable attribute of DNN pro-cessing), we make no attempt to disentangle content (text) andvoice characteristics (style) in this case. We assume that theDNN automatically learns to disentangle content and style aspart of the training process, and that during the decoding pro-cess, the first speaker’s voice characteristics are discarded andthe second speaker’s voice is injected into the content.The reasoning behind using a hierarchical reduction oftimesteps is that speech frames are highly inflated, redundantdescriptors of linguistic content mixed with speaker and dura-tion information. A single phoneme could thus span several( ∼
10) frames. It therefore makes sense to reduce or cluster thespeech frames so as to contain more relevant information. Byreducing the number of timesteps, we are implicitly performingthis clustering operation to distill the pertinent linguistic contentat the top of the stack. The reduction in timesteps is also favor-able as regards learning attention, the rationale being that as thedecoder examines all the frames of the encoder to extract atten-tion parameters, it is useful to aggregate relevant information sothat it has a smaller set to work with, which helps in speedingup the computation and in helping the model learn alignment.In order to reduce the number of input timesteps, we accu-mulate two neighboring frames, and then pass the concatenatedfeatures along to the bidirectional RNN layer above. In our ex-periments, we use a stack of recurrent reduction layers, re-sulting in an overall reduction in the number of timesteps by afactor of .The basic unit of the hierarchical recurrent encoder is thebi-directional GRU. This bi-directional Gated Recurrent Unit(shown separately in the diagram as L → R and R → L )passes over the input sequence twice: left to right and fromright to left, and concatenates the two passes. Each GRU has150 hidden units, and outputs a dimensionality 150xT, where Tis input sequence length. After concatenation of the L → R and R → L outputs, one gets a dimensionality 300xT. Thesequence length itself is reduced to T/2 after GRU1 and to T/4after GRU2 as a result of accumulating 2 neighbouring framesat each step. GRU0 is a pre-processing recurrent unit that doesnot have this accumulation and reduction of time steps. Thedetails of the pyramidal encoding, with tensor sizes after eachstep are shown in Figure 3. The decoder architecture is inspired by the Tacotron TTS setup[1]. As in the Tacotron work, the decoder has the followingcomponents:igure 3:
Hierarchical Bi-directional Recurrent Encoder withan indication of the tensor sizes at each step. The number ofhidden units in each GRU is 150. Each pyramidal GRU unit(GRU 1 and 2) decreases the sequence length by 1/2. Left-rightand right-left GRU units each output a 150xT matrix, that areconcatenated to give a 300xT matrix, with T as input sequencelength.
1. Prenet2. Attention RNN3. Decoder RNNs with residualityWe describe the components in more detail below. How-ever, before doing so, it is useful to have in mind an overallpicture of how the data flows through the decoder stack. To thatend, we present a brief description of the calculations at a highlevel.The decoder’s task is to transform linguistic content fromthe source speaker to that of the target speaker in a temporalway, conditioned on frames generated previously. The linguisticcontent is provided by the hierarchical encoder described previ-ously, which condenses the source speaker’s utterances into anintermediate hidden representation embodying linguistic con-tent. The decoder’s task is therefore to ingest this linguisticcontent, and imbue it with the target speaker’s voice character-istics. In the current setup, the DNN implicitly adds the tar-get speaker’s voice characteristics (i.e. duration, pitch) to theencoder summary. It is designed as a complex stack of unidi-rectional RNNs trained to emit output spectrogram frames con-ditioned on all previous frames emitted, together with the en-coder’s representation of the context.The attention modeling ensures that the target’s spectro-gram frames are aligned with the appropriate frames of the in-put. Attention computations are ubiquitous in sequence to se-quence modeling. While decoding output sequence frames -this could be in any general sequence modeling task, such asNMT, ASR, or TTS - attention helps to focus on the appropriateframe of the input sequence so that the decoder is able to de-cide what it should emit in a more precise way. This aspect isespecially important when the sequence length becomes large,for the decoder’s task becomes much more difficult in emitting sequential output based on a single, global context that the en-coder provides. Moreover, it is seen in experiments that atten-tion modeling is essential for the system to generalize to unseeninput. Our experiments seem to be in line with the notion thatfor the speech model to perform well on unseen data, it is in factnecessary for the model to learn proper alignment.We now proceed to describe in more detail the componentsof the decoder.
As with the encoder, we transform target data through a set ofbottleneck layers (two in total) using dropout. We use dropoutin order to regularize the model and prevent overfitting, andhence it is a very essential component. We use a stack of prenet layers (full connections with ReLU non-linearity and adropout ratio of . ) yielding vectors of size and re-spectively. We use attention modeling [44, 45, 2] as a way of focusing thegenerator on the most relevant section of the input sequence.We have a state sequence output by the encoder (hidden) unitsat the top of the hierarchical stack: h = ( h , · · · h L ) . The se-quence output by the decoder units is s = ( s , · · · s T ) . Theinput spectrogram sequence is x = ( x , · · · x L ) and the outputspectrogram sequence is y = ( y , · · · y T ) . At the ith step of thegeneration process, the recurrent sequence generator, the RNN,generates state s i by using the y i ’s up to that point, the previous s i − and the hidden encoder output ( h , · · · h L ) . The attentionmodel is used to inform the generator which encoder states h j are important for the generation of this s i , and this is done withan attentional neural network, which learns to produce the at-tention or alignment vector α i , which is a vector of normalizedimportance weights used to weight the hidden encoder state h .This is then used to produce the context vector c i , which is aweighted sum of the encoder states h j : s i = RNN ( s i − , [ c i , y i − ]) (1) c i = (cid:88) j α ij h j (2)The context vector c i , concatenated with the spectogramoutput prediction of the previous time step y i − is used to con-dition the production of the decoder output for the current timestep s i .The attention vector α ij is obtained by softmax normaliza-tion (to between 0 and 1) with a β temperature parameter, overthe scores e ij . α ij = softmax ( βe ij ) (3)where β is the softmax temperature that sharpens the attention([45]).The scores or un-normalized attention energies e ij is thecentral part of the attention modeling, and is done for each hid-den encoder state h j separately. There are two ways of cal-culating these attention energies. Content based attention is dependent on the content or encoder hidden state: e ij = score ( s i − , h j ) . Location based attention is dependent on thelocation of the previous generator state, or where the attentionwas previously focused: e ij = score ( alpha i − ) . This is nor-mally implemented as a 1-D convolutional kernel (with learntigure 4: The decoder RNN. Att represents the Attention RNN,RNNa and b represent the first and second layers of the decoderRNNs. Red arrows indicate residual connections and purplearrows indicate the generated output being fed back to the at-tention RNN (along with input) to generate the attention out-put. Output of the second decoder is transformed to the dimen-sions of the output spectrogram using a fully connected layer(Project). weights) centred around the previous position. We use a hybridattention model, with both content and location based scoring.Location scoring is done by convolving the previous attention α i − with F . This is then combined with content scoring: f i = F ∗ α i − (4) e ij = v T tanh(( W s i − ) T ( W h j ) + Uf ij ) (5)where vector v and matrices F , W , W and U are trainableweights, implemented as a feed-forward neural network. Weuse a form inspired by Luong’s multiplicative attention mecha-nism [44] to determine the mapping between hidden units andattention energies in equation 5. The attention RNN’s output is now processed by two RNN lay-ers with residuality, before transforming them back to audioframes. g i = RNN ( s i , g i − ) + s i (6) g i = RNN ( g i , g i − ) + g i (7)This is depicted in the equations (6), (7). Here, the super-scripts , represent the first and second decoder layers. Thesecond term in these equations contain the residual signal fromthe input. In this case, s i represents the output from the atten-tion RNN and g i and g i denote the hidden units from the firstand second decoder layers. We use the same number of dimen-sions ( ) in all the decoder RNN layers.Finally, the output of the last residual decoder layer istransformed back to the dimensions of the output ( bins) bysending it to a fully connected layer and adding a ReLU non-linearity to it. Figure 5: Feature extractor, depicted through attention align-ment and mel spectrograms produced by training the network toproduce ljspeech voices, with source and target being the same.
4. Autoencoder pretraining and transferlearning
Voice conversion with DNNs for parallel data is a difficult un-dertaking owing to the lack of availability of large multispeakervoice conversion datasets. To get around this problem, wefirst pretrain our network as an autoencoder with a large sin-gle speaker TTS corpus [46], with the source and target voicesbeing the same. After this network is trained - a guideline forthis is to see if system learns alignment - we adapt the networkfor the smaller, multispeaker voice conversion data.Transfer learning can be seen as a way to mitigate datainsufficiency problems in the speech domain. This is particu-larly trenchant owing to the lack of availability of good qualityspeech datasets (large corpuses, and with sufficient diversity)that can be obtained inexpensively.The system is trained using the L1 loss between source andtarget voices. The Adam optimizer is used with a learning rateof − for the pretraining task, and . × − for the voiceadaptation task. The optimizer parameters β , β were . and . respectively.
5. Experimental setup
Our experimental procedure consists of two steps, as mentionedin section 4. We first pretrain the network with a large single-speaker corpus in which the source and the target are the same.After this, we allow the network to adapt to the desired sourceand target data.
For autoencoder pretraining, we use the LJSpeech dataset [46].This dataset contains short utterances from a single fe-male speaker reading passages from audio books, with a to-tal audio amounting to about hours recorded on a Mac-book Pro in a home setting with a sampling rate of .The main task is to perform voice conversion (by adapting thepretrained network trained above) on the much smaller CMUArctic dataset [47] containing utterances from severalspeakers. We used the male speakers ”bdl”, ”rms” and thefemale speakers ”clb” and ”slt” for experiments. The train-ing/test/validation split was , and respectively. Sincethis corpus has a sampling rate of , we upsample thisdataset to , generate audio through the pipeline andthen downsample it back to the original sampling rate. Thismeasure was adopted instead of downsampling the large corpusto the target sampling rate because we found that the systemwas unable to learn at the lower rate of
16 kHz .igure 6:
Voice conversion from male (bdl) to female (slt)voice, depicted through attention alignment and mel spectro-grams produced by adapting to small CMU Arctic voice corpus.
Table 1:
Network architecture hyperparameters.
Conv k -c-BN-ReLU-Dropout(f) denotes a convolution of width k , c out-put channels, BatchNorm, ReLU, with a dropout of f (= . ). Conv - -Dropout(f)-ReLU-Linear denotes a convolution ofwidth , with output channels, dropout of f (= . ), followedby ReLU and a linear projection to the same size output. Prenetlayers are full connections (e.g. FC- would be a linear con-nection to an output of size ) but with dropout of . . Allother network components use a dropout of . Encoder Prenet FC-80-ReLU-Dropout(0.5)CBH
Conv k -BN-480-ReLU-Dropout(0.1) k = 1 , , , , · · · , Maxpool (stride=1) and stack
Conv - -Dropout(0.1)-ReLU-LinearHighway layer stack of BiGRU0 300 cells (f+b); Dropout(0.1)Hierarchical BiGRU 2 layers, 300 cells (f+b);Dropout(0.1)Decoder Prenet FC-256-ReLU-Dropout(0.5)FC-256-ReLU-Dropout(0.5)Attention GRU 600 cells; Dropout(0.1)Residual GRU 1,2 600 cells; Dropout(0.1)
In figure 5, we present visualizations of source and target spec-trograms, conversion and alignment curve for the pretrained au-toencoder feature extractor using the large LJSpeech corpus.The alignment curve in this case shows more decoder timestepsthan the encoder (by a factor of ) because of the hierarchicalencoding scheme which reduces the number of timesteps in theencoder.In figure 6, we present corresponding visualizations for thetransfer learning experiment wherein we convert from male(bdl) to female (slt) voice. Starting with a network whoseweights are pretrained with the large LJSpeech corpus as an au-toencoder, we allow the network to adapt to the smaller CMU-Arctic dataset, using paired training examples. As canbe seen, while the conversion is plausible, the transfer learningspectrogram is somewhat ‘blurry’ owing to the limited amountof data and the use of the L1 loss, which makes the spectro-grams appear oversmoothed. While the alignment curve is moreor less linear, it has a few ’kinks’ (unlike the ljspeech curve) inkeeping with the slight differences that arise in the alignmentpath as compared with the case where both source and targetare the same. We use a popular open source wavenet implementation [38]available online to recover audio from mel spectrograms.Wavenet is an autoregressive architecture [48] especially de-signed for audio generation. Related architecutures have beenused for generative modeling tasks in other domains: ByteNet[49] for text, PixelCNN [50] for images and Video Pixel Net[51] for videos. This type of architecture, at a high level workson a temporal (in the sense that there is a certain temporal order-ing of data) basis by stacking dilated convolutions with expo-nentially growing receptive field sizes (e.g. , , , ). Mask-ing is carried out so as to only allow information from the past.In wavenet, instead of masking, one simply uses all the inputsfrom the past for the operations as the data already has an im-plicit temporal order to it. The architecture also uses gating andskip connections to allow better information flow through thenetwork stack.A drawback of this type of architecture is that while trainingis fast, inference is slow owing to the sample level autoregres-sive nature of the setup, in that every sample generated is con-ditioned on all previous samples; the upshot being that with rawaudio ( samples per second), the calculations become ex-tremely expensive. To alleviate these issues, themes from flowbased generative modeling techniques (with some of the ideasoriginally proposed in order to improve the expressiveness ofVAE priors [52, 53] by successively transforming them) wereadapted for fast inference during the sampling stage [54, 55].To use as a vocoder backend, we present the wavenet withmel-spectrograms as conditioning features. These are upsam-pled (with transpose convolutions) to match the target ratewith upsampling layers. This network has layers with (512 , , channels for residuals, gating and skips respec-tively. The setup uses mixtures of logistics to model the bit( bins) raw waveform. This architecture was also usedto compare against the WaveGlow implementation in [55]. Amore extensive list of hyperparameters is available online [38].
6. Conclusions
In this work, we demonstrated a way to overcome data limi-tations (an all too common malady in the speech world) witha trick to extract linguistic features by pretraining with a largecorpus so that it learns to reconstruct the input voice. These fea-tures serve as a useful starting point for transfer learning in thelimited data corpus. The architecture proposed is slightly elab-orate, in that it resorts to hierarchically reducing the number oftimesteps on the encoder side. The basis for this proposal wasin keeping with the fact that the content embedded in the inputwaveforms - viewed as words or phoneme like entities - is muchsmaller than the size of the waveforms (5 words vs 100 audioframes, 10 phonemes vs 100 audio frames, etc.). With this in-tuition, the hierarchical reduction in timesteps is viewed as amechanism to extract phoneme like entities by compressing thecontent in the input mel spectrogram. Our task is in a sense, toextract a style independent representation on the encoder side.The decoder then learns to inject the target speaker’s contentusing exactly the same type of architecture as in the Tacotronworks [1, 2]. The output spectrograms are converted back toaudio using a wavenet vocoder, yielding plausible conversions,demonstrating that our approach is indeed legitimate.The system is sensitive to hyperparameters. We noticedthe capacity of the CBHG network is particularly important,and adding dropout at various places helps in generalizing tohe small dataset. However, dropout also leads to ’blurriness’.Cleaning up the output is probably necessary with a postnet,which we have not implemented.We hope to release code and samples to allow for experi-mentation.
7. References [1] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgian-nakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end toend speech synthesis,” arXiv preprint arXiv:1703.10135 , 2017.[2] J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Sauros, Y. Agiomyr-giannakis, and Y. Wu, “Natural tts synthesis by condition-ing wavenet on mel spectrogram predictions,” arXiv preprintarXiv:1712.05884 , 2017.[3] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky,Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, andM. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXivpreprint arXiv:1702.07825 , 2017.[4] S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping,and Y. Zhou, “Deep voice 2: Multi-speaker text-to-speech,” arXivpreprint arXiv:1705.08947 , 2017.[5] W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman,and J. Miller, “Deep voice 3: Scaling text-to-speech with con-volutional sequence learning,” arXiv preprint arXiv:1710.07654 ,2017.[6] A. Kain and M. Macon, “Spectral voice conversion for text-to-speech synthesis,” in
ICASSP, IEEE International Conference onAcoustics, Speech and Signal Processing - Proceedings , vol. 1.IEEE, 1998, pp. 285–288.[7] T. Toda, A. W. Black, and K. Tokuda, “Voice conversionbased on maximum-likelihood estimation of spectral parametertrajectory,”
Trans. Audio, Speech and Lang. Proc. , vol. 15,no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https://doi.org/10.1109/TASL.2007.907344[8][9] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black,and K. Prahallad, “Voice conversion using artificial neuralnetworks,” in
Proceedings of the 2009 IEEE InternationalConference on Acoustics, Speech and Signal Processing ,ser. ICASSP ’09. Washington, DC, USA: IEEE ComputerSociety, 2009, pp. 3893–3896. [Online]. Available: https://doi.org/10.1109/ICASSP.2009.4960478[10] S. Desai, A. W. Black, and B. Yegnanarayana, “Voice conversionusing artificial neural networks,” in
IEEE Transactions on Audio,Speech and Language Processing , vol. 18, no. 5, July 2010.[11] L. Sun, S. Yang, K. Li, and H. Meng, “Voice conversion usingdeep bidirectional long short-term memory,” in
Proceedings of the2015 IEEE International Conference on Acoustics, Speech andSignal Processing , ser. ICASSP ’15. Washington, DC, USA:IEEE Computer Society, 2015, pp. 4869–4873.[12] L. Sun, K. Li, S. Kang, and H. Meng, in
IEEE International Con-ference on Multimedia and Expo , 2016.[13] M. M¨uller,
Information Retrieval for Music and Motion .Springer, 2007.[14] S. H. Mohammadi and A. Kain, “An overview of voiceconversion systems,”
Speech Commun. , vol. 88, no. C, pp. 65–82,Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom.2017.01.008[15] Y. Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hi-erarchical learning for non-negative matrix factorization with out-liers,”
IEEE Access , vol. 7, pp. 10 546–10 558, 2019.[16] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voiceconversion using sparse representation in noisy environments,”
IEICE Transactions on Fundamentals of Electronics, Communi-cations and Computer Sciences , vol. 96, no. 10, pp. 1946–1953,2013. [17] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai,“Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865 , 2018.[18] J. Zhang, Z. Ling, Y. Jiang, L. Liu, C. Liang, and L. Dai,“Improving sequence-to-sequence acoustic modeling by addingtext-supervision,”
CoRR , vol. abs/1811.08111, 2018. [Online].Available: http://arxiv.org/abs/1811.08111[19] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend andspell,” arXiv preprint arXiv:1508.01211 , 2015.[20] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc:Sequence-to-sequence voice conversion with attention and con-text preservation mechanisms,” arXiv preprint arXiv:1811.04076 ,2018.[21] H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s-vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609 , 2018.[22] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainabletext-to-speech system based on deep convolutional networks withguided attention,”
CoRR , vol. abs/1710.08969, 2017. [Online].Available: http://arxiv.org/abs/1710.08969[23] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpairedimage-to-image translation using cycle-consistent adversarialnetworks,”
CoRR , vol. abs/1703.10593, 2017. [Online]. Available:http://arxiv.org/abs/1703.10593[24] D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114 , 2013.[25] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial networks,” arXiv preprint arXiv:1406.2661 , 2014.[26] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino,“Sequence-to-sequence voice conversion with similarity met-ric learned using generative adversarial networks,” in
INTER-SPEECH , 2017.[27] A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencodingbeyond pixels using a learned similarity metric,”
CoRR , vol.abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/1512.09300[28] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang,“Voice conversion from unaligned corpora using variationalautoencoding wasserstein generative adversarial networks,”
CoRR , vol. abs/1704.00849, 2017. [Online]. Available: http://arxiv.org/abs/1704.00849[29] M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875 , 2017.[30] T. Kaneko and H. Kameoka, “Parallel-data-free voice conver-sion using cycle-consistent adversarial networks,” arXiv preprintarXiv:1711.11293 , 2017.[31] H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to-many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169 , 2018.[32] Y. Chen, Y. M. Assael, B. Shillingford, D. Budden, S. E. Reed,H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C¸ . G¨ulc¸ehre,A. van den Oord, O. Vinyals, and N. de Freitas, “Sampleefficient adaptive text-to-speech,”
CoRR , vol. abs/1809.10460,2018. [Online]. Available: http://arxiv.org/abs/1809.10460[33] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren,Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu,“Transfer learning from speaker verification to multispeakertext-to-speech synthesis,”
CoRR , vol. abs/1806.04558, 2018.[Online]. Available: http://arxiv.org/abs/1806.04558[34] S. ¨O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neuralvoice cloning with a few samples,”
CoRR , vol. abs/1802.06006,2018. [Online]. Available: http://arxiv.org/abs/1802.06006[35] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voicesynthesis for in-the-wild speakers via a phonological loop,”
CoRR , vol. abs/1707.06588, 2017. [Online]. Available: http://arxiv.org/abs/1707.0658836] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting newspeakers based on a short untranscribed sample,”
CoRR , vol.abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/1802.06984[37] E. Nachmani and L. Wolf, “Unsupervised polyglot text tospeech,”
CoRR , vol. abs/1902.02263, 2019. [Online]. Available:http://arxiv.org/abs/1902.02263[38] R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available:https://github.com/r9y9/wavenet vocoder[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overfitting,”
Journal of Machine LearningResearch , vol. 15, pp. 1929–1958, 2014. [Online]. Available:http://jmlr.org/papers/v15/srivastava14a.html[40] R. J. Williams and D. Zipser, “A learning algorithm forcontinually running fully recurrent neural networks,”
NeuralComput. , vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online].Available: http://dx.doi.org/10.1162/neco.1989.1.2.270[41] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam-pling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099 , 2015.[42] A. Lamb, A. Goyal, Y. Zhang, S. Zhang, A. Courville, and Y. Ben-gio, “Professor forcing: A new algorithm for training recurrentnetworks,” arXiv preprint arXiv:1610.09038 , 2016.[43] J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural ma-chine translation without explicit segmentation,” arXiv prepringarXiv:1610.03017 , 2016.[44] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approachesto attention-based neural machine translation,” arXiv preprintarXiv:1508.04025 , 2015.[45] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention based models for speech recognition,” arXiv preprintarXiv:1506.07503 , 2015.[46] K. Ito, “The lj speech dataset,” 2017. [Online]. Available:https://keithito.com/LJ-Speech-Dataset/[47] J. Kominek and A. W. Black, “Cmu arctic databases forspeechsynthesis,” Language Technology Institute, CarnegieMellon University, Pittsburgh, PA, 2003. [Online]. Available:http://festvox.org/cmuarctic/index.html[48] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “Wavenet: A generative model for rawaudio,”
CoRR , vol. abs/1609.03499, 2016. [Online]. Available:http://arxiv.org/abs/1609.03499[49] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord,A. Graves, and K. Kavukcuoglu, “Neural machine translationin linear time,”
CoRR , vol. abs/1610.10099, 2016. [Online].Available: http://arxiv.org/abs/1610.10099[50] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,A. Graves, and K. Kavukcuoglu, “Conditional image generationwith pixelcnn decoders,”
CoRR , vol. abs/1606.05328, 2016.[Online]. Available: http://arxiv.org/abs/1606.05328[51] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka,O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixelnetworks,”
CoRR , vol. abs/1610.00527, 2016. [Online]. Available:http://arxiv.org/abs/1610.00527[52] D. J. Rezende and S. Mohamed, “Variational normalizing flows,” arXiv preprint arXiv:1505.05770 , 2015.[53] D. P. Kingma, T. Salimans, and M. Welling, “Improvingvariational inference with inverse autoregressive flow,”
CoRR ,vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/abs/1606.04934[54] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King,T. Walters, D. Belov, and D. Hassabis, “Parallel wavenet:Fast high-fidelity speech synthesis,”
CoRR , vol. abs/1711.10433,2017. [Online]. Available: http://arxiv.org/abs/1711.10433 [55] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: Aflow-based generative network for speech synthesis,”