[PDF] Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

Abstract

This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, 2020 1

Any-to-Many Voice Conversion withLocation-Relative Sequence-to-Sequence Modeling

Songxiang Liu,

Student Member, IEEE,

Yuewen Cao,

Student Member, IEEE,

DisongWang,

Student Member, IEEE,

Xixin Wu,

Member, IEEE,

Xunying Liu,

Member, IEEE, and Helen Meng,

Fellow, IEEE

Abstract —This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq) based, non-parallelvoice conversion approach. In this approach, we combine abottle-neck feature extractor (BNE) with a seq2seq basedsynthesis module. During the training stage, an encoder-decoder based hybrid connectionist-temporal-classiﬁcation-attention (CTC-attention) phoneme recognizer is trained, whoseencoder has a bottle-neck layer. A BNE is obtained fromthe phoneme recognizer and is utilized to extract speaker-independent, dense and rich linguistic representations fromspectral features. Then a multi-speaker location-relative attentionbased seq2seq synthesis model is trained to reconstruct spectralfeatures from the bottle-neck features, conditioning on speakerrepresentations for speaker identity control in the generatedspeech. To mitigate the difﬁculties of using seq2seq based modelsto align long sequences, we down-sample the input spectralfeature along the temporal dimension and equip the synthesismodel with a discretized mixture of logistic (MoL) attentionmechanism. Since the phoneme recognizer is trained with largespeech recognition data corpus, the proposed approach canconduct any-to-many voice conversion. Objective and subjectiveevaluations shows that the proposed any-to-many approach hassuperior voice conversion performance in terms of both natural-ness and speaker similarity. Ablation studies are conducted toconﬁrm the effectiveness of feature selection and model designstrategies in the proposed approach. The proposed VC approachcan readily be extended to support any-to-any VC (also knownas one/few-shot VC), and achieve high performance according toobjective and subjective evaluations.

Index Terms —any-to-many, voice conversion, location relativeattention, sequence-to-sequence modeling

I. I

NTRODUCTION V OICE conversion (VC) aims to convert the non-linguisticinformation of a speech utterance while keeping thelinguistic content unchanged. The non-linguistic informationmay refer to speaker identity, emotion, accent or pronuncia-tion, to name a few. In this paper, we focus on the problemof speaker identity conversion. Potential applications of VCtechniques include entertainment, personalized text-to-speech,pronunciation or accent correction, etc.Based on the number of source speakers and target speakersthat a single VC system can support, we can categorizecurrent VC approaches into one-to-one VC, many-to-one

Songxiang Liu, Yuewen Cao, Disong Wang, Xunying Liu and HelenMeng are with the Human-Computer Communications Laboratory (HCCL),the Department of Systems Engineering and Engineering Management, TheChinese University of Hong Kong, Hong Kong SAR, China. E-mails: { sxliu,ywcao, dswang, xyliu, hmmeng } @se.cuhk.edu.hkXixin Wu is with Engineering Department, Cambridge University, UK. E-mail: [email protected]. VC, many-to-many VC, any-to-many VC and any-to-any VC.Conventional VC approaches focus on one-to-one VC, whichrequires parallel training data between a pair of source-targetspeakers. At the training stage of the conventional VC pipeline,acoustic features are ﬁrst extracted from the source and targetutterances. The acoustic features of parallel utterances are thenaligned frame-by-frame using alignment algorithms, such asdynamic time warping (DTW) [1]. A conversion model istrained to learn the mapping function between time-alignedsource and target acoustic features, which can be Gaussianmixture models (GMMs) [2, 3], artiﬁcial neural networks(ANNs) [4–7], etc. These approaches perform frame-wise con-version on spectral features, i.e., the converted speech has thesame duration as the source speech. This restricts the modelingof the speaking rate and duration. Recent studies show that thealignment phase can be surpassed through using sequence-to-sequence (seq2seq) based model [8, 9] for direct source-targetacoustic modeling, and this approach can achieve better VCperformance, especially in terms of speaker similarity. Sinceone-to-one VC is limited to supporting only one particularpair of source and target speakers. VC researchers haveexplored many-to-one VC approaches to extend the versatilityof VC approaches. Among these approaches, the one basedon phonetic posteriorgrams (PPGs) is widely used [10–12].PPGs are computed from an ASR acoustic model and are oftenassumed to be speaker-independent linguistic representations.The many-to-one VC approaches concatenate a PPG extractorwith a target-speaker dependent PPG-to-acoustic synthesismodel. Many approaches have been proposed to further extendVC approaches to support many-to-many conversion. Thesetechniques can be classiﬁed into two categories. The ﬁrstcategory requires text supervision during training stage. Thisinclude the PPG-based methods and the non-parallel seq2seqbased methods [13, 14]. The second category does not requiretext supervision. This includes the model using auto-encoders[15], variational auto-encoders [16], generative adversarialnetworks [17–20] and their combinations [21–23].This paper focuses on developing an any-to-many VCapproach, leveraging text supervision during the training stage.Few any-to-many VC approaches have been reported inthe literature. Speciﬁc many-to-many VC approaches, whichcan directly be adopted for any-to-many conversion withtext supervision during training stage, include the many-to-many PPG based approaches and the non-parallel seq2seqbased approaches. These approaches have the presumptionthat the linguistic extractor/encoder in the many-to-many VC a r X i v : . [ ee ss . A S ] N ov OURNAL OF L A TEX CLASS FILES, 2020 2 approaches can generalize well to source speakers that areunseen during training process.The use of PPGs in any-to-many VC concatenates thespeaker-independent PPG extractor and a multi-speaker con-version model, as shown in Fig. 1 (a). Such an approachhas several deﬁciencies: First, the PPG model is usuallytrained with the HMM-GMM based phonetic alignments us-ing acoustic features. If the alignments are inaccurate, mis-pronunciations will occur more often downstream in the VCpipeline. Second, the conversion model usually adopts a neuralnetwork (e.g., a bidirectional LSTM model) which maps PPGsframewise to acoustic features. Conditioned on the input PPGs,these conversion models predict each acoustic frame inde-pendently. This is unfavorable in terms of VC performance,because acoustic frames in an utterance are highly correlated.The non-parallel seq2seq based approach to any-to-manyVC cascade a seq2seq ASR model and a multi-speaker seq2seqsynthesis model, as shown in Fig. 1 (b). These approachesalso have several drawbacks in despite their strong sequencemodeling ability: First, the pipeline is very long, whichmeans that the model contains many parameters, resulting ina complicated and slow training process. Second, the ASRmodule usually adopts beam search algorithms to reducerecognition errors during inference. This slows down theconversion process. Third, the multi-speaker seq2seq synthesismodel usually uses an attention module to align the hiddenstates of the encoder and decoder. There has been evidenceof instability in this attention-based alignment procedure, asit may introduce missing or repeating words, incompletesynthesis, or an inability to generalize to longer utterances[24]. To make the synthesis more robust, our prior work[25] incorporates a rhythm model, which guides the explicittemporal expansion process of the hidden representationsoutput from the encoder. The auto-regressive decoder adoptsa local attention mechanism within a small window, whichfurther corrects possible alignment errors for high-ﬁdelity syn-thesis. The prior work focuses on maintaining source speakingstyles in the converted speech. The any-to-many conversionperformance of this approach, however, needs to be thoroughlyexamined. In this paper, we present novel modiﬁcations inthe prior approach [25] and re-design a robust, non-parallel,seq2seq any-to-many VC approach. This pipeline concatenatesa seq2seq based phoneme recognizer (Seq2seqPR) and a multi-speaker duration informed attention network (DurIAN) forsysthesis. This technique is referred to as Seq2seqPR-DurIANbelow. Details are presented in Section III.To address the deﬁciencies of previous any-to-manyVC approaches, this paper further proposes the use ofa bottle-neck feature extractor (BNE) combined with aseq2seq based synthesis model. We ﬁrst train an end-to-endhybrid connectionist-temporal-classiﬁcation-attention (CTC-attention) phoneme recognizer, where the encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizerand is used to extract bottle-neck features (BNF) as linguisticrepresentations of speech signals. To mitigate the difﬁcul-ties of using a seq2seq based model to align the linguisticfeatures and spectral features, we down-sample the inputspeech features along the temporal dimension by a factor of

Fig. 1. Schematic diagram of (a) the PPG based and (b) non-parallel seq2seqVC approaches. four. We then train a multi-speaker seq2seq BNF-to-spectralsynthesis model, where each speaker is represented as an one-hot vector. To facilitate the seq2seq modeling, the synthesismodel is equipped with a mixture-of-logistic (MoL) location-relative attention module. We term this VC approach as BNE-Seq2seqMoL below and details are presented in Section IV.As mentioned above, the synthesis modules of both any-to-many approaches, namely, the Seq2seqPR-DurIAN andBNE-Seq2seqMoL, use one-hot vectors to represent speakeridentities. Both approaches are able to convert any sourcespeaker to a target speaker in the training set. This paperalso explores extending these any-to-many VC approaches tosupport any-to-any conversion. A speaker encoder is utilizedto generate a ﬁxed-dimensional speaker vector from a speechutterance of arbitrary length. The speaker vector is used tocondition the synthesis module on a reference speech signalfrom a desired arbitrary target speaker, so that the generatedspeech has speaker identity of that target speaker.The rest of this paper is organized as follows: Section II re-views related work. Section III and IV present the Seq2seqPR-DurIAN and BNE-Seq2seqMoL approaches respectively. Ex-periments are described in Section V and Section VI concludesthis paper. II. R

ELATED WORK

A. Attention mechanisms in seq2seq based models

Sequence-to-sequence models equipped with attentionmechanism has been very popular in VC and TTS tasks.The BNE-Seq2seqMoL approach proposed in this paper usesa location-relative attention mechanism, which is ﬁrst intro-duced by Graves [26]. Inspired by [27], we incorporate adiscretized mixture of logsitics (MoL) distribution [28] tomodel the attention weights in each decoding step in the BNE-Seq2seqMoL approach. Modiﬁcations are applied to make thealignment process strictly monotonic. Details are presented inSection IV-B.

OURNAL OF L A TEX CLASS FILES, 2020 3

B. Recognition-synthesis VC approaches

Both the Seq2seqPR-DurIAN and BNE-Seq2seqMoL VCapproaches proposed in this paper belong to the class ofrecognition-synthesis based approaches, where an ASR mod-ule is built to extract linguistic representations and the synthe-sis module is used to predict acoustic features from linguis-tic representations. Compared with the non-parallel seq2seqbased VC approach proposed in [13], the Seq2seqPR-DurIANapproach utilizes a more robust synthesis module to mitigatepossible attention alignment errors, where a duration modelis incorporated to provide explicit phoneme-level alignmentinformation. Besides, the Seq2seqPR-DurIAN approach hasa simpler and more direct training procedure, whereas theapproach in [13] uses complicated loss function and thetraining procedure alternates between a generation step andan adversarial step.A recent study, which is related to the BNE-Seq2seqMoLproposed in this paper, uses a pre-trained ASR encoder and apre-trained TTS decoder to initialize parameters of the ultimateencoder-decoder based VC model [29]. The ASR and TTSmodels are pre-trained with large-scale corpora with a four-stage process, i.e., TTS decoder pre-training, TTS encoderpre-training, ASR encoder pre-training and ASR decoder pre-training. Then the VC model is ﬁne-tuned on the pre-trainedparameters with a small number of parallel utterances betweena speciﬁc pair of source and target speakers. In comparison,the proposed BNE-Seq2seqMoL adopts a simpliﬁed two-stagetraining scheme, i.e., a seq2seq phoneme recognizer trainingstage and a multi-speaker MoL attention based seq2seq syn-thesis model training stage. After this two-stage training, theBNE-Seq2seqMoL approach can directly support any-to-manyvoice conversion.III. S EQ SEQ

PR-D UR IAN

BASED VOICE CONVERSION

The Seq2seqPR-DurIAN approach concatenates a seq2seqbased phoneme recognizer (Seq2seqPR) and a multi-speaker duration informed attention network (DurIAN). TheSeq2seqPR model is adopted to predict an L -length phonemesequence Y = { y l ∈ U| l = 1 , · · · , L } from spectralfeature vectors X of a speech signal, where U is a set ofdistinct phonemes. The DurIAN model is utilized to generatespectral feature vectors ˆ X from an input phoneme sequence Y , conditioned on the speaker representations s to achievemulti-speaker synthesis. Since the Seq2seqPR model and theDurIAN model can be optimized independently, this VCapproach does not require parallel data between the source andtarget speakers. We ﬁrst present the details of the phonemerecognizer and then describe the DurIAN synthesis model.The conversion procedure and extension to any-to-any VC arepresented in the later parts of this Section. A. Seq2seq phoneme recognizer

We adopt the hybrid CTC-attention model structure forthe seq2seq phoneme recognizer, which has similar networkstructure to [30], as shown in Fig. 2.

1) CTC and attention based modeling:

CTC is a latentvariable model that monotonically maps an input sequenceto an output sequence of shorter length [31]. An additional“blank” symbol is introduced into frame-wise phoneme se-quence Z = { z t ∈ U ∪ blank | t = 1 , · · · , T } , where T is thenumber of spectral frames. By using conditional independenceassumptions, the posterior distribution p ( Y | X ) is factorized asfollows: P ( Y | X ) = (cid:88) Z (cid:89) t p ( z t | z t − , Y ) p ( z t | X ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) p ctc ( Y | X ) p ( Y ) (1)We deﬁne p ctc ( Y | X ) as the CTC objective function, wherethe frame-wise posterior distribution p ( z t | X ) is conditionedon all inputs X, and it is quite natural to be modeled witha deep neural network (e.g., LSTM model). The summationover Z in Eq. 1 can be efﬁciently computed using a dynamicprogramming algorithm.The attention-based approach directly estimates the poste-rior p ( Y | X ) based on the probability chain rule as: P ( Y | X ) = (cid:89) l p ( y l | y , · · · , y l − ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) p att ( Y | X ) , (2)where we deﬁne p att ( Y | X ) as an attention-based objec-tive function, which can be conveniently modeled with anattention-based encoder-decoder model.

2) Model structure and training objective:

Following [30],we regard the CTC objective as an auxiliary task to train theattention model encoder, which contains a VGG-Prenet and abidirectional LSTM (BiLSTM) encoder, as shown in Fig. 2.The input spectral features X are 80-dimensional log mel-spectrograms, on which we conduct utterance-level mean-variance normalization before feeding into the recognizermodel. The VGG-Prenet sub-samples the input features bya factor of 4 in time scale using two VGG-like max poolinglayers. Then the hidden feature maps from the VGG-Prenetare fed into the BiLSTM encoder, which contains 4 BiLSTMlayers with 512 hidden units per direction. The CTC modulehas one fully-connected (FC) layer. The attention decoder useslocation-sensitive attention and has one decoder LSTM layerwith hidden size of 1024.The training objective to be maximized is a logarithmiclinear combination of the CTC and attention objectives, i.e., p ctc ( Y | X ) in Eq. 1 and p att ( Y | X ) in Eq. 2: J Seq seqP R = λ log P ctc ( Y | X ) + (1 − λ ) log P att ( Y | X ) (3)where λ ∈ [0 , is a hyper-parameter weighting the CTCobjective and the attention objective. In this paper, we set λ to be . . B. DurIAN synthesis model

The DurIAN synthesis model used in this paper is inspiredby [32], which is trained to predict the mel-spectrogram X from an input phoneme sequence Y , as shown in Fig. 3.Attention based seq2seq TTS models such as Tacotron are OURNAL OF L A TEX CLASS FILES, 2020 4

VGG-PrenetBiLSTM EncoderAttention DecoderCTC ModuleMel-spectrogramsPhoneme sequence

Fig. 2. Hybrid CTC-attention model structure [30] for phoneme recognition.

CBHG EncoderPhoneme SequenceSpeakerRepresentation TTSDecoderDurationmoduleStateExpansionDuration Training stage Conversion stagePredicted Duration Predicted Mel-spectrogram

Fig. 3. Duration informed attention network (DurIAN) used in theSeq2seqPR-DurIAN VC approach. error prone in the alignment procedure, which leads to missingor repeating words, incomplete synthesis or an inability togeneralize to longer utterances. To address this issue, weincorporate a duration module into the synthesis model. Asimilar idea has been used in [33].A CBHG encoder [34] is adopted to transform phonemesequences into hidden representations. In the state expan-sion procedure, the hidden representations are expanded byrepeating along the temporal axis according the providedphoneme-level duration information, such that the expandedrepresentations have the same number of frames as the spectralfeatures. An auto-regressive RNN-based TTS decoder is usedto generate mel-spectrograms from the expanded spectralfeatures, conditioned on the speaker representations (e.g.,one-hot vectors) to support multi-speaker generation. In theany-to-many VC setting, the speaker identity is representedwith one-hot vectors. A speaker embedding table is jointlyoptimized with the remaining parts of the DurIAN model.Speaker embedding vectors are appended to every frames ofthe expanded encoder hidden representations. Note that in theany-to-any VC setting, speaker vectors generated from a pre-trained speaker encoder are used to represent speaker identity.The details are presented in Section III D. The TTS decoderhas similar network structure as the one in Tacotron 1 [34]. Theonly difference is that the attention context concatenated with the decoder prenet output is replaced with the correspondingencoder state in the expanded hidden representations. Similarto Tacotron 1, we make the decoder generate r non-overlappedmel-spectrogram frames at each decoding step to accelerate thetraining and synthesis.The duration module employs a RNN based model whichconsists of three 512-unit BiLSTM layers. The input to theduration module contains the pre-expanded hidden states fromthe CBHG encoder and speaker identity representation. As inthe TTS decoder, in the any-to-many VC setting, speakers arepresented with one-hot vectors and a speaker embedding tableis jointly learned with the duration module. In the any-to-any tasks, speaker vectors from the same speaker encoder areadopted to represent speaker identity. In both cases, speakervectors are appended to all frames of the CBHG encoderoutput.The DurIAN synthesis model is trained with a two-stagescheme. In the ﬁrst stage, the whole model except the durationmodule is trained, where the phoneme-level duration informa-tion is extracted from speech-text forced alignments. In this pa-per, we use the open-source Montreal-forced-aligner (MFA). The loss function to minimize in this stage is the mean squarederror (MSE) between the ground-truth mel-spectrogram X and the predicted mel-spectrogram ˆ X . In the second stage,the duration model is trained, where the parameters of theremaining parts are ﬁxed. The loss function is the MSE lossbetween the reference phoneme-level duration computed fromthe MFA alignments and the predicted duration. C. Conversion procedure

At the conversion stage, mel-spectrogram features are ﬁrstcomputed from a source speech utterance of an arbitrarysource speaker. The Seq2seqPR model is then used to rec-ognize phoneme sequence from the mel-spectrogram. TheDurIAN model then generates the converted mel-spectrogramfrom the recognized phoneme sequence, conditioned on thetarget speaker representation. The phoneme-level duration in-formation is obtained using the duration module, as illustratedin Fig. 3. Finally, a neural vocoder converts the converted mel-spectrogram into a time domain waveform.

D. Extend to support any-to-any conversion

To extend the Seq2seqPR-DurIAN to support any-to-anyconversion, we use an additional speaker encoder model togenerate the speaker vector, which is used to condition theDurIAN synthesis module to generate speech with the identityof an arbitrary target speaker. The speaker encoder takesacoustic vector sequence with various number of frames com-puted from a speech signal and outputs a ﬁxed-dimensionalspeaker embedding vector. The DurIAN model uses thespeaker embedding vector computed from a desired targetspeaker as auxiliary conditioning to control the vocal identityof the generated speech. As in [35], we train the speakerencoder to optimize a generalized end-to-end (GE2E) speakerveriﬁcation loss. Embeddings of utterances from the same https://montreal-forced-aligner.readthedocs.io OURNAL OF L A TEX CLASS FILES, 2020 5

Bottle-neck Feature Extractor VGG-PrenetAttention DecoderCTC ModuleMel-spectrogramsPhoneme sequenceBottle-neck LayerBiLSTM Encoder

Fig. 4. Training process illustration of the bottle-neck feature extractor (BNE)used in the proposed BNE-Seq2seqPR VC approach. speaker are expected to have high cosine similarity, while thosefrom different speakers are distant. The speaker encoder usedin this paper has the same network structure as the one use in[36].IV. BNE-S EQ SEQ M O L BASED VOICE CONVERSION

The proposed BNE-Seq2seqMoL approach combines abottle-neck feature extractor (BNE) with a multi-speaker mix-ture of logistic (MoL) attention based seq2seq synthesis model.The BNE is used to compute dense and rich linguistic fea-tures from mel-spectrograms, while the MoL attention basedseq2seq model (Seq2seqMoL) is adopted to generate mel-spectrograms auto-regressively. Details of the BNE and theSeq2seqMoL model are presented in Section IV-A and IV-Brespectively. The conversion procedure and extension to any-to-any VC are presented in the later parts of this section.

A. Bottle-neck feature extractor

We obtain a bottle-neck feature extractor from an end-to-endhybrid CTC-attention phoneme recognizer. The phoneme rec-ognizer has the same network structure as the one introducedin Section III-A, except that we incorporate an additionalbottle-neck layer into the recognizer, as illustrated in Fig. 4.The bottle-neck layer is a fully-connected layer with hiddensize of 256. The training objective is the same as that usedin Section III-A (See Eq. 3). After training, we drop the CTCmodule and attention decoder from the phoneme recognizerand use the remaining part as the bottle-neck feature extractor.The bottle-neck features computed from speech signals areregarded as linguistic representations which are presumed tocontain rich and dense linguistic information.

B. Seq2seqMoL synthesis

The training procedure of the seq2seq based synthesismodel is depicted in Fig. 5. A well-trained BNF extractor

Mixture-of-LogisticAttention DecoderMel-spectrogramsLog F0 & UVPitch Encoder Bottle-neck features Bottle-neckFeature Prenet Speaker Representation Bottle-neckFeature ExtractorConcatenate Decoder PrenetPostnet Predictedstop tokensPredicted Mel-spectrograms

Fig. 5. Schematic illustration for training stage of the synthesis module inthe proposed BNE-Seq2seqMol VC approach. presented in Section IV-A is adopted as an off-line linguisticfeature extractor. The synthesis model can be regarded asan encoder-decoder model, where the encoder contains twosimple networks, i.e., a bottle-neck feature prenet and a pitchencoder.

1) Bottle-neck feature prenet and pitch encoder:

The bottle-neck feature prenet contains two bidirectional GRU layers,which have 256 hidden units per direction. The pitch encoderemploys convolution network structure, which takes contin-uously interpolated logarithmic F0 (Log-F0) and unvoiced-voiced ﬂags (UV) features as input. Log-F0s and UVs arecomputed with the same frame-shift as the one used to ex-tract mel-spectrograms. Since the bottle-neck feature extractordown-samples the mel-spectrograms by a factor of 4 alongthe time axis, the bottle-neck features only have a quarter ofthe frames in the corresponding Log-F0s or UVs. To makeBNFs, Log-F0s and UVs have the same time resolution, wealso down-sample Log-F0 and UVs by a factor of 4 along timeaxis. This is achieved by using two 1-dimensional convolutionlayers with a stride of 2, where the hidden-dimension is 256.To remove possible speaker information, we add an instancenormalization layer without afﬁne transformation after eachconvolution layer in the pitch encoder.The outputs of the pitch encoder and the bottle-neck featureprenet are added element-wise. In any-to-many conversion,one-hot vectors are used as the speaker representation, andan additional speaker embedding table is jointly trained withthe whole synthesis network. Speaker vectors are concatenatedto every frame of the encoder output.

2) MoL attention based decoder:

Decoder of the synthesismodel adopts a similar auto-regressive network structure as

OURNAL OF L A TEX CLASS FILES, 2020 6

MLPNetwork

Softmax(·)Softplus(·)Softplus(·)

Fig. 6. Computation procedure of the mixture of logistic distributionparameters γ i for attention weights in decoder step i in the proposed BNE-Seq2seqMoL VC approach. the one used in Tacotron 2, except that a location-relativediscretized mixture of logistics (MoL) attention mechanismis used.Let us denote the encoder outputs as { h j } ˜ Tj =1 , where ˜ T = T and T is the number of frames in the mel-spectrograms.An attention RNN (Eq. 4) produces hidden state s i at decoderstep i . Then the attention mechanism consumes s i to producethe alignment α i ∈ R ˜ T (Eq. 5). The context vector, c i , whichis fed to the decoder RNN, is computed using the alignment α i to produce a weighted average of encoder states { h j } ˜ Tj =1 (Eq. 6). The decoder RNN takes s i and c i as input, whoseoutput is used together with the context vector c i to producemel-spectrogram frames at the current decoder step by a linearlayer (Eq. 7 and 8). s i = RNN

Att ([ x i − , c i − ] , s i − ) (4) α i = Attention ( s i ) (5) c i = ˜ T (cid:88) j =1 α i,j h j (6) d i = RNN

Dec ([ c i , s i ] , d i − ) (7) x i = Linear

Out ( d i , c i ) (8)The attention mechanism is similar to the one used in [27],which is a location-relative extension from a purely location-based mechanism proposed in [26]. The attention alignmentweights correspond to a learned attention distribution φ ( · ; γ i ) ,where γ i is the distribution parameters computed using a sim-ple multi-layer perception (MLP) network from the attentionRNN state s i . We use a discretized MoL [28] for the attentiondistribution φ i ( · ; γ i ) . At each decoder step, a set of distributionparameters γ i = { w ki , µ ki , σ ki } Kk =1 is computed, correspondingto K mixture coefﬁcients, means and scales. In this paper, thenumber of mixtures are set to be 5. The computation procedureis shown as below and also illustrated in Fig. 6. ( ˆ w i , ˆ∆ i , ˆ σ i ) = MLP ( s i ) (9) w i = SM ( ˆ w i ) , ∆ i = SP ( ˆ∆ i ) , σ = SP (ˆ σ i ) (10) µ i = µ i − + ∆ i (11) where SM ( · ) represents the softmax function and SP ( · ) rep-resents the softplus function. Note that the mean of eachlogistic component is computed using the recurrence relationin Eq. 11, which enables the mechanism localtion-relativeand monotonic, since ∆ i is constrained to be positive bythe softplus function. Given the computed MoL distributionparameters γ i , the attention weight α i,j is obtained from thediscretized attention distribution φ i ( · ; γ i ) at decoder step i as: α i,j = φ i ( j ; γ i )= K (cid:88) k =1 w ki [ sigmoid ( j + 0 . − µ ki σ i ) − sigmoid ( j − . − µ ki σ i )] (12)Following Tacotron 1 and 2, a decoder prenet containingtwo linear layers and a residual convolution based postnet areadded to the synthesis decoder. We also let the decoder predictstop tokens, which are used to stop the decoding procedurewhen the stopping probability reach a threshold . .The training objective is to minimize the MSE loss betweenthe ground truth mel-spectrogram X and the predicted ˆ X , incombination with a binary cross-entropy loss on the stop tokenpredictions. C. Conversion procedure

Given a speech utterance from an arbitrary source speaker,the approach ﬁrst computes the mel-spectrogram, continuousLog-F0s and UV ﬂags. Then the BNF extractor is used toextract linguistic features from the mel-spectrogram. Log-F0sare converted linearly in log-scale from the source to targetusing log-scaled F0 statistics of the source and target speakers,as: Log-F0 vc = σ target σ source ( Log-F0 source − µ source ) + µ target (13)where µ ’s and σ ’s represent the mean and standard deviationof the log-scaled F0.The bottle-neck features, Log-F0 vc and UV ﬂags are addedelement-wise after going through the bottle-neck featuresprenet and the pitch encoder respectively. The output isconcatenated with the target speaker embedding vector to formthe encoder outputs. The MoL attention decoder then generatesthe converted mel-spectrogram from the encoder outputs in anauto-regressive manner. A neural vocoder is ﬁnally used togenerate waveform from the converted mel-spectrogram. D. Extension to any-to-any conversion

We use the same speaker encoder model introduced inSection III-D to generate the speaker vector for an arbitrarytarget speaker. We replace the one-hot speaker representationwith the speaker-encoder-generated speaker vector in the BNE-Seq2seqMol any-to-many approach, such that it supports any-to-any conversion. Details of the speaker encoder are the sameas that presented in Section III-D.

OURNAL OF L A TEX CLASS FILES, 2020 7

V. E

XPERIMENTS

A. Datasets

The datasets used in this paper are all publicly available.LibriSpeech (960 hours) [37] is used to train the phonemerecognizer introduced in section III-A and IV-A. The Lib-riSpeech lexicon is used to obtain phoneme sequences fromtext transcripts.In any-to-many voice conversion, we use the VCTK corpus[38] and the CMU ARCTIC database [39]. The VCTK corpuscontains 44 hours of clean speech from 109 speakers. In thispaper, we only use data from 105 VTCK speakers. We choose600 utterances for validation set and another 600 utterancesfor test set, while the remaining utterances are for train set.The CMU ARCTIC database contains 1132 parallel recordingsof English speakers. Data from four speakers are used: twofemale (clb and slt) and two male (bdl and rms). We choose 50utterances for validation and another 50 utterances for testing.We randomly choose non-overlapped 250 utterances from theremaining utterances for each of the four speakers respectively,such that they do not have parallel utterances during training.For any-to-any voice conversion, we use LibriSpeech (train-other-500), VoxCeleb1 [40] and VoxCeleb2 [41] datasets totrain the speaker encoder introduced in Section III-D andSeciton IV-D. In total, there are more than 8K speakers,such that we expect the speaker encoder can generalize toany unseen speaker. LibriTTS (train-clean-100 and train-clean-360) dataset [42] together with the training set of VCTKcorpus is used to train the synthesis models of the Seq2seqPR-DurIAN and BNE-Seq2seqMoL approaches. The CMU ARC-TIC database is used only for the conversion stage in thissetting. That is, the four speakers (bdl, clb, rms and slt) areall unseen during the training procedure. This simulates voiceconversion from an arbitrary source speaker to an arbitrarytarget speaker, which forms a pilot version of any-to-anyconversion. B. Features and neural vocoder model

Speech signals used in this paper are all re-sampledto 16kHz if the original sampling rate is different. Spec-tral features are all 80-dimensional log mel-spectrogramsexcept that the speaker encoder takes 40-dimensional logmel-spectrograms as input. The 80-dimensional log mel-spectrograms are computed using 50ms Hanning window and10ms frame shift, while the 40-dimensional ones are computedusing 25ms Hanning window and 10ms frame shift. We usethe PyWorld toolkit to extract F0s from speech signals andLog-F0s are obtained by taking logarithm on the linearlyinterpolated F0s.In this paper, the WaveRNN network [43] is used as theneural vocoder. The speech waveform is µ -law quantized into512-way categorical distributions. The open-sourced Pytorchimplementation is used. Since the mel-spectrograms captureall of the relevant details needed for high quality speech https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder https://github.com/fatchord/WaveRNN synthesis, we simply use ground-truth mel-spectrograms frommultiple speakers to train the WaveRNN, without addingany speaker identity representations. We only use the VCTKtraining set to train the WaveRNN model. C. Comparisons

Four VC approaches are compared with experiments forany-to-many conversion, as well as extenting to any-to-anyconversion. We compare the proposed Seq2seqPR-DurIANand BNE-Seq2seqMoL approaches with another two recentlyproposed approaches, namely, the PPG-based VC and the non-parallel seq2seq based VC, as introduced in Section I. Thedetails of their implementation are presented below.

PPG-VC : This baseline approach has a network architecturesimilar to the N10 system [44] in VCC2018 [45]. As shownin Fig. 1 a, this approach consists of a PPG extractor and amulti-speaker conversion model. The PPG extractor adopts anRNN based model structure, which contains 5 bidirectionalgated recurrent unit (GRU) layers with 512 hidden unitsper direction. The multi-speaker conversion model also hasan RNN structure, which consists of 4 bidirectional LSTMlayers with 256 hidden units per direction. In any-to-manyconversion, the speaker identity is presented as an one-hotvector and an additional speaker embedding table is learnedtogether with other part of the conversion model. The speakerembedding vector has size of 256, and are concatenated withPPGs frame-by-frame. The conversion model is trained withthe VCTK and CMU ARCTIC training splits mentioned inSection V-A. In any-to-any conversion, the same speakerencoder introduced in Section III-D is used to generate speakervectors from mel-spectrograms. This setting is similar to theone in our prior work [46], except that the i-vectors andlearned speaker embedding vectors are used as speaker identityrepresentations there. We use LibriTTS (train-clean-100 andtrain-clean-360) dataset together with the training set of VCTKcorpus to train the conversion model.The PPG extractor is obtained from a frame-wise phonemerecognizer, which is trained with the LibriSpeech dataset(960 hours). We ﬁrst use the MFA to align the audio andtranscripts at the phoneme level. Then the audio-text alignmentinformation is used to obtain the frame-to-phoneme correspon-dence between mel-spectorgrams and phoneme sequences,with which it is possible to train a frame-wise phonemerecognizer. We regard the probability vectors after the lastsoftmax layer as PPG features for one utterance.

NonParaSeq2seq-VC : This baseline approach is proposedby [13], where disentangled linguistic and speaker representa-tions are extracted from acoustic features, and voice conver-sion is achieved by preserving the linguistic representations ofsource utterances while replacing the speaker representationswith the target ones. Since there is a speaker encoder whichis jointly trained with the whole model, this approach canbe trivially extended from any-to-many conversion to supportany-to-any conversion. We use the VCTK and CMU ARC-TIC training splits to train the model for the any-to-manyconversion setting and use the LibriTTS (train-clean-100 andtrain-clean-360) dataset together with the training set of VCTK

OURNAL OF L A TEX CLASS FILES, 2020 8

TABLE IO

BJECTIVE EVALUATION RESULTS FOR ANY - TO - MANY VOICE CONVERSION . Conversionpair PPG-VC NonParaSeq2seq-VC Seq2seqPR-DurIAN BNE-Seq2seqMoLMCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WERF-M 6.90 48.22 5.75 8.93 7.37 49.94 17.78 26.54 7.03 52.60 7.03 9.89 6.87 44.55 3.00 4.29F-F 7.03 48.22 5.54 8.46 7.36 45.43 8.14 13.46 7.04 47.25 8.76 13.39 6.99 45.44 5.12 6.97M-M 6.96 51.4 5.19 7.50 7.36 53.86 24.78 39.80 7.14 51.89 7.33 14.16 6.94 50.37 4.72 7.22M-F 7.25 56.36 5.32 7.80 7.39 50.37 19.05 25.71 7.09 45.09 8.32 14.72 7.18 56.31 3.61 5.76Average 7.04 51.05 5.45 8.17 7.37 49.90 17.44 26.38 7.08 49.21 7.86 13.04

TABLE IIO

BJECTIVE EVALUATION RESULTS FOR ANY - TO - ANY VOICE CONVERSION . Conversionpair PPG-VC NonParaSeq2seq-VC Seq2seqPR-DurIAN BNE-Seq2seqMoLMCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WERF-M 7.51 48.8 4.21 6.99 8.01 67.2 5.74 8.6 7.51 58.44 5.45 8.08 7.44 44.57 5.74 7.42F-F 7.52 49.86 4.67 7.22 7.58 47.01 4.56 6.32 7.83 63.00 9.79 6.40 7.71 48.45 3.98 7.03M-M 7.64 58.37 5.05 6.73 8.16 59.13 5.72 8.80 7.50 58.34 5.97 11.89 7.53 50.07 5.28 7.86M-F 7.80 69.76 4.42 6.27 8.46 58.38 6.55 9.95 7.86 68.65 6.08 10.42 7.93 61.50 6.05 8.11Average

TABLE IIIS

UBJECTIVE EVALUATION RESULTS FOR FOUR VC APPROACHES : PPG-VC, N ON P ARA S EQ SEQ -VC, S EQ SEQ

PR-D UR IAN

AND

BNE-S EQ SEQ M O L( CONFIDENCE INTERVALS ). Conversionpair PPG-VC NonParaSeq2seq-VC Seq2seqPR-DurIAN BNE-Seq2seqMoLNaturalness Similarity Naturalness Similarity Naturalness Similarity Naturalness SimilarityF-M 2.20 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Recording 4.81 ± TABLE IVS

UBJECTIVE EVALUATION RESULTS OF THE PROPOSED

BNE-S EQ SEQ M O L APPROACH FOR ANY - TO - ANY VOICE CONVERSION ( CONFIDENCEINTERVALS ). Conversionpair F-M M-M F-F M-F AverageNaturalness 3.40 ± ± ± ± ± ± ± ± ± ± inthis study. D. Objective evaluations

Mel-cepstrum distortion (MCD), root mean squared errors(F0-RMSE) and character/word error rate (CER/WER) froman ASR system are used as the metrics for objective evaluation.The MCD is used for evaluating spectral conversion, which iscomputed as:MCD[dB] = 10 log (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) d =1 ( MCC cd − MCC td ) (14) https://github.com/jxzhanggg/nonparaSeq2seqVC code where MCC represents mel-cepstral coefﬁcient, K is thedimension of the MCCs, and MCC cd and MCC td representthe d -th dimensional coefﬁcient of the converted MCCs andthe target MCCs, respectively. The Pyworld toolkit is used toextract MCCs in this paper, where we set K = 24 .The F0-RMSE is used to evaluate the F0 conversion, whichis computed as:F0-RMSE[Hz] = 1 N (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 ( F0 ci − F0 ti ) (15)where N is number of frames, F0 ci and F0 ti are F0 valueat the i -th frame of the converted speech and target speechrespectively. OURNAL OF L A TEX CLASS FILES, 2020 9

We use a transformer based end-to-end ASR engine tocompute CER and WER of the converted speech to evaluate itsintelligibility. The ASR model is trained using the LibriSpeech(960 hours) dataset. The CER and WER for the CMU ARC-TIC test set are 2.71% and 4.30%, respectively.The objective evaluation results in the any-to-many voiceconversion setting are shown in Table I. We can see thatthe proposed BNE-Seq2seqMoL approach achieves the bestperformance for all the four objective metrics on average.The PPG-VC approach has the worst F0-RMSE among thefour approaches, which veriﬁes that conversion by frame-wisemapping has constrained capability to model prosody duringthe conversion. Comparing the NonParaSeq2seq-VC and theSeq2seqPR-DurIAN approaches, which have similar modelarchitectures, we can see that the re-designed Seq2seqPR-DurIAN has superior results for all four metrics on average.The NonParaSeq2seq-VC has signiﬁcantly worse CER andWER among the approaches, and preliminary listening testﬁnds that there exists repeating, skipping and truncation phe-nomena in the converted speech. This implies that injecting aduration model, which provides explicit phone-level durationalinformation, makes the conversion more robust. Comparingthe results from PPG-VC with those from BNE-Seq2seqMoLshows that the auto-regressive property of the latter canboost VC performance across the four objective metrics. Thelowest CER and WER conﬁrms the robustness brought by thelocation-relative MoL attention to the seq2seq model in theBNE-Seq2seqMoL approach.The objective evaluation results in the any-to-any voiceconversion setting are presented in Table II. We can see thatthe PPG-VC approach has the best results in terms of MFC,CER and WER. The proposed BNE-Seq2seqMoL gave lowestF0-RMSE among the four approaches and obtains good resultsfor MCD, CER and WER. In any-to-any VC, the models aretrained using a combination of the VCTK training set andthe large LibriTTS (train-clean-100, train-clean-360) dataset,which overcomes the deﬁciency of independent predictionacross frames in the PPG-VC approach. E. Subjective evaluations

Subjective evaluation in terms of both the naturalness andspeaker similarity of converted speech are conducted . Thestandard 5-scale mean opinion score (MOS) test is adoptedfor both naturalness and speaker similarity evaluations. In theMOS tests for evaluating naturalness, each group of stimulicontains recording samples from the target speakers, whichare randomly shufﬂed with the samples generated by the fourcomparative approaches before presented to listeners. In theMOS similarity tests, converted speech samples are directlycompared with the recording samples of the target speakers.10 utterances from the CMU ARCTIC test set are presented foreach conversion pair. We invite 10 Chinese speakers who areproﬁcient in English to participate in the evaluations and theyare allowed to replay each sample as many times as necessary. https://github.com/espnet/espnet model zoo Audio demo and source codes can be found in https://liusongxiang.github.io/BNE-Seq2SeqMoL-VC/. (a)(b)

Fig. 7. (a) Visualization of bottle-neck features extracted by the BNE in theproposed BNE-Seq2seqMoL VC approach from the utterance “arctic-a0001”in the CMU ARCTIC dataset with t-SNE. (b) Details of the dashed red box in(a), where the number represent sequential order of a frame in the bottle-neckfeatures.

The subjective MOS evaluation results for the any-to-manyvoice conversion setting are shown in Table III. We cansee that the proposed BNE-Seq2seqMOL approach achievesthe best results in terms of both naturalness and speakersimilarity. In any-to-any conversion, we only conduct MOStests for the proposed BNE-Seq2seqMoL approach. The resultsare presented in Table IV. We can see that the proposedapproach also achieves superior VC performance even in theone-shot/few-shot voice conversion setting.

F. Cross-speaker property of bottle-neck features

To explore the property of the bottle-neck features ex-tracted by the BNE in the BNE-Seq2seqMoL approach, t-distributed Stochastic Neighbor Embedding (t-SNE) [47] isused to visualize bottle-neck features. t-SNE is a non-lineardimensionality reduction technique that is widely used to

OURNAL OF L A TEX CLASS FILES, 2020 10

TABLE VA

BLATION STUDIES FOR THE PROPOSED

BNE-S EQ SEQ M O L APPROACH . “IN”

REPRESENTS INSTANCE NORMALIZATION AND “LSA”

REPRESENTSLOCATION - SENSITIVE ATTENTION . Conversionpair BNE-Seq2seqMoL Without Log-F0&UV Without IN Use LSAMCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WER MCD F0RMSE CER WERF-M 6.87 44.55 3.00 4.29 6.91 49.92 2.23 5.14 6.85 41.65 3.94 6.46 6.81 46.64 4.31 7.19F-F 6.99 45.44 5.12 6.97 7.05 47.24 3.97 6.12 6.93 45.77 4.90 6.96 7.00 44.03 4.28 7.20M-M 6.94 50.37 4.72 7.22 7.00 54.20 3.93 6.80 7.04 52.81 5.52 8.51 6.90 49.71 4.53 6.44M-F 7.18 56.31 3.61 5.76 7.13 47.80 5.09 7.25 7.23 60.56 5.45 7.75 7.15 58.65 4.89 6.84Average 6.99 embed high-dimensional data into a space of two or threedimensions. The goal is to ﬁnd a faithful representation ofthose high-dimensional data in a low-dimensional space.Fig. 7a is the 2-dimensional t-SNE visualization of thebottle-neck features of the utterance “arctic-a0001” of thefour speakers (bdl, rms, slt and clb). The 256-dimensionalbottle-neck features are fed into t-SNE and then the resultis obtained after ﬁve thousand iterations. In the ﬁgure, eachbottle-neck feature frame is represented by a dot. The fourcolors represent the four speakers. We can see a strong degreeof clustering effect of the bottle-neck features across differentspeakers. The details of the red dashed box in Fig. 7a aredepicted in Fig. 7b, where the numerical indices representsequential frame order of the ﬁrst several frames in the bottle-neck features. A similar manifold pattern can be observedacross the four examined speakers, which shows the goodnessof cross-speaker property of the bottle-neck features used inthe proposed BNE-Seq2seqMoL approach. This can also bea reasonable explanation of the superior VC performance ofthe BNE-Seq2seqMoL approach in both the objective andsubjective evaluations.

G. Ablation Studies

In this section, ablation studies are conducted to validate theeffectiveness of the feature selection and model design strate-gies in the proposed BNE-Seq2seqMoL approach. Speciﬁcally,three ablation studies are conducted: 1) dropping the Log-F0and UV features and only use bottle-neck features as input tothe synthesis module; 2) dropping the instance normalizationlayers in the pitch encoder; 3) using the location-sensitiveattention (LSA) instead of the location-relative MoL attention.The objective evaluation of the ablation studies are shownin Table V. We observe that the proposed feature selection andmodel design for the BNE-Seq2seqMoL approach obtains thebest F0-RMSE and WER results, and achieve near-the-bestMCD and CER results. This validates the effectiveness of thefeature selection and model design in the BNE-Seq2seqMoLapproach. VI. C

ONCLUSION

In this paper, we re-design a prior approach [25] to achieve arobust non-parallel seq2seq based any-to-many VC approach.The novel approach concatenates a seq2seq based phonemerecognizer (Seq2seqPR) and a multi-speaker duration informedattention network (DurIAN) for synthesis. Extension is also made on this approach to enable support of any-to-any voiceconversion. Thorough examinations including objective andsubjective evaluations are conducted for this model in any-to-many, as well as any-to-any settings.To overcome the deﬁciencies of the PPG-based and non-parallel seq2seq any-to-many VC approaches, we further pro-posed a new any-to-many VC approach, which combines abottle-neck feature extractor (BNE) with an MoL attentionbased seq2seq synthesis model. This approach can easily ex-tended for any-to-any VC. Objective and subjective evaluationresults shows its superior VC performance in both any-to-many and any-to-any VC settings. Ablation studies have beenconducted to conﬁrms the effectiveness of feature selectionand model design strategies in the proposed approach. In thefuture, we will explore the proposed approach in terms ofsource style transfer and emotion conversion.R

EFERENCES [1] A. Bundy and L. Wallen, “Dynamic time warping,” in

Catalogue of Artiﬁcial Intelligence Tools . Springer,1984, pp. 32–33.[2] Y. Stylianou, O. Capp´e, and E. Moulines, “Continu-ous probabilistic transform for voice conversion,”

IEEETransactions on speech and audio processing , vol. 6,no. 2, pp. 131–142, 1998.[3] T. Toda, A. W. Black, and K. Tokuda, “Voice conversionbased on maximum-likelihood estimation of spectral pa-rameter trajectory,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 15, no. 8, pp. 2222–2235,2007.[4] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahal-lad, “Spectral mapping using artiﬁcial neural networksfor voice conversion,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 18, no. 5, pp.954–964, 2010.[5] S. H. Mohammadi and A. Kain, “Voice conversionusing deep neural networks with speaker-independentpre-training,” in

Spoken Language Technology Workshop(SLT), 2014 IEEE . IEEE, 2014, pp. 19–23.[6] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice con-version using rnn pre-trained by recurrent temporal re-stricted boltzmann machines,”

IEEE/ACM Transactionson Audio, Speech and Language Processing (TASLP) ,vol. 23, no. 3, pp. 580–587, 2015.[7] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversionusing deep bidirectional long short-term memory based

OURNAL OF L A TEX CLASS FILES, 2020 11 recurrent neural networks,” in

Acoustics, Speech andSignal Processing (ICASSP), 2015 IEEE InternationalConference on . IEEE, 2015, pp. 4869–4873.[8] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R.Dai, “Sequence-to-sequence acoustic modeling for voiceconversion,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 27, no. 3, pp. 631–644,2019.[9] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo,“Atts2s-vc: Sequence-to-sequence voice conversion withattention and context preservation mechanisms,” in

ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 6805–6809.[10] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phoneticposteriorgrams for many-to-one voice conversion withoutparallel data training,” in . IEEE, 2016,pp. 1–6.[11] S. Liu, L. Sun, X. Wu, X. Liu, and H. Meng, “The hccl-cuhk system for the voice conversion challenge 2018.”in

Odyssey , 2018, pp. 248–254.[12] S. Liu, Y. Cao, X. Wu, L. Sun, X. Liu, and H. Meng,“Jointly trained conversion model and wavenet vocoderfor non-parallel voice conversion using mel-spectrogramsand phonetic posteriorgrams,”

Proc. Interspeech 2019 ,pp. 714–718, 2019.[13] J. Zhang, Z. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguisticand speaker representations,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , 2019.[14] H. Kameoka, W.-C. Huang, K. Tanaka, T. Kaneko,N. Hojo, and T. Toda, “Many-to-many voice transformernetwork,” arXiv preprint arXiv:2005.08445 , 2020.[15] K. Qian, Y. Zhang, S. Chang, X. Yang, andM. Hasegawa-Johnson, “AutoVC: Zero-shot voice styletransfer with only autoencoder loss,” ser. Proceedingsof Machine Learning Research, K. Chaudhuri andR. Salakhutdinov, Eds., vol. 97. Long Beach, California,USA: PMLR, 09–15 Jun 2019, pp. 5210–5219. [Online].Available: http://proceedings.mlr.press/v97/qian19c.html[16] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M.Wang, “Voice conversion from non-parallel corpora usingvariational auto-encoder,” in . IEEE, 2016, pp. 1–6.[17] Y. Gao, R. Singh, and B. Raj, “Voice impersonationusing generative adversarial networks,” in . IEEE, 2018, pp. 2506–2510.[18] T. Kaneko and H. Kameoka, “Parallel-data-free voiceconversion using cycle-consistent adversarial networks,” , 2018.[19] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba,“High-quality nonparallel voice conversion based oncycle-consistent adversarial network,” in . IEEE, 2018, pp. 5279–5283.[20] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo,“Stargan-vc: Non-parallel many-to-many voice conver-sion using star generative adversarial networks,” in .IEEE, 2018, pp. 266–273.[21] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, andH.-M. Wang, “Voice conversion from unaligned corporausing variational autoencoding wasserstein generativeadversarial networks,” in

Proc. Interspeech 2017 , 2017,pp. 3364–3368. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2017-63[22] J. chieh Chou, C. chieh Yeh, H. yi Lee, andL. shan Lee, “Multi-target voice conversion withoutparallel data by adversarially learning disentangledaudio representations,” in

Proc. Interspeech 2018 , 2018,pp. 501–505. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1830[23] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo,“Acvae-vc: Non-parallel many-to-many voice conversionwith auxiliary classiﬁer variational autoencoder,” arXivpreprint arXiv:1808.05092 , 2018.[24] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stan-ton, D. Kao, M. Shannon, and T. Bagby, “Location-relative attention mechanisms for robust long-formspeech synthesis,” in

ICASSP 2020-2020 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 6194–6198.[25] S. Liu, Y. Cao, S. Kang, N. Hu, X. Liu, D. Su, D. Yu, andH. Meng, “Transferring source style in non-parallel voiceconversion,” arXiv preprint arXiv:2005.09178 , 2020.[26] A. Graves, “Generating sequences with recurrent neuralnetworks,” arXiv preprint arXiv:1308.0850 , 2013.[27] S. Vasquez and M. Lewis, “Melnet: A generative modelfor audio in the frequency domain,” arXiv preprintarXiv:1906.01083 , 2019.[28] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma,“Pixelcnn++: Improving the pixelcnn with discretized lo-gistic mixture likelihood and other modiﬁcations,” arXivpreprint arXiv:1701.05517 , 2017.[29] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka,and T. Toda, “Pretraining techniques for sequence-to-sequence voice conversion,” arXiv preprintarXiv:2008.03088 , 2020.[30] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attentionbased end-to-end speech recognition using multi-tasklearning,” in .IEEE, 2017, pp. 4835–4839.[31] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classiﬁcation: labelling unseg-mented sequence data with recurrent neural networks,”in

Proceedings of the 23rd international conference onMachine learning , 2006, pp. 369–376.[32] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu,D. Tuo, S. Kang, G. Lei et al. , “Durian: Durationinformed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700 , 2019.

OURNAL OF L A TEX CLASS FILES, 2020 12 [33] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text tospeech,” in

Advances in Neural Information ProcessingSystems , 2019, pp. 3165–3174.[34] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. ,“Tacotron: Towards end-to-end speech synthesis,”

Proc.Interspeech 2017 , pp. 4006–4010, 2017.[35] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Gener-alized end-to-end loss for speaker veriﬁcation,” in . IEEE, 2018, pp. 4879–4883.[36] S. Liu, D. Wang, Y. Cao, L. Sun, X. Wu, S. Kang,Z. Wu, X. Liu, D. Su, D. Yu, and H. Meng, “End-to-end accent conversion without using native utterances,”in

ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,2020, pp. 6289–6293.[37] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: an asr corpus based on public domainaudio books,” in .IEEE, 2015, pp. 5206–5210.[38] C. Veaux, J. Yamagishi, K. MacDonald et al. , “Cstrvctk corpus: English multi-speaker corpus for cstr voicecloning toolkit,”

University of Edinburgh. The Centre forSpeech Technology Research (CSTR) , 2017.[39] J. Kominek and A. Black, “The cmu arctic speechdatabases for speech synthesis research,” Tech. Rep.CMU-LTI-03-177 http://festvox. org/cmu arctic/, Lan-guage . . . , Tech. Rep., 2003.[40] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identiﬁcation dataset,” in

Proc.Interspeech , 2017.[41] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2:Deep speaker recognition,” in

Proc. Interspeech , 2018.[42] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss,Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus de-rived from librispeech for text-to-speech,” arXiv preprintarXiv:1904.02882 , 2019.[43] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neuralaudio synthesis,” in

International Conference on MachineLearning , 2018, pp. 2410–2419.[44] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai,“Wavenet vocoder with limited training data for voiceconversion,” in

Proc. Interspeech , 2018, pp. 1983–1987.[45] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito,F. Villavicencio, T. Kinnunen, and Z. Ling, “The voiceconversion challenge 2018: Promoting development ofparallel and nonparallel methods,” in

Proc. Odyssey2018 The Speaker and Language Recognition Workshop ,2018, pp. 195–202. [Online]. Available: http://dx.doi.org/10.21437/Odyssey.2018-28[46] S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng,“Voice conversion across arbitrary speakers based on a single target-speaker utterance.” in

Proc. Interspeech ,2018, pp. 496–500.[47] L. v. d. Maaten and G. Hinton, “Visualizing data usingt-sne,”

Journal of machine learning research , vol. 9, no.Nov, pp. 2579–2605, 2008.

PLACEPHOTOHERE

Songxiang Liu received his B.S. in Automationfrom the College of Control Science and Engineer-ing, Zhejiang University (ZJU), Hangzhou, China,in 2016. He is currently a Ph.D candidate at theHuman Computer Communications Lab (HCCL) inthe Chinese University of Hong Kong, Hong KongSAR, China. His research focuses on voice con-version, accent conversion, audio adversarial attack& defense, text-to-speech synthesis and automaticspeech recognition.PLACEPHOTOHERE

Yuewen Cao received her B.S. degree in com-munication engineering from Huazhong Universityof Science & Technology, Wuhan, China, in 2017.She is currently pursuing her Ph.D degree at theHuman Computer Communications Lab (HCCL) inthe Chinese University of Hong Kong, Hong KongSAR, China. Her research interests include speechsynthesis and voice conversion.PLACEPHOTOHERE

Disong Wang received his B.S. in Mathematics &Physics Basic Science from University of ElectronicScience and Technology of China (UESTC) in 2015,and M.E. in Computer Applied Technology fromPeking University (PKU) in 2018. He is currentlya Ph. D. candidate at the Human Computer Com-munications Lab (HCCL) in the Chinese Univer-sity of Hong Kong (CUHK). His research interestsinclude voice conversion, text-to-speech synthesis,automatic speech recognition and their applicationsto non-standard speech, such as accented voice anddysarthric speech.PLACEPHOTOHERE

Xixin Wu received the B.S., M.S. and Ph.D. de-grees from Beihang University, Tsinghua Universityand The Chinese University of Hong Kong, China,respectively. He has been a Research Assistant at theMachine Intelligence Laboratory, Cambridge Uni-versity Engineering Department. His research in-terests include speech synthesis, voice conversionspeech recognition and neural network uncertainty.

OURNAL OF L A TEX CLASS FILES, 2020 13

PLACEPHOTOHERE

Xunying Liu received the bachelor’s degree fromShanghai Jiao Tong University, the Ph.D. degreein speech recognition, and the M.Phil. degree incomputer speech and language processing both fromthe University of Cambridge, Cambridge, UK. Hehas been a Senior Research Associate at the MachineIntelligence Laboratory, Cambridge University En-gineering Department, and from 2016 an AssociateProfessor in the Department of Systems Engineeringand Engineering Management, the Chinese University of Hong Kong. He received the Best PaperAward at ISCA Interspeech 2010. His current research interests include largevocabulary continuous speech recognition, language modelling, noise robustspeech recognition, speech synthesis, speech and language processing. He isa Member of ISCA.PLACEPHOTOHERE