[PDF] Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion

Abstract

Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data. In the study reported here, we investigated the interpretability of state-of-the-art implementations of non-parallel GANs in the domain of VC. We show that the learned representations in the repeating layers of a particular GAN architecture remain close to their original random initialised parameters, demonstrating that it is the number of repeating layers that is more responsible for the quality of the output. We also analysed the learned representations of a model trained on one particular dataset when used during transfer learning on another dataset. This showed extremely high levels of similarity across the entire network. Together, these results provide new insight into how the learned representations of deep generative networks change during learning and the importance in the number of layers.

Full PDF

IINVESTIGATING DEEP NEURAL STRUCTURES AND THEIR INTERPRETABILITY IN THEDOMAIN OF VOICE CONVERSION

Samuel J. Broughton, Md Asif Jalal, Roger K. Moore

Dept. Computer Science, University of Shefﬁeld, UK

ABSTRACT

Generative Adversarial Networks (GANs) are machine learn-ing networks based around creating synthetic data. VoiceConversion (VC) is a subset of voice translation that involvestranslating the paralinguistic features of a source speaker toa target speaker while preserving the linguistic information.The aim of non-parallel conditional GANs for VC is to trans-late an acoustic speech feature sequence from one domainto another without the use of paired data. In the study re-ported here, we investigated the interpretability of state-of-the-art implementations of non-parallel GANs in the domainof VC. We show that the learned representations in the re-peating layers of a particular GAN architecture remain closeto their original random initialised parameters, demonstrat-ing that it is the number of repeating layers that is more re-sponsible for the quality of the output. We also analysed thelearned representations of a model trained on one particulardataset when used during transfer learning on another dataset.This showed extremely high levels of similarity across the en-tire network. Together, these results provide new insight intohow the learned representations of deep generative networkschange during learning and the importance in the number oflayers. Index Terms — Voice conversion (VC), generative ad-versarial networks (GANs), canonical correlation analysis(CCA), SVCCA, non-parallel VC, multi-domain VC

1. INTRODUCTION

Deep Learning networks have been shown to exhibit superiorabilities in a range of problem domains [1, 2, 3]. However,such networks are black-box representations in terms of theirinterpretability [4], and this can mitigate against informed de-cision making when selecting appropriate network conﬁgura-tions.One problem domain, voice conversion (VC), or voicestyle transfer, is a technique aimed at modifying the linguis-tic style of speech while preserving the linguistic informationcontained therein [5, 6, 7]. VC can be formulated as a regres-sion problem with the aim of building a function in which Audio samples available at: https://samuelbroughton.github.io/interpretability-demo-2020 . the features of a source speaker A can be mapped to a tar-get speaker B [8, 6, 7]. Applications of VC include mod-ifying speaker identity in text-to-speech (TTS) systems [9],aiding those with vocal disabilities [10] and generating ac-cents for assisted language conversion in domains such asreal-time language translation and device-assisted languagelearning [11].Historically, methods employed to achieve VC have in-cluded mapping code books [12], Gaussian mixture models(GMMs) [8, 9, 13] and artiﬁcial neural networks (ANNs) [14,15]. However, variations of generative adversarial networks(GANs) [16] have recently shown success in a range of dif-ferent domains, such as producing convincingly real imagesand videos [2, 17], enhancing the quality of images [1], gen-erating new music [18] and, of interest here, a methodologyfor achieving VC [19, 20, 21, 3, 22, 23].Some of the VC methods mentioned above can be catego-rized as either parallel or non-parallel. Parallel VC refers tosource and target speaker utterances being perfectly aligned[6, 7]. Such data can be a laborious task to collect. Further-more, once collected, the data would need to be pre-processedwith automatic time alignment which can fail, resulting inother methods of correction. However, GANs are able tolearn mapping functions between data of similar domains andso mitigate the need for a parallel dataset [2]. Recent state-of-the-art non-parallel generative VC architectures includeCycleGAN-VC2 [3] and StarGAN-VC2 [21]. Both makeuse of a gated convolutional neural network (CNN) [24],identity-mapping loss [25] and architecture [3].A major advantage of using the StarGAN [26] frameworkwhen compared to CycleGAN [2], is the ability to performmulti-domain conversion whilst only requiring a single gen-erator. With regards to VC, the StarGAN framework allowsfor learned mapping functions between multiple speakers.Extending this framework, StarGAN-VC [27] and StarGAN-VC2 [21] make various modiﬁcations including updates to thetraining objective and alterations to the network architecture.However, despite StarGAN-VC2 demonstrating superiorVC in both objective and subjective experiments when com-pared to StarGAN-VC [21], there has been very little investi-gation into the interpretability of it’s network representations- as is the case with many deep multi-layer neural networks,especially GANs [28]. Being able understand how interpret a r X i v : . [ c s . S D ] F e b eep multi-layer GANs would beneﬁt the development of newgenerative techniques and the improve the efﬁciency of cur-rent methods, as interpretability studies have began to do sowith discriminative models [29, 30]. The motivation for thispaper is to provide some insight into the underlying genera-tion process by focusing on the learned network representa-tions and network depth.In this work, we conducted an evaluation of learned net-work representations by performing Singular Vector Canoni-cal Correlation Analysis (SVCCA) [30] in a range of differentexperiments using an adaptation of the StarGAN-VC2 net-work. The aim was to provide insight into the interpretabilityof GANs for VC by addressing the similarity of optimallytrained networks and their random initial states. This wasachieved by conducting experiments with networks includ-ing frozen layers, observing how quickly networks reachedtheir optimal representations, exploring the effects of modi-fying the size of networks and investigating learned networkrepresentations when trained using transfer learning.The rest of the paper is structured as follows: Section 2outlines the generative network architecture used, Section 3discusses SVCCA and the motivation to use it in this work,Section 4 describes the research questions and experimentalconditions, Section 5 discusses the results and probable im-plications, and Section 6 draws the conclusion.

2. GENERATIVE NETWORK ARCHITECTURE

The network architecture implemented for the experimentspresented in this paper was based on StarGAN-VC2 [21],which allows for non-parallel many-to-many learned map-pings for VC.

The main objective of the StarGAN framework [26] is to learnmany-to-many mapping functions between multiple domainswhilst only using a single generator G . StarGAN does thisby conditioning itself on ‘one-hot’ representations of domaincodes c ∈ { , ..., N } , where c and N indicate the domaincode the number of domains, respectively. More speciﬁcallyin StarGAN-VC2, G can be formulated as the mapping func-tion G ( x, ˆ c ) −→ ˆ x , taking an acoustic input feature sequence x ∈ R Q × T and target domain code ˆ c to generate an acousticoutput feature sequence ˆ x . StarGAN-VC2 does this by mak-ing use of an adversarial loss [16], reconstruction or cycle-consistency loss [2] and identity-mapping loss [25]. Adversarial loss is used in GANs to encourage generateddata, conditioned on target domain code, to be indistinguish-able to that of real data [26]: L adv = E ( x,c ) ∼ P ( x,c ) [log D ( x, c )]+ E x ∼ P ( x ) , ˆ c ∼ P (ˆ c ) [log(1 − D ( G ( x, ˆ c ) , ˆ c ))] , (1) where D is a real/fake discriminator that attempts to max-imise this loss to learn the decision boundary between realand fake features. G attempts to minimize this loss by gener-ating an output indistinguishable to the real acoustic featuresof target domain ˆ c .As discussed in the StarGAN-VC2 study [21], when han-dling both hard negative and easy negative samples (e.g. samespeaker domain conversion and different speaker domain con-version), this condition can make it difﬁcult to bring gener-ated output data close to real target data. Therefore, source-and-target conditional adversarial loss [21] is used to help G generate an output closer to the real target data. However, dur-ing pre-experiments we found that only using target domaininput in G and using both source-and-target domain inputs in D yielded a better output quality without affecting speakersimilarity. The modiﬁed source-and-target adversarial objec-tive is deﬁned as: L st - adv = E ( x,c ) ∼ P ( x,c ) , ˆ c ∼ P (ˆ c ) [log D ( x, ˆ c, c )]+ E ( x,c ) ∼ P ( x,c ) , ˆ c ∼ P (ˆ c ) [log D ( G ( x, ˆ c ) , c, ˆ c )] , (2) Cycle-consistency loss is used in order to guarantee thatthe converted output feature sequence preserves the sourcecharacteristics of input feature sequence x [2, 26]: L cyc = E ( x,c ) ∼ P ( x,c ) , ˆ c ∼ P (ˆ c ) [ || x − G ( G ( x, ˆ c ) , c ) || ] . (3)This cyclic constraint encourages G to reconstruct theoriginal input feature x from the generated output ˆ x andsource domain code c . This helps G to preserve the linguisticinformation of the speech [27]. Identity-mapping loss is employed to encourage thepreservation of input feature identity within generated outputdata [25]: L id = E ( x,c ) ∼ P ( x,c ) [ || G ( x, c ) − x || ] . (4)Identity-mapping loss has previously been used in image-to-image translation for colour preservation [2].The full objective can be summarised as follows: L G = L st - adv + λ cyc L cyc + λ id L id , (5) L D = −L st - adv , (6)where λ cyc and λ id are hyperparameters for each term.Here, G aims to minimise the loss whilst D is trying to max-imise it. The fully convolutional GAN architecture used in the studyreported here allows for acoustic input feature sequences ofarbitrary sizes. enerator : The input to G was an image of size Q × T ofan acoustic feature sequence x , where Q and T are the featuredimension and sequence length, respectively. A [3, 21] architecture was used to construct G . 2D convolutionsare well suited for holding the original data structure whilst1D convolutions work well at dynamically changing the data[3]. The implementation speciﬁcally used a gated CNN [24],which allowed for relevant features to be selected and prop-agated based on previous layer states. The effectiveness of agated CNN for VC has already been conﬁrmed in previousstudies [27, 20].Conditional domain speciﬁc style code was injected in the1D CNN architecture by a modulation-based method [21].Conditional instance normalisation (CIN) [31, 32] was usedto modulate parameters in a domain-speciﬁc manner:CIN ( f, ˆ c ) = γ ˆ c ( f − µ ( f ) σ ( f ) ) + β ˆ c , (7)where µ ( f ) and σ ( f ) are the average and standard devia-tion of feature f and γ ˆ c and β ˆ c are domain-speciﬁc scale andbias parameters, respectively.The 1D repeating blocks were not residual because theuse of skip connections was reported to result in partial con-version [33]. Real/Fake Discriminator : A 2D gated CNN [24] wasused for the architecture of the real/fake discriminator D ,which has been formulated as a projection discriminator [34],as seen in StarGAN-VC2 [21]. D outputs a sequence ofprobabilities, calculating how close the input acoustic featuresequence x is to domain c .

3. SVCCA ON DEEP NEURAL REPRESENTATIONS

Singular Vector Canonical Correlation Analysis (SVCCA)is an extension of Canonical Correlation Analysis (CCA), amethod used in statistics to measure the similarity of two vec-tors formed by some underlying process [37, 38, 39]. In thecase of deep neural networks, these are the “neuron activationvectors” formed from training on a particular dataset [39, 30].A single neuron activation vector is the output of a singleneuron of a layer in a network. Combining the outputs of allneurons for a particular layer in a network results in a set ofmultidimensional output [39, 30]. Subsequently, CCA can beused to compare the similarity between two layers of the samenetwork, similar networks using layers of same/differing di-mensionality, or a given layer at different stages of training[39].SVCCA is an extension to CCA that involves a pre-processing step [39, 30]. The authors of [30] explain thatSVCCA takes the same inputs as CCA, for example two lay-ers of a neural network l and l that each contain a set ofneuron activation vectors. SVCCA then factorises the vectorsby computing Singular Value Decomposition (SVD) over each layer to obtain subspaces l (cid:48) ⊂ l and l (cid:48) ⊂ l . Thesesubspaces contain the most important variance directions,which can account for 99% of the variance in input layers l and l [30]. CCA is then performed on l (cid:48) and l (cid:48) to return thecorrelation coefﬁcients, providing a measure of similarity ofthe two layers.The motivation for SVCCA in this paper is to providea similarity metric for the comparison of various layers inthe generator network. This allows for the interpretation oflearned network representations at different stages of train-ing.

4. EXPERIMENTSDatasets : To evaluate our methods, we made use of theDevice and Produced Speech Dataset [40], as seen in themulti-speaker VC task in the Voice Conversion Challenge2018 (VCC2018) [7] and the English Multi-speaker Corpusfor CSTR Voice Cloning Toolkit (VCTK) [41]. We used asubset of both datasets in all experiments except during trans-fer learning where the initial model was trained using theVCTK dataset.In both datasets four speakers were selected covering allinter- and intra-gender conversions. In the VCTK dataset weselected speakers labelled p229 , p236 , p232 and p243 ;speakers p229 and p236 are female, and speakers p232 and p243 are male. The data from VCC2018 mimicked thedata used to test StarGAN-VC2 [21], whereby VCC2SF1 and

VCC2SF2 are female speakers, and

VCC2SM1 and

VCC2SM2 are male speakers. Speakerwise normalisation was conductedas a pre-process.For each experiment, × source-and-target pairmappings were learnt for each single model trained on bothdatasets. All the recordings for both datasets were down-sampled to 22.05 kHz. 36 Mel-cepstral coefﬁcients (MCEPs)were extracted from each recording. The logarithmic funda-mental frequency ( log F ) and aperiodicities (APs) were ex-tracted every 5 ms using the WORLD vocoder [42]. Conversion process : The conversion process mimickedthat of StarGAN-VC [27] and StarGAN-VC2 [21], by not us-ing any form of post ﬁltering [43, 44] or powerful vocoding[45, 46] and just focusing on MCEP conversion . As in previ-ous studies, the WORLD vocoder [42] was used to synthesisespeech, directly taking APs and converting the log F using alogarithm Gaussian normalised transformation [47]. Network implementations : Figure 1 presents the net-work architectures for G and D , inﬂuenced by StarGAN-VC2[21] and CycleGAN-VC2 [3]. The networks were initiallytrained for × batch iterations on both datasets. Dur-ing transfer learning, optimal models trained on the VCTKdataset were selected and trained for an extra × batchiterations on the VCC2018 dataset. During the training of the Audio samples available at: https://samuelbroughton.github.io/interpretability-demo-2020 . ig. 1 . Network architectures of the fully convolutional [35] generator and discriminator based on StarGAN-VC2 [21]. In theinput, output and reshape layers ‘h’, ‘w’ and ‘c’ represent the height, width and channel number respectively. In the Conv2d,Conv1D and ConvT2D convolution layers, ‘k’, ‘c’ and ‘s’ represent the kernel size, channel number and stride, respectively.‘IN’, ‘GLU’, ‘GSP’ and ‘FC’ denote instance normalisation [36], gated linear unit [24], global sum pooling and fully connectedlayers, respectively. N = 9 repeating 1D CNN blocks were used for all experiments unless otherwise speciﬁed.networks for all experiments, the states for G and D weresaved at every × batch iterations. All networks weretrained using the Adam optimizer [48] with a momentum term β set to . . The batch size was set to and we randomlycropped segments of 512 frames from randomly selected sen-tences. Learning rates for G and D were both set to . , λ cyc = 10 and λ id = 5 . L id was only used for the ﬁrst iterations. repeating 1D CNN blocks were used for allexperiments unless otherwise speciﬁed. Least squares GAN[49] was used for a GAN objective. Experimental investigation : Experiments were con-ducted in order to provide insights into questions relating tothe interpretability of the trained networks.

Experiment 1 addressed the issue as to how similar the learned represen-tations of the optimally trained network are to its randominitialisation.

Experiment 2 addressed the question of howsimilar the learned representations of networks trained viatransfer learning on a new dataset are to their previously op-timal representations when trained on the original dataset.

Experiment 3 addressed the issue as to how similar thelearned representations of networks with various frozen re-peating layers are.

Experiment 4 addressed the question ofhow the quality of the output feature sequence changes withnetworks of a differing number of repeating layers.

5. RESULTS AND DISCUSSIONSExperiment 1 : To assess how close the optimally trainednetwork’s learned representations were to their random ini-tialisations, SVCCA was used to compare networks at 0 andtheir optimal number of batch iterations. The number of op-timal batch iterations for networks trained on the VCTK andVCC2018 datasets were found to be approximately × and . × , respectively.Figures 2 and 3 show the CCA distance between thelearned representations of layers in the network at differentstages of training and their random initialisation. Both ﬁguresshow a greater correlation of similarity in the learned networkrepresentations of the repeating 1D CNN layers (R1-R9) andtheir random initial states when compared to the less simi- 0.1 0.3 0.5 0.7 0.9 . . . . . . Batch Iteration ( × ) CC AD i s t a n ce D-sampling layersRepeat layersU-sampling layers

Fig. 2 . Average CCA distance between the downsampling,repeating and upsampling portions of the trained network andits random initial states (trained on the VCTK dataset).lar downsampling and upsampling portions of the network.Both ﬁgures show networks trained using the VCTK datasethowever, similar results were observed with the VCC2018dataset.The extreme similarity observed at D1 can be seen as afundamental trait of these networks. During pre-experiments,this trait was also seen in training StarGAN-VC [27]. Weremoved the GLU of the ﬁrst downsampling layer to checkif this was preventing the ﬁrst convolution from learning asmuch as it could. However, the same trait was still observed.The results show that the learned network representationsof these optimally trained networks remain close to their ini-tial random states, especially in the repeating 1D CNN layers,which are reportedly responsible for the main conversion pro-cess [3].

Experiment 2 : The optimal model trained on the VCTKdataset in experiment 1 was used as the initial state for train-1 D2 D3 DC R1 R2 R3 R4 R5 R6 R7 R8 R9 UC U1 U2 Out . . . . . . . . Layer CC AD i s t a n ce . × BIs . × BIs × BIs

Fig. 3 . CCA distance between each layer of a network at different stages of training and its random initial states, where “BI”denotes batch iteration trained on the VCTK dataset.D1 D2 D3 DC R1 R2 R3 R4 R5 R6 R7 R8 R9 UC U1 U2 Out . . . . Layer CC AD i s t a n ce . × BIs × BIs

Fig. 4 . CCA distance between each layer of a network at different stages of training during transfer learning from the initialstates of the previous optimally trained network. Transfer learning was conducted using the VCC2018 dataset from a networkpreviously trained on the VCTK dataset. “BI” denotes the number of batch iterations trained.D1 D2 D3 DC R1 R2 R3 R4 R5 R6 R7 R8 R9 UC U1 U2 Out . . . . Layer CC AD i s t a n ce Network ANetwork BNetwork C

Fig. 5 . CCA distance between each layer of three different optimally trained networks with varying frozen layers and anoptimally trained network with no frozen layers. Network A was trained with the parameters of layers R2 and R3 frozen,network B was trained with the parameters of layers R4 and R5 frozen, and network C was trained with the parameters of layersR6, R7 and R8 frozen. All three networks are extremely similar in terms of their acoustic output when compared with the sameoptimally trained network with no layers frozen.ng during transfer learning on the VCC2018 dataset. Figure4 shows that the learned parameters of the entire network re-mained extremely close to its initial network representations(the parameters learned from training on the VCTK dataset).The model converged after approximately batch itera-tions and suffered from partial modal collapse.The similarity between target reference and convertedsynthesised samples transfer learning model was poor whencompared with the original model trained on the same dataset.This could be due to the difference in speaker regions acrossdatasets.

Experiment 3 : The random initial state of the modelstrained in experiment 1 were used to train networks with var-ious frozen layers in the repeating portion of the network. Atotal of three networks were evaluated with various frozenlayers, the ﬁrst of which froze R2 and R3. The second net-work froze R4 and R5, and the third network froze R6, R7and R8. Figure 5 shows the similarity of these networks whencompared against the optimally trained model from experi-ment 1. The repeating 1D layers again showed a high degreeof similarity in their learned network representations. All net-works were extremely similar in terms of their acoustic outputwhen compared with the optimally trained model.

Experiment 4 : The random initial state of the modelstrained in experiment 1 were used to train networks withdiffering numbers of repeating 1D layers. Six models weretrained with 3, 5, 7, 11, 13 and 15 repeating layers in additionto the previously trained model from experiment 1, which had9 repeating layers. It was observed that, in general, the audioquality of the models using 3, 5, 7 and 9 repeating layerssounded better than the models using 11, 13 and 15 layers.However, each model included at least one instance of havinga worse quality of output than their counterparts for variousdifferent source-target pairs.It was also observed that, as the number of repeating lay-ers increased, the modiﬁcation of speaker identity was morepronounced. In other words, the output from models with agreater number of repeating layers had clearer accents thanthe output of those with fewer repeating layers. However, atsome points the modiﬁcation of speaker identity was so pro-nounced that the intelligibility of the audio began to deterio-rate. Also, as the amount of repeating layers of the networkincreased, so the overall level of noise increased. Networksconsisting of 3 and 5 repeating layers struggled to convergewhilst networks using 11, 13 and 15 suffered from a vanish-ing gradient.

6. CONCLUSIONS

In the research reported here, we provide new insights into theinterpretability of Generative Adversarial Networks (GANs)for Voice Conversion (VC). Using a network architecturebased on StarGAN-VC2 [21], we conducted an investigationinto the learned representations of the network over a range of different experimental conditions. The results showed thatthere is at least one local optimum that lies close to the ran-dom initial states of the network. It was also found that itis the number of repeating layers in the network architecturethat has a noticeable effect on the quality of the output speech.In general, as the number of repeating layers in the networkincreased, so too did the noise and certain aspects of speakeridentity became more pronounced. Future work will involvelooking more into the importance of network depth in GANsfor VC.

7. REFERENCES [1] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, YihaoLiu, Chao Dong, Yu Qiao, and Chen Change Loy, “Es-rgan: Enhanced super-resolution generative adversarialnetworks,” in

Proceedings of the European Conferenceon Computer Vision (ECCV) , 2018, pp. 0–0.[2] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros, “Unpaired image-to-image translation usingcycle-consistent adversarial networks,” in

Proceedingsof the IEEE international conference on computer vi-sion , 2017, pp. 2223–2232.[3] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, andNobukatsu Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in

ICASSP2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6820–6824.[4] Quan-shi Zhang and Song-Chun Zhu, “Visual inter-pretability for deep learning: a survey,”

Frontiers ofInformation Technology & Electronic Engineering , vol.19, no. 1, pp. 27–39, 2018.[5] Seyed Hamidreza Mohammadi and Alexander Kain,“An overview of voice conversion systems,”

SpeechCommunication , vol. 88, pp. 65–82, 2017.[6] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, FernandoVillavicencio, Mirjam Wester, Zhizheng Wu, and Ju-nichi Yamagishi, “The voice conversion challenge2016,” in

Interspeech , 2016, pp. 1632–1636.[7] Jaime Lorenzo-Trueba, Junichi Yamagishi, TomokiToda, Daisuke Saito, Fernando Villavicencio, TomiKinnunen, and Zhenhua Ling, “The voice con-version challenge 2018: Promoting development ofparallel and nonparallel methods,” arXiv preprintarXiv:1804.04262 , 2018.[8] Yannis Stylianou, Olivier Capp´e, and Eric Moulines,“Continuous probabilistic transform for voice conver-sion,”

IEEE Transactions on speech and audio process-ing , vol. 6, no. 2, pp. 131–142, 1998.9] Hiromichi Kawanami, Yohei Iwami, Tomoki Toda, Hi-roshi Saruwatari, and Kiyohiro Shikano, “Gmm-basedvoice conversion applied to emotional speech synthe-sis,” in

Eighth European Conference on Speech Com-munication and Technology , 2003.[10] Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari,and Kiyohiro Shikano, “Speaking-aid systems us-ing gmm-based voice conversion for electrolaryngealspeech,”

Speech Communication , vol. 54, no. 1, pp.134–146, 2012.[11] Daniel Felps, Heather Bortfeld, and Ricardo Gutierrez-Osuna, “Foreign accent conversion in computer assistedpronunciation training,”

Speech communication , vol. 51,no. 10, pp. 920–932, 2009.[12] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano,and Hisao Kuwabara, “Voice conversion through vec-tor quantization,”

Journal of the Acoustical Society ofJapan (E) , vol. 11, no. 2, pp. 71–76, 1990.[13] Tomoki Toda, Alan W Black, and Keiichi Tokuda,“Voice conversion based on maximum-likelihood esti-mation of spectral parameter trajectory,”

IEEE Transac-tions on Audio, Speech, and Language Processing , vol.15, no. 8, pp. 2222–2235, 2007.[14] Srinivas Desai, E Veera Raghavendra, B Yegna-narayana, Alan W Black, and Kishore Prahallad, “Voiceconversion using artiﬁcial neural networks,” in . IEEE, 2009, pp. 3893–3896.[15] Seyed Hamidreza Mohammadi and Alexander Kain,“Voice conversion using deep neural networks withspeaker-independent pre-training,” in . IEEE, 2014,pp. 19–23.[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio, “Generative adversar-ial nets,” in

Advances in neural information processingsystems , 2014, pp. 2672–2680.[17] Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarialnetworks,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp.4401–4410.[18] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang, “Musegan: Multi-track sequential genera-tive adversarial networks for symbolic music generationand accompaniment,” in

Thirty-Second AAAI Confer-ence on Artiﬁcial Intelligence , 2018. [19] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu,Yu Tsao, and Hsin-Min Wang, “Voice conversionfrom unaligned corpora using variational autoencod-ing wasserstein generative adversarial networks,” arXivpreprint arXiv:1704.00849 , 2017.[20] Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hira-matsu, and Kunio Kashino, “Sequence-to-sequencevoice conversion with similarity metric learned usinggenerative adversarial networks.,” in

INTERSPEECH ,2017, vol. 2017, pp. 1283–1287.[21] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, andNobukatsu Hojo, “Stargan-vc2: Rethinking conditionalmethods for stargan-based voice conversion,” arXivpreprint arXiv:1907.12279 , 2019.[22] Dipjyoti Paul, Yannis Pantazis, and Yannis Stylianou,“Non-parallel voice conversion using weighted genera-tive adversarial networks,”

Proc. Interspeech 2019 , pp.659–663, 2019.[23] Hung-Yi Lee and Yu Tsao, “Generative adversarialnetwork and its applications to speech processingand natural language processing,” Speech Pro-cessing Laboratory, National Taiwan University. http://speech.ee.ntu.edu.tw/˜tlkagk/GAN_3hour.pdf , (Accessed: November 19, 2019).[24] Yann N Dauphin, Angela Fan, Michael Auli, and DavidGrangier, “Language modeling with gated convolu-tional networks,” in

Proceedings of the 34th Inter-national Conference on Machine Learning-Volume 70 .JMLR. org, 2017, pp. 933–941.[25] Yaniv Taigman, Adam Polyak, and Lior Wolf, “Un-supervised cross-domain image generation,” arXivpreprint arXiv:1611.02200 , 2016.[26] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-WooHa, Sunghun Kim, and Jaegul Choo, “Stargan: Uni-ﬁed generative adversarial networks for multi-domainimage-to-image translation,” in

Proceedings of the IEEEconference on computer vision and pattern recognition ,2018, pp. 8789–8797.[27] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, andNobukatsu Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarialnetworks,” in . IEEE, 2018, pp. 266–273.[28] Andrey Voynov and Artem Babenko, “Rpgan: Gansinterpretability via random routing,” arXiv preprintarXiv:1912.10920 , 2019.[29] Aravindh Mahendran and Andrea Vedaldi, “Under-standing deep image representations by inverting them,”n

Proceedings of the IEEE conference on computer vi-sion and pattern recognition , 2015, pp. 5188–5196.[30] Maithra Raghu, Justin Gilmer, Jason Yosinski, andJascha Sohl-Dickstein, “Svcca: Singular vector canoni-cal correlation analysis for deep learning dynamics andinterpretability,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 6076–6085.[31] Vincent Dumoulin, Jonathon Shlens, and ManjunathKudlur, “A learned representation for artistic style,” arXiv preprint arXiv:1610.07629 , 2016.[32] Xun Huang and Serge Belongie, “Arbitrary style trans-fer in real-time with adaptive instance normalization,”in

Proceedings of the IEEE International Conference onComputer Vision , 2017, pp. 1501–1510.[33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[34] Takeru Miyato and Masanori Koyama, “cganswith projection discriminator,” arXiv preprintarXiv:1802.05637 , 2018.[35] Jonathan Long, Evan Shelhamer, and Trevor Darrell,“Fully convolutional networks for semantic segmenta-tion,” in

Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2015, pp. 3431–3440.[36] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky,“Instance normalization: The missing ingredient for faststylization,” arXiv preprint arXiv:1607.08022 , 2016.[37] Harold Hotelling, “Relations between two sets of vari-ates,” in

Breakthroughs in statistics , pp. 162–190.Springer, 1992.[38] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor, “Canonical correlation analysis: An overviewwith application to learning methods,”

Neural computa-tion , vol. 16, no. 12, pp. 2639–2664, 2004.[39] Ari Morcos, Maithra Raghu, and Samy Bengio, “In-sights on representational similarity in neural networkswith canonical correlation,” in

Advances in Neural In-formation Processing Systems , 2018, pp. 5727–5736.[40] Gautham J Mysore, “Can we automatically transformspeech recorded on common consumer devices in real-world environments into professional production qual-ity speech?—a dataset, insights, and challenges,”

IEEESignal Processing Letters , vol. 22, no. 8, pp. 1006–1010,2014. [41] Christophe Veaux, Junichi Yamagishi, andKirsten MacDonald, “Cstr vctk corpus: En-glish multi-speaker corpus for cstr voicecloning toolkit, [sound],” Datashare, Edinburgh. https://doi.org/10.7488/ds/1994 , (Ac-cessed: December 16, 2019).[42] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“World: a vocoder-based high-quality speech synthesissystem for real-time applications,”

IEICE TRANSAC-TIONS on Information and Systems , vol. 99, no. 7, pp.1877–1884, 2016.[43] Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo,Yusuke Ijima, Kaoru Hiramatsu, and Kunio Kashino,“Generative adversarial network-based postﬁlter for sta-tistical parametric speech synthesis,” in . IEEE, 2017, pp. 4910–4914.[44] Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka,and Junichi Yamagishi, “Generative adversarialnetwork-based postﬁlter for stft spectrograms.,” in

IN-TERSPEECH , 2017, pp. 3389–3393.[45] Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXivpreprint arXiv:1609.03499 , 2016.[46] Akira Tamamori, Tomoki Hayashi, KazuhiroKobayashi, Kazuya Takeda, and Tomoki Toda,“Speaker-dependent wavenet vocoder.,” in

Interspeech ,2017, vol. 2017, pp. 1118–1122.[47] Kun Liu, Jianping Zhang, and Yonghong Yan, “Highquality voice conversion through phoneme-based lin-ear mapping functions with straight for mandarin,” in

Fourth International Conference on Fuzzy Systems andKnowledge Discovery (FSKD 2007) . IEEE, 2007, vol. 4,pp. 410–414.[48] Diederik P Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.[49] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau,Zhen Wang, and Stephen Paul Smolley, “Least squaresgenerative adversarial networks,” in