[PDF] An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Abstract

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.

Full PDF

11 An Overview of Voice Conversion andits Challenges: From Statistical Modeling toDeep Learning

Berrak Sisman,

Member, IEEE,

Junichi Yamagishi,

Senior Member, IEEE,

Simon King,

Fellow, IEEE, and Haizhou Li,

Fellow, IEEE

Abstract —Speaker identity is one of the important charac-teristics of human speech. In voice conversion, we change thespeaker identity from one to another, while keeping the lin-guistic content unchanged. Voice conversion involves multiplespeech processing techniques, such as speech analysis, spectralconversion, prosody conversion, speaker characterization, andvocoding. With the recent advances in theory and practice, weare now able to produce human-like voice quality with highspeaker similarity. In this paper, we provide a comprehensiveoverview of the state-of-the-art of voice conversion techniquesand their performance evaluation methods from the statisticalapproaches to deep learning, and discuss their promise andlimitations. We will also report the recent Voice ConversionChallenges (VCC), the performance of the current state oftechnology, and provide a summary of the available resourcesfor voice conversion research.

Index Terms —Voice conversion, speech analysis, speakercharacterization, vocoding, voice conversion evaluation, voiceconversion challenges.

I. I

NTRODUCTION

Voice conversion (VC) is a signiﬁcant aspect of artiﬁ-cial intelligence. It is the study of how to convert one’svoice to sound like that of another without changing thelinguistic content. Voice conversion belongs to a generaltechnical ﬁeld of speech synthesis, which converts text tospeech or changes the properties of speech, for example,voice identity, emotion, and accents. Stewart, a pioneer inspeech synthesis, commented in 1922 [1], the really difﬁcultproblem involved in the the artiﬁcial production of speech-sounds is not the making of a device which shall producespeech, but in the manipulation of the apparatus. As voiceconversion is focused on the manipulation of voice identityin speech, it represents one of the challenging researchproblems in speech processing.There has been a continuous effort in quest for effec-tive manipulation of speech properties since the debut ofcomputer-based speech synthesis in the 1950s. The rapiddevelopment of digital signal processing in the 1970s greatly

Berrak Sisman is with the Information Systems Technology and Design(ISTD) Pillar of Singapore University of Technology and Design (SUTD),Singapore.Junichi Yamagishi is with National Institute of Informatics, Japan andUniversity of Edinburgh, United Kingdom.Simon King is with the University of Edinburgh, United Kingdom.Haizhou Li is with the Department of Electrical and Computer Engi-neering, National University of Singapore. facilitated the control of the parameters for speech manip-ulation. While the original motivation of voice conversioncould be simply novelty and curiosity, the technologicaladvancements from statistical modeling to deep learninghave made a major impact on many real-life applications,and beneﬁted the consumers, such as personalized speechsynthesis [2], [3], communication aids for the speech-impaired [4], speaker de-identiﬁcation [5], voice mimicry [6]and disguise [7], and voice dubbing for movies.In general, a speaker can be characterized by three factorsthat are 1) linguistic factors that are reﬂected in sentencestructure, lexical choice, and idiolect; 2) supra-segmentalfactors such as the prosodic characteristics of a speechsignal, and 3) segmental factors that are related to shortterm features, such as spectrum and formants. When thelinguistic content is ﬁxed, the supra-segment and the seg-mental factors are the relevant factors concerning speakerindividuality. An effective voice conversion technique isexpected to convert both the supra-segment and the seg-mental factors. Despite much progress, voice conversionis still far from perfect. In this paper, we celebrate thetechnological advances, at the same time we expose theirlimitations. We will discuss the state-of-the-art technologyfrom historical and technological perspectives.A typical voice conversion pipeline includes a speechanalysis, mapping, and reconstruction modules as illus-trated in Figure 1, that is referred to as analysis-mapping-reconstruction pipeline. The speech analyzer decomposesthe speech signals of a source speaker into features thatrepresent supra-segmental and segmental information, andthe mapping module changes them towards the targetspeaker, ﬁnally the reconstruction module re-synthesizestime-domain speech signals. The mapping module hastaken the centre stage in many of the studies. These tech-niques can be categorized in different ways, for example,based on the use of training data - parallel vs non-parallel,the type of statistical modeling technique - parametric vsnon-parametric, the scope of optimization - frame level vsutterance level, and the workﬂow of conversion - directmapping vs inter-lingual. Let’s ﬁrst give an account fromthe perspective of the use of training data.The early studies of voice conversion were focusedon spectrum mapping using parallel training data, wherespeech of the same linguistic content is available fromboth the source and target speaker, for example, vector a r X i v : . [ ee ss . A S ] A ug quantization (VQ) [8] and fuzzy vector quantization [9].With parallel data, one can align the two utterances usingDynamic Time Warping [10]. The statistical parametric ap-proaches can beneﬁt from more training data for improvedperformance, just to name a few, Gaussian mixture model[11]–[13], partial least square regression [14] and dynamickernel partial least squares regression (DKPLS) [15].One of the successful statistical non-parametric tech-niques is based on non-negative matrix factorization (NMF)[16] and it is known as the exemplar-based sparse repre-sentation technique [17]–[20]. It requires a smaller amountof training data than the parametric techniques, and ad-dresses well the over-smoothing problem. The family ofsparse representation techniques include phonetic sparserepresentation, group sparsity implementation [21], [22],that greatly improved the voice quality on small paralleltraining dataset.The studies on voice conversion towards non-paralleltraining data [23]–[28] open up the opportunities for newapplications. The challenge is how to establish the mappingbetween non-parallel source and target utterances. TheINCA alignment technique by Erro et al. [27] representsone of the solutions to the non-parallel data alignmentproblem [29]. With the alignment techniques, one is ableto extend the voice conversion techniques from paralleldata to non-parallel data, such as the extension to DKPLS[30] and speaker model alignment method [31]. PhoneticPosteriograms, or PPG-based [32], approach represents an-other direction of research towards non-parallel trainingdata. While the alignment technique doesn’t use externalresources, the PPG-based approach makes use of auto-matic speech recognizer to generate intermediate phoneticrepresentation [33], [34] as the inter-lingual between thespeakers. Successful applications include Phonetic SparseRepresentation [22].Wu and Li [6], and Mohammadi and Kain [35] providedan overview of voice conversion systems from the per-spective of time alignment of speech features followed byfeature mapping, that represents the statistical modelingschool of thoughts. The advent of deep learning techniquesrepresents an important technology milestone in the voiceconversion research [36]. It has not only greatly advancedthe state-of-the-art, but also transformed the way we for-mulate the voice conversion research problems. It alsoopens up a new direction of research beyond the paralleland non-parallel data paradigm. Nonetheless, the studieson statistical modeling approaches have provided profoundinsights into many aspects of the research problems thatserve as the foundation work of today’s deep learningmethodology. In this paper, we will give an overview of voiceconversion research by providing a perspective that revealsthe underlying design principles from statistical modelingto deep learning.Deep learning’s contributions to voice conversion can besummarized in three areas. Firstly, it allows the mappingmodule to learn from a large amount of speech data,therefore, tremendously improves voice quality and simi-larity to target speaker. With neural networks, we see the mapping module as a nonlinear transformation function[37], that is trained from data [38], [39]. LSTM represents asuccessful implementation with parallel training data [40].Deep learning made a great impact on non-parallel datatechniques. The joint use of DBLSTM and i-vector [41], KLdivergence and DNN-based approach [42], variational auto-encoder [43], average modeling [44] and DBLSTM basedRecurrent Neural Networks [32], [45] bring the voice qualityto a new height. More recently, Generative AdversarialNetworks such as VAW-GAN [46], CycleGAN [47]–[49], andStarGAN [50] further advance the state-of-the-art.Secondly, deep learning has created a profound impacton vocoding technology. Speech analysis and reconstruc-tion modules are typically implemented using a traditionalparametric vocoder [11]–[13], [51]. The parameters of suchvocoders are manually tuned according to some over-simpliﬁed assumptions in signal processing. As a result,the parametric vocoders offer a suboptimal solution. Neuralvocoder is a neural network that learns to reconstruct anaudio waveform from acoustic features [52]. For the ﬁrsttime, neural vocoder becomes trainable and data-driven.WaveNet vocoder [53] represents one of the popular neuralvocoders, that directly estimates waveform samples fromthe input feature vectors. It has been studied intensively,for example, speaker dependent and independent WaveNetvocoder [54], [55], quasi-periodic WaveNet vocoder [56],[57], adaptive WaveNet vocoder with GANs [58], factorizedWaveNet vocoder [59], and reﬁned WaveNet vocoder withVAEs [60] that are known for their natural sounding voicequality. WaveNet vocoder is also widely adopted in tradi-tional voice conversion pipeline, such as GMM [54], sparserepresentation [61], [62] systems. Other successful neuralvocoders include WaveRNN vocoder [63], WaveGlow [64],that are excellent vocoders in their own right.Thirdly, deep learning represents a departure from thetraditional analysis-mapping-reconstruction pipeline. Allthe above techniques largely follow the voice conversionpipeline as in Figure 1. As neural vocoder is trainable, itcan be trained jointly with mapping module [58] and evenwith analysis module to become end-to-end solution [53].Voice conversion research used to be a niche area inspeech synthesis. However, it has become a major topicin recent years. In the 45th International Conference onAcoustics, Speech, and Signal Processing (ICASSP 2020),voice conversion papers represent more than one-third ofthe papers under the speech synthesis category. The growthof research community was accelerated by collaborativeactivities across academia and industry, such as voiceconversion challenge (VCC) 2016, which was ﬁrst launched[65]–[67] at INTERSPEECH 2016. VCC 2016 is focused on themost basic voice conversion task, that is voice conversionfor parallel training data recorded in acoustic studio. Itestablishes the evaluation methodology and protocol forperformance benchmarking, that are adopted widely in thecommunity. VCC 2018 [68]–[70] proposes a non-paralleltraining data challenge, and also connects voice conversionwith anti-spooﬁng of speaker veriﬁcation studies. VCC 2020puts forward a cross-lingual voice conversion challenge for Training Mapping ReconstructionAnalysis &Feature Extraction

Target SpeechSource Speech Converted Speech

Conversion Model

Source Speech

Fig. 1: The typical ﬂow of a voice conversion system. The pink box represents the training of the mapping function, whilethe blue box applies the mapping function at run-time, in a 3-step pipeline process Y = ( R ◦ F ◦ A )( X ).the ﬁrst time. We will provide an overview of the seriesof challenges and the publicly available resources in thispaper.This paper is organized as follows: In Section II, wepresent the typical ﬂow of voice conversion that includesfeature extraction, feature mapping and waveform gener-ation. In Section III, we study the statistical modeling forvoice conversion with parallel training data. In Section IV,we study statistical modeling for voice conversion withoutparallel training data. In Section V, we study the deep learn-ing approaches for voice conversion with parallel trainingdata, and beyond parallel training data. In Section VI, weexplain the evaluation techniques for voice conversion. InSection VII and VIII, we summarize the series of voiceconversion challenges, and publicly available research re-sources for voice conversion. We conclude in Section IX.II. T YPICAL FLOW OF VOICE CONVERSION

The goal of voice conversion is to modify a sourcespeaker’s voice to sound as if it is produced by a targetspeaker. In other words, a voice conversion system onlymodiﬁes the speaker-dependent characteristics of speech,such as formants, fundamental frequency (F0), intonation,intensity and duration, while carrying over the speaker-independent speech content.The core module of a voice conversion system performsthe conversion function. Let’s denote the source and targetspeech signals as X and Y respectively. As will be discussedlater, voice conversion is typically applied to some inter-mediate representation of speech, or speech feature, thatcharacterizes a speech frame. Let’s denote the source andtarget speech features as x and y . The conversion functioncan be formulated as follows, y = F ( x ) (1)where F ( · ) is also called mapping function in rest of thispaper. As illustrated in Figure 1, a typical voice conversionframework is implemented in three steps: 1) speech analy-sis, 2) feature mapping, and 3) speech reconstruction, thatwe call the analysis-mapping-reconstruction pipeline. Wediscuss in detail next. A. Speech Analysis and Reconstruction

The speech analysis and reconstruction are two cru-cial processes in the 3-step pipeline. The goal of speech analysis is to decompose speech signals into some formof intermediate representation for effective manipulationor modiﬁcation with respect to the acoustic properties ofspeech. There have been many useful intermediate repre-sentation techniques that were initially studied for speechcommunication, and speech synthesis. They become handyfor voice conversion. In general, the techniques can becategorized into model-based representations, and signal-based representations.In model-based representation, we assume that speechsignal is generated according to a underlying physicalmodel, such as source-ﬁlter model, and express a frame ofspeech signal as a set of model parameters. By modifyingthe parameters, we manipulate the input speech. In signal-based representation, we don’t assume any models, butrather represent speech as a composition of controllableelements in time domain or frequency domain. Let’s denotethe intermediate representation for source speaker as x ,speech analysis can be described by a function, x = A ( X ) (2)Speech reconstruction can be seen as an inverse functionof the speech analysis, that operates on the modiﬁedparameters and generates an audible speech signal. It workswith speech analysis in tandem. For example, A vocoder [51]is used to express a speech frame with a set of controllableparameters that can be converted back into a speechwaveform. A Grifﬁn-Lim algorithm is used to reconstruct aspeech signal from a modiﬁed short-time Fourier transformafter amplitude modiﬁcation [71]. As the output speechquality is affected by the speech reconstruction process,speech reconstruction is also one of the important topicsin voice conversion research. Let’s denote the modiﬁedintermediate representation and the reconstructed speechsignal for target speaker as y and Y = R ( y ), voice conversioncan be described by a composition of three functions, Y = ( R ◦ F ◦ A )( X ) = C ( X ) (3)that represents the typical ﬂow of a voice conversion systemas a 3-step pipeline. As the mapping is applied frame-by-frame, the number of converted speech features y is thesame as that of the source speech features x if speechduration is not modiﬁed in the process. While speech analysis and reconstruction make pos-sible voice conversion, just like other signal processingtechniques, they inevitably also introduce artifacts. Manystudies were devoted to minimize such artifacts. We nextdiscuss the most commonly used speech analysis andreconstruction techniques in voice conversion.

1) Signal-based Representation:

Pitch Synchronous Over-Lap and Add (PSOLA) is an example of signal-based rep-resentation techniques. It decomposes a speech signal intooverlapping speech segments [72], each of which representsone of the successive pitch periods of the speech signal. Byoverlap-and-adding these speech segments with a differentpitch periods, we can reconstruct the speech signal of a dif-ferent intonation. As PSOLA operates directly on the time-domain speech signal [72], the analysis and reconstructiondo not introduce signiﬁcant artifacts. While PSOLA tech-nique is effective for modiﬁcation of fundamental frequencyof speech signals, it suffers from several inherent limitations[73], [74]. For example, unvoiced speech signal is notperiodic, and the manipulation of time-domain signal notstraightforward.Harmonic plus Noise Model (HNM) represents anothersignal-based representation approach. It works under theassumption that a speech signal can be represented asa harmonic component plus a noise component that isdelimited by the so-called maximum voiced frequency[75]. The harmonic component is modeled as the sum ofharmonic sinusoids up to the maximum voiced frequency,while the noise component is modeled as Gaussian noiseﬁltered by a time-varying autoregressive ﬁlter. As HNMdecomposition is represented by some controllable param-eters, it allows for easy modiﬁcation speech [76], [77].

2) Model-based Representation:

The model-based tech-nique assumes that the input signal can be mathematicallyrepresented by a model whose parameters vary with time.A typical example is the source-ﬁlter model that representsa speech signal as the outcome of an excitation of thelarynx (source) modulated by a transfer (ﬁlter) functiondetermined by the shape of the supralaryngeal vocal tract. Avocoder, a short form of voice coder, was initially developedto minimize the amount of data that are transmitted forvoice communication. It encodes speech into slowly chang-ing control parameters, such as linear predictive codingand mel-log spectrum approximation [78], that describe theﬁlter, and re-synthesizes the speech signal with the sourceinformation at the receiving end. In voice conversion, weconvert the speech signals from a source speaker to mimicthe target speaker by modifying the controllable parame-ters.The majority of vocoders are designed based on someform of the source-ﬁlter model of speech production, suchas mixed excitation with a spectral envelope, and glottalvocoders [79]. STRAIGHT or “Speech Transformation andRepresentation using Adaptive Interpolation of weiGHTedspectrum" is one of the popular vocoders in speech synthe-sis and voice conversion [80]. It decomposes a speech signalinto: 1) a smooth spectrogram which is free from periodicityin time and frequency; 2) a fundamental frequency (F0) contour which is estimated using a ﬁxed-point algorithm;and 3) a time-frequency periodicity map which capturesthe spectral shape of the noise and its temporal envelope.STRAIGHT is widely used in voice conversion because itsparametric representation facilitates the statistical modelingof speech, that allows for easy manipulation of speech [11],[81], [82].Parametric vocoders are widely adopted for analysis andreconstruction of speech in voice conversion studies [8],[9], [11], [12], [46], [47], [83], [84], and continue to play amajor role today [17], [21], [22]. The traditional parametricvocoders are designed to approximate the complex me-chanics of the human speech production under certain sim-pliﬁed assumptions. For example, the interaction betweenF0 and formant structure is ignored, and the original phasestructure is discarded [85]. The assumption of stationaryprocess in the short-time window, and time-invariant linearﬁlter, also give rise to “robotic” and “buzzy” voice. Suchproblems become more serious in voice conversion as wemodify both F0 and the formant structure of speech amongothers at the same time. We believe that vocoding canbe improved by considering the interaction between theparameters.

3) WaveNet Vocoder:

Deep learning offers a solution tosome of the inherent problems of parametric vocoders.WaveNet [53] is a deep neural network that learns togenerate high quality time-domain waveform. As it doesn’tassume any mathematical model, it is a data-driven solu-tion that requires a large amount of training data.The joint probability of a waveform X = x , x ,..., x N canbe factorized as a product of conditional probabilities. p ( X ) = N (cid:89) n = p ( x n | x , x ,..., x n − ) (4)A WaveNet is constructed with many residual blocks, eachof which consists of 2 × × h , WaveNet can also modelconditional distribution p ( x | h ) [53]. Eq. (4) can then bewritten as follows: p ( X | h ) = N (cid:89) n = p ( x n | x , x ,..., x n − , h ) (5)A typical parametric vocoder performs both analysis andreconstruction of speech. However, most of today’s WaveNetvocoders only cover the function of speech reconstruction.It takes some intermediate representations of speech asthe input auxiliary features, and generate speech wave-form as the output. WaveNet vocoder [55] outperformsremarkably the traditional parametric vocoders in termsof sound quality. Not only can it learn the relationshipbetween input features and output waveform, but also itlearns the interaction among the input features. It has beensuccessfully adopted as part of the state-of-the-art speechsynthesis [3], [86]–[89] and voice conversion [54], [55], [57],[60]–[62], [86], [90]–[97] systems.There have been promising studies on using vocodingparameters as the intermediate representations in WaveNet vocoding. A speaker independent WaveNet vocoder [55] isstudied by utilizing the STRAIGHT vocoding parameters,such as F0, aperiodicity, and spectrum as the inputs ofWaveNet. In this way, WaveNet learns a sample-by-samplecorrespondence between the time-domain waveform andthe input vocoding parameters. When such a WaveNetvocoder is trained on speech signals from a large speakerpopulation, we obtain a speaker independent vocoder [55].By adapting the speaker independent WaveNet vocoderwith speaker speciﬁc data, we obtain a speaker dependentvocoder that generates personalized voice output [58], [60].The study on WaveNet vocoder also opens up opportu-nities for the use of other non-vocoding parameters asthe input. For example, a recent study adopts phoneticposteriogram (PPG) in WaveNet vocoding with promisingresults in voice conversion with non-parallel training data[94]–[97]. Another study adopts latent code of autoencoderand speaker embedding as the speech representation forWaveNet vocoder [98].

4) Recent Progress on Neural Vocoders:

More recently,speaker independent WaveRNN-based neural vocoder [63]became popular as it can generate human-like voices fromboth in-domain and out-of-domain spectrogram [99]–[101].Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64]. WaveGlowis a ﬂow-based network capable of generating high qualityspeech from mel-spectrogram [102]. WaveGlow beneﬁtsfrom the best of Glow and WaveNet so as to provide fast,efﬁcient and high-quality audio synthesis, without the needfor auto-regression. We note that WaveGlow is implementedusing only a single network with a single cost function, thatis to maximize the likelihood of the training data, whichmakes the training procedure simple and stable [103].WaveNet [53] uses an auto-regressive (AR) approach tomodel the distribution of waveform sampling points, thatincurs a high computational cost. As an alternative to auto-regression, a neural source-ﬁlter (NSF) waveform modelingframework is proposed [104], [105]. We note that NSF isstraightforward to train and fast to generate waveform. Itis reported 100 times faster than WaveNet vocoder, andyet achieving comparable voice quality on a large speechcorpus [106].

B. Feature Extraction

With speech analysis, we derive vocoding parametersthat usually contains spectral and prosodic componentsto represent the input speech. The vocoding parameterscharacterize the speech in a way that we can reconstruct thespeech signal later on after transmission. This is particularlyimportant in speech communication. However, such vocod-ing parameters may not be the best for transformation ofvoice identity. More often, the vocoding parameters are fur-ther transformed into speech features, that we call featureextraction in Figure 1, for more effective modiﬁcation of theacoustic properties in voice conversion.For the spectral component, feature extraction aimsto derive low-dimensional representations from the high-dimensional raw spectra. Generally speaking, the spectral features are be able to represent the speaker individualitywell. The feature not only ﬁt the spectral envelope well,but also be able to be converted back to spectral envelope.They should have good interpolation properties that allowfor ﬂexible modiﬁcation.The magnitude spectrum can be warped to Mel or Barkfrequency scale that are perceptually meaningful for voiceconversion. It can also be transformed into cepstral domainusing a ﬁnite number of coefﬁcients using the DiscreteCosine Transform of log-magnitude. Cepstral coefﬁcientsare less correlated. In this way, high dimension magnitudespectrum is transformed to lower dimension feature rep-resentation. The commonly used speech features includeMel-cepstral coefﬁcients (MCC), linear predictive cepstralcoefﬁcients (LPCC), and line spectral frequencies (LSF).Typically, a speech frame is represented by a feature vector.Short-time analysis has been the most practical wayof speech analysis. Unfortunately it inherently ignores thetemporal context of speech, that is crucial in voice conver-sion. Many studies have shown that multiple frames [18],[107], dynamic features [62], and phonetic segments serveas effective features in feature mapping.For the prosodic component, feature extraction can beused to decompose prosodic signal, such as fundamentalfrequency (F0), aperiodicity (AP), and energy contours,into speaker dependent and independent parameters [82].In this way, we can carry over the speaker independentprosodic patterns, while converting speaker dependentones during the feature mapping.

C. Feature Mapping

In the typical ﬂow of voice conversion, feature mappingperforms the modiﬁcation of speech features from sourceto target speaker. Spectral mapping seeks to change thevoice timbre, while prosody conversion seeks to modify theprosody features, such as fundamental frequency, intona-tion and duration. So far, spectral mapping remains thecenter of many voice conversion studies.During training, we learn the mapping function, F ( · )in Eq.(1), from training data. At run time inference, themapping function transforms the acoustic features. A largepart of this paper is devoted to the study of the mappingfunction. In Section III, we will discuss the traditionalstatistical modeling techniques with parallel training data.In Section IV, we will review the statistical modeling tech-niques that do not require parallel training data. In SectionV, we will introduce a number of deep learning approaches,which includes 1) parallel training data of paired speakers;and 2) beyond parallel data of paired speakers.III. S TATISTICAL M ODELING FOR V OICE C ONVERSION WITH P ARALLEL T RAINING D ATA

Most of the traditional voice conversion techniques as-sume availability of parallel training data. In other words,the mapping function is trained on paired utterances ofthe same linguistic content spoken by source and targetspeaker. Voice conversion studies started with statistical approaches [108] in late 1980s, that can be grouped intoparametric and non-parametric mapping techniques. Para-metric techniques makes assumptions about the under-lying statistical distributions of speech features and theirmapping. Non-parametric ones make fewer assumptionsabout the data, but seek to ﬁt the training data with thebest mapping function, while maintaining some ability togeneralize to unseen data.Parametric techniques, such as Gaussian mixture model(GMM) [109], Dynamic Kernel Partial Least Square Regres-sion, PSOLA mapping technique [73], represent a greatsuccess in the recent past. The vector quantization ap-proach to voice conversion is a typical non-parametrictechnique. It maps codewords between source and targetcodebooks [8]. In this method, a source feature vectoris approximated by the nearest codeword in the sourcecodebook, and mappped to the corresponding codewordin the target codebook. To reduce the quantization error,fuzzy vector quantization was studied [9], [110], wherecontinuous weights for individual clusters are determinedat each frame according to the source feature vector. Theconverted feature vector is deﬁned as a weighted sum ofthe centroid vectors of the mapping codebook. Recently,the non-negative factorization approach marks a successfulnon-parametric implementation.We will discuss a typical frame-level mapping paradigmunder the assumption of parallel training data, as illustratedin Figure 2. During the training phase, given parallel train-ing data from a source speaker x and a target speaker y ,frame alignment is performed to align the source speechvectors and target speech vectors to obtain the pairedspeech feature vector z = { x , y }. Dynamic time warpingis feature-based alignment technique that is commonlyused. Speech recognizer, that is equipped with phoneticknowledge, can also be used to perform model-based align-ment. Frame alignment has been well studied in speechprocessing. In voice conversion, a large body of literaturehas been devoted to the design of frame-level mappingfunction. A. Gaussian Mixture Models

In Gaussian mixture modeling (GMM) approach to voiceconversion [109], we represent the relationship betweentwo sets of spectral envelopes, from source and targetspeakers, using a Gaussian mixture model. The Gaussianmixture model is a continuous parametric function, thatis trained to model the spectral mapping. In [109], har-monic plus noise (HNM) features are used in the featuremapping, which allows for high-quality modiﬁcations ofspeech signals. The GMM approach is seen as an extensionto the vector quantization approach [8], [9], that resultsin improved voice quality. However, the speech qualityis affected by some factors, e.g., spectral movement withinappropriate dynamic characteristics caused by the frame-by-frame conversion process, and excessive smoothing ofconverted spectra [111]–[113].To address the frame-by-frame conversion issue, a maxi-mum likelihood estimation technique was studied to model the spectral parameter trajectory [11]. This technique aimsto estimate an appropriate spectrum sequence using dy-namic acoustic features. To address the over-smoothingissue, or the mufﬂed effect, joint density Gaussian mixturemodel (JD-GMM) was studied [2], [11] to jointly model thesequences of spectral features and their variances usingmaximum likelihood estimation, that increases the globalvariance of the spectral features. The JD-GMM method in-volves two phases: off-line training and run-time conversionphases. During the training phase, Gaussian mixture model(GMM) is adopted to model the joint probability density p ( z ) of the paired feature vector sequence z = { x , y }, whichrepresents the joint distribution of source speech x andtarget speech y : p ( z ) = K (cid:88) k = w ( z ) k N (cid:179) z | µ zk , Σ ( z ) k (cid:180) (6) µ zk =  µ xk µ yk  , Σ ( z ) k =  Σ ( xx ) k Σ ( xy ) k Σ ( yx ) k Σ ( y y ) k  where K is the number of Gaussian components, µ zk and Σ ( z ) k are the mean vector and the covariance matrix of the k th Gaussian component N (cid:179) z | µ zk , Σ ( z ) k (cid:180) , respectively. To es-timate the model parameters of the JD-GMM, expectation-maximization (EM) algorithm [114]–[117] is used to maxi-mize likelihood on the training data.A post-ﬁlter based on modulation spectrum modiﬁcationis found useful to address the inherent over-smoothingissue in statistical modeling [118], such as GMM approach,which effectively compensates the global variance. TheGMM approach is a parametric solution [119]–[123]. Itrepresents a successful statistical modeling technique thatworks well with parallel training data. B. Dynamic Kernel Partial Least Squares

The family of parametric techniques also include linear[73], [74] or non-linear mapping functions. With the localmapping functions, each frame of speech is typically trans-formed independently from the neighboring frames, whichcauses temporal discontinuities to the output [74].To take into account the time-dependency betweenspeech features, a dynamic kernel partial least squares(DKPLS) technique was studied [15]. This method is basedon a kernel transformation of the source features to allownon-linear modeling, and concatenation adjacent frames tomodel the dynamics. The non-linear transformation takesadvantage of the global properties of the data that GMMapproach doesn’t. It was reported that DKPLS outperformsGMM approach [109] in terms of voice quality. This methodis simple and efﬁcient, and does not require massive tuning.More recently, DKPLS-based approaches are studied toovercome the over-ﬁtting and over-smoothing problems byfeature combination strategy [124].While statistical modeling for the mapping of spectralfeatures has been well studied, conversion of prosody is

ReconstructionFrame Alignment

Source SpeechTarget Speech Converted Speech

GMMFrequencyWarpingNMFDKPLS

Frame-level Mapping

MappingAnalysis &Feature Extraction

Source Speech

TRAINING PHASERUN-TIME CONVERSION PHASE DNNLSTM (2) Deep Learning (1) Statistical Modeling

Fig. 2: Training and run-time inference of voice conversion with parallel training data under the frame-level mappingparadigm. The pink boxes represent the training algorithms of the models that result in the mapping function F ( x ) inblue box for run-time inference. Dotted box (1) includes examples of statistical approaches, and (2) includes examplesof deep learning approaches.often achieved by simply shifting and scaling F0, which isnot sufﬁcient for high-quality voice conversion. Hierarchicalmodeling of prosody, for different linguistic units at severaldistinct temporal scales, represents an advanced techniquefor prosody conversion [82], [125]–[127]. DKPLS has cre-ated a platform for multi-scale prosody conversion throughwavelet transform [128] that shows signiﬁcant improvementin naturalness over the F0 shifting and scaling technique. C. Frequency Warping

Parametric techniques, such as GMM [109] and DKPLS[15], usually suffer from over-smoothing because they usethe minimum mean square error [81] or the maximumlikelihood [11] function as the optimization criterion. As aresult, the system produces acoustic features that representstatistical average, and fails to capture the desired detailsof temporal and spectral dynamics.Additionally, parametric techniques generally employlow-dimensional features, as discussed in Section II.B, suchas the Mel cepstral coefﬁcients (MCC) or line spectralfrequencies (LSF) to avoid the curse of dimensionality. Thelow dimensional features, however, are doomed to losespectral details because they have low-resolution. Statisticalaveraging and low-resolution features both lead to themufﬂed effect of output speech [129].To preserve the necessary spectral details during con-version, a number of frequency warping-based methodswere introduced. The frequency warping technique directlytransforms the high resolution source spectrum to that ofthe target speaker through a frequency warping function. Inrecent literature, the warping function is either realized bya single parameter, such as VTLN-based approaches [26],[130]–[133], or represented as a piecewise linear function[73], [129], [134], which has become a mainstream solution.The goal of piecewise linear warping function is to align aset of frequencies between the source and target spectrum by minimizing the spectral distance or maximizing thecorrelation between the converted and target spectrum.More recently, the parametric frequency warping techniquewas incorporated with a non-parametric exemplar-basedtechnique, that achieves good performance [107].

D. Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) [135] is an ef-fective data mining technique that has been widely used,especially for reconstruction of high quality signals, suchas in speech enhancement [136], [137], speech de-noising[138], [139], noise and speech estimation [140]. It factorizesa matrix into two matrices, a dictionary and an activationmatrix, with the property that all three matrices have nonegative elements. The NMF-based techniques are showneffective in voice conversion with very limited training data.It marks a major progress of non-parametric approachto voice conversion since vector quantization techniquewas introduced. Successful implementation includes non-negative spectrogram deconvolution [141], locally linearembedding (LLE) [142], and unit selection [20]. In NMF-based approaches, a target spectrogram is constructedas a linear combination of exemplars. Therefore, over-smoothing problem can also arise. To overcome the over-smoothing problem, several effective techniques were de-veloped, that we summarize next.

1) Sparse Representation:

One effective way to alleviatethe over-smoothing problem is to apply sparsity constraintto the activation matrix, referred to as exemplar-basedsparse representation.As illustrated in Figure 3, a pair of dictionaries A and B are ﬁrst constructed from speech feature vectors, that wecall aligned exemplars, from source and target. [ A ; B ] is alsocalled the coupled dictionary. At run-time, let’s consider aspeech utterance as a sequence of speech feature vectors, Source SpectrogramConverted Spectrogram SourceDictionaryTarget Dictionary COPYXX

Fixed Dictionaries

Fig. 3: Illustration of non-negative matrix factorization forexemplar-based sparse representation.that form a spectrogram matrix. The matrix of a sourceutterance X can be represented as, X ≈ A ˆ H (7)Due to the non-negative nature of spectrogram, NMF tech-nique is employed to estimate the source activation matrixˆ H , which is constrained to be sparse. Mathematically, weestimate ˆ H by minimizing an objective function,ˆ H = argmin H ≥ d (cid:161) X , AH (cid:162) + λ || H || (8)where λ is the sparsity penalty factor. To estimate activationmatrix ˆ H , a generalised Kullback-Leibler (KL) divergence isused. It is assumed that source and target dictionaries A and B can share the same source activation matrix ˆ H .Therefore, the converted spectrogram for the targetspeaker can be written as,ˆ Y = B ˆ H . (9)where the activation matrix ˆ H serves as the pivot to transfersource utterance X to target utterance Y .The sparse representation framework continues to attractmuch attention in voice conversion. The recent studiesinclude its extension to discriminative graph-embeddedNMF approach [19], phonetic sparse representation forspectrum conversion [22], and its application to timbre andprosody conversion [143], [144].

2) Phonetic Sparse Representation:

As the frame-levelmapping is done at acoustic feature level, the coupleddictionary [ A ; B ] is therefore called acoustic dictionary.With the scripts of the training data and a general purposespeech recognizer, we are able to obtain phonetic labelsand their boundaries. Studies have shown that the strat-egy of dictionary construction plays an important role invoice conversion [145]. The idea of selecting sub-dictionaryaccording to the run-time speech content shows improvedperformance [21].Phonetic sparse representation [22] is an extension tosparse representation for voice conversion. It is built onthe idea of phonetic sub-dictionaries, and dictionary selec-tion at run-time. The study shows that multiple phoneticsub-dictionaries consistently outperform single dictionary in exemplar-based sparse representation voice conversion[21], [22]. However, the phonetic sparse representation relieson a speech recognizer at run-time to help select the sub-dictionary.

3) Group Sparse Representation:

Sisman et al. [62] pro-posed group sparse representation to formulate bothexemplar-based sparse representation [141], and phoneticsparse representation [22] under an uniﬁed mathematicalframework. With the group sparsity regularization, onlythe phonetic sub-dictionary that is relevant to the inputfeatures is likely to be activated at run-time inference. Un-like phonetic sparse representation that relies on a speechrecognizer for both training and run-time inference, groupsparse representation only requires the speech recognizerduring training when we build the phonetic dictionary. Itwas reported that group sparse representation provides sim-ilar performance to that of phonetic sparse representationwhen performing both spectrum and prosody conversion[62].IV. S

TATISTICAL M ODELING FOR V OICE C ONVERSION WITH N ON - PARALLEL T RAINING D ATA

It is easy to understand that it is more straightforwardto train a mapping function from parallel than non-paralleltraining data. However, parallel training data are not alwaysavailable. In real-world applications, there are situationswhere only non-parallel data are available. Intuitively, if wecan derive the equivalents of speech frames or segmentsbetween speakers from non-parallel data, we are able toestablish or to reﬁne the mapping function using the con-ventional linear transformation parameter training, such asGMM, DKPLS or frequency warping.There were a number of attempts to do so. For example,one idea is to ﬁnd source-target mapping between unsu-pervised feature clusters [146]. Another is to use a speechrecognizer to index the target training data so that we canretrieve similar frames from target database for a unknownsource frame at run-time [147]. Unfortunately, each of thesteps may produce errors that accumulate and may lead toa poor parameter estimation [146]. There was also a studyto use a hidden Markov model (HMM) that is trained for thetarget speaker, then the parameters of GMM-based lineartransformation function are estimated in such a way thatthe converted source vectors exhibit maximum likelihoodwith respect to the target HMM [148]. This method showscomparable performance with methods of parallel data.However, it requires that the orthography of the trainingutterances be known, that limits its use.Next we will discuss three clusters of studies and theirrepresentative work, 1) INCA algorithm, 2) unit selectionalgorithm, and 3) speaker modeling algorithm.

A. INCA Algorithm

INCA refers to an Iterative combination of a NearestNeighbor search step and a Conversion step Alignmentmethod [27]. It learns a mapping function by ﬁndingthe nearest neighbor of each source vector in the target

Non-parallelSpeech Data

GMM DKPLS ...

INCA Algorithm

Fig. 4: The training of a frame-level mapping function is aniterative process between the nearest neighbor search step(INCA alignment) and the conversion step (a parametricmapping function).acoustic space. It is based on a hypothesis that an iter-ative reﬁnement of the basic nearest neighbour method,in tandem with the voice conversion system, would leadto a progressive alignment improvement. The main idea isthat the intermediate voice, x ks , obtained after the previousnearest neighbour alignment can be used as the sourcevoice during the next iteration. x k + s = F k ( x ks ) (10)During training, the optimization process is repeated untilthe current intermediate voice, x ks , is close enough totarget voice, y t . INCA represents a successful framework forthe non-parallel training data problem, where the nearestneighbor search step (INCA alignment) and the conversionstep (a parametric mapping function) iterates to optimizethe mapping function, as illustrated in Figure 4.INCA was ﬁrst implemented with GMM approach [109]for voice conversion to estimate a linear mapping func-tion. As INCA does not require any phonetic or linguisticinformation, it not only works for non-parallel trainingdata, but also works for cross-lingual voice conversion.Experiments show that the INCA implementation of a cross-lingual system achieves similar performance to its intra-lingual counterpart that is trained on parallel data [27].INCA was further implemented with DKPLS approach[15] that was discussed in Section III.B for parallel trainingdata. The idea [30] is to use the INCA alignment algorithm[27] to ﬁnd the corresponding frames from the source andtarget datasets, that allows the DKPLS regression to ﬁnd anon-linear mapping between the aligned datasets. It was re-ported [30] that the INCA-DKPLS implementation produceshigh-quality voice that is comparable to implementationwith parallel training data on the same amount of trainingdata. Source Features Target FeaturesTarget SpeakerDatabaseDynamicProgramming

Fig. 5: Run-time inference of unit selection algorithm thatdoesn’t model a mapping function with parameters, butrather searches for output feature sequence directly fromtarget speaker database, and optimizes the output at utter-ance level.

B. Unit Selection Algorithm

Unit selection algorithm has been widely used to generatenatural-sounding speech in speech synthesis. It is knownto produce high speaker similarity and voice quality [75],[149], [150] because the synthesized waveform is formedof sound units directly from the target speaker [151]. Theunit selection algorithm optimizes the unit selection froma voice inventory of a target speaker. It was suggested[152] to make use of unit selection synthesis system togenerate parallel versions of the training sentences fromnon-parallel data. With the resulting pseudo-parallel data,the statistical modeling techniques for parallel training data,that we discuss in Section III, can be readily applied. Whilethis approach produces satisfactory voice quality [152], itrequires a large speech database to develop the the voiceinventory, that is not always practical in reality.Another idea is to follow what we do in unit selectionspeech synthesis by deﬁning a speech feature vector as aunit [24]. Given an utterance of M speech feature vectors X = { x , x ,..., x M } from the source speaker, a dynamic pro-gramming is applied to ﬁnd the sequence of feature vectors y i from the target speaker, that minimizes a cost function, Y = argmin y (cid:179) α M (cid:88) i = d ( x i , y i ) + (1 − α ) M (cid:88) i = d ( y i , y i − ) (cid:180) (11)where d ( · ) represents the acoustic distance between asource and a target feature vector, while d ( · ) is the con-catenative cost between two target feature vectors. Withthe acoustic distance, we make sure that the retrievedspeech features from the target speakers are close to thoseof the source; with the concatenative cost, we encouragethe consecutive speech frames from the target speakerdatabase to be retrieved together in a multi-frame segment.As illustrated in Figure 5, unit selection algorithm is a non-parametric solution because we don’t model the conver-sion with parameters. It optimizes the output by applyinga dynamic programming to ﬁnd the best feature vectorsequence from the target speaker database. The mappingfunction Y = F ( X ) is deﬁned by the cost function Eq.11 itself,and optimized at the utterance level. C. Speaker Modeling Algorithm

The techniques for text-independent speaker character-ization are readily available for non-parallel training data,where a speaker can be modeled by a set of parameters,such as a GMM or i-vector. One is possible to make usesuch speaker models to perform voice conversion.Mouchtaris et al. [153] used a GMM-based technique tomodel relationship between reference speakers in advanceand apply the relationship for a new speaker. Toda etal. [154] proposed an eigenvoice approach that performstwo mappings, one to map from the source speaker toan eigenvoice (or average voice) trained from referencespeakers, and another from the eigenvoice to the targetspeaker. These approaches don’t require parallel trainingdata, they do require parallel data from some referencespeakers.In speaker veriﬁcation, the joint factor analysis method[155] decomposes a supervector into speaker independent,speaker dependent and channel dependent components,each of which is represented by a low-dimensional set offactors. This aims to disentangle speaker from other speechcontent for effective speaker veriﬁcation. Inspired by thisidea, we argue [156] that similar decomposition would beuseful in voice conversion, where we would like to separatespeaker information from the linguistic content, and applyfactor analysis on the speaker speciﬁc component.With factor analysis, the speaker speciﬁc componentcan be represented by a low-dimensional set of latentvariables via the factor loadings. One of the ideas [156] isto estimate the phonetic component and factor loadingsfrom non-parallel prior data. In this way, during the trainingprocess, we only estimate a low-dimensional set of speakeridentity factors and a tied covariance matrix instead ofa full conversion function from the source-target parallelutterances. Even though parallel utterances are still requiredfor estimating the conversion function, the use of priordata allows us to obtain a reliable model from much fewertraining samples than those required by conventional JD-GMM [157].Another idea is to perform the voice conversion ini-vector [155] speaker space, where i-vector is used todisentangle a speaker from the linguistic content. Theprimary motivation is that an i-vector can be extracted inan unsupervised manner regardless of speaker or speechcontent, which opens up new possibilities especially fornon-parallel data scenarios where source and target speechis of different content or even in different languages [28],[45], [158]. Kinnunen et al. [159] studies a way to shift theacoustic features of input speech towards target speech inthe i-vector space. The idea is to learn a function that mapsthe i-vector of the source utterance to that of the target.With the mapping function, we are able to convert thesource speech frame-by-frame to the target. This techniqueis free of any parallel data, and text transcription.V. D

EEP L EARNING FOR V OICE C ONVERSION

Voice conversion is typically a research problem withscarce training data. Deep learning techniques are typi- cally data driven, that rely on big data. However, this isactually the strength of deep learning in voice conver-sion. Deep learning opens up many possibilities to beneﬁtfrom abundantly available training data, so that the voiceconversion task can focus more on learning the mappingof speaker characteristics. For example, it shouldn’t bethe job of voice conversion task to infer low level detailduring speech reconstruction, a neural vocoder can learnfrom large database to do so [98]. It shouldn’t be a taskof voice conversion to learn how to represent an entirephonetic system of a spoken language, a general purposeacoustic model of neural ASR [160] or TTS [161] systemcan learn from a large database to do so. By leveragingthe large database, we free up the conversion networkfrom using its capacity to represent low level detail andgeneral information, but instead, to focus on the high levelsemantics necessary for speaker identity conversion.Deep learning techniques also transform the way we im-plement the analysis-mapping-reconstruction pipeline. Foreffective mapping, we need to derive adequate intermediaterepresentation of speech, that was discussed in Section II.The concept of embedding in deep learning provides anew way of deriving the intermediate representation, forexample, latent code for linguistic content, and speakerembedding for speaker identity. It also makes the disen-tanglement of speaker from content much easier.In this section, we will summarize how deep learninghelps address existing research problems, such as paralleland non-parallel data voice conversion. We will also reviewhow deep learning breaks new ground in voice conversionresearch.

A. Deep Learning for Frame-Aligned Parallel Data

The study on deep learning approaches for voice con-version started with parallel training data, where we usea neural network as an improved regression function toapproximate the mapping function y = F ( x ) under theframe-level mapping paradigm in Figure 2.

1) DNN Mapping Function:

The early studies on DNN-based voice conversion methods are focused on spectraltransformation. DNN mapping function, y = F ( x ), has someclear advantage over other statistical models, such as GMM,and DKPLS. For instance, it allows for non-linear mappingbetween source and target features, and there is littlerestriction to the dimension of features to be modeled. Wenote that conversion on other acoustic features, such asfundamental frequency and energy contour, can also bedone similarly [162].Desai et al. [81] proposed a DNN to map a low-dimensional spectral representation, such as mel-cepstralcoefﬁcients (MCEP), from source to target speaker.Nakashika et al. [163] proposed to use Deep Belief Nets(DBNs) to extract latent features from source and targetcepstrum coefﬁcients, and use a neural network with onehidden layer to perform conversion between latent features.Mohammadi et al. [164] furthered the idea by studyinga deep autoencoder from multiple speakers to derive a compact representations of speech spectral feature. High-dimensional representation of spectrum has also been usedin a more recent work [165] for spectral mapping, togetherwith dynamic features and a parameter generation algo-rithm [166]. Chen et al. [167] proposed to model the distri-butions of spectral envelopes of source and target speakersrespectively through a layer-wise generative training.Generally speaking, DNN for spectrum and/or prosodytransformation requires a large amount of parallel trainingdata from paired speakers, which is not always feasible. Butit opens up opportunities for us to make use of speech datafrom multiple speakers beyond source and target, to bettermodel the source and the target speakers, and to discoverbetter feature representations for feature mapping.

2) LSTM Mapping Function:

To model the temporalcorrelation across speech frames in voice conversion,Nakashika et al. [168] explore the use of Recurrent Tem-poral Restricted Boltzmann Machines (RTRBM), a type ofrecurrent neural networks. The success of Long-Short TermMemory (LSTM) [169], [170] in sequence to sequence mod-eling inspires the study of LSTM in voice conversion, whichleads to an improvement of naturalness and continuity ofthe speech output.The LSTM network architecture consists of a set ofmemory blocks and peephole connections, that supportthe storage and access to long-range contextual information[171] in linear memory cells. It learns the optimal amount ofcontextual information for voice conversion. A bidirectionalLSTM (BLSTM) network is expected to capture sequentialinformation and maintain long-range contextual featuresfrom both forward sequence and backward sequence [45].Sun et al. [40] and Ming et al. [172] proposed a deepbidirectional LSTM network (DBLSTM) by stacking multiplehidden layers of BLSTM network architecture, that is shownto outperform DNN voice conversion even without usingdynamic features. While DBLSTM-based voice conversionapproach generates high-quality synthesized voice, it typ-ically requires a large speech corpus from source andtarget speakers for training, that limits the scope of theapplications in practice [40].Just like GMM approach, DNN and LSTM techniques relyon external frame aligner during training data preparation,as illustrated in Figure 2. At run-time, the conversionprocess follows the typical ﬂow of 3-step pipeline, anddoesn’t change the speech duration during the conversion.

B. Encoder-decoder with Attention for Parallel Data

The research problems of voice conversion are centeredaround alignment and mapping, which are interrelated bothduring training and at run-time inference, as illustrated inFigure 2. During training, more accurate alignment helpsbuild better mapping function, that explains why we preferparallel training data. At run-time inference, the frame-level mapping paradigm doesn’t change the duration of thespeech during the conversion. While it is possible to modeland predict the duration for voice conversion output, itis not straightforward to incorporate duration model and

AttentionEncoder Decoder

Source Speech ConvertedSpeech

Fig. 6: Encoder-decoder mechanism with attention for voiceconversion.mapping model in a systematic manner. Deep learningprovides a new solution to this research problem.The attention mechanism [173], [174] in encoder-decoderstructure neural network brings about a paradigm change.The idea of attention was ﬁrst successfully used in machinetranslation [173], speech recognition [175], and sequence-to-sequence speech synthesis [86], [176]–[178], that led tomany parallel studies in voice conversion [179]–[181]. Withthe attention mechanism, the neural network learns thefeature mapping and alignment at the same time duringtraining. At run-time inference, the network automaticallydecides the output duration according to what it has learnt.In other words, the frame-aligner in Figure 2 is no longerrequired.There are several variations based on recurrent neuralnetworks, such as SCENT [179], and AttS2S-VC [181]. Theyfollow the widely-used architecture of encoder-decoder withattention [180], [182]. Suppose that we have a source speech x = { x , x ,..., x T s }. The encoder network ﬁrst transformsthe input feature sequences into hidden representations, h = { h , h ,..., h T h } at a lower frame rate with T h < T s , whichare suitable for the decoder to deal with. At each decodertime step, the attention module aggregates the encoderoutputs by attention probabilities and produces a contextvector. Then, the decoder predicts output acoustic featuresframe by frame using context vectors. Furthermore, a post-ﬁltering network is designed to enhance the accuracy ofthe converted acoustic features to generate the convertedspeech y = { y , y ,..., y T y }. During training, the attentionmechanism learns the mapping dynamics between sourcesequence and target sequence. At run-time inference, thedecoder and the attention mechanism interacts to performthe mapping and alignment at the same time. The overallarchitecture is illustrated in Figure 6.While recurrent neural networks represent an effectiveimplementation for sequence-to-sequence conversion, re-cent studies have shown that convolutional neural networkswith gating mechanisms also learn well the long-termdependencies [53], [183]. It employs an attention mecha-nism that effectively makes possible parallel computationsfor encoding and decoding. During decoding, the causalconvolution design allows the model to generate an outputsequence in an autoregressive manner. Kameoka et al. pro-posed a convolutional neural networks implementation forvoice conversion [184], that is called ConvS2S-VC. Recentstudies show that ConvS2S-VC outperforms its recurrentneural network counterparts in both pairwise and many- Generator → Target OriginalFeaturesSource OriginalFeatures

DiscriminatorGenerator → Generator → Generator → Source ConvertedFeatures Target ConvertedFeatures

L1 L1

Converted Features Real, or not?

Discriminator

Converted Features Real, or not?Source SpeechTarget Speech

Training

Fig. 7: Training a CycleGAN with cycle-consistency loss of L1 norm for voice conversion with non-parallel training dataof paired speakers. L1 norm represents the least absolute errorsto-many voice conversion [181].The encoder-decoder structure with attention marks adeparture from the frame-level mapping paradigm. Theattention doesn’t perform the mapping frame-by-frame, butrather allows the decoder to attend to multiple speechframes and uses the soft combination to predict an outputframe in the decoding process. With the attention mecha-nism, the duration of the converted speech T y is typicallydifferent from that of the source speech T s to reﬂect thedifferences of speaking style between source and target.This represents a way to handle both spectral and prosodyconversion at the same time. The studies have attributedthe improvement of voice quality to the effective attentionmechanism. The attention mechanism also represents theﬁrst step towards relaxing the rigid requirement of paralleldata in voice conversion. C. Beyond Parallel Data of Paired Speakers

In Section III and IV, we study statistical modelingfor voice conversion with parallel training data and non-parallel training data. The advent of deep learning hasbroken new ground for voice conversion research. Wenow go beyond the paradigm of parallel and non-paralleltraining data. We refer nonparallel training data to the casewhere nonparallel utterances from source-target speakersare required. However, the recent studies show that, deeplearning has enabled many voice conversion scenarios with-out the need of parallel data. In this section, we summarizethe studies into four scenarios,1) Non-parallel data of paired speakers,2) Leveraging TTS systems,3) Leveraging ASR systems, and4) Disentangling speaker from linguistic content.

1) Non-parallel data of paired speakers:

Voice conversionwith non-parallel training data is a task similar to image-to-image translation, which is to ﬁnd a mapping from a sourcedomain to a target domain without the need of paralleltraining data. Let’s draw a parallel between image-to-imagetranslation and voice conversion. In image translation, wewould like to translate a horse to a zebra, where we preservethe structure of horse and change the coat of horse to thatof zebra [185]–[190], in voice conversion, we would like totransform one voice to that of another, while preserving thelinguistic, and prosodic content. CycleGAN is based on the concept of adversarial learning[191], which is to train a generative model to ﬁnd a solutionin a min-max game between two neural networks, called asgenerator ( G ) and discriminator ( D ). It is known to achieveremarkable results [185] on several tasks where pairedtraining data does not exist, such as image manipulationand synthesis [185], [188], [192]–[195], speech enhancement[196], speech recognition [197], speech synthesis [198],[199].As the speech data are non-parallel, alignment is not eas-ily achieved. Kaneko and Kameoka ﬁrst studied a CycleGAN[47], [48], [200], [201] that incorporates three loss func-tions: adversarial loss, cycle-consistency loss, and identity-mapping loss, to learn forward and inverse mapping be-tween source and target speakers.The adversarial loss measures how distinguishable be-tween the data distribution of converted features andsource features x or target features y . For the forwardmapping, it is deﬁned as follows: L ADV ( G X → Y , D Y , X , Y ) = (cid:69) y ∼ P ( y ) [ D Y ( y )] + (cid:69) x ∼ P ( x ) [ l og (1 − D Y ( G X → Y ( x ))] (12)The closer the distribution of converted data with that oftarget data, the smaller the loss becomes.The adversarial loss only tells us whether G X → Y followsthe distribution of target data and does not ensure thatthe contextual information, that represents the generalsentence structure we would like to carry over from sourceto target, is preserved. To ensure that we maintain theconsistent contextual information between x and G X → Y ( x ),the cycle-consistency loss, that is presented in Figure 7, isintroduced, L CY C ( G X → Y , G Y → X ) = (cid:69) x ∼ P ( x ) [ (cid:107) G Y → X ( G X → Y ( x ) − x ) (cid:107) ] + (cid:69) y ∼ P ( y ) [ (cid:107) G X → Y ( G Y → X ( y ) − y ) (cid:107) ] (13)where (cid:107) · (cid:107) refers to a L1 norm function, or least absoluteerrors, that is known to produce sharper spectral features.This loss encourages G X → Y and G Y → X to ﬁnd an optimalpseudo pair of ( x , y ) through circular conversion.To encourage the generator to ﬁnd the mapping thatpreserves underlying linguistic content between the input and output [202], an identity mapping loss is introduced asfollows, L ID ( G X → Y , G Y → X ) = (cid:69) x ∼ P ( x ) [ (cid:107) G Y → X ( x ) − x (cid:107) ] + (cid:69) y ∼ P ( y ) [ (cid:107) G X → Y ( y ) − y (cid:107) ] (14)Combining the three loss functions, we have the totalloss as, L ( G , F , D X , D Y , X , Y ) = L G AN ( G , D Y , X , Y ) + L G AN ( F , D X , X , Y ) + λ CY C L CY C ( G , F , X , Y ) + λ ID L ID ( G , F , X , Y ) (15)where λ CY C and λ ID are trade-off parameters.The optimal mapping functions G ∗ and F ∗ are obtainedby solving the minmax-game deﬁned as: G ∗ , F ∗ = argmin G , F max D X , D Y L ( G , F , D X , D Y , X , Y ) (16)CycleGAN represents a successful deep learning imple-mentation to ﬁnd an optimal pseudo pair from non-parallel data of paired speakers. It doesn’t require any framealignment mechanism such as dynamic time warping orattention. Experimental results show that, with non-paralleltraining data, CycleGAN achieves comparable performanceto that of GMM-based system that is trained on twiceamount of parallel data [47]. Moreover, with the adversarialtraining, it effectively overcomes the over-smoothing prob-lem, which is known to be one of the main factors leadingto speech-quality degradation. We note that more recently,CycleGAN-VC2, an improved version of CycleGAN-VC hasbeen studied [201], that further improves CycleGAN byincorporating three new techniques: an improved objective(two-step adversarial losses), improved generator (2-1-2DCNN), and improved discriminator (PatchGAN). CycleGANhas been successfully applied in mono-lingual [48], [203],cross-lingual voice conversion [204], emotional voice con-version [205], [206] and rhythm-ﬂexible voice conversion[207].Unlike the encoder-decoder structure, CycleGAN followsa generative modeling architecture that doesn’t explicitlymodel some internal representations to support ﬂexiblemanipulation, such as voice identity, duration of speech,and emotion. Therefore, it is more suitable for voice conver-sion between a speciﬁc source and target pair. Nonetheless,it represents an important milestone towards non-paralleldata voice conversion.

2) Leveraging TTS systems:

We have discussed the deeplearning architectures for voice conversion that do not in-volve text. One of the important aspects of voice conversionis to carry forward the linguistic content from source totarget. Voice conversion and TTS systems are similar in thesense that they both aim to generate high quality speechwith the appropriate linguistic content. A TTS system pro-vides a mechanism for the speech to adhere to the linguisticcontent. The ideas to leverage TTS mechanism can bemotivated in different ways. Firstly, a TTS system is trainedon a large speech database that offers a high quality speechre-construction mechanism given the linguistic content;

AttentionEncoder Decoder

Text Speech

AttentionEncoder Decoder

SourceSpeech TargetSpeechShared DecoderShared Attention

Fig. 8: The upper panel is a TTS ﬂow, and the lower panelis a voice conversion ﬂow. Both follow similar encoder-decoder with attention architecture. The voice conversionleverages the TTS system that is linguistically informed.secondly, a TTS system is equipped with a quality attentionmechanism that is needed by voice conversion.As illustrated in Figure 8, encoder-decoder models withattention have recently shown considerable success in mod-eling a variety of complex sequence-to-sequence problems.Tacotron [87], [176], [208] represents one of the successfultext-to-speech (TTS) implementations, that has been ex-tended to voice conversion [3], [179].Zhang et al. proposed a joint training system architecturefor both text-to-speech and voice conversion [3] by ex-tending the model architecture of Tacotron, which featuresa multi-source sequence-to-sequence model with a dualinput, and dual attention mechanism. By taking only textas input, the system performs speech synthesis. The systemcan also take either voice alone, or both text and voiceas input for voice conversion. The multi-source encoder-decoder model is trained with a decoder that is linguisti-cally informed via the TTS joint training, as illustrated by shared decoder in Figure 8. Experiments show that the jointtraining has improved the voice conversion task with orwithout text input at run-time inference.Park et al. proposed a voice conversion system, known asCotatron, that is built on top of a multi-speaker TacotronTTS architecture [161]. At run-time inference, the pre-trained TTS system is used to derive speaker-independentlinguistic features of the source speech. This process isguided by the transcription of the input speech, as such,text transcription of source speech is required at run-timeinference. The system uses the TTS encoder to extractspeaker-independent linguistic features, or disentangle thespeaker identity. The decoder then takes the attention-aligned speaker-independent linguistic features as the in-put, and the target speaker identity as the condition, togenerate a target speaker’s voice. In this way, voice conver-sion leverage the attention mechanism or shared attention from TTS, as shown in Figure 8. Cotatron is designedto perform one-to-many voice conversion. A study [209],that shares similar motivation with [161] but is based on PPG Features

Neural Networks

MCEPs

Neural NetworksMulti-speakerTarget Speaker

PPG FeaturesMCEPs Multi-speakerAverage ModelTarget SpeakerMapping Function

Fig. 9: Training phase of the average modeling approachthat maps PPG features to MCEP features for voice conver-sion [44].the Transformer instead of Tacotron, suggests transferringknowledge from a learned TTS model to beneﬁt from large-scale, easily accessible TTS corpora.Zhang et al. [210] proposed to improve the sequence-to-sequence model [179] by using text supervision dur-ing training. A multi-task learning structure is designedwhich adds auxiliary classiﬁers to the middle layers of thesequence-to-sequence model to predict linguistic labels asa secondary task. The linguistic labels can be obtainedeither manually or automatically with alignment tools. Withthe linguistic label objective, the encoder and decoder areexpected to generate meaningful intermediate representa-tions which are linguistically informed. The text transcriptsare only required during training. Experiments show thatthe multi-task learning with linguistic labels effectivelyimproves the alignment quality of the model, thus alleviatesissues such as mispronunciation.The neural representation of deep learning has facilitatedthe interaction between TTS and voice conversion. By lever-aging TTS systems, we hope to improve the training andrun-time inference of voice conversion with by adheringto linguistic content. However, such techniques usuallyrequire a large training corpus. Recent studies introduced aframework for creating limited-data VC system [209], [211],[212] by bootstrapping from a speaker-adaptive TTS model.It deserves future studies as to how voice conversion canbeneﬁt from TTS systems without involving large trainingdata.

3) Leveraging ASR systems:

Deep learning approaches forvoice conversion typically require a large parallel corpus fortraining. This is partly because we would like to learn thelatent representations that describe the phonetic systems.The requirement of training data has limited the scope ofpotential applications. We know that most ASR systems arealready trained with a large corpus. They already describewell the phonetic systems in different ways. The question ishow to leverage the latent representations in ASR systemsfor voice conversion.One of the ideas is to use the context posterior proba- bility sequence produced by the ASR model with sequenceto sequence learning to generate a target speech featuresequence [160]. In this modal, the system has an encoder-decoder structure similar to Figure 6, except that it uses aspeech recognizer as the encoder, and a speech synthesizeras the decoder. Another study is to guide a sequence tosequence voice conversion model by an ASR system, whichaugments inputs with bottleneck features [179]. Recently,an end-to-end speech-to-speech sequence transducer, Par-rotron [213], was studied. Parrotron learns to convertspeech spectrogram of any speakers, with multiple accentsand imperfections, to the voice of a single predeﬁnedtarget speaker. Parrotron accomplishes this by using anauxiliary ASR decoder to predict the transcript of the outputspeech, conditioned on the encoder latent representation.The multi-task training of Parrotron optimizes the decoderto generate the target voice, at the same time, constrains thelatent representation to retain linguistic information only.The ASR decoder aims to disentangle the speaker’s identityfrom the speech. The above techniques adopt the encoder-decoder with attention architecture.It is another way to look at voice conversion that speechconsists of two components, speaker dependent compo-nent and speaker independent component. If we are able todecompose speech signals into the two components, we cancarry over the former, and only convert the latter to achievevoice conversion. The average modeling technique repre-sents one of the successful implementations [41], where webuild a mapping function to convert phonetic posteriogram(PPG) [32] to acoustic features. The PPG features are derivedfrom an ASR system, that can be considered as speakerindependent. We train the mapping function from multi-speaker, non-parallel speech data. In this way, one doesn’tneed to train a full conversion model for each targetspeaker. The average model can be adapted towards thetarget with a small amount of target speech. The trainingand adaptation of the average model are illustrated inFigure 9.There were several follow-up studies along this direction,for example, Tian et al. proposes a PPG to waveform conver-sion [94], and a average model with speaker identity [155]as a condition [44]. Zhou et al. proposes to use PPG as thelinguistic features for cross-lingual voice conversion [158].Liu et al. proposes to use PPG for emotional voice conver-sion [214]. Zhang et al. also shows that the average modelframework can beneﬁt from a small amount of paralleltraining data using an error reduction network [215].

4) Disentangling speaker from linguistic content:

In thecontext of voice conversion, speech can be considered asa composition of speaker voice identity and linguistic con-tent. If we are able to disentangle speaker from the linguisticcontent, we can change the speaker identity independentlyof the linguistic content. Auto-encoder [216] represents oneof the common techniques for speech disentanglement, andreconstruction. There are other techniques such as instancenormalization [217] and vector quantization [218], [219],that are effective in disentangling speaker from the content.An auto-encoder learns to reproduce its input as its ContentEncoder Decoder

Source Speaker Speech Converted Target Speech

SpeakerEncoder

Target SpeakerSpeech Latent CodeSpeakerEmbedding

Fig. 10: A typical auto-encoding network for voice conver-sion, where the encoders and decoder learn to disentanglespeaker from linguistic content. At run-time, the linguisticcontent of the source speech represented by latent codeand speaker embedding of a target speaker are combinedto generate target speech.output. Therefore, parallel training data is not required. Anencoder learns to represent the input with a latent code,and a decoder learns to reconstruct the original input fromthe latent code. The latent code can be seen as an infor-mation bottleneck which, on one hand, lets pass informa-tion necessary, e.g. speaker independent linguistic content,for perfect reconstruction, and on the other hand, forcessome information to be discarded, e.g. speaker, noise andchannel information [83]. Variational auto-encoder (VAE)[220] is the stochastic version of auto-encoder, in which theencoder produces distributions over latent representations,rather than deterministic latent codes, while the decoderis trained on samples from these distributions. Variationalauto-encoder is more suitable than deterministic auto-encoder in synthesizing new samples.Chorowski et al. [98] provides a comparison of threeauto-encoding neural networks by studying how they learna representation from speech data to separate speakeridentity from the linguistic content. It was shown that dis-crete representation, that is the latent code obtained fromVQ-VAE, preserves the most linguistic content while alsobeing the most speaker-invariant. Recently, a group latentembedding technique for VQ-VAE is studied to improve theencoding process, which divides the embedding dictionaryinto groups and uses the weighted average of atoms in thenearest group as the latent embedding [221].The concept of a VAE-based voice conversion frame-work [43] can be illustrated in Figure 10. The decoderreconstructs the utterance by conditioning on the latentcode extracted by the encoder, and separately on a speakercode, which could be an one-hot vector [43], [222] fora close set of speakers, or an i-vector [155], bottleneckspeaker representation [223], or d-vector [224] for an openset of speakers. By explicitly conditioning the decoder onspeaker identity, the encoder is forced to capture speaker-independent information in the latent code from a multi- speaker database.Just like other auto-encoder, VAE decoder tends to gen-erate over-smoothed speech. This can be problematic forvoice conversion because the network may generate poorquality buzzy-sounding speech. Generative adversarial net-works (GANs) [225] were proposed as one of the solu-tions to the over-smoothing problem. GANs offer a generalframework for training a data generator in such a way thatit can deceive a real/fake discriminator that attempts todistinguish real data and fake data produced by the gener-ator. By incorporating the GAN concept into VAE, VAE-GANwas studied for voice conversion with non-parallel trainingdata [46] and in cross-lingual voice conversion [204]. It wasshown that VAE-GAN [225] produces more natural soundingspeech than the standard VAE method [43], [223].A recent study on sequence-to-sequence non-parallelvoice conversion [226] shows that it is possible to explicitlymodel the transfer of other aspects of speech, such assource rhythm, speaking style, and emotion to the targetspeech. VI. E

VALUATION OF V OICE C ONVERSION

Effective quality assessment of voice quality is requiredto validate the algorithms, to measure the technologicalprogress, and to benchmark a system against the state-of-the-art. Typically, we report the results in terms of objectiveand subjective measurements.To provide an objective evaluation, a reference speech isrequired. The common objective evaluation metrics includeMel-cepstral distortion (MCD) [227] for spectrum, and PCC[228] and RMSE [229]–[231] for prosody. We note that, suchmetrics are not always correlated with human perceptionpartly because they measure the distortion of acousticfeatures rather than the waveform that humans actuallylisten to.Subjective evaluation metrics, such as the mean opinionscore (MOS) [2], [232]–[234], preference tests [18], [235]and best-worst scaling [236] could represent the intrinsicnaturalness and similarity to the target. We note that, forsubjective evaluation to be meaningful, a large number oflisteners are required, that is not always possible in practice.

A. Objective Evaluation1) Spectrum Conversion:

To provide an objective evalua-tion, ﬁrst of all, we need a reference utterance spoken by thetarget speaker. Ideally the converted speech is very close tothe reference speech. We can measure the differences be-tween them by comparing their spectral distances. However,there is no guarantee that the converted speech and thereference speech is of the same length. In this case, a framealigner is required to establish the frame-level mapping.Mel-cepstral distortion (MCD) [227] is commonly used tomeasure the difference between two spectral features [62],[237]–[239]. It is calculated between the converted andtarget Mel-cepstral coefﬁcients, or MCEPs, [240], [241], ˆ y and y . Suppose that each MCEP vector consists of 24coefﬁcients, we have ˆ y = { m ck , i } and y = { m tk , i } at frame k , where i denotes the i th coefﬁcient in the converted andtarget MCEPs. MC D [ dB ] = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i = ( m tk , i − m ck , i ) (17)We note that a lower MCD indicates better performance.However, MCD value is not always correlated with humanperception. Therefore, subjective evaluations, such as MOSand similarity score, are also conducted.

2) Prosody Conversion:

Speech prosody of an utteranceis characterized by phonetic duration, energy contour, andpitch contour. To effectively measure how close the prosodypatterns of converted speech is to the reference speech, weneed to provide measurements for the three aspects.The alignment between the converted speech and thereference speech provides the information about how muchthe phonetic duration differs one another. We can derivethe number of frames that deviate from the ideal diagonalpath on average, such as frame disturbance [242], to reportthe differences of phonetic duration.Pearson Correlation Coefﬁcient (PCC) [62], [205] and RootMean Squared Error (RMSE) have been widely used as theevaluation metrics to measure the linear dependence ofprosody contours or energy contours between two speechutterances.We next take the measurement of two prosody contoursas an example. PCC between the aligned pair of convertedand target F0 sequences is given as follows, ρ ( F c , F t ) = cov ( F c , F t ) σ F c σ F t (18)where σ F c and σ F t are the standard deviations of theconverted F0 sequences ( F c ) and the target F0 sequences( F t ), respectively. We note that a higher PCC value repre-sents better F0 conversion performance.The RMSE between the converted F0 and the correspond-ing target F0 is deﬁned as, RMSE = (cid:118)(cid:117)(cid:117)(cid:116) K K (cid:88) k = ( F ck − F tk ) (19)where F ck and F tk denote the converted and target F0features, respectively. K is the length of F F B. Subjective Evaluation

Mean Opinion Score (MOS) has been widely used in lis-tening tests [40], [61], [62], [246]–[251]. In MOS experiments, listeners rate the quality of the converted voice using a 5-point scale: “5” for excellent, “4” for good, “3” for fair, “2”for poor, and “1” for bad. There are several evaluation meth-ods that are similar to MOS, for example: 1) DMOS [252]–[254], which is a “degradation” or “differential” MOS test,requiring listeners to rate the sample with respect to thisreference, and 2) MUSHRA [255]–[257], which stands forMUltiple Stimuli with Hidden Reference and Anchor, andrequires fewer participants than MOS to obtain statisticallysigniﬁcant results.Another popular subjective evaluation is preference test,also denoted as AB/ABX test [2], [11], [40], [258]. In AB tests,listeners are presented with two speech samples and askedto indicate which one has more of a certain property; forexample in terms of naturalness, or similarity. In ABX test,similar to that of AB, two samples are given but an extrareference sample is also given. Listeners need to judge if Aor B more like X in terms of naturalness, similarity, or evenemotional quality [205]. We note that it is not practical touse AB and/or ABX test for the comparison of many VCsystems at the same time. MUSHRA is another type of voicequality test in telecommunication [259], where the referencenatural speech and several other converted samples of thesame content are presented to the listeners in a randomorder. The listeners are asked to rate the speech quality ofeach sample between 0 and 100.It is known that people are good at picking the extremesbut their preferences for anything in between might befuzzy and inaccurate when presented with a long list ofoptions. Best-Worst Scaling (BWS) [236] is proposed forvoice conversion quality assessment [22], where listenersare presented only with a few randomly selected optionseach time. With many such BWS decisions, Best-WorstScaling can handle a long list of options and generates morediscriminating results, such voice quality ranking, than MOSand preference tests.We note that subjective measures can represent theintrinsic naturalness and similarity of a voice conversionsystem. However, such evaluation can be time-consumingand expensive as they involve a large number of listeners.

C. Evaluation with Deep Learning Approaches

The study of perceptual quality evaluation seeks to ap-proximate human judgement with computational modelsof psychoacoustic motivation. It provides insights into howhumans perceive speech quality in listening tests, andsuggests assessment metrics, that are required in speechcommunication, speech enhancement, speech synthesis,voice conversion and any other speech production ortransmission applications. Perceptual Evaluation of SpeechQuality (PESQ) [260] is an ITU-T recommendation thatis widely used as industry standard. It provides objec-tive speech quality evaluation that predicts the human-perceived speech quality.However, the PESQ formulation requires the presenceof reference speech, that considerably restricts its use invoice conversion applications, and motivates the study of perceptual evaluations without the need of referencespeech. Those metrics that don’t require reference speechare called non-intrusive evaluation metrics. For example,Fu et al. [261] propose Quality-Net [261] that is an end-to-end model to predict PESQ ratings, that are the proxy forhuman ratings. Yoshimura et al. [262], Patton et al. [263]propose a CNN-based naturalness predictor to predict hu-man MOS ratings, among other non-intrusive assessmentmetrics [264]–[266].Lo et al. [267] propose MOSNet, another non-intrusiveassessment technique based on deep neural networks, thatlearns to predict human MOS ratings. MOSNet scores arehighly correlated with human MOS ratings at system level,and fairly correlated at utterance level. While it is a non-intrusive evaluation metric for naturalness, MOSNet canalso be modiﬁed and re-purposed to predict the similarityscores between target speech and converted speech. Itprovides similarity scores with fair correlation values tohuman ratings on VCC 2018 dataset. MOSNet marks arecent advancement towards automatic perceptual qualityevaluation [268], which is free and open-source.VII. V OICE C ONVERSION C HALLENGES

In this section, we would like to give an overview of theseries of voice conversion challenges, that provide sharedtasks with common data sets and evaluation metrics for faircomparison of algorithms. The voice conversion challenge(VCC) is a biannual event since 2016. In a challenge,a common database is provided by the organizers. Theparticipants build voice conversion systems using their owntechnology, and the organizers evaluate the performance ofthe converted speech. The main evaluation methodology isa listening test in which crowd-sourced evaluators rank thenaturalness and speaker similarity.The 2016 challenge offers a standard voice conversiontask using a parallel training database was adopted [269].The 2018 challenge features a more advanced conversionscenario using a non-parallel database [270]. The 2020challenge puts forward a cross-lingual voice conversionresearch problem. A summary of VCC 2016, VCC 2018 andVCC 2020 is also provided in Table I.

A. Why is the Challenge Needed?

As described earlier, many of the voice conversion ap-proaches are data-driven, hence speech data are requiredto train models and for conversion evaluation. To comparesuch data-driven methods each other precisely, a commondatabase that speciﬁes training and evaluation data explic-itly is needed. However, such common database did not ex-ist until 2016. Without common databases, researchers haveto re-implement others’ system with their own databasesbefore trying any new ideas. In such situation, it is notguaranteed that the re-implemented system achieves theexpected performance in the original work.To address the same problem, the TTS community gavebirth to the ﬁrst Blizzard challenge in 2005. Since then,the challenge has deﬁned various standard databases for TTS and has made comparisons of TTS much fairer andeasier. The motivations of VCC are exactly the same as thoseof the Blizzard challenges. VCC introduced a few standarddatabases for voice conversion and also deﬁned the com-mon training and evaluation protocols. All the convertedspeech submitted by the participants for the challengeshave been released publicly. In this way, researchers cancompare the performance of their voice conversion systemwith that of other state-of-the-art systems without the needof re-implementation.Another need on voice conversion standard databasesarose from biometric speaker recognition community. Asthe voice conversion technology could be misused forattacking speaker veriﬁcation systems, anti-spooﬁng coun-termeasures are required [271]. This is also called presen-tation attack detection. Anti-spooﬁng techniques aim atdiscriminating between fake artiﬁcial inputs presented tobiometric authentication systems and genuine inputs. Ifsufﬁcient knowledge and data regarding the spoofed datais available, a binary classiﬁer can be constructed to rejectartiﬁcial inputs. Therefore, the common VCC databasesare also important for anti-spooﬁng research. With manyconverted speech data from advanced voice conversion sys-tems, researchers in the biometric community can developanti-spooﬁng models to strengthen the defence of speakerrecognition systems, and to evaluate their vulnerabilities.

B. Overview of the 2016 Voice Conversion Challenge

We ﬁrst overview the 2016 voice conversion challenge[269] and its datasets . As the ﬁrst shared task in voiceconversion, a parallel voice conversion task and its eval-uation protocol are deﬁned for VCC 2016. The paralleldataset consists of 162 common sentences uttered by bothsource and target speakers. Target and source speakers arefour native speakers of American English (two females andtwo males), respectively. In the challenge, the participantsdevelop the conversion systems and produce convertedspeech for all possible source-target pair combinations.In total, eight speakers (plus two unused speakers) areincluded in the VCC 2016 database. The number of testsentences for evaluation is 54.The main evaluation methodology adopted for the rank-ing is subjective evaluation on perceived naturalness andspeaker similarity of the converted samples to target speak-ers. The naturalness is evaluated using the standard ﬁve-point scale mean-opinion score (MOS) test ranging from1 (completely unnatural) to 5 (completely natural). Thespeaker similarity was evaluated using the Same/Differentparadigm [272]. Subjects are asked to listen to two audiosamples and to judge if they are speech signals producedby the same speaker in a four point scale: “Same, absolutelysure”, “Same, not sure”, “Different, not sure” and “Different,absolutely sure.” As the perceived speaker similarity to atarget speaker, and the perceived voice quality are notnecessarily correlated, it is important to use a scatter-plotto observe the trade-off between the two aspects. The VCC2016 dataset is available at https://doi.org/10.7488/ds/1575 Challenge Language Task Training Data

VCC 2016 monolingual parallel 162 paired utterances 4 source, 4 target 54 utterancesVCC 2018 monolingual parallel 81 paired utterances 4 source, 4 target 35 utterancesmonolingual nonparallel 81 unpaired utterances 4 source, 4 target 35 utterancesVCC 2020 monolingual parallel + nonparallel 20 paired, 50 unpaired utterances 4 source, 4 target 25 utterancescrosslingual nonparallel 70 unpaired utterances 4 source, 6 target 25 utterances

TABLE I: Summary of VCC 2016, VCC 2018 and VCC 2020.In the 2016 challenge, 17 participants submitted theirconversion results. Two hundreds native listeners of Englishjoined the listening tests. It is reported that the best systemusing GMM and waveform ﬁltering obtained an averageof 3.0 in the ﬁve-point scale evaluation for the naturalnessjudgement, and about 70% of its converted speech samplesare judged to be the same as target speakers by listeners.However, it is also conﬁrmed that there is still a huge gapbetween target natural speech and the converted speech.We observe that it remains a unsolved challenge to achievegood quality and speaker similarity at that time. Moredetails of VCC 2016 can be found at [272]. Details of bestperforming systems are reported in [273].

C. Overview of the 2018 Voice Conversion Challenge

Next we give an overview of the 2018 voice conversionchallenge [270] and its datasets . VCC 2018 offers two tasks,parallel and non-parallel voice conversion tasks. A datasetand its evaluation protocol are deﬁned for each task. Thedataset for the parallel conversion task is similar to that ofthe 2016 challenge, except that it has a smaller number ofcommon utterances uttered by source and target speakers.Target and source speakers are four native speakers ofAmerican English (two females and two males), respectively,but, they are different speakers from those used for the 2016challenge. Like the 2016 challenge, the participants wereasked to develop conversion systems and to produce con-verted data for all possible source-target pair combinations.VCC 2018 introduced a non-parallel voice conversion taskfor the ﬁrst time. The same target speakers’ data in theparallel task are used as the target. However, the sourcespeakers are four native speakers of American English (2females and 2 males) different from those of the parallelconversion task and their utterances are also all differentfrom those of the target speakers. Like the parallel voiceconversion task, converted data for all possible source-target pair combinations needed to be produced by theparticipants. In total twelve speakers are included in theVCC 2018 database. Each of the source and target speakershas a set of 81 sentences as training data, which is halfof that for VCC 2016. The number of test sentences forevaluation is 35.In the 2018 challenge, 23 participants submitted theirconversion results to the parallel conversion task, with11 of them additionally participating in the non-parallelconversion task. The same evaluation methodology as the2016 challenge was adopted for the 2018 challenge and 260 The VCC2018 dataset is available at https://doi.org/10.7488/ds/2337. crowd-sourced native listeners of English have joined thelistening tests. It was reported that in both tasks, the bestsystem using phone encoder and neural vocoder obtainedan average of 4.1 in the ﬁve-point scale evaluation forthe naturalness judgement and about 80% of its convertedspeech samples were judged to be the same as target speak-ers by listeners. It was also reported that the best system hassimilar performance in both the parallel and non-paralleltasks in contrast to results reported in literature.In VCC 2018, the spooﬁng countermeasure was intro-duced as an supplement to subjective evaluation of voicequality, that brought together the voice conversion andspeaker veriﬁcation research community. More details ofthe 2018 challenge can be found at [270]. Details of bestperforming systems are reported in [274], [275].From this challenge, we observed that new speech wave-form generation paradigms such as WaveNet and phoneencoding have brought signiﬁcant progress to the voiceconversion ﬁeld. Further improvements have been achievedin the follow up papers [276], [277] and new VC systems thatexceed the challenge’s best performance have already beenreported.

D. Overview of the 2020 Voice Conversion Challenge

The 2020 voice conversion challenge consists of twotasks: 1) non-parallel training in the same language (En-glish); and 2) non-parallel training over different languages(English-Finnish, English-German, and English-Mandarin).In the ﬁrst task, each participant trains voice conversionmodels for all source and target speaker pairs using upto 70 utterances, including 20 parallel utterances and 50non-parallel utterances in English, for each speaker as thetraining data. Overall, 16 voice conversion models (i.e., 4sources by 4 targets) are to be developed. In the secondtask, each participant develops voice conversion models forall source and target speaker pairs using up to 70 utterancesfor each speaker (i.e., in English for the source speakers, andin Finnish, German, or Mandarin for the target speakers)as the training data. Overall, 24 conversion systems (i.e., 4sources by 6 targets) are to be developed.In the 2020 challenge, the participants are allowed tomix and combine different source speaker’s data to trainspeaker-independent models. Moreover, the participantscan also use orthographic transcriptions of the releasedtraining data to develop their voice conversion systems.Last but not least, the participants are free to performmanual annotations of the released training data, which caneffectively improves the quality of the converted speech. The 2020 challenge organizers also built several baselinesystems including the top system of the previous challengeon the new database. The codes of CycleVAE-based base-line and Cascade ASR + TTS based VC are releasedso that participants can build the basic systems easilyand focus on their own innovation. The 2020 challengealso features a multifaceted evaluation. In addition to thetraditional evaluation metrics, the challenge also reports thespeech recognition, speaker recognition, and anti-spooﬁngevaluation results on the converted speech. The challengeis underway at the time we submit this manuscript. E. Relevant Challenges – ASVspoof Challenge

The spooﬁng capability against automatic speaker veriﬁ-cation is a related topic to voice conversion, that has alsobeen organized as technology challenges. The ASVspoofseries of challenges are such biannual events, which startedin 2013. Like in the voice conversion challenges, the orga-nizers release a common database including many pairs ofspoofed audio (converted, generated audio or replay audio)and genuine audio to the participants, who build anti-spooﬁng models using their own technology. The organizersrank the detection accuracy of the anti-spooﬁng resultssubmitted by the participants.In 2015, the ﬁrst anti-spooﬁng database including varioustypes of spoofed audio using voice conversion and TTSsystems was constructed. This database became a referencestandard in the automatic speaker veriﬁcation (ASV) com-munity [278], [279]. The main focus of the 2017 challengewas a replay task, where a large quantity of real-worldreplay speech data were collected [280]. In 2019, an evenlarger database including converted, generated, and replayspeech data was constructed [281]. The best performingsystems in the 2016 and 2018 voice conversion challengeswere also used for generating advanced spoofed audio [282].The challenges revealed that some anti-spooﬁng systemsoutperform human listeners in detecting spoofed audio.VIII. R

ESOURCES

In addition to the voice conversion challenge databasesdescribed above, the CMU-Arctic database [283] and theVCTK databases [284] are also popular for voice conversionresearch. The current version of the CMU-Arctic database has 18 English speakers and each of them reads out thesame set of around 1,150 utterances, which are carefullyselected from out-of-copyright texts from Project Guten-berg. This is suitable for parallel voice conversion sincesentences are common to all the speakers. The currentversion (ver. 0.92) of the CSTR VCTK corpus has speechdata uttered by 110 English speakers with various dialects.Each speaker reads out about 400 sentences, which areselected from newspapers, the rainbow passage and anelicitation paragraph used for the speech accent archive. https://github.com/bigpon/vcc20_baseline_cyclevae https://github.com/espnet/espnet/tree/master/egs/vcc20. https://doi.org/10.7488/ds/2645 Since the rainbow passage and an elicitation paragraph arecommon to all the speakers, this database can be used forboth parallel and non-parallel voice conversion.Since neural networks are data hungry and generalizationto unseen speakers is a key for successful conversion, large-scale, but, low-quality databases such as LibriTTS and Vox-Celeb are also used for training some components required(e.g. speaker encoder) for voice conversion. The LibriTTScorpus [285] has 585 hours of transcribed speech datauttered by total of 2,456 speakers. The recording conditionand audio quality are less than ideal, but, this corpus issuitable for training speaker encoder networks or general-izing any-to-any speaker mapping network. The VoxCelebdatabase [286] is further a larger scale speech databaseconsisting of about 2,800 hours of untranscribed speechfrom over 6,000 speakers. This is an appropriate databasefor training noise-robust speaker encoder networks.There are many open-source codes for training VCmodels. For instance, spocket [287] supports GMM-basedconversions and ESPnet [288] supports cascaded ASR andTTS system. In addition, there are many open-source codesfor neural-network based voice conversion written by thecommunity at github .IX. C ONCLUSION

This article provides a comprehensive overview of thevoice conversion technology, covering the fundamentalsand practice till July 2020. We reveal the underlying tech-nologies and their relationship from the statistical ap-proaches to deep learning, and discuss their promise andlimitations. We also study the evaluation techniques forvoice conversion. Moreover, we report the series of voiceconversion challenges and resources that are useful infor-mation for researchers and engineers to start voice conver-sion research. R

EFERENCES[1] John Q. Stewart, “An electrical analogue of the vocal organs,”

Nature ,vol. 110, pp. 311–312.[2] Alexander Kain and Michael W Macon, “Spectral voice conversionfor text-to-speech synthesis,” in

Proceedings of the 1998 IEEEInternational Conference on Acoustics, Speech and Signal Processing,ICASSP’98 (Cat. No. 98CH36181) . IEEE, 1998, vol. 1, pp. 285–288.[3] Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, and JunichiYamagishi, “Joint training framework for text-to-speech and voiceconversion using multi-source tacotron and wavenet,” arXiv preprintarXiv:1903.12389 , 2019.[4] Christophe Veaux, Junichi Yamagishi, and Simon King, “Towardspersonalised synthesised voices for individuals with vocal disabili-ties: Voice banking and reconstruction,” 08 2013.[5] Brij Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bel-let, Marc Tommasi, and Emmanuel Vincent, “Evaluating voiceconversion-based privacy protection against informed attackers,” 11

APSIPA Transactions on Signal andInformation Processing , vol. 3, pp. e17, 2014.[7] Chien yu Huang, Yist Y. Lin, Hung yi Lee, and Lin shan Lee,“Defending your voice: Adversarial attack on voice conversion,”

ArXiv , vol. abs/2005.08781, 2020. https://paperswithcode.com/task/voice-conversion [8] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and HisaoKuwabara, “Voice conversion through vector quantization,” Journalof the Acoustical Society of Japan (E) , vol. 11, no. 2, pp. 71–76, 1990.[9] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe, “SpeakerAdaptation and Voice Conversion by Codebook Mapping,”

IEEEInternational Sympoisum on Circuits and Systems , pp. 594–597, 1991.[10] Elina Helander, Jan Schwarz, Jani Nurminen, Hanna Silen, andMoncef Gabbouj, “On the impact of alignment on voice conversionperformance,” in

Ninth Annual Conference of the InternationalSpeech Communication Association , 2008.[11] Tomoki Toda, Alan W. Black, and Keiichi Tokuda, “Voice conversionbased on maximum-likelihood estimation of spectral parametertrajectory,”

IEEE Transactions on Audio, Speech and LanguageProcessing , vol. 15, no. 8, pp. 2222–2235, 2007.[12] Heiga Zen, Yoshihiko Nankaku, and Keiichi Tokuda, “Probabilisticfeature mapping based on trajectory hmms,” in

Ninth AnnualConference of the International Speech Communication Association ,2008.[13] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, andTomoki Toda, “The NU-NAIST voice conversion system for the VoiceConversion Challenge 2016,” in INTERSPEECH , 2016.[14] Elina Helander, Tuomas Virtanen, Jani Nurminen, and MoncefGabbouj, “Voice conversion using partial least squares regression,”

IEEE Transactions on Audio, Speech, and Language Processing , vol.

18, no. 5, pp. 912–921, 2010.[15] Elina Helander, Hanna Silén, Tuomas Virtanen, and Moncef Gab-bouj, “Voice conversion using dynamic kernel partial least squaresregression,”

IEEE transactions on audio, speech, and languageprocessing , vol. 20, no. 3, pp. 806–817, 2011.[16] Yi Luan, Daisuke Saito, Yosuke Kashiwagi, Nobuaki Minematsu, andKeikichi Hirose, “Semi-supervised noise dictionary adaptation forexemplar-based noise robust speech recognition,” in . IEEE, 2014, pp. 1745–1748.[17] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-based voice conversion in noisy environment,”

In IEEE SLT , pp.313–317, 2012.[18] Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li,“Exemplar-based sparse representation with residual compensationfor voice conversion,”

IEEE/ACM Transactions on Audio, Speech andLanguage Processing , vol. 22, no. 10, pp. 1506–1521, 2014.[19] Ryo Aihara, Kenta Masaka, Tetsuya Takiguchi, and Yasuo Ariki,“Parallel dictionary learning for multimodal voice conversion usingmatrix factorization,”

In INTERSPEECH , pp. 27–40, 2016.[20] Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gau-tham J Mysore, “Cute: A concatenative method for voice conversionusing exemplar-based unit selection,” in .IEEE, 2016, pp. 5660–5664.[21] Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki,“Voice conversion based on non-negative matrix factorization usingphoneme-categorized dictionary,” in ference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, . IEEE, 2017, pp. 677–684.[23] Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, KiyohiroShikano, and Nick Campbell, “Cross-language voice conversionevaluation using bilingual databases,”

IPSJ Journal , 2002.[24] David Sundermann, Harald Hoge, Antonio Bonafonte, HermannNey, Alan Black, and Shri Narayanan, “Text-independent voiceconversion based on unit selection,” in .IEEE, 2006, vol. 1, pp. I–I. [25] Hao Wang, Frank Soong, and Helen Meng, “A spectral space warpingapproach to cross-lingual voice transformation in hmm-based tts,”in . IEEE, 2015, pp. 4874–4878.[26] David Sundermann, Hermann Ney, and H Hoge, “Vtln-basedcrosslanguage voice conversion,”

IEEE ASRU , 2003.[27] D. Erro, A. Moreno, and A. Bonafonte, “Inca algorithm for trainingvoice conversion systems from nonparallel corpora,”

IEEE Transac-tions on Audio, Speech, and Language Processing , vol. 18, no. 5, pp.944–953, 2010. [28] Daniel Erro and Asuncion Moreno, “Frame alignment method forcross-lingual voice conversion,”

INTERSPEECH , 1972.[29] Jianhua Tao, Meng Zhang, Jani Nurminen, Jilei Tian, and Xia Wang,“Supervisory data alignment for text-independent voice conversion,”

IEEE Transactions on Audio, Speech, and Language Processing , vol.18, no. 5, pp. 932–943, 2010.[30] Hanna Silen, Jani Nurminen, Elina Helander, and Moncef Gabbouj,“Voice conversion for non-parallel datasets using dynamic kernelpartial least squares regression,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 20, no. 3, pp. 806–817, 2012.[31] Peng Song, Yun Jin, Wenming Zheng, and Li Zhao, “Text-independent voice conversion using speaker model alignmentmethod from non-parallel speech,”

In Proceedings of the AnnualConference of the International Speech Communication Association,INTERSPEECH , , no. September, pp. 2308–2312, 2014.[32] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phoneticposteriorgrams for many-to-one voice conversion without paralleldata training,” in . IEEE, 2016, pp. 1–6.[33] Timothy J Hazen, Wade Shen, and Christopher White, “Query-by-example spoken term detection using phonetic posteriorgramtemplates,”

In IEEE ASRU , pp. 421–426, 2009.[34] Keith Kintzley, Aren Jansen, and Hynek Hermansky, “Event selectionfrom phone posteriorgrams using matched ﬁlters,”

In INTER-

SPEECH , pp. 1905–1908, 2011.[35] Seyed Hamidreza Mohammadi and Alexander Kain, “An overviewof voice conversion systems,”

Speech Communication , vol. 88, pp.65–82, 2017.[36] M Narendranath, Hema A Murthy, S Rajendran, and B Yegna-narayana, “Transformation of formants for voice conversion usingartiﬁcial neural networks,”

Speech communication , vol. 16, no. 2,pp. 207–216, 1995.[37] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multi-layer feedforward networks are universal approximators,”

Neuralnetworks , vol. 2, no. 5, pp. 359–366, 1989.[38] Rabul Hussain Laskar, D Chakrabarty, Fazal Ahmed Talukdar,K Sreenivasa Rao, and Kalyan Banerjee, “Comparing ann and gmmin a voice conversion framework,”

Applied Soft Computing , vol. 12,no. 11, pp. 3332–3342, 2012.[39] Hy Quy Nguyen, Siu Wa Lee, Xiaohai Tian, Minghui Dong, andEng Siong Chng, “High quality voice conversion using prosodicand high-resolution spectral features,”

Multimedia Tools and Appli-cations , vol. 75, no. 9, pp. 5265–5285, 2016.[40] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “Voice conversionusing deep bidirectional long short-term memory based recurrentneural networks,” in . IEEE, 2015, pp. 4869–4873.[41] Jie Wu, Zhizheng Wu, and Lei Xie, “On the use of I-vectors andaverage voice model for voice conversion without parallel data,”

In IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2016.[42] Feng Long Xie, Frank K. Soong, and Haifeng Li, “A KL divergenceand DNN-based approach to voice conversion without parallel training sentences,”

In Proceedings of the Annual Conference of the

International Speech Communication Association, INTERSPEECH ,vol. 08-12-September-2016, pp. 287–291, 2016.[43] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from non-parallel corpora using vari-ational auto-encoder,” in .IEEE, 2016, pp. 1–6.[44] Xiaohai Tian, Junchao Wang, Xu Haihua, Eng Siong Chng, andHaizhou Li, “Average Modeling Approach to Voice Conversionwith Non-Parallel Data,”

Odyssey 2018 The Speaker and LanguageRecognition Workshop , pp. 1–10, 2018.[45] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen Meng, “Per-sonalized, cross-lingual TTS using phonetic posteriorgrams,” In INTERSPEECH , pp. 322–326, 2016.[46] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice Conversion from Unaligned Corpora using Varia-tional Autoencoding Wasserstein Generative Adversarial Networks,” arXiv:1704.00849 [cs.CL] , 2017.[47] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voiceconversion using cycle-consistent adversarial networks,” arXivpreprint arXiv:1711.11293 , 2017.[48] Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle- consistent adversarial network,” in . IEEE,2018, pp. 5279–5283.[49] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Ju-nichi Yamagishi, and Tomi Kinnunen, “Can we steal your vocal iden-tity from the Internet?: Initial investigation of cloning Obama’s voiceusing GAN, WaveNet and low-quality found data,” arXiv:1803.00860[eess.AS] , 2018.[50] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and NobukatsuHojo, “StarGAN-VC: Non-parallel many-to-many voice conversionwith star generative adversarial networks,” arXiv:1806.02169 [cs.SD] ,2018.[51] Manu Airaksinen, Lauri Juvela, Bajibabu Bollepalli, Junichi Yam-agishi, and Paavo Alku, “A comparison between straight, glottal,and sinusoidal vocoding in statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 26, no. 9, pp. 1658–1670, 2018.[52] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, andJunichi Yamagishi, “A comparison of recent waveform generationand acoustic modeling methods for neural-network-based speechsynthesis,” in

Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , Calgary, Canada,April 2018, pp. 4804–4808.[53] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,

Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, andKoray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499 , 2016.[54] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, KazuyaTakeda, and Tomoki Toda, “Speaker-dependent wavenet vocoder.,”in

Interspeech , 2017, vol. 2017, pp. 1118–1122.[55] Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, KazuyaTakeda, and Tomoki Toda, “An investigation of multi-speakertraining for wavenet vocoder,” in . IEEE, 2017, pp.712–718.[56] Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, KazuhiroKobayashi, and Tomoki Toda, “Quasi-periodic wavenet vocoder: apitch dependent dilated convolution model for parametric speechgeneration,” arXiv preprint arXiv:1907.00797 , 2019.[57] Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, KazuhiroKobayashi, and Tomoki Toda, “Statistical voice conversion withquasi-periodic wavenet vocoder,” arXiv preprint arXiv:1907.08940 ,2019.[58] Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, andSatoshi Nakamura, “Adaptive wavenet vocoder for residual com-pensation in gan-based voice conversion,” in . IEEE, 2018, pp. 282–289.[59] H. Du, X. Tian, L. Xie, and H. Li, “Wavenet factorization with singularvalue decomposition for voice conversion,” in , 2019, pp.152–159.[60] Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick LumbanTobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao,and Hsin-Min Wang, “Reﬁned wavenet vocoder for variationalautoencoder based voice conversion,” in . IEEE, 2019, pp. 1–5.[61] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “A voice con-version framework with tandem feature sparse representation andspeaker-adapted wavenet vocoder.,” in

Interspeech , 2018, pp. 1978–1982.[62] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Group SparseRepresentation with WaveNet Vocoder Adaptation for Spectrum andProsody Conversion,”

IEEE/ACM Transactions on Audio, Speech andLanguage Processing , 2019.[63] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, NormanCasagrande, Edward Lockhart, Florian Stimberg, Aaron van den

Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efﬁcient neuralaudio synthesis,” arXiv preprint arXiv:1802.08435 , 2018.[64] Ryan Prenger, Raffael Valle, and Bryan Catanzaro, “WaveGlow: AFlow-based Generative Network for Speech Synthesis,” in

Proceed-ings of the IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , Brighton, UK, May 2019, pp. 3617–3621.[65] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voiceconversion challenge 2016.,” in

Interspeech , 2016, pp. 1632–1636. [66] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Multidimen-sional scaling of systems in the voice conversion challenge 2016.,”in

SSW , 2016, pp. 38–43.[67] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Analysis of thevoice conversion challenge 2016 evaluation results.,” in

Interspeech ,2016, pp. 1637–1641.[68] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, DaisukeSaito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,“The voice conversion challenge 2018: Promoting development ofparallel and nonparallel methods,” arXiv preprint arXiv:1804.04262 ,2018.[69] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, DaisukeSaito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, et al.,“the voice conversion challenge 2018: database and results,” 2018.[70] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, KazuhiroKobayashi, and Tomoki Toda, “Nu voice conversion system for thevoice conversion challenge 2018.,” in

Odyssey , 2018, pp. 219–226.[71] Daniel Grifﬁn and Jae Lim, “Signal estimation from modiﬁed short-time fourier transform,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 32, no. 2, pp. 236–243, 1984.[72] Eric Moulines and Francis Charpentier, “Pitch-synchronous wave-form processing techniques for text-to-speech synthesis using di-phones,”

Speech communication , vol. 9, no. 5-6, pp. 453–467, 1990.[73] Hélene Valbret, Eric Moulines, and Jean-Pierre Tubach, “Voice transformation using psola technique,”

Speech communication , vol.11, no. 2-3, pp. 175–187, 1992.[74] Levent M Arslan, “Speaker transformation algorithm using segmen-tal codebooks (stasc),”

Speech Communication , vol. 28, no. 3, pp.211–226, 1999.[75] Yannis Stylianou, “Applying the harmonic plus noise model inconcatenative speech synthesis,”

IEEE Transactions on speech andaudio processing , vol. 9, no. 1, pp. 21–29, 2001.[76] Yannis Stylianou and Olivier Cappe, “A system for voice conversionbased on probabilistic classiﬁcation and a harmonic plus noisemodel,” in

Proceedings of the 1998 IEEE International Conferenceon Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.98CH36181) . IEEE, 1998, vol. 1, pp. 281–284.[77] Daniel Erro and Asunción Moreno, “Weighted frequency warping forvoice conversion,” in

Eighth Annual Conference of the InternationalSpeech Communication Association , 2007.[78] “Mel log spectrum approximation (mlsa) ﬁlter for speech synthesis,”

Electronics and Communications in Japan (Part I: Communications) ,vol. 66, no. 2, pp. 10–18, 1983.[79] M. Airaksinen, L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “Acomparison between straight, glottal, and sinusoidal vocoding instatistical parametric speech synthesis,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 26, no. 9, pp. 1658–1670, 2018.[80] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne,“Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 ex-traction: Possible role of a repetitive structure in sounds,”

Speechcommunication , vol. 27, no. 3-4, pp. 187–207, 1999.[81] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan WBlack, and Kishore Prahallad, “Voice conversion using artiﬁcial neu-ral networks,” in . IEEE, 2009, pp. 3893–3896.[82] Berrak Sisman and Haizhou Li, “Wavelet analysis of speakerdependent and independent prosody for voice conversion.,” in

Interspeech , 2018, pp. 52–56.[83] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsupervised learningof disentangled and interpretable representations from sequentialdata,” in

Advances in neural information processing systems , 2017,pp. 1878–1889.[84] Wei-Ning Hsu, Yu Zhang, and James Glass, “Learning latent repre-sentations for speech generation and transformation,” arXiv preprint arXiv:1704.04222 , 2017.[85] Sadaoki Furui, “Digital speech processing, synthesis, and recogni-tion(revised and expanded),”

Digital Speech Processing, Synthesis,and Recognition , 2000.[86] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, NavdeepJaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang,RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, andYonghui Wu, “Natural tts synthesis by conditioning wavenet on melspectrogram predictions,” arXiv:1712.05884 , 2018. [87] Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, andHaizhou Li, “Teacher-student training for robust tacotron-based tts,” arXiv preprint arXiv:1911.02839 , 2019.[88] Zdenˇek Hanzlíˇcek, Jakub Vít, and Daniel Tihelka, “Wavenet-basedspeech synthesis applied to czech,” in International Conference onText, Speech, and Dialogue . Springer, 2018, pp. 445–452.[89] Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos,Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng,Jonathan Raiman, et al., “Deep voice: Real-time neural text-to-speech,” in

Proceedings of the 34th International Conference onMachine Learning-Volume 70 . JMLR. org, 2017, pp. 195–204.[90] Berrak Sisman,

Machine Learning for Limited Data Voice Conversion ,Ph.D. thesis, 2019.[91] Kuan Chen, Bo Chen, Jiahao Lai, and Kai Yu, “High-quality voiceconversion using spectrogram-based wavenet vocoder.,” in

Inter-speech , 2018, pp. 1993–1997.[92] Nagaraj Adiga, Vassilis Tsiaras, and Yannis Stylianou, “On the useof wavenet as a statistical vocoder,” in .IEEE, 2018, pp. 5674–5678.[93] Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, DaisukeSaito, and Nobuaki Minematsu, “Wasserstein gan and waveformloss-based acoustic model training for multi-speaker text-to-speechsynthesis systems using a wavenet vocoder,”

IEEE Access , vol. 6, pp.

Proceedings of the Interspeech, Graz, Austria , pp. 15–19, 2019.[95] Hui Lu, Zhiyong Wu, Runnan Li, Shiyin Kang, Jia Jia, and HelenMeng, “A compact framework for voice conversion using wavenetconditioned on phonetic posteriorgrams,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6810–6814.[96] Hongqiang Du, Xiaohai Tian, Lei Xie, and Haizhou Li, “Wavenet fac-torization with singular value decomposition for voice conversion,”in . IEEE, 2019, pp. 152–159.[97] Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu,and Helen Meng, “Jointly trained conversion model and wavenetvocoder for non-parallel voice conversion using mel-spectrogramsand phonetic posteriorgrams,”

Proc. Interspeech 2019 , pp. 714–718,2019.[98] Jan Chorowski, Ron Weiss, Samy Bengio, and Aaron Oord, “Unsuper-vised speech representation learning using wavenet autoencoders,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. PP, pp. 1–1, 09 2019.[99] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, ThomasMerritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, andVatsal Aggarwal, “Towards achieving robust universal neural vocod-ing,” in

Proc. Interspeech , 2019, vol. 2019, pp. 181–185.[100] Prachi Govalkar, Johannes Fischer, Frank Zalkow, and ChristianDittmar, “A comparison of recent neural vocoders for speech signalreconstruction,” in

Proc. 10th ISCA Speech Synthesis Workshop , 2019,pp. 7–12.[101] Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singingvoice synthesis using deep autoregressive neural networks for acous-tic modeling,” arXiv preprint arXiv:1906.08977 , 2019.[102] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai,“Real-time neural text-to-speech with sequence-to-sequence acous-tic model and waveglow or single gaussian wavernn vocoders,” in

Proc. Interspeech , 2019, vol. 2019, pp. 1308–1312.[103] Soumi Maiti and Michael I Mandel, “Parametric resynthesis withneural vocoders,” in . IEEE, 2019, pp. 303–307.[104] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-ﬁlter-based waveform model for statistical parametric speech syn- thesis,” in

ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.5916–5920.[105] Xin Wang and Junichi Yamagishi, “Neural harmonic-plus-noisewaveform model with trainable maximum voice frequency for text-to-speech synthesis,” arXiv preprint arXiv:1908.10256 , 2019.[106] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-ﬁlter waveform models for statistical parametric speech synthesis,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 28, pp. 402–415, 2019. [107] Xiaohai Tian, Siu Wa Lee, Zhizheng Wu, Eng Siong Chng, SeniorMember, and Haizhou Li, “An Exemplar-based Approach to Fre-quency Warping for Voice Conversion,” pp. 1–10, 2016.[108] Hisao Kuwabara and Yoshinori Sagisak, “Acoustic characteristics ofspeaker individuality: Control and conversion,”

Speech communica-tion , vol. 16, no. 2, pp. 165–173, 1995.[109] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuousprobabilistic transform for voice conversion,”

IEEE Transactions onspeech and audio processing , vol. 6, no. 2, pp. 131–142, 1998.[110] Hiroshi Matsumoto and Yasuki Yamashita, “Unsupervised speakeradaptation from short utterances based on a minimized fuzzyobjective function,”

Journal of the Acoustical Society of Japan (E) ,vol. 14, no. 5, pp. 353–361, 1993.[111] Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Voice con-version algorithm based on gaussian mixture model with dynamicfrequency warping of straight spectrum,” in . IEEE, 2001, vol. 2, pp. 841–844.[112] Tomoki Toda, Jinlin Lu, Satoshi Nakamura, and Kiyohiro Shikano,“Voice conversion algorithm based on gaussian mixture modelapplied to straight,” 2000.[113] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Spectral conver-sion based on maximum likelihood estimation considering globalvariance of converted parameter,” in

Proceedings.(ICASSP’05). IEEE

International Conference on Acoustics, Speech, and Signal Processing,2005.

IEEE, 2005, vol. 1, pp. I–9.[114] Todd K Moon, “The expectation-maximization algorithm,”

IEEESignal processing magazine , vol. 13, no. 6, pp. 47–60, 1996.[115] Chuong B Do and Seraﬁm Batzoglou, “What is the expectationmaximization algorithm?,”

Nature biotechnology , vol. 26, no. 8, pp.897–899, 2008.[116] Guorong Xuan, Wei Zhang, and Peiqi Chai, “Em algorithms of gaus-sian mixture model and hidden markov model,” in

Proceedings 2001International Conference on Image Processing (Cat. No. 01CH37205) .IEEE, 2001, vol. 1, pp. 145–148.[117] Maya R Gupta, Yihua Chen, et al., “Theory and use of the emalgorithm,”

Foundations and Trends® in Signal Processing , vol. 4,no. 3, pp. 223–296, 2011.[118] Shinnosuke Takamichi, Tomoki Toda, Alan W Black, and SatoshiNakamura, “Modulation spectrum-based post-ﬁlter for gmm-basedvoice conversion,” in

Signal and Information Processing AssociationAnnual Summit and Conference (APSIPA), 2014 Asia-Paciﬁc . IEEE,2014, pp. 1–4.[119] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and KiyohiroShikano, “Maximum likelihood voice conversion based on gmmwith straight mixed excitation,” 2006.[120] Hiromichi Kawanami, Yohei Iwami, Tomoki Toda, HiroshiSaruwatari, and Kiyohiro Shikano, “Gmm-based voice conversionapplied to emotional speech synthesis,” in

Eighth EuropeanConference on Speech Communication and Technology , 2003.[121] Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and YasuoAriki, “Gmm-based emotional voice conversion using spectrum andprosody features,”

American Journal of Signal Processing , vol. 2, no.

5, pp. 134–138, 2012. [122] Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Yih-Ru Wang, and Sin-Horng Chen, “Incorporating global variance in the training phaseof gmm-based voice conversion,” in .IEEE, 2013, pp. 1–6.[123] Tudor-C˘at˘alin Zoril˘a, Daniel Erro, and Inma Hernáez, “Improvingthe quality of standard gmm-based voice conversion systems byconsidering physically motivated linear transformations,” in

Ad-vances in Speech and Language Technologies for Iberian Languages ,pp. 30–39. Springer, 2012.[124] Mostafa Ghorbandoost, Abolghasem Sayadiyan, Mohsen Ahangar,Hamid Sheikhzadeh, Abdoreza Sabzi Shahrebabaki, and JamalAmini, “Voice conversion based on feature combination with limited training data,”

Speech Communication , vol. 67, pp. 113–128, 2015.[125] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert AJ Clark, “Aperceptual investigation of wavelet-based decomposition of f0 fortext-to-speech synthesis,” in

Sixteenth Annual Conference of theInternational Speech Communication Association , 2015.[126] Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi, and Robert AJClark, “Wavelet-based decomposition of f0 as a secondary task fordnn-based speech synthesis with multi-task learning,” in . IEEE, 2016, pp. 5525–5529. [127] Cheng-Cheng Wang, Zhen-Hua Ling, Bu-Fan Zhang, and Li-RongDai, “Multi-layer f0 modeling for hmm-based speech synthesis,”in . IEEE, 2008, pp. 1–4.[128] Gerard Sanchez, Hanna Silen, Jani Nurminen, and Moncef Gabbouj,“Hierarchical modeling of F0 contours for voice conversion,” InProceedings of the Annual Conference of the International SpeechCommunication Association, INTERSPEECH , , no. September, pp.2318–2321, 2014.[129] Daniel Erro, Asunción Moreno, and Antonio Bonafonte, “Voice con-version based on weighted frequency warping,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 18, no. 5, pp. 922–931, 2009.[130] David Sundermann and Hermann Ney, “Vtln-based voice con-version,” in

Proceedings of the 3rd IEEE International Symposiumon Signal Processing and Information Technology (IEEE Cat. No.03EX795) . IEEE, 2003, pp. 556–559.[131] Matthias Eichner, Matthias Wolff, and Rüdiger Hoffmann, “Voicecharacteristics conversion for tts using reverse vtln,” in .IEEE, 2004, vol. 1, pp. I–17.[132] Anna Pˇribilová and Jiˇrí Pˇribil, “Non-linear frequency scale mappingfor voice conversion in text-to-speech system with cepstral descrip-tion,”

Speech Communication , vol. 48, no. 12, pp. 1691–1703, 2006. [133] Robert Vích and Martin Vondra, “Pitch synchronous transformwarping in voice conversion,” in

Cognitive Behavioural Systems , pp.280–289. Springer, 2012.[134] Elizabeth Godoy, Olivier Rosec, and Thierry Chonavel, “Voice con-version using dynamic frequency warping with amplitude scaling,for parallel or nonparallel corpora,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 20, no. 4, pp. 1313–1323, 2011.[135] Dd Lee and Hs Seung, “Algorithms for non-negative matrix factor-ization,”

Advances in neural information processing systems , , no. 1,pp. 556–562, 2001.[136] Syu-Siang Wang, Alan Chern, Yu Tsao, Jeigh-Weih Hung, Xugang Lu,Ying-Hui Lai, and Borching Su, “Wavelet speech enhancement basedon nonnegative matrix factorization,”

IEEE Signal Processing Letters ,vol. 23, 2016.[137] Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon, “Super-vised and Unsupervised Speech Enhancement Using NonnegativeMatrix Factorization,”

IEEE Transactions on Audio, Speech andLanguage Processing , vol. 21, no. 10, pp. 2140–2151, 2013.[138] K A Akarsh, “Speech Enhancement using Non negative MatrixFactorization and Enhanced NMF,”

International Conference onCircuit, Power and Computing Technologies (ICCPCT) , 2015.[139] Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran,“Speech denoising using nonnegative matrix factorization with pri-ors,”

In IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2008.[140] Meng Sun, Yinan Li, Jort F Gemmeke, and Xiongwei Zhang, “Speechenhancement under low SNR conditions via noise estimation us-ing sparse and low-rank NMF with Kullback-Leibler divergence,”

IEEE/ACM Transactions on Audio, Speech and Language Processing ,vol. 23, no. 7, pp. 1233–1242, 2015.[141] Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng,and Haizhou Li, “Examplar-Based Voice Conversion Using Non-Negative Spectrogram Deconvolution,” , 2013.[142] Yi Chiao Wu, Hsin Te Hwang, Chin Cheng Hsu, Yu Tsao, andHsin Min Wang, “Locally linear embedding for exemplar-basedspectral conversion,”

In Proceedings of the Annual Conference of theInternational Speech Communication Association, INTERSPEECH ,vol. 08-12-September-2016, no. 1, pp. 1652–1656, 2016.[143] Huaiping Ming, Dongyan Huang, Lei Xie, Shaofei Zhang, MinghuiDong, and Haizhou Li, “Exemplar-based sparse representation oftimbre and prosody for voice conversion,”

In IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , . IEEE, 2017, pp. 1537–1546.[145] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Dictionary update for nmf-based voice conversion usingan encoder-decoder network,” , vol. 22, no. 3, pp.293–297, 2016. [146] Hermann Ney, David Suendermann, Antonio Bonafonte, and HaraldHöge, “A ﬁrst step towards text-independent voice conversion,”in Eighth International Conference on Spoken Language Processing ,2004.[147] Hui Ye and Steve J. Young, “Voice conversion for unknown speakers,”in

INTERSPEECH 2004 - ICSLP, 8th International Conference onSpoken Language Processing, Jeju Island, Korea, October 4-8, 2004 .2004, ISCA.[148] Hui Ye and Steve Young, “Quality-enhanced voice morphing usingmaximum likelihood transformations,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 14, pp. 1301 – 1312, 08 2006.[149] Alan W Black and Nick Campbell, “Optimising selection of unitsfrom speech databases for concatenative synthesis.,” 1995.[150] Kei Fujii, Jun Okawa, and Kaori Suigetsu, “High individuality voiceconversion based on concatenative speech synthesis,”

InternationalJournal of Electrical, Computer, Energetic, Electronic and Communi-cation Engineering , vol. 1, no. 11, pp. 1617–1622, 2007.[151] Yoshinori Sagisaka, Nobuyoshi Kaiki, Naoto Iwahashi, and KatsuhikoMimura, “Atr µ -talk speech synthesis system,” in Second Interna-tional Conference on Spoken Language Processing , 1992.[152] Daniel Erro, Ferran Diego, and Antonio Bonafonte, “Voice conver-sion of non-aligned data using unit selection,” 2006.[153] A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Nonparalleltraining for voice conversion based on a parameter adaptation approach,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 14, no. 3, pp. 952–963, 2006.[154] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoiceconversion based on gaussian mixture model,” in

INTERSPEECH ,2006.[155] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEE Transactions on Audio, Speech, and Language Processing , vol.19, no. 4, pp. 788–798, 2010.[156] Z. Wu, T. Kinnunen, E. S. Chng, and H. Li, “Mixture of factor ana-lyzers using priors from non-parallel speech for voice conversion,”

IEEE Signal Processing Letters , vol. 19, no. 12, pp. 914–917, 2012.[157] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuousprobabilistic transform for voice conversion,”

IEEE Transactions onSpeech and Audio Processing , vol. 6, no. 2, pp. 131–142, 1998.[158] Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, and HaizhouLi, “Cross-lingual voice conversion with bilingual phonetic pos-teriorgrams and average modeling,”

International Conference onAcoustic, Speech and Signal Processing (ICASSP) , 2019.[159] Tomi Kinnunen, Lauri Juvela, Paavo Alku, and Junichi Yamagishi,“Non-parallel voice conversion using i-vector plda: Towards unifyingspeaker veriﬁcation and transformation,” in .IEEE, 2017, pp. 5535–5539.[160] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and HiroshiSaruwatari, “Voice conversion using sequence-to-sequence learningof context posterior probabilities,” arXiv preprint arXiv:1704.02360 ,2017.[161] Seung-won Park, Doo-young Kim, and Myun-chul Joe, “Cotatron:Transcription-guided speech encoder for any-to-many voice conver-sion without parallel data,”

ArXiv , vol. abs/2005.03295, 2020.[162] Feng-Long Xie, Yao Qian, Frank K Soong, and Haifeng Li, “Pitchtransformation in neural network based voice conversion,” in

The 9th International Symposium on Chinese Spoken LanguageProcessing . IEEE, 2014, pp. 197–200.[163] Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and YasuoAriki, “Voice conversion in high-order eigen space using deep beliefnets.,” in

Interspeech , 2013, pp. 369–372.[164] Seyed Hamidreza Mohammadi and Alexander Kain, “Voice con-version using deep neural networks with speaker-independent pre-training,” in .IEEE, 2014, pp. 19–23. [165] Feng-Long Xie, Yao Qian, Yuchen Fan, Frank K Soong, and HaifengLi, “Sequence error (se) minimization training of neural networkfor voice conversion,” in

Fifteenth Annual Conference of the Inter-national Speech Communication Association , 2014.[166] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, TakaoKobayashi, and Tadashi Kitamura, “Speech parameter generationalgorithms for hmm-based speech synthesis,” in . IEEE, 2000, vol. 3, pp. 1315–1318. [167] Ling-hui Chen, Zhen-hua Ling, Li-juan Liu, and Li-rong Dai, “VoiceConversion Using Deep Neural Networks With Layer-Wise Genera-tive Training,” IEEE Transactions on Audio, Speech and LanguageProcessing , vol. 22, no. 12, pp. 1859–1872, 2014.[168] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “High-ordersequence modeling using speaker-dependent recurrent temporalrestricted Boltzmann machines for voice conversion,”

In Proceedingsof the Annual Conference of the International Speech CommunicationAssociation, INTERSPEECH , , no. September, pp. 2278–2282, 2014.[169] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term mem-ory,”

Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[170] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins, “Learningto forget: Continual prediction with lstm,” 1999.[171] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink,and Jürgen Schmidhuber, “Lstm: A search space odyssey,”

IEEEtransactions on neural networks and learning systems , vol. 28, no.10, pp. 2222–2232, 2016.[172] Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong,and Haizhou Li, “Deep bidirectional LSTM modeling of timbreand prosody for emotional voice conversion,”

In Proceedings ofthe Annual Conference of the International Speech CommunicationAssociation, INTERSPEECH , vol. 08-12-September-2016, pp. 2453–2457, 2016.[173] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXivpreprint arXiv:1409.0473 , 2014.[174] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Atten-tion is all you need,” in

Advances in neural information processingsystems , 2017, pp. 5998–6008.[175] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend andspell: A neural network for large vocabulary conversational speechrecognition,” in , 2016, pp. 4960–4964.[176] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron JWeiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen,Samy Bengio, et al., “Tacotron: Towards end-to-end speech syn-thesis,” arXiv preprint arXiv:1703.10135 , 2017.[177] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, AjayKannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deepvoice 3: 2000-speaker neural text-to-speech,” arXiv preprintarXiv:1710.07654 , 2017.[178] Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara,“Efﬁciently trainable text-to-speech system based on deep convolu-tional networks with guided attention,” in .IEEE, 2018, pp. 4784–4788.[179] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, andLi-Rong Dai, “Sequence-to-sequence acoustic modeling for voiceconversion,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 3, pp. 631–644, 2019.[180] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry

Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder-decoder forstatistical machine translation,” arXiv preprint arXiv:1406.1078 ,2014.[181] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc:Sequence-to-sequence voice conversion with attention and contextpreservation mechanisms,” in

ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) ,2019, pp. 6805–6809.[182] Minh-Thang Luong, Hieu Pham, and Christopher D Manning, “Ef-fective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025 , 2015.[183] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and YannDauphin, “Convolutional sequence to sequence learning,”

ArXiv ,vol. abs/1705.03122, 2017. [184] Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, and NobukatsuHojo, “Convs2s-vc: Fully convolutional sequence-to-sequence voiceconversion,”

ArXiv , vol. abs/1811.01609, 2018.[185] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Un-paired image-to-image translation using cycle-consistent adversarialnetworks,” in

Proceedings of the IEEE international conference oncomputer vision , 2017, pp. 2223–2232.[186] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,“Attribute manipulation generative adversarial networks for fashion images,” in

Proceedings of the IEEE International Conference onComputer Vision , 2019, pp. 10541–10550.[187] Kenan E Ak, Ashraf A Kassim, Joo Hwee Lim, and Jo Yew Tham,“Learning attribute representations with localization for ﬂexiblefashion search,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 7708–7717.[188] Kenan Emir Ak,

Deep learning approaches for attribute manipulationand text-to-image synthesis , Ph.D. thesis, 2019.[189] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,“Efﬁcient multi-attribute similarity learning towards attribute-basedfashion search,” in . IEEE, 2018, pp. 1671–1679.[190] Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang, “Incorporatingreinforced adversarial learning in autoregressive image generation,” arXiv preprint arXiv:2007.09923 , 2020.[191] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,“Generative adversarial nets,” in

Advances in neural informationprocessing systems , 2014, pp. 2672–2680.[192] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz, “Mul-timodal unsupervised image-to-image translation,” in

Proceedingsof the European Conference on Computer Vision (ECCV) , 2018, pp.172–189.[193] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A

Efros, Oliver Wang, and Eli Shechtman, “Toward multimodal image-to-image translation,” in

Advances in neural information processingsystems , 2017, pp. 465–476.[194] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kas-sim, “Semantically consistent text to fashion image synthesis withan enhanced attentional generative adversarial network,”

PatternRecognition Letters , 2020.[195] Kenan Emir Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf Kassim,“Semantically consistent hierarchical text to fashion image synthesiswith an enhanced-attentional generative adversarial network,” in

Proceedings of the IEEE International Conference on Computer VisionWorkshops , 2019, pp. 0–0.[196] Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang (Fred) Juang,“Cycle-Consistent Speech Enhancement,”

INTERSPEECH , 2018.[197] Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks,”

IEEE Automatic Speech Recognitionand Understanding Workshop (ASRU) , 2017.[198] Dongsuk Yook, In-Chul Yoo, and Seungho Yoo, “Voice conversionusing conditional cyclegan,” in . IEEE,2018, pp. 1460–1461.[199] Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore,and Roger B Grosse, “Timbretron: A wavenet (cyclegan (cqt(audio))) pipeline for musical timbre transfer,” arXiv preprintarXiv:1811.09620 , 2018.[200] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallelvoice conversion using cycle-consistent adversarial networks,” in . IEEE,

ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.6820–6824.[202] Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-domain image generation,”

ArXiv abs/1611.02200 , 2016.[203] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, KazuhiroKobayashi, and Tomoki Toda, “Voice conversion with cyclic recur-rent neural network and ﬁne-tuned wavenet vocoder,” in

ICASSP2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 6815–6819. [204] Berrak Sisman, Mingyang Zhang, Minghui Dong, and Haizhou Li,“On the study of generative adversarial networks for cross-lingualvoice conversion,” in . IEEE, 2019, pp. 144–151.[205] Kun Zhou, Berrak Sisman, and Haizhou Li, “Transforming spec-trum and prosody for emotional voice conversion with non-paralleltraining data,” arXiv preprint arXiv:2002.00198 , 2020.[206] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Con-verting anyone’s emotion: Towards speaker-independent emotionalvoice conversion,” arXiv preprint arXiv:2005.07025 , 2020. [207] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, andLin-shan Lee, “Rhythm-ﬂexible voice conversion without paralleldata using cycle-gan over phoneme posteriorgram sequences,” in . IEEE, 2018,pp. 274–281.[208] Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, and HaizhouLi, “Wavetts: Tacotron-based tts with joint time-frequency domainloss,” arXiv preprint arXiv:2002.00417 , 2020.[209] Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, HirokazuKameoka, and Tomoki Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speechpretraining,” arXiv preprint arXiv:1912.06813 , 2019.[210] Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, ChenLiang, and Li-Rong Dai, “Improving sequence-to-sequence voiceconversion by adding text-supervision,” in ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6785–6789.[211] Hieu-Thi Luong and Junichi Yamagishi, “Bootstrapping non-parallelvoice conversion from speaker-adaptive text-to-speech,” in . IEEE, 2019, pp. 200–207.[212] Hieu-Thi Luong and Junichi Yamagishi, “Nautilus: a versatile voicecloning system,” arXiv preprint arXiv:2005.11004 , 2020.[213] Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and

Ye Jia, “Parrotron: An end-to-end speech-to-speech conversionmodel and its applications to hearing-impaired speech and speechseparation,” arXiv preprint arXiv:1904.04169 , 2019.[214] Songxiang Liu, Yuewen Cao, and Helen Meng, “Multi-target emo-tional voice conversion with neural vocoders,” arXiv preprintarXiv:2004.03782 , 2020.[215] Mingyang Zhang, Berrak Sisman, Sai Sirisha Rallabandi, HaizhouLi, and Li Zhao, “Error reduction network for dblstm-based voiceconversion,” in . IEEE,2018, pp. 823–828.[216] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, HugoLarochelle, and Ole Winther, “Autoencoding beyond pixels usinga learned similarity metric,”

Proceedings of The 33rd InternationalConference on Machine Learning, PMLR , 2016.[217] Ju-Chieh Chou, Cheng chieh Yeh, and Hung yi Lee, “One-shot voiceconversion by separating speaker and content representations withinstance normalization,”

ArXiv , vol. abs/1904.05742, 2019.[218] Da-Yi Wu and Hung-yi Lee, “One-shot voice conversion by vectorquantization,” in

ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.7734–7738.[219] Da-Yi Wu, Yen-Hao Chen, and Hung-Yi Lee, “Vqvc+: One-shot voiceconversion by vector quantization and u-net architecture,” arXivpreprint arXiv:2006.04154 , 2020.[220] Diederik P Kingma and Max Welling, “Auto-encoding variationalbayes,” arXiv preprint arXiv:1312.6114 , 2013.[221] Shaojin Ding and Ricardo Gutierrez-Osuna, “Group latent embed-ding for vector quantized variational autoencoder in non-parallelvoice conversion.,” in

INTERSPEECH , 2019, pp. 724–728.[222] Wen-Chin Huang, Hsin-Te Hwang, Yu-Huai Peng, Yu Tsao, and Hsin-Min Wang, “Voice conversion based on cross-domain features usingvariational auto encoders,” in . IEEE, 2018, pp.51–55.[223] Yanping Li, Kong Aik Lee, Yougen Yuan, Haizhou Li, and Zhen Yang,“Many-to-many voice conversion based on bottleneck features withvariational autoencoder for non-parallel training data,” in . IEEE, 2018, pp. 829–833.[224] Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke

Takamichi, “Non-parallel voice conversion using variational autoen-coders conditioned by phonetic posteriorgrams and d-vectors,” in . IEEE, 2018, pp. 5274–5278.[225] Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-Huai Peng, Yu Tsao, and Hsin-Min Wang, “Unsupervised represen-tation disentanglement using cross domain features and adversariallearning in variational autoencoder based voice conversion,”

IEEETransactions on Emerging Topics in Computational Intelligence , p.1–12, 2020. [226] Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, DanSu, Dong Yu, and Helen Meng, “Transferring source style in non-parallel voice conversion,” arXiv preprint arXiv:2005.09178 , 2020.[227] R. Kubichek, “Mel-cepstral distance measure for objective speechquality assessment,”

Communications, Computers and Signal Pro-cessing , pp. 125–128, 1993.[228] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen,“Pearson correlation coefﬁcient,” in

Noise reduction in speechprocessing , pp. 1–4. Springer, 2009.[229] Tianfeng Chai and Roland R Draxler, “Root mean square error (rmse)or mean absolute error (mae)?–arguments against avoiding rmse inthe literature,”

Geoscientiﬁc model development , vol. 7, no. 3, pp.1247–1250, 2014.[230] Cort J Willmott and Kenji Matsuura, “Advantages of the meanabsolute error (mae) over the root mean square error (rmse) inassessing average model performance,”

Climate research , vol. 30,no. 1, pp. 79–82, 2005.[231] Volodya Grancharov and W Bastiaan Kleijn, “Speech quality as-sessment,” in

Springer handbook of speech processing , pp. 83–100.Springer, 2008.[232] Robert C Streijl, Stefan Winkler, and David S Hands, “Mean opinionscore (mos) revisited: methods and applications, limitations andalternatives,”

Multimedia Systems , vol. 22, no. 2, pp. 213–227, 2016.[233] Min Chu, Hu Peng, and Yong Zhao, “Optimization of an objective measure for estimating mean opinion score of synthesized speech,”June 10 2008, US Patent 7,386,451.[234] Mahesh Viswanathan and Madhubalan Viswanathan, “Measuringspeech quality for text-to-speech systems: development and assess-ment of a modiﬁed mean opinion score (mos) scale,”

ComputerSpeech & Language , vol. 19, no. 1, pp. 55–83, 2005.[235] Alexander Kain and Michael W Macon, “Design and evaluation ofa voice conversion algorithm based on spectral envelope mappingand residual prediction,” in . IEEE, 2001, vol. 2, pp. 813–816.[236] Terry N. Flynn and Anthony A. J. Marley, “Best worst scaling:Theory and methods,”

Handbook of choice modelling, Edward ElgarPublishing , pp. 178–201, 2014.[237] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The VoiceConversion Challenge 2016,”

In INTERSPEECH , pp. 1632–1636, 2016.[238] Mingyang Zhang, Berrak Sisman, Li Zhao, and Haizhou Li, “Deep-conversion: Voice conversion with limited parallel training data,”

Speech Communication , 2020.[239] Jiahao Lai, Bo Chen, Tian Tan, Sibo Tong, and Kai Yu, “Phone-awarelstm-rnn for voice conversion,” in . IEEE, 2016, pp. 177–182.[240] Alan W Black, H Timothy Bunnell, Ying Dou, Prasanna KumarMuthukumar, Florian Metze, Daniel Perry, Tim Polzehl, KishorePrahallad, Stefan Steidl, and Callie Vaughn, “Articulatory features forexpressive speech synthesis,” in . IEEE, 2012, pp.4005–4008.[241] Beth Logan et al., “Mel frequency cepstral coefﬁcients for musicmodeling.,” in

Ismir , 2000, vol. 270, pp. 1–11.[242] Chitralekha Gupta, Haizhou Li, and Ye Wang, “Perceptual evaluationof singing quality,” in ,2017, pp. 577–586.[243] Wei Chu and Abeer Alwan, “Reducing f0 frame error of f0 trackingalgorithms under noisy conditions with an unvoiced/voiced classiﬁ-cation frontend,” in . IEEE, 2009, pp. 3969–3972.[244] Tomohiro Nakatani, Shigeaki Amano, Toshio Irino, Kentaro Ishizuka,and Tadahisa Kondo, “A method for fundamental frequency estima-tion and voicing decision: Application to infant utterances recorded in real acoustical environments,”

Speech Communication , vol. 50,no. 3, pp. 203–214, 2008.[245] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, DaisyStanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “To-wards end-to-end prosody transfer for expressive speech synthesiswith tacotron,” arXiv preprint arXiv:1803.09047 , 2018.[246] Berrak Sisman, Grandee Lee, Haizhou Li, and Kay Chen Tan, “Onthe analysis and evaluation of prosody conversion techniques,” in .IEEE, 2017, pp. 44–47. [247] Tomomi Watanabe, Takahiro Murakami, Munehiro Namba, TetsuyaHoya, and Yoshihisa Ishida, “Transformation of spectral envelopefor voice conversion based on radial basis function networks,” in Seventh international conference on spoken language processing ,2002.[248] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, andTomoki Toda, “The nu-naist voice conversion system for the voiceconversion challenge 2016.,” in

Interspeech , 2016, pp. 1667–1671.[249] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Nagarajan,“Cross-lingual voice conversion-based polyglot speech synthesizerfor indian languages,” in

Fifteenth annual conference of the inter-national speech communication association , 2014.[250] Oytun Turk and Levent M Arslan, “Robust processing techniquesfor voice conversion,”

Computer Speech & Language , vol. 20, no. 4,pp. 441–467, 2006.[251] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahal-lad, “Spectral mapping using artiﬁcial neural networks for voiceconversion,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 18, no. 5, pp. 954–964, 2010.[252] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and TakaoKobayashi, “Speaker adaptation for hmm-based speech synthesissystem using mllr,” in the third ESCA/COCOSDA Workshop (ETRW)on Speech Synthesis , 1998.[253] Volodya Grancharov, David Yuheng Zhao, Jonas Lindblom, and

W Bastiaan Kleijn, “Low-complexity, nonintrusive speech qualityassessment,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 14, no. 6, pp. 1948–1956, 2006.[254] Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter,“Are we using enough listeners? no!—an empirically-supported cri-tique of interspeech 2014 tts evaluations,” in

Sixteenth AnnualConference of the International Speech Communication Association ,2015.[255] Slawomir Zielinski, Philip Hardisty, Christopher Hummersone, andFrancis Rumsey, “Potential biases in mushra listening tests,” in

Au-dio Engineering Society Convention 123 . Audio Engineering Society,2007.[256] Hadas Benisty and David Malah, “Voice conversion using gmm withenhanced global variance,” in

Twelfth Annual Conference of theInternational Speech Communication Association , 2011.[257] Jakub Vít, Zdenˇek Hanzlíˇcek, and Jindˇrich Matoušek, “On theanalysis of training data for wavenet-based speech synthesis,” in . IEEE, 2018, pp. 5684–5688.[258] Meng Zhang, Jianhua Tao, Jilei Tian, and Xia Wang, “Text-independent voice conversion based on state mapped codebook,” in . IEEE, 2008, pp. 4605–4608.[259] ITUR Recommendation, “1534-1, Method for the subjective as-sessment of intermediate sound quality (MUSHRA),”

InternationalTelecommunications Union, Geneva, Switzerland , 2001.[260] Antony W Rix, John G Beerends, Michael P Hollier, and Andries PHekstra, “Perceptual evaluation of speech quality (pesq)-a newmethod for speech quality assessment of telephone networks andcodecs,” in . IEEE, 2001,vol. 2, pp. 749–752.[261] Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang, “Quality-net: An end-to-end non-intrusive speech quality assessment modelbased on blstm,” arXiv preprint arXiv:1808.05344 , 2018.[262] Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester,Junichi Yamagishi, and Keiichi Tokuda, “A hierarchical predictor ofsynthetic speech naturalness using neural networks.,” in

INTER-SPEECH , 2016, pp. 342–346.[263] Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson,Rif A Saurous, and D Sculley, “Automos: Learning a non-intrusiveassessor of naturalness-of-speech,” arXiv preprint arXiv:1611.09207 ,2016. [264] Milos Cernak and Milan Rusko, “An evaluation of synthetic speechusing the pesq measure,” in

Proc. European Congress on Acoustics ,2005, pp. 2725–2728.[265] Dong-Yan Huang, “Prediction of perceived sound quality of syn-thetic speech,”

Proc. APSIPA , 2011.[266] Ulpu Remes, Reima Karhila, and Mikko Kurimo, “Objective evalu-ation measures for speaker-adaptive hmm-tts systems,” in

EighthISCA Workshop on Speech Synthesis , 2013.[267] Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, JunichiYamagishi, Yu Tsao, and Hsin-Min Wang, “Mosnet: Deep learning based objective assessment for voice conversion,” arXiv preprintarXiv:1904.08352 , 2019.[268] Jennifer Williams, Joanna Rownicka, Pilar Oplustil, and Simon King,“Comparison of speech representations for automatic quality esti-mation in multi-speaker text-to-speech synthesis,” arXiv preprintarXiv:2002.12645 , 2020.[269] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voiceconversion challenge 2016,” in

Interspeech 2016 , 2016, pp. 1632–1636.[270] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, DaisukeSaito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,“The voice conversion challenge 2018: Promoting development ofparallel and nonparallel methods,” in

Proc. Odyssey 2018 The Speakerand Language Recognition Workshop , 2018, pp. 195–202.[271] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi,Federico Alegre, and Haizhou Li, “Spooﬁng and countermeasuresfor speaker veriﬁcation: A survey,”

Speech Communication , vol. 66,pp. 130 – 153, 2015.[272] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Analysis of thevoice conversion challenge 2016 evaluation results,” in

Interspeech2016 , 2016, pp. 1637–1641.[273] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, andTomoki Toda, “The nu-naist voice conversion system for the voice conversion challenge 2016,” in

Interspeech 2016 , 2016, pp. 1667–1671.[274] Yichiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, KazuhiroKobayashi, and Tomoki Toda, “The nu non-parallel voice conversionsystem for the voice conversion challenge 2018,” in

Proc. Odyssey2018 The Speaker and Language Recognition Workshop , 2018, pp.211–218.[275] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou, and Li-RongDai, “Wavenet vocoder with limited training data for voice conver-sion,” in

Proc. Interspeech 2018 , 2018, pp. 1983–1987.[276] J. Zhang, Z. Ling, L. Liu, Y. Jiang, and L. Dai, “Sequence-to-sequenceacoustic modeling for voice conversion,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 27, no. 3, pp. 631–644,2019.[277] J. Zhang, Z. Ling, and L. Dai, “Non-parallel sequence-to-sequencevoice conversion with disentangled linguistic and speaker represen-tations,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 540–552, 2020.[278] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi,Cemal Hanilçi, Md. Sahidullah, and Aleksandr Sizov, “ASVspoof 2015:the ﬁrst automatic speaker veriﬁcation spooﬁng and countermea-sures challenge,” in

Proc. Interspeech , 2015, pp. 2037–2041.[279] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov,N. Evans, M. Todisco, and H. Delgado, “Asvspoof: The automaticspeaker veriﬁcation spooﬁng and countermeasures challenge,”

IEEEJournal of Selected Topics in Signal Processing , vol. 11, no. 4, pp. 588–604, 2017.[280] Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano

Todisco, Nicholas Evans, Junichi Yamagishi, and Kong-Aik Lee, “The

ASVspoof 2017 challenge: assessing the limits of replay spooﬁngattack detection,” in

Proc. Interspeech , 2017, pp. 2–6.[281] Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah,Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, NicholasEvans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: futurehorizons in spoofed and fake audio detection,” in

Proc. Interspeech ,2019, pp. 1008–1012.[282] Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado,Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman,Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-HuaiPeng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Ma-guer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, QuanWang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang,

Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, KouTanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-XuanZhang, and Zhen-Hua Ling, “Asvspoof 2019: a large-scale publicdatabase of synthetic, converted and replayed speech,” 2019.[283] John Kominek and Alan W Black, “The cmu arctic speech databases,”in

Fifth ISCA workshop on speech synthesis , 2004.[284] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al.,“Cstr vctk corpus: English multi-speaker corpus for cstr voice cloningtoolkit,” 2016. [285] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia,Zhifeng Chen, and Yonghui Wu, “LibriTTS: A Corpus Derived fromLibriSpeech for Text-to-Speech,” in Proc. Interspeech 2019 , 2019, pp.1526–1530.[286] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman,“Voxceleb: Large-scale speaker veriﬁcation in the wild,”

ComputerSpeech & Language , vol. 60, pp. 101027, 2020.[287] Kazuhiro Kobayashi and Tomoki Toda, “sprocket: Open-sourcevoice conversion software,” in

Proc. Odyssey 2018 The Speaker andLanguage Recognition Workshop , 2018, pp. 203–210.[288] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, JiroNishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsub-asa Ochiai, “Espnet: End-to-end speech processing toolkit,” in

Proc.Interspeech 2018 , 2018, pp. 2207–2211.

Berrak Sisman received her PhD degree in Elec-trical and Computer Engineering from NationalUniversity of Singapore in 2020, fully funded byA*STAR Graduate Academy under Singapore Inter-national Graduate Award (SINGA). She is currently an Assistant Professor at Singapore University ofTechnology and Design (SUTD). She is also anAfﬁliated Researcher at the National University ofSingapore (NUS). Prior to joining SUTD, she wasa Postdoctoral Research Fellow at the NationalUniversity of Singapore. She was also an exchangePhD student at the University of Edinburgh and a visiting scholar atThe Centre for Speech Technology Research, University of Edinburgh in2019. She was attached to RIKEN Advanced Intelligence Project, Japan in2018. Her research interests include speech information processing, ma-chine learning, speech synthesis and voice conversion. She has publishedin leading journals and conferences, including IEEE/ACM Transactionson Audio, Speech and Language Processing, ASRU, INTERSPEECH andICASSP. She has served as the Local Arrangement Co-chair of IEEE ASRU2019, Chair of Young Female Researchers Mentoring @ASRU2019, andChair of the INTERSPEECH Student Events in 2018 and 2019.

Junichi Yamagishi received the Ph.D. degree fromthe Tokyo Institute of Technology (Tokyo Tech),Tokyo, Japan, in 2006. He is currently a Professorwith the National Institute of Informatics, Tokyo,Japan, and also a Senior Research Fellow with TheCentre for Speech Technology Research, The Uni-versity of Edinburgh, Edinburgh, UK. Since 2006,he has authored or co-authored over 250 refereed papers in international journals and conferences.

Prof. Yamagishi was a recipient of the Tejima Prizeas the best Ph.D. thesis of Tokyo Tech in 2007. Hereceived the Itakura Prize from the Acoustic Society of Japan in 2010,the Kiyasu Special Industrial Achievement Award from the InformationProcessing Society of Japan in 2013, the Young Scientists’ Prize from theMinister of Education, Science and Technology in 2014, the JSPS Prizefrom the Japan Society for the Promotion of Science in 2016, and the17th DOCOMO Mobile Science Award from the Mobile CommunicationFund, Japan in 2018. He was one of the organizers for special sessionson Spooﬁng and Countermeasures for the Automatic Speaker Veriﬁcationat INTERSPEECH 2013, the 1st/2nd/3rd ASVspoof Evaluation, the VoiceConversion Challenge 2016/2018/2020, and the VoicePrivacy Challenge2020. He was an Associate Editor of the IEEE/ACM Transactions on Audio,Speech, and Language Processing, a Lead Guest Editor of the IEEE Journal of Selected Topics in Signal Processing Special Issue on Spooﬁng andCountermeasures for Automatic Speaker Veriﬁcation, and a member ofthe Technical Committee of the IEEE Signal Processing Society Speechand Language. He is now the Chairperson of ISCA Special Interest Group:Speech Synthesis (SynSig), a member of the Technical Committee for theAsia-Paciﬁc Signal and Information Processing Association MultimediaSecurity and Forensics, an IEEE Senior Area Editor of the IEEE/ACMTransaction on Audio, Speech, and Language Processing.

Simon King (M’95–SM’08–F’15) received the M.A.(Cantab) and M.Phil. degrees from the Universityof Cambridge, Cambridge, U.K., and the Ph.D.degree from University of Edinburgh, Edinburgh,U.K. He has been with the Centre for SpeechTechnology Research, University of Edinburgh,since 1993, where he is now Professor of SpeechProcessing and the Director of the Centre. Hisresearch interests include speech synthesis, recog-nition and signal processing and he has around230 publications across these areas. He has servedon the ISCA SynSIG Board and currently co-organises the Blizzard Chal-lenge. He has previously served on the IEEE SLTC and as an AssociateEditor of the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGEPROCESSING, and is currently an Associate Editor of Computer Speechand Language.