[PDF] CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

Abstract

Recently, Generative Adversarial Networks (GAN)-based methods have shown remarkable performance for the Voice Conversion and WHiSPer-to-normal SPeeCH (WHSP2SPCH) conversion. One of the key challenges in WHSP2SPCH conversion is the prediction of fundamental frequency (F0). Recently, authors have proposed state-of-the-art method Cycle-Consistent Generative Adversarial Networks (CycleGAN) for WHSP2SPCH conversion. The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping. This leads to additional non-linear noise in predicted F0. To suppress this noise, we propose Cycle-in-Cycle GAN (i.e., CinC-GAN). It is specially designed to increase the effectiveness in F0 prediction without losing the accuracy of MCC mapping. We evaluated the proposed method on a non-parallel setting and analyzed on speaker-specific, and gender-specific tasks. The objective and subjective tests show that CinC-GAN significantly outperforms the CycleGAN. In addition, we analyze the CycleGAN and CinC-GAN for unseen speakers and the results show the clear superiority of CinC-GAN.

Full PDF

CCinC-GAN for Effective F prediction forWhisper-to-Normal Speech Conversion Maitreya Patel, Mirali Purohit, Jui Shah, and Hemant A. Patil

Speech Research Lab, DA-IICT, Gandhinagar-382007, India.E-mail: { maitreya patel, purohit mirali, jui shah, hemant patil } @daiict.ac.in Abstract —Recently, Generative Adversarial Networks (GAN)-based methods have shown remarkable performance for the VoiceConversion and WHiSPer-to-normal SPeeCH (WHSP2SPCH)conversion. One of the key challenges in WHSP2SPCH conver-sion is the prediction of fundamental frequency ( F ). Recently,authors have proposed state-of-the-art method Cycle-ConsistentGenerative Adversarial Networks (CycleGAN) for WHSP2SPCHconversion. The CycleGAN-based method uses two differentmodels, one for Mel Cepstral Coefﬁcients (MCC) mapping, andanother for F prediction, where F is highly dependent on thepre-trained model of MCC mapping. This leads to additional non-linear noise in predicted F . To suppress this noise, we proposeCycle-in-Cycle GAN (i.e., CinC-GAN). It is specially designedto increase the effectiveness in F prediction without losing theaccuracy of MCC mapping. We evaluated the proposed methodon a non-parallel setting and analyzed on speaker-speciﬁc, andgender-speciﬁc tasks. The objective and subjective tests show thatCinC-GAN signiﬁcantly outperforms the CycleGAN. In addition,we analyze the CycleGAN and CinC-GAN for unseen speakersand the results show the clear superiority of CinC-GAN. Index Terms —Whisper-to-Normal Speech, Non-parallel, F prediction, CycleGAN, CinC-GAN. I. I

NTRODUCTION

Whisper and normal speech are different way of commu-nication. People generally use normal mode of speech inregular life, however in some cases, people need to keep theirconversation private such as, during phone calls in publicplaces, in meeting, library, hospital etc, where people adoptto use whisper mode conversation [1]. Whisper and normalspeech are cross-domain entities, as it differs in terms ofspeech production and perception [1]–[3]. Given a speech,whether it is normal or not is depend on arrangements oflarynx, and particularly on glottis [4]–[7]. Sometimes becauseof accident or disease, people are not able to produce normalspeech, because the parts which take part in speech productionget affected. Also losing the normal way of speaking willsigniﬁcantly affect the person’s life. When people speak innormal style, vocal folds vibrates with some speciﬁc funda-mental frequency (i.e., F ) while this is not the case in whisperspeech [1], [8]. In addition, current speech processing systemsdo not perform efﬁciently on any kind of speech except onnormal speech. Therefore, WHSP2SPCH conversion task isnecessary.One of the challenging problem in WHSP2SPCH conver-sion is F prediction. However, F is encapsulated in anintricate way in the whispered speech. The presence andabsence of F is the key difference between normal vs. whis- pered speech [9]–[11]. At the acoustic-level, there is differencebetween voiced and unvoiced speech, and statistical voiceconversion (VC)-based methods are able to do such conversion[12]. Attempts have been made in the literature for VC,such as GMM, Conditional Variational AutoEncoders (CVAE),CycleGAN-VC, etc. [13]–[18]. For WHSP2SPCH conversionattempts have been made in the literature using parallel dataonly. Such as LSTM, MSpeC-Net, DiscoGAN, CycleGAN,etc. are proposed in the literature [2], [9], [12], [19]–[24].Moreover, CycleGAN has shown state-of-the-art result forWHSP2SPCH conversion including F prediction on paralleldata, which relies on the availability of particular speaker’swhisper, and normal speech [25]. However, this is not feasibleand it is impractical too. Moreover, parallel data requires time-alignment as pre-processing. In addition, traditional methoduses 2-step sequential method for WHSP2SPCH conversion[25], [26]. For CycleGAN based conversion, in ﬁrst step, oneCycleGAN is trained for cepstral feature mapping of whisperto normal speech, and in second step, another CycleGAN istrained for F prediction, which heavily relies on previouslytrained CycleGAN [25]. Because of the imperfect cepstralfeature mapping, noise is introduced in the output. Due tothe non-linear DNN layers, it is non-linear noise. Therefore,signiﬁcant non-linear noise is added in F prediction.Although CycleGAN gives the state-of-the-art result, thereis still a gap between the original and converted normal speechin terms of naturalness [25]. To reduce this gap and overcomeabove limitations, we propose CinC-GAN for non-parallelWHSP2SPCH conversion task, including F prediction in non-parallel mode. CinC-GAN is designed speciﬁcally for effective F prediction, which is important factor for naturalness. Here,CinC-GAN uses joint training methodology, where acousticmapping, and F prediction is done simultaneously. Theobjective result shows that CinC-GAN is able to suppressthe non-linear noise in F prediction. Therefore, F - RM SE is decreased by 29.8% and 82.2% compared to the baselinefor speaker and gender-speciﬁc tasks, respectively. Subjectiveevaluation shows that CinC-GAN helps to bring the convertednormal speech more closer to the original normal speechcompared to the baseline (CycleGAN). In objective and sub-jective evaluations, gender-speciﬁc task contains analysis onseen and unseen speakers. In addition, CinC-GAN maintainsthe naturalness for gender-speciﬁc task (for seen and unseenspeakers), whereas CycleGAN degrades its result and produceswhisper speech. a r X i v : . [ ee ss . A S ] A ug I. C

ONVENTIONAL C YCLE -GANLet x(cid:15)R N and y(cid:15)R N be the cepstral features of whisper (X)and normal (Y) speech, respectively, where N is the dimensionof a feature vector. In CycleGAN, two generators are used, G X → Y and G Y → X , where G X → Y maps the cepstral featuresof X to Y , whereas mapping G Y → X does the opposite (i.e., Y to X ). In addition, we have two discriminators D X and D Y , whose role is to predict whether its input is from thedistribution X and Y or not, respectively.Fig. 1: Conventional CycleGAN. After [27].In CycleGAN, there are three types of losses, cycle-consistent loss, adversarial loss, and identity loss, as describedbelow. Adversarial loss:

To make converted normal speech indis-tinguishable from the original, we use adversarial loss. Here,we use least square error loss instead of traditional binarycross-entropy loss, which is deﬁned as: L adv ( G X → Y ,D Y ) = E y ∼ P Y ( y ) [( D Y ( y ) − ]+ E x ∼ P X ( x ) [( D Y ( G X → Y ( x ))) ] . (1) Cycle-consistent loss:

The main idea behind this loss is tomap the distribution between original and reconstructed data.In addition, this loss tries to preserve contextual informationacross different speech. This loss allows us to do non-parallelWHSP2SPCH conversion. The loss is deﬁned as: L cyc ( G X → Y , G Y → X )= E x ∼ P X ( x ) [ (cid:107) G Y → X ( G X → Y ( x )) − x (cid:107) ]+ E y ∼ P Y ( y ) [ (cid:107) G X → Y ( G Y → X ( y )) − y (cid:107) ] . (2) Identity-mapping loss:

To encourage preservation of inputlinguistic content (as suggested in [27]), identity loss is used: L id ( G X → Y , G Y → X ) = E x ∼ P X ( x ) [ (cid:107) G Y → X ( x ) − x (cid:107) ]+ E y ∼ P Y ( y ) [ (cid:107) G X → Y ( y ) − y (cid:107) ] . (3)The total loss function is deﬁned as: L full = L adv ( G X → Y , D Y ) + L adv ( G Y → X , D X )+ λ cyc L cyc ( G X → Y , G Y → X ) + λ id L id ( G X → Y , G Y → X ) . (4)Where the values of λ cyc and λ id are 10 and 5, respec-tively. Now, for F prediction, we train another CycleGANarchitecture, where y (cid:48) (cid:15)R N is the cepstral features of convertednormal, which is extracted from previously trained CycleGANfor MCC mapping, and z(cid:15)R is the F of original normalspeech. III. P ROPOSED C IN C-GAN

Problem formulation:

The conventional formulation forWHSP2SPCH conversion is y (cid:48) = f ( x )+ n for cepstral featuremapping, where x is whisper speech features, f is the mappingfunction, and n is the additive noise. Now, for F prediction,we formulate the problem as z = g ( y (cid:48) ) + n (cid:48) , which impliesthat z = g ( f ( x ) + n ) + n (cid:48) , where g is the mapping function,and n (cid:48) another additive noise.Given this problem, we observed that due to the use oftwo differently trained mapping functions, for F prediction,signiﬁcant non-linear noise is being added. Hence, for effective F prediction and to suppress this noise, we need some sophis-ticated mapping function, which can be trained simultaneously,and somehow it can also directly rely on input instead of only f ( x ) . Proposed solution:

In this paper, we propose a differenttraining method, namely, Cycle-in-Cycle GAN (CinC-GAN),which is an advanced version of CycleGAN, for WHSP2SPCHconversion. In CycleGAN, we use one model for acousticfeature mapping, and second for F prediction, where bothof them are separately trained (i.e., sequential training). How-ever, in CinC-GAN, we use inner cycle for acoustic featuremapping, and outer cycle for F prediction, where outer cyclerelies on cepstral features of converted normal speech, andinput whisper speech as well (i.e., joint training). This way,we are able to achieve our goal, and suppress the effect ofextra noise.In summary, we propose a Cycle-in-Cycle GAN as shownin Fig. 2. In this approach, we adopt two coupled CycleGANsto learn the mapping for X to Y and Y to Z , respectively.In addition, non-parallel dataset x(cid:15)X , y(cid:15)Y , and z(cid:15)Z is usedfor training, where X and Y are set of cepstral features ofwhisper and normal speech, respectively, and Z is set of F extracted from the normal speech. Detailed description onfeature extraction is given in Section IV. A. Acoustic feature mapping

The inner cycle in Fig. (2) maps cepstral features of whisper( X ) to normal speech ( Y ). We use two generators, G X → Y and G Y → X , where G X → Y maps x to Y and G Y → X maps y to X .The discriminators D X and D Y conﬁrms whether generateddistribution is from X and Y or not, respectively. Here, weuse adversarial loss, cycle-consistency loss, and identity loss.Adversarial loss is deﬁned as: L adv ( G X → Y ,D Y ) = E y ∼ P Y ( y ) [( D Y ( y ) − ]+ E x ∼ P X ( x ) [( D Y ( G X → Y ( x ))) ] . (5)To map the two different distributions (i.e., normal andwhisper speech), we add generator G Y → X to map normal-to-whisper speech features. In addition, we use discriminator, D X to distinguish between real and generated whisper speech.Therefore, we also use single cycle-consistency loss: i.e., L cyc ( G X → Y ) = E x ∼ P X ( x ) [ (cid:107) G Y → X ( G X → Y ( x )) − x (cid:107) ] . (6)n addition, we use identity loss to preserve the linguisticcontent, i.e., L id ( G X → Y , G Y → X ) = E x ∼ P X ( x ) [ (cid:107) G Y → X ( x ) − x (cid:107) ]+ E y ∼ P Y ( y ) [ (cid:107) G X → Y ( y ) − y (cid:107) ] . (7) B. F Prediction

After mapping the cepstral features of whisper-to-normalspeech, we focus on F prediction task. Previous methodstries to predict F from the cepstral features of convertednormal speech using CycleGAN, which is trained separately(i.e., sequential training). However, in this paper, we proposeto predict F from the cepstral features of converted normalspeech simultaneously via joint training.We use the generator G Y → Z to predict F from the con-verted normal speech ( G X → Y ( x ) ) and G Z → X is used tomap the predicted F to whisper speech instead of normalspeech. This way, we are able to remove the non-linear noiseby including the effect of original whisper speech and jointtraining methodology. In addition, we use discriminator D Z to make generated F just like original F . However, to addthe effect of whisper speech, we used fourth generator togenerate whisper speech features from the predicted F insteadto converted normal speech features. Here, we adapt only twolosses, adversarial loss, and cycle-consistency loss, i.e., L adv ( G Y → Z ,D Z ) = E z ∼ P Z ( z ) [( D z ( z ) − ]+ E y ∼ P Y ( y ) [( D Z ( G Y → Z ( y ))) ] . (8) L cyc ( G Y → Z )= E y ∼ P Y ( y ) [ (cid:107) G Z → Y ( G Y → Z ( y )) − y (cid:107) ] . (9)Moreover, we add combine loss through a third discrim-inator D X . This discriminator conﬁrms the output of twogenerators ( G Y → X , G Z → X ) is from original distribution of X or not. This way both (inner and outer) cycles stay connectedwith common measure of reconstruction. L adv ( G Y → X , G Z → X )= E x ∼ P X ( x ) [( D X ( x ) − ]+ E y ∼ P Y ( y ) [( D X ( G Y → X ( G X → Y ( x )))) ]+ E z ∼ P Z ( z ) [( D X ( G Z → X ( G Y → Z ( G X → Y ( x ))))) ] . (10) C. Overall Objective of the Proposed Method

In summary, we train both the cycles simultaneously. Andwe optimize all the generators, and discriminators accordingto the following rules: L full = L adv + λ ∗ L cyc + λ ∗ L id + λ ∗ L adv + λ ∗ L cyc + λ ∗ L adv , (11)where λ , λ , λ , λ , and λ are the hyperparametersassociated with different loss functions. These parametersdeﬁnes relative importance of each losses w.r.t. the other Fig. 2: Proposed Cycle-in-Cycle GAN. After [28].losses. Here, λ = 10 , λ = 5 , λ = 10 , λ = 1 , and λ = 1 are used empirically in all of our experiments (because thischoice of hyperparameters shows stable and accurate training).And these hyperparameter values work for any conversionpairs. IV. E XPERIMENTAL R ESULTS

A. Dataset and Feature Extraction

In WHSP2SPCH conversion, we have used WhisperedTIMIT (wTIMIT) database [29]. In both the approaches,i.e., speaker-speciﬁc and gender-speciﬁc, we have done non-parallel training. We have done speaker-speciﬁc task on fourdifferent speakers, speciﬁcally two female and two malespeakers. Particularly, for each speaker, minutes of trainingdata and . minutes of testing data was used. In each gender-speciﬁc task, we have used four speakers, and particularly,in each training, minutes of training data was used. Ingender-speciﬁc task, we test it on four seen and two unseenspeakers, and test data for each speaker is . minutes. Weextract the F and MCC (Mel Cepstral Coefﬁcient) featuresfrom whisper and normal speech using AHOCODER [30]. Infeature extraction, we have used 25 ms window size, and 5ms frame shift [30]. B. Architecture Details

Generators G X → Y , G Y → X and G Y → Z follow the sameconﬁguration, for both the architectures. In G X → Y and G Y → X , contain 40, 512, and 40 neurons in input layer,hidden layers, and output layer, respectively. Generator G Y → Z contains 40, 512, and 1 neurons in input layer, hidden layersand output layer, respectively. G Z → Y has the 1, 512, and40 neurons in input layer, hidden layers, and output layer,respectively. All layers are followed by Rectiﬁed Linear Unit(ReLU) activation function. All discriminators follow thesame conﬁguration for both the architecture. D X , D Y , and D (cid:48) Y have the 40, 512, and 1 neurons in the input layer,hidden layers, and output layer, respectively. D Z has the1, 512, and 1 neurons in the input layer, hidden layers,and output layer, respectively. In all discriminators, inputlayer and all hidden layers are followed by ReLU activationfunction and output layer followed by sigmoid activationfunction. Both the architectures are trained for 100 epochs,ABLE I: MCD analysis of the different WHSP2SPCH sys-tems for speaker-speciﬁc task. Here, % in the bracket indicatesthe relative reduction in the MCD w.r.t the baseline (cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)

Method Speaker F1(US 102) M1(US 103) F2(US 104) M2(US 106)CycleGAN (Baseline)

CinC-GAN

TABLE II: MCD analysis of the different WHSP2SPCH sys-tems for gender-speciﬁc task. Here, % in the bracket indicatesthe relative reduction in the MCD w.r.t the baseline (cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)

Method Speaker F-Seen M-Seen F-Unseen M-UnseenCycleGAN (Baseline)

CinC-GAN and learning rate was set to . . Source code is pro-vided at https://github.com/Maitreyapatel/speech-conversion-between-different-modalities. C. Objective Evaluation

We have applied Mel Cepstral Distortion (MCD), andRoot Mean Square Error (RMSE) of log( F )-based objectivemeasures to analyze the effectiveness of the WHSP2SPCHconversion systems [31]. MCD is the distance between theconverted and the reference cepstral features, a system that ishaving lesser MCD is considered as a better system. Lesserthe RMSE of log ( F ) , better the system is.The effectiveness of CinC-GAN can be clearly seen for theWHSP2SPCH conversion system in objective results. Analysisof both the architectures is done using 2 different approaches1) speaker-speciﬁc in which is model is trained an tested onlyon single speaker and 2) gender-speciﬁc in which model istrained for speciﬁc number of speakers and tested on seen aswell as out of the box speaker (unseen speaker). As shown inTable I, it can be observed that CinC-GAN performs compar-atively to CycleGAN in terms of MCD. However, CinC-GANoutperforms CycleGAN in terms of RMSE log( F ) for allthe speakers (as shown in Table III). CinC-GAN gets on anaverage 29.8% relative reduction in case of speaker-speciﬁc,compared to the CycleGAN in F - RM SE . Moreover, TableV shows the Kullback-Leibler Divergence (KLD) and Jensen-Shannon Divergence (JSD) between predicted F and original F for speaker-speciﬁc task. Here, we can observed that CinC-GAN outperforms CycleGAN. Therefore, this analysis furtherstrengthens our results.TABLE V: Results of KL-JSD for Speaker-speciﬁc task. (cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104) Speaker Method CinC-GAN CycleGANKL JSD KL JSDUS 102

US 103

US 104

US 106

Average 7.51 4.71

TABLE III: RMSE-based objective analysis of log( F ) forspeaker-speciﬁc task. Here, % in the bracket indicates arelative reduction in the RMSE w.r.t the baseline (cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104) Method Speaker F1(US 102) M1(US 103) F2(US 104) M2(US 106)CycleGAN (Baseline)

CinC-GAN ) 4.6( ) 2.77( ) 3.25( ) TABLE IV: RMSE-based objective analysis of log( F ) forgender-speciﬁc task. Here, % in the bracket indicates a relativereduction in the RMSE w.r.t the baseline (cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104)(cid:104) Method Speaker F-Seen M-Seen F-Unseen M-UnseenCycleGAN (Baseline)

CinC-GAN ) 3.16( ) 3.14( ) 3.8( ) We further extend our experiment, and perform objectiveevaluation for gender-speciﬁc task. For this, we trained twoCinC-GAN, ﬁrst on 4 female speakers, and second on 4male speakers. We tested both of them on seen speaker andunseen utterances, and unseen speaker, as well. As shownin Table II, in terms of MCD, CycleGAN and CinC-GANperforms similarly. However, in terms of F - RM SE

CinC-GAN outperforms CycleGAN by on an average 82.1%, asshown in Table IV. We observed that the CycleGAN is notable to predict F effectively on combined dataset, whereasCinC-GAN works quite efﬁciently in every scenarios even onunseen speaker and unseen utterances. D. Subjective Evaluation

Fig. 3: MOS score analysis for speaker-speciﬁc and gender-speciﬁc task (i.e., seen-unseen) with conﬁdence interval.For subjective test analysis, Mean Opinion Score (MOS) hasbeen taken to measure the naturalness of the converted speech.Total 28 subjects (7 females and 21 males between 18 to 30years of age and with no known hearing impairments) tookpart in the subjective test. Here, we randomly played utterancesfrom both the systems. In the MOS test, subjects were asked torate the played utterances on the scale of - , where indicatescompletely whisper speech, and means completely convertedin normal speech. We can observe that the CinC-GAN haslmost . % more naturalness in case of speaker-speciﬁctask. From Fig. 3, we can observe that CinC-GAN signiﬁcantlyoutperforms CycleGAN for seen and unseen (out-of-the-box)speakers, respectively, on gender-speciﬁc task. In addition, inthis case, CycleGAN fails measurably and produces whisperspeech even for seen and unseen speakers, which can beobserved in MOS plot shown in Fig. 3. However, CinC-GANmaintains its performance for seen and unseen speakers. CinC-GAN is able to score M OS ≥ for gender-speciﬁc task forunseen speaker as well. Therefore, CinC-GAN leads to thepossibility of few-shot learning for WHSP2SPCH for the ﬁrsttime in literature.V. S UMMARY AND C ONCLUSION

In this paper, we proposed the CinC-GAN to increase theeffectiveness of F prediction without affecting accuracy ofMCC mapping. Baseline (i.e., CycleGAN) uses sequentialtraining, which adds non-linear noise in F prediction. How-ever, CinC-GAN adopts joint training methodology to decreasethis noise. Objective and subjective results show superiorityof CinC-GAN over the baseline. In addition, CycleGAN failsin WHSP2SPCH conversion for gender-speciﬁc task. How-ever, CinC-GAN maintains its result even for out-of-the-boxspeaker. This shows the potential of CinC-GAN for few-shotWHSP2SPCH conversion. In future, we plan to extend ourstudy on zero-shot and one-shot WHSP2SPCH conversion.R EFERENCES[1] Chi Zhang and John H. L. Hansen,

Advancements in whisperedspeech detection for interactive / speech systems , Hemant A. Patil et. al.(Eds), Signal and Acoustic Modelling for Speech and CommunicationDisorders, De Gruyter, vol. 5, pp. 9–32, 2018.[2] Nirmesh J. Shah, Mihir Parmar, Neil Shah, and Hemant A. Patil, “NovelMMSE DiscoGAN for cross-domain whisper-to-speech conversion,”in Machine Learning in Speech and Language Processing (MLSLP)Workshop , Google Ofﬁce, Hyderabad, India, 2018, pp. 1–3.[3] Aravind Illa, Prasanta Kumar Ghosh, et al., “A comparative study ofacoustic-to-articulatory inversion for neutral and whispered speech,” in

International Conference on Acoustics, Speech, and Signal Processing(ICASSP) , New Orleans, USA, 2017, pp. 5075–5079.[4] Lesly Wallis, Cristina Jackson-Menaldi, Wayne Holland, and AlvaroGiraldo, “Vocal fold nodule vs. vocal fold polyp: Answer from surgicalpathologist and voice pathologist point of view,”

Journal of Voice , vol.18, no. 1, pp. 125–129, 2004.[5] Jacqueline A Mattiske, Jennifer M Oates, and Kenneth M Greenwood,“Vocal problems among teachers: A review of prevalence, causes,prevention, and treatment,”

Journal of Voice , vol. 12, no. 4, pp. 489–499,1998.[6] Lucian Sulica, “Vocal fold paresis: An evolving clinical concept,”

Current Otorhinolaryngology Reports , vol. 1, no. 3, pp. 158–162, 2013.[7] Adam D Rubin and Robert T Sataloff, “Vocal fold paresis and paralysis,”

Otolaryngologic Clinics of North America , vol. 40, no. 5, pp. 1109–1131, 2007.[8] Thomas F Quatieri,

Discrete-Time Speech Signal Processing: Principlesand Practice , Pearson Education India, st (Eds.), 2006.[9] Hideaki Konno, Mineichi Kudo, Hideyuki Imai, and Masanori Sugimoto,“Whisper to normal speech conversion using pitch estimated fromspectrum,” Speech Communication , vol. 83, pp. 10–20, 2016.[10] Werner Meyer-Eppler, “Realization of prosodic features in whisperedspeech,”

The J. of the Acoust. Soc. of Amer. (JASA) , vol. 29, no. 1, pp.104–106, 1957.[11] Taisuke Itoh, Kazuya Takeda, and Fumitada Itakura, “Acoustic analysisand recognition of whispered speech,” in

Automatic Speech Recognitionand Understanding (ASRU) , Madonna di Campiglio, Italy, 2001, pp.429–432. [12] Tomoki Toda, Mikihiro Nakagiri, and Kiyohiro Shikano, “Statisti-cal voice conversion techniques for body-conducted unvoiced speechenhancement,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 9, pp. 2505–2517, 2012.[13] Yannis Stylianou, Olivier Capp´e, and Eric Moulines, “Continuousprobabilistic transform for voice conversion,”

IEEE Transactions onSpeech and Audio Processing , vol. 6, no. 2, pp. 131–142, 1998.[14] Diederik P Kingma and Max Welling, “Auto-encoding variationalbayes,” arXiv preprint arXiv:1312.6114 , 2013, { Last Accessed: May01, 2014 } .[15] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “Voiceconversion using deep neural networks with layer-wise generativetraining,” IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP) , vol. 22, no. 12, pp. 1859–1872, 2014.[16] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallelvoice conversion using cycle-consistent adversarial networks,” in

Euro-pean Signal Processing Conference (EUSIPCO) , Rome, Italy, 2018, pp.2100–2104.[17] Maitreya Patel, Mihir Parmar, Savan Doshi, Nirmesh J. Shah, andHemant A. Patil, “Novel adaptive generative adversarial network forvoice conversion,” in

Asia-Paciﬁc Signal and Information ProcessingAssociation Annual Summit and Conference (APSIPA) , Lanzhou, China,2019, pp. 1273–1281.[18] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and NobukatsuHojo, “CYCLEGAN-VC2: Improved cyclegan-based non-parallel voiceconversion,” in

ICASSP , Brighton, UK, 2019.[19] Harshit Malaviya, Jui Shah, Maitreya Patel, Jalansh Munshi, and He-mant A Patil, “Mspec-net: Multi-domain speech conversion network,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp. 7764–7768.[20] G Nisha Meenakshi and Prasanta Kumar Ghosh, “Whispered speech-to-neutral speech conversion using bidirectional LSTMs,” in

INTER-SPEECH , Hyderabad, India, 2018, pp. 491–495.[21] Matthias Janke, Michael Wand, Till Heistermann, Tanja Schultz, andK Prahallad, “Fundamental frequency generation for whisper-to-audiblespeech conversion,” in

ICASSP , Florence, Italy, 2014, pp. 2579–2583.[22] Ian Vince McLoughlin, Jingjie Li, and Yan Song, “Reconstruction ofcontinuous voiced speech from whispers,” in

INTERSPEECH , Lyon,France, 2013, pp. 1022–1026.[23] Ian V Mcloughlin et al., “Reconstruction of phonated speech fromwhispers using formant-derived plausible pitch modulation,”

ACMTransactions on Accessible Computing (TACCESS) , vol. 6, no. 4, pp.12, 2015.[24] Viet-Anh Tran, G´erard Bailly, H´el`ene Lœvenbruck, and Tomoki Toda,“Multimodal HMM-based NAM-to-speech conversion,” in

INTER-SPEECH , Brighton, United Kingdom (UK), 2009, pp. 656–659.[25] Mihir Parmar, Savan Doshi, Nirmesh J. Shah, Maitreya Patel, and He-mant A. Patil, “Effectiveness of cross-domain architectures for whisper-to-normal speech conversion,” in th European Signal ProcessingConference (EUSIPCO) , Corua, Spain, 2019.[26] Maitreya Patel, Mihir Parmar, Savan Doshi, Nirmesh Shah, and HemantPatil, “Novel Inception-GAN for Whispered-to-Normal Speech Conver-sion,” in

Proc. 10th ISCA Speech Synthesis Workshop , 2019, pp. 87–92.[27] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpairedimage-to-image translation using cycle-consistent adversarial networks,”in

ICCV , Venice, Italy, 2017, pp. 1–18.[28] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang, Chao Dong,and Liang Lin, “Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,2018, pp. 701–710.[29] Boon Pang Lim,

Computational differences between Whispered andNon-whispered Speech. , Ph.D. Thesis, University of Illinois at Urbana-Champaign, USA, 2011.[30] D. Erro, I. Sainz, E. Navas, and I. Hern´aez, “Improved HNM-basedvocoder for statistical synthesizers,” in

INTERSPEECH , Florence, Italy,2011, pp. 1809–1812.[31] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based onmaximum-likelihood estimation of spectral parameter trajectory,”