HumanACGAN: conditional generative adversarial network with human-based auxiliary classifier and its evaluation in phoneme perception
Yota Ueda, Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari
HHUMANACGAN: CONDITIONAL GENERATIVE ADVERSARIAL NETWORKWITH HUMAN-BASED AUXILIARY CLASSIFIERAND ITS EVALUATION IN PHONEME PERCEPTION
Yota Ueda , Kazuki Fujii , Yuki Saito , Shinnosuke Takamichi , Yukino Baba and Hiroshi Saruwatari Graduate School of Information Science and Technology, The University of Tokyo, Japan. National Institute of Technology, Tokuyama College, Japan. Faculty of Engineering, Information and Systems, University of Tsukuba, Japan.
ABSTRACT
We propose a conditional generative adversarial network (GAN) in-corporating humans’ perceptual evaluations. A deep neural network(DNN)-based generator of a GAN can represent a real-data distri-bution accurately but can never represent a human-acceptable dis-tribution, which are ranges of data in which humans accept the nat-uralness regardless of whether the data are real or not. A Human-GAN was proposed to model the human-acceptable distribution. ADNN-based generator is trained using a human-based discriminator,i.e., humans’ perceptual evaluations, instead of the GAN’s DNN-based discriminator. However, the HumanGAN cannot representconditional distributions. This paper proposes the HumanACGAN,a theoretical extension of the HumanGAN, to deal with conditionalhuman-acceptable distributions. Our HumanACGAN trains a DNN-based conditional generator by regarding humans as not only a dis-criminator but also an auxiliary classifier. The generator is trainedby deceiving the human-based discriminator that scores the uncon-ditioned naturalness and the human-based classifier that scores theclass-conditioned perceptual acceptability. The training can be exe-cuted using the backpropagation algorithm involving humans’ per-ceptual evaluations. Our experimental results in phoneme perceptiondemonstrate that our HumanACGAN can successfully train this con-ditional generator.
Index Terms — Generative adversarial network, human com-putation, conditional generator, auxiliary classifier, black-box opti-mization, speech perception
1. INTRODUCTION
Deep generative models of machine learning have contributed to me-dia research [1, 2, 3]. A generative adversarial network (GAN) [1]is one of the strongest generative models. It has been applied inspeech modeling [4, 5]. The GAN consists of a set of deep neuralnetworks (DNNs), a generator, and a discriminator. The generator istrained to deceive the discriminator, and the discriminator is trainedto distinguish between real and generated data. After iterating them,the trained generator represents a real-data distribution and can ran-domly generate data that follows the real-data distribution.The GAN cannot represent an outer side of the real-data distri-bution. However, humans can accept out-sided media as natural. Inspeech perception, humans can recognize a voice as a human’s eventhough the voice is out of the real-data distribution (i.e., an actual hu-mans’ voice). For example, that range contains synthesized voicesor processed voices. In this study, we call this data range perceivedby humans the human-acceptable distribution . HumanGAN [6] wasproposed to model the human-acceptable distribution by using hu-mans as the discriminator of the GAN. The top of
Fig. 1 shows
ACGAN
HumanACGAN (proposed) T r a i n i ng G e n e -r a ti on Conditional human-acceptable distr.
Conditional real-data distr.Class 2 ...Class 1 Class 2 ...Class 1GAN HumanGAN G e n e -r a ti on Generator Discriminator HumansUnconditional human-acceptable distr.Unconditional real-data distr.Class Class Natural?Belong to class?Real/FakeClass 1/ 2/ ...Class ClassBackpropagation Backpropagation T r a i n i ng Data Natural?Prior Real/FakeBackpropagation Backpropagation
Fig. 1 . Comparison of four GANs. We extend the HumanGAN tothe conditional modeling in the same way the GAN was extended tothe ACGAN. While an ACGAN trains a conditional generator with aDNN-based discriminator and an auxiliary classifier, the HumanAC-GAN trains one with a humans-based discriminator and an auxiliaryclassifier. The trained generator of the HumanACGAN can representhuman-acceptable distributions conditioned by input class labels.the comparison of a GAN and HumanGAN. The HumanGAN re-gards humans as a black-boxed system that outputs a difference inposterior probabilities given generated data. The DNN-based gen-erator is trained using the backpropagation algorithm, including thehuman-based discriminator. The trained generator can represent thehuman-acceptable distribution.However, the HumanGAN’s generator cannot achieve morepractical generative modeling such as text-to-speech synthesis [7]and voice conversion [8] because it cannot represent conditionaldistributions. Because the GAN was extended to the conditionalGAN [9, 10], we expect that the HumanGAN can be extendedto the conditional modeling. Namely, we train the HumanGAN’sgenerator conditioned on the desired class label, and it representsthe class-specific human-acceptable distribution as a result. Thiswill contribute establishing a DNN-based framework to model thetask-oriented perception by humans [11, 12].In this paper, we propose the
HumanACGAN , aiming to train aDNN-based conditional generator using humans’ perceptual eval-uations. To this end, we extend the HumanGAN by introducingan auxiliary classifier GAN (ACGAN) [10]: class-conditional ex-pansion of a GAN.
Fig. 1 shows a comparison of four GANs: aGAN, an ACGAN, a HumanGAN, and our HumanACGAN. TheACGAN uses a DNN-based auxiliary classifier to train a conditionalDNN-based generator in addition to a discriminator of the GAN.Our HumanACGAN replaces both the DNN-based discriminatorand auxiliary classifier with humans. The HumanACGAN’s gen-erator is trained using human-perception-based discrimination andclassification. This training is operated using an expansion of the a r X i v : . [ c s . H C ] F e b ackpropagation-based algorithm incorporating humans’ perceptualevaluations as a discriminator and an auxiliary classifier. We eval-uated the HumanACGAN in phoneme perception, a task to traina DNN-based generator that represents phoneme-specific human-acceptable distributions. The experimental results show that 1)the phoneme-conditioned human-acceptable distributions are widerthan the real-data ones and that 2) the HumanACGAN can success-fully train a generator that represents conditional human-acceptabledistributions.
2. RELATED WORKS2.1. ACGAN
The ACGAN [10] trains a DNN-based conditional generator thatrepresents real-data distributions conditioned on class labels. Toachieve this, the ACGAN uses a DNN-based auxiliary classifier inaddition to a DNN-based discriminator of the GAN. The genera-tor G ( · ) transforms a prior noise z = [ z , · · · , z n , · · · , z N ] intodata ˆ x = [ ˆ x , · · · , ˆ x n , · · · , ˆ x N ] , conditioned on class labels c =[ c , · · · , c n , · · · , c N ] , i.e., ˆ x n = G ( z n , c n ) . N denotes the num-ber of data. The prior noise follows a known probability distribu-tion, e.g., a uniform distribution U (0 , . Here, let real data be x = [ x , · · · , x n , · · · , x N ] . The real data also have the same corre-sponding class label c . A discriminator D S ( · ) and a classifier D C ( · ) are used for the generator training. The D S ( · ) takes x n or ˆ x n as aninput and outputs a posterior probability that the input is real data.The D C ( · ) takes an x n or ˆ x n as an input and outputs a posteriorprobability that the input source belongs to each class. Objectivefunctions in training are formulated as L S = N (cid:88) n =1 log D S ( x n ) + N (cid:88) n =1 log (1 − D S ( G ( z n , c n ))) , (1) L C = N (cid:88) n =1 log D C ( x n , c n ) + N (cid:88) n =1 log ( D C ( G ( z n , c n ) , c n )) . (2) L S is the objective function of the GAN, and L C enables training theconditional generator. The generator is trained to maximize − L S + λL C , where λ is a hyperparameter.The ACGAN trains the generator using real data, and the gen-erator represents the real-data distribution conditioned on the dataclass. However, because the human-acceptable distribution is widerthan the real-data distribution [6], the ACGAN cannot represent thefull range of that distribution. The HumanGAN [6] was proposed to represent the human-acceptabledistribution in a wider manner than a real-data distribution. A DNN-based unconditional generator of the HumanGAN is trained usinghumans’ perceptual evaluations instead of the DNN-based discrim-inator of the GAN. The generator G ( · ) transforms prior noise z n to data ˆ x n unconditionally. A human-based discriminator D S ( · ) isdefined to deal with humans’ perceptual evaluations. D S ( · ) takes ˆ x n as an input and outputs a posterior probability that the input isperceptually acceptable. The objective function is L S = N (cid:88) n =1 D S ( G ( z n )) . (3)The generator is trained to maximize L S . A model parameter θ of G ( · ) is iteratively updated as θ ← θ + α∂L S /∂θ , where α is thelearning coefficient, and ∂L S /∂θ = ∂L S /∂ ˆ x n · ∂ ˆ x n /∂θ . Evaluate the difference inNaturalness: Class acceptability: Humansdata× R [times]Backpropagation usingapproximated gradient: class Fig. 2 . Generator training process of proposed HumanACGAN. Ahuman observes two perturbed data and evaluates their perceptualdifference in two views: naturalness and class acceptability. Theygive global and class-specific gradients, respectively. Evaluationsand perturbations are used for backpropagation to train the generator. ∂ ˆ x n /∂θ can be estimated analytically, but ∂L S /∂ ˆ x n can-not because the human-based discriminator D S ( · ) is not differ-entiable. The HumanGAN uses the natural evolution strategy(NES) [13] algorithm to approximate the gradient. A small per-turbation ∆ x ( r ) n randomly generated from a multivariate Gaussiandistribution N (cid:0) , σ I (cid:1) , and it is added to a generated datum ˆ x n . r is the perturbation index (1 ≤ r ≤ R ) . σ , I are the standard devi-ation and the identity matrix, respectively. Next, a human observestwo perturbed data { ˆ x n + ∆ x ( r ) n , ˆ x n − ∆ x ( r ) n } and evaluates thedifference in their posterior probabilities of naturalness: ∆ D S ( ˆ x ( r ) n ) ≡ D S ( ˆ x n + ∆ x ( r ) n ) − D S ( ˆ x n − ∆ x ( r ) n ) . (4) ∆ D S ( ˆ x ( r ) n ) ranges from − to . For instance, a human will answer ∆ D S ( ˆ x ( r ) n ) = 1 when he/she perceives that ˆ x n +∆ x ( r ) n is substan-tially more acceptable than ˆ x n − ∆ x ( r ) n . ∂L S /∂ ˆ x is approximatedwith [13] ∂L S ∂ ˆ x n = 12 σ R R (cid:88) r =1 ∆ D S (cid:16) ˆ x ( r ) n (cid:17) · ∆ x ( r ) n . (5)
3. HUMANACGAN3.1. Training
We propose the HumanACGAN. As shown in
Fig. 2 , the Hu-manACGAN has a conditional generator that represents class-specific human-acceptable distributions, while the HumanGANonly deals with one class. As well as the ACGAN, our HumanAC-GAN consists of a conditional generator, a discriminator, and anauxiliary classifier. The DNN-based generator G ( · ) is the sameas that of the ACGAN; it transforms prior noise z n into data ˆ x n conditioned on a class label c n . The discriminator and the auxiliaryclassifier are redefined to incorporate humans’ perceptual evalua-tions. The human-based discriminator D S ( · ) is the same as that ofthe HumanGAN; it outputs a posterior probability of global (i.e.,class-independent) naturalness. In addition, the human-based aux-iliary classifier D C ( · ) evaluates class acceptability, whether or notthe data can belong to the class c n . D C ( · ) takes ˆ x n generated from G ( · ) and class label c n as inputs and outputs a posterior probabilitythat the input is perceptually acceptable as the class ”. The objectivefunctions are L S = N (cid:88) n =1 D S ( G ( z n , c n )) , (6) L C = N (cid:88) n =1 D C ( G ( z n , c n ) , c n ) . (7)A model parameter θ of G ( · ) is estimated by maximizing L S + λL C . θ is iteratively updated as follows. θ ← θ + α ∂ ( L S + λL C ) ∂θ , (8) ( L S + λL C ) ∂θ = (cid:18) ∂L S ∂ ˆ x n + λ ∂L C ∂ ˆ x n (cid:19) · ∂ ˆ x n ∂θ . (9)As well as the HumanGAN, ∂ ˆ x n /∂θ can be estimated analytically,but ∂L S /∂ ˆ x n and ∂L C /∂ ˆ x n cannot. We formulate gradient ap-proximation using the NES algorithm. A human observes two per-turbed data { ˆ x n + ∆ x ( r ) n , ˆ x n − ∆ x ( r ) n } and evaluates two kindsof difference in their posterior probabilities. The first is the sameas the HumanGAN, i.e., the human evaluates “to what degree theinputs are perceptually different in the view of naturalness” and an-swers ∆ D S ( ˆ x ( r ) n ) for approximating ∂L S /∂ ˆ x n as in Eq. (4). Thesecond is for the class-specific question, “to what degree the inputsare different in the view of class acceptability.” The difference in theposterior probability ∆ D C ( ˆ x ( r ) n , c n ) is defined as ∆ D C ( ˆ x ( r ) n , c n ) ≡ D C (cid:16) ˆ x n + ∆ x ( r ) n , c n (cid:17) − D C (cid:16) ˆ x n − ∆ x ( r ) n , c n (cid:17) . (10)In the same way as the HumanGAN, ∂L S /∂ ˆ x n can be approximatedusing ∆ D S ( ˆ x ( r ) n ) . Unlike ∆ D S ( · ) , ∆ D C ( · ) becomes the class-specific difference. Namely, ∆ D C ( ˆ x ( r ) n , c n ) will be 1 when thehuman perceives that ˆ x n + ∆ x ( r ) n is substantially more acceptablethan ˆ x n − ∆ x ( r ) n as the presented class. ∆ D C ( ˆ x ( r ) n , c n ) rangesfrom − to , and ∂L C /∂ ˆ x is approximated with ∂L C ∂ ˆ x n = 12 σ R R (cid:88) r =1 ∆ D C (cid:16) ˆ x ( r ) n , c n (cid:17) · ∆ x ( r ) n . (11)Note that the HumanACGAN does not explicitly involve classifica-tion problems. In other words, it only needs to estimate the gradientof multiclass probability function (i.e., the softmax function used inan ACGAN) as the degree of class-specific acceptability. The HumanGAN [6] suffers from a mode collapse problem [14], agradient vanishing, and scalability to the data size and dimensional-ity. These problems still remain in our HumanACGAN. Therefore,our experiments followed the initialization way and data preprocess-ing of the paper [6]. We first estimated histograms of the posteriorprobabilities and initialized θ by referring to the histograms. Also,we reduced the data dimensionality using principal component anal-ysis (PCA).
4. EXPERIMENTAL EVALUATION
This section describes an experimental test of the effectiveness of theHumanACGAN using Japanese phonemes as classes.
Two phonemes we used were /i/ and /e/ of Japanese vowels. Webasically followed the experimental setup of the HumanGAN pa-per [6]. The used data consisted of female speakers’ utter-ances recorded in the JVPD [15] corpus. Before extracting speechfeatures, the speech waveforms were downsampled at kHz, andtheir powers were normalized. 513-dimensional log spectral en-velopes, fundamental frequency (F0), and aperiodicities (APs) wereextracted every ms from the speech waveforms using the WORLDvocoder [16, 17]. We extracted the speech features of the vowels/i/ and /e/ using phoneme alignment obtained by Julius [18]. We ap-plied PCA to the log spectral envelopes and used the first and secondprincipal components. The two-dimensional principal components Fig. 3 . Color maps representing posterior probabilities of natural-ness and class acceptability (“accep.”) of /i/ and /e/.
Fig. 4 . Generated data (points) and gradient (arrows) estimated byour algorithm. “accep.” denotes acceptability. The gradients of theupper left figure are the weighted sum of the other three gradients.We can see that the gradients for naturalness and class acceptabilityare respectively pointing to darker (i.e., higher posterior) zones.were normalized to have zero-mean and unit-variance. The speechwaveforms to be evaluated were synthesized using features obtainedfrom a DNN-based conditional generator in the following way. First,the first and second principal components were generated by the gen-erator and de-normalized. For the other features, i.e., the F0 and theAPs, we used the average of all speakers. These corresponded to thespeech features of one frame ( ms). Next, we copied the features for frames to make the perceptual evaluations easy and synthesized second of speech waveforms using the WORLD vocoder. First, we confirmed that the human-acceptable distributions werewider than the real-data distributions. We carried out two tests:evaluations of humans’ tolerance of naturalness and class acceptabil-ity. In this experiment, class acceptability denotes whether the datasounded like the presented phoneme. We split the two-dimensionalspace into grids and generated a speech waveform for every grid.Then, we presented a speech waveform to a listener, and the listenerrated the naturalness on a 5-point scale from 1 (bad) to 5 (excel-lent). Next, we presented a speech waveform and a phoneme label ig. 5 . Data generated from the initialized or trained generator. Thecolors of the data points correspond to those of the color maps forclass acceptability in
Fig. 3 . Based on posterior probabilities shownin
Fig. 3 , we can say that the training makes the data move to havehigher posterior probabilities.
Fig. 6 . Posterior probability of data generated from initialized(“Init”) or trained (“Trained”) generators. The boxes indicate first,second (i.e., median), and third quantiles. The line plot indicatesmean value.(/i/ or /e/) to a listener, and the listener rated the class acceptabilityon the same scale. One-through-five of the obtained scores corre-sponded to . , . , . , . , and . of the posterior proba-bility ( D S ( ˆ x n ) or D C ( ˆ x n , c n ) ), respectively. We used the Lancerscrowdsourcing platform [19] to execute the evaluations. The poste-rior probabilities were averaged for each grid. Each grid was scoredby at least five listeners. The total number of listeners was 105. Fig. 3 shows the results. As described in
Section 4.1 , thereal data ware normalized to have zero-mean and unit-variance, andthe ACGAN represents this range. However, as shown in
Fig. 3 ,the human-acceptable distributions, i.e., darker zones of the colormaps, were wider than the real-data distribution for both natural-ness and class acceptability. Therefore, we obtained support for theHumanACGAN being able to represent this distribution, which anACGAN cannot adequately represent.
Next, we executed the HumanACGAN training and qualitativelyevaluated the generated data and approximated gradient. The priornoise followed a two-dimensional uniform distribution U (0 , . Werandomly generated this prior noise before the training and fixed itduring the iterations. Half of the data belonged to the class label/i/, and the remaining data belonged to /e/. The generator was asmall feed-forward neural network consisting of a two-unit inputlayer, × -unit sigmoid hidden layers, and a two-unit linear outputlayer. The output layer was conditioned using the two dimensionalone-hot class label vector. We performed random initialization ofthe generator until these conditions were satisfied: The data gener-ated from the initialized generator should cover ranges of the higherposterior probabilities of naturalness and class acceptability, and thedata of each class should leak into the other class distribution for determining the ability to classify. We empirically set the hyperpa-rameter λ = 2 and used the gradient descent method with a learningrate of α = 0 . for the training. We used Chainer [20] for theimplementation. The number of generated data N , the number ofperturbations R , the number of training iterations, and the standarddeviation of NES σ were set to , , , and . , respectively. Wecarried out two perceptual evaluations during the HumanACGANtraining. First, we presented two speech waveforms to a listener, andthe listener rated which one was more natural on a 5-point scale: 1:the first one, 3: equal, and 5: the second one. Next, we presented twospeech waveforms and one phoneme label (/i/ or /e/) to a listener,and the listener rated which one sounded like the presented phonemeon the same scale. One-through-five of the obtained scores corre-sponded to . , . , . , − . , and − . of the difference in theposterior probabilities ( ∆ D S ( ˆ x n ) or ∆ D C ( ˆ x n , c n ) ), respectively. Fig. 4 shows the gradient of every data point. The posteriorprobability of
Fig. 3 is also drawn for reference, but note that wenever used it during the training. We can say that the gradient byboth naturalness and class acceptability points to each darker range.This qualitatively indicates that the gradient shown in Eqs. (5), (11)was properly estimated.
Fig. 5 shows the data generated from initial-ized and trained generators. On the basis of posterior probabilitiesshown in
Fig. 3 , we can say that the training makes the data move sothat they have higher posterior probabilities of both naturalness andclass acceptability. This qualitatively indicates that the loss func-tions shown in Eqs. (6), (7) improved the generator to represent theconditional human-acceptable distributions.
Finally, we quantitatively verified that the training increases theposterior probabilities of naturalness D S ( · ) , and class acceptability D C ( · ) . We prepared two types of data: closed and open. The closeddata were generated from the prior noise used while the training,and open data was generated from a newly sampled prior noise thatwas not used during the training. The posterior probabilities of theclosed/open data were scored in the same manner as the ones in Section 4.2 . The total number of listeners was 160.
Fig. 6 shows the box plots of the posterior probability. Thetraining iteration increased the posterior probabilities of both natu-ralness and class acceptability with not only the closed data but alsothe open data. Therefore, we can say that our training can increasethe objective values consisting of posterior probabilities and that thegenerator of the HumanACGAN can represent conditional human-acceptable distributions.
5. CONCLUSION
We proposed the HumanACGAN, which can conditionally representhumans’ perceptually acceptable distributions. We reconfigured adiscriminator and an auxiliary classifier of an ACGAN to utilize hu-mans’ perceptual evaluations. The DNN-based conditional genera-tor of the HumanACGAN was trained using two kinds of human’sperceptual evaluations: naturalness and class acceptability, whichwere used for the discriminator and the auxiliary classifier, respec-tively. We evaluated the effectiveness of the HumanACGAN usingqualitative and quantitative experiments. We are planning to expandthe scalability of the HumanACGAN in terms of scalability to thedata size and dimensionality as part of our future work.
Acknowledgements:
Part of this work was supported by theMIC/SCOPE . REFERENCES [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in
Proc. NIPS , Montreal, Canada, Dec. 2014,pp. 2672–2680.[2] D. Kingma and M. Welling, “Auto-encoding variationalbayes,” arXiv , vol. abs/1312.6114, 2013.[3] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear inde-pendent components estimation,” in
Proc. ICLR , San Diego,U.S.A., May 2015.[4] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical para-metric speech synthesis incorporating generative adversarialnetworks,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 26, no. 1, pp. 84–96, Jan. 2018.[5] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda,“Singing voice synthesis based on generative adversarial net-works,” in proc. ICASSP , Brighton, United Kingdom, May2019, pp. 6955–6959.[6] K. Fujii, Y. Saito, S. Takamichi, Y. Baba, and H. Saruwatari,“HumanGAN: generative adversarial network with human-based discriminator and its evaluation in speech perceptionmodeling,” in
Proc. ICASSP , Barcelona, Spain, May 2020,pp. 6239–6243.[7] Y. Sagisaka, “Speech synthesis by rule using an optimal selec-tion of non-uniform synthesis units,” in
Proc. ICASSP , NewYork, U.S.A., Apr. 1988, pp. 679–682.[8] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous proba-bilistic transform for voice conversion,”
IEEE Transactions onSpeech and Audio Processing , vol. 6, no. 2, pp. 131–142, Mar.1998.[9] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784 , 2014.[10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthe-sis with auxiliary classifier GANs,” in
Proc. ICLR , Vancouver,Canada, Apr. 2018.[11] C. Chiu, Y. Koyama, Y. Lai, T. Igarashi, and Y. Yue, “Human-in-the-loop differential subspace search in high-dimensionallatent space,”
ACM Transactions on Graphics (TOG) , vol. 39,no. 4, pp. 85–1, 2020.[12] J. Peterson, J. Suchow, K. Aghi, A. Ku, and T. Griffiths, “Cap-turing human category representations by sampling in deepfeature spaces,” arXiv preprint arXiv:1805.07644 , May 2018.[13] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-boxadversarial attacks with limited queries and information,” in
Proc. ICML , Stockholm, Sweden, Jul. 2018, vol. 2, pp. 2137–2146.[14] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarialnetworks,” in
Proc. NIPS , Barcelona, Spain, Dec. 2016.[15] “Vowel database: Five Japanese vowels of males, females, andchildren along with relevant physical data (JVPD),” http://research.nii.ac.jp/src/en/JVPD.html .[16] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time appli-cations,”
IEICE transactions on information and systems , vol.E99-D, no. 7, pp. 1877–1884, Jul. 2016. [17] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,”
Speech Communication , vol. 84, pp.57–65, Nov. 2016.[18] A. Lee, T. Kawahara, and K. Shikano, “Julius — an opensource real-time large vocabulary recognition engine,” in
Proc.EUROSPEECH , Aalborg, Denmark, Sep. 2001, pp. 1691–1694.[19] “Lancers,” .[20] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito,S. Suzuki, K. Uenishi, B. Vogel, and H. Y Vincent, “Chainer:A deep learning framework for accelerating the research cy-cle,” in