[PDF] HumanACGAN: conditional generative adversarial network with human-based auxiliary classifier and its evaluation in phoneme perception

Abstract

We propose a conditional generative adversarial network (GAN) incorporating humans' perceptual evaluations. A deep neural network (DNN)-based generator of a GAN can represent a real-data distribution accurately but can never represent a human-acceptable distribution, which are ranges of data in which humans accept the naturalness regardless of whether the data are real or not. A HumanGAN was proposed to model the human-acceptable distribution. A DNN-based generator is trained using a human-based discriminator, i.e., humans' perceptual evaluations, instead of the GAN's DNN-based discriminator. However, the HumanGAN cannot represent conditional distributions. This paper proposes the HumanACGAN, a theoretical extension of the HumanGAN, to deal with conditional human-acceptable distributions. Our HumanACGAN trains a DNN-based conditional generator by regarding humans as not only a discriminator but also an auxiliary classifier. The generator is trained by deceiving the human-based discriminator that scores the unconditioned naturalness and the human-based classifier that scores the class-conditioned perceptual acceptability. The training can be executed using the backpropagation algorithm involving humans' perceptual evaluations. Our experimental results in phoneme perception demonstrate that our HumanACGAN can successfully train this conditional generator.

Full PDF

HHUMANACGAN: CONDITIONAL GENERATIVE ADVERSARIAL NETWORKWITH HUMAN-BASED AUXILIARY CLASSIFIERAND ITS EVALUATION IN PHONEME PERCEPTION

Yota Ueda , Kazuki Fujii , Yuki Saito , Shinnosuke Takamichi , Yukino Baba and Hiroshi Saruwatari Graduate School of Information Science and Technology, The University of Tokyo, Japan. National Institute of Technology, Tokuyama College, Japan. Faculty of Engineering, Information and Systems, University of Tsukuba, Japan.

ABSTRACT

We propose a conditional generative adversarial network (GAN) in-corporating humans’ perceptual evaluations. A deep neural network(DNN)-based generator of a GAN can represent a real-data distri-bution accurately but can never represent a human-acceptable dis-tribution, which are ranges of data in which humans accept the nat-uralness regardless of whether the data are real or not. A Human-GAN was proposed to model the human-acceptable distribution. ADNN-based generator is trained using a human-based discriminator,i.e., humans’ perceptual evaluations, instead of the GAN’s DNN-based discriminator. However, the HumanGAN cannot representconditional distributions. This paper proposes the HumanACGAN,a theoretical extension of the HumanGAN, to deal with conditionalhuman-acceptable distributions. Our HumanACGAN trains a DNN-based conditional generator by regarding humans as not only a dis-criminator but also an auxiliary classiﬁer. The generator is trainedby deceiving the human-based discriminator that scores the uncon-ditioned naturalness and the human-based classiﬁer that scores theclass-conditioned perceptual acceptability. The training can be exe-cuted using the backpropagation algorithm involving humans’ per-ceptual evaluations. Our experimental results in phoneme perceptiondemonstrate that our HumanACGAN can successfully train this con-ditional generator.

Index Terms — Generative adversarial network, human com-putation, conditional generator, auxiliary classiﬁer, black-box opti-mization, speech perception

1. INTRODUCTION

Deep generative models of machine learning have contributed to me-dia research [1, 2, 3]. A generative adversarial network (GAN) [1]is one of the strongest generative models. It has been applied inspeech modeling [4, 5]. The GAN consists of a set of deep neuralnetworks (DNNs), a generator, and a discriminator. The generator istrained to deceive the discriminator, and the discriminator is trainedto distinguish between real and generated data. After iterating them,the trained generator represents a real-data distribution and can ran-domly generate data that follows the real-data distribution.The GAN cannot represent an outer side of the real-data distri-bution. However, humans can accept out-sided media as natural. Inspeech perception, humans can recognize a voice as a human’s eventhough the voice is out of the real-data distribution (i.e., an actual hu-mans’ voice). For example, that range contains synthesized voicesor processed voices. In this study, we call this data range perceivedby humans the human-acceptable distribution . HumanGAN [6] wasproposed to model the human-acceptable distribution by using hu-mans as the discriminator of the GAN. The top of

Fig. 1 shows

ACGAN

HumanACGAN (proposed) T r a i n i ng G e n e -r a ti on Conditional human-acceptable distr.

Conditional real-data distr.Class 2 ...Class 1 Class 2 ...Class 1GAN HumanGAN G e n e -r a ti on Generator Discriminator HumansUnconditional human-acceptable distr.Unconditional real-data distr.Class Class Natural?Belong to class?Real/FakeClass 1/ 2/ ...Class ClassBackpropagation Backpropagation T r a i n i ng Data Natural?Prior Real/FakeBackpropagation Backpropagation

Fig. 1 . Comparison of four GANs. We extend the HumanGAN tothe conditional modeling in the same way the GAN was extended tothe ACGAN. While an ACGAN trains a conditional generator with aDNN-based discriminator and an auxiliary classiﬁer, the HumanAC-GAN trains one with a humans-based discriminator and an auxiliaryclassiﬁer. The trained generator of the HumanACGAN can representhuman-acceptable distributions conditioned by input class labels.the comparison of a GAN and HumanGAN. The HumanGAN re-gards humans as a black-boxed system that outputs a difference inposterior probabilities given generated data. The DNN-based gen-erator is trained using the backpropagation algorithm, including thehuman-based discriminator. The trained generator can represent thehuman-acceptable distribution.However, the HumanGAN’s generator cannot achieve morepractical generative modeling such as text-to-speech synthesis [7]and voice conversion [8] because it cannot represent conditionaldistributions. Because the GAN was extended to the conditionalGAN [9, 10], we expect that the HumanGAN can be extendedto the conditional modeling. Namely, we train the HumanGAN’sgenerator conditioned on the desired class label, and it representsthe class-speciﬁc human-acceptable distribution as a result. Thiswill contribute establishing a DNN-based framework to model thetask-oriented perception by humans [11, 12].In this paper, we propose the

HumanACGAN , aiming to train aDNN-based conditional generator using humans’ perceptual eval-uations. To this end, we extend the HumanGAN by introducingan auxiliary classiﬁer GAN (ACGAN) [10]: class-conditional ex-pansion of a GAN.

Fig. 1 shows a comparison of four GANs: aGAN, an ACGAN, a HumanGAN, and our HumanACGAN. TheACGAN uses a DNN-based auxiliary classiﬁer to train a conditionalDNN-based generator in addition to a discriminator of the GAN.Our HumanACGAN replaces both the DNN-based discriminatorand auxiliary classiﬁer with humans. The HumanACGAN’s gen-erator is trained using human-perception-based discrimination andclassiﬁcation. This training is operated using an expansion of the a r X i v : . [ c s . H C ] F e b ackpropagation-based algorithm incorporating humans’ perceptualevaluations as a discriminator and an auxiliary classiﬁer. We eval-uated the HumanACGAN in phoneme perception, a task to traina DNN-based generator that represents phoneme-speciﬁc human-acceptable distributions. The experimental results show that 1)the phoneme-conditioned human-acceptable distributions are widerthan the real-data ones and that 2) the HumanACGAN can success-fully train a generator that represents conditional human-acceptabledistributions.

2. RELATED WORKS2.1. ACGAN

The ACGAN [10] trains a DNN-based conditional generator thatrepresents real-data distributions conditioned on class labels. Toachieve this, the ACGAN uses a DNN-based auxiliary classiﬁer inaddition to a DNN-based discriminator of the GAN. The genera-tor G ( · ) transforms a prior noise z = [ z , · · · , z n , · · · , z N ] intodata ˆ x = [ ˆ x , · · · , ˆ x n , · · · , ˆ x N ] , conditioned on class labels c =[ c , · · · , c n , · · · , c N ] , i.e., ˆ x n = G ( z n , c n ) . N denotes the num-ber of data. The prior noise follows a known probability distribu-tion, e.g., a uniform distribution U (0 , . Here, let real data be x = [ x , · · · , x n , · · · , x N ] . The real data also have the same corre-sponding class label c . A discriminator D S ( · ) and a classiﬁer D C ( · ) are used for the generator training. The D S ( · ) takes x n or ˆ x n as aninput and outputs a posterior probability that the input is real data.The D C ( · ) takes an x n or ˆ x n as an input and outputs a posteriorprobability that the input source belongs to each class. Objectivefunctions in training are formulated as L S = N (cid:88) n =1 log D S ( x n ) + N (cid:88) n =1 log (1 − D S ( G ( z n , c n ))) , (1) L C = N (cid:88) n =1 log D C ( x n , c n ) + N (cid:88) n =1 log ( D C ( G ( z n , c n ) , c n )) . (2) L S is the objective function of the GAN, and L C enables training theconditional generator. The generator is trained to maximize − L S + λL C , where λ is a hyperparameter.The ACGAN trains the generator using real data, and the gen-erator represents the real-data distribution conditioned on the dataclass. However, because the human-acceptable distribution is widerthan the real-data distribution [6], the ACGAN cannot represent thefull range of that distribution. The HumanGAN [6] was proposed to represent the human-acceptabledistribution in a wider manner than a real-data distribution. A DNN-based unconditional generator of the HumanGAN is trained usinghumans’ perceptual evaluations instead of the DNN-based discrim-inator of the GAN. The generator G ( · ) transforms prior noise z n to data ˆ x n unconditionally. A human-based discriminator D S ( · ) isdeﬁned to deal with humans’ perceptual evaluations. D S ( · ) takes ˆ x n as an input and outputs a posterior probability that the input isperceptually acceptable. The objective function is L S = N (cid:88) n =1 D S ( G ( z n )) . (3)The generator is trained to maximize L S . A model parameter θ of G ( · ) is iteratively updated as θ ← θ + α∂L S /∂θ , where α is thelearning coefﬁcient, and ∂L S /∂θ = ∂L S /∂ ˆ x n · ∂ ˆ x n /∂θ . Evaluate the difference inNaturalness: Class acceptability: Humansdata× R [times]Backpropagation usingapproximated gradient: class Fig. 2 . Generator training process of proposed HumanACGAN. Ahuman observes two perturbed data and evaluates their perceptualdifference in two views: naturalness and class acceptability. Theygive global and class-speciﬁc gradients, respectively. Evaluationsand perturbations are used for backpropagation to train the generator. ∂ ˆ x n /∂θ can be estimated analytically, but ∂L S /∂ ˆ x n can-not because the human-based discriminator D S ( · ) is not differ-entiable. The HumanGAN uses the natural evolution strategy(NES) [13] algorithm to approximate the gradient. A small per-turbation ∆ x ( r ) n randomly generated from a multivariate Gaussiandistribution N (cid:0) , σ I (cid:1) , and it is added to a generated datum ˆ x n . r is the perturbation index (1 ≤ r ≤ R ) . σ , I are the standard devi-ation and the identity matrix, respectively. Next, a human observestwo perturbed data { ˆ x n + ∆ x ( r ) n , ˆ x n − ∆ x ( r ) n } and evaluates thedifference in their posterior probabilities of naturalness: ∆ D S ( ˆ x ( r ) n ) ≡ D S ( ˆ x n + ∆ x ( r ) n ) − D S ( ˆ x n − ∆ x ( r ) n ) . (4) ∆ D S ( ˆ x ( r ) n ) ranges from − to . For instance, a human will answer ∆ D S ( ˆ x ( r ) n ) = 1 when he/she perceives that ˆ x n +∆ x ( r ) n is substan-tially more acceptable than ˆ x n − ∆ x ( r ) n . ∂L S /∂ ˆ x is approximatedwith [13] ∂L S ∂ ˆ x n = 12 σ R R (cid:88) r =1 ∆ D S (cid:16) ˆ x ( r ) n (cid:17) · ∆ x ( r ) n . (5)

3. HUMANACGAN3.1. Training

We propose the HumanACGAN. As shown in

Fig. 2 , the Hu-manACGAN has a conditional generator that represents class-speciﬁc human-acceptable distributions, while the HumanGANonly deals with one class. As well as the ACGAN, our HumanAC-GAN consists of a conditional generator, a discriminator, and anauxiliary classiﬁer. The DNN-based generator G ( · ) is the sameas that of the ACGAN; it transforms prior noise z n into data ˆ x n conditioned on a class label c n . The discriminator and the auxiliaryclassiﬁer are redeﬁned to incorporate humans’ perceptual evalua-tions. The human-based discriminator D S ( · ) is the same as that ofthe HumanGAN; it outputs a posterior probability of global (i.e.,class-independent) naturalness. In addition, the human-based aux-iliary classiﬁer D C ( · ) evaluates class acceptability, whether or notthe data can belong to the class c n . D C ( · ) takes ˆ x n generated from G ( · ) and class label c n as inputs and outputs a posterior probabilitythat the input is perceptually acceptable as the class ”. The objectivefunctions are L S = N (cid:88) n =1 D S ( G ( z n , c n )) , (6) L C = N (cid:88) n =1 D C ( G ( z n , c n ) , c n ) . (7)A model parameter θ of G ( · ) is estimated by maximizing L S + λL C . θ is iteratively updated as follows. θ ← θ + α ∂ ( L S + λL C ) ∂θ , (8) ( L S + λL C ) ∂θ = (cid:18) ∂L S ∂ ˆ x n + λ ∂L C ∂ ˆ x n (cid:19) · ∂ ˆ x n ∂θ . (9)As well as the HumanGAN, ∂ ˆ x n /∂θ can be estimated analytically,but ∂L S /∂ ˆ x n and ∂L C /∂ ˆ x n cannot. We formulate gradient ap-proximation using the NES algorithm. A human observes two per-turbed data { ˆ x n + ∆ x ( r ) n , ˆ x n − ∆ x ( r ) n } and evaluates two kindsof difference in their posterior probabilities. The ﬁrst is the sameas the HumanGAN, i.e., the human evaluates “to what degree theinputs are perceptually different in the view of naturalness” and an-swers ∆ D S ( ˆ x ( r ) n ) for approximating ∂L S /∂ ˆ x n as in Eq. (4). Thesecond is for the class-speciﬁc question, “to what degree the inputsare different in the view of class acceptability.” The difference in theposterior probability ∆ D C ( ˆ x ( r ) n , c n ) is deﬁned as ∆ D C ( ˆ x ( r ) n , c n ) ≡ D C (cid:16) ˆ x n + ∆ x ( r ) n , c n (cid:17) − D C (cid:16) ˆ x n − ∆ x ( r ) n , c n (cid:17) . (10)In the same way as the HumanGAN, ∂L S /∂ ˆ x n can be approximatedusing ∆ D S ( ˆ x ( r ) n ) . Unlike ∆ D S ( · ) , ∆ D C ( · ) becomes the class-speciﬁc difference. Namely, ∆ D C ( ˆ x ( r ) n , c n ) will be 1 when thehuman perceives that ˆ x n + ∆ x ( r ) n is substantially more acceptablethan ˆ x n − ∆ x ( r ) n as the presented class. ∆ D C ( ˆ x ( r ) n , c n ) rangesfrom − to , and ∂L C /∂ ˆ x is approximated with ∂L C ∂ ˆ x n = 12 σ R R (cid:88) r =1 ∆ D C (cid:16) ˆ x ( r ) n , c n (cid:17) · ∆ x ( r ) n . (11)Note that the HumanACGAN does not explicitly involve classiﬁca-tion problems. In other words, it only needs to estimate the gradientof multiclass probability function (i.e., the softmax function used inan ACGAN) as the degree of class-speciﬁc acceptability. The HumanGAN [6] suffers from a mode collapse problem [14], agradient vanishing, and scalability to the data size and dimensional-ity. These problems still remain in our HumanACGAN. Therefore,our experiments followed the initialization way and data preprocess-ing of the paper [6]. We ﬁrst estimated histograms of the posteriorprobabilities and initialized θ by referring to the histograms. Also,we reduced the data dimensionality using principal component anal-ysis (PCA).

4. EXPERIMENTAL EVALUATION

This section describes an experimental test of the effectiveness of theHumanACGAN using Japanese phonemes as classes.

Two phonemes we used were /i/ and /e/ of Japanese vowels. Webasically followed the experimental setup of the HumanGAN pa-per [6]. The used data consisted of female speakers’ utter-ances recorded in the JVPD [15] corpus. Before extracting speechfeatures, the speech waveforms were downsampled at kHz, andtheir powers were normalized. 513-dimensional log spectral en-velopes, fundamental frequency (F0), and aperiodicities (APs) wereextracted every ms from the speech waveforms using the WORLDvocoder [16, 17]. We extracted the speech features of the vowels/i/ and /e/ using phoneme alignment obtained by Julius [18]. We ap-plied PCA to the log spectral envelopes and used the ﬁrst and secondprincipal components. The two-dimensional principal components Fig. 3 . Color maps representing posterior probabilities of natural-ness and class acceptability (“accep.”) of /i/ and /e/.

Fig. 4 . Generated data (points) and gradient (arrows) estimated byour algorithm. “accep.” denotes acceptability. The gradients of theupper left ﬁgure are the weighted sum of the other three gradients.We can see that the gradients for naturalness and class acceptabilityare respectively pointing to darker (i.e., higher posterior) zones.were normalized to have zero-mean and unit-variance. The speechwaveforms to be evaluated were synthesized using features obtainedfrom a DNN-based conditional generator in the following way. First,the ﬁrst and second principal components were generated by the gen-erator and de-normalized. For the other features, i.e., the F0 and theAPs, we used the average of all speakers. These corresponded to thespeech features of one frame ( ms). Next, we copied the features for frames to make the perceptual evaluations easy and synthesized second of speech waveforms using the WORLD vocoder. First, we conﬁrmed that the human-acceptable distributions werewider than the real-data distributions. We carried out two tests:evaluations of humans’ tolerance of naturalness and class acceptabil-ity. In this experiment, class acceptability denotes whether the datasounded like the presented phoneme. We split the two-dimensionalspace into grids and generated a speech waveform for every grid.Then, we presented a speech waveform to a listener, and the listenerrated the naturalness on a 5-point scale from 1 (bad) to 5 (excel-lent). Next, we presented a speech waveform and a phoneme label ig. 5 . Data generated from the initialized or trained generator. Thecolors of the data points correspond to those of the color maps forclass acceptability in

Fig. 3 . Based on posterior probabilities shownin

Fig. 3 , we can say that the training makes the data move to havehigher posterior probabilities.

Fig. 6 . Posterior probability of data generated from initialized(“Init”) or trained (“Trained”) generators. The boxes indicate ﬁrst,second (i.e., median), and third quantiles. The line plot indicatesmean value.(/i/ or /e/) to a listener, and the listener rated the class acceptabilityon the same scale. One-through-ﬁve of the obtained scores corre-sponded to . , . , . , . , and . of the posterior proba-bility ( D S ( ˆ x n ) or D C ( ˆ x n , c n ) ), respectively. We used the Lancerscrowdsourcing platform [19] to execute the evaluations. The poste-rior probabilities were averaged for each grid. Each grid was scoredby at least ﬁve listeners. The total number of listeners was 105. Fig. 3 shows the results. As described in

Section 4.1 , thereal data ware normalized to have zero-mean and unit-variance, andthe ACGAN represents this range. However, as shown in

Fig. 3 ,the human-acceptable distributions, i.e., darker zones of the colormaps, were wider than the real-data distribution for both natural-ness and class acceptability. Therefore, we obtained support for theHumanACGAN being able to represent this distribution, which anACGAN cannot adequately represent.

Next, we executed the HumanACGAN training and qualitativelyevaluated the generated data and approximated gradient. The priornoise followed a two-dimensional uniform distribution U (0 , . Werandomly generated this prior noise before the training and ﬁxed itduring the iterations. Half of the data belonged to the class label/i/, and the remaining data belonged to /e/. The generator was asmall feed-forward neural network consisting of a two-unit inputlayer, × -unit sigmoid hidden layers, and a two-unit linear outputlayer. The output layer was conditioned using the two dimensionalone-hot class label vector. We performed random initialization ofthe generator until these conditions were satisﬁed: The data gener-ated from the initialized generator should cover ranges of the higherposterior probabilities of naturalness and class acceptability, and thedata of each class should leak into the other class distribution for determining the ability to classify. We empirically set the hyperpa-rameter λ = 2 and used the gradient descent method with a learningrate of α = 0 . for the training. We used Chainer [20] for theimplementation. The number of generated data N , the number ofperturbations R , the number of training iterations, and the standarddeviation of NES σ were set to , , , and . , respectively. Wecarried out two perceptual evaluations during the HumanACGANtraining. First, we presented two speech waveforms to a listener, andthe listener rated which one was more natural on a 5-point scale: 1:the ﬁrst one, 3: equal, and 5: the second one. Next, we presented twospeech waveforms and one phoneme label (/i/ or /e/) to a listener,and the listener rated which one sounded like the presented phonemeon the same scale. One-through-ﬁve of the obtained scores corre-sponded to . , . , . , − . , and − . of the difference in theposterior probabilities ( ∆ D S ( ˆ x n ) or ∆ D C ( ˆ x n , c n ) ), respectively. Fig. 4 shows the gradient of every data point. The posteriorprobability of

Fig. 3 is also drawn for reference, but note that wenever used it during the training. We can say that the gradient byboth naturalness and class acceptability points to each darker range.This qualitatively indicates that the gradient shown in Eqs. (5), (11)was properly estimated.

Fig. 5 shows the data generated from initial-ized and trained generators. On the basis of posterior probabilitiesshown in

Fig. 3 , we can say that the training makes the data move sothat they have higher posterior probabilities of both naturalness andclass acceptability. This qualitatively indicates that the loss func-tions shown in Eqs. (6), (7) improved the generator to represent theconditional human-acceptable distributions.

Finally, we quantitatively veriﬁed that the training increases theposterior probabilities of naturalness D S ( · ) , and class acceptability D C ( · ) . We prepared two types of data: closed and open. The closeddata were generated from the prior noise used while the training,and open data was generated from a newly sampled prior noise thatwas not used during the training. The posterior probabilities of theclosed/open data were scored in the same manner as the ones in Section 4.2 . The total number of listeners was 160.

Fig. 6 shows the box plots of the posterior probability. Thetraining iteration increased the posterior probabilities of both natu-ralness and class acceptability with not only the closed data but alsothe open data. Therefore, we can say that our training can increasethe objective values consisting of posterior probabilities and that thegenerator of the HumanACGAN can represent conditional human-acceptable distributions.

5. CONCLUSION

We proposed the HumanACGAN, which can conditionally representhumans’ perceptually acceptable distributions. We reconﬁgured adiscriminator and an auxiliary classiﬁer of an ACGAN to utilize hu-mans’ perceptual evaluations. The DNN-based conditional genera-tor of the HumanACGAN was trained using two kinds of human’sperceptual evaluations: naturalness and class acceptability, whichwere used for the discriminator and the auxiliary classiﬁer, respec-tively. We evaluated the effectiveness of the HumanACGAN usingqualitative and quantitative experiments. We are planning to expandthe scalability of the HumanACGAN in terms of scalability to thedata size and dimensionality as part of our future work.

Acknowledgements:

Part of this work was supported by theMIC/SCOPE . REFERENCES [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Proc. NIPS , Montreal, Canada, Dec. 2014,pp. 2672–2680.[2] D. Kingma and M. Welling, “Auto-encoding variationalbayes,” arXiv , vol. abs/1312.6114, 2013.[3] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear inde-pendent components estimation,” in

Proc. ICLR , San Diego,U.S.A., May 2015.[4] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical para-metric speech synthesis incorporating generative adversarialnetworks,”

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 26, no. 1, pp. 84–96, Jan. 2018.[5] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda,“Singing voice synthesis based on generative adversarial net-works,” in proc. ICASSP , Brighton, United Kingdom, May2019, pp. 6955–6959.[6] K. Fujii, Y. Saito, S. Takamichi, Y. Baba, and H. Saruwatari,“HumanGAN: generative adversarial network with human-based discriminator and its evaluation in speech perceptionmodeling,” in

Proc. ICASSP , Barcelona, Spain, May 2020,pp. 6239–6243.[7] Y. Sagisaka, “Speech synthesis by rule using an optimal selec-tion of non-uniform synthesis units,” in

Proc. ICASSP , NewYork, U.S.A., Apr. 1988, pp. 679–682.[8] Y. Stylianou, O. Capp´e, and E. Moulines, “Continuous proba-bilistic transform for voice conversion,”

IEEE Transactions onSpeech and Audio Processing , vol. 6, no. 2, pp. 131–142, Mar.1998.[9] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784 , 2014.[10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthe-sis with auxiliary classiﬁer GANs,” in

Proc. ICLR , Vancouver,Canada, Apr. 2018.[11] C. Chiu, Y. Koyama, Y. Lai, T. Igarashi, and Y. Yue, “Human-in-the-loop differential subspace search in high-dimensionallatent space,”

ACM Transactions on Graphics (TOG) , vol. 39,no. 4, pp. 85–1, 2020.[12] J. Peterson, J. Suchow, K. Aghi, A. Ku, and T. Grifﬁths, “Cap-turing human category representations by sampling in deepfeature spaces,” arXiv preprint arXiv:1805.07644 , May 2018.[13] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-boxadversarial attacks with limited queries and information,” in

Proc. ICML , Stockholm, Sweden, Jul. 2018, vol. 2, pp. 2137–2146.[14] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarialnetworks,” in

Proc. NIPS , Barcelona, Spain, Dec. 2016.[15] “Vowel database: Five Japanese vowels of males, females, andchildren along with relevant physical data (JVPD),” http://research.nii.ac.jp/src/en/JVPD.html .[16] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time appli-cations,”

IEICE transactions on information and systems , vol.E99-D, no. 7, pp. 1877–1884, Jul. 2016. [17] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,”

Speech Communication , vol. 84, pp.57–65, Nov. 2016.[18] A. Lee, T. Kawahara, and K. Shikano, “Julius — an opensource real-time large vocabulary recognition engine,” in

Proc.EUROSPEECH , Aalborg, Denmark, Sep. 2001, pp. 1691–1694.[19] “Lancers,” .[20] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito,S. Suzuki, K. Uenishi, B. Vogel, and H. Y Vincent, “Chainer:A deep learning framework for accelerating the research cy-cle,” in