On the Performance of Generative Adversarial Network (GAN) Variants: A Clinical Data Study
Jaesung Yoo, Jeman Park, An Wang, David Mohaisen, Joongheon Kim
OOn the Performance of Generative AdversarialNetwork (GAN) Variants: A Clinical Data Study ◦ Jaesung Yoo, (cid:63)
Jeman Park, • An Wang, † David Mohaisen, and ◦ , ‡ Joongheon Kim ◦ School of Electrical Engineering, Korea University, Seoul, Republic of Korea (cid:63)
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA • Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA † Department of Computer Science, University of Central Florida, Orlando, FL, USA ‡ Artificial Intelligence Engineering Research Center, College of Engineering, Korea University, Seoul, Republic of KoreaE-mails: [email protected] , [email protected] , [email protected] , [email protected] , [email protected] Abstract —Generative Adversarial Network (GAN) is a usefultype of Neural Networks in various types of applications includinggenerative models and feature extraction. Various types of GANsare being researched with different insights, resulting in a diversefamily of GANs with a better performance in each generation.This review focuses on various GANs categorized by theircommon traits.
I. I
NTRODUCTION
Deep learning has many applications in the area of medicalinformatics [1]–[4]. Generative adversarial network (GAN)is one of the most widely used deep learning models todaywhich was first introduced in 2014 by
Goodfellow et al [5].The notable idea of GAN is the adversarial nets framework.This framework has two modules: generator and discriminator,which competes with adversarial objectives. It allows thegenerator to learn the distribution of the data through feedbackfrom the discriminator. Since the advent of GAN, it hasgained considerable popularity and numerous variants of GANare being proposed continuously. The variants of GAN arelargely divided into two main types. The first type copeswith the instability in the learning process of vanilla GAN.These variants approached the learning problems with variousinsights, resulting in diverse loss functions to achieve betterperformance. The second type is specialized in each task (e.g.,image-to-image translation [6] or super-resolution [7]). Thesevariants construct their model architectures differently thanvanilla GAN.In this paper, we first analyze the mechanism and chroniclearning problems of the vanilla GAN. Then, we introducesome variants of GAN in the perspectives of loss functions andmodel architectures. We then conduct experiments on surgicaldata augmentation using different GAN types to compare theperformance.II. A
NALYSIS OF THE V ANILLA
GANThe vanilla GAN simply consists of a generator model G and its adversarial discriminator model D . First, D receivesfake data which are generated from G ’s model distribution, orreal data from the real data distribution. Then D is trainedto precisely distinguish these real and fake data. Next, G generates and sends fake data to D . D evaluates how closethe fake data is to the real and this result is fed back to G .Finally, G is trained as to confuse D from differentiating data.This process can be described with the value function of thefollowing: min G max D V ( D, G ) = E [log( D ( x ))] + E [log(1 − D ( G ( z )))] (1)where x is a sample from real data distribution and z is theinput noise for the generator G . Thus the loss function can bedescribed as the following: L D = − E [log( D ( x ))] − E [log(1 − D ( G ( z )))] (2) L G = E [log(1 − D ( G ( z )))] . (3)For the sake of better convergence, the following lossfunction could also be used as it proposes a steeper gradienttowards undesired values. L G = − E [log( D ( G ( z )))] (4)This two-player minimax game is proceeded with twoadversarial gradient descent algorithms which optimizes theloss function in the opposite direction. This may look straight-forward but it has some problems with learning instability asfollows:
1) Vanishing gradient:
If the discriminator outperforms, thegenerator can fail to learn because the gradient vanishes [8]. Inother words, optimal discriminator does not provide sufficientfeedback for the generator to learn properly.
2) Mode collapse:
The desired generator should produce avariety of outputs that resembles the overall data distribution.However, it sometimes generates only a fraction (mode) ofthe overall data distribution, resulting in a specific modethat still minimizes the adversarial loss function. In responseto this phenomenon, the discriminator might figure out thespecific mode of the generator and adjust its weight to criticizethe generator. This triggers the generator to move its outputdistribution to the next fraction of the real data distribution, a r X i v : . [ c s . N E ] S e p eading to oscillating modes of data distribution that might notconverge [9].
3) Non-convergence:
As GAN is based on a minimax gamebased on an adversarial loss function, the generator and thediscriminator are under a never-ending loop of oscillation.When the generator improves enough to fool the discriminator,the discriminator would have around 50% accuracy. At thispoint, the discriminator cannot give appropriate feedback andrather gives random feedback which degrades the generator’sperformance again. After the generator’s performance getsworse, the generator will learn the appropriate features againand this oscillation continues when both discriminator andgenerator jointly search for equilibrium.Many GAN variants were proposed to remedy these learningproblems through different adjustments such as loss modifica-tions or novel model architectures.III. L
OSS BASED
GANOne possible way of stabilizing the training of GAN is tomodify its loss functions.Since loss functions provide guidance for model weights tofollow in the vast state space, it is important that the functionsfaithfully represent the ultimate goals of the optimizationproblem.
A. Least Squares GAN (LSGAN)
LSGAN argued that the sigmoid cross entropy loss func-tion would lead to the problem of vanishing gradients whenupdating the generator using the fake samples which are on thecorrect side of the GAN’s decision boundary, but are still farfrom the real data [10]. Because these samples are in the highconfidence area of the decision boundary, so there is almost noloss in the vanilla GAN. However, these samples are isolatedfrom real data, so they seem unrealistic. To remedy this issue,LSGAN introduced the least-squares loss function for pullingthem close to the decision boundary. The loss functions are asthe following: min D V ( D ) = 12 E [( D ( x ) − b ) ] +12 E [( D ( G ( z )) − a ) ] (5) min G V ( G ) = 12 E [( D ( G ( z )) − c ) ] (6)where a and b are the labels for fake data and real data, and c denotes the value that G wants D to believe for fake data,respectively. By these loss functions, LSGAN will penalize thesamples which are far from the real data even though they arecorrectly classified, and pull them to the decision boundary. Ifa GAN has been successfully learned, its decision boundaryis formed to pass through the manifold of the real data.Accordingly, pulling the samples to the decision boundarymakes them be closer to the manifold, allowing G to generatemore realistic data. B. Wasserstein GAN (WGAN)
WGAN uses Wasserstein distance to stabilize training [11].Along with Kullback-Leibler divergence (KL-divergence) andJensenShannon divergence(JS-divergence), Wasserstein dis-tance is a metric to measure the distance between two probabil-ity distributions. A common drawback of KL-divergence andJS-divergence is that when two distributions are far different,the loss function becomes flat, resulting in a vanishing gradientproblem. In applications, such as GAN, where the two modulescompete fiercely, it is easy for the two modules to becomefar apart in the mean probability distribution, where thevanishing gradient happens and leads to non-convergence. Toaddress this problem, Wasserstein distance can be used as itsgradient does not vanish. Wasserstein distance is defined asthe following: L WGAN = || D ( x ) − D ( G ( z )) || . (7)Unlike KL-divergence, the Wasserstein distance does not havelogs which make it fairly linear. This can lead to more stabletraining than using KL-divergence.IV. A RCHITECTURE BASED
GANGAN can also be improved with additional variants onarchitecture. The following Neural Networks are variants ofGAN resulting from the following insights: Generator capableof having Variational Autoencoder (VAE) architecture andDiscriminator having multiple outputs.
A. Variational Autoencoder GAN (VAEGAN)
VAEGAN has its simple generator substituted with Varia-tional Autoencoder (VAE) [12]. Thus, it inherits the benefitsof GAN and VAE. The VAE acts as a generator and createsaugmented data while the discriminator tries to score them.This structure enhances the power of feature extraction. Inthe training process, the VAE structure makes learning stable.After training completes, VAEGAN is more effective thanGAN to generate new samples since passing a random sampledlatent vector results in new augmented samples.VAE has two submodules: Encoder q and a Decoder G (cid:48) .The encoder encodes the given data into a latent variable z .The decoder then reconstructs the data from the latent variable z . Thus reconstruction loss is applied as: L rec = − E [log G (cid:48) ˆ z ∼ q ( x ) ( x | ˆ z )] (8)where ˆ z is sampled from the distribution q ( x ) . The lossfunctions for VAEGAN is calculated as the linear combinationof GAN loss and VAE reconstruction loss, with one possibleadditional loss. Kullback-Leibler divergence (KL-divergence)is optionally applied to the latent variable to reduce modelcomplexity. The latent variable z is conventionally restrainedto N (0 , . The KL-divergence of two Gaussians is as thefollowing: L KLD = D KL ( q ( z | x ) || p ( z ))= log (cid:18) σ p σ q (cid:19) + σ q + ( µ q − µ p ) µ p − (9)here p denotes the prior distribution of the latent space,which is conventionally assumed to be N (0 , . Thus, µ = 0 , σ = 1 . Since VAE acts as a generator, we can define G in away to utilize the same loss function from (4) G ( z ) = G (cid:48) ˆ z ∼ q ( x ) (ˆ z ) . (10)Integrating the losses from GAN and VAE, the final lossfor the VAE becomes: L VAE = L G + λ rec L rec + λ KLD L KLD (11)where λ rec , λ KLD are hyperparameters which denote theweights of the losses. The loss for the discriminator is thesame as Equation (2)
B. Auxiliary Classifier GAN (ACGAN)
ACGAN is a variant of GAN in which the discriminator hasa classifier output along with the standard True/False output[13]. Since the discriminator can also act as a classifier, thegenerator can be modified to accept a condition vector toproduce samples of a different class for the sake of versatilityand functionality. Thus, ACGAN is appropriate in applicationswhich has categorical distributions. In the case of ACGAN, thegenerator is noted as G ( c, z ) where c denotes the class vectorwhich acts as a condition vector. The discriminator is notedas D S ( x ) , D C ( c, x ) which are outputs from the discriminatorthat distinguishes True/False data and classifies categoriesrespectively.Loss functions of ACGAN can be divided into 2 parts:distinguishing the source and the class. The loss function deter-mined from the source of the data is the same as Equations (2)and (4): L D S = − E [log( D S ( x ))] − E [log(1 − D S ( G ( c, z )))] (12) L G S = − E [log( D S ( G ( c, z )))] (13)The loss function from the class of the data is defined in away to encourage cooperation between the generator and thediscriminator, rather than operating in adversary. L D C = − E [log( D C ( c, x ))] − E [log( D C ( c, G ( c, z )))] (14) L G C = − E [log( D C ( c, G ( c, z )))] (15)The final loss functions for the generator and the discriminatoris the sum of the two loss functions defined in Equations (12)to (15). L D = L D S + λ D C L D C (16) L G = L G S + λ G C L G C (17)where λ D C , λ G C are hyperparameters which denote the weightsof the losses.In the case where the class label is not available, variantsof ACGAN can be introduced by setting the class loss func-tion with entropy rather than categorical cross-entropy. Thisensures that the model learns to predict in the absence ofa class label. In applications which the generator wants tofool the discriminator, the entropy class loss function can bemaximized rather than minimized. C. Auxiliary Classifier Variational Autoencoder (ACVAE)
Motivated by VAEGAN and ACGAN, both generator anddiscriminator architecture can be modified for improvement.ACVAE uses Auxiliary Classifier (AC) as the discriminatorand VAE as the generator [14]. The AC structure allowsthe generator to create samples of various classes while theVAE structure allows stable training and flexible generationof samples. The generator of ACVAE is a VAE structurewhich receives a class vector c as condition. Since there aretwo submodules in VAE: Encoder q and a Decoder G (cid:48) , thereare 3 possible types of generator structures that receives thecondition vector. The case which only the Encoder q receives c , case which only the Decoder G (cid:48) receives c , and the casewhich both submodules receives c . The appropriate structuredepends on the applications. When the last type of architectureis assumed, the encoder is noted as q ( c, x ) and the decoder G (cid:48) ( c, z ) .Since the previous two insights on architecture are used inACVAE the corresponding loss functions are also followed.The generator G can be redefined to utilize the same lossfunction from (16),(17) G ( c, z ) = G (cid:48) ˆ z ∼ q ( c,x ) ( c, ˆ z ) . (18)The loss for the VAE is identical to (11), and the loss forthe discriminator is identical to (16) L VAE = L G + λ rec L rec + λ KLD L KLD = L G S + λ G C L G C + λ rec L rec + λ KLD L KLD . (19)V. E XPERIMENTS
While machine learning requires a vast amount of data,surgical databases are limited in quantity due to practicalconstraints. Thus, data augmentation techniques are used inmedical applications to aid the learning process of the machinelearning model [15], [16].Data augmentation experiment using anesthetic surgicaldata was performed to compare the results of different GANvariants. The data includes four components which consistof 2 types of anesthetic drug dosage history: Propofol(PPF)and Remifentanil(RFTN), the anesthetized response of thepatient: Bispectral Index(BIS), and the covariates of the pa-tient. Lower BIS indicates a deeply anesthetized state and BISaround 50 is desired. Data augmentation was performed bythe generator which receives patient covariates and Gaussiannoise as input and creates fabricated drug dosage history. Theaugmented drug dosage was put into the pharmacokinetic-pharmacodynamic model (PK-PD model), a traditional patientresponse model to compute the BIS response of the augmenteddata [17]. The BIS response to augmented drug dosage wasused to monitor the training process of GAN.Fig. 1 is the examples of augmented data using differenttypes of GAN. Real surgical data tends to have a high peak atthe start of the surgery, which occurs to fully anesthetize thepatient. At the end of the surgery, the drug dosage is low toawake the patient. From Fig. 1 it can be seen that vanilla GANa) Ground truth (b) Vanilla GAN(c) LSGAN (d) WGAN (e) VAEGAN
Fig. 1. Examples of Data Augmentation result using different GAN variants. For PPF and RFTN the y -axis equals dose/10sec, and for BIS the y -axis isdimensionless. ACGAN and ACVAE are not included as the application does not have classes. succeeds in generating synthetic surgical data which ensures aroughly similar BIS response. However, the drug dosages aredistributed all along with the timestamps, unlike the groundtruth surgical data which has certain peaks at certain points.Due to distributed drug dosage, the BIS response to theaugmented drug dosage of vanilla GAN keeps increasing.Results of LSGAN and WGAN are roughly similar and theperformance is better than vanilla GAN as the BIS responseto the augmented data better stays around 50. VAEGANshows the best result that resembles the ground truth data,with sharp peaks upfront and BIS around 50. However, thisdoes not ensure that VAEGAN always outperforms the otherssince deep learning architectures are application-dependent.Variants of GAN are improved versions of vanilla GAN, butthe performance of each model is application-dependent.VI. C ONCLUDING R EMARKS
GAN is a powerful structure that invigorates feature extrac-tion but is unstable along with its uncertainty of convergenceduring training. Its adversarial insight is the core motivationof model improvement but its vanilla structure is too simpleand has numerous unknown factors affecting its results whichaccounts for GAN’s instability. Thus various attempts tofurther restrict the search space of GAN prevails and it isan ongoing matter that needs more research.The human brain is evidently a massive modular neuralnetwork. Since nature and evolution have divided the brains of animals into numerous sub-modules rather than to operateas a whole, it can be boldly deduced that operating with theharmony of sub-modules performs better than a single largemodule. GAN is a powerful type of modular neural network,and the modularizing neural network would be one of the mainkeys to stabilizing large deep neural networks that performpowerful tasks. A
CKNOWLEDGEMENT
This research was supported by the MHW (Ministry ofHealth and Welfare), Korea (Grant number: HI19C0842)supervised by the KHIDI (Korea Health Industry Develop-ment Institute) and the project title is DisTIL (Developmentof Privacy-Reinforcing Distributed Transfer-Iterative LearningAlgorithm). All authors in this paper have equal contributions(first authors). J. Kim is the corresponding author of this paper.R
EFERENCES[1] J. Jeon, J. Kim, J. Kim, K. Kim, A. Mohaisen, and J. Kim, “Privacy-preserving deep learning computation for geo-distributed medical big-data platforms,” in , 2019, pp. 3–4.[2] S. Ahn, J. Kim, E. Lim, W. Choi, A. Mohaisen, and S. Kang, “ShmCaffe:A distributed deep learning platform with shared memory buffer for HPCarchitecture,” in
Proc. IEEE International Conference on DistributedComputing Systems (ICDCS) , 2018, pp. 1118–1128.3] M. Shin, J. Kim, A. Mohaisen, J. Park, and K. H. Lee, “Neural networksyntax analyzer for embedded standardized deep learning,” in
Proc. 2ndACM International Workshop on Embedded and Mobile Deep Learning(EMDL) , 2018, pp. 37–41.[4] Y. J. Mo, J. Kim, J. Kim, A. Mohaisen, and W. Lee, “Performance ofdeep learning computation with TensorFlow software library in GPU-capable multi-core computing platforms,” in
Proc. IEEE InternationalConference on Ubiquitous and Future Networks (ICUFN) , 2017, pp.240–242.[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Proc. Neural Information Processing Systems (NIPS) , 2014, pp. 2672–2680.[6] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan:Unified generative adversarial networks for multi-domain image-to-image translation,” in
Proc. IEEE Conference on Computer Vision andPattern Recognition (ICCVPR) , 2018, pp. 8789–8797.[7] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al. , “Photo-realistic singleimage super-resolution using a generative adversarial network,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition(ICCVPR) , 2017, pp. 4681–4690.[8] M. Arjovsky and L. Bottou, “Towards principled methods for train-ing generative adversarial networks,” arXiv preprint arXiv:1701.04862 ,2017.[9] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generativeadversarial networks,” arXiv preprint arXiv:1611.02163 , 2016.[10] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Leastsquares generative adversarial networks,” in
Proc. IEEE InternationalConference on Computer Vision (ICCV) , 2017, pp. 2794–2802.[11] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generativeadversarial networks,” in
Proc. International Conference on MachineLearning (ICML) , 2017, pp. 214–223.[12] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,” in
Proc.International Conference on Machine Learning (ICML) , 2016, pp. 1558–1566.[13] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis withauxiliary classifier gans,” in
Proc. International Conference on MachineLearning (ICML) , 2017, pp. 2642–2651.[14] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC:Non-parallel voice conversion with auxiliary classifier variational au-toencoder,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing (TASLP) , vol. 27, no. 9, pp. 1432–1443, 2019.[15] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, andH. Greenspan, “Gan-based synthetic medical image augmentation forincreased cnn performance in liver lesion classification,”
Neurocomput-ing , vol. 321, pp. 321–331, 2018.[16] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L.Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical imagesynthesis for data augmentation and anonymization using generativeadversarial networks,” in
Proc. International workshop on Simulationand Synthesis in Medical Imaging (SASHIMI) . Springer, 2018, pp.1–11.[17] T. G. Short, J. A. Hannam, S. Laurent, D. Campbell, M. Misur,A. F. Merry, and Y. H. Tam, “Refining target-controlled infusion: Anassessment of pharmacodynamic target-controlled infusion of propofoland remifentanil using a response surface model of their combinedeffects on bispectral index,”