Combating Mode Collapse in GAN training: An Empirical Analysis using Hessian Eigenvalues
Ricard Durall, Avraam Chatzimichailidis, Peter Labus, Janis Keuper
CCombating Mode Collapse in GAN training: An Empirical Analysisusing Hessian Eigenvalues
Ricard Durall , , , Avraam Chatzimichailidis , , , Peter Labus , and Janis Keuper , Fraunhofer ITWM, Germany IWR, University of Heidelberg, Germany Chair for Scientific Computing, TU Kaiserslautern, Germany Fraunhofer Center Machine Learning, Germany Institute for Machine Learning and Analytics, Offenburg University, Germany
Keywords: Generative Adversarial Network, Second-Order Optimization, Mode Collapse, Stability, Eigenvalues.Abstract: Generative adversarial networks (GANs) provide state-of-the-art results in image generation. However, despitebeing so powerful, they still remain very challenging to train. This is in particular caused by their highly non-convex optimization space leading to a number of instabilities. Among them, mode collapse stands out asone of the most daunting ones. This undesirable event occurs when the model can only fit a few modes ofthe data distribution, while ignoring the majority of them. In this work, we combat mode collapse usingsecond-order gradient information. To do so, we analyse the loss surface through its Hessian eigenvalues,and show that mode collapse is related to the convergence towards sharp minima. In particular, we observehow the eigenvalues of the G are directly correlated with the occurrence of mode collapse. Finally, motivatedby these findings, we design a new optimization algorithm called nudged-Adam (NuGAN) that uses spectralinformation to overcome mode collapse, leading to empirically more stable convergence properties. Although Deep Neural Networks (DNNs) have ex-hibited remarkable success in many applications, theoptimization process of DNNs remains a challengingtask. The main reason for that is the non-convexityof the loss landscape of such networks. While mostof the research in the field has focused on single ob-jective minimization, such as classification problems,there are other models that require the joint mini-mization of several objectives. Among these models,Generative Adversarial Networks (GANs) (Goodfel-low et al., 2014a) are particularly interesting, due totheir success of learning entire probability distribu-tions. Since their first appearance, they have beenused to improve the performance of a wide range oftasks in computer vision, including image-to-imagetranslation (Abdal et al., 2019; Karras et al., 2020),image inpainting (Iizuka et al., 2017; Yu et al., 2019),semantic segmentation (Xue et al., 2018; Durall et al.,2019) and many more.GANs are a class of generative models which con-sist of a generator ( G ) and a discriminator ( D ) DNNmodel. Within an adversarial game they are trained in Accepted in VISAPP 2021 such a way that the G learns to produce new samplesdistributed according to the desired data distribution.Training can be formulated in terms of a minimax op-timization of a value function V ( G , D ) min G max D V ( G , D ) . (1)While being very powerful and expressive, GANs areknown to be notoriously hard to train. This is becausetheir training is equivalent to the search of Nash equi-libria in a high-dimensional, highly non-convex op-timization space. The standard algorithm for solv-ing this optimization problem is gradient descent-ascent (GDA), where G and D perform alternatingupdate steps using first order gradient informationw.r.t. the loss function. In practise, GDA is oftencombined with regularization, which has yield manystate-of-the-art results for generative models on var-ious benchmark datasets. However, GDA is knownto suffer from undesirable convergence properties thatmay lead to instabilities, divergence, catastrophic for-getting and mode collapse. The latter term refers tothe scenario where only a few modes of the data dis-tribution are generated and the model produces only alimited variety of samples.Many recent works have studied different ap-proaches to tackle these issues. In reference (Rad- a r X i v : . [ c s . L G ] D ec ord et al., 2015), for instance, was one of the firstattempts to use convolutional neural networks in or-der to improve both training stability, as well as thevisual quality of the generated samples. Other worksachieved improvements through the use of new objec-tive functions (Salimans et al., 2016; Arjovsky et al.,2017), and additional regularization terms (Gulrajaniet al., 2017; Durall et al., 2020). There have alsobeen recent advances in the theoretical understandingof GAN training. References (Nagarajan and Kolter,2017; Mescheder et al., 2017), for examples, have in-vestigated the convergence properties of GAN train-ing using first-order information. There, it has beenshown that a local analysis of the eigenvalues of theJacobian of the loss function can provide guaranteeson local stability properties. Moreover, going beyondfirst order gradient information, references (Berardet al., 2019; Fiez et al., 2019) have used the top k-eigenvalues of the Hessian of the loss to investigatethe convergence and dynamics of GANs.In this paper, we conduct an empirical study to ob-tain new insights concerning stability issues of GANs.In particular, we investigate the general problem offinding local Nash equilibria by examining the char-acteristics of the Hessian eigenvalue spectrum and thegeometry of the loss surface. We thereby verify someof the previous findings that were based on the topk-eigenvalues alone. We hypothesize that mode col-lapse might stand in close relationship with conver-gence towards sharp minima, and we show empiricalresults that support this claim. Finally, we introduce anovel optimizer which uses second-order informationto combat mode collapse. We believe that our find-ings can contribute to understand the origins of theinstabilities encountered in training GAN.In summary our contributions are as follows• We calculate the full Hessian eigenvalue spectrumduring GAN training, allowing us to link modecollapse to anomalies of the eigenvalue spectrum.• We identify similar patterns in the evolution ofthe eigenvalue spectrum of G and D by inspect-ing their top k-eigenvalues.• We introduce a novel optimizer that uses second-order information to mitigate mode collapse.• We empirically demonstrate that D finds a localminimum, while G remains in a saddle point. While gradient-based optimization has been very suc-cessful in Deep Learning, applying gradient-based al-gorithms in game theory, i.e. finding Nash equilibria, has often highlighted their limitations. An intense lineof research based on first- and second-order meth-ods has studied the dynamics of gradient descent-ascent by investigating the loss landscape of DNNs.One of the initial first-order approaches (Goodfellowet al., 2014b) studied the properties of the loss land-scape along a linear path between two points in pa-rameter space. In doing so, it was demonstrated thatDNNs tend to behave similarly to convex loss func-tions along these paths. In later references (Draxleret al., 2018) non-linear paths between two points wereinvestigated. There, it was shown that the loss surfaceof DNNs contains paths that connect different min-ima, having constant low loss along these paths.In the context of second-order approaches, therehas also been notable progress (Sagun et al., 2016;Alain et al., 2019). There, the Hessian w.r.t. theloss function was used to reduce oscillations aroundcritical points in order to obtain faster convergenceto Nash equilibria. The main advantage of second-order methods is the fact that the Hessian providescurvature information of the loss landscape in all di-rections of parameter space (and not only along thepath of steepest descent as with first-order methods).However, this curvature information is local only andvery expensive to compute. In the context of GANs,second-order methods have not been investigated indepth. Recent works (Berard et al., 2019; Fiez et al.,2019) have not calculated the full Hessian matrix butresorted to approximations, such as computing thetop-k eigenvalues only. To the best of our knowledge,we are the first to use the full Hessian eigenvalue spec-trum to study the training behavior of GANs.Another line of research tries to classify differenttypes of local critical points of the loss surface dur-ing training w.r.t. their generalization behavior. In(Hochreiter and Schmidhuber, 1997) it was originallyspeculated that the width of an optimum is criticallyrelated to its generalization properties. Later, (Keskaret al., 2016) extended the conjectures by conductinga set of experiments showing that SGD usually con-verges to sharper local optima for larger batch sizes.Following this principle, (Chaudhari et al., 2019) pro-posed an SGD-based method that explicitly forcesoptimization towards wide valleys. (Izmailov et al.,2018) introduced a novel variant of SGD which av-erages weights during training. In this way, solutionsin flatter regions of the loss landscape could be foundwhich led to better generalization. This, in turn, hasled to measurable improvements in applications suchas classification. However, (Dinh et al., 2017) arguesthat the commonly used definitions of flatness areproblematic. By exploiting symmetries of the model,they can change the amount of flatness of a minimumithout changing its generalization behavior.
The goal of a generative model is to approximate areal data distribution p r with a surrogate data distri-bution p f . One way to achieve this is to minimizethe “distance” between those two distributions. Thegenerative model of a GAN, as originally introducedby Goodfellow et al. , does this by minimizing theJensen-Shannon Divergence between p r and p f , us-ing the feedback of the D . From a game theoreticalpoint of view, this optimization problem may be seenas a zero-sum game between two players, representedby the discriminator model and the generator model,respectively. During the training, the D tries to max-imize the probability of correctly classifying a giveninput as real or fake by updating its loss function L D = E x ∼ p r [ log ( D ( x ))] + E z ∼ p z [ log ( − D ( G ( z ))] , (2)through stochastic gradient ascent. Here, x is a datasample and z is drawn randomly.The G , on the other hand, tries to minimize theprobability of D to classify its generated data cor-rectly. This is done by updating its loss function L G = E z ∼ p z [ log ( − D ( G ( z ))] (3)via stochastic gradient descent. As a result, the jointoptimization can be viewed as a minimax game be-tween G , which learns how to generate new samplesdistributed according to p r , and D , which learns todiscriminate between real and generated data. Theequilibrium of this game is reached when the G isgenerating samples that look as if they were drawnfrom the training data, while the D is left indecisivewhether its input is generated or real. As we explained above, the training of a GAN re-quires the joint optimization of several objectivesmaking their convergence intrinsically different fromthe case of a single objective function. The opti-mal joint solution to a minimax game is called Nash-equilibrium. In practice, since the objectives arenon-convex, using local gradient information, we canonly expect to find local optima, that is local Nash-equilibria (LNE) (Adolphs et al., 2018). An LNE isa point for which there exists a local neighborhood inparameter space, where neither the G nor the D can D (x)0.00.20.40.60.81.0 D ( G ( z ) D ( G ( z )2.01.51.00.50.0 Figure 1: (Left) G losses, either minimization of log ( − D ( G ( z )) or maximization of log ( D ( G ( z )) . (Right) D loss,maximization of log ( D ( x )) + log ( − D ( G ( z )) . unilaterally decrease/increase their respective losses,i.e. their gradients vanish while their second deriva-tive matrix is positive/negative semi-definite: || ∇ θ L G || = || ∇ ϕ L D || = , ∇ θ L G (cid:23) ∇ ϕ L D (cid:22) θ and ϕ are the weights D and G , respectively. GANs have experienced a dramatic improvement interms of image quality in recent years. Nowadays, itis possible to generate artificial high resolution facesindistinguishable from real images to humans (Karraset al., 2020). However, their evaluation and compari-son remain a daunting task, and so far, there is no con-sensus as to which metric can best capture strengthsand limitations of models. Nevertheless, the Incep-tion Score (IS), proposed by (Salimans et al., 2016),is the most widely adopted metric. It evaluates imagesgenerated from the model by determining a numericalvalue that reasonably correlates with the quality anddiversity of output images.
In minimax GANs the G attempts to generate sam-ples that minimize the probability of being detectedas fake by the D (c.f. Formula 3). However, in prac-tice, it turns out to be advantages to use an alternativecost function which instead ensures that the generatedsamples have a high probability of being consideredreal. This modified version of a GAN is called a non-saturating GAN (NSGAN). When training a NSGANthe G maximizes an alternative objective E z ∼ p z [ log ( D ( G ( z ))] . (5)he intuition why NSGANs perform better thanGANs is as follows. In case the model distributionis highly different from the data distribution, the NS-GAN can bring the two distributions closer togethersince the loss function generates a strong gradient. Infact, the NSGAN will have a vanishing gradient onlywhen the D starts being indecisive whether its input isfrom the data distribution or the G . This is accept-able, however, since the samples will already havereached the distribution of the real data by that time.Figure 1 shows the loss function of the original andnon-saturating D and G , respectively. During optimization the network tries to convergeinto a local minimum of the loss landscape. Here wedifferentiate between flat and sharp minima. Whethera minimum is considered sharp or flat is determinedby the loss landscape around the converged point. Ifthe region has approximately the same error, the min-imum is considered flat, otherwise we refer to theminimum as sharp. Another method to determine thesharpness of a minimum is by looking at the eigenval-ues of the Hessian. These describe the local curvaturein every direction of the parameter space. This allowsus to see whether our network converges into sharp orflat minima or whether it converges into a minimumat all. Here, big eigenvalues correspond to a sharpminimum in the corresponding eigendirection.We observe that high eigenvalues in the G and D lead to a worse IS score. Therefore we conclude thatmode collapse is linked to the network converginginto sharp minima. In order to confirm this, we look atthe full eigenvalue density spectrum during training.Calculating the eigenvalues of the Hessian has acomplexity of O ( N ) , and storing the Hessian itselfin order to compute the eigenvalues scales with O ( N ) where N is the number of parameters in the network.For neural networks that typically have millions ofparameters, calculating the eigenvalues of their Hes-sian is infeasible. We can skip the problem of storingthe Hessian by only calculating the Hessian-vectorproduct for different vectors. In combination withthe Lanczos algorithm, this allows us to compute theeigenvalues of the Hessian without having to calculateand store the Hessian itself.The stochastic Lanczos quadrature algorithm(Lanczos, 1950) is a method for the approximationof the eigenvalue density of very large matrices. Theeigenvalue density spectrum is given by φ ( t ) = N N ∑ i = δ ( t − λ i ) (6) where N is the number of parameters in the network, λ i is the i-th eigenvalue of the Hessian and δ is theDirac delta function. In order to deal with the Diracdelta function the eigenvalue density spectrum is ap-proximated by a sum of Gaussian functions φ σ ( t ) = N N ∑ i = f ( λ i , t , σ ) (7)where f ( λ i , t , σ ) = σ √ π exp ( − ( t − λ i ) σ ) (8)We use the Lanczos algorithm with full reorthogonal-ization in order to compute eigenvalues and eigenvec-tors of the Hessian and to ensure orthogonality be-tween the different eigenvectors. Since the Hessian issymmetric we can diagonalize it and all eigenvaluesare real. The Lanczos algorithm is used together withthe Hessian vector product for a certain number of it-erations. Afterwards it returns a tridiagonal matrix T .This matrix is diagonalized as T = ULU T (9)where L is a diagonal matrix.By setting ω i = U , i and l i = L ii for i = , , ..., m ,the resulting eigenvalues and eigenvectors are used toestimate the true eigenvalue density spectrumˆ φ ( v i ) ( t ) = m ∑ i = ω i f ( l i , t , σ ) (10)ˆ φ σ ( t ) = k k ∑ i = ˆ φ ( v i ) ( t ) (11)For our experiments we use the toolbox from (Chatz-imichailidis et al., 2019) which implements theStochastic Lanczos quadrature algorithm. This allowsto inspect and visualize the spectral information fromour models. To prevent our neural network from reaching sharpminima during optimization, we remove the gradi-ent information in the direction of high eigenvalues.This forces our network to ignore the sharpest min-ima entirely and instead converge into wider ones.Inspired by (Jastrzebski et al., 2018), we constructan optimizer based on Adam (Kingma and Ba, 2014)which ignores the gradient in the direction of the top-k eigenvectors. In order to achieve this, we use theexisting Adam optimizer and remove the directionsof steepest descent from its gradient. This means thativen the top-k eigenvectors v i and the gradient g weremove the eigenvector directions by g ∗ = g − k ∑ i = < g , v i > v i (12)The resulting gradient g ∗ is now used by the regu-lar Adam optimizer. Using this technique one canmodify a lot of different optimizers into their nudgedcounterpart by using g ∗ instead of the true gradient.The eigenvectors are computed by using the Lanc-zos method together with the R-operator. This al-lows fast computation of eigenvalues and eigenvec-tors without having to store the full Hessian. In this section, we present a set of experiments tostudy the loss properties and the instability issues thatmight occur when training a GAN. We first use the vi-sualization toolbox to inspect the spectrum of GANsduring training, and to corroborate the problematicsearch of an LNE. Then, we examine the k-top eigen-values from the G and D , and their evolution through-out the training. Finally, we introduce a novel op-timizer, called nudged-Adam, to prevent mode col-lapse and we test its performances on several datasetsto guarantee reliability across different scenarios. We start our experimental section with a loss land-scape visualization that will serve us as a referencepoint. We believe that building a solid backgroundwill help to provide a better understanding of the non-convergent nature of GANs, in particular concerningthe G . In order to do this, we track the spectral den-sity throughout the entire optimization process. Ourmain goal here is to gather evidences from the cur-vature that visualizes the general problem of findingLNEs. In order to carry out these experiments, we in-dependently analyse the loss landscapes of the G andof the D using their highest eigenvalue, respectively.We employ the toolbox from (Chatzimichailidiset al., 2019) to visualize the loss landscape and thetrajectory of our GAN during training. In this way,we can gain some insights into the optimization pro-cess that happens underneath. Note, that to obtain thetrajectory, we project all the points of training into the2D plane of the last epoch. Figure 2 shows the land-scape after training for 180 epochs on the NSGANsetup. By inspecting the landscapes, we observe (1)the D clearly finds a local minimum and descends to-wards it, and (2) the G ends up in an unstable saddle point, as suggested by the irregular landscapes sur-rounding it. This findings agree with the aforemen-tioned second-order literature (Berard et al., 2019). Figure 2: Logarithmic loss landscapes with trajectory ofthe same training run, visualized along eigenvectors cor-responding to the two highest eigenvalues of NSGAN onMNIST. (Left) G loss landscape. (Right) D loss landscape. After having gained intuition of the training condi-tions of GANs and their problem to find and remainat an LNE, we now focus our attention on the issueof mode collapse. In particular, we provide empiricalevidences of a plausible relationship between modecollapse and the behaviour of the eigenvalues. To thisend, we evaluate the spectrum of our model through-out the optimization process. More specifically, wetrack the largest eigenvalues from the G and from the D for each epoch, together with the IS.We start training and evaluating the originalnon-saturating GAN architecture on the MNIST,Kuzushiji, Fashion and EMNIST datasets (see Fig-ure 3). This results in a number of patterns that arepresent in all experiments. (1) The evolution of theeigenvalues of the G and D behave visually very sim-ilar. In particular, when D exhibits an increasing ten-dency in its eigenvalues, the G does so as well. (2)Apart from the G shape of the dynamics, it is impor-tant to evaluate the local behaviour, i.e. the correlationbetween the G and D . Thereby, we observe a strongcorrelation in all our setups ranging from 0.72 to 0.90.(3) Furthermore, there seems to exist a connection be-tween the IS and the behaviour of the eigenvalues.When the eigenvalues have a decreasing tendency, theIS score tends to increase, while when the eigenvaluesincrease, the IS scores deteriorates. Moreover, we seehow all our models start to suffer from mode collapseafter 25 epochs (approximately when the eigenvaluestendency changes and starts to increase).The empirical observations found in this analysislead to the conclusion that eigenvalues can give anindication of the state of convergence of a GAN, aspointed out in (Berard et al., 2019). Furthermore, wefound that the eigenvalue evolution is correlated withthe likely occurrence of a mode collapse event.
50 100 15010 EpochsgendiscIS (a) Kuzushiji dataset. Correlation 0.80. EpochsgendiscIS (b) Fashion dataset. Correlation 0.90. EpochsgendiscIS (c) EMNIST dataset. Correlation 0.72.Figure 3: Evolution of the top k-eigenvalues of the Hessian from generator (gen) and discriminator (disc), and the correspon-dence IS over the whole training phase. The correlation score is measured between the G and the D .Figure 4: (First row) Evolution of the top k-eigenvalues of the Hessian, the IS and random generated samples at differentepochs of NSGAN on MNIST. (Second row) Comparison of IS evolution of NSGAN and NuGAN, and random generatedsamples at different epochs of NuGAN. In the last section we have seen that the growth ofHessian eigenvalues during the training of a GAN cor-relates to the occurrence of mode collapse. In order toremove this undesirable effect, we train our NSGANwith a nudged-Adam optimizer (referred to as Nu-GAN), which is inspired by (Jastrzebski et al., 2018).Figure 4 shows the results together with some visualsamples generated at different training epochs. Weobserve that NuGAN achieves a much more stable IS,and this is also displayed on the generated samples.While NSGAN suffers from mode collapse, NuGANdoes not (see samples on epoch 160). This shows a clear relationship between the behaviour of the ISscore and the occurrence of mode collapse.Figure 5 shows the full spectrum of the Hessianat different stages of the training. A remarkable ob-servation here is the present of negative eigenvaluesfor the G for both optimizers. This indicates that thecritical point reached during training is not an LNE(c.f. Formula 4). Rather, the G reaches only a saddlepoint in all cases. On the other hand, the D seems toconverge to a sharp local minimum when using plainGDA. In fact, it seems that the longer training laststhe sharper the minimum gets. The D of our NuGANhowever, reaches a much flatter minimum, which canbe seen by the presents of much smaller eigenvalues
100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y Epoch 2 gendisc 0 100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y Epoch 25 gendisc 0 100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y Epoch 160 gendisc0 100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y gendisc 0 100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y gendisc 0 100 200 300 40010 − − − − Eigenvalue H e ss i a n E i g e nv a l u e D e n s it y gendisc Figure 5: Plots of the whole spectrum of the Hessian at different stage of the training on MNIST. (First row) Results onNSGAN: we can identify an abnormal behaviour (mode collapse) in the generator at epoch 160. (Second row) Results onNuGAN: the spectrum remains stable during the whole training. We can observe how the D for both cases finds local minima,while the G remains all the time in a saddle point. towards the end of training.A second interesting observation is the connectionbetween the spectrum of the G and the mode collapse.In particular, we observe the occurrence of mode col-lapse when the spectrum spreads significantly (seefirst row from Figure 5). On the other hand, the spec-tral evolution of our NuGAN (see second row fromFigure 5) does not display any anomaly for the G , andindeed, no mode collapse event occurrences. In Ta-ble 1 we show more quantitative results supportingthe benefit of our nudged-Adam optimizer approach.There we show the IS for both optimizers evaluatedof 4 different datasets. Notice that in all cases, ourmethod achieves a higher mean and maximum scorethan the NSGAN baseline. These quantitative resultstogether with the visual inspection of the image qual-ity suggest that our NuGAN algorithm has a directinfluence on the behavior of the eigenvalues and theloss landscape of our adversarial model, resulting inthe avoidance of mode collapse.Overall, we can summarize that the algorithmdoes not converge to an LNE, while still achievinggood results w.r.t. the evaluation metric (IS). Thisraises the question whether convergences to an LNEis actually needed in order to achieve good generatorperformance of a GAN. Table 1: Mean and max IS from the different datasets andmethods (with and without mode collapse). Higher valuesare better.
Methods NSGAN NuGANmean max mean maxMNIST 4.30 7.03 7.14 8.46Kuzushiji 5.24 6.50 6.12 7.20Fashion 5.74 6.82 6.35 7.20EMNIST 3.77 7.02 8.53 7.67
In this work, we investigate instabilities that occurduring the training of GANs, focusing particularly onthe issue of mode collapse. To do this, we analysethe loss surfaces of the G and D neural networks,using second-order gradient information, with spe-cial attention on the Hessian eigenvalues. Hereby,we empirically show that there exists a correlationbetween the stability of training and the eigenvaluesof the generator network. In particular, we observethat large eigenvalues, which may be an indicationof the convergence towards a sharp minimum, corre-late well with the occurrence of mode collapse. Mo-tivated by this observation, we introduce a novel op-timization algorithm that uses second-order informa-tion to steer away from sharp optima, thereby prevent-ing the occurrence of mode collapse. Our findingssuggest that the investigation of generalization prop-rties of GANs, e.g. by analysing the flatness of theoptima found during training, is a promising approachto progress towards more stable GAN training as well. REFERENCES
Abdal, R., Qin, Y., and Wonka, P. (2019). Image2stylegan:How to embed images into the stylegan latent space?In
Proceedings of the IEEE international conferenceon computer vision , pages 4432–4441.Adolphs, L., Daneshmand, H., Lucchi, A., and Hof-mann, T. (2018). Local saddle point optimization:A curvature exploitation approach. arXiv preprintarXiv:1805.05751 .Alain, G., Roux, N. L., and Manzagol, P.-A. (2019). Neg-ative eigenvalues of the hessian in deep neural net-works. arXiv preprint arXiv:1902.02366 .Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-stein gan. arXiv preprint arXiv:1701.07875 .Berard, H., Gidel, G., Almahairi, A., Vincent, P., andLacoste-Julien, S. (2019). A closer look at the op-timization landscapes of generative adversarial net-works. arXiv preprint arXiv:1906.04848 .Chatzimichailidis, A., Keuper, J., Pfreundt, F.-J., andGauger, N. R. (2019). Gradvis: Visualization and sec-ond order analysis of optimization surfaces during thetraining of deep neural networks. In , pages 66–74.IEEE.Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.,Baldassi, C., Borgs, C., Chayes, J., Sagun, L., andZecchina, R. (2019). Entropy-sgd: Biasing gradientdescent into wide valleys.
Journal of Statistical Me-chanics: Theory and Experiment , 2019(12):124018.Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017).Sharp minima can generalize for deep nets.
CoRR ,abs/1703.04933.Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht,F. A. (2018). Essentially no barriers in neural networkenergy landscape. arXiv preprint arXiv:1803.00885 .Durall, R., Keuper, M., and Keuper, J. (2020). Watchyour up-convolution: Cnn based generative deep neu-ral networks are failing to reproduce spectral distribu-tions. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages7890–7899.Durall, R., Pfreundt, F.-J., K¨othe, U., and Keuper, J. (2019).Object segmentation using pixel-wise adversarial loss.In
German Conference on Pattern Recognition , pages303–316. Springer.Fiez, T., Chasnov, B., and Ratliff, L. J. (2019). Convergenceof learning dynamics in stackelberg games. arXivpreprint arXiv:1906.01217 .Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Ben-gio, Y. (2014a). Generative adversarial nets. In
Advances in neural information processing systems ,pages 2672–2680.Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2014b).Qualitatively characterizing neural network optimiza-tion problems. arXiv preprint arXiv:1412.6544 .Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., andCourville, A. C. (2017). Improved training of wasser-stein gans. In
Advances in neural information pro-cessing systems , pages 5767–5777.Hochreiter, S. and Schmidhuber, J. (1997). Flat minima.
Neural Computation , 9(1):1–42.Iizuka, S., Simo-Serra, E., and Ishikawa, H. (2017). Glob-ally and locally consistent image completion.
ACMTransactions on Graphics (ToG) , 36(4):107.Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., andWilson, A. G. (2018). Averaging weights leads towider optima and better generalization. arXiv preprintarXiv:1803.05407 .Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio,Y., and Storkey, A. (2018). On the relation betweenthe sharpest directions of dnn loss and the sgd steplength. arXiv preprint arXiv:1807.05031 .Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,J., and Aila, T. (2020). Analyzing and improvingthe image quality of stylegan. In
Proceedings of theIEEE/CVF Conference on Computer Vision and Pat-tern Recognition , pages 8110–8119.Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M.,and Tang, P. T. P. (2016). On large-batch training fordeep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 .Kingma, D. P. and Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Lanczos, C. (1950).
An iteration method for the solution ofthe eigenvalue problem of linear differential and inte-gral operators . United States Governm. Press OfficeLos Angeles, CA.Mescheder, L., Nowozin, S., and Geiger, A. (2017). Thenumerics of gans. In
Advances in Neural InformationProcessing Systems , pages 1825–1835.Nagarajan, V. and Kolter, J. Z. (2017). Gradient descent ganoptimization is locally stable. In
Advances in neuralinformation processing systems , pages 5585–5595.Radford, A., Metz, L., and Chintala, S. (2015). Unsu-pervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434 .Sagun, L., Bottou, L., and LeCun, Y. (2016). Eigenvalues ofthe hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476 .Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,Radford, A., and Chen, X. (2016). Improved tech-niques for training gans. In
Advances in neural infor-mation processing systems , pages 2234–2242.Xue, Y., Xu, T., Zhang, H., Long, L. R., and Huang, X.(2018). Segan: Adversarial network with multi-scalel 1 loss for medical image segmentation.
Neuroinfor-matics , 16(3-4):383–392.u, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S.(2019). Free-form image inpainting with gated con-volution. In