[PDF] MSR-GAN: Multi-Segment Reconstruction via Adversarial Learning

Abstract

Multi-segment reconstruction (MSR) is the problem of estimating a signal given noisy partial observations. Here each observation corresponds to a randomly located segment of the signal. While previous works address this problem using template or moment-matching, in this paper we address MSR from an unsupervised adversarial learning standpoint, named MSR-GAN. We formulate MSR as a distribution matching problem where the goal is to recover the signal and the probability distribution of the segments such that the distribution of the generated measurements following a known forward model is close to the real observations. This is achieved once a min-max optimization involving a generator-discriminator pair is solved. MSR-GAN is mainly inspired by CryoGAN [1]. However, in MSR-GAN we no longer assume the probability distribution of the latent variables, i.e. segment locations, is given and seek to recover it alongside the unknown signal. For this purpose, we show that the loss at the generator side originally is non-differentiable with respect to the segment distribution. Thus, we propose to approximate it using Gumbel-Softmax reparametrization trick. Our proposed solution is generalizable to a wide range of inverse problems. Our simulation results and comparison with various baselines verify the potential of our approach in different settings.

Full PDF

MMSR-GAN: MULTI-SEGMENT RECONSTRUCTION VIA ADVERSARIAL LEARNING

Mona Zehni, Zhizhen Zhao

Department of ECE and CSL, University of Illinois at Urbana-Champaign

ABSTRACT

Multi-segment reconstruction (MSR) is the problem of estimat-ing a signal given noisy partial observations. Here each observationcorresponds to a randomly located segment of the signal. Whileprevious works address this problem using template or moment-matching, in this paper we address MSR from an unsupervisedadversarial learning standpoint, named MSR-GAN. We formulateMSR as a distribution matching problem where the goal is to recoverthe signal and the probability distribution of the segments such thatthe distribution of the generated measurements following a knownforward model is close to the real observations. This is achievedonce a min-max optimization involving a generator-discriminatorpair is solved. MSR-GAN is mainly inspired by CryoGAN [1].However, in MSR-GAN we no longer assume the probability dis-tribution of the latent variables, i.e. segment locations, is givenand seek to recover it alongside the unknown signal. For thispurpose, we show that the loss at the generator side originally isnon-differentiable with respect to the segment distribution. Thus,we propose to approximate it using Gumbel-Softmax reparametriza-tion trick. Our proposed solution is generalizable to a wide rangeof inverse problems. Our simulation results and comparison withvarious baselines verify the potential of our approach in differentsettings.

Index Terms — Multi-segment reconstruction, adversarial learn-ing, unsupervised learning, Gumbel-Softmax approximation, cate-gorical distribution.

1. INTRODUCTION

The problem of recovering a signal from a set of noisy partial ob-servations appear in a wide range of applications including genomicsequence assembly [2], puzzle solving[3], tomographic reconstruc-tion [4] and cryo-electron microscopy (Cryo-EM) [5, 6], to namea few. In this paper, we focus on multi-segment reconstruction(MSR) [7], where the unknown is a 1D sequence and the measure-ments are noisy randomly located partial observations (segments) ofthis sequence. A schematic illustration of MSR is provided in Fig. 1.MSR is a general form of multi-reference alignment (MRA) [8]problem in which the measurements are noisy randomly shifted ver-sions of the signal. While in MRA the length of each measurementis the same as the signal, in MSR the measurements can be shorter.Current efforts devoted to MSR is studied in two broad cat-egories, 1) alignment-based, 2) alignment-free. In one form ofalignment-based methods, the segment location corresponding toeach observation is estimated. Then, the observations are alignedaccordingly and averaged. While these methods have low com-putational and sample complexity, low signal-to-noise ratio (SNR)of the observations adversely affect their performance. Examplesof alignment-based methods applied to MRA and tomographic re-construction are found in [9, 10]. In other forms of alignmentbased methods, the segment locations and the 1D sequence are

Fig. 1 . Multi-segment reconstruction (MSR) problem.jointly updated using alternating steps. An example would be themaximum likelihood formulation of MSR, solved using expectation-maximization (EM). Despite the robustness of EM to different noiseregimes, it suffers from high computational complexity. This isdue to the complexity of the E-step, requiring a whole pass throughthe measurements at every iteration. This is signiﬁcantly time-consuming, especially in the presence of large number of observa-tions.Alignment-free solutions speciﬁcally designed for MRA side-step the estimation of the random shifts by introducing a set of invari-ant features. These features constitute the moments of the signal andare estimated from the measurements. The signal is then estimatedfrom the features via an optimization-based framework [8, 11], ten-sor decomposition [12, 13] using Jennrich’s algorithm [14] or spec-tral decomposition [15, 16]. As these works are specialized forMRA, they do not address the challenges associated with MSR, suchas observing only shorter segments of the signal. In [7], we showedhow for MSR, we can estimate the invariant features from the mea-surements and how the recovery of the signal is tied to the segmentlength. Compared to alignment-based solutions, in alignment-freemethods, we only have one pass through the measurements to es-timate the features, thus computationally more efﬁcient. The esti-mated features then serve as a compact representation of the mea-surements which are functions of the unknown signal and segmentlocation distribution.In this paper, we propose an alignment-free adversarial learn-ing based method for MSR. Our goal is to ﬁnd the unknown 1Dsignal and the distribution of the segment locations such that themeasurements generated from the estimated signal match the realmeasurements in a distribution sense. Therefore, we train a gener-ator discriminator pair, where the discriminator tries to distinguishbetween the measurements output by the generator and the realones. Our approach is inspired by CryoGAN [1] in which thegoal is to reconstruct a 3D structure given 2D noisy projectionimages from unknown projection views. Unlike CryoGAN, weassume the distribution of the latent variables, i.e. the segmentlocations in MSR, is unknown and we seek to recover it alongsidethe signal. For this purpose, we modify the loss at the generatorside using Gumbel-Softmax approximation of categorical distri-bution, to accommodate gradient-based updates of the segmentlocation distribution. Our simulation results and comparison withseveral baselines conﬁrm the feasibility of our approach in vari-ous segment length and noise regimes. Our code is available at https://github.com/MonaZI/MSR-GAN . a r X i v : . [ ee ss . SP ] F e b ig. 2 . An illustration of MSR-GAN pipeline.

2. SYSTEM MODEL

We consider the following observation model, ξ j = M s j x + ε j , j ∈ { , , ..., N } (1)where x ∈ R d is the underlying signal and ξ j ∈ R m , m ≤ d isthe j -th observation. We often refer to m as the segment length.The cyclic masking operator M s captures m consecutive entries of x starting from index s . In other words, M s : R d → R m and ( M s x ) [ n ] = x [ n + s mod d ] . We also assume the segment location s ∈ { , , ..., d − } to be unknown and randomly drawn from a cat-egorical distribution with p as its probability mass function (PMF)where P { s = s j } = p [ s j ] . In addition, the randomly located seg-ment of the signal is contaminated by additive white Gaussian noise ε j with zero mean and covariance σ I m ( I m denoting the identitymatrix with size m × m ). Our goal here is to recover x and p giventhe noisy partial observations { ξ j } Nj =1 .Note that the distribution of the observations depends on boththe signal x and the distribution of the segment locations p . Thus,it is possible to estimate x and p by matching the distribution of theobservations generated by x and p following (1) to the real measure-ments.

3. METHOD

We use an unsupervised adversarial learning approach to solve MSR.Our method is unsupervised as it only relies on the given observa-tions and does not use large paired datasets for training. Similarto [1], our method aims to ﬁnd x and p such that the distribution ofthe partial noisy measurements generated from (1) matches the realmeasurements { ξ j real } Nj =1 . To this end, we use a generative adversar-ial network (GAN) [17]. Unlike common GAN models, we use theknown forward model in (1) to map the signal and segment distribu-tion to the measurements. Thus, the generator acts upon x and p andsimulates noisy measurements { ξ j sim } Mj =1 . The discriminator’s taskis then to distinguish between the real and fake measurements fromthe generator. An illustration of MSR-GAN is provided in Fig. 2.Here we use Wasserstein GAN [18] with gradient penalty (WGAN-GP) [19], to beneﬁt from its favorable convergence behaviour. InWGAN, the output of the discriminator is a score, where the morethe input resembles ξ real , the higher the score. The min-max formu-lation of the problem is: (cid:98) x, (cid:98) p = arg min x,p max φ L ( φ, x, p ) (2) L ( φ, x, p ) = B (cid:88) b =1 D φ ( ξ b real ) − D φ ( ξ b sim ) − λ GP ( ξ b int ) (3) Algorithm 1

MSR-GAN

Require: α φ , α x , α p : learning rates for the discriminator, the imageand projection angle distribution. λ : gradient penalty weight. n disc :the number of iterations of the discriminator (critic) per generatoriteration. Require:

Initialize x randomly and p with a uniform distribution,i.e. p [ s ] = 1 /d . Output:

Estimates I and p given { ξ j real } Nj =1 . while not converged do for t = 0 , ..., n disc − do Sample a batch from real data, { ξ b real } Bb =1 Sample a batch of simulated measurements using esti-mated signal and PMF, i.e. { ξ b sim } Bb =1 where ξ b sim = M s x + ε b , ε b ∼ N (0 , σI m ) Generate interpolated samples { ξ b int } Bb =1 , ξ b int = α ξ b real +(1 − α ) ξ b sim with α ∼ Unif (0 , Update the discriminator using gradient ascent stepswith, ∇ φ L D ( φ ) = ∇ φ (cid:32) B (cid:88) b =1 D φ ( ξ b real ) − D φ ( ξ b sim ) + λ GP ( ξ b int ) (cid:33) end for Sample a batch of { q i,b } Bb =1 using (8) Update x and p using gradient descent steps with the follow-ing gradients, ∇ x,p L G ( x, p ) = ∇ x,p (cid:32) − B (cid:88) b =1 d − (cid:88) s =0 q i,b D φ ( M s x + ε b ) (cid:33) end while GP ( ξ b int ) = (cid:16) (cid:107)∇ ξ D φ ( ξ b int ) (cid:107) − (cid:17) (4)where L denotes the loss which is a function of the discriminator’sparameters φ , the signal and the PMF. Also, B is the batch size, D φ denotes the discriminator parameterized by φ and ξ sim = M s x + ε , s ∼ p and ε ∼ N (0 , σ I m ) . The weight of the gradient penaltyterm (GP) is λ and ξ int is a sample generated by linear interpolationbetween a real and simulated measurement, i.e. ξ int = α ξ real + (1 − α ) ξ sim , α ∼ Unif(0, 1). To solve the min-max optimization in (2),following common practice, we take alternating steps to update thediscriminator’s parameters φ and the generator, i.e. x and p , usingtheir gradients.To update p , we need to take gradients of (3) with respect to p . However, this loss function is related to p through a samplingoperator which is non-differentiable (we are sampling the segmentlocations based on the p distribution). This would be problematic atthe generator update steps. Therefore, it is crucial to devise a wayto have a meaningful gradient with respect to p . First, let us takea closer look at the loss function that is minimized at the generatorside: L G ( x, p ) = − B (cid:88) b =1 D φ ( M s b x + ε b ) , s ∼ p, ε ∼ N (0 , σ I m ) (5) = − B (cid:88) b =1 d − (cid:88) s =0 δ ( s − s b ) D φ ( M s x + ε b ) (6)where δ is the Kronecker delta and δ ( s − s b ) denotes the one-hotrepresentation of a sample drawn from a categorical distributionwith PMF p . Jang et al. in [20] proposed a Gumbel-Softmaxreparametrization trick to approximate samples from a categoricaldistribution with a differentiable function. We use this idea and Rel-Error = 0 . Rel-Error = 0 . Known PMF

SNR = ∞ SNR = 1 GT − Rel-Error = 0 . Rel-Error = 0 . Fixed PMF with Unif. − Rel-Error = 0 . Rel-Error = 0 . Unknown PMF · − TV = 0 . TV = 0 . MSR-GAN est. PMF vs GT . Rel-Error = 0 . Rel-Error = 0 . . Rel-Error = 0 . Rel-Error = 0 . . Rel-Error = 0 . Rel-Error = 0 . · − TV = 0 . TV = 0 . − . . Rel-Error = 0 . Rel-Error = 0 . − . . Rel-Error = 0 . Rel-Error = 0 . − . . Rel-Error = 0 . Rel-Error = 0 . · − TV = 0 . TV = 0 . Fig. 3 . Comparison between MSR-GAN in different noise regimes for 1) known PMF (ﬁrst column), 2) unknown PMF but ﬁxed with uniformdistribution during training (second column), 3) unkown PMF and recovered during training (third column). The last column plots the groundtruth PMF (green dashed curve) alongside the estimated PMFs from MSR-GAN (the same experiment as the third column) in blue and red.Each row corresponds to different signals and PMFs. The relative error of the reconstruction for SNR = ∞ and SNR = 1 is written inblue (SNR = ∞ ) and red (SNR = 1 ) underneath each subplot. For all experiments in this ﬁgure we are using the same architecture for thediscriminator with (cid:96) = 100 and the number of measurements is N = 5 × .replace δ ( s − s b ) , s b ∼ p with a sample from the Gumbel-Softmaxdistribution, i.e. L G ( x, p ) ≈ B (cid:88) b =1 d − (cid:88) s =0 q s,b D φ ( M s x + ε b ) (7)where q s,b = exp (( g b,s + log( p [ s ])) /τ ) d − (cid:80) i =0 exp (( g b,i + log( p [ i ])) /τ ) , g b,s ∼ Gumbel (0 , . (8)Note that (8) is a continuous approximation of the arg max func-tion, τ is the softmax temperature factor and q s,b → δ ( s − arg max s ( g b,s + log p [ s ])) as τ → . Note that drawing sam-ples from arg max s ( g b,s + log p [ s ]) , g b,s ∼ Gumbel (0 , isan efﬁcient way of sampling from p distribution [20]. Further-more, to obtain samples from the Gumbel distribution [21], itsufﬁces to transform samples from a uniform distribution using g = − log( − log( u )) , u ∼ Unif (0 , .

4. IMPLEMENTATION DETAILS

We present the pseudo-code for MSR-GAN in Alg. 1. In all ourexperiments, we use a batch-size of B = 200 and keep the numberof real measurements as N = 3 × unless otherwise mentioned.We have three separate learning rates for the discriminator, the signaland the PMF denoted by α φ , α x and α p , while in most experimentalsettings we keep α x = α p . We reduce the learning rates by a factorof . , with different schedules for different learning rates. We use SGD [22] as the optimizer for the discriminator and the signal x witha momentum of . . We also update p using gradient descent stepsafter normalizing the corresponding gradients. We clip the gradientsof the discriminator to have norm . Similar to common practice,we train the discriminator n disc = 4 times per updates of x and p .To have stabilized updates with respect to p , we choose τ = 0 . inour experiments. We also use spectral normalization to stabilize thetraining [23].Our architecture of the discriminator consists of three fully con-nected (FC) layers with (cid:96) , (cid:96)/ , and output sizes, where (cid:96) is deter-mined accordingly for different experiments. We use ReLU [24] forthe non-linear activations between the FC layers. We initialize thelayers with weights drawn from normal distribution with mean zeroand . standard deviation. We train MSR-GAN for , and , iterations for high and low SNR regimes, respectively. Toenforce p to have non-negative values while adding up to one, we setit to be the output of a Softmax layer. Our implementation is inPyTorch and runs on single GPU.

5. NUMERICAL RESULTS

In this section we ﬁrst provide details on our evaluation metrics andbaselines. Next, we discuss our results.

Evaluation metrics and baselines : The SNR of the observationsis deﬁned as the variance of the clean measurements divided by thevariance of the noise. As the signal and PMF are reconstructed upto a random global shift, we align the reconstructions before com-paring them to the ground truths. We use relative error (rel-error)between the aligned estimated signal (cid:98) x and the ground truth x as . Segment length ( m ) S u cce ss r a t e MSR-GANMSR-SIFEM

Fig. 4 . Effect of segment length on the success rate of 1) MSR-GAN (blue curve), 2) MSR-SIF (red curve), 3) EM (green curve).In this experiment the signal length d = 60 , the signal is generatedrandomly and σ = 0 . . The success rate is computed based on random initializations for each segment length value. All threemethods are initialized with the same random x and p .the quantitative measure of the performance, deﬁned as rel-error = min s (cid:107) x −R s (cid:98) x (cid:107) (cid:107) x (cid:107) , where R s shifts its input by s ∈ { , ..., d − } . Toassess the quality of the estimated PMF, we use total variation (TV)distance, deﬁned as TV = min s (cid:107) p − R s (cid:98) p (cid:107) [25]. We also deﬁnesuccess rate by running MSR solutions with different initializa-tions. The ratio of the initializations that lead to a relative-error lessthan a threshold . is reported as the success-rate.We compare MSR-GAN to two baselines: 1) Estimating shift-invariant features, i.e. moments up to the third order, from the mea-surements and recovering x and p by solving a non-convex opti-mization problem [7]. We use up to third order moments as the fea-tures. We call this baseline MSR via shift-invariant features (MSR-SIF). We use Riemannian trust-regions method [26] implementedin Manopt [27] to solve the optimization problem. 2) Expectationmaximization (EM). In this baseline, we formulate MSR as a max-imum marginalized likelihood estimation problem and solve it viaEM [8, 7]. Effect of knowledge of PMF on the MSR-GAN results : Figure 3shows the results of MSR-GAN on different signals with d = 64 and m = 24 in three different scenarios: 1) p in known (ﬁrst column), 2) p is not known but ﬁxed with a uniform distribution during training(second column), 3) p is not known and we recover it along side x (third and fourth columns). Note that for all three scenarios, we areusing Alg. 1 with the same discriminator architecture and (cid:96) = 100 .However, for the ﬁrst and the second scenarios, we do not update p (skip step -update p ), rather keep it ﬁxed with the true and theuniform distribution, respectively.Note that when the PMF is known, the results of MSR-GANclosely match the ground truth signal. When the PMF is unknown,if we ﬁx p to be a uniform distribution (see second column of Fig. 3),we observe that although the reconstructed signal is close to the GT,it has larger relative error compared to the scenarios where the PMFis given (see the ﬁrst column of Fig. 3) or the PMF is updated jointlywith the signal (see the third column of Fig. 3). Updating p jointlywith x (Fig. 3-third column) leads to more accurate reconstruction ofthe signal. This shows the importance of recovering the distributionof the segments. Effect of segment length and comparison with baselines : Fig-ure 4 illustrates the effect of segment length m on the success rateof MSR-GAN compared to the other two baselines. For this experi-ment, we set the network hyper-parameter (cid:96) = 300 and test the per-formance of our algorithm on randomly generated signals of length d = 60 . As discussed in [7], solving MSR for smaller segment − . . . . − − − log SNR R e l a ti v e - e rr o r MSR-GANMSR-SIFEM

Fig. 5 . Comparison between MSR-GAN with different baselinesin terms of relative-error versus SNR of the observations. In thisexperiment d = 60 and m = 18 . All three methods have beeninitialized with the same signal and PMF and the reported results arethe median across different initializations and noise realizationsfor the observations.length regimes using shift invariant features is more challenging, asthe number of equations provided by the moments for smaller seg-ment lengths can be less than the number of unknowns. Similarly,the EM algorithm fails at shorter segments, i.e. m ≤ , wherethe success rate is less than . EM is more likely to get stuck ata local optimal solution when the segment length becomes smaller.However, as MSR-GAN solves the inverse problem by matching thedistribution of real measurements and stochastic gradient descent, itachieves higher success rates for smaller segment lengths. In partic-ular, even at m = 15 , MSR-GAN achieves a success rate close to . Effect of noise and comparison with baselines : In Fig. 5, we in-vestigate the effect of noise on the performance of MSR-GAN com-pared to the baselines. For this experiment d = 60 , m = 18 andfor the discriminator’s architecture we set (cid:96) = 300 . Note that indifferent noise regimes MSR-GAN outperforms MSR-SIF and EM.Here we have a short segment length, thus as mentioned earlier solv-ing MSR is more challenging and both baselines get stuck in localminima that is not close to the ground truth solution. Note that ifwe increase the segment length we observe an improved reconstruc-tion error and success rate for MSR-SIF and EM (as also observed inFig. 4). This suggests that MSR-GAN is a better solution comparedto the baselines in short segment length regimes.

6. CONCLUSION

In this paper, we focused on the multi-segment reconstruction(MSR) problem, where we are given noisy randomly located seg-ments of an unknown signal and the goal is to recover the signaland the distribution of the segments. We proposed a novel adversar-ial learning based approach to solve MSR. Our approach relies ondistribution matching between the real measurements and the onesgenerated by the estimated signal and segment distribution. We for-mulated our problem in a Wasserstein GAN based framework. Weshowed how the generator loss term is a non-differentiable functionof the segments distribution. To facilitate updates of the distribu-tion through its gradients, we approximate the loss function at thegenerator side using Gumbel-Softmax reparametrization trick. Thisallowed us to update both the signal and the segment distributionusing stochastic gradient descent. Our simulation results and com-parisons to various baselines veriﬁed the ability of our approachin accurately solving MSR in various noise regimes and segmentlengths. . REFERENCES [1] H. Gupta, M. T. McCann, L. Donati, and M. Unser, “Cryogan:A new reconstruction paradigm for single-particle cryo-em viadeep adversarial learning,” bioRxiv , 2020.[2] A. S. Motahari, G. Bresler, and D. N. C. Tse, “Informationtheory of DNA shotgun sequencing,”

IEEE Transactions onInformation Theory , vol. 59, pp. 6273 – 6289, 2013.[3] G. Paikin and A. Tal, “Solving multiple square jigsaw puzzleswith missing pieces,” in

Computer Vision and Pattern Recog-nition (CVPR), 2015 IEEE Conference on , 2015, pp. 161–174.[4] M. Willemink and P. No¨el, “The evolution of image recon-struction for ct—from ﬁltered back projection to artiﬁcial in-telligence,”

European Radiology , vol. 29, 10 2018.[5] A. Barnett, L. Greengard, A. Pataki, and M. Spivak, “Rapidsolution of the cryo-em reconstruction problem by frequencymarching,”

SIAM Journal on Imaging Sciences , vol. 10, no. 3,pp. 1170–1195, 2017.[6] A. Punjani, M. A. Brubaker, and D. J. Fleet, “Building proteinsin a day: Efﬁcient 3D molecular structure estimation with elec-tron cryomicroscopy,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 39, pp. 706–718, 2017.[7] M. Zehni, M. N. Do, and Z. Zhao, “Multi-segment reconstruc-tion using invariant features,” in ,2018, pp. 4629–4633.[8] T. Bendory, N. Boumal, C. Ma, Z. Zhao, and A. Singer, “Bis-pectrum inversion with application to multireference align-ment,”

IEEE Transactions on signal processing , vol. 66, no.4, pp. 1037–1050, 2017.[9] Y. Chen and E. Cand`es, “The projected power method: An efﬁ-cient algorithm for joint alignment from pairwise differences,”

Communications on Pure and Applied Mathematics , vol. 71,09 2016.[10] S. Basu and Y. Bresler, “Feasibility of tomography with un-known view angles,”

IEEE Transactions on Image Processing ,vol. 9, no. 6, pp. 1107–1122, Jun 2000.[11] N. Boumal, T. Bendory, R. R. Lederman, and A. Singer, “Het-erogeneous multireference alignment: a single pass approach,”

ArXiv e-prints , Oct. 2017.[12] A. S. Bandeira, J. Niles-Weed, and P. Rigollet, “Optimal ratesof estimation for multi-reference alignment,”

MathematicalStatistics and Learning , vol. 2, no. 1, pp. 25–75, 2020.[13] T. G. Kolda and B. W. Bader, “Tensor decompositions andapplications,”

SIAM REVIEW , vol. 51, no. 3, pp. 455–500,2009.[14] R. Harshman, “Foundations of the parafac procedure: Modelsand conditions for an “explanatory” multi-modal factor analy-sis,”

UCLA Working Papers in Phonetics , vol. 16, 1970.[15] H. Chen, M. Zehni, and Z. Zhao, “A spectral method for stablebispectrum inversion with application to multireference align-ment,”

IEEE Signal Processing Letters , vol. 25, no. 7, pp. 911–915, 2018.[16] E. Abbe, J. M. Pereira, and A. Singer, “Sample complexity ofthe boolean multireference alignment problem,” in . IEEE,2017, pp. 1316–1320. [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Advances in Neural Information Process-ing Systems 27 , Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, Eds., pp. 2672–2680. CurranAssociates, Inc., 2014.[18] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein genera-tive adversarial networks,” in

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 , 2017, pp.214–223.[19] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, andA. Courville, “Improved training of wasserstein gans,” in

Proceedings of the 31st International Conference on NeuralInformation Processing Systems , Red Hook, NY, USA, 2017,NIPS’17, p. 5769–5779, Curran Associates Inc.[20] E. Jang, S. Gu, and B. Poole, “Categorical reparameteriza-tion with gumbel-softmax,” arXiv preprint arXiv:1611.01144 ,2016.[21] C. J. Maddison, D. Tarlow, and T. Minka, “A ∗ sampling,”in Advances in Neural Information Processing Systems 27 ,Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, andK. Q. Weinberger, Eds., pp. 3086–3094. Curran Associates,Inc., 2014.[22] L. Bottou, “Large-scale machine learning with stochastic gra-dient descent,”

Proc. of COMPSTAT , 01 2010.[23] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,” in

Interna-tional Conference on Learning Representations , 2018.[24] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation ofrectiﬁed activations in convolutional network,” arXiv preprintarXiv:1505.00853 , 2015.[25] T. Chen and S. Kiefer, “On the total variation distance of la-belled markov chains,” in

Proceedings of the Joint Meetingof the Twenty-Third EACSL Annual Conference on ComputerScience Logic (CSL) and the Twenty-Ninth Annual ACM/IEEESymposium on Logic in Computer Science (LICS) , New York,NY, USA, 2014, CSL-LICS ’14, Association for ComputingMachinery.[26] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-regionmethods on Riemannian manifolds,”

Found. Comput. Math. ,vol. 7, no. 3, pp. 303–330, July 2007.[27] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre,“Manopt, a matlab toolbox for optimization on manifolds,”