Learning Implicit Generative Models with Theoretical Guarantees
LLearning Implicit Generative Models with TheoreticalGuarantees
Yuan Gao ∗ Jian Huang † Yuling Jiao ‡ Jin Liu § February 18, 2020
Abstract
We propose a uni fied f ramework for i mplicit ge nerative m odeling (UnifiGem)with theoretical guarantees by integrating approaches from optimal transport,numerical ODE, density-ratio (density-difference) estimation and deep neural net-works. First, the problem of implicit generative learning is formulated as thatof finding the optimal transport map between the reference distribution and thetarget distribution, which is characterized by a totally nonlinear Monge-Amp`ereequation. Interpreting the infinitesimal linearization of the Monge-Amp`ere equationfrom the perspective of gradient flows in measure spaces leads to the continuityequation or the McKean-Vlasov equation. We then solve the McKean-Vlasovequation numerically using the forward Euler iteration, where the forward Eulermap depends on the density ratio (density difference) between the distribution atcurrent iteration and the underlying target distribution. We further estimate thedensity ratio (density difference) via deep density-ratio (density-difference) fittingand derive explicit upper bounds on the estimation error. Experimental resultson both synthetic datasets and real benchmark datasets support our theoreticalfindings and demonstrate the effectiveness of UnifiGem. Keywords:
Deep generative model, Optimal transport, Continuity equation,McKean-Vlasov equation, Deep density-ratio (density-difference) fitting, Nonpara-metric estimation error.
Running title: UnifiGem
The ability to efficiently model complex data and sample from complex distributionsplays a key role in a variety of prediction and inference tasks in machine learning andstatistics [52]. The long-standing methodology for learning an underlying distribution ∗ School of Mathematics and Statistics, Xi’an Jiaotong University, China ([email protected]) † Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242 ([email protected]) ‡ School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China. ([email protected]) § Center of Quantitative Medicine Duke-NUS Medical School, Singapore. ([email protected]) a r X i v : . [ c s . L G ] F e b elies on an explicit statistical data model, which can be difficult to specify in manymodern machine learning tasks such as image analysis, computer vision and naturallanguage processing. In contrast, implicit generative models do not assume a specificform of the data distribution, but rather learn a nonlinear map to transform a simplereference distribution to the underlying target distribution. This modeling approachhas been shown to achieve state-of-the-art performance in many machine learning tasks[49, 64]. Generative adversarial networks (GAN) [21], variational auto-encoders (VAE)[31] and flow-based methods [50] are important representatives of implicit generativemodels.GANs model the low-dimensional latent structure via deep nonlinear factors. Theyare trained by sequential differentiable surrogates of two-sample tests, including thedensity-ratio test [21, 46, 42, 45, 59] and the density-difference test [37, 58, 36, 6, 9]among others. VAE is a probabilistic deep nonlinear factor model trained with variationalinference and stochastic approximation. Several authors have proposed improved versionsof VAE by enhancing the disentangled representation power of the learned latent codesand reducing the blurriness of the generated images in vanilla VAE [41, 25, 60, 63].Flow-based methods learn a diffeomorphism map between the reference distribution andthe target distribution by maximum likelihood using the change of variables formula.Recent work on flow-based methods has been focused on developing training methodsand designing neural network architectures to trade off between the efficiency of trainingand sampling and the representation power of the learned map [50, 16, 17, 30, 47, 29, 23].In this paper, we propose a unified framework (UnifiGem) for implicitly learning anunderlying generative model by integrating approaches from optimal transport, numericalODE, density-ratio (density-difference) estimation and deep neural networks. The keyidea of implicit generative learning is to find a nonlinear transform that pushes forwarda simple reference distribution to the target distribution. Mathematically, this taskis known as finding an optimal transport map characterized by the Monge-Amp`ereequation. However, it is quite challenging to solve the Monge-Amp`ere equation due tothe nonlinearity and high-dimensionality even at the population level assuming the targetdistribution is known. The infinitesimal linearization of the Monge-Amp`ere equationcan be interpreted from the perspective of gradient flows in measure spaces, whichleads to the continuity equation. Therefore, we turn to solve the continuity equationor equivalently, the characteristic ODE system associated with the continuity equation,which is a kind of McKean-Vlasov equation. We solve the resulting McKean-Vlasovequation numerically using the forward Euler method and bound the discretizationerror at the population level since the forward Euler map depends on the density ratio(density difference) between the distribution at current iteration and the underlyingtarget distribution. We estimate the density ratio (density difference) via nonparametricdeep density-ratio (density-difference) fitting and derive an explicit estimation errorbound. Experimental results on both synthetic datasets and real benchmark datasetssupport our theoretical findings and demonstrate the effectiveness of UnifiGem.2
Notation, background and theory
Let P ( R m ) denote the space of Borel probability measures on R m with finite secondmoments, and let P a ( R m ) denote the subset of P ( R m ) in which measures are absolutelycontinuous with respect to the Lebesgue measure (all distributions are assumed to satisfythis assumption hereinafter). Tan µ P ( R m ) denotes the tangent space to P ( R m ) at µ . Let AC loc ( R + , P ( R m )) := { µ t : I → P ( R m ) is absolutely continuous , | µ (cid:48) t | ∈ L ( I ) , I ⊂ R + } . Lip loc ( R m ) denotes the set of functions that are Lipschitz continuouson any compact set of R m . For any (cid:96) ∈ [1 , ∞ ] , we use L (cid:96) ( µ, R m ) ( L (cid:96) loc ( µ, R m )) to denotethe L (cid:96) space of µ -measurable functions on R m (on any compact set of R m ). With I , detand tr, we refer to the identity map, the determinant and the trace. We use ∇ , ∇ and∆ to denote the gradient or Jacobian operator, the Hessian operator and the Laplaceoperator, respectively.We first describe the theoretical background used in deriving UnifiGem, a unifiedframework to learn the generative model ν implicitly from an i.i.d. sample { X i } ni =1 ⊂ R m . The quadratic Wasserstein distance between µ and ν ∈ P ( R m ) is defined as [61, 2] W ( µ, ν ) = { inf γ ∈ Γ( µ,ν ) E ( X,Y ) ∼ γ [ (cid:107) X − Y (cid:107) ] } , (1)where Γ( µ, ν ) denotes the set of couplings of ( µ, ν ). The static formulation of W in (1)admits the following variational form [8] W ( µ, ν ) = { inf q t , v t { (cid:90) E X ∼ q t [ (cid:107) v t ( X ) (cid:107) ] dt }} , s . t . ∂ t q t ( x ) = −∇ · ( q t ( x ) v t ( x )) ,q ( x ) = q ( x ) , q ( x ) = p ( x ) , where v t ( x ) : R + × R m → R m is a velocity vector field. The Wasserstein distance W ( µ, ν )measures the optimal quadratic cost of transporting µ onto ν . The corresponding optimaltransport map T such that T µ = ν is characterized by the Monge-Amp`ere equation[10, 43, 53]. Lemma 2.1.
Let µ and ν ∈ P a ( R m ) with densities q and p respectively. Then (1)admits a unique solution γ = ( I , T ) µ with T = ∇ Ψ , µ - a.e., where the potential function Ψ is convex and satisfies the Monge-Amp`ere equationdet( ∇ Ψ( x )) = q ( x ) p ( ∇ Ψ( x )) , x ∈ R m . (2)It is challenging to find the optimal transport map T by solving the totally nonlineardegenerate elliptic Monge-Amp`ere equation (2). Linearization via a residual type ofpushforward map, i.e., letting T t, Φ = ∇ Ψ = I + t ∇ Φ (3)3ith a specially designed function Φ : R m → R and a small t ∈ R + , is a commonly usedtechnique to address the difficulty due to nonlinearity [61]. To be precise, let X ∼ q , (cid:101) X = T t, Φ ( X ) , and denote the distribution of (cid:101) X as (cid:101) q. With a small t , the map T t,φ isinvertible according to the implicit function theorem, and we have the change of variablesformula det( ∇ Ψ)( x ) = | det( ∇T t, Φ )( x ) | = q ( x ) (cid:101) q (˜ x ) , (4)where ˜ x = T t, Φ ( x ) . (5)Using the fact dd t (cid:12)(cid:12) t =0 det( A + t B ) = det( A )tr (cid:0) A − B (cid:1) ∀ A , B ∈ R m × m with A invertible,and applying the first order Taylor expansion to (4) we havelog (cid:101) q (˜ x ) − log q ( x ) = − t ∆Φ( x ) + o ( t ) . (6)Let t → { x t } and its law q t satisfyingd x t d t = ∇ Φ( x t ) , with x ∼ q, (7)d ln q t ( x t )d t = − ∆Φ( x t ) , with q = q. (8)Equations (7) and (8) resulting from linearization of the Monge-Amp`ere equation (2) canbe interpreted as gradient flows in measure spaces [2]. And thanks to this connection,we can resort to solving a continuity equation characterized by a type of McKean-Vlasovequation, an ODE system that is easier to handle. P a ( R m ) For µ ∈ P a ( R m ) with density q , let L [ µ ] = (cid:90) R m F ( q ( x ))d x : P a ( R m ) → R + ∪ { } (9)be an energy functional satisfying ν ∈ arg min L [ · ] , where F ( · ) : R + → R is a twice-differentiable convex function. Among the widely used metrics on P a ( R m ) in implicitgenerative learning, the following two are important examples of L [ · ] . • f -divergence [1]: D f ( µ (cid:107) ν ) = (cid:90) R m p ( x ) f (cid:18) q ( x ) p ( x ) (cid:19) d x , (10)where f : R + → R is a twice-differentiable convex function satisfying f (1) = 0. • Lebesgue norm of density difference: (cid:107) µ − ν (cid:107) L ( R m ) = (cid:90) R m | q ( x ) − p ( x ) | d x . (11) Definition.
We call { µ t } t ∈ R + ⊂ AC loc ( R + , P ( R m )) a gradient flow of the functional L [ · ], if { µ t } t ∈ R + ⊂ P a ( R m ) a.e., t ∈ R + and the velocity vector field v t ∈ Tan µ t P ( R m )satisfies v t ∈ − ∂ L [ µ t ] a.e. t ∈ R + , where ∂ L [ · ] is the subdifferential of L [ · ]. 4he gradient flow { µ t } t ∈ R + of L [ · ] enjoys the following nice properties. Theorem 2.1. (i) The continuity equationdd t µ t = −∇ · ( µ t v t ) in R + × R m with µ = µ, (12)holds in the sense of distributions.(ii) Representation of the velocity fields.If the density q t of µ t is differentiable, then v t ( x ) = −∇ F (cid:48) ( q t ( x )) µ t - a.e. x ∈ R m . (13)(iii) Energy decay along the gradient flow.dd t L [ µ t ] = −(cid:107) v t (cid:107) L ( µ t , R m ) a.e. t ∈ R + . In addition, W ( µ t , ν ) = O (exp − λt ) , if L [ µ ] is λ -geodetically convex with λ > { µ t } t is the solution of continuity equation (12) in (i) with v t ( x )specified by (13) in (ii), then { µ t } t is a gradient flow of L [ · ]. Proposition 2.1.
If we let Φ be time-dependent in (7)-(8), i.e., Φ t , then the linearizedMonge-Amp`ere equations (7)-(8) ⇔ the continuity equation (12) by taking Φ t ( x ) = − F (cid:48) ( q t ( x )) . Theorem 2.1 and Proposition 2.1 imply that { µ t } t , the solution of the continualityequation (12) with v t = −∇ F (cid:48) ( q t ( x )) , approximates the Monge-Amp`ere equation (2)and converges rapidly to the target distribution ν . Furthermore, the continuity equationhas the following representation under mild regularity conditions on the velocity fields. Theorem 2.2.
Assume (cid:107) v t (cid:107) L ( µ t , R m ) ∈ L ( R + ) and v t ( · ) ∈ Lip loc ( R m ) with upperbound B t and Lipschitz constant L t such that ( B t + L t ) ∈ L ( R + ) . Then the solutionof the continuity equation (12) can be represented as µ t = ( X t ) µ, (14)where X t ( x ) : R + × R m → R m satisfies the McKean-Vlasov equationdd t X t ( x ) = v t ( X t ( x )) with X ∼ µ, (15) µ - a.e. x ∈ R m . We use the forward Euler method to solve the McKean-Vlasov equation (15). Let s > T k = I + s v k , (16) X k +1 = T k ( X k ) , (17) µ k +1 = ( T k ) µ k , (18)5here X ∼ µ , µ = µ and k = 0 , , ..., K . It is well known that for a finite time horizon T and a fixed compact domain, Euler discretization of the McKean-Vlasov equation (15)has a global error of O ( s ) in the supremum norm [35]. Let { µ st : t ∈ [ ks, ( k + 1) s ) } be a piecewise linear interpolation between µ k and µ k +1 . The discretization error of µ t and µ st can be bounded in a finite time interval [0 , T ). Proposition 2.2. W ( µ t , µ st ) = O ( s ) . Proposition 2.2 and (iii) in Theorem 2.1 imply that the distribution of the particles X k defined in (17) with k large enough is close to the target ν . The above theoreticalresults are obtained at the population level, where v k depends on the target ν . Therefore,it is natural to implicitly learn ν via first estimating the discrete velocity fields v k at thesample level and then plugging the estimator of v k into (17). As shown in Lemma 2.2below, the velocity fields associated with the f -divergence (10) and the Lebesgue norm(11) are determined by density ratio and density difference respectively. Lemma 2.2.
The velocity fields v t satisfy v t ( x ) = (cid:40) − f (cid:48)(cid:48) ( r t ( x )) ∇ r t ( x ) , L [ µ ] = D f ( µ (cid:107) ν ) , − ∇ d t ( x ) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) , where r t ( x ) = q t ( x ) p ( x ) and d t ( x ) = q t ( x ) − p ( x ) , x ∈ R m . Several methods have been developed to estimate density ratio and density differencein the literature. Examples include probabilistic classification approaches, momentmatching and direct density-ratio (density-difference) fitting, see [56, 57, 28, 44] and thereferences therein.
The evaluation of velocity fields depends on dynamic estimation of a discrepancy (densityratio or density difference) between the pushforward distribution q t and the targetdistribution p . Density-ratio and density-difference fitting with the Bregman scoreprovides a unified framework for such discrepancy estimation [20, 13, 56, 57, 28] withoutestimating each probability distribution separately.We use a neural network R φ : R m → R with parameter φ to parameterize thedensity ratio r ( x ) = q ( x ) p ( x ) between a given density q and the target p . Let g : R → R bea differentiable and strictly convex function. The separable Bregman score with the baseprobability measure p to measure the discrepancy between R φ and r is B ratio ( r, R φ )= E X ∼ p [ g (cid:48) ( R φ ( X ))( R φ ( X ) − r ( X )) − g ( R φ ( X ))]= E X ∼ p [ g (cid:48) ( R φ ( X )) R φ ( X ) − g ( R φ ( X ))] − E X ∼ q [ g (cid:48) ( R φ ( X ))] . B ratio ( r, R φ ) ≥ B ratio ( r, r ), where the equality holds iff R φ = r .For deep density-difference fitting, a neural network D ψ : R m → R with parameter ψ is utilized to estimate the density-difference d ( x ) = q ( x ) − p ( x ) between a given density q and the target p . The separable Bregman score with the base probability measure w to measure the discrepancy between D ψ and d can be derived similarly, B diff ( d, D ψ )= E X ∼ p [ w ( X ) g (cid:48) ( D ψ ( X ))] − E X ∼ q [ w ( X ) g (cid:48) ( D ψ ( X ))]+ E X ∼ w [ g (cid:48) ( D ψ ( X )) D ψ ( X ) − g ( D ψ ( X ))] . Here, we focus on the widely used least-squares density-ratio (LSDR) fitting with g ( c ) = ( c − as a working example: B LSDR ( r, R φ ) = E X ∼ p [ R φ ( X ) ] − E X ∼ q [ R φ ( X )] + 1 , The scenario of other functions, such as g ( c ) = c log c − ( c + 1) log( c + 1) correspondingto estimating r via the logistic regression (LR), and the case of density-difference fittingcan be handled similarly. The distributions of real data may have a low-dimensional structure with their supportsconcentrated on a low-dimensional manifold, which may cause the f -divergence to beill-posed due to non-overlapping supports. Motivated by recent works on smoothing vianoise injection [54, 5] and Tikhonov regularization method for f -GAN [51], we derive asimple weighted gradient penalty to improve deep density-ratio fitting. We consider anoise convolution form of B ratio ( r, R φ ) with Gaussian noise (cid:15) ∼ N ( , α I ), B α ratio ( r, R φ )= E X ∼ p E (cid:15) [ g (cid:48) ( R φ ( X + (cid:15) )) R φ ( X + (cid:15) ) − g ( R φ ( X + (cid:15) ))] − E X ∼ q E (cid:15) [ g (cid:48) ( R φ ( X + (cid:15) ))] . Taylor expansion applied to R φ gives E (cid:15) [ R φ ( x + (cid:15) )] = R φ ( x ) + α R φ ( x ) + O ( α ) . Using equations (13)-(17) in [51], we get B α ratio ( r, R φ ) ≈ B ratio ( r, R φ ) + α E p [ g (cid:48)(cid:48) ( R φ ) (cid:107)∇ R φ (cid:107) ] , i.e., E p [ g (cid:48)(cid:48) ( R φ ) (cid:107)∇ R φ (cid:107) ] serves as a regularizer for deep density-ratio fitting when g istwice differentiable. As a consequence, for g ( c ) = ( c − , the resulting gradient penalty E p [ (cid:107)∇ R φ (cid:107) ] , (19)recovers the well-known squared Sobolev semi-norm in nonparametric statistics.7 .5 Estimation error Lemma 2.3.
For given densities p and q , let r = qp with C = E X ∼ q [ r ( X )] − < ∞ . For any α ≥
0, define a nonnegative functional B α LSDR ( R ) = B LSDR ( r, R ) + α E p [ (cid:107)∇ R (cid:107) ] + C . Then, r ∈ arg min measureable R B ( R ) . And B α ( R ) = 0 iff R ( x ) = r ( x ) = 1 ( q, p )- a.e. x ∈ R m . At the population level, according to Lemma 2.3, we can recover the density ratio r via minimizing B α LSDR ( R ). Moreover, the gradient penalty (19) stabilizes and improvesthe long time behavior of Euler iterations at the sample level, where the pushforwarddistribution should be close to the target as expected. This is supported by our numericalexperiments in Section 5.Let H D , W , S , B be the set of ReLU neural networks R φ with depth D , width W , size S , and (cid:107) R φ (cid:107) ∞ ≤ B . At the sample level, only i.i.d. data { X i } i =1 ,...,n and { Y i } i =1 ,...,n sampled from p and q are available. We estimate r with (cid:98) R φ defined as (cid:98) R φ ∈ arg min R φ ∈H D , W , S , B n (cid:88) i =1 n ( R φ ( X i ) + α (cid:107)∇ R φ ( X i ) (cid:107) − R φ ( Y i )) . (20)Next we bound the nonparametric estimation error (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) under the assumptionthat the support of ν concentrates on a compact low-dimensional manifold and r isLipsichiz continuous. Let M ⊆ [ − c, c ] m be a Riemannian manifold with dimension m ,condition number 1 /τ , volume V , geodesic covering regularity R , and m (cid:28) M = O ( m ln( m VR /τ )) (cid:28) m. Denote M (cid:15) = { x ∈ [ − c, c ] m : inf {(cid:107) x − y (cid:107) : y ∈ M } ≤ (cid:15) } , (cid:15) ∈ (0 , Theorem 2.3.
Assume supp( r ) = M (cid:15) and r ( x ) is Lipschitz continuous with the bound B and the Lipschitz constant L . Suppose the topological parameter of H D , W , S , B in (20)with α = 0 satisfies D = O (log n ), W = O ( n M M ) / log n ), S = O ( n M− M +2 / log n ), and B = 2 B . Then, E { X i ,Y i } n [ (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) ] ≤ C ( B + cLm M ) n − / (2+ M ) , where C is a universal constant. We are now ready to described how to implement UnifiGem with i.i.d. data { X i } ni =1 ⊂ R m from an unknown target distribution ν . UnifiGem is a particle method, with which we8earn a transport map that transforms particles from a simple reference distribution µ , such as the standard normal distribution or the uniform distribution, into particlesfrom the target distribution ν . From Theorems 2.2 and 2.1 and Proposition 2.2 we knowthat at the population level, the solution X t of the McKean-Vlasov equation (15) with asufficiently large t is a good approximation of such a transform. This solution can beobtained accurately via the forward Euler iteration (16)-(18) with a small step size, i.e., T K ◦ T K − ◦ ... ◦ T serves as a desired transform with a large K . As implied by Theorem 2.3, each T k , k =1 , ..., K can be estimated with high accuracy by (cid:98) T k = I + s ˆ v k where ˆ v k = − f (cid:48)(cid:48) ( (cid:98) R φ )( x )) ∇ (cid:98) R φ ( x ). Here (cid:98) R φ is estimated based on (20) with { Y i } i =1 ,...n ∼ q k . Therefore, the particles (cid:98) T K ◦ (cid:98) T K − ◦ ... ◦ (cid:98) T ( ˜ Y i ) , i = 1 , ...n serve as samples drawn from the target distribution ν , where particles { ˜ Y i } ni =1 ⊂ R m aresampled from a simple reference distribution µ .In many applications, high-dimensional complex data such as images, texts andnatural languages, tend to have low-dimensional features. To learn generative modelswith hidden low-dimensional structures, it is beneficial to have the option of first samplingparticles { Z i } ni =1 from a low-dimensional reference distribution ˜ µ ∈ P ( R (cid:96) ) with (cid:96) (cid:28) d .Then we apply (cid:98) T K ◦ (cid:98) T K − ◦ ... ◦ (cid:98) T to particles ˜ Y i = G θ ( Z i ) , i = 1 , ...n , where we introduce another deep neural network G θ : R (cid:96) → R m with parameter θ . Wecan estimate G θ via fitting the pairs { ( Z i , ˜ Y i ) } ni =1 . We give a detailed description of theUnifiGem algorithm below. • Outer loop for modeling low dimensional latent structure (optional)–
Sample { Z i } ni =1 ⊂ R (cid:96) from a low-dimensional simple reference distribution ˜ µ and let ˜ Y i = G θ ( Z i ) , i = 1 , , ..., n . – Inner loop for finding the pushforward map ∗ If there are no outer loops, sample ˜ Y i ∼ µ, i = 1 , , ..., n . ∗ Get ˆ v ( x ) = − f (cid:48)(cid:48) ( (cid:98) R φ ( x )) ∇ (cid:98) R φ ( x ) via solving (20) with Y i = ˜ Y i . Set (cid:98) T = I + s ˆ v with small step size s . ∗ Update the particles ˜ Y i = (cid:98) T ( ˜ Y i ), i = 1 , , ..., n . – End inner loop– If there are outer loops, update the parameter θ of G θ ( · ) via solving min θ (cid:80) ni =1 (cid:107) G θ ( Z i ) − ˜ Y i (cid:107) /n . 9 End outer loop
UnifiGem is a unified and general framework, since it allows different choices of theenergy functionals L [ · ] in (9) and density-ratio (density-difference) estimators. We discuss connections between UnifiGem and the existing related works, especiallythose that use optimal transport based on Wasserstein distances and gradient flowsin measure spaces. Implicit generative learning aims at finding a transform map thatpushes forward a simple reference distribution µ to the target ν . The existing implicitgenerative models, such as VAEs, GANs and flow-based methods, parameterize such amap with a neural network, say G θ , that solvesmin θ D (( G θ ) µ, ν ) , (21)where D ( · , · ) is an integral probability discrepancy. f -GAN [46] including the vanillaGAN [21], and WGAN [6] solve the dual form of (21) via parameterizing the dualvariable with another neural network with D as the f -divergence and the 1-Wassersteindistance respectively. Based on the fact that the 1-Wasserstein distance can be evaluatedfrom samples via linear programming [55], [38] and [19] proposed training the primalform of WGAN via a two-stage method that solves the linear programm and refits theoptimal pairs with a neural network and unrolling the Sinkhorn iteration respectively.SWGAN [15] and MMDGAN [36, 9] use the sliced quadratic Wasserstein distance andthe maximum mean discrepancy (MMD) as the discrepancy D respectively.Vanilla VAE [31] approximately solves the primal form of (21) with the KL-divergenceloss under the framework of variational inference. Several authors have proposed methodsthat use optimal transport losses, such as various forms of Wasserstein distances betweenthe distribution of learned latent codes and the prior distribution as the regularizer inVAE to improve performance. These methods include WAE [60], Sliced WAE [32] andSinkhorn AE [48].Discrete time flow-based methods [50, 16, 17, 30, 47, 29] minimize (21) with theKL divergence loss. [23] proposed an ODE flow for fast training via using the adjointequation [12]. By introducing the optimal transport tools into maximum likelihoodtraining, [11] and [62] considered continuous time flow. [11] proposed a gradient flow inmeasure spaces in the framework of variational inference and then discretized it with theimplicit movement minimizing scheme [14, 27]. [62] actually considered gradient flowsin measure spaces with time invariant velocity fields. CFGGAN [26] derived from theperspective of optimization in the functional space is exactly a special form of UnifiGemwith L [ · ] taken as the KL divergence. SW flow [40] and MMD flow [3] are gradientflows in measure spaces. These methods are most related to our proposed UnifiGem. InSW flow, the energy functional L [ · ] in (9) is the sliced quadratic Wasserstein distancepenalized with the entropy regularizer. We should mention that with SW flow, the target ν may not be the minimizer of such L [ · ] even at the population level. MMD flow can be10ecovered from UnifiGem by first choosing L [ · ] as the Lebesgue norm and then projectingthe corresponding vector fields onto reproducing kernel Hilbert spaces, please see thesupplementary material for a proof. However, neither SW flow nor MMD flow can modelhidden low-dimensional structure with the particle sampling procedure. The implementation details on numerical settings, network structures, SGD optimizers,and hyper-parameters are given in the appendix. All experiments are performed usingNVIDIA Tesla K80 GPUs. And The PyTorch code of UnifiGem is available at https://github.com/anonymous/UnifiGem . We use UnifiGem to learn 2D distributions adapted from [22] with multiple modes anddensity ridges. We utilize a multilayer perceptron with ReLU activation in dynamic deepdensity-ratio fitting without using gradient penalty. We use UnifiGem without outerloops to push particles from a predrawn pool consisting of 50k i.i.d. Gaussian particlesto evolve in 20k steps. The first row in Figure 1 shows kernel density estimation (KDE)plots of 50k samples from target distributions including (from left to right) and circles , and the second row shows KDEplots of the transformed particles via UnifiGem, and the third row displays the surfaceplots of the estimated density-ratio functions at the end of the iteration. As evident byFigure 1, KDE plots of generated samples via UnifiGem are nearly indistinguishable fromthose of the target samples and the estimated density-ratio functions are approximatelyequal to 1 s , indicating the learnt distribution matches the target well.Next, we demonstrate the effectiveness of the gradient penalty (19) by visualizingthe transport maps learned in the generative learning tasks with the learning targets5 squares and large gaussians from 4 squares and small gaussians respectively. We use200 particles connected with grey lines to manifest the learned transport maps. As shownin Figure 2(a), the central squares of 5 squares were learned better with the gradientpenalty, which is consistent with the result of the estimated density-ratio in Figure2(b). For large gaussians , the learned transport map exhibited some optimality underquadratic Wasserstein distance due to the obvious correspondence between the samplesin Figure 2(a), and the gradient penalty also improves the density-ratio estimation asexpected.Finally, we illustrate the convergence property of the learning dynamics of UnifiGemon synthetic datasets pinwheel, checkerboard and . As shown in Figure 3, on thethree test datasets, the dynamics of estimated density-ratio fitting losses in (20) sharecommon patterns for three stages, i.e., the initialization stage (top penal), the declinestage (middle panel) and the converging stage (bottom panel). And both the left panel(LSDR fitting loss (20) with α = 0) and the right panel (estimated value of the gradientnorm E X ∼ q k [ (cid:107)∇ R φ ( X ) (cid:107) ]) demonstrate the estimated LSDR fitting losses in (20) (with11 Figure 1: KDE plots of the target samples (the first row) and the corresponding generatedsamples (the second row). The third row shows surface plots of estimated density ratioafter 20k iterations. (a)
Left two figures: Maps learned without gradient penalty.
Right two figures: Maps learned withgradient penalty. (b)
Left two figures: Surface plots of estimated density-ratio without gradient penalty.
Right twofigures: Surface plots of estimated density-ratio with gradient penalty.
Figure 2: Learned transport maps and estimated density ratio in learning 5 squares from4 squares and learning large gaussians from small gaussians . α = 0) converge to the theoretical value − We show the performance of applying UnifiGem on benchmark image data MNIST [34],CIFAR10 [33] and CelebA [39]. The evolving particles shown in Figure 4 on MNIST12
50 100 150 200 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 0 50 100 150 200 k G r ad i en t no r m pinwheelcheckerboard2spirals0 1000 2000 3000 4000 5000 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 0 1000 2000 3000 4000 5000 k G r ad i en t no r m pinwheelcheckerboard2spirals12000 14000 16000 18000 20000 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 12000 14000 16000 18000 20000 k G r ad i en t no r m pinwheelcheckerboard2spirals Figure 3: Convergence of UnifiGem on pinwheel, checkerboard and . Top:
Theinitialization stage.
Middle:
The decline stage.
Bottom:
The converging stage.
Left:
LSDR fitting loss (20) with α = 0. Right:
Estimation of the gradient norm E X ∼ q k [ (cid:107)∇ R φ ( X ) (cid:107) ].and CIFAR10 demonstrate that UnifiGem can transport samples from a multivariatenormal distribution into a target distribution with the same dimension without usingthe outer loop. We further compare UnifiGem using the outer loop with state-of-the-artgenerative models including WGAN, SNGAN and MMDGAN. We considered different f -divergences, including Pearson χ , KL, JS and logD [18] and different deep density-ratiofitting methods (LSDR and LR). Table 1 shows FID [Heusel et al.(2017)] evaluated withfive bootstrap sampling of UnifiGem with four divergences on CIFAR10. We can seethat UnifiGem attains (usually better) comparable FID scores with the state-of-the-artgenerative models. Comparisons of the real samples and learned samples on MNIST,13igure 4: Particle evolution of UnifiGem on MNIST and CIFAR10.CIFAR10 and CelebA are shown in Figure 5, where high-fidelity learned samples arecomparable to real samples visually.Table 1: Mean (standard deviation) of FID scores on CIFAR10 and results in last sixrows are adapted from [4]. Models CIFAR10 (50k)UnifiGem-LSDR- χ UnifiGem-LR-KL 25.9 (0.1)UnifiGem-LR-JS 25.3 (0.1)UnifiGem-LR-logD
WGAN-GP 31.1 (0.2)MMDGAN-GP-L2 31.4 (0.3)SMMDGAN 31.5 (0.4)SN-GAN 26.7 (0.2)SN-SWGAN 28.5 (0.2)SN-SMMDGAN
UnifiGem is a unified framework for implicit generative learning via finding a transportmap between a reference distribution and the target distribution. It is inspired by severalfruitful ideas from optimal transport theory, numerical ODE, density-ratio (density-14ifference) estimation and deep neural networks. We also provide theoretical guaranteesfor our proposed approach. Numerical results on both synthetic datasets and realbenchmark datasets support our theoretical findings and demonstrate that UnifiGem iscompetitive with the state-of-the-art generative models.There are two important ingredients in UnifiGem: the energy functional L [ · ] in(9) and density-ratio (density-difference) estimation. It can be shown that with asuitable choice of L [ · ] and a density-ratio estimation approach, UnifiGem can recoversome existing generative models. Thus our theoretical results also provide insights onthe properties of these existing methods. With different combinations of the energyfunctionals and density-ratio (density-difference) estimation approach, one can developnew theoretically sound learning procedures under UnifiGem. It would be interestingto have a through comparison between the procedures resulting from such differentcombinations. In particular, it is desirable to carefully explore conditions and scenarios ofthe data structures under which certain choices of the energy functional and density-ratio(density-difference) estimator lead to better performance.Some aspects and results in this paper are of independent interest. For example,density-ratio estimation is an important problem and of general interest in machinelearning and statistics. The estimation error bound established in Theorem 2.3 forthe nonparametric deep density-ratio fitting procedure is new. It is a step forward inthe direction that shows deep nonparametric estimation can circumvent the curse ofdimensionality via exploring the structure of the data [7]. It is of interest to use thetechniques developed here to study deep nonparametric regression and classification. Acknowledgements
The authors are grateful to the anonymous referees, the associate editor and the editorfor their helpful comments, which have led to a significant improvement on the quality ofthe paper. The work of Jian Huang is supported in part by the NSF grant DMS-1916199.The work of Y. Jiao was supported in part by the National Science Foundation of Chinaunder Grant 11871474 and by the research fund of KLATASDSMOE. The work of J.Liu is supported by Duke-NUS Graduate Medical School WBS: R913-200-098-263 andMOE2016- T2-2-029 from Ministry of Eduction, Singapore.
In the appendix, we give and the implementation details on numerical settings, networkstructures, SGD optimizers, and hyper-parameters in the paper, and detailed proofsof Lemmas 2.1-2.3, Theorems 2.1-2.3, Proposition 2.1-2.2, and the proof of MMD flowbeing a special case of UnifiGem. 15igure 5: Visual comparisons between real images (top 3 panels) and generated images(bottom 3 panels) by UnifiGem-LSDR- χ on MNIST, CIFAR10 and CelebA.16 Experimental details
Experiments on 2D examples in our work were performed with deep LSDR fitting andthe Pearson χ divergence. For simplicity purposes, outer loops of UnifiGem wereomitted and our algorithm became a particle method for approximating solutions ofPDEs [Chertock(2017)]. In inner loops, only a multilayer perceptron (MLP) was utilizedfor dynamic estimation of the density ratio between the model distribution q k and thetarget distribution p . The network structure and hyper-parameters in UnifiGem anddeep LSDR fitting were shared in all 2D experiments. We used RMSProp with thelearning rate 0.0005 and the batch size 1k as the SGD optimizer. The details are givenin Table 2 and Table 3. We note that s is the step size, n is the number of particles, α isthe penalty coefficient, and T is the times of LSDR fitting in each inner loop hereinafter.Table 2: MLP for deep LSDR fitting. Layer Details Output size1 Linear, ReLU 642 Linear, ReLU 643 Linear, ReLU 644 Linear 1
Table 3: Hyper-parameters in UnifiGem on 2D examples.
Parameter s n α T
Value 0.005 50k 0 or 0.5 5
Datasets.
We evaluated UnifiGem on three benchmark datasets including two smalldatasets MNIST, CIFAR10 and one large dataset CelebA from GAN literature. MNISTcontains a training set of 60k examples and a test set of 10k examples as 28 ×
28 bilevelimages which were resized to 32 ×
32 resolution. There are a training set of 50k examplesand a test set of 10k examples as 32 ×
32 color images in CIFAR10. We randomly dividedthe 200k celebrity images in CelebA into two sets for training and test according to theratio 9:1. We also pre-processed CelebA images by first taking a 160 ×
160 central cropand then resizing to the 64 ×
64 resolution. Only the training sets are used to train ourmodels. 17 valuation metrics.
Fr´echet Inception Distance (FID) [Heusel et al.(2017)] com-putes the Wasserstein distance W with summary statistics (mean µ and variance Σ) ofreal samples x s and generated samples g s in the feature space of the Inception-v3 model[Szegedy et al.(2016)], i.e., FID = (cid:107) µ x − µ g (cid:107) + Tr(Σ x + Σ g − x Σ g ) ). Here, FID isreported with the TensorFlow implementation and lower FID is better. Network architectures and hyper-parameter settings.
We employed theResNet architectures used by [18] in our UnifiGem algorithm. Especially, the batch nor-malization [Ioffe & Szegedy(2015)] and the spectral normalization [Miyato et al.(2018)]of networks were omitted for UnifiGem-LSDR- χ . To train neural networks, we set SGDoptimizers as RMSProp with the learning rate 0.0001 and the batch size 100. Inputs { Z i } ni =1 in UnifiGem with outer loops were vectors generated from a 128-dimensionalstandard normal distribution on all three datasets. Hyper-parameters are listed in Table4 where IL expresses the number of inner loops in each outer loop. Even without outerloops, UnifiGem can generate images on MNIST and CIFAR10 as well by making use ofa large set of particles. Table 5 shows the hyper-parameters.Table 4: Hyper-parameters in UnifiGem with outer loops on real image datasets. Parameter (cid:96) s n α T IL
Value 128 0.5 2k 0 1 20
Table 5: Hyper-parameters in UnifiGem without outer loops on real image datasets.
Parameter s n α T
Value 0.5 4k 0 5
Proof.
This is well known results [10, 43, 53], see, Section 1.7.6 on page 54 of [53] forexample.
Proof.
We show the results item by item. (i) The continuity equation (12) follows fromthe definition of the gradient flow directly, see, page 281 in [2] for detail.(ii) Recall L [ µ ] is a functional on P a ( R m ). By the classical results in calculus of variation[Gelfand & Silverman(2000)], ∂ L [ q ] ∂q ( x ) = dd t L [ q + tg ] | t =0 = F (cid:48) ( q ( x )) , ∂ L [ q ] ∂q denotes the first order of variation of L [ · ] at q , and q, g are the densities of µ and an arbitrary ξ ∈ P a ( R m ), respectively. Let L F ( z ) = zF (cid:48) ( z ) − F ( z ) : R → R . Some algebra shows, ∇ L F ( q ( x )) = q ( x ) ∇ F (cid:48) ( q ( x )) . Then, it follows from Theorem 10.4.6 in [2] that ∇ F (cid:48) ( q ( x )) = ∂ o L ( µ ) , where, ∂ o L ( µ ) denotes the one in ∂L ( µ ) with minimum length. The above display andthe definition of gradient flow implies the representation of the velocity fields v t .(iii) The first equality follows from chain rule and integration by part, see, Theorem 24.2in [61] for detail. The second one on linear convergence follows from Theorem 24.7 in[61], where the assumption on λ in equation (24.6) is equivalent to the λ -geodeticallyconvex assumption here.(iv) Similar to (i) see, page 281 in [2] for detail. Proof.
The time dependent form of (7)-(8) readsd x t d t = ∇ Φ t ( x t ) , with x ∼ q, d ln q t ( x t )d t = − ∆Φ t ( x t ) , with q = q. By chain rule and substituting the first equation into the second one, we have1 q t ( d q t d t + d q t d x t d x t d t ) = 1 q t ( d q t d t + ∇ q t ∇ Φ t ( x t ))= − ∆Φ t ( x t ) , which implies, d q t d t = − q t ∆Φ t ( x t ) − ∇ q t ∇ Φ t ( x t ) = −∇ · ( q t ∇ Φ t ) . By (13), the above display coincides with the continuity equation (12) with v t = ∇ Φ t = −∇ F (cid:48) ( q t ( x )). Proof.
The Lipschitz assumption of v t implies the existence and uniqueness of theMcKean-Vlasov equation (15) according to the classical results in ODE [Arnold(2012)].By the uniqueness of the continuity equation, see Proposition 8.1.7 in [2], it sufficient toshow µ t = ( X t ) µ defined in equation (14) satisfying the continuity equation (12) in aweak sense. This can be done by the standard test function and soothing approximationarguments, see, Theorem 4.4 in [53] for detail.19 .7 Proof of Proposition 2.2 Proof.
Without loss of generality let K = Ts > { µ st t ∈ [ ks, ( k + 1) s ) is the piecewise linear interpolation between µ k and µ k +1 defined as µ st = ( T k,st ) µ k , where, T k,st = I + ( t − ks ) v k ,µ k is defined in (16)-(18) with v k = v ks , i.e., the continuous velocity in (13) at time ks , k = 0 , .., K − µ = µ. Under the assumption that the velocity fields v t is Lipschitzcontinuous on ( x , µ t ), we can first show similarly as Lemma 10 in [3] W ( µ ks , µ k ) = O ( s ) . ( A µ k and µ ks , and ( X, Y ) ∼ Γ. Let X t = T k,st ( X )and Y t be the solution of (15) with X = Y and t ∈ [ ks, ( k + 1) s ). Then X t ∼ µ st , Y t ∼ µ t and Y t = Y + (cid:90) tks v ˜ t ( Y ˜ t )d˜ t. W ( µ t , µ ks ) ≤ E [ (cid:107) Y t − Y (cid:107) ]= E [ (cid:107) (cid:90) tks v ˜ t ( Y ˜ t )d˜ t (cid:107) ] ≤ E [( (cid:90) tks (cid:107) v ˜ t ( Y ˜ t ) (cid:107) d˜ t ) ] ≤ O ( s ) . ( A W , and the last equality followsfrom the the uniform bounded assumption of v t . Similarly, W ( µ k , µ st ) ≤ E [ (cid:107) X − X t (cid:107) ]= E [ (cid:107) ( t − ks ) v k ( X ) (cid:107) ] ≤ O ( s ) . ( A W ( µ t , µ st ) ≤ W ( µ t , µ ks ) + W ( µ ks , µ k ) + W ( µ k , µ st ) ≤ O ( s ) , where the first inequality follows from the triangle inequality, see for example Lemma5.3 in [53], and the second one follows from ( A − ( A .8 Proof of Lemma 2.2 Proof.
By definition, F ( q t ( x )) = (cid:40) p ( x ) f ( q t ( x ) p ( x ) ) , L [ µ ] = D f ( µ (cid:107) ν ) , ( q t ( x ) − p ( x )) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) . Direct calculation shows F (cid:48) ( q t ( x )) = (cid:40) f (cid:48) ( q t ( x ) p ( x ) ) , L [ µ ] = D f ( µ (cid:107) ν ) , q t ( x ) − p ( x )) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) . Then, the desired result follows from the above display and equation (13).
Proof.
By definition, it is easy to check B ( R ) = B ratio ( r, R ) − B ratio ( r, r ) , where, B ratio ( r, R )the Bregman score with the base probability measure p between R and r . Then r ∈ arg min measureable R B ( R ) follow from the fact B ratio ( r, R ) ≥ B ratio ( r, r ) and theequality holds iff R = r . Since B α ( R ) = B ( R ) + α E p [ (cid:107)∇ R (cid:107) ] ≥ , Then, B α ( R ) = 0iff B ( R ) = 0 and E p [ (cid:107)∇ R (cid:107) ] = 0 , which is further equivalent to R = r = constant ( q, p )- a.e. constant = 1 due to r is a density ratio. Proof.
We use B ( R ) to denote B − C for simplicity, i.e., B ( R ) = E X ∼ p [ R ( X ) ] − E X ∼ q [ R ( X )] . ( A α = 0 as (cid:98) R φ ∈ arg min R φ ∈H D , W , S , B (cid:98) B ( R φ )= n (cid:88) i =1 n ( R φ ( X i ) − R φ ( Y i )) . ( A ∈ ∂ B ( r ) . Then, ∀ R directcalculation yields, (cid:107) R − r (cid:107) L ( ν ) = B ( R ) − B ( r ) − (cid:104) ∂ B ( r ) , R − r (cid:105) = B ( R ) − B ( r ) . ( A ∀ ¯ R φ ∈ H D , W , S , B we have, (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) = B ( (cid:98) R φ ) − B ( r )= B ( (cid:98) R φ ) − (cid:98) B ( (cid:98) R φ ) + (cid:98) B ( (cid:98) R φ ) − (cid:98) B ( ¯ R φ )+ (cid:98) B ( ¯ R φ ) − B ( ¯ R φ ) + B ( ¯ R φ ) − B ( r ) ≤ R ∈H D , W , S , B | B ( R ) − (cid:98) B ( R ) | + (cid:107) ¯ R φ − r (cid:107) L ( ν ) , ( A (cid:98) R φ and ¯ R φ and (A6). We prove our thistheorem by upper bounding the expected value of the right hand side term in (A7). Tothis end, we need the following auxiliary results (A8)-(A10). E { Z i } ni [sup R | B ( R ) − (cid:98) B ( R ) | ] ≤ C (2 B + 1) G ( H ) , ( A G ( H ) = E { Z i ,(cid:15) i } ni [ sup R ∈H D , W , S , B | n n (cid:88) i =1 (cid:15) i R ( Z i ) | ]is the Gaussian complexity of H D , W , S , B [Bartlett & Mendelson(2002)]. Proof of (A8) .Let g ( c ) = c − c , z = ( x , y ) ∈ R m × R m , (cid:101) R ( z ) = ( g ◦ R )( z ) = R ( x ) − R ( y ) . Denote Z = ( X, Y ), Z i = ( X i , Y i ) , i = 1 , ..., n with X, X i i.i.d. ∼ p , Y, Y i i.i.d. ∼ q . Let (cid:101) Z i be a i.i.d. copy of Z i , and σ i ( (cid:15) i ) be the i.i.d. Rademacher random (standard normal)variables that are independent with Z i and (cid:101) Z i . Then, B ( R ) = E Z [ (cid:101) R ( Z )] = 1 n E (cid:101) Z i [ (cid:101) R ( (cid:101) Z i )] , and (cid:98) B ( R ) = 1 n n (cid:88) i =1 (cid:101) R ( Z i ) . Denote R ( H ) = 1 n E { Z i ,σ i } ni [ sup R ∈H D , W , S , B | n (cid:88) i =1 (cid:15) i R ( Z i ) | ]22s the Rademacher complexity of H D , W , S , B [Bartlett & Mendelson(2002)]. Then, E { Z i } ni [sup R | B ( R ) − (cid:98) B ( R ) | ]= 1 n E { Z i } ni [sup R | n (cid:88) i =1 ( E (cid:101) Z i [ (cid:101) R ( (cid:101) Z i )] − (cid:101) R ( Z i )) | ] ≤ n E { Z i , (cid:101) Z i } ni [sup R | (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i ) | ]= 1 n E { Z i , (cid:101) Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i ( (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i )) | ] ≤ n E { Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i (cid:101) R ( Z i ) | ]+ 1 n E { (cid:101) Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i (cid:101) R ( (cid:101) Z i ) | ]= 2 R ( g ◦ H ) ≤ B + 1) R ( H ) ≤ C (2 B + 1) G ( H ) , where, the first inequality follows from the Jensen’s inequality, and the second equalityholds since the distribution of σ i ( (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i )) and (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i ) are the same,and the last equality holds since the distribution of the two terms are the same, andlast two inequality follows from the Lipschitz contraction property where the Lipschitzconstant of g on H D , W , S , B is bounded by 2 B + 1 and the relationship between theGaussian complexity and the Rademacher complexity, see for Theorem 12 and Lemma 4in [Bartlett & Mendelson(2002)], respectively. G ( H ) ≤ C B (cid:114) n DS log S log n DS log S exp( − log n DS log S ) . ( A Proof of (A9) .Since H is negation closed, G ( H ) = E { Z i ,(cid:15) i } ni [ sup R ∈H D , W , S , B n n (cid:88) i =1 (cid:15) i R ( Z i )]= E Z i [ E (cid:15) i [ sup R ∈H D , W , S , B n n (cid:88) i =1 (cid:15) i R ( Z i )] |{ Z i } ni =1 ] . Conditioning on { Z i } ni =1 , ∀ R, (cid:101) R ∈ H D , W , S , B it easy to check V (cid:15) i [ 1 n n (cid:88) i =1 (cid:15) i ( R ( Z i ) − (cid:101) R ( Z i ))] = d H ( R, ˜ R ) √ n , d H ( R, ˜ R ) = √ n (cid:113)(cid:80) ni =1 ( R ( Z i ) − ˜ R ( Z i )) . Observing the diameter of H D , W , S , B under d H is at most B , we have G ( H ) ≤ C √ n E { Z i } ni =1 [ (cid:90) B (cid:113) log N ( H , d H , δ )d δ ] ≤ C √ n E { Z i } ni =1 [ (cid:90) B (cid:113) log N ( H , d H∞ , δ )d δ ] ≤ C √ n (cid:90) B (cid:114) VC H log 6 Bnδ VC H d δ, ≤ C B ( n VC H ) / log( n VC H ) exp( − log ( n VC H )) ≤ C B (cid:114) n DS log S log n DS log S exp( − log n DS log S )where, the first inequality follows from the chaining Theorem 8.1.3 in [Vershynin(2018)],and the second inequality holds due to d H ≤ d H∞ , and in the third inequality we used therelationship between the matric entropy and the VC-dimension of the ReLU networks H D , W , S , B [Anthony & Bartlett(2009)], i.e.,log N ( H , d H∞ , δ ) ≤ VC H log 6 B nδ VC H , and the fourth inequality follows by some calculation, and the last inequality holds dueto the upper bound of VC-dimension for the ReLU network H D , W , S , B satisfyingVC H ≤ C DS log S , see [Bartlett et al.(2019)].For any two integer M, N , there exists a ¯ R φ ∈ H D , W , S , B with width W = max { M N / M + 4 M , N + 14 } , and depth D = 9 M + 12 , and B = 2 B, such that (cid:107) r − ¯ R φ (cid:107) L ( ν ) ≤ C Lm M ( N M ) − / M . ( A . Proof of (A10) .We use Lemma 4.1, Theorem 4.3, 4.4 and following the proof of Theorem 1.3 in[Shen et al.(2019)]. Let A be the random orthoprojector in Theorem 4.4, then it isto check A ( M (cid:15) ) ⊂ A ([ − c, c ] m ) ⊂ [ − c √ m, √ mc ] M . Let ˜ r be a extension of the restrictionof r on M (cid:15) , which is defined similarly as ˜ g on page 30 in [Shen et al.(2019)]. Since weassume the target r is Lipschitz continuous with the bound B and the Lipschitz constant L , let (cid:15) small enough, then by Theorem 4.3, there exist a ReLU network ˜ R φ ∈ H D , W , S , B with width W = max { M N / M + 4 M , N + 14 } , D = 9 M + 12 , and B = 2 B, such that (cid:107) ˜ r − ˜ R φ (cid:107) L ∞ ( M (cid:15) \N ) ≤ cL √ m M ( N M ) − /m , and (cid:107) ˜ R φ (cid:107) L ∞ ( M (cid:15) ) ≤ B + 3 cL √ m M , where, N is a ν − negligible set with ν ( N ) can be arbitrary small. Define ¯ R φ = ˜ R φ ◦ A .Then, following the proof after equation (4.8) in Theorem 1.3 [Shen et al.(2019)], we getour (A10) and (cid:107) ¯ R φ (cid:107) L ∞ ( M (cid:15) \N ) ≤ B, (cid:107) ¯ R φ (cid:107) L ∞ ( N ) ≤ B + 3 cL √ m M . Let DS log S < n , combing the results A (7) − A (10), we have E { X i ,Y i } n [ (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) ] ≤ C (2 B + 1) G ( H ) + C cLm M ( N M ) − / M ≤ C (2 B + 1) C B (cid:114) DS log S n log n DS log S + C cLm M ( N M ) − / M ≤ C ( B + cLm M ) n − / (2+ M ) , where, last inequality holds since we choose M = log n,N = n M M ) / log n , S = n M− M +2 / log n, i.e., D = 9 log n + 12, W = 12 n M M ) / log n + 14 . Proof.
Let H be a reproducing kernel Hilbert space with characteristic kernel K ( x , z ).Recall in MMD flow, L [ µ ] = 12 (cid:107) µ − ν (cid:107) , and ∂ L [ µ ] ∂µ ( x ) = (cid:90) K ( x , z )d µ ( z ) − (cid:90) K ( x , z )d ν ( z ) , v mmd t = −∇ ∂ L [ µ ] ∂µ t = (cid:90) ∇ x K ( x , z )d ν ( z ) − (cid:90) ∇ x K ( x , z )d µ t ( z )= (cid:90) ∇ x K ( x , z ) p ( z )d z − (cid:90) ∇ x K ( x , z ) q t ( z )d z By Lemma 2.2, the vector fields corresponding the Lebesgue norm (cid:107) µ − ν (cid:107) L ( R m ) = (cid:82) R m | q ( x ) − p ( x ) | d x are defined as v t = ∇ p ( x ) − ∇ q t ( x ) . Next, we will show the vector fields v mmd t is exactly by projecting the vector fields v t onto the reproducing kernel Hilbert space H m = H ⊗ m . By the definition of reproducingkernel we have, p ( x ) = (cid:104) p ( · ) , K ( x , · ) (cid:105) H = (cid:90) K ( x , z ) p ( z )d z , and q t ( x ) = (cid:104) q t ( · ) , K ( x , · ) (cid:105) H = (cid:90) K ( x , z ) q t ( z )d z . Hence, v t ( x ) = ∇ p ( x ) − ∇ q t ( x )= (cid:90) ∇ x K ( x , z )( p ( z ) − q t ( z ))d z = v mmd t ( x ) . References [1] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of onedistribution from another.
Journal of the Royal Statistical Society: Series B(Methodological) , 28(1):131–142, 1966.[2] L. Ambrosio, N. Gigli, and G. Savar´e.
Gradient flows: in metric spaces and in thespace of probability measures . Springer Science & Business Media, 2008.[3] M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradientflow. In
NeurIPS , 2019.[4] M. Arbel, D. Sutherland, M. Bi´nkowski, and A. Gretton. On gradient regularizersfor MMD GANs. In
NeurIPS , 2018.[5] M. Arjovsky and L. Bottou. Towards principled methods for training generativeadversarial networks. In
ICLR , 2017. 266] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks.In
ICML , 2017.[7] B. Bauer, M. Kohler, et al. On deep learning as a remedy for the curse of dimen-sionality in nonparametric regression.
The Annals of Statistics , 47(4):2261–2285,2019.[8] J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to themonge-kantorovich mass transfer problem.
Numerische Mathematik , 84(3):375–393,2000.[9] M. Bi´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMDGANs. In
ICLR , 2018.[10] Y. Brenier. Polar factorization and monotone rearrangement of vector-valuedfunctions.
Communications on pure and applied mathematics , 44(4):375–417, 1991.[11] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. C. Duke. Continuous-time flowsfor efficient inference and density estimation. In
ICML , 2018.[12] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinarydifferential equations. In
NIPS , 2018.[13] A. P. Dawid. The geometry of proper scoring rules.
Annals of the Institute ofStatistical Mathematics , 59(1):77–93, 2007.[14] E. De Giorgi. New problems on minimizing movements. in boundary value problemsfor partial differential equations, res. notes appl. math. vol. 29. pages 81–98, 1993.[15] I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the slicedwasserstein distance. In
CVPR , 2018.[16] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent componentsestimation. In
ICLR , 2015.[17] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In
ICLR , 2017.[18] Y. Gao, Y. Jiao, Y. Wang, Y. Wang, C. Yang, and S. Zhang. Deep generativelearning via variational gradient flow. In
ICML , 2019.[19] A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorndivergences. In
ICML , 2018.[20] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, andestimation.
Journal of the American statistical Association , 102(477):359–378, 2007.[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio. Generative adversarial nets. In
NIPS , 2014.2722] W. Grathwohl, R. Chen, J. Bettencourt, and D. Duvenaud. Scalable reversiblegenerative models with free-form continuous dynamics. In
ICLR Workshop , 2019.[23] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord:Free-form continuous dynamics for scalable reversible generative models. In
ICLR ,2019.[24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANstrained by a two time-scale update rule converge to a local nash equilibrium. In
NIPS , 2017.[25] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, andA. Lerchner. β -VAE: Learning basic visual concepts with a constrained variationalframework. In ICLR , 2017.[26] R. Johnson and T. Zhang. Composite functional gradient learning of generativeadversarial models. In
ICML , 2018.[27] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of thefokker–planck equation.
SIAM journal on mathematical analysis , 29(1):1–17, 1998.[28] T. Kanamori and M. Sugiyama. Statistical analysis of distance estimators withdensity differences and density ratios.
Entropy , 16(2):921–942, 2014.[29] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolu-tions. In
NeurIPS , 2018.[30] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling.Improved variational inference with inverse autoregressive flow. In
NIPS , 2016.[31] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In
ICLR , 2014.[32] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoen-coder: An embarrassingly simple generative model. In
ICLR , 2019.[33] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning appliedto document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[35] R. J. LeVeque.
Finite difference methods for ordinary and partial differentialequations: steady-state and time-dependent problems , volume 98. 2007.[36] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P´oczos. MMD GAN: Towardsdeeper understanding of moment matching network. In
NIPS , 2017.[37] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In
ICML ,2015. 2838] H. Liu, G. Xianfeng, and D. Samaras. A two-step computation of the exact ganwasserstein distance. In
ICML , 2018.[39] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In
ICCV , 2015.[40] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, F.-R. St¨oter, K. Chaudhuri, andR. Salakhutdinov. Sliced-wasserstein flows: Nonparametric generative modeling viaoptimal transport and diffusions. In
ICML , 2019.[41] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoen-coders. In
ICLR , 2016.[42] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squaresgenerative adversarial networks. In
ICCV , 2017.[43] R. J. McCann et al. Existence and uniqueness of monotone measure-preservingmaps.
Duke Mathematical Journal , 80(2):309–324, 1995.[44] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483 , 2016.[45] Y. Mroueh and T. Sercu. Fisher GAN. In
NIPS , 2017.[46] S. Nowozin, B. Cseke, and R. Tomioka. f -GAN: Training generative neural samplersusing variational divergence minimization. In NIPS , 2016.[47] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow fordensity estimation. In
NIPS , 2017.[48] G. Patrini, S. Bhargav, R. van den Berg, M. Welling, P. Forr´e, T. Genewein,M. Carioni, K. Graz, F. Nielsen, and C. Sony. Sinkhorn autoencoders. In
UAI , 2019.[49] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generativeadversarial text to image synthesis. In
ICML , 2016.[50] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. In
ICML , 2015.[51] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generativeadversarial networks through regularization. In
NIPS , pages 2018–2028, 2017.[52] R. Salakhutdinov. Learning deep generative models.
Annual Review of Statisticsand Its Application , 2:361–385, 2015.[53] F. Santambrogio.
Optimal transport for applied mathematicians . Springer, 2015.[54] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Husz´ar. Amortised mapinference for image super-resolution. In
ICLR , 2017.2955] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨olkopf, G. R. Lanckriet,et al. On the empirical estimation of integral probability metrics.
Electronic Journalof Statistics , 6:1550–1599, 2012.[56] M. Sugiyama, T. Kanamori, T. Suzuki, M. D. Plessis, S. Liu, and I. Takeuchi.Density-difference estimation. In
NIPS , 2012.[57] M. Sugiyama, T. Suzuki, and T. Kanamori.
Density ratio estimation in machinelearning . Cambridge University Press, 2012.[58] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, andA. Gretton. Generative models and model criticism via optimized maximum meandiscrepancy. In
ICLR , 2017.[59] C. Tao, L. Chen, R. Henao, J. Feng, and L. C. Duke. Chi-square generativeadversarial network. In
ICML , 2018.[60] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders.In
ICML , 2018.[61] C. Villani.
Optimal transport: old and new , volume 338. Springer Science & BusinessMedia, 2008.[62] L. Zhang, L. Wang, et al. Monge-ampere flow for generative modeling. arXivpreprint arXiv:1809.10188 , 2018.[63] S. Zhang, Y. Gao, Y. Jiao, J. Liu, Y. Wang, and C. Yang. Wasserstein-wassersteinauto-encoders. arXiv preprint arXiv:1902.09323 , 2019.[64] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In
ICCV , 2017.[Anthony & Bartlett(2009)] Anthony, M. and Bartlett, P. L.
Neural network learning:Theoretical foundations . cambridge university press, 2009.[Arnold(2012)] Arnold, V. I.
Geometrical methods in the theory of ordinary differentialequations , volume 250. Springer Science & Business Media, 2012.[Bartlett & Mendelson(2002)] Bartlett, P. L. and Mendelson, S. Rademacher and gaus-sian complexities: Risk bounds and structural results.
Journal of Machine LearningResearch , 3:463–482, 2002.[Bartlett et al.(2019)] Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.
Journal of Machine Learning Research , 20:1–17, 2019.[Clarke(1990)] Clarke, F. H.
Optimization and nonsmooth analysis , volume 5. Siam,1990.[Gelfand & Silverman(2000)] Gelfand, I. M., Silverman, R. A., et al.
Calculus of varia-tions . 2000. 30Shen et al.(2019)] Shen, Z., Yang, H., and Zhang, S. Deep network approximationcharacterized by number of neurons. arXiv preprint arXiv:1906.05497 , 2019.[Vershynin(2018)] Vershynin, R.
High-dimensional probability: An introduction withapplications in data science , volume 47. Cambridge university press, 2018.[Heusel et al.(2017)] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochre-iter, S. GANs trained by a two time-scale update rule converge to a local nashequilibrium. In
NIPS , 2017.[Ioffe & Szegedy(2015)] Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In
ICML , 2015.[Miyato et al.(2018)] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectralnormalization for generative adversarial networks. In
ICLR , 2018.[Szegedy et al.(2016)] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.Rethinking the inception architecture for computer vision. In
CVPR , 2016.[Chertock(2017)] Chertock, A. A practical guide to deterministic particle methods. In