[PDF] Learning Implicit Generative Models with Theoretical Guarantees

Abstract

We propose a \textbf{uni}fied \textbf{f}ramework for \textbf{i}mplicit \textbf{ge}nerative \textbf{m}odeling (UnifiGem) with theoretical guarantees by integrating approaches from optimal transport, numerical ODE, density-ratio (density-difference) estimation and deep neural networks. First, the problem of implicit generative learning is formulated as that of finding the optimal transport map between the reference distribution and the target distribution, which is characterized by a totally nonlinear Monge-Ampère equation. Interpreting the infinitesimal linearization of the Monge-Ampère equation from the perspective of gradient flows in measure spaces leads to the continuity equation or the McKean-Vlasov equation. We then solve the McKean-Vlasov equation numerically using the forward Euler iteration, where the forward Euler map depends on the density ratio (density difference) between the distribution at current iteration and the underlying target distribution. We further estimate the density ratio (density difference) via deep density-ratio (density-difference) fitting and derive explicit upper bounds on the estimation error. Experimental results on both synthetic datasets and real benchmark datasets support our theoretical findings and demonstrate the effectiveness of UnifiGem.

Full PDF

LLearning Implicit Generative Models with TheoreticalGuarantees

Yuan Gao ∗ Jian Huang † Yuling Jiao ‡ Jin Liu § February 18, 2020

Abstract

We propose a uni ﬁed f ramework for i mplicit ge nerative m odeling (UniﬁGem)with theoretical guarantees by integrating approaches from optimal transport,numerical ODE, density-ratio (density-diﬀerence) estimation and deep neural net-works. First, the problem of implicit generative learning is formulated as thatof ﬁnding the optimal transport map between the reference distribution and thetarget distribution, which is characterized by a totally nonlinear Monge-Amp`ereequation. Interpreting the inﬁnitesimal linearization of the Monge-Amp`ere equationfrom the perspective of gradient ﬂows in measure spaces leads to the continuityequation or the McKean-Vlasov equation. We then solve the McKean-Vlasovequation numerically using the forward Euler iteration, where the forward Eulermap depends on the density ratio (density diﬀerence) between the distribution atcurrent iteration and the underlying target distribution. We further estimate thedensity ratio (density diﬀerence) via deep density-ratio (density-diﬀerence) ﬁttingand derive explicit upper bounds on the estimation error. Experimental resultson both synthetic datasets and real benchmark datasets support our theoreticalﬁndings and demonstrate the eﬀectiveness of UniﬁGem. Keywords:

Deep generative model, Optimal transport, Continuity equation,McKean-Vlasov equation, Deep density-ratio (density-diﬀerence) ﬁtting, Nonpara-metric estimation error.

Running title: UniﬁGem

The ability to eﬃciently model complex data and sample from complex distributionsplays a key role in a variety of prediction and inference tasks in machine learning andstatistics [52]. The long-standing methodology for learning an underlying distribution ∗ School of Mathematics and Statistics, Xi’an Jiaotong University, China ([email protected]) † Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242 ([email protected]) ‡ School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China. ([email protected]) § Center of Quantitative Medicine Duke-NUS Medical School, Singapore. ([email protected]) a r X i v : . [ c s . L G ] F e b elies on an explicit statistical data model, which can be diﬃcult to specify in manymodern machine learning tasks such as image analysis, computer vision and naturallanguage processing. In contrast, implicit generative models do not assume a speciﬁcform of the data distribution, but rather learn a nonlinear map to transform a simplereference distribution to the underlying target distribution. This modeling approachhas been shown to achieve state-of-the-art performance in many machine learning tasks[49, 64]. Generative adversarial networks (GAN) [21], variational auto-encoders (VAE)[31] and ﬂow-based methods [50] are important representatives of implicit generativemodels.GANs model the low-dimensional latent structure via deep nonlinear factors. Theyare trained by sequential diﬀerentiable surrogates of two-sample tests, including thedensity-ratio test [21, 46, 42, 45, 59] and the density-diﬀerence test [37, 58, 36, 6, 9]among others. VAE is a probabilistic deep nonlinear factor model trained with variationalinference and stochastic approximation. Several authors have proposed improved versionsof VAE by enhancing the disentangled representation power of the learned latent codesand reducing the blurriness of the generated images in vanilla VAE [41, 25, 60, 63].Flow-based methods learn a diﬀeomorphism map between the reference distribution andthe target distribution by maximum likelihood using the change of variables formula.Recent work on ﬂow-based methods has been focused on developing training methodsand designing neural network architectures to trade oﬀ between the eﬃciency of trainingand sampling and the representation power of the learned map [50, 16, 17, 30, 47, 29, 23].In this paper, we propose a uniﬁed framework (UniﬁGem) for implicitly learning anunderlying generative model by integrating approaches from optimal transport, numericalODE, density-ratio (density-diﬀerence) estimation and deep neural networks. The keyidea of implicit generative learning is to ﬁnd a nonlinear transform that pushes forwarda simple reference distribution to the target distribution. Mathematically, this taskis known as ﬁnding an optimal transport map characterized by the Monge-Amp`ereequation. However, it is quite challenging to solve the Monge-Amp`ere equation due tothe nonlinearity and high-dimensionality even at the population level assuming the targetdistribution is known. The inﬁnitesimal linearization of the Monge-Amp`ere equationcan be interpreted from the perspective of gradient ﬂows in measure spaces, whichleads to the continuity equation. Therefore, we turn to solve the continuity equationor equivalently, the characteristic ODE system associated with the continuity equation,which is a kind of McKean-Vlasov equation. We solve the resulting McKean-Vlasovequation numerically using the forward Euler method and bound the discretizationerror at the population level since the forward Euler map depends on the density ratio(density diﬀerence) between the distribution at current iteration and the underlyingtarget distribution. We estimate the density ratio (density diﬀerence) via nonparametricdeep density-ratio (density-diﬀerence) ﬁtting and derive an explicit estimation errorbound. Experimental results on both synthetic datasets and real benchmark datasetssupport our theoretical ﬁndings and demonstrate the eﬀectiveness of UniﬁGem.2

Notation, background and theory

Let P ( R m ) denote the space of Borel probability measures on R m with ﬁnite secondmoments, and let P a ( R m ) denote the subset of P ( R m ) in which measures are absolutelycontinuous with respect to the Lebesgue measure (all distributions are assumed to satisfythis assumption hereinafter). Tan µ P ( R m ) denotes the tangent space to P ( R m ) at µ . Let AC loc ( R + , P ( R m )) := { µ t : I → P ( R m ) is absolutely continuous , | µ (cid:48) t | ∈ L ( I ) , I ⊂ R + } . Lip loc ( R m ) denotes the set of functions that are Lipschitz continuouson any compact set of R m . For any (cid:96) ∈ [1 , ∞ ] , we use L (cid:96) ( µ, R m ) ( L (cid:96) loc ( µ, R m )) to denotethe L (cid:96) space of µ -measurable functions on R m (on any compact set of R m ). With I , detand tr, we refer to the identity map, the determinant and the trace. We use ∇ , ∇ and∆ to denote the gradient or Jacobian operator, the Hessian operator and the Laplaceoperator, respectively.We ﬁrst describe the theoretical background used in deriving UniﬁGem, a uniﬁedframework to learn the generative model ν implicitly from an i.i.d. sample { X i } ni =1 ⊂ R m . The quadratic Wasserstein distance between µ and ν ∈ P ( R m ) is deﬁned as [61, 2] W ( µ, ν ) = { inf γ ∈ Γ( µ,ν ) E ( X,Y ) ∼ γ [ (cid:107) X − Y (cid:107) ] } , (1)where Γ( µ, ν ) denotes the set of couplings of ( µ, ν ). The static formulation of W in (1)admits the following variational form [8] W ( µ, ν ) = { inf q t , v t { (cid:90) E X ∼ q t [ (cid:107) v t ( X ) (cid:107) ] dt }} , s . t . ∂ t q t ( x ) = −∇ · ( q t ( x ) v t ( x )) ,q ( x ) = q ( x ) , q ( x ) = p ( x ) , where v t ( x ) : R + × R m → R m is a velocity vector ﬁeld. The Wasserstein distance W ( µ, ν )measures the optimal quadratic cost of transporting µ onto ν . The corresponding optimaltransport map T such that T µ = ν is characterized by the Monge-Amp`ere equation[10, 43, 53]. Lemma 2.1.

Let µ and ν ∈ P a ( R m ) with densities q and p respectively. Then (1)admits a unique solution γ = ( I , T ) µ with T = ∇ Ψ , µ - a.e., where the potential function Ψ is convex and satisﬁes the Monge-Amp`ere equationdet( ∇ Ψ( x )) = q ( x ) p ( ∇ Ψ( x )) , x ∈ R m . (2)It is challenging to ﬁnd the optimal transport map T by solving the totally nonlineardegenerate elliptic Monge-Amp`ere equation (2). Linearization via a residual type ofpushforward map, i.e., letting T t, Φ = ∇ Ψ = I + t ∇ Φ (3)3ith a specially designed function Φ : R m → R and a small t ∈ R + , is a commonly usedtechnique to address the diﬃculty due to nonlinearity [61]. To be precise, let X ∼ q , (cid:101) X = T t, Φ ( X ) , and denote the distribution of (cid:101) X as (cid:101) q. With a small t , the map T t,φ isinvertible according to the implicit function theorem, and we have the change of variablesformula det( ∇ Ψ)( x ) = | det( ∇T t, Φ )( x ) | = q ( x ) (cid:101) q (˜ x ) , (4)where ˜ x = T t, Φ ( x ) . (5)Using the fact dd t (cid:12)(cid:12) t =0 det( A + t B ) = det( A )tr (cid:0) A − B (cid:1) ∀ A , B ∈ R m × m with A invertible,and applying the ﬁrst order Taylor expansion to (4) we havelog (cid:101) q (˜ x ) − log q ( x ) = − t ∆Φ( x ) + o ( t ) . (6)Let t → { x t } and its law q t satisfyingd x t d t = ∇ Φ( x t ) , with x ∼ q, (7)d ln q t ( x t )d t = − ∆Φ( x t ) , with q = q. (8)Equations (7) and (8) resulting from linearization of the Monge-Amp`ere equation (2) canbe interpreted as gradient ﬂows in measure spaces [2]. And thanks to this connection,we can resort to solving a continuity equation characterized by a type of McKean-Vlasovequation, an ODE system that is easier to handle. P a ( R m ) For µ ∈ P a ( R m ) with density q , let L [ µ ] = (cid:90) R m F ( q ( x ))d x : P a ( R m ) → R + ∪ { } (9)be an energy functional satisfying ν ∈ arg min L [ · ] , where F ( · ) : R + → R is a twice-diﬀerentiable convex function. Among the widely used metrics on P a ( R m ) in implicitgenerative learning, the following two are important examples of L [ · ] . • f -divergence [1]: D f ( µ (cid:107) ν ) = (cid:90) R m p ( x ) f (cid:18) q ( x ) p ( x ) (cid:19) d x , (10)where f : R + → R is a twice-diﬀerentiable convex function satisfying f (1) = 0. • Lebesgue norm of density diﬀerence: (cid:107) µ − ν (cid:107) L ( R m ) = (cid:90) R m | q ( x ) − p ( x ) | d x . (11) Deﬁnition.

We call { µ t } t ∈ R + ⊂ AC loc ( R + , P ( R m )) a gradient ﬂow of the functional L [ · ], if { µ t } t ∈ R + ⊂ P a ( R m ) a.e., t ∈ R + and the velocity vector ﬁeld v t ∈ Tan µ t P ( R m )satisﬁes v t ∈ − ∂ L [ µ t ] a.e. t ∈ R + , where ∂ L [ · ] is the subdiﬀerential of L [ · ]. 4he gradient ﬂow { µ t } t ∈ R + of L [ · ] enjoys the following nice properties. Theorem 2.1. (i) The continuity equationdd t µ t = −∇ · ( µ t v t ) in R + × R m with µ = µ, (12)holds in the sense of distributions.(ii) Representation of the velocity ﬁelds.If the density q t of µ t is diﬀerentiable, then v t ( x ) = −∇ F (cid:48) ( q t ( x )) µ t - a.e. x ∈ R m . (13)(iii) Energy decay along the gradient ﬂow.dd t L [ µ t ] = −(cid:107) v t (cid:107) L ( µ t , R m ) a.e. t ∈ R + . In addition, W ( µ t , ν ) = O (exp − λt ) , if L [ µ ] is λ -geodetically convex with λ > { µ t } t is the solution of continuity equation (12) in (i) with v t ( x )speciﬁed by (13) in (ii), then { µ t } t is a gradient ﬂow of L [ · ]. Proposition 2.1.

If we let Φ be time-dependent in (7)-(8), i.e., Φ t , then the linearizedMonge-Amp`ere equations (7)-(8) ⇔ the continuity equation (12) by taking Φ t ( x ) = − F (cid:48) ( q t ( x )) . Theorem 2.1 and Proposition 2.1 imply that { µ t } t , the solution of the continualityequation (12) with v t = −∇ F (cid:48) ( q t ( x )) , approximates the Monge-Amp`ere equation (2)and converges rapidly to the target distribution ν . Furthermore, the continuity equationhas the following representation under mild regularity conditions on the velocity ﬁelds. Theorem 2.2.

Assume (cid:107) v t (cid:107) L ( µ t , R m ) ∈ L ( R + ) and v t ( · ) ∈ Lip loc ( R m ) with upperbound B t and Lipschitz constant L t such that ( B t + L t ) ∈ L ( R + ) . Then the solutionof the continuity equation (12) can be represented as µ t = ( X t ) µ, (14)where X t ( x ) : R + × R m → R m satisﬁes the McKean-Vlasov equationdd t X t ( x ) = v t ( X t ( x )) with X ∼ µ, (15) µ - a.e. x ∈ R m . We use the forward Euler method to solve the McKean-Vlasov equation (15). Let s > T k = I + s v k , (16) X k +1 = T k ( X k ) , (17) µ k +1 = ( T k ) µ k , (18)5here X ∼ µ , µ = µ and k = 0 , , ..., K . It is well known that for a ﬁnite time horizon T and a ﬁxed compact domain, Euler discretization of the McKean-Vlasov equation (15)has a global error of O ( s ) in the supremum norm [35]. Let { µ st : t ∈ [ ks, ( k + 1) s ) } be a piecewise linear interpolation between µ k and µ k +1 . The discretization error of µ t and µ st can be bounded in a ﬁnite time interval [0 , T ). Proposition 2.2. W ( µ t , µ st ) = O ( s ) . Proposition 2.2 and (iii) in Theorem 2.1 imply that the distribution of the particles X k deﬁned in (17) with k large enough is close to the target ν . The above theoreticalresults are obtained at the population level, where v k depends on the target ν . Therefore,it is natural to implicitly learn ν via ﬁrst estimating the discrete velocity ﬁelds v k at thesample level and then plugging the estimator of v k into (17). As shown in Lemma 2.2below, the velocity ﬁelds associated with the f -divergence (10) and the Lebesgue norm(11) are determined by density ratio and density diﬀerence respectively. Lemma 2.2.

The velocity ﬁelds v t satisfy v t ( x ) = (cid:40) − f (cid:48)(cid:48) ( r t ( x )) ∇ r t ( x ) , L [ µ ] = D f ( µ (cid:107) ν ) , − ∇ d t ( x ) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) , where r t ( x ) = q t ( x ) p ( x ) and d t ( x ) = q t ( x ) − p ( x ) , x ∈ R m . Several methods have been developed to estimate density ratio and density diﬀerencein the literature. Examples include probabilistic classiﬁcation approaches, momentmatching and direct density-ratio (density-diﬀerence) ﬁtting, see [56, 57, 28, 44] and thereferences therein.

The evaluation of velocity ﬁelds depends on dynamic estimation of a discrepancy (densityratio or density diﬀerence) between the pushforward distribution q t and the targetdistribution p . Density-ratio and density-diﬀerence ﬁtting with the Bregman scoreprovides a uniﬁed framework for such discrepancy estimation [20, 13, 56, 57, 28] withoutestimating each probability distribution separately.We use a neural network R φ : R m → R with parameter φ to parameterize thedensity ratio r ( x ) = q ( x ) p ( x ) between a given density q and the target p . Let g : R → R bea diﬀerentiable and strictly convex function. The separable Bregman score with the baseprobability measure p to measure the discrepancy between R φ and r is B ratio ( r, R φ )= E X ∼ p [ g (cid:48) ( R φ ( X ))( R φ ( X ) − r ( X )) − g ( R φ ( X ))]= E X ∼ p [ g (cid:48) ( R φ ( X )) R φ ( X ) − g ( R φ ( X ))] − E X ∼ q [ g (cid:48) ( R φ ( X ))] . B ratio ( r, R φ ) ≥ B ratio ( r, r ), where the equality holds iﬀ R φ = r .For deep density-diﬀerence ﬁtting, a neural network D ψ : R m → R with parameter ψ is utilized to estimate the density-diﬀerence d ( x ) = q ( x ) − p ( x ) between a given density q and the target p . The separable Bregman score with the base probability measure w to measure the discrepancy between D ψ and d can be derived similarly, B diﬀ ( d, D ψ )= E X ∼ p [ w ( X ) g (cid:48) ( D ψ ( X ))] − E X ∼ q [ w ( X ) g (cid:48) ( D ψ ( X ))]+ E X ∼ w [ g (cid:48) ( D ψ ( X )) D ψ ( X ) − g ( D ψ ( X ))] . Here, we focus on the widely used least-squares density-ratio (LSDR) ﬁtting with g ( c ) = ( c − as a working example: B LSDR ( r, R φ ) = E X ∼ p [ R φ ( X ) ] − E X ∼ q [ R φ ( X )] + 1 , The scenario of other functions, such as g ( c ) = c log c − ( c + 1) log( c + 1) correspondingto estimating r via the logistic regression (LR), and the case of density-diﬀerence ﬁttingcan be handled similarly. The distributions of real data may have a low-dimensional structure with their supportsconcentrated on a low-dimensional manifold, which may cause the f -divergence to beill-posed due to non-overlapping supports. Motivated by recent works on smoothing vianoise injection [54, 5] and Tikhonov regularization method for f -GAN [51], we derive asimple weighted gradient penalty to improve deep density-ratio ﬁtting. We consider anoise convolution form of B ratio ( r, R φ ) with Gaussian noise (cid:15) ∼ N ( , α I ), B α ratio ( r, R φ )= E X ∼ p E (cid:15) [ g (cid:48) ( R φ ( X + (cid:15) )) R φ ( X + (cid:15) ) − g ( R φ ( X + (cid:15) ))] − E X ∼ q E (cid:15) [ g (cid:48) ( R φ ( X + (cid:15) ))] . Taylor expansion applied to R φ gives E (cid:15) [ R φ ( x + (cid:15) )] = R φ ( x ) + α R φ ( x ) + O ( α ) . Using equations (13)-(17) in [51], we get B α ratio ( r, R φ ) ≈ B ratio ( r, R φ ) + α E p [ g (cid:48)(cid:48) ( R φ ) (cid:107)∇ R φ (cid:107) ] , i.e., E p [ g (cid:48)(cid:48) ( R φ ) (cid:107)∇ R φ (cid:107) ] serves as a regularizer for deep density-ratio ﬁtting when g istwice diﬀerentiable. As a consequence, for g ( c ) = ( c − , the resulting gradient penalty E p [ (cid:107)∇ R φ (cid:107) ] , (19)recovers the well-known squared Sobolev semi-norm in nonparametric statistics.7 .5 Estimation error Lemma 2.3.

For given densities p and q , let r = qp with C = E X ∼ q [ r ( X )] − < ∞ . For any α ≥

0, deﬁne a nonnegative functional B α LSDR ( R ) = B LSDR ( r, R ) + α E p [ (cid:107)∇ R (cid:107) ] + C . Then, r ∈ arg min measureable R B ( R ) . And B α ( R ) = 0 iﬀ R ( x ) = r ( x ) = 1 ( q, p )- a.e. x ∈ R m . At the population level, according to Lemma 2.3, we can recover the density ratio r via minimizing B α LSDR ( R ). Moreover, the gradient penalty (19) stabilizes and improvesthe long time behavior of Euler iterations at the sample level, where the pushforwarddistribution should be close to the target as expected. This is supported by our numericalexperiments in Section 5.Let H D , W , S , B be the set of ReLU neural networks R φ with depth D , width W , size S , and (cid:107) R φ (cid:107) ∞ ≤ B . At the sample level, only i.i.d. data { X i } i =1 ,...,n and { Y i } i =1 ,...,n sampled from p and q are available. We estimate r with (cid:98) R φ deﬁned as (cid:98) R φ ∈ arg min R φ ∈H D , W , S , B n (cid:88) i =1 n ( R φ ( X i ) + α (cid:107)∇ R φ ( X i ) (cid:107) − R φ ( Y i )) . (20)Next we bound the nonparametric estimation error (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) under the assumptionthat the support of ν concentrates on a compact low-dimensional manifold and r isLipsichiz continuous. Let M ⊆ [ − c, c ] m be a Riemannian manifold with dimension m ,condition number 1 /τ , volume V , geodesic covering regularity R , and m (cid:28) M = O ( m ln( m VR /τ )) (cid:28) m. Denote M (cid:15) = { x ∈ [ − c, c ] m : inf {(cid:107) x − y (cid:107) : y ∈ M } ≤ (cid:15) } , (cid:15) ∈ (0 , Theorem 2.3.

Assume supp( r ) = M (cid:15) and r ( x ) is Lipschitz continuous with the bound B and the Lipschitz constant L . Suppose the topological parameter of H D , W , S , B in (20)with α = 0 satisﬁes D = O (log n ), W = O ( n M M ) / log n ), S = O ( n M− M +2 / log n ), and B = 2 B . Then, E { X i ,Y i } n [ (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) ] ≤ C ( B + cLm M ) n − / (2+ M ) , where C is a universal constant. We are now ready to described how to implement UniﬁGem with i.i.d. data { X i } ni =1 ⊂ R m from an unknown target distribution ν . UniﬁGem is a particle method, with which we8earn a transport map that transforms particles from a simple reference distribution µ , such as the standard normal distribution or the uniform distribution, into particlesfrom the target distribution ν . From Theorems 2.2 and 2.1 and Proposition 2.2 we knowthat at the population level, the solution X t of the McKean-Vlasov equation (15) with asuﬃciently large t is a good approximation of such a transform. This solution can beobtained accurately via the forward Euler iteration (16)-(18) with a small step size, i.e., T K ◦ T K − ◦ ... ◦ T serves as a desired transform with a large K . As implied by Theorem 2.3, each T k , k =1 , ..., K can be estimated with high accuracy by (cid:98) T k = I + s ˆ v k where ˆ v k = − f (cid:48)(cid:48) ( (cid:98) R φ )( x )) ∇ (cid:98) R φ ( x ). Here (cid:98) R φ is estimated based on (20) with { Y i } i =1 ,...n ∼ q k . Therefore, the particles (cid:98) T K ◦ (cid:98) T K − ◦ ... ◦ (cid:98) T ( ˜ Y i ) , i = 1 , ...n serve as samples drawn from the target distribution ν , where particles { ˜ Y i } ni =1 ⊂ R m aresampled from a simple reference distribution µ .In many applications, high-dimensional complex data such as images, texts andnatural languages, tend to have low-dimensional features. To learn generative modelswith hidden low-dimensional structures, it is beneﬁcial to have the option of ﬁrst samplingparticles { Z i } ni =1 from a low-dimensional reference distribution ˜ µ ∈ P ( R (cid:96) ) with (cid:96) (cid:28) d .Then we apply (cid:98) T K ◦ (cid:98) T K − ◦ ... ◦ (cid:98) T to particles ˜ Y i = G θ ( Z i ) , i = 1 , ...n , where we introduce another deep neural network G θ : R (cid:96) → R m with parameter θ . Wecan estimate G θ via ﬁtting the pairs { ( Z i , ˜ Y i ) } ni =1 . We give a detailed description of theUniﬁGem algorithm below. • Outer loop for modeling low dimensional latent structure (optional)–

Sample { Z i } ni =1 ⊂ R (cid:96) from a low-dimensional simple reference distribution ˜ µ and let ˜ Y i = G θ ( Z i ) , i = 1 , , ..., n . – Inner loop for ﬁnding the pushforward map ∗ If there are no outer loops, sample ˜ Y i ∼ µ, i = 1 , , ..., n . ∗ Get ˆ v ( x ) = − f (cid:48)(cid:48) ( (cid:98) R φ ( x )) ∇ (cid:98) R φ ( x ) via solving (20) with Y i = ˜ Y i . Set (cid:98) T = I + s ˆ v with small step size s . ∗ Update the particles ˜ Y i = (cid:98) T ( ˜ Y i ), i = 1 , , ..., n . – End inner loop– If there are outer loops, update the parameter θ of G θ ( · ) via solving min θ (cid:80) ni =1 (cid:107) G θ ( Z i ) − ˜ Y i (cid:107) /n . 9 End outer loop

UniﬁGem is a uniﬁed and general framework, since it allows diﬀerent choices of theenergy functionals L [ · ] in (9) and density-ratio (density-diﬀerence) estimators. We discuss connections between UniﬁGem and the existing related works, especiallythose that use optimal transport based on Wasserstein distances and gradient ﬂowsin measure spaces. Implicit generative learning aims at ﬁnding a transform map thatpushes forward a simple reference distribution µ to the target ν . The existing implicitgenerative models, such as VAEs, GANs and ﬂow-based methods, parameterize such amap with a neural network, say G θ , that solvesmin θ D (( G θ ) µ, ν ) , (21)where D ( · , · ) is an integral probability discrepancy. f -GAN [46] including the vanillaGAN [21], and WGAN [6] solve the dual form of (21) via parameterizing the dualvariable with another neural network with D as the f -divergence and the 1-Wassersteindistance respectively. Based on the fact that the 1-Wasserstein distance can be evaluatedfrom samples via linear programming [55], [38] and [19] proposed training the primalform of WGAN via a two-stage method that solves the linear programm and reﬁts theoptimal pairs with a neural network and unrolling the Sinkhorn iteration respectively.SWGAN [15] and MMDGAN [36, 9] use the sliced quadratic Wasserstein distance andthe maximum mean discrepancy (MMD) as the discrepancy D respectively.Vanilla VAE [31] approximately solves the primal form of (21) with the KL-divergenceloss under the framework of variational inference. Several authors have proposed methodsthat use optimal transport losses, such as various forms of Wasserstein distances betweenthe distribution of learned latent codes and the prior distribution as the regularizer inVAE to improve performance. These methods include WAE [60], Sliced WAE [32] andSinkhorn AE [48].Discrete time ﬂow-based methods [50, 16, 17, 30, 47, 29] minimize (21) with theKL divergence loss. [23] proposed an ODE ﬂow for fast training via using the adjointequation [12]. By introducing the optimal transport tools into maximum likelihoodtraining, [11] and [62] considered continuous time ﬂow. [11] proposed a gradient ﬂow inmeasure spaces in the framework of variational inference and then discretized it with theimplicit movement minimizing scheme [14, 27]. [62] actually considered gradient ﬂowsin measure spaces with time invariant velocity ﬁelds. CFGGAN [26] derived from theperspective of optimization in the functional space is exactly a special form of UniﬁGemwith L [ · ] taken as the KL divergence. SW ﬂow [40] and MMD ﬂow [3] are gradientﬂows in measure spaces. These methods are most related to our proposed UniﬁGem. InSW ﬂow, the energy functional L [ · ] in (9) is the sliced quadratic Wasserstein distancepenalized with the entropy regularizer. We should mention that with SW ﬂow, the target ν may not be the minimizer of such L [ · ] even at the population level. MMD ﬂow can be10ecovered from UniﬁGem by ﬁrst choosing L [ · ] as the Lebesgue norm and then projectingthe corresponding vector ﬁelds onto reproducing kernel Hilbert spaces, please see thesupplementary material for a proof. However, neither SW ﬂow nor MMD ﬂow can modelhidden low-dimensional structure with the particle sampling procedure. The implementation details on numerical settings, network structures, SGD optimizers,and hyper-parameters are given in the appendix. All experiments are performed usingNVIDIA Tesla K80 GPUs. And The PyTorch code of UniﬁGem is available at https://github.com/anonymous/UnifiGem . We use UniﬁGem to learn 2D distributions adapted from [22] with multiple modes anddensity ridges. We utilize a multilayer perceptron with ReLU activation in dynamic deepdensity-ratio ﬁtting without using gradient penalty. We use UniﬁGem without outerloops to push particles from a predrawn pool consisting of 50k i.i.d. Gaussian particlesto evolve in 20k steps. The ﬁrst row in Figure 1 shows kernel density estimation (KDE)plots of 50k samples from target distributions including (from left to right) and circles , and the second row shows KDEplots of the transformed particles via UniﬁGem, and the third row displays the surfaceplots of the estimated density-ratio functions at the end of the iteration. As evident byFigure 1, KDE plots of generated samples via UniﬁGem are nearly indistinguishable fromthose of the target samples and the estimated density-ratio functions are approximatelyequal to 1 s , indicating the learnt distribution matches the target well.Next, we demonstrate the eﬀectiveness of the gradient penalty (19) by visualizingthe transport maps learned in the generative learning tasks with the learning targets5 squares and large gaussians from 4 squares and small gaussians respectively. We use200 particles connected with grey lines to manifest the learned transport maps. As shownin Figure 2(a), the central squares of 5 squares were learned better with the gradientpenalty, which is consistent with the result of the estimated density-ratio in Figure2(b). For large gaussians , the learned transport map exhibited some optimality underquadratic Wasserstein distance due to the obvious correspondence between the samplesin Figure 2(a), and the gradient penalty also improves the density-ratio estimation asexpected.Finally, we illustrate the convergence property of the learning dynamics of UniﬁGemon synthetic datasets pinwheel, checkerboard and . As shown in Figure 3, on thethree test datasets, the dynamics of estimated density-ratio ﬁtting losses in (20) sharecommon patterns for three stages, i.e., the initialization stage (top penal), the declinestage (middle panel) and the converging stage (bottom panel). And both the left panel(LSDR ﬁtting loss (20) with α = 0) and the right panel (estimated value of the gradientnorm E X ∼ q k [ (cid:107)∇ R φ ( X ) (cid:107) ]) demonstrate the estimated LSDR ﬁtting losses in (20) (with11 Figure 1: KDE plots of the target samples (the ﬁrst row) and the corresponding generatedsamples (the second row). The third row shows surface plots of estimated density ratioafter 20k iterations. (a)

Left two ﬁgures: Maps learned without gradient penalty.

Right two ﬁgures: Maps learned withgradient penalty. (b)

Left two ﬁgures: Surface plots of estimated density-ratio without gradient penalty.

Right twoﬁgures: Surface plots of estimated density-ratio with gradient penalty.

Figure 2: Learned transport maps and estimated density ratio in learning 5 squares from4 squares and learning large gaussians from small gaussians . α = 0) converge to the theoretical value − We show the performance of applying UniﬁGem on benchmark image data MNIST [34],CIFAR10 [33] and CelebA [39]. The evolving particles shown in Figure 4 on MNIST12

50 100 150 200 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 0 50 100 150 200 k G r ad i en t no r m pinwheelcheckerboard2spirals0 1000 2000 3000 4000 5000 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 0 1000 2000 3000 4000 5000 k G r ad i en t no r m pinwheelcheckerboard2spirals12000 14000 16000 18000 20000 k L S DR f i tt i ng l o ss pinwheelcheckerboard2spirals 12000 14000 16000 18000 20000 k G r ad i en t no r m pinwheelcheckerboard2spirals Figure 3: Convergence of UniﬁGem on pinwheel, checkerboard and . Top:

Theinitialization stage.

Middle:

The decline stage.

Bottom:

The converging stage.

Left:

LSDR ﬁtting loss (20) with α = 0. Right:

Estimation of the gradient norm E X ∼ q k [ (cid:107)∇ R φ ( X ) (cid:107) ].and CIFAR10 demonstrate that UniﬁGem can transport samples from a multivariatenormal distribution into a target distribution with the same dimension without usingthe outer loop. We further compare UniﬁGem using the outer loop with state-of-the-artgenerative models including WGAN, SNGAN and MMDGAN. We considered diﬀerent f -divergences, including Pearson χ , KL, JS and logD [18] and diﬀerent deep density-ratioﬁtting methods (LSDR and LR). Table 1 shows FID [Heusel et al.(2017)] evaluated withﬁve bootstrap sampling of UniﬁGem with four divergences on CIFAR10. We can seethat UniﬁGem attains (usually better) comparable FID scores with the state-of-the-artgenerative models. Comparisons of the real samples and learned samples on MNIST,13igure 4: Particle evolution of UniﬁGem on MNIST and CIFAR10.CIFAR10 and CelebA are shown in Figure 5, where high-ﬁdelity learned samples arecomparable to real samples visually.Table 1: Mean (standard deviation) of FID scores on CIFAR10 and results in last sixrows are adapted from [4]. Models CIFAR10 (50k)UniﬁGem-LSDR- χ UniﬁGem-LR-KL 25.9 (0.1)UniﬁGem-LR-JS 25.3 (0.1)UniﬁGem-LR-logD

WGAN-GP 31.1 (0.2)MMDGAN-GP-L2 31.4 (0.3)SMMDGAN 31.5 (0.4)SN-GAN 26.7 (0.2)SN-SWGAN 28.5 (0.2)SN-SMMDGAN

UniﬁGem is a uniﬁed framework for implicit generative learning via ﬁnding a transportmap between a reference distribution and the target distribution. It is inspired by severalfruitful ideas from optimal transport theory, numerical ODE, density-ratio (density-14iﬀerence) estimation and deep neural networks. We also provide theoretical guaranteesfor our proposed approach. Numerical results on both synthetic datasets and realbenchmark datasets support our theoretical ﬁndings and demonstrate that UniﬁGem iscompetitive with the state-of-the-art generative models.There are two important ingredients in UniﬁGem: the energy functional L [ · ] in(9) and density-ratio (density-diﬀerence) estimation. It can be shown that with asuitable choice of L [ · ] and a density-ratio estimation approach, UniﬁGem can recoversome existing generative models. Thus our theoretical results also provide insights onthe properties of these existing methods. With diﬀerent combinations of the energyfunctionals and density-ratio (density-diﬀerence) estimation approach, one can developnew theoretically sound learning procedures under UniﬁGem. It would be interestingto have a through comparison between the procedures resulting from such diﬀerentcombinations. In particular, it is desirable to carefully explore conditions and scenarios ofthe data structures under which certain choices of the energy functional and density-ratio(density-diﬀerence) estimator lead to better performance.Some aspects and results in this paper are of independent interest. For example,density-ratio estimation is an important problem and of general interest in machinelearning and statistics. The estimation error bound established in Theorem 2.3 forthe nonparametric deep density-ratio ﬁtting procedure is new. It is a step forward inthe direction that shows deep nonparametric estimation can circumvent the curse ofdimensionality via exploring the structure of the data [7]. It is of interest to use thetechniques developed here to study deep nonparametric regression and classiﬁcation. Acknowledgements

The authors are grateful to the anonymous referees, the associate editor and the editorfor their helpful comments, which have led to a signiﬁcant improvement on the quality ofthe paper. The work of Jian Huang is supported in part by the NSF grant DMS-1916199.The work of Y. Jiao was supported in part by the National Science Foundation of Chinaunder Grant 11871474 and by the research fund of KLATASDSMOE. The work of J.Liu is supported by Duke-NUS Graduate Medical School WBS: R913-200-098-263 andMOE2016- T2-2-029 from Ministry of Eduction, Singapore.

In the appendix, we give and the implementation details on numerical settings, networkstructures, SGD optimizers, and hyper-parameters in the paper, and detailed proofsof Lemmas 2.1-2.3, Theorems 2.1-2.3, Proposition 2.1-2.2, and the proof of MMD ﬂowbeing a special case of UniﬁGem. 15igure 5: Visual comparisons between real images (top 3 panels) and generated images(bottom 3 panels) by UniﬁGem-LSDR- χ on MNIST, CIFAR10 and CelebA.16 Experimental details

Experiments on 2D examples in our work were performed with deep LSDR ﬁtting andthe Pearson χ divergence. For simplicity purposes, outer loops of UniﬁGem wereomitted and our algorithm became a particle method for approximating solutions ofPDEs [Chertock(2017)]. In inner loops, only a multilayer perceptron (MLP) was utilizedfor dynamic estimation of the density ratio between the model distribution q k and thetarget distribution p . The network structure and hyper-parameters in UniﬁGem anddeep LSDR ﬁtting were shared in all 2D experiments. We used RMSProp with thelearning rate 0.0005 and the batch size 1k as the SGD optimizer. The details are givenin Table 2 and Table 3. We note that s is the step size, n is the number of particles, α isthe penalty coeﬃcient, and T is the times of LSDR ﬁtting in each inner loop hereinafter.Table 2: MLP for deep LSDR ﬁtting. Layer Details Output size1 Linear, ReLU 642 Linear, ReLU 643 Linear, ReLU 644 Linear 1

Table 3: Hyper-parameters in UniﬁGem on 2D examples.

Parameter s n α T

Value 0.005 50k 0 or 0.5 5

Datasets.

We evaluated UniﬁGem on three benchmark datasets including two smalldatasets MNIST, CIFAR10 and one large dataset CelebA from GAN literature. MNISTcontains a training set of 60k examples and a test set of 10k examples as 28 ×

28 bilevelimages which were resized to 32 ×

32 resolution. There are a training set of 50k examplesand a test set of 10k examples as 32 ×

32 color images in CIFAR10. We randomly dividedthe 200k celebrity images in CelebA into two sets for training and test according to theratio 9:1. We also pre-processed CelebA images by ﬁrst taking a 160 ×

160 central cropand then resizing to the 64 ×

64 resolution. Only the training sets are used to train ourmodels. 17 valuation metrics.

Fr´echet Inception Distance (FID) [Heusel et al.(2017)] com-putes the Wasserstein distance W with summary statistics (mean µ and variance Σ) ofreal samples x s and generated samples g s in the feature space of the Inception-v3 model[Szegedy et al.(2016)], i.e., FID = (cid:107) µ x − µ g (cid:107) + Tr(Σ x + Σ g − x Σ g ) ). Here, FID isreported with the TensorFlow implementation and lower FID is better. Network architectures and hyper-parameter settings.

We employed theResNet architectures used by [18] in our UniﬁGem algorithm. Especially, the batch nor-malization [Ioﬀe & Szegedy(2015)] and the spectral normalization [Miyato et al.(2018)]of networks were omitted for UniﬁGem-LSDR- χ . To train neural networks, we set SGDoptimizers as RMSProp with the learning rate 0.0001 and the batch size 100. Inputs { Z i } ni =1 in UniﬁGem with outer loops were vectors generated from a 128-dimensionalstandard normal distribution on all three datasets. Hyper-parameters are listed in Table4 where IL expresses the number of inner loops in each outer loop. Even without outerloops, UniﬁGem can generate images on MNIST and CIFAR10 as well by making use ofa large set of particles. Table 5 shows the hyper-parameters.Table 4: Hyper-parameters in UniﬁGem with outer loops on real image datasets. Parameter (cid:96) s n α T IL

Value 128 0.5 2k 0 1 20

Table 5: Hyper-parameters in UniﬁGem without outer loops on real image datasets.

Parameter s n α T

Value 0.5 4k 0 5

Proof.

This is well known results [10, 43, 53], see, Section 1.7.6 on page 54 of [53] forexample.

Proof.

We show the results item by item. (i) The continuity equation (12) follows fromthe deﬁnition of the gradient ﬂow directly, see, page 281 in [2] for detail.(ii) Recall L [ µ ] is a functional on P a ( R m ). By the classical results in calculus of variation[Gelfand & Silverman(2000)], ∂ L [ q ] ∂q ( x ) = dd t L [ q + tg ] | t =0 = F (cid:48) ( q ( x )) , ∂ L [ q ] ∂q denotes the ﬁrst order of variation of L [ · ] at q , and q, g are the densities of µ and an arbitrary ξ ∈ P a ( R m ), respectively. Let L F ( z ) = zF (cid:48) ( z ) − F ( z ) : R → R . Some algebra shows, ∇ L F ( q ( x )) = q ( x ) ∇ F (cid:48) ( q ( x )) . Then, it follows from Theorem 10.4.6 in [2] that ∇ F (cid:48) ( q ( x )) = ∂ o L ( µ ) , where, ∂ o L ( µ ) denotes the one in ∂L ( µ ) with minimum length. The above display andthe deﬁnition of gradient ﬂow implies the representation of the velocity ﬁelds v t .(iii) The ﬁrst equality follows from chain rule and integration by part, see, Theorem 24.2in [61] for detail. The second one on linear convergence follows from Theorem 24.7 in[61], where the assumption on λ in equation (24.6) is equivalent to the λ -geodeticallyconvex assumption here.(iv) Similar to (i) see, page 281 in [2] for detail. Proof.

The time dependent form of (7)-(8) readsd x t d t = ∇ Φ t ( x t ) , with x ∼ q, d ln q t ( x t )d t = − ∆Φ t ( x t ) , with q = q. By chain rule and substituting the ﬁrst equation into the second one, we have1 q t ( d q t d t + d q t d x t d x t d t ) = 1 q t ( d q t d t + ∇ q t ∇ Φ t ( x t ))= − ∆Φ t ( x t ) , which implies, d q t d t = − q t ∆Φ t ( x t ) − ∇ q t ∇ Φ t ( x t ) = −∇ · ( q t ∇ Φ t ) . By (13), the above display coincides with the continuity equation (12) with v t = ∇ Φ t = −∇ F (cid:48) ( q t ( x )). Proof.

The Lipschitz assumption of v t implies the existence and uniqueness of theMcKean-Vlasov equation (15) according to the classical results in ODE [Arnold(2012)].By the uniqueness of the continuity equation, see Proposition 8.1.7 in [2], it suﬃcient toshow µ t = ( X t ) µ deﬁned in equation (14) satisfying the continuity equation (12) in aweak sense. This can be done by the standard test function and soothing approximationarguments, see, Theorem 4.4 in [53] for detail.19 .7 Proof of Proposition 2.2 Proof.

Without loss of generality let K = Ts > { µ st t ∈ [ ks, ( k + 1) s ) is the piecewise linear interpolation between µ k and µ k +1 deﬁned as µ st = ( T k,st ) µ k , where, T k,st = I + ( t − ks ) v k ,µ k is deﬁned in (16)-(18) with v k = v ks , i.e., the continuous velocity in (13) at time ks , k = 0 , .., K − µ = µ. Under the assumption that the velocity ﬁelds v t is Lipschitzcontinuous on ( x , µ t ), we can ﬁrst show similarly as Lemma 10 in [3] W ( µ ks , µ k ) = O ( s ) . ( A µ k and µ ks , and ( X, Y ) ∼ Γ. Let X t = T k,st ( X )and Y t be the solution of (15) with X = Y and t ∈ [ ks, ( k + 1) s ). Then X t ∼ µ st , Y t ∼ µ t and Y t = Y + (cid:90) tks v ˜ t ( Y ˜ t )d˜ t. W ( µ t , µ ks ) ≤ E [ (cid:107) Y t − Y (cid:107) ]= E [ (cid:107) (cid:90) tks v ˜ t ( Y ˜ t )d˜ t (cid:107) ] ≤ E [( (cid:90) tks (cid:107) v ˜ t ( Y ˜ t ) (cid:107) d˜ t ) ] ≤ O ( s ) . ( A W , and the last equality followsfrom the the uniform bounded assumption of v t . Similarly, W ( µ k , µ st ) ≤ E [ (cid:107) X − X t (cid:107) ]= E [ (cid:107) ( t − ks ) v k ( X ) (cid:107) ] ≤ O ( s ) . ( A W ( µ t , µ st ) ≤ W ( µ t , µ ks ) + W ( µ ks , µ k ) + W ( µ k , µ st ) ≤ O ( s ) , where the ﬁrst inequality follows from the triangle inequality, see for example Lemma5.3 in [53], and the second one follows from ( A − ( A .8 Proof of Lemma 2.2 Proof.

By deﬁnition, F ( q t ( x )) = (cid:40) p ( x ) f ( q t ( x ) p ( x ) ) , L [ µ ] = D f ( µ (cid:107) ν ) , ( q t ( x ) − p ( x )) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) . Direct calculation shows F (cid:48) ( q t ( x )) = (cid:40) f (cid:48) ( q t ( x ) p ( x ) ) , L [ µ ] = D f ( µ (cid:107) ν ) , q t ( x ) − p ( x )) , L [ µ ] = (cid:107) µ − ν (cid:107) L ( R m ) . Then, the desired result follows from the above display and equation (13).

Proof.

By deﬁnition, it is easy to check B ( R ) = B ratio ( r, R ) − B ratio ( r, r ) , where, B ratio ( r, R )the Bregman score with the base probability measure p between R and r . Then r ∈ arg min measureable R B ( R ) follow from the fact B ratio ( r, R ) ≥ B ratio ( r, r ) and theequality holds iﬀ R = r . Since B α ( R ) = B ( R ) + α E p [ (cid:107)∇ R (cid:107) ] ≥ , Then, B α ( R ) = 0iﬀ B ( R ) = 0 and E p [ (cid:107)∇ R (cid:107) ] = 0 , which is further equivalent to R = r = constant ( q, p )- a.e. constant = 1 due to r is a density ratio. Proof.

We use B ( R ) to denote B − C for simplicity, i.e., B ( R ) = E X ∼ p [ R ( X ) ] − E X ∼ q [ R ( X )] . ( A α = 0 as (cid:98) R φ ∈ arg min R φ ∈H D , W , S , B (cid:98) B ( R φ )= n (cid:88) i =1 n ( R φ ( X i ) − R φ ( Y i )) . ( A ∈ ∂ B ( r ) . Then, ∀ R directcalculation yields, (cid:107) R − r (cid:107) L ( ν ) = B ( R ) − B ( r ) − (cid:104) ∂ B ( r ) , R − r (cid:105) = B ( R ) − B ( r ) . ( A ∀ ¯ R φ ∈ H D , W , S , B we have, (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) = B ( (cid:98) R φ ) − B ( r )= B ( (cid:98) R φ ) − (cid:98) B ( (cid:98) R φ ) + (cid:98) B ( (cid:98) R φ ) − (cid:98) B ( ¯ R φ )+ (cid:98) B ( ¯ R φ ) − B ( ¯ R φ ) + B ( ¯ R φ ) − B ( r ) ≤ R ∈H D , W , S , B | B ( R ) − (cid:98) B ( R ) | + (cid:107) ¯ R φ − r (cid:107) L ( ν ) , ( A (cid:98) R φ and ¯ R φ and (A6). We prove our thistheorem by upper bounding the expected value of the right hand side term in (A7). Tothis end, we need the following auxiliary results (A8)-(A10). E { Z i } ni [sup R | B ( R ) − (cid:98) B ( R ) | ] ≤ C (2 B + 1) G ( H ) , ( A G ( H ) = E { Z i ,(cid:15) i } ni [ sup R ∈H D , W , S , B | n n (cid:88) i =1 (cid:15) i R ( Z i ) | ]is the Gaussian complexity of H D , W , S , B [Bartlett & Mendelson(2002)]. Proof of (A8) .Let g ( c ) = c − c , z = ( x , y ) ∈ R m × R m , (cid:101) R ( z ) = ( g ◦ R )( z ) = R ( x ) − R ( y ) . Denote Z = ( X, Y ), Z i = ( X i , Y i ) , i = 1 , ..., n with X, X i i.i.d. ∼ p , Y, Y i i.i.d. ∼ q . Let (cid:101) Z i be a i.i.d. copy of Z i , and σ i ( (cid:15) i ) be the i.i.d. Rademacher random (standard normal)variables that are independent with Z i and (cid:101) Z i . Then, B ( R ) = E Z [ (cid:101) R ( Z )] = 1 n E (cid:101) Z i [ (cid:101) R ( (cid:101) Z i )] , and (cid:98) B ( R ) = 1 n n (cid:88) i =1 (cid:101) R ( Z i ) . Denote R ( H ) = 1 n E { Z i ,σ i } ni [ sup R ∈H D , W , S , B | n (cid:88) i =1 (cid:15) i R ( Z i ) | ]22s the Rademacher complexity of H D , W , S , B [Bartlett & Mendelson(2002)]. Then, E { Z i } ni [sup R | B ( R ) − (cid:98) B ( R ) | ]= 1 n E { Z i } ni [sup R | n (cid:88) i =1 ( E (cid:101) Z i [ (cid:101) R ( (cid:101) Z i )] − (cid:101) R ( Z i )) | ] ≤ n E { Z i , (cid:101) Z i } ni [sup R | (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i ) | ]= 1 n E { Z i , (cid:101) Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i ( (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i )) | ] ≤ n E { Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i (cid:101) R ( Z i ) | ]+ 1 n E { (cid:101) Z i ,σ i } ni [sup R | n (cid:88) i =1 σ i (cid:101) R ( (cid:101) Z i ) | ]= 2 R ( g ◦ H ) ≤ B + 1) R ( H ) ≤ C (2 B + 1) G ( H ) , where, the ﬁrst inequality follows from the Jensen’s inequality, and the second equalityholds since the distribution of σ i ( (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i )) and (cid:101) R ( (cid:101) Z i ) − (cid:101) R ( Z i ) are the same,and the last equality holds since the distribution of the two terms are the same, andlast two inequality follows from the Lipschitz contraction property where the Lipschitzconstant of g on H D , W , S , B is bounded by 2 B + 1 and the relationship between theGaussian complexity and the Rademacher complexity, see for Theorem 12 and Lemma 4in [Bartlett & Mendelson(2002)], respectively. G ( H ) ≤ C B (cid:114) n DS log S log n DS log S exp( − log n DS log S ) . ( A Proof of (A9) .Since H is negation closed, G ( H ) = E { Z i ,(cid:15) i } ni [ sup R ∈H D , W , S , B n n (cid:88) i =1 (cid:15) i R ( Z i )]= E Z i [ E (cid:15) i [ sup R ∈H D , W , S , B n n (cid:88) i =1 (cid:15) i R ( Z i )] |{ Z i } ni =1 ] . Conditioning on { Z i } ni =1 , ∀ R, (cid:101) R ∈ H D , W , S , B it easy to check V (cid:15) i [ 1 n n (cid:88) i =1 (cid:15) i ( R ( Z i ) − (cid:101) R ( Z i ))] = d H ( R, ˜ R ) √ n , d H ( R, ˜ R ) = √ n (cid:113)(cid:80) ni =1 ( R ( Z i ) − ˜ R ( Z i )) . Observing the diameter of H D , W , S , B under d H is at most B , we have G ( H ) ≤ C √ n E { Z i } ni =1 [ (cid:90) B (cid:113) log N ( H , d H , δ )d δ ] ≤ C √ n E { Z i } ni =1 [ (cid:90) B (cid:113) log N ( H , d H∞ , δ )d δ ] ≤ C √ n (cid:90) B (cid:114) VC H log 6 Bnδ VC H d δ, ≤ C B ( n VC H ) / log( n VC H ) exp( − log ( n VC H )) ≤ C B (cid:114) n DS log S log n DS log S exp( − log n DS log S )where, the ﬁrst inequality follows from the chaining Theorem 8.1.3 in [Vershynin(2018)],and the second inequality holds due to d H ≤ d H∞ , and in the third inequality we used therelationship between the matric entropy and the VC-dimension of the ReLU networks H D , W , S , B [Anthony & Bartlett(2009)], i.e.,log N ( H , d H∞ , δ ) ≤ VC H log 6 B nδ VC H , and the fourth inequality follows by some calculation, and the last inequality holds dueto the upper bound of VC-dimension for the ReLU network H D , W , S , B satisfyingVC H ≤ C DS log S , see [Bartlett et al.(2019)].For any two integer M, N , there exists a ¯ R φ ∈ H D , W , S , B with width W = max { M N / M + 4 M , N + 14 } , and depth D = 9 M + 12 , and B = 2 B, such that (cid:107) r − ¯ R φ (cid:107) L ( ν ) ≤ C Lm M ( N M ) − / M . ( A . Proof of (A10) .We use Lemma 4.1, Theorem 4.3, 4.4 and following the proof of Theorem 1.3 in[Shen et al.(2019)]. Let A be the random orthoprojector in Theorem 4.4, then it isto check A ( M (cid:15) ) ⊂ A ([ − c, c ] m ) ⊂ [ − c √ m, √ mc ] M . Let ˜ r be a extension of the restrictionof r on M (cid:15) , which is deﬁned similarly as ˜ g on page 30 in [Shen et al.(2019)]. Since weassume the target r is Lipschitz continuous with the bound B and the Lipschitz constant L , let (cid:15) small enough, then by Theorem 4.3, there exist a ReLU network ˜ R φ ∈ H D , W , S , B with width W = max { M N / M + 4 M , N + 14 } , D = 9 M + 12 , and B = 2 B, such that (cid:107) ˜ r − ˜ R φ (cid:107) L ∞ ( M (cid:15) \N ) ≤ cL √ m M ( N M ) − /m , and (cid:107) ˜ R φ (cid:107) L ∞ ( M (cid:15) ) ≤ B + 3 cL √ m M , where, N is a ν − negligible set with ν ( N ) can be arbitrary small. Deﬁne ¯ R φ = ˜ R φ ◦ A .Then, following the proof after equation (4.8) in Theorem 1.3 [Shen et al.(2019)], we getour (A10) and (cid:107) ¯ R φ (cid:107) L ∞ ( M (cid:15) \N ) ≤ B, (cid:107) ¯ R φ (cid:107) L ∞ ( N ) ≤ B + 3 cL √ m M . Let DS log S < n , combing the results A (7) − A (10), we have E { X i ,Y i } n [ (cid:107) (cid:98) R φ − r (cid:107) L ( ν ) ] ≤ C (2 B + 1) G ( H ) + C cLm M ( N M ) − / M ≤ C (2 B + 1) C B (cid:114) DS log S n log n DS log S + C cLm M ( N M ) − / M ≤ C ( B + cLm M ) n − / (2+ M ) , where, last inequality holds since we choose M = log n,N = n M M ) / log n , S = n M− M +2 / log n, i.e., D = 9 log n + 12, W = 12 n M M ) / log n + 14 . Proof.

Let H be a reproducing kernel Hilbert space with characteristic kernel K ( x , z ).Recall in MMD ﬂow, L [ µ ] = 12 (cid:107) µ − ν (cid:107) , and ∂ L [ µ ] ∂µ ( x ) = (cid:90) K ( x , z )d µ ( z ) − (cid:90) K ( x , z )d ν ( z ) , v mmd t = −∇ ∂ L [ µ ] ∂µ t = (cid:90) ∇ x K ( x , z )d ν ( z ) − (cid:90) ∇ x K ( x , z )d µ t ( z )= (cid:90) ∇ x K ( x , z ) p ( z )d z − (cid:90) ∇ x K ( x , z ) q t ( z )d z By Lemma 2.2, the vector ﬁelds corresponding the Lebesgue norm (cid:107) µ − ν (cid:107) L ( R m ) = (cid:82) R m | q ( x ) − p ( x ) | d x are deﬁned as v t = ∇ p ( x ) − ∇ q t ( x ) . Next, we will show the vector ﬁelds v mmd t is exactly by projecting the vector ﬁelds v t onto the reproducing kernel Hilbert space H m = H ⊗ m . By the deﬁnition of reproducingkernel we have, p ( x ) = (cid:104) p ( · ) , K ( x , · ) (cid:105) H = (cid:90) K ( x , z ) p ( z )d z , and q t ( x ) = (cid:104) q t ( · ) , K ( x , · ) (cid:105) H = (cid:90) K ( x , z ) q t ( z )d z . Hence, v t ( x ) = ∇ p ( x ) − ∇ q t ( x )= (cid:90) ∇ x K ( x , z )( p ( z ) − q t ( z ))d z = v mmd t ( x ) . References [1] S. M. Ali and S. D. Silvey. A general class of coeﬃcients of divergence of onedistribution from another.

Journal of the Royal Statistical Society: Series B(Methodological) , 28(1):131–142, 1966.[2] L. Ambrosio, N. Gigli, and G. Savar´e.

Gradient ﬂows: in metric spaces and in thespace of probability measures . Springer Science & Business Media, 2008.[3] M. Arbel, A. Korba, A. Salim, and A. Gretton. Maximum mean discrepancy gradientﬂow. In

NeurIPS , 2019.[4] M. Arbel, D. Sutherland, M. Bi´nkowski, and A. Gretton. On gradient regularizersfor MMD GANs. In

NeurIPS , 2018.[5] M. Arjovsky and L. Bottou. Towards principled methods for training generativeadversarial networks. In

ICLR , 2017. 266] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks.In

ICML , 2017.[7] B. Bauer, M. Kohler, et al. On deep learning as a remedy for the curse of dimen-sionality in nonparametric regression.

The Annals of Statistics , 47(4):2261–2285,2019.[8] J.-D. Benamou and Y. Brenier. A computational ﬂuid mechanics solution to themonge-kantorovich mass transfer problem.

Numerische Mathematik , 84(3):375–393,2000.[9] M. Bi´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMDGANs. In

ICLR , 2018.[10] Y. Brenier. Polar factorization and monotone rearrangement of vector-valuedfunctions.

Communications on pure and applied mathematics , 44(4):375–417, 1991.[11] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. C. Duke. Continuous-time ﬂowsfor eﬃcient inference and density estimation. In

ICML , 2018.[12] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinarydiﬀerential equations. In

NIPS , 2018.[13] A. P. Dawid. The geometry of proper scoring rules.

Annals of the Institute ofStatistical Mathematics , 59(1):77–93, 2007.[14] E. De Giorgi. New problems on minimizing movements. in boundary value problemsfor partial diﬀerential equations, res. notes appl. math. vol. 29. pages 81–98, 1993.[15] I. Deshpande, Z. Zhang, and A. G. Schwing. Generative modeling using the slicedwasserstein distance. In

CVPR , 2018.[16] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent componentsestimation. In

ICLR , 2015.[17] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In

ICLR , 2017.[18] Y. Gao, Y. Jiao, Y. Wang, Y. Wang, C. Yang, and S. Zhang. Deep generativelearning via variational gradient ﬂow. In

ICML , 2019.[19] A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorndivergences. In

ICML , 2018.[20] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, andestimation.

Journal of the American statistical Association , 102(477):359–378, 2007.[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio. Generative adversarial nets. In

NIPS , 2014.2722] W. Grathwohl, R. Chen, J. Bettencourt, and D. Duvenaud. Scalable reversiblegenerative models with free-form continuous dynamics. In

ICLR Workshop , 2019.[23] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord:Free-form continuous dynamics for scalable reversible generative models. In

ICLR ,2019.[24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANstrained by a two time-scale update rule converge to a local nash equilibrium. In

NIPS , 2017.[25] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, andA. Lerchner. β -VAE: Learning basic visual concepts with a constrained variationalframework. In ICLR , 2017.[26] R. Johnson and T. Zhang. Composite functional gradient learning of generativeadversarial models. In

ICML , 2018.[27] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of thefokker–planck equation.

SIAM journal on mathematical analysis , 29(1):1–17, 1998.[28] T. Kanamori and M. Sugiyama. Statistical analysis of distance estimators withdensity diﬀerences and density ratios.

Entropy , 16(2):921–942, 2014.[29] D. P. Kingma and P. Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolu-tions. In

NeurIPS , 2018.[30] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling.Improved variational inference with inverse autoregressive ﬂow. In

NIPS , 2016.[31] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In

ICLR , 2014.[32] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoen-coder: An embarrassingly simple generative model. In

ICLR , 2019.[33] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning appliedto document recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[35] R. J. LeVeque.

Finite diﬀerence methods for ordinary and partial diﬀerentialequations: steady-state and time-dependent problems , volume 98. 2007.[36] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P´oczos. MMD GAN: Towardsdeeper understanding of moment matching network. In

NIPS , 2017.[37] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In

ICML ,2015. 2838] H. Liu, G. Xianfeng, and D. Samaras. A two-step computation of the exact ganwasserstein distance. In

ICML , 2018.[39] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In

ICCV , 2015.[40] A. Liutkus, U. Simsekli, S. Majewski, A. Durmus, F.-R. St¨oter, K. Chaudhuri, andR. Salakhutdinov. Sliced-wasserstein ﬂows: Nonparametric generative modeling viaoptimal transport and diﬀusions. In

ICML , 2019.[41] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoen-coders. In

ICLR , 2016.[42] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squaresgenerative adversarial networks. In

ICCV , 2017.[43] R. J. McCann et al. Existence and uniqueness of monotone measure-preservingmaps.

Duke Mathematical Journal , 80(2):309–324, 1995.[44] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483 , 2016.[45] Y. Mroueh and T. Sercu. Fisher GAN. In

NIPS , 2017.[46] S. Nowozin, B. Cseke, and R. Tomioka. f -GAN: Training generative neural samplersusing variational divergence minimization. In NIPS , 2016.[47] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive ﬂow fordensity estimation. In

NIPS , 2017.[48] G. Patrini, S. Bhargav, R. van den Berg, M. Welling, P. Forr´e, T. Genewein,M. Carioni, K. Graz, F. Nielsen, and C. Sony. Sinkhorn autoencoders. In

UAI , 2019.[49] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generativeadversarial text to image synthesis. In

ICML , 2016.[50] D. J. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In

ICML , 2015.[51] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generativeadversarial networks through regularization. In

NIPS , pages 2018–2028, 2017.[52] R. Salakhutdinov. Learning deep generative models.

Annual Review of Statisticsand Its Application , 2:361–385, 2015.[53] F. Santambrogio.

Optimal transport for applied mathematicians . Springer, 2015.[54] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Husz´ar. Amortised mapinference for image super-resolution. In

ICLR , 2017.2955] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch¨olkopf, G. R. Lanckriet,et al. On the empirical estimation of integral probability metrics.

Electronic Journalof Statistics , 6:1550–1599, 2012.[56] M. Sugiyama, T. Kanamori, T. Suzuki, M. D. Plessis, S. Liu, and I. Takeuchi.Density-diﬀerence estimation. In

NIPS , 2012.[57] M. Sugiyama, T. Suzuki, and T. Kanamori.

Density ratio estimation in machinelearning . Cambridge University Press, 2012.[58] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, andA. Gretton. Generative models and model criticism via optimized maximum meandiscrepancy. In

ICLR , 2017.[59] C. Tao, L. Chen, R. Henao, J. Feng, and L. C. Duke. Chi-square generativeadversarial network. In

ICML , 2018.[60] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders.In

ICML , 2018.[61] C. Villani.

Optimal transport: old and new , volume 338. Springer Science & BusinessMedia, 2008.[62] L. Zhang, L. Wang, et al. Monge-ampere ﬂow for generative modeling. arXivpreprint arXiv:1809.10188 , 2018.[63] S. Zhang, Y. Gao, Y. Jiao, J. Liu, Y. Wang, and C. Yang. Wasserstein-wassersteinauto-encoders. arXiv preprint arXiv:1902.09323 , 2019.[64] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In

ICCV , 2017.[Anthony & Bartlett(2009)] Anthony, M. and Bartlett, P. L.

Neural network learning:Theoretical foundations . cambridge university press, 2009.[Arnold(2012)] Arnold, V. I.

Geometrical methods in the theory of ordinary diﬀerentialequations , volume 250. Springer Science & Business Media, 2012.[Bartlett & Mendelson(2002)] Bartlett, P. L. and Mendelson, S. Rademacher and gaus-sian complexities: Risk bounds and structural results.

Journal of Machine LearningResearch , 3:463–482, 2002.[Bartlett et al.(2019)] Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.

Journal of Machine Learning Research , 20:1–17, 2019.[Clarke(1990)] Clarke, F. H.

Optimization and nonsmooth analysis , volume 5. Siam,1990.[Gelfand & Silverman(2000)] Gelfand, I. M., Silverman, R. A., et al.

Calculus of varia-tions . 2000. 30Shen et al.(2019)] Shen, Z., Yang, H., and Zhang, S. Deep network approximationcharacterized by number of neurons. arXiv preprint arXiv:1906.05497 , 2019.[Vershynin(2018)] Vershynin, R.

High-dimensional probability: An introduction withapplications in data science , volume 47. Cambridge university press, 2018.[Heusel et al.(2017)] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochre-iter, S. GANs trained by a two time-scale update rule converge to a local nashequilibrium. In

NIPS , 2017.[Ioﬀe & Szegedy(2015)] Ioﬀe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In

ICML , 2015.[Miyato et al.(2018)] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectralnormalization for generative adversarial networks. In

ICLR , 2018.[Szegedy et al.(2016)] Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., and Wojna, Z.Rethinking the inception architecture for computer vision. In

CVPR , 2016.[Chertock(2017)] Chertock, A. A practical guide to deterministic particle methods. In