[PDF] COT-GAN: Generating Sequential Data via Causal Optimal Transport

Abstract

We introduce COT-GAN, an adversarial algorithm to train implicit generative models optimized for producing sequential data. The loss function of this algorithm is formulated using ideas from Causal Optimal Transport (COT), which combines classic optimal transport methods with an additional temporal causality constraint. Remarkably, we find that this causality condition provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance, and an ideal mechanism for learning time dependent data distributions. Following Genevay et al.\ (2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost. Our experiments show effectiveness and stability of COT-GAN when generating both low- and high-dimensional time series data. The success of the algorithm also relies on a new, improved version of the Sinkhorn divergence which demonstrates less bias in learning.

Full PDF

CCOT-GAN: G

ENERATING S EQUENTIAL D ATAVIA C AUSAL O PTIMAL T RANSPORT

A P

REPRINT

Tianlin Xu

London School of Economics

Li K. Wenliang

University College London

Michael Munn

Google

Beatrice Acciaio ∗ London School of Economics A BSTRACT

We introduce COT-GAN, an adversarial algorithm to train implicit generative models optimized forproducing sequential data. The loss function of this algorithm is formulated using ideas from CausalOptimal Transport (COT), which combines classic optimal transport methods with an additionaltemporal causality constraint. Remarkably, we ﬁnd that this causality condition provides a naturalframework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance, and an ideal mechanism for learning time dependent data distributions. FollowingGenevay et al. (2018), we also include an entropic penalization term which allows for the use of theSinkhorn algorithm when computing the optimal transport cost. Our experiments show effectivenessand stability of COT-GAN when generating both low- and high-dimensional time series data. Thesuccess of the algorithm also relies on a new, improved version of the Sinkhorn divergence whichdemonstrates less bias in learning.

Dynamical data are ubiquitous in the world, including natural scenes such as video and audio data, and temporalrecordings such as physiological and ﬁnancial traces. Being able to synthesize realistic dynamical data is a challengingunsupervised learning problem and has wide scientiﬁc and practical applications. In recent years, training implicitgenerative models (IGMs) has proven to be a promising approach to data synthesis, driven by the work on generativeadversarial networks (GANs) [22].Nonetheless, training IGMs on dynamical data poses an interesting yet difﬁcult challenge. On one hand, learningcomplex dependencies between spatial locations and channels for static images has already received signiﬁcant effortwithin the research community. On the other hand, temporal dependencies are no less complicated since the dynamicalfeatures are strongly correlated with spatial features. Recent works, including [34, 42, 15, 39, 36], often tackle thisproblem by separating the model or loss into static and dynamic components.In this paper, we consider training dynamic IGMs for sequential data. We introduce a new adversarial objective thatbuilds on optimal transport (OT) theory, and constrains the transport plans to respect causality : the probability massmoved to the target sequence at time t can only depend on the source sequence up to time t [3, 7]. A reformulationof the causality constraint leads to a formulation of an adversarial training objective in the spirit of [19], but tailoredto sequential data. In addition, we demonstrate that optimizing the original Sinkhorn divergence over mini-batchescauses biased parameter estimation, and propose the mixed Sinkhorn divergence which avoids this problem. Our newframework, Causal Optimal Transport GAN (COT-GAN), outperforms existing methods on a wide range of datasetsfrom traditional time series to high dimensional videos. ∗ We would like to thank the ﬁnancial support from the Erwin Schrödinger Institute during the thematic programme on OptimalTransport (May 2019, Vienna) a r X i v : . [ s t a t . M L ] J un OT-GAN: Generating Sequential Datavia Causal Optimal Transport

Goodfellow et al. [22] introduced an adversarial scheme for training an IGM. Given a (real) data distribution µ = N (cid:80) Ni =1 δ x i , x i ∈ X , and a distribution ζ on some latent space Z , the generator is a function g : Z → X trained sothat the induced distribution ν = ζ ◦ g − is as close as possible to µ as judged by a discriminator. The discriminator isa function f : X → [0 , trained to output a high value if the input is real (from µ ), and a low value otherwise (from ν ).In practice, the two functions are implemented as neural networks g θ and f ϕ with parameters θ and ϕ , and the generatordistribution is denoted by ν θ . The training objective is then formulated as a zero-sum game between the generator andthe discriminator. Different probability divergences were later proposed to evaluate the distance between µ and ν θ [30, 26, 29, 4]. Notably, the Wasserstein-1 distance was used in [6, 5]: W ( µ, ν ) = inf π ∈ Π( µ,ν ) E π [ (cid:107) x − y (cid:107) ] , (2.1)where Π( µ, ν ) is the space of transport plans (couplings) between µ and ν . Its dual form turns out to be a maximizationproblem over ϕ such that f ϕ is Lipschitz. Combined with the minimization over θ , a min-max problem can beformulated with a Lipschitz constraint on f ϕ . The optimization in (2.1) is a special case of the classical (Kantorovich) optimal transport problem. Given probabilitymeasures µ on X , ν on Y , and a cost function c : X × Y → R , the optimal transport problem is formulated as W c ( µ, ν ) := inf π ∈ Π( µ,ν ) E π [ c ( x, y )] . (2.2)Here, c ( x, y ) represents the cost of transporting a unit of mass from x ∈ X to y ∈ Y , and W c ( µ, ν ) is thus theminimal total cost to transport the mass from µ to ν . Obviously, the Wasserstein-1 distance (2.1) corresponds to c ( x, y ) = (cid:107) x − y (cid:107) . However, when µ and ν are supported on ﬁnite sets of size n , solving (2.2) has super-cubic (in n )complexity [14, 31, 32], which is computationally expensive for large datasets.Instead, Genevay et al. [19] proposed training IGMs by minimizing a regularized Wasserstein distance that can becomputed more efﬁciently by the Sinkhorn algorithm (see [14]). For transport plans with marginals µ supported ona ﬁnite set { x i } i and ν on a ﬁnite set { y j } j , any π ∈ Π( µ, ν ) is also discrete with support on the set of all possiblepairs { ( x i , y j ) } i,j . Denoting π ij = π ( x i , y j ) , the Shannon entropy of π is given by H ( π ) := − (cid:80) i,j π ij log( π ij ) . For ε > , the regularized optimal transport problem then reads as P c,ε ( µ, ν ) := inf π ∈ Π( µ,ν ) { E π [ c ( x, y )] − εH ( π ) } . (2.3)Denoting by π c,ε ( µ, ν ) the optimizer in (2.3), one can deﬁne a regularized distance by W c,ε ( µ, ν ) := E π c,ε ( µ,ν ) [ c ( x, y )] . (2.4)Computing this distance is numerically more stable than solving the dual formulation of the OT problem, as the latterrequires differentiating dual Kantorovich potentials; see e.g. [12, Proposition 3]. To correct the fact that W c,ε ( α, α ) (cid:54) = 0 ,Genevay et al. [19] proposed to use the Sinkhorn divergence : (cid:99) W c,ε ( µ, ν ) := 2 W c,ε ( µ, ν ) − W c,ε ( µ, µ ) − W c,ε ( ν, ν ) (2.5)as the objective function, and to learn the cost c ϕ ( x, y ) = (cid:107) f ϕ ( x ) − f ϕ ( y ) (cid:107) parameterized by ϕ , resulting in thefollowing adversarial objective inf θ sup ϕ (cid:99) W c ϕ ,ε ( µ, ν θ ) , (2.6)In practice, a sample-version of this cost is used, where µ and ν are replaced by distributions of mini-batches randomlyextracted from them. We now focus on data that consists of d -dimensional (number of channels), T -long sequences, so that µ and ν aredistributions on the path space R d × T . In this setting we introduce a special class of transport plans, between X = R d × T and Y = R d × T , that will be used to deﬁne our objective function. On X × Y , we denote by x = ( x , ..., x T ) and y = ( y , ..., y T ) the ﬁrst and second half of the coordinates, and we let F X = ( F X t ) Tt =1 and F Y = ( F Y t ) Tt =1 be thecanonical ﬁltrations (for all t , F X t is the smallest σ -algebra s.t. ( x , ..., x T ) (cid:55)→ ( x , ..., x t ) is measurable; analogouslyfor F Y ). 2OT-GAN: Generating Sequential Datavia Causal Optimal Transport A transport plan π ∈ Π( µ, ν ) is called causal if π ( dy t | dx , · · · , dx T ) = π ( dy t | dx , · · · , dx t ) for all t = 1 , · · · , T − . The set of all such plans will be denoted by Π K ( µ, ν ) . Roughly speaking, the amount of mass transported by π to a subset of the target space Y belonging to F Y t dependson the source space X only up to time t . Thus, a causal plan transports µ into ν in a non-anticipative way, which is anatural request in a sequential framework. In the present paper, we will use causality in the sense of Deﬁnition 3.1.However, note that in the literature, the term causality is often used to indicate a mapping in which the output at a giventime t depends only on inputs up to time t .Restricting the space of transport plans to Π K in the OT problem (2.2) gives the COT problem K c ( µ, ν ) := inf π ∈ Π K ( µ,ν ) E π [ c ( x, y )] . (3.1)COT has already found wide application in dynamic problems in stochastic calculus and mathematical ﬁnance, seee.g. [2, 1, 3, 9, 8]. The causality constraint can be equivalently formulated in several ways, see [7, Proposition 2.3].The one that will be useful for our purposes can be expressed in the following way: let M ( F X , µ ) be the set of ( X , F X , µ ) -martingales, and deﬁne H ( µ ) := { ( h, M ) : h = ( h t ) T − t =1 , h t ∈ C b ( R t ) , M = ( M t ) Tt =1 ∈ M ( F X , µ ) , M t ∈ C b ( R t ) } , then a transport plan π ∈ Π( µ, ν ) is causal if and only if E π (cid:104)(cid:80) T − t =1 h t ( y ≤ t )∆ t +1 M ( x ≤ t +1 ) (cid:105) = 0 for all ( h, M ) ∈ H ( µ ) , (3.2)where x ≤ t := ( x , x , . . . , x t ) and similarly for y ≤ t , and ∆ t +1 M ( x ≤ t +1 ) := M t +1 ( x ≤ t +1 ) − M t ( x ≤ t ) . As usual, C b ( X ) denotes the space of continuous, bounded functions on X . Where no confusion can arise, with an abuse ofnotation we will simply write h t ( y ) , M t ( x ) , ∆ t +1 M ( x ) rather than h t ( y ≤ t ) , M t ( x ≤ t ) , ∆ t +1 M ( x ≤ t +1 ) . In the same spirit of [19], we include an entropic regularization in the COT problem (3.1) and consider P K c,ε ( µ, ν ) := inf π ∈ Π K ( µ,ν ) { E π [ c ( x, y )] − εH ( π ) } . (3.3)The solution to such problem is then unique due to strict concavity of H . We denote by π K c,ε ( µ, ν ) the optimizer to theabove problem, and deﬁne the regularized COT distance by K c,ε ( µ, ν ) := E π K c,ε ( µ,ν ) [ c ( x, y )] . Remark 3.2.

In analogy to the non-causal case, it can be shown that, for discrete µ and ν (as in practice), the followinglimits holds: K c ( µ, ν ) ←−−− ε → K c,ε ( µ, ν ) −−−→ ε →∞ E µ ⊗ ν [ c ( x, y )] , where µ ⊗ ν denotes the independent coupling. See Appendix A.1 for a proof. This means that the regularized COT distance is between the COT distance and the lossobtained by independent coupling, and is closer to the former for small ε . Optimizing over the space of causal plans Π K ( µ, ν ) is not straightforward. Nonetheless, the following proposition shows that the problem can be reformulated asa maximization over non-causal problems with respect to a speciﬁc family of cost functions. Proposition 3.3.

The regularized COT problem (3.3) can be reformulated as P K c,ε ( µ, ν ) = sup l ∈L ( µ ) P c + l,ε ( µ, ν ) , (3.4) where L ( µ ) :=  J (cid:88) j =1 T − (cid:88) t =1 h jt ( y )∆ t +1 M j ( x ) : J ∈ N , ( h j , M j ) ∈ H ( µ )  . (3.5)3OT-GAN: Generating Sequential Datavia Causal Optimal Transport c , = 0.1 = 1.0 = 10.0 c , m i x c , batch size 10 30 100 300 1000 Figure 1: Regularized distance (2.4), Sinkhorn divergence (2.5) and mixed Sinkhorn divergence (3.8) computed formini-batches of size m from µ and ν θ , where µ = ν . . Color indicates batch size. Curve and errorbar show the meanand sem estimated from 300 draws of mini-batches.This means that the optimal value of the regularized COT problem equals the maximum value over the family ofregularized OT problems w.r.t. the set of cost functions { c + l : l ∈ L ( µ ) } . This result has been proven in [3]. As it iscrucial for our analysis, we show it in Appendix A.2.Proposition 3.3 suggests the following worst-case distance between µ and ν : sup l ∈L ( µ ) W c + l,ε ( µ, ν ) , (3.6)as a regularized Sinkhorn distance that respects the causal constraint on the transport plans.In the context of training a dynamic IGM, the training dataset is a collection of paths { x i } Ni =1 of equal length T , x i = ( x i , .., x iT ) , x it ∈ R d . As N is usually very large, we proceed as usual by approximating W c + l,ε ( µ, ν ) with itsempirical mini-batch counterpart. Precisely, for a given IGM g θ , we ﬁx a batch size m and sample { x i } mi =1 from thedataset and { z i } mi =1 from ζ . Denote the generated samples by y iθ = g θ ( z i ) , and the empirical distributions by ˆ x = 1 m m (cid:88) i =1 δ x i , ˆ y θ = 1 m m (cid:88) i =1 δ y iθ . The empirical distance W c + l,ε (ˆ x , ˆ y θ ) can be efﬁciently approximated by the Sinkhorn algorithm. When implementing the Sinkhorn divergence (2.5) at the level of mini-batches, one canonical candidate clearly is W c ϕ ,ε (ˆ x , ˆ y θ ) − W c ϕ ,ε (ˆ x , ˆ x ) − W c ϕ ,ε (ˆ y θ , ˆ y θ ) , (3.7)which is indeed what is used in [19]. While the expression in (3.7) does converge in expectation to (2.5) for m → ∞ ([20, Theorem 3]), it is not clear whether it is an adequate loss given data of ﬁxed batch size m . In fact, we ﬁnd that thisis not the case, and demonstrate it here empirically. Example 3.4.

We build an example where the data distribution µ belongs to a parameterized family of distributions { ν θ } θ , with µ = ν . (details in Appendix A.3). As shown in Figure 1 (top two rows), neither the expected regularizeddistance (2.4) nor the Sinkhorn divergence (2.5) reaches minimum at θ = 0 . , especially for small m . This means thatoptimizing ν over mini-batches will not lead to µ . Instead, we propose the following mixed Sinkhorn divergence at the level of mini-batches: (cid:99) W mix c,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) := W c,ε (ˆ x , ˆ y θ ) + W c,ε (ˆ x (cid:48) , ˆ y (cid:48) θ ) − W c,ε (ˆ x , ˆ x (cid:48) ) − W c,ε (ˆ y θ , ˆ y (cid:48) θ ) , (3.8)where ˆ x and ˆ x (cid:48) are the empirical distributions of mini-batches from the data distribution, and ˆ y θ and ˆ y (cid:48) θ from the IGMdistribution ζ ◦ g − . The idea is to take into account the bias within the distribution µ and that within the distribution ν θ as well. The proposed divergence ﬁnds the correct minimizer for all m in Example 3.4 (Figure 1, bottom), andthe improvement is not due solely to the double batch used by Equation (3.8). We further discuss this choice and ourﬁndings in Appendix A.3. 4OT-GAN: Generating Sequential Datavia Causal Optimal Transport We now combine the results in Section 3.2 and Section 3.3 to formulate an adversarial training algorithm for IGMs. First,we approximate the set of functions (3.5) by truncating the sums at a ﬁxed J , and we parameterize h ϕ := ( h jϕ ) Jj =1 and M ϕ := ( M jϕ ) Jj =1 as two separate neural networks, and let ϕ := ( ϕ , ϕ ) . To capture the adaptedness of thoseprocesses, we employ architectures where the output at time t depends on the input only up to time t . The mixedSinkhorn divergence between ˆ x and ˆ y θ is then calculated with respect to a parameterized cost function c K ϕ ( x, y ) := c ( x, y ) + J (cid:88) j =1 T − (cid:88) t =1 h jϕ ,t ( y )∆ t +1 M jϕ ( x ) . (3.9)Second, it is not obvious how to directly impose the martingale condition, as constraints involving conditionalexpectations cannot be easily enforced in practice. Rather, we penalize processes M for which increments at every timestep are non-zero on average. For an ( X , F X ) -adapted process M jϕ and a mini-batch { x i } mi =1 ( ∼ ˆ x ), we deﬁne themartingale penalization for M ϕ = ( M jϕ ) Jj =1 as p M ϕ (ˆ x ) := 1 mT J (cid:88) j =1 T − (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) i =1 ∆ t +1 M jϕ ( x i ) (cid:113) Var [ M jϕ ] + η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where Var [ M ] is the empirical variance of M over time and batch, and η > is a small constant. Third, we use themixed normalization introduced in (3.8). Each of the four terms is approximated by running the Sinkhorn algorithm onthe cost c K ϕ for L iterations.Altogether, we arrive at the following adversarial objective function for COT-GAN: (cid:99) W mix ,Lc K ϕ ,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) − λp M ϕ (ˆ x ) , (3.10)where ˆ x and ˆ x (cid:48) are empirical measures corresponding to non-overlapping subsets of the dataset, ˆ y θ and ˆ y (cid:48) θ are the onescorresponding to two samples from ν θ , and λ is a positive constant. We update θ to decrease this objective, and ϕ toincrease it.While the generator g θ : Z → X acts as in classical GANs, the adversarial role here is played by h ϕ and M ϕ . In thissetting, the discriminator, parameterized by ϕ , learns a robust (worst-case) distance between the real data distribution µ and the generated distribution ν θ , where the class of cost functions as in (3.9) originates from causality. The algorithmis summarized in Algorithm 1. Its time complexity scales as O (( J + d ) LT m ) for each iteration. Early video generation literature focuses on dynamic texture modeling [16, 35, 40]. Recent efforts in video generationwithin the GAN community have been devoted to designing GAN architectures of generator and discriminator totackle the spatio-temporal dependencies separately, e.g., [39, 34, 36]. VGAN [39] explored a two-stream generatorthat combines a network for a static background and another one for moving foreground trained on the original GANobjective. TGAN [34] proposed a new structure capable of generating dynamic background as well as a weight clippingtrick to regularize the discriminator. In addition to a uniﬁed generator, MoCoGAN [36] employed two discriminators tojudge both the quality of frames locally and the evolution of motions globally.The broader literature of sequential data generation attempts to capture the dependencies in time by simply deployingrecurrent neural networks in the architecture [28, 18, 23, 42]. Among them, TimeGAN [42] demonstrated improvementsin time series generation by adding a teacher-forcing component in the loss function. Alternatively, WaveGAN [15]adopted the causal structure of WaveNet [38]. Despite substantial progress made, existing sequential GANs are generallydomain-speciﬁc. We therefore aim to offer a framework that considers (transport) causality in the objective functionand is suitable for more general sequential settings.Whilst our analysis is built upon [14] and [19], we remark two major differences between COT-GAN and the SinkhornGAN in [19]. First, we consider a different family of costs. While [19] learns the cost function c ( f ϕ ( x ) , f ϕ ( y )) byparametrizing f with ϕ , the family of costs in COT-GAN is found by adding a causal component to c ( x, y ) in terms of h ϕ and M ϕ . is the mixed Sinkhorn divergence we propose, which reduces biases in parameter estimation and can beused as a generic divergence for training IGMs not limited to time series settings.5OT-GAN: Generating Sequential Datavia Causal Optimal Transport Algorithm 1: training COT-GAN by SGD

Data: { x i } Ni =1 (real data), ζ (probability distribution on latent space Z ) Parameters: θ , ϕ , m (batch size), ε (regularization parameter), L (number of Sinkhorn iterations), α (learning rate), λ (martingale penalty coefﬁcient) Result: θ , ϕ Initialize: θ ← θ , ϕ ← ϕ for k = 1 , , . . . do Sample { x i } mi =1 and { x (cid:48) i } mi =1 from real data;Sample { z i } mi =1 and { z (cid:48) i } mi =1 from ζ ; ( y iθ , y (cid:48) iθ ) ← ( g θ ( z i ) , g θ ( z (cid:48) i )) ;compute (cid:99) W mix ,Lc K ϕ ,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) (3.8) by the Sinkhorn algorithm, with c K ϕ given by (3.9) ϕ ← ϕ + α ∇ ϕ (cid:16) (cid:99) W mix ,Lc K ϕ ,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) − λp M ϕ (ˆ x ) (cid:17) Sample { x i } mi =1 and { x (cid:48) i } mi =1 from real data;Sample { z i } mi =1 and { z (cid:48) i } mi =1 from ζ ; ( y iθ , y (cid:48) iθ ) ← ( g θ ( z i ) , g θ ( z (cid:48) i )) ;compute (cid:99) W mix ,Lc K ϕ ,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) (3.8) by the Sinkhorn algorithm, with c K ϕ given by (3.9) θ ← θ − α ∇ θ (cid:16) (cid:99) W mix ,Lc K ϕ ,ε (ˆ x , ˆ x (cid:48) , ˆ y θ , ˆ y (cid:48) θ ) (cid:17) ; end + 101 real0.00 COT-GAN12.31 min mix c , min c , TimeGAN19.18 Sinkhorn GAN18.72 d i m e n s i o n s Figure 2: Results on learning the multivariate AR-1 process. Top row shows the auto-correlation coefﬁcient for eachchannel. Bottom row shows the correlation coefﬁcient between channels averaged over time. The number on top ofeach panel is the sum of the absolute difference between the correlation coefﬁcients computed from real (leftmost) andgenerated samples.

We now validate COT-GAN empirically . For times series that have a relatively small dimensionality d but exhibitcomplex temporal structure, we compare COT-GAN with the following methods: Direct minimization of Sinkhorndivergences (cid:99) W mix c,ε (3.8) and (cid:99) W c,ε (3.7); TimeGAN [42] as reviewed in Section 4;

Sinkhorn GAN , similar to [19] withcost c ( f ϕ ( x ) , f ϕ ( y )) where ϕ is trained to increase the mixed Sinkhorn divergence with weight clipping. All methodsuse c ( x, y ) = (cid:107) x − y (cid:107) . The networks h and M in COT-GAN and f in Sinkhorn GAN share the same architecture.Details of models and datasets are in Appendix B.1. Autoregressive processes.

We ﬁrst test whether COT-GAN can learn temporal and spatial correlation in a multivariateﬁrst-order auto-regressive process (AR-1) . Results are shown in Figure 2. COT-GAN samples have correlation structures Code and data are available at github.com/neuripss2020/COT-GAN real0.0 COT-GAN395.5 min mix c , c , d i m e n s i o n Figure 3: Results on EEG data. The same correlations as Figure 2 are shown.that best match the real data. Minimizing the mixed divergence produces almost as good correlations as COT-GAN,but with less accurate auto-correlation. Minimizing the original Sinkhorn divergence yields poor results, and neitherTimeGAN nor Sinkhorn GAN could capture the correlation structure for this dataset.

Noisy oscillations.

The noisy oscillation distribution is composed of sequences of 20-element arrays (1-D images)[41]. Figure 6 in Appendix B.1 shows data as well as generated samples by different training methods. To evaluateperformance, we estimate two attributes of the samples by Monte Carlo: the marginal distribution of pixel values, andthe joint distribution of the location at adjacent time steps. COT-GAN samples match the real data best.

Electroencephalography (EEG).

This dataset is from the UCI repository [17] and contains recordings from 43healthy subjects each undergoing around 80 trials. Each data sequence has 64 channels and we model the ﬁrst 100time steps. We trained and evaluated each method 16 times with different training and test splits. We evaluatedperformance by the maximum mean discrepancy (MMD), and the match with data in terms of temporal and channelcorrelations, and frequency spectrum. In addition, we investigated how the coefﬁcient λ affects sample quality. Weshow an example of the data and learned correlations in Figure 3, and summary statistics of all evaluation metrics inFigure 8 in Appendix B.1. COT-GANs generate the best samples compared with other baselines across all four metrics.A smaller λ tends to generate less realistic correlation patterns, but slightly better match in frequency spectrum. We train COT-GAN on Sprites animations [27, 33] and human action sequences [11], and compare with MoCoGAN [36].The evaluation metrics are Fréchet Inception Distance (FID) [24] comparing individual frames, Fréchet Video Distance(FVD) [37] which compares the video sequences as a whole by mapping samples into features via pretrained 3Dconvolutional networks, and their kernel counterparts (KID, KVD) [10]. Previous studies suggest that FVD correlatesbetter with human judgement than KVD for videos [37], whereas KID does so better than FID on images [44].We pre-process the Sprites and human action sequences to have a sequence length of T = 13 and T = 16 , respectively.Each frame has dimension × × . We employ the same architecture of generator and discriminator to train bothdatasets. Both the generator and discriminator comprises generic LSTM with 2-D convolutional layers. The detaileddata pre-processing, GAN architectures, hyper-parameter settings, and training techniques are reported in AppendixB.2. We show the real data and samples from COT-GAN side by side in Figure 4.Table 1: Evaluations for video datasets. Lower value means better sample quality. Sprites

FVD FID KVD KIDMoCoGAN 1 213.2 281.3 160.1 0.33COT-GAN

MoCoGAN 661.8 128.4 60.4 0.21COT-GAN

The performance of COT-GAN suggests that constraining the transport plans to be causal is a promising direction forgenerating sequential data. The approximations we introduce, such as the mixed Sinkhorn distance (3.8) and truncatedsum in (3.5), are sufﬁcient to produce good experimental results, and provide opportunities for more theoretical analysesin future studies. Directions of future development include ways to learn from data with ﬂexible lengths, extensions toconditional COT-GAN, and improved methods to enforce the martingale property for M and better parameterize thecausality constraint. References [1] Acciaio, B., Backhoff-Veraguas, J., and Carmona, R. Extended mean ﬁeld control problems: stochastic maximumprinciple and transport perspective.

SIAM Journal on Control and Optimization , 57(6), 2019.[2] Acciaio, B., Backhoff-Veraguas, J., and Zalashko, A. Causal optimal transport and its links to enlargement ofﬁltrations and continuous-time stochastic optimization.

Stochastic Processes and their Applications , 2019.[3] Acciaio, B., Backhoff-Veraguas, J., and Jia, J. Cournot-nash equilibrium and optimal transport in a dynamicsetting. arXiv preprint arXiv:2002.08786 , 2020.[4] Arbel, M., Sutherland, D., Bi´nkowski, M., and Gretton, A. On gradient regularizers for mmd gans. In

NeurIPS ,2018.[5] Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. arxive-prints, art. arXiv preprint arXiv:1701.04862 , 2017.[6] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In

ICML , 2017.[7] Backhoff, J., Beiglbock, M., Lin, Y., and Zalashko, A. Causal transport in discrete time and applications.

SIAMJournal on Optimization , 27(4), 2017.[8] Backhoff, J., Bartl, D., Beiglböck, M., and Wiesel, J. Estimating processes in adapted Wasserstein distance. arXivpreprint arXiv:2002.07261 , 2020. 8OT-GAN: Generating Sequential Datavia Causal Optimal Transport[9] Backhoff-Veraguas, J., Bartl, D., Beiglböck, M., and Eder, M. Adapted Wasserstein distances and stability inmathematical ﬁnance. arXiv preprint arXiv:1901.07450 , 2019.[10] Bi´nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying mmd gans. arXiv preprintarXiv:1801.01401 , 2018.[11] Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. Actions as space-time shapes. In

ICCV , 2005.[12] Bousquet, O., Gelly, S., Tolstikhin, I., Simon-Gabriel, C.-J., and Schoelkopf, B. From optimal transport togenerative modeling: the vegan cookbook. arXiv preprint arXiv:1705.07642 , 2017.[13] Cover, T. M. and Thomas, J. A.

Elements of information theory . John Wiley & Sons, 2012.[14] Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In

NeurIPS , 2013.[15] Donahue, C., McAuley, J. J., and Puckette, M. S. Adversarial audio synthesis. In

ICLR , 2019.[16] Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S. Dynamic textures.

International Journal of Computer Vision ,51(2), 2003.[17] Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .[18] Esteban, C., Hyland, S. L., and Rätsch, G. Real-valued (medical) time series generation with recurrent conditionalgans. arXiv preprint arXiv:1706.02633 , 2017.[19] Genevay, A., Peyre, G., and Cuturi, M. Learning generative models with sinkhorn divergences. In

AISTATS , 2018.[20] Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré, G. Sample complexity of sinkhorn divergences. In

AISTATS , 2019.[21] Good, I. J. et al. Maximum entropy for hypothesis formulation, especially for multidimensional contingencytables.

The Annals of Mathematical Statistics , 34(3), 1963.[22] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets. In

NeurIPS , 2014.[23] Haradal, S., Hayashi, H., and Uchida, S. Biosignal data augmentation based on generative adversarial networks.In

International Conference in Medicine and Biology Society , 2018.[24] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale updaterule converge to a local nash equilibrium. In

NeurIPS , 2017.[25] Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical models withneural networks for structured representations and fast inference. In

Advances in neural information processingsystems , pp. 2946–2954, 2016.[26] Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. Mmd gan: Towards deeper understanding of momentmatching network. In

NeurIPS , 2017.[27] Li, Y. and Mandt, S. Disentangled sequential autoencoder. arXiv preprint arXiv:1803.02991 , 2018.[28] Mogren, O. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprintarXiv:1611.09904 , 2016.[29] Mroueh, Y., Li, C.-L., Sercu, T., Raj, A., and Cheng, Y. Sobolev gan. In

ICLR , 2018.[30] Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergenceminimization. In

NeurIPS , 2016.[31] Orlin, J. B. A faster strongly polynomial minimum cost ﬂow algorithm.

Operations research , 41(2):338–350,1993.[32] Pele, O. and Werman, M. Fast and robust earth mover’s distances. In , pp. 460–467. IEEE, 2009.[33] Reed, S. E., Zhang, Y., Zhang, Y., and Lee, H. Deep visual analogy-making. In

NeurIPS , 2015.[34] Saito, M., Matsumoto, E., and Saito, S. Temporal generative adversarial nets with singular value clipping. In

ICCV , 2017.[35] Szummer, M. and Picard, R. W. Temporal texture modeling. In

International Conference on Image Processing ,volume 3, 1996.[36] Tulyakov, S., Liu, M.-Y., Yang, X., and Kautz, J. Mocogan: Decomposing motion and content for video generation.In

CVPR , 2018.[37] Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Fvd: A new metric forvideo generation. 2019. 9OT-GAN: Generating Sequential Datavia Causal Optimal Transport[38] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,and Kavukcuoglu, K. Wavenet: A generative model for raw audio. In

ISCA workshop , 2016.[39] Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene dynamics. In

NeurIPS , 2016.[40] Wei, L.-Y. and Levoy, M. Fast texture synthesis using tree-structured vector quantization. In

Annual conferenceon Computer graphics and interactive techniques , 2000.[41] Wenliang, L. K. and Sahani, M. A neurally plausible model for online recognition and postdiction in a dynamicalenvironment. In

NeurIPS . 2019.[42] Yoon, J., Jarrett, D., and van der Schaar, M. Time-series generative adversarial networks. In

NeurIPS . 2019.[43] Zhang, X. L., Begleiter, H., Porjesz, B., Wang, W., and Litke, A. Event related potentials during object recognitiontasks.

Brain Research Bulletin , 38(6), 1995.[44] Zhou, S., Gordon, M., Krishna, R., Narcomey, A., Fei-Fei, L. F., and Bernstein, M. Hype: A benchmark forhuman eye perceptual evaluation of generative models. In

NeurIPS , 2019.10OT-GAN: Generating Sequential Datavia Causal Optimal Transport

COT-GAN: Generating Sequential Data via Causal Optimal TransportSupplementary material

A Speciﬁcs on regularized Causal Optimal Transport

A.1 Limits of regularized Causal Optimal Transport

In this section we prove the limits stated in Remark 3.2.

Lemma A.1.

Let µ and ν be discrete measures, say on path spaces X T and Y T , with | X | = m and | Y | = n . Then K c,ε ( µ, ν ) −−−→ ε → K c ( µ, ν ) . Proof.

We mimic the proof of Theorem 4.5 in [3], and note that the entropy of any π ∈ Π( µ, ν ) is uniformly bounded: ≤ H ( π ) ≤ C := m T n T e − . (A.1)This yields inf π ∈ Π K ( µ,ν ) E π [ c ] − ε C + εH ( π K c,ε ( µ, ν )) ≤ inf π ∈ Π K ( µ,ν ) { E π [ c ] − ε H ( π ) } + εH ( π K c,ε ( µ, ν )) ≤ inf π ∈ Π K ( µ,ν ) E π [ c ] + εH ( π K c,ε ( µ, ν )) . (A.2)Now, note that inf π ∈ Π K ( µ,ν ) { E π [ c ] − ε H ( π ) } = K c,ε ( µ, ν ) − εH ( π K c,ε ( µ, ν )) , and that, for ε → , the LHS and RHSin (A.2) both tend to K c ( µ, ν ) . Lemma A.2.

Let µ and ν be discrete measures. Then K c,ε ( µ, ν ) −−−→ ε →∞ E µ ⊗ ν [ c ( x, y )] . Proof.

Being µ and ν discrete, E π [ c ] is uniformly bounded for π ∈ Π K ( µ, ν ) . Therefore, for ε big enough, the optimizerin P K c,ε ( µ, ν ) is ˆ π := argmax π ∈ Π K ( µ,ν ) H ( π ) = µ ⊗ ν , the independent coupling, for which H ( µ ⊗ ν ) = H ( µ )+ H ( ν ) ;see [13] and [21]. Therefore, for ε big enough, we have K c,ε ( µ, ν ) = E µ ⊗ ν [ c ( x, y )] . A.2 Reformulation of the COT problem

Proof.

The causal constraint (3.2) can be expressed using the following characteristic function: sup l ∈L ( µ ) E π [ l ( x, y )] = (cid:26) if π is causal; + ∞ otherwise. (A.3)This allows to rewrite (3.3) as P K c,ε ( µ, ν ) = inf π ∈ Π( µ,ν ) (cid:40) E π [ c ( x, y )] − εH ( π ) + sup l ∈L ( µ ) E π [ l ( x, y )] (cid:41) = inf π ∈ Π( µ,ν ) sup l ∈L ( µ ) { E π [ c ( x, y ) + l ( x, y )] − εH ( π ) } = sup l ∈L ( µ ) inf π ∈ Π( µ,ν ) { E π [ c ( x, y ) + l ( x, y )] − εH ( π ) } = sup l ∈L ( µ ) P c + l,ε ( µ, ν ) , where the third equality holds by the min-max theorem, thanks to convexity of L ( µ ) , and convexity and compactness of Π( µ, ν ) . 11OT-GAN: Generating Sequential Datavia Causal Optimal Transport A.3 Sinkhorn divergence at the level of mini-batchesEmpirical observation of the bias in Example 3.4.

In the experiment mentioned in Example 3.4, we consider aset of distributions ν ’s as sinusoids with random phase, frequency and amplitude. We let µ be one element in this setwhose amplitude is uniformly distributed between minimum 0.3 and maximum 0.8. On the other hand, for each ν ,the amplitude is uniformly distributed between the same minimum 0.3 and a maximum that lies in { . , . , . . . , . } .Thus, the only parameter of the distribution being varied is the maximum amplitude. We may equivalently take themaximum amplitude as a single θ that parameterizes ν θ , so that µ = ν . . Figure 1 illustrates that the sample Sinkhorndivergence (3.7) (or regularized distance (2.4)) does not recover the optimizer . , while the proposed mix Sinkhorndivergence (3.8) does. Further discussion.

As mentioned in Section 3.3, when implementing the Sinkhorn divergence (2.5) at the level ofmini-batches, one canonical choice is the one adopted in [19], that is W c ϕ ,ε (ˆ x , ˆ y θ ) − W c ϕ ,ε (ˆ x , ˆ x ) − W c ϕ ,ε (ˆ y θ , ˆ y θ ) . (A.4)What inspired us the different choice of the mixed Sinkhorn divergence in (3.8), that is W c ϕ ,ε (ˆ x , ˆ y θ ) + W c ϕ ,ε (ˆ x (cid:48) , ˆ y (cid:48) θ ) − W c ϕ ,ε (ˆ x , ˆ x (cid:48) ) − W c ϕ ,ε (ˆ y θ , ˆ y (cid:48) θ ) , (A.5)is the idea of also taking into account the bias within the distribution µ and that within the distribution ν θ , whensampling mini-batches from them.Clearly, when the batch size m → ∞ , both (A.4) and (A.5) converge to (2.5) in expectation, see [20, Theorem 3]. Sothe main point here is, for a ﬁxed m ∈ N , which one of the two does a better job in translating the idea of the Sinkhorndivergence at the level of mini-batches. Experiments suggest that (A.5) is indeed the better choice. To support this fact,note that the triangular inequality implies E (cid:2)(cid:12)(cid:12) W c ϕ ,ε (ˆ x , ˆ y θ ) + W c ϕ ,ε (ˆ x (cid:48) , ˆ y (cid:48) θ ) − W c,ε ( µ, ν ) (cid:12)(cid:12)(cid:3) ≤ E (cid:2)(cid:12)(cid:12) W c ϕ ,ε (ˆ x , ˆ y θ ) − W c,ε ( µ, ν ) (cid:12)(cid:12)(cid:3) . One can possibly argue that in (A.5) we are using two batches of size m , thus simply considering a bigger mini-batch,say of size m , may perform as well. However, we have considered this case and our experiments conﬁrm that themixed Sinkhorn divergence (A.5) we suggest does perform better than the so-far used (A.4) even when in the latter weallow for bigger batch size. This reasoning can be pushed further, by for example considering W c ϕ ,ε ( ., . ) for all fourcombinations of samples with and without (cid:48) . Implementations showed that there is no advantage in doing so whilerequiring more computations. The MMD limiting case.

In the limit ε → ∞ , Genevay et al. [19] showed that W c,ε ( µ, ν ) → MMD − c ( µ, ν ) underthe kernel deﬁned by − c ( x, y ) . Here we want to point out an interesting fact about the limiting behavior of the mixedSinkhorn divergence. Remark A.3.

Given distributions of mini-batches ˆ x and ˆ y formed by samples from µ and ν , respectively, in the limit ε → ∞ , the Sinkhorn divergence (cid:99) W c,ε (ˆ x , ˆ y ) converges to a biased estimator of MMD − c ( µ, ν ) ; given additional ˆ x (cid:48) and ˆ y (cid:48) from µ and ν , respectively, the mixed Sinkhorn divergence (cid:99) W mix c,ε (ˆ x , ˆ x (cid:48) , ˆ y , ˆ y (cid:48) ) converges to an unbiased estimatorof MMD − c ( µ, ν ) .Proof. The ﬁrst part of the statement relies on the fact that

MMD − c (ˆ x , ˆ y ) is a biased estimator of MMD − c ( µ, ν ) .Indeed, we have (cid:99) W c,ε (ˆ x , ˆ y ) ε →∞ −→ MMD − c (ˆ x , ˆ y ) = − m m (cid:88) i =1 m (cid:88) j =1 [ c ( x i , x j ) + c ( y i , y j ) − c ( x i , y j )] . Now note that m m (cid:88) i =1 m (cid:88) j =1 E [ c ( x i , x j )] = 1 m  m (cid:88) i =1 E µ [ c ( x i , x i )] + (cid:88) i (cid:54) = j E µ ⊗ µ [ c ( x i , x j )]  = m − m E µ ⊗ µ [ c ( x, x (cid:48) )] , c ( x i , x i ) = 0 . A similar result holds for the sum over c ( y i , y j ) . On the other hand, m (cid:80) ij E [ c ( x i , y j )] = E µ ⊗ ν [ c ( x, y )] . Therefore E MMD − c (ˆ x , ˆ y ) = − m − m [ E µ ⊗ µ [ c ( x, x (cid:48) )] + E ν ⊗ ν [ c ( y, y (cid:48) )]] + 2 E µ ⊗ ν [ c ( x, y )] (cid:54) = MMD − c ( µ, ν ) , which completes the proof of the ﬁrst part of the statement.For the second part, note that W c,ε ( µ, ν ) → E µ ⊗ µ [ c ( x, x (cid:48) )] as ε → ∞ [19, Theorem 1], thus (cid:99) W mix c,ε (ˆ x , ˆ x (cid:48) , ˆ y , ˆ y (cid:48) ) → E ˆ x ⊗ ˆ y [ c ( x, y )] + E ˆ x (cid:48) ⊗ ˆ y (cid:48) [ c ( x (cid:48) , y (cid:48) )] − E ˆ x ⊗ ˆ x (cid:48) [ c ( x, x (cid:48) )] − E ˆ y ⊗ ˆ y (cid:48) [ c ( y, y (cid:48) )]= 1 m m (cid:88) i =1 m (cid:88) j =1 [ c ( x i , y i ) + c ( x (cid:48) i , y (cid:48) i ) − c ( x i , x (cid:48) i ) − c ( y i , y (cid:48) i )] . The RHS is an unbiased estimator of

MMD , since its expectation is E µ ⊗ ν [ c ( x, y )] + E µ ⊗ ν [ c ( x (cid:48) , y (cid:48) )] − E µ ⊗ µ [ c ( x, x (cid:48) )] − E ν ⊗ ν [ c ( y, y (cid:48) )] = MMD − c ( µ, ν ) . Note that the bias refers to the parameter estimate, rather than the divergence itself. The mixed divergence may still be abiased estimate of the true Sinkhorn divergence. However, in the experiment of Example 3.4 we note that the minimumis reached for the parameter θ close to the real one (Figure 1, bottom). We defer detailed analysis of mixed divergenceto a future paper. B Experimental details

B.1 Low dimensional time series

Here we describe details of the experiments in Section 5.1.

Autoregressive process.

The generative process to obtain data x t for the autoregressive process is x t = Ax t − + ζ t , ζ t i.i.d ∼ N (0 , Σ ) , Σ = 0 . I + 0 . , where A is diagonal with ten values evenly spaced between . and . . We initialize x from a 10-dimensionalstandard normal, and ignore the data in the ﬁrst 10 time steps so that the data sequence begins with a more or lessstationary distribution. We use λ = 0 . and ε = 10 . for this experiment. Real data and generated samples are shownin Figure 5. Noisy oscillation.

This dataset comprises paths simulated from a noisy, nonlinear dynamical system. Each path isrepresented as a sequence of d -dimensional arrays, T time steps long, and can be displayed as a d × T -pixel imagefor visualization. At each discrete time step t ∈ { , . . . , T } , data at time t , given by x t ∈ [0 , d , is determined bythe position of a “particle” following noisy, nonlinear dynamics. When shown as an image, each sample path appearsvisually as a “bump” travelling rightward, moving up and down in a zig-zag pattern as shown in Figure 6 (top left).More precisely, the state of the particle at time t is described by its position and velocity s t = ( s t, , s t, ) ∈ R , andevolves according to s t = f ( s t − ) + ζ t , ζ t = N (0 , . I ) , f ( s t − ) = c t As t − ; c t = 1 (cid:107) s t − (cid:107) exp( − (cid:107) s t − (cid:107) − .

3) + 1) , where A ∈ R × is a rotation matrix, and s is uniformly distributed on the unit circle.We take T = 48 and d = 20 so that x t is a vector of evaluations of a Gaussian function at 20 evenly spaced locations,and the peak of the Gaussian function follows the position of the particle s t, for each t : x t,i = exp (cid:20) − ( loc ( i ) − s t, ) × . (cid:21) , : { , . . . , d } → R maps pixel indices to a grid of evenly spaced points in the space of particle position. Thus, x t , the observation at time t , contains information about s t, but not s t, . A similar data generating process was used in[41], inspired by Johnson et al. [25].We compare the marginal distribution of the pixel values x t,i and joint distribution of the bump location ( argmax i x t,i )between adjacent time steps. See Figure 6. Electroencephalography.

We obtained EEG dataset from [43] and took the recordings of all the 43 subjects in thecontrol group under the matching condition (S2). For each subject, we choose 75% of the trials as training data andthe remaining for evaluation , giving training sequences and test sequences in total. All data are subtractedby channel-wise mean, divided by three times the channel-wise standard deviation, and then passed through a tanh nonlinearity. We train and evaluate models 16 times with different splittings. For COT-GAN, we trained three variants14OT-GAN: Generating Sequential Datavia Causal Optimal Transport real COT-GAN min mix c , min c , TimeGAN Sinkhorn GAN time steps x d i m s p r o b peak t + 1 p e a k t Figure 6: 1-D noisy oscillation. Top two rows show two samples from the data distribution and generators trained bydifferent methods. Third row shows marginal distribution of pixels values (y axis clipped at 0.07 for clarity). Bottomrow shows joint distribution of the position of the oscillation at adjacent time steps.corresponding to λ being one of { . , . , . } , and ε = 100 . for all OT-based methods. Data and samples are shownin Figure 7.We use four different metrics to compare sample quality. Relative MMD test that compares a test statistic basedon

MMD( D real , D alternative ) − MMD( D real , D COT-GAN . ) , where D real indicates the real test dataset, D COT-GAN λ =1 . is sampled from a COT-GAN with λ = 1 . , and D alternative is sampled from an alternative method that is one ofthe following: COT-GAN with λ ∈ { . , . } , direct minimizations of mixed and original Sinkhorn divergences,TimeGAN and Sinkhorn GAN. A larger value of the test statistic indicates that COT-GAN with λ = 1 . is bettercompared to the alternative. We do not employ the hypothesis testing framework, but rather use the test statistic asa metric of relative sample quality. We also compute the following quantities on the real and generated samples: a)temporal correlation coefﬁcient, b) channel-wise correlation coefﬁcient, and c) the frequency spectrum for each channelaveraged over samples. For each of these three features, we use the sum of absolute difference between featurescomputed from real and synthesized data as a metric of similarity. A small number means the generated data is close toreal data based on the corresponding feature.As the results in Figure 8 show, the different metrics do not agree in general. Nonetheless, COT-GANs in generaloutperform other models. According to MMD and temporal correlation, direct minimization of the mixed Sinkhorndivergence is as good as the best COT-GAN with λ = 1 . . But all COT-GANs do better in channel correlation andfrequency spectrum. We noticed that increasing λ is helpful for MMD and the two correlations, but is not for frequencyspectrum. Model and training parameters.

The dimensionality of the latent state is 10 at each time step, and there is also a10-dimensional time-invariant latent state. The generator common to COT-GAN, direct minimization and SinkhornGAN comprises a 1-layer (synthetic) or 2-layer (EEG) LSTM networks, whose output at each time step is passedthrough two layers of fully connected

ReLU networks. We used Adam for updating θ and ϕ , with learning rate 0.001.Batch size is 32 for all methods except for direct minimization of the original Sinkhorn divergence which is trainedwith batch size 64. These hyperparameters do not substantially affect the results.The same discriminator architecture is used for both h and M in COT-GAN and the discriminator of the SinkhornGAN. This network has two layers of 1-D causal CNN with stride 1, ﬁlter length 5. Each layer has 32 (synthetic data)or 64 neurons (EEG) at each time step. The activation is ReLU except at the output which is linear for autoregressiveprocess, sigmoid for noisy oscillation, and tanh for EEG. For COT-GAN, λ = 10 . and (cid:15) = 10 for synthetic datasets,and λ ∈ { . , . , . } and (cid:15) = 100 . for EEG. The choice of (cid:15) is made based on how fast it converges to a particularthreshold of the transport plan, and each iteration takes around 1 second on a 2.6GHz Xeon CPU.15OT-GAN: Generating Sequential Datavia Causal Optimal TransportFigure 7: Data and samples obtained by different methods for EEG data, the number after COT-GAN indicates thevalue of λ . B.2 Videos datasetsB.2.1 Sprite animationsData pre-processing.

The sprite sheets can be created and downloaded from . The data can be generated withvarious feature options for clothing, hairstyle and skin color, etc. Combining all feature options gives us 6352 charactersin total. Each character performs spellcast, walk, slash, shoot and hurt movements from different directions, making upto a total number of 21 actions. As the number of frames T ranges from 6 to 13, we pad all actions to have the samelength T = 13 by repeating previous movements in shorter sequences. We then crop the characters from sheets to be inthe center of each frame, which gives a dimension of × × for each frame. We decide to drop the 4th channel gaurav.munjal.us/Universal-LPC-Spritesheet-Character-Generator /and github.com/jrconway3/Universal-LPC-spritesheet MMD statistic sum of abs. difference in temporal correlation C O T G A N , = . C O T G A N , = . C O T G A N , = . m i n m i x c , m i n c , T i m e G A N S i n k h o r n G A N sum of abs. difference in channelwise correlation sum of abs. difference in frequency spectrum Figure 8: EEG data evaluations. Top left, MMD statistisc. More positive means COT-GAN λ = 1 . is better, andmore negative means COT-GAN λ = 1 . is worse. The two horizontal lines indicate statistical signiﬁcance thresholds( α = 0 . , two-tailed). For the other panels, lower value means the feature of generated data is closer to the feature ofreal data. Table 2: Generator architecture.Generator ConﬁgurationInput z ∼ N ( , I ) × × . B.2.2 The Weizmann Action databaseData pre-processing.

The videos in this dataset consists of clips that have lengths from 2 to 7 seconds. Each secondof the original videos contains 25 frames, each of which has dimension 144x180x3. To avoid the absence of objectsat the beginning of the videos and to ensure having an entire evolution of motions in each sequence, we skip the ﬁrst5 frames, then skip every 2 frames and collect 16 frames in a whole sequence as a result. Due to limited access tohardware, we also downscale each frame to × × . The training set used contains 89 data points with dimension × × × . To facilitate the use of large dataset in TensorFlow, we pre-shufﬂed all data used and wrote into tfrecord ﬁles. Links for downloadcan be found on the Github repository.

GAN architectures.

We detail the GAN architectures used in the experiment of the Weizmann Action databasein Table 2 and Table 3. A latent variable z of shape × per time step is sampled from a multivariate standardnormal distribution and is then passed to a 2-layer LSTM to generate time-dependent features, followed by 4-layerdeconvolutional neural network (DCONV) to map the features to frames. In order to connect two different types ofnetworks, we map the features using a feedforward (dense) layer and reshape them to the desired shape for DCNN.In Table 2 and 3, the DCONV layers have N ﬁlter size, K kernel size, S strides and P padding option. We adoptedbatch-normalisation layers and the LeakyReLU activation function. We have two networks to parameterize the process h and M as discriminator share the same structure, shown in Table 3.Note that we did not use random projector in this experiments. Moreover, we used a ﬁxed length T = 16 of LSTMand the state size in the last LSTM layer corresponds to the dimension of h t and M t , i.e., j in (3.9). We also appliedexponential decay to learning rate by η t = η r s/c where η is the initial learning rate, r is decay rate, s is the currentnumber of training steps and c is the decaying frequency. In our experiments, we set the initial learning rate to be . ,decay rate . , and decaying frequency . The batch size m and time steps T used are both 16, λ = 0 . , (cid:15) = 6 . and the Sinkhorn L = 100 in this experiment. We trained COT-GAN on a single NVIDIA Tesla P100 GPU. Eachiteration takes roughly 1.5 seconds. 18OT-GAN: Generating Sequential Datavia Causal Optimal TransportFigure 10: Random samples with no cherry picking from models trained on human actions. Top row from left to right:real sequences and mixed Sinkhorn minimization; bottom row from left to right: MoCoGAN and COT-GAN. C Sprites and human action results without cherry-picking

In this section we show random samples of Sprites and human actions generated by COT-GAN, mixed Sinkhornminimization, and MoCoGAN without cherry-picking. The background was static for both experiments. In the Spritesexperiments (see Figure 9), the samples from mixed Sinkhorn minimization and COT-GAN are both of good quality,whereas those from MoCoGAN only capture a rough pattern in the frames and fail to show a smooth evolution ofmotions.In Figure 10, we show a comparison of real and generated samples for human action sequences. Noticeable artifactsof COT-GAN and mixed Sinkhorn minimization results include blurriness and even disappearance of the person in asequence, which normally happens when the clothing of the person has a similar color as the background. MoCoGANalso suffers from this issue and, visually, there appears to be some degree of mode collapse. We used generators ofsimilar capacity across all models and trained COT-GAN, mixed Sinkhorn minimization and MoCoGAN for 65000iterations.In Table 4 the evaluation scores are estimated using 10000 generated samples. We increased the sample size from 5000samples for Table 1 to 10000 samples in order to yield more robust evaluation metrics. For Sprites, COT-GAN performsbetter than the other two methods on FVD and KVD. However, minimization of the mixed Sinkhorn divergenceproduces slightly better FID and KID scores when compared to COT-GAN. The results in [37] suggest that FID bettercaptures the frame-level quality, while FVD is better suited for the temporal coherence in videos. For the human actiondataset, COT-GAN is the best performing method across all metrics except for KVD. It is also reported in [37] that,while FVD and KVD are highly correlated, the former agrees with human judgement better than the latter.Table 4: Evaluations for video datasets. Lower value indicates better sample quality.

Sprites

FVD FID KVD KIDMoCoGAN 1 108.2 280.25 146.8 0.34 min (cid:99) W mix c,ε COT-GAN

Human actions

MoCoGAN 1 034.3 151.3 89.0 0.26 min (cid:99) W mix c,ε0.13