Decision-Aware Conditional GANs for Time Series Data
DDecision-Aware Conditional GANs for Time Series Data
He Sun, Zhun Deng, Hui Chen David C. Parkes Harvard University MIThe [email protected], [email protected], [email protected],[email protected]
Abstract
We introduce the decision-aware time-series conditional generative adversarial network(DAT-CGAN) as a method for time-series generation. The framework adopts a multi-Wassersteinloss on structured decision-related quantities, capturing the heterogeneity of decision-relateddata and providing new effectiveness in supporting the decision processes of end users. We im-prove sample efficiency through an overlapped block-sampling method, and provide a theoreti-cal characterization of the generalization properties of DAT-CGAN. The framework is demon-strated on financial time series for a multi-time-step portfolio choice problem. We demonstratebetter generative quality in regard to underlying data and different decision-related quantitiesthan strong, GAN-based baselines.
High fidelity time-series simulators are highly desirable across many domains, either due to a lack ofdata or because there are high-stakes in regard to the deployment of automated decision methods. Agood simulator can improve sample efficiency for training models and be used to evaluate differentdecision methods. Consider, for example, the use of a simulator for fine-tuning strategies in financialportfolio choice (simulating asset returns) or to support the design of insurance products (simulatingrisk outcomes).A gap with current GAN-based approaches is that they are not decision-aware , but focus insteadon generating the underlying data distribution [KFT19, YJvdS19]. With high fidelity this cansupport good decisions, but without this, these current methods prove to be insufficient. In afinancial portfolio choice problem [Mar52], a risk-neutral portfolio manager will be sensitive to thefirst order moment of the portfolio return distribution, and cares about mis-specification error inregard to this moment. A risk-sensitive portfolio manager will care more about mis-specificationerror in regard to the higher moments of the return distribution. Thus, a decision-aware simulatorshould be designed by tailoring to specific kinds of end users. In capturing a specific kind of decisionmaker, this requires the simulator to be trained with a loss function that relates to different, decision-related quantities. In the context of portfolio choice, these quantities may include estimated mean, estimated co-variance, estimatedprecision matrix of returns, and for a given strategy, quantities such as portfolio weights, portfolio return, or utilityto the end-user. a r X i v : . [ c s . L G ] O c t particular challenge with time-series data arises when decisions are made over time and basedon inference across multiple steps. A simulator needs good fidelity in regard to multiple look-ahead steps. Data scarcity is another challenge, either because of a limited sample size or as theresult of non-stationarity. Previous work has made use of bootstrap methods to improve sampleefficiency of estimators [B¨02]. However, these methods generally provide no finite-sample guaranteeson generalization error.The problem of exposure bias is another challenge when training sequential generative mod-els [RCAZ16]. First studied in the context of language models, this bias arises when models aretrained to predict one-step forward using previous ground truth observations, whereas at test timemodels are used to generate an entire sequence; i.e., the model is trained on a different distributionthan the generated data distribution, and errors accumulate. The problem has received attentionin natural language processing [BVJS15, RMM + +
20, e.g.], and where, with noisy data, the accumulation of error can leadto poor generation quality.
Our Contributions.
To address these issues, we propose a novel, decision-aware time-series con-ditional generative adversarial network (DAT-CGAN) . We make the training procedure decisionaware by imposing a multi-Wasserstein loss structure , with loss terms on multiple, decision-relatedquantities, in addition to data. We align training and evaluation by using the same number of look-ahead steps and using generated quantities, rather than ground-truth quantities, when handlinglook-ahead during training. This addresses the problem of exposure bias. The design of the genera-tor and discriminator is non-trivial, since the generator needs to capture the structural relationshipbetween different decision-related quantities. We provide the discriminator with access to the sameamount of conditioning information as the generator to avoid it being too strong relative to thegenerator. In improving sample efficiency, we adopt an overlapped block-sampling mechanism .We provide a theoretical characterization of the generalization properties of DAT-CGAN. Specif-ically, for our framework involving decision-related quantities, we provide non-asymptotic boundsfor Conditional GANs for time series with overlapping block sampling. In experimental results, weevaluate the framework on a financial portfolio choice problem where a decision-maker cares aboutdecision-related quantities (e.g., the precision matrix and portfolio weights). The results demon-strate that defining loss on both utility (to the decision-maker) and asset returns, the DAT-CGANframework achieves better performance than strong, GAN-based baselines, considering both simu-lated and real-data scenarios (ETFs).
Related Work.
The literature on GANs for time-series data does not consider decision aware-ness or provide theoretical guarantees for decision-related quantities, or for conditional GANs withoverlapped-sampling schemes. Yoon et al.[YJvdS19] study GANs for general time series problems,combining a supervised and an unsupervised paradigm. They use a bidirectional RNN as the dis-criminator, which is forward-looking when providing conditioning variables and unsuitable for finan-cial time series due to its strength relative to the generator. In the context of financial markets, Li etal. [LWL +
20] introduce the stock-GAN framework for the generation of order streams (we study as-set prices), and evaluate their approach on stylized facts about market micro-structure. Koshiyamaet al.[KFT19] use GANs for the calibration and aggregation of trading strategies, generating time-series asset returns for fine tuning trading strategies. They do not consider decision-related quan- This is especially apparent in applications to financial data. In asset pricing, for example, the daily equity pricetime series provides only around 250 samples per year, and pooling across multiple years is not effective whendistributions shift over time. Even with high-frequency data, the structure of market participants will tend to changeover time, with the effect that the distribution shifts, again limiting the effective sample size.
A Wasserstein GAN uses the Wasserstein distance as the loss function. The Wasserstein distancebetween random variables r and r (cid:48) , distributed according to P r and P r (cid:48) , is W ( P ( r ) , P ( r (cid:48) )) = inf Γ ∈ Π( P ( r ) , P ( r (cid:48) )) E ( r,r (cid:48) ) ∼ Γ [ (cid:107) r − r (cid:48) (cid:107) ] , where (cid:107)·(cid:107) is L2 norm, and Π( P ( r ) , P ( r (cid:48) )) is the set of all joint distributions Γ( r, r (cid:48) ) whose marginalsequals to P ( r ) and P ( r (cid:48) ). According to the the Kantorovich-Rubinstein duality [Vil09], the dualform can be written as: sup (cid:107) h (cid:107) L (cid:54) E r ∼P ( r ) [ h ( r )] − E r (cid:48) ∼P ( r (cid:48) ) [ h ( r (cid:48) )] , where h is 1-Lipschitz function. For Wasserstein GANs, the goal is to minimize the Wassersteindistance between the non-synthetic data and synthetic data. Following Mirza et al.[MO14], we workwith Conditional GANs, allowing for conditioning variables. For functions D θ and G η , parameterizedby θ and η , the discriminator and generator respectively, and conditioning variable x , the CGANproblem is min η max θ E r ∼P ( r | x ) [ D θ ( r, x )] − E z ∼P ( z ) [ D θ ( G η ( z, x ) , x )] , where P ( r | x ) and P ( z ) denote the distribution of non-synthetic data and the input random seed,respectively. Here, the synthetic data comes from the generator, with r (cid:48) = G η ( z, x ) for conditioning x . In this section, we will discuss the framework of DAT-GAN. Let ( r , . . . , r T ) denote a multivariatetime series, where r t is a d -dimensional column vector. Let x t denote an m -dimensional multivariate time-series information vector , summarizing relevant information up to time t , including informa-tion that could affect the data after t . Let R t,k = ( r t +1 , . . . , r t + k ) denote a k -length block after time t , where k ∈ { , . . . , K } , and K is the number of look-ahead steps to generate. The R t,k blocks canoverlap with each other for different t and k . To model the decision process of an end user, let f j,k ( R t,k , x t ) denote the decision-related quan-tity (a scalar, vector, or matrix), for j ∈ { , . . . , J } and J quantities in total, and define, for eachlook-ahead period k , f k ( R t,k , x t ) = ( f ,k ( R t,k , x t ) , . . . , f J,k ( R t,k , x t )) (cid:124) . (1) In finance, r t could be the asset returns at day t . Since asset returns are affected by the past information ofthe market, x t could be the past asset returns, volatility, and other technical indicators. R t,k could be the k -daysforward asset returns. J decision-related quantities at look-ahead period k given data R t,k andinformation x t . Multi-Wasserstein loss.
Let r (cid:48) t,k denote the synthetic data generated based on informationvector x t and for look-ahead period k , for k ∈ { , . . . , K } . For all t , all k , we want the con-ditional distribution on synthetic data, P ( r (cid:48) t,k | x t ), to match the conditional distribution on thenon-synthetic data, P ( r t + k | x t ). Similarly, for all t , all k , we want the conditional distribution ondecision-related quantities for synthetic data, P ( f j,k ( R (cid:48) t,k , x t ) | x t ), where R (cid:48) t,k = ( r (cid:48) t, , . . . , r (cid:48) t,k ), tomatch the conditional distribution, P ( f j,k ( R t,k , x t ) | x t ). It will be convenient to write P ( R (cid:48) t,K | x t )for {P ( r (cid:48) t,k | x t ) } k ∈{ ,...,K } . Adopting a separate loss term for each quantity and each k , we definethe following multi-Wasserstein objective (written here for conditioning x t ):inf P ( R (cid:48) t,K | x t ) K (cid:88) k =1 ω k L rk + K (cid:88) k =1 J (cid:88) j =1 λ j,k L fj,k , with (2) L rk = W ( P ( r t + k | x t ) , P ( r (cid:48) t,k | x t )) ,L fj,k = W ( P ( f j,k ( R t,k , x t ) | x t ) , P ( f j,k ( R (cid:48) t,k , x t ) | x t )) .L rk denotes the loss for data at k steps forward and L fj,k denotes the loss for decision-relatedquantity j at k steps forward. Values ω k > λ j,k > Surrogate loss.
Let D γ k denote the discriminator for the data at look-ahead period k , defined forparameters γ k . Let r (cid:48) t,k = G η ( z t,k , x t ) denote the synthetic data at look-ahead period k , where G η is the generator, defined for parameters η , and where noise z t,k ∼ N (0 , I d ). Let D θ j,k denote thediscriminator for decision-related quantity j at look-ahead period k , defined for parameters θ j,k .Let Z t,k = ( z t, , . . . , z t,k ) denote a k -length block of random seeds after t . We define the followingin-expectation quantities: E rk = E r t + k ∼P ( r t + k | x t ) [ D γ k ( r t + k , x t )] (3) E G η k = E z t,k ∼P ( z t,k ) [ D γ k ( r (cid:48) t,k , x t )] (4) E f,Rj,k = E R t,k ∼P ( R t,k | x t ) [ D θ j,k ( f j,k ( R t,k , x t ) , x t )] (5) E f,G η j,k = E Z t,k ∼P ( Z t,k ) [ D θ j,k ( f j,k ( R (cid:48) t,k , x t ) , x t )] (6)To formulate the DAT-CGAN training problem, we use the Kantorovich-Rubinstein duality foreach Wasserstein distance in (2), and sum over the dual forms [Vil09]. This provides a surrogate loss ,upper bounding the original objective. The surrogate problem is a min-max optimization problem,with the discriminator loss defined as:inf η sup γ k ,θ j,k K (cid:88) k =1 ω k ( E rk − E G η k ) + K (cid:88) k =1 J (cid:88) j =1 λ j,k ( E f,Rj,k − E f,G η j,k ) . In financial applications, f j,k ( R t,k , x t ) could be the estimated co-variance of asset returns or portfolio weightsusing the information up to time t + k . We use notation r (cid:48) t,k rather than notation r (cid:48) t + k because there is a difference, for example, between r (cid:48) , and r (cid:48) , . An alternative formulation would impose the Wasserstein distance on a vector concatenating all quantities. Wejustify our design choice in the experimental results section. generator loss is inf η − (cid:88) k ω k E G η k − (cid:88) k,j λ j,k E f,G η j,k . We also write ˜ L rk = E rk − E Gk and ˜ L fj,k = E f,Rj,k − E f,G η j,k to denote the discriminator loss for theunderlying data and decision-related quantities, respectively. Training procedure.
See Algorithm 1 for the training procedure. Lines 2-3 prepare the datafrom the decision process as well as conditioning variables. Lines 5-11 train the discriminators.Lines 6 perform K consecutive time block sampling. Lines 7-8 generate synthetic block samplesfor each time block, conditioning on the information vector. Lines 9-11 update the discrimina-tors. Lines 12-15 train the generators. Lines 13-14 generate synthetic block samples for each timeblock, conditioning on the information vector. Line 15 updates the generators. We define sampleestimates for expectations (3), (4), (5), (6), as ˆ E rk , ˆ E G η k , ˆ E f,Rj,k and ˆ E f,G η j,k , respectively. Quantities( r t i + k , f k ( R t i ,k , x t i ) , x t i ) , ∀ i are obtained by an overlapped block sampling scheme (see Figure 1),where different blocks of samples can overlap with other blocks in terms of consecutive samples. Wecharacterize the generalization property of the procedure under suitable mixing conditions. Thistype of sampling scheme is commonly used in literature for CGANs [KFT19, e.g.]. We study the DAT-CGAN in application to a financial portfolio choice problem. The end user wehave in mind is a portfolio manager who wants to understand the properties of a particular portfolioselection strategy.A good simulator should not only generate high fidelity synthetic asset returns data, but alsoproduce data that generates high fidelity, decision-related quantities, as are relevant to the portfolioselection problem. These decision-related quantities may include (1) estimated mean asset returnsˆ u t,k , (2) estimated co-variance of asset returns ˆΣ t,k , (3) estimated precision matrix of asset returnsˆ H t,k , (4) portfolio weights w t,k , and (5) the portfolio return p t,k and utilities U t,k of the corre-sponding portfolio. These decision-related quantities need to be generated based on conditioningvariables, reflecting the current market conditions.Let r t + k denote the asset return vector at time t + k , where 1 (cid:54) k (cid:54) K . Let x t denote theconditioning variables at time t . We assume the end user is a mean-variance portfolio manager ,and solving the mean-variance portfolio optimization problem in deciding the portfolio weights.5 lgorithm 1 . Learning Rate α = 1 e − ω k = λ j,k = 0 . k , s D = 1, s G = 5, clipping l b = − . , u b = 0 . T = 2048, look ahead step K = 4, batch size I = 32, number of batches N = 1 e Require: γ k, and θ j,k, , initial discriminator parameters; η , initial generator parameters. for t = 1 , k = 1 to T, K do Compute R t,k and f j,k ( R t,k , x t ) , ∀ j while n epoch < N do for s = 0 to s D do Make I samples of K -size time blocks.The i th sample (1 (cid:54) i (cid:54) I ) ranges from time t i + 1 to t i + K , and consists of data( r t i + k , f k ( R t i ,k , x t i ) , x t i ) Kk =1 for i = 1 , k = 1 to I, K do Sample z t i ,k ∼ P ( z t i ,k ); Compute r (cid:48) t i ,k = G η ( z t i ,k , x t i ); Compute f j,k ( R (cid:48) t i ,k , x t i ) , ∀ j for k = 1 to K do γ k ← clip( γ k + αω k ∇ γ k (cid:80) Kk =1 [ˆ E rk − ˆ E G η k ] , l b , u b ) θ j,k ← clip( θ j,k + αλ j,k ∇ θ j,k (cid:80) Kk =1 [ˆ E f,Rj,k − ˆ E f,G η j,k ] , l b , u b ) , ∀ j for s = 0 to s G do for i = 1 , k = 1 to I, K do Sample z t i ,k ∼ P ( z t i ,k ); Compute r (cid:48) t i ,k = G η ( z t i ,k , x t i ); Compute f j,k ( R (cid:48) t i ,k , x t i ) , ∀ j η ← η − αω k ∇ η (cid:80) Kk =1 [ˆ E G η k − ˆ E f,G η j,k ] , ∀ j For non-synthetic data, the corresponding portfolio optimization problem at time t + k − w (cid:62) t,k =1 w (cid:62) t,k ˆ u t,k − φw (cid:62) t,k ˆΣ t,k w t,k . The analytical solution for this investment problem is w t,k = 2 ˆ H t,k φ (ˆ u t,k − (cid:62) ˆ H t,k ˆ u t,k − φ (cid:62) ˆ H t,k ) , with ˆΣ − t,k replaced by ˆ H t,k = ((1 − τ ) ˆΣ t,k + τ Λ) − using the shrinkage methods and Λ is the identitymatrix. w t,k are portfolio weights decided at time t + k − t + k and φ > u t,k and ˆΣ t,k are estimators for the mean and co-variance ofthe asset return at time t + k , defined asˆ u t,k = f u,k ( R t,k − , x t ) = MA ζ ( r t + k − )ˆΣ t,k = f Σ ,k ( R t,k − , x t ) = MA ζ ( r t + k − r (cid:62) t + k − ) − ˆ u t,k where MA ζ ( r t + k − ) = ζ MA ζ ( r t + k − ) + (1 − ζ ) r t + k − is a moving average operator and ζ asmoothing parameters. The portfolio manager evaluates the portfolio on the basis of the realizedportfolio return , p t,k = w (cid:62) t,k r t,k , and the realized utility of the portfolio return given the risk pref-erence, i.e., U t,k = p t,k − φp t,k . We show the relationship between these quantities in Figure 2. Forthe synthetic data, the entire workflow is the same as with the non-synthetic data. The asset return r (cid:48) t,k is generated by a GAN, where z t,k is the random seed. Similar to the non-synthetic data, wedefine ˆ u (cid:48) t,k = f u,k ( R (cid:48) t,k − , x t ), ˆΣ (cid:48) t,k = f Σ ,k ( R (cid:48) t,k − , x t ), ˆ H (cid:48) t,k , w (cid:48) t,k , p (cid:48) t,k and U (cid:48) t,k .6 Theoretical Results
Arora et al. [AGL +
17] have shown that training results for GANs that appear successful may be farfrom the target distribution in terms of standard metrics, such as Jensen-Shannon (JS) divergenceand Wasserstein distance. Reconciling this with their good performance comes from recognizingthat the Wasserstein GAN optimization ismin η max θ E r ∼P ( r ) [ D θ ( r )] − E z ∼P ( z ) [ D θ ( G η ( z ))] , where D θ and G η are instantiated as neural networks, parameterized by θ and η . Arora et al.[AGL +
17] consider the following, weaker metric:
Definition 1 (Neural Net Distance for WGAN) . Given a family of neural networks { D θ : θ ∈ Θ } for a set Θ , for two distributions µ and ν , the corresponding neural network distance for WassersteinGAN is defined as D Θ ( µ, ν ) = sup θ ∈ Θ { E x ∼ µ [ D θ ( x )] − E x ∼ ν [ D θ ( x )] } . With this, Arora et al. [AGL +
17] build a generalization theory for WGANs under the followinggeneralization property:
Definition 2.
Let P data denotes the distribution of non-synthetic data and P G denotes the generateddistribution, and let ˆ P data and ˆ P G denote the corresponding empirical versions, the generalizationgap for WGAN is defined as | D Θ ( ˆ P data , ˆ P G ) − D Θ ( P data , P G ) | . As long as the training is successful, i.e. D Θ ( ˆ P data , ˆ P G ) is small, and the generalization gapmentioned above is small, we have that D Θ ( P data , P G ) is small.A natural question in our setting is the following: Question: for DAT-CGAN, is there any similar generalization property guarantee?
To build such a theory for DAT-CGAN, instead of dealing with i.i.d. data as in [AGL + k = K andfor a single decision-related quantity (the underlying data can also be viewed as decision-relatedquantity, where the corresponding f is the mapping picking the last element in R t i ,K ). For multiplebut finite values of k , and multiple but finite decision-related quantities, we can use a uniformbound to obtain the corresponding generalization bounds.We can simplify notation: let ˆ P R ( I ) and ˆ P G η ,Z ( I ) denote the empirical distribution induced by { ( f ( R t i ,K , x t i ) , x t i ) } Ii =1 and { ( f ( R (cid:48) t i ,K , x t i ) , x t i ) } Ii =1 , respectively (recall R (cid:48) t i ,K = ( G η ( z t i , , x t i ) , . . . , G η ( z t i ,K , x t i ))),and define D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I )) = sup θ ∈ Θ [ˆ E f,R − ˆ E f,G η ] , (7)where ˆ E f,R = (1 /I ) I (cid:88) i =1 [ D θ ( f ( R t i ,K , x t i ) , x t i )]7 E f,G η = (1 /I ) I (cid:88) i =1 [ D θ ( f ( R (cid:48) t i ,K , x t i ) , x t i )]Here, Θ and Ξ are parameter sets. Before formally stating the theoretical results, we needto understand what D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I )) converges to. Notice that for the surrogate loss, takingexpectation with respect to P ( r t + K | x t ), for any realization of x t , i.e. x t = c for constant vector c , we need enough samples for r t + K given x t = c so that the empirical distribution ˆ P ( r t + K | x t = c ) can well represent the ground-truth distribution P ( r t + K | x t = c ). However, in application, wewould not normally have enough samples for any arbitrary value c , and especially considering that x t may be a continuous random vector instead of a categorical one. It is even possible that forall { t i } Ii =1 , the { x t i } Ii =1 values are different from each other. Thus, we need to understand what D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I )) converge to as I → ∞ .We show that D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I )) converges to a “weaker” version for a given η under certainconditions, i.e., that it converges to D Θ ( P R , P G η ,Z ) = sup θ ∈ Θ [ E f,R − E f,G η ] , (8)where P R and P G η ,Z are the distribution of ( f ( R t,K , x t ) , x t ) and ( f ( R (cid:48) t,K , x t ) , x t ), respectively, and E f,R = E x t E R t,K ∼P ( R t,K | x t ) [ D θ ( f ( R t,K , x t ) ,x t )] , E f,G η = E x t E Z t,K ∼P ( Z t,K ) [ D θ ( f ( R (cid:48) t,K , x t ) , x t )] . Compared with the surrogate loss mentioned previouly such as Eq.(3), there is an extra expec-tation over x t in D Θ ( P R , P G η ,Z ), which comes from sampling over different { x t i } ’s. We can viewthis as an average version of the surrogate losses under different realization of x t ’s. Now we areready to state a generalization bound regarding | D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I ))) − D Θ ( P R , P G η ,Z ) | .In order to conquer the issues with non-i.i.d. data and overlapping sampling, we introduce aframework for defining suitable mixing conditions . This kind of framework is commonly used intime series analysis [Bra07]. Mixing condition framework.
Let X i ∈ S for some set S , and X = ( X , , · · · , X n ). We furtherdenote X ji = ( X i , X i +1 , · · · , X j ) as a random vector for 1 (cid:54) i < j (cid:54) n . Correspondingly, we let x ji = ( x i , x i +1 , · · · , x j ) be a subsequence for the realization of X , i.e. ( x , x , · · · , x n ). We denotethe set C = { y ∈ S i − , w, w (cid:48) ∈ S : P ( X i = y w ) > , P ( X i = y w (cid:48) ) > } , and write¯ η i,j ( { X i } ni =1 ) = sup C η i,j ( y , w, w (cid:48) ) , where η i,j (( { X i } ni =1 , y , w, w (cid:48) ) denotes TV (cid:16) P ( X nj | X i = y w ) , P ( X nj | X i = y w (cid:48) ) (cid:17) . Here, TV is the total variation distance , and P ( X nj | X i = y w ) is the conditional distribution of X nj , conditioning on { X i = y w } . Assumptions and implications.
First, we make a number of natural, boundedness assumptions.We assume the time series data are of bounded support, i.e. there exists a universal B r , such that8ax {(cid:107) r i (cid:107) ∞ , (cid:107) r i (cid:107)} (cid:54) B r , where the boundedness of (cid:107) r i (cid:107) is implied by the boundedness of (cid:107) r i (cid:107) ∞ since the dimension of r i is finite. We also assume boundedness of conditioning information { x t } t ,i.e. there exists a universal B x , such that max {(cid:107) x t (cid:107) ∞ , (cid:107) x t (cid:107)} (cid:54) B x . For the discriminators D γ and D θ , where θ ∈ Θ ⊆ R p , we assume w.l.o.g. that Θ is a subset of unit balls with correspondingdimensions. Similarly, for the generative model G η , η ∈ Ξ, we assume Ξ is a subset of unit ball.We also need L -Lipschitzness of D θ and G η with respect to their parameters, i.e. (cid:107) D θ ( x ) − D θ ( x ) (cid:107) (cid:54) L (cid:107) θ − θ (cid:107) for any x (similar for G η ), as well as the boundedness of the outputrange of G , i.e. that there exists ∆ such that max {(cid:107) G η ( x ) (cid:107) , (cid:107) G η ( x ) (cid:107) ∞ } (cid:54) ∆ for any input x .To characterize the mixing conditions, we assume there exists a universal function β , such thatmax { ¯ η i,j ( { ( r i , x i ) } Ti =1 ) , ¯ η i,j ( { x i } Ti =1 ) } (cid:54) β ( | j − i | )) , and with ∆ β = (cid:80) ∞ k =1 β ( k ) < ∞ , where β ’s arethe mixing coefficients . Lastly, and as holds for the Wasserstein GAN, there exists a constant ˜ L ,such that (cid:107) D θ ( x ) − D θ ( x (cid:48) ) (cid:107) (cid:54) ˜ L (cid:107) x − x (cid:48) (cid:107) for all θ .We first claim the boundedness of the decision-related quantities in DAT-CGAN. We defer theproofs of Lemma 1 and Theorem 1 to the supplemental material. Lemma 1 (Boundedness of decision-related quantities) . Under the assumptions above, the decision-related quantities we considered in financial portfolio choice problem are all bounded, where thebounds are universal and only depend on B r . Let B f denote the bound of the decision-related quantity, i.e.max {(cid:107) f ( R t i ,K , x t i ) (cid:107) , (cid:107) f ( R (cid:48) t i ,K , x t i ) (cid:107)} (cid:54) B f for all i . By Lemma 1, we obtain the following gen-eralization bound for |D Θ ( ˆ P R ( I ) , ˆ P G η ,Z ( I ))) − D Θ ( P R , P G η ,Z ) | , for each iteration of the trainingprocess (referring to each round of the mix-max optimization of CGANs). Theorem 1.
Under the assumptions above, suppose G η , G η , · · · , G η M be the M generators in the M iterations of the training, let B ∗ = (cid:113) B f + B x ( K + ∆ β ) , then sup j ∈ [ M ] | D Θ ( ˆ P R ( I ) , ˆ P G ηj ,Z ( I )); η ) − D Θ ( P R , P G ηj ,Z ) | (cid:54) ε, with probability at least − C exp (cid:16) p log( pLε ) (cid:17) (1 + M ) exp (cid:16) − Iε ˜ L B ∗ (cid:17) . for some constant C > . Theorem 1 provides, whether for underlying data or one of the decision-related quantities, thatthe distribution on non-synthetic data is close to the generated distribution at every iteration inthe training process. As with Arora et al. [AGL + We study two different financial environments. The first is a simulated environment, and the secondis real, and based on a basket of ETF time series.
To avoid ambiguity, in this section we use the We can always rescale the parameter properly by changing the parameterization as long as Θ is bounded. Theboundedness of Θ is naturally satisfied since the training algorithm of the Wasserstein GAN requires weight clipping. (a) step 1 asset returns (b) step 1 precision matrix (c) step 1 portfolio weights(d) step 4 asset returns (e) step 3 precision matrix (f) step 3 portfolio weights hrase “simulated” to refer to the simulated ground-truth model, and “synthetic” to refer to thedata generated by the DAT-GAN and other baselines (whether in a simulated or real environment). Experimental setup.
Assume the risk-preference parameter of the portfolio manager is φ = 1,and with a shrinkage parameter τ = 0 .
01 for use in estimating the precision matrix (to avoid issueswith a degenerate co-variance matrix).We use the DAT-GAN simulator with asset returns as underlying data and with the realizedutility of the portfolio as the decision related quantity. Thus, we also call this a
Ret-Utility-GAN .We adopt utility as the decision related quantity in the loss, using this in addition to data (assetreturns). The use of utility controls all the decision-related quantities, since it comes at the endof the decision chain. We find in our experiments that this provides good fidelity for the syntheticdistribution. For the conditioning variables for each asset, we use five features: the asset returnof the last day and four rolling-average based features, computed by taking the average of assetreturns in the past a few days.We perform an ablation study, and compare against: • ( Ret-GAN ) A GAN with only asset return loss, which is a typical model used in the literature[KFT19, ZPH + • ( ) A GAN with one-step look ahead asset return and utility, which is designed torepresent an approach similar to that used by [LWL + • ( Single-GAN ) A GAN that imposes a single loss on stacked, return and utility quantities, andis designed to test this approach compared with the multi-loss approach. • ( Utility-GAN ) A GAN with only the utility loss, designed to test the necessity of adding losswith respect to the underlying data distribution.For the generator, we use a two-layer feed-forward neural network for each asset. The output areasset returns, and these are used to compute decision-related quantities, which are then fed into thediscriminators. We make use of multiple discriminators, each corresponding to a particular quantity(e.g., underlying data, or a decision-related quantity). For each discriminator, the architecture is atwo-layer feed-forward neural network. We train on an Azure GPU standard NV6 instance, whichhas one Tesla M60 GPU. For evaluation, we calculate the Wasserstein distances with respect to underlying asset returns,the estimated precision matrix, and the portfolio weights, and for both the first- and last forward-looking step (this is steps 1 and 3 for the estimated precision matrix and portfolio weights, andsteps 1 and 4 for the underlying data).
Results: simulated time series.
We first present results on a simulated time series. The data-generating process is given by r t +1 = b r t + (cid:80) i =1 b i MA ζ i ( r t ) + (cid:15) , where r t is the asset returnvector, MA ζ i ( r t ) moving average operator, ζ i smoothing parameter, b i coefficient, and (cid:15) noise. Weuse a multivariate t-distribution to model the noise, with location parameter µ = [0 , , , (cid:62) , shapematrix Σ = [1 , . , ,
0; 0 . , , . ,
0; 0 , . , , .
6; 0 , , . , ν = 100. The t-distributionsimulates the heavy-tail behavior of asset returns [Bol87]. For the generator, the neural network has dimension (5+8), 4, and 1 for input, hidden and output layer. Atthe input, 5 nodes are based on the hand-crafted conditioning variables and 8 nodes provide random seeds. For thediscriminator, the neural networks have dimensions M , 8, 1 from input to output, where M matches the dimensionsof each of the quantities. Both networks use ReLU units. The approach is computationally intensive, requiring around one month of training time for a single generativemodel. The main reason is that we need to compute decision-related quantities, and also solve a series of optimaltransport optimization problems to allow for accurate evaluation (using solvers from [FC17]). We stop some runsearly to save on compute time. (a) step 1 asset returns (b) step 1 precision matrix (c) step 1 portfolio weights(d) step 4 asset returns (e) step 3 precision matrix (f) step 3 portfolio weights
Figures 3a to 3f confirm that Ret-Utility-GAN is the best in terms of Wasserstein distancefor asset returns, estimated precision matrix, and portfolio weights for simulated time series. Thatthe Ret-Utility-GAN performs better than Ret-GAN for returns confirms that making use of util-ity within the loss function provides useful controls on the distribution of the synthetic data. Inother comparisons, the performance of the Ret-Utility-GAN is (1) better than Utility-GAN for theportfolio weights comparison, which shows that considering loss with respect to underlying dataalso helps; (2) better than the Single-GAN, which shows that imposing loss for each quantity ismore effective than a single loss in accounting for the heterogeneity of different types of quantities;(3) better than the 1step-GAN, which shows that the Ret-Utility-GAN is effective in addressingexposure bias.
Results: Real ETF time series.
We use daily price data for each of four U.S. ETFs, i.e. Material(XLB), Energy (XLE), Financial (XLF) and Industrial (XLI) ETFs, from 1999 to 2016. The dataincludes end of day price information for each ETF. The entire dataset has more than 17,400 datapoints (17 years ×
250 days × Conclusion
We proposed a novel, decision-aware time series conditional generative adversarial network (DAT-CGAN) for the multi-step time series generation problem. The method is decision-aware throughthe incorporation of loss functions related to decision related quantities, which provides high-fidelitysynthetic data not only for the underlying data distribution, but also in supporting decision pro-cesses by end users. The DAT-CGAN makes use of a multi-loss structure, avoids exposure biasby aligning look-ahead periods during training and test, and alleviate problems with data scarcitythrough an overlapped-block sampling scheme. Moreover, we characterize the generalization prop-erties of DAT-CGANs for both underlying data generation and decision-related quantities. In anapplication to portfolio selection, we demonstrated better generative quality for decision-relatedquantities such as estimated precision matrix and portfolio weights than other strong, GAN-basedbaselines. In future work, we will study the robustness of the simulator when applied to the endusers with a mildly different risk preference. 13 eferences [AGL +
17] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization andequilibrium in generative adversarial nets (gans). In
Proceedings of the 34th Interna-tional Conference on Machine Learning , volume 70 of
Proceedings of Machine LearningResearch , pages 224–232. PMLR, 2017.[AIMM19] Susan Athey, Guido Imbens, Jonas Metzger, and Evan Munro. Using Wasserstein gener-ative adversarial networks for the design of monte carlo simulations. arXiv:1909.02210 ,2019.[B¨02] Peter B¨uhlmann. Bootstraps for time series.
Statistical Science , 17:52–72, 2002.[Bol87] Bollerslev. A conditional heteroskedastic time series model for speculative prices andrates of return.
Review of Economics and Statistics , 69:542–547, 1987.[Bra07] R.C. Bradley. Introduction to strong mixing conditions.
Kendrick Press, Heber City(Utah) , 1-3, 2007.[BVJS15] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling forsequence prediction with recurrent neural networks. In
Advances in Neural InformationProcessing Systems 28: Annual Conference on Neural Information Processing Systems ,pages 1171–1179, 2015.[FC17] R´emi Flamary and Nicolas Courty. Pot python optimal transport library.
Website:https://pythonot.github.io/ , 2017.[KFT19] Adriano Soares Koshiyama, Nick Firoozye, and Philip C. Treleaven. Generative ad-versarial networks for financial trading strategies fine-tuning and combination.
CoRR ,abs/1901.01751, 2019.[KR08] Leonid Aryeh Kontorovich and Kavita Ramanan. Concentration inequalities for de-pendent random variables via the martingale method.
The Annals of Probability ,36(6):2126–2158, 2008.[LWL +
20] Junyi Li, Xintong Wang, Yaoyang Lin, Arunesh Sinha, and Michael P. Wellman. Gen-erating realistic stock market order streams. In
The Thirty-Fourth AAAI Conferenceon Artificial Intelligence , pages 727–734, 2020.[Mar52] Harry Markowitz. Portfolio selection.
The Journal of Finance , 7:77–91, 1952.[MO14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.
CoRR ,abs/1411.1784, 2014.[RCAZ16] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequencelevel training with recurrent neural networks. In , 2016.[RMM +
17] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel.Self-critical sequence training for image captioning. In , pages 1179–1195, 2017.14Vil09] C´edric Villani. Optimal transport: Old and new.
Springer, Berlin , 2009.[YJvdS19] Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative ad-versarial networks. In
Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems , pages 5509–5519, 2019.[ZPH +
18] Xingyu Zhou, Zhisong Pan, Guyu Hu, Siqi Tang, and Cheng Zhao. Stock market predic-tion on high-frequency data using generative adversarial nets.
Mathematical Problemsin Engineering , 2018. 15 ppendixA Omitted Proofs
A.1 Proof of Lemma 1
We first show the decision-related quantities in DAT-CGAN are all bounded under our assumptions.Here by boundedness of a vector or matrix, we mean the largest entry of the vector or matrix isbounded. Similar to the theory part in the main paper, we also simplify notations, such as onlyconsidering fixed k = K and omitting some constants. That will affect the validity of applicationof our theory to the algorithm.The quantities of interest are:a. ˆ u t,K = MA ζ ( r t + K − ), where MA ζ ( r t + K − ) = ζ MA ζ ( r t + K − ) + (1 − ζ ) r t + K − and 0 < ζ < t,K = MA ζ ( r t + K − r (cid:62) t + K − ) − ˆ u t,K ;c. ˆ H t,K = ((1 − τ ) ˆΣ t,K + τ Λ) − , where 0 < τ < w t,K = h (ˆ u t,K , ˆ H t,K ) = 2 ˆ H t,K φ (cid:32) ˆ u t,K − (cid:62) ˆ H t,K ˆ u t,K − φ (cid:62) ˆ H t,K (cid:33) ;e. p t,K = ˆ w (cid:62) t,K r t + K f. U ( ˆ w (cid:62) t,K r t + K ) = ˆ w (cid:62) t,K r t + K − φ ( ˆ w (cid:62) t,K r t + K ) a, b are obviously bounded since r t is bounded, ∀ t . e, f are also bounded once we prove c, d arebounded.I. For ˆ H t,K , we can obtain its boundedness by simply realizing the determinant of ˆ H t,K is lowerbounded, and that the determinant of the adjugate matrices are all upper bounded, andapplying Cramer’s rule.The lower bound of | ˆ H − t,K | can be obtained by | ˆ H − t,K | = Π dj =1 ((1 − τ ) τ i + τ ) (cid:62) τ d , where τ i ’s are eigenvalues of matrix ˆΣ t,K = MA ζ ( r t + K − r (cid:62) t + K − ) − ˆ u t,K , which are all non-negative. The upper bound of the determinant of the ( d − × ( d −
1) adjugate matrices areclearly upper bounded since every entry of them is bounded.II. Next, we considerˆ w t,K = h (ˆ u t,K , ˆ H t,K ) = 2 ˆ H t,K φ (cid:32) ˆ u t,K − (cid:62) ˆ H t,K ˆ u t,K − φ (cid:62) ˆ H t,K (cid:33) ;16otice τ max ( ˆ H t,K ) (cid:54) /τ min ( ˆ H − t,K ) (cid:54) /τ , for any vector v | ˆ H t,K v | (cid:54) (cid:107) v (cid:107) τ . Besides, τ max ( ˆ H − t,K ) (cid:54) (cid:107) ˆ H − t,K (cid:107) F . The Frobenius norm of matrix ˆ H − t,K is bounded since everyentry in the matrix is bounded. Thus, for any v , | ˆ H t,K v | (cid:62) (cid:107) v (cid:107)(cid:107) ˆ H − t,K (cid:107) F (cid:62) (cid:107) v (cid:107) (cid:112) (1 − τ ) B r + τ d . Then, we know ˆ w t,K is bounded. Notice all the bounds mentioned above can be obtained byonly using B r , τ , d , and φ . A.2 Proof of Theorem 1
For convenience of statement, we restate the mixing condition framework and corresponding lemma.
Restatement of Result in [KR08]
We consider a simplified variant of Theorem 1.1 in [KR08].Let X i ∈ S , where S is a finite set, and X = ( X , X , · · · , X n ). We further denote X ji =( X i , X i +1 , · · · , X j ) as a random vector for 1 (cid:54) i < j (cid:54) n . Correspondingly, we let x ji = ( x i , x i +1 , · · · , x j )be a subsequence for ( x , x , · · · , x n ). And let¯ η i,j = sup y i − ∈ S i − ,w,w (cid:48) ∈ S, P ( X i = Y i − w ) > , P ( X i = Y i − w (cid:48) ) > η i,j ( y i − , w, w (cid:48) ) , where η i,j ( y i − , w, w (cid:48) ) = T V (cid:16) D ( X nj | X i = y i − w ) , D ( X nj | X i = y i − w (cid:48) ) (cid:17) . Here
T V is the total variation distance, and D ( X nj | X i = y i w ) is the conditional distribution of D ( X nj | X i = y i w ), conditioning on { X i = y i w } .Let H n be am n × n upper triangular matrix, defined by( H n ) ij = i = j ¯ η i,j i < j o.w. Then, (cid:107) H n (cid:107) ∞ = max (cid:54) i (cid:54) n J n,i , where J n,i = 1 + ¯ η i,i +1 + · · · + ¯ η i,n , and J n,n = 1. Lemma 2 (Variant of Result in [KR08]) . Let h be a L h -Lipschitz function (with respect to theHamming distance) on S n for some constant L f > . Then, for any t > , P ( | f ( X ) − Ef ( x ) | (cid:62) t ) (cid:54) (cid:16) − t nL h (cid:107) H n (cid:107) ∞ (cid:17) . θ ∈ Θ, D θ (cid:0) f ( R t i ,K , x t ) , x t (cid:1) ∈ [ L fR , U fR ] , such that U fR − L fR (cid:54) L (cid:113) B f + B x . Since Lemma 2 needs finite support for S , we take a detourhere in order to extend it to an interval support. We define a ε -net for the interval [ L fR , U fR ], i.e. P ε = { p , p , · · · , p W } such that p = L fR , p = L fR + ε, · · · p W = L fR + W ε and | U fR − p W | (cid:54) ε . Wedefine a function g P ε ( · ) on [ L fR , U fR ] such that, for any x ∈ [ L fR , U fR ], we have g P ε ( x ) = argmin p i ∈ P ε | x − p i | . Without loss of generality, we can assume, for all j ∈ { , · · · , W } , P ( g P ε [ D θ (cid:0) f ( R t i ,k , x t ) , x t (cid:1) ] = p j ) > , otherwise, we can remove the corresponding p j and form a new net. From the mixing condition on { ( r i , x i ) } Ti =1 , we can obtain the mixing condition on overlapping blocks. Lemma 3.
Under our assumptions, denote ˜ R xt i ,K = (( r t i +1 , x t i +1 ) , · · · , ( r t i + K , x t i + K )) , we have ¯ η i,j ( { ˜ R xt m ,K } Im =1 ) (cid:54) (cid:26) | i − j | (cid:54) K − η i + K − ,j (( { ( r i , x i ) } Ti =1 ) (cid:54) β ( | j − i − K + 1 | ) o.w. Proof.
This can be immediately obtained once we know for any output range O , that ( ˜ R xi,K , ˜ R xi,K ) ∈ O is equivalent to (( r i +1 , x i +1 ) , · · · , ( r i + K , x i + K ) , ( r i +1+ K , x i +1+ K )) ∈ O (cid:48) , for some output range O (cid:48) and | t i − t j | (cid:62) | i − j | . Lemma 4.
Under our assumptions, denote ˜ Z xt i ,K = (( z t i , , x t i +1 ) , · · · , ( z t i ,K , x t i + K )) , we have ¯ η i,j ( { ˜ Z xt m ,K } Im =1 ) (cid:54) (cid:26) | i − j | (cid:54) K − η i + K − ,j (( { ( z i , x i ) } Ti =1 ) (cid:54) β ( | j − i − K + 1 | ) o.w. Proof.
Notice unlike R t i ,K , each { Z t i ,K } i are mutually independent, and elements in each Z t i ,K arealso mutually independent. Thus, the mixing coefficients depend entirely on x . Similarly as withLemma 3, we immediately obtain the result.Then, we can use Lemma 2 to obtain the following theorem. Theorem 2.
With overlapping block sampling, then for any ε > , and any θ ∈ Θ , we have P ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε ) (cid:54) (cid:16) − Iε L ( B f + B x )( K + ∆ β ) (cid:17) . Proof.
By Lemma 3, and combined with the assumption that (cid:80) i β ( | i | ) (cid:54) ∆ β , with simple calcula-tion, we can obtain for { ˜ R xt m ,K } Im =1 (cid:107) H I (cid:107) ∞ (cid:54) K + ∆ β . n n (cid:88) i =1 g P ε [ D θ (cid:0) ( f ( R t i ,K , x t i ) , x t i (cid:1) ]is 2 ˜ L (cid:113) B f + B x –Lipschitz continuous with respect to the Hamming distance. Then, by Lemma 2,we have P ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 g P ε [ D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) ] − E g P ε [ D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) ] (cid:12)(cid:12)(cid:12) (cid:62) ε ) (cid:54) (cid:16) − Iε L ( B f + B x )( K + ∆ β ) (cid:17) . Next, it is easy to see for any x ∈ [ L fR , U fR ], we have (cid:12)(cid:12)(cid:12) g P ε [ x ] − x (cid:12)(cid:12)(cid:12) (cid:54) ε. Thus, we can obtain P ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε ) (cid:54) (cid:16) − Iε L ( B f + B x )( K + ∆ β ) (cid:17) . Similarly, we have:
Theorem 3.
With overlapping block sampling, then for any ε > , and any θ ∈ Θ , any η ∈ Ξ , wehave P ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε ) (cid:54) (cid:16) − Iε L ( B f + B x )( K + ∆ β ) (cid:17) , Now, let us consider the generalization bound under the neural-network distance for a fixedgenerator.
Lemma 5.
Under the assumptions in subsection “Assumptions and implications”, there exists auniversal constant C such that (cid:12)(cid:12)(cid:12) D Θ (cid:16) ˆ P R ( I ) , ˆ P G η ,Z ( I ) (cid:17) − D Θ (cid:16) P R , P G η ,Z (cid:17)(cid:12)(cid:12)(cid:12) (cid:54) ε, with probability − C exp (cid:16) p log( pLε ) (cid:17) (cid:34) exp (cid:16) − Iε ˜ L ( B f + B x )( K + ∆ β ) (cid:17)(cid:35) . Proof.
Recall for a fixed θ , we have the following two conditions: P ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε/ (cid:54) (cid:16) − Iε
576 ˜ L ( B f + B x )( K + ∆ β ) (cid:17) ( (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε/ (cid:54) (cid:16) − Iε
576 ˜ L ( B f + B x )( K + ∆ β ) (cid:17) Let N Θ be an ε/ L -net of Θ, which is a standard construction satisfying log |N Θ | (cid:54) O ( p log( pL/ε )),then, by the uniform bound, we can obtain P (sup θ ∈ Θ (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε/ (cid:54) |N Θ | exp (cid:16) − Iε
576 ˜ L ( B f + B x )( K + ∆ β ) (cid:17) P (sup θ ∈ Θ (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) ε/ (cid:54) |N Θ | exp (cid:16) − Iε
576 ˜ L ( B f + B x )( K + ∆ β ) (cid:17) Let us denote by D θ ∗ the optimal discriminator of E f,R − E f,G η . It is easy to see D Θ (cid:16) ˆ P R ( I ) , ˆ P G η ,Z ( I ) (cid:17) (cid:62) (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ ∗ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1) − I I (cid:88) i =1 D θ ∗ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) D Θ (cid:16) P R , P G η ,Z (cid:17) − sup θ ∈ Θ (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) f ( R t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) − sup θ ∈ Θ (cid:12)(cid:12)(cid:12) I I (cid:88) i =1 D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1) − E D θ (cid:0) G η ( Z t i ,K , x t i ) , x t i (cid:1)(cid:12)(cid:12)(cid:12) (cid:62) D Θ (cid:16) P R , P G η ,Z (cid:17) − ε. The other direction is similar.Thus, if we further apply the uniform bound to the generator part, we can obtain the followingtheorem.
Theorem 4.
Under the assumptions in subsection “Assumptions and implications”, let G η , G η , · · · , G η M denote the generators in each of the M iterations of the training procedure, and let B ∗ = (cid:113) B f + B x ( K +∆ β ) , then sup j ∈ [ M ] (cid:12)(cid:12)(cid:12) D Θ (cid:16) ˆ P R ( I ) , ˆ P G η ,Z ( I ) (cid:17) − D Θ (cid:16) P R , P G η ,Z (cid:17)(cid:12)(cid:12)(cid:12) (cid:54) ε, with probability at least − C exp (cid:16) p log( pLε ) (cid:17) (1 + M ) exp (cid:16) − Iε ˜ L B ∗ (cid:17) , for some universal constant C > . Additional Illustration for Experimental Results
B.1 Parameters for Financial Portfolio Choice and Simulated Time Se-ries
For non-synthetic data, ˆ u t,k and ˆΣ t,k are estimators for the mean and co-variance of the asset returnat time t + k , defined asˆ u t,k = f u,k ( R t,k − , x t ) = MA ζ ( r t + k − )ˆΣ t,k = f Σ ,k ( R t,k − , x t ) = MA ζ ( r t + k − r (cid:62) t + k − ) − ˆ u t,k , where MA ζ ( r t + k − ) = ζ MA ζ ( r t + k − ) + (1 − ζ ) r t + k − with ζ = 0 . r (cid:48) t,k is generated by a GAN, where z t,k is the random seed. Similar to the non-syntheticdata, we define ˆ u (cid:48) t,k = f u,k ( R (cid:48) t,k − , x t ) = MA ζ ( r (cid:48) t,k − )ˆΣ (cid:48) t,k = f Σ ,k ( R (cid:48) t,k − , x t ) = MA ζ ( r (cid:48) t,k − r (cid:48)(cid:62) t,k − ) − ˆ u (cid:48) t,k , where MA ζ ( r (cid:48) t, ) = ζ MA ζ ( r t ) + (1 − ζ ) r (cid:48) t, .The data-generating process for the simulated time series is given by r t +1 = b r t + (cid:80) i =1 b i MA ζ i ( r t )+ (cid:15) , where r t is the asset return vector, MA ζ i ( r t ) moving average operator, ζ i smoothing parameter, b i coefficient, and (cid:15) noise. We set ζ = 0 . , ζ = 0 . , ζ = 0 . , ζ = 0 .
92, and b = 0 . , b =0 . , b = 0 . , b = 0 .
1, and b = 0 . B.2 Generator Network Architecture
The neural network architecture of the generator is shown in Figure 5. x t,i , for 1 (cid:54) i (cid:54)
5, denotesthe feature (state) variable input at time t for the network. This is the asset returns of the lastday and four rolling-average based features, computed by taking the average of asset returns in thepast few days mentioned in the “Experimental setup” section. Inputs z t,k,i , for 1 (cid:54) k (cid:54) K , and1 (cid:54) i (cid:54)
8, are random seeds. The h t,k,i units are hidden ReLU nodes, and the output units, r (cid:48) t,k ,provide the synthetic asset returns.After obtaining r (cid:48) t,k , we follow the recipe in the “Simulator for Financial Portfolio Choice”section in the main paper to compute quantities of interest, i.e., ˆ u (cid:48) t,k = f u,k ( R (cid:48) t,k − , x t ), ˆΣ (cid:48) t,k = f Σ ,k ( R (cid:48) t,k − , x t ), ˆ H (cid:48) t,k , w (cid:48) t,k , p (cid:48) t,k and U (cid:48) t,k . B.3 Discriminator Network Architecture