[PDF] Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework

Abstract

We approach the continuous-time mean-variance (MV) portfolio selection with reinforcement learning (RL). The problem is to achieve the best tradeoff between exploration and exploitation, and is formulated as an entropy-regularized, relaxed stochastic control problem. We prove that the optimal feedback policy for this problem must be Gaussian, with time-decaying variance. We then establish connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero. Finally, we prove a policy improvement theorem, based on which we devise an implementable RL algorithm. We find that our algorithm outperforms both an adaptive control based method and a deep neural networks based algorithm by a large margin in our simulations.

Full PDF

aa r X i v : . [ q -f i n . P M ] M a y Continuous-Time Mean–Variance PortfolioSelection: A Reinforcement Learning Framework ∗ Haoran Wang † Xun Yu Zhou ‡ First draft: February 2019This version: May 2019

Abstract

We approach the continuous-time mean–variance (MV) portfolioselection with reinforcement learning (RL). The problem is to achievethe best tradeoﬀ between exploration and exploitation, and is formu-lated as an entropy-regularized, relaxed stochastic control problem.We prove that the optimal feedback policy for this problem must beGaussian, with time-decaying variance. We then establish connectionsbetween the entropy-regularized MV and the classical MV, includingthe solvability equivalence and the convergence as exploration weight-ing parameter decays to zero. Finally, we prove a policy improvementtheorem, based on which we devise an implementable RL algorithm.We ﬁnd that our algorithm outperforms both an adaptive control basedmethod and a deep neural networks based algorithm by a large marginin our simulations.

Key words.

Reinforcement learning, mean-variance portfolio se-lection, entropy regularization, stochastic control, value function, Gaus-sian distribution, policy improvement theorem. ∗ We are grateful for comments from the seminar participants at the Fields Institute.Wang gratefully acknowledges ﬁnancial supports through the FDT Center for IntelligentAsset Management at Columbia. Zhou gratefully acknowledges ﬁnancial supports througha start-up grant at Columbia University and through the FDT Center for Intelligent AssetManagement. † Department of Industrial Engineering and Operations Research, Columbia University,New York, NY 10027, USA. Email: [email protected]. ‡ Department of Industrial Engineering and Operations Research, and The Data ScienceInstitute, Columbia University, New York, NY 10027, USA. Email: [email protected]. Introduction

Applications of reinforcement learning (RL) to quantitative ﬁnance (e.g.,algorithmic and high frequency trading, smart order routing, portfolio man-agement, etc) have attracted more attentions in recent years. One of themain reasons is that the electronic markets prevailing nowadays can providesuﬃcient amount of microstructure data for training and adaptive learn-ing, much beyond what human traders and portfolio managers could han-dle in old days. Numerous studies have been carried out along this di-rection. For example, Nevmyvaka et al. (2006) conducted the ﬁrst largescale empirical analysis of RL method applied to optimal order executionand achieved substantial improvement relative to the baseline strategies.Hendricks and Wilcox (2014) improved over the theoretical optimal trad-ing strategies of the Almgren-Chriss model (Almgren and Chriss (2001))using RL techniques and market attributes. Moody and Saﬀell (2001) andMoody et al. (1998) studied portfolio allocation problems with transactioncosts via direct policy search based RL methods, without resorting to fore-cast models that rely on supervised leaning.However, most existing works only focus on RL optimization problemswith expected utility of discounted rewards. Such criteria are either un-able to fully characterize the uncertainty of the decision making process inﬁnancial markets or opaque to typical investors. On the other hand, mean–variance (MV) is one of the most important criteria for portfolio choice.Initiated in the seminal work Markowitz (1952) for portfolio selection ina single period, such a criterion yields an asset allocation strategy thatminimizes the variance of the ﬁnal payoﬀ while targeting some prespeciﬁedmean return. The MV problem has been further investigated in the discrete-time multiperiod setting (Li and Ng (2000)) and the continuous-time setting(Zhou and Li (2000)), along with hedging (Duﬃe and Richardson (1991))and optimal liquidation (Almgren and Chriss (2001)), among many othervariants and generalizations. The popularity of the MV criterion is not onlydue to its intuitive and transparent nature in capturing the tradeoﬀ betweenrisk and reward for practitioners, but also due to the theoretically interest-ing issue of time-inconsistency (or Bellman’s inconsistency) inherent withthe underlying stochastic optimization and control problems.From the RL perspective, it is computationally challenging to seek theglobal optimum for Markov Decision Process (MDP) problems under the MVcriterion (Mannor and Tsitsiklis (2013)). In fact, variance estimation andcontrol are not as direct as optimizing the expected reward-to-go which hasbeen well understood in the classical MDP framework for most RL problems.2ecause most standard MDP performance criteria are linear in expectation,including the discounted sum of rewards and the long-run average reward(Sutton and Barto (2018)), Bellman’s consistency equation can be easilyderived for guiding policy evaluation and control, leading to many state-of-the-art RL techniques (e.g., Q-learning, temproaral diﬀerence (TD) learning,etc). The variance of reward-to-go, however, is nonlinear in expectation and,as a result, most of the well-known learning rules cannot be applied directly.Existing works on variance estimation and control generally divide intotwo groups, value based methods and policy based methods. Sobel (1982)obtained the Bellman’s equation for the variance of reward-to-go undera ﬁxed , given policy. Based on that equation, Sato et al. (2001) derivedthe TD(0) learning rule to estimate the variance under any given policy.In a related paper, Sato and Kobayashi (2000) applied this value basedmethod to an MV portfolio selection problem. It is worth noting thatdue to their deﬁnition of the intermediate value function (i.e., the vari-ance penalized expected reward-to-go), Bellman’s optimality principle doesnot hold. As a result, it is not guaranteed that a greedy policy based onthe latest updated value function will eventually lead to the true global op-timal policy. The second approach, the policy based RL, was proposed inTamar et al. (2013). They also extended the work to linear function approx-imators and devised actor-critic algorithms for MV optimization problemsfor which convergence to the local optimum is guaranteed with probabil-ity one (Tamar and Mannor (2013)). Related works following this line ofresearch include Prashanth and Ghavamzadeh (2013, 2016), among others.Despite the various methods mentioned above, it remains an open and in-teresting question in RL to search for the global optimum under the MVcriterion.In this paper, we establish an RL framework for studying the continuous-time MV portfolio selection, with continuous portfolio (control/action) andwealth (state/feature) spaces. The continuous-time formulation is appealingwhen the rebalancing of portfolios can take place at ultra-high frequency.Such a formulation may also beneﬁt from the large amount of tick data thatis available in most electronic markets nowadays. The classical continuous-time MV portfolio model is a spacial instance of a stochastic linear–quadratic(LQ) control problem (Zhou and Li (2000)). Recently, Wang et al. (2019)proposed and developed a general entropy-regularized, relaxed stochasticcontrol formulation, called an exploratory formulation, to capture explic-itly the tradeoﬀ between exploration and exploitation in RL. They showedthat the optimal distributions of the exploratory control policies must beGaussian for an LQ control problem in the inﬁnite time horizon, thereby3roviding an interpretation for the Guassian exploration broadly used bothin RL algorithm design and in practice.While being essentially an LQ control problem, the MV portfolio selec-tion must be formulated in a ﬁnite time horizon which is not covered byWang et al. (2019). The ﬁrst contribution of this paper is to present theglobal optimal solution to the exploratory MV problem. One of the in-teresting ﬁndings is that, unlike its inﬁnite horizon counterpart derived inWang et al. (2019), the optimal feedback control policy for the ﬁnite horizoncase is a Gaussian distribution with a time-decaying variance. This suggeststhat the level of exploration decreases as the time approaches the end of theplanning horizon. On the other hand, we will obtain results and observeinsights that are parallel to those in Wang et al. (2019), such as the perfectseparation between exploitation and exploration in the mean and varianceof the optimal Gaussian distribution, the positive eﬀect of a random envi-ronment on learning, and the close connections between the classical andthe exploratory MV problems.The main contribution of the paper, however, is to design an inter-pretable and implementable RL algorithm to learn the global optimal so-lution of the exploratory MV problem, premised upon a provable policyimprovement theorem for continuous-time stochastic control problems withboth entropy regularization and control relaxation. This theorem providesan explicit updating scheme for the feedback Gaussian policy, based on thevalue function of the current policy in an iterative manner. Moreover, itenables us to reduce from a family of general non-parametric policies to aspeciﬁcally parametrized Gaussian family for exploration and exploitation,irrespective of the choice of an initial policy. This, together with a care-fully chosen initial Gaussian policy at the beginning of the learning process,guarantees the fast convergence of both the policy and the value function tothe global optimum of the exploratory MV problem.We further compare our RL algorithm with two other methods applied tothe MV portfolio optimization. The ﬁrst one is an adaptive control approachthat adopts the real-time maximum likelihood estimation of the underlyingmodel parameters. The other one is a recently developed continuous controlRL algorithm, a deep deterministic policy gradient method (Lillicrap et al.(2016)) that employs deep neural networks. The comparisons are performedunder various simulated market scenarios, including those with both sta-tionary and non-stationary investment opportunities. In nearly all the sim-ulations, our RL algorithm outperforms the other two methods by a largemargin, in terms of both performance and training time.The rest of the paper is organized as follows. In Section 2, we present4he continuous-time exploratory MV problem under the entropy-regularizedrelaxed stochastic control framework. Section 3 provides the complete solu-tion of the exploratory MV problem, along with connections to its classicalcounterpart. We then provide the policy improvement theorem and a con-vergence result for the learning problem in Section 4, based on which wedevise the RL algorithm for solving the exploratory MV problem. In Sec-tion 5, we compare our algorithm with two other methods in simulationsunder various market scenarios. Finally, we conclude in Section 6.

In this section, we formulate an exploratory, entropy-regularized Markowitz’sMV portfolio selection problem in continuous time, in the context of RL.The motivation of a general exploratory stochastic control formulation, ofwhich the MV problem is a special case, was discussed at great length in aprevious paper Wang et al. (2019); so we will frequently refer to that paper.

We ﬁrst recall the classical MV problem in continuous time (without RL).For ease of presentation, throughout this paper we consider an investmentuniverse consisting of only one risky asset and one riskless asset. The case ofmultiple risky assets poses no essential diﬀerences or diﬃculties other thannotational complexity.Let an investment planning horizon

T > { W t , ≤ t ≤ T } a standard one-dimensional Brownian motion deﬁned on a ﬁltered probabil-ity space (Ω , F , {F t } ≤ t ≤ T , P ) that satisﬁes the usual conditions. The priceprocess of the risky asset is a geometric Brownian motion governed by dS t = S t ( µ dt + σ dW t ) , ≤ t ≤ T, (1)with S = s > t = 0, and µ ∈ R , σ > r > The Sharpe ratio of the risky asset is deﬁnedby ρ = µ − rσ . In practice, the true (yet unknown) investment opportunity parameters µ , σ and r can be time-varying stochastic processes. Most existing quantitative ﬁnance methodsare devoted to estimating these parameters. In contrast, RL learns the values of variousstrategies and the optimal value through exploration and exploitation, without assumingany statistical properties of these parameters or estimating them. But for a model-based,classical MV problem, we assume these parameters are constant and known. In subsequentcontexts, all we need is the structure of the problem for our RL algorithm design. { x ut , ≤ t ≤ T } the discounted wealth process of an agentwho rebalances her portfolio investing in the risky and riskless assets witha strategy u = { u t , ≤ t ≤ T } . Here u t is the discounted dollar valueput in the risky asset at time t , while satisfying the standard self-ﬁnancingassumption and other technical conditions that will be spelled out in detailsbelow. It follows from (1) that the wealth process satisﬁes dx ut = σu t ( ρ dt + dW t ) , ≤ t ≤ T, (2)with an initial endowment being x u = x ∈ R .The classical continuous-time MV model aims to solve the following con-strained optimization problemmin u Var[ x uT ]subject to E [ x uT ] = z, (3)where { x ut , ≤ t ≤ T } satisﬁes the dynamics (2) under the investmentstrategy (portfolio) u , and z ∈ R is an investment target set at t = 0 as thedesired mean payoﬀ at the end of the investment horizon [0 , T ]. Due to the variance in its objective, (3) is known to be time inconsistent .The problem then becomes descriptive rather than normative because thereis generally no dynamically optimal solution for a time-inconsistent opti-mization problem. Agents react diﬀerently to the same time-inconsistentproblem, and a goal of the study becomes to describe the diﬀerent behaviorswhen facing such time-inconsistency. In this paper we focus ourselves to theso-called pre-committed strategies of the MV problem, which are optimal at t = 0 only. To solve (3), one ﬁrst transforms it into an unconstrained problem byapplying a Lagrange multiplier w : min u E [( x uT ) ] − z − w ( E [ x uT ] − z ) = min u E [( x uT − w ) ] − ( w − z ) . (4) The original MV problem is to ﬁnd the Pareto eﬃcient frontier for a two-objective(i.e. maximizing the expected terminal payoﬀ and minimizing its variance) optimizationproblem. There are a number of equivalent mathematical formulations to ﬁnd such afrontier, (3) being one of them. In particular, by varying the parameter z one can traceout the frontier. See Zhou and Li (2000) for details. For a detailed discussions about the diﬀerent behaviors under time-inconsistency, seethe seminal paper Strotz (1955). Most of the study on continuous-time MV problem inliterature has been devoted to pre-committed strategies; see Zhou and Li (2000); Li et al.(2002); Bielecki et al. (2005); Lim and Zhou (2002); Zhou and Yin (2003). Strictly speaking, 2 w ∈ R is the Lagrange multiplier. u ∗ = { u ∗ t , ≤ t ≤ T } depends on w . Then the original constraint E [ x u ∗ T ] = z determines the valueof w . We refer a detailed derivation to Zhou and Li (2000). Given the complete knowledge of the model parameters, the classical, model-based MV problem (3) and many of its variants have been solved rathercompletely. When implementing these solutions, one needs to estimate themarket parameters from historical time series of asset prices, a procedureknown as identiﬁcation in classical adaptive control. However, it is wellknown that in practice it is diﬃcult to estimate the investment opportunityparameters, especially the mean return (aka the mean–blur problem ; see,e.g., Luenberger (1998)) with a workable accuracy. Moreover, the classicaloptimal MV strategies are often extremely sensitive to these parameters,largely due to the procedure of inverting ill-conditioned variance–covariancematrices to obtain optimal allocation weights. In view of these two issues,the Markowitz solution can be greatly irrelevant to the underlying invest-ment objective.On the other hand, RL techniques do not require, and indeed oftenskip, any estimation of model parameters. Rather, RL algorithms, drivenby historical data, output optimal (or near-optimal) allocations directly.This is made possible by direct interactions with the unknown investmentenvironment, in a learning (exploring) while optimizing (exploiting) fashion.Wang et al. (2019) motivated and proposed a general theoretical frameworkfor exploratory, RL stochastic control problems and carried out a detailedstudy for the special LQ case, albeit in the setting of the inﬁnite time horizon.We adopt the same framework here, noting the inherent features of an LQstructure and a ﬁnite time horizon of the MV problem. Indeed, althoughthe motivation for the exploratory formulation is mostly the same, there areintriguing new insights emerging with this transition from the inﬁnite timehorizon to its ﬁnite counterpart.First, we introduce the “exploratory” version of the state dynamics(2). It was originally proposed in Wang et al. (2019), motivated by repet-itive learning in RL. In this formulation, the control (portolio) process u = { u t , ≤ t ≤ T } is randomized, which represents exploration and learn-ing, leading to a measure-valued or distributional control process whosedensity function is given by π = { π t , ≤ t ≤ T } . The dynamics (2) ischanged to dX πt = ˜ b ( π t ) dt + ˜ σ ( π t ) dW t , (5)7here 0 < t ≤ T and X π = x ,˜ b ( π ) := Z R ρσuπ ( u ) du, π ∈ P ( R ) , (6)and ˜ σ ( π ) := sZ R σ u π ( u ) du, π ∈ P ( R ) , (7)with P ( R ) being the set of density functions of probability measures on R that are absolutely continuous with respect to the Lebesgue measure.Mathematically, (5) coincides with the relaxed control formulation in classi-cal control theory. Refer to Wang et al. (2019) for a detailed discussion onthe motivation of (5).Denote respectively by µ t and σ t , 0 ≤ t ≤ T , the mean and variance(assuming they exist for now) processes associated with the distributionalcontrol process π , i.e., µ t := Z R uπ t ( u ) du and σ t := Z R u π t ( u ) du − µ t . (8)Then, it follows immediately that the exploratory dynamics (5) become dX πt = ρσµ t dt + σ q µ t + σ t dW t , (9)where 0 < t ≤ T and X π = x . The randomized, distributional controlprocess π = { π t , ≤ t ≤ T } is to model exploration, whose overall level isin turn captured by its accumulative diﬀerential entropy H ( π ) := − Z T Z R π t ( u ) ln π t ( u ) dudt. (10)Further, introduce a temperature parameter (or exploration weight ) λ > w ∈ R :min π ∈A ( x , E (cid:20) ( X πT − w ) + λ Z T Z R π t ( u ) ln π t ( u ) dudt (cid:21) − ( w − z ) , (11)where A ( x ,

0) is the set of admissible distributional controls on [0 , T ] tobe precisely deﬁned below. Once this problem is solved with a minimizer π ∗ = { π ∗ t , ≤ t ≤ T } , the Lagrange multiplier w can be determined by theadditional constraint E [ X π ∗ T ] = z . 8he optimization objective (11) explicitly encourages exploration, in con-trast to the classical problem (4) which concerns exploitation only.We will solve (11) by dynamic programming. For that we need to deﬁnethe value functions. For each ( s, y ) ∈ [0 , T ) × R , consider the state equation(9) on [ s, T ] with X πs = y . Deﬁne the set of admissible controls, A ( s, y ), asfollows. Let B ( R ) be the Borel algebra on R . A (distributional) control (orportfolio/strategy) process π = { π t , s ≤ t ≤ T } belongs to A ( s, y ), if(i) for each s ≤ t ≤ T , π t ∈ P ( U ) a.s.;(ii) for each A ∈ B ( R ), { R A π t ( u ) du, s ≤ t ≤ T } is F t -progressivelymeasurable;(iii) E hR Ts (cid:0) µ t + σ t (cid:1) dt i < ∞ ;(iv) E h(cid:12)(cid:12) ( X πT − w ) + λ R Ts R R π t ( u ) ln π t ( u ) dudt (cid:12)(cid:12) (cid:12)(cid:12)(cid:12) X πs = y i < ∞ .Clearly, it follows from condition (iii) that the stochastic diﬀerentialequation (SDE) (9) has a unique strong solution for s ≤ t ≤ T that satisﬁes X πs = y .Controls in A ( s, y ) are measure-valued (or, precisely, density-function-valued) stochastic processes , which are also called open-loop controls in thecontrol terminology. As in the classical control theory, it is important todistinguish between open-loop controls and feedback (or closed-loop ) con-trols (or policies as in the RL literature, or laws as in the control literature).Speciﬁcally, a deterministic mapping π ( · ; · , · ) is called an (admissible) feed-back control if i) π ( · ; t, x ) is a density function for each ( t, x ) ∈ [0 , T ] × R ; ii)for each ( s, y ) ∈ [0 , T ) × R , the following SDE (which is the system dynamicsafter the feedback policy π ( · ; · , · ) is applied) dX π t = ˜ b ( π ( · ; t, X π t )) dt + ˜ σ ( π ( · ; t, X π t )) dW t , t ∈ [ s, T ]; X π s = y (12)has a unique strong solution { X π t , t ∈ [ s, T ] } , and the open-loop control π = { π t , t ∈ [ s, T ] } ∈ A ( s, y ) where π t := π ( · ; t, X π t ). In this case, theopen-loop control π is said to be generated from the feedback policy π ( · ; · , · ) with respect to the initial time and state, ( s, y ). It is useful to note that anopen-loop control and its admissibility depend on the initial ( s, y ), whereasa feedback policy can generate open-loop controls for any ( s, y ) ∈ [0 , T ) × R ,and hence is in itself independent of ( s, y ). Throughout this paper, we use boldfaced π to denote feedback controls, and thenormal style π to denote open-loop controls. w ∈ R , deﬁne V ( s, y ; w ) := inf π ∈A ( s,y ) E (cid:20) ( X πT − w ) + λ Z T Z R π t ( u ) ln π t ( u ) dudt (cid:12)(cid:12)(cid:12) X πs = y (cid:21) − ( w − z ) , (13)for ( s, y ) ∈ [0 , T ) × R . The function V ( · , · ; w ) is called the optimal valuefunction of the problem. Moreover, we deﬁne the value function under anygiven feedback control π : V π ( s, y ; w ) = E (cid:20) ( X π T − w ) + λ Z Ts Z R π t ( u ) ln π t ( u ) dudt (cid:12)(cid:12)(cid:12) X π s = y (cid:21) − ( w − z ) , (14)for ( s, y ) ∈ [0 , T ) × R , where π = { π t , t ∈ [ s, T ] } is the open-loop control gen-erated from π with respect to ( s, y ) and { X π t , t ∈ [ s, T ] } is the correspondingwealth process. In this section we ﬁrst solve the exploratory MV problem, and then establishsolvability equivalence between the classical and exploratory problems. Thelatter is important for understanding the cost of exploration and for devisingRL algorithms.

To solve the exploratory MV problem (11), we apply the classical Bellman’sprinciple of optimality: V ( t, x ; w ) = inf π ∈A ( t,x ) E (cid:20) V ( s, X πs ; w ) + λ Z st Z R π v ( u ) ln π v ( u ) dudv (cid:12)(cid:12)(cid:12) X πt = x (cid:21) , for x ∈ R and 0 ≤ t < s ≤ T . Following standard arguments, we deducethat V satisﬁes the Hamilton-Jacobi-Bellman (HJB) equation v t ( t, x ; w )+ min π ∈P ( R ) (cid:16)

12 ˜ σ ( π ) v xx ( t, x ; w )+˜ b ( π ) v x ( t, x ; w )+ λ Z R π ( u ) ln π ( u ) du (cid:17) = 0 , (15) In the control literature, V is called the value function. However, in the RL literaturethe term “value function” is also used for the objective value under a particular control.So to avoid ambiguity we call V the optimal value function. v t ( t, x ; w )+ min π ∈P ( R ) Z R (cid:18) σ u v xx ( t, x ; w ) + ρσuv x ( t, x ; w ) + λ ln π ( u ) (cid:19) π ( u ) du = 0 , (16)with the terminal condition v ( T, x ; w ) = ( x − w ) − ( w − z ) . Here v denotesthe generic unknown solution to the HJB equation.Applying the usual veriﬁcation technique and using the fact that π ∈P ( R ) if and only if Z R π ( u ) du = 1 and π ( u ) ≥ R , (17)we can solve the (constrained) optimization problem in the HJB equation(16) to obtain a feedback (distributional) control whose density function isgiven by π ∗ ( u ; t, x, w ) = exp (cid:0) − λ (cid:0) σ u v xx ( t, x ; w ) + ρσv x ( t, x ; w ) (cid:1)(cid:1)R R exp (cid:0) − λ (cid:0) σ u v xx ( t, x ; w ) + ρσv x ( t, x ; w ) (cid:1)(cid:1) du = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ v x ( t, x ) v xx ( t, x ; w ) , λσ v xx ( t, x ; w ) (cid:19) , (18)where we have denoted by N ( u | α, β ) the Gaussian density function withmean α ∈ R and variance β >

0. In the above representation, we haveassumed that v xx ( t, x ; w ) >

0, which will be veriﬁed in what follows.Substituting the candidate optimal Gaussian feedback control policy (18)back into the HJB equation (16), the latter is transformed to v t ( t, x ; w ) − ρ v x ( t, x ; w ) v xx ( t, x, w ) + λ (cid:18) − ln 2 πeλσ v xx ( t, x ; w ) (cid:19) = 0 , (19)with v ( T, x ; w ) = ( x − w ) − ( w − z ) . A direct computation yields that thisequation has a classical solution v ( t, x ; w ) = ( x − w ) e − ρ ( T − t ) + λρ (cid:0) T − t (cid:1) − λ (cid:18) ρ T − ln σ πλ (cid:19) ( T − t ) − ( w − z ) , (20)which clearly satisﬁes v xx ( t, x ; w ) >

0, for any ( t, x ) ∈ [0 , T ] × R . It thenfollows that the candidate optimal feedback Gaussian control (18) reducesto π ∗ ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( x − w ) , λ σ e ρ ( T − t ) (cid:19) , ( t, x ) ∈ [0 , T ] × R . (21)11inally, the optimal wealth process (9) under π ∗ becomes dX ∗ t = − ρ ( X ∗ t − w ) dt + r ρ ( X ∗ t − w ) + λ e ρ ( T − t ) dW t , X ∗ = x . (22)It has a unique strong solution for 0 ≤ t ≤ T , as can be easily veriﬁed.We now summarize the above results in the following theorem. Theorem 1

The optimal value function of the entropy-regularized exploratoryMV problem (11) is given by V ( t, x ; w ) = ( x − w ) e − ρ ( T − t ) + λρ (cid:0) T − t (cid:1) − λ (cid:18) ρ T − ln σ πλ (cid:19) ( T − t ) − ( w − z ) , (23) for ( t, x ) ∈ [0 , T ] × R . Moreover, the optimal feedback control is Gaussian,with its density function given by π ∗ ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( x − w ) , λ σ e ρ ( T − t ) (cid:19) . (24) The associated optimal wealth process under π ∗ is the unique solution of theSDE dX ∗ t = − ρ ( X ∗ t − w ) dt + r ρ ( X ∗ t − w ) + λ e ρ ( T − t ) dW t , X ∗ = x . (25) Finally, the Lagrange multiplier w is given by w = ze ρ T − x e ρ T − . Proof.

For each ﬁxed w ∈ R , the veriﬁcation arguments aim to show thatthe optimal value function of problem (11) is given by (23) and that thecandidate optimal policy (24) is indeed admissible. A detailed proof followsthe same lines of that of Theorem 4 in Wang et al. (2019), and is left forinterested readers.We now determine the Lagrange multiplier w through the constraint E [ X ∗ T ] = z . It follows from (25), along with the standard estimate that E (cid:2) max t ∈ [0 ,T ] ( X ∗ t ) (cid:3) < ∞ and Fubini’s Theorem, that E [ X ∗ t ] = x + E (cid:20)Z t − ρ ( X ∗ s − w ) ds (cid:21) = x + Z t − ρ ( E [ X ∗ s ] − w ) ds. Hence, E [ X ∗ t ] = ( x − w ) e − ρ t + w . The constraint E [ X ∗ T ] = z now becomes( x − w ) e − ρ T + w = z , which gives w = ze ρ T − x e ρ T − .12here are several interesting points to note in this result. First of all,it follows from Theorem 2 in the next section that the classical and theexploratory MV problems have the same Lagrange multiplier value due tothe fact that the optimal terminal wealths under the respective optimalfeedback controls of the two problems turn out to have the same mean. This latter result is rather surprising at ﬁrst sight because the explorationgreatly alters the underlying system dynamics (compare the dynamics (2)with (9)).Second, the variance of the optimal Gaussian policy, which measures thelevel of exploration, is λ σ e ρ ( T − t ) at time t . So the exploration decays intime: the agent initially engages in exploration at the maximum level, andreduces it gradually (although never to zero) as time passes and approachesthe end of the investment horizon. Hence, diﬀerent from its inﬁnite horizoncounterpart studied in Wang et al. (2019), the extent of exploration is nolonger constant, but, rather, annealing. This is intuitive because, as theRL agent learns more about the random environment as time passes, theexploitation becomes more important since there is a deadline T at which heractions will be evaluated. Naturally, exploitation dominates exploration astime approaches maturity. Theorem 1 presents such a decaying explorationscheme endogenously which, to our best knowledge, has not been derived inthe RL literature.Third, as already noted in Wang et al. (2019), at any given t ∈ [0 , T ], thevariance of the exploratory Gaussian distribution decreases as the volatilityof the risky asset increases, with other parameters being ﬁxed. The volatilityof the risky asset reﬂects the level of randomness of the investment universe.This hints that a more random environment contains more learning oppor-tunities, which the RL agent can leverage to reduce her own exploratoryendeavor because, after all, exploration is costly.Finally, the mean of the Gaussian distribution (24) is independent of theexploration weight λ , while its variance is independent of the state x . Thishighlights a perfect separation between exploitation and exploration, as theformer is captured by the mean and the latter by the variance of the optimalGaussian exploration. This property is also consistent with the LQ case inthe inﬁnite horizon studied in Wang et al. (2019). Theorem 2 is a reproduction of the results on the classical MV problem obtained inZhou and Li (2000). .2 Solvability equivalence between classical and exploratoryMV problems In this section, we establish the solvability equivalence between the classi-cal and the exploratory, entropy-regularized MV problems. Note that bothproblems can be and indeed have been solved separately and independently.Here by “solvability equivalence” we mean that the solution of one problemwill lead to that of the other directly , without needing to solve it separately.This equivalence was ﬁrst discovered in Wang et al. (2019) for the inﬁnitehorizon LQ case, and was shown to be instrumental in deriving the conver-gence result (when the exploration weight λ decays to 0) as well as analyzingthe exploration cost therein. Here, the discussions are mostly parallel; sothey will be brief.Recall the classical MV problem (4). In order to apply dynamic pro-gramming, we again consider the set of admissible controls, A cl ( s, y ), for( s, y ) ∈ [0 , T ) × R , A cl ( s, y ) := n u = { u t , t ∈ [ s, T ] } : u is F t -progressively measurable and E [ R Ts ( u s ) ds ] < ∞ o . The (optimal) value function is deﬁned by V cl ( s, y ; w ) := inf u ∈A cl ( s,y ) E (cid:2) ( x uT − w ) (cid:12)(cid:12) x us = y (cid:3) − ( w − z ) , (26)for ( s, y ) ∈ [0 , T ) × R , where w ∈ R is ﬁxed. Once this problem is solved, w can be determined by the constraint E [ x ∗ T ] = z , with { x ∗ t , t ∈ [0 , T ] } beingthe optimal wealth process under the optimal portfolio u ∗ .The HJB equation is ω t ( t, x ; w )+min u ∈ R (cid:18) σ u ω xx ( t, x ; w ) + ρσu ω x ( t, x ; w ) (cid:19) = 0 , ( t, x ) ∈ [0 , T ) × R , (27)with the terminal condition ω ( T, x ; w ) = ( x − w ) − ( w − z ) .Standard veriﬁcation arguments deduce the optimal value function to be V cl ( t, x ; w ) = ( x − w ) e − ρ ( T − t ) − ( w − z ) , the optimal feedback control policy to be u ∗ ( u ; t, x, w ) = − ρσ ( x − w ) , (28)14nd the corresponding optimal wealth process to be the unique strong solu-tion to the SDE dx ∗ t = − ρ ( x ∗ t − w ) dt − ρ ( x ∗ t − w ) dW t , x ∗ = x . (29)Comparing the optimal wealth dynamics, (25) and (29), of the exploratoryand classical problems, we note that they have the same drift coeﬃcient (butdiﬀerent diﬀusion coeﬃcients). As a result, the two problems have the samemean of optimal terminal wealth and hence the same value of the Lagrangemultiplier w = ze ρ T − x e ρ T − determined by the constraint E [ x ∗ T ] = z .We now provide the solvability equivalence between the two problems.The proof is very similar to that of Theorem 7 in Wang et al. (2019), and isthus omitted. Theorem 2

The following two statements (a) and (b) are equivalent. (a)

The function v ( t, x ; w ) = ( x − w ) e − ρ ( T − t ) + λρ (cid:0) T − t (cid:1) − λ (cid:16) ρ T − ln σ πλ (cid:17) ( T − t ) − ( w − z ) , ( t, x ) ∈ [0 , T ] × R , is the optimal value function of theexploratory MV problem (11), and the corresponding optimal feedbackcontrol is π ∗ ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( x − w ) , λ σ e ρ ( T − t ) (cid:19) . (b) The function ω ( t, x ; w ) = ( x − w ) e − ρ ( T − t ) − ( w − z ) , ( t, x ) ∈ [0 , T ] × R ,is the optimal value function of the classical MV problem (26), and thecorresponding optimal feedback control is u ∗ ( t, x ; w ) = − ρσ ( x − w ) . Moreover, the two problems have the same Lagrange multiplier w = ze ρ T − x e ρ T − . It is reasonable to expect that the exploratory problem converges toits classical counterpart as the exploration weight λ decreases to 0. Thefollowing result makes this precise. Theorem 3

Assume that statement (a) (or equivalently, (b)) of Theorem2 holds. Then, for each ( t, x, w ) ∈ [0 , T ] × R × R , lim λ → π ∗ ( · ; t, x ; w ) = δ u ∗ ( t,x ; w ) ( · ) weakly.Moreover, lim λ → | V ( t, x ; w ) − V cl ( t, x ; w ) | = 0 . roof. The weak convergence of the feedback controls follows from theexplicit forms of π ∗ and u ∗ in statements (a) and (b). The pointwise con-vergence of the value functions follows easily from the forms of V ( · ) and V cl ( · ), together with the fact thatlim λ → λ σ πλ = 0 . Finally, we conclude this section by examining the cost of the exploration.This was originally deﬁned and derived in Wang et al. (2019) for the inﬁnitehorizon setting. Here, the cost associated with the MV problem due to theexplicit inclusion of exploration in the objective (11) is deﬁned by C u ∗ ,π ∗ (0 , x ; w ) := (cid:18) V (0 , x ; w ) − λ E (cid:20) Z T Z R π ∗ t ( u ) ln π ∗ t ( u ) du dt (cid:12)(cid:12)(cid:12)(cid:12) X π ∗ = x (cid:21)(cid:19) − V cl (0 , x ; w ) , (30)for x ∈ R , where π ∗ = { π ∗ t , t ∈ [0 , T ] } is the (open-loop) optimal strategygenerated by the optimal feedback law π ∗ with respect to the initial condi-tion X π ∗ = x . This cost is the diﬀerence between the two optimal valuefunctions, adjusting for the additional contribution due to the entropy valueof the optimal exploratory strategy.Making use of Theorem 2, we have the following result. Theorem 4

Assume that statement (a) (or equivalently, (b)) of Theorem2 holds. Then, the exploration cost for the MV problem is C u ∗ ,π ∗ (0 , x ; w ) = λT , x ∈ R , w ∈ R . (31) Proof.

Let { π ∗ t , t ∈ [0 , T ] } be the open-loop control generated by the feed-back control π ∗ given in statement (a) with respect to the initial state x at t = 0, namely, π ∗ t ( u ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( X ∗ t − w ) , λ σ e ρ ( T − t ) (cid:19) where { X ∗ t , t ∈ [0 , T ] } is the corresponding optimal wealth process of theexploratory MV problem, starting from the state x at t = 0, when π ∗ isapplied. Then, we easily deduce that Z R π ∗ t ( u ) ln π ∗ t ( u ) du = −

12 ln (cid:18) πeλσ e ρ ( T − t ) (cid:19) . V ( · ) in(a) and V cl ( · ) in (b).The exploration cost depends only on two “agent-speciﬁc” parameters,the exploration weight λ > T >

0. Notethat the latter is also the exploration horizon. Our result is intuitive inthat the exploration cost increases with the exploration weight and with theexploration horizon. Indeed, the dependence is linear with respect to eachof the two attributes: λ and T . It is also interesting to note that the cost isindependent of the Lagrange multiplier. This suggests that the explorationcost will not increase when the agent is more aggressive (or risk-seeking) –reﬂected by the expected target z or equivalently the Lagrange multiplier w . Having laid the theoretical foundation in the previous two sections, we nowdesign an RL algorithm to learn the solution of the entropy-regularized MVproblem and to output implementable portfolio allocation strategies, with-out assuming any knowledge about the underlying parameters. To thisend, we will ﬁrst establish a so-called policy improvement theorem as wellas the corresponding convergence result. Meanwhile, we will provide a self-correcting scheme to learn the true Lagrange multiplier w , based on stochas-tic approximation. Our RL algorithm bypasses the phase of estimatingany model parameters, including the mean return vector and the variance-covariance matrix. It also avoids inverting a typically ill-conditioned variance-covariance matrix in high dimensions that would likely produce non-robustportfolio strategies.In this paper, rather than relying on the typical framework of discrete-time MDP (that is used for most RL problems) and discretizing time andspace accordingly, we design an algorithm to learn the solutions of thecontinuous-time exploratory MV problem (11) directly . Speciﬁcally, weadopt the approach developed in Doya (2000) to avoid discretization of thestate dynamics or the HJB equation. As pointed out in Doya (2000), itis typically challenging to ﬁnd the right granularity to discretize the state,action and time, and naive discretization may lead to poor performance. On In Wang et al. (2019) with the inﬁnite horizon LQ case, an analogous result is ob-tained which states that exploration cost is proportional to the exploration weight andinversely proportional to the discount factor. Clearly, here the length of time horizon, T ,plays a role similar to the inverse of the discount factor. Most RL algorithms consist of two iterative procedures: policy evaluationand policy improvement (Sutton and Barto (2018)). The former provides anestimated value function for the current policy, whereas the latter updatesthe current policy in the right direction to improve the value function. Apolicy improvement theorem (PIT) is therefore a crucial prerequisite forinterpretable RL algorithms that ensures the iterated value functions tobe non-increasing (in the case of a minimization problem), and ultimatelyconverge to the optimal value function; see, for example, Section 4.2 ofSutton and Barto (2018). PITs have been proved for discrete-time entropy-regularized RL problems in inﬁnite horizon (Haarnoja et al. (2017)), and forcontinuous-time classical stochastic control problems (Jacka and Mijatovi´c(2017)). The following result provides a PIT for our exploratory MVportfolio selection problem.

Theorem 5 (Policy Improvement Theorem)

Let w ∈ R be ﬁxed and π = π ( · ; · , · , w ) be an arbitrarily given admissible feedback control policy. Interpretability is one of the most important and pressing issues in the general arti-ﬁcial intelligence applications in ﬁnancial industry due to, among others, the regulatoryrequirement. Jacka and Mijatovi´c (2017) studied classical stochastic control problems with no dis-tributional controls nor entropy regularization. They did not consider RL and relatedissues including exploration. uppose that the corresponding value function V π ( · , · ; w ) ∈ C , ([0 , T ) × R ) ∩ C ([0 , T ] × R ) and satisﬁes V π xx ( t, x ; w ) > , for any ( t, x ) ∈ [0 , T ) × R .Suppose further that the feedback policy ˜ π deﬁned by ˜ π ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ V π x ( t, x ; w ) V π xx ( t, x ; w ) , λσ V π xx ( t, x ; w ) (cid:19) (32) is admissible. Then, V ˜ π ( t, x ; w ) ≤ V π ( t, x ; w ) , ( t, x ) ∈ [0 , T ] × R . (33) Proof.

Fix ( t, x ) ∈ [0 , T ] × R . Since, by assumption, the feedback policy ˜ π is admissible, the open-loop control strategy, ˜ π = { ˜ π v , v ∈ [ t, T ] } , generatedfrom ˜ π with respect to the initial condition X ˜ π t = x is admissible. Let { X ˜ π s , s ∈ [ t, T ] } be the corresponding wealth process under ˜ π . ApplyingItˆo’s formula, we have V π ( s, ˜ X s ) = V π ( t, x ) + Z st V π t ( v, X ˜ π v ) dv + Z st Z R (cid:16) σ u V π xx ( v, X ˜ π v )+ ρσuV π x ( v, X ˜ π v ) (cid:17) ˜ π v ( u ) dudv + Z st σ (cid:18)Z R u ˜ π v ( u ) du (cid:19) V π ( v, X ˜ π v ) dW v , s ∈ [ t, T ] . (34)Deﬁne the stopping times τ n := inf { s ≥ t : R st σ R R u ˜ π v ( u ) du (cid:0) V π ( v, X ˜ π v ) (cid:1) dv ≥ n } , for n ≥

1. Then, from (34), we obtain V π ( t, x ) = E h V π ( s ∧ τ n , X ˜ π s ∧ τ n ) − Z s ∧ τ n t V π t ( v, X ˜ π v ) dv − Z s ∧ τ n t Z R (cid:16) σ u V π xx ( v, X ˜ π v ) + ρσuV π x ( v, X ˜ π v ) (cid:17) ˜ π v ( u ) dudv (cid:12)(cid:12)(cid:12) X ˜ π t = x i . (35)On the other hand, by standard arguments and the assumption that V π issmooth, we have V π t ( t, x )+ Z R (cid:18) σ u V π xx ( t, x ) + ρσuV π x ( t, x ) + λ ln π ( u ; t, x ) (cid:19) π ( u ; t, x ) du = 0 , for any ( t, x ) ∈ [0 , T ) × R . It follows that V π t ( t, x )+ min π ′ ∈P ( R ) Z R (cid:18) σ u V π xx ( t, x ) + ρσuV π x ( t, x ) + λ ln π ′ ( u ) (cid:19) π ′ ( u ) du ≤ . (36)19otice that the minimizer of the Hamiltonian in (36) is given by the feedbackpolicy ˜ π in (32). It then follows that equation (35) implies V π ( t, x ) ≥ E h V π ( s ∧ τ n , X ˜ π s ∧ τ n ) + λ Z s ∧ τ n t Z R ˜ π v ( u ) ln ˜ π v ( u ) dudv (cid:12)(cid:12)(cid:12) X ˜ π t = x i , for ( t, x ) ∈ [0 , T ] × R and s ∈ [ t, T ]. Now taking s = T , and using that V π ( T, x ) = V ˜ π ( T, x ) = ( x − w ) − ( w − z ) together with the assumption that˜ π is admissible, we obtain, by sending n → ∞ and applying the dominatedconvergence theorem, that V π ( t, x ) ≥ E h V ˜ π ( T, X ˜ π T )+ λ Z Tt Z R ˜ π v ( u ) ln ˜ π v ( u ) dudv (cid:12)(cid:12)(cid:12) X ˜ π t = x i = V ˜ π ( t, x ) , for any ( t, x ) ∈ [0 , T ] × R .The above theorem suggests that there are always policies in the Gaus-sian family that improves the value function of any given, not necessarilyGaussian, policy. Hence, without loss of generality, we can simply focuson the Gaussian policies when choosing an initial solution. Moreover, theoptimal Gaussian policy (24) in Theorem 1 suggests that a candidate initialfeedback policy may take the form π ( u ; t, x, w ) = N ( u | a ( x − w ) , c e c ( T − t ) ).It turns out that, theoretically, such a choice leads to the convergence of boththe value functions and the policies in a ﬁnite number of iterations. Theorem 6

Let π ( u ; t, x, w ) = N ( u | a ( x − w ) , c e c ( T − t ) ) , with a, c ∈ R and c > . Denote by { π n ( u ; t, x, w ) , ( t, x ) ∈ [0 , T ] × R , n ≥ } the sequenceof feedback policies updated by the policy improvement scheme (32), and { V π n ( t, x ; w ) , ( t, x ) ∈ [0 , T ] × R , n ≥ } the sequence of the correspondingvalue functions. Then, lim n →∞ π n ( · ; t, x, w ) = π ∗ ( · ; t, x, w ) weakly, (37) and lim n →∞ V π n ( t, x ; w ) = V ( t, x ; w ) , (38) for any ( t, x, w ) ∈ [0 , T ] × R × R , where π ∗ and V are the optimal Gaussianpolicy (24) and the optimal value function (23), respectively. Proof.

It can be easily veriﬁed that the feedback policy π where π ( u ; t, x, w ) = N ( u | a ( x − w ) , c e c ( T − t ) ) generates an open-loop policy π that is admissible20ith respect to the initial ( t, x ). Moreover, it follows from the Feynman-Kacformula that the corresponding value function V π satisﬁes the PDE V π t ( t, x ; w ) + Z R (cid:16) σ u V π xx ( t, x ; w ) + ρσuV π x ( t, x ; w )+ λ ln π ( u ; t, x, w ) (cid:17) π ( u ; t, x, w ) du = 0 , (39)with terminal condition V π ( T, x ; w ) = ( x − w ) − ( w − z ) . Solving thisequation we obtain V π ( t, x ; w ) = ( x − w ) e (2 ρσa + σ a )( T − t ) + Z Tt c σ e (2 ρσa + σ a + c )( T − s ) ds + λc T − t ) + λ ln(2 πec )2 ( T − t ) − ( w − z ) . (40)It is easy to check that V π satisﬁes the conditions in Theorem 5 and, hence,the theorem applies. The improved policy is given by (32), which, in thecurrent case, becomes π ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( x − w ) , λ σ e (2 ρσa + σ a )( T − t ) (cid:19) . Again, we can calculate the corresponding value function as V π ( t, x ; w ) =( x − w ) e − ρ ( T − t ) + F ( t ), where F is a function of t only. Theorem 5 isapplicable again, which yields the improved policy π as exactly the optimalGaussian policy π ∗ given in (24), together with the optimal value function V in (23). The desired convergence therefore follows, as for n ≥

2, boththe policy and the value function will no longer strictly improve under thepolicy improvement scheme (32).The above convergence result shows that if we choose the initial policywisely, the learning scheme will, theoretically , converge after a ﬁnite num-ber of (two, in fact) iterations. When implementing this scheme in prac-tice, of course, the value function for each policy can only be approximatedand, hence, the learning process typically takes more iterations to converge.Nevertheless, Theorem 5 provides the theoretical foundation for updating acurrent policy, while Theorem 6 suggests a good starting point in the policyspace. We will make use of both results in the next subsection to design animplementable RL algorithm for the exploratory MV problem.21 .2 The EMV algorithm

In this section, we present an RL algorithm, the EMV (exploratory mean–variance) algorithm, to solve (11). It consists of three concurrently ongo-ing procedures: the policy evaluation, the policy improvement and a self-correcting scheme for learning the Lagrange multiplier w based on stochasticapproximation.For the policy evaluation, we follow the method employed in Doya (2000)for learning the value function V π under any arbitrarily given admissiblefeedback policy π . By Bellman’s consistency, we have V π ( t, x ) = E (cid:20) V π ( s, X s ) + λ Z st Z R π v ( u ) ln π v ( u ) dudv (cid:12)(cid:12) X t = x (cid:21) , s ∈ [ t, T ] , (41)for ( t, x ) ∈ [0 , T ] × R . Rearranging this equation and dividing both sides by s − t , we obtain E (cid:20) V π ( s, X s ) − V π ( t, X t ) s − t + λs − t Z st Z R π v ( u ) ln π v ( u ) dudv (cid:12)(cid:12) X t = x (cid:21) = 0 . Taking s → t gives rise to the continuous-time Bellman’s error (or the tem-proral diﬀerence (TD) error; see Doya (2000)) δ t := ˙ V π t + λ Z R π t ( u ) ln π t ( u ) du, (42)where ˙ V π t = V π ( t +∆ t,X t +∆ t ) − V π ( t,X t )∆ t is the total derivative and ∆ t is thediscretization step for the learning algorithm.The objective of the policy evaluation procedure is to minimize the Bell-man’s error δ t . In general, this can be carried out as follows. Denote by V θ and π φ respectively the parametrized value function and policy (upon usingregressions or neural networks, or making use of some of the structures ofthe problem; see below), with θ, φ being the vector of weights to be learned.We then minimize C ( θ, φ ) = 12 E (cid:20)Z T | δ t | dt (cid:21) = 12 E (cid:20)Z T (cid:12)(cid:12) ˙ V θt + λ Z R π φt ( u ) ln π φt ( u ) du (cid:12)(cid:12) dt (cid:21) , where π φ = { π φt , t ∈ [0 , T ] } is generated from π φ with respect to a giveninitial state X = x at time 0. To approximate C ( θ, φ ) in an implementablealgorithm, we ﬁrst discretize [0 , T ] into small equal-length intervals [ t i , t i +1 ], i = 0 , , · · · , l , where t = 0 and t l +1 = T . Then we collect a set of samples22 = { ( t i , x i ) , i = 0 , , · · · , l + 1 } in the following way. The initial sample is(0 , x ) for i = 0. Now, at each t i , i = 0 , , · · · , l , we sample π φt i to obtain anallocation u i ∈ R in the risky asset, and then observe the wealth x i +1 at thenext time instant t i +1 . We can now approximate C ( θ, φ ) by C ( θ, φ ) = 12 X ( t i ,x i ) ∈D (cid:18) ˙ V θ ( t i , x i ) + λ Z R π φt i ( u ) ln π φt i ( u ) du (cid:19) ∆ t. (43)Instead of following the common practice to represent V θ , π φ using(deep) neural networks for continuous RL problems, in this paper we willtake advantage of the more explicit parametric expressions obtained in The-orem 1 and Theorem 6. This will lead to faster learning and convergence,which will be demonstrated in all the numerical experiments below. Moreprecisely, by virtue of Theorem 6, we will focus on Gaussian policies withvariance taking the form c e c ( T − t ) , which in turn leads to the entropyparametrized by H ( π φt ) = φ + φ ( T − t ), where φ = ( φ , φ ) ′ , with φ ∈ R and φ >

0, is the parameter vector to be learned.On the other hand, as suggested by the theoretical optimal value func-tion (23) in Theorem 1, we consider the parameterized V θ , where θ =( θ , θ , θ , θ ) ′ , by V θ ( t, x ) = ( x − w ) e − θ ( T − t ) + θ t + θ t + θ , ( t, x ) ∈ [0 , T ] × R . (44)From the policy improvement updating scheme (32), it follows that thevariance of the policy π φt is λ σ e θ ( T − t ) , resulting in the entropy ln πeλσ + θ ( T − t ). Equating this with the previously derived form H ( π φt ) = φ + φ ( T − t ), we deduce σ = λπe − φ and θ = 2 φ = ρ . (45)The improved policy, in turn, becomes, according to (32), that π ( u ; t, x, w ) = N (cid:18) u (cid:12)(cid:12)(cid:12) − ρσ ( x − w ) , λ σ e θ ( T − t ) (cid:19) = N u (cid:12)(cid:12)(cid:12) − r φ λπ e φ − ( x − w ) , π e φ ( T − t )+2 φ − ! , (46)where we have assumed that the true (unknown) Sharpe ratio ρ > H ( π φt ) = φ + φ ( T − t ), we obtain C ( θ, φ ) = 12 X ( t i ,x i ) ∈D (cid:16) ˙ V θ ( t i , x i ) − λ ( φ + φ ( T − t )) (cid:17) ∆ t. V θ ( t i , x i ) = V θ ( t i +1 ,x i +1 ) − V θ ( t i ,x i )∆ t , with θ = 2 φ in the parametriza-tion of V θ ( t i , x i ). It is now straightforward to devise the updating rules for( θ , θ ) ′ and ( φ , φ ) ′ using stochastic gradient descent algorithms (see, forexample, Chapter 8 of Goodfellow et al. (2016)). Precisely, we compute ∂C∂θ = X ( t i ,x i ) ∈D (cid:16) ˙ V θ ( t i , x i ) − λ ( φ + φ ( T − t i )) (cid:17) ∆ t ; (47) ∂C∂θ = X ( t i ,x i ) ∈D (cid:16) ˙ V θ ( t i , x i ) − λ ( φ + φ ( T − t i )) (cid:17) ( t i +1 − t i ); (48) ∂C∂φ = − λ X ( t i ,x i ) ∈D (cid:16) ˙ V θ ( t i , x i ) − λ ( φ + φ ( T − t i )) (cid:17) ∆ t ; (49) ∂C∂φ = X ( t i ,x i ) ∈D (cid:16) ˙ V θ ( t i , x i ) − λ ( φ + φ ( T − t i )) (cid:17) ∆ t × − x i +1 − w ) e − φ ( T − t i +1 ) ( T − t i +1 ) − x i − w ) e − φ ( T − t i ) ( T − t i )∆ t − λ ( T − t i ) ! . (50)Moreover, the parameter θ is updated with θ = 2 φ , and θ is updatedbased on the terminal condition V θ ( T, x ; w ) = ( x − w ) − ( w − z ) , whichyields θ = − θ T − θ T − ( w − z ) . (51)Finally, we provide a scheme for learning the underlying Lagrange mul-tiplier w . Indeed, the constraint E [ X T ] = z itself suggests the standardstochastic approximation update w n +1 = w n − α n ( X T − z ) , (52)with α n > n ≥

1, being the learning rate. In implementation, we canreplace X T in (52) by a sample average N P j x jT to have a more stablelearning process (see, for example, Section 1.1 of Kushner and Yin (2003)),where N ≥ x jT ’s are the most recent N terminalwealth values obtained at the time when w is to be updated. It is interestingto notice that the learning scheme of w , (52), is statistically self-correcting.For example, if the (sample average) terminal wealth is above the target z ,the updating rule (52) will decrease w , which in turn decreases the meanof the exploratory Gaussian policy π in view of (46). This implies that24here will be less risky allocation in the next step of actions for learning andoptimizing, leading to, on average, a decreased terminal wealth. We now summarize the pseudocode for the EMV algorithm.

Algorithm EMV: Exploratory Mean-Variance Portfolio Selection

Input : Market Simulator

Market , learning rates α, η θ , η φ , initial wealth x , target payoﬀ z , investment horizon T , discretization ∆ t , explorationrate λ , number of iterations M , sample average size N .Initialize θ , φ and w for k = 1 to M dofor i = 1 to ⌊ T ∆ t ⌋ do Sample ( t ki , x ki ) from Market under π φ Obtain collected samples D = { ( t ki , x ki ) , ≤ i ≤ ⌊ T ∆ t ⌋} Update θ ← θ − η θ ∇ θ C ( θ, φ ) using (47) and (48)Update θ using (51) and θ ← φ Update φ ← φ − η φ ∇ φ C ( θ, φ ) using (49) and (50) end for Update π φ ← N (cid:18) u (cid:12)(cid:12)(cid:12) − q φ λπ e φ − ( x − w ) , π e φ ( T − t )+2 φ − (cid:19) if k mod N == 0Update w ← w − α (cid:18) N P kj = k − N +1 x j ⌊ T ∆ t ⌋ − z (cid:19) end ifend for In this section, we compare the performance of our RL algorithm EMV withtwo other methods that could be used to solve the classical MV problem(3). The ﬁrst one is the traditional maximum likelihood estimation (MLE)that relies on the real-time estimation of the drift µ and the volatility σ in the geometric Brownian motion price model (1). Once the estimators of µ and σ are available using the most recent price time series, the portfolioallocation can be computed using the optimal allocation (28) for the classicalMV problem. Another alternative is based on the deep deterministic policygradient (DDPG) method developed by Lillicrap et al. (2016), a method This discussion here is based on the assumption that the market has a positive Sharperatio (recall (46)). The case of a negative Sharpe ratio can be dealt with similarly. λ across all the episodes [0 , T ] inthe learning process. In nearly all of the experiments, our EMV algorithmoutperforms the other two methods by large margins.In the following, we brieﬂy describe the two other methods. Maximum likelihood estimation (MLE)

The MLE is a popular method for estimating the parameters µ and σ inthe geometric Brownian motion model (1). We refer the interested readersto Section 9.3.2 of Campbell et al. (1997) for a detailed description of thismethod. At each decision making time t i , the MLE estimators for µ and σ are calculated based on the most recent 100 data points of the price. Onecan then substitute the MLE estimators into the optimal allocation (28)and the expression of the Lagrange multiplier w = ze ρ T − x e ρ T − to computethe allocation u i ∈ R in the risky asset. Such a two-phase procedure iscommonly used in adaptive control, where the ﬁrst phase is identiﬁcationand the second one is optimization (see, for example, Chen and Guo (2012),Kumar and Varaiya (2015)). The real-time estimation procedure also allowsthe MLE to be applicable in non-stationary markets with time-varying µ and σ . Deep deterministic policy gradient (DDPG)

The DDPG method has attracted signiﬁcant attention since it was intro-duced in Lillicrap et al. (2016). It has been taken as a state-of-the-art base-line approach for continuous control (action) RL problems, albeit in discretetime. The DDPG learns a deterministic target policy using deep neural net-works for both the critic and the actor, with exogenous noise being added toencourage exploration (e.g. using OU processes; see Lillicrap et al. (2016)for details). To adapt the DDPG to the classical MV setting (without en-tropy regularization), we make the following adjustments. Since the targetpolicy we aim to learn is a deterministic function of ( x − w ) (see (28)), we willfeed the samples x i − w , rather than only x i , to the actor network in order tooutput the allocation u i ∈ R . Here, w is the learned Lagrange multiplier atthe decision making time t i , obtained from the same self-correcting scheme2652). This modiﬁcation also enables us to connect current allocation with thepreviously obtained sample average of terminal wealth in a feedback loop,through the Lagrange multiplier w . Another modiﬁcation from the orig-inal DDPG is that we include prioritized experience replay (Schaul et al.(2016)), rather than sampling experience uniformly from the replay buﬀer.We select the terminal experience with higher probability to train the criticand actor networks, to account for the fact that the MV problem has norunning cost, but only a terminal cost given by ( x T − w ) − ( w − z ) (cf. (4)).Such a modiﬁcation signiﬁcantly improves learning speed and performance. We ﬁrst perform numerical simulations in a stationary market environment,where the price process is simulated according to the geometric Brownianmotion (1) with constant µ and σ . We take T = 1 and ∆ t = , in-dicating that the MV problem is considered over a one-year period, withdaily rebalancing. Reasonable values of the annualized return and volatilitywill be taken from the sets µ ∈ {− , − , − , , , , } and σ ∈ { , , , } , respectively. These values are usuallyconsidered for a “typical” stock for simulation purpose (see, for exam-ple, Hutchinson et al. (1994)). The annualized interest rate is taken to be r = 2%. We consider the MV problem with a 40% annualized target returnon the terminal wealth starting from a normalized initial wealth x = 1and, hence, z = 1 . M = 20000and the sample size N = 10 for learning the Lagrange multiplier w . Thetemperature parameter λ = 2. Across all the simulations in this section, thelearning rates are ﬁxed as α = 0 .

05 and η θ = η φ = 0 . M and N for the DDPG algorithm for a fair com-parison. The critic network has 3 hidden layers with 10 , , , . τ = 0 . µ ’s and σ ’s. For each method under each market scenario,we present the annualized sample mean M and sample variance V of thelast 2000 terminal wealth, and the corresponding annualized Sharpe ratio SR = M − √ V .A few observations are in order. First of all, the EMV algorithm outper-forms the other two methods by a large margin in several statistics of theinvestment outcome including sample mean, sample variance and Sharperatio. In fact, based on the comparison of Sharpe ratios, EMV outper-forms MLE in all the 28 experiments, and outperforms DDPG in 23 out ofthe total 28 experiments. Notice that DDPG yields rather unstable perfor-mance across diﬀerent market scenarios, with some of the sample averageterminal wealth below 0 indicating the occurrence of bankruptcy. The EMValgorithm, on the other hand, achieves positive annualized return in all theexperiments. The advantage of the EMV algorithm over the deep learningDDPG algorithm is even more signiﬁcant if we take into account the train-ing time (all the experiments were performed on a MacBook Air laptop).Indeed, DDPG involves extensive training of two deep neural networks, mak-ing it less appealing for high-frequency portfolio rebalancing and trading inpractice.Another advantage of EMV over DDPG is the ease of hyperparametertuning. Recall that the learning rates are ﬁxed respectively for EMV andDDPG across all the experiments for diﬀerent µ ’s and σ ’s. The perfor-mance of the EMV algorithm is less aﬀected by the ﬁxed learning rates,while the DDPG algorithm is not. Indeed, DDPG has been noted for itsnotorious brittleness and hyperparameter sensitivity in practice (see, for ex-ample, Duan et al. (2016), Henderson et al. (2018)), issues also shared byother deep RL methods for continuous control problems. Our EMV algo-rithm does not suﬀer from such issues, as a result of avoiding deep neuralnetworks in its training and decision making processes.The MLE method, although free from any learning rates tuning, needs toestimate the underlying parameters µ and σ . In all the simulations presentedin Table 1, the estimated value of σ is relatively close to its true value, whilethe drift parameter µ cannot be estimated accurately. This is consistentwith the well documented mean–blur problem, which in turn leads to highervariance of the terminal wealth (see Table 1) when one applies the estimated µ and σ to select the risky allocation (28).Finally, we present the learning curves for the three methods in Figure 1and Figure 2. We plot the changes of the sample mean and sample variance28able 1: Comparison of the annualized sample mean ( M ),sample variance ( V ), Sharpe ratio ( SR ) and average trainingtime (per experiment) for EMV, MLE and DDPG. Market scenarios EMV MLE DDPG µ = − , σ = 10% 1.396; 0.006; 5.107 1.556; 0.017; 4.284 1.297; 0.107; 0.908 µ = − , σ = 10% 1.390; 0.016; 3.039 1.215; 0.014; 1.833 1.401; 0.003; 7.076 µ = − , σ = 10% 1.330; 0.074; 1.218 1.056; 1.365; 0.482 0.901; 0.014; -0.833 µ = 0% , σ = 10% 1.204; 1.280; 0.180 0.926; 38.40; -0.012 1.029; 0.038; 0.147 µ = 10% , σ = 10% 1.318; 0.171; 0.769 1.009; 0.453; 0.014 0.951; 0.008; -0.541 µ = 30% , σ = 10% 1.385; 0.019; 2.785 1.179; 0.059; 0.737 0.224; 0.104; -2.405 µ = 50% , σ = 10% 1.394; 0.007; 4.772 1.459; 0.013; 3.983 1.478; 0.667; 0.717 µ = − , σ = 20% 1.387; 0.022; 2.606 1.310; 0.050; 1.387 -0.551; 0.425; -2.379 µ = − , σ = 20% 1.359; 0.051; 1.598 1.205; 1.319; 0.178 1.973; 0.404; 1.531 µ = − , σ = 20% 1.309; 0.245; 0.625 1.036; 2.183; 0.024 1.349; 0.368; 0.575 µ = 0% , σ = 20% 1.105; 0.727; 0.123 0.921; 6.887; -0.301 0.988; 0.139; -0.033 µ = 10% , σ = 20% 1.221; 0.314; 0.395 1.045; 6.751; 0.017 1.243; 0.354; 0.408 µ = 30% , σ = 20% 1.345; 0.062; 1.387 1.155; 1.743; 0.117 1.360; 0.050; 1.613 µ = 50% , σ = 20% 1.385; 0.027; 2.350 1.237; 1.293; 0.208 1.385; 0.004; 6.496 µ = − , σ = 30% 1.353; 0.044; 1.682 1.333; 9.465; 0.108 0.272; 2.762; -0.438 µ = − , σ = 30% 1.323; 0.106; 0.992 1.092; 5.657; 0.039 0.034; 0.924; -1.005 µ = − , σ = 30% 1.317; 0.696; 0.380 1.045; 17.87; 0.011 1.371; 0.792; 0.417 µ = 0% , σ = 30% 1.079; 0.727; 0.092 0.955; 28.84; -0.008 1.070; 0.752; 0.081 µ = 10% , σ = 30% 1.282; 0.885; 0.300 0.885; 24.06; -0.023 1.243; 0.825; 0.268 µ = 30% , σ = 30% 1.334; 0.131; 0.921 0.886; 24.41; -0.023 1.210; 0.921; 0.218 µ = 50% , σ = 30% 1.350; 0.049; 1.583 1.238; 7.505; 0.087 0.610; 0.143 -1.030 µ = − , σ = 40% 1.342; 0.061; 1.385 1.284; 11.14; 0.085 1.328; 0.501; 0.463 µ = − , σ = 40% 1.320; 0.146; 0.839 1.145; 3.315; 0.080 1.212; 0.160; 0.531 µ = − , σ = 40% 1.241; 0.707; 0.287 0.979; 7.960; -0.007 1.335; 1.413; 0.282 µ = 0% , σ = 40% 1.057; 0.671; 0.070 0.950; 31.60; -0.009 1.064; 1.467; 0.053 µ = 10% , σ = 40% 1.155; 0.591; 0.202 1.053; 9.090; 0.017 1.242; 1.499; 0.198 µ = 30% , σ = 40% 1.320; 0.198; 0.716 1.083; 17.46; 0.020 0.179; 1.533; -0.663 µ = 50% , σ = 40% 1.329; 0.078; 1.174 0.963; 43.17; -0.006 -0.390; 1.577; -1.107 Training time < < ≈ If the hyperparameters are allowed to be tuned on a scenario-by-scenario basis (i.e.,not for comparison purpose), the performance of EMV can be further improved with theShape ratio close to its theoretical maximum, whereas improvement is more diﬃcult toachieve for DDPG by hyperparameter tuning.

29f every (non-overlapping) 50 terminal wealth, as the learning proceeds foreach method. From these plots, we can see that the EMV algorithmconverges relatively faster than the other two methods, achieving relativelygood performance even in the early phase of the learning process. This isalso consistent with the convergence result in Theorem 6 and the remarksimmediately following it in Section 4.1.

Aggregated episodes (50) -0.500.511.522.533.5 S a m p l e m ean DDPGMLEEMVTarget z=1.4

Figure 1: Learning curves of sample means of terminal wealth (over every50 iterations) for EMV, MLE and DDPG ( µ = − , σ = 10%). Note the log scale for the variance in Figure 2.

50 100 150 200 250 300 350 400

Aggregated episodes (50) -4 -3 -2 -1 S a m p l e v a r i an c e DDPGMLEEMV

Figure 2: Learning curves of sample variances of terminal wealth (over every50 iterations) for EMV, MLE and DDPG ( µ = − , σ = 10%). When it comes to the application domains of RL, one of the major diﬀer-ences between quantitative ﬁnance and other domains (such as AlphaGo; seeSilver et al. (2016)) is that the underlying unknown investment environmentis typically time-varying with the former. In this subsection, we consider theperformance of the previous three methods in the non-stationary scenariowhere the price process is modeled by a stochastic factor model. To havea well-deﬁned learning problem, the stochastic factor needs to change at amuch slower time-scale, compared to that of the learning process. Speciﬁ-cally, we take the slow factor model within a multi-scale stochastic volatilitymodel framework (see, for example, Fouque et al. (2003)). The price processfollows dS t = S t ( µ t dt + σ dW t ) , < t ≤ T, S = s > , with µ t , σ t , t ∈ [0 , T ], being respectively the drift and volatility processesrestricted to each simulation episode [0 , T ] (so they may vary across diﬀer-ent episodes). The controlled wealth dynamics over each episode [0 , T ] istherefore dx ut = σ t u t ( ρ t dt + dW t ) , < t ≤ T, x u = x ∈ R . (53)31or simulations in this subsection, we take dρ t = δdt and dσ t = σ t ( δ dt + √ δ dW t ) , < t ≤ M T, (54)with ρ ∈ R and σ > δ > d h W, W i t = γdt ,with | γ | <

1. Notice that the terminal horizon for the stochastic factormodel (54) is

M T (recall M = 20000 and T = 1 as in Subsection 5.1),indicating that the stochastic factors ρ t and σ t change across all the M episodes. To make sure that their values stay in reasonable ranges afterrunning for all M episodes, we take δ to be small with value δ = 0 . ρ = − . σ = 10%, corresponding to the case µ = −

30% initially.We plot the learning curves in Figure 3 and Figure 4. Clearly, the EMValgorithm displays remarkable stability for learning performance, comparedto the other two methods even in the non-stationary market scenario.

Aggregated episodes (50) -4-20246810 S a m p l e m ean DDPGMLEEMVTarget z=1.4

Figure 3: Learning curves of sample means of terminal wealth (over every 50iterations) for EMV, MLE and DDPG for non-stationary market scenario( µ = − , σ = 10% , δ = 0 . , γ = 0).32

50 100 150 200 250 300 350 400

Aggregated episodes (50) -6 -4 -2 S a m p l e v a r i an c e DDPGMLEEMV

Figure 4: Learning curves of sample variances of terminal wealth (over every50 iterations) for EMV, MLE and DDPG for non-stationary market scenario( µ = − , σ = 10% , δ = 0 . , γ = 0). A decaying exploration scheme is often desirable for RL since exploitationought to gradually take more weight over exploration, as more learning it-erations have been carried out. Within the current entropy-regularizedrelaxed stochastic control framework, we have been able to show the conver-gence from the solution of the exploratory MV to that of the classical MVin Theorem 3, as the tradeoﬀ parameter λ →

0. Putting together Theorem3 and Theorem 6, we can reasonably expect a further improvement of theEMV algorithm when it adopts a decaying λ scheme, rather than a constant λ as in the previous two sections. Our numerical simulation result reportedin Figure 5 demonstrates slightly improved performance of the EMV algo-rithm with a speciﬁcally chosen λ process that decreases across all the M episodes, given by λ k = λ (cid:18) − exp (cid:18) k − M ) M (cid:19)(cid:19) , for k = 0 , , , . . . , M. (55) It is important to distinguish between decaying during a given learning episode [0 , T ]and decaying during the entire learning process (across diﬀerent episodes). The formerhas been derived in Theorem 1 as the decay of the Gaussian variance in t . The latterrefers to the notion that less exploration is needed in later learning episodes.

33e plot the histogram of the last 2000 values of terminal wealth generatedby the EMV algorithm with the decaying λ scheme starting from λ = 2 andthe original EMV algorithm with constant λ = 2, respectively. The Sharperatio increases from 3 .

039 to 3 . λ and decaying λ ( µ = − , σ = 10%). In this paper we have developed an RL framework for the continuous-timeMV portfolio selection, using the exploratory stochastic control formulationrecently proposed and studied in Wang et al. (2019). By recasting the MVportfolio selection as an exploration/learning and exploitation/optimizationproblem, we are able to derive a data-driven solution, completely skippingany estimation of the unknown model parameters which is a notoriously dif-ﬁcult, if not insurmountable, task in investment practice. The explorationpart is explicitly captured by the relaxed stochastic control formulation andthe resulting exploratory state dynamics, as well as the entropy-regularizedobjective function of the new optimization problem. We prove that the34eedback control distribution that optimally balances exploration and ex-ploitation in the MV setting is a Gaussian distribution with a time-decayingvariance. Similar to the case of general LQ problems in Wang et al. (2019),we establish the close connections between the exploratory MV and the clas-sical MV problems, including the solvability equivalence and the convergenceas exploration decays to zero.The RL framework of the classical MV problem also allows us to design acompetitive and interpretable RL algorithm, thanks in addition to a provedpolicy improvement theorem and the explicit functional structures for boththe value function and the optimal control policy. The policy improvementtheorem yields an updating scheme for the control policy that improves theobjective value in each iteration. The explicit structures of the theoretical optimal solution to the exploratory MV problem suggest simple yet eﬃcientfunction approximators without having to resort to black-box approachessuch as neural network approximations. The advantage of our method hasbeen demonstrated by various numerical simulations with both stationaryand non-stationary market environments, where our RL algorithm generallyoutperforms other two approaches by large margins when solving the MVproblem in the continuous control setting.It should also be noted that the MV problem considered in this paper is almost model-free, in the sense that what is essentially needed for the un-derlying theory is the LQ structure only, namely, the incremental change inwealth depends linearly on wealth and portfolio, and the objective functionis quadratic in the terminal wealth. The former is a reasonable assumptionso long as the incremental change in the risky asset price is linear in the priceitself (including but not limited to the case when the price is lognormal andthe non-stationary case studied in Section 5.2), and the latter is an intrinsicproperty of the MV formulation due to the variance term. Therefore, ouralgorithm is data-driven on one hand yet entirely interpretable on the other(in contrast to the black-box approach).Interesting future research directions include empirical study of our RLalgorithm using real market data, in the high-dimensional control settingwhere allocations among multiple risky assets need to be determined ateach decision time. For that, our exploratory MV formulation would remainintact and a multivariate Gaussian distribution would be learned using analgorithm similar to Algorithm 1. It would be interesting to compare theperformance of our algorithm with other existing methods, including theFama-French model (Fama and French (1993)) and the distributionally ro-bust method for MV problem (Blanchet et al. (2018)). Another open ques-tion is to design an endogenous, “optimal” decaying scheme for the temper-35ture parameter λ as learning advances, an essential quantity that dictatesthe overall level of exploration and bridges the exploratory MV problem withthe classical MV problem. These questions are left for further investigations. References

Robert Almgren and Neil Chriss. Optimal execution of portfolio transac-tions.

Journal of Risk , 3:5–40, 2001.Tomasz R Bielecki, Hanqing Jin, Stanley R Pliska, and Xun Yu Zhou.Continuous-time mean-variance portfolio selection with bankruptcy pro-hibition.

Mathematical Finance , 15(2):213–244, 2005.Jose Blanchet, Lin Chen, and Xun Yu Zhou. Distributionally robust mean-variance portfolio selection with Wasserstein distances. arXiv preprintarXiv:1802.04885 , 2018.John Y Campbell, Andrew W Lo, and A Craig MacKinlay.

The econometricsof ﬁnancial markets . Princeton University press, 1997.Han-Fu Chen and Lei Guo.

Identiﬁcation and stochastic adaptive control .Springer Science & Business Media, 2012.Kenji Doya. Reinforcement learning in continuous time and space.

NeuralComputation , 12(1):219–245, 2000.Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel.Benchmarking deep reinforcement learning for continuous control. In

In-ternational Conference on Machine Learning , pages 1329–1338, 2016.Darrell Duﬃe and Henry R Richardson. Mean-variance hedging in continu-ous time.

The Annals of Applied Probability , 1(1):1–15, 1991.Eugene F Fama and Kenneth R French. Common risk factors in the returnson stocks and bonds.

Journal of ﬁnancial economics , 33(1):3–56, 1993.Jean-Pierre Fouque, George Papanicolaou, Ronnie Sircar, and Knut Solna.Multiscale stochastic volatility asymptotics.

Multiscale Modeling & Sim-ulation , 2(1):22–42, 2003.Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MITPress, 2016. .36uomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Rein-forcement learning with deep energy-based policies. In

Proceedings of the34th International Conference on Machine Learning , pages 1352–1361,2017.Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Pre-cup, and David Meger. Deep reinforcement learning that matters. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.Dieter Hendricks and Diane Wilcox. A reinforcement learning extension tothe Almgren-Chriss framework for optimal trade execution. In , pages 457–464. IEEE, 2014.James M Hutchinson, Andrew W Lo, and Tomaso Poggio. A nonpara-metric approach to pricing and hedging derivative securities via learningnetworks.

The Journal of Finance , 49(3):851–889, 1994.Saul D Jacka and Aleksandar Mijatovi´c. On the policy improvement algo-rithm in continuous time.

Stochastics , 89(1):348–359, 2017.Panqanamala Ramana Kumar and Pravin Varaiya.

Stochastic systems: Es-timation, identiﬁcation, and adaptive control , volume 75. SIAM, 2015.Harold Kushner and G George Yin.

Stochastic approximation and recur-sive algorithms and applications , volume 35. Springer Science & BusinessMedia, 2003.Duan Li and Wan-Lung Ng. Optimal dynamic portfolio selection: Multi-period mean-variance formulation.

Mathematical Finance , 10(3):387–406,2000.Xun Li, Xun Yu Zhou, and Andrew EB Lim. Dynamic mean-variance port-folio selection with no-shorting constraints.

SIAM Journal on Control andOptimization , 40(5):1540–1555, 2002.Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, TomErez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous con-trol with deep reinforcement learning. In

International Conference onLearning Representations , 2016.Andrew EB Lim and Xun Yu Zhou. Mean-variance portfolio selection withrandom parameters in a complete market.

Mathematics of OperationsResearch , 27(1):101–120, 2002. 37avid G Luenberger.

Investment Science . Oxford University Press, NewYork, 1998.Shie Mannor and John N Tsitsiklis. Algorithmic aspects of mean–varianceoptimization in Markov decision processes.

European Journal of Opera-tional Research , 231(3):645–653, 2013.Harry Markowitz. Portfolio selection.

The Journal of Finance , 7(1):77–91,1952.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, JoelVeness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas KFidjeland, and Georg Ostrovski. Human-level control through deep rein-forcement learning.

Nature , 518(7540):529, 2015.John Moody and Matthew Saﬀell. Learning to trade via direct reinforce-ment.

IEEE transactions on neural Networks , 12(4):875–889, 2001.John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saﬀell. Perfor-mance functions and reinforcement learning for trading systems and port-folios.

Journal of Forecasting , 17(5-6):441–470, 1998.R´emi Munos. A study of reinforcement learning in the continuous case bythe means of viscosity solutions.

Machine Learning , 40(3):265–299, 2000.R´emi Munos and Paul Bourgine. Reinforcement learning for continuousstochastic control problems. In

Advances in neural information processingsystems , pages 1029–1035, 1998.Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learningfor optimized trade execution. In

Proceedings of the 23rd internationalconference on Machine learning , pages 673–680. ACM, 2006.LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms forrisk-sensitive MDPs. In

Advances in neural information processing sys-tems , pages 252–260, 2013.LA Prashanth and Mohammad Ghavamzadeh. Variance-constrained actor-critic algorithms for discounted and average reward MDPs.

MachineLearning , 105(3):367–417, 2016.Makoto Sato and Shigenobu Kobayashi. Variance-penalized reinforcementlearning for risk-averse asset allocation. In

International Conference onIntelligent Data Engineering and Automated Learning , pages 244–249.Springer, 2000. 38akoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm forthe variance of return and mean-variance reinforcement learning.

Trans-actions of the Japanese Society for Artiﬁcial Intelligence , 16(3):353–362,2001.Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritizedexperience replay. In

International Conference on Learning Representa-tions , Puerto Rico, 2016.David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, and Marc Lanctot. Mastering the game of Gowith deep neural networks and tree search.

Nature , 529(7587):484, 2016.Matthew J Sobel. The variance of discounted Markov decision processes.

Journal of Applied Probability , 19(4):794–802, 1982.Robert Henry Strotz. Myopia and inconsistency in dynamic utility maxi-mization.

The Review of Economic Studies , 23(3):165–180, 1955.Richard S Sutton and Andrew G Barto.

Reinforcement learning: An intro-duction . MIT press, 2018.Aviv Tamar and Shie Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697 , 2013.Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal diﬀerence meth-ods for the variance of the reward to go. In

ICML (3) , pages 495–503,2013.Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Exploration ver-sus exploitation in reinforcement learning: A stochastic control approach. arXiv preprint: arXiv:1812.01552v3 , 2019.Xun Yu Zhou and Duan Li. Continuous-time mean-variance portfolio selec-tion: A stochastic LQ framework.

Applied Mathematics and Optimization ,42(1):19–33, 2000.Xun Yu Zhou and George Yin. Markowitz’s mean-variance portfolio selec-tion with regime switching: A continuous-time model.