[PDF] Discrete-time portfolio optimization under maximum drawdown constraint with partial information and deep learning resolution

Abstract

We study a discrete-time portfolio selection problem with partial information and maxi\-mum drawdown constraint. Drift uncertainty in the multidimensional framework is modeled by a prior probability distribution. In this Bayesian framework, we derive the dynamic programming equation using an appropriate change of measure, and obtain semi-explicit results in the Gaussian case. The latter case, with a CRRA utility function is completely solved numerically using recent deep learning techniques for stochastic optimal control problems. We emphasize the informative value of the learning strategy versus the non-learning one by providing empirical performance and sensitivity analysis with respect to the uncertainty of the drift. Furthermore, we show numerical evidence of the close relationship between the non-learning strategy and a no short-sale constrained Merton problem, by illustrating the convergence of the former towards the latter as the maximum drawdown constraint vanishes.

Full PDF

DDiscrete-time portfolio optimization under maximum drawdownconstraint with partial information and deep learning resolution

Carmine DE FRANCO ∗ Johann NICOLLE † Huyˆen PHAM ‡ November 2, 2020

Abstract

We study a discrete-time portfolio selection problem with partial information and maxi-mum drawdown constraint. Drift uncertainty in the multidimensional framework is modeledby a prior probability distribution. In this Bayesian framework, we derive the dynamic pro-gramming equation using an appropriate change of measure, and obtain semi-explicit resultsin the Gaussian case. The latter case, with a CRRA utility function is completely solvednumerically using recent deep learning techniques for stochastic optimal control problems.We emphasize the informative value of the learning strategy versus the non-learning one byproviding empirical performance and sensitivity analysis with respect to the uncertainty ofthe drift. Furthermore, we show numerical evidence of the close relationship between thenon-learning strategy and a no short-sale constrained Merton problem, by illustrating theconvergence of the former towards the latter as the maximum drawdown constraint vanishes.

This paper is devoted to the study of a constrained allocation problem in discrete time withpartial information. We consider an investor who is willing to maximize the expected utilityof her terminal wealth over a given investment horizon. The risk-averse investor is looking forthe optimal portfolio in ﬁnancial assets under a maximum drawdown constraint. The maximumdrawdown is a common metric in ﬁnance and represents the largest drop in the portfolio value.Our framework incorporates this constraint by setting a threshold representing the proportionof the current maximum of the wealth process that the investor is willing to keep.The expected rate of assets’ return (drift) is unknown, but information can be learnt byprogressive observation of the ﬁnancial asset prices. The uncertainty about the rate of return ismodeled by a probability distribution, i.e., a prior belief on the drift. To take into account the in-formation conveyed by the prices, this prior will be updated using a Bayesian learning approach.An extensive literature exists on parameters uncertainty and especially on ﬁltering and learn-ing techniques in a partial information framework. To cite just a few, see Lakner (1998), Rogers(2001), Cvitani´c et al. (2006), Karatzas and Zhao (2001), Bismuth et al. (2019), and De Franco,Nicolle, and Pham (2019a). Some articles deal with risk constraints in a portfolio allocationframework, see for instance the paper by Redeker and Wunderlich (2018) which tackles dynamicrisk constraints and compares the continuous and discrete time trading. Other papers especiallyfocus on drawdown constraints, see in particular the seminal paper by Grossman and Zhou ∗ OSSIAM, E-mail: [email protected] † OSSIAM and LPSM, Universit´e de Paris, E-mail: [email protected] ‡ LPSM, Universit´e de Paris, E-mail: [email protected] a r X i v : . [ q -f i n . P M ] O c t Hybrid-Now , is particularly suited for solving stochasticcontrol problems in high dimension using deep neural networks.Our main contributions to the literature is twofold: a detailed theoretical study of a discrete-time portfolio selection problem including both drift uncertainty and maximum drawdown con-straint, and a numerical resolution using a deep learning approach for an application to a modelof three risky assets, leading to a ﬁve-dimensional problem. We derive the dynamic program-ming equation (DPE), which is in general of inﬁnite-dimensional nature, following the changeof measure suggested in Elliott et al. (2008). In the Gaussian case, the DPE is reduced to aﬁnite-dimensional equation by exploiting the Kalman ﬁlter. In the particular case of constantrelative risk aversion (CRRA) utility function, we reduce furthermore the dimensionality of theproblem. Then, we solve numerically the problem in the Gaussian case with CRRA utility func-tions using the deep learning

Hybrid-Now algorithm. Such numerical results allow us to providea detailed analysis of the performance and allocations of both the learning and non-learningstrategies benchmarked with a comparable equally-weighted strategy. Finally, we assess the per-formance of the learning compared to the non-learning strategy with respect to the sensitivityof the uncertainty of the drift. Additionally, we provide empirical evidence of convergence ofthe non-learning strategy to the solution of the classical Merton problem when the parametercontrolling the maximum drawdown vanishes.The paper is organized as follows: Section 2 sets up the ﬁnancial market model and theassociated optimization problem. Section 3 describes, in the general case, the change of measureand the Bayesian ﬁltering, the derivation of the dynamic programming equation and detailssome properties of the value function. Section 4 focuses on the Gaussian case. Finally, Section 5presents the neural network techniques used, and shows the numerical results.

On a probability space (Ω , F , P ) equipped with a discrete ﬁltration ( F k ) k =0 , ..., N satisfying theusual conditions, we consider a ﬁnancial market model with one riskless asset assumed normal-ized to one, and d risky assets. The price process ( S ik ) k =0 ,...,N of asset i ∈ [[1 , d ]] is governed bythe dynamics S ik +1 = S ik e R ik +1 , k = 0 , . . . , N − , (1)where R k +1 = ( R k +1 , . . . , R Nk +1 ) is the vector of the assets log-return between time k and k + 1,and modeled as: R k +1 = B + (cid:15) k +1 . (2)The drift vector B is a d -dimensional random variable with probability distribution (prior) µ of known mean b = E [ B ] and ﬁnite second order moment. Note that the case of known drift B means that µ is a Dirac distribution. The noise (cid:15) = ( (cid:15) k ) k is a sequence of centered i.i.d.random vector variables with covariance matrix Γ = E [ (cid:15) k (cid:15) (cid:48) k ], and assumed to be independent of B . We also assume the fundamental assumption that the probability distribution ν of (cid:15) k admitsa strictly positive density function g on R d with respect to the Lebesgue measure.2he price process S is observable, and notice by relation (1) that R can be deduced from S ,and vice-versa. We will then denote by F o = {F ok } k =0 , ..., N the observation ﬁltration generatedby the process S (hence equivalently by R ) augmented by the null sets of F , with the conventionthat for k = 0, F o is the trivial algebra.An investment strategy is an F o -progressively measurable process α = ( α k ) k =0 , ..., N − , valuedin R d , and representing the proportion of the current wealth invested in each of the d risky assetsat each time k = 0 , . . . , N −

1. Given an investment strategy α and an initial wealth x > X α evolves according to (cid:40) X αk +1 = X αk (cid:0) α (cid:48) k (cid:0) e R k +1 − d (cid:1)(cid:1) , k = 0 , . . . , N − ,X α = x . (3)where e R k +1 is the d -dimensional random variable with components (cid:2) e R k +1 (cid:3) i = e R ik +1 for i ∈ [[1 , d ]], and d is the vector in R d with all components equal to 1.Let us introduce the process Z αk , as the maximum up to time k of the wealth process X α ,i.e., Z αk := max ≤ (cid:96) ≤ k X α(cid:96) , k = 0 , . . . , N. The maximum drawdown constraints the wealth X αk to remain above a fraction q ∈ (0 ,

1) of thecurrent historical maximum Z αk . We then deﬁne the set of admissible investment strategies A q as the set of investment strategies α such that X αk ≥ qZ αk , a . s ., k = 0 , . . . , N. In this framework, the portfolio selection problem is formulated as V := sup α ∈A q E [ U ( X αN )] , (4)where U is a utility function on (0 , ∞ ) satisfying the standard Inada conditions: continuouslydiﬀerentiable, strictly increasing, concave on (0 , ∞ ) with U (cid:48) (0) = ∞ and U (cid:48) ( ∞ ) = 0. In this section, we show how Problem (4) can be characterized from dynamic programming interms of a backward system of equations amenable for algorithms. In a ﬁrst step, we will updatethe prior on the drift uncertainty, and take advantage of the newest available information byadopting a Bayesian ﬁltering approach. This relies on a suitable change of probability measure.

We start by introducing a change of measure under which R ,..., R N are mutually independent,identically distributed random variables and independent from the drift B , hence behaving likea noise. Following the methodology detailed in (Elliott et al., 2008) we deﬁne the σ -algebras G k := σ ( B, R , . . . , R k ) , k = 0 , . . . , N, and G = ( G k ) k the corresponding complete ﬁltration. We then deﬁne a new probability measure P on (Ω , (cid:87) Nk =1 G k ) by d P d P (cid:12)(cid:12)(cid:12)(cid:12) G k := Λ k , k = 0 , . . . , N, k := k (cid:89) (cid:96) =1 g ( R (cid:96) ) g ( (cid:15) (cid:96) ) , k = 1 , . . . , N, Λ = 1 . The existence of P comes from the Kolmogorov’s theorem since Λ k is a strictly positive martingalewith expectation equal to one. Indeed, for all k = 1 , ..., N , • Λ k > g is strictly positive • Λ k is G k -adapted, • As (cid:15) k ⊥⊥ G k − , we have E [Λ k |G k − ] = Λ k − E (cid:104) g ( B + (cid:15) k ) g ( (cid:15) k ) (cid:12)(cid:12) G k − (cid:105) = Λ k − (cid:90) R d g ( B + e ) g ( e ) g ( e ) de = Λ k − (cid:90) R d g ( z ) dz = Λ k − . Proposition 3.1.

Under P , ( R k ) k =1 ,...,N , is a sequence of i.i.d. random variables, independentfrom B , having the same probability distribution ν as (cid:15) k . Proof.

See Appendix 6.1. (cid:50)

Conversely, we recover the initial measure P under which ( (cid:15) k ) k =1 ,...,N is a sequence of in-dependent and identically distributed random variables having probability density function g where (cid:15) k = R k − B . Denoting by Λ k the Radon-Nikodym derivative d P / d P restricted to the σ -algebra G k : d P d P (cid:12)(cid:12)(cid:12)(cid:12) G k = Λ k , we have Λ k = k (cid:89) i =1 g ( R i − B ) g ( R i ) . It is clear that, under P , the return and wealth processes have the form stated in equations (2)and (3). Moreover, from Bayes formula, the posterior distribution of the drift, i.e. the conditionallaw of B given the asset price observation, is µ k ( db ) := P (cid:2) B ∈ db |F ok (cid:3) = π k ( db ) π k ( R d ) , k = 0 , . . . , N, (5)where π k is the so-called unnormalized conditional law π k ( db ) := E (cid:2) Λ k { B ∈ db } |F ok (cid:3) , k = 0 , . . . , N. We then have the key recurrence linear relation on the unnormalized conditional law.

Proposition 3.2.

We have the recursive linear relation π (cid:96) = ¯ g ( R (cid:96) − · ) π (cid:96) − , (cid:96) = 1 , . . . , N, (6) with initial condition π = µ , where ¯ g ( R (cid:96) − b ) = g ( R (cid:96) − b ) g ( R (cid:96) ) , b ∈ R d , and we recall that g is the probability density function of the identically distributed (cid:15) k under P . Proof.

See Appendix 6.2 . (cid:50) .2 The static set of admissible controls In this subsection, we derive some useful characteristics of the space of controls which will turnout to be crucial in the derivation of the dynamic programming system.Given time k ∈ [[0 , N ]], a current wealth x = X αk >

0, and current maximum wealth z = Z αk ≥ x that satisﬁes the drawdown constraint qz ≤ x at time k for an admissible investmentstrategy α ∈ A q , we denote by A qk ( x, z ) ⊂ R d the set of static controls a = α k such that thedrawdown constraint is satisﬁed at next time k + 1, i.e. X αk +1 ≥ qZ αk +1 . From the relation (3),and noting that Z αk +1 = max[ Z αk , X αk +1 ], this yields A qk ( x, z ) = (cid:110) a ∈ R d : 1 + a (cid:48) (cid:0) e R k +1 − d (cid:1) ≥ q max (cid:104) zx , a (cid:48) (cid:0) e R k +1 − d (cid:1)(cid:105) a . s . (cid:111) . (7)Recalling from Proposition 3.1, that the random variables R , ..., R N are i.i.d. under P , we noticethat the set A qk ( x, z ) does not depend on the current time k , and we will drop the subscript k in the sequel, and simply denote by A q ( x, z ).Remembering that the support of ν , the probability distribution of (cid:15) k , is R d , the followinglemma characterizes more precisely the set A q ( x, z ). Lemma 3.3.

For any ( x, z ) ∈ S q := (cid:8) ( x, z ) ∈ (0 , ∞ ) : qz ≤ x ≤ z (cid:9) , we have A q ( x, z ) = (cid:110) a ∈ R d + : | a | ≤ − q zx (cid:111) , where | a | = (cid:80) di =1 | a i | for a = ( a , . . . , a d ) ∈ R d + . Proof.

See Appendix 6.3. (cid:50)

Let us prove some properties on the admissible set A q ( x, z ). Lemma 3.4.

For any ( x, z ) ∈ S q , the set A q ( x, z ) satisﬁes the following properties:1. It is decreasing in q : ∀ q ≤ q , A q ( x, z ) ⊆ A q ( x, z ) ,2. It is continuous in q ,3. It is increasing in x : ∀ x ≤ x , A q ( x , z ) ⊆ A q ( x , z ) ,4. It is a convex set,5. It is homogeneous: a ∈ A q ( x, z ) ⇔ a ∈ A q ( λx, λz ) , for any λ > . Proof.

See Appendix 6.4. (cid:50)

The change of probability detailed in Subsection 3.1 allows us to turn the initial partial infor-mation Problem (4) into a full observation problem as V := sup α ∈A q E [ U ( X αN )] = sup α ∈A q E [Λ N U ( X αN )]= sup α ∈A q E (cid:104) E (cid:2) Λ N U ( X αN ) (cid:12)(cid:12) F oN (cid:3)(cid:105) = sup α ∈A q E (cid:104) U ( X αN ) π N ( R d ) (cid:105) , (8)5rom Bayes formula, the law of conditional expectations, and the deﬁnition of the unnormal-ized ﬁlter π N valued in M + , the set of nonnegative measures on R d . In view of Equation (3),Proposition 3.1, and Proposition 3.2, we then introduce the dynamic value function associatedto Problem (8) as v k ( x, z, µ ) = sup α ∈A qk ( x,z ) J k ( x, z, µ, α ) , k ∈ [[0 , N ]] , ( x, z ) ∈ S q , µ ∈ M + , with J k ( x, z, µ, α ) = E (cid:104) U (cid:0) X k,x,αN (cid:1) π k,µN ( R d ) (cid:105) , where X k,x,α is the solution to Equation (3) on [[ k, N ]], starting at X k,x,αk = x at time k , controlledby α ∈ A qk ( x, z ), and ( π k,µ(cid:96) ) (cid:96) = k,...,N is the solution to (6) on M + , starting from π k,µk = µ , sothat V = v ( x , x , µ ). Here, A qk ( x, z ) is the set of admissible investment strategies embeddingthe drawdown constraint: X k,x,α(cid:96) ≥ qZ k,x,z,α(cid:96) , (cid:96) = k, . . . , N , where the maximum wealth process Z k,x,z,α follows the dynamics: Z k,x,z,α(cid:96) +1 = max[ Z k,x,z,α(cid:96) , X k,x,α(cid:96) +1 ], (cid:96) = k, . . . , N −

1, starting from Z k,x,z,αk = z at time k . The dependence of the value function upon the unnormalized ﬁlter µ means that the probability distribution on the drift is updated at each time step from Bayesianlearning by observing assets price.The dynamic programming equation associated to (8) is then written in backward inductionas  v N ( x, z, µ ) = U ( x ) µ ( R d ) ,v k ( x, z, µ ) = sup α ∈A qk ( x,z ) E (cid:104) v k +1 (cid:0) X k,x,αk +1 , Z k,x,z,αk +1 , π k,µk +1 (cid:1)(cid:105) , k = 0 , . . . , N − . Recalling Proposition 3.2 and Lemma 3.3, this dynamic programming system is written moreexplicitly as  v N ( x, z, µ ) = U ( x ) µ ( R d ) , ( x, z ) ∈ S q , µ ∈ M + ,v k ( x, z, µ ) = sup a ∈ A q ( x,z ) E (cid:104) v k +1 (cid:16) x (cid:0) a (cid:48) (cid:0) e R k +1 − d (cid:1)(cid:1) , max (cid:2) z, x (cid:0) a (cid:48) (cid:0) e R k +1 − d (cid:1)(cid:1)(cid:3) , ¯ g ( R k +1 − · ) µ (cid:17)(cid:105) , (9)for k = 0 , . . . , N −

1. Notice from Proposition 3.1 that the expectation in the above formulais only taken with respect to the noise R k +1 , which is distributed under P according to theprobability distribution ν with density g on R d . In the case where the utility function is of CRRA (Constant Relative Risk Aversion) type, i.e., U ( x ) = x p p , x > , for some 0 < p < , (10)one can reduce the dimensionality of the problem. For this purpose, we introduce the process ρ = ( ρ k ) k deﬁned as the ratio of the wealth over its maximum up to current as: ρ αk = X αk Z αk , k = 0 , . . . , N. q,

1] due to the maximum drawdown constraint. Moreover,recalling (3), and observing that Z αk +1 = max[ Z αk , X αk +1 ], together with the fact that z,x ] =min[ z , x ], we notice that the ratio process ρ can be written in inductive form as ρ αk +1 = min (cid:104) , ρ αk (cid:0) α (cid:48) k (cid:0) e R k +1 − d (cid:1)(cid:1)(cid:105) , k = 0 , . . . , N − . The following result states that the value function inherits the homogeneity property of theutility function.

Lemma 3.5.

For a utility function U as in (10) , we have for all ( x, z ) ∈ S q , µ ∈ M + , k ∈ [[0 , N ]] , v k ( λx, λz, µ ) = λ p v k ( x, z, µ ) , λ > . Proof.

See Appendix 6.5. (cid:50)

In view of the above Lemma, we consider the sequence of functions w k , k ∈ [[0 , N ]], deﬁnedby w k ( r, µ ) = v k ( r, , µ ) , r ∈ [ q, , µ ∈ M + , so that v k ( x, z, µ ) = z p w k ( xz , µ ), and we call w k the reduced value function. From the dynamicprogramming system satisﬁed by v k , we immediately obtain the backward system for ( w k ) k as  w N ( r, µ ) = r p p µ ( R d ) , r ∈ [ q, , µ ∈ M + ,w k ( r, µ ) = sup a ∈ A q ( r ) E (cid:104) w k +1 (cid:0) min (cid:2) , r (cid:0) a (cid:48) (cid:0) e R k +1 − d (cid:1)(cid:1)(cid:3) , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) , (11)for k = 0 , . . . , N −

1, where A q ( r ) = (cid:110) a ∈ R d + : a (cid:48) d ≤ − qr (cid:111) . We end this section by stating some properties on the reduced value function.

Lemma 3.6.

For any k ∈ [[0 , N ]] , the reduced value function w k is nondecreasing and concavein r ∈ [ q, . Proof.

See proof in Appendix 6.6. (cid:50)

We consider in this section the Gaussian framework where the noise and the prior belief onthe drift are modeled according to a Gaussian distribution. In this special case, the Bayesianﬁltering is simpliﬁed into the Kalman ﬁltering, and the dynamic programming system is reducedto a ﬁnite-dimensional problem that will be solved numerically. It is convenient to deal directlywith the posterior distribution of the drift, i.e. the conditional law of the drift B given the assetsprice observation, also called normalized ﬁlter. From (5) and Proposition 3.2, it is given by theinductive relation µ k ( db ) = g ( R k − b ) µ k − ( db ) (cid:82) R d g ( R k − b ) µ k − ( db ) , k = 1 , . . . , N. (12)7 .1 Bayesian Kalman ﬁltering We assume that the probability law ν of the noise (cid:15) k is Gaussian: N (0 , Γ), and so with densityfunction g ( r ) = (2 π ) − d | Γ | − e − r (cid:48) Γ − r , r ∈ R d . (13)Assuming also that the prior distribution µ on the drift B is Gaussian with mean b , andinvertible covariance matrix Σ , we deduce by induction from (12) that the posterior distribution µ k is also Gaussian: µ k ∼ N ( ˆ B k , Σ k ), where ˆ B k = E [ B |F ok ] and Σ k satisfy the well-knowninductive relations: ˆ B k +1 = ˆ B k + K k +1 ( R k +1 − ˆ B k ) , k = 0 , . . . , N − k +1 = Σ k − Σ k (Σ k + Γ) − Σ k , (15)where K k +1 is the so-called Kalman gain given by K k +1 = Σ k (Σ k + Γ) − , k = 0 , . . . , N − . (16)We have the initialization ˆ B = b , and the notation for Σ k is coherent at time k = 0 as itcorresponds to the covariance matrice of B . While the Bayesian estimation ˆ B k of B is updatedfrom the current observation of the log-return R k , notice that Σ k (as well as K k ) is deterministic,and is then equal to the covariance matrix of the error between B and its Bayesian estimation,i.e. Σ k = E [( B − ˆ B k )( B − ˆ B k ) (cid:48) ]. Actually, we can explicitly compute Σ k by noting from Equation(12) with g as in (13) and µ ∼ N ( b , Σ ) that µ k ∼ e − (cid:16) b − ( Σ − +Γ − k ) − ( Γ − (cid:80) kj =1 R j +Σ − b ) (cid:17) ( Σ − +Γ − k ) (cid:16) b − ( Σ − +Γ − k ) − ( Γ − (cid:80) kj =1 R j +Σ − b ) (cid:17) (2 π ) d | (Σ − + Γ − k ) − | . By identiﬁcation, we then getΣ k = (Σ − + Γ − k ) − = Σ (Γ + Σ k ) − Γ . (17)Moreover, the innovation process (˜ (cid:15) k ) k , deﬁned as˜ (cid:15) k +1 = R k +1 − E [ R k +1 |F ok ] = R k +1 − ˆ B k , k = 0 , . . . , N − , (18)is a F o -adapted Gaussian process. Each ˜ (cid:15) k +1 is independent of F k (hence ˜ (cid:15) k , k = 1 , . . . , N aremutually independent), and is a centered Gaussian vector with covariance matrix:˜ (cid:15) k +1 ∼ N (cid:0) , ˜Γ k +1 (cid:1) , with ˜Γ k +1 = Σ k + Γ . We refer to Kalman (1960) and Kalman and Bucy (1961) for these classical properties aboutthe Kalman ﬁltering and the innovation process.

Remark 4.1.

From (14), and (18), we see that the Bayesian estimator ˆ B k follows the dynamics (cid:40) ˆ B k +1 = ˆ B k + K k +1 ˜ (cid:15) k +1 , k = 0 , . . . , N − B = b , which implies in particular that ˆ B k has a Gaussian distribution with mean b , and covariancematrix satisfyingVar( ˆ B k +1 ) = Var( ˆ B k ) + K k +1 (Σ k + Γ) K (cid:48) k +1 = Var( ˆ B k ) + Σ k (Σ k + Γ) − Σ k . Recalling the inductive relation (15) on Σ k , this shows that Var( ˆ B k ) = Σ − Σ k . Note that,from Equation (15), (Σ k ) k is a decreasing sequence which ensures that Var( ˆ B k ) is positive semi-deﬁnite and is nondecreasing with time k . ♦ .2 Finite-dimensional dynamic programming equation From (18), we see that our initial portfolio selection Problem (4) can be reformulated as a fullobservation problem with state dynamics given by (cid:40) X αk +1 = X αk (cid:16) α (cid:48) k (cid:0) e ˆ B k +˜ (cid:15) k +1 − d (cid:1)(cid:17) , ˆ B k +1 = ˆ B k + K k +1 ˜ (cid:15) k +1 , k = 0 , . . . , N − . (19)We then deﬁne the value function on [[0 , N ]] × S q × R d by˜ v k ( x, z, b ) = sup α ∈A qk ( x,z ) E (cid:2) U ( X k,x,b,αN ) (cid:3) , k ∈ [[0 , N ]] , ( x, z ) ∈ S q , b ∈ R d , where the pair ( X k,x,b,α , ˆ B k,b ) is the process solution to (19) on [[ k, N ]], starting from ( x, b ) attime k , so that V = ˜ v ( x , x , b ). The associated dynamic programming system satisﬁed bythe sequence (˜ v k ) k is  ˜ v N ( x, z, b ) = U ( x ) , ( x, z ) ∈ S q , b ∈ R d , ˜ v k ( x, z, b ) = sup a ∈ A q ( x,z ) E (cid:104) ˜ v k +1 (cid:16) x (cid:0) a (cid:48) (cid:0) e b +˜ (cid:15) k +1 − d (cid:1)(cid:1) , max (cid:2) z, x (cid:0) a (cid:48) (cid:0) e b +˜ (cid:15) k +1 − d (cid:1)(cid:1)(cid:3) , b + K k +1 ˜ (cid:15) k +1 (cid:17)(cid:105) , for k = 0 , . . . , N −

1. Notice that in the above formula, the expectation is taken with respect tothe innovation vector ˜ (cid:15) k +1 , which is distributed according to N (0 , ˜Γ k +1 ).Moreover, in the case of CRRA utility functions U ( x ) = x p /p , and similarly as in Section3.4, we have the dimension reduction with˜ w k ( r, b ) = ˜ v k ( r, , b ) , r ∈ [ q, , b ∈ R d , so that ˜ v k ( x, z, b ) = z p ˜ w k ( xz , b ), and this reduced value function satisﬁes the backward systemon [ q, × R d :  ˜ w N ( r, b ) = r p p , r ∈ [ q, , b ∈ R d , ˜ w k ( r, b ) = sup a ∈ A q ( r ) E (cid:104) ˜ w k +1 (cid:16) min (cid:2) , r (cid:0) a (cid:48) (cid:0) e b +˜ (cid:15) k +1 − d (cid:1)(cid:1)(cid:3) , b + K k +1 ˜ (cid:15) k +1 (cid:17)(cid:105) , for k = 0 , . . . , N − Remark 4.2 (No short-sale constrained Merton problem) . In the limiting case when q = 0, thedrawdown constraint is reduced to a non-negativity constraint on the wealth process, and byLemma 3.3, this means a no-short selling and no borrowing constraint on the portfolio strategies.When the drift B is also known, equal to b , and for a CRRA utility function, let us then considerthe corresponding constrained Merton problem with value function denoted by v Mk , k = 0 , . . . , N ,which satisﬁes the standard backward recursion from dynamic programming:  v MN ( x ) = x p p , x > ,v Mk ( x ) = sup a (cid:48) d ≤ a ∈ [0 , d E (cid:104) v Mk +1 (cid:0) x (cid:0) a (cid:48) (cid:0) e b + (cid:15) k +1 − d (cid:1)(cid:1)(cid:105) , k = 0 , . . . , N − . (20)Searching for a solution of the form v Mk ( x ) = K k x p /p , with K k ≥ k ∈ [[0 , N ]], we seethat the sequence ( K k ) k satisﬁes the recursive relation: K k = SK k +1 , k = 0 , . . . , N − , K N = 1, where S := sup a (cid:48) d ≤ a ∈ [0 , d E (cid:104)(cid:16) a (cid:48) (cid:0) e b + (cid:15) − d (cid:1)(cid:17) p (cid:105) , by recalling that (cid:15) , . . . , (cid:15) N are i.i.d. random variables. It follows that the value function of theconstrained Merton problem, unique solution to the dynamic programming system (20), is equalto v Mk ( x ) = S N − k x p p , k = 0 , . . . , N, and the constant optimal control is given by a Mk = argmax a (cid:48) ≤ a ∈ [0 , d E (cid:104)(cid:0) a (cid:48) (cid:0) e R − d (cid:1)(cid:1) p (cid:105) k = 0 , . . . , N − . ♦ In this section, we exhibit numerical results to promote the beneﬁts of learning from new in-formation. To this end, we compare the learning strategy (Learning) to the non-learning one(Non-Learning) in the case of the CRRA utility function and the Gaussian distribution for thenoise. The prior probability distribution of B is the Gaussian distribution N ( b , Σ ) for Learningwhile it is the Dirac distribution concentrated at b for Non-Learning.We use deep neural network techniques to compute numerically the optimal solutions forboth Learning and Non-Learning. To broaden the analysis, in addition to the learning andnon-learning strategies, we have computed an ”admissible” equally weighted (EW) strategy.More precisely, this EW strategy will share the quantity X k − qZ k equally among the d assets.Eventually, we show numerical evidence that the Non-Learning converges to the optimal strategyof the constrained Merton problem, when the loss aversion parameter q vanishes. Neural networks (NN) are able to approximate nonlinear continuous functions, typically thevalue function and controls of our problem. The principle is to use a large amount of data totrain the NN so that it progressively comes close to the target function. It is an iterative processin which the NN is tuned on a training set, then tested on a validation set to avoid over-ﬁtting.For more details, see for instance Hornik (1991) and G´eron (2019).The algorithm we use, relies on two dense neural networks: the ﬁrst one is dedicated to thecontrols ( A NN ) and the second one to the value function ( V F NN ). Each NN is composed of fourlayers: an input layer, two hidden layers and an output layer:(i) The input layer is d + 1-dimensional since it embeds the conditional expectations of eachof the d assets and the ratio of the current wealth to the current historical maximum ρ .(ii) The two hidden layers give the NN the ﬂexibility to adjust its weights and biases toapproximate the solution. From numerical experiments, we see that, given the complexityof our problem, a ﬁrst hidden layer with d + 20 neurons and a second one with d + 10 area good compromise between speed and accuracy.10 arameter A NN VF NN Initializer uniform(0 ,

1) He uniformRegularizers L2 norm L2 normActivation functions Elu and Sigmoid for output layer Elu and Sigmoid for output layerOptimizer Adam AdamLearning rates: step N-1 5e-3 1e-3steps k = 0,...,N-2 6.25e-4 5e-4Scale 1e-3 1e-3Number of elements in a training batch 3e2 3e2Number of training batches 1e2 1e2Size of the validation batches 1e3 1e3Penalty constant 3e-1 NANumber of epochs: step N-1 2e3 2e3steps k = 0,...,N-2 5e2 5e2Size of the training set: step N-1 6e7 6e7steps k = 0,...,N-2 1.5e7 1.5e7Size of the validation set: step N-1 2e6 2e6steps k = 0,..., N-2 5e5 5e5

Table 1: Parameters for the neural networks of the controls A NN and the value function VF NN . (iii) The output layer is d -dimensional for the controls, one for each asset representing theweight of the instrument, and is one-dimensional for the value function. See Figures 1 and2 for an overview of the NN architectures in the case of d = 3 assets. Figure 1: A NN architecture with d = 3 assets Figure 2: V F NN architecture with d = 3 assets We follow the indications in G´eron (2019) to setup and deﬁne the values of the various inputsof the neural networks which are listed in Table 1.To train the NN, we simulate the input data. For the conditional expectation ˆ B k , we useits time-dependent Gaussian distribution (see Remark 4.1): ˆ B k ∼ N ( b , Σ − Σ k ), with Σ k asin Equation (17). On the other hand, the training of ρ is drawn from the uniform distributionbetween q and 1, the interval where it lies according to the maximum drawdown constraint. Hybrid-Now algorithm

We use the

Hybrid-Now algorithm developped in Bachouch et al. (2018b) in order to solvenumerically our problem. This algorithm combines optimal policy estimation by neural networksand dynamic programming principle which suits the approach we have developped in Section 4.11ith the same notations as in Algorithm 1 detailed in the next insert, at time k , the algo-rithm computes the proxy of the optimal control ˆ α k with A NN , using the known function ˆ V k +1 calculated the step before, and uses V NN to obtain a proxy of the value function ˆ V k . Startingfrom the known function ˆ V N := U at terminal time N , the algorithm computes sequentially ˆ α k and ˆ V k with backward iteration until time 0. This way, the algorithm loops to build the optimalcontrols and the value function pointwise and gives as output the optimal strategy, namely theoptimal controls from 0 to N − N time steps.The maximum drawdown constraint is a time-dependent constraint on the maximal propor-tion of wealth to invest (recall Lemma 3.3). In practice, it is a constraint on the sum of weightsof each asset or equivalently on the output of A NN . For that reason, we have implemented anappropriate penalty function that will reject undesirable values: G P enalty ( A, r ) = K max (cid:16) | A | ≤ − qr , (cid:17) , A ∈ [0 , d , r ∈ [ q, . This penalty function ensures that the strategy respects the maximum drawdown constraint ateach time step, when the parameter K is chosen suﬃciently large. Algorithm 1:

Hybrid-Now

Input: the training distributions µ Unif and µ kGauss ; (cid:46) µ Unif = U ( q, (cid:46) µ kGauss = N ( b , Σ − Σ k ) Output: - estimate of the optimal strategy (ˆ a k ) N − k =0 ;- estimate of the value function (cid:16) ˆ V k (cid:17) N − k =0 ;Set ˆ V N = U ; for k = N − , . . . , do Compute:ˆ β k ∈ argmin β ∈ R d d +283 E (cid:104) G P enalty ( A NN ( ρ k , ˆ B k ; β ) , ρ k ) − ˆ V k +1 (cid:16) ρ βk +1 , ˆ B k +1 (cid:17)(cid:105) where ρ k ∼ µ Unif , ˆ B k ∼ µ kGauss ,ˆ B k +1 = ˜ H k ( ˆ B k , ˜ (cid:15) k +1 ) and ρ βk +1 = F (cid:16) ρ k , ˆ B k , A NN (cid:16) ρ k , ˆ B k ; β (cid:17) , ˜ (cid:15) k +1 (cid:17) ; (cid:46) F ( ρ, b, a, (cid:15) ) = min (cid:16) , ρ (cid:16) (cid:80) di =1 a i (cid:16) e b i + (cid:15) i − (cid:17)(cid:17)(cid:17) (cid:46) ˜ H k ( b, (cid:15) ) = b + Σ (Γ + Σ k ) − (cid:15) Set ˆ a k = A NN (cid:16) . ; ˆ β k (cid:17) ; (cid:46) ˆ a k is the estimate of the optimal control at time k . Compute:ˆ θ k ∈ argmin θ ∈ R d d +261 E (cid:20)(cid:16) ˆ V k +1 (cid:16) ρ ˆ β k k +1 , ˆ B k +1 (cid:17) − V F NN (cid:16) ρ k , ˆ B k ; θ (cid:17)(cid:17) (cid:21) Set ˆ V k = V F NN (cid:16) ., ˆ θ k (cid:17) ; (cid:46) ˆ V k is the estimate of the value function at time k . A major argument behind the choice of this algorithm is that, it is particularly relevant forproblems in which the neural network approximation of the controls and value function at time k , are close to the ones at time k + 1. This is what we expect in our case. We can then takea small learning rate for the Adam optimizer which enforces the stability of the parameters’update during the gradient-descent based learning procedure.12arameter ValueNumber of risky assets d T N N p . q (cid:2) .

05 0 .

025 0 . (cid:3) Annualized covariance matrix of the drift B  . .

00 0 0 .  Annualized volatility of (cid:15) (cid:2) .

08 0 .

04 0 . (cid:3) Correlation matrix of (cid:15)  − . . − . − . . − .

25 1  Annualized covariance matrix of the noise (cid:15)  . − . . − . . − . . − . .  Table 2: Values of the parameters used in the simulation.

In this section, we explain the setup of the simulation and exhibit the main results. We haveused Tensorﬂow 2 and deep learning techniques for Python developped in G´eron (2019). Weconsider d = 3 risky assets and a riskless asset whose return is assumed to be 0, on a 1-yearinvestment horizon for the sake of simplicity. We consider 24 portfolio rebalancing during the1-year period, i.e., one every two weeks. This means that we have N = 24 steps in the trainingof our neural networks. The parameters used in the simulation are detailed in Table 2.First, we show the numerical results for the learning and the non-learning strategies bypresenting a performance and an allocation analysis in Subsection 5.3.1. Then, we add theadmissible constrained EW to the two previous ones and use this neutral strategy as a benchmarkin Subsection 5.3.2. Ultimately, in Subsection 5.3.3, we illustrate numerically the convergence ofthe non-learning strategy to the constrained Merton problem when the loss aversion parameter q vanishes. We simulate ˜ N = 1000 trajectories for each strategy and exhibit the performance results withan initial wealth x = 1. Figures 3 illustrates the average historical level of the learning andnon-learning strategies with a 95% conﬁdence interval. Learning outperforms signiﬁcantly Non-Learning with a narrower conﬁdence interval revealing that less uncertainty surrounds Learningperformance, thus yielding less risk.An interesting phenomenon, visible in Fig. 3, is the nearly ﬂat curve for Learning betweentime 0 and time 1. Indeed, whereas Non-Learning starts investing immediately, Learning adoptsa safer approach and needs a ﬁrst time step before allocating a signiﬁcant proportion of wealth.Given the level of uncertainty surrounding b , this ﬁrst step allows Learning to ﬁne-tune itsallocation by updating the prior belief with the ﬁrst return available at time 1. On the contrary,Non-Learning, which cannot update its prior, starts investing at time 0.13ig. 4 shows the ratio of Learning over Non-Learning. A ratio greater than one meansthat Learning outperforms Non-Learning and underperforms when less than one. It shows thesigniﬁcant outperformance of Learning over Non-Learning except during the ﬁrst period whereLearning was not signiﬁcantly invested and Non-Learning had a positive return. Moreover, thisgraph reveals the typical increasing concave curve of the value of information described in Keppoet al. (2018), in the context of investment decisions and costs of data analytics, and in De Franco,Nicolle, and Pham (2019a) in the resolution of the Markowitz portfolio selection problem usinga Bayesian learning approach. Figure 3: Historical Learning and Non-Learning levels witha 95% conﬁdence interval. Figure 4: Historical ratio of Learning over Non-Learning lev-els.

Table 3 gathers relevant statistics for both Learning and Non-Learning such as: average totalperformance, standard deviation of the terminal wealth X T , Sharpe ratio computed as averagetotal performance over standard deviation of terminal wealth. The maximum drawdown (MD)is examined through two statistics: noting M D ˜ s(cid:96) the maximum drawdown of the (cid:96) - th trajectoryof a strategy ˜ s , the average MD is deﬁned as,Avg MD ˜ s = 1˜ N ˜ N (cid:88) (cid:96) =1 MD ˜ s(cid:96) , for ˜ N trajectories of the strategy ˜ s , and the worst MD is deﬁned as,Worst MD ˜ s = min (cid:16) M D ˜ s , . . . , M D ˜ s ˜ N (cid:17) . Finally, the Calmar ratio, computed as the ratio of the average total performance over the av-erage maximum drawdown, is the last statistic exhibited.With the simulated dataset, Learning delivered, on average, a total performance of 9 . . .

94% ex-cess return. Moreover, risk metrics are signiﬁcantly better for Learning than for Non-Learning.Learning exhibits a lower standard deviation of terminal wealth than Non-Learning (11 . . . − .

53% versus − . − .

74% versus − . .

95% and the Calmar ratio by 525 . tatistic Learning Non-Learning DiﬀerenceAvg total performance 9.34% 6.40% 2.94%Std dev. of X T Table 3: Performance metrics: Learning and Non-Learning. The diﬀerence for ratios are computed as relative improvement. invests in Asset 2 since it has the lowest expected return according to the prior, see Table 2.Whereas Non-Learning focuses on Asset 3, the one with the highest expected return, Learningperforms an optimal allocation between Asset 1 and Asset 3 since this strategy is not stuck withthe initial estimate given by the prior. Therefore, Learning invests little at time 0, then balancesnearly equally both Assets 1 and 3, and then invests only in Asset 3 after time step 12. Instead,Non-Learning is investing only in Asset 3, from time 0 until the end of the investment horizon.

Figure 5: Historical Learning and Non-Learning asset allocations.

The curves in Fig. 6 recall each asset’s optimal weight, but the main features are the coloredareas that represent the average historical total percentage of wealth invested by each strategy.The dotted line represents the total allocation constraint they should satisfy to be admissible. Tosatisfy the maximum drawdown constraint, admissible strategies can only invest in risky assetsthe proportion of wealth that, in theory, could be totally lost. This explains why the non-learningstrategy invests at full capacity on the asset that has the maximum expected return accordingto the prior distribution.We clearly see that both strategies satisfy their respective constraints. Indeed, looking at theleft panel, Learning is far from saturating the constraint. It has invested, on average, roughly10% of its wealth while its constraint was set around 30%. Non-learning invests at full capacitysaturating its allocation constraint. Remark that this constraint is not a straight line since itdepends on the value of the ratio: current wealth over current historical maximum, and evolvesaccording to time. 15 igure 6: Historical Learning and Non-Learning total allocations.

In this section, we add a simple constrained equally-weighted (EW) strategy to serve as abenchmark for both Learning and Non-Learning. At each time step, the constrained EW strategyinvests, equally across the three assets, the proportion of wealth above the threshold q .Fig. 7 shows the average historical levels of the three strategies: Learning, Non-Learningand constrained EW. We notice Non-Learning outperforms constrained EW and both havesimilar conﬁdence intervals. It is not surprising to see that Non-Learning outperforms constrainedEW since Non-Learning always bets on Asset 3, the most performing, while constrained EWdiversiﬁes the risks equally among the three assets. Figure 7: Historical Learning, Non-Learning and constrained EW (Const. EW) levels with a 95% conﬁdence interval.

Fig. 8 shows the ratio of Learning over constrained EW: it depicts the same concave shapeas Fig. 4. The outperformance of Non-Learning with respect to constrained EW is plot in Fig.9 and conﬁrms, on average, the similarity of the two strategies.16 tatistic Const. EW L NL L - Const. EW NL - Const. EWAvg total performance 3.85% 9.34% 6.40% 5.49% 2.55%Std dev. of X T Table 4: Performance metrics: Constrained EW (Const. EW) vs Learning (L) and Non-Learning (NL). The diﬀerence forratios are computed as relative improvement.

Figure 8: Ratio Learning over constrained EW (Const. EW)according to time. Figure 9: Ratio Non-Learning over constrained EW (Const.EW) according to time.

Table 4 collects relevant statistics for the three strategies. Learning clearly surpasses con-strained EW: it outperforms by 5 .

49% while reducing uncertainty on terminal wealth by 1 . .

08% of the Sharpe ratio. Moreover, it better handles maxi-mum drawdown regarding both the average and the worst case, exhibiting an improvement of3 .

17% and 10 .

09% respectively, enhancing the Calmar ratio by 647 . . . − .

70% and − .

83% respectively) thanby Non-Learning ( − .

54% and − .

18% respectively) thanks to the diversiﬁcation capacity ofconstrained EW. The better performance of Non-Learning compensates the better maximumdrawdown handling of constrained EW, entailing a better Calmar ratio for Non-Learning 0 . .

82 for constrained EW.

We numerically analyze the impact of the drawdown parameter q , and compare the non-learningstrategies (assuming that the drift is equal to b ), with the constrained Merton strategy asdescribed in Remark 4.2. Fig. 10 conﬁrms that when the loss aversion parameter q goes to zero,the non-learning strategy approaches the Merton strategy.17 igure 10: Wealth curves resulting from the Merton strategy and the non-learning strategy for diﬀerent values of q . In terms of assets’ allocation, the Merton strategy saturates the constraint only by investingin the asset with the highest expected return, Asset 3, while the non-learning strategy adopts asimilar approach and invests at full capacity in the same asset. To illustrate this point, we easilysee that the areas at the top and bottom-left corner converge to the area at the bottom-rightcorner of Fig. 11.

Figure 11: Asset 3 average weights of the non-learning strategies with q ∈ { . , . , . } and the Merton strategy. As q vanishes, we observe evidence of the convergence of the Merton and the non-learningstrategies, materialized by a converging allocation pattern and resulting wealth trajectories. Itshould not be surprising since both have in common not to learn from incoming informationconveyed by the prices. 18 .4 Sensitivities analysis In this subsection, we study the eﬀect of changes in the uncertainty about the beliefs of B . Thesebeliefs take the form of an estimate b of B , and a degree of uncertainty about this estimate,the covariance of Σ of B . For the sake of simplicity, we design Σ as a diagonal matrix whosediagonal entries are variances representing the conﬁdence the investor has in her beliefs aboutthe drift. To easily model a change in Σ , we deﬁne the modiﬁed covariance matrix ˜Σ as˜Σ unc := unc ∗ Σ , where unc >

0. From now on, the prior of B is N ( b , ˜Σ unc ).A higher value of unc means a higher uncertainty materialized by a lower conﬁdence in theprior estimate of the expected return of B , b . We consider learning strategies with values of unc ∈ { / , , , , } . The value unc = 1 was used for Learning in Subsection 5.3.Equation (2) implies that the returns’ probability distribution depends upon unc . It impliesthat for each value of unc , we need to compute both Learning and Non-Learning on the returnssample drawn from the same probability law to make relevant comparisons.Therefore, from a sample of a thousand returns paths’ draws, we plot in Fig. 12 the averagecurves of the excess return of Learning over its associated Non-Learning, for diﬀerent values ofthe uncertainty parameter unc . Figure 12: Excess return of Learning over Non-Learning with a 95% conﬁdence interval for diﬀerent levels of uncertainty.

Looking at Fig. 12, we notice that when uncertainty about b is low, i.e. unc = 1 /

6, Learn-ing is close to Non-Learning and unsurprisingly the associated excess return is small. Then, aswe increase the value of unc the curves steepen increasingly showing the eﬀect of learning ingenerating excess return.Table 5 summarises key statistics for the ten strategies computed in this section. When unc = 1 /

6, Learning underperforms Non-Learning. This is explained by the fact that Non-Learning has no doubt about b and knows Asset 3 is the best performing asset acoording toits prior, whereas Learning, even with low uncertainty, needs to learn it generating a lag whichexplains the underperformance on average. For values of unc ≥ unc . Nevertheless, an interesting fact is that the ratio rises from unc = 1 / unc = 1,then reaches a level close to 0 . unc = 1 , , unc = 12.19 nc = 1 / unc = 1 unc = 3 unc = 6 unc = 12Statistic L NL L NL L NL L NL L NLAvg total performance 3.87% 4.35% 9.45% 6.00% 19.96% 10.25% 90.03% 16.22% 130.07% 30.44%Std dev. of X T Table 5: Performance and risk metrics: Learning (L) vs Non-Learning (NL) for diﬀerent values of uncertainty unc . This phenomenon is more visible on Fig. 14 that displays the Sharpe ratio of terminal wealth ofLearning and Non-Learning according to the values of unc , and the associated relative improve-ment. Clearly, looking at Figures 13 and 14, we remark that while increasing unc gives moreexcess return, too high values of unc in the model turn out to be a drag as far as Sharpe ratioimprovement is concerned.

Figure 13: Average total performance of Learning (L) andNon-Learning (NL), and excess return, for unc ∈{ / , , , , } . Figure 14: Sharpe ratio of terminal wealth of Learning (L)and Non-Learning (NL), and relative improve-ment, for unc ∈ { / , , , , } . For any value of unc , Learning handles maximum drawdown signiﬁcantly better than Non-Learning whatever it is the average or the worst. This results in a better performance per unitof average maximum drawdown (Calmar ratio), for Learning. We also see that the maximumdrawdown constraint is satisﬁed for every strategies of the sample and for any value of unc sincethe worst maximum drawdown is always above − q set at 0 . Figure 15: Average maximum drawdown of Learning (L) and Non-Learning (NL) and the gain from learning for unc ∈{ / , , , , } . Fig. 15 reveals how the average maximum drawdown behaves regarding the level of uncertainty.Non-Learning maximum drawdown behaves linearly with uncertainty: the wider the range of20ossible values of B the higher the maximum drawdown is on average. It emphasizes its inabil-ity to adapt to an environment in which the returns have diﬀerent behaviors compared to theirexpectations. Learning instead, manages to keep a low maximum drawdown for any value of unc .Given the previous remarks, it is obvious that the gain in maximum drawdown from learninggrows with the level of uncertainty.Figures 16-20 represent portfolio allocations averaged over the simulations. They depict, foreach value of the uncertainty parameter unc , the average proportion of wealth invested, in eachof the three assets, by Learning and Non-Learning. The purpose is not to compare the graphswith diﬀerent values of unc since the allocation is not performed on the same sample of returns.Rather, we can identify trends that are typically diﬀerentiating Learning from Non-Learningallocations.Since the maximum drawdown constraint is satisﬁed by the capped sum of total weightsthat can be invested, the allocations of both Learning and Non-Learning are mainly based onthe expected returns of the assets.Non-Learning, by deﬁnition, does not depend on the value of the uncertainty parameter. Hence,no matter the value of unc , its allocation is easy to characterize since it saturates its constraintinvesting in the asset that has the best expected return according to the prior. In our setup,Asset 3 has the highest expected return, so Non-Learning invests only in it and saturates itsconstraint of roughly 30% during all the investment period. The slight change of the averageweight in Asset 3 comes from ρ , the ratio wealth over maximum wealth, changing over time. Figure 16: Learning and Non-Learning historical assets’ allocations with unc = 1 / Unlike Non-Learning, depending of the value of unc , Learning can perform more sophisticatedallocations because it can adjust the weights according to the incoming information. Nonetheless,in Fig. 16, when unc is low, Learning and Non-Learning look similar regarding their weightsallocation since both strategies invest, as of time 0, a signiﬁcant proportion of their wealth onlyin Asset 3.On the right panel of Fig. 16, the progressive increase in the weight of Asset 3 illustrates thelearning process. As time goes by, Learning progressively increases the weight in Asset 3 since ithas the highest expected return. It also explains why Learning underperforms Non-Learning forlow values of unc ; contrary to Non-Learning which invests at full capacity in Asset 3, Learningneeds to learn that Asset 3 is the optimal choice.21 igure 17: Learning and Non-Learning historical assets’ allocations with unc = 1.Figure 18: Learning and Non-Learning historical assets’ allocations with unc = 3.

However, as uncertainty increases, Learning and Non-Learning strategies start diﬀerentiating.When unc ≥

1, Learning invests little, if any, at time 0. In addition, an increase in unc allowsthe inital drift to lie in a wider range and generates investment opportunities for Learning. Thisexplains why Learning invests in Asset 1 when unc = 1 , , ,

12 although the estimate b forthis asset is lower than for Asset 3. In Fig. 19, we see that Learning even invests in Asset 2which has the lowest expected drift. Figure 19: Learning and Non-Learning historical assets’ allocations with unc = 6. igure 20: Learning and Non-Learning historical assets’ allocations with unc = 12. Figures 21-25 illustrate the historical total percentage of wealth allocated for Learning andNon-Learning with diﬀerent levels of uncertainty. As seen previously, Non-Learning has fullyinvested in Asset 3 for any value of unc . Figure 21: Historical total allocations of Learning and Non-Learning with unc = 1 / Moreover, Learning has always less investment that Non-Learning for any level of uncer-tainty. It suggests that Learning yields a more cautious strategy than Non-Learning. This fact,in addition to its wait-and-see approach at time 0 and its ability to better handle maximumdrawdown, makes Learning a safer and more conservative strategy than Non-Learning. This canbe seen in Fig. 21, where both Learning and Non-Learning have invested in Asset 3, but not atthe same pace. Non-Learning goes fully in Asset 3 at time 0, whereas Learning increments slowlyits weight in Asset 3 reaching 25% at the ﬁnal step. When unc is low, there is no value addedto choose Learning over Non-Learning from a performance perspective. Nevertheless, Learningallows for a better management of risk as Table 5 exhibits.As unc increases, in addition to being cautious, Learning mixes allocation in diﬀerent assets,see Figures 22-25, while Non-Learning is stuck with the highest expected return asset.23 igure 22: Historical total allocations of Learning and Non-Learning with unc = 1.Figure 23: Historical total allocations of Learning and Non-Learning with unc = 3.

Learning is able to be opportunistic and changes its allocation given the prices observed. Forexample in Fig. 22, Learning starts investing in Asset 1 and 3 at time 1 and stops at time 12 toweigh Asset 1 while keeping Asset 3. Similar remarks can be made for Fig. 23, where Learningputs non negligeable weights in all three risky assets for unc = 6 in Fig. 24.

Figure 24: Historical total allocations of Learning and Non-Learning with unc = 6. igure 25: Historical total allocations of Learning and Non-Learning with unc = 12. We have studied a discrete-time portfolio selection problem by taking into account both driftuncertainty and maximum drawdown constraint. The dynamic programming equation has beenderived in the general case thanks to a speciﬁc change of measure. More explicit results havebeen provided in the Gaussian case using the Kalman ﬁlter. Moreover, a change of variable hasreduced the dimensionality of the problem in the case of CRRA utility functions. Next, we haveprovided extensive numerical results in the Gaussian case with CRRA utility functions usingrecent deep neural network techniques. Our numerical analysis has clearly shown and quantiﬁedthe better risk-return proﬁle of the learning strategy versus the non-learning one. Indeed, besidesoutperforming the non-learning strategy, the learning one provides a signiﬁcantly lower standarddeviation of terminal wealth and a better controlled maximum drawdown. Conﬁrming the resultsestablished in De Franco, Nicolle, and Pham (2019b), this study exhibits the beneﬁts of learningin providing optimal portfolio allocations.

Appendix

For all k = 1 , ..., N , the law under P , of R k given the ﬁltration G k − yields the unconditionallaw under P of (cid:15) k . Indeed, since (Λ k ) k is a ( P , G )-martingale, we have from Bayes formula, forall Borelian F ⊂ R d , P [ R k ∈ F |G k − ] = E [ { R k ∈ F } |G k − ] = E [Λ k { R k ∈ F } |G k − ] E [Λ k |G k − ]= E [ Λ k Λ k − { R k ∈ F } |G k − ] = E (cid:20) g ( B + (cid:15) k ) g ( (cid:15) k ) { R k ∈ F } (cid:12)(cid:12) G k − (cid:21) = (cid:90) R d g ( B + e ) g ( e ) { B + e ∈ F } g ( e ) de = (cid:90) R d g ( z ) { z ∈ F } dz = P [ (cid:15) k ∈ F ] . This means that, under P , R k is independent from B and from R , .., R k − and that R k has thesame probability distribution as (cid:15) k . (cid:50) .2 Proof of Proposition 3.2 For any borelian function f : R d (cid:55)→ R we have, on one hand, by deﬁnition of π k +1 : E (cid:2) Λ k +1 f ( B ) |F ok +1 (cid:3) = (cid:90) R d f ( b ) π k +1 ( db ) , and, on the other hand, by deﬁnition of Λ k : E [Λ k +1 f ( B ) |F ok +1 ] = E (cid:20) Λ k f ( B ) g ( R k +1 − B ) g ( R k +1 ) (cid:12)(cid:12)(cid:12)(cid:12) F ok +1 (cid:21) = E (cid:2) Λ k f ( B ) g ( R k +1 − B ) (cid:12)(cid:12) F ok +1 (cid:3) ( g ( R k +1 )) − = (cid:90) R d f ( b ) g ( R k +1 − b ) g ( R k +1 ) π k ( db ) , where we use in the last equality the fact that R k +1 is independent of B under P (recall Propo-sition 3.1). By identiﬁcation, we obtain the expected relation. (cid:50) Since the support of the probability distribution ν of (cid:15) k is R d , we notice that the law of therandom vector Y k := e R k − d has support equal to ( − , ∞ ) d . Recall from (7) that a ∈ A qk ( x, z )iﬀ 1 + a (cid:48) Y k +1 ≥ q max (cid:104) zx , a (cid:48) Y k +1 (cid:105) , a.s. (21)(i) Take some a ∈ A qk ( x, z ), and assume that a i < i ∈ [[1 , d ]]. Let us then deﬁne theevent Ω iM = { Y ik +1 ≥ M, Y Mk +1 ∈ [0 , , j (cid:54) = i } , for M >

0, and observe that P [Ω iM ] >

0. It followsfrom (21) that 1 + a i M + max j (cid:54) = i | a j | ≥ q zx , on Ω iM , which leads to a contradiction for M large enough. This shows that a i ≥ i ∈ [[1 , d ]], i.e. A qk ( x, z ) ⊂ R d + .(ii) For ε ∈ (0 , ε = { Y ik +1 ≤ − ε, i = 1 , . . . , d } , which satisﬁes P [Ω ε ] >

0. For a ∈ A q ( x, z ), we get from (21), and since a ∈ R d + by Step (i):1 − (1 − ε ) a (cid:48) d ≥ q zx , on Ω ε . By taking ε small enough, this shows by a contradiction argument that A qk ( x, z ) ⊂ (cid:110) a ∈ R d + : 1 − a (cid:48) d ≥ q zx (cid:111) . =: ˜ A q ( x, z ) . (22)(iii) Let us ﬁnally check the equality in (22). Fix some a ∈ ˜ A q ( x, z ). Since the random vector Y k +1 is valued in ( − , ∞ ) d , it is clear that1 + a (cid:48) Y k +1 ≥ − a (cid:48) d ≥ q zx ≥ , a.s., and thus 1 + a (cid:48) Y k +1 ≥ q (cid:2) a (cid:48) Y k +1 (cid:3) , a.s., which proves (21), hence the equality A q ( x, z ) = ˜ A ( x, z ). (cid:50) .4 Proof of Lemma 3.4 Fix q ≤ q and ( x, z ) ∈ S q ⊂ S q . We then have a ∈ A q ( x, z ) ⇒ a ∈ R d + and a (cid:48) d ≤ − q zx ≤ − q zx = ⇒ a ∈ A q ( x, z ) , which means that A q ( x, z ) ⊆ A q ( x, z ). Fix q ∈ (0 , q n = q + n , n ∈ N ∗ . For any ( x, z ) ∈ S q n ,we then have A q n ( x, z ) ⊆ A q n +1 ( x, z ) ⊂ A a ( x, z ), which implies that the sequence of increasingsets A q n ( x, z ) admits a limit equal tolim n →∞ A q n ( x, z ) = ∪ n ≥ A q n ( x, z ) = A q ( x, z ) , since lim n →∞ q n = q . This shows the right continuity of q (cid:55)→ A q ( x, z ). Similarly, by consideringthe increasing sequence q n = q − n , n ∈ N ∗ , we see that for any ( x, z ) ∈ A q ( x, z ), the sequenceof decreasing sets A q n ( x, z ) admits a limit equal tolim n →∞ A q n ( x, z ) = ∩ n ≥ A q n ( x, z ) = A q ( x, z ) , since lim n →∞ q n = q . This proves the continuity in q of the set A q ( x, z ). Fix q ∈ (0 , x , z ), ( x , z ) ∈ S q s.t. x ≤ x . Then, a ∈ A q ( x , z ) = ⇒ a ∈ R d + and a (cid:48) d ≤ − q zx ≤ − q zx = ⇒ a ∈ A q ( x , z ) , which shows that A q ( x , z ) ⊆ A q ( x , z ). Fix q ∈ (0 , x, z ) ∈ A a ( x, z ). Then, for any a , a of the set A q ( x, z ), and β ∈ (0 , a = βa + (1 − β ) a ∈ R d + , we have a (cid:48) d = βa (cid:48) d + (1 − β ) a (cid:48) d ≤ β (cid:0) − q zx (cid:1) + (1 − β ) (cid:0) − q zx (cid:1) = 1 − q zx . This proves the convexity of the set A q ( x, z ). The homogeneity property of A q ( x, z ) is obvious from its very deﬁnition. (cid:50) We prove the result by backward induction on time k from the dynamic programming equationfor the value function. • At time N , we have for all λ > v N ( λx, λz, µ ) = ( λx ) p p = λ p v N ( x, z, µ ) , which shows the required homogeneity property. • Now, assume that the homogeneity property holds at time k + 1, i.e v k +1 ( λx, λz, µ ) = λ p v k +1 ( x, z, µ ) for any λ >

0. Then, from the backward relation (9), and the homogeneity prop-erty of A q ( x, z ) in Lemma 3.4, it is clear that v k inherits from v k +1 the homogeneity property. (cid:50) .6 Proof of Lemma 3.6 We ﬁrst show by backward induction that r (cid:55)→ w k ( r, · ) is nondecreasing in on [ q,

1] for all k ∈ [[0 , N ]]. • For any r , r ∈ [ q, r ≤ r , and µ ∈ M + , we have at time Nw N ( r , µ ) = U ( r ) µ ( R d ) ≤ U ( r ) µ ( R d ) = w N ( r , µ ) . This shows that w N ( r, · ) is nondecreasing on [ q, • Now, suppose by induction hypothesis that r (cid:55)→ w k +1 ( r, · ) is nondecreasing. Denoting by Y k := e R k − d the random vector valued in ( − , ∞ ) d , we see that for all a ∈ A q ( r )min (cid:104) , r (cid:0) a (cid:48) Y k +1 (cid:1)(cid:105) ≤ min (cid:104) , r (cid:0) a (cid:48) Y k +1 (cid:1)(cid:105) , a.s. since 1+ a (cid:48) Y k +1 ≥ − a (cid:48) d ≥ q r ≥

0. Therefore, from backward dynamic programming Equation(11), and noting that A q ( r ) ⊂ A q ( r ), we have w k ( r , µ ) = sup a ∈ A q ( r ) E (cid:104) w k +1 (cid:0) min (cid:2) , r (cid:0) a (cid:48) Y k +1 (cid:1)(cid:3) , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) ≤ sup a ∈ A q ( r ) E (cid:104) w k +1 (cid:0) min (cid:2) , r (cid:0) a (cid:48) Y k +1 (cid:1)(cid:3) , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) = w k ( r , µ ) , which shows the required nondecreasing property at time k . We prove the concavity of r ∈ [ q, (cid:55)→ w k ( r, · ) by backward induction for all k ∈ [[0 , N ]]. For r , r ∈ [ q, λ ∈ (0 , r = λr + (1 − λ ) r , and for a ∈ A q ( r ), a ∈ A q ( r ), we set a = (cid:0) λr a + (1 − λ ) r a (cid:1) /r which belongs to A q ( r ). Indeed, since a , a ∈ R d + , we have a ∈ R d + ,and a = (cid:16) λr a + (1 − λ ) r a r (cid:17) (cid:48) d ≤ λr r (cid:0) − qr (cid:1) + (1 − λ ) r r (cid:0) − qr (cid:1) = 1 − qr . • At time N , for ﬁxed µ ∈ M + , we have w N (cid:0) λr + (1 − λ ) r , µ (cid:1) = U ( λr + (1 − λ ) r ) ≥ λU ( r ) + (1 − λ ) U ( r ) = λw N ( r , µ ) + (1 − λ ) w N ( r , µ ) , since U is concave. This shows that w N ( r, · ) is concave on [ q, • Suppose now the induction hypothesis holds true at time k + 1: w k +1 ( r, · ) is concave on [ q, λw k ( r , µ ) + (1 − λ ) w k ( r , µ ) ≤ λ E (cid:104) w k +1 (cid:0) min[1 , r (1 + a (cid:48) Y k +1 )] , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) +(1 − λ ) E (cid:104) w k +1 (cid:0) min[1 , r (1 + a (cid:48) Y k +1 )] , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) ≤ E (cid:104) w k +1 (cid:0) λ min[1 , r (1 + a (cid:48) Y k +1 )] + (1 − λ ) min[1 , r (1 + a (cid:48) Y k +1 )] , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) = E (cid:104) w k +1 (cid:0) min[1 , r (1 + a (cid:48) Y k +1 )] , ¯ g ( R k +1 − · ) µ (cid:1)(cid:105) ≤ w k ( r, µ ) , where we used for the second inequality, the induction hypothesis joint with the concavity of x (cid:55)→ min(1 , x ), and the nondecreasing monotonicity of r (cid:55)→ w k +1 ( r, · ). This shows the requiredinductive concavity property of r (cid:55)→ w k ( r, · ) on [ q, (cid:50) eferences Bachouch, A., C. Hur´e, N. Langren´e, and H. Pham (2018a). Deep neural networks algorithms forstochastic control problems on ﬁnite horizon, part 1: convergence analysis. arXiv:1812.04300 .Bachouch, A., C. Hur´e, N. Langren´e, and H. Pham (2018b). Deep neural networks algorithms forstochastic control problems on ﬁnite horizon, part 2: numerical applications. arXiv preprintarXiv:1812.05916, to appear in Methodology and Computing in Applied Probability .Bismuth, A., O. Gu´eant, and J. Pu (2019). Portfolio choice, portfolio liquidation, and portfoliotransition under drift uncertainty.

Mathematics and Financial Economics 13 (4), 661–719.Boyd, S., E. Lindstr¨om, H. Madsen, and P. Nystrup (2019). Multi-period portfolio selectionwith drawdown control.

Annals of Operations Research 282 (1-2), 245–271.Cvitani´c, J. and I. Karatzas (1994). On portfolio optimization under ”drawdown” constraints.

Constraints, IMA Lecture Notes in Mathematics & Applications 65 .Cvitani´c, J., A. Lazrak, L. Martellini, and F. Zapatero (2006). Dynamic portfolio choice withparameter uncertainty and the economic value of analysts? recommendations.

The Review ofFinancial Studies 19 (4), 1113–1156.De Franco, C., J. Nicolle, and H. Pham (2019a). Bayesian learning for the markowitz portfolioselection problem.

International Journal of Theoretical and Applied Finance 22 (07).De Franco, C., J. Nicolle, and H. Pham (2019b). Dealing with drift uncertainty: a bayesianlearning approach.

Risks 7 (1), 5.Elie, R. and N. Touzi (2008). Optimal lifetime consumption and investment under a drawdownconstraint.

Finance and Stochastics 12 (3), 299.Elliott, R. J., L. Aggoun, and J. B. Moore (2008).

Hidden Markov Models: Estimation andControl . Springer.G´eron, A. (2019).

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:Concepts, Tools, and Techniques to Build Intelligent Systems . O’Reilly Media.Grossman, S. J. and Z. Zhou (1993). Optimal investment strategies for controlling drawdowns.

Mathematical ﬁnance 3 (3), 241–276.Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks.

Neural Net-works 4 (2), 251–257.Kalman, R. E. (1960). A new approach to linear ﬁltering and prediction problems.

Transactionsof the ASME–Journal of Basic Engineering 82 , 35–45.Kalman, R. E. and R. S. Bucy (1961, 03). New Results in Linear Filtering and PredictionTheory.

Journal of Basic Engineering 83 (1), 95–108.Karatzas, I. and X. Zhao (2001). Bayesian Adaptative Portfolio Optimization. In

Option Pricing,Interest Rates and Risk Management . Cambridge University Press.Keppo, J., H. M. Tan, and C. Zhou (2018). Investment decisions and falling cost of data analytics.Lakner, P. (1998). Optimal trading strategy for an investor: the case of partial information.

Stochastic Processes and their Applications 76 (1), 77–97.29edeker, I. and R. Wunderlich (2018). Portfolio optimization under dynamic risk constraints:Continuous vs. discrete time trading.

Statistics & Risk Modeling 35 (1-2), 1–21.Rogers, L. C. G. (2001). The relaxed investor and parameter uncertainty.