[PDF] Dirichlet policies for reinforced factor portfolios

Abstract

This article aims to combine factor investing and reinforcement learning (RL). The agent learns through sequential random allocations which rely on firms' characteristics. Using Dirichlet distributions as the driving policy, we derive closed forms for the policy gradients and analytical properties of the performance measure. This enables the implementation of REINFORCE methods, which we perform on a large dataset of US equities. Across a large range of parametric choices, our result indicates that RL-based portfolios are very close to the equally-weighted (1/N) allocation. This implies that the agent learns to be *agnostic* with regard to factors, which can partly be explained by cross-sectional regressions showing a strong time variation in the relationship between returns and firm characteristics.

Full PDF

DDirichlet policies for reinforced factor portfolios

Eric André ∗ Guillaume Coqueret † November 13, 2020

Abstract

This article aims to combine factor investing and reinforcement learning (RL). The agentlearns through sequential random allocations which rely on ﬁrms’ characteristics. Using Dirichletdistributions as the driving policy, we derive closed forms for the policy gradients and analyticalproperties of the performance measure. This enables the implementation of REINFORCE meth-ods, which we perform on a large dataset of US equities. Across a large range of implementa-tion choices, our result indicates that RL-based portfolios are very close to the equally-weighted(1/ N ) allocation. This implies that the agent learns to be agnostic with regard to factors.This is partly consistent with cross-sectional regressions showing a strong time variation in therelationship between returns and ﬁrm characteristics. Keywords : Reinforcement learning; Factor Investing; Equally-weighted Portfolio; Asset Pricing.

JEL classiﬁcations : C38; G11; G12

1. Introduction

The traditional econometric approaches to asset pricing have recently seen a surge in competi-tion from machine learning tools. A ﬂow of recent studies have shown the beneﬁts that can bereaped when switching from the conventional linear models to more complex structures such as treemethods or neural networks. Supervised learning algorithms help the econometrician link ﬁnancialperformance (asset returns) to key indicators such as ﬁrm characteristics (Gu et al. (2020b)) orlatent factors (Kelly et al. (2019), Lettau and Pelger (2020a,b)). Based on large datasets, theseblack boxes reveal intricate correlations between variables that are not captured by standard linearmodels, thereby (often) improving cross-sectional ﬁt. Depending on the quality of the sample andthe algorithm’s architecture, these correlations may nonetheless be spurious. They are likely tohold out-of-sample only if they reﬂect genuine causality relationships, which are much harder touncover. Beyond supervised learning, researchers have resorted to another powerful family of techniquesto understand and predict returns: reinforcement learning (RL). Contributions range from early ∗ EMLYON Business School, 23 avenue Guy de Collongue, 69130 Ecully, FRANCE. E-mail: [email protected] † EMLYON Business School, 23 avenue Guy de Collongue, 69130 Ecully, FRANCE. E-mail: [email protected] See for instance Chen et al. (2019), Feng et al. (2019), Gu et al. (2020a) and Gu et al. (2020b). We refer to Pearl (2009) for an exhaustive treatment on causal models and to Arjovsky et al. (2019) and Pﬁsteret al. (2019) for recent perspectives on causality in machine learning models. a r X i v : . [ q -f i n . P M ] N ov ests of Neuneier (1996) and Moody et al. (1998) to the more recent work of Deng et al. (2016), Liet al. (2019) and Wang and Zhou (2020). Two common threads between these studies is that theyoften originate from the ﬁeld of computer science and that they work with price data only (at highfrequencies most of the time). To the best of our knowledge, there are no contributions that seekto harvest the information contained in ﬁrm speciﬁc attributes and combine it with reinforcementroutines to produce factor-based portfolios. One goal of the present paper is to ﬁll this void.The main challenge when implementing RL algorithms for the purpose of trading is the mod-elling of the environment. The inﬁnite dimensions of the state space (ﬁrm attributes) and actionspace (investment policies) make many approaches relying on Markov decision processes (MPD)inadequate. This is because a focal tool in MDP analysis is the value function, which measuresthe expected gain or reward for any given action or state. In the framework of factor investing,these states and actions cannot be properly discretized without either making overly simplisticassumptions, or rendering the computations intractable.In order to bypass these technical hurdles, one solution is to resort to the so-called policygradient approach. In this case, the decisions are made according to a parametric function whichprobabilistically determines which actions (i.e., investments) to perform. The agent then learns bysequentially updating the policy parameters after receiving ﬂows of rewards (e.g., returns). Most ofthe time, the policy is modelled by neural networks (NNs), which is a convenient choice, given theirﬂexibility. It is for instance the option chosen by Deng et al. (2016) and Zhang et al. (2019). Onedrawback of general purpose NNs is that their output cannot be directly translated into portfolioweights, because it violates the budget constraint. The core idea of the present paper is to resortto a special class of distributions that circumvent this issue by directly yielding the investmentallocations.Indeed, Dirichlet distributions have the opportune property of being deﬁned on simplexes, whichmakes them appropriate to model long-only portfolio compositions. In fact, Dirichlet distributionshave already been used in related studies. Cover and Ordentlich (1996) ﬁnd that two such distribu-tions yield portfolio allocations with interesting theoretical properties. More recently, Le Courtoisand Xu (2019) rely on Dirichlet distributions to derive robust estimates of the eﬃcient frontier, andKorsos (2013) uses them to estimate the composition of hedge fund portfolio holdings. In a similarvein, Sosnovskiy (2015) shows that Dirichlet laws can be used to approximate the distribution ofstock weights in aggregate market indices.One of the simple but novel contributions of the paper is to link the RL policy to ﬁrm-speciﬁcattributes. To this purpose, the inspiration comes from earlier work on characteristics-based invest-ing. The idea is to map a linear combination of the characteristics into portfolio weights. Whilethe traditional models aim to optimize expected utility functions, our approach seeks to maximizeexpected gains. The simplest deﬁnition of gain is a portfolio return but it is possible to adjust itto risk via the sequential Sharpe ratio computations presented in Moody et al. (1998).Our contribution is threefold. First, we propose a tractable formulation of the reinforcementlearning problem when designing portfolio allocations based on ﬁrm speciﬁc attributes. To the best These references are by no means an exhaustive account of the literature on this subject. On the arXiv repos-itory only, more than 20 papers including the terms "reinforcement learning" in their title have been posted in the quantitative ﬁnance (qﬁn) section in 2019 only. We also direct to the survey Sato (2019) for more references on RLapplied to portfolio optimization. See, e.g., Haugen and Baker (1996), Daniel and Titman (1997), Brandt et al. (2009), Hjalmarsson and Manchev(2012) and Ammann et al. (2016)). Our approach is closer in spirit to the most recent of these references.

2f our knowledge our approach is the ﬁrst to articulate the combination between factor investingand RL in such a simple fashion. Second, we employ our methodology on a large dataset of USequities. Our results are qualitatively homogeneous, despite the numerous degrees of freedom in theimplementation, and they indicate that the agent should be better of by ignoring the informationalcontent provided by ﬁrm-speciﬁc attributes. Finally, we compare the learning process to a simplefactor-based quadratic optimization. The two are hard to reconcile, except for one salient stylizedfact: both methods recognize a strong common factor within the cross-section of stock returns.Consequently, portfolios allocate almost uniformly across assets.The paper is structured as follows. In Section 2, we lay out the theoretical foundations ofRL-based factor investing. Section 3 is dedicated to a detailed presentation of the dataset and theimplementation protocol. Our empirical results are outlined in Section 4. In Section 5, we compareour baseline ﬁndings to those stemming from a more classical, asset pricing model. Finally, 6concludes.

2. Reinforcement learning meets factor investing

This section is dedicated to the presentation of all concepts and theoretical apparatus developedand required in the paper.

We study a dynamic discrete time investment problem with ﬁnite horizon T . The investableuniverse consists of N assets indexed by n = 1 , . . . , N . There are K characteristics associatedto each asset. We refer to section 3.1 for a list of those retained in the empirical section of thisstudy. To allow for a bias or non-zero intercept in our model, we add a constant characteristicequal to 1. Therefore, at time t ∈ { , , . . . , T } , asset n is described by a ( K + 1)-dimensionalvector x t,n = h x (0) t,n . . . x ( k ) t,n . . . x ( K ) t,n i (cid:124) where x (0) t,n = 1 is an indicator that is kept ﬁxed through thecross-section of assets.Among these characteristics are p t,n , the time- t price of asset n , and d t,n , the dividend per shareissued between time t − t . The total return of asset n between t − t is therefore r t,n = p t,n + d t,n p t − ,n − . (1)In our setting, we can work with price returns (omitting dividends) or total returns interchangeably,as they are simply two diﬀerent drivers of rewards for the investor. We use the bold notations r t forthe vector of the returns of all assets at time t and X t for the N × ( K + 1) matrix of characteristicsat time t : X t =  ( x t, ) (cid:124) ...( x t,n ) (cid:124) ...( x t,N ) (cid:124)  . We denote by M the set of these N × ( K + 1) matrices whose ﬁrst column is , the vector of 1.3he agent posits a factor model for the returns of the assets r t +1 = f ( X t ) + (cid:15) t +1 (2)where f is a function from M to R N and (cid:15) t is an i.i.d. White Noise with mean vector equal to 0and a diagonal correlation matrix Σ (cid:15) . The diagonal elements of Σ (cid:15) are the idiosyncratic variancesof the assets σ n . Let P (cid:15) denote the law of the r.v. (cid:15) t , deﬁned on R N .A standard assumption in the ﬁnance literature is that the function f is a linear map that canbe represented by a ( K + 1) vector β , that is f ( X t ) = X t β . Note however that we do not needthis assumption in our study. We assume that the investment problem can be formulated as a ﬁnite horizon Markov DecisionProcess (MDP). At each time t , the agent observes the state S t of the system (the characteristics ofthe investable universe and of her portfolio) and then takes an action A t (a choice of a compositionfor her portfolio). Finally, the agent obtains a time-( t + 1) reward, which is linked to the returnof her portfolio between t and t + 1 and the system transition to the next state. We now describeformally this MDP . Actions.

The action A t taken by the agent at time t is the choice of a vector w t ∈ R N , whichis the composition of her portfolio. We consider the case where there is no short selling. Therestriction to positive weights is realistic since most asset managers have long-only constraints.This is typically the case of institutional investors (see Koijen and Yogo, 2019). Therefore w t mustbe in the N − action space : ∆ = ( ( w , . . . , w N ) ∈ R N : N X n =1 w n = 1 and w n ≥ n ) . (3)Seen as a subset of R N − , it is endowed with the inherited Borel σ -algebra that we denote B (∆) . Rewards.

The agent’s objective is to maximise her utility of, or some performance measure of,the terminal value of her portfolio V T = V + P T − t =0 V t ρ t +1 , where ρ t +1 := w (cid:124) t r t +1 is the return ofher portfolio between t and t + 1. We will consider two cases: a risk insensitive agent who seeksto maximize her proﬁt and a risk sensitive agent whose goal is to maximize the diﬀerential SharpeRatio proposed by Moody et al. (1998).In the ﬁrst case, the agent’s reward at time t for the action taken at time t − R t = ρ t (4)In the second case, it is R t := SR t = ˆ µ t K κ q ˆ σ t − ˆ µ t (5) See, e.g., Bäuerle and Rieder (2011). µ t = κρ t + (1 − κ )ˆ µ t − exponentially weighted (EW) moving average of returns;ˆ σ t = κρ t + (1 − κ )ˆ σ t − EW moving average of squared returns; K κ = q − κ/ − κ scaling factor . We underline that we use uppercase R t for rewards and lowercase r t for returns. The two areclosely linked, but not equal. States.

As we want the agent to choose an action using the asset characteristics, X t must beincluded in the state . The chosen reward deﬁnes which data must be added to the state: for therisk-insensitive agent ρ t is enough, for the diﬀerential Sharpe Ratio, we should also include ˆ µ t andˆ σ t . In the latter case, it is still ρ t that drives deterministically the additional data, therefore wewill henceforth consider without loss of generality the case where a state corresponds to the couple S t = ( ρ t , X t )The state space is then S = R × M . The set M , seen as a subset of the space of the N × ( K + 1)matrices which can be identiﬁed with R N × ( K +1) , is endowed with the inherited product of Borel σ -algebras that we denote B ( M ). The state space itself is endowed with the product σ -algebra B ( S ) = B ( R ) ⊗ B ( M ). Episodes.

The agent having observed the state of the system takes an action, then the systemtransitions to the next state and the agent receives a reward. A given realization of this interactionbetween the agent and her environment is an episode : S , A , R , S , A , R , S , A , . . . , S T − , A T − , R T , S T At any date t , the cumulative discounted return can be computed. It is the sum of the futurerewards in this episode, possibly discounted at a discount rate 0 < γ ≤ G t = T − t X l =1 γ l − R t + l = R t +1 + γG t +1 For the diﬀerential Sharpe Ratio, this additive structure is obtained by construction. In the case ofthe risk insensitive agent, let us rewrite the growth of the portfolio from t to T as V T = V t Q T − tl =1 (1 + ρ t + l ). Maximizing the proﬁt on this horizon is the same as maximizing P T − tl =1 log(1 + ρ t + l ) which isapproximately equal to P T − tl =1 ρ t + l = P T − tl =1 R t + l when the horizons of returns is not too long (whichimplies that the magnitude of the portfolio returns ρ t + l is small). Transition probability.

How the system transitions to the next state S t +1 given some previousstate S t and action A t is given by the state transition probability Prob ( S t +1 ∈ B | S t , A t ) , B ∈ B ( S ) . We assume that the matrix of asset characteristics is a Markov process whose evolution is drivenby the transition probabilities P ut ( M | X ) = Prob ( X u ∈ M | X t = X ) , u > t B is a Cartesian product of Borel sets, B = C × M , where C ∈ B ( R ) and M ∈ B ( M ), we obtain thefactorization Prob ( S t +1 ∈ B | S t , A t ) = Prob ( ρ t +1 ∈ C | S t , A t ) · P t +1 t ( M | X t ) . In our speciﬁc setting with a factor model, we have a transition function T that gives the nextvalue of ρ t +1 given the state and action at t . This is ρ t +1 = T ( X t , w t , (cid:15) t +1 ) = w (cid:124) t ( f ( X t ) + (cid:15) t +1 ).When S t and A t are known, the value of ρ t +1 is driven by (cid:15) t +1 and conversely, if r ∈ R , then T − ( r | X t , w t ) is the hyperplane orthogonal to the vector w t translated by the vector − f ( X t ) andby any vector α such that w (cid:124) t α = r . Finally, we can writeProb ( S t +1 ∈ C × M | S t , A t ) = P (cid:15) ( (cid:15) t +1 ∈ T − ( C | X t , w t )) · P t +1 t ( M | X t )Of special interest for our study is the computation of future periods expected returns givensome state and action at date t . It is given by the following lemma. Lemma 1. Z S ρ t + l Prob( ds | S t = ( ρ t , X t ) , A t = w t ) = ( w (cid:124) t f ( X t ) l = 1 R M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) l ≥ where E θ [ w t + l − | ξ ] is the vector expectation of the portfolio composition.Proof. See Appendix A.1.

To allow for the exploration of all actions versus the exploitation of the optimal action, we will use astochastic policy that gives the probability of choosing an action A t given the state S t . Speciﬁcally,we will study policies π θ deﬁned on B (∆) with parameter θ = ( θ (1) , . . . , θ ( k ) ). At each time step,we will draw from this distribution to select an action: A t = w t ∼ π θ ( · | S t ) . More precisely, we are looking in this study for a policy that only takes into account the assetcharacteristics, hence we restrict ourselves to policies that takes the form w t ∼ π θ ( · | X t )We will use the notations E θ [ · ] or E π [ · | θ ] for the expectation under the policy π θ . When thereis no ambiguity, we will omit to write that a result holds conditionally on a given a policy.6 alue function. The value function at t of the state S t under the policy π θ , is the expectedvalue of the cumulative discounted return from t onward, when this policy is chosen to select theactions at each future time steps: V θ ( t, S t ) = E θ [ G t | S t ] = T − t X l =1 γ l − E θ [ R t + l | S t ]To ﬁnd the optimal policy, the standard tool is dynamic programming, for which the valuefunction must satisfy the recursive Bellman equation (see Chapter 4 in Sutton and Barto (2018)).However, for the diﬀerential Sharpe Ratio, it is known that the introduction of the variance in thereward renders the problem time inconsistent. In this paper, we will use RL algorithms to explorethe optimal policies. Nonetheless, in the case of the risk insensitive agent, the problem can also besolved with dynamic programming as the next result shows. Proposition 2.

For the risk insensitive agent, the time t expected values of the future rewards aregiven by E θ [ R t + l | S t = ( ρ t , X t )] = ( E θ [ w t | X t ] (cid:124) f ( X t ) l = 1 , R M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) l ≥ . The policy value satisﬁes the recursive Bellman equation V θ ( t, X t ) = E θ [ R t +1 | S t ] + Z M V θ ( t + 1 , ξ ) P t +1 t ( dξ | X t ) ,V θ ( T − , X T − ) = E θ [ R T | S T − ] . Proof.

See Appendix A.2.

Performance measure.

The performance measure of the policy is its value from some initialstate S : J ( θ ) = V θ (0 , S ). Our aim is to ﬁnd a parameter of the policy that maximizes thisperformance measure θ ∗ ∈ arg max θ J ( θ ) . (6)Before taking on this task, we specify the parametrized form of the policy that we use in this study. One of the main contribution of this paper is the use of Dirichlet distributions to deﬁne the policyof the agent. We ﬁnd it particularly well suited for describing portfolio weights when short selling isproscribed. We ﬁrst brieﬂy recall its deﬁnition and some of its properties that are used thereafter.

Deﬁnition.

The Dirichlet distribution is deﬁned on the N − a = h a a · · · a N i (cid:124) of concentration parameters where a n > n = 1 , . . . , N . We will use the notation σ for the scale parameter σ = N X n =1 a n = (cid:124) a . f ( w , . . . , w N | a ) = 1 B ( a ) N Y n =1 w a n − n , where the normalizing constant is the Multivariate Beta function, which can be written with theGamma function as follows B ( a ) = Q Nn =1 Γ ( a n )Γ ( σ ) . A note on Order and Dimension.

The Lebesgue measure of ∆ N − in R N being 0, for the pdfto be meaningful and integrate to 1, it must be the density with respect to the Lebesgue measureon R N − . Hence it is a function deﬁned on R N − and we should write f ( w , . . . , w N − | a ) = 1 B ( a ) N − Y n =1 w a n − n ! − N − X n =1 w n ! a N − . As a consequence, we say that a random vector W = [ W · · · W N − ] (cid:124) of dimension N − N with concentration parameters a = [ a a · · · a N ] (cid:124) , and wewrite W ∼ Dir N ( a ) . Alternatively, we could write h W · · · W N − − P N − n =1 W n i (cid:124) ∼ Dir N ( a ) . In the case of a 1-dimensional vector, we write W ∼ Dir ( a , a ) or [ W − W ] ∼ Dir ( a , a ),the density is then f ( w | a , a ) = 1 B ( a , a ) w a − (1 − w ) a − , that is, W ∼ Beta ( a , a ) and the Dirichlet distribution is sometimes called the Multivariate Betadistribution. Some properties.

Let W ∼ Dir N ( a ). The marginal distributions are Beta distributions: for n = 1 , . . . , N − W n ∼ Beta ( a n , σ − a n ) , and the two-dimensional marginal distributions are Dirichlet: for 1 ≤ n < m ≤ N − h W n W m i (cid:124) ∼ Dir ( a n , a m , σ − a n − a m ) . Let (cid:122) denotes the Digamma function, the derivative of the natural logarithm of the Gammafunction: for a > (cid:122) ( a ) = dd a Γ( a ) = Γ ( a )Γ( a ) . The ﬁrst equality below is well documented and we prove the three additional results in appendixA.3. 8 roposition 3.

Let W ∼ Dir N ( a ) . Then for ≤ n ≤ N − , E [ W n ] = a n σ , E [ln W n ] = (cid:122) ( a n ) − (cid:122) ( σ ) , E [ W n ln W n ] = a n σ (cid:18) (cid:122) ( a n ) + 1 a n − (cid:122) ( σ ) − σ (cid:19) , and for ≤ m ≤ N − , n = m , E [ W n ln W m ] = a n σ (cid:18) (cid:122) ( a m ) − (cid:122) ( σ ) − σ (cid:19) . Link with the asset characteristics.

In this paper, we study the policy for which the proba-bility of choosing action w t at time t has the Dirichlet distribution with concentration parameters a t = h a t, a t, · · · a t,N i (cid:124) where a t,n > n . We posit that the concentration parametersare functions of the asset characteristics. Two possible forms are studied: a t = ( X t θ t ( F1 ) e X t θ t ( F2 ) . (7)The ﬁrst form is a simple linear combination which is highly tractable, but may violate thecondition that a t,n > θ ( k ) t . Indeed, during the learning process, an update in θ might yield values that are out of the feasible set of a t . In this case, it is possible to resort to atrick that is widely used in online learning (see, e.g., Section 2.3.1 in Hoi et al., 2018). The idea issimply to ﬁnd the acceptable solution that is closest to the suggestion from the algorithm. If wecall θ ∗ the result of an update rule from a given algorithm, then the closest feasible vector is θ = min z ∈ Θ( X t ) || θ ∗ − z || , (8)where || · || is the Euclidean norm and Θ( X t ) is the feasible set, that is, the set of vectors θ suchthat the a t,n = θ (0) t + P Kk =1 θ ( k ) t x ( k ) t,n are all nonnegative.The second form of the policy is slightly more complex but remains always valid.The combination of the Dirichlet distribution with time-varying weights w t and parameters a t deﬁned above yields a policy π := π θ that depends on exogenous characteristics X t as well as K + 1parameters, stacked in the vector θ t . There is a very strong link between this formulation and othermethods that link ﬁnancial performance to ﬁrm-speciﬁc characteristics like Brandt et al. (2009) andAmmann et al. (2016). One common feature is that for any k = 0, the parameter θ ( k ) synthesizesthe impact of feature k on the whole cross-section of returns. If θ ( k ) is positive ( resp ., negative),then, on average, the corresponding feature is expected to have a positive ( resp ., negative) eﬀecton returns. The parameter θ (0) is intended to reﬂect some idiosyncrasy that is not rendered by thecharacteristics but that is shared by all assets.Note that if the idiosyncratic parameter θ (0) is much larger than the sum P Kk =1 θ ( k ) t x ( k ) t,n for all n , we converge to the limiting case where a t = θ (0) . In this case the Dirichlet distribution issymmetric and the portfolio converges to the Equally Weighted portfolio as E [ W n ] = 1 /N for all n .9 .5. The policy gradient method The optimization problem (6) cannot be solved by dynamic programming when the reward is thediﬀerential Sharpe Ratio. We thus search for an approximate solution using the method namedPolicy Gradient (Sutton and Barto, 2018, Chapter 13). This method can deal with the inﬁnitestate space S and seeks to learn a parametrized policy by updating the parameter via gradientascent in J : θ t +1 = θ t + α (cid:92) ∇ J ( θ t ) (9)where (cid:92) ∇ J ( θ t ) is a stochastic estimate of the gradient of the performance measure (with respect to θ t ) and α ∈ (0 ,

1) is a learning rate.The core result when implementing policy gradient learning is the so-called Policy GradientTheorem: ∇ J ( θ t ) = E π [ G t ∇ ln π ( w t | S t , θ t ) | X t , θ t ] (10)which is incredibly convenient because the two terms in the expectation are disentangled. We referto Section 13.3 in Sutton and Barto (2018) for a proof of this result. It is thus imperative to deriveanalytical expressions for ∇ lnπ θ . Closed forms for the Dirichlet policy gradients

We recall the two policy options derivedfrom ( F1 ) and ( F2 ) in (7): π ( w t | X t , θ t ) = Γ ( σ t ) Q Nn =1 Γ ( a t,n ) N Y n =1 w a t,n − t,n , (11)where a t,n = ( x t,n θ t ( F1 )exp( x t,n θ t ) ( F2 ) for all 1 ≤ n ≤ N, and σ t = N X n =1 a t,n . (12) Proposition 4.

For a Dirichlet policy, the gradients are given by ∇ ln π ( w t | X t , θ t ) = N X n =1 ( (cid:122) ( σ t ) − (cid:122) ( a t,n ) + ln w t,n ) ∇ a t,n , where ∇ a t,n = ( x (cid:124) t,n ( F1 )exp( x t,n θ t ) x (cid:124) t,n ( F2 ) . Proof.

From equation (11) we obtainln π ( w t | X t , θ t ) = ln Γ ( σ t ) − N X n =1 ln Γ( a t,n ) + N X n =1 ( a t,n −

1) ln w t,n Then ∇ ln π ( w t | X t , θ t ) = (cid:122) ( σ t ) N X n =1 ∇ a t,n − N X n =1 (cid:122) ( a t,n ) ∇ a t,n + N X n =1 ln w n ∇ a t,n = N X n =1 ( (cid:122) ( σ t ) − (cid:122) ( a t,n ) + ln w n ) ∇ a t,n a t,n are computed for both case given inequation (12).It can be seen that the policy gradient is the weighted sum of the concentration parameters’gradients. The weight applied to each parameter’s gradient is given by the diﬀerence between theactual value of the logarithm of the weight of asset n in the portfolio and the expected value of thislogarithm under the policy π θ t . Indeed, using Proposition 3, we have (cid:122) ( σ t ) − (cid:122) ( a t,n ) + ln w t,n = ln w t,n − E π [ln w t,n | X t , θ t ] . (13)By Proposition 3, the average weight of asset n in the portfolio is E π [ w t,n | S t , θ t ] = a t,n σ t , (14)hence, the gradient of the concentration parameter ∇ a t,n gives the direction in the parameter spacethat will increase the relative importance of asset n in the portfolio. Therefore, learning with thepolicy gradient given by equation (10) will, when G t is positive, increase the expected weights attime t +1 of those assets which had, at time t , their realized log weights above their expected values.In this way, the stochastic policy allows to explore the action space through random deviations fromthe mean, that are reinforced if they generate a proﬁt. Gradient of the performance measure

For the risk insensitive agent, the gradient of theperformance measure takes a simple form.

Proposition 5.

For the risk insensitive agent, under a Dirichlet policy, ∇ J ( θ t ) = N X n =1 ( E [ r t +1 ,n | X t ] − E π [ R t +1 | X t , θ t ]) ∇ a t,n σ t , where we recall that r t +1 ,n is the return of asset n between time t and t + 1 (while R t +1 is thereward).Proof. See Appendix A.4.We see that the learning process is “ myopic ”, as the one step ahead return is the only one takeninto consideration for the update of the parameters. The learning process will, on average, increase( resp. decrease) the weights of those assets whose expected returns are higher ( resp. lower) thanthe portfolio’s expected return. This behavior could be expected from the risk-insensitive agent.

3. Data and protocols

In this section, we describe the data on which we carry out our empirical analysis. Additionally,we discuss many implementation issues related to the RL framework.11 .1. Data

The dataset comprises ﬁrms listed in the US between January 2000 and June 2020 downloadedfrom Bloomberg. The number of ﬁrms through time is depicted in Figure 1. Observations aresampled at a monthly frequency. Each stock is characterized by twelve attributes that correspondto documented predictors (accounting-based, risk-based and momentum-based). These variablesare summarized in Table 1. We restrict our analysis to these twelve indicators to be able toeasily comment on the associated values of θ . These features naturally serve as the non constantcomponents of the X t matrices in our model. date number of firmsin the sample Fig. 1.

Number of ﬁrms in the sample, through time .The features (predictors) are cross-sectionally processed so that for a ﬁxed t and given predictor j , x ( j ) t is uniformly distributed (over the [-0.5,0.5] interval) across ﬁrms. Scaling predictors isstandard practice both in the machine learning literature and in some recent asset pricing models(e.g., Koijen and Yogo (2019), Kelly et al. (2019) and Freyberger et al. (2020)). In characteristics-based investing (e.g., in Brandt et al. (2009)), the indicators are for instance also demeaned andstandardized. The learning process at the core of our method is the so-called REINFORCE algorithm, which isthe most straightforward route towards the policy gradient approach (see Chapter 13 of Suttonand Barto (2018)). We brieﬂy recall the steps in Table 2.In spite of its apparent simplicity, the REINFORCE algorithm leaves a lot of room for imple-mentation design. Below, we discuss open options (highlighted in bold font) for the steps deﬁnedin Table 2:-

Step 0 : the policy is deﬁned by Equation (11) along with one of the speciﬁcations ( F1 ) or( F2 ). The levels of the two rates γ and η will be discussed below.- Step 1 : the central question is: what is an episode? More precisely, do we need to imposea chronological ordering of events? In traditional RL, this is imperative because actions can12 hort name Long name Academic references (alphab. order)cst Constant -cap Market Capitalization Banz (1981); Fama and French (1992)pb Price-to-Book ratio Asness et al. (2013); Fama and French (1992)de Debt-to-Equity ratio Barbee Jr et al. (1996); Bhandari (1988)vol Realized volatility in the past 30 days Baker et al. (2011)prof Proﬁtability Fama and French (2015)inv Asset growth Cooper et al. (2008); Fama and French (2015)eps Earnings per share Ball and Brown (1968, 2019)liq Trading volume Chordia and Swaminathan (2000)rsi Relative strength index Han et al. (2013)pe Price-earnings ratio Basu (1983); Easton (2004)dy Dividend yield Litzenberger and Ramaswamy (1982), Naranjo et al. (1998)mom 12-1M momentum Asness et al. (2013); Jegadeesh and Titman (1993)

Table 1:

List of predictors and associated academic references . The Bloomberg ﬁelds are, in or-der, CUR_MKT_CAP, PX_TO_BOOK_RATIO, TOT_DEBT_TO_TOT_EQY, VOLATILITY_30D,PROF_MARGIN, ASSET_GROWTH, IS_EPS, PX_VOLUME, RSI_30D, PE_RATIO. The dividendyield is evaluated as EQY_DPS divided by the lagged value of the closing price ﬁeld PX_LAST. Momen-tum is also computed via the closing price (lagged 12 month value divided by lagged one month value, minusone).

Step

REINFORCE π θ , a discount rate γ ∈ (0 ,

1) and a learning rate η ∈ (0 , For i = 1, 2, . . . , number of episodes, do:2 Generate sequence S , A , R , . . . , S T − , A T − , R T , with actions driven by π θ For t = 0, 1, . . . , T −

1, do:4 G ← P Tk = t +1 γ k − t − R k (compute the gain)5 θ ← θ + ηγ t G ∇ log( π θ ). (update the parameters)Table 2: Steps of the baseline REINFORCE algorithm.have an impact on the environment (the states). This is rarely the case in ﬁnance, except whentaking very large orders, which large institutions usually avoid to limit the odds of marketshifts. The generation of the SARSA sequences in Step 2 can thus be either chronological indeed, or independently drawn from samples of features and returns, akin to bootstrapping .In the latter case, the discounting rate γ would lose its meaning and should be set at one.- Step 2 & 4 : the deﬁnition of the reward R t is not unambiguous. A natural choice is to takeraw returns . The most prominent extension is when returns need to be adjusted by somerisk measure, like in the Sharpe ratio (SR). However, the computation of the rewards in thiscase is not straightforward. Luckily, Moody et al. (1998) provide a solution to this obstacle.Their idea is to sequentially update the reward using the exponential moving average SRgiven in Equation (5).With this convention, the ﬁrst steps of the SARSA sequence rapidly yield a risk adjustedreward.-

Step 5 : this is not an option, but in the case of the linear policy ( F1 ), the updated θ has tobe adjusted so that the a n lie in the intervals discussed in Section B in the Appendix. Since13e will work with bundles of N = 100 assets, we choose a − = 0 .

02 and a + = 1 .

6. Thus, thefeasible set in the projection (8) isΘ( X t ) = { θ , a − ≤ θX t ≤ a + } . The above discussion gives rise to two dichotomies: chronological versus bootstrapped sequencesand return versus

Sharpe ratio rewards. Below, we explain how to incorporate these designchoices in a series of thorough backtests that rely on market data (and not on simulated samples).Chronological sequences require temporal depth. Every January (time t ), the preceding 12months of data are gathered and the sequences will consist of portfolios held during each monthof the sample. The number of episodes is E and their length is 12. Each action at month s ( A s )consists in randomly choosing (with replacement) N assets and sampling their portfolio weightsaccording to the current policy π θ . The weights depend both on θ and on the characteristics ofthe assets at month s . Rewards can either be returns, or Sharpe ratios, computed iteratively asdeﬁned in Equation (5).Bootstrapped sequences do not require much depth. They can be performed every month.The number of episodes is E and their length is one: the learning is performed over values thatoriginate from the past month only. This could be relaxed, but it creates more reactive portfolios,as opposed to learning on chronological sequences. Again, actions A s consist in randomly choosing(with replacement) N assets and sampling their portfolio weights according to the current policy π θ . Rewards can only be returns.Both types of learning processes are summarized in Table 3. Because of the numerous degreesof freedom ( γ and η rates, initialization values, random seeds, number of episodes, etc. - see Section4.1 below), we restrict our study to two alternatives only. The ﬁrst one links bootstrapped sequenceswith simple returns, while the second combines chronological sequences with Sharpe ratio rewards. Step

Chronological method For every January, do:1 Randomly pick N assets2 Extract data from previous year3 Initialize θ For i = 1 , . . . episodes, do5 Sample N stocks randomly6 Generate streams A t and R t via π θ θ via (9)8 For next 12 months, do:9 allocate via average policy, Eq. (14)10 store realized returns Step

Bootstrap method For every date t = 2 , . . . , T −

1, do:1 Randomly pick N assets2 Extract data from previous month3 Initialize θ For i = 1 , . . . episodes, do5 Sample N stocks randomly6 Generate action and reward7 Update θ via (9)8 For date t + 1, do:9 allocate via average policy, Eq. (14)10 store realized returns Table 3: Macro view of backtest stages. The diﬀerences in the two REINFORCE implementationsare outlined in Section 3.3. 14 . Results

Before we move towards a presentation of our results, we expose the richness of the ﬂexibility ofthe modelling approach. Below, we list the diﬀerent choices we need to make to run one batch oflearning over our whole dataset:- Choices that we will always compare:1. Whether to learn form chronological sequences or bootstrap (see Table 3).2. Whether to resort to a linear ( F1 ) or an exponential ( F2 ) policy.- Choices that we will discuss:1. The number of episodes.2. The initialization values of θ .3. The seed for the quasi-random number generator.4. η , the learning rate in the update of the policy parameters. To simplify scale issues,the gradient in the update is divided by the maximum absolute value of gradient values.This makes the learning rate easier to interpret.- Choices that are ﬁxed throughout the entire study:1. γ , the reward discounting factor. For bootstrap learning, this parameter is irrelevant.For chronological sequences, since they only last 12 months, there is no major nor obviousgain in using a discount. Thus we set γ = 1.2. N , the number of stocks that are integrated in the portfolio (used to compute thereward). As discussed in Section B, it is impossible to consider very large portfoliosbecause of the asymptotic behavior of the functions required in the Dirichlet forms.The most obvious choice is N = 100. Larger portfolios impose stringent constraints onthe Dirichlet parameters, making the approach impractical. By construction, smallerportfolios lack diversiﬁcation and may reﬂect cross-sectional information insuﬃciently.3. The bounds on the Dirichlet parameters (see Section B in the Appendix). They areﬁxed to a − = 0 . a + = 1 .

6. These bounds are optimal empirically: going beyondleads to numerical errors.4.

Rewards . Bootstrapped sequences can only work with simple return rewards. Chrono-logical sequences are more ﬂexible. To reduce the amount of results, we work with thediﬀerential Sharpe ratio for temporal learning.

First and foremost, the Dirichlet policy depends on its parameter vector θ . It is thus natural tostart by showing the evolution of the θ ( k ) t for the four speciﬁcations we work with. They are shownin Figure 2. While only a few cases are outlined, they are qualitatively representative of all theother parameter conﬁgurations studied below.There is a clear discrepancy between the two learning schemes: chronological sequences (lowerpanels) lead to the hegemony of the constant variable while the bootstrapped sequences (upperplots) give more room to the ﬁrm characteristics. The latter are also much more volatile throughtime. Across both learning methods, the exponential policy (to the left) saturates the constantmuch more often, compared to the linear policy (to the right).15 xponential policy linear policy t he t a factor capcst dedy epsinv liqmom pbpe profrsi vol exponential policy linear policy t he t a B OO T S T RA PPE D SE Q U E NC ES CHR O N O L OG I CA L SE Q U E NC ES Fig. 2.

Values of θ ( k ) t . We plot the value of parameters through time for our two learning schemes(bootstrapped (upper panel) and chronological (middle panel)) and two policy schemes (linear (right) versusexponential (left) - see Equation (7)). The parameters are the following: the learning rate η = 0 .

1, thenumber of episodes E = 500, the bounds for the a n are [0 . , . θ k is 1. Finally, therandom seed in 42. This has consequences on the optimal weights derived from the policy parameters via Equation(7). The chosen portfolio weights are simply chosen as the mean of the policy distributions: w t,n = a t,n /σ t (see (12) and (14)). In Figure 3, we plot the histogram of these weights. The distributionsare grouped by year and then stacked on the graph. Because the number of assets changes throughtime (see Figure 1), we add two bounds on the plots. The full vertical black line marks theminimum uniform allocation (1/ N ), which is reached in 2016. The dotted line shows the maximum1/ N weights, which are implemented in 2001.First of all, because of the increase in the number of stocks, there is a temporal shift in the16 xponential policy linear policy B OO T S T RA PPE D SE Q U E NC ES CHR O N O L OG I CA L SE Q U E NC ES Year portfolio weight (times 10^(-4))

Fig. 3.

Distribution of weights . We plot the histogram of portfolio weights for our two learningschemes (bootstrapped (upper panel) and chronological (middle panel)) and two policy schemes (linear(right) versus exponential (left) - see Equation (7)). The histograms are stacked and each color stands fora given year. The full vertical line marks the minimum uniform allocation (1/ N ) over all dates, while thedotted line shows the maximum 1/ N value. The parameters are the following: the learning rate η = 0 . E = 500, the bounds for the a n are [0 . , . θ k is 1. Finally,the random seed in 42. distribution of weights. Average weights are smaller in the later years and portfolios are morediversiﬁed. Moreover, weights are not very dispersed and appear concentrated around their means,which implies that allocations are relatively close to the EW benchmark and do not make strongbets towards some assets. This is especially true for the lower panel (chronological sequences),where there are almost no outliers beyond the vertical lines. This is consistent with the prominenceof the constant in the lower panels of Figure 2.Again, we underline that these results depend only marginally on the parametric choices de-scribed in the caption of the ﬁgures. The concentration of portfolios does not depend much im-plementation choices, as long as they are realistic (e.g., suﬃciently many episodes, or moderatelearning rate). The ultimate yardstick for sophisticated portfolio construction methods is out-of-sample perfor-mance. It is usually presented in several steps: starting with a pure return indicator, and comple-menting it by other risk-adjusted metrics, like the Sharpe ratio. In Figure 4, we display average Additional results are available upon request. ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● random seed = 42 random seed = 43 = . = .

500 1000 100 500 10001.1501.1751.2001.1501.1751.200

Number of episodes A v e r age m on t h l y r e t u r n ( % ) policy ●● linearexp o (constant ● nn BOOTSTRAPPEDSEQUENCESCHRONOLOGICALSEQUENCES ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●

100 500 1000 100 500 10001.081.121.161.081.121.16

Number of episodes A v e r age m on t h l y r e t u r n ( % ) random seed = 42 = . n = . n random seed = 43 vector) Average returns Sharpe ratios random seed = 42 random seed = 43 = . = .

100 500 1000 100 500 10000.2100.2150.2200.2250.2100.2150.2200.225

Number of episodes S ha r pe r a t i o nn random seed = 42 random seed = 43 = . = .

100 500 1000 100 500 10000.20000.20250.20500.20750.21000.20000.20250.20500.20750.2100

Number of episodes S ha r pe r a t i o nn Fig. 4.

Performance . We plot the average returns (left quadrants) and Sharpe ratios (right quadrants) ofthe portfolio allocations for our two learning schemes (bootstrapped (upper panel) and chronological (lowerpanel)), and two policy schemes (linear (orange points) versus exponential (blue points) - see Equation (7)).The dotted horizontal line marks the performance of the EW (1/ N ) portfolio. The only robust conclusion is that bootstrap sequences perform better than sequential ones.Almost all bootstrap conﬁgurations surpass the EW benchmark, while it is the opposite for theportfolios based on chronological learning. The four degrees of freedom (random seed, number ofepisodes, learning rate and θ ) have an impact on average returns that is not consistent acrossconﬁgurations. Notably, this corroborates the sensitivity of RL algorithms to random seeds (see,e.g., Henderson et al. (2017), Islam et al. (2017), and Colas et al. (2018)). Nevertheless, the mag-nitude of changes is small overall: average returns are scattered between 1.08% and 1.2%, so thatthe diﬀerence with the uniform allocation (1.14%) is not signiﬁcant. Thus, even though parameterconﬁgurations alter the results, the changes are very limited in magnitude and all portfolios remainsomewhat in the vicinity of the 1 /N benchmark. A simple t -test of series of RL-based portfolio returns versus 1 /N portfolio returns yields p − values between 0.82and 0.998. . Insights from a toy factor model The purpose of this section is to elaborate on the patterns observed in Figure 2, which shows thatthe most important driver of the policies is the constant term. This preponderance implies thatRL-based allocations remain in close to the equally-weighted portfolio. We propose a factor-basedallocation model that tries to replicate this stylized property. In particular, we investigate if RL-driven decisions can be reproduced by simpler models. We start with a theoretical contributionand subsequently move towards simple statistical estimates to illustrate the forces at work.

We assume that there are N assets on the market. Their future returns are driven by a linearmodel r = Xβ + (cid:15) , (15)where X = X nk is a N × ( K + 1) matrix of ﬁrm-speciﬁc characteristics with N > K + 1. Weomit the time index for notational simplicity. The ﬁrst column of the matrix is constant withall elements equal to one. The innovations (cid:15) and loadings β = ( β , β , . . . , β K ) are random andmutually independent. Moreover, we posit that the errors are independent across assets (andindependent of loadings), and have zero means and uniform variance:¯ (cid:15) = E [ (cid:15) ] = N (16) Σ (cid:15) = E [ (cid:15)(cid:15) (cid:124) ] = σ (cid:15) I N , (17)where N is a N -dimensional vector of zeroes and I N the corresponding identity matrix. Foranalytical tractability concerns, we also need to be more speciﬁc with regard to the covariancestructure of loadings and characteristics. We assume that both Σ β := E [( β − ¯ β )( β − ¯ β ) (cid:124) ] and (18)ˆ Σ X := N − ( X − N ¯ x (cid:124) ) (cid:124) ( X − N ¯ x (cid:124) ) (19)are diagonal with diagonal values equal to σ β,k and σ X,k respectively, where we have casually written¯ β for the mean vector of β and ¯ x = N − X (cid:124) N for the column vector of sample column means of X . The function diag( · ) maps a vector into the corresponding diagonal matrix. Note that since X is given and non-random, the matrix ˆ Σ X is its sample covariance matrix. The fact that ˆ Σ X is diagonal implies that the ﬁrm characteristics are uncorrelated, i.e., that they carry informationpertaining to companies that is not redundant. The β k are also unrelated, which means that eachfactor impacts returns regardless of the eﬀect of other ﬁrm attributes.Some representative agent seeks to maximize a standard quadratic function of expected re-turns. The portfolio is based on ﬁrms’ characteristics in a linear fashion: w = Xθ , where θ = ( θ , θ , . . . , θ K ) drives and reﬂects the agents beliefs and preferences with regard to the cor-responding factors. This form is on purpose the same as F1 in Equation (7) (Section 2.4), whichdrives the average portfolio allocation. The utility function is quadratic (as in the standard mean-variance formulation), hence, the optimization program is the following:max θ E (cid:20) θ (cid:124) X (cid:124) Xβ − γ θ (cid:124) X (cid:124) ( X ( β − ¯ β ) + (cid:15) )( (cid:15) (cid:124) + ( β − ¯ β ) (cid:124) X (cid:124) ) Xθ (cid:21) , s.t. θ (cid:124) X (cid:124) N = 1 . (20)The lemma below provides the solution to this problem.19 emma 6. If Σ β , Σ (cid:15) are diagonal and assuming (16) - (17) , the solution to (20) is ˜ θ ∗ = γ − diag( σ ) − I K − diag( σ β ) ¯ x ¯ x diag( σ ) − x diag( σ β )diag( σ ) − ¯ x ! (cid:16) N − ¯ β + c ( X X ) − ¯ x (cid:17) , (21) where diag( σ ) = diag( σ X )diag( σ β ) + N − σ (cid:15) I K and c is the scaling constant that warrants thebudget constraint is satisﬁed. If, in addition, ¯ x = [1 K ] , then ˜ θ ∗ =  ˜ θ ∗ = N − ˜ θ ∗ j = ( γN ) − β j σ X,j σ β,j + σ (cid:15) /N . (22)The proof of the lemma is located in Appendix A.5. All other things equal, in the second partof the lemma, θ ∗ j increases with ¯ β j , but decreases with all sources of risk: σ X,j , σ β,j , and σ (cid:15) . Moreimportantly, the relative importance of the non-constant factors are strongly linked to γ . Whenrisk aversion is low, the non-constant factors play a prominent role in the allocation choice. If,however, risk aversion is high, then the ˜ θ ∗ j are negligible and the 1 /N portfolio is appealing to theinvestor. Based on our empirical results, the latter situation seems more likely.One particular subcase of the lemma is when the budget constraint (to the right of Equation(20)) is removed. In the general case, this implies c = 0 in (21). If non constant predictors havea zero mean, then ˜ θ ∗ , like the other ˜ θ ∗ j , is given by the second part of (22). This conﬁguration isinteresting because in practice, θ j values that are derived from RL algorithms are not subject tothe budget constraint (see, e.g., step 5 in Table 2).In addition, the fact that ˜ θ ∗ j decreases to zero when σ β,j increases to inﬁnity is consistent withthe literature that ﬁnds that the EW portfolio is optimal under high model ambiguity (see, e.g.,Pﬂug et al. (2012) and Maillet et al. (2015)). Indeed if the non-constant loadings of β are subjectto a very high level of uncertainty, then the investor will naturally and comparatively trust theconstant factor much more. This results in an optimal allocation that is uniform across assets. We illustrate the implications of Lemma 6 by running monthly regressions to estimate the loadingsin Equation (15). To ease interpretability, we restrict the analysis to the three most common factorsin the literature: size (market capitalization), value (price-to-book) and momentum (12 month to1 month return). The largest correlation between them is 0.18 on the whole sample, thus thehypothesis of diagonal covariance matrix is not too far-fetched. Each month, we report the OLScoeﬃcients for Equation (15), where r is the vector of future one month returns.In the upper panel of Figure 5, we depict the estimated ˆ β j for the three factors plus the constant.In addition, in the lower panel, we plot the scaled unconstrained theta values ˜ θ j = ˆ β j / (2 N )ˆ σ X,j ˆ σ β,j +ˆ σ (cid:15) /N (i.e., with γ = 2, which is without loss of generality, as γ is only a normalizing constant). Oneadditional reason we resort to unconstrained θ j values is that they do not depend on the riskaversion parameter, which only plays the role of a scaling factor.All betas and unconstrained thetas oscillate strongly, but their means and deviations are telling.The θ associated to the constant is volatile, but solidly positive on average, and by far dominatingin magnitude, while the values for market capitalization are negative (which tends to be consistentwith the so-called size anomaly). 20 ean std. dev. -0.025 0.0980.014 0.056-0.002 0.0920.002 0.088 -0.50-0.250.000.25 2000 2005 2010 2015 2020 factor capcstmompb Estimated beta (loading) -2.50.02.5 2000 2005 2010 2015 2020

Unconstrained theta mean std. dev . -0.184 0.5000.278 1.4100.102 0.704-0.044 0.441 factor capcstmompb Fig. 5.

Panel betas and scaled unconstrained thetas . We plot the estimated panel betas ˆ β j foreach month in the dataset (upper panel). For each of the four features (3 factors + constant), we show thescaled values ˜ θ j = ˆ β j / (2 N )ˆ σ X,j ˆ σ β,j +ˆ σ (cid:15) /N in the lower panel. Note: the risk aversion parameter (scaling constant) is γ = 2. Let us brieﬂy summarize how agents allocate in the two frameworks (RL versus loadings -based):- When resorting to RL, the agent learns (via the policy gradient) by computing the sensitivity of the performance metric (average return or Sharpe ratio) with respect to variations in theparameters of the policy ( θ ).- In a more conventional characteristics-based asset pricing approach, the econometrician willevaluate the exposure of returns to ﬁrm speciﬁc attributes ( β ). These exposures can thenbe translated into portfolio weights, when optimizing the average of a given utility function(see Lemma 6 when the utility is quadratic).Theoretically, there are no reasons why these two approaches should be linked (they are hardto reconcile analytically, even though both seek to give more weight to assets that are expectedto outperform - see Section 2.5). However, empirically, we ﬁnd some consistency between the twomethods. One common feature is the dominance of the constant in the upper panel of Figure 2 and21n the lower panel of Figure 5. This indicates that both approaches ﬁnd a strong common factorin the cross-section of returns which cannot be explained by ﬁrm level idiosyncasies. Heuristically,when the estimated ˆ β (which drives ˜ θ ) is high, returns are on average high in the cross-section,thus, the sensitivity of performance to variations in θ should also be positive. This is when theorange line in Figure 2 is either at its ceiling, or increasing. When the policy learns over longerlonger samples (bottom panel of Figure 2), then the long-term positivity of ˆ β (which is linked tothat of the equity premium) pushes θ upwards.We thus wish to further investigate the link between the two learning processes. On the onehand, the driving element in the RL allocation is the policy update ∆ θ t = θ t − θ t − = α (cid:92) ∇ J ( θ t )(see Equation (9)). On the other hand, we pick the optimal (unconstrained) ˜ θ t in Equation (22) toproxy for the information that is processed by the asset pricing model during the period between t − t . In Figure 6, we plot the ﬁrst values ( y -axis) against the second ones ( x -axis). asset pricing o mom pbcap cst −2 −1 0 1 2 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5−1 0 1 −2.5 0.0 2.5−0.4−0.20.00.20.4−0.3−0.2−0.10.00.10.2−0.50−0.250.000.250.50−0.50−0.250.000.250.50 asset pricing o va r i a t i on i n t h e t a A o slope = 0.005 (0.763) slope = 0.029 (0.000)slope = -0.011 (0.134) slope = -0.004 (0.667)~ ~ Fig. 6.

Policy gradients versus optimal asset pricing parameters . We plot the variations inpolicy parameters ∆ θ t versus the unconstrained asset pricing based ˜ θ ∗ t deﬁned in Equation (22) and plottedin Figure 5. The former correspond to the ﬁrst policy form ( F1 ) updated from bootstrapped returns. Thelatter are built with returns from month t − t for consistency reasons (i.e., to match the informational setwith which ∆ θ t is built). The slope of the ﬁtted linear relationship is reported, along with the corresponding p -value (in brackets). Note: the risk aversion parameter (scaling constant) is γ = 2. There is only one dimension for which the two schemes yield consistent values: the constant(upper right quadrant). On all purely ﬁrm-speciﬁc characteristics, the approaches seldom agreeand the correlation between the two approaches is not signiﬁcantly diﬀerent from zero. Thus, apartfrom the relative agreement on the dominance of the common factor in the cross-section of stocks,it is hard to fully reconcile the two methods. 22 . Conclusion

In this article, we combine reinforcement learning with factor investing. The investor learns from theimpact of ﬁrm-speciﬁc characteristics on a chosen performance measure. The technical machineryrelies on tractable expressions derived from analytical properties of the Dirichlet distribution. Thisallows to keep allocation decision inside a simplex, which is the unique requirement in long-onlyportfolios.Empirically, the approach yields weights that are very diversiﬁed, akin to the 1/N portfolio. Oneinterpretation is that the learning process captures the importance of a common factor that drivesthe cross-section of stocks beyond their factorial idiosyncrasies. We compare the RL decisions tothose of a simple asset pricing model and ﬁnd major diﬀerence for the non-constant characteristics.This shows the peculiarities stemming from the RL-based methodology.Interestingly, and in spite of a wide range of implementation choices, the fact that the RLportfolios remain in the vicinity of uniform allocations underlines the eﬃciency of the latter.23 ppendix A Proofs

A.1 Proof of Lemma 1

Case l = 1 . We have Z S ρ t +1 Prob ( ds | ( ρ t , X t ) , w t ) = Z R Z M r P (cid:15) ( T − ( dr | X t , w t )) · P t +1 t ( dξ | X t )= Z R r P (cid:15) ( T − ( dr | X t , w t ))= Z R N T ( e ) P (cid:15) ( d e )= Z R N ( w (cid:124) t ( f ( X t ) + e ) P (cid:15) ( d e )= w (cid:124) t f ( X t ) + w (cid:124) t (cid:18)Z R N eP (cid:15) ( d e ) (cid:19) which gives the result as (cid:15) t is a zero mean White Noise. Case l ≥ . Note that ρ t + l = T ( X t + l − , w t + l − , (cid:15) t +1 ). Therefore, Z S ρ t + l Prob ( ds | ( ρ t , X t ) , w t ) = Z R Z M ρ t + l P (cid:15) ( T − ( dr | X t , w t )) · P t +1 t ( dξ | X t )= Z M ρ t + l P t +1 t ( dξ | X t ) . If l = 2, we go to the next step, otherwise we iterate Z S ρ t + l Prob ( ds | ( ρ t , X t ) , w t ) = Z M (cid:18)Z ∆ (cid:18)Z S ρ t + l Prob (cid:0) ds | r, ξ, w t +1 (cid:1)(cid:19) π θ ( d w t +1 | ξ ) (cid:19) P t +1 t ( dξ | X t )= Z M (cid:18)Z S ρ t + l Prob (cid:0) ds | r, ξ, w t +1 (cid:1)(cid:19) P t +1 t ( dξ | X t )= Z M (cid:18)Z R Z M ρ t + l P (cid:15) ( T − ( dr | ξ, w t +1 )) P t +2 t +1 ( dξ | ξ ) (cid:19) P t +1 t ( dξ | X t )= Z M (cid:18)Z M ρ t + l P t +2 t +1 ( dξ | ξ ) (cid:19) P t +1 t ( dξ | X t )= Z M ρ t + l P t +2 t ( dξ | X t )up to the point where Z S ρ t + l Prob ( ds | ( ρ t , X t ) , w t ) = Z M (cid:18)Z ∆ (cid:18)Z S ρ t + l Prob (cid:0) ds | r, ξ, w t + l − (cid:1)(cid:19) π θ ( d w t + l − | ξ ) (cid:19) P t + l − t ( dξ | X t ) . The innermost integral is given by the above case l = 1 hence Z S ρ t + l Prob ( ds | ( ρ t , X t ) , w t ) = Z M (cid:18)Z ∆ w (cid:124) t + l − f ( ξ ) π θ ( d w t + l − | ξ ) (cid:19) P t + l − t ( dξ | X t )= Z M (cid:18)Z ∆ w t + l − π θ ( d w t + l − | ξ ) (cid:19) (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) . .2 Proof of Proposition 2 We ﬁrst compute the expected values of the reward.For the risk-insensitive agent, R t = ρ t , so that E θ [ R t + l | S t = ( ρ t , X t )] = Z ∆ (cid:18)Z S ρ t + l Prob ( ds | S t , w t ) (cid:19) π θ ( d w t | X t ) . Lemma 1 gives the value of the inner integral and we have two cases. If l = 1, then E θ [ R t +1 | S t ] = Z ∆ w (cid:124) t f ( X t ) π θ ( d w t | X t ) = (cid:18)Z ∆ w t π θ ( d w t | X t ) (cid:19) (cid:124) f ( X t ) . If l ≥

2, we have E θ [ R t + l | S t ] = Z ∆ (cid:18)Z M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) (cid:19) π θ ( d w t | X t ) . w t + l − is sampled under policy π θ when S t + l − is known and is independent of the previous states,while r t + l is given by the state S t + l − and the realization of (cid:15) t + l and is independent of the choiceof the action. Therefore the inner integral is not a function of w t and E θ [ R t + l | S t ] = Z M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) . Now we show that the policy value takes a recursive form. First rewrite the policy value for therisk insensitive agent as V θ ( t, S t ) = T − t X l =1 E θ [ R t + l | S t ] = E θ [ R t +1 | S t ] + T − ( t +1) X l =1 E θ [ R t +1+ l | S t ] . Using the expression for the expected values of the reward for l ≥ E θ [ R t +1+ l | S t ] = Z M E θ [ w t + l | ξ ] (cid:124) f ( ξ ) P t + lt ( dξ | X t )= Z M (cid:18)Z M E θ (cid:2) w t + l | ξ (cid:3) (cid:124) f (cid:0) ξ (cid:1) P t + lt +1 ( dξ | ξ ) (cid:19) P t +1 t ( dξ | X t ) . For l = 1, the inner integral is simply E θ [ w t +1 | ξ ] (cid:124) f ( ξ ) = E θ [ R t +1+ l | S t +1 = ξ ]For l ≥

2, it is also Z M E θ (cid:2) w t +1+ l − | ξ (cid:3) (cid:124) f (cid:0) ξ (cid:1) P t +1+ l − t +1 ( dξ | ξ ) = E θ [ R t +1+ l | S t +1 = ξ ] . The rightmost part of the policy value expression can now be written as T − ( t +1) X l =1 E θ [ R t +1+ l | S t ] = Z M  T − ( t +1) X l =1 E θ [ R t +1+ l | S t +1 = ξ ]  P t +1 t ( dξ | X t ) , where the inner sum is V θ ( t + 1 , ξ ). 25 .3 Proof of Proposition 3 A.3.1 Expectation of ln W n We have W n ∼ Beta ( a n , σ − a n ). Consequently, E [ln W n ] = 1 B ( a n , σ − a n ) Z ln w n w a n − n (1 − w n ) σ − a n − dw n = 1 B ( a n , σ − a n ) Z ln w n ϕ ( w n , a n ) dw n , where ϕ ( w n , a n ) = w a n − n (1 − w n ) σ − a n − = e ( a n −

1) ln w n +( σ − a n −

1) ln(1 − w n ) . As a n cancels out in σ − a n , ∂∂a n ϕ ( w n , a n ) = ln w n ϕ ( w n , a n ) , and E [ln W n ] = 1 B ( a n , σ − a n ) Z ∂∂a n ϕ ( w n , a n ) dw n = 1 B ( a n , σ − a n ) ∂∂a n Z ϕ ( w n , a n ) dw n = 1 B ( a n , σ − a n ) ∂∂a n B ( a n , σ − a n )= dd a n ln B ( a n , σ − a n )= dd a n (ln Γ( a n ) + ln Γ( σ − a n ) − ln Γ( σ ))= (cid:122) ( a n ) − (cid:122) ( σ ) . A.3.2 Expectation of W n ln W n We have W n ∼ Beta ( a n , σ − a n ). Then, E [ W n ln W n ] = 1 B ( a n , σ − a n ) Z w n ln w n w a n − n (1 − w n ) σ − a n − dw n = 1 B ( a n , σ − a n ) Z ln w n ϕ ( w n , a n ) dw n , where ϕ ( w n , a n ) = w a n n (1 − w n ) σ − a n − .

26n addition, E [ W n ln W n ] = 1 B ( a n , σ − a n ) Z ∂∂a n ϕ ( w n , a n ) dw n = 1 B ( a n , σ − a n ) ∂∂a n Z ϕ ( w n , a n ) dw n = 1 B ( a n , σ − a n ) ∂∂a n B ( a n + 1 , σ − a n )= B ( a n + 1 , σ − a n ) B ( a n , σ − a n ) dd a n ln B ( a n + 1 , σ − a n ) . The ratio of Betas simpliﬁes to B ( a n + 1 , σ − a n ) B ( a n , σ − a n ) = Γ ( a n + 1) Γ ( σ − a n ) Γ ( σ )Γ ( σ + 1) Γ ( a n ) Γ ( σ − a n ) = a n Γ ( a n ) Γ ( σ ) σ Γ ( σ ) Γ ( a n ) = a n σ . Therefore we can conclude E [ W n ln W n ] = a n σ dd a n ln B ( a n + 1 , σ − a n )= a n σ dd a n (ln Γ( a n + 1) + ln Γ( σ − a n ) − ln Γ( σ + 1))= a n σ ( (cid:122) ( a n + 1) − (cid:122) ( σ + 1))= a n σ (cid:18) (cid:122) ( a n ) + 1 a n − (cid:122) ( σ ) − σ (cid:19) . A.3.3 Expectation of W n ln W m We have h W n W m i ∼ Dir ( a n,m ), where a n,m = h a n a m σ − a n − a m i . Also, E [ W n ln W m ] = 1 B ( a n,m ) Z Z − w n w n ln w m w a n − n w a m − m (1 − w n − w m ) σ − a n − a m − dw n dw m = 1 B ( a n,m ) Z w a n n (cid:18)Z − w n ln w m w a m − m (1 − w n − w m ) σ − a n − a m − dw m (cid:19) dw n . The inner integral is I = Z λ ln ww a m − ( λ − w ) σ − a n − a m − dw. With the change of variable λt = wI = Z (ln λ + ln t ) λ a m − t a m − λ σ − a n − a m − (1 − t ) σ − a n − a m − λdt = λ σ − a n − (cid:18) ln λ Z t a m − (1 − t ) σ − a n − a m − dt + Z ln t t a m − (1 − t ) σ − a n − a m − dt (cid:19) = λ σ − a n − (cid:18) ln λB ( a m , σ − a n − a m ) + Z ln tϕ ( t, a m ) dt (cid:19) , ϕ ( t, a m ) = t a m − (1 − t ) σ − a n − a m − . As a m cancels out in σ − a n − a m , ∂∂a m ϕ ( t, a m ) = ln tϕ ( t, a m ) . Therefore, Z ln tϕ ( t, a m ) dt = Z ∂∂a m ϕ ( t, a m ) dt = ∂∂a m Z ϕ ( t, a m ) dt = dd a m B ( a m , σ − a n − a m )= B ( a m , σ − a n − a m ) dd a m ln B ( a m , σ − a n − a m ) , and dd a m ln B ( a m , σ − a n − a m ) = dd a m (ln Γ ( a m ) + ln Γ ( σ − a n − a m ) − ln Γ ( σ − a n ))= (cid:122) ( a m ) − (cid:122) ( σ − a n ) . This gives I = (1 − w n ) σ − a n − B ( a m , σ − a n − a m ) (ln (1 − w n ) + (cid:122) ( a m ) − (cid:122) ( σ − a n )) . Back to the expectation, E [ W n ln W m ] = B ( a m , σ − a n − a m ) B ( a n,m ) Z w a n n (1 − w n ) σ − a n − (ln (1 − w n ) + (cid:122) ( a m ) − (cid:122) ( σ − a n )) dw n . The ratio of Betas simpliﬁes to B ( a m , σ − a n − a m ) B ( a n,m ) = Γ ( a m ) Γ ( σ − a n − a m ) Γ ( σ )Γ ( σ − a n ) Γ ( a n ) Γ ( a m ) Γ ( σ − a n − a m )= Γ ( σ )Γ ( σ − a n ) Γ ( a n ) = 1 B ( a n , σ − a n ) , and the integral splits in two parts which are I = ( (cid:122) ( a m ) − (cid:122) ( σ − a n )) Z w a n n (1 − w n ) σ − a n − dw n = ( (cid:122) ( a m ) − (cid:122) ( σ − a n )) B ( a n + 1 , σ − a n )and I = Z ln (1 − w n ) w a n n (1 − w n ) σ − a n − dw n = Z ∂∂a k (cid:16) w a n n (1 − w n ) σ − a n − (cid:17) dw n ≤ k ≤ N , k = n . Thus, I = ∂∂a k (cid:18)Z w a n n (1 − w n ) σ − a n − dw n (cid:19) = B ( a n + 1 , σ − a n ) dd a k ln B ( a n + 1 , σ − a n )= B ( a n + 1 , σ − a n ) ( (cid:122) ( σ − a n ) − (cid:122) ( σ + 1)) . Putting everything together E [ W n ln W m ] = B ( a n + 1 , σ − a n ) B ( a n , σ − a n ) ( (cid:122) ( a m ) − (cid:122) ( σ − a n ) + (cid:122) ( σ − a n ) − (cid:122) ( σ + 1)) . Again, the ratio of Betas simpliﬁes to B ( a n + 1 , σ − a n ) B ( a n , σ − a n ) = Γ ( a n + 1) Γ ( σ − a n ) Γ ( σ )Γ ( σ + 1) Γ ( a n ) Γ ( σ − a n ) = a n Γ ( a n ) Γ ( σ ) σ Γ ( σ ) Γ ( a n ) = a n σ . Finally E [ W n ln W m ] = a n σ (cid:18) (cid:122) ( a m ) − (cid:122) ( σ ) − σ (cid:19) . A.4 Proof of Proposition 5

We have ∇ J ( θ t ) = E π [ G t ∇ ln π ( w t | X t , θ t ) | X t , θ t ]= T − t X l =1 E π [ ρ t + l ∇ ln π ( w t | X t , θ t ) | X t , θ t ]= T − t X l =1 Z ∆ (cid:18)Z S ρ t + l ∇ ln π ( w t | X t , θ t ) Prob ( ds | S t , w t ) (cid:19) π θ ( d w t | X t )= T − t X l =1 Z ∆ (cid:18)Z S ρ t + l Prob ( ds | S t , w t ) (cid:19) ∇ ln π ( w t | X t , θ t ) π θ ( d w t | X t ) . The inner integrals are given by Lemma 1 and ∇ J ( θ t ) = Z ∆ w (cid:124) t f ( X t ) ∇ ln π ( w t | X t , θ t ) π θ ( d w t | X t )+ T − t X l =2 Z ∆ (cid:18)Z M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) (cid:19) ∇ ln π ( w t | X t , θ t ) π θ ( d w t | X t ) , as the inner integrals in the second part of the expression are independent of the choice of w t , wecan write this second part as (cid:18)Z ∆ ∇ ln π ( w t | X t , θ t ) π θ ( d w t | X t ) (cid:19) T − t X l =2 Z M E θ [ w t + l − | ξ ] (cid:124) f ( ξ ) P t + l − t ( dξ | X t ) . Z ∆ ∇ ln π ( w t | X t , θ t ) π θ ( d w t | X t ) = 0 . Therefore, ∇ J ( θ t ) = E π [ w (cid:124) t f ( X t ) ∇ ln π ( w t | X t , θ t ) | X t , θ t ] . Let f t,n denotes the n -th element of the vector f ( X t ). Then, ∇ J ( θ t ) = E π " N X m =1 w t,m f t,m ! N X n =1 ( (cid:122) ( σ t ) − (cid:122) ( a t,n ) + ln w t,n ) ∇ a t,n ! | S t , θ t = N X n =1 E π " N X m =1 w t,m f t,m ! ( (cid:122) ( σ t ) − (cid:122) ( a t,n ) + ln w t,n ) | S t , θ t ∇ a t,n = N X n =1 ( (cid:122) ( σ t ) − (cid:122) ( a t,n )) E π " N X m =1 w t,m f t,m | S t , θ t ∇ a t,n + N X n =1 E π " N X m =1 w t,m f t,m ! ln w t,n | S t , θ t ∇ a t,n . On the one hand we have, by the expectation of a Dirichlet distributed random vector, E π " N X m =1 w t,m f t,m | S t , θ t = N X m =1 E π [ w t,m | S t , θ t ] f t,m = N X m =1 a t,m σ t f t,m . On the other hand, using Proposition 3, we obtain E π " N X m =1 w t,m f t,m ! ln w t,n | S t , θ t = N X m =1 E π [ w t,m ln w t,n | S t , θ t ] f t,m = N X m =1 m = n E π [ w t,m ln w t,n | S t , θ t ] f t,m + E π [ w t,n ln w t,n | S t , θ t ] f t,n = N X m =1 m = n a t,m σ t (cid:18) (cid:122) ( a t,n ) − (cid:122) ( σ t ) − σ t (cid:19) f t,m + a t,n σ t (cid:122) ( a t,n ) + 1 a t,n − (cid:122) ( σ t ) − σ t ! f t,n = (cid:18) (cid:122) ( a t,n ) − (cid:122) ( σ t ) − σ t (cid:19) N X m =1 a t,m σ t f t,m + 1 σ t f t,n . ∇ J ( θ t ) = N X n =1 ( (cid:122) ( σ t ) − (cid:122) ( a t,n )) N X m =1 a t,m σ t f t,m ∇ a t,n + N X n =1 (cid:18) (cid:122) ( a t,n ) − (cid:122) ( σ t ) − σ t (cid:19) N X m =1 a t,m σ t f t,m + 1 σ t f t,n ! ∇ a t,n = N X n =1 f t,n − N X m =1 a t,m σ t f t,m ! ∇ a t,n σ t . Now note that N X m =1 a t,m σ t f t,m = E π [ w t | X t , θ t ] (cid:124) f ( X t ) = E π [ R t +1 | X t , θ t ] , and that f t,n is the expected return of asset n between t and t + 1 (by equation (2)). A.5 Proof of lemma 6

The Lagrange formulation of the problem, L ( θ ) = θ (cid:124) X (cid:124) X ¯ β − γ θ (cid:124) X (cid:124) E h ( X ( β − ¯ β ) + (cid:15) )( (cid:15) + X ( β − ¯ β )) (cid:124) i Xθ + λ ( θ (cid:124) X (cid:124) N −

1) (23) ∂L∂ θ = X (cid:124) X ¯ β − γ X (cid:124) ( X Σ β X (cid:124) + σ (cid:15) I N ) Xθ + λ X (cid:124) N (24)leads, via the ﬁrst order conditions, to the standard solution θ ∗ = γ − ( X (cid:124) ( X Σ β X (cid:124) + σ (cid:15) I N ) X ) − ( X (cid:124) X ¯ β + c X (cid:124) N ) , = γ − ( Σ β X (cid:124) X + σ (cid:15) I K ) − ( ¯ β + c ( X (cid:124) X ) − X (cid:124) N ) (25)where c is a constant which ensures that the budget constraint (to the right of Equation 20) isfulﬁlled. Note that X (cid:124) X is nonsingular because the characteristics are not redundant and because N > K + 1. For the sake of completeness, we derive the expressions for the ﬁrst inverse matricebelow.From (19) and the deﬁnition of ¯ x , it holds that X (cid:124) X = N ( ˆ Σ X + ¯ x ¯ x (cid:124) ) = N (diag( σ X ) + ¯ x ¯ x (cid:124) ) , (26)so that by the Sherman-Morrison formula, and because Σ β = diag( σ β ),( Σ β X (cid:124) X + σ (cid:15) I K ) − = (cid:16) N diag( σ X )diag( σ β ) + N diag( σ β ) ¯ x ¯ x (cid:124) + σ (cid:15) I K (cid:17) − = N − diag( σ ) − I K − diag( σ β ) ¯ x ¯ x (cid:124) diag( σ ) − x (cid:124) diag( σ ) − diag( σ β ) ¯ x ! , (27)where diag( σ ) = diag( σ X )diag( σ β ) + N − σ (cid:15) I K - this form being a strong echo of the structurein Equation (15). This proves the ﬁrst point. 31ow, let us make the extreme simpliﬁcation, as in our empirical section, that ¯ x (cid:124) = [1 (cid:124) K ], sothat ﬁrm characteristics have zero sample mean - apart for the ﬁrst one. This is not uncommonin the recent literature as long as the data is preprocessed (see Freyberger et al. (2020), Gu et al.(2020b), Kelly et al. (2019) and Koijen and Yogo (2019)). Then, X (cid:124) X = N diag( ˜ σ X ), where themodiﬁed vector of variances ˜ σ X is simply σ X with the ﬁrst element equal to one (instead of zero).The ratio in (27) vanishes (because σ β, = 0 and ¯ x ¯ x (cid:124) is empty apart for its unit ﬁrst element) and θ ∗ = ( N γ ) − diag( σ ) − ( ¯ β + c diag( ˜ σ X ) − ¯ x ) , from which the lower part of (22) is derived (the constant c impacts only θ ∗ ). In this case, thebudget constraint is only binding for the ﬁrst asset. Indeed, because N X = ¯ x = [1 (cid:124) K ] (cid:124) , thesum of weights linked to the non-constant factors is always equal to zero. Thus, θ , which is linkedto a constant column, must satisfy θ ∗ N = 1. Appendix B Dirichlet distributions and portfolios in high dimen-sions

One major issue with the Dirichlet distribution is the computation of the scaling constant in highdimension. More precisely, let us consider the log of this quantity: c = log( B ( a )) = N X n =1 log(Γ( a n )) − log Γ N X n =1 a n !! . (28)When N is large and the a n are free, both terms can reach levels that are beyond what machinescan handle when exponentiated. Thus, we need to impose restrictions. We do it in two steps. First,we set some lower and upper bound on the a n . In a second stage, we compute an upper value for N that will depend on the range of the a n . This second step is the most technical and we providethe details below. The third and last step is to determine a tradeoﬀ.Before we continue, we recall that the a n dictate the allocation of the agent and that, onaverage, the position in asset n is equal to a n (cid:16)P Nn =1 a n (cid:17) − . For obvious risk-management reasons,it is preferable to diversify portfolios. In our framework, we assume that there exists a constant δ > δN ≤ a n N X n =1 a n ! − ≤ δN , n = 1 , . . . , N. (29)In practice, the minimum value of δ will be driven by the data, and we discuss realistic rangesbelow. This constraint helps measure if the portfolio is on average well balanced and does notmake extreme bets. The smaller δ is, the higher the diversiﬁcation of the positions. Notably, undercondition (29), the mean of the a n , m a , is such that δ − a + ≤ m a ≤ δa − , with a + = max n a n , a − = min n a n , . (30)To further explicit our idea, we ﬁx a maximum threshold κ max beyond which we consider thatthe two terms in Equation (28) have numerically exploded. The two terms in Equation (28) havevery diﬀerent asymptotics when the a n are large or small, hence the treatment is not symmetric.32e start with problems when the a n are large. Given the strong convexity of the Γ function, thisis more impactful for the second term in (28). We seek an upper bound a + for the a n such thatthis second term remains below κ max , i.e.,log Γ N X n =1 a n !! ≤ κ max . Although the inverse of the Γ function exists (at least when its argument is large enough, seeUchiyama (2012)), it is not straightforward to compute. We thus resort to Stirling’s formulainstead and seek to simplifylog vuut π N X n =1 a n − ! P Nn =1 a n − e !P Nn =1 a n −  ≤ κ max . (31)If we omit the ﬁrst negligible term inside the square root, this is equivalent to N X n =1 a n − ! log N X n =1 a n − ! − ! ≤ κ max . As a ﬁrst order (rough) approximation, we reduce this expression to N X n =1 a n ! log N X n =1 a n ! ≤ κ max , that is, P Nn =1 a n ≤ κ max W ( κ max ) ∼ κ max log( κ max ) , where W is the principal branch of the Lambert function.Its asymptotic behavior for large arguments is indeed W ( z ) ∼ log( z ) (see Section 4.13 in Olveret al. (2010)). Given (30), a rule of thumb constraint that links N and a + is a + ≤ δN κ max log( κ max ) ⇔ N ≤ δκ max a + log( κ max ) , (32)where we purposefully underline that the condition can also be viewed as a limit on the number ofassets.The ﬁrst term in 28 relates to the lower bound on the a n . Indeed, as z shrinks to zero, Γ( z ) isequivalent to z − . Thus, if the a n are small and a − is suﬃciently close to zero, N X n =1 log(Γ( a n )) ≤ N log (cid:18) a − (cid:19) ≤ κ max ⇔ N ≤ κ max / log( a − − ) ⇔ a − ≥ e − κ max /N . (33)Conditions (32) and (33) link the bounds of the a n to the number of assets N . In Figure 7,we illustrate them by assigning values to κ max , δ and N . Taking κ max = 100 allows B ( a ) to rangefrom e − to e , which is a large magnitude. In the ﬁgure, as the number of assets increases, therange of the a n shrinks.For our empirical study, we pick a − = 0 .

02 and a + = 1 .

6. These values are optimal empiricallybecause we obtain errors outside this range. 33 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● N B ound s Bound types ●●●● lowerupper, d = 2upper, d = 3upper, d = 4 Small portfolio :large range for a n Large portfolio :small range for a n Fig. 7.

Intervals for the a n . We show the lower ( a − ) and upper ( a + ) bound for the a n when the numberof assets is ﬁxed to 50, 100 or 200 and κ max = 100. They are derived from Equations (32) and (33). Theblack line is the Γ function. References

Ammann, M., G. Coqueret, and J.-P. Schade (2016). Characteristics-based portfolio choice withleverage constraints.

Journal of Banking & Finance 70 , 23–37.Arjovsky, M., L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019). Invariant risk minimization. arXivPreprint (1907.02893).Asness, C. S., T. J. Moskowitz, and L. H. Pedersen (2013). Value and momentum everywhere.

Journal of Finance 68 (3), 929–985.Baker, M., B. Bradley, and J. Wurgler (2011). Benchmarks as limits to arbitrage: Understandingthe low-volatility anomaly.

Financial Analysts Journal 67 (1), 40–54.Ball, R. and P. Brown (1968). An empirical evaluation of accounting income numbers.

Journal ofAccounting Research , 159–178.Ball, R. and P. Brown (2019). Ball and Brown (1968) after ﬁfty years.

Paciﬁc-Basin FinanceJournal 53 , 410–431.Banz, R. W. (1981). The relationship between return and market value of common stocks.

Journalof Financial Economics 9 (1), 3–18.Barbee Jr, W. C., S. Mukherji, and G. A. Raines (1996). Do sales–price and debt–equity explainstock returns better than book–market and ﬁrm size?

Financial Analysts Journal 52 (2), 56–60.Basu, S. (1983). The relationship between earnings’ yield, market value and return for nyse commonstocks: Further evidence.

Journal of Financial Economics 12 (1), 129–156.Bäuerle, N. and U. Rieder (2011).

Markov Decision Processes with Applications to Finance . Uni-versitext. Berlin Heidelberg: Springer-Verlag. 34handari, L. C. (1988). Debt/equity ratio and expected common stock returns: Empirical evidence.

Journal of Finance 43 (2), 507–528.Brandt, M. W., P. Santa-Clara, and R. Valkanov (2009). Parametric portfolio policies: Exploitingcharacteristics in the cross-section of equity returns.

Review of Financial Studies 22 (9), 3411–3447.Chen, L., M. Pelger, and J. Zhu (2019). Deep learning in asset pricing.

SSRN Working Pa-per 3350138 .Chordia, T. and B. Swaminathan (2000). Trading volume and cross-autocorrelations in stockreturns.

Journal of Finance 55 (2), 913–935.Colas, C., O. Sigaud, and P.-Y. Oudeyer (2018). How many random seeds? Statistical poweranalysis in deep reinforcement learning experiments. arXiv Preprint (1806.08295).Cooper, M. J., H. Gulen, and M. J. Schill (2008). Asset growth and the cross-section of stockreturns.

Journal of Finance 63 (4), 1609–1651.Cover, T. M. and E. Ordentlich (1996). Universal portfolios with side information.

IEEE Transac-tions on Information Theory 42 (2), 348–363.Daniel, K. and S. Titman (1997). Evidence on the characteristics of cross sectional variation instock returns.

Journal of Finance 52 (1), 1–33.Deng, Y., F. Bao, Y. Kong, Z. Ren, and Q. Dai (2016). Deep direct reinforcement learning forﬁnancial signal representation and trading.

IEEE Transactions on Neural Networks and LearningSystems 28 (3), 653–664.Easton, P. D. (2004). Pe ratios, peg ratios, and estimating the implied expected rate of return onequity capital.

Accounting Review 79 (1), 73–95.Fama, E. F. and K. R. French (1992). The cross-section of expected stock returns.

Journal ofFinance 47 (2), 427–465.Fama, E. F. and K. R. French (2015). A ﬁve-factor asset pricing model.

Journal of FinancialEconomics 116 (1), 1–22.Feng, G., N. G. Polson, and J. Xu (2019). Deep learning in characteristics-sorted factor models.

SSRN Working Paper 3243683 .Freyberger, J., A. Neuhierl, and M. Weber (2020). Dissecting characteristics nonparametrically.

Review of Financial Studies 33 (5), 2326–2377.Gu, S., B. T. Kelly, and D. Xiu (2020a). Autoencoder asset pricing models.

Journal of Economet-rics forthcoming .Gu, S., B. T. Kelly, and D. Xiu (2020b). Empirical asset pricing via machine learning.

Review ofFinancial Studies 33 (5), 2223–2273. 35an, Y., K. Yang, and G. Zhou (2013). A new anomaly: The cross-sectional proﬁtability oftechnical analysis.

Journal of Financial and Quantitative Analysis 48 (5), 1433–1461.Haugen, R. A. and N. L. Baker (1996). Commonality in the determinants of expected stock returns.

Journal of Financial Economics 41 (3), 401–439.Henderson, P., R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2017). Deep reinforce-ment learning that matters. arXiv Preprint (1709.06560).Hjalmarsson, E. and P. Manchev (2012). Characteristic-based mean-variance portfolio choice.

Journal of Banking & Finance 36 (5), 1392–1401.Hoi, S. C., D. Sahoo, J. Lu, and P. Zhao (2018). Online learning: A comprehensive survey. arXivPreprint (1802.02871).Islam, R., P. Henderson, M. Gomrokchi, and D. Precup (2017). Reproducibility of benchmarkeddeep reinforcement learning tasks for continuous control. arXiv Preprint (1708.04133).Jegadeesh, N. and S. Titman (1993). Returns to buying winners and selling losers: Implicationsfor stock market eﬃciency.

Journal of Finance 48 (1), 65–91.Kelly, B. T., S. Pruitt, and Y. Su (2019). Characteristics are covariances: A uniﬁed model of riskand return.

Journal of Financial Economics 134 (3), 501–524.Koijen, R. S. and M. Yogo (2019). A demand system approach to asset pricing.

Journal of PoliticalEconomy 127 (4), 1475–1515.Korsos, L. F. (2013). The Dirichlet portfolio model: Uncovering the hidden composition of hedgefund investments. arXiv Preprint (1306.0938).Le Courtois, O. and X. Xu (2019). Portfolio optimization in the presence of extreme risks: Apareto-dirichlet approach.

SSRN Working Paper 3376921 .Lettau, M. and M. Pelger (2020a). Estimating latent asset-pricing factors.

Journal of Economet-rics 218 (1), 1–31.Lettau, M. and M. Pelger (2020b). Factors that ﬁt the time series and cross-section of stock returns.

Review of Financial Studies 33 (5), 2274–2325.Li, Y., W. Zheng, and Z. Zheng (2019). Deep robust reinforcement learning for practical algorithmictrading.

IEEE Access 7 , 108014–108022.Litzenberger, R. H. and K. Ramaswamy (1982). The eﬀects of dividends on common stock pricestax eﬀects or information eﬀects?

Journal of Finance 37 (2), 429–443.Maillet, B., S. Tokpavi, and B. Vaucher (2015). Global minimum variance portfolio optimisationunder some model risk: A robust regression-based approach.

European Journal of OperationalResearch 244 (1), 289–299.Moody, J., L. Wu, Y. Liao, and M. Saﬀell (1998). Performance functions and reinforcement learningfor trading systems and portfolios.

Journal of Forecasting 17 (5-6), 441–470.36aranjo, A., M. Nimalendran, and M. Ryngaert (1998). Stock returns, dividend yields, and taxes.

Journal of Finance 53 (6), 2029–2057.Neuneier, R. (1996). Optimal asset allocation using adaptive dynamic programming. In

Advancesin Neural Information Processing Systems , pp. 952–958.Olver, F. W., D. W. Lozier, R. F. Boisvert, and C. W. Clark (2010).

NIST handbook of mathematicalfunctions . Cambridge university press.Pearl, J. (2009).

Causality: Models, Reasoning and Inference. Second Edition , Volume 29. Cam-bridge University Press.Pﬁster, N., P. Bühlmann, and J. Peters (2019). Invariant causal prediction for sequential data.

Journal of the American Statistical Association 114 (527), 1264–1276.Pﬂug, G. C., A. Pichler, and D. Wozabal (2012). The 1/N investment strategy is optimal underhigh model ambiguity.

Journal of Banking & Finance 36 (2), 410–417.Sato, Y. (2019). Model-free reinforcement learning for ﬁnancial portfolios: A brief survey. arXivPreprint (1904.04973).Sosnovskiy, S. (2015). On ﬁnancial applications of the two-parameter Poisson-Dirichlet distribution. arXiv Preprint (1501.01954).Sutton, R. S. and A. G. Barto (2018).

Reinforcement learning: An introduction (2nd Edition) .MIT press.Uchiyama, M. (2012). The principal inverse of the gamma function.

Proceedings of the AmericanMathematical Society 140 (4), 1343–1348.Wang, H. and X. Y. Zhou (2020). Continuous-time mean-variance portfolio selection: A reinforce-ment learning framework.

Mathematical Finance Forthcoming .Zhang, Z., S. Zohren, and S. Roberts (2019). Deep reinforcement learning for trading. arXivPreprintarXivPreprint