[PDF] G-Learner and GIRL: Goal Based Wealth Management with Reinforcement Learning

Abstract

We present a reinforcement learning approach to goal based wealth management problems such as optimization of retirement plans or target dated funds. In such problems, an investor seeks to achieve a financial goal by making periodic investments in the portfolio while being employed, and periodically draws from the account when in retirement, in addition to the ability to re-balance the portfolio by selling and buying different assets (e.g. stocks). Instead of relying on a utility of consumption, we present G-Learner: a reinforcement learning algorithm that operates with explicitly defined one-step rewards, does not assume a data generation process, and is suitable for noisy data. Our approach is based on G-learning - a probabilistic extension of the Q-learning method of reinforcement learning. In this paper, we demonstrate how G-learning, when applied to a quadratic reward and Gaussian reference policy, gives an entropy-regulated Linear Quadratic Regulator (LQR). This critical insight provides a novel and computationally tractable tool for wealth management tasks which scales to high dimensional portfolios. In addition to the solution of the direct problem of G-learning, we also present a new algorithm, GIRL, that extends our goal-based G-learning approach to the setting of Inverse Reinforcement Learning (IRL) where rewards collected by the agent are not observed, and should instead be inferred. We demonstrate that GIRL can successfully learn the reward parameters of a G-Learner agent and thus imitate its behavior. Finally, we discuss potential applications of the G-Learner and GIRL algorithms for wealth management and robo-advising.

Full PDF

GG-Learner and GIRL:Goal Based Wealth Management with Reinforcement Learning

Matthew F. Dixon ∗ Department of Applied MathIllinois Institute of TechnologyIgor Halperin † Fidelity Investments &NYU Tandon School of EngineeringFebruary 2020

Abstract

We present a reinforcement learning approach to goal based wealth management problemssuch as optimization of retirement plans or target dated funds. In such problems, an investorseeks to achieve a ﬁnancial goal by making periodic investments in the portfolio while beingemployed, and periodically draws from the account when in retirement, in addition to theability to re-balance the portfolio by selling and buying diﬀerent assets (e.g. stocks). In-stead of relying on a utility of consumption, we present G-Learner: a reinforcement learningalgorithm that operates with explicitly deﬁned one-step rewards, does not assume a datageneration process, and is suitable for noisy data. Our approach is based on G-learning(Fox et al., 2015) — a probabilistic extension of the Q-learning method of reinforcementlearning. In this paper, we demonstrate how G-learning, when applied to a quadratic re-ward and Gaussian reference policy, gives an entropy-regulated Linear Quadratic Regulator(LQR). This critical insight provides a novel and computationally tractable tool for wealthmanagement tasks which scales to high dimensional portfolios. In addition to the solutionof the direct problem of G-learning, we also present a new algorithm, GIRL, that extendsour goal-based G-learning approach to the setting of Inverse Reinforcement Learning (IRL)where rewards collected by the agent are not observed, and should instead be inferred. Wedemonstrate that GIRL can successfully learn the reward parameters of a G-Learner agentand thus imitate its behavior. Finally, we discuss potential applications of the G-Learnerand GIRL algorithms for wealth management and robo-advising. . ∗ Matthew Dixon is an Assistant Professor in the Department of Applied Math, Illinois Institute of Technology.E-mail: [email protected]. † Igor Halperin is a Research Professor in Financial Engineering at NYU, and an AI Research associate atFidelity Investments. E-mail: [email protected]. The views presented in this paper are of the author, and do notnecessarily represent the views of his employer. The standard disclaimer applies. The author thanks Lisa Huangfor helpful discussions. a r X i v : . [ q -f i n . P M ] F e b Introduction

Mean-variance Markowitz optimization (MVO) (Markowitz, 1959) remains one of the most com-monly used tools in wealth management. Portfolio objectives in this approach are deﬁned interms of expected returns and covariances of assets in the portfolio, which may not be the mostnatural formulation for retail investors. Indeed, the latter typically seek speciﬁc ﬁnancial goalsfor their portfolios. For example, a contributor to a retirement plan may demand that the valueof their portfolio at the age of his or her retirement be at least equal to, or preferably largerthan, some target value P T .Goal-based wealth management oﬀers some valuable perspectives into optimal structuring ofwealth management plans such as retirement plans or target date funds. The motivation foroperating in terms of wealth goals can be more intuitive (while still tractable) than the classicalformulation in terms of expected excess returns and variances. To see this, let V T be the ﬁnalwealth in the portfolio, and P T be a certain target wealth level at the horizon T . The goal-based wealth management approach of Browne (1996) and Das et al. (2018) uses the probability P [ V T − P T ≥ ] of ﬁnal wealth V T to be above the target level P T as an objective for maximizationby an active portfolio management. This probability is the same as the price of a binary optionon the terminal wealth V T with strike P T : P [ V T − P T ≥ ] = E t (cid:2) V T > P T (cid:3) . Instead of a utility ofwealth such as e.g. a power or logarithmic utility, this approach uses the price of this binaryoption as the objective function. This idea can also be modiﬁed by using a call option-likeexpectation E t [( V T − P T ) + ] , instead of a binary option. Such an expectation quantiﬁes howmuch the terminal wealth is expected to exceed the target, rather than simply providing theprobability of such event .This treatment of the goal-based utility function can be implemented in a reinforcement learning(RL) framework for discrete-time planning problems. In contrast to the Merton consumptionapproach, RL does not require speciﬁc functional forms of the utility nor does it require that thedynamics of the assets be treated as log-normal. Thus in theory, RL can be viewed as a data-driven extension of dynamic programming (Sutton and Barto, 2018). In practice, a substantialchallenge with the RL framework is the curse of dimensionality — portfolio allocation as acontinuous action space Markov Decision Process (MDP) requires techniques such as deep Q-learning or other function approximation methods combined e.g. with the Least Squares PolicyIteration (LSPI) method (Lagoudakis and Parr, 2003). The latter has exponential complexitywith increasing stocks in the portfolio, and the former is cumbersome, highly data intensive, andheavily relies on heuristics for operational eﬃciency. For more details, see e.g. (Dixon et al.,2020).In this paper, we present G-learning (Fox et al., 2015) — a probabilistic extension of Q-learningwhich scales to high dimensional portfolios while providing a ﬂexible choice of utility functions.To demonstrate the utility of G-learning, we consider a general class of wealth managementproblems: optimization of a deﬁned contribution retirement plan, where cash is injected (ratherthan withdrawn) at each time step. In contrast to methods based on a utility of consumption, weadopt a more “RL-native” approach by directly specifying one-step rewards. Such an approach The problem of optimal consumption with an investment portfolio is frequently referred to as the

Mertonconsumption problem , after the celebrated work of Robert Merton who formulated this problem as a continuous-time optimal control problem with log-normal dynamics for asset prices (Merton, 1971). As optimization inproblems involving cash injections instead of cash withdrawals formally corresponds to a sign change of one-stepconsumption in the Merton formulation, we can collectively refer to all types of wealth management problemsinvolving injections or withdrawals of funds at intermediate time steps as a generalized Merton consumptionproblem.

2s suﬃciently general to capture other possible settings, such as e.g. a retirement plan in adecumulation (post-retirement) phase, or target based wealth management. Previously, G-learning was applied to dynamic portfolio optimization in (Halperin and Feldshteyn, 2018), whilehere we extend this approach to portfolio management involving cashﬂows at intermediate timesteps.A key step in our formulation is that we deﬁne actions as absolute (dollar-valued) changes ofasset positions, instead of deﬁning them in fractional terms, as in the Merton approach (Merton,1971). This enables a simple transformation of the optimization problem into an unconstrainedoptimization problem, and provides a semi-analytical solution for a particular choice of thereward function. As will be shown below, this approach oﬀers a tractable setting for both thedirect reinforcement learning problem of learning the optimal policy which maximizes the totalreward, and its inverse problem where we observe actions of a ﬁnancial agent but not the rewardsreceived by the agent. Inference of the reward function from observations of states and actions ofthe agent is the objective of Inverse Reinforcement Learning (IRL). After we present

G-Learner — a G-learning algorithm for the direct RL problem, we will introduce

GIRL (G-learning IRL)— a framework for inference of rewards of ﬁnancial agents that are “implied” by their observedbehavior. The two practical algorithms, G-Learner and GIRL, can be used either separately orin a combination, and we will discuss their potential joint applications for wealth managementand robo-advising.The paper is organized as follows. In Section 2, we introduce G-learning and explain howit generalizes the more well known Q-learning method for reinforcement learning. Section 3introduces the problem of portfolio optimization for a deﬁned contribution retirement plan.Then in Section 4, we present the G-Learner: a G-learning algorithm for portfolio optimizationwith cash injection and consumption. The GIRL algorithm for performing IRL of ﬁnancial agentsis introduced in Section 5. Section 6 presents the results of our implementation and demonstratesthe ability of G-learner to scale to high dimensional portfolio optimization problems, and theability of GIRL to make inference of the reward function of a G-Learner agent. Section 7concludes with ideas for future developments in G-learning for wealth management and robo-advising.

In this section, we provide a short but self-contained overview of G-learning as a probabilisticextension of the popular Q-learning method in reinforcement learning. We assume some famil-iarity with constructs in dynamic programming and reinforcement learning, see e.g. (Suttonand Barto, 2018), or (Dixon et al., 2020) for a more ﬁnance-focused introduction. In particular,we assume that the reader is familiar with the notions of value function, action-value function,and the Bellman optimality equations. Familiarity with Q-learning is desirable but not criticalfor understanding this section, however for the beneﬁt of the informed reader, a short informalsummary of the diﬀerences is as follows: • Q-learning is an oﬀ-policy RL method with a deterministic policy. • G-Learning is an oﬀ-policy RL method with a stochastic policy. G-learning can be consid-ered as an entropy-regularized Q-learning, which may be suitable when working with noisydata. Because G-learning operates with stochastic policies, it amounts to a generative RLmodel. 3 .1 Bellman optimality equation

More formally, let x t be a state vector for an agent that summarizes the knowledge of theenvironment that the agent needs in order to perform an action a t at time step t . Let ˆ R t ( x t , a t ) be a random reward collected by the agent for taking action a t at time t when the state ofthe environment is x t . Assume that all future actions a t for future time steps are determinedaccording to a policy π ( a t | x t ) which speciﬁes which action a t to take when the environment isin state x t . We note that policy π can be deterministic as in Q-learning, or stochastic as inG-learning, as we will discuss below.For a given policy π , the expected value of cumulative reward with a discount factor γ , condi-tioned on the current state x t , deﬁnes the value function V π t ( x t ) : = E π t (cid:34) T − (cid:213) t (cid:48) = t γ t (cid:48) − t ˆ R t (cid:48) ( x t (cid:48) , a t (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x t (cid:35) . (1)Here E π t stands for the expectation of future states and actions, conditioned on the current state x t and policy π .Let π (cid:63) be the optimal policy, i.e. the policy that maximizes the total reward. This policycorresponds to the optimal value function, denoted V (cid:63) t ( x t ) . The latter satisﬁes the Bellmanoptimality equation (see e.g. (Sutton and Barto, 2018)) V (cid:63) t ( x t ) = max a t ˆ R t ( x t , a t ) + γ E t , a t (cid:2) V (cid:63) t + ( x t + ) (cid:3) . (2)Here E t , a t [·] stands for an expectation conditional on the current state x t and action a t . Theoptimal policy π (cid:63) can be obtained from V (cid:63) as follows: π (cid:63) t ( a t | x t ) = arg max a t ˆ R t ( x t , a t ) + γ E t , a t (cid:2) V (cid:63) t + ( x t + ) (cid:3) . (3)The goal of Reinforcement Learning (RL) is to solve the Bellman optimality equation based onsamples of data. Assuming that an optimal value function is found by means of RL, solving forthe optimal policy π (cid:63) takes another optimization problem as formulated in Eq.(3). Let us begin by reformulating the Bellman optimality equation using a Fenchel-type represen-tation: V (cid:63) t ( x t ) = max π (·| y )∈P (cid:213) a t ∈A t π ( a t | x t ) (cid:16) ˆ R t ( x t , a t ) + γ E t , a t (cid:2) V (cid:63) t + ( x t + ) (cid:3) (cid:17) . (4)Here P = (cid:8) π : π ≥ , T π = (cid:9) denotes a set of all valid distributions. Eq.(4) is equivalent tothe original Bellman optimality equation (2), because for any x ∈ R n , we have max i ∈{ ,..., n } x i = max π ≥ , | | π | | ≤ π T x . Note that while we use discrete notations for simplicity of presentation,all formulae below can be equivalently expressed in continuous notations by replacing sums byintegrals. For brevity, we will denote the expectation E x t + | x t , a t [·] as E t , a [·] in what follows.The one-step information cost of a learned policy π ( a t | x t ) relative to a reference policy π ( a t | x t ) is deﬁned as follows (Fox et al., 2015): g π ( x t , a t ) : = log π ( a t | x t ) π ( a t | x t ) . (5) Here we assume a discrete-time setting where time t is measured in terms of integer-valued number of ele-mentary time steps ∆ t . π is the Kullback-Leibler (KL) divergence of π (·| x t ) and π (·| x t ) : E π [ g π ( x , a )| x t ] = K L [ π || π ]( x t ) : = (cid:213) a t π ( a t | x t ) log π ( a t | x t ) π ( a t | x t ) . (6)The total discounted information cost for a trajectory is deﬁned as follows: I π ( x t ) : = T (cid:213) t (cid:48) = t γ t (cid:48) − t E π t [ g π ( x t (cid:48) , a t (cid:48) )| x t ] . (7)The free energy function F π t ( x t ) is deﬁned as the value function (4) augmented by the informationcost penalty (7) which is added using a regularization parameter 1 / β : F π t ( x t ) : = V π t ( x t ) − β I π ( x t ) = T (cid:213) t (cid:48) = t γ t (cid:48) − t E π t (cid:20) ˆ R t (cid:48) ( x t (cid:48) , a t (cid:48) ) − β g π ( x t (cid:48) , a t (cid:48) ) (cid:21) . (8)The free energy, F π t ( x t ) , is the entropy-regularized value function, where the amount of regu-larization can be tuned to the level of noise in the data. The regularization parameter β inEq.(8) controls a trade-oﬀ between reward optimization and proximity of the optimal policy tothe reference policy, and is often referred to as the “inverse temperature” parameter, using theanalogy between Eq.(8) and free energy in physics, see e.g. (Dixon et al., 2020). The referencepolicy, π , provides a “guiding hand” in the stochastic policy optimization process that we nowdescribe.A Bellman equation for the free energy function F π t ( x t ) is obtained from Eq.(8): F π t ( x t ) = E a | y (cid:20) ˆ R t ( x t , a t ) − β g π ( x t , a t ) + γ E t , a (cid:2) F π t + ( x t + ) (cid:3) (cid:21) . (9)For a ﬁnite-horizon setting with a terminal reward ˆ R T ( x t , a T ) , Eq.(9) should be supplementedby a terminal condition F π T ( x t ) = ˆ R T ( x t , a (cid:63) T ) (10)where the ﬁnal action a (cid:63) T maximizes the terminal reward ˆ R T for the given terminal state x T .Eq.(9) can be viewed as a soft probabilistic relaxation of the Bellman equation for the valuefunction, with the KL information cost penalty (5) as a regularization controlled by the inversetemperature β . In addition to such a regularized value function (free energy), we will nextintroduce an entropy regularized Q-function. Similar to the action-value function, we deﬁne the state-action free energy function G π ( x , a ) as(Fox et al., 2015) G π t ( x t , a t ) = ˆ R t ( x t , a t ) + γ E (cid:2) F π t + ( x t + ) (cid:12)(cid:12) x t , a t (cid:3) (11) = ˆ R t ( x t , a t ) + γ E t , a (cid:34) T (cid:213) t (cid:48) = t + γ t (cid:48) − t − (cid:18) ˆ R t (cid:48) ( x t (cid:48) , a t (cid:48) ) − β g π ( x t (cid:48) , a t (cid:48) ) (cid:19) (cid:35) = E t , a t (cid:34) T (cid:213) t (cid:48) = t γ t (cid:48) − t (cid:18) ˆ R t (cid:48) ( x t (cid:48) , a t (cid:48) ) − β g π ( x t (cid:48) , a t (cid:48) ) (cid:19) (cid:35) , a t in the G-function is ﬁxed,and hence g π ( x t , a t ) = a t .If we now compare this expression with Eq.(8), we obtain the relation between the G-functionand the free energy F π t ( x t ) : F π t ( x t ) = (cid:213) a t π ( a t | x t ) (cid:20) G π t ( x t , a t ) − β log π ( a t | x t ) π ( a t | x t ) (cid:21) . (12)This functional is maximized by the following distribution π ( a t | x t ) : π ( a t | x t ) = Z t π ( a t | x t ) e β G π t ( x t , a t ) (13) Z t = (cid:213) a t π ( a t | x t ) e β G π t ( x t , a t ) . The free energy (12) evaluated at the optimal solution (13) becomes F π t ( x t ) = β log Z t = β log (cid:213) a t π ( a t | x t ) e β G π t ( x t , a t ) . (14)Using Eq.(14), the optimal action policy can be written as follows : π ( a t | x t ) = π ( a t | x t ) e β ( G π t ( x t , a t )− F π t ( x t ) ) . (15)Eqs.(14), (15), along with the ﬁrst form of Eq.(11) repeated here for convenience: G π t ( x t , a t ) = ˆ R t ( x t , a t ) + γ E t , a (cid:2) F π t + ( x t + ) (cid:12)(cid:12) x t , a t (cid:3) , (16)constitute a system of equations for G-learning (Fox et al., 2015) that should be solved self-consistently for π ( a t | x t ) , G π t ( x t , a t ) and F π t ( x t ) by backward recursion for t = T − , . . . ,

0, withterminal conditions G π T ( x t , a (cid:63) T ) = ˆ R T ( x t , a (cid:63) T ) (17) F π T ( x t ) = G π T ( x t , a (cid:63) T ) = ˆ R T ( x t , a (cid:63) T ) . We will next show how G-learning can be implemented in the context of (direct) reinforcementlearning.

In the RL setting when rewards are observed, the system Eqs.(14, 15, 16) can be reduced to onenon-linear equation. Substituting the augmented free energy (14) into Eq.(16), we obtain G π t ( x , a ) = ˆ R ( x t , a t ) + E t , a (cid:34) γβ log (cid:213) a t + π ( a t + | x t + ) e β G π t + ( x t + , a t + ) (cid:35) . (18)This equation provides a soft relaxation of the Bellman optimality equation for the action-valueQ-function, with the G-function deﬁned in Eq.(11) being an entropy-regularized Q-function(Fox et al., 2015). The ”inverse-temperature” parameter β in Eq.(18) determines the strengthof entropy regularization. In particular, if we take a “zero-temperature” limit β → ∞ , we6ecover the original Bellman optimality equation for the Q-function. Because the last term in(18) approximates the max (·) function when β is large but ﬁnite, for a particular choice of auniform reference distribution π , Eq.(18) is known in the literature as “soft Q-learning”.For ﬁnite values β < ∞ , in a setting of Reinforcement Learning with observed rewards, Eq.(18)can be used to specify G-learning (Fox et al., 2015): an oﬀ-policy time-diﬀerence (TD) algo-rithm that generalizes Q-learning to noisy environments where an entropy-based regularizationis appropriate.The G-learning algorithm of Fox et al. (2015) was speciﬁed in a tabulated setting where both thestate and action space are ﬁnite. In our case, we model MDPs in high-dimensional continuousstate and action spaces. Respectively, we cannot rely on a tabulated G-learning, and needto specify a functional form of the action-value function, or use a non-parametric functionapproximation such as a neural network to represent its values. An additional challenge is tocompute a multidimensional integral (or a sum) over all next-step actions in Eq.(18). Unless atractable parameterization is used for π and G t , repeated numerical integration of this integralcan substantially slow down the learning.To summarize, G-learning is an oﬀ-policy, generative reinforcement learning algorithm with astochastic policy. In contrast to Q-learning, which produces deterministic policies, G-learninggenerally produces stochastic policies, while the deterministic Q-learning policies are recoveredin a zero-temperature limit β → ∞ . In the next section, we will build an approach to goal-basedwealth management based on G-learning. Later in this paper, we will also consider applicationsof G-learning for Inverse Reinforcement Learning (IRL). Let us begin by considering a simpliﬁed model for retirement planning. We assume a discrete-time process with T steps, so that T is the (integer-valued) time horizon. The investor/plannerkeeps the wealth in N assets, with x t being the vector of dollar values of positions in diﬀerentassets at time t , and u t being the vector of changes in these positions. We assume that the ﬁrstasset with n = r t whoseexpected values are ¯ r t . The covariance matrix of return is Σ r of size ( N − ) × ( N − ) .Optimization of a retirement plan involves optimization of both regular contributions to theplan and asset allocations. Let c t be a cash installment in the plan at time t . The pair ( c t , u t ) can thus be considered the action variables in a dynamic optimization problem correspondingto the retirement plan.We assume that at each time step t , there is a pre-speciﬁed target value ˆ P t + of a portfolioat time t +

1. We assume that the target value ˆ P t + at step t exceeds the next-step value V t + = ( + r t )( x t + u t ) of the portfolio, and we seek to impose a penalty for under-performancerelative to this target. To this end, we can consider the following expected reward for time step t : R t ( x t , u t , c t ) = − c t − λ E t (cid:104) (cid:16) ˆ P t + − ( + r t )( x t + u t ) (cid:17) + (cid:105) − u Tt Ωu t . (19)Here the ﬁrst term is due to an installment of amount c t at the beginning of time period t , thesecond term is the expected negative reward from the end of the period for under-performancerelative to the target, and the third term approximates transaction costs by a convex functionalwith the parameter matrix Ω , and serves as a L regularization.7he one-step reward (19) is inconvenient to work with due to the rectiﬁed non-linearity (·) + : = max (· , ) under the expectation. Another problem is that decision variables c t and u t are notindependent but rather satisfy the following constraint N (cid:213) n = u tn = c t , (20)which simply means that at every time step, the total change in all positions should equal thecash installment c t at this time.We therefore modify the one-step reward (19) in two ways: we replace the ﬁrst term usingEq.(20), and approximate the rectiﬁed non-linearity by a quadratic function. The new one-stepreward is R t ( x t , u t ) = − N (cid:213) n = u tn − λ E t (cid:20) (cid:16) ˆ P t + − ( + r t )( x t + u t ) (cid:17) (cid:21) − u Tt Ωu t . (21)The new reward function (21) is attractive on two counts. First, it explicitly resolves the con-straint (20) between the cash injection c t and portfolio allocation decisions, and thus converts theinitial constrained optimization problem into an unconstrained one. We remind the reader thatthis diﬀers from the Merton model where allocation variables are deﬁned as fractions of the totalwealth, and thus are constrained by construction. The approach based on dollar-measured ac-tions both reduces the dimensionality of the optimization problem, and makes it unconstrained.When the unconstrained optimization problem is solved, the optimal contribution c t at time t can be obtained from Eq.(20).The second attractive feature of the reward (21) is that it is quadratic in actions u t , and istherefore highly tractable. On the other hand, the well known disadvantage of quadratic rewards(penalties) is that they are symmetric, and penalize both scenarios V t + (cid:29) ˆ P t + and V t + (cid:28) ˆ P t + ,while in fact we only want to penalize the second class of scenarios. To mitigate this drawback,we can consider target values ˆ P t + that are considerably higher than the time- t expectation of thenext-period portfolio value. For example, one simple choice could be to set the target portfolio asa linear combination of a portfolio-independent benchmark B t and the current portfolio growingwith a ﬁxed rate η : ˆ P t + = ( − ρ ) B t + ρη T x t , (22)where 0 ≤ ρ ≤ η > T x t . For a suﬃciently large values of B t and η , such a target portfolio would be well abovethe current portfolio at all times, and thus would serve as a reasonable proxy to the asymmetricmeasure (19). The advantage of such a parameterization of the target portfolio is that both the“desired growth” parameter η and the mixture parameter ρ can be learned from an observedbehavior of a ﬁnancial agent in the setting of Inverse Reinforcement Learning (IRL), as we willdiscuss in Sec. 5. In what follows, we use Eq.(22) as our speciﬁcation of the target portfolio.We note that a quadratic loss speciﬁcation relative to a target time-dependent wealth levelis a popular choice in the recent literature on wealth management. One example is providedby Lin et al. (2019) who develop a dynamic optimization approach with a similar squaredloss function for a deﬁned contribution retirement plan. A similar approach which relies on adirect speciﬁcation of a reward based on a target portfolio level is known as “goal-based wealthmanagement” (Browne, 1996; Das et al., 2018).The square loss reward speciﬁcation is very convenient, as it allows one to construct optimalpolicies semi-analytically. Here we will demonstrate how to build a semi-analytical scheme8or computing optimal stochastic consumption-investment policies for a retirement plan — themethod is suﬃciently general for either a cumulation or de-cumulation phase. For other speciﬁ-cations of rewards, numerical optimization and function approximations (e.g. neural networks)would be required.The expected reward (21) can be written in a more explicit quadratic form if we denote assetreturns as r t = ¯ r t + ˜ ε t where the ﬁrst component ¯ r ( t ) = r f is the risk-free rate (as the ﬁrstasset is risk-free), and ˜ ε t = ( , ε t ) where ε t is an idiosyncratic noise with covariance Σ r of size ( N − ) × ( N − ) . Substituting this expression in Eq.(21), we obtain R t ( x t , u t ) = − λ ˆ P t + − u Tt + λ ˆ P t + ( x t + u t ) T ( + ¯r t ) − λ ( x t + u t ) T ˆΣ t ( x t + u t ) − u Tt Ωu t = x Tt R ( xx ) t x t + u Tt R ( ux ) t x t + u Tt R ( uu ) t u t + x Tt R ( x ) t + u Tt R ( u ) t + R ( ) t where ˆ Σ t = (cid:20)

00 Σ r (cid:21) + ( + ¯ r t )( + ¯ r t ) T R ( xx ) t = − λη ρ T + ληρ ( + ¯r t ) T − λ ˆΣ t R ( ux ) t = ληρ ( + ¯ r t ) T − λ ˆ Σ t R ( uu ) t = − λ ˆ Σ t − ΩR ( x ) t = − ληρ ( − ρ ) B t + λ ( − ρ ) B t ( + ¯ r t ) R ( u ) t = − + λ ( − ρ ) B t ( + ¯ r t ) (23) R ( ) t = −( − ρ ) λ B t Assuming that the expected returns ¯ r t , covariance matrix Σ r and the benchmark B t are ﬁxed,the vector of free parameters deﬁning the reward function is thus θ : = ( λ, η, ρ, Ω ) . To solve the optimization problem, we use a semi-analytical formulation of G-learning withGaussian time-varying policies (GTVP). In what follows, we will refer to our speciﬁc algorithmimplementing G-learning with our model speciﬁcations as the

G-Learner algorithm, to diﬀeren-tiate our model from more general models that could potentially be constructed using G-learningas a general RL method.We start by specifying a functional form of the value function as a quadratic form of x t : F π t ( x t ) = x Tt F ( xx ) t x t + x Tt F ( x ) t + F ( ) t , (24)where F ( xx ) t , F ( x ) t , F ( ) t are parameters that can depend on time via their dependence on the targetvalues ˆ P t + and the expected returns ¯ r t . The dynamic equation takes the form: x t + = A t ( x t + u t ) + ( x t + u t ) ◦ ˜ ε t , A t : = diag ( + ¯ r t ) , ˜ ε t : = ( , ε t ) (25)Note that the only features used here are the expected asset returns ¯ r t for the current period t .We assume that the expected asset returns are available as an output of a separate statisticalmodel using e.g. a factor model framework. The present formalism is agnostic to the choice ofthe expected return model. 9oeﬃcients of the value function (24) are computed backward in time starting from the lastmaturity t = T −

1. For t = T −

1, the quadratic reward (23) can be optimized analytically bythe following action: u T − = ˜ Σ − T − (cid:18) λ R ( u ) t + λ R ( ux ) t x T − (cid:19) (26)where we deﬁned ˜ Σ T − as follows ˜ Σ T − : = ˆ Σ T − + λ Ω . (27)Note that the optimal action is a linear function of the state. Another interesting point to note isthat the last term ∼ Ω that describes convex transaction costs in Eq.(23) produces regularizationof matrix inversion in Eq.(26).As for the last time step we have F π T − ( x T − ) = ˆ R T − , coeﬃcients F ( xx ) T − , F ( x ) T − , F ( ) T − can be com-puted by plugging Eq.(26) back in Eq.(23), and comparing the result with Eq.(24) with t = T − F ( xx ) T − = R ( xx ) T − + λ (cid:104) R ( ux ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( ux ) T − + λ (cid:104) R ( ux ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( uu ) T − ˜ Σ − T − R ( ux ) T − F ( x ) T − = R ( x ) T − + λ (cid:104) R ( ux ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( u ) T − + λ (cid:104) R ( ux ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( uu ) T − ˜ Σ − T − R ( u ) T − (28) F ( ) T − = R ( ) T − + λ (cid:104) R ( u ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( u ) T − + λ (cid:104) R ( u ) T − (cid:105) T (cid:2) ˜ Σ − T − (cid:3) T R ( uu ) T − ˜ Σ − T − R ( u ) T − . For an arbitrary time step t = T − , . . . ,

0, we use Eq.(25) to compute the conditional expectationof the next-period F-function in the Bellman equation as follows: E t , a (cid:2) F π t + ( x t + ) (cid:3) = ( x t + u t ) T (cid:16) A Tt ¯ F ( xx ) t + A t + ˜ Σ r ◦ ¯ F ( xx ) t + (cid:17) ( x t + u t ) + ( x t + u t ) T A Tt ¯ F ( x ) t + + ¯ F ( ) t + , ˜ Σ r : = (cid:20)

00 Σ r (cid:21) (29)where ¯ F ( xx ) t + : = E t (cid:104) F ( xx ) t + (cid:105) , and similarly for ¯ F ( x ) t + and ¯ F ( ) t + . This is a quadratic function of x t and u t , and has the same structure as the quadratic reward ˆ R ( x t , a t ) in Eq.(23). Plugging bothexpressions in the Bellman equation G π t ( x t , u t ) = ˆ R t ( x t , u t ) + γ E t , u (cid:2) F π t + ( x t + ) (cid:12)(cid:12) x t , u t (cid:3) we see that the action-value function G π t ( x t , u t ) should also be a quadratic function of x t and u t : G π t ( x t , u t ) = x Tt Q ( xx ) t x t + u Tt Q ( ux ) t x t + u Tt Q ( uu ) t u t + x Tt Q ( x ) t + u Tt Q ( u ) t + Q ( ) t , (30)where Q ( xx ) t = R ( xx ) t + γ (cid:16) A Tt ¯ F ( xx ) t + A t + ˜ Σ r ◦ ¯ F ( xx ) t + (cid:17) Q ( ux ) t = R ( ux ) t + γ (cid:16) A Tt ¯ F ( xx ) t + A t + ˜ Σ r ◦ ¯ F ( xx ) t + (cid:17) Q ( uu ) t = R ( uu ) t + γ (cid:16) A Tt ¯ F ( xx ) t + A t + ˜ Σ r ◦ ¯ F ( xx ) t + (cid:17) − Ω (31) Q ( x ) t = R ( x ) t + γ A Tt ¯ F ( x ) t + Q ( u ) t = R ( u ) t + γ A Tt ¯ F ( x ) t + Q ( ) t = R ( ) t + γ F ( ) t + . F π t ( x t ) = β log ∫ π ( u t | x t ) e β G π t ( x t , u t ) d u t . (32)A reference policy π ( u t | x t ) is Gaussian: π ( u t | x t ) = (cid:113) ( π ) n (cid:12)(cid:12) Σ p (cid:12)(cid:12) e − ( u t − ˆ u t ) T Σ − p ( u t − ˆ u t ) , (33)where the mean value ˆ u t is a linear function of the state x t :ˆ u t = ¯ u t + ¯ v t x t . (34)Integration over u t in Eq.(32) is performed analytically using the well known n -dimensionalGaussian integration formula ∫ e − u T Au + u T B d n u = (cid:115) ( π ) n | A | e B T A − B , (35)where | A | denotes the determinant of matrix A .Note that, unlike in the Merton approach (Merton, 1971) or in traditional Markowitz portfoliooptimization (Markowitz, 1959), here we work with unconstrained variables that do not haveto sum up to one, and therefore an unconstrained multivariate Gaussian integration readilyapplies here. Remarkably, this implies that once the decision variables are chosen appropriately,portfolio optimization for wealth management tasks may in a sense be an easier problem thanportfolio optimization that does not involve intermediate cashﬂows, and is often formulatedusing self-ﬁnancing conditions.Performing the Gaussian integration and comparing the resulting expression with Eq.(24), weobtain for its coeﬃcients: F π t ( x t ) = x Tt F ( xx ) t x t + x Tt F ( x ) t + F ( ) t F ( xx ) t = Q ( xx ) t + β (cid:16) U Tt ¯ Σ − p U t − ¯ v Tt Σ − p ¯ v t (cid:17) F ( x ) t = Q ( x ) t + β (cid:16) U Tt ¯ Σ − p W t − ¯ v Tt Σ − p ¯ u t (cid:17) (36) F ( ) t = Q ( ) t + β (cid:16) W Tt ¯ Σ − p W t − ¯ u Tt Σ − p ¯ u t (cid:17) − β (cid:0) log (cid:12)(cid:12) Σ p (cid:12)(cid:12) + log (cid:12)(cid:12) ¯ Σ p (cid:12)(cid:12)(cid:1) , where we use the auxiliary parameters U t = β Q ( ux ) t + Σ − p ¯ v t W t = β Q ( u ) t + Σ − p ¯ u t (37)¯ Σ p = Σ − p − β Q ( uu ) t . The optimal policy for the given step is given by π ( u t | x t ) = π ( u t | x t ) e β ( G π t ( x t , u t )− F π t ( x t ) ) . (38)11sing here the quadratic action-value function (30) produces a new Gaussian policy π ( u t | x t ) : π ( u t | x t ) = (cid:113) ( π ) n (cid:12)(cid:12) ˜ Σ p (cid:12)(cid:12) e − ( u t − ˜ u t − ˜ v t x t ) T ˜ Σ − p ( u t − ˆ u t − ˜ v t x t ) (39)where ˜ Σ − p = Σ − p − β Q ( uu ) t ˜ u t = ˜ Σ p (cid:16) Σ − p ¯ u t + β Q ( u ) t (cid:17) (40)˜ v t = ˜ Σ p (cid:16) Σ − p ¯ v t + β Q ( ux ) t (cid:17) Therefore, policy optimization for G-learning with quadratic rewards and Gaussian referencepolicy amounts to the Bayesian update of the prior distribution (33) with parameters updates¯ u t , ¯ v t , Σ p to the new values ˜ u t , ˜ v t , ˜ Σ p deﬁned in Eqs.(40). These quantities depend on time viatheir dependence on the targets ˆ P t and expected asset returns ¯ r t .For a given time step t , the G-learning algorithm keeps iterating between the policy optimizationstep that updates policy parameters according to Eq.(40) for ﬁxed coeﬃcients of the F - and G -functions, and the policy evaluation step that involves Eqs.(30, 31, 36) and solves for parametersof the F - and G -functions given policy parameters. Note that convergence of iterations for ˜ u t , ˜ v t is guaranteed as (cid:12)(cid:12) ˜ Σ p Σ − p (cid:12)(cid:12) <

1. At convergence of iteration for time step t , Eqs.(30, 31, 36) and(39) together solve one step of G-learning. The calculation then proceeds by moving to theprevious step t → t −

1, and repeating the calculation, all the way back to the present time.The additional step needed from G-learning for the present problem is to ﬁnd the optimal cashcontribution for each time step by using the budget constraint (20). As G-learning producesGaussian random actions u t , Eq.(20) implies that the time- t optimal contribution c t is Gaussiandistributed with mean ¯ c t = T ( ¯ u t + ¯ v t x t ) . The expected optimal contribution ¯ c t thus has apart ∼ ¯ u t that is independent of the portfolio value, and a part ∼ ¯ v t that depends on thecurrent portfolio. This is similar e.g. to a linear speciﬁcation of the deﬁned contribution with adeterministic policy in Lin et al. (2019).It should be noted that in practice, we may want to impose constraints on cash installments c t .For example, we could impose band constraints 0 ≤ c t ≤ c max with some upper bound c max .Such constraints can be easily added to the framework. To this end, we need to replace theexactly solvable unconstrained least squares problem with a constrained least squares problem.This can be done without a substantial increase of computational time using eﬃcient oﬀ-the-shell convex optimization software. Note that enforcing constraints on the resulting cash-ﬂowsin our approach amounts to optimization with one constraint, instead of two constraints as inthe Merton approach. So far in this paper, we considered the setting of (direct) reinforcement learning, when the agent(investor) learns while observing the rewards, and optimizes the policy so that the expectedcumulative reward (regularized by the KL information cost) is maximized. This setting issuitable when the investor explicitly deﬁnes his or her reward function.In many cases of practical interest, an individual investor may not be able to explain his or herutility function used for trading decision-making, which can instead be rule-driven (or driven by12ther model not formulated in RL terms). Alternatively, when an agent (investor) is a subjectof behavioral inference to a diﬀerent agent (a researcher or robo-advisor), the latter has accessto observed trajectories (states and actions) of the agent, but not to rewards received by theagent. Such cases where rewards are not available belong in the realms of Inverse ReinforcementLearning (IRL) whose objective is to recover both the reward function of the agent and theoptimal policy, see e.g. (Dixon et al., 2020) for a review.In this section, we consider the IRL problem with G-learning, and present an algorithm wecall GIRL (G-learning IRL) whose objective is to make inference of the reward function ofan individual agent such as a retirement plan contributor or an individual brokerage accountholder. That is, we assume that we are given a history of dollar-nominated asset positions in aninvestment portfolio, jointly with an agent’s decisions that include both injections or withdrawalsof cash from the portfolio and asset allocation decisions. Additionally, we are given historicalvalues of asset prices and expected asset returns for all assets in the investor universe. Aspreviously in the paper, we can consider a portfolio of stocks and a single bond, but the sameformalism can be applied to other types of assets.Assume that we have historical data that includes a set of D trajectories ζ i where i = , . . . D of state-action pairs ( x t , u t ) where trajectory i starts at some time t i and runs until time T i .Consider a single trajectory ζ from this collection, and set for this trajectory the start time t = T . As individual trajectories are considered independent, they will enteradditively in the ﬁnal log-likelihood of the problem. We assume that dynamics are Markovian inthe pair ( x t , u t ) , with a generative model p θ ( x t + , u t | x t ) = π θ ( u t | x t ) p θ ( x t + | x t , u t ) where Θ standsfor a vector of model parameters, and π θ is the action policy given by Eq.(38).The probability of observing trajectory ζ is given by the following expression P ( x , u | Θ ) = p ( x ) T − (cid:214) t = π θ ( u t | x t ) p θ ( x t + | x t , u t ) . (41)Here p ( x ) is a marginal probability of x t at the start of the i -th demonstration. Assuming thatthe initial values x are ﬁxed, this gives the following log-likelihood for data { x t , a t } Tt = observedfor trajectory ζ : L L ( θ ) : = log P ( x , u | Θ ) = (cid:213) t ∈ ζ ( log π θ ( u t | x t ) + log p θ ( x t + | x t , u t )) . (42)Transition probabilities p θ ( x t + | x t , u t ) entering this expression can be obtained from the stateequation x t + = A t ( x t + u t ) + ( x t + u t ) ◦ ˜ ε t , A t : = diag ( + ¯ r t ) , ˜ ε t : = ( , ε t ) , (43)where ε t is a Gaussian noise with covariance Σ r (see Eq.(25)). Writing x t = ( x ( ) t , x ( r ) t ) where x ( ) t is the value of a bond position and x ( r ) t are the values of positions in risky assets, and similarlyfor u t and A t , this produces transition probabilities p θ ( x t + | x t , u t ) = e − ∆ Tt Σ − r ∆ t (cid:113) ( π ) N | Σ r | δ (cid:16) x ( ) t + − ( + r f ) x ( ) t (cid:17) , ∆ t : = x ( r ) t + x ( r ) t + u ( r ) t − (cid:174) A ( r ) t , (44)where the factor δ (cid:16) x ( ) t + − ( + r f ) x ( ) t (cid:17) captures the deterministic dynamics of the bond part ofthe portfolio. As this term does not depend on model parameters, we can drop it from the13og-transition probability, along with a constant term ∼ log ( π ) . This produceslog p θ ( x t + | x t , u t ) = −

12 log | Σ r | − ∆ Tt Σ − r ∆ t . (45)Substituting Eqs.(38), (30), (45) into the trajectory log-likelihood (42), we put it in the followingform: L L ( θ ) = (cid:213) t ∈ ζ (cid:18) β (cid:0) G π t ( x t , u t ) − F π t ( x t ) (cid:1) −

12 log | Σ r | − ∆ Tt Σ − r ∆ t (cid:19) , (46)where G π t ( x t , u t ) and F π t ( x t ) are deﬁned by Eqs.(30) and (24). The log-likelihood (46) is afunction of model parameter vector θ = (cid:0) λ, η, ρ, Ω , Σ r , Σ p , ¯ u t , ¯ v t (cid:1) (recall that β is a regularizationhyper-parameter which should not be optimized in-sample). We can simplify the problem bysetting ¯ v t = u t = ¯ u (i.e. take a constant mean in the prior). In this case, the vectorof model parameter to learn with IRL inference is θ = (cid:0) λ, η, ρ, Ω , Σ r , Σ p , ¯ u (cid:1) . A “proper” IRLsetting would correspond to only learning parameters of the reward function ( λ, η, ρ, Ω ) whilekeeping parameters (cid:0) Σ r , Σ p , ¯ u (cid:1) ﬁxed (i.e. estimated outside of the IRL model). Optimizationcan be performed using available oﬀ-the-shelf software. In our implementation, we use the Adamoptimization method within PyTorch to optimize the negative log-likelihood function. To illustrate the G-learner and GIRL algorithms for goal based wealth management, we usea simple simulated environment that mimics the working of equity return models (sometimesreferred to as “alpha-models”) which are expected in practice to be weak predictors of realisedreturns. The advantage of such a simulated environment is that it allows us to deﬁne the “groundtruth” and thus demonstrate the performance of both algorithms. We remind the reader thatwhile we use simulated data to show the performance of our algorithms, the latter are modelfree as they are independent of a model of stock-price dynamics.The investment horizon is set to 7.5 years and the portfolio rebalancing and consumption occurquarterly (over 30 periods). In this simpliﬁed setting, the portfolio is assumed to be initiallyequally weighted, with $1000 allocated equally between N − =

99 stocks and a risk free bond.We assume a ﬁxed risk free annual rate, r f = .

02, stock transactions costs are 1.5% of the stockprice and risk-free bond transactions costs are 5%. The benchmark portfolio is initially set equalto the initial value of the portfolio, and is continuously compounded at a constant rate of 50%.We model the quarterly realized risky asset returns, r t , i , of the i th asset as being correlated toexpected risky asset returns, ¯ r t , i : r t , i = ¯ r t , i + β (cid:48) i ( r M − µ M dt ) + σ i (cid:113) − ( β (cid:48) i ) dW t , i , i ∈ { , . . . , N − } , (47)where µ M = .

05 is the market drift, r M are the market returns simulated under a GBM modelwith volatility σ M = .

25, and β (cid:48) i is the beta of the i th asset. σ i ≡ σ = .

05 is the idiosyncraticvolatility and dW t is a driving Brownian motion which is correlated with the market noise and dt = .

25. ¯ r t is assumed to be given by CAPM:¯ r t = α + β (cid:48) (( − c ) µ M dt + cr M ) , c ∈ [ , ] (48)where we choose the oracle coeﬃcient c = .

2. 14e assume that α and β (cid:48) are uniform random variables across all risky assets, with α ∼U([− . , . ]) , β (cid:48) ∼ U([ . , . ]) . The risky assets are assumed to initially be dollar valuesgiven by uniform random variables U([ , ]) . In our experiments, we generate the risky assetreturns over M = The sample mean realized returns are plotted against the sample mean expected returnsand observed to be weakly correlated.

To demonstrate a G-learning agent for wealth management, we arbitrarily choose the set ofparameters in Table 1. Note that the G-learner parameter, β , is not optimized by GIRL, but issimply set as β = β > r t together with the covarianceof the risk asset return, Σ r . The discount factor for the future value of rewards, γ = . (cid:96) = . τ = × − . Consequently GIRL is observed toimitate the G-learner — the sample averaged portfolio returns closely track each other in Figure2. The error in the learned G-learner parameters results in a marginal decrease in the Sharperatio, as reported in the parentheses of the legend in Figure 2. In Figure 3, we show the localbehaviour of the loss surface for our problem, illustrating its convex shape and parameters foundby GIRL. GIRL requires approximately 200 iterations to converge.15arameter G-learner GIRL ρ λ η ω The G-learning agent parameters used for portfolio allocation together with the valuesestimated by GIRL.

Figure 2:

The sample mean portfolio returns are shown over a 30 quarterly period horizon (7.5years). The black line shows the sample mean returns for an equally weighted portfolio withoutrebalancing. The red line shows a G-learning agent, for the parameter values given in Table 1.GIRL imitates the G-learning agent and generates returns shown by the blue dashed line. SharpeRatios are shown in parentheses.

An illustration of an optimal solution trajectory obtained without enforcing any constraints isshown in Figure 4 which presents simulation results for the portfolio using the G-learner. Thevalues of optimal cash installments are shown in Table 2.16igure 4:

An illustration of the G-learner for a retirement plan optimization using a portfoliowith 100 assets. The values of optimal cash installments are shown in Table 2. (a) λ (b) ρ (c) η (d) ω Figure 3:

The loss surface about each of the G-learner’s parameters which are found by GIRL.The solid circle denotes the exact parameter value. The loss is convex w.r.t. to each parameter.