[PDF] Integrating prediction in mean-variance portfolio optimization

Abstract

Many problems in quantitative finance involve both predictive forecasting and decision-based optimization. Traditionally, predictive models are optimized with unique prediction-based objectives and constraints, and are therefore unaware of how those predictions will ultimately be used in the context of their final decision-based optimization. We present a stochastic optimization framework for integrating regression based predictive models in a mean-variance portfolio optimization setting. Closed-form analytical solutions are provided for the unconstrained and equality constrained case. For the general inequality constrained case, we make use of recent advances in neural-network architecture for efficient optimization of batch quadratic-programs. To our knowledge, this is the first rigorous study of integrating prediction in a mean-variance portfolio optimization setting. We present several historical simulations using global futures data and demonstrate the benefits of the integrated approach in comparison to the decoupled alternative.

Full PDF

IIntegrating prediction in mean-variance portfolio optimization

Andrew Butler and Roy H. KwonUniversity of TorontoDepartment of Mechanical and Industrial EngineeringFebruary 19, 2021

Abstract

Many problems in quantitative ﬁnance involve both predictive forecasting and decision-based op-timization. Traditionally, predictive models are optimized with unique prediction-based objectivesand constraints, and are therefore unaware of how those predictions will ultimately be used in thecontext of their ﬁnal decision-based optimization. We present a stochastic optimization frameworkfor integrating regression based predictive models in a mean-variance portfolio optimization set-ting. Closed-form analytical solutions are provided for the unconstrained and equality constrainedcase. For the general inequality constrained case, we make use of recent advances in neural-networkarchitecture for eﬃcient optimization of batch quadratic-programs. To our knowledge, this is theﬁrst rigorous study of integrating prediction in a mean-variance portfolio optimization setting. Wepresent several historical simulations using global futures data and demonstrate the beneﬁts of theintegrated approach in comparison to the decoupled alternative.

Keywords:

Data driven stochastic-programming, Regression, Mean-variance optimization, Em-pirical risk minimization, Diﬀerentiable neural networks

Many problems in quantitative ﬁnance can be characterized by the following elements:1. A sample data set Y = { y (1) , ..., y ( m ) } = { y ( i ) } mi =1 of uncertain quantities of interest, y ( i ) ∈ R d y ,such as asset returns.2. A decision, z ∈ R d z , often constrained to some feasible region Z ⊂ R d z .3. A nominal objective (cost) function , c : R d z × R d y → R , to be minimized over decision variable z ∈ Z in the context of the observed realization y ( i ) .For example, in portfolio management we are often presented with the following problem: for aparticular observation of asset returns, y ( i ) , the objective is to construct a vector of assets weights, z ∗ ( y ( i ) ), that minimizes the nominal cost, c ( z , y ( i ) ) and adheres to the constraint set Z . A commonchoice for nominal cost is the Markowitz mean-variance quadratic objective [24], with typical con-straints being that the weights be non-negative and sum to one. Of course, the realization of assetreturns, { y ( i ) } mi =1 , are not directly observable at the decision time and instead must be estimated1 a r X i v : . [ q -f i n . P M ] F e b hrough associated auxiliary data X = { x (1) , ..., x ( m ) } , of covariates of Y . In this paper, we considerregression based prediction models of the form:ˆ y ( i ) = θ T x ( i ) , with regression coeﬃcient matrix θ ∈ R d x × d y .In most applications, estimating ˆ y ( i ) requires solving a second, independent prediction optimiza-tion problem , with its own unique objective function (e.g. least-squares or maximum likelihood) andfeasible region Θ. Continuing with the example above; in order to generate mean-variance eﬃcientportfolios we must supply, at a minimum, an estimate of expected asset returns and covariances. Aprototypical framework would ﬁrst estimate the conditional expectations of asset returns and covari-ances by ordinary least-squares (OLS) regression and then ‘plug-in’ those estimates to a mean-variancequadratic program (see for example Goldfarb and Iyengar [17], Clarke et al. [11] Chen et al. [9]).While the focus of operations research (OR) is on optimal decision making, machine learning(ML), on the other hand, focuses on providing optimal predictions. As exempliﬁed above, OR-basedoptimization and ML-based predictions are often decoupled processes; ﬁrst predict, then optimize.Indeed, at ﬁrst glance, a ‘predict then optimize’ approach seems reasonable, if not optimal. That isto say, if a predictive model provides perfect forecast accuracy (ˆ y ( i ) = y ( i ) ), then generating optimaldecisions amounts to solving a deterministic nominal optimization problem.While a perfect predictive model would lead to optimal decision making, in reality, all predictivemodels do make some error. As such, an ineﬃciency exists in the ‘predict then optimize’ paradigm:predictive models are optimized with unique prediction-based objectives and constraints, and thus areunaware of how those predictions will ultimately be used in the context of the nominal decision-based optimization.Recently there has been a growing body of research on data-driven decision making and therelative merits of decoupled versus integrated predictive decision-making. In this paper, we followthe work of Donti et al. [13], Elmachtoub and Grigas [15] and others, and propose the use of anintegrated prediction and optimization framework with direct applications to mean-variance portfoliooptimization. The key distinction is that under the integrated setting, the model parameters, θ , areestimated in order to yield the ‘best’ decisions, not to provide the ‘best’ predictions. Speciﬁcally, weselect θ such that the resulting optimal decision policy, z ∗ ( x , θ ), produces the lowest average realizednominal cost: minimize θ ∈ Θ E [ c ( z ∗ ( x , θ ) , y )]subject to z ∗ ( x , θ ) = argmin z ∈ Z c ( z , f ( x , θ )) . (1)Here f : R d x × R d θ → R d y denotes the θ -parameterized prediction model for estimating ˆ y . For thecase of linear models f ( x , θ ) = θ T x .We denote the objective function of the integrated prediction and optimization (IPO) problemas L ( θ ) = E [ c ( z ∗ ( x , θ ) , y )] , where E is the expectation operator. Solving the IPO problem overdecision variables ( θ, z ) is challenging for several reasons. First, even in the case where the nominalprogram is convex, the resulting integrated program is likely not convex in θ and therefore we haveno guarantee that a particular local solution is globally optimal. Secondly, in the case where L ( θ )is diﬀerentiable, computing the gradient, ∇ θ L , remains diﬃcult as it requires diﬀerentiation through2he argmin operator. Furthermore, solving program (1) through iterative descent methods can becomputationally demanding as at each iteration we must solve several instances of z ∗ ( x ( i ) , θ ) . In this paper we address the aforementioned challenges and provide an eﬃcient framework forintegrating regression predictions into a mean-variance portfolio optimization setting. To our knowl-edge this is the ﬁrst rigorous study of a general and direct method for integrating prediction in amean-variance portfolio optimization setting, with experimentation using real asset price data. Theremainder of the paper is outlined as follows. We ﬁrst review the growing body of literature in the ﬁeldof integrated predictive decision-making and summarize our primary contributions. In Section 2 weformalize the IPO problem in a general setting. In Section 3 we describe the classical mean-varianceportfolio optimization problem and present the integrated formulation. We provide closed-form ana-lytical solutions for the special case where the nominal program is either unconstrained or containsonly linear equality constraints. We provide the relevant gradient and Hessian formulae and in practicewe propose the use of Newton’s method for eﬃcient global convergence.We then analyze the integrated problem under both linear equality and inequality constraints. Wefollow the work of Amos and Kolter [2] and describe an eﬃcient routine for overcoming the primarytechnical challenge of diﬀerentiating through the argmin operator. Speciﬁcally, we embed the nominalobjective program as a specialized quadratic program (QP) layer in a feed-forward neural-network. Wemake use of backpropagation for eﬃcient gradient computation and optimize the integrated programthrough stochastic gradient descent. We conclude with a simulation study using global futures dataand demonstrate that the integrated methodology can provide lower realized costs compared to the‘predict, then optimize’ alternative.

In recent years there has been a growing body of research on data-driven decision making and therelative merits of decoupled versus integrated predictive decision-making. For example, Ban and Rudin[4] present a direct empirical risk minimization approach using nonparametric kernel regression as thecore prediction method. They consider a data-driven newsvendor problem and demonstrate that theirapproach outperforms the best-practice benchmark when evaluated out-of-sample. More recently,Kannan et al. [22] present three frameworks for integrating machine learning prediction models withina stochastic optimization setting. Their primary contribution is in using the out-of-sample residualsfrom leave-one-out prediction models to generate scenarios which are then optimized in the context ofa sample average approximation nominal program. Their frameworks are ﬂexible and accommodateparametric and nonparametric prediction models, for which they derive convergence rates and ﬁnitesample guarantees.Bertsimas and Kallus [5] present a general framework for optimizing a conditional stochastic ap-proximation program whereby the conditional density is estimated through a variety of parametricand nonparametric machine learning methods. They generate locally optimal decision policies withinthe context of the nominal optimization problem and consider the setting where the decision policyaﬀects subsequent realizations of the uncertainty variable. They also consider an empirical risk mini-mization framework for generating predictive prescriptions and discuss the relative trade-oﬀs of suchan approach.Recently, Elmachtoub and Grigas [15] proposed replacing the prediction-based loss function witha convex surrogate loss function that optimizes prediction variables based on the decision error in-3uced by the prediction, as measured by the nominal objective function. They demonstrate thattheir ‘smart predict, then optimize’ (SPO) loss function attains Fisher consistency with the popularleast-squares loss function and show through example that optimizing predictions in the context ofnominal objectives and constraints can lead to improved overall decision error. The SPO loss functionhowever is limited to linear nominal objective functions, and despite convexity can be computationallydemanding due to repeated evaluation of the nominal program.Our approach is most similar to, and is largely inspired by, the work of Amos and Kolter [2] andDonti et al. [13]. Indeed, a fundamental challenge in optimizing the integrated problem (1) throughgradient-descent is in computing the gradient of the objective function with respect to the predic-tion model parameters. Speciﬁcally, while computing the gradient, ∂L∂ z ∗ , is relatively straightforward,computing the Jacobian, ∂ z ∗ ∂ θ , is complicated by virtue of the argmin operator. Amos and Kolter [2],however, demonstrate that for general quadratic programs, the solution to the system of equationsprovided by the KKT conditions at optimality, z ∗ , provide the necessary ingredients for computingthe desired gradient, ∂L∂ θ , and therefore it is not necessary to calculate the Jacobian, ∂ z ∗ ∂ θ , explicitly.Furthermore, Amos and Kolter [2], and more recently Agrawal et al. [1], propose an eﬃcient frame-work for embedding diﬀerentiable optimization problems as layers in an end-to-end trainable neuralnetwork. Such an approach is advantageous as it allows for use of eﬃcient backpropogation technology,GPU computing and the plethora of robust stochastic gradient descent optimization routines.Donti et al. [13] present a direct application of the aforementioned work and propose an end-to-endstochastic programming approach for estimating the parameters of probability density functions inthe context of their ﬁnal task-based loss function. They consider applications from power schedulingand battery storage and focus speciﬁcally on parametric models in a stochastic programming setting.They demonstrate that their task-based end-to-end approach can result in lower realized costs incomparison to traditional maximum likelihood estimation and a black-box neural network. While our methodology follows closely to that of [13] and Elmachtoub and Grigas [15], in this paper weprovide several notable diﬀerences and extensions. First, whereas [13] considers the parametric densityestimation problem in a stochastic optimization setting, we consider a regression based estimationwith deterministic nominal optimization problem. Secondly, we consider applications from portfoliooptimization; speciﬁcally the case where the nominal program is a convex mean-variance quadraticprogram. Moreover, we demonstrate that in the case where the solution to the nominal program islinear in the parameter θ , then we can eﬃciently make use of second-order Hessian information andoptimize the integrated problem more eﬃciently through Newton’s method.The IPO program presented in Equation (1) is the most direct and perhaps simplest expressionof an integrated prediction and optimization problem. To our knowledge this is the ﬁrst rigorousstudy of these equations in a mean-variance portfolio optimization setting. While Elmachtoub andGrigas [15] provide an application to a mean-variance optimization (MVO) problem, their frameworkrequires optimization of a customized and more complicated SPO surrogate loss function. Moreover,the SPO framework can only support optimization problems with a linear objective and therefore theintegration is limited to prediction models that remain linear across all relevant problem variables.Lastly, in order to demonstrate the beneﬁt of the SPO approach, Elmachtoub and Grigas [15] use4ynthetically generated asset returns and feature data with varying degrees of nonlinearity, which mayor may not be an accurate representation of real-world asset prices.In contrast, the simplicity of our IPO formulation lends itself to a very practical and approachableintegrated prediction and portfolio optimization solution. We demonstrate that, in many cases, theintegrated problem is no more complicated than a least-squares problem, which can be readily solvedthrough standard quadratic programming. We consider the MVO problem under several constraintsettings using both univariate and multivariate regression for the prediction model. Under specialcircumstances, discussed below, the IPO program emits a closed-form analytical solution and thusobviates the need for a more computationally demanding iterative optimization algorithm. Further-more, unlike many machine-learning applications, the required action of the inverse of the Hessian ofthe IPO solution is computationally tractable in most practical settings. This is discussed in moredetail in Section 3.1.In this paper, we estimate asset returns through regression and choose to hold the estimate ofasset covariances constant. This is supported by the observation that asset mean returns are bothnonstationary and heterogeneous and are therefore likely to be dependent on auxiliary feature data [16,20, 27], whereas variance and covariances are typically much more stable [33, 14]. Moreover, Chopraand Ziemba [10] and Best and Grauer [6] report that MVO portfolio weights are an order of magnitudemore sensitive to the estimate of asset mean returns compared to estimates of asset covariances. Thatsaid, the general IPO framework, discussed in Section 3.5, can fully support linear and nonlinearprediction models for all variables of the nominal problem, including estimates of expected returns,expected covariances and constraint variables.Numerical experiments are performed on a universe of 55 global futures markets, with daily returndata starting in January 1995 and extending through October 2020. From an experimental stand-point, our goal is to analyze the performance of integrated prediction and optimization in a portfoliooptimization setting with real asset price data, thus ﬁlling a notable gap in this fast moving ﬁeld. Insummary, our analytical and experimental contributions are as follows:1. For univariate regression and the case where the nominal program is either unconstrained orcontains only linear equality constraints, the integrated program can be recast as an uncon-strained quadratic program. We provide the necessary conditions for convexity and provide ananalytical solution for the optimal regression parameterization. Out-of-sample numerical resultsdemonstrate that the integrated approach can provide lower realized cost and superior absoluteand risk-adjusted performance.2. For multivariate regression and the case where the nominal program is either unconstrained orcontains only linear equality constraints, the integrated program can be recast as an uncon-strained program. The program is not strictly convex in the general case and therefore multiplesolutions exist. An interesting, and somewhat surprising result is that the OLS solution happensto be a particular solution.3. For univariate and multivariate regression, we consider the case where the nominal programcontains both linear equality and inequality constraints and therefore an analytical solution isnot possible. We reformulate the integrated problem as a neural-network with a linear predictionlayer, a diﬀerentiable quadratic programming layer and the realized nominal cost loss function.5e perform local optimization through stochastic gradient descent and demonstrate that theintegrated approach can provide lower realized cost and superior risk-adjusted performance. We begin by formally deﬁning the traditional ‘predict, then optimize’ framework which will help drawthe distinction to the IPO alternative. We assume that we have a nominal cost function c : R d z × R d y → R , which takes as input a decision vector z ∈ R d z constrained to a feasible region Z ⊂ R d z , and aquantity of interest y ( i ) ∈ R d y . If the realization y ( i ) is known then generating optimal decisionsamounts to solving the following deterministic optimization problem:minimize z ∈ Z c ( z , y ( i ) ) , (2)with optimal solution: z ∗ ( y ( i ) ) = argmin z ∈ Z c ( z , y ( i ) ) . (3)In reality, we do not know the true value of y ( i ) at decision time. We therefore estimate the values y ( i ) through associated auxiliary variables x ( i ) ∈ R d x . We deﬁne a prediction model by the function f : R d x × R d θ → R d y , where f belongs to the class of regression models F = { f | f ( x , θ ) = θ T x } andthe parameter θ is constrained to the feasible region Θ ⊂ R d x × d y . For a particular θ , we deﬁne theestimate ˆ y ( i ) as: ˆ y ( i ) = f ( x ( i ) , θ ) = θ T x ( i ) . (4)In traditional parameter estimation, θ would be chosen in order to minimize a prediction-basedloss function (cid:96) : R d y × R d y → R , such as least-squares or negative log-likelihood. Speciﬁcally, giventraining data set D = { ( x ( i ) , y ( i ) ) } mi =1 we choose ˆ θ such that:ˆ θ = argmin θ ∈ Θ E D [ (cid:96) ( f ( x ( i ) , θ ) , y ( i ) )] , (5)where E D denotes the expectation operator with respect to the training distribution D . Once ˆ θ hasbeen determined, a traditional ‘predict, then optimize’ framework, would then ‘plug-in’ the estimatesˆ y ( i ) = f ( x ( i ) , ˆ θ ) into program (2) in order to generate the optimal decisions z ∗ (ˆ y ( i ) ).In this paper, we are interested in optimizing θ in the context of the nominal cost, c , and itsconstraints, Z , in order to minimize the average realized nominal cost of the policy z ∗ ( x ( i ) , θ ) inducedby this parameterization. For a ﬁxed instantiation ( x ( i ) , θ ), we solve the nominal cost program (6) inorder to determine the optimal action z ∗ ( x ( i ) , θ ) corresponding to our observed input x ( i ) . Speciﬁcally,we solve: minimize z ∈ Z c ( z , f ( x ( i ) , θ )) , (6)6ith optimal solution given by: z ∗ ( x ( i ) , θ ) = argmin z ∈ Z c ( z , f ( x ( i ) , θ )) . (7)Our objective is to therefore choose θ in order to minimize the average realized nominal cost induced bythe decision policy z ∗ ( x ( i ) , θ ). The resulting integrated prediction-optimization problem is a stochasticoptimization problem, presented in program (8).minimize θ ∈ Θ E D [ c ( z ∗ ( x , θ ) , y )]subject to z ∗ ( x , θ ) = argmin z ∈ Z c ( z , f ( x , θ )) . (8)In practice, we are typically presented with discrete observations D = { ( x ( i ) , y ( i ) ) } mi =1 and thereforewe can approximate the expectation by its sample average approximation [30]. The full IPO problemis presented in discrete form in program (9):minimize θ ∈ Θ m m (cid:88) i =1 c ( z ∗ ( x ( i ) , θ ) , y ( i ) )subject to z ∗ ( x ( i ) , θ ) = argmin z ∈ Z c ( z , f ( x ( i ) , θ )) ∀ i ∈ , ..., m. (9)Note that program (9) resembles the well documented empirical risk minimization [34], where theobjective is to choose θ in order to minimize the average empirical nominal cost. The notable diﬀerence,of course, is that for a ﬁxed instantiation θ , the decision policy, z ∗ ( x ( i ) , θ ), is optimal in the contextof its nominal cost program, which assumes predictions of the form ˆ y ( i ) = f ( x ( i ) , θ ).The IPO formulation results in a complicated dependency of the model parameters, θ , on theoptimized values, z ∗ ( x ( i ) , θ ), connected through the argmin function. This problem is frequentlyencountered in modern machine learning applications and there are several approaches for overcomingthis challenge. For example, the problem can be structured as a bi-level optimization problem inwhich, under special constraint cases, an analytical solution for the gradient exists [18]. Alternatively,heuristic methods, such as the one provided by Sinha et al. [32], propose structuring the problem asa bi-level optimization and use kriging (Gaussian process models) to approximate the mapping to thelower-level optimization problem.In Section 3, we discuss this relationship more fully in the context of mean-variance optimization.First, in the case where the nominal program is either unconstrained or contains only linear equalityconstraints, then an analytical solution exists and is provided for both univariate and multivariate re-gression. For the more diﬃcult case where the nominal program includes linear inequality constraints,we follow Amos and Kolter [2] and Donti et al. [13] and re-structure the problem as an end-to-endneural network with a diﬀerentiable quadratic programming layer. With appropriate Hessian infor-mation, the neural network formulation is capable of solving the unconstrained and linear equalityconstrained IPO problems in a single Newton step, which is the approach used in practice. For thegeneral inequality constrained case, we make use of stochastic gradient descent for locally solving theIPO problem. 7 IPO: Mean-Variance Optimization

We begin with a brief introduction to mean-variance portfolio optimization. The fundamental objectiveof portfolio optimization is the determination of next period’s optimal asset allocation under condi-tions of uncertainty. Markowitz [24], a pioneer of Modern Portfolio Theory, proposed that investors’preferences for return and risk are characterized by quadratic utility: mean-variance optimization(MVO).We consider a universe of d z assets and denote the matrix of (excess) return observations as Y = [ y (1) , y (2) , ..., y ( m ) ] ∈ R m × d z with m > d z . Let µ = E [ Y ] ∈ R d z be the expected value of Y and V = E [( Y − µ )( Y − µ ) T ] be the d z × d z symmetric positive deﬁnite covariance matrix. We deﬁne theportfolio z ∈ R d z , where the element, z i , denotes the proportion of total capital invested in the i th asset. The mean variance nominal cost function is presented in equation (10). c ( z , y ( i ) ) = − z T y ( i ) + δ z T V z (10)Therefore, for a particular return observation y ( i ) , the optimal portfolio weights, z ∗ ( y ( i ) ), is given bythe solution to the following convex quadratic program: z ∗ ( y ( i ) ) = argmin z ∈ Z − z T y ( i ) + δ z T V z (11)where δ ∈ R + is a risk-aversion parameter that controls the trade-oﬀ between minimizing varianceand maximizing return.In reality, we do not know the value y ( i ) at decision time and it is instead estimated throughassociated auxiliary data. Recall, for our regression based prediction model, we have the estimateˆ y ( i ) = f ( x ( i ) , θ ) = θ T x ( i ) , and therefore under the estimation hypothesis, the mean-variance nominalcost function has the following form: c ( z , ˆ y ( i ) ) = − z T ˆ y ( i ) + δ z T V z = − z T θ T x ( i ) + δ z T V z (12)In the IPO framework, the objective is to choose θ in order to minimize the average realized nominalcost induced by the optimal decision policies { z ∗ ( x ( i ) , θ ) } mi =1 . Substituting the estimated and realizedcosts into program (9) gives the full IPO program in discrete form, presented in program (13):minimize θ ∈ Θ m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + δ z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) subject to z ∗ ( x ( i ) , θ ) = argmin z ∈ Z − z T θ T x ( i ) + δ z T V z ∀ i = 1 , ..., m. (13)where as before the objective function L ( θ ) is given by: L ( θ ) = 1 m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + δ z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) . (14)8ote at this point we have not described the feasible region, Z , of the nominal program. In thefollowing subsections we provide analytical results for the case where Z is unconstrained or deﬁned by aset of linear equality constraints. We consider regressions of two forms; the standard multivariate casewhere θ ∈ R d x × d z and the univariate case where θ ∈ R d z × d z is a diagonal matrix. We then discuss themore general case where Z includes inequality constraints and formalize the neural-network framework. Z unconstrained, f ( x , θ ) univariate For the case of univariate prediction, the expected return estimate of asset j is given by:ˆ y j ( i ) = θ jj x ( i ) j . (15)The coeﬃcient matrix θ is a square matrix with non-zero entries only along the diagonal. Equivalently,we let θ ∈ R d z and instead consider the prediction model: f ( x ( i ) , θ ) = diag( x ( i ) ) θ, (16)where diag( · ) denotes the usual diagonal operator. Proposition 1.

Let θ ∈ R d z with Z = R d z and Θ = R d z . The IPO program is therefore an uncon-strained quadratic program given by: minimize θ − mδ θ T m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − y ( i ) (cid:17) + 12 mδ θ T m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − diag( x ( i ) ) (cid:17) θ . (17) Furthermore, if there exists an x ( i ) such that x ( i ) j (cid:54) = 0 ∀ j ∈ , ..., d x then diag( x ( i ) ) V − diag( x ( i ) ) (cid:31) and therefore program (17) is an unconstrained convex quadratic program with unique minimum: θ ∗ = (cid:16) m (cid:88) i =1 diag( x ( i ) ) V − diag( x ( i ) ) (cid:17) − (cid:16) m (cid:88) i =1 diag( x ( i ) ) V − y ( i ) (cid:17) . (18)We refer the reader to Appendix A for proof of Proposition 1. We make a few important obser-vations. The ﬁrst, is that for the realistic case where there exists an x ( i ) such that each x ( i ) j are notexactly zero, then the optimal regression coeﬃcients, θ ∗ , are unique. Furthermore, we observe thatthe solution is independent of the risk-aversion parameter. This is intuitive, as when the nominalprogram is unconstrained, then the risk-aversion parameter simply controls the scale of the resultingsolutions z ∗ ( x ( i ) , θ ). Lastly, the optimal coeﬃcients presented in equation (18) has a form similar tothe solution of a weighted least-squares problem (see for example Hastie et al. [19]), with a weightmatrix V − .We note that the solution presented in Equation (18) requires the action of the inverse of theHessian: m (cid:88) i =1 diag( x ( i ) ) V − diag( x ( i ) ) .

9n many applications of machine learning, such as computer vision or statistical meta-modelling, itis diﬃcult, if not impossible, to solve the inverse problem without customized algorithms or priorknowledge of the data (see for example Jones and Taylor [21], Ranjan et al. [29], Ongie et al. [28]). Inmany cases, the dimension of the relevant Hessian is either too large for both forward-mapping andinversion in reasonable compute time or is computationally unstable due to near-singularity. In ourIPO framework, we fortunately do not encounter these technical diﬃculties surrounding the actionof the inverse. In most practical settings, the dimension of the Hessian matrix, is on the order of10 or 100, whereas the number of observations, m , is on the order of 1000 or 10000. The Hessian istherefore likely to be computationally stable and the action of the inverse is very tractable and is nomore computationally demanding than the inversion required for solving the nominal program itself. Z = { Az = b } , f ( x , θ ) univariate We now consider the case where the nominal program is constrained by a set of linear equalityconstraints: Z = { A z = b } , where A ∈ R d eq × d z and b ∈ R d eq . We assume the non-trivial case where A is not full rank. Proposition 2.

Let θ ∈ R d z with Z = { A z = b } and Θ = R d z . Let the columns of F form a basisfor the nullspace of A deﬁned as: Null( A ) = { z ∈ R d z | A z = 0 } . Let z be a particular element of Z . Then the IPO program is therefore an unconstrained quadraticprogram: minimize θ m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + δ z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) , (19) where z ∗ ( x ( i ) , θ ) = 1 δ F ( F T V F ) − F T (diag( x ( i ) ) θ − δ V z ) + z . Furthermore, if there exists an x ( i ) such that x ( i ) j (cid:54) = 0 ∀ j ∈ , ..., d x then H (cid:31) and thereforeprogram (19) is an unconstrained convex quadratic program with unique minimum: θ ∗ = H − (cid:16) m (cid:88) i = diag( x ( i ) ) F ( F T V F ) − F T y ( i ) (cid:17) , (20) where H = m (cid:88) i = diag( x ( i ) ) F ( F T V F ) − F T diag( x ( i ) ) . Corollary 1.

Let the following conditions hold true.1. The feasible region of the nominal program is unconstrained or contains at most linear equalityconstraints: Z = { A z = b } . . The prediction model is of the form f ( x ( i ) , θ ) = diag( x ( i ) ) θ for θ ∈ R d z .3. There exists an auxiliary data observation that satisﬁes x ( i ) j (cid:54) = 0 ∀ j ∈ , ..., d x .Then the mean-variance IPO program deﬁned in Equation (13) is a convex quadratic program in thevariable θ . We refer the reader to Appendix B for proof of Proposition 2. Again, the observations of uniquenessof θ ∗ and its independence on the risk-aversion parameter hold in the case of the equality constrainedIPO problem. Note, that in Propositions 1 and 2, we assume that the regression coeﬃcients are un-constrained (Θ = R d z ). In many applications, however, this may not always be the case. Nonetheless,Corollary 1 states that under the aforementioned conditions, the mean-variance IPO program is aconvex quadratic program and therefore in practice the optimal θ ∗ solution can be found throughNewton’s method. We refer the reader to the Appendix for the formula for relevant gradients andHessians. Of course, when θ is unconstrained, Newton’s method converges after a single Newtondescent iteration. In practice, we implement a Newton descent method within the neural-networkarchitecture, described in Section 3.5 below. Z unconstrained, f ( x , θ ) multivariate In the multivariate prediction case, we have θ ∈ R d x × d z and therefore the prediction model is givenas: f ( x ( i ) , θ ) = θ T x ( i ) . (21) Proposition 3.

Let θ ∈ R d x × d z with Z = R d z and Θ = R d x × d z . The IPO program is therefore anunconstrained program given by: minimize θ − mδ m (cid:88) i =1 (cid:16) x ( i ) θ V − y ( i ) (cid:17) + 12 mδ m (cid:88) i =1 (cid:16) x ( i ) θ V − θ T x ( i ) (cid:17) . (22) Furthermore, when (cid:80) mi =1 x ( i ) x ( i ) T (cid:31) then (22) has a unique minimum: θ ∗ = (cid:16) m (cid:88) i =1 x ( i ) x ( i ) T (cid:17) − (cid:16) m (cid:88) i =1 x ( i ) y ( i ) (cid:17) . (23)Proposition 3 states that when the nominal program is unconstrained and f ( x , θ ) is a multivariateregression problem, then the ordinary least-squares coeﬃcients are the optimal coeﬃcients, θ ∗ , to theunconstrained IPO program. Z = { Az = b } , f ( x , θ ) multivariate As in the univariate case, we introduce the set of (non-trivial) linear equality constraints: Z = { A z = b } . roposition 4. Let θ ∈ R d x × d z with Z = { A z = b } and Θ = R d x × d z . Let the columns of F form abasis for the nullspace of A deﬁned as: Null( A ) = { z ∈ R d z | A z = 0 } . Let z be a particular element of Z . Then the IPO program is therefore an unconstrained quadraticprogram: minimize θ m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + δ z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) , (24) where z ∗ ( x ( i ) , θ ) = 1 δ F ( F T V F ) − F T ( θ T x ( i ) − δ V z ) + z . Furthermore, when (cid:80) mi =1 x ( i ) x ( i ) T (cid:31) then (24) is convex and the OLS solution, presented inEquation (25) , is a particular solution. θ ∗ = (cid:16) m (cid:88) i =1 x ( i ) x ( i ) T (cid:17) − (cid:16) m (cid:88) i =1 x ( i ) y ( i ) (cid:17) . (25) Z = { Az = b , Gz ≤ h } We now consider the more general case where the feasible region of the nominal program is deﬁned byboth linear equality and inequality constraints. Speciﬁcally we have the following nominal program:minimize z − z T y ( i ) + δ z T V z subject to

A z = bG z ≤ h (26)where A ∈ R d eq × d z , b ∈ R d eq and G ∈ R d iq × d z , h ∈ R d iq describe the linear equality and inequalityconstraints, respectively. Unlike our previous examples, there is no known analytical solution to thegeneral inequality constrained quadratic program. Therefore, the strategy of analytically solving for z ∗ ( x ( i ) , θ ) and substituting the solution into the IPO objective, L ( θ ), does not apply here.Furthermore, because of the inequality constraints, the IPO program is not a convex program.Nonetheless, we can solve for locally optimal solutions by applying (stochastic) gradient descent meth-ods. From the multivariate chain-rule, the gradient, ∇ θ L can be expressed as: ∇ θ L = ∂L∂ z ∗ ∂ z ∗ ∂ θ . (27)In our particular case, the nominal cost function, c , is smooth and twice diﬀerentiable in the decisionvector z and therefore it is relatively straightforward to compute the gradient ∂L∂ z ∗ . Recall, however,that the primary technical challenge lies in computing the Jacobian ∂ z ∗ ∂ θ , as it requires diﬀerentiationthrough the argmin operator. We will see shortly, however, that rather than forming the Jacobian12irectly, we can instead compute ∂L∂ θ essentially ‘for free’ by using the resulting system of linearequations provided by the Karush–Kuhn–Tucker (KKT) conditions at the optimal solution z ∗ toprogram (26).We follow the work of [2] and begin by ﬁrst writing the Lagrangian of program (26): L ( z , λ, ν ) = − z T y + δ T V z + λ T ( G z − h ) + ν T ( A z − b ) , (28)where λ ∈ R d iq and ν ∈ R d eq are the dual variables of the inequality and equality constraints,respectively. For compactness, we have temporarily dropped the index notation. The well-known KKTconditions for stationarity, primal feasibility, and complementary slackness are given by equations (29). − y + δ V z ∗ + G T λ ∗ T + A T ν ∗ = ( G z ∗ − h ) ≤ λ ∗ ≥ λ ∗ · ( G z ∗ − h ) =

0A z ∗ = b (29)Taking the diﬀerentials of these conditions gives the following system of equations:  δ V G T A T diag( λ ∗ ) G diag( G z ∗ − h ) 0 A   dzd λ d ν  = −  δ dV z ∗ − dy + dG T λ ∗ + dA T ν ∗ diag( λ ∗ ) dGz ∗ − diag( λ ∗ ) dhdA z ∗ − db  . (30)We make two important observations about the system of equations (30). The ﬁrst, is that the leftside matrix gives the optimality conditions of the convex quadratic problem, which, when solving byinterior-point methods, must be factorized in order to obtain the solution to the nominal program [7].Secondly, the right side gives the diﬀerentials of the relevant functions at the achieved solution withrespect to any of the input parameters. In particular, we seek to compute the Jacobian ∂ z ∗ ∂ θ = ∂ z ∗ ∂ y ∂ y ∂ θ . As explained by Amos and Kolter [2], the Jacobian ∂ z ∗ ∂ y is obtained by letting dy = I (setting all otherdiﬀerential terms to zero) and solving the system of equations for dz . From a computation standpoint,the required Jacobian is therefore eﬀectively obtained ‘for free’ upon factorization of the left matrixwhen obtaining the solution, z ∗ , to the nominal program.In practice, however, we never explicitly form the right-side Jacobian matrix and compute ∂ z ∗ ∂ y directly. We instead follow the work of Amos and Kolter [2] and Agrawal et al. [1] and treat thenominal quadratic program as a diﬀerentiable layer in a neural-network. The IPO equivalent neural-network structure is depicted in Figure 1. In the forward pass, the input layer takes the auxiliaryvariables x ( i ) and passes them to a simple linear layer to produce the estimates, ˆ y ( i ) . The linear layeris either single connected, in the univariate case, or fully-connected, in the multivariate case. Thepredictions are then passed to a diﬀerentiable quadratic programming layer which, for a given inputˆ y ( i ) , solves the nominal program and returns the optimal portfolio weights z ∗ ( x ( i ) , θ ) . At this point,we cache the factorized left-hand side matrix at the optimal solution, which will be invoked duringbackpropagation. Finally, the quality of the portfolio weights are evaluated by the cost function, c ( z ∗ ( x , θ ) , y ( i ) ) in the context of the realized (true) y ( i ) . ∂c∂ z ∗ . Following Amos and Kolter [2], we apply the implicit function theorem and compute ∂c∂ y by multiplying the backward pass vector by the inverse of the transposed left-hand-side matrix, asshown in equation (31). Recall, we have cached the factorized left-side matrix and thus the gradient ∂c∂ y is eﬀectively provided at no additional computation cost.  ˆ d z ˆ d λ ˆ d ν  =  ∂c∂ y −−  = −  δ V G T diag( λ ∗ ) A T G diag( G z ∗ − h ) 0 A  − (cid:0) ∂c∂ z ∗ (cid:1) T  . (31)Forward: x ( i ) ˆ y ( i ) = θ x ( i ) min z c ( z , ˆ y ( i ) ) c ( z ∗ , y ( i ) )Input layer Linear layer QP layer Loss Function ∂c/∂ z ∗ ∂ z ∗ /∂ ˆ y ( i ) ∂ ˆ y ( i ) /∂ θθ ← θ − g θ Backward:Figure 1: IPO program represented as an end-to-end neural-network with predictive linear layer,diﬀerentiable quadratic programming layer and realized nominal cost loss function.When the nominal program contains linear inequality constraints, then we search for a locallyoptimal solution using stochastic gradient descent (SGD). In this case, the descent direction, g θ , ateach iteration attempts to approximate the gradient, ∇ θ L , and is given by : g θ = (cid:88) i ∈ B (cid:16) ∂ c ∂ θ (cid:17) | ( z ∗ ( x ( i ) ,θ ) , y ( i ) ) ≈ ∇ θ L where in standard stochastic gradient descent, B represents a randomly drawn sample batch. Whenthe nominal program does not contain linear inequality constraints, then the descent direction is givenby the full Newton descent step: g θ = H − m m (cid:88) i = (cid:16) ∂ c ∂ θ (cid:17) | ( z ∗ ( x ( i ) ,θ ) , y ( i ) ) where H denotes the Hessian, ∂ c∂ θ , deﬁned in more detail in the Appendix. Recall, that when theregression coeﬃcients are unconstrained, then the IPO program is a convex quadratic program andtherefore Newton’s method will converge after a single step.Lastly, Equation (31) allows for eﬃcient computation of the gradients with respect to any of therelevant problem variables. While in this paper we are solely concerned with linear prediction modelsfor the vectors of expected returns, y ( i ) , we note that the IPO framework can easily support integratedoptimization of prediction model parameters (linear or otherwise) for the remaining problem variables.14or the reader’s interest, we state the gradients for all other problem variables and refer the reader toAmos and Kolter [2] for their derivation. ∂c∂ V = 12 (cid:16) ˆ d z z ∗ T + z ∗ ˆ d z T (cid:17) ∂c∂ y = ˆ d z ∂c∂ A = ˆ d ν z ∗ T + ν ∗ ˆd zT ∂c∂ b = − ˆ d ν ∂c∂ G = diag( λ ∗ ) (cid:16) ˆd λ z ∗ T + λ ∗ ˆd zT (cid:17) ∂c∂ h = − diag( λ ∗ ) ˆd λ Experiment Setup:

We consider an asset universe of 55 global futures markets, described in Table 1. The daily pricedata is given from January 1995 through October 2020, and is provided by Commodity Systems Inc.Futures contracts are rolled on, at most, a monthly basis in order to remain invested in the most liquidcontract, as measured by open-interest and volume. Arithmetic returns are computed directly from theprice data and are in excess of the risk-free rate. The auxiliary feature of interest is the average dailyreturn for each market over the past 20-days. The feature therefore represents a measure of the well-documented ‘trend’ factor, popular to many Commodity Trading Advisors (CTAs) and Hedge Funds(see for example [26], [3], [8]). In an eﬀort to mitigate turnover and provide more realistic results, thedependent variable, y ( i ) , is the average realized 5-day forward return. Therefore the portfolio weights, z ∗ ( x ( i ) , θ ), represents the optimal holding for the next 5 days. In practice, we reform the portfolioat the close of each day, and thus rebalance 1 / th of the exposure on a daily basis. We consider 4prediction and optimization cases:1. Unconstrained nominal program with univariate regression.2. Equality constrained nominal program with univariate regression.3. Equality and inequality constrained nominal program with univariate regression.4. Equality and inequality constrained nominal program with multivariate regression.In all cases, the weights are normalized such that their total exposure (long and short) is equal to one.Note that because the universe is of moderate size and is highly diversiﬁed, the ambient volatility ofthe resulting portfolios is relatively low. In practice, portfolios can be re-sized to an appropriate levelof volatility by applying leverage [31] or by altering the asset universe. The results presented beloware meant to illustrate the beneﬁt of the integrated framework. All performance is gross of tradingcosts and in excess of the risk-free rate.For each experiment, we divide our data into 10 equally partitioned, contiguous segments. Weapply a k -fold training and testing methodology. That is, for each out-of-sample segment, we optimizeour coeﬃcients, θ ∗ , on the remaining 9 segments and apply the optimal policy z ∗ ( x ( i ) , θ ∗ ) to the out-of-sample segment. The equity growth charts, presented below, are then created by stitching togetherthe 10 out-of-sample folds. 15he IPO and OLS methodologies are evaluated on absolute and relative terms based on out-of-sample nominal cost and out-of-sample Sharpe ratios. The out-of-sample nominal MVO cost isprovided by Equation (10). The Sharpe ratio is the ratio of average excess return to average standarddeviation, and is presented in Equation (32). c SR ( z , y ( i ) ) = z T y ( i ) √ z T V z . (32)In each experiment we generate 1000 samples with 252 observations (1 year) from the out-of-sampledistribution. We use the same observations across both methodologies for each sample draw. Wecompare the average negative MVO cost and Sharpe ratios over the resulting samples. Furthermore,we report the dominance ratio, which we informally deﬁne as the proportion of samples for which therealized metric of the IPO methodology exceeds that of the OLS methodology, presented in Equation(33): DR = 1 N N (cid:88) i =1 I [ c ( z ∗ ( x ( i ) , θ ∗ ) , y ( i ) ) > c ( z ∗ ( x ( i ) , ˆ θ ) , y ( i ) )] . (33)Here, I denotes the indicator function and θ ∗ and ˆ θ denote the IPO and OLS optimal coeﬃcients,respectively.All experiments were conducted on an Apple Mac Pro computer (2 . . .

0) and torch (version 0 . . Z unconstrained, f ( x , θ ) univariate To compute the IPO coeﬃcients, we set the risk aversion parameter to 1 and solve the following IPOprogram using Newton’s method as described in Section 3.5minimize θ m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + 12 z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) subject to z ∗ ( x ( i ) , θ ) = argmin z − z T diag( x ( i ) ) θ + 12 z T V z ∀ i = 1 , ..., m. (34)For illustrative purposes, the estimated regression coeﬃcients for the ﬁrst out-of-sample data foldare provided in Figure 2. Note that the IPO method provides very diﬀerent regression coeﬃcients,in both magnitude and sign, compared to the (OLS) coeﬃcients. In particular, we observe thatpractically all of the IPO regression coeﬃcients are negative, in comparison to the OLS coeﬃcients inwhich only 55% of the coeﬃcients are negative. Moreover, the magnitude of the IPO coeﬃcients areroughly 50% larger on average.Figure 3 highlights a particularly extreme example of the diﬀerence in the regression coeﬃcientsand the resulting optimal prediction model. The futures market is the US ten-year bond (TY). Weobserve that while the OLS coeﬃcient is positive with a value of approximately 0 .

10 the IPO optimalcoeﬃcient is negative and of roughly equal magnitude.16 D BAB O BP CCCCDC F C L CNC O C T E C E D E O E R ES F C F V GG C G X H G H I H O I B I R J B J YK C K W LL C L HN G N X O EP L Q S R T Y R XSSBS F S I S M T P T U T Y U S W XBXPY M Z −0.4−0.20.00.20.4 IPOOLS Figure 2: Optimal IPO and OLS regression coeﬃcients for the unconstrained nominal mean-varianceprogram and univariate prediction model. −0.003 −0.002 −0.001 0.000 0.001 0.002 0.003 − . − . . . . x y x y ^ −0.0015−0.0010−0.00050.00000.00050.00100.0015IPOOLS Figure 3: Comparison of the optimal IPO and OLS predictions for the US-Ten year treasury bondfuture.The out-of-sample equity growth is presented in Figure 4 and demonstrates that the IPO methodprovides higher absolute and risk-adjusted performance. In Figure 5 we compare the realized negativeMVO costs across 1000 out-of-sample realizations. Figure 5(a) demonstrates an approximate 50%improvement in the average realized out-of-sample cost. Moreover, Figure 5(b) demonstrates thatacross all sample draws the IPO method results in a lower realized out-of-sample cost. In Figure 6we compare the realized Sharpe ratios across 1000 out-of-sample realizations. The average Sharpe17atio of the IPO method is approximately 1 .

30, whereas the average Sharpe ratio of the OLS methodis approximately 0 .

70. We report the dominance ratio in Figure 6(b) at 77% meaning that the IPOmethod realizes a higher Sharpe ratio in approximately 77% of samples in comparison to the OLSmethod.

Feb 011995 Jan 021998 Jan 022001 Jan 022004 Jan 022007 Jan 042010 Jan 022013 Jan 042016 Jan 022019

IPO OLS

Figure 4: Out-of-sample log-equity growth of IPO and OLS for the unconstrained nominal mean-variance program and univariate prediction model.

IPO OLS . . . . A nnua li z ed N eg M V O Lo ss (a) Negative MVO cost distribution. . . . . A nnua li z ed N eg M V O Lo ss IPO (DR: 1)OLS (b) Negative MVO cost realizations.

Figure 5: Realized out-of-sample negative MVO cost of the IPO and OLS methods for the uncon-strained nominal mean-variance program and univariate prediction model.18

PO OLS − − A nnua li z ed S ha r pe R a t i o (a) Out-of-sample Sharpe ratio distribution. − − A nnua li z ed S ha r pe R a t i o IPO (DR: 0.77)OLS (b) Out-of-sample Sharpe ratio realizations.

Figure 6: Realized out-of-sample Sharpe ratios of the IPO and OLS methods for the unconstrainednominal mean-variance program and univariate prediction model. Z = { Az = b } , f ( x , θ ) univariate We now introduce the ‘fully-invested’ linear equality constraint into the nominal optimization program.Again, we set the risk aversion parameter to 1 and solve the following IPO program using Newton’smethod as described in Section 3.5.minimize θ m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + 12 z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) subject to z ∗ ( x ( i ) , θ ) = argmin z − z T diag( x ( i ) ) θ + 12 z T V z ∀ i = 1 , ..., m d z (cid:88) i =1 z i = 1 (35)The estimated regression coeﬃcients for the ﬁrst out-of-sample data fold are provided in Figure 7.The IPO regression coeﬃcients are very similar to the optimal coeﬃcients in the unconstrained case.Again we observe IPO coeﬃcients are predominantly negative and of larger magnitude.19 D BAB O BP CCCCDC F C L CNC O C T E C E D E O E R ES F C F V GG C G X H G H I H O I B I R J B J YK C K W LL C L HN G N X O EP L Q S R T Y R XSSBS F S I S M T P T U T Y U S W XBXPY M Z −0.4−0.20.00.20.4 IPOOLS Figure 7: Optimal IPO and OLS regression coeﬃcients for the equality constrained nominal mean-variance program and univariate prediction model.The out-of-sample equity growth is presented in Figure 8 and demonstrates the beneﬁt of theIPO method. Figure 9 shows an approximate 50% improvement in the average realized out-of-samplecost with an MVO cost dominance ratio of 100%. The average Sharpe ratio of the IPO method isapproximately 1 .

30, whereas the average Sharpe ratio of the OLS method is only 0 .

50. We observe thatwith the addition of the equality constraint the Sharpe ratio of the OLS method decreases whereas theSharpe ratio of the IPO method is preserved. Furthermore, the Sharpe dominance ratio is reportedin Figure 10(b) at a value of 79%.

Feb 011995 Jan 021998 Jan 022001 Jan 022004 Jan 022007 Jan 042010 Jan 022013 Jan 042016 Jan 022019

IPO OLS

Figure 8: Out-of-sample log-equity growth of IPO and OLS for the equality constrained nominalmean-variance program and univariate prediction model.20

PO OLS . . . . . A nnua li z ed N eg M V O Lo ss (a) Negative MVO cost distribution. . . . . A nnua li z ed N eg M V O Lo ss IPO (DR: 1)OLS (b) Negative MVO cost realizations.

Figure 9: Realized out-of-sample negative MVO cost of the IPO and OLS methods for the equalityconstrained nominal mean-variance program and univariate prediction model.

IPO OLS − − − A nnua li z ed S ha r pe R a t i o (a) Out-of-sample Sharpe ratio distribution. − − A nnua li z ed S ha r pe R a t i o IPO (DR: 0.79)OLS (b) Out-of-sample Sharpe ratio realizations.

Figure 10: Realized out-of-sample Sharpe ratios of the IPO and OLS methods for the equality con-strained nominal mean-variance program and univariate prediction model. Z = { Az = b, Gz ≤ h } , f ( x , θ ) univariate We introduce non-negativity constraints into the nominal program. The resulting portfolios are there-fore ‘long-only’ and ‘fully-invested’. With inequality constraints, the risk-aversion parameter can havea signiﬁcant impact on the resulting portfolio constitution. Rather than specifying a particular risk-aversion parameter, we instead follow Cornuejols and Tutuncu [12] and re-cast the nominal programto maximize the Sharpe ratio. This is possible as both the inequality constraints and Sharpe ratio arehomogeneous of degree zero.In the inequality constrained case we solve the IPO program using the neural-network structuredescribed in Section 3.5, employing both mini-batching and the ADAM stochastic gradient descentroutine [23]. The realized cost function is the Sharpe ratio, deﬁned by Equation (32) . We choose a mini-batch size of 5% of the daily training observations and randomly sample using a space-ﬁlling maximinLatin hypercube design to ensure roughly evenly spaced sampling [25]. The algorithm terminates after500 iterations of stochastic gradient descent. 21inimize θ − m m (cid:88) i =1 (cid:32) z ∗ ( x ( i ) , θ ) T y ( i ) (cid:112) z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:33) subject to z ∗ ( x ( i ) , θ ) = argmin z z T V z ∀ i = 1 , ..., m z T diag( x ( i ) ) θ = 1 z ≥ A D BAB O BP CCCCDC F C L CNC O C T E C E D E O E R ES F C F V GG C G X H G H I H O I B I R J B J YK C K W LL C L HN G N X O EP L Q S R T Y R XSSBS F S I S M T P T U T Y U S W XBXPY M Z −3−2−10123 IPOOLS Figure 11: Optimal IPO and OLS regression coeﬃcients for the inequality constrained nominal mean-variance program and univariate prediction model.The out-of-sample equity growth is presented in Figure 12 and again demonstrates that the IPOmethod can provide superior performance, in both absolute and risk-adjusted terms, over the long-term. The diﬀerence in performance, however, is not nearly as pronounced as prior examples. This isintuitive as tighter portfolio constraints reduce the eﬀective degrees of freedom and subsequent betsthat the portfolio can make at each rebalance. Nonetheless, Figure 13(a) demonstrates an approximate30% improvement in the average realized out-of-sample Sharpe ratio. The average Sharpe ratio of theIPO method is 1 .

12 whereas the average Sharpe ratio of the OLS method is approximately 0 .

86. TheSharpe dominance ratio is reported in Figure 13(b) at a more modest value of 59%.22 eb 011995 Jan 021998 Jan 022001 Jan 022004 Jan 022007 Jan 042010 Jan 022013 Jan 042016 Jan 022019

IPO OLS

Figure 12: Out-of-sample log-equity growth of IPO and OLS for the inequality constrained nominalmean-variance program and univariate prediction model.

IPO OLS − − A nnua li z ed S ha r pe R a t i o (a) Out-of-sample Sharpe ratio distribution. − − A nnua li z ed S ha r pe R a t i o IPO (DR: 0.59)OLS (b) Out-of-sample Sharpe ratio realizations.

Figure 13: Realized out-of-sample Sharpe ratios of the IPO and OLS methods for the inequalityconstrained nominal mean-variance program and univariate prediction model. Z = { Az = b, Gz ≤ h } , f ( x , θ ) multivariate Lastly, we consider the inequality constrained nominal program under a multivariate prediction setting.The IPO program is equivalent to program (36) with the exception that the multivariate predictionstake the form f ( x ( i ) , θ ) = θ T x ( i ) .The out-of-sample equity growth is presented in Figure 14. Note in the multivariate case the IPOmethod results in higher risk-adjusted performance but lower absolute performance. This result is notsurprising as the objective function of the IPO program is to maximize the risk-adjusted performance,irrespective of the absolute performance. Interestingly, the average realized Sharpe ratios of the IPOand OLS methods are 1 .

13 and 0 .

82, respectively. The Sharpe ratios are therefore comparable to theunivariate prediction case and we posit that any beneﬁt gained by the more complex multivariatemodel is possibly oﬀset by overﬁtting. Again, the IPO method provides a modest Sharpe dominanceratio, reported in Figure 15(b) at a value of 63%.23 eb 011995 Jan 021998 Jan 022001 Jan 022004 Jan 022007 Jan 042010 Jan 022013 Jan 042016 Jan 022019

IPO OLS

Figure 14: Out-of-sample log-equity growth of IPO and OLS for the inequality constrained nominalmean-variance program and multivariate prediction model.

IPO OLS − − A nnua li z ed S ha r pe R a t i o (a) Out-of-sample Sharpe ratio distribution. − A nnua li z ed S ha r pe R a t i o IPO (DR: 0.63)OLS (b) Out-of-sample Sharpe ratio realizations.

Figure 15: Realized out-of-sample Sharpe ratios of the IPO and OLS methods for the inequalityconstrained nominal mean-variance program and multivariate prediction model.

In this paper we proposed an integrated prediction and optimization (IPO) framework for optimizingregression coeﬃcients in the context of a mean-variance portfolio optimization. We structured theproblem as a stochastic program where, for a ﬁxed instantiation of regression coeﬃcients, we solve aseries of deterministic nominal mean-variance optimization programs. We investigated the IPO frame-work under both univariate and multivariate regression settings and considered the nominal programunder various forms of constraints. Where possible, we provided closed-form analytical solutions forthe optimal regression coeﬃcients and the suﬃcient conditions for uniqueness. In the general case, wefollowed the work of Amos and Kolter [2] and Donti et al. [13] and restructured the IPO problem as aneural network with a diﬀerentiable quadratic programming layer. For the case where the nominal pro-gram is either unconstrained or contains only linear equality constraints, we proposed a second-order24ethod for eﬃcient optimization. We performed several historical simulations under various portfolioconstraints and compared the IPO framework with the more traditional ordinary least-squares (OLS)‘predict, then optimize’ framework. The numerical results demonstrate the beneﬁts of the IPO frame-work, which in all cases provided superior performance, as measured by its average realized nominalcost.We identiﬁed that, for our particular numerical example, the multivariate regression did not provideany beneﬁt over the univariate counterpart, suggesting that possibly the multivariate model is overﬁtor that only a subset of the features are relevant for each asset. Methods for regularizing both theprediction and the nominal optimization program as well as methods for choosing the ‘best’ featuresubsets are an interesting area of future research. Future work also includes incorporating other formsof prediction models into the IPO framework as well as exploring methods for performing the morediﬃcult joint prediction of asset returns and covariances.

References [1] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J. ZicoKolter. Diﬀerentiable convex optimization layers. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d ' Alch´e-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural Information ProcessingSystems , volume 32, pages 9562–9574. Curran Associates, Inc., 2019.[2] Brandon Amos and J. Zico Kolter. Optnet: Diﬀerentiable optimization as a layer in neuralnetworks.

CoRR , abs/1703.00443, 2017. URL http://arxiv.org/abs/1703.00443 .[3] Nick Baltas and Robert Kosowski. Momentum strategies in futures markets and trend-followingfunds, 2012.[4] Gah-Yi Ban and Cynthia Rudin. The big data newsvendor: Practical insights from machinelearning.

Operations Research , 67(1):90–108, 2019.[5] Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics.

ManagementScience , 66(3):1025–1044, 2020.[6] Michael J. Best and Robert R. Grauer. On the sensitivity of mean-variance-eﬃcient portfoliosto changes in asset means: Some analytical and computational results.

The Review of FinancialStudies , 4(2):315–342, 1991.[7] Stephen Boyd and Lieven Vandenberghe.

Convex Optimization . Cambridge University Press,2004.[8] Benjamin Bruder, Tung-Lam Dao, Jean-Charles Richard, and Thierry Roncalli. Trend FilteringMethods for Momentum Strategies.

Lyxor Asset Management , 2011.[9] Binbin Chen, Shih-Feng Huang, and Guangming Pan. High dimensional mean variance opti-mization through factor analysis.

Journal of Multivariate Analysis , 133:140 – 159, 2015. ISSN0047-259X. 2510] Vijay Kumar. Chopra and William T. Ziemba. The eﬀect of errors in means, variances, and covari-ances on optimal portfolio choice.

The Journal of Portfolio Management , 19(2):6–11, 1993. ISSN0095-4918. doi: 10.3905/jpm.1993.409440. URL https://jpm.pm-research.com/content/19/2/6 .[11] Roger G Clarke, Harindra de Silva, and Robert Murdock. A factor approach to asset allocation.

The Journal of Portfolio Management , 32(1):10–21, 2005.[12] Gerard Cornuejols and Reha Tutuncu.

Optimization Methods in Finance . Cambridge UniversityPress, 2006.[13] Priya L. Donti, Brandon Amos, and J. Zico Kolter. Task-based End-to-end Model Learning.

CoRR , abs/1703.04529, 2017. URL http://arxiv.org/abs/1703.04529 .[14] Holger Drees and Catalin Starica. A simple non-stationary model for stock returns. 2002.[15] Adam N. Elmachtoub and Paul Grigas. Smart ”predict, then optimize”. arXiv , 2020.[16] Robert F. Engle. Autoregressive conditional heteroskedasticity with estimates of the variance ofuk inﬂation.

Econometrica , 50(1):987–1008, 1982.[17] D. Goldfarb and G. Iyengar. Robust portfolio selection problems.

Mathematics of OperationsResearch , 28(1):1–38, 2003.[18] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, andEdison Guo. On diﬀerentiating parameterized argmin and argmax problems with application tobi-level optimization.

CoRR , abs/1607.05447, 2016. URL http://arxiv.org/abs/1607.05447 .[19] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The Elements of Statistical Learning .Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.[20] Der-Ann Hsu, Robert B. Miller, and Dean W. Wichern. On the stable paretian behavior ofstock-market prices.

Journal of the American Statistical Association , 69(345):108–113, 1974.[21] A. G. Jones and C. Taylor. Solving inverse problems in computer vision by scale space recon-struction. In

MVA , 1994.[22] R. Kannan, G. Bayraksan, and James R. Luedtke. Data-driven sample average approximationwith covariate information. 2020.[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR ,abs/1412.6980, 2015.[24] H. Markowitz. Portfolio selection.

Journal of Finance , 7(1):77–91, 1952.[25] M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selectingvalues of input variables in the analysis of output from a computer code.

Technometrics , 21(2):239–245, 1979. 2626] Tobias J. Moskowitz, Yao Hua Ooi, and Lasse Pedersen. Time series momentum.

Journal ofFinancial Economics , 104(2):228–250, 2012.[27] R.R. Oﬃcer. A time series examination of the market factor of the new york stock exchange.

University of Chicago PhD dissertation , 1971.[28] Gregory Ongie, Ajil Jalal, Christopher A. Metzler, Richard G. Baraniuk, Alexandros G. Dimakis,and Rebecca Willett. Deep learning techniques for inverse problems in imaging, 2020.[29] P. Ranjan, R. Haynes, and R. Karsten. A Computationally Stable Approach to Gaussian ProcessInterpolation of Deterministic Computer Simulation Data.

Technomectrics , 52(4):366–378, 2011.[30] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy?ski.

Lectures on Stochastic Pro-gramming . Society for Industrial and Applied Mathematics, 2009. doi: 10.1137/1.9780898718751.[31] William F. Sharpe. Capital asset prices: A theory of market equilibrium under conditions of risk.

The Journal of Finance , 19(3):425–442, 1964.[32] Ankur Sinha, Tanmay Khandait, and Raja Mohanty. A gradient-based bilevel optimizationapproach for tuning hyperparameters in machine learning, 2020.[33] Catalin Starica and Clive Granger. Nonstationarities in stock returns.

The Review of Economicsand Statistics , 87(3):503–522, 2005.[34] V. Vapnik. Principles of risk minimization for learning theory. In J. Moody, S. Hanson, andR. P. Lippmann, editors,

Advances in Neural Information Processing Systems , volume 4, pages831–838. Morgan-Kaufmann, 1992.

A Proof of Proposition 1

We begin with the following proposition that will become useful later.

Proposition 5.

Let V ∈ R m × m be a symmetric positive deﬁnite matrix. Let B ∈ R m × n and considerthe quadratic form A = B T V B . Then A is a symmetric positive deﬁnite matrix if B has full columnrank.Proof. The symmetry of A follows directly from the deﬁnition. To prove positive deﬁniteness, let x ∈ R n be a non-zero vector and consider the quadratic form x T A x : x T A x = x T B T V B x = y T V y . Clearly y T V y > y (cid:54) = 0 and y T V y = 0 ⇐⇒ B x = . But B has full column rank andtherefore the only solution to B x = is the trivial solution x = 0. It follows then that x T B T V B x > and therefore A is positive deﬁnite. 27ecall that the nominal problem is a convex quadratic program and for the unconstrained casehas a unique global solution, presented in equation (37). z ∗ ( x ( i ) , θ ) = 1 δ V − f ( x ( i ) , θ ) = 1 δ V − diag( x ( i ) ) θ . (37)Direct substitution of z ∗ ( x ( i ) , θ ) into L ( θ ) gives the desired quadratic function: L ( θ ) = − mδ θ T m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − y ( i ) (cid:17) + 12 mδ θ T m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − diag( x ( i ) ) (cid:17) θ . (38)Diﬀerentiating with respect to θ gives the following gradient: ∂L∂ θ = − mδ m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − y ( i ) (cid:17) T + 1 mδ θ T m (cid:88) i =1 (cid:16) diag( x ( i ) ) V − diag( x ( i ) ) (cid:17) . (39)Diﬀerentiating again with respect to θ gives the relevant Hessian: ∂ L∂ θ = 1 mδ m (cid:88) i =1 diag( x ( i ) ) V − diag( x ( i ) ) . (40)Applying Proposition 5 it follows then that if there exists an x ( i ) such that x ( i ) j (cid:54) = 0 ∀ j ∈ , ..., d x then diag( x ( i ) ) V − diag( x ( i ) ) (cid:31) θ ∗ = (cid:16) m (cid:88) i =1 diag( x ( i ) ) V − diag( x ( i ) ) (cid:17) − (cid:16) m (cid:88) i =1 diag( x ( i ) ) V − y ( i ) (cid:17) . (41) B Proof of Proposition 2

Recall that the nominal problem is a convex quadratic program with linear equality constraints: Z = { A z = b } . Let the columns of F form a basis for the nullspace of A deﬁned as:Null( A ) = { z ∈ R d z | A z = 0 } . Let z be a particular element of Z . It follows that ∀ z ∈ R d z − then w = F z + ˜z is also an elementof Z . We follow Boyd and Vandenberghe [7] and recast the nominal program as an unconstrainedconvex quadratic program: min w c ( F w + z , ˆy ( i ) ) , (42)28ith unique global minimum: w ∗ (ˆ y ( i ) ) = 1 δ ( F T V F ) − F T (ˆ y ( i ) − δ V z ) . (43)For the case where ˆ y ( i ) = diag( x ( i ) ) θ then we have the following unique solution to the originalnominal program: z ∗ ( x ( i ) , θ ) = F w ∗ (ˆ y ( i ) ) + z = 1 δ F ( F T V F ) − F T (diag( x ( i ) ) θ − δ V z ) + z . Of course, the derivative with respect to θ is given by: ∂ z ∗ ∂ θ = 1 δ F ( F T V F ) − F T diag( x ( i ) ) . Recall from Equation (14) that L ( θ ) has the following form: L ( θ ) = 1 m m (cid:88) i =1 (cid:16) − z ∗ ( x ( i ) , θ ) T y ( i ) + δ z ∗ ( x ( i ) , θ ) T V z ∗ ( x ( i ) , θ ) (cid:17) . (44)Diﬀerentiating with respect to θ and applying the multivariate chain rule gives: ∂L∂ θ = ∂L∂ z ∗ ∂ z ∗ ∂ θ = − m (cid:16) m (cid:88) i =1 y ( i ) + δm m (cid:88) i =1 z ∗ T V (cid:17)(cid:16) δ F ( F T V F ) − F T diag( x ( i ) ) (cid:17) (45)Diﬀerentiating again gives the following Hessian: ∂ L∂ θ = (cid:16) ∂ z ∗ ∂ θ (cid:17) T ∂ L∂ z ∗ ∂ z ∗ ∂ θ = 1 mδ m (cid:88) i =1 diag( x ( i ) ) F ( F T V F ) − F T diag( x ( i ) ) . (46)Applying Proposition 5, it follows then that if there exists an x ( i ) such that x ( i ) j (cid:54) = 0 ∀ j ∈ , ..., d x then ∂ L∂ θ (cid:31) ∂L∂ θ = 0 gives the following criteria for optimality:1 m m (cid:88) i =1 z ∗ T ( x ( i ) , θ ) V F ( F T V F ) − F T diag( x ( i ) ) = 1 mδ m (cid:88) i =1 y ( i ) F ( F T V F ) − F T diag( x ( i ) )Expanding the left hand side and simplifying gives:1 mδ m (cid:88) i =1 (cid:16) F ( F T V F ) − F T (diag( x ( i ) ) θ − δ V z ) + δ z (cid:17) T V F ( F T V F ) − F T diag( x ( i ) )= 1 mδ θ T m (cid:88) i =1 diag( x ( i ) ) F ( F T V F ) − F T diag( x ( i ) ) (47)29t follows then that the unique solution, θ ∗ is given by: θ ∗ = (cid:16) m (cid:88) i =1 diag( x ( i ) ) F ( F T V F ) − F T diag( x ( i ) ) (cid:17) − (cid:16) m (cid:88) i =1 y ( i ) F ( F T V F ) − F T diag( x ( i ) ) (cid:17) , thus completing the proof. C Proof of Proposition 3

We use the following identities for matrix diﬀerentiation of L with respect to θ . Deﬁnition 1.

Let a ∈ R m , b ∈ R n and X ∈ R m × n . We have the following identity: ∂ a T X b ∂ X = ba T Deﬁnition 2.

Let a ∈ R m , b ∈ R n , C ∈ R m × m , X ∈ R m × n . We have the following identity: ∂ a T X T CXb ∂ X = ( CXab T + C T Xba T )Once again, the nominal problem is a convex quadratic program and therefore has a unique ana-lytical solution, presented in equation (48). z ∗ ( x ( i ) , θ ) = 1 δ V − f ( x ( i ) , θ ) = 1 δ V − θ T x ( i ) . (48)Direct substitution of z ∗ ( x ( i ) , θ ) into L ( θ ) gives: L ( θ ) = − mδ m (cid:88) i =1 (cid:16) x ( i ) T θ V − y ( i ) (cid:17) + 12 mδ m (cid:88) i =1 (cid:16) x ( i ) T θ V − θ T x ( i ) (cid:17) . (49)Diﬀerentiating L with respect to θ gives: ∂L∂ θ = − mδ m (cid:88) i =1 (cid:16) x ( i ) T y ( i ) V − (cid:17) + 1 mδ m (cid:88) i =1 (cid:16) x ( i ) x ( i ) T θ V − (cid:17) . (50)Finally, solving for ﬁrst order conditions, ∂L∂ θ = 0, yields the least-squares solution: θ ∗ = (cid:16) m (cid:88) i =1 x ( i ) x ( i ) T (cid:17) − (cid:16) m (cid:88) i =1 x ( i ) y ( i ) (cid:17) . (51)Uniqueness of course follows directly from the invertability of XX T = (cid:80) mi =1 x ( i ) x ( i ) T .30 Proof of Proposition 4