Machine Learning Optimization Algorithms & Portfolio Allocation
MMachine Learning OptimizationAlgorithms & Portfolio Allocation ∗ Sarah PerrinArtificial Intelligence & AdvancedVisual Computing MasterEcole Polytechnique [email protected]
Thierry RoncalliQuantitative ResearchAmundi Asset ManagementParis [email protected]
June 2019
Abstract
Portfolio optimization emerged with the seminal paper of Markowitz (1952). Theoriginal mean-variance framework is appealing because it is very efficient from a com-putational point of view. However, it also has one well-established failing since it canlead to portfolios that are not optimal from a financial point of view (Michaud, 1989).Nevertheless, very few models have succeeded in providing a real alternative solutionto the Markowitz model. The main reason lies in the fact that most academic portfoliooptimization models are intractable in real life although they present solid theoreticalproperties. By intractable we mean that they can be implemented for an investmentuniverse with a small number of assets using a lot of computational resources and skills,but they are unable to manage a universe with dozens or hundreds of assets. However,the emergence and the rapid development of robo-advisors means that we need to re-think portfolio optimization and go beyond the traditional mean-variance optimizationapproach.Another industry and branch of science has faced similar issues concerning large-scale optimization problems. Machine learning and applied statistics have long beenassociated with linear and logistic regression models. Again, the reason was the inabilityof optimization algorithms to solve high-dimensional industrial problems. Nevertheless,the end of the 1990s marked an important turning point with the development and therediscovery of several methods that have since produced impressive results. The goal ofthis paper is to show how portfolio allocation can benefit from the development of theselarge-scale optimization algorithms. Not all of these algorithms are useful in our case,but four of them are essential when solving complex portfolio optimization problems.These four algorithms are the coordinate descent, the alternating direction method ofmultipliers, the proximal gradient method and the Dykstra’s algorithm. This paperreviews them and shows how they can be implemented in portfolio allocation.
Keywords:
Portfolio allocation, mean-variance optimization, risk budgeting optimization,quadratic programming, coordinate descent, alternating direction method of multipliers,proximal gradient method, Dykstra’s algorithm.
JEL classification:
C61, G11. ∗ This survey article has been prepared for the book Machine Learning and Asset Management editedby Emmanuel Jurczenko. We would like to thank Mohammed El Mendili, Edmond Lezmi, Lina Mezghani,Jean-Charles Richard, Jules Roche and Jiali Xu for their helpful comments. a r X i v : . [ q -f i n . P M ] S e p achine Learning Optimization Algorithms & Portfolio Allocation The contribution of Harry Markowitz to economics is considerable. The mean-variance op-timization framework marks the beginning of portfolio allocation in finance. In addition tothe seminal paper of 1952, Harry Markowitz proposed an algorithm for solving quadraticprogramming problems in 1956. At that time, very few people were aware of this optimiza-tion framework. We can cite Mann (1943) and Martin (1955), but it is widely accepted thatHarry Markowitz is the “ father of quadratic programming ” (Cottle and Infanger, 2010). Thisis not the first time that economists are participating in the development of mathematics ,but this is certainly the first time that mathematicians will explore a field of research, whosemain application during the first years of research is exclusively an economic problem .The success of mean-variance optimization (MVO) is due to the appealing propertiesof the quadratic utility function, but it should also be assessed in light of the success ofquadratic programming (QP). Because it is easy to solve QP problems and because QPproblems are available in mathematical software, solving MVO problems is straightforwardand does not require a specific skill. This is why the mean-variance optimization is auniversal method which is used by all portfolio managers. However, this approach has beenwidely criticized by academics and professionals. Indeed, mean-variance optimization isvery sensitive to input parameters and produces corner solutions. Moreover, the conceptof mean-variance diversification is confused with the concept of hedging (Bourgeron et al. ,2018). These different issues make the practice of mean-variance optimization less attractivethan the theory (Michaud, 1989). In fact, solving MVO allocation problems requires theright weight constraints to be specified in order to obtain acceptable solutions. It follows thatdesigning the constraints is the most important component of mean-variance optimization.In this case, MVO appears to be a trial-and-error process, not a systematic solution.The success of the MVO framework is also explained by the fact that there are very fewcompeting portfolio allocation models that can be implemented from an industrial point ofview. There are generally two reasons for this. The first one is that some models use inputparameters that are difficult to estimate or understand, making these models definitivelyunusable. The second reason is that other models use a more complex objective functionthan the simple quadratic utility function. In this case, the computational complexity makesthese models less attractive than the standard MVO model. Among these models, some ofthem are based on the mean-variance objective function, but introduce regularization penaltyfunctions in order to improve the robustness of the portfolio allocation. Again, these modelshave little chance of being used if they cannot be cast into a QP problem. However, newoptimization algorithms have emerged for solving large-scale machine learning problems.The purpose of this article is to present these new mathematical methods and show thatthey can be easily applied to portfolio allocation in order to go beyond the MVO/QP model.This survey article is based on several previous research papers ([8], [37], [61] and [62])and extensively uses four leading references (Beck, 2017; Boyd et al. , 2010; Combettes andPesquet, 2011; Tibshirani, 2017). It is organized as follows. In section two, we present themean-variance approach and how it is related to the QP framework. The third section isdedicated to large-scale optimization algorithms that have been used in machine learning:coordinate descent, alternating direction method of multipliers, proximal gradient and Dyk-stra’s algorithm. Section four shows how these algorithms can be implemented in orderto solve portfolio optimization problems and build a more robust asset allocation. Finally,section five offers some concluding remarks. For example, Leonid Kantorovich made major contributions to the success of linear programming. If we consider the first publications on quadratic programming, most of them were published in
Econo-metrica or illustrated the Markowitz problem (see [1], [2], [4], [24], [26], [30], [40], [72] and [73]).
A quadratic programming (QP) problem is an optimization problem with a quadratic ob-jective function and linear inequality constraints: x ? = arg min x x > Qx − x > R (1)s.t. Sx ≤ T where x is a n × Q is a n × n matrix and R is a n × Sx ≤ T allows us to specify linear equality constraints Ax = B or boxconstraints x − ≤ x ≤ x + . Most numerical packages then consider the following formulation: x ? = arg min x x > Qx − x > R (2)s.t. Ax = BCx ≤ Dx − ≤ x ≤ x + because the problem (2) is equivalent to the canonical problem (1) with the following systemof linear inequalities: − AAC − I n I n x ≤ − BBD − x − x + If the space Ω defined by Sx ≤ T is non-empty and if Q is a symmetric positive definitematrix, the solution exists because the function f ( x ) = 12 x > Qx − x > R is convex. In thegeneral case where Q is a square matrix, the solution may not exist. The Lagrange function is equal to: L ( x ; λ ) = 12 x > Qx − x > R + λ > ( Sx − T )We deduce that the dual problem is defined by: λ ? = arg max λ n inf x L ( x ; λ ) o s.t. λ ≥ This is equivalent to impose that Ax ≥ B and Ax ≤ B . ∂ x L ( x ; λ ) = Qx − R + S > λ . The solution to the equation ∂ x L ( x ; λ ) = 0 isthen x = Q − (cid:0) R − S > λ (cid:1) . We finally obtain:inf x L ( x ; λ ) = 12 (cid:0) R > − λ > S (cid:1) Q − (cid:0) R − S > λ (cid:1) − (cid:0) R > − λ > S (cid:1) Q − R + λ > (cid:0) SQ − (cid:0) R − S > λ (cid:1) − T (cid:1) = 12 R > Q − R − λ > SQ − R + 12 λ > SQ − S > λ − R > Q − R +2 λ > SQ − R − λ > SQ − S > λ − λ > T = − λ > SQ − S > λ + λ > (cid:0) SQ − R − T (cid:1) − R > Q − R We deduce that the dual program is another quadratic programming problem: λ ? = arg min λ λ > ¯ Qλ − λ > ¯ R (3)s.t. λ ≥ Q = SQ − S > and ¯ R = SQ − R − T . Remark 1
This duality property is very important for some machine learning methods. Forexample, this is the case of support vector machines and kernel methods that extensively usethe duality for defining the solution (Cortes and Vapnik, 1995).
There is a substantial literature on the methods for solving quadratic programming problems(Gould and Toint, 2000). The research begins in the 1950s with different key contributions:Frank and Wolfe (1956), Markowitz (1956), Beale (1959) and Wolfe (1959). Nowadays,QP problems are generally solved using three approaches: active set methods, gradientprojection methods and interior point methods. All these algorithms are implemented instandard mathematical programming languages (Matlab, Matematica, Python, Gauss, R,etc.). This explains the success of QP problems since 2000s, because they can be easily andrapidly solved.
The concept of portfolio allocation has a long history and dates back to the seminal work ofMarkowitz (1952). In his paper, Markowitz defined precisely what portfolio selection means:“ the investor does (or should) consider expected return a desirable thing and variance ofreturn an undesirable thing ”. Indeed, Markowitz showed that an efficient portfolio is theportfolio that maximizes the expected return for a given level of risk (corresponding to thevariance of portfolio return) or a portfolio that minimizes the risk for a given level of expectedreturn. Even if this framework has been extended to many other allocation problems (indexsampling, turnover management, etc.), the mean-variance model remains the optimizationapproach that is the most widely used in finance.
We consider a universe of n assets. Let x = ( x , . . . , x n ) be the vector of weights in theportfolio. We assume that the portfolio is fully invested meaning that P ni =1 x i = > n x = 1.We denote R = ( R , . . . , R n ) as the vector of asset returns where R i is the return of asset4achine Learning Optimization Algorithms & Portfolio Allocation i . The return of the portfolio is then equal to R ( x ) = P ni =1 x i R i = x > R . Let µ = E [ R ]and Σ = E h ( R − µ ) ( R − µ ) > i be the vector of expected returns and the covariance matrixof asset returns. The expected return of the portfolio is equal to: µ ( x ) = E [ R ( x )] = x > µ whereas its variance is equal to: σ ( x ) = E h ( R ( x ) − µ ( x )) ( R ( x ) − µ ( x )) > i = x > Σ x Markowitz (1952) formulated the investor’s financial problem as follows:1. Maximizing the expected return of the portfolio under a volatility constraint ( σ -problem): max µ ( x ) s.t. σ ( x ) ≤ σ ? (4)2. Or minimizing the volatility of the portfolio under a return constraint ( µ -problem):min σ ( x ) s.t. µ ( x ) ≥ µ ? (5)Markowitz’s bright idea was to consider a quadratic utility function: U ( x ) = x > µ − φ x > Σ x where φ ≥ U ( x ) is equivalent to minimizing −U ( x ),the Markowitz problems (4) and (5) can be cast into a QP problem : x ? ( γ ) = arg min x x > Σ x − γx > µ (6)s.t. > n x = 1where γ = φ − . Therefore, solving the µ -problem or the σ -problem is equivalent to findingthe optimal value of γ such that µ ( x ? ( γ )) = µ ? or σ ( x ? ( γ )) = σ ? . We know that thefunctions µ ( x ? ( γ )) and σ ( x ? ( γ )) are increasing with respect to γ and are bounded. Theoptimal value of γ can then be easily computed using the bisection algorithm. It is obviousthat a large part of the success of the Markowitz framework lies on the QP trick. Indeed,Problem (6) corresponds to the QP problem (2) where Q = Σ, R = γµ , A = > n and B = 1.Moreover, it is easy to include bounds on the weights, inequalities between asset classes, etc. The previous framework can be extended to other portfolio allocation problems. However,from a numerical point of view, the underlying idea is to always find an equivalent QPformulation (Roncalli, 2013).
Portfolio optimization with a benchmark
We now consider a benchmark b . We note µ ( x | b ) = ( x − b ) > µ as the expected excess return and σ ( x | b ) = q ( x − b ) > Σ ( x − b )as the tracking error volatility of Portfolio x with respect to Benchmark b . The objective This transformation is called the QP trick. f ( x | b ) = 12 σ ( x | b ) − γµ ( x | b )We can show that the equivalent QP problem is : x ? ( γ ) = arg min x x > Σ x − γx > ˜ µ where ˜ µ = µ + γ − Σ b is the regularized vector of expected returns. Therefore, portfolioallocation with a benchmark can be viewed as a regularization of the MVO problem and issolved using a QP numerical algorithm. Index sampling
The goal of index sampling is to replicate an index portfolio with asmaller number of assets than the index (or the benchmark) b . From a mathematical pointof view, index sampling could be written as follows: x ? = arg min x
12 ( x − b ) > Σ ( x − b ) (7)s.t. > n x = 1 x ≥ n P ni =1 { x i > } ≤ n x The idea is to minimize the volatility of the tracking error such that the number of stocks n x in the portfolio is smaller than the number of stocks n b in the benchmark. For example,one would like to replicate the S&P 500 index with only 50 stocks and not the entire 500stocks that compose this index. Professionals generally solve Problem (7) with the followingheuristic algorithm:1. We set x +(0) = n . At the iteration k + 1, we solve the QP problem: x ? = arg min x
12 ( x − b ) > Σ ( x − b )s.t. (cid:26) > n x = 1 n ≤ x ≤ x +( k )
2. We then update the upper bound x +( k ) of the QP problem by deleting the asset i ? withthe lowest non-zero optimized weight : x +( k +1) ,i ← x +( k ) ,i if i = i ? and x +( k +1) ,i ? ←
03. We iterate the two steps until P ni =1 { x ?i > } = n x .The purpose of the heuristic algorithm is to delete one asset at each iteration in order toobtain an invested portfolio, which is exactly composed of n x assets and has a low trackingerror volatility. Again, we notice that solving the index sampling problem is equivalent tosolving ( n b − n x ) QP problems. See Appendix A.1 on page 51. We have i ? = (cid:8) i : arg inf x ?i (cid:12)(cid:12) x ?i > (cid:9) . Turnover management
If we note ¯ x as the current portfolio and x as the new portfolio,the turnover of Portfolio x with respect to Portfolio ¯ x is the sum of purchases and sales: τ ( x | ¯ x ) = n X i =1 ( x i − ¯ x i ) + + n X i =1 (¯ x i − x i ) + = n X i =1 | x i − ¯ x i | Adding a turnover constraint in long-only MVO portfolios leads to the following problem: x ? = arg min x x > Σ x − γx > µ s.t. P ni =1 x i = 1 P ni =1 | x i − ¯ x i | ≤ τ + ≤ x i ≤ τ + is the maximum turnover with respect to the current portfolio ¯ x . Scherer (2007)introduces the additional variables x − i and x + i such that: x i = ¯ x i + x + i − x − i with x − i ≥ x i and x + i ≥ n X i =1 | x i − ¯ x i | = n X i =1 (cid:12)(cid:12) x + i − x − i (cid:12)(cid:12) = n X i =1 x + i + n X i =1 x − i because one of the variables x + i or x − i is necessarily equal to zero due to the minimizationproblem. The γ -problem of Markowitz becomes: x ? = arg min x x > Σ x − γx > µ s.t. P ni =1 x i = 1 x i = ¯ x i + x + i − x − i P ni =1 x + i + P ni =1 x − i ≤ τ + ≤ x i , x − i , x + i ≤ n (see Appendix A.2 on page 51). Transaction costs
The previous analysis assumes that there is no transaction cost ccc ( x | ¯ x )when we rebalance the portfolio from the current portfolio ¯ x to the new optimized portfolio x . If we note c − i and c + i as the bid and ask transaction costs, we have: ccc ( x | ¯ x ) = n X i =1 x − i c − i + n X i =1 x + i c + i The net expected return of Portfolio x is then equal to µ ( x ) − ccc ( x | ¯ x ). It follows that the γ -problem of Markowitz becomes : x ? = arg min x x > Σ x − γ n X i =1 x i µ i − n X i =1 x − i c − i − n X i =1 x + i c + i ! s.t. P ni =1 x i + P ni =1 x − i c − i + P ni =1 x + i c + i = 1 x i = ¯ x i + x + i − x − i ≤ x i , x − i , x + i ≤ The equality constraint > n x = 1 becomes > n x + ccc ( x | ¯ x ) = 1 because the rebalancing process has to befinanced. The concurrent model of the Markowitz framework is the risk budgeting approach (Qian,2005; Maillard et al. , 2010; Roncalli, 2013). The goal is to define a convex risk measure R ( x )and to allocate the risk according to some specified risk budgets RB = ( RB , . . . , RB n )where RB i >
0. This approach exploits the Euler decomposition property of the riskmeasure: R ( x ) = n X i =1 x i ∂ R ( x ) ∂ x i By noting RC i ( x ) = x i · ∂ x i R ( x ) as the risk contribution of Asset i with respect to portfolio x , the risk budgeting (RB) portfolio is defined by the following set of equations: x ? = { x ∈ [0 , n : RC i ( x ) = RB i } Roncalli (2013) showed that it is equivalent to solving the following non-linear optimizationproblem : x ? = arg min x R ( x ) − λ n X i =1 RB i · ln x i (8)s.t. x i > λ > et al. , 2010): R ( x ) = √ x > Σ x and the standard deviation-based risk measure (Roncalli, 2015): R ( x ) = − x > ( µ − r ) + ξ √ x > Σ x where r is the risk-free rate and ξ is a positive scalar. In particular, this last one encompassesthe Gaussian value-at-risk — ξ = Φ − ( α ) — and the Gaussian expected shortfall — ξ =(1 − α ) − φ (cid:0) Φ − ( α ) (cid:1) .The risk budgeting approach has displaced the MVO approach in many fields of assetmanagement, in particular in the case of factor investing and alternative risk premia. Nev-ertheless, we notice that Problem (8) is a not a quadratic programming problem, but alogarithmic barrier problem. Therefore, the risk budgeting framework opens a new worldof portfolio optimization that is not necessarily QP! That is all the more true since MVOportfolios face robustness issues (Bourgeron et al. , 2018). Regularization of portfolio allo-cation has then become the industry standard. Indeed, it is frequent to add a ‘ -norm or ‘ -norm penalty functions to the MVO objective function. This type of penalty is, how-ever, tractable in a quadratic programming setting. With the development of robo-advisors,non-linear penalty functions have emerged, in particular the logarithmic barrier penaltyfunction. And these regularization techniques result in a non-quadratic programming worldof portfolio optimization.The success of this non-QP financial world will depend on how quickly and easily thesecomplex optimization problems can be solved. Griveau-Billon at al. (2013), Bourgeron et al. (2018), and Richard and Roncalli (2019) have already proposed numerical algorithms thatare doing the work in some special cases. The next section reviews the candidate algorithmsthat may compete with QP numerical algorithms. In fact, the solution x ? must be rescaled after the optimization step. The machine learning industry has experienced a similar trajectory to portfolio optimization.Before the 1990s, statistical learning focused mainly on models that were easy to solve froma numerical point of view. For instance, the linear (and the ridge) regression has an analyt-ical solution, we can solve logistic regression with the Newton-Raphson algorithm whereassupervised and unsupervised classification models consist in performing a singular valuedecomposition or a generalized eigenvalue decomposition. The 1990s saw the emergence ofthree models that have deeply changed the machine learning approach: neural networks,support vector machines and lasso regression.Neural networks have been extensively studied since the seminal work of Rosenblatt(1958). However, the first industrial application dates back to the publication of LeCun etal. (1989) on handwritten zip code recognition. At the beginning of 1990s, a fresh crazethen emerged with the writing of many handbooks that were appropriate for students, themost popular of which was Bishop (1995). With neural networks, two main issues ariseconcerning calibration: the large number of parameters to estimate and the absence of aglobal maximum. The traditional numerical optimization algorithms that were popularin the 1980s cannot be applied to neural networks. New optimization approaches are thenproposed. First, researchers have considered more complex learning rules than the steepestdescent (Jacobs, 1988), for example the momentum method of Polyak (1964) or the Nesterovaccelerated gradient approach (Nesterov, 1983). Second, the descent method is generally notperformed on the full sample of observations, but on a subset of observations that changes ateach iteration. This is the underlying idea behind batch gradient descent (BGD), stochasticdescent gradient (SGD) and mini-batch gradient descent (MGD). We notice that adaptivelearning methods and batch optimization techniques have marked the revival of the gradientdescend method.The development of support vector machines is another important step in the develop-ment of machine learning techniques. Like neural networks, they can be seen as an extensionof the perceptron. However, they present nice theoretical properties and a strong geometri-cal framework. Once SVMs have been first developed for linear classification, they have beenextended for non-linear classification and regression. A support vector machine consists inseparating hyperplanes and finding the optimal separation by maximizing the margin. Theoriginal problem called the hard margin classification can be formulated as a quadratic pro-gramming problem. However, the dual problem, which is also a QP problem, is generallypreferred to the primal problem for solving SVM classification, because of the sparse prop-erty of the solution (Cortes and Vapnik, 1995). Over the years, the original hard marginclassification has been extended to other problems: soft margin classification with binaryhinge loss, soft margin classification with squared hinge loss, least squares SVM regression, ε -SVM regression, kernel machines (Vapnik, 1998). All these statistical problems share thesame calibration framework. The primal problem can be cast into a QP problem, implyingthat the corresponding dual problem is also a QP problem. Again, we notice that the successand the prominence of statistical methods are related to the efficiency of the optimizationalgorithms, and it is obvious that support vector machines have substantially benefited fromthe QP formulation. From an industrial point of view, support vector machines present how-ever some limitations. Indeed, if the dimension of the primal QP problem is the number p of features (or parameters), the dimension of the dual QP problem is the number n of obser- e.g. principal component analysis (PCA), linear/quadratic discriminant analysis (LDA/QDA), Fisherclassification method, etc. For example, we can cite the quasi-Newton BFGS (Broyden-Fletcher-Goldfarb-Shanno) and DFP(Davidon-Fletcher-Powell) methods, and the Fletcher-Reeves and Polak-Ribiere conjugate gradient methods. ‘ -norm penalty is replaced by the ‘ -norm penalty (Tibshirani, 1996). Since the ‘ regularization forces the solution to be sparse, it has been first largely used for variableselection, and then for pattern recognition and robust estimation of linear models. For find-ing the lasso solution, the technique of augmented QP problems is widely used since it iseasy to implement. The extension of the lasso-ridge regularization to the other ‘ p normsis straightforward, but these approaches have never been popular. The main reason is thatexisting numerical algorithms are not sufficient to make these models tractable.Therefore, the success of a quantitative model may be explained by two conditions.First, the model must be obviously appealing. Second, the model must be solved by anefficient numerical algorithm that is easy to implement or available in mathematical pro-gramming software. As shown previously, quadratic programming and gradient descentmethods have been key for many statistical and financial models. In what follows, we con-sider four algorithms and techniques that have been popularized by their use in machinelearning: coordinate descent, alternating direction method of multipliers, proximal opera-tors and Dykstra’s algorithm. In particular, we illustrate how they can be used for solvingcomplex optimization problems. We consider the following unconstrained minimization problem: x ? = arg min x f ( x ) (9)where x ∈ R n and f ( x ) is a continuous, smooth and convex function. A popular method tofind the solution x ? is to consider the descent algorithm, which is defined by the followingrule: x ( k +1) = x ( k ) + ∆ x ( k ) = x ( k ) − ηD ( k ) where x ( k ) is the approximated solution of Problem (9) at the k th Iteration, η > D ( k ) is the direction. We notice that the currentsolution x ( k ) is updated by going in the opposite direction to D ( k ) in order to obtain x ( k +1) .In the case of the gradient descent, the direction is equal to the gradient vector of f ( x ) atthe current point: D ( k ) = ∇ f (cid:0) x ( k ) (cid:1) . Coordinate descent (CD) is a variant of the gradientdescent and minimizes the function along one coordinate at each step: x ( k +1) i = x ( k ) i + ∆ x ( k ) i = x ( k ) i − ηD ( k ) i where D ( k ) i = ∇ i f (cid:0) x ( k ) (cid:1) is the i th element of the gradient vector. At each iteration, acoordinate i is then chosen via a certain rule, while the other coordinates are assumed to befixed. Coordinate descent is an appealing algorithm, because it transforms a vector-valuedproblem into a scalar-valued problem that is easier to implement. Algorithm (1) summarizesthe CD algorithm. The convergence criterion can be a predefined number of iterations oran error rule between two iterations. The step size η can be either a given parameter orcomputed with a line search, implying that the parameter η ( k ) changes at each iteration.10achine Learning Optimization Algorithms & Portfolio Allocation Algorithm 1
Coordinate descent algorithm (gradient formulation)The goal is to find the solution x ? = arg min f ( x )We initialize the vector x (0) and we note η the step sizeSet k ← repeat Choose a coordinate i ∈ { , n } x ( k +1) i ← x ( k ) i − η ∇ i f (cid:0) x ( k ) (cid:1) x ( k +1) j ← x ( k ) j if j = ik ← k + 1 until convergence return x ? ← x ( k ) Another formulation of the coordinate descent method is given in Algorithm (2). Theunderlying idea is to replace the descent approximation by the exact problem. Indeed, theobjective of the descend step is to minimize the scalar-valued problem: x ?i = arg min κ f (cid:16) x ( k )1 , . . . , x ( k ) i − , κ , x ( k ) i +1 , . . . , x ( k ) n (cid:17) (10) Algorithm 2
Coordinate descent algorithm (exact formulation)The goal is to find the solution x ? = arg min f ( x )We initialize the vector x (0) Set k ← repeat Choose a coordinate i ∈ { , n } x ( k +1) i = arg min κ f (cid:16) x ( k )1 , . . . , x ( k ) i − , κ , x ( k ) i +1 , . . . , x ( k ) n (cid:17) x ( k +1) j ← x ( k ) j if j = ik ← k + 1 until convergence return x ? ← x ( k ) Coordinate descent is efficient in large-scale optimization problems, in particular whenthere is a solution to the scalar-valued problem (10). Furthermore, convergence is guaranteedwhen f ( x ) is convex and differentiable (Luo and Tseng, 1992; Luo and Tseng, 1993). Remark 2
Coordinate descent methods have been introduced in several handbooks on nu-merical optimization in the 1980s and 1990s (Wright, 1985). However, the most importantstep is the contribution of Tseng (2001), who studied the block-coordinate descent methodand extended CD algorithms in the case of a non-differentiable and non-convex function f ( x ) . There are several options for choosing the coordinate of the k th iteration. A natural choicecould be to choose the coordinate which minimizes the function: i ? = arg inf n f ?i : i ∈ { , n } , f ?i = min κ f (cid:16) (1 − e i ) x ( k ) + e i κ (cid:17)o However, it is obvious that choosing the optimal coordinate i ? would require the gradientalong each coordinate to be calculated. This causes the coordinate descent to be no longer11achine Learning Optimization Algorithms & Portfolio Allocationefficient, since a classic gradient descent would then be of equivalent cost at each iterationand would converge faster because it requires fewer iterations.The simplest way to implement the CD algorithm is to consider cyclic coordinates,meaning that we cyclically iterate through the coordinates (Tseng, 2001): i = k mod n This ensures that all the coordinates are selected during one cycle { k − n + 1 , . . . , k } in thesame order. This approach, called cyclical coordinate descent (CCD), is the most popularand used method, even if it is difficult to estimate the rate of convergence.The second way is to consider random coordinates. Let π i be the probability of choosingthe coordinate i at the iteration k . The simplest approach is to consider uniform probabil-ities: π i = 1 /n . A better approach consists in pre-specifying probabilities according to theLipschitz constants : π i = L αi P nj =1 L αj (11)Nesterov (2012) considers three schemes: α = 0, α = 1 and α = ∞ — in this last case,we have i = arg max { L , . . . , L n } . From a theoretical point of view, the random coordi-nate descent (RCD) method based on the probability distribution (11) leads to a fasterconvergence, since coordinates that have a large Lipschitz constant L i are more likely tobe chosen. However, it requires additional calculus to compute the Lipschitz constants andCCD is often preferred from a practical point of view. In what follows, we only use theCCD algorithm described below. In Algorithm (3), the variable k represents the numberof cycles whereas the number of iterations is equal to k · n . For the coordinate i , the lowercoordinates j < i correspond to the current cycle k + 1 while the upper coordinates j > i correspond to the previous cycle k . Algorithm 3
Cyclical coordinate descent algorithmThe goal is to find the solution x ? = arg min f ( x )We initialize the vector x (0) Set k ← repeatfor i = 1 : n do x ( k +1) i = arg min κ f (cid:16) x ( k +1)1 , . . . , x ( k +1) i − , κ , x ( k ) i +1 , . . . , x ( k ) n (cid:17) end for k ← k + 1 until convergence return x ? ← x ( k ) λ -problem of the lasso regression We consider the linear regression: Y = Xβ + ε (12)where Y is the n × X is the n × p design matrix, β is the p × ε is the n × n is the number of observations and p Nesterov (2012) assumes that f ( x ) is convex, differentiable and Lipschitz-smooth for each coordinate: k∇ i f ( x + e i h ) − ∇ i f ( x ) k ≤ L i k h k where h ∈ R . β = arg min β
12 RSS ( β )where RSS ( β ) = P ni =1 ε i . Since we have:RSS ( β ) = ( Y − Xβ ) > ( Y − Xβ )we obtain: ∂ f ( β ) ∂ β j = − x > j ( Y − Xβ )where x j is the n × j th explanatory variable. Becausewe can write: Xβ = X ( − j ) β ( − j ) + x j β j where X ( − j ) and β ( − j ) are the design matrix and the beta vector by excluding the j th explanatory variable, it follows that: ∂ f ( β ) ∂ β j = x > j (cid:0) X ( − j ) β ( − j ) + x j β j − Y (cid:1) = x > j X ( − j ) β ( − j ) + x > j x j β j − x > j Y At the optimum, we have ∂ β j f ( β ) = 0 or: β j = x > j (cid:0) Y − X ( − j ) β ( − j ) (cid:1) x > j x j (13)The implementation of the coordinate descent algorithm is straightforward. It suffices toiterate Equation (13) through the coordinates.The lasso regression problem is a variant of the OLS regression by adding a ‘ -normregularization (Tibshirani, 1996):ˆ β ( λ ) = arg min β
12 ( Y − Xβ ) > ( Y − Xβ ) + λ k β k (14)In this formulation, the residual sum of squares of the linear regression is penalized by aterm that will force a sparse selection of the coordinates. Since the objective function is thesum of two convex norms, the convergence is guaranteed for the lasso problem. Because k β k = P nj =1 | β j | , the first order condition becomes:0 = ∇ i f ( β )= x > j x j β j − x > j (cid:0) Y − X ( − j ) β ( − j ) (cid:1) + λ∂ | β j | In Appendix A.5, we show that the solution is given by: β j = S (cid:0) x > j (cid:0) Y − X ( − j ) β ( − j ) (cid:1) ; λ (cid:1) x > j x j (15)where S ( v ; λ ) is the soft-thresholding operator: S ( v ; λ ) = sign ( v ) · ( | v | − λ ) + It follows that the lasso CD algorithm is a variation of the linear regression CD algorithm byapplying the soft-threshold operator to the residuals x > j (cid:0) Y − X ( − j ) β ( − j ) (cid:1) at each iteration.Let us consider an experiment with n = 10 0000 and p = 50. The design matrix X is built using the uniform distribution while the residuals are simulated using a Gaussiandistribution and a standard deviation of 20%. The beta coefficients are distributed uniformlybetween − − p (cid:29) p . If λ is large, a lot of optimal coordinates are equalto zero and a few cycles are needed to find the optimal values of non-zero coefficients. Coordinate descent can also be applied to the box-constrained QP problem: x ? = arg min x x > Qx − x > R s.t. x − ≤ x ≤ x + (16)In Appendix A.6 on page 53, we show that the coordinate update of the CCD algorithm isequal to: x ( k +1) i = T R i − P ji x ( k ) j ( Q i,j + Q j,i ) Q i,i ; x − i , x + i T ( v ; x − , x + ) is the truncation operator: T (cid:0) v ; x − , x + (cid:1) = v (cid:12) (cid:8) x − < v < x + (cid:9) + (17) x − (cid:12) (cid:8) v ≤ x − (cid:9) + x + (cid:12) (cid:8) v ≥ x + (cid:9) Generally, we assume that Q is a symmetric matrix, implying that the CCD update reducesto: x ( k +1) i = T R i − P ji x ( k ) j Q i,j Q i,i ; x − i , x + i ! Remark 3
CCD can be applied to Problem (16) because the box constraint x − ≤ x ≤ x + ispointwise . Figure 2: CCD algorithm applied to the box-constrained QP problem
We consider the following example: Q = .
76 5 .
11 3 .
47 5 .
13 6 . .
11 7 .
98 5 .
38 4 .
30 8 . .
47 5 .
38 4 .
01 2 .
83 5 . .
13 4 .
30 2 .
83 4 .
70 5 . .
82 8 .
70 5 .
91 5 .
84 10 . and R = . . . . . In Figure 2, we have reported the solution obtained with the CCD algorithm. The toppanels correspond to the QP problem without any constraints, whereas the bottom panelcorresponds to the QP problem with the box constraint − . ≤ x i ≤
1. We notice that See the discussion on page 25. x (0) = , whereas less than 10 cyclical iterationsare sufficient if we consider the unit vector x (0) = . The alternating direction method of multipliers (ADMM) is an algorithm introduced byGabay and Mercier (1976) to solve optimization problems which can be expressed as: { x ? , y ? } = arg min ( x,y ) f x ( x ) + f y ( y ) (18)s.t. Ax + By = c where A ∈ R p × n , B ∈ R p × m , c ∈ R p , and the functions f x : R n → R ∪ { + ∞} and f y : R m → R ∪ { + ∞} are proper closed convex functions. Boyd et al. (2011) show that theADMM algorithm consists of the following three steps:1. The x -update is: x ( k +1) = arg min x (cid:26) f x ( x ) + ϕ (cid:13)(cid:13)(cid:13) Ax + By ( k ) − c + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) (19)2. The y -update is: y ( k +1) = arg min y (cid:26) f y ( y ) + ϕ (cid:13)(cid:13)(cid:13) Ax ( k +1) + By − c + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) (20)3. The u -update is: u ( k +1) = u ( k ) + (cid:16) Ax ( k +1) + By ( k +1) − c (cid:17) (21)In this approach, u ( k ) is the dual variable of the primal residual r = Ax + By − c and ϕ is the ‘ -norm penalty variable. The parameter ϕ can be constant or may change at eachiteration . The ADMM algorithm benefits from the dual ascent principle and the methodof multipliers. The difference with the latter is that the x - and y -updates are performedin an alternating way. Therefore, it is more flexible because the updates are equivalent tocomputing proximal operators for f x and f y independently. In practice, ADMM may beslow to converge with high accuracy, but is fast to converge if we consider modest accuracy.This is why ADMM is a good candidate for solving large-scale machine learning problems,where high accuracy does not necessarily lead to a better solution. Remark 4
In this paper, we use the notations f ( k +1) x ( x ) and f ( k +1) y ( y ) when referring tothe objective functions that are defined in the x - and y -updates. Algorithm (4) summarizesthe different ADMM steps. See Appendix A.7 on page 56 for a discussion about the convergence of the ADMM algorithm.
Algorithm 4
ADMM algorithmThe goal is to compute the solution ( x ? , y ? )We initialize the vectors x (0) and y (0) and we choose a value for the parameter ϕ We set u (0) = n k ← repeat x ( k +1) = arg min x n f ( k +1) x ( x ) = f x ( x ) + ϕ (cid:13)(cid:13) Ax + By ( k ) − c + u ( k ) (cid:13)(cid:13) o y ( k +1) = arg min y n f ( k +1) y ( y ) = f y ( y ) + ϕ (cid:13)(cid:13) Ax ( k +1) + By − c + u ( k ) (cid:13)(cid:13) o u ( k +1) = u ( k ) + (cid:0) Ax ( k +1) + By ( k +1) − c (cid:1) k ← k + 1 until convergence return x ? ← x ( k ) and y ? ← y ( k ) The appeal of ADMM is that it can separate a complex problem into two sub-problems thatare easier to solve. However, most of the time, the optimization problem is not formulatedusing a separable objective function. The question is then how to formulate the initialproblem as a separable problem. We now list some tricks that show how ADMM may beused in practice.
First trick
We consider a problem of the form x ? = arg min x g ( x ). The idea is thento write g ( x ) as a separable function g ( x ) = g ( x ) + g ( x ) and to consider the followingequivalent ADMM problem: { x ? , y ? } = arg min ( x,y ) f x ( x ) + f y ( y ) (22)s.t. x = y where f x ( x ) = g ( x ) and f y ( y ) = g ( y ). Usually, the smooth part of g ( x ) will correspondto g ( x ) while the non-smooth part will be included in g ( x ). The underlying idea is thatthe x -update is straightforward, whereas the y -update deals with the tricky part of g ( x ). Second trick
If we want to minimize the function g ( x ) where x ∈ Ω is a set of constraints,the optimization problem can be cast into the ADMM form (22) where f x ( x ) = g ( x ), f y ( y ) = Ω ( y ) and Ω ( x ) is the convex indicator function of Ω: Ω ( x ) = (cid:26) x ∈ Ω+ ∞ if x / ∈ Ω (23)For example, if we want to solve the QP problem (2) given on page 3, we have: f x ( x ) = 12 x > Qx − x > R and: Ω = (cid:8) x ∈ R n : Ax = B, Cx ≤ D, x − ≤ x ≤ x + (cid:9) Third trick
We can combine the first and second tricks. For instance, if we consider thefollowing optimization problem: x ? = arg min x g ( x ) + g ( x )s.t. x ∈ Ω ∩ Ω the equivalent ADMM form is: { x ? , y ? } = arg min ( x,y ) ( g ( x ) + Ω ( x )) | {z } f x ( x ) + ( g ( y ) + Ω ( y )) | {z } f y ( y ) s.t. x = y Let us consider a variant of the QP problem where we add a non-linear constraint h ( x ) = 0.In this case, we can write the set of constraints as Ω = Ω ∩ Ω where:Ω = (cid:8) x ∈ R n : Ax = B, Cx ≤ D, x − ≤ x ≤ x + (cid:9) and: Ω = { x ∈ R n : h ( x ) = 0 } Fourth trick
Finally, if we want to minimize the function g ( x ) = g ( x, Ax + b ) = g ( x ) + g ( Ax + b ), we can write: { x ? , y ? } = arg min ( x,y ) g ( x ) + g ( y )s.t. y = Ax + b For instance, this trick can be used for a QP problem with a non-linear part: g ( x ) = 12 x > Qx − x > R + h ( x )If we assume that Q is a symmetric positive-definite matrix, we set x = Ly where L is thelower Cholesky matrix such that LL > = Q . It follows that the ADMM form is equal to : { x ? , y ? } = arg min ( x,y ) x > x | {z } f x ( x ) + h ( y ) − y > R | {z } f y ( y ) s.t. x − Ly = n We notice that the x -update is straightforward because it corresponds to a standard QPproblem. If we add a set Ω of constraints, we specify: f y ( y ) = h ( y ) − y > R + Ω ( y ) Remark 5
In the previous cases, we have seen that when the function g ( x ) may contain aQP problem, it is convenient to isolate this QP problem into the x -update: x ( k +1) = arg min x (cid:26) x > Qx − x > R + Ω ( x ) + ϕ (cid:13)(cid:13)(cid:13) x − y ( k ) + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) Since we have: ϕ (cid:13)(cid:13)(cid:13) x − y ( k ) + u ( k ) (cid:13)(cid:13)(cid:13) = ϕ x > x − ϕx > (cid:16) y ( k ) − u ( k ) (cid:17) + ϕ (cid:16) y ( k ) − u ( k ) (cid:17) > (cid:16) y ( k ) − u ( k ) (cid:17) we deduce that the x -update is a standard QP problem where: f ( k +1) x ( x ) = 12 x > ( Q + ϕI n ) x − x > (cid:16) R + ϕ (cid:16) y ( k ) − u ( k ) (cid:17)(cid:17) + Ω ( x ) (24) This Cholesky trick has been used by Gonzalvez et al. (2019) to solve trend-following strategies usingthe ADMM algorithm in the context of Bayesian learning. λ -problem of the lasso regression The λ -problem of the lasso regression (14) has the following ADMM formulation: (cid:8) β ? , ¯ β ? (cid:9) = arg min 12 ( Y − Xβ ) > ( Y − Xβ ) + λ k ¯ β k s.t. β − ¯ β = p Since the x -step corresponds to a QP problem , we use the results given in Remark 5 tofind the value of β ( k +1) : β ( k +1) = ( Q + ϕI p ) − (cid:16) R + ϕ (cid:16) ¯ β ( k ) − u ( k ) (cid:17)(cid:17) = (cid:0) X > X + ϕI p (cid:1) − (cid:16) X > Y + ϕ (cid:16) ¯ β ( k ) − u ( k ) (cid:17)(cid:17) The y -step is: ¯ β ( k +1) = arg min ¯ β (cid:26) λ k ¯ β k + ϕ (cid:13)(cid:13)(cid:13) β ( k +1) − ¯ β + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) = arg min (cid:26) (cid:13)(cid:13)(cid:13) ¯ β − (cid:16) β ( k +1) + u ( k ) (cid:17)(cid:13)(cid:13)(cid:13) + λϕ k ¯ β k (cid:27) We recognize the soft-thresholding problem with v = β ( k +1) + u ( k ) . Finally, the ADMMalgorithm is made up of the following steps (Boyd et al. , 2011): β ( k +1) = (cid:0) X > X + ϕI p (cid:1) − (cid:0) X > Y + ϕ (cid:0) ¯ β ( k ) − u ( k ) (cid:1) ) (cid:1) ¯ β ( k +1) = S (cid:0) β ( k +1) + u ( k ) ; ϕ − λ (cid:1) u ( k +1) = u ( k ) + (cid:0) β ( k +1) − ¯ β ( k +1) (cid:1) We consider the example of the lasso regression with λ = 900 on page 14. By setting ϕ = λ and by initialing the algorithm with the OLS estimates, we obtain the convergencegiven in Figure 3. We notice that the ADMM algorithm converges more slowly than theCCD algorithm for this example. In practice, we generally observe that the convergence ispoor for low and very high values of ϕ . However, finding an optimal value of ϕ is difficult.A better approach involves using a varying parameter ϕ ( k ) such as the method described onpage 56. The x - and y -update steps of the ADMM algorithm require a ‘ -norm penalized optimizationproblem to be solved. Proximal operators are special cases of this type of problem when thematrices A or B correspond to the identity matrix I n or its opposite − I n . Let f : R n → R ∪ { + ∞} be a proper closed convex function. The proximal operator prox f ( v ) : R n → R n is defined by: prox f ( v ) = x ? = arg min x (cid:26) f v ( x ) = f ( x ) + 12 k x − v k (cid:27) (25) We have Q = X > X and R = X > Y . Since the function f v ( x ) = f ( x ) + 12 k x − v k is strongly convex, it has a unique minimumfor every v ∈ R n (Parikh and Boyd, 2014). By construction, the proximal operator definesa point x ? which is a trade-off between minimizing f ( x ) and being close to v .In many situations, we need to calculate the proximal of the scaled function λf ( x ) where λ >
0. In this case, we use the notation prox λf ( v ) and we have: prox λf ( v ) = arg min x (cid:26) λf ( x ) + 12 k x − v k (cid:27) = arg min x (cid:26) f ( x ) + 12 λ k x − v k (cid:27) For instance, if we consider the y -update of the ADMM algorithm with B = − I n , we have: y ( k +1) = arg min y (cid:26) f y ( y ) + ϕ (cid:13)(cid:13)(cid:13) y − v ( k +1) y (cid:13)(cid:13)(cid:13) (cid:27) = arg min y (cid:26) ϕ − f y ( y ) + 12 (cid:13)(cid:13)(cid:13) y − v ( k +1) y (cid:13)(cid:13)(cid:13) (cid:27) = prox ϕ − f y (cid:16) v ( k +1) y (cid:17) where v ( k ) y = Ax ( k +1) − c + u ( k ) . ADMM is then given by Algorithm (5). The interest ofthis mathematical formulation is to write the ADMM algorithm in a convenient form suchthat the x -update corresponds to the tricky part of the optimization while the y -update isreduced to an analytical formula. 20achine Learning Optimization Algorithms & Portfolio Allocation Algorithm 5
ADMM algorithm in the case Ax − y = c The goal is to compute the solution ( x ? , y ? )We initialize the vectors x (0) and y (0) and we choose a value for the parameter ϕ We set u (0) = n k ← repeat x ( k +1) = arg min x n f ( k +1) x ( x ) = f x ( x ) + ϕ (cid:13)(cid:13) Ax − y ( k ) − c + u ( k ) (cid:13)(cid:13) o v ( k +1) y = Ax ( k +1) − c + u ( k ) y ( k +1) = prox ϕ − f y (cid:16) v ( k +1) y (cid:17) u ( k +1) = u ( k ) + (cid:0) Ax ( k +1) − y ( k +1) − c (cid:1) k ← k + 1 until convergence return x ? ← x ( k ) and y ? ← y ( k ) In the case where f ( x ) = Ω ( x ) is the indicator function, the proximal operator is then theEuclidean projection onto Ω: prox f ( v ) = arg min x (cid:26) Ω ( x ) + 12 k x − v k (cid:27) = arg min x ∈ Ω n k x − v k o = P Ω ( v )where P Ω ( v ) is the standard projection of v onto Ω. Parikh and Boyd (2014) interpret thenproximal operators as a generalization of the Euclidean projection.Let us consider the constrained optimization problem x ? = arg min f ( x ) subject to x ∈ Ω. Using the second ADMM trick, we have f x ( x ) = f ( x ), f y ( y ) = Ω ( y ) and x − y = n .Therefore, we can use Algorithm (5) since the v - and y -steps become v ( k +1) y = x ( k +1) + u ( k ) and y ( k +1) = P Ω (cid:16) v ( k +1) y (cid:17) .Here, we give the results of Parikh and Boyd (2014) for some simple polyhedra:Notation Ω P Ω ( v ) A ffineset [ A, B ] Ax = B v − A † ( Av − B ) H yperplane [ a, b ] a > x = b v − (cid:0) a > v − b (cid:1) k a k a H alfspace [ c, d ] c > x ≤ d v − (cid:0) c > v − d (cid:1) + k c k c B ox [ x − , x + ] x − ≤ x ≤ x + T ( v ; x − , x + )where A † is the Moore-Penrose pseudo-inverse of A , and T ( v ; x − , x + ) is the truncationoperator. We notice that the parameter ϕ has no impact on the y -update because ϕ − f y ( y ) = f y ( y ) = Ω ( y ).We then deduce that: prox ϕ − f y (cid:16) v ( k +1) y (cid:17) = prox f y (cid:16) v ( k +1) y (cid:17) = P Ω (cid:16) v ( k +1) y (cid:17) There are many properties that are useful for finding the analytical expression of the proximaloperator. In what follows, we consider three main properties, but the reader may referto Combettes and Pesquet (2011), Parikh and Boyd (2014) and Beck (2017) for a moreexhaustive list.
Separable sum
Let us assume that f ( x ) = P ni =1 f i ( x i ) is fully separable, then the prox-imal of f ( v ) is the vector of the proximal operators applied to each scalar-valued function f i ( x i ): prox f ( v ) = prox f ( v )... prox f n ( v n ) For example, if f ( x ) = λ k x k , we have f ( x ) = λ P ni =1 | x i | and f i ( x i ) = λ | x i | . Wededuce that the proximal operator of f ( x ) is the vector formulation of the soft-thresholdingoperator: prox λ k x k ( v ) = sign ( v ) · ( | v | − λ ) + ...sign ( v n ) · ( | v n | − λ ) + = sign ( v ) (cid:12) ( | v | − λ n ) + This result has been used to solve the λ -problem of the lasso regression on page 19.If we consider the scalar-valued logarithmic barrier function f ( x ) = − λ ln x , we have: f v ( x ) = − λ ln x + 12 ( x − v ) = − λ ln x + 12 x − xv + 12 v The first-order condition is − λx − + x − v = 0. We obtain two roots with opposite signs: x ? = v ± √ v + 4 λ x >
0, we deduce that the proximal operator isthe positive root. In the case of the vector-valued logarithmic barrier f ( x ) = − λ P ni =1 ln x i ,it follows that: prox f ( v ) = v + √ v (cid:12) v + 4 λ Moreau decomposition
An important property of the proximal operator is the Moreaudecomposition theorem: prox f ( v ) + prox f ∗ ( v ) = v where f ∗ is the convex conjugate of f . This result is used extensively to find the proximalof norms, the max function, the sum-of- k -largest-values function, etc. (Beck, 2017).In the case of the pointwise maximum function f ( x ) = max x , we can show that: prox λ max x ( v ) = min ( v, s ? )where s ? is the solution of the following equation (see Appendix A.8.1 on page 57): s ? = ( s ∈ R : n X i =1 ( v i − s ) + = λ ) f ( x ) = k x k p , we obtain: p prox λf ( v ) p = 1 S ( v ; λ ) = sign ( v ) (cid:12) ( | v | − λ n ) + p = 2 (cid:18) − λ max ( λ, k v k ) (cid:19) vp = ∞ sign ( v ) (cid:12) prox λ max x ( | v | )If f ( x ) is a ‘ q -norm function, then f ∗ ( x ) = B p ( x ) where B p is the ‘ p unit ball and p − + q − = 1. Since we have prox f ∗ ( v ) = P B p ( v ), we deduce that: prox f ( v ) + P B p ( v ) = v More generally, we have: prox λf ( v ) + λ P B p (cid:16) vλ (cid:17) = v It follows that the projection onto the ‘ p ball can be deduced from the proximal operator ofthe ‘ q -norm function. Let B p ( c, λ ) = n x ∈ R n : k x − c k p ≤ λ o be the ‘ p ball with center c and radius λ . We obtain: p P B p ( n ,λ ) ( v ) qp = 1 v − sign ( v ) (cid:12) prox λ max x ( | v | ) q = ∞ p = 2 v − prox λ k x k ( v ) q = 2 p = ∞ T ( v ; − λ, λ ) q = 1 Scaling and translation
Let us define g ( x ) = f ( ax + b ) where a = 0. We have : prox g ( v ) = prox a f ( av + b ) − ba We can use this property when the center c of the ‘ p ball is not equal to n . Since wehave prox g ( v ) = prox f ( v − c ) + c where g ( x ) = f ( x − c ) and the equivalence B p ( n , λ ) = { x ∈ R n : f ( x ) ≤ λ } where f ( x ) = k x k p , we deduce that: P B p ( c,λ ) ( v ) = P B p ( n ,λ ) ( v − c ) + c τ -problem of the lasso regression We have previously presented the lasso regression problem by considering the Lagrangeformulation ( λ -problem). We now consider the original τ -problem:ˆ β ( τ ) = arg min β
12 ( Y − Xβ ) > ( Y − Xβ )s.t. k β k ≤ τ The ADMM formulation is: (cid:8) β ? , ¯ β ? (cid:9) = arg min( β, ¯ β ) 12 ( Y − Xβ ) > ( Y − Xβ ) + Ω (cid:0) ¯ β (cid:1) s.t. β = ¯ β The proof can be found in Beck (2017) on page 138. We have reported it in Appendix A.8.3 on page 58. B ( n , τ ) is the centered ‘ ball with radius τ . We notice that the x -update is: β ( k +1) = arg min β (cid:26)
12 ( Y − Xβ ) > ( Y − Xβ ) + ϕ (cid:13)(cid:13)(cid:13) β − ¯ β ( k ) + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) = (cid:0) X > X + ϕI p (cid:1) − (cid:16) X > Y + ϕ (cid:16) ¯ β ( k ) − u ( k ) (cid:17)(cid:17) where v ( k +1) x = ¯ β ( k ) − u ( k ) . For the y -update, we deduce that:¯ β ( k +1) = arg min ¯ β (cid:26) Ω (cid:0) ¯ β (cid:1) + ϕ (cid:13)(cid:13)(cid:13) β ( k +1) − ¯ β + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) = prox f y (cid:16) β ( k +1) + u ( k ) (cid:17) = P Ω (cid:16) v ( k +1) y (cid:17) = v ( k +1) y − sign (cid:16) v ( k +1) y (cid:17) (cid:12) prox τ max x (cid:16)(cid:13)(cid:13)(cid:13) v ( k +1) y (cid:13)(cid:13)(cid:13)(cid:17) where v ( k +1) y = β ( k +1) + u ( k ) . Finally, the u -update is defined by u ( k +1) = u ( k ) + β ( k +1) − ¯ β ( k +1) . Remark 6
The ADMM algorithm is similar for λ - and τ -problems since the only differenceconcerns the y -step. For the λ -problem, we apply the soft-thresholding operator while we usethe ‘ projection in the case of the τ -problem. However, our experience shows that the τ -problem is easier to solve with the ADMM algorithm from a practical point of view. Thereason is that the y -update of the τ -problem is independent of the penalization parameter ϕ .This is not the case for the λ -problem, because the soft-thresholding depends on the valuetaken by ϕ − λ . We consider the following constrained minimization problem: x ? = arg min x f ( x ) s.t. x ∈ Ωwhere the set Ω of constraints is fully separable: Ω ( x ) = n X i =1 Ω i ( x i )The scalar-valued problem of the CD algorithm becomes: x ?i = arg min κ f ( x , . . . , x i − , κ , x i +1 , . . . , x n ) + λ n X i =1 Ω i ( x i )Nesterov (2012) and Wright (2015) propose the following coordinate update: x ?i = arg min κ ( κ − x i ) g i + 12 η ( κ − x i ) + λ · Ω i ( κ )where g i = ∇ i f ( x ) is the first-derivative of the function with respect to x i , η > λ is a positive scalar. The objective function is equivalent24achine Learning Optimization Algorithms & Portfolio Allocationto: ( ∗ ) = ( κ − x i ) g i + 12 η ( κ − x i ) + λ · Ω i ( κ )= 12 η (cid:16) ( κ − x i ) + 2 ( κ − x i ) ηg i (cid:17) + λ · Ω i ( κ )= 12 η ( κ − x i + ηg i ) + λ · Ω i ( κ ) − η g i By taking λ = η − , we deduce that: x ?i = arg min κ Ω i ( κ ) + 12 k κ − ( x i − ηg i ) k = prox ψ ( x i − ηg i )= P Ω i ( x i − ηg i )where ψ ( κ ) = Ω i ( κ ). Extending the CD algorithm in the case of pointwise constraints isthen equivalent to implement a standard CD algorithm and apply the projection onto the i th coordinate at each iteration . For instance, this algorithm is particularly efficient whenwe consider box constraints. We now consider the proximal optimization problem where the function f ( x ) is the convexsum of basic functions f j ( x ): x ? = arg min x m X j =1 f j ( x ) + 12 k x − v k and the proximal of each basic function is known. m = 2 case In the previous section, we listed some analytical solutions of the proximal problem whenthe function f ( x ) is basic. For instance, we know the proximal solution of the ‘ -normfunction f ( x ) = λ k x k or the proximal solution of the logarithmic barrier function f ( x ) = λ P ni =1 ln x i . However, we don’t know how to compute the proximal operator of f ( x ) = f ( x ) + f ( x ): x ? = arg min x f ( x ) + f ( x ) + 12 k x − v k = prox f ( v )Nevertheless, an elegant solution is provided by the Dykstra’s algorithm (Dykstra, 1983;Bauschke and Borwein, 1994; Combettes and Pesquet, 2011), which is defined by the fol-lowing iterations: x ( k +1) = prox f (cid:0) y ( k ) + p ( k ) (cid:1) p ( k +1) = y ( k ) + p ( k ) − x ( k +1) y ( k +1) = prox f (cid:0) x ( k +1) + q ( k ) (cid:1) q ( k +1) = x ( k +1) + q ( k ) − y ( k +1) (26) This method corresponds to the proximal gradient algorithm. x (0) = y (0) = v and p (0) = q (0) = n . This algorithm is obviously related to theDouglas-Rachford splitting framework where x ( k ) and p ( k ) are the variable and the residualassociated to f ( x ), and y ( k ) and q ( k ) are the variable and the residual associated to f ( x ).Algorithm (26) can be reformulated by introducing the intermediary step k + : x ( k + ) = prox f (cid:0) x ( k ) + p ( k ) (cid:1) p ( k +1) = p ( k ) − ∆ / x ( k + ) x ( k +1) = prox f (cid:16) x ( k + ) + q ( k ) (cid:17) q ( k +1) = q ( k ) − ∆ / x ( k +1) (27)where ∆ h x ( k ) = x ( k ) − x ( k − h ) . Figure 4 illustrates the splitting method used by the Dykstra’salgorithm and clearly shows the relationship with the Douglas-Rachford algorithm.Figure 4: Splitting method of the Dykstra’s algorithm x ( k − x ( k ) x ( k +1) x ( k +2) x ( k − ) x ( k + ) x ( k + ) f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) p ( k ) p ( k +1) p ( k +2) q ( k ) q ( k +1) q ( k +2) Residual of f ( x )Residual of f ( x ) m > case The case m > m residuals:1. The x -update is: x ( k +1) = prox f j ( k ) (cid:16) x ( k ) + z ( k +1 − m ) (cid:17)
2. The z -update is: z ( k +1) = x ( k ) + z ( k +1 − m ) − x ( k +1) where x (0) = v , z ( k ) = n for k < j ( k ) = mod ( k + 1 , m ) denotes the modulo operatortaking values in { , . . . , m } . The variable x ( k ) is updated at each iteration while the residual z ( k ) is updated every m iterations. This implies that the basic function f j ( x ) is related tothe residuals z ( j ) , z ( j + m ) , z ( j +2 m ) , etc. Following Tibshirani (2017), it is better to write theDykstra’s algorithm by using two iteration indices k and j . The main index k refers to thecycle , whereas the sub-index j refers to the constraint number: See Douglas and Rachford (1956), Combettes and Pesquet (2011), and Lindstrom and Sims (2018). Exactly like the coordinate descent algorithm. x -update is: x ( k +1 ,j ) = prox f j (cid:16) x ( k +1 ,j − + z ( k,j ) (cid:17) (28)2. The z -update is: z ( k +1 ,j ) = x ( k +1 ,j − + z ( k,j ) − x ( k +1 ,j ) (29)where x (1 , = v , z ( k,j ) = n for k = 0 and x ( k +1 , = x ( k,m ) .The Dykstra’s algorithm is particularly efficient when we consider the projection problem: x ? = P Ω ( v )where: Ω = Ω ∩ Ω ∩ · · · ∩ Ω m Indeed, the solution is found by replacing Equation (28) with: x ( k +1 ,j ) = P Ω j (cid:16) x ( k +1 ,j − + z ( k,j ) (cid:17) (30) Let us consider the case Ω = { x ∈ R n : Cx ≤ D } where the number of inequality constraintsis equal to m . We can write: Ω = Ω ∩ Ω ∩ · · · ∩ Ω m where Ω j = n x ∈ R n : c > ( j ) x ≤ d ( j ) o , c > ( j ) corresponds to the j th row of C and d ( j ) is the j th element of D . Since the projection P Ω j is known and has been given on page 22, we canfind the projection P Ω using Algorithm (6). Algorithm 6
Dykstra’s algorithm for solving the proximal problem with linear inequalityconstraintsThe goal is to compute the solution x ? = prox f ( v ) where f ( x ) = Ω ( x ) and Ω = { x ∈ R n : Cx ≤ D } We initialize x (0 ,m ) ← v We set z (0 , ← n , . . . , z (0 ,m ) ← n k ← repeat x ( k +1 , ← x ( k,m ) for j = 1 : m do The x -update is: x ( k +1 ,j ) = x ( k +1 ,j − + z ( k,j ) − (cid:16) c > ( j ) x ( k +1; j − + c > ( j ) z ( k,j ) − d ( j ) (cid:17) + (cid:13)(cid:13) c ( j ) (cid:13)(cid:13) c ( j ) The z -update is: z ( k +1 ,j ) = x ( k +1 ,j − + z ( k,j ) − x ( k +1 ,j ) end for k ← k + 1 until Convergence return x ? ← x ( k,m ) (cid:8) x ∈ R n : Ax = B, Cx ≤ D, x − ≤ x ≤ x + (cid:9) we decompose Ω as the intersection of three basic convex sets:Ω = Ω ∩ Ω ∩ Ω where Ω = { x ∈ R n : Ax = B } , Ω = { x ∈ R n : Cx ≤ D } and Ω = { x ∈ R n : x − ≤ x ≤ x + } .Using Dykstra’s algorithm is equivalent to formulating Algorithm (7). Algorithm 7
Dykstra’s algorithm for solving the proximal problem with general linearconstraintsThe goal is to compute the solution x ? = prox f ( v ) where f ( x ) = Ω ( x ) and Ω = { x ∈ R n : Ax = B, Cx ≤ D, x − ≤ x ≤ x + } We initialize x (0) m ← v We set z (0)1 ← n , z (0)2 ← n and z (0)3 ← n k ← repeat x ( k +1)0 ← x ( k ) m x ( k +1)1 ← x ( k +1)0 + z ( k )1 − A † (cid:16) Ax ( k +1)0 + Az ( k )1 − B (cid:17) z ( k +1)1 ← x ( k +1)0 + z ( k )1 − x ( k +1)1 x ( k +1)2 ← P Ω (cid:16) x ( k +1)1 + z ( k )2 (cid:17) (cid:73) Algorithm (6) z ( k +1)2 ← x ( k +1)1 + z ( k )2 − x ( k +1)2 x ( k +1)3 ← T (cid:16) x ( k +1)2 + z ( k )3 ; x − , x + (cid:17) z ( k +1)3 ← x ( k +1)2 + z ( k )3 − x ( k +1)3 k ← k + 1 until Convergence return x ? ← x ( k )3 Since we have: 12 k x − v k = 12 x > x − x > v + 12 v > v we deduce that the two previous problems can be cast into a QP problem: x ? = arg min x x > I n x − x > v s.t. x ∈ ΩWe can then compare the efficiency of Dykstra’s algorithm with the QP algorithm. Let usconsider the proximal problem where the vector v is defined by the elements v i = ln (cid:0) i (cid:1) and the set of constraints is:Ω = ( x ∈ R n : n X i =1 x i ≤ , n X i =1 e − i x i ≥ ) Using a Matlab implementation , we find that the computational time of the Dykstra’salgorithm when n is equal to 10 million is equal to the QP algorithm when n is equal to12 500, meaning that there is a factor of 800 between the two methods! The QP implementation corresponds to the quadprog function. ‘ -penalized logarithmic barrier function We consider the following proximal problem: x ? = arg min x − λ n X i =1 b i ln x i + 12 k x − v k s.t. k x − c k ≤ r In Appendix A.8.6 on page 59, we show that the corresponding Dykstra’s algorithm is: x ( k +1) = y ( k ) + z ( k )1 + r(cid:16) y ( k ) + z ( k )1 (cid:17) (cid:12) (cid:16) y ( k ) + z ( k )1 (cid:17) + 4 λb z ( k +1)1 = y ( k ) + z ( k )1 − x ( k +1) y ( k +1) = c + r max (cid:16) r, (cid:13)(cid:13)(cid:13) x ( k +1) + z ( k )2 − c (cid:13)(cid:13)(cid:13) (cid:17) (cid:16) x ( k +1) + z ( k )2 − c (cid:17) z ( k +1)2 = x ( k +1) + z ( k )2 − y ( k +1) The development of the previous algorithms will fundamentally change the practice of port-folio optimization. Until now, we have seen that portfolio managers live in a quadraticprogramming world. With these new optimization algorithms, we can consider more com-plex portfolio optimization programs with non-quadratic objective function, regularizationwith penalty functions and non-linear constraints.Table 1: Some objective functions used in portfolio optimizationItem Portfolio f ( x ) Reference(1) MVO x > Σ x − γx > µ Markowitz (1952)(2) GMV x > Σ x Jagganathan and Ma (2003)(3) MDP ln (cid:16) √ x > Σ x (cid:17) − ln (cid:0) x > σ (cid:1) Choueifaty and Coignard (2008)(4) KL P ni =1 x i ln ( x i / ˜ x i ) Bera and Park (2008)(5) ERC x > Σ x − λ P ni =1 ln x i Maillard et al. (2010)(6) RB R ( x ) − λ P ni =1 RB i · ln x i Roncalli (2015)(7) RQE x > Dx Carmichael et al. (2018)We consider a universe of n assets. Let x be the vector of weights in the portfolio.We denote by µ and Σ the vector of expected returns and the covariance matrix of assetreturns . Some models consider also a reference portfolio ˜ x . In Table 1, we report the mainobjective functions that are used by professionals . Besides the mean-variance optimizedportfolio (MVO) and the global minimum variance portfolio (GMV), we find the equal riskcontribution portfolio (ERC), the risk budgeting portfolio (RB) and the most diversifiedportfolio (MDP). According to Choueifaty and Coignard (2008), the MDP is defined as theportfolio which maximizes the diversification ratio DR ( x ) = x > σ √ x > Σ x . We also include in The vector of volatilities is defined by σ = ( σ , . . . , σ n ). For each model, we write the optimization problem as a minimization problem. academic ’ portfolios, which are based on the Kullback-Leibler (KL) informationcriteria and the Rao’s quadratic entropy (RQE) measure .Table 2: Some regularization penalties used in portfolio optimizationItem Regularization R ( x ) Reference(8) Ridge λ k x − ˜ x k DeMiguel et al. (2009)(9) Lasso λ k x − ˜ x k Brodie at al. (2009)(10) Log-barrier − P ni =1 λ i ln x i Roncalli (2013)(11) Shannon’s entropy λ P ni =1 x i ln x i Yu et al. (2014)In a similar way, we list in Table 2 some popular regularization penalty functions thatare used in the industry (Bruder et al. , 2013; Bourgeron et al. , 2018). The ridge and lassoregularization are well-known in statistics and machine learning (Hastie et al. , 2009). Thelog-barrier penalty function comes from the risk budgeting optimization problem, whereasShannon’s entropy is another approach for imposing a sufficient weight diversification.Table 3: Some constraints used in portfolio optimization(12) No cash and leverage P ni =1 x i = 1(13) No short selling x i ≥ x − i ≤ x i ≤ x + i (15) Asset class limits c − j ≤ P i ∈C j x i ≤ c + j (16) Turnover P ni =1 | x i − ˜ x i | ≤ τ + (17) Transaction costs P ni =1 (cid:0) c − i (˜ x i − x i ) + + c + i ( x i − ˜ x i ) + (cid:1) ≤ ccc + (18) Leverage limit P ni =1 | x i | ≤ L + (19) Long/short exposure −LS − ≤ P ni =1 x i ≤ LS + (20) Benchmarking q ( x − ˜ x ) > Σ ( x − ˜ x ) ≤ σ + (21) Tracking error floor q ( x − ˜ x ) > Σ ( x − ˜ x ) ≥ σ − (22) Active share floor P ni =1 | x i − ˜ x i | ≥ AS − (23) Number of active bets (cid:0) x > x (cid:1) − ≥ N − Concerning the constraints, the most famous are the no cash/no leverage and no shortselling restrictions. Weight bounds and asset class limits are also extensively used by prac-titioners. Turnover and transaction cost management may be an important topic whenrebalancing a current portfolio ˜ x . When managing long/short portfolios, we generally im-pose leverage or long/short exposure limits. In the case of a benchmarked strategy, we mightalso want to have a tracking error limit with respect to the benchmark ˜ x . On the contrary,we can impose a minimum tracking error or active share in the case of active management.Finally, the Herfindahl constraint is used for some smart beta portfolios.In what follows, we consider several portfolio optimization problems. Most of them are acombination of an objective function, one or two regularization penalty functions and someconstraints that have been listed above. From an industrial point of view, it is interestingto implement the proximal operator for each item. In this approach, solving any portfoliooptimization problem is equivalent to using CCD, ADMM, Dykstra and the appropriateproximal functions as Lego bricks. D is the dissimilarity matrix satisfying D i,j ≥ D i,j = D j,i . The global minimum variance (GMV) portfolio corresponds to the following optimizationprogram: x ? = arg min x x > Σ x s.t. > n x = 1We know that the solution is x ? = (cid:0) > n Σ − n (cid:1) − Σ − n . In practice, nobody implementsthe GMV portfolio because it is a long/short portfolio and it is not robust. Most of the time,professionals impose weight bounds: 0 ≤ x i ≤ x + . However, this approach generally leadsto corner solutions, meaning that a large number of optimized weights are equal to zero orthe upper bound and very few assets have a weight within the range. With the emergence ofsmart beta portfolios, the minimum variance portfolio gained popularity among institutionalinvestors. For instance, we can find many passive indices based on this framework. In orderto increase the robustness of these portfolios, the first generation of minimum variancestrategies has used relative weight bounds with respect to a benchmark b : δ − b i ≤ x i ≤ δ + b i (31)where 0 < δ − < δ + >
1. For instance, the most popular scheme is to take δ − = 0 . δ + = 2. Nevertheless, the constraint (31) produces the illusion that the portfolio isdiversified, because the optimized weights are different. In fact, portfolio weights are differ-ent because benchmark weights are different. The second generation of minimum variancestrategies imposes a global diversification constraint. The most popular solution is basedon the Herfindahl index H ( x ) = P ni =1 x i . This index takes the value 1 for a pure concen-trated portfolio ( ∃ i : x i = 1) and 1 /n for an equally-weighted portfolio. Therefore, we candefine the number of effective bets as the inverse of the Herfindahl index (Meucci, 2009): N ( x ) = H ( x ) − . The optimization program becomes: x ? = arg min x x > Σ x (32)s.t. > n x = 1 n ≤ x ≤ x + N ( x ) ≥ N − where N − is the minimum number of effective bets.The Herfindhal constraint is equivalent to: N ( x ) ≥ N − ⇔ (cid:0) x > x (cid:1) − ≥ N − ⇔ x > x ≤ N − Therefore, a first solution to solve (32) is to consider the following QP problem : x ? ( λ ) = arg min x x > Σ x + λx > x (33)s.t. (cid:26) > n x = 1 n ≤ x ≤ x + The objective function can be written as:12 x > Σ x + λx > x = 12 x > (Σ + 2 I n ) x λ ≥ N ( x ? ( ∞ )) is equal to the number n of assets and N ( x ? ( λ ))is an increasing function of λ , Problem (33) has a unique solution if N − ∈ [ N ( x ? (0)) , n ].There is an optimal value λ ? such that for each λ ≥ λ ? , we have N ( x ? ( λ )) ≥ N − . Comput-ing the optimal portfolio x ? ( λ ? ) therefore implies finding the solution λ ? of the non-linearequation N ( x ? ( λ )) = N − .A second method is to consider the ADMM form: { x ? , y ? } = arg min ( x,y ) x > Σ x + Ω ( x ) + Ω ( y )s.t. x = y where Ω = (cid:8) x ∈ R n : > n x = 1 , n ≤ x ≤ x + (cid:9) and Ω = B (cid:16) n , q N − (cid:17) . We deduce thatthe x -update is a QP problem: x ( k +1) = arg min x (cid:26) x > (Σ + ϕI n ) x − ϕx > (cid:16) y ( k ) − u ( k ) (cid:17) + Ω ( x ) (cid:27) whereas the y -update is: y ( k +1) = x ( k +1) + u ( k ) max (cid:16) , √N − (cid:13)(cid:13) x ( k +1) + u ( k ) (cid:13)(cid:13) (cid:17) A better approach is to write the problem as follows: { x ? , y ? } = arg min ( x,y ) x > Σ x + Ω ( x ) + Ω ( y )s.t. x = y where Ω = H yperplane [ n ,
1] and Ω = B ox [ n , x + ] ∩ B (cid:16) n , q N − (cid:17) . In this case, the x -and y -updates become : x ( k +1) = arg min x (cid:26) x > (Σ + ϕI n ) x − ϕx > (cid:16) y ( k ) − u ( k ) (cid:17) + Ω ( x ) (cid:27) = (Σ + ϕI n ) − ϕ (cid:16) y ( k ) − u ( k ) (cid:17) + 1 − > n (Σ + ϕI n ) − ϕ (cid:0) y ( k ) − u ( k ) (cid:1) > n (Σ + ϕI n ) − n n ! and: y ( k +1) = P B ox −B all x ( k +1) + u ( k ) ; n , x + , n , r N − ! where P B ox −B all corresponds to the Dykstra’s algorithm given in Appendix A.8.8 on page59. We consider the parameter set . In Table 4, we report thesolutions found by the ADMM algorithm for several values of N − . When there is noHerfindahl constraint, the portfolio is fully invested in the seventh stock, meaning that the We generally use the bisection algorithm to determine the optimal solution λ ? . See Appendix A.4 on page 52 for the derivation of the x -update. This means that x + is set to n . N − is equal to the number n of stocks, we verify that the solution corresponds to the equally-weighted portfolio. Between these two limit cases, we see the impact of the Herfindahlconstraint on the portfolio diversification. The parameter set . . . . . . . .
21% and 9 . N − .
00 2 .
00 3 .
00 4 .
00 5 .
00 6 .
00 6 .
50 7 .
00 7 .
50 8 . x ? .
00 3 .
22 9 .
60 13 .
83 15 .
18 15 .
05 14 .
69 14 .
27 13 .
75 12 . x ? .
00 12 .
75 14 .
14 15 .
85 16 .
19 15 .
89 15 .
39 14 .
82 14 .
13 12 . x ? .
00 0 .
00 0 .
00 0 .
00 0 .
00 0 .
07 2 .
05 4 .
21 6 .
79 12 . x ? .
00 10 .
13 15 .
01 17 .
38 17 .
21 16 .
09 15 .
40 14 .
72 13 .
97 12 . x ? .
00 0 .
00 0 .
00 0 .
00 0 .
71 5 .
10 6 .
33 7 .
64 9 .
17 12 . x ? .
00 5 .
36 8 .
95 12 .
42 13 .
68 14 .
01 13 .
80 13 .
56 13 .
25 12 . x ? .
00 68 .
53 52 .
31 40 .
01 31 .
52 25 .
13 22 .
92 20 .
63 18 .
00 12 . x ? .
00 0 .
00 0 .
00 0 .
50 5 .
51 8 .
66 9 .
41 10 .
14 10 .
95 12 . λ ? (in %) 0 .
00 1 .
59 3 .
10 5 .
90 10 .
38 18 .
31 23 .
45 31 .
73 49 . ∞ As explained before, we can also solve the optimization problem by combining Problem(33) and the bisection algorithm. This is why we have reported the corresponding value λ ? inthe last row in Table 4. However, this approach is no longer valid if we consider diversificationconstraints that are not quadratic. For instance, let us consider the generalized minimumvariance problem: x ? = arg min x x > Σ x (34)s.t. > n x = 1 n ≤ x ≤ x + D ( x ) ≥ D − where D ( x ) is a weight diversification measure and D − is the minimum acceptable diversi-fication. For example, we can use Shannon’s entropy, the Gini index or the diversificationratio. In this case, it is not possible to obtain an equivalent QP problem, whereas theADMM algorithm is exactly the same as previously except for the y -update: y ( k +1) = P B ox [ n ,x + ] ∩ D (cid:16) x ( k +1) + u ( k ) (cid:17) where D = { x ∈ R n : D ( x ) ≥ D − } . The projection onto D can be easily derived from theproximal operator of the dual function (see the tips and tricks on page 42). Remark 7
If we compare the computational times, we observe that the best method is thesecond version of the ADMM algorithm. In our example, the computational time is dividedby a factor of eight with respect to the bisection approach . If we consider a large-scaleproblem with n larger than , the computational time is generally divided by a factorgreater than ! In contrast, the first version of the ADMM algorithm is not efficient since the computational time ismultiply by a factor of five with respect to the bisection approach.
Another big challenge of the minimum variance portfolio is the management of the turnoverbetween two rebalancing dates. Let x t be the current portfolio. The optimization programfor calibrating the optimal solution x t +1 for the next rebalancing date t + 1 may include apenalty function ccc ( x | x t ) and/or a weight constraint C ( x | x t ) that are parameterized withrespect to the current portfolio x t : x t +1 = arg min x x > Σ x + ccc ( x | x t ) (35)s.t. > n x = 1 n ≤ x ≤ x + x ∈ C ( x | x t )Again, we can solve this problem using the ADMM algorithm. Thanks to the Dykstra’salgorithm, the only difficulty is finding the proximal operator of ccc ( x | x t ) or C ( x | x t ) whenperforming the y -update.Let us define the cost function as: ccc ( x | x t ) = λ n X i =1 (cid:16) c − i ( x i,t − x i ) + + c + i ( x i − x i,t ) + (cid:17) where c − i and c + i are the bid and ask transaction costs. In Appendix A.8.12 on page 62, weshow that the proximal operator is: prox ccc ( x | x t ) ( v ) = x t + S (cid:0) v − x t ; λc − , λc + (cid:1) (36)where S ( v ; λ − , λ + ) = ( v − λ + ) + − ( v + λ − ) − is the two-sided soft-thresholding operator.If we define the cost constraint C ( x | x t ) as a turnover constraint: C ( x | x t ) = (cid:8) x ∈ R n : k x − x t k ≤ τ + (cid:9) the proximal operator is: P C ( v ) = v − sign ( v − x t ) (cid:12) min ( | v − x t | , s ? ) (37)where s ? = n s ∈ R : P ni =1 ( | v i − x t,i | − s ) + = τ + o . Remark 8
These two examples are very basic and show how we can easily introduce turnovermanagement using the ADMM framework. More sophisticated approaches are presented inSection 4.4 on page 42.
In this section, we consider three main models of smart beta portfolios: the equal riskcontribution (ERC) portfolio, the risk budgeting (RB) portfolio and the most diversifiedportfolio (MDP). Specific algorithms for these portfolios based on the CCD method havealready been presented in Griveau-Billion et al. (2013) and Richard and Roncalli (2015,2019). We extend these results to the ADMM algorithm.34achine Learning Optimization Algorithms & Portfolio Allocation
The ERC portfolio uses the volatility risk measure σ ( x ) = √ x > Σ x and allocates the weightssuch that the risk contribution is the same for all the assets of the portfolio (Maillard et al. ,2010): RC i ( x ) = x i ∂ σ ( x ) ∂ x i = x j ∂ σ ( x ) ∂ x j = RC j ( x )In this case, we can show that the ERC portfolio is the scaled solution x ? / (cid:0) > n x ? (cid:1) where x ? is given by: x ? = arg min x x > Σ x − λ n X i =1 ln x i and λ is any positive scalar. The first-order condition is (Σ x ) i − λx − i = 0. It follows that x i (Σ x ) i − λ = 0 or: x i σ i + x i σ i X j = i x j ρ i,j σ j − λ = 0We deduce that the solution is the positive root of the second-degree equation. Finally, weobtain the following iteration for the CCD algorithm: x ( k +1) i = − v ( k +1) i + r(cid:16) v ( k +1) i (cid:17) + 4 λσ i σ i (38)where: v ( k +1) i = σ i X ji x ( k ) j ρ i,j σ j The ADMM algorithm uses the first trick where f x ( x ) = x > Σ x and f y ( y ) = − λ P ni =1 ln y i .It follows that the x - and y -update steps are: x ( k +1) = (Σ + ϕI n ) − ϕ (cid:16) y ( k ) − u ( k ) (cid:17) and: y ( k +1) i = 12 (cid:16) x ( k +1) i + u ( k ) i (cid:17) + r(cid:16) x ( k +1) i + u ( k ) i (cid:17) + 4 λϕ − ! We apply the CCD and ADMM algorithms to the parameter set . . . . . . .
52% and 7 . λ = p x (0) > Σ x (0) , x (0) = n − n and ϕ = 1, the CCD algorithm needssix cycles to converge whereas the ADMM algorithm needs 156 iterations if we set theconvergence criterion ε = 10 − . Whatever the values of λ , x (0) and ε , our experience isthat the CCD generally converges within less than 15 cycles even if the number of assets isgreater than 1 000. The convergence of the ADMM is more of an issue, because it dependson the parameters λ and ϕ . In Figure 5, we have reported the number of iterations of theADMM with respect to ϕ for several values of ε when λ = 1 and x (0) = n . We verify thatit is very sensitive to the value taken by ϕ . Curiously, the parameter λ has little influence,meaning that the convergence issue mainly concerns the x -update step. The termination rule is defined as max (cid:12)(cid:12)(cid:12) x ( k +1) i − x ( k ) i (cid:12)(cid:12)(cid:12) ≤ ε . The ERC portfolio has been extended by Roncalli (2013) when the risk budgets are notequal and when the risk measure R ( x ) is convex and coherent: RC i ( x ) = x i ∂ R ( x ) ∂ x i = RB i where RB i is the risk budget allocated to Asset i . In this case, we can show that the riskbudgeting portfolio is the scaled solution of the following optimization problem: x ? = arg min x R ( x ) − λ n X i =1 RB i · ln x i where λ is any positive scalar. Depending on the risk measure, we can use the CCD or theADMM algorithm.For example, Roncalli (2015) proposes using the standard deviation-based risk measure: R ( x ) = − x > ( µ − r ) + ξ √ x > Σ x In this case, the first-order condition for defining the CCD algorithm is: − ( µ i − r ) + ξ (Σ x ) i √ x > Σ x − λ RB i x i = 0It follows that ξx i (Σ x ) i − ( µ i − r ) x i σ ( x ) − λσ ( x ) · RB i = 0 or equivalently: α i x i + β i x i + γ i = 036achine Learning Optimization Algorithms & Portfolio Allocationwhere α i = ξσ i , β i = ξσ i P j = i x j ρ i,j σ j − ( µ i − r ) σ ( x ) and γ i = − λσ ( x ) · RB i . We noticethat the solution x i depends on the volatility σ ( x ). Here, we face an endogenous problem,because σ ( x ) depends on x i . Griveau-Billon et al. (2015) notice that this is not an issue,because we may assume that σ ( x ) is almost constant between two coordinate iterations ofthe CCD algorithm. They deduce that the coordinate solution is then the positive root ofthe second-degree equation: x ( k +1) i = − β ( k +1) i + r(cid:16) β ( k +1) i (cid:17) − α ( k +1) i γ ( k +1) i α ( k +1) i (39)where: α ( k +1) i = ξσ i β ( k +1) i = ξσ i (cid:16)P ji x ( k ) j ρ i,j σ j (cid:17) − ( µ i − r ) σ ( k +1) i ( x ) γ ( k +1) i = − λσ ( k +1) i ( x ) · RB i σ ( k +1) i ( x ) = p χ > Σ χχ = (cid:16) x ( k +1)1 , . . . , x ( k +1) i − , x ( k ) i , x ( k ) i +1 . . . , x ( k ) n (cid:17) In the case of the volatility or the standard deviation-based risk measure, we apply theexact formulation of the CCD algorithm because we have an analytical solution of the first-order condition. This is not always the case, especially when we consider skewness-basedrisk measure (Bruder et al. , 2016; Lezmi et al. , 2018). In this case, we can use the gradientformulation of the CCD algorithm or the ADMM algorithm, which is defined as follows: x ( k +1) = prox ϕ − R ( x ) (cid:0) y ( k ) − u ( k ) (cid:1) v ( k +1) y = x ( k +1) + u ( k ) y ( k +1) = (cid:18) v ( k +1) y + q v ( k +1) y (cid:12) v ( k +1) y + 4 λϕ − · RB (cid:19) u ( k +1) = u ( k ) + x ( k +1) − y ( k +1) Choueifaty and Coignard (2008) introduce the concept of diversification ratio, which corre-sponds to the following expression: DR ( x ) = P ni =1 x i σ i σ ( x ) = x > σ √ x > Σ x By construction, the diversification ratio of a portfolio fully invested in one asset is equal toone, whereas it is larger than one in the general case. The authors then propose buildingthe most diversified portfolio as the portfolio which maximizes the diversification ratio. Itis also the solution to the following minimization problem : x ? = arg min x
12 ln (cid:0) x > Σ x (cid:1) − ln (cid:0) x > σ (cid:1) s.t. (cid:26) > n x = 1 x ∈ Ω See Choueifaty et al. (2013). D ( x ) ≥ D − . For example, we can assume that the number of effective bets N ( x )is larger than a minimum acceptable value N − . Contrary to the minimum variance portfolio,we do not obtain a QP problem and we observe that the optimization problem is trickyin practice. Thanks to the ADMM algorithm, we can however simplify the optimizationproblem by splitting the constraints and using the same approach that has been alreadydescribed on page 34. The x -update consists in finding the regularized standard MDP: x ( k +1) = arg min x (cid:26)
12 ln (cid:0) x > Σ x (cid:1) − ln (cid:0) x > σ (cid:1) + ϕ (cid:13)(cid:13)(cid:13) x − y ( k ) + u ( k ) (cid:13)(cid:13)(cid:13) s.t. > n x = 1 (cid:27) whereas the y -update corresponds to the projection onto the intersection of Ω and D = { x ∈ R n : D ( x ) ≥ D − } : y ( k +1) = P Ω ∩ D (cid:16) x ( k +1) + u ( k ) (cid:17) We consider the parameter set R n ).By definition, we cannot compute the number of effective bets because it contains shortpositions. The other columns correspond to the long-only MDP portfolio (or Ω = [0 , n )when we impose a sufficient number of effective bets N − . We notice that the traditionallong-only MDP is poorly diversified in terms of weights since we have N ( x ) = 2 .
30. As forthe minimum variance portfolio, the MDP tends to the equally-weighted portfolio when N − tends to the number of assetsTable 5: MDP portfolios (in %)L/S Long-only N − .
00 3 .
00 4 .
00 5 .
00 6 .
00 7 . x ? .
81 41 .
04 35 .
74 30 .
29 26 .
08 22 .
44 18 . x ? .
88 50 .
92 43 .
91 36 .
68 31 .
05 26 .
12 21 . x ? .
20 8 .
05 10 .
12 11 .
52 12 .
33 12 .
80 13 . x ? − .
43 0 .
00 2 .
48 5 .
12 7 .
16 8 .
90 10 . x ? − .
26 0 .
00 0 .
92 2 .
28 3 .
60 5 .
02 6 . x ? − .
38 0 .
00 2 .
03 4 .
36 6 .
28 8 .
02 9 . x ? − .
51 0 .
00 3 .
47 6 .
68 8 .
85 10 .
44 11 . x ? − .
31 0 .
00 1 .
32 3 .
07 4 .
65 6 .
27 8 . N ( x ) 2 .
30 3 .
00 4 .
00 5 .
00 6 .
00 7 . Today’s financial industry is facing a digital revolution in all areas: payment services, on-line banking, asset management, etc. This is particularly true for the financial advisoryindustry, which has been impacted in the last few years by the emergence of digitalizationand robo-advisors. The demand for robo-advisors is strong, which explains the growth of38achine Learning Optimization Algorithms & Portfolio Allocationthis business . How does one characterize a robo-advisor? This is not simple, but theunderlying idea is to build a systematic portfolio allocation in order to provide a customizedadvisory service. A robo-advisor has two main objectives. The first objective is to know theinvestor better than a traditional asset manager. Because of this better knowledge, the robo-advisor may propose a more appropriate asset allocation. The second objective is to performthe task in a systematic way and to build an automated rebalancing process. Ultimately,the goal is to offer a customized solution. In fact, the reality is very different. We generallynotice that many robo-advisors are more a web or a digital application, but not really arobo-advisor. The reason is that portfolio optimization is a very difficult task. In manyrobo-advisors, asset allocation is then rather human-based or not completely systematicwith the aim to rectify the shortcomings of mean-variance optimization. Over the next fiveyears, the most important challenge for robo-advisors will be to reduce these discretionarydecisions and improve the robustness of their systematic asset allocation process. But thismeans that robo-advisors must give up the quadratic programming world of the portfolioallocation. In order to make mean-variance optimization more robust, two directions can be followed.The first one has been largely explored and adds a penalty function in order to regularizeor sparsify the solution (Brodie et al. et al. , 2009; Carrasco and Noumon,2010; Bruder et al. , 2013; Bourgeron et al. , 2018). The second one consists in changingthe objective function and considering risk budgeting portfolios instead of mean-varianceoptimized portfolios (Maillard et al. , 2010; Roncalli, 2013). Even if this second direction hasencountered great success, it presents a solution that is not sufficiently flexible in terms ofactive management. Nevertheless, these two directions are not so divergent. Indeed, Roncalli(2013) shows that the risk budgeting optimization can be viewed as a non-linear shrinkageapproach of the minimum variance optimization. Richard and Roncalli (2015) propose thena unified approach of smart beta portfolios by considering alternative allocation modelsas penalty functions of the minimum variance optimization. In particular, they use thelogarithmic barrier function in order to regularize minimum variance portfolios. This ideahas also been reiterated by de Jong (2018), who considers a mean-variance framework.Therefore, we propose defining the robo-advisor optimization problem as follows: x ?t +1 = arg min x f R obo ( x ) (40)s.t. > n x = 1 n ≤ x ≤ n x ∈ Ωwhere: f R obo ( x ) = 12 ( x − b ) > Σ t ( x − b ) − γ ( x − b ) > µ t + % k Γ ( x − x t ) k + 12 % k Γ ( x − x t ) k +˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) + 12 ˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) − λ n X i =1 RB i · ln x i (41) For instance, the growth was 60% per year in the US over the last five years. In Europe, the growth isalso impressive, even though the market is smaller. In the last two years, assets under management haveincreased 14-fold. b is the benchmark portfolio, ˜ x is the reference portfolio and x t is the current portfolio.This specification is sufficiently broad that it encompasses most models used by theindustry. We notice that the objective function is made up of three parts. The first partcorresponds to the MVO objective function with a benchmark. If we set b equal to n , weobtain the Markowitz utility function. The second part contains ‘ - and ‘ -norm penaltyfunctions. The regularization can be done with respect to the current allocation x t in orderto control the rebalancing process and the smoothness of the dynamic allocation. Theregularization can also be done with respect to a reference portfolio, which is generally thestrategic asset allocation of the fund. The idea is to control the profile of the fund. Forexample, if the strategic asset allocation is an 80/20 asset mix policy, we do not want theportfolio to present a defensive or balanced risk profile. Finally, the third part of the objectivefunction corresponds to the logarithmic barrier function, where the parameter λ controls thetrade-off between MVO optimization and RB optimization. This last part is very importantin order to make the dynamic asset allocation more robust. The hyperparameters of theobjective function are % , % , ˜ % , ˜ % and λ . They are all positive and can also be set to zero inorder to deactivate a penalty function. For instance, % and ˜ % are equal to zero if we don’twant to have a shrinkage of the covariance matrix Σ t . The hyperparameters % and ˜ % canalso be equal to zero because the ‘ regularization is generally introduced when specifyingthe additional constraints Ω. The parameter γ is not really a hyperparameter, because itis generally calibrated to target volatility or an expected return. We also notice that thismodel encompasses the Black-Litterman model thanks to the specification of µ t (Bourgeron et al. , 2018). Another important component of this framework is the specification of theset x ∈ Ω. It may include traditional constraints such as weight bounds and/or asset classlimits, but we can also add non-linear constraints such as a turnover limit, an active sharefloor or a weight diversification constraint.
Problem (40) is equivalent to solving: x ?t +1 = arg min x f + R obo ( x )where the objective function can be broken down as follows: f + R obo ( x ) = f MVO ( x ) + f ‘ ( x ) + f ‘ ( x ) + f RB ( x ) ++ Ω ( x ) + Ω ( x )where: f MVO ( x ) = 12 ( x − b ) > Σ t ( x − b ) − γ ( x − b ) > µ t f ‘ ( x ) = % k Γ ( x − x t ) k + ˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) f ‘ ( x ) = 12 % k Γ ( x − x t ) k + 12 ˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) f RB ( x ) = − λ n X i =1 RB i · ln x i and Ω = (cid:8) x ∈ [0 , n : > n x = 1 (cid:9) . The ADMM algorithm is implemented as follows: { x ? , y ? } = arg min f x ( x ) + f y ( y )s.t. x − y = n f + R obo into f x and f y . However, in order tobe efficient, the x - and y -update steps of the ADMM algorithm must be easy to compute.Therefore, we impose that the x -step is solved using QP or CCD methods while the y -stepis solved using the Dykstra’s algorithm, where each component corresponds to an analyticalproximal operator. Moreover, we also split the set of constraints Ω into a set of linearconstraints Ω L inear and a set of non-linear constraints Ω N onlinear . This lead defining f x ( x )and f y ( y ) as follows: (cid:26) f x ( x ) = f MVO ( x ) + f ‘ ( x ) + Ω ( x ) + Ω L inear ( x ) f y ( y ) = f ‘ ( y ) + f RB ( x ) + Ω N onlinear ( x ) (42)We notice that f x ( x ) has a quadratic form, implying that the x -step may be solved using aQP algorithm. Another formulation is: (cid:26) f x ( x ) = f MVO ( x ) + f ‘ ( x ) + f RB ( x ) f y ( y ) = f ‘ ( y ) + Ω ( x ) + Ω L inear ( x ) + Ω N onlinear ( x ) (43)In this case, the x -step is solved using the CCD algorithm. If we consider Formulation (42), we have: f QP ( x ) = f MVO ( x ) + f ‘ ( x )= 12 ( x − b ) > Σ t ( x − b ) − γ ( x − b ) > µ t + 12 % k Γ ( x − x t ) k + 12 ˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) = 12 x > Qx − x > R + C where Q = Σ t + % Γ > Γ + ˜ % ˜Γ > ˜Γ , R = γµ t + Σ t b + % Γ > Γ x t + ˜ % ˜Γ > ˜Γ ˜ x and C is aconstant . Using the fourth ADMM trick, we deduce that x ( k +1) is the solution of thefollowing QP problem: x ( k +1) = arg min x x > ( Q + ϕI n ) x − x > (cid:16) R + ϕ (cid:16) y ( k ) − u ( k ) (cid:17)(cid:17) s.t. (cid:26) > n x = 1 n ≤ x ≤ n Since the proximal operators of f ‘ and f RB have been already computed, finding y ( k +1) is straightforward with the Dykstra’s algorithm as long as the proximal of each non-linearconstraint is known. The ADMM-CCD formulation
If we consider Formulation (43), we have: f x ( x ) = f QP ( x ) − λ n X i =1 RB i · ln x i The expression of f QP ( x ) is computed in Appendix A.9 on page 63. x -update is: x ( k +1) i = R i − P ji x ( k ) j Q i,j Q i,i + r(cid:16)P ji x ( k ) j Q i,j − R i (cid:17) + 4 λ i Q i,i Q i,i where the matrices Q and R are defined as: Q = Σ t + % Γ > Γ + ˜ % ˜Γ > ˜Γ + ϕI n and: R = γµ t + Σ t b + % Γ > Γ x t + ˜ % ˜Γ > ˜Γ ˜ x + ϕ (cid:16) y ( k ) − u ( k ) (cid:17) and λ i = λ · RB i . Like the ADMM-QP formulation, the y -update step does not pose anyparticular difficulties. If we consider the different portfolio optimization approaches presented in Table 1, we haveshown how to solve MVO (1), GMV (2), MDP (3), ERC (4) and RB (5) models. The RQE(7) model is equivalent to the GMV (2) model by replacing the covariance matrix Σ bythe dissimilarity matrix D . Below, we implement the Kullback-Leibler model (4) of Beraand Park (2008) using the ADMM framework. Concerning the regularization problemsin Table 2, ridge (8), lasso (9) and log-barrier (10) penalty functions have been alreadycovered. Indeed, ridge and lasso penalizations correspond to the proximal operator of ‘ -and ‘ -norm functions by applying the translation g ( x ) = x − ˜ x . Shannon’s entropy (11)penalization is discussed below. For the constraints that are considered in Table 3, imposingno cash and leverage (12) is done with the proximal of the hyperplane H yperlane [ n , B ox [ n , ∞ ] and B ox [ x − , x + ]. Asset class limits can be implemented using the projectiononto the intersection of two half-spaces H alfspace (cid:2) i ∈ C j , c + j (cid:3) and H alfspace (cid:2) − i ∈ C j , − c − j (cid:3) .The proximal of the turnover (16) had been already given in Equation (37) on page 35. If wewant to impose an upper limit on transaction costs (17), we use the Moreau decompositionand Equation (36). Finally, Section 4.1.1 on page 31 dealt with the weight diversificationproblem of the number of active bets. Therefore, it remains to solve leverage limits (18),long/short exposure (19) restrictions and active management constraints: benchmarking(20), tracking error floor (21) and active share floor (22). We first consider the µ -problem and the σ -problem. Targeting a minimum expected return µ ( x ) ≥ µ ? can be implemented in the ADMM framework using the proximal operator ofthe hyperplane H yperlane [ − µ, − µ ? ]. In the case of the σ -problem σ ( x ) ≤ σ ? , we use thefourth ADMM trick. Let L be the lower Cholesky decomposition of Σ, we have: σ ( x ) ≤ σ ? ⇔ √ x > Σ x ≤ σ ? ⇔ q x > ( LL > ) x ≤ σ ? ⇔ (cid:13)(cid:13) y > y (cid:13)(cid:13) ≤ σ ? We have µ ( x ) ≥ µ ? ⇔ x > µ ≥ µ ? ⇔ − µ > x ≤ − µ ? . y = L > x . It follows that the proximal of the y -update is the projection onto the ‘ ball B ( n , σ ? ). If we impose a leverage limit P ni =1 | x i | ≤ L + , we have k x k ≤ L + and the proximal of the y -update is the projection onto the ‘ ball B ( n , L + ). If the leverage constraint concernsthe long/short limits −LS − ≤ P ni =1 x i ≤ LS + , we consider the intersection of the two half-spaces H alfspace (cid:2) n , LS + (cid:3) and H alfspace (cid:2) − n , LS − (cid:3) . If we consider an absolute leverage | P ni =1 x i | ≤ L + , we obtain the previous case with LS − = LS + = L + . Portfolio managerscan also use another constraint concerning the sum of the k largest values : f ( x ) = n X i = n − k +1 x ( i : n ) = x ( n : n ) + . . . + x ( n − k +1: n ) where x ( i : n ) is the order statistics of x : x (1: n ) ≤ x (2: n ) ≤ · · · ≤ x ( n : n ) . Beck (2017) showsthat: prox λf ( x ) ( v ) = v − λ P Ω (cid:16) vλ (cid:17) where: Ω = (cid:8) x ∈ [0 , n : > n x = k (cid:9) = B ox [ n , n ] ∩ H yperlane [ n , k ] Bera and Park (2008) propose using a cross-entropy measure as the objective function: x ? = arg min x KL ( x | ˜ x )s.t. > n x = 1 n ≤ x ≤ n µ ( x ) ≥ µ ? σ ( x ) ≤ σ ? where KL ( x | ˜ x ) = P ni =1 x i ln ( x i / ˜ x i ) and ˜ x is a reference portfolio, which is well-diversified(e.g. the EW or ERC portfolio). In Appendix A.8.9 on page 60, we show that the proximaloperator of λ KL ( x | ˜ x ) is equal to: prox λ KL( v | ˜ x ) ( v ) = λ W (cid:16) λ − ˜ x e λ − v − ˜ x − (cid:17) ... W (cid:16) λ − ˜ x n e λ − v n − ˜ x − n (cid:17) where W ( x ) is the Lambert W function. Remark 9
Using the previous result and the fact that
SE ( x ) = − KL ( x | n ) , we can useShannon’s entropy to define the diversification measure D ( x ) = SE ( x ) . Therefore, solvingProblem (34) is straightforward when we consider the following diversification set: D = ( x ∈ [0 , n : − n X i =1 x i ln x i ≥ SE − ) An example is the 5/10/40 UCITS rule: A UCITS fund may invest no more than 10% of its net assetsin transferable securities or money market instruments issued by the same body, with a further aggregatelimitation of 40% of net assets on exposures of greater than 5% to single issuers. In this case, it is equivalent to maximize Shannon’s entropy because ˜ x = n . In the case of the active share, we use the translation property: AS ( x | ˜ x ) = 12 n X i =1 | x i − ˜ x i | = 12 k x − ˜ x k The proximal operator is given in Appendix A.8.11 on page 62. It is interesting to notice thatthis type of problem cannot be solved using an augmented QP algorithm since it involvesthe complement of the ‘ ball and not directly the ‘ ball itself. In this case, we face amaximization problem and not a minimization problem, and the technique of augmentedvariables does not work.For tracking error volatility, again we use the fourth ADMM trick: σ ( x | ˜ x ) = q ( x − ˜ x ) > Σ ( x − ˜ x )= k y k where y = L > x − L > ˜ x . Using our ADMM notations, we have Ax + By = c where A = L > , B = − I n and c = L > ˜ x . Index sampling is based on the cardinality constraint P ni =1 { x i > } ≤ n x . It is closed tothe ‘ -norm function k x k = P ni =1 { x i = 0 } . Beck (2017) derives the proximal of λ k x k on pages 137-138 of his monograph. However, it does not help to solve the index samplingproblem, because we are interested in computing the projection onto the ‘ ball and not theproximal of the ‘ -norm function . This is why index sampling remains an open problemusing the ADMM framework. The aim of this paper is to propose an alternative solution to the quadratic programmingalgorithm in the context of portfolio allocation. In numerical analysis, the quadratic pro-gramming model is a powerful optimization tool, which is computationally very efficient.In portfolio management, the mean-variance optimization model is exactly a quadratic pro-gramming model, meaning that it benefits from its computational power. Therefore, thesuccess of the Markowitz allocation model is explained by these two factors: the quadraticutility function and the quadratic programming setup. A lot of academics and profession-als have proposed an alternative approach to the MVO framework, but very few of thesemodels are used in practice. The main reason is that these competing models focus on theobjective function and not on the numerical implementation. However, we believe that anymodel which is not tractable will have little success with portfolio managers. The analogyis obvious if we consider the theory of options. The success of the Black-Scholes modellies in the Black-Scholes analytical formula. Over the last thirty years, many models havebeen created (e.g. local volatility and stochastic volatility models), but only one can really We cannot use the Moreau decomposition, because the dual of λ k x k p is not necessarily the ball B p ( n , λ ). For example, the dual of the ‘ -norm function is the ‘ ball, but the dual of the ‘ -normfunction is the ‘ ∞ ball. ‘ p -norm penalty or the logarithmic barrier.Third, we have discussed how to handle non-linear constraints. For instance, we have im-posed constraints on active share, volatility targeting, leverage limits, transaction costs, etc.Most importantly, these three non-QP extensions can be combined.With the development of quantitative strategies (smart beta, factor investing, alternativerisk premia, systematic strategies, robo-advisors, etc.), the asset management industry hasdramatically changed over the last five years. This is just the beginning and we think thatalternative data, machine learning methods and artificial intelligence will massively shapeinvestment processes in the future. This paper is an illustration of this trend and showshow machine learning optimization algorithms allow to move away from the traditional QPworld of portfolio management. 45achine Learning Optimization Algorithms & Portfolio Allocation References [1]
Barankin , E.W., and
Dorfman , R. (1956), A Method for Quadratic Programming,
Econometrica , 24, pp. 340.[2]
Barankin , E.W., and
Dorfman , R. (1958), On Quadratic Programming,
Universityof California Publications in Statistics , 2(13), pp. 285-318.[3]
Bauschke , H.H., and
Borwein , J.M. (1994), Dykstra’s Alternating Projection Algo-rithm for Two Sets,
Journal of Approximation Theory , 79(3), pp. 418-443.[4]
Beale , E.M.L. (1959), On Quadratic Programming,
Naval Research Logistics Quar-terly , 6(3), pp. 227-243.[5]
Beck , A. (2017),
First-Order Methods in Optimization , MOS-SIAM Series on Opti-mization, 25, SIAM.[6]
Bera , A.K., and
Park , S.Y. (2008), Optimal Portfolio Diversification Using the Max-imum Entropy Principle,
Econometric Reviews , 27(4-6), pp. 484-512.[7]
Bishop , C.M. (1995),
Neural Networks for Pattern Recognition , Oxford UniversityPress.[8]
Bourgeron , T.,
Lezmi , E., and
Roncalli , T. (2018), Robust Asset Allocation forRobo-Advisors, arXiv , 1902.05710.[9]
Brodie , J.,
Daubechies , I.,
De Mol , C.,
Giannone , D., and
Loris , I. (2009), Sparseand Stable Markowitz Portfolios,
Proceedings of the National Academy of Sciences ,106(30), pp. 12267-12272.[10]
Bruder , B.,
Gaussel , N.,
Richard , J-C., and
Roncalli , T. (2013), Regularizationof Portfolio Allocation,
SSRN , .[11] Bruder , B.,
Kostyuchyk , N., and
Roncalli , T. (2016), Risk Parity Portfolios withSkewness Risk: An Application to Factor Investing and Alternative Risk Premia,
SSRN , .[12] Boyd , S.,
Parikh , N.,
Chu , E.,
Peleato , B., and
Eckstein , J. (2010), DistributedOptimization and Statistical Learning via the Alternating Direction Method of Multi-pliers,
Foundations and Trends R (cid:13) in Machine learning , 3(1), pp. 1-122.[13] Candelon , B.,
Hurlin , C., and
Tokpavi , S. (2012), Sampling Error and DoubleShrinkage Estimation of Minimum Variance Portfolios,
Journal of Empirical Finance ,19(4), pp. 511-527.[14]
Carmichael , B.,
Koumou , G.B., and
Moran , K. (2018), Rao’s Quadratic Entropyand Maximum Diversification Indexation,
Quantitative Finance , 18(6), pp. 1017-1031.[15]
Carrasco , M., and
Noumon , N. (2010), Optimal Portfolio Selection Using Regular-ization, University of Montr´eal,
Discussion paper .[16]
Chaux , C.,
Combettes , P.L.,
Pesquet , J.C., and
Wajs , V.R. (2007), A VariationalFormulation for Frame-based Inverse Problems,
Inverse Problems , 23(4), pp. 1495-1518.[17]
Choueifaty , Y. and
Coignard , Y. (2008), Toward Maximum Diversification,
Journalof Portfolio Management , 35(1), pp. 40-51.46achine Learning Optimization Algorithms & Portfolio Allocation[18]
Choueifaty , Y.,
Froidure , T. and
Reynier , J. (2013), Properties of the Most Di-versified Portflio,
Journal of investment strategies , 2(2), pp. 49-70.[19]
Combettes , P.L., and
M¨uller , C.L. (2018), Perspective Functions: Proximal Calcu-lus and Applications in High-dimensional Statistics,
Journal of Mathematical Analysisand Applications , 457(2), pp. 1283-1306.[20]
Combettes , P.L., and
Pesquet , J.C. (2011), Proximal Splitting Methods in SignalProcessing, in Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R.,and Wolkowicz, H. (Eds),
Fixed-point Algorithms for Inverse Problems in Science andEngineering , Springer Optimization and Its Applications, 48, pp. 185-212, Springer.[21]
Corless , R.M.,
Gonnet , G.H.,
Hare , D.E.,
Jeffrey , D.J., and
Knuth , D.E. (1996),On the Lambert W Function,
Advances in Computational Mathematics , 5(1), pp. 329-359.[22]
Cortes , C., and
Vapnik , V. (1995), Support-vector Networks,
Machine Learning ,20(3), pp. 273-297.[23]
Cottle , R.W. and
Infanger , G. (2010), Harry Markowitz and the Early History ofQuadratic Programming, in Guerard, J.B. (Eds),
Handbook of Portfolio Construction ,Springer, pp. 179-211.[24]
Dantzig , G.B. (1961), Quadratic Programming: A Variant of the Wolfe-Markowitzalgorithms,
Operations Research Center , University of California-Berkeley, ResearchReport 2.[25] de
Jong , M. (2018), Portfolio Optimisation in an Uncertain World,
Journal of AssetManagement , 19(4), pp. 216-221.[26]
Debreu , G. (1952), Definite and Semidefinite Quadratic Forms,
Econometrica , 20(2),pp. 295–300.[27]
DeMiguel , V.,
Garlappi , L.,
Nogales , F.J., and
Uppal , R. (2009), A GeneralizedApproach to Portfolio Optimization: Improving Performance by Constraining PortfolioNorms,
Management Science , 55(5), pp. 798-812.[28]
Douglas , J., and
Rachford , H.H. (1956), On the Numerical Solution of Heat Con-duction Problems in Two and Three Space Variables,
Transactions of the Americanmathematical Society , 82(2), pp. 421-439.[29]
Dykstra , R.L. (1983), An Algorithm for Restricted Least Squares Regression,
Journalof the American Statistical Association , 78(384), pp. 837-842.[30]
Frank , M., and
Wolfe , P. (1956), An Algorithm for Quadratic Programming,
NavalResearch Logistics Quarterly , 3, pp. 95-110.[31]
Friedman , J.,
Hastie , T., and
Tibshirani , R. (2010), Regularization Paths for Gen-eralized Linear Models via Coordinate Descent,
Journal of Statistical Software , 33(1),pp. 1-22.[32]
Gabay , D., and
Mercier , B. (1976), A Dual Algorithm for the Solution of NonlinearVariational Problems via Finite Element Approximation,
Computers & Mathematicswith Applications , 2(1), pp. 17-40. 47achine Learning Optimization Algorithms & Portfolio Allocation[33]
Ghadimi , E.,
Teixeira , A.,
Shames , I., and
Johansson , M. (2015), Optimal Param-eter Selection for the Alternating Direction Method of Multipliers (ADMM): QuadraticProblems,
IEEE Transactions on Automatic Control , 60(3), pp. 644-658.[34]
Giselsson , P., and
Boyd , S. (2017), Linear Convergence and Metric Selection forDouglas-Rachford Splitting and ADMM,
IEEE Transactions on Automatic Control ,62(2), pp. 532-544.[35]
Gonzalvez , J.,
Lezmi , E.,
Roncalli , T., and Xu , J. (2019), Financial Applications ofGaussian Processes and Bayesian Optimization, arXiv , arxiv.org/abs/1903.04841 .[36] Gould , N.I.M., and
Toint , P.L. (2000), A Quadratic Programming Bibliography,
Numerical Analysis Group Internal Report , 1, 142 pages.[37]
Griveau-Billion , T.,
Richard , J-C., and
Roncalli , T. (2013), A Fast Algorithm forComputing High-dimensional Risk Parity Portfolios,
SSRN , .[38] Hastie , T.,
Tibshirani , R. and
Friedman , J. (2009),
The Elements of StatisticalLearning , Second edition, Springer.[39] He , B.S., Yang , H., and
Wang , S.L. (2000), Alternating Direction Method with Self-Adaptive Penalty Parameters for Monotone Variational Inequalities,
Journal of Opti-mization Theory and applications , 106(2), pp. 337-356.[40]
Hildreth , C. (1957), A Quadratic Programming Procedure,
Naval Research LogisticsQuarterly , 4, pp. 79-85.[41]
Jacobs , R.A. (1988), Increased Rates of Convergence Through Learning Rate Adap-tation,
Neural Networks , 1(4), pp. 295-307.[42]
Jagannathan , R., and Ma , T. (2003), Risk Reduction in Large Portfolios: WhyImposing the Wrong Constraints Helps, Journal of Finance , 58(4), pp. 1651-1684.[43]
LeCun , Y.,
Boser , B.,
Denker , J.S.,
Henderson , D.,
Howard , R.E.,
Hubbard ,W., and
Jackel , L.D. (1989), Backpropagation Applied to Handwritten Zip CodeRecognition,
Neural Computation , 1(4), pp. 541-551.[44]
Lezmi , E.,
Malongo , H.,
Roncalli , T., and
Sobotka , R. (2018), Portfolio Al-location with Skewness Risk: A Practical Guide,
SSRN , .[45] Lindstrom , S.B., and
Sims , B. (2018), Survey: Sixty Years of Douglas–Rachford, arXiv , 1809.07181.[46]
Luo , Z.Q., and
Tseng , P. (1992), On the Convergence of the Coordinate DescentMethod for Convex Differentiable Minimization,
Journal of Optimization Theory andApplications , 72(1), pp. 7-35.[47]
Luo , Z.Q., and
Tseng , P. (1993), Error Bounds and Convergence Analysis of FeasibleDescent Methods: A General Approach,
Annals of Operations Research , 46(1), pp.157-178.[48]
Maillard , S.,
Roncalli , T. and
Te¨ıletche , J. (2010), The Properties of EquallyWeighted Risk Contribution Portfolios,
Journal of Portfolio Management , 36(4), pp.60-70. 48achine Learning Optimization Algorithms & Portfolio Allocation[49]
Mann , H.B. (1943), Quadratic Forms with Linear Constraints,
American MathematicalMonthly , 50, pp. 430-433.[50]
Markowitz , H. (1952), Portfolio Selection,
Journal of Finance , 7(1), pp. 77-91.[51]
Markowitz , H. (1956), The Optimization of a Quadratic Function Subject to LinearConstraints,
Naval Research Logistics Quarterly , 3(1-2), pp. 111-133.[52]
Martin , A.D. (1955), Mathematical Programming of Portfolio Selections,
ManagementScience , 1(2), pp. 152-166.[53]
Meucci , A. (2009), Managing Diversification,
Risk , 22(5), pp. 74-79.[54]
Michaud , R.O. (1989), The Markowitz Optimization Enigma: Is ‘Optimized’ Opti-mal?,
Financial Analysts Journal , 45(1), pp. 31-42.[55]
Nesterov , Y.E. (1983), A Method for Solving the Convex Programming Problem withConvergence Rate O (cid:0) k − (cid:1) , Doklady Akademii Nauk SSSR , 269, pp. 543-547.[56]
Nesterov , Y. (2004),
Introductory Lectures on Convex Optimization: A Basic Course ,Applied Optimization, 87, Kluwer Academic Publishers.[57]
Nesterov , Y. (2012), Efficiency of Coordinate Descent Methods on Huge-scale Opti-mization Problems,
SIAM Journal on Optimization , 22(2), pp. 341-362.[58]
Parikh , N., and
Boyd , S. (2014), Proximal Algorithms,
Foundations and Trends R (cid:13) inOptimization , 1(3), pp. 127-239.[59] Polyak , B.T. (1964), Some Methods of Speeding Up the Convergence of IterationMethods,
USSR Computational Mathematics and Mathematical Physics , 4(5), pp. 1-17.[60]
Qian , E. (2005), Risk Parity Portfolios: Efficient Portfolios Through True Diversifica-tion,
Panagora Asset Management , September.[61]
Richard , J-C., and
Roncalli , T. (2015), Smart Beta: Managing Diversification ofMinimum Variance Portfolios, in Jurczenko, E. (Ed.),
Risk-based and Factor Investing ,ISTE Press – Elsevier.[62]
Richard , J-C., and
Roncalli , T. (2019), Constrained Risk Budgeting Portfolios:Theory, Algorithms, Applications & Puzzles, arXiv , 1902.05710.[63]
Roncalli , T. (2013),
Introduction to Risk Parity and Budgeting , Chapman &Hall/CRC Financial Mathematics Series.[64]
Roncalli , T. (2015), Introducing Expected Returns into Risk Parity Portfolios: ANew Framework for Asset Allocation,
Bankers, Markets & Investors , 138, pp. 18-28.[65]
Scherer , B. (2007),
Portfolio Construction & Risk Budgeting , Third edition, RiskBooks.[66]
Tibshirani , R. (1996), Regression Shrinkage and Selection via the Lasso,
Journal ofthe Royal Statistical Society B , 58(1), pp. 267-288.[67]
Tibshirani , R.J. (2017), Dykstra’s Algorithm, ADMM, and Coordinate Descent: Con-nections, Insights, and Extensions, in Guyon, I., Luxburg, U.V., Bengio, S., Wallach,H., Fergus, R., Vishwanathan, S., and Garnett, R. (Eds),
Advances in Neural Informa-tion Processing Systems , 30, pp. 517-528.49achine Learning Optimization Algorithms & Portfolio Allocation[68]
Tseng , P. (1990), Dual Ascent Methods for Problems with Strictly Convex Costs andLinear Constraints: A Unified Approach,
SIAM Journal on Control and Optimization ,28(1), pp. 214-242.[69]
Tseng , P. (2001), Convergence of a Block Coordinate Descent Method for Nondiffer-entiable Minimization,
Journal of Optimization Theory and Applications , 109(3), pp.475-494.[70]
Vapnik , V. (1998),
Statistical Learning Theory , John Wiley & Sons.[71]
Wang , S.L., and
Liao , L.Z. (2001), Decomposition Method with a Variable Parame-ter for a Class of Monotone Variational Inequality Problems,
Journal of OptimizationTheory and Applications , 109(2), pp. 415-429.[72]
Weston , J.F. and
Barenek , W. (1955), Programming Investment Portfolio Construc-tion,
Analysts Journal , 11(2), pp. 51-55.[73]
Wolfe , P. (1959), The Simplex Method for Quadratic Programming,
Econometrica ,27(3), pp. 382-398.[74]
Wright , S.J. (2015), Coordinate Descent Algorithms,
Mathematical Programming ,151(1), pp. 3-34[75] Yu , J.R., Lee , W.Y., and Chiou, W.J.P. (2014), Diversified Portfolios with DifferentEntropy Measures,
Applied Mathematics and Computation , 241, pp. 47-63.50achine Learning Optimization Algorithms & Portfolio Allocation
AppendixA Mathematical results
A.1 QP problem when there is a benchmark
Following Roncalli (2013), the excess return R ( x | b ) of Portfolio x with respect to Bench-mark b is the difference between the return of the portfolio and the return of the benchmark: R ( x | b ) = R ( x ) − R ( b ) = ( x − b ) > R It is easy to show that the expected excess return is equal to: µ ( x | b ) = E [ R ( x | b )] = ( x − b ) > µ whereas the volatility of the tracking error is given by: σ ( x | b ) = σ ( R ( x | b )) = q ( x − b ) > Σ ( x − b )The objective function is then: f ( x | b ) = 12 ( x − b ) > Σ ( x − b ) − γ ( x − b ) > µ = 12 x > Σ x − x > ( γµ + Σ b ) + (cid:18) b > Σ b + γb > µ (cid:19) = 12 x > Qx − x > R + C where C is a constant which does not depend on Portfolio x . We recognize a QP problemwhere Q = Σ and R = γµ + Σ b . A.2 Augmented QP formulation of the turnover management prob-lem
The augmented QP problem is defined by: X ? = arg min X X > QX − X > R s.t. AX = BCX ≤ D n ≤ X ≤ n where X = (cid:0) x , . . . , x n , x − , . . . , x − n , x +1 , . . . , x + n (cid:1) is a 3 n × Q is a 3 n × n matrix: Q = Σ n × n n × n n × n n × n n × n n × n n × n n × n R = ( γµ, n , n ) is a 3 n × A is a ( n + 1) × n matrix: A = (cid:18) > n > n > n I n I n − I n (cid:19) B = (1 , ¯ x ) is a ( n + 1) × C = (cid:0) > n > n > n (cid:1) is a 1 × n matrix and D = τ + .51achine Learning Optimization Algorithms & Portfolio Allocation A.3 Augmented QP formulation of the MVO problem with trans-action costs
The augmented QP problem of dimension 3 n is defined by: X ? = arg min X X > QX − X > R s.t. (cid:26) AX = B n ≤ X ≤ n where X = (cid:0) x , . . . , x n , x − , . . . , x − n , x +1 , . . . , x + n (cid:1) is a 3 n × Q is a 3 n × n matrix: Q = Σ n × n n × n n × n n × n n × n n × n n × n n × n R = ( γµ, − c − , − c + ) is a 3 n × A is a ( n + 1) × n matrix: A = (cid:18) > n ( c − ) > ( c + ) > I n I n − I n (cid:19) and B = (1 , ¯ x ) is a ( n + 1) × A.4 QP problem with a hyperplane constraint
We consider the following QP problem: x ? = arg min x x > Qx − x > R s.t. a > x = b The associated Lagrange function is: L ( x ; λ ) = 12 x > Qx − x > R + λ (cid:0) a > x − b (cid:1) The first-order conditions are then: (cid:26) ∂ x L ( x ; λ ) = Qx − R + λa = n ∂ λ L ( x ; λ ) = a > x − b = 0We obtain x = Q − ( R + λa ). Because a > x − b = 0, we have a > Q − R + λa > Q − a = b and: λ ? = b − a > Q − Ra > Q − a The optimal solution is then: x ? = Q − (cid:18) R + b − a > Q − Ra > Q − a a (cid:19) A.5 Derivation of the soft-thresholding operator
We consider the following equation: cx − v + λ∂ | x | ∈ c > λ >
0. Since we have ∂ | x | = sign ( x ), we deduce that: x ? = c − ( v + λ ) if x ? <
00 if x ? = 0 c − ( v − λ ) if x ? > x ? < x ? >
0, then we have v + λ < v − λ >
0. This is equivalent to set | v | > λ > x ? = 0 implies that | v | ≤ λ . We deduce that: x ? = c − · S ( v ; λ )where S ( v ; λ ) is the soft-thresholding operator: S ( v ; λ ) = (cid:26) | v | ≤ λv − λ sign ( v ) otherwise= sign ( v ) · ( | v | − λ ) + In Figure 6, we have represented the function S ( v ; λ ) when λ is respectively equal to 1 and2. Remark 10
The soft-thresholding operator is the proximal operator of the ‘ -norm f ( x ) = k x k . Indeed, we have prox f ( v ) = S ( v ; 1) and prox λf ( v ) = S ( v ; λ ) . A.6 The box-constrained QP problem
We consider the box-constrained QP problem: x ? = arg min x x > Qx − x > R (44)s.t. x − ≤ x ≤ x + The objective function is equal to: f ( x ) = 12 x > Qx − x > R = 12 n X i =1 x i n X j =1 Q i,j x j − n X i =1 x i R i = 12 n X i =1 x i Q i,i x i + X j = i Q i,j x j − n X i =1 x i R i We deduce that: ∂ f ( x ) ∂ x i = 12 x i Q i,i + X j = i x j ( Q i,j + Q j,i ) − R i S ( v ; λ ) -5 -4 -3 -2 -1 0 1 2 3 4 5-4-3-2-101234 We notice that: ∂ f ( x ) ∂ x i = 0 ⇔ x i = 1 Q i,i R i − X j = i x j (cid:18) Q i,j + Q j,i (cid:19) The Lagrange function associated to Problem (44) is equal to: L (cid:0) x ; λ − , λ + (cid:1) = f ( x ) − n X i =1 λ − i (cid:0) x i − x − i (cid:1) − n X i =1 λ + i (cid:0) x + i − x i (cid:1) The first-order condition is then: ∂ L ( x ; λ − , λ + ) ∂ x i = ∂ f ( x ) ∂ x i − λ − i + λ + i = 0Since the Kuhn-Tucker conditions are: (cid:26) min (cid:0) λ − i , x i − x − i (cid:1) = 0min (cid:0) λ + i , x + i − x i (cid:1) = 0we obtain three cases:1. If no bound is reached, we have λ − i = λ + i = 0 and the solution is equal to: x ?i = 1 Q i,i R i − X j = i x j (cid:18) Q i,j + Q j,i (cid:19)
2. If the lower bound is reached, we have λ − i > λ + i = 0 and x ?i = x − i .3. If the upper bound is reached, we have λ − i = 0, λ + i > x ?i = x + i .54achine Learning Optimization Algorithms & Portfolio Allocation A.7 ADMM algorithm
The optimization problem is defined as: { x ? , y ? } = arg min ( x,y ) f x ( x ) + f y ( y ) (45)s.t. Ax + By = c The derivation of the algorithm is fully explained in Boyd et al. (2011). For that, theyconsider the augmented Lagrange function: L ( x, y ; λ, ϕ ) = f x ( x ) + f y ( y ) + λ > ( Ax + By − c ) + ϕ k Ax + By − c k (46)where ϕ >
0. According to Boyd et al. (2011), the ‘ -norm penalty adds robustness tothe dual ascent method and accelerates its convergence. The ADMM algorithm uses theproperty that the objective function is separable, and consists of the following iterations: x ( k +1) = arg min x L (cid:16) x, y ( k ) ; λ ( k ) , ϕ (cid:17) = arg min x (cid:26) f x ( x ) + λ ( k ) > (cid:16) Ax + By ( k ) − c (cid:17) + ϕ (cid:13)(cid:13)(cid:13) Ax + By ( k ) − c (cid:13)(cid:13)(cid:13) (cid:27) and: y ( k +1) = arg min y L (cid:16) x ( k +1) , y ; λ ( k ) , ϕ (cid:17) = arg min y (cid:26) f y ( y ) + λ ( k ) > (cid:16) Ax ( k +1) + By − c (cid:17) + ϕ (cid:13)(cid:13)(cid:13) Ax ( k +1) + By − c (cid:13)(cid:13)(cid:13) (cid:27) The update for the dual variable λ is then: λ ( k +1) = λ ( k ) + ϕ (cid:16) Ax ( k +1) + By ( k +1) − c (cid:17) We repeat the iterations until convergence.Boyd et al. (2011) notice that the previous algorithm can be simplified. Let r = Ax + By − c be the (primal) residual. By combining linear and quadratic terms, we have: λ > r + ϕ k r k = ϕ k r + u k − ϕ k u k where u = ϕ − λ is the scaled dual variable. We can then write the Lagrange function (46)as follows: L ( x, y ; u, ϕ ) = f x ( x ) + f y ( y ) + ϕ k Ax + By − c + u k − ϕ k u k = f x ( x ) + f y ( y ) + ϕ k Ax + By − c + u k − ϕ k λ k Since the last term is a constant, we deduce that the x - and y -updates become: x ( k +1) = arg min x L (cid:16) x, y ( k ) ; u ( k ) , ϕ (cid:17) = arg min x (cid:26) f x ( x ) + ϕ (cid:13)(cid:13)(cid:13) Ax + By ( k ) − c + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) (47)55achine Learning Optimization Algorithms & Portfolio Allocationand: y ( k +1) = arg min y L (cid:16) x ( k +1) , y ; u ( k ) , ϕ (cid:17) = arg min y (cid:26) f y ( y ) + ϕ (cid:13)(cid:13)(cid:13) Ax ( k +1) + By − c + u ( k ) (cid:13)(cid:13)(cid:13) (cid:27) (48)For the scaled dual variable u , we have: u ( k +1) = u ( k ) + r ( k +1) = u ( k ) + (cid:16) Ax ( k +1) + By ( k +1) − c (cid:17) (49)where r ( k +1) = Ax ( k +1) + By ( k +1) − c is the primal residual at iteration k + 1. Boyd etal. (2011) also define the variable s ( k +1) = ϕA > B (cid:0) y ( k +1) − y ( k ) (cid:1) and refer to s ( k +1) as thedual residual at iteration k + 1.Under the assumption that the traditional Lagrange function L ( x, y ; λ,
0) has a saddlepoint, one can prove that the residual r ( k ) converges to zero, the objective function f x (cid:0) x ( k ) (cid:1) + f y (cid:0) y ( k ) (cid:1) converges to the optimal value f x ( x ? ) + f y ( y ? ), and the dual variable λ ( k ) = ϕu ( k ) converges to a dual optimal point. However, the rate of convergence is not known and theprimal variables x ( k ) and y ( k ) do not necessarily converge to the optimal values x ? and y ? .In general, the stopping criterion is defined with respect to the residuals: ( (cid:13)(cid:13) r ( k +1) (cid:13)(cid:13) (cid:54) ε (cid:13)(cid:13) s ( k +1) (cid:13)(cid:13) (cid:54) ε Typical values when implementing this stopping criterion are ε = ε = 10 − (Bourgeron etal. , 2018).From a theoretical point of view, the convergence holds regardless of the choice of thepenalization parameter ϕ >
0. But the choice of ϕ affects the convergence rate (Ghadimi et al. , 2015; Giselsson and Boyd, 2017). In practice, the penalization parameter ϕ may bechanged at each iteration, implying that ϕ is replaced by ϕ ( k ) and the scaled dual variable u k is equal to λ ( k ) /ϕ ( k ) . This may improve the convergence and make the performanceindependent of the initial choice ϕ (0) . To update ϕ ( k ) in practice, He et al. (2000) andWang and Liao (2001) provide a simple and efficient scheme. On the one hand, the x - and y -updates in ADMM essentially come from placing a penalty on (cid:13)(cid:13) r ( k ) (cid:13)(cid:13) . As a consequence,if ϕ ( k ) is large, (cid:13)(cid:13) r ( k ) (cid:13)(cid:13) tends to be small. On the other hand, s ( k ) depends linearly on ϕ .As a consequence, if ϕ ( k ) is small, (cid:13)(cid:13) s ( k ) (cid:13)(cid:13) is small. To keep (cid:13)(cid:13) r ( k ) (cid:13)(cid:13) and (cid:13)(cid:13) s ( k ) (cid:13)(cid:13) within afactor µ , one may consider: ϕ ( k +1) = τ ϕ ( k ) if (cid:13)(cid:13) r ( k ) (cid:13)(cid:13) > µ (cid:13)(cid:13) s ( k ) (cid:13)(cid:13) ϕ ( k ) /τ if (cid:13)(cid:13) s ( k ) (cid:13)(cid:13) > µ (cid:13)(cid:13) r ( k ) (cid:13)(cid:13) ϕ ( k ) otherwisewhere µ , τ and τ are parameters that are greater than one. In practice, we use ϕ (0) = 1, u (0) = p , µ = 10 and τ = τ = 2. Remark 11
The constant case ϕ ( k +1) = ϕ ( k ) = ϕ (0) is obtained by setting τ = τ = 1 . A.8 Proximal operators
A.8.1 Pointwise maximum function
The unit simplex is the generalization of the triangle: S n = ( x ∈ [0 , n , θ i ≥ x = n X i =0 θ i e i , n X i =0 θ i = 1 , > n x ≤ ) where e is the zero vector and e i are the unit vectors for i ≥
1. Beck (2017) shows that P S n ( v ) = ( v − µ ? n ) + where µ ? is the root of the equation > n ( v − µ ? n ) + = 1. In the caseof the pointwise maximum function f ( x ) = max x , the Moreau decomposition gives: prox λ max x ( v ) = v − λ P S n (cid:16) vλ (cid:17) = v − λ (cid:16) vλ − µ ? n (cid:17) + where µ ∗ is the root of the equation: > n (cid:16) vλ − µ ? n (cid:17) + = 1 ⇔ n X i =1 (cid:16) v i λ − µ ? (cid:17) + = 1 ⇔ n X i =1 ( v i − s ? ) + = λ where s ? = λµ ? . It follows that: prox λ max x ( v ) = v − ( v − s ? n ) + = min ( v, s ? ) A.8.2 ‘ -norm function The projection onto the unit ball B ( ,
1) = { x ∈ R n : k x k ≤ } is equal to: P B ( , ( v ) = ( v if k v k ≤ v k v k if k v k > prox λ k x k ( v ) + λ P B ( , (cid:16) vλ (cid:17) = v we deduce that: prox λ k x k ( v ) = v − λ P B ( , (cid:16) vλ (cid:17) = v − λ (cid:16) vλ (cid:17) if k v k ≤ λv − λ vλ (cid:13)(cid:13)(cid:13) vλ (cid:13)(cid:13)(cid:13) if k v k > λ = k v k ≤ λv − λv k v k if k v k > λ = (cid:18) − λ max ( λ, k v k ) (cid:19) v A.8.3 Scaling and translation
Let g ( x ) = f ( ax + b ) where a = 0. Using the change of variable y = ax + b , we have: prox g ( v ) = arg min x (cid:26) g ( x ) + 12 k x − v k (cid:27) = arg min x (cid:26) f ( ax + b ) + 12 k x − v k (cid:27) = arg min y ( f ( y ) + 12 (cid:13)(cid:13)(cid:13)(cid:13) y − ba − v (cid:13)(cid:13)(cid:13)(cid:13) ) We deduce that: f ( y ) + 12 (cid:13)(cid:13)(cid:13)(cid:13) y − ba − v (cid:13)(cid:13)(cid:13)(cid:13) = f ( y ) + 12 a k y − b − av k = 1 a (cid:18) a f ( y ) + 12 k y − ( av + b ) k (cid:19) We conclude that y ? = prox a f ( av + b ) and: prox g ( v ) = y ? − ba = prox a f ( av + b ) − ba A.8.4 Projection onto the ‘ ball We have: x = P B ( c,r ) ( v )= P B ( n ,r ) ( v − c ) + c = ( v − c ) − sign ( v − c ) (cid:12) prox r max x ( | v − c | ) + c = v − sign ( v − c ) (cid:12) min ( | v − c | , s ? )where s ? is the solution of the following equation: s ? = ( s ∈ R : n X i =1 ( | v i − c i | − s ) + = r ) Remark 12 P B ( c,r ) ( v ) is sometimes expressed using the soft-thresholding operator (Beck,2017, page 151), but the two formulas are equivalent. A.8.5 Projection onto the ‘ ball We have: x = P B ( c,r ) ( v )= P B ( n ,r ) ( v − c ) + c = ( v − c ) − prox r k x k ( v − c ) + c = v − (cid:18) − r max ( r, k v − c k ) (cid:19) ( v − c )= c + r max ( r, k v − c k ) ( v − c ) (50)58achine Learning Optimization Algorithms & Portfolio Allocation A.8.6 ‘ -penalized logarithmic barrier function We note f ( x ) = − λ P ni =1 b i ln x i and f ( x ) = Ω ( x ) where Ω = B ( c, r ). The Dykstra’salgorithm becomes: v ( k ) x = y ( k ) + z ( k )1 x ( k +1) = prox f (cid:16) v ( k ) x (cid:17) z ( k +1)1 = y ( k ) + z ( k )1 − x ( k +1) v ( k ) y = x ( k +1) + z ( k )2 y ( k +1) = prox f (cid:16) v ( k ) y (cid:17) z ( k +1)2 = x ( k +1) + z ( k )2 − y ( k +1) It follows that: x ( k +1) = v ( k ) x + q v ( k ) x (cid:12) v ( k ) x + 4 λb y ( k +1) = P B ( c,r ) (cid:16) v ( k ) y (cid:17) = c + r max (cid:16) r, (cid:13)(cid:13)(cid:13) v ( k ) y − c (cid:13)(cid:13)(cid:13) (cid:17) (cid:16) v ( k ) y − c (cid:17) A.8.7 Quadratic function
Let f ( x ) = 12 x > Qx − x > R . We have: prox f ( v ) = arg min x (cid:26) x > Qx − x > R + 12 k x − v k (cid:27) = arg min x (cid:26) x > ( Q + I n ) x − x > ( R + v ) + 12 v > v (cid:27) = ( Q + I n ) − ( R + v ) A.8.8 Projection onto the intersection of a ‘ ball and a box We note f ( x ) = Ω ( x ) and f ( x ) = Ω ( x ) where Ω = B ( c, r ) and Ω = B ox [ x − , x + ] = { x ∈ R n : x − ≤ x ≤ x + } . The Dykstra’s algorithm becomes: x ( k +1) = c + r max (cid:16) r, (cid:13)(cid:13)(cid:13) y ( k ) + z ( k )1 − c (cid:13)(cid:13)(cid:13) (cid:17) (cid:16) y ( k ) + z ( k )1 − c (cid:17) z ( k +1)1 = y ( k ) + z ( k )1 − x ( k +1) y ( k +1) = T (cid:16) x ( k +1) + z ( k )2 ; x − , x + (cid:17) z ( k +1)2 = x ( k +1) + z ( k )2 − y ( k +1) This algorithm is denoted by P B ox −B all ( v ; x − , x + , c, r ).59achine Learning Optimization Algorithms & Portfolio Allocation A.8.9 Shannon’s entropy and Kullback-Leibler divergence
If we consider the scalar function f ( x ) = λx ln ( x/ ˜ x ) where ˜ x is a constant, we have: λf ( x ) + 12 k x − v k = λx ln x ˜ x + 12 ( x − v ) = λx ln x ˜ x + 12 x − xv + 12 v The first-order condition is: λ x + λ ln x ˜ x + x − v = 0 ⇔ ln x + λ − x = ln ˜ x + λ − v − x ⇔ e ln x + λ − x = e λ − v − x +ln ˜ x ⇔ xe λ − x = ˜ xe λ − v − x ⇔ (cid:0) λ − x (cid:1) e ( λ − x ) = λ − ˜ xe λ − v − x We deduce that the root is equal to: x ? = λW ˜ xe λ − v − x λ ! where W ( x ) is the Lambert W function satisfying W ( x ) e W ( x ) = x (Corless et al. , 1996).In the case of the Kullback-Liebler divergence KL ( x ) = P ni =1 x i ln ( x i / ˜ x i ), it follows that: prox λ KL( v | ˜ x ) ( v ) = λ W (cid:16) λ − ˜ x e λ − v − ˜ x − (cid:17) ... W (cid:16) λ − ˜ x n e λ − v n − ˜ x − n (cid:17) Remark 13
The proximal of Shannon’s entropy
SE ( x ) = − P ni =1 x i ln x i is a special caseof the previous result with ˜ x i = 1 : prox λ SE( x ) ( v ) = λ W (cid:16) λ − e λ − v − (cid:17) ... W (cid:16) λ − e λ − v n − (cid:17) This result has been first obtained by Chaux et al. (2007).
A.8.10 Projection onto the complement ¯ B ( c, r ) of the ‘ ball We consider the following proximal problem: x ? = arg min x (cid:26) Ω ( x ) + 12 k x − v k (cid:27) where: Ω = { x ∈ R n : k x − c k ≥ r } We use the fact that max SE ( x ) = min − SE ( x ). x ? = arg min x
12 ( x − v ) > ( x − v )s.t. ( x − c ) > ( x − c ) − r ≥ L ( x ; λ ) = 12 ( x − v ) > ( x − v ) − λ (cid:16) ( x − c ) > ( x − c ) − r (cid:17) The first-order condition is: ∂ L ( x ; λ ) ∂ x = x − v − λ ( x − c ) = n whereas the KKT condition is min (cid:16) λ, ( x − c ) > ( x − c ) − r (cid:17) = 0. We distinguish two cases:1. If λ = 0, this means that x ? = v and ( x − c ) > ( x − c ) − r > λ >
0, we have ( x − c ) > ( x − c ) = r . Then we obtain the following system: (cid:26) x − v − λ ( x − c ) = n ( x − c ) > ( x − c ) = r We deduce that: ( x − c ) − ( v − c ) − λ ( x − c ) = n (51)and: ( x − c ) > ( x − c ) − ( x − c ) > ( v − c ) − λ ( x − c ) > ( x − c ) = 0It follows that r − ( x − c ) > ( v − c ) − λr = 0, meaning that: λ ? = r − ( x − c ) > ( v − c )2 r We notice that:(51) ⇔ ( x − c ) − ( v − c ) − r − ( x − c ) > ( v − c )2 r ( x − c ) = n ⇔ − r ( v − c ) + ( x − c ) > ( v − c ) ( x − c ) = n ⇔ ( x − c ) > ( v − c ) ( x − c ) = r ( v − c )Because ( x − c ) > ( v − c ) is a scalar, we deduce that x − c and v − c are two collinearvectors: x − c = r ( v − c ) k v − c k The optimal solution is: x ? = c + r ( v − c ) k v − c k Combining the two cases gives: prox Ω ( x ) ( v ) = c + r min ( r, k v − c k ) ( v − c )This is the formula of the projection onto the ‘ ball, but the minimum function has replacedthe maximum function . See Equation (50) on page 58.
A.8.11 Projection onto the complement ¯ B ( c, r ) of the ‘ ball This proximal problem associated with ¯ B ( n , r ) is: x ? = arg min x
12 ( x − v ) > ( x − v )s.t. k x k ≥ r We deduce that the Lagrange function is equal to: L ( x ; λ ) = 12 ( x − v ) > ( x − v ) − λ ( k x k − r )The first-order condition is: ∂ L ( x ; λ ) ∂ x = x − v − λ sign ( x ) = n whereas the KKT condition is min ( λ, k x k − r ) = 0. We distinguish two cases:1. If λ = 0, this means that x ? = v and k x k ≥ r .2. If λ >
0, we have k x k = r . Then we obtain the following system of equations: (cid:26) x − v = λ sign ( x ) k x k = r The first condition gives that x − v is a vector whose elements are + λ and/or − λ ,whereas the second condition shows that x is on the surface of the ‘ ball. Unfortu-nately, there is no unique solution. This is why we assume that sign ( x ) = sign ( v ) andwe modify the sign function: sign ( a ) = 1 if a ≥ a ) = − a <
0. In thiscase, there is a unique solution x ? = v + λ ? sign ( v ) where λ ? = n − ( r − k v k ) because | v + λ sign ( v ) | = | v | + λ .Combining the two cases implies that: P ¯ B ( n ,r ) ( v ) = (cid:26) v if k v k ≥ rv + sign ( v ) (cid:12) n − ( r − k v k ) if k v k < r Using the translation property, we deduce that: P ¯ B ( c,r ) ( v ) = v + sign ( v − c ) (cid:12) max ( r − k v − c k , n A.8.12 The bid-ask linear cost function
If we consider the scalar function: f ( x ) = α ( γ − x ) + + β ( x − γ ) + we have: f v ( x ) = λf ( x ) + 12 k x − v k = λα ( γ − x ) + + λβ ( x − γ ) + + 12 x − xv + 12 v Following Beck (2017), we distinguish three cases:62achine Learning Optimization Algorithms & Portfolio Allocation1. If f v ( x ) = λα ( γ − x ) + x − xv + v , then f v ( x ) = − λα + x − v and x ? = v + λα .This implies that γ − x ? > v < γ − λα .2. If f v ( x ) = λβ ( x − γ ) + + x − xv + v , then f v ( x ) = λβ + x − v and x ? = v − λβ .This implies that x ? − γ > v < γ + λβ .3. If v ∈ [ γ − λα, γ + λβ ], the minimum is not obtained at a point of differentiability.Since γ is the only point of non-differentiability, we obtain x ? = γ .Therefore, we can write the proximal operator in the following compact form: x ? = γ + ( v − γ − λβ ) + − ( v − γ + λα ) − where x − and x + are the negative part and the positive part of x . If we consider thevector-value function f ( x ) = P ni =1 α i ( γ i − x i ) + + β i ( x i − γ i ) + , we deduce that: prox λf ( x ) ( v ) = γ + S ( v − γ ; λα, λβ )where S ( v ; λ − , λ + ) = ( v − λ + ) + − ( v + λ − ) − is the two-sided soft-thresholding operator. A.9 The QP form of the ADMM-QP problem
We have: f QP ( x ) = f MVO ( x ) + f ‘ ( x )= 12 ( x − b ) > Σ t ( x − b ) − γ ( x − b ) > µ t + 12 % k Γ ( x − x t ) k + 12 ˜ % (cid:13)(cid:13) ˜Γ ( x − ˜ x ) (cid:13)(cid:13) = 12 x > Σ t x − x > Σ t b + 12 b > Σ t b − γx > µ t + γb > µ t +12 x > (cid:0) % Γ > Γ (cid:1) x − x > (cid:0) % Γ > Γ (cid:1) x t + 12 x > t (cid:0) % Γ > Γ (cid:1) x t +12 x > (cid:0) ˜ % ˜Γ > ˜Γ (cid:1) x − x > (cid:0) ˜ % ˜Γ > ˜Γ (cid:1) ˜ x + 12 ˜ x (cid:0) ˜ % ˜Γ > ˜Γ (cid:1) ˜ x = 12 x > (cid:0) Σ t + % Γ > Γ + ˜ % ˜Γ > ˜Γ (cid:1) x − x > (cid:0) γµ t + Σ t b + % Γ > Γ x t + ˜ % ˜Γ > ˜Γ ˜ x (cid:1) + γb > µ t + 12 (cid:0) b > Σ t b + % x > t Γ > Γ x t + ˜ % ˜ x ˜ % ˜Γ > ˜Γ ˜ x (cid:1) A.10 The CCD algorithm of a QP form with a logarithmic barrier
We consider the following optimization problem: x ? = arg min x x > Qx − x > R − n X i =1 λ i ln x i where Q is a positive-definite matrix and λ i >
0. The first-order condition with respect tocoordinate x i is: ( Qx ) i − R i − λ i x i = 0It follows that x i ( Qx ) i − R i x i − λ i = 0 or equivalently: Q i,i x i + X j = i x j Q i,j − R i x i − λ i = 063achine Learning Optimization Algorithms & Portfolio AllocationThe polynomial function is convex because we have Q i,i >
0. Since the product of the rootsis negative , we have two solutions with opposite signs. We deduce that the solution is thepositive root of the second-degree equation: x ?i = R i − P j = i x j Q i,j + r(cid:16)P j = i x j Q i,j − R i (cid:17) + 4 λ i Q i,i Q i,i It follows that CCD algorithm is: x ( k +1) i = R i − P ji x ( k ) j Q i,j Q i,i + r(cid:16)P ji x ( k ) j Q i,j − R i (cid:17) + 4 λ i Q i,i Q i,i B Data
Parameter set
We consider a capitalization-weighted stock index, which is com-posed of eight stocks. The weights of this benchmark are equal to 23%, 19%, 17%, 9%, 8%,6% and 5%. We assume that their volatilities are 21%, 20%, 40%, 18%, 35%, 23%, 7% and29%. The correlation matrix is defined as follows: ρ = Parameter set
We consider a universe of eight stocks. We assume that theirvolatilities are 25%, 20%, 15%, 18%, 30%, 20%, 15% and 35%. The correlation matrix isdefined as follows: ρ = We have − Q i,i λ i < C Notations • µ = ( µ , . . . , µ n ) is the vector of expected return. • Σ = [ ρ i,j σ i σ j ] i,j =1 i,j =1 is the covariance matrix where σ i is the volatility of Asset i and ρ i,j is the correlation between Asset i and Asset j . • b is the vector of benchmark weights. • µ ( x ) = x > µ is the expected return of Portfolio x . • σ ( x ) = √ x > Σ x is the volatility of Portfolio x . • µ ( x | b ) = ( x − b ) > µ is the expected excess return of Portfolio x with respect toBenchmark b . • σ ( x | b ) = q ( x − b ) > Σ ( x − b ) is the tracking error volatility of Portfolio x with re-spect to Benchmark b . • R ( x ) is a convex risk measure. • RB = ( RB , . . . , RB n ) is the vector of risk budgets. • RC i ( x ) is the risk contribution of Asset i with respect to Portfolio x . • τ ( x | ˜ x ) = P ni =1 | x i − ˜ x i | is the turnover between Portfolios x and ˜ x . The maximumacceptable turnover is denoted by τ + . • ccc ( x | ˜ x ) is the cost function when rebalancing Portfolio x from Portfolio ˜ x . The max-imum acceptable cost is denoted by ccc + . • AS ( x | b ) = P ni =1 | x i − b i | is the active share of Portfolio x with respect to Bench-mark b . AS − is the minimum acceptable active share. • H ( x ) = P ni =1 x i is the Herfindahl index. • N ( x ) = 1 / H ( x ) is the number of effective bets. N − corresponds to the minimumacceptable number of effective bets. • DR ( x ) = (cid:0) x > σ (cid:1) / √ x > Σ x is the diversification ratio of Portfolio x . • LS ( x ) = | P ni =1 x i | is the long/short exposure of Portfolio x . • L ( x ) = P ni =1 | x i | is the leverage of Portfolio x . • SE ( x ) = − P ni =1 x i ln x i is Shannon’s entropy of x . • KL ( x ) = P ni =1 x i ln ( x i / ˜ x i ) is the Kullback-Leibler divergence between x and ˜ x . • W ( x ) is the Lambert W function satisfying W ( x ) e W ( x ) = x . • n is the vector of zeros. • n is the vector of ones. • e i is the unit vector, i.e. [ e i ] i = 1 and [ e i ] j = 0 for all j = i . • x − = max ( − x,
0) = − min ( x,
0) is the negative part of x .65achine Learning Optimization Algorithms & Portfolio Allocation • x + = max ( x,
0) is the positive part of x . • Ω ( x ) is the convex indicator function of Ω: Ω ( x ) = 0 for x ∈ Ω and Ω ( x ) = + ∞ for x / ∈ Ω. • A † is the Moore-Penrose pseudo-inverse matrix of A . • k x k p = ( P ni =1 | x i | p ) /p is the ‘ p norm. • k x k A = (cid:0) x > Ax (cid:1) / is the weighted ‘ norm. • x (cid:12) y is the Hadamard element-wise product: [ x (cid:12) y ] i,j = [ x ] i,j [ y ] i,j . • prox f ( v ) is the proximal operator of f ( x ): prox f ( v ) = arg min x n f ( x ) + k x − v k o . • S ( v ; λ ) = sign ( v ) · ( | v | − λ ) + is the soft-thresholding operator. • S ( v ; λ − , λ + ) = ( v − λ + ) + − ( v + λ − ) − is the two-sided soft-thresholding operator. Wehave the following property: S ( v ; λ ) = S ( v ; λ, λ ). • T ( v, x − , x + ) = max ( x − , min ( x, x + )) is the truncation operator. • P Ω ( v ) is the projection of v onto the set Ω: P Ω ( v ) = arg min x ∈ Ω 12 k x − v k = prox Ω ( x ) ( v ). • S n is the unit simplex with dimension n . • A ffineset [ A, B ] is the affine set { x ∈ R n : Ax = B } . • H yperplane [ a, b ] is the hyperplane (cid:8) x ∈ R n : a > x = b (cid:9) . • H alfspace [ c, d ] is the half-space (cid:8) x ∈ R n : c > x ≤ d (cid:9) . • B ox [ x − , x + ] is the box { x ∈ R n : x − ≤ x ≤ x + } . • B p ( c, r ) is the ‘ p -ball n x ∈ R n : k x − c k p ≤ r o . • D is the weight diversification set { x ∈ R n : D ( x ) ≥ D − } where D ( x ) is the diversifi-cation measure and D −−