[PDF] Analytic approach to variance optimization under an \ell_1 constraint

Abstract

The optimization of the variance supplemented by a budget constraint and an asymmetric \ell_1 regularizer is carried out analytically by the replica method borrowed from the theory of disordered systems. The asymmetric regularizer allows us to penalize short and long positions differently, so the present treatment includes the no-short-constrained portfolio optimization problem as a special case. Results are presented for the out-of-sample and the in-sample estimator of the regularized variance, the relative estimation error, the density of the assets eliminated from the portfolio by the regularizer, and the distribution of the optimal portfolio weights. We have studied the dependence of these quantities on the ratio r of the portfolio's dimension N to the sample size T, and on the strength of the regularizer. We have checked the analytic results by numerical simulations, and found general agreement. Regularization extends the interval where the optimization can be carried out, and suppresses the large sample fluctuations, but the performance of \ell_1 regularization is rather disappointing: if the sample size is large relative to the dimension, i.e. r is small, the regularizer does not play any role, while for r's where the regularizer starts to be felt the estimation error is already so large as to make the whole optimization exercise pointless. We find that the \ell_1 regularization can eliminate at most half the assets from the portfolio, corresponding to this there is a critical ratio r=2 beyond which the \ell_1 regularized variance cannot be optimized: the regularized variance becomes constant over the simplex. These facts do not seem to have been noticed in the literature.

Full PDF

AAnalytic approach to variance optimization under an (cid:96) constraint Imre Kondor , , , G´abor Papp , Fabio Caccioli , , July 16, 2018

Abstract

The optimization of the variance supplemented by a budget constraint and anasymmetric (cid:96) regularizer is carried out analytically by the replica method borrowedfrom the theory of disordered systems. The asymmetric regularizer allows us to penalizeshort and long positions diﬀerently, so the present treatment includes the no-short-constrained portfolio optimization problem as a special case. Results are presentedfor the out-of-sample and the in-sample estimator of the regularized variance, therelative estimation error, the density of the assets eliminated from the portfolio by theregularizer, and the distribution of the optimal portfolio weights. We have studied thedependence of these quantities on the ratio r of the portfolio’s dimension N to thesample size T , and on the strength of the regularizer. We have checked the analyticresults by numerical simulations, and found general agreement. Regularization extendsthe interval where the optimization can be carried out, and suppresses the large sampleﬂuctuations, but the performance of (cid:96) regularization is rather disappointing: if thesample size is large relative to the dimension, i.e. r is small, the regularizer does notplay any role, while for r ’s where the regularizer starts to be felt the estimation error isalready so large as to make the whole optimization exercise pointless. We ﬁnd that the (cid:96) regularization can eliminate at most half the assets from the portfolio, correspondingto this there is a critical ratio r = 2 beyond which the (cid:96) regularized variance cannotbe optimized: the regularized variance becomes constant over the simplex. These factsdo not seem to have been noticed in the literature. a r X i v : . [ q -f i n . P M ] J u l Introduction

In this paper we present analytic results for a simple quadratic optimization problemwith a linear constraint plus an (cid:96) regularizer. Although we are going to speak in termsof portfolio optimization, it is important to emphasize that the problem we addressis not speciﬁc to portfolios, but is a generic feature of quadratic optimization if thedimension is high and the objective function is estimated on the basis of a limitednumber of observations. We will assume that there is no additional information (likeprior knowledge or sparsity) available besides the observations and wish to ﬁnd outhow much can be learned from the limited data. Our objective function will be theportfolio variance. In order to ﬁnd the optimum of the variance over the portfolioweights, one has to invert the estimated covariance matrix, which is possible only ifits dimension N is not larger than the number of observations T . The ratio r = N/T is a fundamentally important control parameter of the problem. If the numberof observations is much larger than the dimension, classical statistics works and theestimated optimum will be very close to the true optimum which can be obtained inthe limit T → ∞ . If T is not very large relative to N , we are in the high dimensionalregime where sample ﬂuctuations can be large and regularizers have to be introducedto rein them in. Regularizers suppress large excursions, and unavoidably introducebias, but the hope is that a reasonable trade-oﬀ can be achieved between the bias andsample ﬂuctuations with a proper choice of the strength of the regularizer. To seewhether this hope is fulﬁlled is one of the aims of this paper.A common regularizer is (cid:96) (shrinkage or ridge regression) whose eﬀect has beenstudied by a number of authors, see [1–7] among many others. In its most recent, non-linear form shrinkage can produce very good quality estimates [8–10]. Another popularregularizer is based on the (cid:96) norm (lasso) [11]. Lasso is known to lead to sparse esti-mates, reducing the eﬀective dimension of the problem and stabilizing the estimator.Jagannathan and Ma [12] considered portfolio optimization under a constraint exclud-ing short positions. Although they did not speak about regularization, a no-shortconstraint is, in fact, a special case of an asymmetric (cid:96) regularizer. Brodie et al. [13]and DeMiguel et al. [14] studied the eﬀect of (cid:96) regularization on the performance andstability of portfolio selection. Subsequently, a number of groups investigated vari-ous aspects of the application of (cid:96) and related regularizers in portfolio optimization,e.g. [15–19].The problem of optimizing the variance under an (cid:96) constraint is a quadratic pro-gramming task which can be solved numerically. Our purpose here is to solve thisproblem analytically, which, to the best of our knowledge, has not been done before.The method that enables us to do this is borrowed from the theory of disorderedsystems and goes by the name of the method of replicas [20]. It assumes that theunderlying distribution is Gaussian and it works in the Kolmogorov limit, where both N and T go to inﬁnity, but their ratio r = N/T is kept ﬁnite.We will show that the (cid:96) regularizer does not eliminate the instability, only shiftsits value. (A similar eﬀect was observed in the case of the Expected Shortfall riskmeasure in [21].) The new critical value turns out to be r = 2, corresponding to thefact that (cid:96) eliminates at most half of the assets from the portfolio.There is an important diﬀerence between our analytic approach and the standard tatistical estimation procedure which analyzes a given sample and tests it by crossvalidation [22]. Instead, our method allows us to average over the whole ensemble ofsamples. Corresponding to this, the step-like eﬀect of (cid:96) , eliminating the dimensionsone by one, is replaced upon averaging over the samples by a smooth, monotonicallyincreasing density of the zero weights.In order to make contact with a previous work in which we treated the case ofexcluded short positions [23], we are going to consider an asymmetric (cid:96) regularizerhere, with diﬀerent slopes for positive, resp. negative weights. We ﬁnd that in themost important results only the right hand side slope appears.Many dimensionality reduction or cleaning methods focus on the covariance matrix,especially on its spectrum. In contrast, the special version of the replica method weuse allows us to derive the distribution of optimal portfolio weights directly.The plan of the paper is as follows. In Sec. 2 we set up the problem and presentsome preliminary results. In Sec. 3 we recall some results from [23] where the taskof optimizing over the N portfolio weights has been reduced to the optimization of aneﬀective objective function depending on ﬁve order parameters. We also spell out theﬁrst order conditions (or saddle point conditions) that determine the stationary pointof the objective function. The solution to the saddle point equations is analyzed in anumber of subsections and the results for various special cases are displayed graphically.Sec. 4 is a summary of the results, while a sketch of the derivation of the eﬀectiveobjective function is provided in the Appendix. In this section we set up the optimization problem, ﬁx notation and present somepreliminary results that will be useful as checks on the replica calculation later.We consider a portfolio of N assets with random returns x i , i = 1 , . . . , N . Forsimplicity, we assume that the returns are independent Gaussian random variableswith zero expectation value and variance σ i , which may be diﬀerent for each asset i .For the time being we assume that we have complete knowledge of the distribution ofthe returns. If we denote the portfolio weights as w i , the return on the portfolio is (cid:80) Ni =1 w i x i , and under the assumption above the variance of the portfolio will be σ p = (cid:88) i σ i w i . (1)This is to be minimized subject to the budget constraint (cid:88) i w i = N, (2)where we set the budget to be N instead of the usual 1, to have O (1) weights in thelimit of large N .Then, with the Lagrange multiplier associated with the budget constraint denotedby λ we would have to ﬁnd the minimum of = (cid:88) i σ i w i − λ (cid:32) N (cid:88) i =1 w i − N (cid:33) (3)over the weights w i , a trivial task.So far, the distribution of the returns (in particular, the variances of the assets σ i )have been assumed to be known. In real life this is never the case, instead we have toestimate the optimal weights and portfolio variance on the basis of ﬁnite samples. Letus assume that we draw these samples from a multivariate distribution of independentGaussian variables with individual standard deviations σ i . These samples are consti-tuted of T observations for each asset: x it , i = 1 , , . . . , N ; t = 1 , , . . . , N . We wish tolearn to what extent it is possible to recover the true optimum of the variance and theoptimal weights by averaging over a large number of such samples.Thus we have the optimization problemmin w i (cid:88) i,j w i C ij w j  , s . t . N (cid:88) i =1 w i = N, (4)where C ij is the estimated covariance matrix C ij = 1 T T (cid:88) t =1 x it x jt . (5)Substituting (5) into (4) the optimization problem becomesmin w i  T T (cid:88) t =1 (cid:32)(cid:88) i w i x it (cid:33)  , s . t . N (cid:88) i =1 w i = N . (6)This is a quadratic optimization problem which can be solved numerically, as long asthe covariance matrix is positive deﬁnite, which holds with probability one for T ≥ N ,that is for r < r approaches 1 from below, sample ﬂuctuations become larger and larger, untilat r = 1 the estimation error diverges, and for r > (cid:96) norm (lasso) [11]. It is knownto result in sparse estimates, which in the present context means eliminating a part ofthe assets from the optimal portfolio, thereby reducing its eﬀective dimension. Lassois extensively used in a variety of problems in high dimensional statistics and machinelearning [22, 24]. Its ﬁrst applications to portfolio optimization is due to Brodie etal. [13] and DeMiguel et al [14]. For the non-analytic character of lasso a full analytictreatment has, to our knowledge, not been attempted. An analytic approach valid inthe large N limit, will be presented in the next section.Let us spell out the (cid:96) regularizer we are going to apply: ( η , η ) = η (cid:88) i w i θ ( w i ) − η (cid:88) i w i θ ( − w i ) , (7)where η and η are positive coeﬃcients and θ ( x ) is the Heaviside function. The regular-izer so deﬁned is asymmetric, having diﬀerent slopes for positive and negative weights.The special case η = η = η corresponds to the usual expression η (cid:80) i | w i | . Keepingthe two slopes diﬀerent allows us to penalize long and short positions diﬀerently.Our regularized objective function is then F = 1 T T (cid:88) t =1 (cid:32)(cid:88) i w i x it (cid:33) + η (cid:88) i w i θ ( w i ) − η (cid:88) i w i θ ( − w i ) − λ (cid:32) N (cid:88) i =1 w i − N (cid:33) . (8)As the ﬁrst term is non-negative and the last term vanishes for w i ’s satisfying thebudget constraint, F is larger or equal to the minimum of (cid:96) ( η , η ), which is N η .Therefore F ≥ N η , where the equality holds when the variance vanishes and theweights minimize the regularizer (cid:96) ( η , η ) (which requires that they are on the simplex w i ≥ ∀ i , (cid:80) i w i = N ). Alternatively, for the value of the objective function per assetwe have the inequality FN = f ≥ η . (9)This will prove important later.As it stands, (8) is amenable for numerical work, with the returns drawn from asuitable distribution. When the returns are independent Gaussians and N and T arelarge, one can derive the analytic results displayed in the next section.A special limit of the above optimization problem is worth considering already atthis point, because it provides an important consistency check on the results to bepresented later: Let us assume that we have very large samples compared with thenumber of assets in the portfolio, i.e. T (cid:29) N , or r = N/T →

0. This means we havecomplete information about the distribution of returns. Then, for the independentrandom returns considered here, the covariance matrix C ij becomes diagonal withdiagonal elements σ i , and the optimization problem becomes F = (cid:88) i σ i w i + η (cid:88) i w i θ ( w i ) − η (cid:88) i w i θ ( − w i ) − λ (cid:32) N (cid:88) i =1 w i − N (cid:33) . (10)A little reﬂection shows that the solution of this optimization problem can satisfy thebudget constraint (cid:80) i w i = N for a positive N only if the Lagrange multiplyer λ islarger than the right slope η of the regularizer: λ > η . Then the Lagrange multiplyerworks out to be λ = 2 N (cid:80) Nj =1 /σ j + η , (11)the optimal weights ∗ i = N (cid:80) Nj =1 /σ j σ i , (12)and the minimal value of the objective function F obtains as F ∗ = N (cid:80) Nj =1 /σ j + N η , (13)while the minimal value of the objective function per asset is F ∗ N = f ∗ = N (cid:80) Nj =1 /σ j + η = λ + η . (14)Note the order of magnitudes in the above formulae: λ , η , and w ∗ i are of O (1),the sum (cid:80) Nj =1 /σ j and the objective function are O ( N ). We also have to point outthat there is a diﬀerence in the notation relative to our earlier papers, especially [23],where we absorbed a factor 1 / r in the deﬁnition of the objective function f ∗ and theLagrange multiplyer λ . This did not change any of the results there, except sending λ to inﬁnity in the limit r →

0, which resulted in some convenience. In contrast tothat paper, instead of considering the special limit η = 0 and η → ∞ , here we aregoing to keep the coeﬃcients of the regularizer ﬁnite, so the convention of absorbing1 / r into the objective function would dictate its absorbtion into η , η as well. Thiswould distort some of the ﬁgures, and would make the message of the paper harder tograsp. Therefore, in the present paper we have this factor 1 / r explicitly written outand kept throughout the paper.The results obtained above for the Lagrange multiplyer, the optimal weights andthe optimal value of the objective function in the limit r → p ( w ) of the optimal portfolio weights would be a series of sharp spikes p ( w ) = 1 N N (cid:88) i =1 δ ( w − w ∗ i ) , (15)where δ ( x ) is the Dirac delta distribution. (cid:96) con-straint Our task is to ﬁnd the optimum of the objective function in (8), where the returns x it are assumed to be drawn from the joint probability density of N independent Gaussianvariables with zero mean and variance σ i . Following the special version of the replicamethod laid out in [21], in [23] we showed how the optimization of (8) could be reducedto that of an eﬀective objective function depending on ﬁve “order parameters”. Themethod we applied to achieve this was the method of replicas, borrowed from the tatistical mechanics of disordered systems [20]. (We will denote this eﬀective objectivefunction by the same symbol f as its full-information counterpart in the precedingsection, and will omit the adjective ”eﬀective” in the following.)The derivation has been presented in [21] and also in the appendices of [23], andis sketched in the Appendix to this paper for easier reference. In the present section,we can start from the expression for the eﬀective objective function f ( λ, q , ∆ , ˆ q , ˆ∆)depending on the order parameters λ , q , ∆, ˆ q , ˆ∆, as given in the Appendix: f ( λ, q , ∆ , ˆ q , ˆ∆) = q (1 + ∆) − r ˆ q ∆ − r ˆ∆ q + λ + min (cid:126)w (cid:68) V ( (cid:126)w ) (cid:69) z,σ , (16)where V = 2 r ˆ∆ σ w − rwzσ (cid:112) − q − λw + η wθ ( w ) − η wθ ( − w ) . (17)and the double average (cid:104) . . . (cid:105) z,σ means (cid:90) ∞ dσ N (cid:88) i δ ( σ − σ i ) (cid:90) ∞−∞ dz √ π e − z / . . . , (18)and σ i is the standard deviation of the distribution of returns on asset i .The minimum of the ”potential” V is at w ∗ = 2 rσz √− q + λ − η θ ( w ∗ ) + η θ ( − w ∗ )4 r ˆ∆ σ . (19)Substituting this back into (17) and performing the averaging according to the recipe(18) we ﬁndmin (cid:126)w (cid:104) V ( (cid:126)w ) (cid:105) z,σ = 2 r ˆ q ˆ∆ 1 N (cid:88) i (cid:18) W (cid:18) λ − η rσ i √− q (cid:19) + W (cid:18) − λ + η rσ i √− q (cid:19)(cid:19) . (20)This is then the explicit form of the last term in (16), which thus becomes f ( λ, q , ∆ , ˆ q , ˆ∆) = q (1 + ∆) − r ˆ q ∆ − r ˆ∆ q + λ ++ 2 r ˆ q ˆ∆ 1 N (cid:88) i (cid:18) W (cid:18) λ − η rσ i √− q (cid:19) + W (cid:18) − λ + η rσ i √− q (cid:19)(cid:19) . (21)The function W appearing here is the third integral of the standard normal Gaussiandensity; its precise deﬁnition will be given shortly, together with two more functionsthat appear frequently in the following.Stationarity of (21) with respect to the order parameters gives the ﬁrst order con-ditions ∆ = 12 r (1 + ∆) (22)ˆ q = − q r (1 + ∆) (23)1 √ q r = 1 N (cid:88) i σ i (cid:32) Ψ (cid:32) w ( i )1 σ ( i ) w (cid:33) − Ψ (cid:32) − w ( i )2 σ ( i ) w (cid:33)(cid:33) (24)∆ = rN (cid:80) i (cid:18) Φ (cid:18) w ( i )1 σ ( i ) w (cid:19) + Φ (cid:18) − w ( i )2 σ ( i ) w (cid:19)(cid:19) − rN (cid:80) i (cid:18) Φ (cid:18) w ( i )1 σ ( i ) w (cid:19) + Φ (cid:18) − w ( i )2 σ ( i ) w (cid:19)(cid:19) (25)12 r = 1 N (cid:88) i (cid:32) W (cid:32) w ( i )1 σ ( i ) w (cid:33) + W (cid:32) − w ( i )2 σ ( i ) w (cid:33)(cid:33) . (26)Here r = N/T , as before. The functions Φ, Ψ and W are the integrals of the Gaussiandensity: Φ( x ) = (cid:90) x −∞ dt √ π e − t / , (27)Ψ( x ) = (cid:90) x −∞ dt Φ( t ) , (28) W ( x ) = (cid:90) x −∞ dt Ψ( t ) . (29)In the above formulae the following notations have been introduced: w ( i )1 = λ − η rσ i ˆ∆ = ( λ − η )(1 + ∆)2 σ i , (30) w ( i )2 = λ + η rσ i ˆ∆ = ( λ + η )(1 + ∆)2 σ i , (31)and σ ( i ) w = √ q rσ i , (32)where in (30), (31) and (32) use has been made of (22) and (23). With this we caneliminate ˆ q and ˆ∆ from our equations. Proceeding similarly in (21) and using (26)we ﬁnd the expression for the objective function in terms of the remaining three orderparameters as f = λ − q r (1 + ∆) . (33)According to the derivation of the objective function in the Appendix, when (24), (25)and (26) are solved and λ , q and ∆ are obtained as functions of the control parameters r , η and η , equation (33) gives the in-sample estimate of the objective function. ith (19) the distribution of weights obtains from p ( w ) = (cid:104) δ ( w − w ∗ ) (cid:105) zσ as p ( w ) = n δ ( w )+ 1 N (cid:88) i σ ( i ) w √ π e − (cid:32) w − w ( i )1 σ ( i ) w (cid:33) θ ( w )+ 1 N (cid:88) i σ ( i ) w √ π e − (cid:32) w − w ( i )2 σ ( i ) w (cid:33) θ ( − w ) , (34)The ﬁrst term in this formula shows that the (cid:96) regularizer eliminates some of theassets from the portfolio by setting their weight to zero. The density of these assets, n is given by n = 1 N (cid:88) i (cid:32) Φ (cid:32) w ( i )2 σ ( i ) w (cid:33) − Φ (cid:32) w ( i )1 σ ( i ) w (cid:33)(cid:33) . (35)The two sums are made up of truncated Gaussians, the ﬁrst sum corresponding to theweight distribution of positive (long) positions, the second to negative (short) ones.We see then that the series of discrete, sharp spikes in (15) is broadened by sampleﬂuctuations, and in addition to the positive weights, also negative ones appear. Equa-tion (34) reveals the meaning of the symbols introduced in (30), (31), and (32): w ( i )1 and w ( i )2 are the centers of the estimated positive, resp. negative weight distributionof asset i , and σ ( i ) w is the width of these distributions.Note how the distribution of optimal weights has been obtained directly from ourformalism, without having to go through the calculation of the estimated covariancematrix.The order parameter q will be of central importance for us. In [25] we showed that q N (cid:88) i /σ i = ˜ q (36)is proportional to the out-of-sample estimate of the variance (cid:80) ij w est i C true ij w est j as:˜ q = (cid:80) ij w est i C true ij w est j (cid:80) ij w true i C true ij w true j , (37)where C true ij is the true covariance matrix, w true i the corresponding optimal portfolioweights, and w est i are the optimal weights corresponding to the estimated covariancematrix. The denominator in (37) serves just to normalise ˜ q . From the deﬁnition it isclear that ˜ q ≥ (cid:112) ˜ q − relative estimation error . r → The limit r → T (cid:29) N . This means we have much more data thanthe dimension, so in this limit we have to recover the results of Section 2. rom (30), (31), and (32) we see that w ( i )1 and w ( i )2 are of order O (1), while σ ( i ) w vanishes. (In the r → λ and q will be seen to be of O (1), while ∆ of O ( r )shortly.)Then, in the limit r → ∞ and −∞ , respectively. For large x , Ψ( x ) ∼ x , and Ψ( − x ) is exponentially small, so (24)yields 1 √ q r ≈ N (cid:88) i σ i w ( i )1 σ ( i ) w , which, by (30) and (32), leads to λ = 2 N (cid:80) i /σ i + η , (39)in accordance with (11).We anticipated that in the small r limit ∆ ≈ r . Indeed, as lim x →∞ Φ( x ) = 1 andlim x →−∞ Φ( x ) = 0, (25) immediately gives ∆ ≈ r , for r → q by noting that, for large x , W ( x ) ∼ x / W ( − x ) is exponentially small: q = 1 N (cid:80) i /σ i , for r → . (40)Eq. (40) then implies that for r →

0, ˜ q →

1, which means that the relative estimationerror vanishes, a natural result in the limit

T /N → ∞ .The value of the objective function at the stationary point is obtained by substi-tuting the above results into eq. (33): f = 1 N (cid:80) i /σ i + η , (41)in agreement with (14).Let us turn to the distribution of weights now. As the argument of the Φ’s in (35)go to inﬁnity for r →

0, the Φ’s themselves go to 1, so n vanishes in this limit.From (30) and (31) we see that for r → w ( i )1 and w ( i )2 tend to w ( i )1 , → σ i N (cid:80) j /σ j , (42)which is the same as the optimal weights found in (12).In the same limit the standard deviations given in (32) vanish, so the Gaussians in(34) go over into Dirac delta functions. Since in the third term in (34) the delta spikesare multiplied by θ ( − w ), they do not contribute, so the distribution of weights in the r → p ( w ) = (cid:88) i δ ( w − w ∗ i ) , (43) here w ∗ i , are the true optimal weights given in (12). We see then that in the limit r → If one of the assets, say the ﬁrst, is riskless, σ →

0, then it must take on the full weightof the portfolio. (Remember that we have no constraint on the expected return of theportfolio, and are looking for the global minimum of the risk functional. Therefore, ifthere is a riskless asset in the portfolio, the total wealth must be invested in this asset.)Let us see how our equations lead to such a result.As we will see later, above r = 1 zero modes (belonging to zero value of the variance)appear in the system, and they start competing with the riskless asset. We will studythese zero modes later; in the present subsection we restrict the discussion to the range r <

1, to avoid the complications related to the zero modes. We also assume that thestandard deviation of the riskless asset goes to zero ﬁrst, and let N go to inﬁnity onlyafter this.Now, if σ is much smaller than the other variances, in (26) a single term dominates,and using the asymptotic behavior of W ( x ) ∼ x / x → ∞ , we ﬁnd1 N ( λ − η ) (1 + ∆) σ q = 1 . (44)Similarly, from (24) we get ( λ − η )(1 + ∆)2 N σ = 1 . (45)Equations (44) and (45) imply q = N σ . (46)Then by (36) the quantity ˜ q given in (37) is˜ q = q N (cid:88) i σ i ≈ q N σ = 1 . (47)As stated in (38), √ ˜ q − i = 1 is error free – an obvious result.From (30), the weight w (1)1 is w (1)1 = ( λ − η )(1 + ∆)2 σ , (48)which, by (45) leads to w (1)1 = N , (49)so the riskless asset carries the total weight, indeed. he only other weight that could compete with this is w (1)2 , but it is positive andis multiplied by θ ( − w ) in the weight distribution, so it does not contribute, while allother weights are negligible in the σ → σ →

0, some small ﬂuctuations still remain. The standard deviation given in (32)works out to be σ (1) w = √ N r , (50)corresponding to Gaussian ﬂuctuations about the average (49).

Before proceeding, we wish to emphasize again that we are calculating averages overthe random samples, rather than trying to infer the behavior of the whole ensemblefrom studying a single sample. The diﬀerence is perhaps the most clearly seen in thecase of the distribution of optimal portfolio weights. The lasso is known to eliminatesome of the variables (setting their weights to zero). For a given sample with a givenratio r = N/T this happens step-wise, i.e. as we increase the strength of the regularizer η (setting η = η = η for simplicity) ﬁrst one, then two, three, etc. weights will berendered zero, in descending order of the corresponding variances. In contrast, theaveraging over the samples in our formalism results in a density n of zero weights thatincreases continuously with η and, according to (35), receives contributions from eachasset i . h % % % % % r=1/3 s = s = n ( i ) h . % . % % % % % r = s = s = ( ) n ( ) N=100, average over 10 samplesdashed lines: replica result

Figure 1:

Elimination of weights with increasing regularization parameter η . Left: The proportion ofzero weights n as function of η for two diﬀerent single samples (blue and red) of a portfolio with the samecomposition of 100 assets. Note that the step-like functions are more or less following the trend of thetheoretical curve (which has been derived in the large N limit and shown in the ﬁgure by a black dashedline), but the ﬂuctuations for N = 100 are still large. Right: The step-like curves have been measured byaveraging over 10 sample portfolios with the same composition: half of the 100 assets with σ = 1, the otherhalf with σ = 10. The red dashed line shows the replica theoretic contribution of the σ = 1 assets to thedensity of zero weights, the blue dashed line shows the same from the σ = 10 assets. The contribution ofthe higher volatility component (shown in blue) is larger than that of the lower volatility one (in red): theregularizer eliminates the higher volatility assets with higher probability. Note that averaging over just 10samples has substantially reduced the ﬂuctuations. In Figure 1 we show numerical results for a two-variance portfolio and comparethem to the results of the replica calculation. The numerical model is constructedfrom N = 100 assets, each having T = 300 data points ( r = 1 /

3) drawn from a normaldistribution. The variance of the returns is set to be σ = 1 for half of the assets,while the other half has σ = 10. As expected, the (cid:96) regularizer mostly eliminates theweights associated with the higher variance group: in the right hand side ﬁgure n (1)0 indicates the proportion of the eliminated weights associated with the higher variance,while n (2)0 is the contribution of the lower variance assets. To indicate the size ofﬂuctuations for a single portfolio, in the left ﬁgure the results for two diﬀerent samplesare shown, compared to the replica result. From these ﬁgures one can form an ideahow measurements performed on individual samples compare with the sample averages(at r = 1 / n from asset i will belarger than that from asset j if σ i > σ j . Thus we have to show that (cid:32) w ( i )2 σ ( i ) w (cid:33) − Φ (cid:32) w ( i )1 σ ( i ) w (cid:33) > Φ (cid:32) w ( j )2 σ ( j ) w (cid:33) − Φ (cid:32) w ( j )1 σ ( j ) w (cid:33) . (51)If we introduce the notations w ( i )2 σ ( i ) w = z i , w ( j )2 σ ( j ) w = z j , w ( i )1 σ ( i ) w = y i , w ( j )1 σ ( j ) w = y j (52)then from (30)–(32) we see that z j z i = y j y i = σ i σ j = a > , (53)and y i z i = y j z j = λ − η λ + η = b < . (54)(The constant a is obviously positive, and it follows from (9) and (33) that λ ≥ η , so b is non-negative.)If we call z i = z , the other three variable are simply proportional to it: y i = bz , z j = az , y j = abz .The inequality (51) can then be written asΦ( z ) − Φ( bz ) > Φ( az ) − Φ( abz ) . The deﬁnition of Φ, (27), then leads to (cid:90) zbz dt e − t / > (cid:90) azabz e − t / , a > , (55)so f ( z ) = (cid:82) zbz dte − t / must be a decreasing function of z . But dfdz = e − z / − e − b z / <

0, indeed, because b <

1. Thus we have shown that more volatile assets are eliminatedfrom the portfolio by (cid:96) with higher probability. Turning now to the distribution of non-zero weights, we see from (34) that the discretespikes in (43) split into two and get broadened by averaging over the samples. Fig. 2is an illustration of p ( w ) in the special case when all the standard deviations σ i are thesame, σ i = 1 for all i and η = η = η .As we can see, with increasing r the Gaussians making up the distribution of weightsbecome broader and broader, and the original sharp structure of p ( w ) becomes washedaway.The question arises how small r must be in order to make it possible to resolve thestructure of the weight distribution of a portfolio consisting of, say, just two classesof assets, N/ σ i and N/ σ j . A glance at Fig. 3 showsthat this is possible as long as the distance between the centers of the two Gaussians islarger than the mean of their standard deviations. From Fig. 3 it is also clear that it is . . . . p ( w ) w h = h = h = r = −2 0 1 2 3 . . . −0.5 −0.3 −0.1 0.1 . . . . p ( w ) w h = h = h = r = −6 −2 2 . . . Figure 2:

The the distribution of estimated weights with the ratio r = 0 . r = 0 .

68 and for diﬀerentvalues of the regularizer’s strength η when all the true standard deviations are the same, σ i = 1 for all i ,and η = η = η . Increasing η tends to suppress the negative weights. The vertical dotted line at the originis meant to represent the Dirac-delta contribution of the zero weights. Figure 3:

The illustration of the resolution of two diﬀerent assets. . . . p ( w ) w r = h = h = s =(1 N 2 ,2 N 2 ) −5 0 5 10 . . . . . p ( w ) w r = h = h = s =(1 N 2 ,2 N 2 ) Figure 4:

Resolution of diﬀerent assets for r = 0 .

3, resp. r = 0 . η . The plot refers to the case where half of the assets have variance σ i = 1, while theother half σ i = 2, and η ≡ η = η . suﬃcient to consider the positive weights side of the distributions, so the requirementfor resolvability is( λ − η )(1 + ∆)2 σ j − ( λ − η )(1 + ∆)2 σ i > (cid:18) √ q rσ j + √ q rσ i (cid:19) , (56)that is ( λ − η )(1 + ∆) √ q r (cid:18) σ j − σ i (cid:19) > , (57)where we have assumed σ j < σ i .When we have a large number of observations, i.e. r (cid:28) λ − η , q , and ∆ can bereplaced by their r = 0 values, as given in (39), (40), and ∆ = 0, respectively. Thenthe resolvability of the two peaks will only depend on r and the two volatilities, andthe criterion of resolvability becomes (cid:114) r < ( σ i − σ j ) (cid:113) σ i + σ j , (58)where we have substituted σ i for half of the assets and σ j for the other half. It is thenclear that for small r ’s the inequality (57) is easily satisﬁed for σ ’s suﬃciently far apart.However, as will be seen shortly, with increasing r the coeﬃcient ( λ − η )(1 + ∆) onthe left of (57) decreases rapidly, and the inequality gets violated: sample ﬂuctuationswill wash the structure away.When one tries to estimate the portfolio weights from a single sample of empiricaldata, one is eﬀectively picking the weights from the multimodal distribution p ( w ) like the distribution in Fig. 4, but with many more peaks). If the peaks are wellseparated and narrow, the estimates so obtained will be close to the true weights, butthis assumes small values of r , that is a large number of observations T . If T is notvery large compared to N , the distribution of weights will lose its discrete structure,and the estimated weights may have very little to do with their true values. Fig. 4shows how much the discrete structure is lost already for r = 0 .

3, while for r = 0 . In this subsection we present results for the range of N and T values where their ratiois neither very small, nor very close to r = 2. While at the two extremes it is easyto get analytic results by hand, in the intermediate r range one has to solve the ﬁrstorder conditions by help of a computer. The results will be displayed below in a fewﬁgures. For comparison, the results for η = η = 0 (no regularization) and η = 0, η → ∞ (no short positions allowed) are also shown. When the full regularizer isapplied we set η = η = η , for simplicity. Also, since we have already displayed theresults that depend on the heterogeneity of the portfolio (dominance of the risklessasset, preferential elimination of the large volatility items and the condition for theresolvability of nearby volatilities), we can henceforth set σ i = σ = 1 for all i , forsimplicity again.Without regularization the optimization of variance does not have a meaningfulsolution beyond r = 1 where the ﬁrst zero eigenvalues of the covariance matrix appear.Then q and ∆ diverge in the limit r → −

0, while λ and the in-sample estimate forthe objective function vanish at r = 1. In the absence of regularization the density n of zero weights is identically zero.Regularization extends the region where the optimization can be carried out, from0 ≤ r < ≤ r <

2. We can see from Fig 5 that for small values of the coeﬃ-cient η of the regularizer n is very small for r <

1, but starts increasing fast above r = 1, ultimately going to 1/2. Note that n can be directly measured by numericalsimulations; the agreement between the replica calculation and numerical simulationhas already been shown in Fig.1. The fraction n of zero weights as function of r . There is a simple relationship between the density n of zero weights and the orderparameter ∆. From eqs. (25) and (35) one can see that∆ = r (1 − n )1 − r (1 − n ) . (59)The rapid growth of n above r = 1 translates into a strong increase of ∆. (Withoutregularization ∆ would diverge at r = 1.) With the regularizer on and n going to 1/2as r → −

0, ∆ ultimately diverges at r = 2. Eq. (59) can serve as a recipe for thenumerical determination of ∆ through n .Fig. 6 shows the behavior of the order parameter q related to the estimation errorand out-of-sample estimate for the objective function; q is a quantity that can beobtained directly from simulations, the analytical and numerical results are comparedin Fig. 6 for various values of η . Without the regularizer q would diverge at r = 1,similarly to ∆. As a vestige of this, for small values of the regularizer’s coeﬃcient η , q shows a strong ”resonance” around r = 1, but remains ﬁnite, and decreases above r = 1 to a ﬁnite limit. For larger η ’s the resonance is suppressed, in particular, in theno-short-selling limit ( η → ∞ ) q is monotonically increasing over the entire interval0 ≤ r < Left panel:

The behavior of the order parameter q (proportional to the out-of-sample estimatefor the variance, and also to the relative estimation error) as function of the ratio r = N/T for diﬀerentvalues of the coeﬃcient η (= η = η ) of the regularizer. For small values of η , q exhibits a sharp maximum(blue curve) around r = 1 where it would diverge without the regularizer. For larger η the maximum is lesspronounced (dashed red curve), and for the largest value of η = 0 . r = 1. Right panel:

Comparison with numerical simulations for diﬀerentvalues of η and N = 50. The agreement between the analytic formula and numerical simulations is alreadygood for a system of size N = 50. Finally, the in-sample estimator for the objective function f can be obtained from(33) through calculating λ from the stationarity conditions. The results for λ areexhibited in Fig. 7. We shall see shortly that f goes to η as r → −

0, implying thatthe variance vanishes in this limit.

The order parameter λ as function of r . Note the logarithmic scale on the vertical axis. In order to assess the performance of regularization, we have to construct the contourlines of the estimation error. For simplicity we consider here a uniform portfolio withall the true variances σ i = 1, and for a ﬁrst orientation let the left hand side slope η of the regularizer go to inﬁnity and keep the right hand side slope η ﬁnite. Theadvantage of such an arrangement is that it excludes all the negative weights: w ( i )2 deﬁned in (31) goes to inﬁnity, and Ψ (cid:18) − w ( i )2 σ ( i ) w (cid:19) , Φ (cid:18) − w ( i )2 σ ( i ) w (cid:19) and W (cid:18) − w ( i )2 σ ( i ) w (cid:19) all vanishin eqs. (24), (25) and (26). This leads to the much simpliﬁed set of equations:1 √ q r = Ψ (cid:18) ( λ − η )(1 + ∆)2 √ q r (cid:19) (60)∆ = r Φ (cid:16) ( λ − η )(1+∆)2 √ q r (cid:17) − r Φ (cid:16) ( λ − η )(1+∆)2 √ q r (cid:17) (61)12 r = W (cid:18) ( λ − η )(1 + ∆)2 √ q r (cid:19) . (62)Applying the identity W ( x ) = x Ψ( x ) + Φ( x ) in the last equation and using the previ-ous two, after some simple manipulations one is led to the result that the arguments of he functions Ψ, Φ and W above are equal to (cid:113) λ − η r . Then the equations themselvesbecome 1 √ q r = Ψ (cid:32)(cid:114) λ − η r (cid:33) (63)∆ = r Φ (cid:18)(cid:113) λ − η r (cid:19) − r Φ (cid:18)(cid:113) λ − η r (cid:19) (64)12 r = W (cid:32)(cid:114) λ − η r (cid:33) . (65)The last equation gives the solution for λ as (cid:114) λ − η r = W ( − ( 12 r ) , (66)where W ( − is the inverse of W . With r increasing λ is decreasing and goes to η for r →

2. As we have seen earlier, λ cannot be smaller than η , so the square root remainsreal, and r cannot grow beyond 2. Substituting (66) into (64) and (65), respectively,we get the other two order parameters as functions of r . At ﬁrst it may seem surprisingthat they depend only on r and do not depend on η at all. (A little reﬂection showsthat this is due to the combined eﬀect of the exclusion of short positions and the budgetconstraint.) In particular, the order parameter q , which determines the out-of-sampleestimator for the objective function and the estimation error, works out to be q = 1 r (cid:0) W ( − ( r ) (cid:1) . (67)This is independent of η , but, of course, not independent of the regularization. With-out regularization (and a very strong one at that; remember that we let η → ∞ at thebeginning of this subsection) we would have q = − r which is blowing up at r = 1,whereas (67) smoothly increases from 1 to π as r goes from zero to 2. Because q is independent of η , if we constructed the contour lines of q , i.e. the lines of ﬁxed q on the r − η plane, we would get a series of horizontal lines stacked above eachother. (The above solution taken at η = 0 is the optimization of the variance witha no-short constraint that we studied in [23].) When η is ﬁnite we have to resort toa computer to construct the contour lines of q . Now we set η = η = η , that is weconsider a symmetric regularizer. The resulting q contour lines are depicted in Fig. 8a(This ﬁgure contains the same information as Fig. 6: the diﬀerence is that there q wasshown as a function of r , with the value of η as the parameter of the curves, while inhere we are showing the constant q lines on the r − η plane, with q as the parameter.) .0 0.1 0.2 0.3 0.4 0.5 0.6 . . . . . r h . . . . . r m a x r m i n q Figure 8: a. Contour plots of estimation error √ q − (cid:96) regularization with η = η = η . There isa critical value of q at π , below which solution exists for any η . For low values of q the result is almostinsensitive to regularization. b. Maximal improvement obtained by using regularization as a function of q . We recognize the nearly horizontal contour lines immediately: in the lower regionsof the ﬁgure (below r = 0 .

3, say) the lines of ﬁxed q are nearly independent of thestrength of the regularizer. As we go higher, the eﬀect of the regularizer starts to befelt more and more. The estimation error ( √ q −

1) on the ﬁrst ﬁve contour lines,from bottom up, is 5%, 10%, 20%, 30%, and 40%, respectively. These lines are nearlyhorizontal, which means that if we have enough data the strength of the regularizerhardly matters at all, the error would be the almost the same even for η = 0. The ﬁrstline where we can see a deﬁnite increase of r with η is the one corresponding to therelative estimation error 0.4. Beyond this point the regularizer is taking over and theestimation error for a large enough η is determined more by the regularizer than the sizeof the sample. We see then that either we have a suﬃcient amount of data and then theregularizer does not play a very important role, or it does, but by then the error is solarge as to make the whole optimization pointless. In the higher regions of the contourmap the optimization is completely determined by the regularizer. The highest contourline corresponds to q = π with r hitting its critical value of 2. For q increasing further r must decrease (see Fig. 6). With q going to inﬁnity the contour lines shrink to thepoint η = 0, r = 1, corresponding to the singularity of the unregularized problem. r = 2 Let us start the analysis of the critical point with eq. (26) and consider the generalcase where η is diﬀerent from η ; we will see that η does not appear in the resultsaround r = 2. From eqs. (26), (30), (31) and (32) it is clear that the limiting behaviorof the various quantities depends on λ and ∆, because q remains ﬁnite here. Theorder parameter ∆ diverges for r → −

0, but we will verify later that λ − η goes to ero so fast that the product ( λ − η )(1 + ∆) still vanishes at r = 2. At the same time( λ + η )(1 + ∆) diverges, therefore W (cid:18) − w ( i )2 σ ( i ) w (cid:19) vanishes, while according to W ( x ) = 14 + x √ π + . . . , x → w ( i )1 become W (cid:32) w ( i )1 σ ( i ) w (cid:33) = 14 + 1 √ π ( λ − η )(1 + ∆)2 σ i √ q r + . . . . (69)Then eq. (26) becomes 12 r = 14 + ( λ − η ) r (1 + ∆) √ πq r N (cid:88) i σ i , (70)or, to leading order in (cid:15) = 2 − r , (cid:15) λ − η )∆) √ πq N (cid:88) i σ i , (71)which shows that the product ( λ − η )∆ vanishes like ∼ (cid:15) indeed.Similarly, from eq.(24) with Ψ(0) = √ π we get near r = 2 q = π (cid:16) N (cid:80) i σ i (cid:17) , r → − . (72)Accordingly, the r → − q = q N (cid:88) i /σ i = π N (cid:80) i /σ i (cid:16) N (cid:80) i σ i (cid:17) , (73)where the expression multiplying π is, by force of the Cauchy inequality, larger or equalto 1 for any distribution of the true volatilities σ i , therefore ˜ q is always larger or equalto 1, as it should.The asymptotic behavior of the order parameter ∆ in (25) can be worked outsimilarly to obtain ∆ = 42 − r , (74)where use has been made of Φ( x ) = + x √ π + . . . , for x small.Going back to (71) and using (72) and (74) we ﬁnd λ − η = π (cid:15) (cid:16) N (cid:80) i σ i (cid:17) , (75) anishing quadratically for r → −

0. For the density of the zero weights we ﬁnd n = 12 (76)in the same limit.Turning to the distribution of weights, we see that w ( i )1 → i , so, in additionto the δ -peak at the origin, all the positive weights collapse to zero, but with a ﬁnitestandard deviation σ ( i ) w = 1 σ i √ π N (cid:80) i σ i . (77)As for the weights w ( i )2 , they all go to inﬁnity, so the corresponding contributions to(34) vanish exponentially.Finally, the objective function can be obtained from (33). Here the second termvanishes because of the divergence of ∆, while according to (75) the ﬁrst term goes to η , so lim r → − f = η . (78)As we see, η does not appear in any of the results near the critical point, but it isimportant to realize that its non-zero value ensures the vanishing of all the contributionswith w ( i )2 .What is happening at the transition at r = 2? To ﬁnd the answer, we have to goback to the discussion below eq. (6) where we found that the objective function f ≥ η and the equality only holds when the empirical variance vanishes and the optimalweight vector lies on the simplex. But then eq. (78) implies that it is precisely thiswhat is happening at the critical point. According to eq. (6) the variance is the sum of T squares. This vanishes only if each T term vanishes separately. So we need to ﬁnd aweight vector that is pointing to the simplex and is orthogonal to the T random returnvectors. This is exactly the same random geometry problem that we encountered inthe case of the no-short-constrained optimization [23]. There we displayed a closedformula for this probability, valid for any N and T : p ( N, T ) = 12 N − N − (cid:88) k = T (cid:18) N − k (cid:19) . (79)This formula depends only on the symmetry, and not on the concrete form of the returndistribution, and as such it is universal. For N ≤ T the probability of ﬁnding sucha solution is zero. For N exceeding T the probability starts to increase, becomes 1/2at N = 2 T and goes to one as N increases further. If N and T go to inﬁnity withtheir ratio r = N/T held ﬁxed, the function p ( N, T ) goes over into a step function:the probability that the variance vanishes becomes zero for 0 < r < r > r = 2 corresponds to a sudden transition between a situationwhere the variance is positive and one where it is zero with probability one, while theobjective function becomes identically equal to η , corresponding to a ﬂat optimizationlandscape. This transition is similar to the large number of phase transitions in randomhigh dimensional geometry studied in [26] and [27]. Summary

We have considered the optimization of variance supplemented by a budget constraintand an asymmetric (cid:96) regularizer. The present treatment includes as a special case theno-short-constrained portfolio optimization problem [23]. We have presented analyticalresults for the order parameter q , directly related to the out-of-sample estimator ofthe objective function and the relative estimation error; for the in-sample estimatorof the objective function; for the density of the assets eliminated from the portfolioby the (cid:96) regularizer; and for the distribution of portfolio weights. We have studiedthe dependence of these quantities on the ratio r of the portfolio’s dimension N tothe sample size T , and on the strength of the regularizer. We have checked theseanalytic results by numerical simulations, and found general agreement. As the mostconspicuous property of (cid:96) is the step-like, one by one, elimination of the dimensions,we also run numerical experiments on single samples to reproduce this phenomenon.We have conﬁrmed the appearance of the steps, and checked that the overall trend ofthe numerical results by and large follows the theoretical curve, which is remarkable,since the measurement was carried out on a single sample of ﬁnite size, whereas thetheory is meant to work in the limit where both N and T go to inﬁnity and the resultsare averaged over the whole ensemble of random samples. We have also seen thataveraging over merely ten numerical curves is already enough to remove most of theﬂuctuations. We have repeatedly emphasized that the replica theory we applied inthe analytic work is designed to average over inﬁnitely many samples, and thus theresults reﬂect the typical properties of the ensemble. Empirical work, in contrast, isusually dealing with a single sample, or a small number of samples, and tries to inferthe properties of the ensemble from the information contained therein. Consideringthe rapid broadening with r of the Gaussians making up the distribution of weights,one can immediately see how misleading a small number of samples can be.As portfolio optimization is just a simple representative example of quadratic opti-mization, our results have a message for these kind of optimization problems at large.The extension of the interval where the optimization can be carried out, the maximalproportion of one half of dimensions eliminated by (cid:96) and the ”resonance” of the esti-mation error around the unregularized critical point at r = 1 are important ﬁndings - asis the disappointing performance (cid:96) in the given context. The poor performance shouldnot be a surprise, as in the given problem we were trying to rein in large ﬂuctuationsof a quadratic objective function by a regularizer which increases linearly. The phasetransition taking place at r = 2 belongs to the large family of transitions in randomgeometrical problems studied in [26] and [27] where they were shown to be universalin the sense that the critical point is independent of the distribution of the data. Asa manifestation of this universality, the critical value r = 2 does not depend on theGaussian nature of the returns that we assumed here for the sake of easy applicationof the replica method.To conclude, we would like to call attention to the fact that the transition at r = 2 is very easy to overlook in empirical work. Upon approaching this criticalpoint, the solution of the optimization problem as posed here would become unstableagainst ”transverse” ﬂuctuations which would leave the length of the weight vectorapproximately constant, but would result in large ﬂuctuations in its direction. This orresponds to the weight vector freely roaming over the simplex. In ﬁnance termsit would mean the optimal portfolio ending up with a diﬀerent composition in eachsample. It is clear that such a situation is undesirable (such a frequent rebalancing ofthe portfolio would be technically diﬃcult and would result in high transaction costs),so the investor should keep well away from the point of instability. In numerical work,however, one may use, even inadvertently, some of the standard solvers that oftencontain a built in (cid:96) regularizer without a clear warning about it. The presence of such”hidden” (cid:96) regularizers in standard quadratic solvers has been pointed out in [23].Such a regularizer will stabilize the solution and drive it toward the naive portfoliowith all the weights equal. In a situation where the original problem is unstable evena very small (cid:96) regularizer will suﬃce to do the job, thereby creating the illusion thata stable solution can be obtained on the basis of a small number of observations. Appendix A Derivation of the free energy with thereplica method

As stated in the main text, (8), we have to ﬁnd the optimum of the following objectivefunction: F = 1 T T (cid:88) t =1 (cid:32)(cid:88) i w i x it (cid:33) + η (cid:88) i w i θ ( w i ) − η (cid:88) i w i θ ( − w i ) − λ (cid:32) N (cid:88) i =1 w i − N (cid:33) . (80)where the returns x it are drawn from the joint probability density of independentGaussian variables with zero mean and variance σ i .Any optimization problem can be embedded into the formalism of statistical physicsby regarding the objective function F as the ”energy functional” of a ﬁctitious system,introducing a ﬁctitious inverse temperature γ , and integrating the Boltzmann factor e − γF over the coordinates x it in a given sample to get the ”partition function” Z . Thelogarithm of the partition function ln Z is essentially a cumulant generating functionfrom which all the quantities of interest can be obtained; in particular, the optimalweights can be found by minimizing the partition function over the weigths in the”zero temperature” limit γ → ∞ . The eﬀectivness of this procedure depends on thefact that we work in the limit of large N ’s where the distribution in the space of returnsis extremely sharp around its maximum. The procedure just described gives us theoptimal weights in a given sample of size T . However, if T is not much larger thanthe dimension N of the portfolio we are in the realm of high-dimensional statistics,where sample ﬂuctuations are large, and optimizing our portfolio over a single samplecan be very misleading. Therefore, in order to capture the typical properties, we haveto average over the full ensemble of samples. This is analogous to averaging over the”quenched” random samples in the statistical physics of disordered systems [20], whichexplains why the methods developed in that theory can be successfully applied in theportfolio optimization context.In order to average over the samples, we have to average the logarithm of the parti-tion function which is a random variable ﬂuctuating from sample to sample. Averaging he logarithm of a random variable is hard, while calculating the integer moments Z n may be feasible. Now Z n is just the partition function of n independent copies or repli-cas of the system (hence the name of the method). Assuming that we can analyticallycontinue Z n from the integers to real n ’s we can make use of the identity (cid:104) (ln Z ) n (cid:105) = (cid:68) Z n − n (cid:69) , (81)valid in the limit n → N all the replicas will go tothe same minimum of ln Z , and the simplest analytic continuation will do the job.Because we cannot provide a rigorous proof of this claim, we should regard the resultsof the replica calculation as heuristic. This is why we performed extensive numericalsimulations to back up the analytic results in this paper. The general agreement wefound is clear evidence of the correctness of the results. On the other hand, to deducethe nontrivial results from a purely numerical approach would have been obviouslyvery hard if not impossible.Let us now carry out the program sketched above. The replicated partition functionis Z n ( (cid:126)w ) = (cid:68) (cid:90) ∞−∞ N (cid:89) i =1 n (cid:89) a =1 dw ai e − γ ( (cid:80) i,j,t,a w ai x it x jt w aj + T (cid:80) a g ( (cid:126)w a ) ) + T λ ( N (cid:80) i w i − ) (cid:69) (cid:126)x t , (82)where g ( (cid:126)w ) = η (cid:80) i w i θ ( w i ) − η (cid:80) i w i θ ( − w i ) and, at an appropriate point, we willhave to take the limits lim γ →∞ lim n → γ Z n ( (cid:126)w ) , (83)where (cid:104)· · · (cid:105) represents an average over the probability distribution of returns.The above partition function refers to a system of n replicas of the original system,and the index a is introduced to label diﬀerent replicas, so that w ai represents the i -thweight of the a -th replica. Introducing an integral representation for the delta functionand using the properties of Gaussian integrals the replicated partition function can bewritten as Z n ( (cid:126)w ) = (cid:68) (cid:90) ∞−∞ N (cid:89) i,a,t dw ai dφ at dλ a exp  − (cid:88) a,t φ at + i √ γ (cid:88) i,t,a φ at w ai x it  × exp (cid:34) T (cid:88) a λ a (cid:32) N (cid:88) i w ai − (cid:33) − T γ (cid:88) a g ( (cid:126)w a ) (cid:35) (cid:69) (cid:126)x t . veraging over the probability distributions of returns gives Z n ( (cid:126)w ) = (cid:90) ∞−∞ (cid:89) i,a,b,t dw ai d ˆ Q ab dφ at dλ a exp  − (cid:88) a,t φ at − γ (cid:88) a,b,t φ at Q ab φ b,t  × exp (cid:88) a,b ˆ Q ab (cid:32) N Q ab − (cid:88) i σ i w ai w bi (cid:33) + T (cid:88) a λ a (cid:32) N (cid:88) i w ai − (cid:33) − T γ (cid:88) a g ( (cid:126)w a )  where we have introduced the overlap matrix Q ab = N (cid:80) i σ i w ai w bi and the conjugatevariables ˆ Q ab to enforce this relation.We can now integrate over the variables φ at to obtain Z n ( (cid:126)w ) = (cid:90) ∞−∞ (cid:89) i,a,b,t dw ai d ˆ Q ab dλ a exp (cid:20) − T δ ab + γQ ab ) (cid:21) × exp (cid:88) a,b ˆ Q ab (cid:32) N Q ab − (cid:88) i σ i w ai w bi (cid:33) + T (cid:88) a λ a (cid:32) N (cid:88) i w ai − (cid:33) − T γ (cid:88) a g ( (cid:126)w a )  It is at this point that we have to make the analytic continuation in the replica number n . In view of the permutation symmetry of the replicas and the convexity argumentput forward earlier, we can choose the replica symmetric ansatz Q ab = (cid:26) q + ∆ , a = bq , a (cid:54) = b (84)ˆ Q ab = (cid:26) ˆ q + ˆ∆ , a = b ˆ q , a (cid:54) = b. (85)The analytic continuation will then consist in simply regarding n as a real variable. Toleading order for small n we have − T δ ab + γQ ab ) = − T n (cid:20) log (1 + γ ∆) + γq γ ∆ (cid:21) (86) (cid:88) a,b ˆ Q ab Q ab = N n (ˆ q ∆ + q ˆ∆ + ∆ ˆ∆) , (87)while the (cid:126)w -dependent part of the partition function can be written as (cid:90) dλ a d ˆ∆ d ˆ q exp (cid:20) N n r (cid:68) log (cid:90) dwe − r ˆ∆ σ w +2 rwzσ √− q + λw − g ( (cid:126)w )] (cid:69) zσ (cid:21) , (88)where (cid:104)· · · (cid:105) z,σ denotes the average of an arbitrary function h ( z, σ ) over the normalvariable z and the distribution of asset variances: (cid:104) h ( z, σ ) (cid:105) zσ = (cid:90) dσ N (cid:88) i δ ( σ − σ i ) (cid:18)(cid:90) ∞−∞ dz √ π h ( z, σ ) e − z / (cid:19) . (89) f we now write the partition function as Z n = (cid:90) dλdq d ∆ d ˆ q d ˆ∆ e − γn T F ( λ,q , ∆ , ˆ q , ˆ∆) , (90)we ﬁnd for F/Nf ( λ, q , ∆ , ˆ q , ˆ∆) = 1 γ (cid:20) log(1 + γ ∆) + γq γ ∆ (cid:21) + λγ − rγ (ˆ q ∆ + q ˆ∆ + ∆ ˆ∆) − γ (cid:68) log (cid:90) dwe − r ˆ∆ σ w +2 rwzσ √− q + λw − g ( (cid:126)w ) (cid:69) zσ Noting how the various quantities scale with the inverse temperature we can performthe change of variables ∆ → ∆ /γ , ˆ q → γ ˆ q , ˆ∆ → γ ˆ∆, λ → γλ and taking the limit γ → ∞ we ﬁnally have f ( λ, q , ∆ , ˆ q , ˆ∆) = q (1 + ∆) − r ˆ q ∆ − r ˆ∆ q + λ + min (cid:126)w (cid:68) V ( (cid:126)w ) (cid:69) zσ , (91)where V = 2 r ˆ∆ σ w − rwzσ (cid:112) − q − λw + η θ ( w ) − η θ ( − w ) . (92)This is the form of the objective function that we use in the main text. Its minimizationis explained in Sec. 3. References [1] J. D. Jobson and B. Korkie. Improved estimation for Markowitz portfolios usingJames-Stein type estimators.

Proceedings of the American Statistical Association(Business and Economic Statistics) , 1:279–284, 1979.[2] P. Jorion. Bayes-stein estimation for portfolio analysis.

Journal of Financial andQuantitative Analysis , 21:279–292, 1986.[3] O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stockreturns with an application to portfolio selection.

Journal of Empirical Finance ,10(5):603–621, 2003.[4] O. Ledoit and M. Wolf. Honey, I shrunk the sample covariance matrix.

J. PortfolioManagement , 31:110, 2004.[5] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional co-variance matrices.

J. Multivar. Anal. , 88:365–411, 2004.[6] V. Golosnoy and Y. Okhrin. Multivariate shrinkage for optimal portfolio weights.

The European Journal of Finance , 13:441–458, 2007.[7] Takashi Shinzato. Minimal investment risk of a portfolio optimization problemwith budget and investment concentration constraints.

Journal of Statistical Me-chanics: Theory and Experiment , 2017(2):023301, 2017.[8] O. Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional co-variance matrices.

Ann. Statist. , 40:1024–1060, 2012.

9] J. Bun, J-P. Bouchaud, and M. Potters. My beautiful laundrette:Cleaning correlation matrices for portfolio optimization. , 2016.[10] O. Ledoit and M. Wolf. Direct nonlinear shrinkage estimation of large-dimensionalcovariance matrices.

University of Zurich, Department of economics, Workingpaper No. 264 , page 46, 2017.[11] R. Tibshirani. Regression shrinkage and selection via the lasso.

Journal of theRoyal Statistical Society. Series B (Methodological) , pages 267–288, 1996.[12] R. Jagannathan and T. Ma. Risk reduction in large portfolios: Why imposing thewrong constraints helps.

Journal of Finance , 58:1651–1684, 2003.[13] J. Brodie, I. Daubechies, C. De Mol, D. Giannone, and I. Loris. Sparse andstable Markowitz portfolios.

Proceedings of the National Academy of Science ,106(30):12267–12272, 2009.[14] V. DeMiguel, L. Garlappi, F.J. Nogales, and R. Uppal. A generalized approach toportfolio optimization: Improving performance by constraining portfolio norms.

Management Science , 55:798–812, 2009.[15] D. Giomouridis and S. Paterlini. Regula(ized) hedge funds.

J. Financ. Res. ,33:223–247, 2010.[16] M. Carrasco and N. Noumon. Optimal portfolio selection using regularization. , 2012.[17] J. Fan, J. Zhang, and K. Yu. Vast portfolio selection with gross exposure con-straints.

J. Am. Stat. Assoc. , 107:592–606, 2012.[18] Y-M. Yen and T-Y. Yen. Solving norm constrained portfolio optimization viacoordinate-wise descent algorithm.

Computational Statistics and Data Analysis ,76:737–759, 2014.[19] B. Fastrich, S. Paterlini, and P Winker. Cardinality versus q-norm constraints forindex tracking.

Quantitative Finance , 14(11):2019–2032, 2014.[20] M. M´ezard, G. Parisi, and M. A. Virasoro.

Spin glass theory and beyond . WorldScientiﬁc Lecture Notes in Physics Vol. 9, World Scientiﬁc, Singapore, 1987.[21] F. Caccioli, I. Kondor, M. Marsili, and S. Still. Liquidity risk and instabilities inportfolio optimization.

International Journal of Theoretical and Applied Finance ,19(05):1650035, 2016.[22] T. Hastie, R. Tibshirani, and J. Friedman.

The elements of statistical learning,data mining, inference, and prediction. Second edition . Springer series in statisticsSpringer, Berlin, 2008.[23] I. Kondor, G. Papp, and F. Caccioli. Analytic solution to variance optimizationwith no short positions.

Journal of Statistical Mechanics: Theory and Experiment ,2017:123402, 2017.[24] P. B¨uhlmann and S. Van De Geer.

Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media, 2011.

25] Istvan Varga-Haszonits, Fabio Caccioli, and Imre Kondor. Replica approach tomean-variance portfolio optimization.

Journal of Statistical Mechanics: Theoryand Experiment , 2016(12):123404, 2016.[26] D. Donoho and J. Tanner. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal pro-cessing.

Philosophical Transactions of The Royal Society A, Mathematical Physicaland Engineering Sciences , 367:4273–93, 2009.[27] D. Amelunxen, M. Lotz, M. B. McCoy, and Joel A. Tropp. Living on the edge: Ageometric theory of phase transitions in convex optimization.

Inform. Inference ,3(3):224–294, 2013.,3(3):224–294, 2013.