Relative utility bounds for empirically optimal portfolios
RRELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMALPORTFOLIOS
DMITRY B. ROKHLIN
Abstract.
We consider a single-period portfolio selection problem for an investor, max-imizing the expected ratio of the portfolio utility and the utility of a best asset taken inhindsight. The decision rules are based on the history of stock returns with unknown distri-bution. Assuming that the utility function is Lipschitz or H¨older continuous (the concavityis not required), we obtain high probability utility bounds under the sole assumption thatthe returns are independent and identically distributed. These bounds depend only on theutility function, the number of assets and the number of observations. For concave utili-ties similar bounds are obtained for the portfolios produced by the exponentiated gradientmethod. Also we use statistical experiments to study risk and generalization propertiesof empirically optimal portfolios. Herein we consider a model with one risky asset and adataset, containing the stock prices from NYSE. Introduction
We consider a single-period portfolio selection problem, where the decision rules are basedon the history of stock returns. It is assumed that the returns are independent and identicallydistributed, but their distribution is unknown. We represent investor’s preferences by anexpected utility and use the sample average approximation (SAA) (see, e.g., [17]) for thesolution of the related expected utility maximization problem. In the terminology of thestatistical learning theory our main goal is to obtain high-probability bounds (generalizationbounds or utility bounds) for the difference between the optimal utility value and the trueutility of the empirically optimal portfolio (estimation error), as well as for the differencebetween the true utility and the empirical utility for such portfolio.Let us mention two specific features of the problem under consideration, which makesome difficulties in an application of standard results. First, some classical utility functions,like the power function, are neither bounded nor globally Lipschitz. Second, most classicalmodels, like the Black-Scholes, assume that the returns are unbounded. Similar unboundedproblems appear in general learning theory: see [6] and a lot of references therein. Theyrequire some additional assumptions, problem reformulations and the development of specialtools.In the present paper we pass to the relative utility maximization, where the objectivefunction equals to the expected ratio of the utility u of some portfolio to the utility ofthe best portfolio for the returns, which are known in hindsight. This allows to avoid anyassumption on the returns, besides the i.i.d. hypothesis. As for u , we assume that it belongsto the class of positive, non-decreasing functions, satisfying the global Lipschitz or H¨older Mathematics Subject Classification.
Key words and phrases.
Portfolio selection; Relative utility; Statistical learning; Empirical utility; Gen-eralization bounds.The research is supported by the Russian Science Foundation, project 17-19-01038. a r X i v : . [ q -f i n . P M ] J un DMITRY B. ROKHLIN condition, and some specific condition, regarding its behavior at zero and infinity. The powerfunction satisfies these assumptions. For the same problem with a concave utility functionwe study the estimation error for the portfolio produced by the stochastic version of theexponentiated gradient algorithm of [18].The obtained utility bounds contain only those quantities, which are known for the in-vestor: the number of return observations; the number of stocks; constants, related to theutility function; and a data-dependent quantity in the case of the exponentiated gradientalgorithm: Theorems 1 – 3.Passing to the relative utility certainly affects investor’s attitude towards risk. In the caseof one risky asset it appears, that an investor with the relative utility is more risk aversethan in the case of the ordinary utility. However, in the case of multiple risky assets ourempirical results show that the situation can be the opposite. Furthermore, we present simplestatistical experiments demonstrating that typically it is impossible to get a reliable estimateof the optimal portfolio on the base of daily historical observations. A related phenomenon,which was mainly demonstrated for the risk-return modeling of investor’s preferences, isknown as the fragility of SAA in portfolio optimization: see [1] and references therein.Let us mention some papers, considering single-period portfolio selection problems in thestatistical learning framework. In [8, 10], the authors studied the influence of the portfolioconstraints on the out-of-sample performance. The papers [10, 11] presented out-of-samplebounds for the loss probabilities of the portfolios, satisfying some empirical VaR- and CVaR-type constraints. The regularization and cross validation methods were applied to the mean-variance and mean-CVaR problems in [1]. One can also find in [1] several other references tothe works, considering the regularization methods. In [2] the authors considered an expectedutility maximization problem with side information and applied a regularization to obtainout-of-sample guarantees for the certainty equivalent of the out-of-sample portfolio value.The rest of the paper is organized as follows. In Section 2 we state the problem and mentionthe consistency of the SAA method. Section 3 contains the main result of the paper: Theorem2, which gives upper bounds for the expected maximum of an empirical process, associatedto the relative utility function. The Lipschitz and H¨older cases are studied separately. Inboth cases we consider the Rademacher complexity of the class of relative utility functions,parametrized by the portfolio weights. In the Lipshitz case this quantity is estimated by theTalagrand contraction lemma and the Massart lemma, in the H¨older case we consider thepacking numbers and the Dudley entropy integral. The obtained estimates directly lead tohigh-probability utility bounds via the concentration inequalities. Section 4 presents similarbounds for the portfolios produced by the stochastic exponentiated gradient algorithm of[18]. Here we combine its online version with the online-to-batch conversion scheme: see[22].Sections 5 and 6 deal with statistical experiments, related to the analysis of risk andgeneralization properties of empirically optimal portfolios. Section 5 considers the case ofone risky asset, obeying the discrete Black-Scholes model, while in Section 6 we analyzea dataset, containing daily stock returns form NYSE. The conclusions are already brieflydescribed above. Here we additionally indicate the utilized solution methods for the empiricalutility maximization problems. In Section 5 the problem is one-dimensional, and it is solvedsimply via the bisection method. In Section 6 we propose a greedy modification of thestochastic exponentiated gradient algorithm to solve the correspondent is multidimensional
ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 3 problem. For logarithmic utility the results are compared with [4, 13]. The code for Sections5, 6 is available at https://github.com/drokhlin/Relative_utility_bounds_code .2.
Problem formulation
Let ( s k , . . . , s dk ) be strictly positive prices of d assets (stocks) at time moments k =0 , . . . , n + 1 , and let r jk = s ji /s jk − , j = 1 , . . . , d , k = 1 , . . . , n + 1 be the total daily returns(price relatives). At time n an investor distributes his wealth X n = 1 between these assetsbased on the price history ( r , . . . , r n ) . In other words, he selects a portfolio ( γ n , . . . , γ dn ) ,where γ jn ( r , . . . , r n ) ≥ is the number of units of the asset j to be bought. So, the wealthwill be distributed between d assets in accordance with the fractions (or weights) ν n = (cid:18) γ jn s jn X n (cid:19) dj =1 ∈ ∆ = (cid:40) z ≥ d (cid:88) j =1 z j = 1 (cid:41) . At time n + 1 the wealth becomes X n +1 = (cid:104) γ n , s n +1 (cid:105) = (cid:104) ν n , r n +1 (cid:105) . By (cid:104) a, b (cid:105) we denote the usual scalar product in R d .Our standing assumptions concern the investor utility function and the returns. Assumption 1.
Investor’s utility function u : (0 , ∞ ) (cid:55)→ (0 , ∞ ) is non-decreasing and con-tinuous. Assumption 2.
The return vectors ( r k , . . . , r dk ) , k = 1 , . . . , n + 1 are independent andidentically distributed.Consider the single-period optimization problem U ( ν ) = E f ( ν, r n +1 ) := E u ( (cid:104) ν, r n +1 (cid:105) ) u (cid:0) r ∗ n +1 (cid:1) → max ν ∈ ∆ , r ∗ n +1 := max ≤ j ≤ d r jn +1 . (2.1)The objective function U ( ν ) of this problem equals to the expected ratio of the utility u ofsome portfolio ν to the utility of the best portfolio taken in hindsight, that is, under theassumption that the values r n +1 are known. In the latter case the investor simply takes anasset with the largest return. Since u is non-decreasing, the relative utility f takes values in (0 , . The set ∆ is compact and the function U is continuous, as follows from the continuityof ν (cid:55)→ f ( ν, r ) and the dominated convergence theorem. Hence an optimal solution ν ∗ of(2.1) exists.It is natural to consider the empirical utility maximization problem (cid:98) U n ( ν ) = (cid:98) f n ( ν, r n +1 ) = 1 n n (cid:88) k =1 u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) → max ν ∈ ∆ . (2.2)Clearly, this problem also has an optimal solution (cid:98) ν n .Furthermore, consider the empirical process ν (cid:55)→ G n ( ν ) = (cid:98) U n ( ν ) − U ( ν ) . Using the in-equalities (cid:98) U n ( ν ∗ ) ≤ (cid:98) U n ( (cid:98) ν n ) , U ( (cid:98) ν n ) ≤ U ( ν ∗ ) , DMITRY B. ROKHLIN we get U ( ν ∗ ) − U ( (cid:98) ν n ) ≤ U ( ν ∗ ) − (cid:98) U n ( ν ∗ ) + (cid:98) U n ( (cid:98) ν n ) − U ( (cid:98) ν n ) ≤ U ( ν ∗ ) − (cid:98) U n ( ν ∗ ) + sup ν ∈ ∆ G n ( ν ) , (2.3) (cid:98) U n ( (cid:98) ν n ) − U ( ν ∗ ) ≤ (cid:98) U n ( (cid:98) ν n ) − U ( (cid:98) ν n ) ≤ sup ν ∈ ∆ G n ( ν ) . (2.4)Note, that when ν n is random, by U ( ν n ) we mean the conditional expectation: U ( ν n ) = E ( f ( ν n , r n +1 ) | r , . . . , r n )) . This quantity can be called the “true utility” of ν n by analogy to the “true risk” in machinelearning: see [23].In learning theory the difference U ( ν ∗ ) − U ( (cid:98) ν n ) is called an estimation error: [23]. Itdescribes the performance of the empirical utility maximizer (cid:98) ν n . The quantity (cid:98) U n ( (cid:98) ν n ) canbe regarded as a statistical estimate of the true utility U ( (cid:98) ν n ) of (cid:98) ν n . This estimate is alwaysoptimistically biased: E U ( (cid:98) ν n ) ≤ U ( ν ∗ ) = E (cid:98) U n ( ν ∗ ) ≤ E (cid:98) U n ( (cid:98) ν n ) . The difference E (cid:98) U n ( (cid:98) ν n ) − E U ( (cid:98) ν n ) ≥ is known as optimizer’s curse: [26, 19].We see that the key quantity is the supremum of the empirical process G n . By thestrong law of large numbers G n ( ν ) → a.s. for a fixed ν . Moreover, since the function ν (cid:55)→ u ( (cid:104) ν, r (cid:105) ) /u ( r ∗ ) is continuous and bounded, the convergence is uniform: sup ν ∈ ∆ | G n ( ν ) | → a.s., n → ∞ by [25, Theorem 7.53]. From (2.3), (2.4) we see that U ( ν ∗ ) ≤ lim inf n →∞ U ( (cid:98) ν n ) , lim sup n →∞ (cid:98) U n ( (cid:98) ν n ) ≤ U ( ν ∗ ) . The reverse inequalities U ( ν ∗ ) ≥ U ( (cid:98) ν n ) , lim inf n →∞ (cid:98) U n ( (cid:98) ν n ) ≥ lim inf n →∞ (cid:98) U n ( ν ∗ ) = U ( ν ∗ ) imply that (cid:98) U n ( (cid:98) ν n ) → U ( ν ∗ ) , U ( (cid:98) ν n ) → U ( ν ∗ ) , n → ∞ a.s. without further assumptions.Thus, the method of empirical utility maximization is consistent: see the definition in [28,Chapter 3], where the convergence in probability is considered. In the next section we providenon-asymptotic bounds for G n . 3. Utility bounds
Let us represent the supremum of the empirical process G n in the form sup ν ∈ ∆ G n ( ν ) = E sup ν ∈ ∆ G n ( ν ) + sup ν ∈ ∆ G n ( ν ) − E sup ν ∈ ∆ G n ( ν ) . Put R n = ( r , . . . , r n ) , Φ( R n ) = sup ν ∈ ∆ G n ( ν ) . We have | Φ( r , . . . , ˜ r k , . . . , r n ) − Φ( r , . . . , r k , . . . , r n ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup ν (cid:32) m (cid:88) i (cid:54) = k u ( (cid:104) ν, r i (cid:105) ) u ( r ∗ i ) − U ( ν ) + 1 m u ( (cid:104) ν, ˜ r k (cid:105) ) u (˜ r ∗ k ) (cid:33) − sup ν (cid:32) m (cid:88) i (cid:54) = k u ( (cid:104) ν, r i (cid:105) ) u ( r ∗ i ) − U ( ν ) + 1 m u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ν (cid:12)(cid:12)(cid:12)(cid:12) m u ( (cid:104) ν, ˜ r k (cid:105) ) u (˜ r ∗ k ) − m u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ m . ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 5
By the McDiarmid concentration inequality (see [20, Theorem D.8]) this bounded differencesproperty implies that P (cid:18) sup ν G n ( ν ) − E sup ν G n ( ν ) ≥ ε (cid:19) = P (Φ( R n ) − E Φ( R n ) ≥ ε ) ≤ e − mε , or, equivalently, P (cid:32) sup ν G n ( ν ) − E sup ν G n ( ν ) ≥ (cid:114) n ln 1 δ (cid:33) ≤ δ. (3.1)For the difference U ( ν ∗ ) − (cid:98) U n ( ν ∗ ) we have a similar estimate: P (cid:32) U ( ν ∗ ) − (cid:98) U n ( ν ∗ ) ≥ (cid:114) n ln 1 δ (cid:33) ≤ δ, (3.2)which follows from the Hoeffding inequality [20, Theorem D.2]: a special case of the McDi-armid inequality.Note, that to get the inequalities (3.1), (3.2) we need not impose any growth assumptionson u . This is an advantage of the relative utility. Let us formulate the obtained result moreexplicitly. Theorem 1.
With probability at least − δ we have U ( ν ∗ ) − U ( (cid:98) ν n ) ≤ E sup ν ∈ ∆ G n ( ν ) + (cid:114) n ln 2 δ , (3.3) (cid:98) U n ( (cid:98) ν n ) − U ( (cid:98) ν n ) ≤ E sup ν ∈ ∆ G n ( ν ) + (cid:114) n ln 1 δ . (3.4)The distinction in constants in the right-hand sides of (3.3), (3.4) is due to the fact thatwe applied both inequalities (3.1), (3.2) to (2.3) and only the first one to (2.4). In the firstcase the following argumentation is used: if P (cid:32) ξ i ≥ (cid:114) n ln 1 δ (cid:33) ≤ δ, i = 1 , , then P (cid:32) ξ + ξ ≥ (cid:114) n ln 2 δ (cid:33) ≤ (cid:88) i =1 P (cid:32) ξ i ≥ (cid:115) n ln 1 δ/ (cid:33) ≤ δ. Theorem 2 contains the main result of the paper: the upper bounds for E sup ν ∈ ∆ G n ( ν ) . Theorem 2.
Assume that the utility function u is uniformly H¨older continuous on (0 , ∞ ) : | u ( x ) − u ( y ) | ≤ K | x − y | α (3.5) with some α ∈ (0 , , K > . Assume further that A := sup x> x α u ( x ) < ∞ . (3.6) DMITRY B. ROKHLIN
Then E sup ν ∈ ∆ G n ( ν ) ≤ AK (cid:114) dn , α = 1 , (3.7) E sup ν ∈ ∆ G n ( ν ) ≤ CAK (cid:114) d − αn , α ∈ (0 , , (3.8) where C > is an absolute constant.Proof . Let ε i , i = 1 , . . . , n be independent Rademacher random variables: P ( ε i = 1) = P ( ε i = −
1) = 1 / , which are also independent from r , . . . , r n . Consider the empiricalRademacher complexity (see , e.g., [20]) (cid:98) R ( F ◦ R n ) = 1 n E (cid:32) sup ν ∈ ∆ n (cid:88) i =1 ε i u ( (cid:104) ν, r i (cid:105) ) u ( r ∗ i ) (cid:12)(cid:12)(cid:12)(cid:12) R n (cid:33) of the set of functions F = { r (cid:55)→ u ( (cid:104) ν, r (cid:105) ) /u ( r ∗ ) : ν ∈ ∆ } with respect to the randomsequence R n = ( r , . . . , r n ) . In fact we compute the Rademacher complexity of the followingset of n -dimensional vectors: F ◦ R n := (cid:26)(cid:18) u ( (cid:104) ν, r (cid:105) ) u ( r ∗ ) , . . . , u ( (cid:104) ν, r n (cid:105) u ( r ∗ n ) (cid:19) : ν ∈ ∆ (cid:27) . For clarity recall (see [23]) that the Rademacher complexity of a set C ⊂ R n is defined bythe formula (cid:98) R ( C ) = 1 n E sup a ∈ C n (cid:88) i =1 ε i a i . (3.9)Let us consider the case α = 1 . The symmetrization argument ([27, Lemma 7.4]) givesthe bound E sup ν ∈ ∆ G n ( ν ) ≤ E (cid:98) R ( F ◦ R n ) . (3.10)For Ψ( x, r ) = u ( x ) /u ( r ∗ ) , r ∗ = max ≤ i ≤ d r i we have | Ψ( x, r ) − Ψ( y, r ) | ≤ Ku ( r ∗ ) | x − y | . Literally following the proof of Talagrand’s contraction lemma, given in [20, Lemma 5.7], weget the inequality (cid:98) R ( F ◦ R n ) = 1 n E (cid:32) sup ν ∈ ∆ n (cid:88) i =1 ε i Ψ( (cid:104) ν, r i (cid:105) , r i ) (cid:12)(cid:12)(cid:12)(cid:12) R n (cid:33) ≤ Kn E (cid:32) sup ν ∈ ∆ n (cid:88) i =1 ε i (cid:104) ν, r i (cid:105) r ∗ i (cid:12)(cid:12)(cid:12)(cid:12) R n (cid:33) = K (cid:98) R ( H ◦ R n ) , H := { r (cid:55)→ (cid:104) ν, r (cid:105) /r ∗ : ν ∈ ∆ } . (3.11)Note, that the only difference with the Talagrand contraction lemma is that the Lipschitzconstant for x (cid:55)→ Ψ( x, r ) depends on r .The Rademacher complexity of the set H equals to the Rademacher complexity of itsextreme points (as follows from [23, Lemma 26.7]), corresponding to the vectors of thestandard basis: ν ∈ { e , . . . , e d } , e i = ( δ ij ) dj =1 , where δ ij is Kronecker symbol. Thus, (cid:98) R ( H ◦ R n ) = (cid:98) R (cid:18) r u ( r ∗ ) , . . . , r d u ( r ∗ ) (cid:19) . (3.12) ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 7
Here r j /u ( r ∗ ) = ( r j /u ( r ∗ ) , . . . , ( r jn /u ( r ∗ n )) ∈ R n are the normalized trajectories of the returns,and the right-hand side of (3.12) is computed in accordance with (3.9). The Rademachercomplexity of a finite set of vectors can be estimated by Massart’s lemma (see [20, Theorem3.7]). Applying this lemma to the right-hand side of (3.12), we get the inequality (cid:98) R (cid:18) r u ( r ∗ ) , . . . , r d u ( r ∗ ) (cid:19) ≤ A √ n √ d, (3.13)since by (3.6), (cid:107) r j /u ( r ∗ ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) k =1 (cid:32) r jk u ( r ∗ k ) (cid:33) ≤ A √ n, where (cid:107) a (cid:107) = (cid:112)(cid:80) ni =1 a i is the l -norm. The inequality (3.7) now follows from (3.10) –(3.13).In the case α < first note that for fixed R n the process Z n ( ν ) = 1 n n (cid:88) k =1 ε k u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) is subgaussian (see [27, Definition 5.20]) with respect to the data dependent pseudometric ρ ( ν, ν (cid:48) ) = 1 n (cid:32) n (cid:88) k =1 (cid:18) u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) − u ( (cid:104) ν (cid:48) , r k (cid:105) ) u ( r ∗ k ) (cid:19) (cid:33) / , defined on ∆ . That is, E (cid:18) e λ ( Z n ( ν ) − Z n ( ν (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12) R n (cid:19) = n (cid:89) i =1 E (cid:20) exp (cid:18) λn ε i u ( (cid:104) ν, r k (cid:105) ) − u ( (cid:104) ν (cid:48) , r k (cid:105) ) u ( r ∗ k ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) R n (cid:21) ≤ e λ ρ ( ν,ν (cid:48) ) / . Here we used an elementary inequality E e λε i a ≤ e λ a / : [29, Example 2.3].A set N ⊂ ∆ is called (cid:15) -dispersed if ρ ( ν, ν (cid:48) ) ≥ (cid:15) for ν, ν (cid:48) ∈ N with ν (cid:54) = ν (cid:48) . Let D (∆ , ρ, (cid:15) ) be the (cid:15) -packing number of (∆ , ρ ) : D (∆ , ρ, (cid:15) ) = sup {| N | : N is an (cid:15) -dispersed } . Here | N | is the cardinality of N . The conditional expectation of the supremum of Z n isbounded by the Dudley entropy integral ([5, Corollary 13.2]): (cid:98) R ( F ◦ R n ) = E (cid:18) sup ν ∈ ∆ Z n ( ν ) | R n (cid:19) ≤ (cid:90) d/ (cid:112) ln D (∆ , ρ, (cid:15) ) d(cid:15), (3.14)where d is the diameter of ∆ .Conditions (3.5), (3.6) imply that ρ ( ν, ν (cid:48) ) ≤ Kn (cid:32) n (cid:88) k =1 |(cid:104) ν − ν (cid:48) , r k (cid:105)| α u ( r ∗ k ) (cid:33) / ≤ Kn (cid:32) n (cid:88) k =1 ( r ∗ k ) α (cid:107) ν − ν (cid:48) (cid:107) α u ( r ∗ k ) (cid:33) / ≤ KA √ n (cid:107) ν − ν (cid:48) (cid:107) α , (3.15) DMITRY B. ROKHLIN where (cid:107) a (cid:107) = (cid:80) dj =1 | a j | is the the l -norm. For the (cid:15) -packing number of ∆ with the metric,induced by (cid:107) · (cid:107) , we have the inequality D (∆ , (cid:107) · (cid:107) , (cid:15) ) ≤ (5 /(cid:15) ) d − (see [9, Proposition C.1]).From (3.15) it follows that if ρ ( ν, ν (cid:48) ) ≥ (cid:15) then (cid:107) ν − ν (cid:48) (cid:107) ≥ (cid:18) √ nεKA (cid:19) /α . Hence, D (∆ , ρ, (cid:15) ) ≤ D (cid:32) ∆ , (cid:107) · (cid:107) , (cid:18) √ n(cid:15)KA (cid:19) /α (cid:33) ≤ d − (cid:18) KA √ n(cid:15) (cid:19) ( d − /α . (3.16)Furthermore, by (3.15) the diameter of ∆ with respect to ρ is estimated as d ≤ α KA √ n , (3.17)since (cid:107) ν − ν (cid:48) (cid:107) ≤ (cid:107) ν (cid:107) + (cid:107) ν (cid:48) (cid:107) ≤ . Let us substitute the estimates (3.16), (3.17) into (3.14),and perform the change of variables z = √ nε/ (2 α − KA ) : (cid:98) R ( F ◦ S n ) ≤ (cid:90) α − KA/ √ n (cid:118)(cid:117)(cid:117)(cid:116) ln (cid:32) d − (cid:18) KA √ n(cid:15) (cid:19) ( d − /α (cid:33) d(cid:15) = 12 (cid:114) d − α (cid:90) α − KA/ √ n (cid:115) ln (cid:18) α KA √ n(cid:15) (cid:19) d(cid:15) = 12 (cid:114) d − α α − KA √ n (cid:90) (cid:114) ln 5 α α − z dz ≤ C KA (cid:114) d − αn ,C = 12 (cid:90) (cid:114) ln 5 z dz. Together with (3.10) this completes the proof ( C = 2 C ). (cid:3) In a most natural way condition (3.6) is satisfied by the power utility function u ( x ) = x α , α ∈ (0 , . This function also satisfies (3.5) with K = 1 , as easily follows from the inequality([12, Appendix A, Lemma 5.1]) ( x + y ) α ≤ x α + y α , x, y > . For u ( x ) = x α the problem (2.1) reduces to the optimization of the ordinary power utilityfunction after the price normalization: U ( ν ) = E (cid:104) ν, r n +1 /r ∗ n +1 (cid:105) α . The power utility is natural in one more respect: the relative utility (3.1) in this case isindependent of investor’s wealth x : E u ( x (cid:104) ν, r n +1 (cid:105) ) u (cid:0) xr ∗ n +1 (cid:1) = E (cid:104) ν, r n +1 /r ∗ n +1 (cid:105) α . This means that one can consider the problems (2.1), (2.2) dynamically in an online manner.At each step the investor will act myopically similar to the case of the ordinary logarithmicutility.
ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 9
Remark 1.
Under additional assumptions condition (3.6) on the utility function can berelaxed. In fact we need only the upper bound for r ∗ k /u ( r ∗ k ) . Thus, if there exists a risklessasset (cash) with r k = 1 , then the supremum in (3.6) can be taken over [1 , ∞ ) . Furthermore,if the returns are bounded, then the supremum can be taken over a finite interval. In thiscase usually it is enough to consider the Lipschitz case α = 1 . Remark 2.
Theorems 1, 2 give high probability error bounds. From (2.3), (2.4) it followsthat max { U ( ν ∗ ) − E U ( (cid:98) ν n ) , E ( (cid:98) U n ( (cid:98) ν n ) − U ( (cid:98) ν n )) } ≤ E sup ν ∈ ∆ G n ( ν ) , Thus, Theorem 2 provides also error bounds in expectation.
Remark 3.
The obtained error bounds are of order n − / . In general the main assumption,which allows to obtain O (1 /n ) bounds, is the strong concavity of U : [24, 21]. However, suchassumption requires additional conditions on the returns r i , which we want to avoid in thepresent paper. 4. Stochastic exponentiated gradient algorithm
In this section we additionally assume that the utility function u is concave. Recall thatthe subdifferential of − u at any point y ∈ (0 , ∞ ) is an interval: ∂ ( − u )( y ) = [ − D − u ( y ) , − D + u ( y )] , where D − u ( y ) and D + u ( y ) are the left and right derivatives: see [16, Chap. I]. We have D − u ( y ) ≥ D + u ( y ) ≥ , as u is non-decreasing.We use the exponentiated gradient (EG) algorithm of [18] to solve the empirical utilitymaximization problem (2.2). Consider the empirical distribution generated by the sample ( r , . . . , r n ) , and a random variable (cid:98) r with this distribution: (cid:98) P ( (cid:98) r = r k ) = 1 n , k = 1 , . . . , n. Put r n = min ≤ k ≤ n min ≤ i ≤ d r ik , r n = max ≤ k ≤ n max ≤ i ≤ d r ik and consider the convex functions ν (cid:55)→ f j ( ν ) = 1 − u ( (cid:104) ν, (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) : ∆ (cid:55)→ [0 , . From the description of their subdifferentials: ∂f j ( ν ) = (cid:26) γu ( (cid:98) r ∗ j ) (cid:98) r j : γ ∈ [ − D − u ( (cid:104) ν, (cid:98) r j (cid:105) ) , − D + u ( (cid:104) ν, (cid:98) r j (cid:105) )] (cid:27) and the inequalities < r n ≤ (cid:104) ν, (cid:98) r j (cid:105) , j = 1 , . . . , n, we see that the absolute values of thesubgradient components are bounded by the constant L n = D − u ( r n ) · max r n ≤ x ≤ r n xu ( x ) = D − u ( r n ) · r n u ( r n ) . Indeed, u ( x ) /x is non-increasing: [16, Proposition 1.1.4], and the subdifferential mapping ismonotone: γ ≤ γ whenever γ i ∈ ∂ ( − u )( y i ) , < y < y , see [16, Theorem 4.2.1]. It follows that the functions f j are L n -Lipschitz with respect to l -norm: see [22, Lemma 2.6].Apply the exponentiated gradient algorithm to f , . . . , f m : ν i = 1 /d, i = 1 , . . . , d, (4.1) a ij = ν ij − exp (cid:18) η D − u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) (cid:98) r ij (cid:19) , ν ij = a ij (cid:80) dl =1 a lj , (4.2) i = 1 , . . . , d , j = 1 , . . . , m − , where η > is a parameter. Note that, − D − u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) (cid:98) r j ∈ ∂f j ( ν ) . For a moment assume that (cid:98) r j ∈ (0 , ∞ ) d is an arbitrary sequence. The basic problem ofthe online convex optimization theory is to find a sequence ν , . . . , ν m − such that ν j − doesnot depend on f j , . . . , f m and the regretRegret m ( ν ) = m (cid:88) j =1 f j ( ν j − ) − m (cid:88) j =1 f j ( ν ) = m (cid:88) j =1 u ( (cid:104) ν, (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) − m (cid:88) j =1 u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) is small uniformly over ν ∈ ∆ . It is well known that the EG algorithm with η = (cid:113) ln dm L n ensures the estimate Regret m ( ν ) ≤ L n √ m √ ln d, (4.3)see [22, Corollary 2.14] (a constant is corrected).For an i.i.d. random sequence (cid:98) r j we can apply to (4.1), (4.2) the online-to-batch con-version scheme: [22, Chap. 5]. In this case it is natural to call (4.1), (4.2) the stochasticexponentiated gradient (SEG) algorithm. Denote by (cid:98) E is the expectation with respect tothe empirical distribution of r , . . . , r n . For any fixed ν , (cid:98) E u ( (cid:104) ν, (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) = 1 n n (cid:88) k =1 u ( (cid:104) ν, r k (cid:105) ) u ( r ∗ k ) = (cid:98) U n ( ν ) . (4.4)Furthermore, since ν j − is σ ( (cid:98) r , . . . , (cid:98) r j − ) -measurable, we have (cid:98) E u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) = (cid:98) E (cid:98) E (cid:18) u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) r , . . . , (cid:98) r j − (cid:19) = (cid:98) E n n (cid:88) k =1 u ( (cid:104) ν j − , r k (cid:105) ) u ( r ∗ k ) , m (cid:98) E m (cid:88) j =1 u ( (cid:104) ν j − , r j (cid:105) ) u ( r ∗ j ) = 1 m m (cid:88) j =1 (cid:98) E n n (cid:88) k =1 u ( (cid:104) ν j − , r k (cid:105) ) u ( r ∗ k ) = 1 n n (cid:88) k =1 (cid:98) E m m (cid:88) j =1 u ( (cid:104) ν j − , r k (cid:105) ) u ( r ∗ k ) ≤ (cid:98) E n n (cid:88) k =1 u ( (cid:104) ν m , r k (cid:105) ) u ( r ∗ k ) = (cid:98) E (cid:98) U n ( ν m ) , (4.5)where ν m = 1 m m − (cid:88) j =0 ν j . (4.6)In these calculations r , . . . , r n are regarded as constants. Note that ν j , ν m depend also on n , but we suppress this dependence in the notation. ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 11
From (4.3) – (4.5) we get L n (cid:114) ln dm ≥ (cid:98) E Regret m ( ν ) m = 1 m (cid:98) E m (cid:88) j =1 (cid:18) u ( (cid:104) ν, (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) − u ( (cid:104) ν j − , (cid:98) r j (cid:105) ) u ( (cid:98) r ∗ j ) (cid:19) ≥ (cid:98) U n ( ν ) − (cid:98) E (cid:98) U n ( ν m ) . In particular, for an empirical utility maximizer (cid:98) ν n , (cid:98) U n ( (cid:98) ν n ) ≤ (cid:98) E (cid:98) U n ( ν m ) + 2 L n (cid:114) ln dm ≤ (cid:98) U n ( ν m ) + (cid:114) n ln 1 δ + 2 L n (cid:114) ln dm (4.7)with probability at least − δ by Hoeffding’s inequality ([20, Theorem D.2]): (cid:98) P ( (cid:98) E (cid:98) U n ( ν m ) − (cid:98) U n ( ν m ) ≥ ε ) = (cid:98) P (cid:32) n n (cid:88) k =1 u ( (cid:104) ν m , r k (cid:105) ) u ( r ∗ k ) − (cid:98) E n n (cid:88) k =1 u ( (cid:104) ν m , r k (cid:105) ) u ( r ∗ k ) ≥ ε (cid:33) ≤ e − ε n with ε = (cid:113) n ln δ .We now able to provide for ν m an analog of inequality (3.3): U ( ν ∗ ) − U ( ν m ) = U ( ν ∗ ) − (cid:98) U n ( ν ∗ ) + (cid:98) U n ( ν ∗ ) − (cid:98) U n ( ν n ) + (cid:98) U n ( ν n ) − (cid:98) U n ( ν m ) + (cid:98) U n ( ν m ) − U ( ν m ) ≤ ( U ( ν ∗ ) − (cid:98) U n ( ν ∗ )) + ( (cid:98) U n ( ν n ) − (cid:98) U n ( ν m )) + sup ν ∈ ∆ G n ( ν ) . Applying (3.2), (4.7) and (3.1) respectively to the tree terms in the right-hand side, we getthe following result.
Theorem 3.
Assume that the function u is concave. Then for the average portfolio (4.6),produced by the SEG algorithm (4.1), (4.2), with probability at least − δ the followingestimate holds true: U ( ν ∗ ) − U ( ν m ) ≤ E sup ν ∈ ∆ G n ( ν ) + 3 (cid:114) n ln 1 δ + 2 L n (cid:114) ln dm . Certainly, the estimates of Theorem 2 still can be applied to E sup ν ∈ ∆ G n ( ν ) . Thus, Theo-rem 3 gives a high-probability bound for the estimation error of the stochastic exponentiatedgradient algorithm. The value of m can be taken sufficiently large to get for the estimationerror of ν m the bound of the same order as for the exact empirical utility maximizer (cid:98) ν n . Thementioned value of m is data dependent, since the Lipschitz constant L n depends on thereturns ( r , . . . , r n ) . Note, that we need no new data to generate an arbitrary large sample (cid:98) r , . . . , (cid:98) r m used in the SEG algorithm.5. Power utility: the case of one risky asset
Consider the case d = 2 . In this section we will put upper indexes in brackets. Assumethat the investor can keep money in cash: r (1) t = 1 , or invest in a risky asset, whose dailyreturns are log-normal and follow the discrete-time Black-Scholes model: r (2) k = exp (cid:18) µ − σ / T + σ √ T Z k (cid:19) , k = 1 , . . . , n. (5.1)Here T = 252 is the number of trading days in a year; Z k are independent standard normalvariables: Z k ∼ N (0 , ; n is the sample size, which we assume to be multiple of T . Put Table 1.
Average optimal weight ν (2) of the risky asset α ϕ ψ µ = 0 . , which corresponds to E T (cid:89) k =1 r (2) k = e µ ≈ . annual expected return for the risky asset, and σ = 0 . . We have ln r (2) k ∼ N (cid:18) µ − σ / T , σ √ T (cid:19) = N (1 . · − , . · − ) . In this section we assume that u ( x ) = x α , α ∈ (0 , . The the relative empirical utilitymaximization problem (2.2) takes the form ψ ( ν (2) ) = 1 n n (cid:88) k =1 (cid:104) ν, r k /r ∗ k (cid:105) α = 1 n n (cid:88) k =1 (cid:32) { , r (2) k } + r (2) k − { , r (2) k } ν (2) (cid:33) α → max ν (2) ∈ [0 , . (5.2)For comparison consider also the ordinary empirical utility: ϕ ( ν (2) ) = 1 n n (cid:88) k =1 (cid:104) ν, r k (cid:105) α = 1 n n (cid:88) k =1 (cid:16) r (2) k − ν (2) (cid:17) α → max ν (2) ∈ [0 , . (5.3)For a large n = T · = 2 . · we applied to ϕ (cid:48) ( ν ) , ψ (cid:48) ( ν ) the bisection method optimize.bisect from the module scipy (Python) with the default tolerance parameter.The results, averaged over 100 realizations of ( r (2) k ) nk =1 , are presented in Table 1.We see that the relative utility makes the investor more risk averse. This property canbe easily explained. Instead of the power utility function consider a differentiable increasingconcave function u . Without loss of generality, we can assume that u (1) = 1 . For theexpected utilities, corresponding to (5.2), (5.3), we have ψ (cid:48) ( ν (2) ) := ∂U ( ν ) ∂ν (2) = E (cid:18) u (cid:48) (1 + ( r (2) − ν (2) ) u (max { , r (2) } ) ( r (2) − (cid:19) = E (cid:0) u (cid:48) (1 + ( r (2) − ν (2) )( r (2) − I { r (2) ≤ } (cid:1) + E (cid:18) u (cid:48) (1 + ( r (2) − ν (2) ) u ( r (2) ) ( r (2) − I { r (2) > } (cid:19) ≤ E (cid:0) u (cid:48) (1 + ( r (2) − ν (2) )( r (2) − (cid:1) = ∂ (cid:101) U ( ν ) ∂ν (2) =: ϕ (cid:48) ( ν (2) ) , where (cid:101) U ( ν ) = E u ( (cid:104) ν, r (cid:105) ) is the ordinary expected utility. The functions ψ (cid:48) , ϕ (cid:48) are decreasing.It follows that the zero of ψ (cid:48) is smaller than the zero of ϕ (cid:48) (for simplicity we assume that azero is unique). A similar argumentation works for the empirical utilities.However, in the next section we will see that the discussed property is not universal. Ina model with several risky assets the optimal portfolio, corresponding to the relative powerutility, can be more risky, than for the ordinary utility. ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 13
Next we argue that if the price of a risky asset follows the Black-Scholes model, neither nor years are enough to make any reliable conclusions concerning the optimal value ν ( ∗ , on the basis of daily historical prices.For α = 0 . in the left panels of Fig. 1 we show the histograms of the optimal weight (cid:98) ν (2) n ofthe risky asset for 200 realizations of daily returns ( r (2) k ) nk =1 , where n = 252 · k , k = 1 , , .To estimate the true utility U ( ν ) of (cid:98) ν we used the empirical mean (cid:98) U N ( ν ) with very large N = 10 . The histogram of linearly transformed true utilities ( U ( (cid:98) ν ) − U ( w )) · , w = (1 , are shown in the right panels in Fig. 1. In the same way we obtained the estimates of theoptimal weight of the risky asset: ν ∗ , ≈ . , and its utility ( U ( (cid:98) ν ∗ ) − U ( w )) · ≈ . . (5.4)We see that optimal portfolio weights very slowly concentrate near the optimal value. Inparticular for n = 252 · in most cases (cid:98) ν (2) n simply takes the extreme values 0 and 1. Onlyfor n = 252 · the largest peak is near the optimum. But even in this case it is blurred.Note, however, that the true utilities of (cid:98) ν (2) n demonstrate somewhat better concentrationnear the optimum (5.4). These conclusions are not specific for the relative power utility orfor a specific value of α . For for other values of α , and for the ordinary power or logarithmicutilities the results will be similar.Note that the slow concentration phenomenon (which is related to the fragility of SAAin portfolio optimization: [1]) does not contradict Theorems 1, 2. Roughly speaking, thesetheorems give the estimate U ( ν ∗ ) − U ( w ) ≤ U ( (cid:98) ν n ) − U ( w ) + O (cid:18) √ n (cid:19) with high probability. From (5.4) it follows that we need n at least of order to get anontrivial lower bound for U ( (cid:98) ν n ) − U ( w ) .6. Experiments with NYSE data
We considered two datasets, containing daily stock returns form the New-York StockExchange (NYSE): • NYSE : Contains 5651 daily returns of 36 stocks for the period ending in 1984, • NYSE : Contains 11178 daily returns of 19 stocks for the period ending in 2006.Both datasets were taken from . NYSE is a classical dataset, considered in many papers, starting from [7] (see the references in[13, 14]). NYSE was first analized in [13], where the authors also proposed a simple greedyalgorithm for the empirical logarithmic utility maximization: n n (cid:88) k =1 ln (cid:104) ν, r k (cid:105) → max ν ∈ ∆ . In this paper we are interested in an application of the exponentited gradient (EG) algo-rithm. Note that already in [15] this algorithm was applied to the NYSE dataset and thelogarithmic utility. However, our goal here is different: we want to solve the problem (2.2).Unfortunately we were unable to do this using the algorithm in the form (4.1), (4.2) or withtime-varying learning rate η (e.g., applying the doubling trick: see [22]). So, we propose itsmodification: the greedy doubly stochastic exponentiated gradient (GDSEG) algorithm. Forclarity we present its pseudocode for the power utility u ( x ) = x α . Figure 1.
Histograms of optimal weight (cid:98) ν (2) n of the risky asset (left panels)and of linearly transformed true utility ( U ( (cid:98) ν n ) − U ( w )) · , w = (1 , (right panels) for 200 realizations of daily returns ( r (2) k ) nk =1 for n = 252 · k , k = 1 , , . The case of relative power utility with α = 0 . .The algorithm accepts either the original returns r k , or the scaled returns r k /r ∗ k . The firstcase corresponds to the traditional power utility, the second one to the relative power utility.At each point ν the algorithm tries to make a step according to line 9, corresponding to(4.2), where the return r k and the learning rate are taken randomly by sampling k and η from the uniform distributions over { , . . . , n } and [0 , η ] respectively. In fact, this is a step ofa stochastic gradient method with random learning rate. That’s why we call the algorithm ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 15
Greedy doubly stochastic exponentiated gradient algorithm (GDSEG) for the power utility
Input: η > : an upper bound for learning rate; n_attempts : an upper bound for the num-ber of attempts to improve a current portfolio; threshold : an improvement threshold; { r ik : k ∈ { , . . . , n } , i ∈ { , . . . , d }} : an array of daily returns; α ∈ (0 , ν i := 1 /d , i = 1 , . . . , d if the relative utility is considered then r ik := r ik / max dj =1 ( r jk ) , i = 1 , . . . , d , k = 1 , . . . , n end if attempt := 0 while attempt ≤ n_attempts do Choose k ∈ { , . . . , n } uniformly at random Choose η ∈ [0 , η ] uniformly at random a i := ν i exp ( ηr ik / (cid:104) ν, r k (cid:105) − α ) , w i := a i (cid:80) dj =1 a j , attempt := attempt + 1 if n (cid:80) nt =1 (cid:104) w, r t (cid:105) α ≥ n (cid:80) nt =1 (cid:104) ν, r t (cid:105) α + threshold then ν := w , attempt := 0 end if end whileOutput: an optimal portfolio ν “doubly stochastic”. Furthermore, the step will be actually performed only if the value of theobjective function for the new portfolio w surpasses the current value by a threshold : line11. The algorithm stops if no such improvement is obtained for some predefined number ofattempts: n_attempts .For the logarithmic utility one should put α = 0 , and substitute in line 11 the powerfunction by the logarithm. We do not consider the relative utility in this case.The algorithm was applied to NYSE and NYSE datasets with the following parameters: η = 1 , n_attempts = 10 , threshold = 10 − . The number of iterations and the resultsdepend on the seed parameter. The average number of attempts to improve the currentportfolio for 30 runs of the algorithm was about · for NYSE and · for NYSE .In both cases the output portfolio ν concentrates only on few stocks: 5 for NYSE and 3 forNYSE . We drop ν i with ν i < . and normalize the results: ν i := ν i I { ν i ≥ . } (cid:80) dj =1 ν j I { ν j ≥ . } . For the logarithmic utility the results can be compared with those of [4, 13]. In Tables2, 3 we present minimal and maximal values for each weight, obtained in 30 runs of theGDSEG algorithm. The accumulated wealth X n = (cid:81) nt =1 (cid:104) ν, r t (cid:105) , in fact, does not depend ona particular output ν : NYSE : X ≈ . , annual return: . NYSE : X ≈ . , annual return: . . The annual return is computed by the formula X /nn .In general the GDSEG algorithm need not be so stable. For the power utility u ( x ) = x α we implemented the following strategy: take an output ν , corresponding to the largest value Table 2.
Optimal weights for the logarithmic utility, NYSE : 30 experimentsof the GDSEG algorithmStock Weight[4] WeightGDSEG, [min , max] comme 0.2767 [0 . , . espey 0.1953 [0 . , . iroqu 0.0927 [0 . , . kinar 0.2507 [0 . , . meico 0.1845 [0 . , . Table 3.
Optimal weights for the logarithmic utility, NYSE : 30 experimentsof the GDSEG algorithmStock Weight[13] WeightGDSEG, [min , max] hp 0.177 [0 . , . morris 0.747 [0 . , . schlum 0.076 [0 . , . of the empirical utility function obtained in 10 experiments. The results for NYSE datasetare presented in Table 4. In the sequel we concentrate only on NYSE . Table 4.
NYSE : optimal portfolio weights, corresponding to the largestvalue of the empirical power utility function obtained in 10 experiments ofthe GDSEG algorithm; the accumulated wealth X n , n = 11178 ; the annualreturns and the annual volatilities of these portfoliosOrdinary utility Relative utility α Stocks Weights X n Annret. Ann.volat. Weights X n Ann.ret. Ann.volat. . hpmorrisschlum 0.17920.75180.0690 4100.4 1.206 0.234 0.17820.75230.0695 4100.4 1.206 0.234 . hpmorrisschlum 0.17620.77660.0473 4091.2 1.206 0.237 0.16170.78820.0501 4085.7 1.206 0.238 . hpmorris 0.17790.8221 4035.7 1.206 0.245 0.14760.8524 3999.7 1.206 0.248 . hpmorris 0.15890.8411 4016.1 1.206 0.247 0.10690.8931 3912.5 1.205 0.253 . hpmorris 0.09720.9028 3885.4 1.205 0.254 01 3496.7 1.202 0.270 . morris 1 3496.7 1.202 0.269 1 3496.7 1.202 0.270Note that as α is growing, the utility maximizer concentrates more on one stock. Thiseffect is stronger for the relative utility. Such behavior can be qualified as more risky: see ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 17 the annual volatility of portfolio returns in Table 4. This quantity is defined as the empiricalstandard deviation of ( (cid:104) (cid:98) ν n , r k (cid:105) ) nk =1 , multiplied by √ . For the log-optimal portfolio fromTable 3 it equals to 0.233.Data used in the above calculations can be considered as a realization of some multidi-mensional stochastic process. From the example considered in Section 5 it is clear that thevalues of an empirical utility function can be very sensitive to such realizations. To get moreinsight on the risk and generalization properties of empirically optimal portfolios, let us tryto describe the stock prices by the multidimensional Black-Scholes model: dS it = S it µ i dt + S it m (cid:88) j =1 σ ij dW jt , i = 1 , . . . , d, (6.1)where ( W , . . . , W m ) is a standard Wiener process, µ is the drift vector and σ is the volatilitymatrix. Solving the system of stochastic differential equations (6.1), we get S it = S i exp (cid:32)(cid:32) µ i − m (cid:88) j =1 ( σ ij ) (cid:33) t + m (cid:88) j =1 σ ij W jt (cid:33) , i = 1 , . . . , d. If t = 1 corresponds to one year, then the daily log-returns should be approximated as follows ln r ik = (cid:32) α i − m (cid:88) j =1 ( σ ij ) (cid:33) h + m (cid:88) j =1 σ ij ( W jkh − W j ( k − h ) , h = 1 / , k = 1 , . . . , n. (6.2)We estimated the expectation vector and the covariance matrix (cid:32) α i h − m (cid:88) j =1 ( σ ij ) h (cid:33) di =1 , (cid:32) m (cid:88) k =1 σ ik σ kj h (cid:33) di,j =1 of (ln r ik ) di =1 for NYSE dataset, using the numpy module. This allows to generate the artifi-cial data by (6.2). For the empirically optimal portfolios from Tables 3, 4, as well as for theportfolio with uniform weights: w = (1 /d, . . . , /d ) , d = 19 , we computed some statisticalcharacteristics of the annual accumulated wealth X , using these data. The results are col-lected in Table 5. This table mainly demonstrates the risk properties of empirically optimalportfolios. For example, as α growth, the portfolios become more risky: their expectationsand standard deviations increase, but medians decrease. The portfolios, corresponding tothe relative power utility are more risky than for the ordinary one, in contrast to the examplein Section 5, but in accordance with Table 4: see again the annual volatility columns.The considered dataset is favorable for the investor: the stock prices are growing (onaverage). Moreover, the performance is evaluated with respect to a concrete model. However,even in this case the investment decisions, based on the historical data, are risky. Forexample, from Table 5 we see that for the log-optimal portfolio there is 5% chance to loosemore than 18% of an initial wealth within 1 year.Note that the means are larger than the medians. This is in line with [13], where it isexplained that typically X n is less then the E X n for log-optimal portfolios. We see also thatthe medians give good estimates for the annual returns from Table 4.Finally, we tried to estimate the true utility of the empirically optimal portfolios, con-structed for trajectories of the Black-Scholes model. We used the same method as in Section5, but with the GDSEG algorithm instead of bisection. Namely, for α = 0 . we considered Table 5.
Statistical characteristics of the annual accumulated wealth X for the portfolios from Table 4 for the artificial data (6.2) with the param-eters, estimated for NYSE . Averaging was performed over realizations,generated by the Black-Scholes model.Portfolio Mean Median Std.deviation 5-thpercentile 95-thpercentileuniform 1.165 1.152 0.183 0.891 1.487log-optimal 1.240 1.207 0.294 0.820 1.772 α = 0 . ordinaryrelative 1.2401.240 1.2071.207 0.2950.295 0.8190.819 1.7751.775 α = 0 . ordinaryrelative 1.2411.242 1.2071.207 0.2990.300 0.8150.814 1.7851.787 α = 0 . ordinaryrelative 1.2431.244 1.2061.206 0.3100.314 0.8050.801 1.8081.815 α = 0 . ordinaryrelative 1.2441.245 1.2061.205 0.3120.320 0.8030.794 1.8121.828 α = 0 . ordinaryrelative 1.2451.247 1.2051.202 0.3220.342 0.7930.771 1.8311.872200 trajectories ( r , . . . , r n ) , n = 11178 generated by the Black-Scholes model (6.2) with pa-rameters, estimated for NYSE dataset. For each trajectory the empirically optimal portfoliowas computed by the GDSEG algorithm (we picked the best portfolio in 10 experiments).For a fixed trajectory the optimal portfolio concentrated on a few number of stock (from 1to 4). For illustration purposes in Fig. 2(a) we show the average weight of each stock over 200optimal portfolios. As in Table 3, the largest average weights have the stocks with numbers9 (hp), 16 (morris), 18 (schlum). The next two positions occupy 12 (jnj) and 14 (merck).The true utility of each portfolio was evaluated by the empirical mean, computed for alarge sample: n = 10 . In Fig. 2(b), similar to left panels in Fig. 1, we see a large clusterof very good portfolios. However, the the concentration is far from perfect. Let us mentionalso that the median ( ≈ . ) of the true utility is greater than the mean ( ≈ . ).7. Conclusion
In this paper we studied generalization properties of the empirically optimal portfoliosfor the relative utility maximization problem. We obtained high probability bounds for theestimation error and for the difference between the empirical and true utilities. Similarbounds were obtained for the portfolios, produced by the stochastic exponentiated gradientalgorithm. The only assumptions, imposed on the returns is the i.i.d. hypothesis. Theobtained bounds depend only the information available to the investor. We also performed
ELATIVE UTILITY BOUNDS FOR EMPIRICALLY OPTIMAL PORTFOLIOS 19
Figure 2.
Relative power utility with α = 0 . . (a) Average weight of eachstock in empirically optimal portfolio over 200 realizations of the Black-Scholesmodel (6.2); (b) Histogram of the evaluated true utility for the same 200optimal portfolios.some statistical experiments, demonstrating risk and generalization properties of the empir-ically optimal portfolios. For a multidimensional problem we proposed the greedy doublystochastic exponentiated gradient (GDSEG) algorithm.Let us mention some topics for further study. • In Theorems 1 – 3 we considered the case of relative utility functions. To obtainsimilar bounds for ordinary utilities, in general one need to analyze the tails of thereturn distributions. In addition, the results of [6] should be useful for analysis ofthis problem. • The proposed GDSEG algorithm was enough for our purposes, but it requires largeamount of calculations. It may be interesting to study this algorithm and its im-provements in more detail. • Using side information is an important method for the construction of successfulportfolio strategies. The recent papers [3, 2] contain theoretical and practical ideasthat can be employed to study this problem in the statistical learning framework.
References [1] G.-Y. Ban, N. El Karoui, and A.E.B. Lim. Machine learning and portfolio optimization.
ManagementScience , 64(3):1136–1154, 2018.[2] T. Bazier-Matte and E. Delage. Generalization bounds for regularized portfolio selection with marketside information.
INFOR: Information Systems and Operational Research , 58(2):374–401, 2020.[3] D. Bertsimas and N. Kallus. From predictive to prescriptive analytics.
Management Science , 66(3):1025–1044, 2020.[4] A. Borodin, R. El-Yaniv, and V. Gogan. On the competitive theory and practice of portfolio selection(extended abstract). In G.H. Gonnet and A. Viola, editors,
LATIN 2000: Theoretical Informatics , pages173–196, Berlin, Heidelberg, 2000. Springer.[5] S. Boucheron, G. Lugosi, and P. Massart.
Concentration inequalities: A nonasymptotic theory of inde-pendence . Oxford University Press, Oxford, 2013. [6] C. Cortes, S. Greenberg, and M. Mohri. Relative deviation learning bounds and generalization withunbounded loss functions.
Ann. Math. Artif. Intell. , 85:45–70, 2019.[7] T.M. Cover. Universal portfolios.
Mathematical Finance , 1(1):1–29, 1991.[8] V. DeMiguel, L. Garlappi, F.J. Nogales, and R. Uppal. A generalized approach to portfolio optimization:Improving performance by constraining portfolio norms.
Management Science , 55(5):798–812, 2009.[9] S. Ghosal and A. van der Vaart.
Fundamentals of Nonparametric Bayesian Inference . Cambridge Uni-versity Press, Cambridge, 2017.[10] J. Gotoh and A. Takeda. On the role of norm constraints in portfolio selection.
Comput. Manag. Sci. ,8:323–353, 2011.[11] J. Gotoh and A. Takeda. Minimizing loss probability bounds for portfolio selection.
European Journalof Operational Research , 217(2):371 – 380, 2012.[12] A. Gut.
Probability: a graduate course . Springer, New York, 2013.[13] L. Gy¨orfi, G. Ottucs´ak, and A. Urb´an. Empirical log-optimal portfolio selections: a survey. In
Machinelearning for financial engineering , pages 81–118. World Scientific, 2012.[14] L. Gy¨orfi, G. Ottucs´ak, and H. Walk. The growth optimal investment strategy is secure, too. In G. Con-sigli, D. Kuhn, and P. Brandimarte, editors,
Optimal Financial Decision Making under Uncertainty ,pages 201–223. Springer International Publishing, Cham, 2017.[15] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. On-line portfolio selection using multi-plicative updates.
Mathematical Finance , 8(4):325–347, 1998.[16] J.-B. Hiriart-Urruty and C. Lemar´echal.
Convex Analysis and Minimization Algorithms . Springer-Verlag, Berlin, 1993.[17] S. Kim, R. Pasupathy, and S. G. Henderson. A guide to sample average approximation. In M.C. Fu,editor,
Handbook of Simulation Optimization , pages 207–243. Springer, New York, 2015.[18] J. Kivinen and M.K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.
Information and computation , 132(1):1–63, 1997.[19] D. Kuhn, P.M. Esfahani, V.A. Nguyen, and S. Shafieezadeh-Abadeh. Wasserstein distributionally robustoptimization: Theory and applications in machine learning. In
INFORMS TutORials in OperationsResearch , chapter 6, pages 130–166. 2019.[20] M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning . The MIT Press,Cambridge, MA, 2018.[21] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochasticoptimization. In
Int. Conf. Mach. Learn. , pages 449–456, 2012.[22] S. Shalev-Shwartz. Online learning and online convex optimization.
Foundations and Trends® in Ma-chine Learning , 4(2):107–194, 2012.[23] S. Shalev-Shwartz and S. Ben-David.
Understanding Machine Learning: From Theory to Algorithms .Cambridge University Press, New York, 2014.[24] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform conver-gence.
Journal of Machine Learning Research , 11:2635–2670, 2010.[25] A. Shapiro, D. Dentcheva, and A. Ruszczynski.
Lectures on Stochastic Programming: Modeling andTheory, Second Edition . SIAM, Philadelphia, 2014.[26] J.E. Smith and R.L. Winkler. The optimizer’s curse: Skepticism and postdecision surprise in decisionanalysis.
Management Science , 52(3):311–322, 2006.[27] R. van Handel. APC 550: Probability in high dimension. Lecture Notes. Princeton University,https://web.math.princeton.edu/ rvan/APC550.pdf, 2016.[28] V. Vapnik.
Statistical learning theory . Wiley, New York, 1998.[29] M.J. Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint . Cambridge University Press,Cambridge, 2019.
I.I. Vorovich Institute of Mathematics, Mechanics and Computer Sciences and RegionalScientific and Educational Mathematical Center of Southern Federal University
E-mail address ::