Portfolio Performance Attribution via Shapley Value
aa r X i v : . [ q -f i n . C P ] F e b For professional clients and qualified investors onlyNot for public distributionThis version posted with permission
Portfolio Performance Attributionvia Shapley Value
Nicholas Moehle Stephen Boyd Andrew AngFebruary 12, 2021
Abstract
We consider an investment process that includes a number of features, eachof which can be active or inactive. Our goal is to attribute or decompose anachieved performance to each of these features, plus a baseline value. Thereare many ways to do this, which lead to potentially different attributions inany specific case. We argue that a specific attribution method due to Shapleyis the preferred method, and discuss methods that can be used to compute thisattribution exactly, or when that is not practical, approximately.
Performance of an investment process.
We consider an investment processguided by a portfolio manager, an automated process, or some combination. We areinterested in some performance measure of the investment process over some pastperiod, for example P&L, realized return, risk, tracking error, or turnover. Some ofthese performance measures we prefer to be large ( e.g. , P&L, realized return), andothers we prefer to be small ( e.g. , risk, tracking error, turnover).
Features.
The process has a number of variations or features, each of which canbe active (or on), or inactive (or off). As a simple example, consider an investmentprocess that relies on daily portfolio rebalancing using an optimization method. Thefeatures might be a leverage limit, an ESG constraint that limits the securities thatcan be held, and the use of a novel return forecast developed by some researchers.(In many practical applications, we would have far more than just three features.)We assume that when the process was run over some time period, all the featureswere active. In our example above, this means that we ran the process with theleverage limit, the ESG constraint, and the forecasts.1 x Risk Return Turnover × × . × × . . x x Total
Risk 0 . .
25 0 .
95 2 . . . Table 1:
Example attribution of three performance measures to two features plus baseline.The table on the left shows the data, i.e. , the result of backtests with different combinationsof the two features. The table on the right shows an example attribution.
Attribution.
The attribution problem asks the question: How much of the per-formance should we attribute to each of these features, and how much to a baselineor benchmark value? In attribution we are dividing up the performance that wasachieved into an amount associated with each feature, and a baseline value, whichcorresponds to what the performance would have been with all the features off. Theamount attributed to a feature can be negative, which means that the feature reducedthe performance value. A positive attribution means that the feature increased theperformance value. We seek full attribution , which means that the sum of the amountsattributed to the features, plus the baseline value, equals the actual performanceachieved.Continuing our example above, suppose our investment process realizes a returnof 8% over one year, with all three features active. An example attribution might be1% to the leverage limit, -1% to the ESG constraint, 5% to the return forecast, and3% to baseline performance. We interpret this as saying that the leverage limit wasresponsible for increasing our return by 1%, while our ESG restrictions depressed thereturn by 1%; our return forecast was responsible for 5% of the realized return, and3% is attributed to baseline.Attribution is closely tied to the idea of marginal performance gain, i.e. , the changein performance when a feature is added. Unfortunately — and this is the crux of theproblem — the marginal performance gain when adding a feature depends on whichfeatures have already been added. We can think of attribution as assigning a singlemarginal performance gain to an feature, independent of which other features areactive.We can carry out attributions for multiple performance objectives. For example,we can attribute realized risk, realized return, and realized turnover to our featuresand a baseline. (This is three separate attribution problems.) An example attributionis shown in table 1.Attribution has many applications. We can use it to allocate credit, for exampleto determine bonus payments. For features that are associated with an additionalcost, we can assess the cost per unit of performance delivered. For example if ourreturn forecast in our example incurs a data source cost, we can compare that cost tothe return attributed to it. We can use attribution to makes changes moving forward;2or example, we can consider dropping features that have a negative attribution toP&L. Attribution is a key method for explaining investment performance to clients,and is often required by law.
Challenges.
There are two main challenges in carrying out attribution. The firstis that it involves hypothetical or what-if situations. While we directly observe theperformance achieved with all features active, we really do not know what the per-formance would have been with some of the features off. This is addressed by usinga high-fidelity simulator or backtester, that can simulate what would have happened,had some of the features been off. Of course our attribution can only be as accurateas this simulator.The second main challenge is conceptual or mathematical. Except in a very simplecase, which we discuss below, it is hard to exactly define what attribution is. As aresult, many different attribution methods are used in practice, leading to differentattributions in any specific case. In this paper we will argue that a specific type ofattribution, called Shapley attribution, is the best choice. Unfortunately, with morethan just a handful of features, computing the Shapley attribution exactly requirescarrying out an impractically large number of simulations or backtests. Fortunately,there are methods for computing it approximately, described below, which work wellin practice.
The simple case.
We describe here, informally, the one case in which attributionis simple. (We describe this case mathematically below.) Attribution is simple whenthe performance measure is additive, i.e. , a sum of terms, each associated with afeature plus a constant. Indeed, this particular form directly gives us the attribution.As a simple (and uninteresting) example, consider the total profit of a company oversome period, with the features being the independent divisions of the company. Herethe total profit is evidently the sum of the profits contributed by the divisions, plussome baseline profit for the company that is unrelated to the profits of the divisions.Roughly speaking, in this simple case there is no interaction among the features;we simply add up the contributions of the features, which directly gives us the attri-bution, with the baseline value being the difference between the achieved performanceand the sum of that attributed to the features.
Interactions.
The challenge is when there are interactions among the features, interms of how they affect the performance. As a simple example, suppose that featureone and feature two, taken individually, give no increase in performance; but whenthey are both active, give a substantial improvement in performance. In this case,how should we attribute the performance to these two features? Intuition suggeststhey should share the credit equally, i.e. , have attributions equal to half the increasein performance. As a variation on this situation, suppose that features one and two3re the same, so having them both on is the same as having either of them on. Heretoo intuition suggests they should share the credit.
Cooperative game theory.
Our recommended attribution method uses the Shap-ley value, an idea from originated in cooperative game theory, which attempts to an-swer the question of how to allocate the earnings of a coalition to individual players[Sha53]. It is derived axiomatically by showing that the only attribution method thatsatifies a set if desirable properties is the Shapley value. Because of these desirableproperties, the Shapley value is widely considered to be a fair approach to allocat-ing value [Mou04]. The Shapley value has seen a large number of extensions andvariations; see [MS+02] for a summary. One important extension assumes certaincoalitions of players are disallowed, which changes the resulting allocation [Hil18].In general, the effort required to compute Shapley values is exponential in thenumber of players, which can be prohibitive when the number of players is large.In this case, the Shapley values can be approximated using Monte Carlo [CGT09],with error bounds given in Maleki et al. [Mal+13]. When the coalition value functionhas specific properties ( e.g. , submodularity), more efficient methods may exist. (SeeLiben-Nowell et al. [LN+12].)
Attribution in machine learning.
Recently, Shapley values have been used tointerpret the output of machine learning prediction models, such as random forestsand neural networks [LC01; ˇSK14]. In this case, the model inputs (or features ) aremodeled as players in a coalition, and the resulting prediction performance is the valueof each coalition. As a result, mature theory and software exists for approximatelycomputing the Shapley value for games with many players [LL17].
Portfolio performance attribution.
Since the work of Jensen [Jen68], academicshave sought to attribute returns of managers to skill (security selection) vs. exposureto rewarded risk factors, like the market portfolio [FF10; Sha92]. Many approachesfor performance attribution exist. The simplest methods break up portfolio returninto components, which sum to the portfolio return [BHB86; BSB91]. Other standardapproaches use the correlation between portfolio weights and returns [GK00; Gri06;Lo07]. These methods are scalable, and are often used to attribute performance to alarge number of predictive signals.Many of these standard approaches use time series of returns or cross-sectionalposition holdings which result in unattributed value (or ‘residuals’); one advantageof the Shapley value is full attribution , which means the attributions of individualfeatures and the baseline sum to the total portfolio return. In our exposition, weexplicitly contrast Shapley attribution to these more widely used techniques. WhileShapley values have been applied to risk decompositions [Ort16; MT08; TBT10;4BSV18], to our knowledge, ours is the first application of the Shapley value to thegeneral portfolio performance attribution problem, and more generally to any statisticthat is produced by an investment process.
In this section we fix our notation and describe our model.
Investment process.
We assume there is an investment process which produces adynamic (time-varying) portfolio allocation over some time window. The investmentprocess may depend on market conditions, the prior portfolio holdings, the decisionsof analysts or portfolio managers, and portfolio optimization techniques, to updatethe portfolio holdings over time. In practice, the investment process may be verycomplicated, and we intentionally leave the details unspecified.
Features and configuration.
The investment process has n features that can be(or could have been) included or excluded, i.e. , active or inactive. These featuresmight represent the choice of a specific benchmark, choice of sector exposures orasset allocation, or the contributions of a specific analyst or signal. This inclusion orexclusion of feature i is denoted x i ∈ { , } , with x i = 0 meaning that the featureis inactive, and x i = 1 meaning the feature is active. The collection of these featurestatus values is called a configuration of the investment process, and is representedby the Boolean vector x = ( x , . . . , x n ). For example, x = (1 , , , ,
0) means thatfeatures 1, 3, and 4 are active, while features 2 and 5 are inactive.We observe that there are 2 n possible configurations, which grows rapidly withthe number of features n . For n = 10, there are around 1000 possible configurations;for n = 30, the number is around 10 . The full and zero configurations.
The configuration x = (1 , , . . . ,
1) = (thevector of all ones) means that all features are active. We refer to this as the fullconfiguration or fully featured configuration . We will assume that the full configurationis the one that was actually used. The other 2 n − x = (0 , , . . . ,
0) = 0 is called the zero configuration or the base-line configuration or the benchmark configuration . It corresponds to the investmentprocess with all features inactive. In some cases it can be interpreted as investing ina benchmark portfolio.
Performance metric.
This investment process is evaluated using a real-valuedperformance metric y ∈ R . (In practice, portfolios are evaluated using many met-rics, which can be considered separately.) Examples of performance metrics include5he portfolio’s return, risk, risk-adjusted return, turnover, or average exposure to aparticular risk factor, over some investment period. (These can be ex-ante values,evaluated using a contemporaneous model; or realized, ex-post values, evaluated us-ing the actual data.) Note that large values of y can be good (as in return), or bad(as in risk).We assume that when the full configuration x = was used, the resulting realizedperformance was y real , which can be directly observed. In cases when the baselineor zero configuration can be interpreted as investing in a benchmark portfolio, theperformance value for this too can be directly observed. Simulation and backtesting.
We use simulation to judge the performance usingother, hypothetical configurations. This typically has the form of a backtesting engine,which can evaluate the performance under the hypothetical configurations. Thisprocess is represented by a function f : { , } n → R , with y = f ( x ) = f ( x , . . . , x n ) . Evaluating the function f requires running a backtest of the investment process underconfiguration x , and recording the performance y . We assume that the backtests arecalibrated so that y real = f ( ), i.e. , the backtest simulation result for the configurationwe used agrees with the performance we actually observed. Lift or marginal contribution.
We now introduce a natural concept in attribu-tion, which is the change in performance when we add one feature. Suppose theconfiguration is x , with x i = 0, i.e. , feature i is inactive. Then ˜ x = x + e i (where e i isthe i th unit vector) is the configuration obtained by turning feature i on. We definethe lift or marginal contribution as the change in performance obtained by addingfeature i , i.e. , f ( x + e i ) − f ( x ) . This marginal contribution depends on the particular configuration x . In other words,the lift associated with adding a feature depends on which other features are active. Geometric interpretation.
We can associate the 2 n different possible configura-tions with the corners of a unit (hyper)cube in R n . We can create a directed graphwith configurations as vertices, by having an edge from configuration x to configu-ration ˜ x if ˜ x = x + e i for some i . In words, an edge goes from one configurationto another that is obtained by adding one feature. We note for future use that wecan associate with an edge from x to ˜ x = x + e i a marginal performance change f (˜ x ) − f ( x ) = f ( x + e i ) − f ( x ). The number of edges is n n − .An example with n = 3 features is depicted in figure 1. The point (0 , ,
0) repre-sents the baseline configuration, and the point (1 , ,
1) represents the fully featuredconfiguration. The edge from (0 , ,
0) to (1 , ,
0) corresponds to adding feature 1 tothe configuration with only feature 2 active. There are a total of 12 edges.6 x x Figure 1:
Visualization of configurations as vertices of a hypercube, for n = 3. Thevertices, shown as dots, as configurations. The edges correspond to adding one feature to aconfiguration. We would like to attribute the realized performance y real to the n features, i.e. , todetermine how much of the performance resulted from each feature. An attributionmethod determines real values a , . . . , a n and b , where a i is the amount attributed tofeature i , and b is the baseline amount. The attribution and baseline amounts can bepositive or negative. We will represent the attribution using a vector a = ( a , . . . , a n )and scalar b .The attribution is derived from the feature performance function f , i.e. , its val-ues for the 2 n different configurations. An attribution method is an algorithm ormethod that determines the attribution based on evaluating f for some, or possiblyall, configurations. We now describe several desirable properties of an attribution method.
Full attribution.
We require full attribution , which means that y real = f ( ) = a + · · · + a n + b = T a + b. This means that the observed performance measure f ( ) is fully attributed to the n features, plus the baseline. Even though full attribution is a crucial property ofa good attribution method, we will see that commonly used attribution methods donot have it. Correct baseline value.
We say that an attribution has the correct baseline valueif f (0) = b. b matches the performance of the benchmarkportfolio. Fairness.
We call an attribution method fair if, for any permutation of the features,the attributions are permuted the same way. This property implies that if two featuresare the same, i.e. , they have the same effect on performance, then their attributionsmust be the same.
Monotonicity.
This means that if we change f in such a way that one feature’smarginal contribution does not decrease (no matter which features are already active),then the attribution to this feature also does not decrease. We say the performance is additive if f has the form f ( x ) = a T x + b = b + X i : x i =1 a i , for some vector a and scalar b . (If we consider x i to be real numbers, and not just0 or 1 as we do here, this corresponds to f being an affine function.) When f isadditive, the baseline performance is b , and the marginal increase in the performancewhen adding feature i is always a i , independent of what other features are alreadyactive.For an additive function, a and b directly give an attribution which satisfies allfour of the desiderata listed above: full attribution, correct baseline, fairness, andmonotonicity. The case of additive performance is the easy, or even trivial case, forattribution. It is when f is not additive that it becomes more difficult to assign anattribution. In this section, we describe several attribution methods, concluding with Shapleyattribution method we recommend.
We take b = f (0), and a i = f ( e i ) − f (0) , i = 1 , . . . , n. In other words, we carry out a baseline simulation (if needed) and n additional sim-ulations, each with exactly one feature enabled. We attribute to each feature the8ncrease in performance when it is added to the baseline configuration, i.e. , its lift ormarginal performance increase from the baseline configuration x = 0. This methodis natural, and requires carrying out only n (or n + 1 if we include the baseline)simulations.One-at-a-time attribution satisfies correct baseline value, fairness, and monotonic-ity. However, it can (and often does) fail to satisfy full attribution. To see this,consider the simple example with n = 2 and f (0) = 0 , f ( e ) = 1 , f ( e ) = 1 , f ( ) = 1 . (1)In this example the presence of either feature 1 or feature 2 gives the full performancevalue 1. One-at-a-time attribution for this example is b = 0, a = 1, a = 1, so b + a + a = 2, whereas f ( ) = 1. Roughly speaking, one-at-a-time attributionover-allocates performance to the features in this example.We note that one-at-a-time attribution coincides with the attribution describedabove for additive f . Here, however, the same formula is being applied to any f , notjust additive f . The leave-one-out attribution method is closely related to one-at-a-time attributionAs in one-at-a-time attribution, we set b = f (0). We then carry out n simulations,with x = − e i , i = 1 , . . . , n . In other words, for each feature we simulate theperformance when it is left out, but all other features are present. We set a i = f ( ) − f ( − e i ) , i = 1 , . . . , n, which is the marginal performance increase when adding feature i when all otherfeatures are active. Like one-at-a-time attribution, leave-one-out attribution requirescarrying out n simulations, plus a baseline simulation.Like one-at-a-time, leave-one-out attribution satisfies correct baseline value, fair-ness, and monotonicity, but it can fail to achieve full attribution. To see this, considerthe same example describe above in (1). The leave-one-out attribution for this exam-ple is b = 0, a = 0, and a = 0, i.e. , it allocates zero performance to each feature.Roughly speaking, it under-allocates performance in this example. Another commonly used method is sequential attribution. We start by evaluatingthe baseline configuration performance b = f (0). We then simulate the configuration x = e , i.e. , we add the first feature. We continue adding features until we have allfeatures active. We take a i = f ( e + · · · + e i ) − f ( e + · · · + e i − ) , i = 1 , . . . , n, i , when the features1 , . . . , i − n simulations, plus a baseline simulation.Sequential attribution satisfies full attribution, since b + a + · · · + a n = f (0) + (cid:0) f ( e ) − f (0) (cid:1) + (cid:0) f ( e + e ) − f ( e ) (cid:1) + · · · + (cid:0) f ( ) − f ( − e n ) (cid:1) = f ( ) . It also satisfies correct baseline and monotonicity.But sequential atttribution does not satisfy fairness, since the attribution obtaineddepends on the order in which the features are added. The same example above givenin (1) illustrates this. Sequential attribution gives b = 0, a = 1, and a = 0; that is,the first feature gets attributed the full performance and the second gets none. If weswap the two features, we assign the full performance to feature two.Sequential attribution is also called off-the-top attribution, described in a differentform. We first evaluate f ( − e n ), the performance when feature n is turned off, withall others on. We set a n = f ( ) − f ( − e n ), the marginal increase in adding feature n when all others are active. We then evaluate f ( − e n − e n − ), the performancewhen features n and n − a n − = f ( − e n ) − f ( − e n − e n − ) , the marginal performance when we add feature n −
1, when features 1 , . . . , n − Geometric interpretation.
We can give a nice geometric interpretation of sequen-tial attribution. We start at node or vertex x = 0 and follow a specific directed pathto node e , then e , and so on, ending at the full configuration x = . The attributionto feature i is the marginal increase associated with the edge in which feature i isadded. An example with n = 3 is shown (in red) in figure 2. Here we describe a simple extension of sequential attribution that will come up inthe sequel. Let π = ( k , . . . , k n ) be a permutation of (1 , . . . , n ), which means thateach integer from 1 to n appears as one of the k i . Define ˜ x as ˜ x i = x k i , i = 1 , . . . , n .This is the configuration vector when we permute the features using π . We define thepermuted performance function as ˜ f (˜ x ) = f ( x ).10 x x Figure 2:
Two permutations for sequential attribution. The red path corresponds the thestandard permutation (0 , , → (1 , , → (1 , , → (1 , , , , → (0 , , → (0 , , → (1 , , Permuted sequential attribution permutes the original features to obtain ˜ f , thenuses sequential attribution on ˜ f , and finally permutes the resulting attribution ˜ a and ˜ b back to the original ordering. In sequential attribution, we use the marginalperformance contribution when the features are added one by one, in order. Per-muted sequential attribution is the same, except that we add the features in theorder ( k , k , . . . , k n ).Permuted sequential attribution satisfies full attribution, correct baseline, mono-tonicity, but not fairness. Indeed, fairness would require that the attribution obtainedis the same for any permutation π . (This is the case if and only if f is additive.) Geometric interpretation.
We can associate a permutation π with a directedpath from 0 to on the vertices of the hypercube, and vice versa, since any such pathcorresponds to a permutation. We allocate to each feature the marginal change inperformance along the edge in which feature i is added. An example with n = 3 isshown (in blue) in figure 2. Finally we come to the attribution method we endorse, the Shapley method. TheShapley attribution is simply the average of the permuted sequential attributionsover all n ! permutations. Formally, let a π and b denote the attribution for permutedsequential attribution with permutation π . (The value of baseline attribution b doesnot depend on the permutation.) The Shapley attribution is a = 1 n ! X π a π , (2)11 ethod Full attr. Baseline Fairness Monotonicity SimulationsOne-at-a-time × × × n + 1Leave-one-out × × × n + 1Sequential × × × n + 1Permuted seq. × × × n + 1Shapley × × × × n Table 2:
Properties of attribution methods. Righthand column gives number of simulationsrequired to compute the attribution. where the sum is over all n ! permutations.This method satisfies all the desiderata: full attribution, correct baseline, fair-ness, and monotonicity. Indeed, it has been shown that any attribution method thatsatisfies these four desiderata must coincide with the Shapley attribution [You85].The bad news is that evaluating the Shapley attribution requires 2 n simulations,which for n larger than 10 or so is likely to be impractical. This is constrastedwith the one-at-a-time, leave-one-out, sequential, and permuted sequential attributionmethods, which require only n + 1 simulations. We will address this computationalcomplexity issue in more detail below.We summarize the properties of the different attribution methods in table 2. Simple example.
Consider the simple example given in (1). There are only 2! = 2permutations. For π = (1 , b = 0, a = 1, and a = 0; for π = (2 , b = 0, a = 0, and a = 1. The Shapley attribution forthis example is b = 0 , a = 1 / , a = 1 / . Roughly speaking, in this example, features one and two are the same; the presenceof either alone gives the full performance. Permuted sequential attribution gives allthe credit to the first feature in the sequence, and none to the second. The Shapleyattributions average over the two cases, and splits credit to the two features equally.
Geometric interpretation.
The Shapley attribution for feature i is the averageof the marginal performance change when feature i is added, over all directed pathsfrom 0 to . The n = 3 case. We work out the general Shapley attribution for the case withwith n = 3. There are 2 n = 8 configurations, and 3! = 6 sequences. We deriveformulas for a here; attribution to other features have similar formulas. First we listthe n ! sequences and the associated marginal performance change for feature 1 as12 x x Figure 3:
All green edges correspond to adding the first feature to a configuration withoutit.
Permutation Marginal performance change(1 , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , a in terms of distinct edges or marginal changes, we have a = 26 ( f (1 , , − f (0 , , f (1 , , − f (0 , , f (1 , , − f (0 , , f (1 , , − f (0 , , . The numerators 1 and 2 in each line correspond to the number of paths that includethat edge. For example, there is only one path or permutation that includes the edgefrom (0 , ,
0) to (1 , , , , , , Computing Shapley attributions
In this section we focus on methods to compute the Shapley attribution exactly, orwhen that is not practical, approximately. We focus on general methods that workwithout any assumptions about f . Computing the Shapley attribution directly using equation (2) requires taking theaverage over all n ! permuted sequential attributions. In these sequential attributions,we end up evaluating f ( x ) for the same value of x multiple times. To evaluate theShapley attribution somewhat more efficiently, we use an alternative formula for theShapley attribution, which sets the baseline value b = f (0) and the attribution tofeature i as a i = X x ∈X i ( T x )!( n − T x − n ! (cid:0) f ( x + e i ) − f ( x ) (cid:1)! − b. (3)Here X i is the set of configurations with feature i off, i.e. , X i = { x | x i = 0 } . Usingthis formula for the Shapley attributions, it can be computed directly from the valuesof the all 2 n configurations.We note for future use that the coefficients in the sum in (3) sum to one, so theydefine a probability distribution on the set of configurations with feature i off. Theformula shows that the i th Shapley attribution is the expected value or weightedaverage of the lift obtained by adding feature i . Computing the Shapley attribution requires 2 n simulations, which can be prohibitivewhen n is large, even just a few tens. In this case, we recommend approximating theShapley attributions using Monte Carlo sampling methods. We can either sampleover sequences of lifts, using the two formulas (2) and (3), each of which expressesthe Shapley attributions as an expectation. The idea of sampling sequences has beenproposed in Castro, G´omez, and Tejada [CGT09], but to our knowledge the methodbased on sampling lifts has not appeared in the literature. Sampling sequences.
In this method, we use the sum in definition (2) as a basis forMonte Carlo sampling. We sample N permutations of the features, with replacement,and average the permuted sequential attributions corresponding to each. Computingeach permuted sequential attribution requires n + 1 simulations. (We can reduce thenumber of simulations required a bit by caching previously computed values of f ( x ),and using these when f ( x ) is needed again.)14he expected value of the attributions corresponding to this method are the Shap-ley attributions. The Monte Carlo attributions satisfy the full attribution property(since each permuted sequential attribution does). The attributions satisfy fairnessapproximately, or in expectation. Sampling lifts.
In this method, we use equation (3) as the basis for Monte Carlosampling. As noted above, the sum in (3) can be expressed as a i = E (cid:0) f ( x + e i ) − f ( x ) (cid:1) − b, (4)where the configuration x is random variable supported over X i with probabilitydistribution Prob ( x = x ′ ) = ( T x ′ )!( n − T x ′ − n ! . (5)To approximate the Shapley attribution of feature i , we first compute b = f (0).We then sample configurations from X i with distribution (5), and compute the liftof adding feature i to this configuration. The approximate Shapley attribution is theaverage of all lifts obtained minus the baseline value. This process is then repeatedwith each feature. (To sample from distribution (5), first sample the number ofactive features T x , which has a multinomial distribution with outcome probabilities p i = ( i !( n − i − /n !. Then randomly sample T x of the n features to be active.)The advantage of this method is that it tends to produce better approximationswith fewer simulations than by sampling sequences, because it samples more fre-quently terms in the sum (3) with larger coefficients, therefore forming a more preciseapproximation of the sum quickly. (This is a form of importance sampling.) Unfor-tunately, these approximate Shapley attributions satisfy full attribution only in thelimit as the number of samples grows, or in expectation. This can be remedied byscaling the approximate Shapley attributions so that full attribution holds, even fora finite number of samples. Caching simulations.
For both sampling methods, some configurations may ap-pear repeatedly. It is therefore useful to cache the values of configurations, so theycan be re-used in future sampled sequences. If we are asked to evaluate f for an x that has already been evaluated, we simply use the already computed value. We now demonstrate the approximation techniques for a simple numerical examplein which the metric is the convex quadratic function f ( x ) = x T P x, where P is symmetric positive semidefinite. It can be shown that its Shapley attri-bution is b = 0 and a = 2 P . 15
100 200 300 400 500 600 700 800 900 1 ,
000 1 , . . . R e l a t i v ee rr o r Sampling sequencesSampling lifts (unscaled)Sampling lifts (scaled)
Figure 4:
Relative error of the approximate Shapley attributions as a function of thenumber of unique configurations evaluated, when sampling sequences (blue), sampling lifts(green), and when sampling lifts and scaling so that full attribution always holds (red).
We generate P randomly as P = Z T Z , where the entries of Z are independentlydrawn from a standard normal distribution. (Thus, P has a Wishart distributionwith n degrees of freedom and scale matrix I .) We approximate the Shapley attribu-tion using the two methods given above, sampling sequences and sampling lifts, andcompare the accuracy as a function of the number of unique configurations x at whichwe evaluate f ( x ). (This gives the number of configurations for which we evaluate f ,using caching as described above.) When sampling lifts, we consider two versions,the basic (unscaled) version and the version in which we scale so that full attributionholds. We compare the results using the relative error e rel = k ˆ a − a k k a k , where a = 2 P is the true Shapley attribution, and ˆ a is the sampling-based estimate.The results are shown in figure 4, for a problem instance with n = 10. When sam-pling sequences, there is no estimate (and therefore no error) until one entire sequencehas been evaluated; similarly, when sampling lifts, there is no estimate or error untilall features have at least one lift sampled. Both methods converge to the true valuesonce all 2 = 1024 configurations have been evaluated. We see that for this example,sampling lifts obtains a lower error than sampling sequences, regardless of the num-ber of configurations evaluated. We also note that when sampling lifts, scaling theapproximate Shapley values decreases accuracy, but only slightly. The results of thisparticular problem instance are typical of many others we have evaluated.16 enchmark Country alloc. Stock sel. Full portfolio(0 ,
0) (1 ,
0) (0 ,
1) (1 , f ( x ) 6 . . . . Table 3:
Data for the simple returns-based attribution example.
Here we consider a simple return attribution example from Bacon [Bac08, § f ( x ) is the portfolio return over some time period, expressedin percent. We have n = 2 features: feature 1 is the country allocation decision(the decision of how much to invest in which country) and feature 2 is the stockselection decision ( i.e. , the decision of which individual stocks to hold within thesecountries). (When feature 1 is not active, we invest in each country in proportionto its benchmark weight. When feature 2 is not active, then within each countrywe invest in each security in proportion to its benchmark weight.) Table 3 showsexample data for a single year. We revisit this example in appendix A, where wefurther decompose returns by country.We now discuss, in detail, how to apply the one-at-a-time, sequential, and Shapleyattribution methods to this example, with a geometric interpretation given in figure 5.We will see that the one-at-a-time and sequential methods recover classical attributionmethods known in the literature. One-at-a-time attribution.
One-at-a-time attribution chooses b = f (0 , , a = f (1 , − f (0 , , a = f (0 , − f (0 , . For this example, one-at-a-time attribution is exactly the classical Brinson–Hood–Beebower method [BHB86]. As discussed in section 3.1, this method does not havefull attribution; in fact, the unattributed value is f (1 , − f (1 , − f (0 ,
1) + f (0 , Sequential attribution.
Sequential attribution chooses b = f (0 , , a = f (1 , − f (0 , , a = f (1 , − f (1 , . This method coincides with the modified Brinson–Hood–Beebower method given inBacon [Bac08], where the authors justify choosing feature 1 first in the sequencebecause the sector allocation decision is often made before security selection decisions.This method eliminates the unattributed value, but violates the fairness property byprioritizing the sector allocation decision over the stock selection decision duringattribution. 17 x x x x x Figure 5:
Geometric interpretation of the returns-based attribution example. The one-at-a-time method attributes based on the lifts f (0 , − f (1 ,
0) and f (0 , − f (0 , f (0 , − f (1 ,
0) and f (1 , − f (1 , Shapley attribution.
Shapley attribution chooses b = f (0 ,
0) and a = 12 (cid:0) f (1 , − f (0 ,
1) + f (1 , − f (0 , (cid:1) ,a = 12 (cid:0) f (1 , − f (1 ,
0) + f (0 , − f (0 , (cid:1) . This is the average of the sequential attributions produced by the two sequences(0 , , , , , , before the sectorallocation decisions. Results.
Table 4 shows the results of applying all three attribution methods. Asexpected, the one-at-a-time method has a non-zero unattributed return. We can seethat in the sequential method, this unattributed return is entirely allocated to stockselection. In the Shapley attribution method, the unattributed term is allocated halfto the sector allocation and half to stock selection. Like the sequential method, it hasno unattributed component. However, unlike the sequential method, it treats countryallocation and stock selection equally, instead of prioritizing country allocation overstock selection. In this simple and small example, the differences in attribution by thedifferent methods is not very significant. In the next section, we will see an examplewhere Shapley attribution is a substantial improvement over competing methods.18 enchmark Country alloc. Stock sel. Unattributed b a a y − a − a − b One at a time 6 . − . . . . − . . . − .
15 3 .
05 0
Table 4:
Attribution results for the simple returns-based attribution example.
Here we give an example of attribution of multiple performance metrics for a tax-aware portfolio management process. To avoid the wash-sale rule (in which certaincapital losses are disallowed), rebalance trades are carried out monthly.
Metrics.
We focus on four performance metrics: realized post-tax return, ex-anterisk, realized capital gains, and portfolio turnover. The return, risk, and turnoverare annualized. The realized capital gains are reported in dollars over the five yearsimulation.
Trading strategy.
We simulate an investment strategy based on Markowitz port-folio optimization. In this case, the features correspond to different terms in theoptimization problem that can be on or off. More specifically, given the configuration x with n = 7 features, we determine the tradelist by solving the optimization problemmaximize P i =1 x i h T α ( i ) − γσ ( h ) − x ℓ ( h − h ) − s T | h − h | . subject to x σ ( h ) ≤ σ lim T h = 1 , h ≥ . (6)Here the decision variable is the post-trade portfolio h , expressed as a fraction of theaccount total; the pre-trade portfolio (which is given) is h . We describe the objectivefunction and constraints in more detail below.The first term in the objective function is an expected return forecast, which isdivided into the five alpha vectors α (1) , . . . , α (5) , correspending to the momentum,size, quality, value, and minimum volatility factors. The first five components of x control whether these five alpha vectors are on or off. The second term is the (scaled)squared active risk, defined as σ ( h ) = ( h − h b ) T Σ( h − h b ) , where Σ is the return covariance matrix and h b is the benchmark portfolio, and γ > ℓ ( h − h ) is the immediate tax liability,due to capital gains, required to reach the post-trade portfolio h , and is parametrizedby the long- and short-term capital gains rates, and the tax lots comprising the initial19ortfolio. (For more details on ℓ , see Moehle et al. [Moe+20, § x . The fourth and last term in the objective isa model of transaction cost, where s is the vector of bid-ask spreads for each asset.The first constraint is a risk limit with parameter σ lim >
0. (When x = 0, thisconstraint is deactivated.) The second constraint is a full-investment constraint, andthe last constraint specifies that the portfolio is long only.Note that when x = 0, the portfolio aims to simply track the benchmark portfolio.The full configuration x = means that all seven features are on, i.e. , we use all fivealpha sources, the capital gain objective term, and the risk limit. Backtests.
All of our simulations use the S&P 500 as the benchmark portfolio,with data over the period 2002 to 2019. The alpha was obtained using methodssimilar to those of Kimura et al. [Kim+20]. We use the Barra US Equity model[MOW11] to define Σ and h b , and used the risk-aversion parameter γ = 80. The taxrates used in ℓ were 0 .
238 and 0 . s =0 . , i.e. , the bid-ask spread is 10 basis points for all assets. The risk limit is σ lim = 2%. Results.
Figures 6 and 7 show the attribution results using Shapley, one-at-a-time,and leave-one-out methods. For each metric, the leftmost set of bars, labeled ‘Base’,shows the baseline attribution b for each of the three methods. (The attribution tothe baseline is the same for all methods, as described in section 4.) The followingseven sets of bars are the attributions a , . . . , a corresponding to the seven featuresfor each of the three methods. Table 5 shows the unattributed amount f ( ) − T a − b for each of the three methods and four metrics. For comparison, we show the metricsfor the baseline and full configuration.By and large, we see the same phenomenon occur for all four metrics: one-at-a-time attribution over-attributes, i.e. , it overestimates the contribution of each feature,because when only a single feature is included, it drives the portfolio selection process.One the other hand, the leave-one-out attribution under-attributes, i.e. , it underes-timates the contribution of each feature, because each single feature makes little dif-ference when competing with the other six. The degree of over- or under-attributiondepends on the specific metric and feature in question.For example, when attributing the risk, this leads to serious problems with theone-at-a-time and leave-one-out attributions that are resolved by Shapley attribution.Under one-at-a-time attribution, the risk limit does not get any ‘credit’ for risk re-duction. This is because the attribution of risk to the risk limit feature is the changein risk by adding it to the benchmark portfolio. Because the benchmark portfolioalready has low risk, the risk limit has no effect. On the other hand, each of the fivesignals, when added to the benchmark portfolio, result in a high risk. Therefore, with20 eturn (%) Risk (%) Tax (k $ ) Turnover (%)baseline value f (0) 12 . . − . . f (1) 11 . . . . . . . . − . − . − . − . . . . . Table 5:
The baseline value b = f (0), full configuration value f ( ), as well as theunattributed components f ( ) − T a − b for all three attribution methods. one-at-a-time attribution, risk is severely over-attributed to the five signals. Thisproblem is also apparent in table 5; With the one-at-a-time method, the risk is over-attributed, i.e. , the sum of the attributions to the features and baseline is 14 . . − . . decreases turnover; thisis reflected in the negative attribution of turnover to tax awareness with Shapley andleave-one-out methods. We propose the use of the Shapley value for portfolio performance attribution. Shap-ley attribution is the only method that possesses four properies that we believe arecrucial for satisfactory portfolio performance attribution: fairness, correct baseline,full attribution, and monotonicity. (A fifth property, additivity, is discussed in ap-pendix A.) We then compare Shapley attribution to other well-known attributionmethods. Compared to other attribution methods, the only disadvantage of Shapleyattribution is computational: the number of simulations required to carry out Shap-ley attribution is exponential in the number of features we attribute to. To overcome21 ase Risk lim Tax Mom. Min. vol. Qual. Size Val.0510 R e t u r n ( % ) Base Risk lim Tax Mom. Min. vol. Qual. Size Val. − R i s k ( % ) Shapley One at a time Leave one out
Figure 6:
Attributions of return and risk for the tax-aware portfolio management example. ase Risk lim Tax Mom. Min. vol. Qual. Size Val. − C a p i t a l ga i n ( k $ ) Base Risk lim Tax Mom. Min. vol. Qual. Size Val.0100200300 T u r n o v e r ( % ) Shapley One at a time Leave one out
Figure 7:
Attributions of capital gains and turnover for the tax-aware portfolio manage-ment example.
Acknowledgements.
We would like to thank Eric Kisslinger for supporting usin carrying out the backtests for the tax-aware portfolio management example. Wewould also like to thank Ronald Kahn and Isaac Mao for useful early discussions andtesting of Shapley attribution. 24 eferences [Bac08] C. R. Bacon.
Practical portfolio performance measurement and attribu-tion . Vol. 546. John Wiley & Sons, 2008.[BHB86] G. P. Brinson, L. R. Hood, and G. L. Beebower. “Determinants of port-folio performance”. In:
Financial Analysts Journal
Financial Analysts Journal
Methodology and Computing in Applied Probability
Computers & Operations Research
The Journal of Finance
Active portfolio management . McGrawHill New York, NY; 2000.[Gri06] R. C. Grinold. “Attribution”. In:
The Journal of Portfolio Management
Applied Economics Letters
The Journal of finance
Applied Stochastic Models in Business and Industry
Advances in neural information processing systems . 2017,pp. 4765–4774.[LN+12] D. Liben-Nowell, A. Sharp, T. Wexler, and K. Woods. “Computing theShapley value in supermodular coalitional games”. In:
International Com-puting and Combinatorics Conference . Springer. 2012, pp. 568–579.25Lo07] A. W. Lo. “Where do alphas come from?: A new measure of the valueof active investment management”. In:
A New Measure of the Value ofActive Investment Management (May 8, 2007) (2007).[Mal+13] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers. “Bound-ing the estimation error of sampling-based Shapley value approximation”.ArXiv preprint. 2013.[Moe+20] N. Moehle, M. J. Kochenderfer, S. Boyd, and A. Ang. “Tax-aware port-folio construction via convex optimization”. ArXiv preprint. 2020.[Mou04] H. Moulin.
Fair division and collective welfare . MIT press, 2004.[MOW11] J. Menchero, D. Orr, and J. Wang.
The Barra US equity model (USE4),methodology notes . English. MSCI, May 2011.[MS+02] D. Monderer, D. Samet, et al. “Variations on the Shapley value”. In:
Handbook of Game Theory
Applied Economics Letters
Decisions in Economics and Finance
Contributions to theTheory of Games
Journal of portfolio Management
Knowledge and infor-mation systems
Attributing systemic risk to in-dividual institutions . Tech. rep. Bank for International Settlements, 2010.[You85] H. P. Young. “Monotonic solutions of cooperative games”. In:
Interna-tional Journal of Game Theory
A Additivity
In addition to the desiderata of section 3.1, Shapley attribution is additive . Thismeans that if the metric can be decomposed into multiple components, such that f ( x ) = f ( x ) + · · · + f k ( x ), then the Shapley attribution is given by a = a + · · · + a k and b = b + · · · + b k , where a i and b i are the attribution of metric f i to the features.This is especially useful when the metric is separable across time. In this case, f ( x ) is the value of the metric across a large time window (such as a year), and each26 enchmark Country alloc. Stock sel. Full portfolio(0 ,
0) (1 ,
0) (0 ,
1) (1 , f uk ( x ) 4 4 8 8 f jp ( x ) − . − . − − . f us ( x ) 3 . . . . f ( x ) 6 . . . . Table 6:
Data for the returns-based attribution example, when further sub-divided bycountry. f i ( x ) is the value of the same metric over a shorter time window (such as a month orquarter). Examples of time-separable metrics are log-returns and squared risk. A.1 Returns-based attribution
Here we return to the returns-based attribution example from section 6.1, where wenow decompose the returns by country. Take f uk ( x ) to be the weighted return onUK stocks, i.e. , it is the portfolio weight in UK stocks multiplied by the return inUK stocks. (Equivalently, it is the value of UK stocks at the end of the investmentperiod divided by the initial portolio value.) Define f jp ( x ) and f us ( x ) similarly. Thismeans that f ( x ) = f uk ( x ) + f jp ( x ) + f us ( x ) . Table 3 shows example data, which are from Bacon [Bac08]. In particular, thebenchmark portfolio weights are 40% (UK), 30% (Japan), and 30% (US), and theportfolio country allociation was 40%, 20%, and 40%, respectively. The benchmarkreturns, by country, were 10%, − − Results.
In table 7, we show the results of using the three attribution methodsfrom section 6.1, but now decomposed by country.27 enchmark Country alloc. Stock sel. Unattr. b a a f ( x ) − a − a − b One at a time UK 4 0 4 0Japan − . − . − . − . . − . − . . Total . − . . . − . − . − . . − . − . Total . − . . − . − . − .
25 0US 3 . − . − . Total . − .
15 3 .
05 0
Table 7: