[PDF] Portfolio Performance Attribution via Shapley Value

Abstract

We consider an investment process that includes a number of features, each of which can be active or inactive. Our goal is to attribute or decompose an achieved performance to each of these features, plus a baseline value. There are many ways to do this, which lead to potentially different attributions in any specific case. We argue that a specific attribution method due to Shapley is the preferred method, and discuss methods that can be used to compute this attribution exactly, or when that is not practical, approximately.

Full PDF

aa r X i v : . [ q -f i n . C P ] F e b For professional clients and qualified investors onlyNot for public distributionThis version posted with permission

Portfolio Performance Attributionvia Shapley Value

Nicholas Moehle Stephen Boyd Andrew AngFebruary 12, 2021

Abstract

We consider an investment process that includes a number of features, eachof which can be active or inactive. Our goal is to attribute or decompose anachieved performance to each of these features, plus a baseline value. Thereare many ways to do this, which lead to potentially diﬀerent attributions inany speciﬁc case. We argue that a speciﬁc attribution method due to Shapleyis the preferred method, and discuss methods that can be used to compute thisattribution exactly, or when that is not practical, approximately.

Performance of an investment process.

We consider an investment processguided by a portfolio manager, an automated process, or some combination. We areinterested in some performance measure of the investment process over some pastperiod, for example P&L, realized return, risk, tracking error, or turnover. Some ofthese performance measures we prefer to be large ( e.g. , P&L, realized return), andothers we prefer to be small ( e.g. , risk, tracking error, turnover).

Features.

The process has a number of variations or features, each of which canbe active (or on), or inactive (or oﬀ). As a simple example, consider an investmentprocess that relies on daily portfolio rebalancing using an optimization method. Thefeatures might be a leverage limit, an ESG constraint that limits the securities thatcan be held, and the use of a novel return forecast developed by some researchers.(In many practical applications, we would have far more than just three features.)We assume that when the process was run over some time period, all the featureswere active. In our example above, this means that we ran the process with theleverage limit, the ESG constraint, and the forecasts.1 x Risk Return Turnover × × . × × . . x x Total

Risk 0 . .

25 0 .

95 2 . . . Table 1:

Example attribution of three performance measures to two features plus baseline.The table on the left shows the data, i.e. , the result of backtests with diﬀerent combinationsof the two features. The table on the right shows an example attribution.

Attribution.

The attribution problem asks the question: How much of the per-formance should we attribute to each of these features, and how much to a baselineor benchmark value? In attribution we are dividing up the performance that wasachieved into an amount associated with each feature, and a baseline value, whichcorresponds to what the performance would have been with all the features oﬀ. Theamount attributed to a feature can be negative, which means that the feature reducedthe performance value. A positive attribution means that the feature increased theperformance value. We seek full attribution , which means that the sum of the amountsattributed to the features, plus the baseline value, equals the actual performanceachieved.Continuing our example above, suppose our investment process realizes a returnof 8% over one year, with all three features active. An example attribution might be1% to the leverage limit, -1% to the ESG constraint, 5% to the return forecast, and3% to baseline performance. We interpret this as saying that the leverage limit wasresponsible for increasing our return by 1%, while our ESG restrictions depressed thereturn by 1%; our return forecast was responsible for 5% of the realized return, and3% is attributed to baseline.Attribution is closely tied to the idea of marginal performance gain, i.e. , the changein performance when a feature is added. Unfortunately — and this is the crux of theproblem — the marginal performance gain when adding a feature depends on whichfeatures have already been added. We can think of attribution as assigning a singlemarginal performance gain to an feature, independent of which other features areactive.We can carry out attributions for multiple performance objectives. For example,we can attribute realized risk, realized return, and realized turnover to our featuresand a baseline. (This is three separate attribution problems.) An example attributionis shown in table 1.Attribution has many applications. We can use it to allocate credit, for exampleto determine bonus payments. For features that are associated with an additionalcost, we can assess the cost per unit of performance delivered. For example if ourreturn forecast in our example incurs a data source cost, we can compare that cost tothe return attributed to it. We can use attribution to makes changes moving forward;2or example, we can consider dropping features that have a negative attribution toP&L. Attribution is a key method for explaining investment performance to clients,and is often required by law.

Challenges.

There are two main challenges in carrying out attribution. The ﬁrstis that it involves hypothetical or what-if situations. While we directly observe theperformance achieved with all features active, we really do not know what the per-formance would have been with some of the features oﬀ. This is addressed by usinga high-ﬁdelity simulator or backtester, that can simulate what would have happened,had some of the features been oﬀ. Of course our attribution can only be as accurateas this simulator.The second main challenge is conceptual or mathematical. Except in a very simplecase, which we discuss below, it is hard to exactly deﬁne what attribution is. As aresult, many diﬀerent attribution methods are used in practice, leading to diﬀerentattributions in any speciﬁc case. In this paper we will argue that a speciﬁc type ofattribution, called Shapley attribution, is the best choice. Unfortunately, with morethan just a handful of features, computing the Shapley attribution exactly requirescarrying out an impractically large number of simulations or backtests. Fortunately,there are methods for computing it approximately, described below, which work wellin practice.

The simple case.

We describe here, informally, the one case in which attributionis simple. (We describe this case mathematically below.) Attribution is simple whenthe performance measure is additive, i.e. , a sum of terms, each associated with afeature plus a constant. Indeed, this particular form directly gives us the attribution.As a simple (and uninteresting) example, consider the total proﬁt of a company oversome period, with the features being the independent divisions of the company. Herethe total proﬁt is evidently the sum of the proﬁts contributed by the divisions, plussome baseline proﬁt for the company that is unrelated to the proﬁts of the divisions.Roughly speaking, in this simple case there is no interaction among the features;we simply add up the contributions of the features, which directly gives us the attri-bution, with the baseline value being the diﬀerence between the achieved performanceand the sum of that attributed to the features.

Interactions.

The challenge is when there are interactions among the features, interms of how they aﬀect the performance. As a simple example, suppose that featureone and feature two, taken individually, give no increase in performance; but whenthey are both active, give a substantial improvement in performance. In this case,how should we attribute the performance to these two features? Intuition suggeststhey should share the credit equally, i.e. , have attributions equal to half the increasein performance. As a variation on this situation, suppose that features one and two3re the same, so having them both on is the same as having either of them on. Heretoo intuition suggests they should share the credit.

Cooperative game theory.

Our recommended attribution method uses the Shap-ley value, an idea from originated in cooperative game theory, which attempts to an-swer the question of how to allocate the earnings of a coalition to individual players[Sha53]. It is derived axiomatically by showing that the only attribution method thatsatiﬁes a set if desirable properties is the Shapley value. Because of these desirableproperties, the Shapley value is widely considered to be a fair approach to allocat-ing value [Mou04]. The Shapley value has seen a large number of extensions andvariations; see [MS+02] for a summary. One important extension assumes certaincoalitions of players are disallowed, which changes the resulting allocation [Hil18].In general, the eﬀort required to compute Shapley values is exponential in thenumber of players, which can be prohibitive when the number of players is large.In this case, the Shapley values can be approximated using Monte Carlo [CGT09],with error bounds given in Maleki et al. [Mal+13]. When the coalition value functionhas speciﬁc properties ( e.g. , submodularity), more eﬃcient methods may exist. (SeeLiben-Nowell et al. [LN+12].)

Attribution in machine learning.

Recently, Shapley values have been used tointerpret the output of machine learning prediction models, such as random forestsand neural networks [LC01; ˇSK14]. In this case, the model inputs (or features ) aremodeled as players in a coalition, and the resulting prediction performance is the valueof each coalition. As a result, mature theory and software exists for approximatelycomputing the Shapley value for games with many players [LL17].

Portfolio performance attribution.

Since the work of Jensen [Jen68], academicshave sought to attribute returns of managers to skill (security selection) vs. exposureto rewarded risk factors, like the market portfolio [FF10; Sha92]. Many approachesfor performance attribution exist. The simplest methods break up portfolio returninto components, which sum to the portfolio return [BHB86; BSB91]. Other standardapproaches use the correlation between portfolio weights and returns [GK00; Gri06;Lo07]. These methods are scalable, and are often used to attribute performance to alarge number of predictive signals.Many of these standard approaches use time series of returns or cross-sectionalposition holdings which result in unattributed value (or ‘residuals’); one advantageof the Shapley value is full attribution , which means the attributions of individualfeatures and the baseline sum to the total portfolio return. In our exposition, weexplicitly contrast Shapley attribution to these more widely used techniques. WhileShapley values have been applied to risk decompositions [Ort16; MT08; TBT10;4BSV18], to our knowledge, ours is the ﬁrst application of the Shapley value to thegeneral portfolio performance attribution problem, and more generally to any statisticthat is produced by an investment process.

In this section we ﬁx our notation and describe our model.

Investment process.

We assume there is an investment process which produces adynamic (time-varying) portfolio allocation over some time window. The investmentprocess may depend on market conditions, the prior portfolio holdings, the decisionsof analysts or portfolio managers, and portfolio optimization techniques, to updatethe portfolio holdings over time. In practice, the investment process may be verycomplicated, and we intentionally leave the details unspeciﬁed.

Features and conﬁguration.

The investment process has n features that can be(or could have been) included or excluded, i.e. , active or inactive. These featuresmight represent the choice of a speciﬁc benchmark, choice of sector exposures orasset allocation, or the contributions of a speciﬁc analyst or signal. This inclusion orexclusion of feature i is denoted x i ∈ { , } , with x i = 0 meaning that the featureis inactive, and x i = 1 meaning the feature is active. The collection of these featurestatus values is called a conﬁguration of the investment process, and is representedby the Boolean vector x = ( x , . . . , x n ). For example, x = (1 , , , ,

0) means thatfeatures 1, 3, and 4 are active, while features 2 and 5 are inactive.We observe that there are 2 n possible conﬁgurations, which grows rapidly withthe number of features n . For n = 10, there are around 1000 possible conﬁgurations;for n = 30, the number is around 10 . The full and zero conﬁgurations.

The conﬁguration x = (1 , , . . . ,

1) = (thevector of all ones) means that all features are active. We refer to this as the fullconﬁguration or fully featured conﬁguration . We will assume that the full conﬁgurationis the one that was actually used. The other 2 n − x = (0 , , . . . ,

0) = 0 is called the zero conﬁguration or the base-line conﬁguration or the benchmark conﬁguration . It corresponds to the investmentprocess with all features inactive. In some cases it can be interpreted as investing ina benchmark portfolio.

Performance metric.

This investment process is evaluated using a real-valuedperformance metric y ∈ R . (In practice, portfolios are evaluated using many met-rics, which can be considered separately.) Examples of performance metrics include5he portfolio’s return, risk, risk-adjusted return, turnover, or average exposure to aparticular risk factor, over some investment period. (These can be ex-ante values,evaluated using a contemporaneous model; or realized, ex-post values, evaluated us-ing the actual data.) Note that large values of y can be good (as in return), or bad(as in risk).We assume that when the full conﬁguration x = was used, the resulting realizedperformance was y real , which can be directly observed. In cases when the baselineor zero conﬁguration can be interpreted as investing in a benchmark portfolio, theperformance value for this too can be directly observed. Simulation and backtesting.

We use simulation to judge the performance usingother, hypothetical conﬁgurations. This typically has the form of a backtesting engine,which can evaluate the performance under the hypothetical conﬁgurations. Thisprocess is represented by a function f : { , } n → R , with y = f ( x ) = f ( x , . . . , x n ) . Evaluating the function f requires running a backtest of the investment process underconﬁguration x , and recording the performance y . We assume that the backtests arecalibrated so that y real = f ( ), i.e. , the backtest simulation result for the conﬁgurationwe used agrees with the performance we actually observed. Lift or marginal contribution.

We now introduce a natural concept in attribu-tion, which is the change in performance when we add one feature. Suppose theconﬁguration is x , with x i = 0, i.e. , feature i is inactive. Then ˜ x = x + e i (where e i isthe i th unit vector) is the conﬁguration obtained by turning feature i on. We deﬁnethe lift or marginal contribution as the change in performance obtained by addingfeature i , i.e. , f ( x + e i ) − f ( x ) . This marginal contribution depends on the particular conﬁguration x . In other words,the lift associated with adding a feature depends on which other features are active. Geometric interpretation.

We can associate the 2 n diﬀerent possible conﬁgura-tions with the corners of a unit (hyper)cube in R n . We can create a directed graphwith conﬁgurations as vertices, by having an edge from conﬁguration x to conﬁgu-ration ˜ x if ˜ x = x + e i for some i . In words, an edge goes from one conﬁgurationto another that is obtained by adding one feature. We note for future use that wecan associate with an edge from x to ˜ x = x + e i a marginal performance change f (˜ x ) − f ( x ) = f ( x + e i ) − f ( x ). The number of edges is n n − .An example with n = 3 features is depicted in ﬁgure 1. The point (0 , ,

0) repre-sents the baseline conﬁguration, and the point (1 , ,

1) represents the fully featuredconﬁguration. The edge from (0 , ,

0) to (1 , ,

0) corresponds to adding feature 1 tothe conﬁguration with only feature 2 active. There are a total of 12 edges.6 x x Figure 1:

Visualization of conﬁgurations as vertices of a hypercube, for n = 3. Thevertices, shown as dots, as conﬁgurations. The edges correspond to adding one feature to aconﬁguration. We would like to attribute the realized performance y real to the n features, i.e. , todetermine how much of the performance resulted from each feature. An attributionmethod determines real values a , . . . , a n and b , where a i is the amount attributed tofeature i , and b is the baseline amount. The attribution and baseline amounts can bepositive or negative. We will represent the attribution using a vector a = ( a , . . . , a n )and scalar b .The attribution is derived from the feature performance function f , i.e. , its val-ues for the 2 n diﬀerent conﬁgurations. An attribution method is an algorithm ormethod that determines the attribution based on evaluating f for some, or possiblyall, conﬁgurations. We now describe several desirable properties of an attribution method.

Full attribution.

We require full attribution , which means that y real = f ( ) = a + · · · + a n + b = T a + b. This means that the observed performance measure f ( ) is fully attributed to the n features, plus the baseline. Even though full attribution is a crucial property ofa good attribution method, we will see that commonly used attribution methods donot have it. Correct baseline value.

We say that an attribution has the correct baseline valueif f (0) = b. b matches the performance of the benchmarkportfolio. Fairness.

We call an attribution method fair if, for any permutation of the features,the attributions are permuted the same way. This property implies that if two featuresare the same, i.e. , they have the same eﬀect on performance, then their attributionsmust be the same.

Monotonicity.

This means that if we change f in such a way that one feature’smarginal contribution does not decrease (no matter which features are already active),then the attribution to this feature also does not decrease. We say the performance is additive if f has the form f ( x ) = a T x + b = b + X i : x i =1 a i , for some vector a and scalar b . (If we consider x i to be real numbers, and not just0 or 1 as we do here, this corresponds to f being an aﬃne function.) When f isadditive, the baseline performance is b , and the marginal increase in the performancewhen adding feature i is always a i , independent of what other features are alreadyactive.For an additive function, a and b directly give an attribution which satisﬁes allfour of the desiderata listed above: full attribution, correct baseline, fairness, andmonotonicity. The case of additive performance is the easy, or even trivial case, forattribution. It is when f is not additive that it becomes more diﬃcult to assign anattribution. In this section, we describe several attribution methods, concluding with Shapleyattribution method we recommend.

We take b = f (0), and a i = f ( e i ) − f (0) , i = 1 , . . . , n. In other words, we carry out a baseline simulation (if needed) and n additional sim-ulations, each with exactly one feature enabled. We attribute to each feature the8ncrease in performance when it is added to the baseline conﬁguration, i.e. , its lift ormarginal performance increase from the baseline conﬁguration x = 0. This methodis natural, and requires carrying out only n (or n + 1 if we include the baseline)simulations.One-at-a-time attribution satisﬁes correct baseline value, fairness, and monotonic-ity. However, it can (and often does) fail to satisfy full attribution. To see this,consider the simple example with n = 2 and f (0) = 0 , f ( e ) = 1 , f ( e ) = 1 , f ( ) = 1 . (1)In this example the presence of either feature 1 or feature 2 gives the full performancevalue 1. One-at-a-time attribution for this example is b = 0, a = 1, a = 1, so b + a + a = 2, whereas f ( ) = 1. Roughly speaking, one-at-a-time attributionover-allocates performance to the features in this example.We note that one-at-a-time attribution coincides with the attribution describedabove for additive f . Here, however, the same formula is being applied to any f , notjust additive f . The leave-one-out attribution method is closely related to one-at-a-time attributionAs in one-at-a-time attribution, we set b = f (0). We then carry out n simulations,with x = − e i , i = 1 , . . . , n . In other words, for each feature we simulate theperformance when it is left out, but all other features are present. We set a i = f ( ) − f ( − e i ) , i = 1 , . . . , n, which is the marginal performance increase when adding feature i when all otherfeatures are active. Like one-at-a-time attribution, leave-one-out attribution requirescarrying out n simulations, plus a baseline simulation.Like one-at-a-time, leave-one-out attribution satisﬁes correct baseline value, fair-ness, and monotonicity, but it can fail to achieve full attribution. To see this, considerthe same example describe above in (1). The leave-one-out attribution for this exam-ple is b = 0, a = 0, and a = 0, i.e. , it allocates zero performance to each feature.Roughly speaking, it under-allocates performance in this example. Another commonly used method is sequential attribution. We start by evaluatingthe baseline conﬁguration performance b = f (0). We then simulate the conﬁguration x = e , i.e. , we add the ﬁrst feature. We continue adding features until we have allfeatures active. We take a i = f ( e + · · · + e i ) − f ( e + · · · + e i − ) , i = 1 , . . . , n, i , when the features1 , . . . , i − n simulations, plus a baseline simulation.Sequential attribution satisﬁes full attribution, since b + a + · · · + a n = f (0) + (cid:0) f ( e ) − f (0) (cid:1) + (cid:0) f ( e + e ) − f ( e ) (cid:1) + · · · + (cid:0) f ( ) − f ( − e n ) (cid:1) = f ( ) . It also satisﬁes correct baseline and monotonicity.But sequential atttribution does not satisfy fairness, since the attribution obtaineddepends on the order in which the features are added. The same example above givenin (1) illustrates this. Sequential attribution gives b = 0, a = 1, and a = 0; that is,the ﬁrst feature gets attributed the full performance and the second gets none. If weswap the two features, we assign the full performance to feature two.Sequential attribution is also called oﬀ-the-top attribution, described in a diﬀerentform. We ﬁrst evaluate f ( − e n ), the performance when feature n is turned oﬀ, withall others on. We set a n = f ( ) − f ( − e n ), the marginal increase in adding feature n when all others are active. We then evaluate f ( − e n − e n − ), the performancewhen features n and n − a n − = f ( − e n ) − f ( − e n − e n − ) , the marginal performance when we add feature n −

1, when features 1 , . . . , n − Geometric interpretation.

We can give a nice geometric interpretation of sequen-tial attribution. We start at node or vertex x = 0 and follow a speciﬁc directed pathto node e , then e , and so on, ending at the full conﬁguration x = . The attributionto feature i is the marginal increase associated with the edge in which feature i isadded. An example with n = 3 is shown (in red) in ﬁgure 2. Here we describe a simple extension of sequential attribution that will come up inthe sequel. Let π = ( k , . . . , k n ) be a permutation of (1 , . . . , n ), which means thateach integer from 1 to n appears as one of the k i . Deﬁne ˜ x as ˜ x i = x k i , i = 1 , . . . , n .This is the conﬁguration vector when we permute the features using π . We deﬁne thepermuted performance function as ˜ f (˜ x ) = f ( x ).10 x x Figure 2:

Two permutations for sequential attribution. The red path corresponds the thestandard permutation (0 , , → (1 , , → (1 , , → (1 , , , , → (0 , , → (0 , , → (1 , , Permuted sequential attribution permutes the original features to obtain ˜ f , thenuses sequential attribution on ˜ f , and ﬁnally permutes the resulting attribution ˜ a and ˜ b back to the original ordering. In sequential attribution, we use the marginalperformance contribution when the features are added one by one, in order. Per-muted sequential attribution is the same, except that we add the features in theorder ( k , k , . . . , k n ).Permuted sequential attribution satisﬁes full attribution, correct baseline, mono-tonicity, but not fairness. Indeed, fairness would require that the attribution obtainedis the same for any permutation π . (This is the case if and only if f is additive.) Geometric interpretation.

We can associate a permutation π with a directedpath from 0 to on the vertices of the hypercube, and vice versa, since any such pathcorresponds to a permutation. We allocate to each feature the marginal change inperformance along the edge in which feature i is added. An example with n = 3 isshown (in blue) in ﬁgure 2. Finally we come to the attribution method we endorse, the Shapley method. TheShapley attribution is simply the average of the permuted sequential attributionsover all n ! permutations. Formally, let a π and b denote the attribution for permutedsequential attribution with permutation π . (The value of baseline attribution b doesnot depend on the permutation.) The Shapley attribution is a = 1 n ! X π a π , (2)11 ethod Full attr. Baseline Fairness Monotonicity SimulationsOne-at-a-time × × × n + 1Leave-one-out × × × n + 1Sequential × × × n + 1Permuted seq. × × × n + 1Shapley × × × × n Table 2:

Properties of attribution methods. Righthand column gives number of simulationsrequired to compute the attribution. where the sum is over all n ! permutations.This method satisﬁes all the desiderata: full attribution, correct baseline, fair-ness, and monotonicity. Indeed, it has been shown that any attribution method thatsatisﬁes these four desiderata must coincide with the Shapley attribution [You85].The bad news is that evaluating the Shapley attribution requires 2 n simulations,which for n larger than 10 or so is likely to be impractical. This is constrastedwith the one-at-a-time, leave-one-out, sequential, and permuted sequential attributionmethods, which require only n + 1 simulations. We will address this computationalcomplexity issue in more detail below.We summarize the properties of the diﬀerent attribution methods in table 2. Simple example.

Consider the simple example given in (1). There are only 2! = 2permutations. For π = (1 , b = 0, a = 1, and a = 0; for π = (2 , b = 0, a = 0, and a = 1. The Shapley attribution forthis example is b = 0 , a = 1 / , a = 1 / . Roughly speaking, in this example, features one and two are the same; the presenceof either alone gives the full performance. Permuted sequential attribution gives allthe credit to the ﬁrst feature in the sequence, and none to the second. The Shapleyattributions average over the two cases, and splits credit to the two features equally.

Geometric interpretation.

The Shapley attribution for feature i is the averageof the marginal performance change when feature i is added, over all directed pathsfrom 0 to . The n = 3 case. We work out the general Shapley attribution for the case withwith n = 3. There are 2 n = 8 conﬁgurations, and 3! = 6 sequences. We deriveformulas for a here; attribution to other features have similar formulas. First we listthe n ! sequences and the associated marginal performance change for feature 1 as12 x x Figure 3:

All green edges correspond to adding the ﬁrst feature to a conﬁguration withoutit.

Permutation Marginal performance change(1 , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , , , f (1 , , − f (0 , , a in terms of distinct edges or marginal changes, we have a = 26 ( f (1 , , − f (0 , , f (1 , , − f (0 , , f (1 , , − f (0 , , f (1 , , − f (0 , , . The numerators 1 and 2 in each line correspond to the number of paths that includethat edge. For example, there is only one path or permutation that includes the edgefrom (0 , ,

0) to (1 , , , , , , Computing Shapley attributions

In this section we focus on methods to compute the Shapley attribution exactly, orwhen that is not practical, approximately. We focus on general methods that workwithout any assumptions about f . Computing the Shapley attribution directly using equation (2) requires taking theaverage over all n ! permuted sequential attributions. In these sequential attributions,we end up evaluating f ( x ) for the same value of x multiple times. To evaluate theShapley attribution somewhat more eﬃciently, we use an alternative formula for theShapley attribution, which sets the baseline value b = f (0) and the attribution tofeature i as a i = X x ∈X i ( T x )!( n − T x − n ! (cid:0) f ( x + e i ) − f ( x ) (cid:1)! − b. (3)Here X i is the set of conﬁgurations with feature i oﬀ, i.e. , X i = { x | x i = 0 } . Usingthis formula for the Shapley attributions, it can be computed directly from the valuesof the all 2 n conﬁgurations.We note for future use that the coeﬃcients in the sum in (3) sum to one, so theydeﬁne a probability distribution on the set of conﬁgurations with feature i oﬀ. Theformula shows that the i th Shapley attribution is the expected value or weightedaverage of the lift obtained by adding feature i . Computing the Shapley attribution requires 2 n simulations, which can be prohibitivewhen n is large, even just a few tens. In this case, we recommend approximating theShapley attributions using Monte Carlo sampling methods. We can either sampleover sequences of lifts, using the two formulas (2) and (3), each of which expressesthe Shapley attributions as an expectation. The idea of sampling sequences has beenproposed in Castro, G´omez, and Tejada [CGT09], but to our knowledge the methodbased on sampling lifts has not appeared in the literature. Sampling sequences.

In this method, we use the sum in deﬁnition (2) as a basis forMonte Carlo sampling. We sample N permutations of the features, with replacement,and average the permuted sequential attributions corresponding to each. Computingeach permuted sequential attribution requires n + 1 simulations. (We can reduce thenumber of simulations required a bit by caching previously computed values of f ( x ),and using these when f ( x ) is needed again.)14he expected value of the attributions corresponding to this method are the Shap-ley attributions. The Monte Carlo attributions satisfy the full attribution property(since each permuted sequential attribution does). The attributions satisfy fairnessapproximately, or in expectation. Sampling lifts.

In this method, we use equation (3) as the basis for Monte Carlosampling. As noted above, the sum in (3) can be expressed as a i = E (cid:0) f ( x + e i ) − f ( x ) (cid:1) − b, (4)where the conﬁguration x is random variable supported over X i with probabilitydistribution Prob ( x = x ′ ) = ( T x ′ )!( n − T x ′ − n ! . (5)To approximate the Shapley attribution of feature i , we ﬁrst compute b = f (0).We then sample conﬁgurations from X i with distribution (5), and compute the liftof adding feature i to this conﬁguration. The approximate Shapley attribution is theaverage of all lifts obtained minus the baseline value. This process is then repeatedwith each feature. (To sample from distribution (5), ﬁrst sample the number ofactive features T x , which has a multinomial distribution with outcome probabilities p i = ( i !( n − i − /n !. Then randomly sample T x of the n features to be active.)The advantage of this method is that it tends to produce better approximationswith fewer simulations than by sampling sequences, because it samples more fre-quently terms in the sum (3) with larger coeﬃcients, therefore forming a more preciseapproximation of the sum quickly. (This is a form of importance sampling.) Unfor-tunately, these approximate Shapley attributions satisfy full attribution only in thelimit as the number of samples grows, or in expectation. This can be remedied byscaling the approximate Shapley attributions so that full attribution holds, even fora ﬁnite number of samples. Caching simulations.

For both sampling methods, some conﬁgurations may ap-pear repeatedly. It is therefore useful to cache the values of conﬁgurations, so theycan be re-used in future sampled sequences. If we are asked to evaluate f for an x that has already been evaluated, we simply use the already computed value. We now demonstrate the approximation techniques for a simple numerical examplein which the metric is the convex quadratic function f ( x ) = x T P x, where P is symmetric positive semideﬁnite. It can be shown that its Shapley attri-bution is b = 0 and a = 2 P . 15

100 200 300 400 500 600 700 800 900 1 ,

000 1 , . . . R e l a t i v ee rr o r Sampling sequencesSampling lifts (unscaled)Sampling lifts (scaled)

Figure 4:

Relative error of the approximate Shapley attributions as a function of thenumber of unique conﬁgurations evaluated, when sampling sequences (blue), sampling lifts(green), and when sampling lifts and scaling so that full attribution always holds (red).

We generate P randomly as P = Z T Z , where the entries of Z are independentlydrawn from a standard normal distribution. (Thus, P has a Wishart distributionwith n degrees of freedom and scale matrix I .) We approximate the Shapley attribu-tion using the two methods given above, sampling sequences and sampling lifts, andcompare the accuracy as a function of the number of unique conﬁgurations x at whichwe evaluate f ( x ). (This gives the number of conﬁgurations for which we evaluate f ,using caching as described above.) When sampling lifts, we consider two versions,the basic (unscaled) version and the version in which we scale so that full attributionholds. We compare the results using the relative error e rel = k ˆ a − a k k a k , where a = 2 P is the true Shapley attribution, and ˆ a is the sampling-based estimate.The results are shown in ﬁgure 4, for a problem instance with n = 10. When sam-pling sequences, there is no estimate (and therefore no error) until one entire sequencehas been evaluated; similarly, when sampling lifts, there is no estimate or error untilall features have at least one lift sampled. Both methods converge to the true valuesonce all 2 = 1024 conﬁgurations have been evaluated. We see that for this example,sampling lifts obtains a lower error than sampling sequences, regardless of the num-ber of conﬁgurations evaluated. We also note that when sampling lifts, scaling theapproximate Shapley values decreases accuracy, but only slightly. The results of thisparticular problem instance are typical of many others we have evaluated.16 enchmark Country alloc. Stock sel. Full portfolio(0 ,

0) (1 ,

0) (0 ,

1) (1 , f ( x ) 6 . . . . Table 3:

Data for the simple returns-based attribution example.

Here we consider a simple return attribution example from Bacon [Bac08, § f ( x ) is the portfolio return over some time period, expressedin percent. We have n = 2 features: feature 1 is the country allocation decision(the decision of how much to invest in which country) and feature 2 is the stockselection decision ( i.e. , the decision of which individual stocks to hold within thesecountries). (When feature 1 is not active, we invest in each country in proportionto its benchmark weight. When feature 2 is not active, then within each countrywe invest in each security in proportion to its benchmark weight.) Table 3 showsexample data for a single year. We revisit this example in appendix A, where wefurther decompose returns by country.We now discuss, in detail, how to apply the one-at-a-time, sequential, and Shapleyattribution methods to this example, with a geometric interpretation given in ﬁgure 5.We will see that the one-at-a-time and sequential methods recover classical attributionmethods known in the literature. One-at-a-time attribution.

One-at-a-time attribution chooses b = f (0 , , a = f (1 , − f (0 , , a = f (0 , − f (0 , . For this example, one-at-a-time attribution is exactly the classical Brinson–Hood–Beebower method [BHB86]. As discussed in section 3.1, this method does not havefull attribution; in fact, the unattributed value is f (1 , − f (1 , − f (0 ,

1) + f (0 , Sequential attribution.

Sequential attribution chooses b = f (0 , , a = f (1 , − f (0 , , a = f (1 , − f (1 , . This method coincides with the modiﬁed Brinson–Hood–Beebower method given inBacon [Bac08], where the authors justify choosing feature 1 ﬁrst in the sequencebecause the sector allocation decision is often made before security selection decisions.This method eliminates the unattributed value, but violates the fairness property byprioritizing the sector allocation decision over the stock selection decision duringattribution. 17 x x x x x Figure 5:

Geometric interpretation of the returns-based attribution example. The one-at-a-time method attributes based on the lifts f (0 , − f (1 ,

0) and f (0 , − f (0 , f (0 , − f (1 ,

0) and f (1 , − f (1 , Shapley attribution.

Shapley attribution chooses b = f (0 ,

0) and a = 12 (cid:0) f (1 , − f (0 ,

1) + f (1 , − f (0 , (cid:1) ,a = 12 (cid:0) f (1 , − f (1 ,

0) + f (0 , − f (0 , (cid:1) . This is the average of the sequential attributions produced by the two sequences(0 , , , , , , before the sectorallocation decisions. Results.

Table 4 shows the results of applying all three attribution methods. Asexpected, the one-at-a-time method has a non-zero unattributed return. We can seethat in the sequential method, this unattributed return is entirely allocated to stockselection. In the Shapley attribution method, the unattributed term is allocated halfto the sector allocation and half to stock selection. Like the sequential method, it hasno unattributed component. However, unlike the sequential method, it treats countryallocation and stock selection equally, instead of prioritizing country allocation overstock selection. In this simple and small example, the diﬀerences in attribution by thediﬀerent methods is not very signiﬁcant. In the next section, we will see an examplewhere Shapley attribution is a substantial improvement over competing methods.18 enchmark Country alloc. Stock sel. Unattributed b a a y − a − a − b One at a time 6 . − . . . . − . . . − .

15 3 .

05 0

Table 4:

Attribution results for the simple returns-based attribution example.

Here we give an example of attribution of multiple performance metrics for a tax-aware portfolio management process. To avoid the wash-sale rule (in which certaincapital losses are disallowed), rebalance trades are carried out monthly.

Metrics.

We focus on four performance metrics: realized post-tax return, ex-anterisk, realized capital gains, and portfolio turnover. The return, risk, and turnoverare annualized. The realized capital gains are reported in dollars over the ﬁve yearsimulation.

Trading strategy.

We simulate an investment strategy based on Markowitz port-folio optimization. In this case, the features correspond to diﬀerent terms in theoptimization problem that can be on or oﬀ. More speciﬁcally, given the conﬁguration x with n = 7 features, we determine the tradelist by solving the optimization problemmaximize P i =1 x i h T α ( i ) − γσ ( h ) − x ℓ ( h − h ) − s T | h − h | . subject to x σ ( h ) ≤ σ lim T h = 1 , h ≥ . (6)Here the decision variable is the post-trade portfolio h , expressed as a fraction of theaccount total; the pre-trade portfolio (which is given) is h . We describe the objectivefunction and constraints in more detail below.The ﬁrst term in the objective function is an expected return forecast, which isdivided into the ﬁve alpha vectors α (1) , . . . , α (5) , correspending to the momentum,size, quality, value, and minimum volatility factors. The ﬁrst ﬁve components of x control whether these ﬁve alpha vectors are on or oﬀ. The second term is the (scaled)squared active risk, deﬁned as σ ( h ) = ( h − h b ) T Σ( h − h b ) , where Σ is the return covariance matrix and h b is the benchmark portfolio, and γ > ℓ ( h − h ) is the immediate tax liability,due to capital gains, required to reach the post-trade portfolio h , and is parametrizedby the long- and short-term capital gains rates, and the tax lots comprising the initial19ortfolio. (For more details on ℓ , see Moehle et al. [Moe+20, § x . The fourth and last term in the objective isa model of transaction cost, where s is the vector of bid-ask spreads for each asset.The ﬁrst constraint is a risk limit with parameter σ lim >

0. (When x = 0, thisconstraint is deactivated.) The second constraint is a full-investment constraint, andthe last constraint speciﬁes that the portfolio is long only.Note that when x = 0, the portfolio aims to simply track the benchmark portfolio.The full conﬁguration x = means that all seven features are on, i.e. , we use all ﬁvealpha sources, the capital gain objective term, and the risk limit. Backtests.

All of our simulations use the S&P 500 as the benchmark portfolio,with data over the period 2002 to 2019. The alpha was obtained using methodssimilar to those of Kimura et al. [Kim+20]. We use the Barra US Equity model[MOW11] to deﬁne Σ and h b , and used the risk-aversion parameter γ = 80. The taxrates used in ℓ were 0 .

238 and 0 . s =0 . , i.e. , the bid-ask spread is 10 basis points for all assets. The risk limit is σ lim = 2%. Results.

Figures 6 and 7 show the attribution results using Shapley, one-at-a-time,and leave-one-out methods. For each metric, the leftmost set of bars, labeled ‘Base’,shows the baseline attribution b for each of the three methods. (The attribution tothe baseline is the same for all methods, as described in section 4.) The followingseven sets of bars are the attributions a , . . . , a corresponding to the seven featuresfor each of the three methods. Table 5 shows the unattributed amount f ( ) − T a − b for each of the three methods and four metrics. For comparison, we show the metricsfor the baseline and full conﬁguration.By and large, we see the same phenomenon occur for all four metrics: one-at-a-time attribution over-attributes, i.e. , it overestimates the contribution of each feature,because when only a single feature is included, it drives the portfolio selection process.One the other hand, the leave-one-out attribution under-attributes, i.e. , it underes-timates the contribution of each feature, because each single feature makes little dif-ference when competing with the other six. The degree of over- or under-attributiondepends on the speciﬁc metric and feature in question.For example, when attributing the risk, this leads to serious problems with theone-at-a-time and leave-one-out attributions that are resolved by Shapley attribution.Under one-at-a-time attribution, the risk limit does not get any ‘credit’ for risk re-duction. This is because the attribution of risk to the risk limit feature is the changein risk by adding it to the benchmark portfolio. Because the benchmark portfolioalready has low risk, the risk limit has no eﬀect. On the other hand, each of the ﬁvesignals, when added to the benchmark portfolio, result in a high risk. Therefore, with20 eturn (%) Risk (%) Tax (k $ ) Turnover (%)baseline value f (0) 12 . . − . . f (1) 11 . . . . . . . . − . − . − . − . . . . . Table 5:

The baseline value b = f (0), full conﬁguration value f ( ), as well as theunattributed components f ( ) − T a − b for all three attribution methods. one-at-a-time attribution, risk is severely over-attributed to the ﬁve signals. Thisproblem is also apparent in table 5; With the one-at-a-time method, the risk is over-attributed, i.e. , the sum of the attributions to the features and baseline is 14 . . − . . decreases turnover; thisis reﬂected in the negative attribution of turnover to tax awareness with Shapley andleave-one-out methods. We propose the use of the Shapley value for portfolio performance attribution. Shap-ley attribution is the only method that possesses four properies that we believe arecrucial for satisfactory portfolio performance attribution: fairness, correct baseline,full attribution, and monotonicity. (A ﬁfth property, additivity, is discussed in ap-pendix A.) We then compare Shapley attribution to other well-known attributionmethods. Compared to other attribution methods, the only disadvantage of Shapleyattribution is computational: the number of simulations required to carry out Shap-ley attribution is exponential in the number of features we attribute to. To overcome21 ase Risk lim Tax Mom. Min. vol. Qual. Size Val.0510 R e t u r n ( % ) Base Risk lim Tax Mom. Min. vol. Qual. Size Val. − R i s k ( % ) Shapley One at a time Leave one out

Figure 6:

Attributions of return and risk for the tax-aware portfolio management example. ase Risk lim Tax Mom. Min. vol. Qual. Size Val. − C a p i t a l ga i n ( k $ ) Base Risk lim Tax Mom. Min. vol. Qual. Size Val.0100200300 T u r n o v e r ( % ) Shapley One at a time Leave one out

Figure 7:

Attributions of capital gains and turnover for the tax-aware portfolio manage-ment example.

Acknowledgements.

We would like to thank Eric Kisslinger for supporting usin carrying out the backtests for the tax-aware portfolio management example. Wewould also like to thank Ronald Kahn and Isaac Mao for useful early discussions andtesting of Shapley attribution. 24 eferences [Bac08] C. R. Bacon.

Practical portfolio performance measurement and attribu-tion . Vol. 546. John Wiley & Sons, 2008.[BHB86] G. P. Brinson, L. R. Hood, and G. L. Beebower. “Determinants of port-folio performance”. In:

Financial Analysts Journal

Methodology and Computing in Applied Probability

Computers & Operations Research

The Journal of Finance

Active portfolio management . McGrawHill New York, NY; 2000.[Gri06] R. C. Grinold. “Attribution”. In:

The Journal of Portfolio Management

Applied Economics Letters

The Journal of ﬁnance

Applied Stochastic Models in Business and Industry

Advances in neural information processing systems . 2017,pp. 4765–4774.[LN+12] D. Liben-Nowell, A. Sharp, T. Wexler, and K. Woods. “Computing theShapley value in supermodular coalitional games”. In:

International Com-puting and Combinatorics Conference . Springer. 2012, pp. 568–579.25Lo07] A. W. Lo. “Where do alphas come from?: A new measure of the valueof active investment management”. In:

A New Measure of the Value ofActive Investment Management (May 8, 2007) (2007).[Mal+13] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers. “Bound-ing the estimation error of sampling-based Shapley value approximation”.ArXiv preprint. 2013.[Moe+20] N. Moehle, M. J. Kochenderfer, S. Boyd, and A. Ang. “Tax-aware port-folio construction via convex optimization”. ArXiv preprint. 2020.[Mou04] H. Moulin.

Fair division and collective welfare . MIT press, 2004.[MOW11] J. Menchero, D. Orr, and J. Wang.

The Barra US equity model (USE4),methodology notes . English. MSCI, May 2011.[MS+02] D. Monderer, D. Samet, et al. “Variations on the Shapley value”. In:

Handbook of Game Theory

Applied Economics Letters

Decisions in Economics and Finance

Contributions to theTheory of Games

Journal of portfolio Management

Knowledge and infor-mation systems

Attributing systemic risk to in-dividual institutions . Tech. rep. Bank for International Settlements, 2010.[You85] H. P. Young. “Monotonic solutions of cooperative games”. In:

Interna-tional Journal of Game Theory

A Additivity

In addition to the desiderata of section 3.1, Shapley attribution is additive . Thismeans that if the metric can be decomposed into multiple components, such that f ( x ) = f ( x ) + · · · + f k ( x ), then the Shapley attribution is given by a = a + · · · + a k and b = b + · · · + b k , where a i and b i are the attribution of metric f i to the features.This is especially useful when the metric is separable across time. In this case, f ( x ) is the value of the metric across a large time window (such as a year), and each26 enchmark Country alloc. Stock sel. Full portfolio(0 ,

0) (1 ,

0) (0 ,

1) (1 , f uk ( x ) 4 4 8 8 f jp ( x ) − . − . − − . f us ( x ) 3 . . . . f ( x ) 6 . . . . Table 6:

Data for the returns-based attribution example, when further sub-divided bycountry. f i ( x ) is the value of the same metric over a shorter time window (such as a month orquarter). Examples of time-separable metrics are log-returns and squared risk. A.1 Returns-based attribution

Here we return to the returns-based attribution example from section 6.1, where wenow decompose the returns by country. Take f uk ( x ) to be the weighted return onUK stocks, i.e. , it is the portfolio weight in UK stocks multiplied by the return inUK stocks. (Equivalently, it is the value of UK stocks at the end of the investmentperiod divided by the initial portolio value.) Deﬁne f jp ( x ) and f us ( x ) similarly. Thismeans that f ( x ) = f uk ( x ) + f jp ( x ) + f us ( x ) . Table 3 shows example data, which are from Bacon [Bac08]. In particular, thebenchmark portfolio weights are 40% (UK), 30% (Japan), and 30% (US), and theportfolio country allociation was 40%, 20%, and 40%, respectively. The benchmarkreturns, by country, were 10%, − − Results.

In table 7, we show the results of using the three attribution methodsfrom section 6.1, but now decomposed by country.27 enchmark Country alloc. Stock sel. Unattr. b a a f ( x ) − a − a − b One at a time UK 4 0 4 0Japan − . − . − . − . . − . − . . Total . − . . . − . − . − . . − . − . Total . − . . − . − . − .

25 0US 3 . − . − . Total . − .

15 3 .

05 0

Table 7: