JJournal of Machine Learning Research ? (2014) ?-? Submitted 11/14; Published ??/??
Multi-Target Shrinkage
Daniel Bartz thankscorresponding authors. [email protected]
Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, Germany
Johannes H¨ohne [email protected]
Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, Germany
Klaus-Robert M¨uller ∗ [email protected] Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, GermanyKorea University, Korea, Seoul
Editor: ???
Abstract
Stein showed that the multivariate sample mean is outperformed by “shrinking” to a con-stant target vector. Ledoit and Wolf extended this approach to the sample covariancematrix and proposed a multiple of the identity as shrinkage target. In a general frame-work, independent of a specific estimator, we extend the shrinkage concept by allowingsimultaneous shrinkage to a set of targets. Application scenarios include settings with(A) additional data sets from potentially similar distributions, (B) non-stationarity, (C) anatural grouping of the data or (D) multiple alternative estimators which could serve astargets.We show that this
Multi-Target Shrinkage can be translated into a quadratic programand derive conditions under which the estimation of the shrinkage intensities yields optimalexpected squared error in the limit. For the sample mean and the sample covariance asspecific instances, we derive conditions under which the optimality of MTS is applicable.We consider two asymptotic settings: the large dimensional limit (LDL), where the dimen-sionality and the number of observations go to infinity at the same rate, and the finiteobservations large dimensional limit (FOLDL), where only the dimensionality goes to in-finity while the number of observations remains constant. We then show the effectivenessin extensive simulations and on real world data.
Keywords:
Covariance estimation, Shrinkage, Large Dimensional Limit, Linear Discrim-inant Analysis, Transfer Learning
1. Introduction and Motivation
Shrinkage is a widely applied estimation technique dating back to Charles Stein (Stein,1956; James and Stein, 1961). Stein showed that the sample mean is not admissible, e.g.that the shrinkage mean estimator is always better. The performance gain is achieved byoptimizing the bias-variance-trade-off between the unbiased, high variance sample estimateand a biased, low variance target. c (cid:13) a r X i v : . [ s t a t . M E ] D ec artz et al. Truth θ b θ unbiased estimate Target 1 b T Target 2 b T b θ MTS b θ STS1 b θ STS2 ∆ STS1 ∆ STS2 ∆ MTS Figure 1: Geometric illustration of Multi-Target Shrinkage. The unbiased estimate and thetwo targets span a convex set. The optimal MTS estimate is the estimate in theconvex set with minimum squared distance to the truth.Over the last years, shrinkage has become very popular for the estimation of covariancematrices. Ledoit and Wolf proposed an analytic formula for covariance shrinkage whichallows to calculate the optimal shrinkage intensity w.r.t. expected squared error (ESE)with low computational cost (Ledoit and Wolf, 2004) and serves as an alternative to time-consuming cross-validation. Shrinkage has further been applied to wavelets (Donoho andJohnstone, 1995) and density estimators (Sancetta, 2013).In the following, we will propose a generalization of the analytic shrinkage approach,in the following called Single-Target Shrinkage (STS), to multiple shrinkage targets. Fig-ure 1 illustrates Single- and Multi-Target Shrinkage (MTS) of an unbiased estimator (cid:98) θ ofa parameter θ for the case of two available shrinkage targets (cid:98) T and (cid:98) T . The convex com-binations of the three estimators span a triangle whose color coding visualizes the squarederror of each combination . The two standard Single-Target Shrinkage estimators (cid:98) θ STS1 ( λ ) = (1 − λ ) (cid:98) θ + λ (cid:98) T (cid:98) θ STS2 ( λ ) = (1 − λ ) (cid:98) θ + λ (cid:98) T are restricted to the lines connecting (cid:98) θ with (cid:98) T and (cid:98) T , respectively. For the optimalshrinkage intensities λ (cid:63) STS1 and λ (cid:63) STS2 , both estimators improve over (cid:98) θ . Further improvementcan be achieved by the Multi-Target Shrinkage estimator (cid:98) θ MTS ( λ , λ ) = (1 − λ − λ ) (cid:98) θ + λ (cid:98) T + λ (cid:98) T ,
1. Note that we do not use different symbols for the estimator (a random variable) and the estimate (arealization of the random variable). It will be clear from the context to which we refer.2. The optimum can lie on the border of the triangle if one of the targets is completely useless. Otherwiseit will lie within the triangle. ulti-Target Shrinkage Truthsample estimate
Target 1 Target 2 ∆ STS1=14.7 ∆ STS2=9.4 ∆ MTS =6.4 Figure 2: Geometric illustration of Multi-Target Shrinkage for handwritten digits. Thetargets are the mean images of digit 9 for two different subjects.the optimal convex combination of the sample estimate and the two targets. This is nicelyseen in Figure 1 where we have∆
MTS := (cid:107) θ − (cid:98) θ MTS (cid:107) < (cid:107) θ − (cid:98) θ STS1 / STS2 (cid:107) := ∆
STS1 / STS2 . As an illustration we consider MTS for the estimation of subject-specific mean imageson a data set of handwritten digits (Alimoglu and Alpaydin, 1997; Bache and Lichman,2013). Assume we want to estimate the mean image of digit 9 of person A from a smallnumber of observations. In this case MTS improves over the sample mean image and STSby shrinking towards the mean images of two other subjects T1 and T2 . This can be seenin Figure 2: for MTS, the differences to the truth are less pronounced than in STS andthe squared error is smaller.The illustrations Figure 1 and 2 are limited to the case of simultaneous shrinkage to twoshrinkage targets. MTS can handle an arbitrary number of shrinkage targets (cid:98) T , (cid:98) T , . . . , (cid:98) T k .Figure 3 shows this for the handwritten digits: incorporating more and more targets, thesquared error decreases.There are a many application scenarios for Multi-Target Shrinkage: • similar data sets: assume that K additional data sets from similar distributions exist.Then, we can calculate a target (cid:98) T k on each additional data set and use MTS to decidehow useful the other data sets are for the estimation task. This is a special case of transfer learning (see (Pan and Yang, 2010) for a recent review). The handwrittendigits example (Figure 2) falls into this category.
3. The data set consists of 10992 traces, approximately equally distributed over 44 subjects and the 10digits 0 , , . . .
9. We converted the traces into images of size 30 × A serves as a proxy to the truth. artz et al. a v e r age s qua r ed e rr o r K (number of targets)
Truth
K = 0 ∆ ˆ Θ =14.2 K = 1 ∆ STS=9.4
K = 43 ∆ MTS =5.6 Figure 3: Decay of the squared error for increasing number of shrinkage targets. Averageover R = 10000 random choices of digits and subjects. • data with group structure: if there is a natural group structure in a data set, one canestimate θ either (A) on the whole data set or (B) on each group separately. – When θ is independent of group membership, (A) is optimal and MTS yieldsapproximately equal weights. – When θ is very different for each group, (B) is optimal and MTS puts approxi-mately no weight on the targets. – When θ is dependent of group membership, but similar, MTS provides an optimalweighting of each group which is superior to both (A) and (B). • non-stationarity: assume that the parameter θ is non-stationary. MTS can yield asuperior estimate of the current value of θ by treating older segments of the data asshrinkage targets. • multiple available targets: for covariance shrinkage, a set of biased estimators hasbeen proposed as shrinkage targets: the identity, a multiple of the identity, a diagonalmatrix, constant and perfect correlation matrices or, in a finance context, a factormodel (see (Sch¨afer and Strimmer, 2005; Ledoit and Wolf, 2003). Which one of thesestructured estimators constitutes the best target depends on the structure of the truecovariance matrix. The choice is based on expert knowledge or cross-validation. Incontrast, MTS does not make a choice but yields an an optimal weighting of all targetswhich is equal or superior to the optimal choice.We have stated above that the optimal STS can be estimated by minimizing the ESE orby a slower cross-validation approach. For MTS, the computational cost to cross-validate ulti-Target Shrinkage K parameters grows with the power of K which is not feasible. We therefore extend theapproach of minimizing the ESE to multiple shrinkage targets.In Section 3 we will introduce the MTS approach independently of a specific estimatorand derive a quadratic program for the optimal shrinkage intensities. We then prove con-ditions under which the MTS estimate on a sequence of statistical model converges to theoptimum.For the sample mean (section 4) and the sample covariance matrix (section 5) we showwhen these conditions are fulfilled. We consider two asymptotic settings: the large dimen-sional limit (LDL), where the dimensionality and the number of observations go to infinityat the same rate, and the finite observations large dimensional limit (FOLDL), where onlythe number of dimensions goes to infinity while the number of observations remains con-stant. In both settings MTS is consistent, although we will show that the FOLDL requiresstronger restrictions on the covariance structure.Section 6 presents simulations which illustrate the theorems and demonstrate the capa-bilities of MTS. Section 7 shows applications on real world data.
2. Notation, distributional assumptions and asymptotic framework
General notation
Our notation adheres to the following conventions: • Matrices M and vectors v are written in upper case and lower case bold letters,respectively, their entries are given by M ij and v i . m j denotes the j th column of thematrix M with entries m ij ≡ M ij . • Quantities with a hat, (cid:99) M and (cid:98) v always denote estimators. • Var( a ) and Cov( a, b ) denote the variance of a and the covariance between a and b ,respectively. • (cid:100) Var( a ) and (cid:100) Cov( a, b ) denote estimators of variance and covariance which have to bespecified for each set of parameters a and b . • For asymptotic behaviour, we make use of the Bachmann-Landau symbols O , o and Θ.We here only define the less frequently used Θ, which denotes asymptotically boundedfrom above and below : f = Θ( g ) ⇐⇒ ∃ c > ∃ C > ∃ x > ∀ x > x : c · | g ( x ) | ≤ | f ( x ) | ≤ C · | g ( x ) | . Notation for MTS
In section 3 the general case is analysed: • we consider the estimation of a set of parameters θ = ( θ , θ , . . . , θ q ) ∈ R q for whichwe assume the existence of an unbiased estimator (cid:98) θ . • optimality is defined w.r.t. expected squared error (ESE) which we denote by ∆. Forexample, the ESE of the unbiased estimator (cid:98) θ is denoted by∆ (cid:98) θ := E (cid:107) (cid:98) θ − θ (cid:107) . We always consider the 2-norm (the Frobenius norm for multivariate parameters). artz et al. Setting set of parameters unbiased est. θ (cid:98) θ q mean µ := E [ x i ] (cid:98) µ := n − (cid:80) i x i q = p covariance C := E [( x i − µ )( x i − µ ) (cid:62) ] (cid:98) C := n − (cid:80) i ( x i − (cid:98) µ )( x i − (cid:98) µ ) (cid:62) q = p Table 1: general, mean and covariance MTS. • to study the behaviour in the limit, we will consider the estimation on a generalsequence of models indexed by p . Notation for MTS of the mean and the covariance
In sections 4 and 5, we considerthe estimation of the mean and the covariance matrix, respectively. There, • the sequence index p also denotes the dimensionality of n p i.i.d. observations withmean µ p and covariance C p , given by the ( p × n p )-matrix X p . • We consider K additional data sets with mean µ kp and covariance C kp , their n kp i.i.d.observations are given by the ( p × n kp )-matrices X kp . • γ ( k ) p, , γ ( k ) p, , . . . , γ ( k ) p,p denote the eigenvalues of C ( k ) p . • Y ( k ) p = R ( k ) p (cid:62) X ( k ) p denote the observations in their respective eigenbasis, where thecovariance matrices Σ ( k ) p = R ( k ) p (cid:62) C p R ( k ) p are diagonal. The mean in the eigenbasis isdenoted by µ Y ( k ) p . • For two datasets X ( k ) p and X ( l ) p , we denote Z ( k ) p = R ( l ) p (cid:62) X ( k ) p . From the context, itwill be clear which l was used to obtain Z ( k ) p . • in the following, we will always omit the sequence index p to obtain a less clutterednotation.Table 1 gives an overview of the different MTS scenarios considered in this paper. Distributional assumptions
We assume( ∀ k :) 1 p p (cid:88) i =1 γ ( k ) i = Θ(1) . (A1) ( ∀ k ) ∃ τ ( k ) γ : 1 p p (cid:88) i =1 γ ( k ) i = Θ (cid:16) p τ ( k ) γ (cid:17) . (A2) ∃ α , β : (1 + β ) E [ y i ] ≤ E [ y i ] ≤ (1 + α ) E [ y i ](A3) ∃ α , β : (1 + β ) E [ y i ] ≤ E [ y i ] ≤ (1 + α ) E [ y i ](A4)The assumption (A1) states, for each data set, that for an increasing number of dimen-sions the variance per dimension is bounded from above and below. ulti-Target Shrinkage The assumption (A2) restricts the dispersion of the eigenvalues: for increasing dimen-sionality, the dispersion is assumed to have a well-defined limit behaviour. Note that (A1)implies 0 ≤ τ ( k ) γ ≤ p . Asymptotic settings
We consider two different asymptotic settings: • LDL: the standard setting in Random Matrix Theory and for the analysis of covarianceshrinkage is the large dimensional limit ( n, p → ∞ , n/p → c ) (Ledoit and Wolf, 2004).In the LDL, the sample mean remains a consistent estimator, this does not hold forthe sample covariance matrix. We assume that for the additional data sets n k /p → c k holds. • FOLDL: in addition we consider the finite observations large dimensional limit ( p →∞ , n = c , n k = c k ). In the FOLDL, neither sample covariance nor sample mean areconsistent.Table 2 gives an overview of the notation in the paper.
3. Multi-Target Shrinkage
In Single-Target Shrinkage, the linear combination of an unbiased estimator (cid:98) θ with anotherestimator (cid:98) T (called the shrinkage target) is optimized. In most cases, the linear combinationis restricted to be convex (Ledoit and Wolf, 2004; Sch¨afer and Strimmer, 2005): (cid:98) θ STS ( λ ) := (1 − λ ) (cid:98) θ + λ (cid:98) T . In this manuscript, we generalize to optimizing the convex combination with a set of K targets (cid:98) θ MTS ( λ ) := (cid:32) − K (cid:88) k =1 λ k (cid:33) (cid:98) θ + K (cid:88) k =1 λ k (cid:98) T k , (1)where λ = ( λ , λ , . . . , λ K ) ∈ R K ≥ is subject to (cid:80) k λ k ≤
1. The MTS objective is given by∆
MTS ( λ ) := E (cid:13)(cid:13)(cid:13) θ − (cid:98) θ MTS ( λ ) (cid:13)(cid:13)(cid:13) . (2)From the MTS objective we derive a quadratic program for the optimal value of λ : Theorem 1 (MTS quadratic program)
Let the MTS quadratic program be defined by ∆ MTSqp ( λ ) := 12 λ (cid:62) A λ − b (cid:62) λ (3)
5. Setting (cid:98) T K +1 = 0 and allowing for λ ∈ R K +1 , this turns into an arbitrary linear combination whichcan deal with arbitrarily rescaled targets. Theoretical results can be extended at the cost of clarity andaccesibility. artz et al. symbol meaning n number of observations p dimensionality / index of the sequence of models q number of parameters f = Θ( g ) f asymptotically bounded from above and below by gf = O ( g ) f asymptotically bounded from above by gf = o ( g ) f asymptotically dominated by g θ set of parameters (cid:98) θ unbiased estimate of the set of parameters τ (cid:98) θ limit behaviour of the unbiased estimator (G1)∆ (cid:98) θ expected squared error, here of the unbiased estimator µ , (cid:98) µ mean and sample mean C , S covariance and sample covariance γ ( k )1 , . . . , γ ( k ) p eigenvalues of C (cid:92) symbol estimate claculated on the data τ γ limit behaviour of the average squared eigenvalue (A2) X observations ( p × n matrix) Y observations in the eigenbasis( p × n matrix) R rotation into the eigenbasis ( p × p matrix) Z observations in the eigenbasis of a different data set ( p × n matrix)(symbol) k for each symbol, k stands for the data set kα , β bounds on the ratio between second and fourth moments (A3) α , β bounds on the ratio between fourth and eighth moments (A4) c ratio between number of observations and dimensionality n/pK number of shrinkage targets T k k th shrinkage target λ k shrinkage intensity of the k th shrinkage target A matrix containing estimates of the quality of the targets b vector containing variance of sample estimate and correlation to targets τ kA limit behaviour of the quality of target k (G3) τ kµ limit behaviour of the quality of the mean of data set k (M1) τ k C limit behaviour of the quality of the covariance of data set k (C1) Q p set of all quadruples consisting of distinct integers between 1 and p | Q p | cardinality of Q p Table 2: overview of the notation. ulti-Target Shrinkage with A kl := q (cid:88) i =1 E (cid:104)(cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17)(cid:105) , b k := q (cid:88) i =1 (cid:110) Var(ˆ θ i ) − Cov( (cid:98) T ki , ˆ θ i ) (cid:111) , Then it is equivalent to optimize ∆ MTS ( λ ) and ∆ MTSqp ( λ ) : λ (cid:63) := arg min λ ∈ R K ≥ (cid:80) k λ k ≤ ∆ MTS ( λ ) = arg min λ ∈ R K ≥ (cid:80) k λ k ≤ ∆ MTSqp ( λ ) . (4) Proof see appendix.
The quadratic program is governed by the parameters A and b , quantifying the qualityof the targets and the unbiased estimator, respectively. The vector b contains the variance ofthe unbiased estimator, adjusted for correlation with the targets. The diagonal elements inthe matrix A contain information on the variance and bias of the targets and the correlationwith the unbiased estimator. A target T k is useful if the entry in A kk is small relative tothe variance of the unbiased estimator. The off-diagonal elements in the matrix A containinformation on the correlation between targets. The optimal shrinkage intensities λ (cid:63) depend on the unknown parameters A and b of thequadratic program eq. (4). We propose the following estimators: (cid:98) λ := arg min λ ∈ R K ≥ (cid:80) k λ k ≤ (cid:98) ∆ MTSqp ( λ ) , (cid:98) ∆ MTSqp ( λ ) := 12 λ (cid:62) (cid:98) A λ − (cid:98) b (cid:62) λ with (5) (cid:98) A kl := q (cid:88) i =1 (cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17) , (cid:98) b k := q (cid:88) i =1 (cid:110)(cid:100) Var(ˆ θ i ) − (cid:100) Cov( (cid:98) T ki , ˆ θ i ) (cid:111) , (6)where the unbiased estimator (cid:98) θ , the targets (cid:98) T k and the estimators of variance and covarianceappearing in (cid:98) b depend on the application scenario.For a general parameter set θ , the following theorem relates the limit behaviour of theestimators in (cid:98) b and of linear combinations of the estimators in (cid:98) A to to the limit behaviourof ∆ MTS ( (cid:98) λ ) and (cid:98) λ : Theorem 2 (consistency of MTS) Let us assume a sequence of models indexed by p suchthat ∃ τ ˆ θ : ∆ ˆ θ = Θ ( p τ ˆ θ ) , (G1) ∀ k ∃ τ kA : A kk = Θ (cid:16) p τ kA (cid:17) , ∀ k : b k = Θ ( p τ ˆ θ )(G2) (cid:13)(cid:13)(cid:13) (cid:98) A kl − A kl (cid:13)(cid:13)(cid:13) = o (cid:16) p . τ kA + τ lA ) (cid:17) , (cid:13)(cid:13)(cid:13) ˆ b k − b k (cid:13)(cid:13)(cid:13) = o ( p τ ˆ θ )(G3) ∀ k : min α ∈ R K ≥ α k =1 q (cid:88) i =1 E (cid:34)(cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) (cid:35) = Θ (cid:16) p τ kA (cid:17) (G4) artz et al. We then have ∀ k : λ (cid:63)k , ˆ λ k = O (cid:16) p ( τ ˆ θ − τ kA ) / (cid:17) . (i) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) )∆ (cid:98) θ = o (1)(ii) If one strenghtens (G4) to hold ∀ α ∈ R K , we also have (cid:107) λ (cid:63) − (cid:98) λ (cid:107) = o (1)(iii) Proof see appendix.
The assumptions (G1) and (G2) state that all estimators have a well-defined limit be-haviour w.r.t. ESE. In addititon, ∆ ˆ θ and b pk having the same limit behaviour implies thatnone of the targets is identical to the unbiased estimator.Assumption (G3) states that the relative errors in the entries of the estimators (cid:98) A kl andˆ b k go to zero in the limit. We call this property consistency of (cid:98) A and (cid:98) b .Assumption (G4) states that the linear combination of a set of targets cannot havebetter limit behaviour w.r.t. ESE than the best single target in the set. This is neededbecause linear dependence of targets can result in A having small eigenvalues for which therelative error does not go to zero.To illustrate the assumptions consider the handwritten digits example. A possible se-quence of models consists of images with increasing resolution ( p × p pixels) and an increasingnumber of observations for each subject. Then the sequence of ESE of the sample estimatorfor subject A would have a clear limit behaviour and hence fullfil (G1). The similaritybetween the digits of subjects A and e.g. T1 defines the similarity of the images. Hence aclear limit behaviour of A (G2) is to be expected. With increasing p and n , we can betterestimate the variance of the sample mean and the similarity between subjects and hencethe relative errors in b and A would go to zero (G3). Two subjects T1 and T2 whose dif-ferences to subject A exactly cancel out in a linear combination would violate Assumption(G4). This is highly unlikely.Part (i) of Theorem 2 states that a target T k which has worse limit behaviour w.r.t.ESE than the sample estimator ˆ θ does not contribute in the limit.Part (ii) is the most important result. It states that the expected squared error of theMTS estimator (cid:98) λ (normalized by the error of the sample estimator) converges to the ESEof the optimal λ (cid:63) . We call this property consistency of MTS .Part (iii) shows that λ (cid:63) is, under a restriction on the linear dependency of the targets,identifiable and that the estimator (cid:98) λ converges to λ (cid:63) . We call this consistency of theestimator (cid:98) λ .
6. for an off-diagonal element A kl , we consider the error relative to √ A kk A ll .7. Note that ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) )∆ MTS ( λ (cid:63) ) = o (1) does not hold in general. ulti-Target Shrinkage
4. Multi-Target Shrinkage of the mean
In this section we apply the MTS approach on the p -dimensional sample mean: θ = µ , (cid:98) θ = (cid:98) µ = (ˆ µ , ˆ µ , . . . , ˆ µ q = p ) . As shrinkage targets, we take a set of sample means (cid:98) µ , (cid:98) µ , . . . , (cid:98) µ K of additional data sets X , X , . . . X K , drawn from potentially different distributions. We obtain A kl = p (cid:88) i =1 E (cid:104)(cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17)(cid:105) b k = p (cid:88) i =1 (cid:110) Var(ˆ µ i ) − Cov(ˆ µ ki , ˆ µ i ) (cid:111) . (7)Cov(ˆ µ ki , ˆ µ i ) = 0 holds and for the sample estimates (cid:98) A and (cid:98) b we propose (cid:98) A kl := p (cid:88) i =1 (cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17) ˆ b k := ˆ b := p (cid:88) i =1 (cid:100) Var(ˆ µ i ) , (8)where the estimator of the variance of the sample mean is given by (cid:100) Var(ˆ µ i ) := 1 n ( n − n (cid:88) t =1 ( x it − ˆ µ i ) . Remark
MTS of the mean can be seen as a weighting of each data point. Data pointsin X are weighted by (1 − (cid:80) Kl =1 λ (cid:63)l ) n − and data points in X k are weighted by λ (cid:63)k n − k .Assuming that the distributions of the data sets only differ with respect to their means, theoptimal weight of each original data point is larger than or equal to the weight of the datapoints from the additional data sets.This translates into a constraint on the quadratic program: ∀ k : λ (cid:63)k n − k ≤ (1 − K (cid:88) l =1 λ (cid:63)l ) n − . The constraint is reasonable to impose in many applications and increases numerical sta-bility.
In this section we will establish the conditions under which MTS of the mean is consistentby showing when the estimators eq. (8) fulfill the assumptions of Theorem 2. We will showthis for both asmptotic settings.
LDL consistency of MTS of the mean
We first consider the LDL.
Theorem 3 (LDL consistency of MTS of the mean)
Let us assume a sequence of sta-tistical models indexed by p for which (A1) , (A2) , (A3) and ∀ k ∃ τ kµ ≤ (cid:107) µ k − µ (cid:107) = Θ( p τ kµ ) , (M1) ∀ k : τ kγ < , τ kµ ) + 1 and τ γ < , min k τ kµ ) + 1(M2) ∀ k | τ kµ > α ∈ R K ≥ α k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) l α l ( µ l − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = Θ (cid:16) p τ kµ (cid:17) (M3) artz et al. hold.Then assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulfilled, MTS of themean is consistent and ∀ k | τ kµ > λ (cid:63)k = ˆ λ k = O (cid:16) p − τ kµ / (cid:17) holds. If (M3) holds for α ∈ R K , λ (cid:63) is identifiable and (cid:98) λ is consistent. Proof see appendix.
Assumption (M1) states that the distance between data and target mean needs to havea clear limit behaviour. We exclude unrealistic sequences of models τ kµ > p and hence do not average out, smalldistances beweent data and target mean cannot be estimated reliably.Assumption (M3) states that there are no target means which, linearly combined, havebetter asymptotic behaviour than the single target means.Theorem 3 states conditions und which MTS of the mean is consistent in the LDL. Inaddition it states that data sets with increasing mean distance (M1) do not contribute tothe MTS estimate in the LDL limit: for n → ∞ , these data sets do not remain usefulbecause the sample mean is consistent. FOLDL consistency of MTS of the mean
We now consider the case where only thedimensionality p goes to infinity, while n remains constant. Theorem 4 (FOLDL consistency of MTS of the mean)
Let us assume a sequence ofstatistical models indexed by p for which (A1) , (A2) , (A3) , assumption (M1) from Theo-rem 3 and ∀ k : τ kγ < and τ γ < (cid:48) ) ( ∀ k :) (cid:88) i,j (cid:54) = i Cov (cid:18) y ( k ) is , y ( k ) js (cid:19) = o (cid:0) p (cid:1) (M4) hold. Then assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulfilled and MTSof the mean is consistent and (cid:98) λ is a consistent estimator. In the FOLDL, consistency results from averaging over dimensions. Therefore, con-sistency requires stronger restrictions on the correlation between dimensions. Assumption(M2 (cid:48) ) states that the dispersion of the eigenvalues (A2) has to grow slower than Θ( p ).Otherwise, strong eigendirections exist whose influence on the MTS estimate remains ata constant level in the sequence of models. Assumption (M4) states that the correlationbetween squared uncorrelated variables, on average, converges to zero.Note that identifiability holds even without Assumption (M3). ulti-Target Shrinkage
5. Multi-Target Shrinkage of the covariance matrix
In the second application of MTS we consider sample covariance matrices: θ = C , (cid:98) θ = S , S ij = n − n (cid:88) s =1 ( x is − ˆ µ i )( x js − ˆ µ j ) . For the sample covariance matrix, we will consider two classes of targets: • as for the sample mean, it is possible to shrink to a set of sample covariance matrices S , . . . , S K from additional data sets X , X , . . . X K . • a variety of biased estimators (cid:98) C , (cid:98) C , . . . , (cid:98) C K of the sample covariance matrix existswhich can be used as targets. An overview is given in (Sch¨afer and Strimmer, 2005).Examples: – T id = trace( S ) · I– T diag = S ◦ I (elementwise product) – T const . corr . = S ◦ I + F ◦ (1 − I ),where F ij = (cid:112) S ii S jj · ¯ r and ¯ r is the average correlation between dimensions.In total, we obtain a set of targets (cid:98) T , (cid:98) T , . . . , (cid:98) T K for which we have A kl = p (cid:88) i,j =1 E (cid:104)(cid:16) (cid:98) T kij − S ij (cid:17) (cid:16) (cid:98) T lij − S ij (cid:17)(cid:105) and b k = p (cid:88) i,j =1 (cid:110) Var( S ij ) − Cov( (cid:98) T kij , S ij ) (cid:111) . For the sample estimates (cid:98) A and (cid:98) b we propose (cid:98) A kl = p (cid:88) i,j =1 (cid:16) (cid:98) T kij − S ij (cid:17) (cid:16) (cid:98) T lij − S ij (cid:17) and ˆ b k ≡ ˆ b = p (cid:88) i,j =1 (cid:100) Var( S ij ) , (9)where the estimator of the variance of the sample covariance is given by (cid:100) Var( S ii (cid:48) ) := 1( n − n (cid:88) s (cid:16) x is x js − n (cid:88) t x it x jt (cid:17) . To keep the notation simple, we assume ∀ k : µ = µ k = 0. In this section we will establish the conditions under which MTS of the mean is consistent byshowing when the estimators eq. (9) fulfill the assumptions of Theorem 2. We will considerboth asmptotic settings. artz et al. LDL consistency of MTS of the covariance
We first consider the LDL.
Theorem 5 (LDL consistency of MTS of the covariance)
Let us assume a sequenceof statistical models indexed by p for which (A1) , (A2) , (A3) , (A4) and ∀ k ∃ τ kC ≤ (cid:107) C k − C (cid:107) = Θ( p τ k C ) , (C1) (cid:80) i,j,k,l ∈ Q p (cid:0) Cov [ y i y j , y k y l ] (cid:1) | Q p | = o (1)(C2) where Q p is the set of all quadruples consisting of distinct integersbetween 1 and p , τ ( k ) γ < , min k τ k C ) , (C3) ∀ k | τ kC > α ∈ R K ≥ α k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) l α l ( C l − C ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = Θ (cid:16) p τ k C (cid:17) (C4) hold. Then, for the set of targets in (Sch¨afer and Strimmer, 2005) and targets given byadditional data sets, assumptions (G1) , (G2) , (G3) and (G4) of Theorem 2 are fulfilled.Hence MTS of the covariance is consistent and ∀ k | τ kC > λ (cid:63)k , ˆ λ k = O (cid:16) p (1 − τ kC ) / (cid:17) holds. If (C4) holds for α ∈ R K , λ (cid:63) is identifiable and (cid:98) λ is consistent. Proof see appendix.
Assumption (C1) states that the distance of the data covariance matrices to each targetcovariance needs to have a clear limit behaviour. We exclude unrealistic sequences of modelswith τ kC > C .Assumption (C2) restricts the average covariance between products of uncorrelated vari-ables. This assumption is quite weak (compare to (Ledoit and Wolf, 2004)).Assumption (C3) limits the eigenvalue dispersion of the data sets in dependence of thedistance between data and target covariance. This is analogue to Assumption (M2) forMTS of the mean.Assumption (C4) states that there are no additional data sets which, linearly combined,have better limit behaviour than the single data sets.Theorem 5 shows that MTS of the covariance is consistent in the LDL. We also see thatdata sets with covariance distance (C1) increasing faster than O ( p ) do not contribute tothe MTS estimator in the LDL limit: for n → ∞ , these data sets do not remain useful. FOLDL consistency of MTS of the covariance
We now consider the case where onlythe dimensionality p goes to infinity, while n remains constant. ulti-Target Shrinkage Theorem 6 (FOLDL consistency of MTS of the covariance)
Let us assume a se-quence of statistical models indexed by p for which (A1) , (A2) , (A3) , (A4) , (C1) , (C2) (see Theorem 5) and τ kγ < and τ γ < (cid:48) ) (cid:80) i,j,k,l ∈ Q p Cov (cid:2) ( y i y j ) , ( y k y l ) (cid:3) | Q p | = o (1)(C5) hold. Then, for the set of targets in (Sch¨afer and Strimmer, 2005) and targets given byadditional data sets, assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulfilled,and MTS of the covariance and (cid:98) λ are consistent. Proof see appendix.
As for the mean, consistency in the FOLDL requires a restriction (C3 (cid:48) ) on the largest eigen-value (compare to Theorem 4) Assumption (C5) further restricts covariances between un-correlated random variables. Note that identifiability holds even without Assumption (C4).
6. Simulations
Our proposed MTS has more free parameters than standard shrinkage and therefore thevector of shrinkage intensity estimates (cid:98) λ has a higher variance than the single shrinkageintensity estimate ˆ λ in STS. In this section, we will provide simulations for both MTS ofthe mean and MTS of the covariance which show that already at moderate data set sizes,MTS accurately estimates λ . We will consider • expected squared error: this quantity is optimized by MTS. We directly measure the percentage improvement in average loss (PRIAL) with respect to the sample estima-tor (cid:98) θ : PRIAL (cid:0)(cid:98) θ shr (cid:1) = 100 · E (cid:107) (cid:98) θ − θ (cid:107) − E (cid:107) (cid:98) θ shr − θ (cid:107) E (cid:107) (cid:98) θ − θ (cid:107) . The PRIAL is a measure relative to the ESE of the sample estimator. A PRIAL of100 means that the shrinkage estimator has no error while a PRIAL of 0 means that ityields no improvement. Negative values indicate performance worse than the sampleestimator. • classification accuracies: in classification tasks, the ESE of the covariance matrix isnot the quantity of interest: it only serves as a proxy for classification accuracies. Wemeasure accuracy relative to the unbiased estimator:accuracy gain (cid:0) (cid:98) θ shr (cid:1) = accuracy (cid:0) (cid:98) θ shr (cid:1) − accuracy (cid:0) (cid:98) θ (cid:1) We use MTS to estimate – means in Linear Discriminant Analysis (LDA) – covariances for Common Spatial Patterns as an LDA preprocessing. artz et al. ���������������� � � � � � �� �� �� ��� ��� �������������������� �������� ����� ����� � ����� � ����� � ����� � � � � � � � ��� � � � � �� � � � � � � � � � � � � ���������������� �� �� �� ��� ��� �������������������������������� �������� ����� ��������� � ��������� � ��������� � ��������� � ���������� � ���� � ��������� � ��������� � ��������� � ��������� � � Figure 4: Large dimensional limit (LDL) of MTS of the mean to additional data sets. Aver-age obtained over R r = 20 repetitions for R m = 500 models. Shaded areas showone standard deviation. In the first simulation we illustrate the behaviour of MTS of the mean in the large dimen-sional limit (LDL, p, n → ∞ ). We generate n standard normal data points of dimensionality p = n with mean µ i = 0. For the shrinkage targets we generate K = 4 standard normaldata sets with n k = p data points and means µ ki = ( ± i η k , where the sign is randomand η = ( √ p − , . , . , . / . In thissetting, the first additional data set X has τ µ = 0 and X / / have τ / / µ = 1.This setting fulfills the assumptions of Theorem 3: targets have a clear limit behaviour(M1), from standard normality follows τ ( k ) γ = 0 (M2) and the means of the targets areindependently sampled (M3). The Theorem tells us that the MTS estimator will convergeand that targets (cid:98) T / / will not receive any weight in the LDL.We compare MTS to five versions of STS: STS to each of the targets (cid:98) T k = (cid:98) µ k and STSto the joint target (cid:98) T joint := (cid:98) µ joint := 0 . · (cid:80) k (cid:98) µ k . Figure 4 shows the dependency of thePRIAL (left) and the shrinkage intensities (right) on the dimensionality p .As predicted for the LDL by Theorem 3, the STS and MTS shrinkage intensities fortargets (cid:98) µ , (cid:98) µ , (cid:98) µ and (cid:98) µ joint go to zero: these targets are not useful in the limit. Only thetarget (cid:98) µ remains useful. As n = n and the entries is µ converge to the entries in µ , theshrinkage intensity (cid:98) λ goes to 0.5.The PRIALs reflect this picture: For the asymptotically useless targets, the improvementover the sample mean goes to zero, for (cid:98) µ it goes to a constant. For low p and n , it is lessrelevant that µ , µ and µ are different from µ : as a consequence, the joint target is betterthan (cid:98) µ . Over the whole range of p , (cid:98) µ MTS outperforms all STS estimators. For p → ∞ ,MTS converges to STS to (cid:98) µ .
8. Drawing the means from normal distributions with different variances seems more straightforward. Inparticular for small dimensionalities it has the disadvantage that the quality of the additional data setsvaries a lot and that often (cid:107) µ − µ (cid:107) > (cid:107) µ − µ (cid:107) . ulti-Target Shrinkage ���������������� � � � � � �� �� �� ��� ��� �������������������� �������� ����� ����� � ����� � ����� � ����� � � � � � � � ��� � � � � �� � � � � � � � � � � � � ���������������� �� �� �� ��� ��� �������������������������������� �������� ����� ��������� � ��������� � ��������� � ��������� � ���������� � ���� � ��������� � ��������� � ��������� � ��������� � � Figure 5: Finite observations large dimensional limit (FOLDL) of MTS of the mean toadditional data sets. Average obtained over R r = 20 repetitions for R m = 500models. Shaded areas show one standard deviation.Figure 5 shows convergence for the finite observations large dimensional limit (FOLDL).The experiment is analogous to the one above, only n = n k = 50 is kept fixed. Contrary tothe LDL, all shrinkage intensities remain finite. As above, over the whole range of p , (cid:98) µ MTS outperforms all STS estimators.
To test MTS in a classification setting we extended the above simulations to two class means µ A/B ( p = 50, n = 50). The difference of the class means is identical in each dimension,chosen such that the Bayes optimal classifier achieves 80% accuracy. For both classes thereare four additional data sets, n k = 100 with mean differences∆ µ kA/B,i = µ kA/B,i − µ A/B,i = ( ± i η k , η = 10 κ · (0 . , . , ,
2) where the parameter κ governs the similarity of the additionaldata sets. The covariance of each data set is C ( k ) A/B = I . To make the setting slightlymore realistic, we transform the data to have diagonal covariance with eigenvalues γ i =10 i − / ( p − − (log-spaced between 10 ± α , α = 1). This is achieved by rescaling all datapoints: x ( k ) ,rescaledA/B,it = x ( k ) A/B,it · √ γ i . We train Linear Discriminant Analysis using diffferent mean estimators: We compareMTS to (A) sample means (cid:98) µ A/B , where we ignore the additional data sets , (B) pooledmeans where we take (cid:98) µ pooledA/B := ( K + 1) − ( (cid:80) k (cid:98) µ kA/B + (cid:98) µ A/B ), and (C) STS where we shrinkboth sample means (cid:98) µ A/B to the corresponding joint target (cid:98) µ jointA/B := K − (cid:80) k (cid:98) µ kA/B .Figure 6 (left) shows the gain in classification accuracy relative to the baseline of samplemeans in dependence of the scale parameter κ . When the target means are very similar
9. to increase comparability, we use the sample covariance averaged over all data sets, independently of theestimator of the mean. artz et al. � �� � � � �� � �� � � ���������������������������������������������������� �� ���� �� ���� � ��� � ��� ������������������������� ��������������������������� � �� � � � �� � �� � � ����������������������� �� ���� �� ���� � ��� � ��� ������������������������� ����������������������������� ����� ������� Figure 6: accuracy gain for MTS for Linear Discriminant Analysis. Average obtained over R r = 20 repetitions for R m = 500 models. Shaded areas show one fourth standarddeviation.( κ → −∞ ), pooled means is the optimal solution. For very different target distributions( κ → ∞ ) we cannot improve over the sample means (cid:98) µ A/B . For these extremes, STS to thepooled data performs as well as the superior method, in between it outperforms both. MTSimproves on STS by finding a superior weighting of the target means.For Figure 6 (right), a spike has been added to the covariance model: The largesteigenvalue has been multiplied by 100 and the corresponding direction has been made non-discriminative. The drop in performance indicates that STS and MTS now give too muchweight to the targets, especially to the less useful targets µ / A/B . All targets are similar tothe original data in the non-discriminative direction of the spike, but still vary in qualityin the discriminative directions.
Whitening – a practical trick
Shrinkage puts too much weight on the direction of high-est variance. Whitening the data before MTS (wMTS) helps: wMTS gives equal importanceto all directions, yields proper weights for the µ kA/B and superior accuracies.Interestingly, wMTS also performs better than standard MTS when there is no spike inthe covariance (left). In this case the estimation of the shrinkage intensities is dominated bythe few directions of largest variance. This causes high variance in the shrinkage intensityestimates (cid:98) µ . Using wMTS, the estimation of the shrinkage intensities becomes an evenlyweighted average over dimensions and hence gets more stable.In general, whitening leads to large improvements if the discriminative information isnot restricted to the subspace of highest variance. Here we illustrate the behaviour of MTS of the covariance in the large dimensional limit(LDL, p, n → ∞ ). We generate n normal data points of dimensionality p = n with co-variance C diagonal with logarithmically spaced eigenvalues. For the shrinkage targets wegenerate K = 4 standard normal data sets with n k = p data points. The covariance matri- ulti-Target Shrinkage ���������������� � � � � � �� �� �� ��� ��� ������������������� ����� ����� ����� � ����� � ����� � ����� � ��� ���������������� � � � � � � ��� � � � � �� � � � � � � �� �� �� ��� ��� �������������������������������� ������� ����� �������� � �������� � �������� � �������� � ���������� � ���� � �������� � �������� � �������� � �������� � � Figure 7: Large dimensional limit (LDL) of MTS of the covariance to additional data sets.Average obtained over R r = 20 repetitions for R m = 500 models. Shaded areasshow one standard deviation. ���������������� � � � � � �� �� �� ��� ��� ������������������� ����� ����� ����� � ����� � ����� � ����� � ��� ���������������� � � � � � � ��� � � � � �� � � � � � � �� �� �� ��� ��� �������������������������������� ������� ����� �������� � �������� � �������� � �������� � ���������� � ���� � �������� � �������� � �������� � �������� � � Figure 8: Finite observations large dimensional limit (FOLDL) of MTS of the covarianceto additional data sets. Average obtained over R r = 20 repetitions for R m = 500models. Shaded areas show one standard deviation.ces C k of the additional data sets only differ in the largest eigenvalue γ kmax = η k · p , with η = ( √ p − , . , . , . /
10. Therefore the first additional data set X has τ C = 1 and X / / have τ / / C = 2.This makes the setting analog to simulation 1. Figure 7 shows the dependency ofthe PRIAL (left) and the shrinkage intensities (right) on the dimensionality p : the STSand MTS shrinkage intensities for targets (cid:98) C / / and (cid:98) C joint go to zero, only the target (cid:98) C remains useful in the LDL. As n = n , the shrinkage intensity goes to 0.5. For theasymptotically useless targets, the PRIAL over the sample covariance goes to zero, for (cid:98) C it goes to a constant. For low p and n , it is less relevant that C / / are different from C : as a consequence, the joint target is better than (cid:98) C . Over the whole range of p , (cid:98) C MTS outperforms all STS estimators. artz et al. ���������������������������������������� � � � � � � �� �� �� �� �� �� �� �� ����������������� ������ � ������� � ����� � ����� � ����� � ��� ���������������������������������������� � � � � � � ��� � � � � �� � � � � � � � �� �� �� �� �� �� �� �� ������������������������������ �������� � ���������� � �������� � �������� � �������� � ��������� � ���������� � �������� � �������� � �������� � � Figure 9: MTS of the covariance to identity and additional data sets. Average obtainedover R r = 20 repetitions for R m = 500 models. Shaded areas show four standarddeviations.Figure 8 shows results for the FOLDL, where n = n = n = n = n = 50 is kept fixed.As for the mean, all shrinkage intensities remain finite and over the whole range of p , (cid:98) C MTS outperforms all STS estimators.
For MTS of the covariance there is also the possibility to include a biased estimator asa shrinkage target. The most widely used biased estimator is the identiy multiplied bythe average sample eigenvalue: (cid:98) T id := ν I . In this simulation, we shrink to (cid:98) T id and thecovariance matrices of four additional sets of observations. We choose C and C k diagonalwith logarithmically spaced eigenvalues between 10 − and 10 . Each of the additional datasets is rotated randomly constrained to a rotation angle φ . We generate multivariate normalrandom data sets X and X / / / of size p = n = 500, n = p/ n = p , n = 2 p and n = 4 p .Figure 9 shows PRIAL and shrinkage intensities in dependence of the rotation angle φ .Shrinkage to (cid:98) T id is independent of φ , while STS to the other data is good when distributionsare similar (small rotation angle) and yields only small improvements for very differentdistributions (large rotation angle). The MTS shrinkage intensities show that for large φ MTS yields approximately the same estimate as STS to T id , while for small φ it yields aweighting of all five targets. This weighting yields superior PRIAL compared to each STSestimator. In this section we apply MTS to the preprocessing method
Common Spatial Patterns (CSP).CSP is used for dimension reduction in classification settings where (A) each datapoint is atime series of observations and (B) the discriminative information between two classes liesin the signal variance. Then CSP yields filters for the classes A and B which are defined by ulti-Target Shrinkage � �� � � � �� � �� � � ���������������������������������������������� � ��� ��� ��� ��� ��� ��� ��� ��� ��� ������������������������������������������� ��������������������� � �� � � � �� � �� � � ����������������������� � ��� ��� ��� ��� ��� ��� ��� ��� ��� ������������������������������������������� �� ������ ����� ����� ������� Figure 10: accuracy gain for MTS of the covariance for CSP. Average obtained over R r = 20repetitions for R m = 500 models.the directions where the ratio of the variances is maximal: f A/Bi := arg max f : f ⊥ f A/Bj ∀ j
Var(
X f i ) (cid:17) .For this simulation, a p = 50 dimensional diagonal covariance matrix C with logarith-mically spaced eigenvalues between 10 − and 10 is generated. The covariances of the twoclasses C A,B and a set of different covariances C A/B,k diff are each obtained by rescaling P = 10random eigenvalues of C by p i = (1+ i/P ) , i = 1 , , . . . , P . In addition, we rotate the C A/B,k diff randomly by an angle φ k , φ = (0 , , , C A/B,k ( w ) = (1 − w ) C A/B,k diff + w C A/B . For each class and each target we generate n = n k = 200 data points. The classificationaccuracy is calculated for test trials of length n test = 20.Figure 10 (left) shows the relative classification accuracies of the different covarianceestimation approaches. For w = 1, the target covariances are equal to the class covariancesand S pooled = 1 / ( k + 1)( (cid:80) k S k + S ) is optimal. For w →
0, the targets do not containdiscriminative information, hence the sample covariance becomes optimal. STS to the jointcovariance of the additional data sets performs better then the pooled covariance, but isclearly outperformed by MTS. Whitened MTS performs even better.For Figure 10 (right) a spike has been added to all covariance matrices: The largesteigenvalue has been multiplied by 100 and the corresponding direction was excluded from therandom rotations. This strong direction dominates the standard STS and MTS estimatesand causes a strong degradation of performance. The performance of whitened MTS, onthe other hand, is not affected. artz et al.
7. Multi-Target Shrinkage on Real World Data
In this section we will spotlight two application scenarios of MTS on real world data, onefor MTS of the mean estimation and one for MTS of the covariance. Detailed articles onthese applications are in preparation.
In a Brain-Computer Interface (BCI) paradigm based on event related potentials (ERPs),Linear Discriminant Analysis (LDA) is commonly applied to a binary classification problem(targets vs. nontargets). A detailed overview of the state-of-the-art approaches for featureextraction and classification for ERP data in BCI application is given in (Blankertz et al.,2011).Generally, a sequence of k different stimuli are presented repetitively in an random order.The user attends on only one stimulus (target ), while neglecting all others (non-targets).For each stimulus, the brain response is evaluated and it is assessed whether or not the userwas attending. Then, a one-out-of- k -class decision has to be taken based on the k binaryLDA classifier outputs.The standard approach is to compute an LDA classifier by pooling all target and allnon-target data, thus neglecting the stimulus identity. Alternatives are STS and MTS:we compute a binary classifier for each stimulus, using the mean over the distinct stimulusclasses as a shrinkage target (STS) or each mean of each distinct stimulus class as a separateshrinkage target (MTS). In ERP, the covariance can be considered as general backgroundactivity which is independent of the stimulus. Hence, for all approaches we take the pooledcovariance.One data set comprising of 21 subjects was reanalyzed (Schreuder et al., 2011). Figure 11shows the classification accuracies when computing the MTS mean, comparing againstclassification accuracies obtained with other estimates for the mean. Next to the MTSestimator, the pooled sample mean (standard approach), sample estimate of the stimulusspecific mean and the STS mean estimate was analyzed. For the STS mean estimator, thepooled mean of the remaining classes was considered as target. The analysis shows theMTS estimator of the mean to be superior to all other approaches. We reanalyzed a data set from a Brain Computer Interface based on motor imagery. Inthe experiment, subjects had to imagine two different movements while brain activity wasmeasured via EEG ( p = 55 channels, 80 subjects, 150 trials per subject, each trial with n trial = 390 measurements (Blankertz et al., 2010)). For each subject the frequency bandwas optimized. Common Spatial Patterns (CSP) was applied on the class-wise covariancematrices for feature extraction. 1-3 filters per class were chosen by a heuristic (Blankertzet al., 2008) and Linear Discriminant Analysis was applied on log-variance features.As training is expensive, we are interested in exploiting training data from other subjects.We compare two approaches: STS to the covariance of all other subjects and Multi-Target
10. Note that despite having the same name, there is no relation between the targets in an ERP experimentand Shrinkage targets. ulti-Target Shrinkage
40 50 60 70 80 90 100405060708090100 100% 0%p=0.00121 ** std Approach M T S
40 50 60 70 80 90 100405060708090100 100% 0%p=8.46e−14 ** sample mean M T S
40 50 60 70 80 90 100405060708090100 76.2% 23.8%p=0.00431 ** STS M T S Figure 11: classification accuracy of the ERP data using several estimates of the mean. Asubject is marked with a circle. It should be noted that all three plots showthe same data on the y-axis, being the classification accuracy obtained with theMTS mean estimate. number of trials per class a cc u r a cy SSTS C joint wSTS C joint
MTSwMTS 0 5 10 15 2000.010.020.030.040.05 number of trials per class a cc u r a cy − a cc ( s a m p l e c o v ) number of trials per class s h r i n k age i n t en s i t y STSwSTSMTS (sum)MTS (max)wMTS (sum)wMTS (max)
Figure 12: dependency on the number of training trials of motor imagery BCI. Averageobtained over R = 100 runs.Shrinkage to all 80 subjects. Directions of high variance dominate shrinkage estimators(Bartz and M¨uller, 2013) and the BCI data contains pronounced directions of high variance,the spectrum is heavily tilted. To reduce the impact of the first eigendirections withoutgiving to much importance to low variance noise directions we applied a special form ofwhitening: we rescaled, only for the calculation of the shrinkage intensities, the first fiveprincipal components to have the same variance as the sixth principal component. Shrinkageis corrected for auto-correlation (Bartz and M¨uller, 2014).Figure 12 (left, middle) shows accuracies for different number of training trials per class.One can see that STS outperforms sample covariance matrices, while it is not possible toestimate the high number of parameters for MTS. For few training trials, wSTS outperformsSTS, as the averaging over additional dimensions reduces variance. wMTS yields very goodaccuracies. artz et al. w M T S sample covariance ** STS C joint ** wSTS C joint subject−wise classification accuracies ** MTS ** Figure 13: subject-wise classification accuracies for motor imagery BCI. 10 training trials.Average obtained over R = 100 runs. ∗∗ / ∗ := significant at p ≤ .
01 or p ≤ .
8. Discussion
Shrinkage is a widely applied estimation technique. In the last years the analytic formula forcovariance shrinkage of Ledoit and Wolf (Ledoit and Wolf, 2004) has become very popular:it is a fast and accurate alternative to cross-validation.In this paper, we pointed out several use cases in which a single shrinkage target is notsufficient. This motivates the usage of multiple shrinkage targets (MTS). We have derivedformulas for optimal Multi-Target Shrinkage and we have shown in theory and simulationsthat MTS yields improvements over standard shrinkage in several situations. As a practicaltrick, we proposed whitening as a preprocessing step which increases the robustness of MTS.On two real world data sets from the neuroscience domain, our proposed method yieldsa significant performance enhancement over standard shrinkage.Future work will explore connections to random matrix theory, consider the transferof domain specific prior knowledge into the proposed framework, application of MTS toother estimators and the analysis of new real world data sets. In addition we are inter-ested in incorporating label information into the weighting of the different dimensions andinto adaptively whitening only to an extent which sufficiently reduces the variance of theshrinkage estimates. ulti-Target Shrinkage Acknowledgments
Klaus-Robert M¨uller gratefully acknowledges funding by BMBF Big Data Centre (01 IS14013 A) and the National Research Foundation grant (No. 2012-005741) funded by theKorean government. We thank Pieter-Jan Kindermans, Sebastian Bach, Shinichi Nakajimaand Duncan Blythe for valuable discussions and comments.
Appendix A. Proofs
A.1 Proof of Theorem 1 (MTS quadratic program)Proof
We decompose the EMSE into bias and variance∆
MTS ( λ ) = E (cid:13)(cid:13)(cid:13) θ − (cid:98) θ MTS ( λ ) (cid:13)(cid:13)(cid:13) = E (cid:34) q (cid:88) i =1 (cid:16) ˆ θ MTS ( λ ) i − θ i (cid:17) (cid:35) (10)= E q (cid:88) i =1 (cid:32)(cid:32) − K (cid:88) k =1 λ k (cid:33) ˆ θ i + K (cid:88) k =1 λ k (cid:98) T ki − θ i (cid:33) = q (cid:88) i =1 (cid:40) (cid:32) − K (cid:88) k =1 λ k (cid:33) Var(ˆ θ i ) + K (cid:88) j,k =1 λ j λ k Cov( (cid:98) T ji , (cid:98) T ki ) abcdef ghi + K (cid:88) j =1 λ j (cid:32) − K (cid:88) k =1 λ k (cid:33) Cov( (cid:98) T ji , ˆ θ i ) abcdef ghi + (cid:40) K (cid:88) k =1 λ k E (cid:104) (cid:98) T ki − ˆ θ i (cid:105)(cid:41) K (cid:88) j =1 λ j E (cid:104) (cid:98) T ji − ˆ θ i (cid:105) (cid:41) . This can be simplified to∆
MTS ( λ ) = q (cid:88) i =1 (cid:40) K (cid:88) j,k =1 λ j λ k E (cid:104)(cid:16) (cid:98) T ji − ˆ θ i (cid:17) (cid:16) (cid:98) T ki − ˆ θ i (cid:17)(cid:105) abcdef ghi + 2 K (cid:88) k =1 λ k (cid:0) Cov( (cid:98) T ki , ˆ θ i ) − Var(ˆ θ i ) (cid:1) + Var(ˆ θ i ) (cid:41) = λ (cid:62) A λ − b (cid:62) λ + q (cid:88) i =1 Var(ˆ θ i ) = 2∆ MTSqp ( λ ) + const. (11)Therefore the sets of λ minimizing ∆ MTS ( λ ) and ∆ MTSqp ( λ ) are identical. A.2 Proof of Theorem 2 (consistency of MTS)Proof
From the constraints, it follows directly that (cid:107) λ (cid:63) (cid:107) = O (1) (12) artz et al. and from the definition of A and b follows ∀ k : τ kA ≥ τ ˆ θ . We first prove (i). We have ∀ k : λ (cid:63) (cid:62) A λ (cid:63) = K (cid:88) k (cid:48) ,l =1 λ (cid:63)k (cid:48) λ (cid:63)l q (cid:88) i =1 E (cid:104)(cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17)(cid:105) , ≥ λ (cid:63)k min α ∈ R K ≥ α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) = λ (cid:63)k Θ (cid:16) p τ kA (cid:17) (13) b (cid:62) λ (cid:63) (G2) , (12) = O ( p τ ˆ θ ) , (14)We then have ∀ k :Θ( p τ ˆ θ ) (G1) = ∆ ˆ θ ≥ ∆ MTS ( λ (cid:63) ) (11) = λ (cid:63) (cid:62) A λ (cid:63) − b (cid:62) λ (cid:63) + (cid:88) i Var(ˆ θ i ) (12) , (13) , (14) ≥ λ (cid:63)k Θ( p τ kA ) + O ( p τ ˆ θ ) . Rearranging yields λ (cid:63)k = O ( p . τ ˆ θ − τ kA ) ). To prove statement (i) for ˆ λ k , we first define (cid:98) ∆ MTS ( λ ) := λ (cid:62) (cid:98) A λ − (cid:98) b (cid:62) λ + p (cid:88) i =1 Var(ˆ θ i ) . Using the result on the limit behaviour of λ (cid:63) , we obtain λ (cid:63) (cid:62) ( A − (cid:98) A ) λ (cid:63) = K (cid:88) k,l =1 λ (cid:63)k λ (cid:63)l ( A kl − (cid:98) A kl ) (G3) = K (cid:88) k,l =1 λ (cid:63)k λ (cid:63)l o (cid:16) p . τ kA + τ lA ) (cid:17) = o ( p τ ˆ θ ) (15)This allows us to calculate∆ MTS ( λ (cid:63) ) − (cid:98) ∆ MTS ( λ (cid:63) ) = λ (cid:63) (cid:62) ( A − (cid:98) A ) λ (cid:63) − b − (cid:98) b ) (cid:62) λ (cid:63) (12) , (15) = o ( p τ ˆ θ ) . (16)In addition, we calculate (cid:98) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( (cid:98) λ ) = (cid:98) λ (cid:62) ( (cid:98) A − A ) (cid:98) λ − (cid:98) b − b ) (cid:62) (cid:98) λ = (cid:88) k ˆ λ k o ( p τ kA ) + o ( p τ ˆ θ ) . (17)Using these equations, we obtainΘ( p τ ˆ θ ) ≥ ∆ MTS ( λ (cid:63) ) (16) = (cid:98) ∆ MTS ( λ (cid:63) ) + o ( p τ ˆ θ ) ≥ (cid:98) ∆ MTS ( (cid:98) λ ) + o ( p τ ˆ θ ) (17) = ∆ MTS ( (cid:98) λ ) + o ( p τ ˆ θ ) + (cid:88) k ˆ λ k o ( p τ kA ) (13) , (14) ≥ ˆ λ k Θ( p τ kA ) + O ( p τ ˆ θ ) + o ( p τ ˆ θ ) + (cid:88) k ˆ λ k o ( p τ kA ) . ulti-Target Shrinkage Rearranging yields ˆ λ k = O ( p . τ ˆ θ − τ kA ) ) which concludes (i). To prove statement (ii) we haveto relate the difference in ESE to the difference in the estimate of the ESE: (cid:16) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) ) (cid:17) − (cid:16) (cid:98) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) = (cid:16) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( (cid:98) λ ) MTS (cid:17) − (cid:16) ∆ MTS ( λ (cid:63) ) − (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) (16) , (17) , ( i ) = o ( p τ ˆ θ )Using this and the optimalities of λ (cid:63) for ∆ MTS ( λ ) and (cid:98) λ for (cid:98) ∆ MTS ( λ ), we obtain0 ≤ (∆ ˆ θ ) − (cid:16) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) ) (cid:17) = Θ( p − τ ˆ θ ) (cid:16) (cid:98) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( λ (cid:63) ) + o ( p τ ˆ θ ) (cid:17) ≤ o (1)which concludes the proof of (ii).The proof of part (iii) is similar to the one of Theorem 2.1 from (Daniel, 1973). On theconvex set we have 0 ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) ∇ ∆ MTS ( λ (cid:63) ) (18)0 ≤ ( λ (cid:63) − (cid:98) λ ) (cid:62) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) MTS (19)where the gradients are ∇ ∆ MTS ( λ ) = ( A λ + b ) and ∇ (cid:98) ∆ MTS ( λ ) = (cid:16) (cid:98) A λ + (cid:98) b (cid:17) . Multiplyingeq. (19) by minus one and combining the two equations, we obtain( (cid:98) λ − λ (cid:63) ) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) ∇ ∆ MTS ( λ (cid:63) ) . Subtracting ( (cid:98) λ − λ (cid:63) ) ∇ (cid:98) ∆ MTS ( λ (cid:63) ) from both sides, we obtain( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:16) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) − ∇ (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:16) ∇ ∆ MTS ( λ (cid:63) ) − ∇ (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) . The left hand side is( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:98) A ( (cid:98) λ − λ (cid:63) ) ≥ (cid:107) (cid:98) λ − λ (cid:63) (cid:107) min (cid:107) α (cid:107) =1 α (cid:62) A α + ( (cid:98) λ − λ (cid:63) ) (cid:62) ( (cid:98) A − A )( (cid:98) λ − λ (cid:63) ) (G4) ,α ∈ R K = (cid:107) (cid:98) λ − λ (cid:63) (cid:107) · Θ( p τ ˆ θ ) . The right hand side is (cid:16) ( (cid:98) λ − λ (cid:63) ) (cid:62) ( (cid:98) A − A ) λ (cid:63) + ( (cid:98) λ − λ (cid:63) ) (cid:62) ( b − (cid:98) b ) (cid:17) = o ( p τ ˆ θ ) . by (G1), (G2), (G3) and the rates of the λ k given by (i). Therefore, rearranging yields (cid:107) (cid:98) λ − λ (cid:63) (cid:107) = o (1). A.3 Proof of Theorem 3 (LDL consistency of MTS of the mean)Proof
Without loss of generality, we assume µ = . We start by analysing the asymptoticbehaviour of the ∆ ˆ θ , A kk and b , then we prove the consistency of (cid:98) A kl and ˆ b . artz et al. (G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , A kk and b We start with the asymptoticbehaviour of ∆ ˆ θ = b . We have∆ ˆ θ = b = p (cid:88) i =1 Var(ˆ µ i ) = n − p (cid:88) i =1 Var( x is ) = n − p (cid:88) i =1 γ i (A1) = Θ(1) ! = Θ( p τ ˆ θ ) (20) ⇐⇒ τ ˆ θ = 0Using this result, we obtain the asymptotic behaviour of A kk : A kk = p (cid:88) i =1 E (cid:20)(cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:21) = p (cid:88) i =1 E (cid:104) (ˆ µ ki ) − µ ki ˆ µ i − ˆ µ i (cid:105) (21)= p (cid:88) i =1 (cid:110) E (cid:104) (ˆ µ ki ) (cid:105) + E (cid:2) ˆ µ i (cid:3) (cid:111) = p (cid:88) i =1 (cid:110) ( µ ki ) + Var (cid:16) ˆ µ ki (cid:17) + Var (ˆ µ i ) (cid:111) = Θ( p τ kµ ) + Θ(1) ! = Θ( p τ kA ) ⇐⇒ τ kA = max( τ kµ , , part I: Consistency of (cid:98) A kl As (cid:98) A kl is unbiased, we have to show thatVar( (cid:98) A kl ) = o ( p τ kA + τ lA ) = o ( p max( τ kµ , τ lµ , ) (22)We introduce the notation ˇ x ( k ) is = x ( k ) is − µ ( k ) i , ˇ µ ( k ) i = n − (cid:88) s ˇ x ( k ) is . We then have Var( (cid:98) A kl ) = Var (cid:32) p (cid:88) i =1 (cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17)(cid:33) = Var (cid:32) p (cid:88) i =1 (cid:16) ˇ µ ki − ˇ µ i + µ ki (cid:17) (cid:16) ˇ µ li − ˇ µ i + µ li (cid:17)(cid:33) (23)To show eq. (22), it is sufficient to show the that the variance of each combination of termsin eq. (23) is o ( p τ kA + τ lA ). There are three non-constant types of combinations: First, thereis the product of a mean and a sample mean:Var (cid:32) p (cid:88) i =1 µ ki ˇ µ li (cid:33) = n − l (cid:88) ij Cov( µ ki (cid:88) s ˇ x lis , µ kj (cid:88) t ˇ x ljt ) = n − l µ k (cid:62) C l µ k (M1) = µ k (cid:62) C l µ k (cid:107) µ k (cid:107) Θ( p τ kµ − ) = max i γ li Θ( p τ kµ − ) (M2) , (A2) = o ( p τ lµ +1 )Θ( p τ kµ − ) = o ( p τ kµ + τ kµ ) = o ( p τ kA + τ lA ) ulti-Target Shrinkage Second, there are products of two different sample means:Var (cid:32) p (cid:88) i =1 ˇ µ i ˇ µ ki (cid:33) = n − n − k Var (cid:32) p (cid:88) i =1 (cid:88) s,t ˇ x is ˇ x kit (cid:33) = n − n − k p (cid:88) i,j =1 Cov ( x i , x j ) Cov (cid:16) x ki , x kj (cid:17) = n − n − k p (cid:88) i,j =1 Cov ( y i , y j ) Cov (cid:16) z ki , z kj (cid:17) = n − n − k p (cid:88) i =1 γ i E [( z ki ) ] ≤ nn k p (cid:88) i =1 γ i γ ki ≤ pnn k (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 γ i (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 ( γ ki ) = Θ (cid:16) p . τ kγ + τ kγ ) − (cid:17) (M2) = o (cid:16) p max(0 ,τ kµ )+max(0 ,τ lµ ) (cid:17) = o ( p τ kA + τ lA )The third combination has two sample means:Var (cid:32) p (cid:88) i =1 ˇ µ i (cid:33) = n − Var (cid:32) p (cid:88) i =1 (cid:88) s,t y is y it (cid:33) = n − p (cid:88) i,j =1 (cid:88) s,t,s (cid:48) ,t (cid:48) Cov (cid:0) y is y it , y js (cid:48) y jt (cid:48) (cid:1) = n − p (cid:88) i,j =1 (cid:88) s Cov (cid:0) y is , y js (cid:1) + (cid:88) s,t (cid:54) = s Cov ( y is y it , y js y jt ) ≤ n − p (cid:88) i,j =1 Cov (cid:0) y i , y j (cid:1) + n − p (cid:88) i,j =1 Cov ( y i , y j ) ≤ p n − (cid:32) p − p (cid:88) i =1 (cid:113) E (cid:2) y i (cid:3)(cid:33) + pn − (cid:32) p − p (cid:88) i =1 γ i (cid:33) (A1) , (A3) = O ( p − ) + Θ (cid:16) p τ γ − (cid:17) (M2) = O ( p − ) + o (cid:16) p , min k τ kµ ) (cid:17) = o ( p τ kA + τ lA ) ∀ k, l (24)We have shown that the variance of all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b The estimator ˆ b is also unbiased, hence we have toshow Var(ˆ b ) = o ( p τ ˆ θ ) = o (1) . In a first step, we reformulate the variance:Var(ˆ b ) = Var (cid:32) p (cid:88) i =1 (cid:100) Var(ˆ µ i ) (cid:33) = Var (cid:32) n − ( n − − p (cid:88) i =1 n (cid:88) t =1 ( x it − ˆ µ i ) (cid:33) = n − ( n − − Var p (cid:88) i =1 n (cid:88) t =1 x it − n − p (cid:88) i =1 n (cid:88) s,t =1 x is x it . artz et al. The variance is o (1) if the variances of both terms in the sum are o ( p ). We start withVar (cid:32) p (cid:88) i =1 n (cid:88) t =1 x it (cid:33) = n Var (cid:32) p (cid:88) i =1 x it (cid:33) = n Var (cid:32) p (cid:88) i =1 y it (cid:33) = n p (cid:88) i,j =1 Cov (cid:0) y it , y jt (cid:1) ≤ n p (cid:88) i,j =1 (cid:113) E (cid:2) y it (cid:3)(cid:114) E (cid:104) y jt (cid:105) (A3) ≤ p n (1 + α ) (cid:32) p − p (cid:88) i =1 γ i (cid:33) = O ( p ) = o ( p ) . The variance of the second term in the sum is, following the steps in eq. (24),Var (cid:32) n − p (cid:88) i =1 n (cid:88) s,t x is x it (cid:33) = O ( p ) + o (cid:16) p , min k τ kµ )+2 (cid:17) (M1) = o ( p ) . This concludes the proof the Var(ˆ b ) is o ( p τ ˆ θ ) = o (1).(G4) : Restriction on linear combinations Let L be R p or R p ≥ . We haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l (ˆ µ li − ˆ µ i ) (cid:33) (25)= min α ∈ L α k =1 q (cid:88) i =1 (cid:32) K (cid:88) l =1 α l ( µ li − µ i ) (cid:33) + K (cid:88) l =1 | α l | Var (ˆ µ i ) + K (cid:88) l =1 | α l | Var (cid:16) ˆ µ li (cid:17) (M3) ≥ Θ (cid:16) p τ kµ (cid:17) + q (cid:88) i =1 Var(ˆ µ ki ) = Θ (cid:16) p τ kµ (cid:17) + Θ(1) = Θ (cid:16) p max(0 ,τ kµ ) (cid:17) This concludes the proof of Theorem 3.
A.4 Proof of Theorem 4 (FOLDL consistency of MTS of the mean)Proof
As above, without loss of generality, we assume µ = . We again start by analysingthe asymptotic behaviour of ∆ ˆ θ , A kk and b , then we prove consistency of (cid:98) A kl and ˆ b .(G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , A kk and b From equations (20) and(21)we directly obtain τ ˆ θ = 1 and ∀ k : τ kA = 1 . (G3) , part I: Consistency of (cid:98) A kl As for the LDL, we show that all types of terms ineq. (23) are o ( p τ kA + τ lA ). For the FOLDL, this means they have to be o ( p ). Following similar ulti-Target Shrinkage steps as above, we obtainVar (cid:32) p (cid:88) i =1 µ ki ˇ µ li (cid:33) = n − l µ k (cid:62) C l µ k = max i γ li Θ( p τ kµ ) (M2 (cid:48) ) = o ( p τ kµ ) = o ( p )Var (cid:32) p (cid:88) i =1 ˇ µ i ˇ µ ki (cid:33) ≤ pnn k (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 γ i (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 ( γ ki ) = o (cid:16) p max(1 ,τ kµ )+max(1 ,τ lµ ) (cid:17) = o ( p )Var (cid:32) p (cid:88) i =1 ˇ µ i (cid:33) = n − p (cid:88) i,j =1 (cid:88) s Cov (cid:0) y is , y js (cid:1) + (cid:88) s,t (cid:54) = s Cov ( y is y it , y is y it ) (26) ≤ n (cid:88) i,j (cid:54) = i Cov (cid:16) ( y kis ) , ( y kjs ) (cid:17) + (1 + α ) pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) + pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) (M2 (cid:48) ) , (M4) = o (cid:0) p (cid:1) + o (cid:0) p (cid:1) We have shown that the variance of all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b We have to show that Var(ˆ b ) is o ( p τ ˆ θ ) = o ( p ):Var(ˆ b ) = Var (cid:32) p (cid:88) i =1 (cid:100) Var(ˆ µ i ) (cid:33) = Var (cid:32) n ( n − p (cid:88) i =1 n (cid:88) t =1 ( x it − ˆ µ i ) (cid:33) = 1 n ( n − Var p (cid:88) i =1 n (cid:88) t =1 x it − n − p (cid:88) i =1 n (cid:88) s =1 x is x is − n − p (cid:88) i =1 n (cid:88) s,t (cid:54) = s =1 x is x it . This variance expression is o ( p ) if the variance of each of the three sums is o ( p ). For thefirst sum, we use eq. (26) and obtainVar (cid:32) p (cid:88) i =1 x it (cid:33) = Var (cid:32) p (cid:88) i =1 y it (cid:33) = (cid:88) ij Cov( y it , y jt ) ≤ n (cid:88) i,j (cid:54) = i Cov (cid:16) ( y kis ) , ( y kjs ) (cid:17) + (1 + α ) pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) = o ( p ) + O ( p ) . The second sum is proportional to the first sum. For the third sum we obtain, by usingeq. (26), Var p (cid:88) i =1 n (cid:88) s,t (cid:54) = s x is x it = (cid:88) ij (cid:88) s,t,s (cid:48) ,t (cid:48) Cov (cid:0) x is x it , x js (cid:48) x jt (cid:48) (cid:1) = o (cid:0) p (cid:1) . This concludes the proof that Var(ˆ b ) is o ( p τ ˆ θ ) = o ( p ). artz et al. (G4) : Restriction on linear combinations Similar to eq. (25), we haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ R α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) ≥ (cid:88) i Var(ˆ µ ki ) = Θ ( p ) . This concludes the proof of Theorem 4.
A.5 Proof of Theorem 5 (LDL consistency of MTS of the covariance)Proof
The estimators (cid:98) A and b depend on the choice of target. We restrict the proof ontargets given by sample covariance matrices of additional data sets. The biased estimators inSch¨afer and Strimmer (2005) and Ledoit and Wolf (2003) have smaller variance, consistencycan be shown following similar steps.(G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , b k and A kk We first show the asymptoticbehaviour ∆ ˆ θ = b k = (cid:88) ij Var (cid:0) S ij (cid:1) = Θ ( p ) ! = Θ ( p τ ˆ θ ) ⇐⇒ τ ˆ θ = 1 . (27)Rotation invariance allows us to analyse in the eigenbasis. The upper bound follows from (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≤ n (cid:88) i,j (cid:110)(cid:113) Var( y i )Var( y j ) + E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:111) (28) ≤ n (cid:88) i,j (cid:110)(cid:113) E [ y i ] E [ y j ] + E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:111) ≤ n (cid:88) i,j (cid:113) E [ y i ] E [ y j ] ≤ p n (1 + α ) (cid:32) p (cid:88) i E [ y i ] (cid:33) = Θ( p ) . For the lower bound, we distinguish two cases: for τ γ = 1, we have (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≥ (cid:88) i Var (cid:0) S (cid:48) ii (cid:1) (29)= 1 n (cid:88) i (cid:8) E (cid:2) y i (cid:3) − E (cid:2) y i (cid:3)(cid:9) ≥ n (cid:88) i β E (cid:2) y i (cid:3) = β pn p (cid:88) i γ i = Θ ( p τ γ ) = Θ( p ) . ulti-Target Shrinkage For the case τ γ <
1, we have (cid:88) i,j
Var (cid:0) S (cid:48) ij (cid:1) = 1 n (cid:88) i,j (cid:8) E (cid:2) y i y j (cid:3) − E [ y i y j ] (cid:9) (30) ≥ n (cid:88) i,j (cid:8) E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:9) ≥ n (cid:32)(cid:88) i E (cid:2) y i (cid:3)(cid:33) − n (cid:88) i E (cid:2) y i (cid:3) ≥ p n (cid:32) p (cid:88) i E (cid:2) x i (cid:3)(cid:33) − pn p (cid:88) i γ i = Θ( p ) − Θ( p τ γ )) = Θ( p ) . The asymptotic behaviour of A kk depends on the relationship between the original data X and the additional data set X k : A kk = p (cid:88) i,j =1 E (cid:104)(cid:16) S kij − S ij (cid:17) (cid:16) S kij − S ij (cid:17)(cid:105) , = p (cid:88) i,j =1 ( C ij − C kij ) + Var( S kij ) + Var( S ij ) (C1) , (27) = Θ( p τ kC ) + Θ( p ) ! = Θ( p τ kA ) ⇐⇒ τ kA = max(1 , τ kC ) . (G3) : Consistency of (cid:98) A kl As the estimator (cid:98) A kl is unbiased (Bartz and M¨uller, 2013),we have to show thatVar (cid:16) (cid:98) A kl (cid:17) = Var p (cid:88) i,j =1 (cid:16) S kij − S ij (cid:17) (cid:16) S li,j − S i,j (cid:17) = Var p (cid:88) i,j =1 S kij S lij − S kij S ij − S lij S ij + S ij , (31)= o ( p τ kA + τ lA ) = o ( p max(1 ,τ kC )+max(1 ,τ lC ) ) . It suffices to show that the variances of all terms in the sum in eq. (31) are o ( p τ kA + τ lA ). Variance of (cid:80) ij S ij We start with the product of two identical sample covariances: (cid:88) ij S ij = (cid:88) ij (cid:32) n (cid:88) s y is y jt (cid:33) = p n (cid:88) st (cid:32) p (cid:88) i y is y it (cid:33) = p n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) + 1 n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33) . (32) artz et al. Again, it is sufficient to show that the variance of both terms separately is o ( p τ kA + τ lA ). Forthe first term, we haveVar p n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) ≤ p n E (cid:32) p (cid:88) i y i (cid:33) ≤ p (1 + α ) n E (cid:34) p (cid:88) i y i (cid:35) = O ( p ) = o ( p τ kA + τ lA ) . Let us now look at the second term in eq. (32):Var n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33) = 1 n (cid:88) s,t (cid:54) = s (cid:88) s (cid:48) ,t (cid:48) (cid:54) = s (cid:48) Cov (cid:32)(cid:88) i y is y it (cid:33) , (cid:32)(cid:88) i y is (cid:48) y it (cid:48) (cid:33) . The covariance expression only depends on the cardinal of the intersection, which we denoteby ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) and which can take the values of 0, 1 and 2. When this cardinality iszero, (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 0 , there is independence and the covariance is zero as well. For (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 1 , we have 4 n ( n − n −
2) expressions of the formCov (cid:32)(cid:88) i y i y i (cid:33) , (cid:32)(cid:88) i y i y i (cid:33) = E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33) − E (cid:32)(cid:88) i y i y i (cid:33) E (cid:32)(cid:88) i y i y i (cid:33) ≤ max E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33) , E (cid:32)(cid:88) i y i y i (cid:33) , as both terms are positive. For the first term, we have E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33) = (cid:88) i,j,i (cid:48) ,j (cid:48) E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) E [ y i y i (cid:48) ] E (cid:2) y j y j (cid:48) (cid:3) = (cid:88) i,j E (cid:2) y i y j (cid:3) E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) ≤ p (cid:32) p (cid:88) i (cid:113) E (cid:2) y i (cid:3) E (cid:2) y i (cid:3)(cid:33) ≤ p (1 + α ) (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33) = O (cid:0) p τ γ +2 (cid:1) . ulti-Target Shrinkage For the second term, we have E (cid:32)(cid:88) i y i y i (cid:33) = (cid:88) i,j E [ y i y j ] = p (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33) = O (cid:0) p τ γ +2 (cid:1) Therefore, we have, combined with the prefactors,4 n ( n − n − n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Cov (cid:32) p (cid:88) i y is y it (cid:33) , (cid:32) p (cid:88) i y is y it (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 n O (cid:0) p τ γ +2 (cid:1) = O (cid:0) p τ γ +1 (cid:1) (C3) = o ( p τ kA + τ lA ) , therefore we have shown that the terms with ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) = 1 are o ( p τ kA + τ lA ).For (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 2 , we get 2 n ( n −
1) expressions of the form (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
Cov (cid:32)(cid:88) i y is y it (cid:33) , (cid:32)(cid:88) i y is y it (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Cov (cid:32)(cid:88) i y i y i (cid:33) , (cid:32)(cid:88) i y i y i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) i,j,i (cid:48) ,j (cid:48) (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . We decompose the set of integers into two disjoint subsets: { , . . . , p } = Q ∪ R , where Q is the set of distinct integers and R is the remainder:= (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) + (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . The sum over Q we can bring into a form which is dominated as a consequence of (C2): (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = (cid:12)(cid:12) E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) − E [ y i y i (cid:48) ] E [ y i y i (cid:48) ] (cid:12)(cid:12) = E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) = (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1) + E [ y i y i (cid:48) ] E (cid:2) y j y j (cid:48) (cid:3)(cid:1) = (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) . (33)Taking the prefactors into account, we get2 n ( n − n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) ≤ p n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) | Q p | (C2) = O ( p ) o (1) = o ( p ) = o ( p τ kA + τ lA ) artz et al. For the sum over R , we have (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) ≤ (cid:88) i,j,j (cid:48) (cid:12)(cid:12) Cov (cid:0) y i y i , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) + 4 | Cov ( y i y i y i (cid:48) y i (cid:48) , y i y i y j y j ) |≤ (cid:88) i,j,j (cid:48) (cid:114) E (cid:2) y i y i (cid:3) E (cid:104) y j y j y j (cid:48) y j (cid:48) (cid:105) + 4 (cid:114) E (cid:104) y i y i y j y j (cid:105) E (cid:104) y i y i y j (cid:48) y j (cid:48) (cid:105) ≤ (cid:88) i,j,j (cid:48) E (cid:2) y i (cid:3) (cid:114) E (cid:104) y j (cid:105)(cid:114) E (cid:104) y j (cid:48) (cid:105) ≤ p (1 + α ) (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33) p (cid:88) j E (cid:2) y j (cid:3) = O (cid:0) p τ γ +3 (cid:1) . (34)Together with the prefactors, we obtain2 n ( n − n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = 1 n O (cid:0) p τ γ +3 (cid:1) = O (cid:0) p τ γ +1 (cid:1) (C3) = o ( p τ kA + τ lA ) . This finishes the proof for the terms with ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) = 2 and in total we have shownthat Var( (cid:80) ij S ij ) is o ( p τ kA + τ lA ). For Var( S kij S kij ) , k = l , an analogue proof holds. Variance of (cid:80) ij S kij S ij Let us now analyse the products of different sample covariancesin eq. (31).Var (cid:88) ij S kij S ij = Var (cid:88) ij (cid:88) st x kis x kjs x it x jt = 1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj x i x j , x kg x kh x g x h (cid:17) = 1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) Cov ( x i x j , x g x h ) − C kij C kgh Cov ( x i x j , x g x h ) − C ij C gh Cov (cid:16) x ki x kj , x kg x kh (cid:17) The first term can be separated into the contributions from the two different data sets:1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) Cov ( x i x j , x g x h ) ≤ nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) + Cov ( x i x j , x g x h )These terms are rotation invariant, therefore we analyse1 nn k (cid:88) ijgh Cov ( y i y j , y g y h ) . ulti-Target Shrinkage For i, j, g, h distinct, this leads directly to assumption (C2). Otherwise, we have1 nn k (cid:88) ijgh Cov ( y i y j , y g y h ) ≤ n k n (cid:88) ijg Cov ( y i y g , y j y g ) + Cov ( y i y j , y g y g ) ≤ n k n (cid:88) ijg E (cid:2) ( y i ) y j (cid:3) E (cid:2) ( y g ) (cid:3) ≤ n k n (cid:88) ijg (cid:112) E [( y i ) ] (cid:113) E [( y j ) ] E (cid:2) ( y g ) (cid:3) ≤ p n k n (cid:32) p (cid:88) i γ i (cid:33) (cid:32) p (cid:88) g γ g (cid:33) = O (cid:0) p τ γ +1 (cid:1) = o ( p τ kA + τ lA ) . Next we consider the second term,1 nn k (cid:88) ijgh C kij C kgh Cov ( x i x j , x g x h ) = 1 nn k (cid:88) ijgh Σ kij Σ kgh Cov ( z i z j , z g z h ) ≤ nn k (cid:88) ig γ ki γ kg (cid:113) E (cid:2) z i (cid:3) E (cid:2) z g (cid:3) = 1 nn k (cid:32)(cid:88) i γ ki (cid:113) E (cid:2) z i (cid:3)(cid:33) ≤ p nn k (cid:32) p (cid:88) i ( γ ki ) + E (cid:2) z i (cid:3)(cid:33) ≤ p (1 + α ) nn k (cid:32) p (cid:88) i ( γ ki ) + γ i (cid:33) = O ( p τ γ ,τ kγ ) ) = o ( p τ kA + τ lA ) . With this we have shown that all terms in Var (cid:16)(cid:80) ij S kij S ij (cid:17) and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b k By reformulation we obtain (cid:88) ij (cid:100) Var S ij = (cid:88) ij (cid:32) n − n (cid:88) s (cid:16) y is y js − n (cid:88) s (cid:48) y is (cid:48) y js (cid:48) (cid:17) (cid:33) = 1( n − n (cid:88) ij (cid:32) (cid:88) s y is y js − n (cid:88) ss (cid:48) y is y js y is (cid:48) y js (cid:48) (cid:33) = p ( n − n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) − n − (cid:88) ij S ij . (35)Both terms, with different prefactors, have been analysed above. The variance of first termis O ( p ) and the bound on the variance of the second term is n − o ( p ,τ C ) ) = o ( p ).Hence Var(ˆ b ) is o ( p τ ˆ θ ). artz et al. (G4) : Restriction on linear combinations Let L be R p or R p ≥ . We haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( S li − S i ) (cid:33) (36)= min α ∈ L α k =1 q (cid:88) i =1 (cid:32) K (cid:88) l =1 α l ( S li − S i ) (cid:33) + Var (cid:32) K (cid:88) l =1 α l S i (cid:33) + Var (cid:32) K (cid:88) l =1 α l S li (cid:33) ≥ Θ (cid:16) p τ kC (cid:17) + (cid:88) i Var( S ki ) = Θ (cid:16) p τ kC (cid:17) + Θ( p ) = Θ (cid:16) p max(1 ,τ kC ) (cid:17) This concludes the proof of Theorem 5.
A.6 Proof of Theorem 6 (FOLDL consistency of MTS of the covariance) (G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , b k and A kk Proof
We first show theasymptotic behaviour∆ ˆ θ = b k = (cid:88) ij Var (cid:0) S ij (cid:1) = Θ (cid:0) p (cid:1) ! = Θ ( p τ ˆ θ ) ⇐⇒ τ ˆ θ = 2 (37)The upper bound follows from (compare to eq. (28)) b k = (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≤ p n (1 + α ) (cid:32) p (cid:88) i E [ y i ] (cid:33) = Θ( p ) . For the lower bound, we again distinguish two cases: for τ γ = 1, we have (compareto eq. (29)) (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) = β pn p (cid:88) i γ i = Θ (cid:0) p τ γ (cid:1) = Θ( p ) . For the case τ γ <
1, we have (compare to eq. (30)) (cid:88) i,j
Var (cid:0) S (cid:48) ij (cid:1) ≥ p n (cid:32) p (cid:88) i E (cid:2) x i (cid:3)(cid:33) − pn p (cid:88) i γ i = Θ( p ) − Θ( p τ γ ) = Θ( p ) . For the asymptotic behaviour of A kk we then have A kk = p (cid:88) i,j =1 ( C ij − C kij ) + Var( S kij ) + Var( S ij ) = Θ( p τ kC ) + Θ( p ) ! = Θ( p τ kA ) , (38) ⇐⇒ ∀ k : τ kA = 2where used the fact that (cid:80) ij Var( S kij ) has the same limit behaviour as (cid:80) ij Var( S ij ). ulti-Target Shrinkage (G3) , part I: Consistency of (cid:98) A kl The proof is analogue to the proof in Theorem 5. Weonly show that Var (cid:16)(cid:80) ij S ij (cid:17) , the expression with the highest variance, is o ( p τ kA + τ lA ) = o ( p ).We use the same decomposition as above: (cid:88) ij S ij = 1 n (cid:88) s (cid:32)(cid:88) i y is (cid:33) + 1 n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33) . (39)This asymptotic setting is easier, because the sums over s and t are finite sums. We havea finite number of terms in the first sum in eq. (39):Var (cid:32)(cid:88) i y is (cid:33) = (cid:88) i,j,i (cid:48) ,j (cid:48) Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) = (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) + (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) . (40)For the sum over Q , we need assumption (C4): (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) ≤ p (cid:80) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:16) y i y j , y i (cid:48) y j (cid:48) (cid:17) | Q p | (C4) = o ( p ) . For the sum over R , we have, (cid:88) ( i,i (cid:48) ,j,j (cid:48) ) ∈ R Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) Cov (cid:0) y i y j , y i (cid:48) y i (cid:1) + Cov (cid:0) y i y j , y i (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) (cid:113) E [ y i y j ] (cid:113) E [ y i (cid:48) y i ] + (cid:113) E [ y i y j ] (cid:113) E [ y i (cid:48) ] ≤ (cid:88) i,j,i (cid:48) (cid:113) E [ y i ] E [ y j ] (cid:113) E [ y i (cid:48) ] E [ y i ] + (cid:113) E [ y i ] E [ y j ] (cid:113) E [ y i (cid:48) ] ≤ α ) (cid:88) i,j,i (cid:48) E [ y i ] E [ y j ] E [ y i (cid:48) ] = O (cid:0) p τ γ (cid:1) (C3 (cid:48) ) = o ( p ) . For the terms in the second sum in eq. (39), we haveVar (cid:32)(cid:88) i y i y i (cid:33) = (cid:88) i,j,i (cid:48) ,j (cid:48) Cov (cid:0) y i y i y j y j , y i (cid:48) y i (cid:48) y j (cid:48) y j (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q ∪ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . artz et al. For the sum over Q , we simplify using eq. (33) and obtain (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) ≤ p (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) | Q p | (C4) = o ( p )For the sum over R , we have, as in eq. (34), (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = Θ (cid:0) p γ τ (cid:1) (C3 (cid:48) ) = o ( p ) . With this we have shown that all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b k As in eq. (35) we have (cid:88) ij (cid:100) Var( S ij ) = p ( n − n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) − n − (cid:88) ij S ij . The first term is equal to the first term in eq. (39) and hence its variance o ( p ). The secondterm is proportional to the left hand side of eq. (39) and its variance therefore also o ( p ).In total, Var(ˆ b ) is o ( p τ ˆ θ ).(G4) : Restriction on linear combinations Following the same steps as in eq. (36), weobtain Θ (cid:16) p τ kA (cid:17) ! = min α ∈ R p α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) ≥ (cid:88) i Var( S ki ) = Θ( p )This concludes the proof of Theorem 6. References
Fevzi Alimoglu and Ethem Alpaydin. Combining multiple representations and classifiers forpen-based handwritten digit recognition. In
Document Analysis and Recognition, 1997.,Proceedings of the Fourth International Conference on , volume 2, pages 637–640. IEEE,1997.Kevin Bache and Moshe Lichman. UCI machine learning repository. University of California,Irvine, School of Information and Computer Sciences, 2013. URL http://archive.ics.uci.edu/ml .Daniel Bartz and Klaus-Robert M¨uller. Generalizing analytic shrinkage for arbitrary co-variance structures. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q.Weinberger, editors,
Advances in Neural Information Processing Systems 26 , pages 1869–1877. 2013. ulti-Target Shrinkage Daniel Bartz and Klaus-Robert M¨uller. Covariance shrinkage for autocorrelateddata. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q.Weinberger, editors,
Advances in Neural Information Processing Systems 27 , pages1592–1600. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5399-covariance-shrinkage-for-autocorrelated-data.pdf .Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus-RobertM¨uller. Optimizing spatial filters for robust EEG single-trial analysis.
Signal ProcessingMagazine, IEEE , 25(1):41–56, 2008.Benjamin Blankertz, Claudia Sannelli, Sebastian Halder, Eva M Hammer, Andrea K¨ubler,Klaus-Robert M¨uller, Gabriel Curio, and Thorsten Dickhaus. Neurophysiological predic-tor of SMR-based BCI performance.
Neuroimage , 51(4):1303–1309, 2010.Benjamin Blankertz, Steven Lemm, Matthias Sebastian Treder, Stefan Haufe, and Klaus-Robert M¨uller. Single-trial analysis and classification of ERP components – a tutorial.56:814–825, 2011. URL http://dx.doi.org/10.1016/j.neuroimage.2010.06.048 .James W. Daniel. Stability of the solution of definite quadratic programs.
MathematicalProgramming , 5:41–53, 1973.David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via waveletshrinkage.
Journal of the american statistical association , 90(432):1200–1224, 1995.William James and Charles Stein. Estimation with quadratic loss. In
Proceedings of thefourth Berkeley symposium on mathematical statistics and probability , volume 1, pages361–379, 1961.Olivier Ledoit and Michael Wolf. Improved estimation of the covariance matrix of stockreturns with an application to portfolio selection.
Journal of Empirical Finance , 10:603–621, 2003.Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covari-ance matrices.
Journal of Multivariate Analysis , 88(2):365–411, 2004.Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.
Knowledge and DataEngineering, IEEE Transactions on , 22(10):1345–1359, 2010.Alessio Sancetta. Weak conditions for shrinking multivariate nonparametric density esti-mators.
Journal of Multivariate Analysis , 115:285–300, 2013.Juliane Sch¨afer and Korbinian Strimmer. A shrinkage approach to large-scale covariancematrix estimation and implications for functional genomics.
Statistical Applications inGenetics and Molecular Biology , 4(1):1175–1189, 2005.Martijn Schreuder, Thomas Rost, and Michael Tangermann. Listen, you are writing! Speed-ing up online spelling with a dynamic auditory BCI. 5(112), 2011. ISSN 1662-453X. doi:10.3389/fnins.2011.00112. URL . artz et al. Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normaldistribution. In
Proc. 3rd Berkeley Sympos. Math. Statist. Probability , volume 1, pages197–206, 1956., volume 1, pages197–206, 1956.