[PDF] Multi-Target Shrinkage

Abstract

Stein showed that the multivariate sample mean is outperformed by "shrinking" to a constant target vector. Ledoit and Wolf extended this approach to the sample covariance matrix and proposed a multiple of the identity as shrinkage target. In a general framework, independent of a specific estimator, we extend the shrinkage concept by allowing simultaneous shrinkage to a set of targets. Application scenarios include settings with (A) additional data sets from potentially similar distributions, (B) non-stationarity, (C) a natural grouping of the data or (D) multiple alternative estimators which could serve as targets. We show that this Multi-Target Shrinkage can be translated into a quadratic program and derive conditions under which the estimation of the shrinkage intensities yields optimal expected squared error in the limit. For the sample mean and the sample covariance as specific instances, we derive conditions under which the optimality of MTS is applicable. We consider two asymptotic settings: the large dimensional limit (LDL), where the dimensionality and the number of observations go to infinity at the same rate, and the finite observations large dimensional limit (FOLDL), where only the dimensionality goes to infinity while the number of observations remains constant. We then show the effectiveness in extensive simulations and on real world data.

Full PDF

JJournal of Machine Learning Research ? (2014) ?-? Submitted 11/14; Published ??/??

Multi-Target Shrinkage

Daniel Bartz thankscorresponding authors. [email protected]

Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, Germany

Johannes H¨ohne [email protected]

Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, Germany

Klaus-Robert M¨uller ∗ [email protected] Department of Computer Science, TU BerlinMarchstraße 23, 10587 Berlin, GermanyKorea University, Korea, Seoul

Editor: ???

Abstract

Stein showed that the multivariate sample mean is outperformed by “shrinking” to a con-stant target vector. Ledoit and Wolf extended this approach to the sample covariancematrix and proposed a multiple of the identity as shrinkage target. In a general frame-work, independent of a speciﬁc estimator, we extend the shrinkage concept by allowingsimultaneous shrinkage to a set of targets. Application scenarios include settings with(A) additional data sets from potentially similar distributions, (B) non-stationarity, (C) anatural grouping of the data or (D) multiple alternative estimators which could serve astargets.We show that this

Multi-Target Shrinkage can be translated into a quadratic programand derive conditions under which the estimation of the shrinkage intensities yields optimalexpected squared error in the limit. For the sample mean and the sample covariance asspeciﬁc instances, we derive conditions under which the optimality of MTS is applicable.We consider two asymptotic settings: the large dimensional limit (LDL), where the dimen-sionality and the number of observations go to inﬁnity at the same rate, and the ﬁniteobservations large dimensional limit (FOLDL), where only the dimensionality goes to in-ﬁnity while the number of observations remains constant. We then show the eﬀectivenessin extensive simulations and on real world data.

Keywords:

Covariance estimation, Shrinkage, Large Dimensional Limit, Linear Discrim-inant Analysis, Transfer Learning

1. Introduction and Motivation

Shrinkage is a widely applied estimation technique dating back to Charles Stein (Stein,1956; James and Stein, 1961). Stein showed that the sample mean is not admissible, e.g.that the shrinkage mean estimator is always better. The performance gain is achieved byoptimizing the bias-variance-trade-oﬀ between the unbiased, high variance sample estimateand a biased, low variance target. c (cid:13) a r X i v : . [ s t a t . M E ] D ec artz et al. Truth θ b θ unbiased estimate Target 1 b T Target 2 b T b θ MTS b θ STS1 b θ STS2 ∆ STS1 ∆ STS2 ∆ MTS Figure 1: Geometric illustration of Multi-Target Shrinkage. The unbiased estimate and thetwo targets span a convex set. The optimal MTS estimate is the estimate in theconvex set with minimum squared distance to the truth.Over the last years, shrinkage has become very popular for the estimation of covariancematrices. Ledoit and Wolf proposed an analytic formula for covariance shrinkage whichallows to calculate the optimal shrinkage intensity w.r.t. expected squared error (ESE)with low computational cost (Ledoit and Wolf, 2004) and serves as an alternative to time-consuming cross-validation. Shrinkage has further been applied to wavelets (Donoho andJohnstone, 1995) and density estimators (Sancetta, 2013).In the following, we will propose a generalization of the analytic shrinkage approach,in the following called Single-Target Shrinkage (STS), to multiple shrinkage targets. Fig-ure 1 illustrates Single- and Multi-Target Shrinkage (MTS) of an unbiased estimator (cid:98) θ ofa parameter θ for the case of two available shrinkage targets (cid:98) T and (cid:98) T . The convex com-binations of the three estimators span a triangle whose color coding visualizes the squarederror of each combination . The two standard Single-Target Shrinkage estimators (cid:98) θ STS1 ( λ ) = (1 − λ ) (cid:98) θ + λ (cid:98) T (cid:98) θ STS2 ( λ ) = (1 − λ ) (cid:98) θ + λ (cid:98) T are restricted to the lines connecting (cid:98) θ with (cid:98) T and (cid:98) T , respectively. For the optimalshrinkage intensities λ (cid:63) STS1 and λ (cid:63) STS2 , both estimators improve over (cid:98) θ . Further improvementcan be achieved by the Multi-Target Shrinkage estimator (cid:98) θ MTS ( λ , λ ) = (1 − λ − λ ) (cid:98) θ + λ (cid:98) T + λ (cid:98) T ,

1. Note that we do not use diﬀerent symbols for the estimator (a random variable) and the estimate (arealization of the random variable). It will be clear from the context to which we refer.2. The optimum can lie on the border of the triangle if one of the targets is completely useless. Otherwiseit will lie within the triangle. ulti-Target Shrinkage Truthsample estimate

Target 1 Target 2 ∆ STS1=14.7 ∆ STS2=9.4 ∆ MTS =6.4 Figure 2: Geometric illustration of Multi-Target Shrinkage for handwritten digits. Thetargets are the mean images of digit 9 for two diﬀerent subjects.the optimal convex combination of the sample estimate and the two targets. This is nicelyseen in Figure 1 where we have∆

MTS := (cid:107) θ − (cid:98) θ MTS (cid:107) < (cid:107) θ − (cid:98) θ STS1 / STS2 (cid:107) := ∆

STS1 / STS2 . As an illustration we consider MTS for the estimation of subject-speciﬁc mean imageson a data set of handwritten digits (Alimoglu and Alpaydin, 1997; Bache and Lichman,2013). Assume we want to estimate the mean image of digit 9 of person A from a smallnumber of observations. In this case MTS improves over the sample mean image and STSby shrinking towards the mean images of two other subjects T1 and T2 . This can be seenin Figure 2: for MTS, the diﬀerences to the truth are less pronounced than in STS andthe squared error is smaller.The illustrations Figure 1 and 2 are limited to the case of simultaneous shrinkage to twoshrinkage targets. MTS can handle an arbitrary number of shrinkage targets (cid:98) T , (cid:98) T , . . . , (cid:98) T k .Figure 3 shows this for the handwritten digits: incorporating more and more targets, thesquared error decreases.There are a many application scenarios for Multi-Target Shrinkage: • similar data sets: assume that K additional data sets from similar distributions exist.Then, we can calculate a target (cid:98) T k on each additional data set and use MTS to decidehow useful the other data sets are for the estimation task. This is a special case of transfer learning (see (Pan and Yang, 2010) for a recent review). The handwrittendigits example (Figure 2) falls into this category.

3. The data set consists of 10992 traces, approximately equally distributed over 44 subjects and the 10digits 0 , , . . .

9. We converted the traces into images of size 30 × A serves as a proxy to the truth. artz et al. a v e r age s qua r ed e rr o r K (number of targets)

Truth

K = 0 ∆ ˆ Θ =14.2 K = 1 ∆ STS=9.4

K = 43 ∆ MTS =5.6 Figure 3: Decay of the squared error for increasing number of shrinkage targets. Averageover R = 10000 random choices of digits and subjects. • data with group structure: if there is a natural group structure in a data set, one canestimate θ either (A) on the whole data set or (B) on each group separately. – When θ is independent of group membership, (A) is optimal and MTS yieldsapproximately equal weights. – When θ is very diﬀerent for each group, (B) is optimal and MTS puts approxi-mately no weight on the targets. – When θ is dependent of group membership, but similar, MTS provides an optimalweighting of each group which is superior to both (A) and (B). • non-stationarity: assume that the parameter θ is non-stationary. MTS can yield asuperior estimate of the current value of θ by treating older segments of the data asshrinkage targets. • multiple available targets: for covariance shrinkage, a set of biased estimators hasbeen proposed as shrinkage targets: the identity, a multiple of the identity, a diagonalmatrix, constant and perfect correlation matrices or, in a ﬁnance context, a factormodel (see (Sch¨afer and Strimmer, 2005; Ledoit and Wolf, 2003). Which one of thesestructured estimators constitutes the best target depends on the structure of the truecovariance matrix. The choice is based on expert knowledge or cross-validation. Incontrast, MTS does not make a choice but yields an an optimal weighting of all targetswhich is equal or superior to the optimal choice.We have stated above that the optimal STS can be estimated by minimizing the ESE orby a slower cross-validation approach. For MTS, the computational cost to cross-validate ulti-Target Shrinkage K parameters grows with the power of K which is not feasible. We therefore extend theapproach of minimizing the ESE to multiple shrinkage targets.In Section 3 we will introduce the MTS approach independently of a speciﬁc estimatorand derive a quadratic program for the optimal shrinkage intensities. We then prove con-ditions under which the MTS estimate on a sequence of statistical model converges to theoptimum.For the sample mean (section 4) and the sample covariance matrix (section 5) we showwhen these conditions are fulﬁlled. We consider two asymptotic settings: the large dimen-sional limit (LDL), where the dimensionality and the number of observations go to inﬁnityat the same rate, and the ﬁnite observations large dimensional limit (FOLDL), where onlythe number of dimensions goes to inﬁnity while the number of observations remains con-stant. In both settings MTS is consistent, although we will show that the FOLDL requiresstronger restrictions on the covariance structure.Section 6 presents simulations which illustrate the theorems and demonstrate the capa-bilities of MTS. Section 7 shows applications on real world data.

2. Notation, distributional assumptions and asymptotic framework

General notation

Our notation adheres to the following conventions: • Matrices M and vectors v are written in upper case and lower case bold letters,respectively, their entries are given by M ij and v i . m j denotes the j th column of thematrix M with entries m ij ≡ M ij . • Quantities with a hat, (cid:99) M and (cid:98) v always denote estimators. • Var( a ) and Cov( a, b ) denote the variance of a and the covariance between a and b ,respectively. • (cid:100) Var( a ) and (cid:100) Cov( a, b ) denote estimators of variance and covariance which have to bespeciﬁed for each set of parameters a and b . • For asymptotic behaviour, we make use of the Bachmann-Landau symbols O , o and Θ.We here only deﬁne the less frequently used Θ, which denotes asymptotically boundedfrom above and below : f = Θ( g ) ⇐⇒ ∃ c > ∃ C > ∃ x > ∀ x > x : c · | g ( x ) | ≤ | f ( x ) | ≤ C · | g ( x ) | . Notation for MTS

In section 3 the general case is analysed: • we consider the estimation of a set of parameters θ = ( θ , θ , . . . , θ q ) ∈ R q for whichwe assume the existence of an unbiased estimator (cid:98) θ . • optimality is deﬁned w.r.t. expected squared error (ESE) which we denote by ∆. Forexample, the ESE of the unbiased estimator (cid:98) θ is denoted by∆ (cid:98) θ := E (cid:107) (cid:98) θ − θ (cid:107) . We always consider the 2-norm (the Frobenius norm for multivariate parameters). artz et al. Setting set of parameters unbiased est. θ (cid:98) θ q mean µ := E [ x i ] (cid:98) µ := n − (cid:80) i x i q = p covariance C := E [( x i − µ )( x i − µ ) (cid:62) ] (cid:98) C := n − (cid:80) i ( x i − (cid:98) µ )( x i − (cid:98) µ ) (cid:62) q = p Table 1: general, mean and covariance MTS. • to study the behaviour in the limit, we will consider the estimation on a generalsequence of models indexed by p . Notation for MTS of the mean and the covariance

In sections 4 and 5, we considerthe estimation of the mean and the covariance matrix, respectively. There, • the sequence index p also denotes the dimensionality of n p i.i.d. observations withmean µ p and covariance C p , given by the ( p × n p )-matrix X p . • We consider K additional data sets with mean µ kp and covariance C kp , their n kp i.i.d.observations are given by the ( p × n kp )-matrices X kp . • γ ( k ) p, , γ ( k ) p, , . . . , γ ( k ) p,p denote the eigenvalues of C ( k ) p . • Y ( k ) p = R ( k ) p (cid:62) X ( k ) p denote the observations in their respective eigenbasis, where thecovariance matrices Σ ( k ) p = R ( k ) p (cid:62) C p R ( k ) p are diagonal. The mean in the eigenbasis isdenoted by µ Y ( k ) p . • For two datasets X ( k ) p and X ( l ) p , we denote Z ( k ) p = R ( l ) p (cid:62) X ( k ) p . From the context, itwill be clear which l was used to obtain Z ( k ) p . • in the following, we will always omit the sequence index p to obtain a less clutterednotation.Table 1 gives an overview of the diﬀerent MTS scenarios considered in this paper. Distributional assumptions

We assume( ∀ k :) 1 p p (cid:88) i =1 γ ( k ) i = Θ(1) . (A1) ( ∀ k ) ∃ τ ( k ) γ : 1 p p (cid:88) i =1 γ ( k ) i = Θ (cid:16) p τ ( k ) γ (cid:17) . (A2) ∃ α , β : (1 + β ) E [ y i ] ≤ E [ y i ] ≤ (1 + α ) E [ y i ](A3) ∃ α , β : (1 + β ) E [ y i ] ≤ E [ y i ] ≤ (1 + α ) E [ y i ](A4)The assumption (A1) states, for each data set, that for an increasing number of dimen-sions the variance per dimension is bounded from above and below. ulti-Target Shrinkage The assumption (A2) restricts the dispersion of the eigenvalues: for increasing dimen-sionality, the dispersion is assumed to have a well-deﬁned limit behaviour. Note that (A1)implies 0 ≤ τ ( k ) γ ≤ p . Asymptotic settings

We consider two diﬀerent asymptotic settings: • LDL: the standard setting in Random Matrix Theory and for the analysis of covarianceshrinkage is the large dimensional limit ( n, p → ∞ , n/p → c ) (Ledoit and Wolf, 2004).In the LDL, the sample mean remains a consistent estimator, this does not hold forthe sample covariance matrix. We assume that for the additional data sets n k /p → c k holds. • FOLDL: in addition we consider the ﬁnite observations large dimensional limit ( p →∞ , n = c , n k = c k ). In the FOLDL, neither sample covariance nor sample mean areconsistent.Table 2 gives an overview of the notation in the paper.

3. Multi-Target Shrinkage

In Single-Target Shrinkage, the linear combination of an unbiased estimator (cid:98) θ with anotherestimator (cid:98) T (called the shrinkage target) is optimized. In most cases, the linear combinationis restricted to be convex (Ledoit and Wolf, 2004; Sch¨afer and Strimmer, 2005): (cid:98) θ STS ( λ ) := (1 − λ ) (cid:98) θ + λ (cid:98) T . In this manuscript, we generalize to optimizing the convex combination with a set of K targets (cid:98) θ MTS ( λ ) := (cid:32) − K (cid:88) k =1 λ k (cid:33) (cid:98) θ + K (cid:88) k =1 λ k (cid:98) T k , (1)where λ = ( λ , λ , . . . , λ K ) ∈ R K ≥ is subject to (cid:80) k λ k ≤

1. The MTS objective is given by∆

MTS ( λ ) := E (cid:13)(cid:13)(cid:13) θ − (cid:98) θ MTS ( λ ) (cid:13)(cid:13)(cid:13) . (2)From the MTS objective we derive a quadratic program for the optimal value of λ : Theorem 1 (MTS quadratic program)

Let the MTS quadratic program be deﬁned by ∆ MTSqp ( λ ) := 12 λ (cid:62) A λ − b (cid:62) λ (3)

5. Setting (cid:98) T K +1 = 0 and allowing for λ ∈ R K +1 , this turns into an arbitrary linear combination whichcan deal with arbitrarily rescaled targets. Theoretical results can be extended at the cost of clarity andaccesibility. artz et al. symbol meaning n number of observations p dimensionality / index of the sequence of models q number of parameters f = Θ( g ) f asymptotically bounded from above and below by gf = O ( g ) f asymptotically bounded from above by gf = o ( g ) f asymptotically dominated by g θ set of parameters (cid:98) θ unbiased estimate of the set of parameters τ (cid:98) θ limit behaviour of the unbiased estimator (G1)∆ (cid:98) θ expected squared error, here of the unbiased estimator µ , (cid:98) µ mean and sample mean C , S covariance and sample covariance γ ( k )1 , . . . , γ ( k ) p eigenvalues of C (cid:92) symbol estimate claculated on the data τ γ limit behaviour of the average squared eigenvalue (A2) X observations ( p × n matrix) Y observations in the eigenbasis( p × n matrix) R rotation into the eigenbasis ( p × p matrix) Z observations in the eigenbasis of a diﬀerent data set ( p × n matrix)(symbol) k for each symbol, k stands for the data set kα , β bounds on the ratio between second and fourth moments (A3) α , β bounds on the ratio between fourth and eighth moments (A4) c ratio between number of observations and dimensionality n/pK number of shrinkage targets T k k th shrinkage target λ k shrinkage intensity of the k th shrinkage target A matrix containing estimates of the quality of the targets b vector containing variance of sample estimate and correlation to targets τ kA limit behaviour of the quality of target k (G3) τ kµ limit behaviour of the quality of the mean of data set k (M1) τ k C limit behaviour of the quality of the covariance of data set k (C1) Q p set of all quadruples consisting of distinct integers between 1 and p | Q p | cardinality of Q p Table 2: overview of the notation. ulti-Target Shrinkage with A kl := q (cid:88) i =1 E (cid:104)(cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17)(cid:105) , b k := q (cid:88) i =1 (cid:110) Var(ˆ θ i ) − Cov( (cid:98) T ki , ˆ θ i ) (cid:111) , Then it is equivalent to optimize ∆ MTS ( λ ) and ∆ MTSqp ( λ ) : λ (cid:63) := arg min λ ∈ R K ≥ (cid:80) k λ k ≤ ∆ MTS ( λ ) = arg min λ ∈ R K ≥ (cid:80) k λ k ≤ ∆ MTSqp ( λ ) . (4) Proof see appendix.

The quadratic program is governed by the parameters A and b , quantifying the qualityof the targets and the unbiased estimator, respectively. The vector b contains the variance ofthe unbiased estimator, adjusted for correlation with the targets. The diagonal elements inthe matrix A contain information on the variance and bias of the targets and the correlationwith the unbiased estimator. A target T k is useful if the entry in A kk is small relative tothe variance of the unbiased estimator. The oﬀ-diagonal elements in the matrix A containinformation on the correlation between targets. The optimal shrinkage intensities λ (cid:63) depend on the unknown parameters A and b of thequadratic program eq. (4). We propose the following estimators: (cid:98) λ := arg min λ ∈ R K ≥ (cid:80) k λ k ≤ (cid:98) ∆ MTSqp ( λ ) , (cid:98) ∆ MTSqp ( λ ) := 12 λ (cid:62) (cid:98) A λ − (cid:98) b (cid:62) λ with (5) (cid:98) A kl := q (cid:88) i =1 (cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17) , (cid:98) b k := q (cid:88) i =1 (cid:110)(cid:100) Var(ˆ θ i ) − (cid:100) Cov( (cid:98) T ki , ˆ θ i ) (cid:111) , (6)where the unbiased estimator (cid:98) θ , the targets (cid:98) T k and the estimators of variance and covarianceappearing in (cid:98) b depend on the application scenario.For a general parameter set θ , the following theorem relates the limit behaviour of theestimators in (cid:98) b and of linear combinations of the estimators in (cid:98) A to to the limit behaviourof ∆ MTS ( (cid:98) λ ) and (cid:98) λ : Theorem 2 (consistency of MTS) Let us assume a sequence of models indexed by p suchthat ∃ τ ˆ θ : ∆ ˆ θ = Θ ( p τ ˆ θ ) , (G1) ∀ k ∃ τ kA : A kk = Θ (cid:16) p τ kA (cid:17) , ∀ k : b k = Θ ( p τ ˆ θ )(G2) (cid:13)(cid:13)(cid:13) (cid:98) A kl − A kl (cid:13)(cid:13)(cid:13) = o (cid:16) p . τ kA + τ lA ) (cid:17) , (cid:13)(cid:13)(cid:13) ˆ b k − b k (cid:13)(cid:13)(cid:13) = o ( p τ ˆ θ )(G3) ∀ k : min α ∈ R K ≥ α k =1 q (cid:88) i =1 E (cid:34)(cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) (cid:35) = Θ (cid:16) p τ kA (cid:17) (G4) artz et al. We then have ∀ k : λ (cid:63)k , ˆ λ k = O (cid:16) p ( τ ˆ θ − τ kA ) / (cid:17) . (i) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) )∆ (cid:98) θ = o (1)(ii) If one strenghtens (G4) to hold ∀ α ∈ R K , we also have (cid:107) λ (cid:63) − (cid:98) λ (cid:107) = o (1)(iii) Proof see appendix.

The assumptions (G1) and (G2) state that all estimators have a well-deﬁned limit be-haviour w.r.t. ESE. In addititon, ∆ ˆ θ and b pk having the same limit behaviour implies thatnone of the targets is identical to the unbiased estimator.Assumption (G3) states that the relative errors in the entries of the estimators (cid:98) A kl andˆ b k go to zero in the limit. We call this property consistency of (cid:98) A and (cid:98) b .Assumption (G4) states that the linear combination of a set of targets cannot havebetter limit behaviour w.r.t. ESE than the best single target in the set. This is neededbecause linear dependence of targets can result in A having small eigenvalues for which therelative error does not go to zero.To illustrate the assumptions consider the handwritten digits example. A possible se-quence of models consists of images with increasing resolution ( p × p pixels) and an increasingnumber of observations for each subject. Then the sequence of ESE of the sample estimatorfor subject A would have a clear limit behaviour and hence fullﬁl (G1). The similaritybetween the digits of subjects A and e.g. T1 deﬁnes the similarity of the images. Hence aclear limit behaviour of A (G2) is to be expected. With increasing p and n , we can betterestimate the variance of the sample mean and the similarity between subjects and hencethe relative errors in b and A would go to zero (G3). Two subjects T1 and T2 whose dif-ferences to subject A exactly cancel out in a linear combination would violate Assumption(G4). This is highly unlikely.Part (i) of Theorem 2 states that a target T k which has worse limit behaviour w.r.t.ESE than the sample estimator ˆ θ does not contribute in the limit.Part (ii) is the most important result. It states that the expected squared error of theMTS estimator (cid:98) λ (normalized by the error of the sample estimator) converges to the ESEof the optimal λ (cid:63) . We call this property consistency of MTS .Part (iii) shows that λ (cid:63) is, under a restriction on the linear dependency of the targets,identiﬁable and that the estimator (cid:98) λ converges to λ (cid:63) . We call this consistency of theestimator (cid:98) λ .

6. for an oﬀ-diagonal element A kl , we consider the error relative to √ A kk A ll .7. Note that ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) )∆ MTS ( λ (cid:63) ) = o (1) does not hold in general. ulti-Target Shrinkage

4. Multi-Target Shrinkage of the mean

In this section we apply the MTS approach on the p -dimensional sample mean: θ = µ , (cid:98) θ = (cid:98) µ = (ˆ µ , ˆ µ , . . . , ˆ µ q = p ) . As shrinkage targets, we take a set of sample means (cid:98) µ , (cid:98) µ , . . . , (cid:98) µ K of additional data sets X , X , . . . X K , drawn from potentially diﬀerent distributions. We obtain A kl = p (cid:88) i =1 E (cid:104)(cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17)(cid:105) b k = p (cid:88) i =1 (cid:110) Var(ˆ µ i ) − Cov(ˆ µ ki , ˆ µ i ) (cid:111) . (7)Cov(ˆ µ ki , ˆ µ i ) = 0 holds and for the sample estimates (cid:98) A and (cid:98) b we propose (cid:98) A kl := p (cid:88) i =1 (cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17) ˆ b k := ˆ b := p (cid:88) i =1 (cid:100) Var(ˆ µ i ) , (8)where the estimator of the variance of the sample mean is given by (cid:100) Var(ˆ µ i ) := 1 n ( n − n (cid:88) t =1 ( x it − ˆ µ i ) . Remark

MTS of the mean can be seen as a weighting of each data point. Data pointsin X are weighted by (1 − (cid:80) Kl =1 λ (cid:63)l ) n − and data points in X k are weighted by λ (cid:63)k n − k .Assuming that the distributions of the data sets only diﬀer with respect to their means, theoptimal weight of each original data point is larger than or equal to the weight of the datapoints from the additional data sets.This translates into a constraint on the quadratic program: ∀ k : λ (cid:63)k n − k ≤ (1 − K (cid:88) l =1 λ (cid:63)l ) n − . The constraint is reasonable to impose in many applications and increases numerical sta-bility.

In this section we will establish the conditions under which MTS of the mean is consistentby showing when the estimators eq. (8) fulﬁll the assumptions of Theorem 2. We will showthis for both asmptotic settings.

LDL consistency of MTS of the mean

We ﬁrst consider the LDL.

Theorem 3 (LDL consistency of MTS of the mean)

Let us assume a sequence of sta-tistical models indexed by p for which (A1) , (A2) , (A3) and ∀ k ∃ τ kµ ≤ (cid:107) µ k − µ (cid:107) = Θ( p τ kµ ) , (M1) ∀ k : τ kγ < , τ kµ ) + 1 and τ γ < , min k τ kµ ) + 1(M2) ∀ k | τ kµ > α ∈ R K ≥ α k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) l α l ( µ l − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = Θ (cid:16) p τ kµ (cid:17) (M3) artz et al. hold.Then assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulﬁlled, MTS of themean is consistent and ∀ k | τ kµ > λ (cid:63)k = ˆ λ k = O (cid:16) p − τ kµ / (cid:17) holds. If (M3) holds for α ∈ R K , λ (cid:63) is identiﬁable and (cid:98) λ is consistent. Proof see appendix.

Assumption (M1) states that the distance between data and target mean needs to havea clear limit behaviour. We exclude unrealistic sequences of models τ kµ > p and hence do not average out, smalldistances beweent data and target mean cannot be estimated reliably.Assumption (M3) states that there are no target means which, linearly combined, havebetter asymptotic behaviour than the single target means.Theorem 3 states conditions und which MTS of the mean is consistent in the LDL. Inaddition it states that data sets with increasing mean distance (M1) do not contribute tothe MTS estimate in the LDL limit: for n → ∞ , these data sets do not remain usefulbecause the sample mean is consistent. FOLDL consistency of MTS of the mean

We now consider the case where only thedimensionality p goes to inﬁnity, while n remains constant. Theorem 4 (FOLDL consistency of MTS of the mean)

Let us assume a sequence ofstatistical models indexed by p for which (A1) , (A2) , (A3) , assumption (M1) from Theo-rem 3 and ∀ k : τ kγ < and τ γ < (cid:48) ) ( ∀ k :) (cid:88) i,j (cid:54) = i Cov (cid:18) y ( k ) is , y ( k ) js (cid:19) = o (cid:0) p (cid:1) (M4) hold. Then assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulﬁlled and MTSof the mean is consistent and (cid:98) λ is a consistent estimator. In the FOLDL, consistency results from averaging over dimensions. Therefore, con-sistency requires stronger restrictions on the correlation between dimensions. Assumption(M2 (cid:48) ) states that the dispersion of the eigenvalues (A2) has to grow slower than Θ( p ).Otherwise, strong eigendirections exist whose inﬂuence on the MTS estimate remains ata constant level in the sequence of models. Assumption (M4) states that the correlationbetween squared uncorrelated variables, on average, converges to zero.Note that identiﬁability holds even without Assumption (M3). ulti-Target Shrinkage

5. Multi-Target Shrinkage of the covariance matrix

In the second application of MTS we consider sample covariance matrices: θ = C , (cid:98) θ = S , S ij = n − n (cid:88) s =1 ( x is − ˆ µ i )( x js − ˆ µ j ) . For the sample covariance matrix, we will consider two classes of targets: • as for the sample mean, it is possible to shrink to a set of sample covariance matrices S , . . . , S K from additional data sets X , X , . . . X K . • a variety of biased estimators (cid:98) C , (cid:98) C , . . . , (cid:98) C K of the sample covariance matrix existswhich can be used as targets. An overview is given in (Sch¨afer and Strimmer, 2005).Examples: – T id = trace( S ) · I– T diag = S ◦ I (elementwise product) – T const . corr . = S ◦ I + F ◦ (1 − I ),where F ij = (cid:112) S ii S jj · ¯ r and ¯ r is the average correlation between dimensions.In total, we obtain a set of targets (cid:98) T , (cid:98) T , . . . , (cid:98) T K for which we have A kl = p (cid:88) i,j =1 E (cid:104)(cid:16) (cid:98) T kij − S ij (cid:17) (cid:16) (cid:98) T lij − S ij (cid:17)(cid:105) and b k = p (cid:88) i,j =1 (cid:110) Var( S ij ) − Cov( (cid:98) T kij , S ij ) (cid:111) . For the sample estimates (cid:98) A and (cid:98) b we propose (cid:98) A kl = p (cid:88) i,j =1 (cid:16) (cid:98) T kij − S ij (cid:17) (cid:16) (cid:98) T lij − S ij (cid:17) and ˆ b k ≡ ˆ b = p (cid:88) i,j =1 (cid:100) Var( S ij ) , (9)where the estimator of the variance of the sample covariance is given by (cid:100) Var( S ii (cid:48) ) := 1( n − n (cid:88) s (cid:16) x is x js − n (cid:88) t x it x jt (cid:17) . To keep the notation simple, we assume ∀ k : µ = µ k = 0. In this section we will establish the conditions under which MTS of the mean is consistent byshowing when the estimators eq. (9) fulﬁll the assumptions of Theorem 2. We will considerboth asmptotic settings. artz et al. LDL consistency of MTS of the covariance

We ﬁrst consider the LDL.

Theorem 5 (LDL consistency of MTS of the covariance)

Let us assume a sequenceof statistical models indexed by p for which (A1) , (A2) , (A3) , (A4) and ∀ k ∃ τ kC ≤ (cid:107) C k − C (cid:107) = Θ( p τ k C ) , (C1) (cid:80) i,j,k,l ∈ Q p (cid:0) Cov [ y i y j , y k y l ] (cid:1) | Q p | = o (1)(C2) where Q p is the set of all quadruples consisting of distinct integersbetween 1 and p , τ ( k ) γ < , min k τ k C ) , (C3) ∀ k | τ kC > α ∈ R K ≥ α k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) l α l ( C l − C ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = Θ (cid:16) p τ k C (cid:17) (C4) hold. Then, for the set of targets in (Sch¨afer and Strimmer, 2005) and targets given byadditional data sets, assumptions (G1) , (G2) , (G3) and (G4) of Theorem 2 are fulﬁlled.Hence MTS of the covariance is consistent and ∀ k | τ kC > λ (cid:63)k , ˆ λ k = O (cid:16) p (1 − τ kC ) / (cid:17) holds. If (C4) holds for α ∈ R K , λ (cid:63) is identiﬁable and (cid:98) λ is consistent. Proof see appendix.

Assumption (C1) states that the distance of the data covariance matrices to each targetcovariance needs to have a clear limit behaviour. We exclude unrealistic sequences of modelswith τ kC > C .Assumption (C2) restricts the average covariance between products of uncorrelated vari-ables. This assumption is quite weak (compare to (Ledoit and Wolf, 2004)).Assumption (C3) limits the eigenvalue dispersion of the data sets in dependence of thedistance between data and target covariance. This is analogue to Assumption (M2) forMTS of the mean.Assumption (C4) states that there are no additional data sets which, linearly combined,have better limit behaviour than the single data sets.Theorem 5 shows that MTS of the covariance is consistent in the LDL. We also see thatdata sets with covariance distance (C1) increasing faster than O ( p ) do not contribute tothe MTS estimator in the LDL limit: for n → ∞ , these data sets do not remain useful. FOLDL consistency of MTS of the covariance

We now consider the case where onlythe dimensionality p goes to inﬁnity, while n remains constant. ulti-Target Shrinkage Theorem 6 (FOLDL consistency of MTS of the covariance)

Let us assume a se-quence of statistical models indexed by p for which (A1) , (A2) , (A3) , (A4) , (C1) , (C2) (see Theorem 5) and τ kγ < and τ γ < (cid:48) ) (cid:80) i,j,k,l ∈ Q p Cov (cid:2) ( y i y j ) , ( y k y l ) (cid:3) | Q p | = o (1)(C5) hold. Then, for the set of targets in (Sch¨afer and Strimmer, 2005) and targets given byadditional data sets, assumptions (G1) , (G2) , (G3) , and (G4) of Theorem 2 are fulﬁlled,and MTS of the covariance and (cid:98) λ are consistent. Proof see appendix.

As for the mean, consistency in the FOLDL requires a restriction (C3 (cid:48) ) on the largest eigen-value (compare to Theorem 4) Assumption (C5) further restricts covariances between un-correlated random variables. Note that identiﬁability holds even without Assumption (C4).

6. Simulations

Our proposed MTS has more free parameters than standard shrinkage and therefore thevector of shrinkage intensity estimates (cid:98) λ has a higher variance than the single shrinkageintensity estimate ˆ λ in STS. In this section, we will provide simulations for both MTS ofthe mean and MTS of the covariance which show that already at moderate data set sizes,MTS accurately estimates λ . We will consider • expected squared error: this quantity is optimized by MTS. We directly measure the percentage improvement in average loss (PRIAL) with respect to the sample estima-tor (cid:98) θ : PRIAL (cid:0)(cid:98) θ shr (cid:1) = 100 · E (cid:107) (cid:98) θ − θ (cid:107) − E (cid:107) (cid:98) θ shr − θ (cid:107) E (cid:107) (cid:98) θ − θ (cid:107) . The PRIAL is a measure relative to the ESE of the sample estimator. A PRIAL of100 means that the shrinkage estimator has no error while a PRIAL of 0 means that ityields no improvement. Negative values indicate performance worse than the sampleestimator. • classiﬁcation accuracies: in classiﬁcation tasks, the ESE of the covariance matrix isnot the quantity of interest: it only serves as a proxy for classiﬁcation accuracies. Wemeasure accuracy relative to the unbiased estimator:accuracy gain (cid:0) (cid:98) θ shr (cid:1) = accuracy (cid:0) (cid:98) θ shr (cid:1) − accuracy (cid:0) (cid:98) θ (cid:1) We use MTS to estimate – means in Linear Discriminant Analysis (LDA) – covariances for Common Spatial Patterns as an LDA preprocessing. artz et al. �� Figure 4: Large dimensional limit (LDL) of MTS of the mean to additional data sets. Aver-age obtained over R r = 20 repetitions for R m = 500 models. Shaded areas showone standard deviation. In the ﬁrst simulation we illustrate the behaviour of MTS of the mean in the large dimen-sional limit (LDL, p, n → ∞ ). We generate n standard normal data points of dimensionality p = n with mean µ i = 0. For the shrinkage targets we generate K = 4 standard normaldata sets with n k = p data points and means µ ki = ( ± i η k , where the sign is randomand η = ( √ p − , . , . , . / . In thissetting, the ﬁrst additional data set X has τ µ = 0 and X / / have τ / / µ = 1.This setting fulﬁlls the assumptions of Theorem 3: targets have a clear limit behaviour(M1), from standard normality follows τ ( k ) γ = 0 (M2) and the means of the targets areindependently sampled (M3). The Theorem tells us that the MTS estimator will convergeand that targets (cid:98) T / / will not receive any weight in the LDL.We compare MTS to ﬁve versions of STS: STS to each of the targets (cid:98) T k = (cid:98) µ k and STSto the joint target (cid:98) T joint := (cid:98) µ joint := 0 . · (cid:80) k (cid:98) µ k . Figure 4 shows the dependency of thePRIAL (left) and the shrinkage intensities (right) on the dimensionality p .As predicted for the LDL by Theorem 3, the STS and MTS shrinkage intensities fortargets (cid:98) µ , (cid:98) µ , (cid:98) µ and (cid:98) µ joint go to zero: these targets are not useful in the limit. Only thetarget (cid:98) µ remains useful. As n = n and the entries is µ converge to the entries in µ , theshrinkage intensity (cid:98) λ goes to 0.5.The PRIALs reﬂect this picture: For the asymptotically useless targets, the improvementover the sample mean goes to zero, for (cid:98) µ it goes to a constant. For low p and n , it is lessrelevant that µ , µ and µ are diﬀerent from µ : as a consequence, the joint target is betterthan (cid:98) µ . Over the whole range of p , (cid:98) µ MTS outperforms all STS estimators. For p → ∞ ,MTS converges to STS to (cid:98) µ .

8. Drawing the means from normal distributions with diﬀerent variances seems more straightforward. Inparticular for small dimensionalities it has the disadvantage that the quality of the additional data setsvaries a lot and that often (cid:107) µ − µ (cid:107) > (cid:107) µ − µ (cid:107) . ulti-Target Shrinkage �� Figure 5: Finite observations large dimensional limit (FOLDL) of MTS of the mean toadditional data sets. Average obtained over R r = 20 repetitions for R m = 500models. Shaded areas show one standard deviation.Figure 5 shows convergence for the ﬁnite observations large dimensional limit (FOLDL).The experiment is analogous to the one above, only n = n k = 50 is kept ﬁxed. Contrary tothe LDL, all shrinkage intensities remain ﬁnite. As above, over the whole range of p , (cid:98) µ MTS outperforms all STS estimators.

To test MTS in a classiﬁcation setting we extended the above simulations to two class means µ A/B ( p = 50, n = 50). The diﬀerence of the class means is identical in each dimension,chosen such that the Bayes optimal classiﬁer achieves 80% accuracy. For both classes thereare four additional data sets, n k = 100 with mean diﬀerences∆ µ kA/B,i = µ kA/B,i − µ A/B,i = ( ± i η k , η = 10 κ · (0 . , . , ,

2) where the parameter κ governs the similarity of the additionaldata sets. The covariance of each data set is C ( k ) A/B = I . To make the setting slightlymore realistic, we transform the data to have diagonal covariance with eigenvalues γ i =10 i − / ( p − − (log-spaced between 10 ± α , α = 1). This is achieved by rescaling all datapoints: x ( k ) ,rescaledA/B,it = x ( k ) A/B,it · √ γ i . We train Linear Discriminant Analysis using diﬀferent mean estimators: We compareMTS to (A) sample means (cid:98) µ A/B , where we ignore the additional data sets , (B) pooledmeans where we take (cid:98) µ pooledA/B := ( K + 1) − ( (cid:80) k (cid:98) µ kA/B + (cid:98) µ A/B ), and (C) STS where we shrinkboth sample means (cid:98) µ A/B to the corresponding joint target (cid:98) µ jointA/B := K − (cid:80) k (cid:98) µ kA/B .Figure 6 (left) shows the gain in classiﬁcation accuracy relative to the baseline of samplemeans in dependence of the scale parameter κ . When the target means are very similar

9. to increase comparability, we use the sample covariance averaged over all data sets, independently of theestimator of the mean. artz et al. � �� Figure 6: accuracy gain for MTS for Linear Discriminant Analysis. Average obtained over R r = 20 repetitions for R m = 500 models. Shaded areas show one fourth standarddeviation.( κ → −∞ ), pooled means is the optimal solution. For very diﬀerent target distributions( κ → ∞ ) we cannot improve over the sample means (cid:98) µ A/B . For these extremes, STS to thepooled data performs as well as the superior method, in between it outperforms both. MTSimproves on STS by ﬁnding a superior weighting of the target means.For Figure 6 (right), a spike has been added to the covariance model: The largesteigenvalue has been multiplied by 100 and the corresponding direction has been made non-discriminative. The drop in performance indicates that STS and MTS now give too muchweight to the targets, especially to the less useful targets µ / A/B . All targets are similar tothe original data in the non-discriminative direction of the spike, but still vary in qualityin the discriminative directions.

Whitening – a practical trick

Shrinkage puts too much weight on the direction of high-est variance. Whitening the data before MTS (wMTS) helps: wMTS gives equal importanceto all directions, yields proper weights for the µ kA/B and superior accuracies.Interestingly, wMTS also performs better than standard MTS when there is no spike inthe covariance (left). In this case the estimation of the shrinkage intensities is dominated bythe few directions of largest variance. This causes high variance in the shrinkage intensityestimates (cid:98) µ . Using wMTS, the estimation of the shrinkage intensities becomes an evenlyweighted average over dimensions and hence gets more stable.In general, whitening leads to large improvements if the discriminative information isnot restricted to the subspace of highest variance. Here we illustrate the behaviour of MTS of the covariance in the large dimensional limit(LDL, p, n → ∞ ). We generate n normal data points of dimensionality p = n with co-variance C diagonal with logarithmically spaced eigenvalues. For the shrinkage targets wegenerate K = 4 standard normal data sets with n k = p data points. The covariance matri- ulti-Target Shrinkage �� Figure 7: Large dimensional limit (LDL) of MTS of the covariance to additional data sets.Average obtained over R r = 20 repetitions for R m = 500 models. Shaded areasshow one standard deviation. �� Figure 8: Finite observations large dimensional limit (FOLDL) of MTS of the covarianceto additional data sets. Average obtained over R r = 20 repetitions for R m = 500models. Shaded areas show one standard deviation.ces C k of the additional data sets only diﬀer in the largest eigenvalue γ kmax = η k · p , with η = ( √ p − , . , . , . /

10. Therefore the ﬁrst additional data set X has τ C = 1 and X / / have τ / / C = 2.This makes the setting analog to simulation 1. Figure 7 shows the dependency ofthe PRIAL (left) and the shrinkage intensities (right) on the dimensionality p : the STSand MTS shrinkage intensities for targets (cid:98) C / / and (cid:98) C joint go to zero, only the target (cid:98) C remains useful in the LDL. As n = n , the shrinkage intensity goes to 0.5. For theasymptotically useless targets, the PRIAL over the sample covariance goes to zero, for (cid:98) C it goes to a constant. For low p and n , it is less relevant that C / / are diﬀerent from C : as a consequence, the joint target is better than (cid:98) C . Over the whole range of p , (cid:98) C MTS outperforms all STS estimators. artz et al. �� Figure 9: MTS of the covariance to identity and additional data sets. Average obtainedover R r = 20 repetitions for R m = 500 models. Shaded areas show four standarddeviations.Figure 8 shows results for the FOLDL, where n = n = n = n = n = 50 is kept ﬁxed.As for the mean, all shrinkage intensities remain ﬁnite and over the whole range of p , (cid:98) C MTS outperforms all STS estimators.

For MTS of the covariance there is also the possibility to include a biased estimator asa shrinkage target. The most widely used biased estimator is the identiy multiplied bythe average sample eigenvalue: (cid:98) T id := ν I . In this simulation, we shrink to (cid:98) T id and thecovariance matrices of four additional sets of observations. We choose C and C k diagonalwith logarithmically spaced eigenvalues between 10 − and 10 . Each of the additional datasets is rotated randomly constrained to a rotation angle φ . We generate multivariate normalrandom data sets X and X / / / of size p = n = 500, n = p/ n = p , n = 2 p and n = 4 p .Figure 9 shows PRIAL and shrinkage intensities in dependence of the rotation angle φ .Shrinkage to (cid:98) T id is independent of φ , while STS to the other data is good when distributionsare similar (small rotation angle) and yields only small improvements for very diﬀerentdistributions (large rotation angle). The MTS shrinkage intensities show that for large φ MTS yields approximately the same estimate as STS to T id , while for small φ it yields aweighting of all ﬁve targets. This weighting yields superior PRIAL compared to each STSestimator. In this section we apply MTS to the preprocessing method

Common Spatial Patterns (CSP).CSP is used for dimension reduction in classiﬁcation settings where (A) each datapoint is atime series of observations and (B) the discriminative information between two classes liesin the signal variance. Then CSP yields ﬁlters for the classes A and B which are deﬁned by ulti-Target Shrinkage � �� Figure 10: accuracy gain for MTS of the covariance for CSP. Average obtained over R r = 20repetitions for R m = 500 models.the directions where the ratio of the variances is maximal: f A/Bi := arg max f : f ⊥ f A/Bj ∀ j

Var(

X f i ) (cid:17) .For this simulation, a p = 50 dimensional diagonal covariance matrix C with logarith-mically spaced eigenvalues between 10 − and 10 is generated. The covariances of the twoclasses C A,B and a set of diﬀerent covariances C A/B,k diﬀ are each obtained by rescaling P = 10random eigenvalues of C by p i = (1+ i/P ) , i = 1 , , . . . , P . In addition, we rotate the C A/B,k diﬀ randomly by an angle φ k , φ = (0 , , , C A/B,k ( w ) = (1 − w ) C A/B,k diﬀ + w C A/B . For each class and each target we generate n = n k = 200 data points. The classiﬁcationaccuracy is calculated for test trials of length n test = 20.Figure 10 (left) shows the relative classiﬁcation accuracies of the diﬀerent covarianceestimation approaches. For w = 1, the target covariances are equal to the class covariancesand S pooled = 1 / ( k + 1)( (cid:80) k S k + S ) is optimal. For w →

0, the targets do not containdiscriminative information, hence the sample covariance becomes optimal. STS to the jointcovariance of the additional data sets performs better then the pooled covariance, but isclearly outperformed by MTS. Whitened MTS performs even better.For Figure 10 (right) a spike has been added to all covariance matrices: The largesteigenvalue has been multiplied by 100 and the corresponding direction was excluded from therandom rotations. This strong direction dominates the standard STS and MTS estimatesand causes a strong degradation of performance. The performance of whitened MTS, onthe other hand, is not aﬀected. artz et al.

7. Multi-Target Shrinkage on Real World Data

In this section we will spotlight two application scenarios of MTS on real world data, onefor MTS of the mean estimation and one for MTS of the covariance. Detailed articles onthese applications are in preparation.

In a Brain-Computer Interface (BCI) paradigm based on event related potentials (ERPs),Linear Discriminant Analysis (LDA) is commonly applied to a binary classiﬁcation problem(targets vs. nontargets). A detailed overview of the state-of-the-art approaches for featureextraction and classiﬁcation for ERP data in BCI application is given in (Blankertz et al.,2011).Generally, a sequence of k diﬀerent stimuli are presented repetitively in an random order.The user attends on only one stimulus (target ), while neglecting all others (non-targets).For each stimulus, the brain response is evaluated and it is assessed whether or not the userwas attending. Then, a one-out-of- k -class decision has to be taken based on the k binaryLDA classiﬁer outputs.The standard approach is to compute an LDA classiﬁer by pooling all target and allnon-target data, thus neglecting the stimulus identity. Alternatives are STS and MTS:we compute a binary classiﬁer for each stimulus, using the mean over the distinct stimulusclasses as a shrinkage target (STS) or each mean of each distinct stimulus class as a separateshrinkage target (MTS). In ERP, the covariance can be considered as general backgroundactivity which is independent of the stimulus. Hence, for all approaches we take the pooledcovariance.One data set comprising of 21 subjects was reanalyzed (Schreuder et al., 2011). Figure 11shows the classiﬁcation accuracies when computing the MTS mean, comparing againstclassiﬁcation accuracies obtained with other estimates for the mean. Next to the MTSestimator, the pooled sample mean (standard approach), sample estimate of the stimulusspeciﬁc mean and the STS mean estimate was analyzed. For the STS mean estimator, thepooled mean of the remaining classes was considered as target. The analysis shows theMTS estimator of the mean to be superior to all other approaches. We reanalyzed a data set from a Brain Computer Interface based on motor imagery. Inthe experiment, subjects had to imagine two diﬀerent movements while brain activity wasmeasured via EEG ( p = 55 channels, 80 subjects, 150 trials per subject, each trial with n trial = 390 measurements (Blankertz et al., 2010)). For each subject the frequency bandwas optimized. Common Spatial Patterns (CSP) was applied on the class-wise covariancematrices for feature extraction. 1-3 ﬁlters per class were chosen by a heuristic (Blankertzet al., 2008) and Linear Discriminant Analysis was applied on log-variance features.As training is expensive, we are interested in exploiting training data from other subjects.We compare two approaches: STS to the covariance of all other subjects and Multi-Target

10. Note that despite having the same name, there is no relation between the targets in an ERP experimentand Shrinkage targets. ulti-Target Shrinkage

40 50 60 70 80 90 100405060708090100 100% 0%p=0.00121 ** std Approach M T S

40 50 60 70 80 90 100405060708090100 100% 0%p=8.46e−14 ** sample mean M T S

40 50 60 70 80 90 100405060708090100 76.2% 23.8%p=0.00431 ** STS M T S Figure 11: classiﬁcation accuracy of the ERP data using several estimates of the mean. Asubject is marked with a circle. It should be noted that all three plots showthe same data on the y-axis, being the classiﬁcation accuracy obtained with theMTS mean estimate. number of trials per class a cc u r a cy SSTS C joint wSTS C joint

MTSwMTS 0 5 10 15 2000.010.020.030.040.05 number of trials per class a cc u r a cy − a cc ( s a m p l e c o v ) number of trials per class s h r i n k age i n t en s i t y STSwSTSMTS (sum)MTS (max)wMTS (sum)wMTS (max)

Figure 12: dependency on the number of training trials of motor imagery BCI. Averageobtained over R = 100 runs.Shrinkage to all 80 subjects. Directions of high variance dominate shrinkage estimators(Bartz and M¨uller, 2013) and the BCI data contains pronounced directions of high variance,the spectrum is heavily tilted. To reduce the impact of the ﬁrst eigendirections withoutgiving to much importance to low variance noise directions we applied a special form ofwhitening: we rescaled, only for the calculation of the shrinkage intensities, the ﬁrst ﬁveprincipal components to have the same variance as the sixth principal component. Shrinkageis corrected for auto-correlation (Bartz and M¨uller, 2014).Figure 12 (left, middle) shows accuracies for diﬀerent number of training trials per class.One can see that STS outperforms sample covariance matrices, while it is not possible toestimate the high number of parameters for MTS. For few training trials, wSTS outperformsSTS, as the averaging over additional dimensions reduces variance. wMTS yields very goodaccuracies. artz et al. w M T S sample covariance ** STS C joint ** wSTS C joint subject−wise classification accuracies ** MTS ** Figure 13: subject-wise classiﬁcation accuracies for motor imagery BCI. 10 training trials.Average obtained over R = 100 runs. ∗∗ / ∗ := signiﬁcant at p ≤ .

01 or p ≤ .

8. Discussion

Shrinkage is a widely applied estimation technique. In the last years the analytic formula forcovariance shrinkage of Ledoit and Wolf (Ledoit and Wolf, 2004) has become very popular:it is a fast and accurate alternative to cross-validation.In this paper, we pointed out several use cases in which a single shrinkage target is notsuﬃcient. This motivates the usage of multiple shrinkage targets (MTS). We have derivedformulas for optimal Multi-Target Shrinkage and we have shown in theory and simulationsthat MTS yields improvements over standard shrinkage in several situations. As a practicaltrick, we proposed whitening as a preprocessing step which increases the robustness of MTS.On two real world data sets from the neuroscience domain, our proposed method yieldsa signiﬁcant performance enhancement over standard shrinkage.Future work will explore connections to random matrix theory, consider the transferof domain speciﬁc prior knowledge into the proposed framework, application of MTS toother estimators and the analysis of new real world data sets. In addition we are inter-ested in incorporating label information into the weighting of the diﬀerent dimensions andinto adaptively whitening only to an extent which suﬃciently reduces the variance of theshrinkage estimates. ulti-Target Shrinkage Acknowledgments

Klaus-Robert M¨uller gratefully acknowledges funding by BMBF Big Data Centre (01 IS14013 A) and the National Research Foundation grant (No. 2012-005741) funded by theKorean government. We thank Pieter-Jan Kindermans, Sebastian Bach, Shinichi Nakajimaand Duncan Blythe for valuable discussions and comments.

Appendix A. Proofs

A.1 Proof of Theorem 1 (MTS quadratic program)Proof

We decompose the EMSE into bias and variance∆

MTS ( λ ) = E (cid:13)(cid:13)(cid:13) θ − (cid:98) θ MTS ( λ ) (cid:13)(cid:13)(cid:13) = E (cid:34) q (cid:88) i =1 (cid:16) ˆ θ MTS ( λ ) i − θ i (cid:17) (cid:35) (10)= E  q (cid:88) i =1 (cid:32)(cid:32) − K (cid:88) k =1 λ k (cid:33) ˆ θ i + K (cid:88) k =1 λ k (cid:98) T ki − θ i (cid:33)  = q (cid:88) i =1 (cid:40) (cid:32) − K (cid:88) k =1 λ k (cid:33) Var(ˆ θ i ) + K (cid:88) j,k =1 λ j λ k Cov( (cid:98) T ji , (cid:98) T ki ) abcdef ghi + K (cid:88) j =1 λ j (cid:32) − K (cid:88) k =1 λ k (cid:33) Cov( (cid:98) T ji , ˆ θ i ) abcdef ghi + (cid:40) K (cid:88) k =1 λ k E (cid:104) (cid:98) T ki − ˆ θ i (cid:105)(cid:41)  K (cid:88) j =1 λ j E (cid:104) (cid:98) T ji − ˆ θ i (cid:105) (cid:41) . This can be simpliﬁed to∆

MTS ( λ ) = q (cid:88) i =1 (cid:40) K (cid:88) j,k =1 λ j λ k E (cid:104)(cid:16) (cid:98) T ji − ˆ θ i (cid:17) (cid:16) (cid:98) T ki − ˆ θ i (cid:17)(cid:105) abcdef ghi + 2 K (cid:88) k =1 λ k (cid:0) Cov( (cid:98) T ki , ˆ θ i ) − Var(ˆ θ i ) (cid:1) + Var(ˆ θ i ) (cid:41) = λ (cid:62) A λ − b (cid:62) λ + q (cid:88) i =1 Var(ˆ θ i ) = 2∆ MTSqp ( λ ) + const. (11)Therefore the sets of λ minimizing ∆ MTS ( λ ) and ∆ MTSqp ( λ ) are identical. A.2 Proof of Theorem 2 (consistency of MTS)Proof

From the constraints, it follows directly that (cid:107) λ (cid:63) (cid:107) = O (1) (12) artz et al. and from the deﬁnition of A and b follows ∀ k : τ kA ≥ τ ˆ θ . We ﬁrst prove (i). We have ∀ k : λ (cid:63) (cid:62) A λ (cid:63) = K (cid:88) k (cid:48) ,l =1 λ (cid:63)k (cid:48) λ (cid:63)l q (cid:88) i =1 E (cid:104)(cid:16) (cid:98) T ki − ˆ θ i (cid:17) (cid:16) (cid:98) T li − ˆ θ i (cid:17)(cid:105) , ≥ λ (cid:63)k min α ∈ R K ≥ α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33) = λ (cid:63)k Θ (cid:16) p τ kA (cid:17) (13) b (cid:62) λ (cid:63) (G2) , (12) = O ( p τ ˆ θ ) , (14)We then have ∀ k :Θ( p τ ˆ θ ) (G1) = ∆ ˆ θ ≥ ∆ MTS ( λ (cid:63) ) (11) = λ (cid:63) (cid:62) A λ (cid:63) − b (cid:62) λ (cid:63) + (cid:88) i Var(ˆ θ i ) (12) , (13) , (14) ≥ λ (cid:63)k Θ( p τ kA ) + O ( p τ ˆ θ ) . Rearranging yields λ (cid:63)k = O ( p . τ ˆ θ − τ kA ) ). To prove statement (i) for ˆ λ k , we ﬁrst deﬁne (cid:98) ∆ MTS ( λ ) := λ (cid:62) (cid:98) A λ − (cid:98) b (cid:62) λ + p (cid:88) i =1 Var(ˆ θ i ) . Using the result on the limit behaviour of λ (cid:63) , we obtain λ (cid:63) (cid:62) ( A − (cid:98) A ) λ (cid:63) = K (cid:88) k,l =1 λ (cid:63)k λ (cid:63)l ( A kl − (cid:98) A kl ) (G3) = K (cid:88) k,l =1 λ (cid:63)k λ (cid:63)l o (cid:16) p . τ kA + τ lA ) (cid:17) = o ( p τ ˆ θ ) (15)This allows us to calculate∆ MTS ( λ (cid:63) ) − (cid:98) ∆ MTS ( λ (cid:63) ) = λ (cid:63) (cid:62) ( A − (cid:98) A ) λ (cid:63) − b − (cid:98) b ) (cid:62) λ (cid:63) (12) , (15) = o ( p τ ˆ θ ) . (16)In addition, we calculate (cid:98) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( (cid:98) λ ) = (cid:98) λ (cid:62) ( (cid:98) A − A ) (cid:98) λ − (cid:98) b − b ) (cid:62) (cid:98) λ = (cid:88) k ˆ λ k o ( p τ kA ) + o ( p τ ˆ θ ) . (17)Using these equations, we obtainΘ( p τ ˆ θ ) ≥ ∆ MTS ( λ (cid:63) ) (16) = (cid:98) ∆ MTS ( λ (cid:63) ) + o ( p τ ˆ θ ) ≥ (cid:98) ∆ MTS ( (cid:98) λ ) + o ( p τ ˆ θ ) (17) = ∆ MTS ( (cid:98) λ ) + o ( p τ ˆ θ ) + (cid:88) k ˆ λ k o ( p τ kA ) (13) , (14) ≥ ˆ λ k Θ( p τ kA ) + O ( p τ ˆ θ ) + o ( p τ ˆ θ ) + (cid:88) k ˆ λ k o ( p τ kA ) . ulti-Target Shrinkage Rearranging yields ˆ λ k = O ( p . τ ˆ θ − τ kA ) ) which concludes (i). To prove statement (ii) we haveto relate the diﬀerence in ESE to the diﬀerence in the estimate of the ESE: (cid:16) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) ) (cid:17) − (cid:16) (cid:98) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) = (cid:16) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( (cid:98) λ ) MTS (cid:17) − (cid:16) ∆ MTS ( λ (cid:63) ) − (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) (16) , (17) , ( i ) = o ( p τ ˆ θ )Using this and the optimalities of λ (cid:63) for ∆ MTS ( λ ) and (cid:98) λ for (cid:98) ∆ MTS ( λ ), we obtain0 ≤ (∆ ˆ θ ) − (cid:16) ∆ MTS ( (cid:98) λ ) − ∆ MTS ( λ (cid:63) ) (cid:17) = Θ( p − τ ˆ θ ) (cid:16) (cid:98) ∆ MTS ( (cid:98) λ ) − (cid:98) ∆ MTS ( λ (cid:63) ) + o ( p τ ˆ θ ) (cid:17) ≤ o (1)which concludes the proof of (ii).The proof of part (iii) is similar to the one of Theorem 2.1 from (Daniel, 1973). On theconvex set we have 0 ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) ∇ ∆ MTS ( λ (cid:63) ) (18)0 ≤ ( λ (cid:63) − (cid:98) λ ) (cid:62) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) MTS (19)where the gradients are ∇ ∆ MTS ( λ ) = ( A λ + b ) and ∇ (cid:98) ∆ MTS ( λ ) = (cid:16) (cid:98) A λ + (cid:98) b (cid:17) . Multiplyingeq. (19) by minus one and combining the two equations, we obtain( (cid:98) λ − λ (cid:63) ) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) ∇ ∆ MTS ( λ (cid:63) ) . Subtracting ( (cid:98) λ − λ (cid:63) ) ∇ (cid:98) ∆ MTS ( λ (cid:63) ) from both sides, we obtain( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:16) ∇ (cid:98) ∆ MTS ( (cid:98) λ ) − ∇ (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) ≤ ( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:16) ∇ ∆ MTS ( λ (cid:63) ) − ∇ (cid:98) ∆ MTS ( λ (cid:63) ) (cid:17) . The left hand side is( (cid:98) λ − λ (cid:63) ) (cid:62) (cid:98) A ( (cid:98) λ − λ (cid:63) ) ≥ (cid:107) (cid:98) λ − λ (cid:63) (cid:107) min (cid:107) α (cid:107) =1 α (cid:62) A α + ( (cid:98) λ − λ (cid:63) ) (cid:62) ( (cid:98) A − A )( (cid:98) λ − λ (cid:63) ) (G4) ,α ∈ R K = (cid:107) (cid:98) λ − λ (cid:63) (cid:107) · Θ( p τ ˆ θ ) . The right hand side is (cid:16) ( (cid:98) λ − λ (cid:63) ) (cid:62) ( (cid:98) A − A ) λ (cid:63) + ( (cid:98) λ − λ (cid:63) ) (cid:62) ( b − (cid:98) b ) (cid:17) = o ( p τ ˆ θ ) . by (G1), (G2), (G3) and the rates of the λ k given by (i). Therefore, rearranging yields (cid:107) (cid:98) λ − λ (cid:63) (cid:107) = o (1). A.3 Proof of Theorem 3 (LDL consistency of MTS of the mean)Proof

Without loss of generality, we assume µ = . We start by analysing the asymptoticbehaviour of the ∆ ˆ θ , A kk and b , then we prove the consistency of (cid:98) A kl and ˆ b . artz et al. (G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , A kk and b We start with the asymptoticbehaviour of ∆ ˆ θ = b . We have∆ ˆ θ = b = p (cid:88) i =1 Var(ˆ µ i ) = n − p (cid:88) i =1 Var( x is ) = n − p (cid:88) i =1 γ i (A1) = Θ(1) ! = Θ( p τ ˆ θ ) (20) ⇐⇒ τ ˆ θ = 0Using this result, we obtain the asymptotic behaviour of A kk : A kk = p (cid:88) i =1 E (cid:20)(cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:21) = p (cid:88) i =1 E (cid:104) (ˆ µ ki ) − µ ki ˆ µ i − ˆ µ i (cid:105) (21)= p (cid:88) i =1 (cid:110) E (cid:104) (ˆ µ ki ) (cid:105) + E (cid:2) ˆ µ i (cid:3) (cid:111) = p (cid:88) i =1 (cid:110) ( µ ki ) + Var (cid:16) ˆ µ ki (cid:17) + Var (ˆ µ i ) (cid:111) = Θ( p τ kµ ) + Θ(1) ! = Θ( p τ kA ) ⇐⇒ τ kA = max( τ kµ , , part I: Consistency of (cid:98) A kl As (cid:98) A kl is unbiased, we have to show thatVar( (cid:98) A kl ) = o ( p τ kA + τ lA ) = o ( p max( τ kµ , τ lµ , ) (22)We introduce the notation ˇ x ( k ) is = x ( k ) is − µ ( k ) i , ˇ µ ( k ) i = n − (cid:88) s ˇ x ( k ) is . We then have Var( (cid:98) A kl ) = Var (cid:32) p (cid:88) i =1 (cid:16) ˆ µ ki − ˆ µ i (cid:17) (cid:16) ˆ µ li − ˆ µ i (cid:17)(cid:33) = Var (cid:32) p (cid:88) i =1 (cid:16) ˇ µ ki − ˇ µ i + µ ki (cid:17) (cid:16) ˇ µ li − ˇ µ i + µ li (cid:17)(cid:33) (23)To show eq. (22), it is suﬃcient to show the that the variance of each combination of termsin eq. (23) is o ( p τ kA + τ lA ). There are three non-constant types of combinations: First, thereis the product of a mean and a sample mean:Var (cid:32) p (cid:88) i =1 µ ki ˇ µ li (cid:33) = n − l (cid:88) ij Cov( µ ki (cid:88) s ˇ x lis , µ kj (cid:88) t ˇ x ljt ) = n − l µ k (cid:62) C l µ k (M1) = µ k (cid:62) C l µ k (cid:107) µ k (cid:107) Θ( p τ kµ − ) = max i γ li Θ( p τ kµ − ) (M2) , (A2) = o ( p τ lµ +1 )Θ( p τ kµ − ) = o ( p τ kµ + τ kµ ) = o ( p τ kA + τ lA ) ulti-Target Shrinkage Second, there are products of two diﬀerent sample means:Var (cid:32) p (cid:88) i =1 ˇ µ i ˇ µ ki (cid:33) = n − n − k Var (cid:32) p (cid:88) i =1 (cid:88) s,t ˇ x is ˇ x kit (cid:33) = n − n − k p (cid:88) i,j =1 Cov ( x i , x j ) Cov (cid:16) x ki , x kj (cid:17) = n − n − k p (cid:88) i,j =1 Cov ( y i , y j ) Cov (cid:16) z ki , z kj (cid:17) = n − n − k p (cid:88) i =1 γ i E [( z ki ) ] ≤ nn k p (cid:88) i =1 γ i γ ki ≤ pnn k (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 γ i (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 ( γ ki ) = Θ (cid:16) p . τ kγ + τ kγ ) − (cid:17) (M2) = o (cid:16) p max(0 ,τ kµ )+max(0 ,τ lµ ) (cid:17) = o ( p τ kA + τ lA )The third combination has two sample means:Var (cid:32) p (cid:88) i =1 ˇ µ i (cid:33) = n − Var (cid:32) p (cid:88) i =1 (cid:88) s,t y is y it (cid:33) = n − p (cid:88) i,j =1 (cid:88) s,t,s (cid:48) ,t (cid:48) Cov (cid:0) y is y it , y js (cid:48) y jt (cid:48) (cid:1) = n − p (cid:88) i,j =1 (cid:88) s Cov (cid:0) y is , y js (cid:1) + (cid:88) s,t (cid:54) = s Cov ( y is y it , y js y jt )  ≤ n − p (cid:88) i,j =1 Cov (cid:0) y i , y j (cid:1) + n − p (cid:88) i,j =1 Cov ( y i , y j ) ≤ p n − (cid:32) p − p (cid:88) i =1 (cid:113) E (cid:2) y i (cid:3)(cid:33) + pn − (cid:32) p − p (cid:88) i =1 γ i (cid:33) (A1) , (A3) = O ( p − ) + Θ (cid:16) p τ γ − (cid:17) (M2) = O ( p − ) + o (cid:16) p , min k τ kµ ) (cid:17) = o ( p τ kA + τ lA ) ∀ k, l (24)We have shown that the variance of all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b The estimator ˆ b is also unbiased, hence we have toshow Var(ˆ b ) = o ( p τ ˆ θ ) = o (1) . In a ﬁrst step, we reformulate the variance:Var(ˆ b ) = Var (cid:32) p (cid:88) i =1 (cid:100) Var(ˆ µ i ) (cid:33) = Var (cid:32) n − ( n − − p (cid:88) i =1 n (cid:88) t =1 ( x it − ˆ µ i ) (cid:33) = n − ( n − − Var  p (cid:88) i =1 n (cid:88) t =1 x it − n − p (cid:88) i =1 n (cid:88) s,t =1 x is x it  . artz et al. The variance is o (1) if the variances of both terms in the sum are o ( p ). We start withVar (cid:32) p (cid:88) i =1 n (cid:88) t =1 x it (cid:33) = n Var (cid:32) p (cid:88) i =1 x it (cid:33) = n Var (cid:32) p (cid:88) i =1 y it (cid:33) = n p (cid:88) i,j =1 Cov (cid:0) y it , y jt (cid:1) ≤ n p (cid:88) i,j =1 (cid:113) E (cid:2) y it (cid:3)(cid:114) E (cid:104) y jt (cid:105) (A3) ≤ p n (1 + α ) (cid:32) p − p (cid:88) i =1 γ i (cid:33) = O ( p ) = o ( p ) . The variance of the second term in the sum is, following the steps in eq. (24),Var (cid:32) n − p (cid:88) i =1 n (cid:88) s,t x is x it (cid:33) = O ( p ) + o (cid:16) p , min k τ kµ )+2 (cid:17) (M1) = o ( p ) . This concludes the proof the Var(ˆ b ) is o ( p τ ˆ θ ) = o (1).(G4) : Restriction on linear combinations Let L be R p or R p ≥ . We haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33)  = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l (ˆ µ li − ˆ µ i ) (cid:33)  (25)= min α ∈ L α k =1 q (cid:88) i =1 (cid:32) K (cid:88) l =1 α l ( µ li − µ i ) (cid:33) + K (cid:88) l =1 | α l | Var (ˆ µ i ) + K (cid:88) l =1 | α l | Var (cid:16) ˆ µ li (cid:17) (M3) ≥ Θ (cid:16) p τ kµ (cid:17) + q (cid:88) i =1 Var(ˆ µ ki ) = Θ (cid:16) p τ kµ (cid:17) + Θ(1) = Θ (cid:16) p max(0 ,τ kµ ) (cid:17) This concludes the proof of Theorem 3.

A.4 Proof of Theorem 4 (FOLDL consistency of MTS of the mean)Proof

As above, without loss of generality, we assume µ = . We again start by analysingthe asymptotic behaviour of ∆ ˆ θ , A kk and b , then we prove consistency of (cid:98) A kl and ˆ b .(G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , A kk and b From equations (20) and(21)we directly obtain τ ˆ θ = 1 and ∀ k : τ kA = 1 . (G3) , part I: Consistency of (cid:98) A kl As for the LDL, we show that all types of terms ineq. (23) are o ( p τ kA + τ lA ). For the FOLDL, this means they have to be o ( p ). Following similar ulti-Target Shrinkage steps as above, we obtainVar (cid:32) p (cid:88) i =1 µ ki ˇ µ li (cid:33) = n − l µ k (cid:62) C l µ k = max i γ li Θ( p τ kµ ) (M2 (cid:48) ) = o ( p τ kµ ) = o ( p )Var (cid:32) p (cid:88) i =1 ˇ µ i ˇ µ ki (cid:33) ≤ pnn k (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 γ i (cid:118)(cid:117)(cid:117)(cid:116) p − p (cid:88) i =1 ( γ ki ) = o (cid:16) p max(1 ,τ kµ )+max(1 ,τ lµ ) (cid:17) = o ( p )Var (cid:32) p (cid:88) i =1 ˇ µ i (cid:33) = n − p (cid:88) i,j =1 (cid:88) s Cov (cid:0) y is , y js (cid:1) + (cid:88) s,t (cid:54) = s Cov ( y is y it , y is y it )  (26) ≤ n (cid:88) i,j (cid:54) = i Cov (cid:16) ( y kis ) , ( y kjs ) (cid:17) + (1 + α ) pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) + pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) (M2 (cid:48) ) , (M4) = o (cid:0) p (cid:1) + o (cid:0) p (cid:1) We have shown that the variance of all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b We have to show that Var(ˆ b ) is o ( p τ ˆ θ ) = o ( p ):Var(ˆ b ) = Var (cid:32) p (cid:88) i =1 (cid:100) Var(ˆ µ i ) (cid:33) = Var (cid:32) n ( n − p (cid:88) i =1 n (cid:88) t =1 ( x it − ˆ µ i ) (cid:33) = 1 n ( n − Var  p (cid:88) i =1 n (cid:88) t =1 x it − n − p (cid:88) i =1 n (cid:88) s =1 x is x is − n − p (cid:88) i =1 n (cid:88) s,t (cid:54) = s =1 x is x it  . This variance expression is o ( p ) if the variance of each of the three sums is o ( p ). For theﬁrst sum, we use eq. (26) and obtainVar (cid:32) p (cid:88) i =1 x it (cid:33) = Var (cid:32) p (cid:88) i =1 y it (cid:33) = (cid:88) ij Cov( y it , y jt ) ≤ n (cid:88) i,j (cid:54) = i Cov (cid:16) ( y kis ) , ( y kjs ) (cid:17) + (1 + α ) pn (cid:32) p − p (cid:88) i =1 γ i (cid:33) = o ( p ) + O ( p ) . The second sum is proportional to the ﬁrst sum. For the third sum we obtain, by usingeq. (26), Var  p (cid:88) i =1 n (cid:88) s,t (cid:54) = s x is x it  = (cid:88) ij (cid:88) s,t,s (cid:48) ,t (cid:48) Cov (cid:0) x is x it , x js (cid:48) x jt (cid:48) (cid:1) = o (cid:0) p (cid:1) . This concludes the proof that Var(ˆ b ) is o ( p τ ˆ θ ) = o ( p ). artz et al. (G4) : Restriction on linear combinations Similar to eq. (25), we haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ R α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33)  ≥ (cid:88) i Var(ˆ µ ki ) = Θ ( p ) . This concludes the proof of Theorem 4.

A.5 Proof of Theorem 5 (LDL consistency of MTS of the covariance)Proof

The estimators (cid:98) A and b depend on the choice of target. We restrict the proof ontargets given by sample covariance matrices of additional data sets. The biased estimators inSch¨afer and Strimmer (2005) and Ledoit and Wolf (2003) have smaller variance, consistencycan be shown following similar steps.(G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , b k and A kk We ﬁrst show the asymptoticbehaviour ∆ ˆ θ = b k = (cid:88) ij Var (cid:0) S ij (cid:1) = Θ ( p ) ! = Θ ( p τ ˆ θ ) ⇐⇒ τ ˆ θ = 1 . (27)Rotation invariance allows us to analyse in the eigenbasis. The upper bound follows from (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≤ n (cid:88) i,j (cid:110)(cid:113) Var( y i )Var( y j ) + E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:111) (28) ≤ n (cid:88) i,j (cid:110)(cid:113) E [ y i ] E [ y j ] + E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:111) ≤ n (cid:88) i,j (cid:113) E [ y i ] E [ y j ] ≤ p n (1 + α ) (cid:32) p (cid:88) i E [ y i ] (cid:33) = Θ( p ) . For the lower bound, we distinguish two cases: for τ γ = 1, we have (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≥ (cid:88) i Var (cid:0) S (cid:48) ii (cid:1) (29)= 1 n (cid:88) i (cid:8) E (cid:2) y i (cid:3) − E (cid:2) y i (cid:3)(cid:9) ≥ n (cid:88) i β E (cid:2) y i (cid:3) = β pn p (cid:88) i γ i = Θ ( p τ γ ) = Θ( p ) . ulti-Target Shrinkage For the case τ γ <

1, we have (cid:88) i,j

Var (cid:0) S (cid:48) ij (cid:1) = 1 n (cid:88) i,j (cid:8) E (cid:2) y i y j (cid:3) − E [ y i y j ] (cid:9) (30) ≥ n (cid:88) i,j (cid:8) E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) − E [ y i y j ] (cid:9) ≥ n (cid:32)(cid:88) i E (cid:2) y i (cid:3)(cid:33) − n (cid:88) i E (cid:2) y i (cid:3) ≥ p n (cid:32) p (cid:88) i E (cid:2) x i (cid:3)(cid:33) − pn p (cid:88) i γ i = Θ( p ) − Θ( p τ γ )) = Θ( p ) . The asymptotic behaviour of A kk depends on the relationship between the original data X and the additional data set X k : A kk = p (cid:88) i,j =1 E (cid:104)(cid:16) S kij − S ij (cid:17) (cid:16) S kij − S ij (cid:17)(cid:105) , = p (cid:88) i,j =1 ( C ij − C kij ) + Var( S kij ) + Var( S ij ) (C1) , (27) = Θ( p τ kC ) + Θ( p ) ! = Θ( p τ kA ) ⇐⇒ τ kA = max(1 , τ kC ) . (G3) : Consistency of (cid:98) A kl As the estimator (cid:98) A kl is unbiased (Bartz and M¨uller, 2013),we have to show thatVar (cid:16) (cid:98) A kl (cid:17) = Var  p (cid:88) i,j =1 (cid:16) S kij − S ij (cid:17) (cid:16) S li,j − S i,j (cid:17) = Var  p (cid:88) i,j =1 S kij S lij − S kij S ij − S lij S ij + S ij  , (31)= o ( p τ kA + τ lA ) = o ( p max(1 ,τ kC )+max(1 ,τ lC ) ) . It suﬃces to show that the variances of all terms in the sum in eq. (31) are o ( p τ kA + τ lA ). Variance of (cid:80) ij S ij We start with the product of two identical sample covariances: (cid:88) ij S ij = (cid:88) ij (cid:32) n (cid:88) s y is y jt (cid:33) = p n (cid:88) st (cid:32) p (cid:88) i y is y it (cid:33) = p n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) + 1 n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33) . (32) artz et al. Again, it is suﬃcient to show that the variance of both terms separately is o ( p τ kA + τ lA ). Forthe ﬁrst term, we haveVar  p n (cid:88) s (cid:32) p (cid:88) i y is (cid:33)  ≤ p n E (cid:32) p (cid:88) i y i (cid:33)  ≤ p (1 + α ) n E (cid:34) p (cid:88) i y i (cid:35) = O ( p ) = o ( p τ kA + τ lA ) . Let us now look at the second term in eq. (32):Var  n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33)  = 1 n (cid:88) s,t (cid:54) = s (cid:88) s (cid:48) ,t (cid:48) (cid:54) = s (cid:48) Cov (cid:32)(cid:88) i y is y it (cid:33) , (cid:32)(cid:88) i y is (cid:48) y it (cid:48) (cid:33)  . The covariance expression only depends on the cardinal of the intersection, which we denoteby ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) and which can take the values of 0, 1 and 2. When this cardinality iszero, (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 0 , there is independence and the covariance is zero as well. For (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 1 , we have 4 n ( n − n −

2) expressions of the formCov (cid:32)(cid:88) i y i y i (cid:33) , (cid:32)(cid:88) i y i y i (cid:33)  = E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33)  − E (cid:32)(cid:88) i y i y i (cid:33)  E (cid:32)(cid:88) i y i y i (cid:33)  ≤ max  E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33)  , E (cid:32)(cid:88) i y i y i (cid:33)  , as both terms are positive. For the ﬁrst term, we have E (cid:32)(cid:88) i y i y i (cid:33) (cid:32)(cid:88) i y i y i (cid:33)  = (cid:88) i,j,i (cid:48) ,j (cid:48) E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) E [ y i y i (cid:48) ] E (cid:2) y j y j (cid:48) (cid:3) = (cid:88) i,j E (cid:2) y i y j (cid:3) E (cid:2) y i (cid:3) E (cid:2) y j (cid:3) ≤ p (cid:32) p (cid:88) i (cid:113) E (cid:2) y i (cid:3) E (cid:2) y i (cid:3)(cid:33) ≤ p (1 + α ) (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33) = O (cid:0) p τ γ +2 (cid:1) . ulti-Target Shrinkage For the second term, we have E (cid:32)(cid:88) i y i y i (cid:33)  = (cid:88) i,j E [ y i y j ]  = p (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33) = O (cid:0) p τ γ +2 (cid:1) Therefore, we have, combined with the prefactors,4 n ( n − n − n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Cov (cid:32) p (cid:88) i y is y it (cid:33) , (cid:32) p (cid:88) i y is y it (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 n O (cid:0) p τ γ +2 (cid:1) = O (cid:0) p τ γ +1 (cid:1) (C3) = o ( p τ kA + τ lA ) , therefore we have shown that the terms with ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) = 1 are o ( p τ kA + τ lA ).For (cid:0) { s, t } ∪ { s (cid:48) , t (cid:48) } (cid:1) = 2 , we get 2 n ( n −

1) expressions of the form (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

Cov (cid:32)(cid:88) i y is y it (cid:33) , (cid:32)(cid:88) i y is y it (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Cov (cid:32)(cid:88) i y i y i (cid:33) , (cid:32)(cid:88) i y i y i (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) i,j,i (cid:48) ,j (cid:48) (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . We decompose the set of integers into two disjoint subsets: { , . . . , p } = Q ∪ R , where Q is the set of distinct integers and R is the remainder:= (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) + (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . The sum over Q we can bring into a form which is dominated as a consequence of (C2): (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = (cid:12)(cid:12) E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) − E [ y i y i (cid:48) ] E [ y i y i (cid:48) ] (cid:12)(cid:12) = E (cid:2) y i y i (cid:48) y j y j (cid:48) (cid:3) = (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1) + E [ y i y i (cid:48) ] E (cid:2) y j y j (cid:48) (cid:3)(cid:1) = (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) . (33)Taking the prefactors into account, we get2 n ( n − n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) ≤ p n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) | Q p | (C2) = O ( p ) o (1) = o ( p ) = o ( p τ kA + τ lA ) artz et al. For the sum over R , we have (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) ≤ (cid:88) i,j,j (cid:48) (cid:12)(cid:12) Cov (cid:0) y i y i , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) + 4 | Cov ( y i y i y i (cid:48) y i (cid:48) , y i y i y j y j ) |≤ (cid:88) i,j,j (cid:48) (cid:114) E (cid:2) y i y i (cid:3) E (cid:104) y j y j y j (cid:48) y j (cid:48) (cid:105) + 4 (cid:114) E (cid:104) y i y i y j y j (cid:105) E (cid:104) y i y i y j (cid:48) y j (cid:48) (cid:105) ≤ (cid:88) i,j,j (cid:48) E (cid:2) y i (cid:3) (cid:114) E (cid:104) y j (cid:105)(cid:114) E (cid:104) y j (cid:48) (cid:105) ≤ p (1 + α ) (cid:32) p (cid:88) i E (cid:2) y i (cid:3)(cid:33)  p (cid:88) j E (cid:2) y j (cid:3) = O (cid:0) p τ γ +3 (cid:1) . (34)Together with the prefactors, we obtain2 n ( n − n (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = 1 n O (cid:0) p τ γ +3 (cid:1) = O (cid:0) p τ γ +1 (cid:1) (C3) = o ( p τ kA + τ lA ) . This ﬁnishes the proof for the terms with ( { s, t } ∪ { s (cid:48) , t (cid:48) } ) = 2 and in total we have shownthat Var( (cid:80) ij S ij ) is o ( p τ kA + τ lA ). For Var( S kij S kij ) , k = l , an analogue proof holds. Variance of (cid:80) ij S kij S ij Let us now analyse the products of diﬀerent sample covariancesin eq. (31).Var (cid:88) ij S kij S ij  = Var (cid:88) ij (cid:88) st x kis x kjs x it x jt  = 1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj x i x j , x kg x kh x g x h (cid:17) = 1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) Cov ( x i x j , x g x h ) − C kij C kgh Cov ( x i x j , x g x h ) − C ij C gh Cov (cid:16) x ki x kj , x kg x kh (cid:17) The ﬁrst term can be separated into the contributions from the two diﬀerent data sets:1 nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) Cov ( x i x j , x g x h ) ≤ nn k (cid:88) ijgh Cov (cid:16) x ki x kj , x kg x kh (cid:17) + Cov ( x i x j , x g x h )These terms are rotation invariant, therefore we analyse1 nn k (cid:88) ijgh Cov ( y i y j , y g y h ) . ulti-Target Shrinkage For i, j, g, h distinct, this leads directly to assumption (C2). Otherwise, we have1 nn k (cid:88) ijgh Cov ( y i y j , y g y h ) ≤ n k n (cid:88) ijg Cov ( y i y g , y j y g ) + Cov ( y i y j , y g y g ) ≤ n k n (cid:88) ijg E (cid:2) ( y i ) y j (cid:3) E (cid:2) ( y g ) (cid:3) ≤ n k n (cid:88) ijg (cid:112) E [( y i ) ] (cid:113) E [( y j ) ] E (cid:2) ( y g ) (cid:3) ≤ p n k n (cid:32) p (cid:88) i γ i (cid:33) (cid:32) p (cid:88) g γ g (cid:33) = O (cid:0) p τ γ +1 (cid:1) = o ( p τ kA + τ lA ) . Next we consider the second term,1 nn k (cid:88) ijgh C kij C kgh Cov ( x i x j , x g x h ) = 1 nn k (cid:88) ijgh Σ kij Σ kgh Cov ( z i z j , z g z h ) ≤ nn k (cid:88) ig γ ki γ kg (cid:113) E (cid:2) z i (cid:3) E (cid:2) z g (cid:3) = 1 nn k (cid:32)(cid:88) i γ ki (cid:113) E (cid:2) z i (cid:3)(cid:33) ≤ p nn k (cid:32) p (cid:88) i ( γ ki ) + E (cid:2) z i (cid:3)(cid:33) ≤ p (1 + α ) nn k (cid:32) p (cid:88) i ( γ ki ) + γ i (cid:33) = O ( p τ γ ,τ kγ ) ) = o ( p τ kA + τ lA ) . With this we have shown that all terms in Var (cid:16)(cid:80) ij S kij S ij (cid:17) and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b k By reformulation we obtain (cid:88) ij (cid:100) Var S ij = (cid:88) ij (cid:32) n − n (cid:88) s (cid:16) y is y js − n (cid:88) s (cid:48) y is (cid:48) y js (cid:48) (cid:17) (cid:33) = 1( n − n (cid:88) ij (cid:32) (cid:88) s y is y js − n (cid:88) ss (cid:48) y is y js y is (cid:48) y js (cid:48) (cid:33) = p ( n − n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) − n − (cid:88) ij S ij . (35)Both terms, with diﬀerent prefactors, have been analysed above. The variance of ﬁrst termis O ( p ) and the bound on the variance of the second term is n − o ( p ,τ C ) ) = o ( p ).Hence Var(ˆ b ) is o ( p τ ˆ θ ). artz et al. (G4) : Restriction on linear combinations Let L be R p or R p ≥ . We haveΘ (cid:16) p τ kA (cid:17) ! = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33)  = min α ∈ L α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( S li − S i ) (cid:33)  (36)= min α ∈ L α k =1 q (cid:88) i =1 (cid:32) K (cid:88) l =1 α l ( S li − S i ) (cid:33) + Var (cid:32) K (cid:88) l =1 α l S i (cid:33) + Var (cid:32) K (cid:88) l =1 α l S li (cid:33) ≥ Θ (cid:16) p τ kC (cid:17) + (cid:88) i Var( S ki ) = Θ (cid:16) p τ kC (cid:17) + Θ( p ) = Θ (cid:16) p max(1 ,τ kC ) (cid:17) This concludes the proof of Theorem 5.

A.6 Proof of Theorem 6 (FOLDL consistency of MTS of the covariance) (G1) & (G2) : Asymptotic behaviour of ∆ ˆ θ , b k and A kk Proof

We ﬁrst show theasymptotic behaviour∆ ˆ θ = b k = (cid:88) ij Var (cid:0) S ij (cid:1) = Θ (cid:0) p (cid:1) ! = Θ ( p τ ˆ θ ) ⇐⇒ τ ˆ θ = 2 (37)The upper bound follows from (compare to eq. (28)) b k = (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) ≤ p n (1 + α ) (cid:32) p (cid:88) i E [ y i ] (cid:33) = Θ( p ) . For the lower bound, we again distinguish two cases: for τ γ = 1, we have (compareto eq. (29)) (cid:88) i,j Var (cid:0) S (cid:48) ij (cid:1) = β pn p (cid:88) i γ i = Θ (cid:0) p τ γ (cid:1) = Θ( p ) . For the case τ γ <

1, we have (compare to eq. (30)) (cid:88) i,j

Var (cid:0) S (cid:48) ij (cid:1) ≥ p n (cid:32) p (cid:88) i E (cid:2) x i (cid:3)(cid:33) − pn p (cid:88) i γ i = Θ( p ) − Θ( p τ γ ) = Θ( p ) . For the asymptotic behaviour of A kk we then have A kk = p (cid:88) i,j =1 ( C ij − C kij ) + Var( S kij ) + Var( S ij ) = Θ( p τ kC ) + Θ( p ) ! = Θ( p τ kA ) , (38) ⇐⇒ ∀ k : τ kA = 2where used the fact that (cid:80) ij Var( S kij ) has the same limit behaviour as (cid:80) ij Var( S ij ). ulti-Target Shrinkage (G3) , part I: Consistency of (cid:98) A kl The proof is analogue to the proof in Theorem 5. Weonly show that Var (cid:16)(cid:80) ij S ij (cid:17) , the expression with the highest variance, is o ( p τ kA + τ lA ) = o ( p ).We use the same decomposition as above: (cid:88) ij S ij = 1 n (cid:88) s (cid:32)(cid:88) i y is (cid:33) + 1 n (cid:88) s,t (cid:54) = s (cid:32)(cid:88) i y is y it (cid:33) . (39)This asymptotic setting is easier, because the sums over s and t are ﬁnite sums. We havea ﬁnite number of terms in the ﬁrst sum in eq. (39):Var (cid:32)(cid:88) i y is (cid:33)  = (cid:88) i,j,i (cid:48) ,j (cid:48) Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) = (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) + (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) . (40)For the sum over Q , we need assumption (C4): (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) ≤ p (cid:80) i,j,i (cid:48) ,j (cid:48) ∈ Q Cov (cid:16) y i y j , y i (cid:48) y j (cid:48) (cid:17) | Q p | (C4) = o ( p ) . For the sum over R , we have, (cid:88) ( i,i (cid:48) ,j,j (cid:48) ) ∈ R Cov (cid:0) y i y j , y i (cid:48) y j (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) Cov (cid:0) y i y j , y i (cid:48) y i (cid:1) + Cov (cid:0) y i y j , y i (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) (cid:113) E [ y i y j ] (cid:113) E [ y i (cid:48) y i ] + (cid:113) E [ y i y j ] (cid:113) E [ y i (cid:48) ] ≤ (cid:88) i,j,i (cid:48) (cid:113) E [ y i ] E [ y j ] (cid:113) E [ y i (cid:48) ] E [ y i ] + (cid:113) E [ y i ] E [ y j ] (cid:113) E [ y i (cid:48) ] ≤ α ) (cid:88) i,j,i (cid:48) E [ y i ] E [ y j ] E [ y i (cid:48) ] = O (cid:0) p τ γ (cid:1) (C3 (cid:48) ) = o ( p ) . For the terms in the second sum in eq. (39), we haveVar (cid:32)(cid:88) i y i y i (cid:33)  = (cid:88) i,j,i (cid:48) ,j (cid:48) Cov (cid:0) y i y i y j y j , y i (cid:48) y i (cid:48) y j (cid:48) y j (cid:48) (cid:1) ≤ (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q ∪ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) . artz et al. For the sum over Q , we simplify using eq. (33) and obtain (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) ≤ p (cid:88) ( i,j,i (cid:48) ,j (cid:48) ) ∈ Q (cid:0) Cov (cid:0) y i y i (cid:48) , y j y j (cid:48) (cid:1)(cid:1) | Q p | (C4) = o ( p )For the sum over R , we have, as in eq. (34), (cid:88) i,j,i (cid:48) ,j (cid:48) ∈ R (cid:12)(cid:12) Cov (cid:0) y i y i y i (cid:48) y i (cid:48) , y j y j y j (cid:48) y j (cid:48) (cid:1)(cid:12)(cid:12) = Θ (cid:0) p γ τ (cid:1) (C3 (cid:48) ) = o ( p ) . With this we have shown that all terms and hence Var( (cid:98) A kl ) is o ( p τ kA + τ lA ).(G3) , part II: Consistency of ˆ b k As in eq. (35) we have (cid:88) ij (cid:100) Var( S ij ) = p ( n − n (cid:88) s (cid:32) p (cid:88) i y is (cid:33) − n − (cid:88) ij S ij . The ﬁrst term is equal to the ﬁrst term in eq. (39) and hence its variance o ( p ). The secondterm is proportional to the left hand side of eq. (39) and its variance therefore also o ( p ).In total, Var(ˆ b ) is o ( p τ ˆ θ ).(G4) : Restriction on linear combinations Following the same steps as in eq. (36), weobtain Θ (cid:16) p τ kA (cid:17) ! = min α ∈ R p α k =1 q (cid:88) i =1 E (cid:32) K (cid:88) l =1 α l ( (cid:98) T li − ˆ θ i ) (cid:33)  ≥ (cid:88) i Var( S ki ) = Θ( p )This concludes the proof of Theorem 6. References

Fevzi Alimoglu and Ethem Alpaydin. Combining multiple representations and classiﬁers forpen-based handwritten digit recognition. In

Document Analysis and Recognition, 1997.,Proceedings of the Fourth International Conference on , volume 2, pages 637–640. IEEE,1997.Kevin Bache and Moshe Lichman. UCI machine learning repository. University of California,Irvine, School of Information and Computer Sciences, 2013. URL http://archive.ics.uci.edu/ml .Daniel Bartz and Klaus-Robert M¨uller. Generalizing analytic shrinkage for arbitrary co-variance structures. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q.Weinberger, editors,

Advances in Neural Information Processing Systems 26 , pages 1869–1877. 2013. ulti-Target Shrinkage Daniel Bartz and Klaus-Robert M¨uller. Covariance shrinkage for autocorrelateddata. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q.Weinberger, editors,

Advances in Neural Information Processing Systems 27 , pages1592–1600. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5399-covariance-shrinkage-for-autocorrelated-data.pdf .Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus-RobertM¨uller. Optimizing spatial ﬁlters for robust EEG single-trial analysis.

Signal ProcessingMagazine, IEEE , 25(1):41–56, 2008.Benjamin Blankertz, Claudia Sannelli, Sebastian Halder, Eva M Hammer, Andrea K¨ubler,Klaus-Robert M¨uller, Gabriel Curio, and Thorsten Dickhaus. Neurophysiological predic-tor of SMR-based BCI performance.

Neuroimage , 51(4):1303–1309, 2010.Benjamin Blankertz, Steven Lemm, Matthias Sebastian Treder, Stefan Haufe, and Klaus-Robert M¨uller. Single-trial analysis and classiﬁcation of ERP components – a tutorial.56:814–825, 2011. URL http://dx.doi.org/10.1016/j.neuroimage.2010.06.048 .James W. Daniel. Stability of the solution of deﬁnite quadratic programs.

MathematicalProgramming , 5:41–53, 1973.David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via waveletshrinkage.

Journal of the american statistical association , 90(432):1200–1224, 1995.William James and Charles Stein. Estimation with quadratic loss. In

Proceedings of thefourth Berkeley symposium on mathematical statistics and probability , volume 1, pages361–379, 1961.Olivier Ledoit and Michael Wolf. Improved estimation of the covariance matrix of stockreturns with an application to portfolio selection.

Journal of Empirical Finance , 10:603–621, 2003.Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covari-ance matrices.

Journal of Multivariate Analysis , 88(2):365–411, 2004.Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.

Knowledge and DataEngineering, IEEE Transactions on , 22(10):1345–1359, 2010.Alessio Sancetta. Weak conditions for shrinking multivariate nonparametric density esti-mators.

Journal of Multivariate Analysis , 115:285–300, 2013.Juliane Sch¨afer and Korbinian Strimmer. A shrinkage approach to large-scale covariancematrix estimation and implications for functional genomics.

Statistical Applications inGenetics and Molecular Biology , 4(1):1175–1189, 2005.Martijn Schreuder, Thomas Rost, and Michael Tangermann. Listen, you are writing! Speed-ing up online spelling with a dynamic auditory BCI. 5(112), 2011. ISSN 1662-453X. doi:10.3389/fnins.2011.00112. URL . artz et al. Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normaldistribution. In

Proc. 3rd Berkeley Sympos. Math. Statist. Probability , volume 1, pages197–206, 1956., volume 1, pages197–206, 1956.

Related Researches

Nonparametric C- and D-vine based quantile regression

by Marija Tepegjozova

On structural and practical identifiability

by Franz-Georg Wieland

Fisher Scoring for crossed factor Linear Mixed Models

by Thomas Maullin-Sapey

A test for comparing conditional ROC curves with multidimensional covariates

by Arís Fanjul-Hevia

Changepoint detection on a graph of time series

by Karl L. Hallgren

RaSE: A Variable Screening Framework via Random Subspace Ensembles

by Ye Tian

Estimating the treatment effect for adherers using multiple imputation

by Junxiang Luo

Nested Group Testing Procedures for Screening

by Yaakov Malinovsky

An empirical comparison and characterisation of nine popular clustering methods

by Christian Hennig

Vine copula mixture models and clustering for non-Gaussian data

by ?zge Sahin

Identifying regions of inhomogeneities in spatial processes via an M-RA and mixture priors

by Marco H. Benedetti

Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings

by Zeda Li

Eliciting judgements about dependent quantities of interest: The SHELF extension and copula methods illustrated using an asthma case study

by Björn Holzhauer

Incompletely observed nonparametric factorial designs with repeated measurements: A wild bootstrap approach

by Lubna Amro

Tilted Nonparametric Regression Function Estimation

by Farzaneh Boroumand

Factor-augmented Smoothing Model for Functional Data

by Yuan Gao

Hierarchical Multivariate Directed Acyclic Graph Auto-Regressive (MDAGAR) models for spatial diseases mapping

by Leiwen Gao

Model Calibration via Distributionally Robust Optimization: On the NASA Langley Uncertainty Quantification Challenge

by Yuanlu Bai

Bayesian Fusion: Scalable unification of distributed statistical analyses

by Hongsheng Dai

Mortality Forecasting using Factor Models: Time-varying or Time-invariant Factor Loadings?

by Lingyu He

Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

by Jan Gertheiss

Sensitivity Analysis for Unmeasured Confounding via Effect Extrapolation

by Wen Wei Loh

Splitting strategies for post-selection inference

by Daniel G. Rasines

Covariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach

by Siyun Yang

Unobserved classes and extra variables in high-dimensional discriminant analysis

by Michael Fop

«

1

2

3

4

»

Submitted on 5 Dec 2014 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar