Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso
Jérôme-Alexis Chevalier, Alexandre Gramfort, Joseph Salmon, Bertrand Thirion
SStatistical control for spatio-temporal MEG/EEGsource imaging with desparsified multi-task Lasso
Jerome-Alexis Chevalier
Inria SaclayParis-Saclay, France [email protected]
Alexandre, Gramfort
Inria SaclayParis-Saclay, France [email protected]
Joseph Salmon
IMAG, Université de MontpellierMontpellier, France [email protected]
Bertrand, Thirion
Inria Saclay, CEAParis-Saclay, France [email protected]
Abstract
Detecting where and when brain regions activate in a cognitive task or in agiven clinical condition is the promise of non-invasive techniques like magnetoen-cephalography (MEG) or electroencephalography (EEG). This problem, referredto as source localization, or source imaging, poses however a high-dimensionalstatistical inference challenge. While sparsity promoting regularizations have beenproposed to address the regression problem, it remains unclear how to ensurestatistical control of false detections. Moreover, M/EEG source imaging requiresto work with spatio-temporal data and autocorrelated noise. To deal with this, weadapt the desparsified Lasso estimator —an estimator tailored for high dimensionallinear model that asymptotically follows a Gaussian distribution under sparsityand moderate feature correlation assumptions— to temporal data corrupted withautocorrelated noise. We call it the desparsified multi-task Lasso (d-MTLasso). Wecombine d-MTLasso with spatially constrained clustering to reduce data dimensionand with ensembling to mitigate the arbitrary choice of clustering; the resulting esti-mator is called ensemble of clustered desparsified multi-task Lasso (ecd-MTLasso).With respect to the current procedures, the two advantages of ecd-MTLasso are that i) it offers statistical guarantees and ii) it allows to trade spatial specificity for sensi-tivity, leading to a powerful adaptive method. Extensive simulations on realistichead geometries, as well as empirical results on various MEG datasets, demonstratethe high recovery performance of ecd-MTLasso and its primary practical benefit:offer a statistically principled way to threshold MEG/EEG source maps. Source imaging with magnetoencephalography (MEG) and electroencephalography (EEG) deliversinsights into brain activity with high temporal and good spatial resolution in a non-invasive way (Bail-let et al., 2001). It however requires to solve the bioelectromagnetic inverse problem, which is ahigh-dimensional ill-posed regression problem. Various approaches have been proposed to regularizethe estimation of the regression coefficients that map activity to brain locations. Historically, (cid:96) regularization was considered first (Hämäläinen and Ilmoniemi, 1994), with successive improvementsknown as dSPM (Dale et al., 2000) and sLORETA (Pascual-Marqui, 2002) that are referred to as“noise normalized” solutions. The reason is that the coefficients are standardized with an estimate a r X i v : . [ s t a t . M L ] N ov f the noise standard deviation, producing outputs that are comparable to T or F statistics, yet notstatistically calibrated. These latter techniques have since become standard when using (cid:96) approaches.More recently, alternative approaches based on sparsity assumptions have been proposed with theambition to improve the spatial specificity of M/EEG source imaging (Matsuura and Okabe, 1995;Haufe et al., 2009; Gramfort et al., 2012; Lucka et al., 2012; Wipf and Nagarajan, 2009). The output ofsuch methods consists of focal sources as opposed to blurred images obtained with (cid:96) regularization.However, obtaining statistics (“noise normalized”) from sparse or non-linear estimators seemschallenging, especially since M/EEG data are spatio-temporal data with complex noise structure. Anatural way to deal with the temporal dimension is to consider a multi-task estimator and structuredsparse priors based on (cid:96) /(cid:96) mixed norms (Ou et al., 2009; Gramfort et al., 2012).In the statistical literature, some attempts to obtain an estimate of both regression coefficients andtheir variance have been proposed for linear models in high dimension (Wasserman and Roeder, 2009;Meinshausen et al., 2009; Bühlmann, 2013). These estimates can then be translated to p -value maps, i.e., maps of p -values associated with each covariate. Some methods adapted for sparse scenarios havethen proposed to debias the Lasso to obtain p -values or confidence intervals (Zhang and Zhang, 2014;van de Geer et al., 2014; Javanmard and Montanari, 2014). We refer to such variants as desparsifiedLasso. Recently, desparsified extensions of group Lasso have also been considered (Mitra and Zhang,2016; Stucky and van de Geer, 2018). However, all these previous methods generally lack of powerwhen p (cid:29) n . Here, we propose to address a multi-task setting in the presence of correlated noise,and to deal with high-dimensional when p (cid:29) n leveraging on data structure as done by Chevalieret al. (2018). All these challenges need to be considered for M/EEG source imaging.Our first contribution is to propose the desparsified multi-task Lasso (d-MTLasso), an extension ofthe desparsified Lasso (d-Lasso) (Zhang and Zhang, 2014; van de Geer et al., 2014) to multi-tasksetting (Obozinski et al., 2010). More precisely, we adapt the group formulation by Mitra and Zhang(2016) to the multi-task setting that enjoys i) a simple statistic test formula with ii) a natural integrationof auto-correlated noise and iii) a simplification of the assumptions. Our second contribution isto introduce ensemble of clustered desparsified multi-task Lasso (ecd-MTLasso), which has twoadvantages compared to current methods: i) it offers statistical guarantees and ii) it allows to tradespatial specificity for sensitivity, leading to a powerful adaptive method. Our third contribution isan empirical validation of the theoretical claims. In particular, we run extensive simulations onrealistic head geometries, as well as empirical results on various MEG datasets to demonstrate thehigh recovery performance of ecd-MTLasso and its primary practical benefit: offer a statisticallyprincipled way to threshold MEG/EEG source maps. In this section, we give the noise model, we provide standard tools for solving the source localizationproblem and, mainly, we present three new methods with their assumptions and statistical guarantees.
For clarity, we use bold lowercase for vectors and bold uppercase for matrices. For any positiveinteger p ∈ N ∗ , we write [ p ] for the set { , . . . , p } . For a vector β , β j refers to its j -th coordinate.For a matrix X ∈ R n × p , X ( − j ) refers to matrix X without the j -th column, X i,. refers to the i -throw and X .,j to the j -th column and X i,j refers to the element in the i -th row and j -th column. Thenotation (cid:107)·(cid:107) refers to the Frobenius norm for matrices and to the standard Euclidean norm for vectors.For a covariance matrix M , the Mahalanobis norm is denoted by (cid:107)·(cid:107) M − and for a given vector a we have (cid:107) a (cid:107) M − (cid:44) Tr( a (cid:62) M − a ) . For B ∈ R p × T , (cid:107) B (cid:107) , = (cid:80) pj =1 (cid:107) B j,. (cid:107) , and its (row) supportis Supp( B ) = { j ∈ [ p ] : B j,. (cid:54) = 0 } . We assume that the underlying model is linear: Y = XB + E , (1)where Y ∈ R n × T is the signal observed on M/EEG sensors, X ∈ R n × p the design matrix represent-ing the M/EEG forward model, B ∈ R p × T the underlying signal in source space and E ∈ R n × T thenoise. We assume that there exist ρ ∈ [0 , and σ > such that all t ∈ [ T ] , E .,t ∼ N ( , σ I n ) andthat for all i ∈ [ n ] and all t ∈ [ T − , Cor( E i,t , E i,t +1 ) = ρ. For all i ∈ [ n ] , E i,. is Gaussian with2oeplitz covariance, i.e., defining M ∈ R T × T by M t,u = σ ρ | t − u | for all ( t, u ) ∈ [ T ] , we have: E i,. ∼ N ( , M ) . (2)We further assume that X has been column-wise standardized and denote by ˆ Σ ∈ R p × p the empiricalcovariance matrix of X , i.e., ˆ Σ = X (cid:62) X /n with ˆ Σ j,j = 1 . All proofs are given in Appendix D. To quantify the ability of a M/EEG source imaging technique to obtain a good estimated ˆ B , acommonly reported quantity is the Peak Localization Error (PLE) (Hauk et al., 2011). It consists inmeasuring the distance (in mm) along the cortical surface between the true simulated source and thelocation with maximum amplitude in the estimator. By contrast, spatial dispersion (SD) measureshow much the activity is spread out by the inverse method (Molins et al., 2008).To quantify the control of statistical errors, we consider a generalization of the Family Wise ErrorRate (FWER) (Hochberg and Tamhane, 1987): the δ -FWER. As illustrated in Figure 5 in appendix,it is the FWER taken with respect to a ground truth dilated spatially by an amount δ —in thepresent study a distance in mm. A rigorous definition of δ -FWER is given in Appendix A. Therationale is that detections made outside of the support, but less than δ away from the supportshould count as slight inaccuracies of the methods, not as false positives. In an analogous manner, δ -FDR = (1 − δ -precision ) has been proposed recently as an extension of the False DiscoveryRate (FDR) (Benjamini and Hochberg, 1995) to include a spatial tolerance (Nguyen et al., 2019;Gimenez and Zou, 2019). We thus characterize the selection capabilities of the methods through a δ -precision/recall curve. The sLORETA and dSPM estimators are derived from the ridge estimator (Hoerl and Kennard, 1970): ˆ B Ridge = KY where K = X (cid:62) ( XX (cid:62) + λ I ) − . (3)They are obtained by scaling each row j in ˆ B Ridge by an estimate of the noise level at location j .It reads (Lin et al., 2006) ˆ B dSPM j,t = ˆ B Ridge j,t /σ dSPM j and ˆ B sLORETA j,t = ˆ B Ridge j,t /σ sLORETA j , where σ dSPM j = (cid:112) σ [ KK (cid:62) ] j,j and σ sLORETA j = (cid:112) [ K ( σ I + XX (cid:62) ) K (cid:62) ] j,j . Interestingly, it can beproved that in the absence of noise and when only a single coefficient is non-zero, the sLORETAestimate has its maximum at the correct location (Pascual-Marqui, 2002). Assuming B .,t ∼ N ( , I ) ,the covariance of Y reads σ I + XX (cid:62) . Hence, one can consider that sLORETA adds to dSPM anextra term in the sensor covariance matrix that comes from the sources. Note that these methods treateach time instant independently, hence ignoring source and noise temporal autocorrelations. Let us first recall the definition of the multi-task Lasso (MTLasso) estimator (Obozinski et al., 2010)in our setting. For a tuning parameter λ > , it is defined as ˆ B MTL ∈ argmin B ∈ R p × T (cid:26) n (cid:107) Y − XB (cid:107) + λ (cid:107) B (cid:107) , (cid:27) . (4)It is well known that similarly to the Lasso, MTLasso is biased: it tends to shrink rows with largeamplitude towards zero. Below, we provide an adaptation of the Desparsified Lasso following theapproach by Zhang and Zhang (2014), see also Mitra and Zhang (2016), to ensure statistical control.The approach relies on the introduction of score vectors z , . . . , z p in R n defined by z j = X · ,j − X ( − j ) ˆ β ( − j ) α j , (5)where, for j ∈ [ p ] , ˆ β ( − j ) α j is the Lasso solution (Tibshirani (1996); Chen and Donoho (1994)) of theregression of X · ,j against X ( − j ) with regularization parameter α j . Note that these score vectors λ is set by cross-validation on a logarithmic grid going from λ max to λ max , where λ max = (cid:13)(cid:13) X (cid:62) Y (cid:13)(cid:13) , ∞ . In (Zhang and Zhang, 2014, Table 1) an algorithm for choosing α j is proposed. We noticed that taking forall j ∈ [ p ] , α j = c α max ,j := c (cid:107) X ( − j ) X · ,j (cid:107) ∞ /n with c = 0 . for M/EEG data allows to make a significantcomputation gain and yields adequate residuals for C = 1000 (see Sec. 2.6). Y and their computation is then equivalent to solving the node-wise Lasso(Meinshausen and Bühlmann, 2006). For such vectors, the noise model in (1) yields z (cid:62) j Yz (cid:62) j X .,j = B j,. + z (cid:62) j Ez (cid:62) j X .,j + (cid:88) k (cid:54) = j z (cid:62) j X .,k B k,. z (cid:62) j X .,j . (6)Discarding the noise term and plugging ˆ B MTL k,. as a preliminary estimator of B k,. in (6), we coin thedesparsified multi-task Lasso (d-MTLasso), a debiased estimator of ˆ B MTL defined for all j ∈ [ p ] by ˆ B (d − MTLasso) j,. = z (cid:62) j Yz (cid:62) j X .,j − (cid:88) k (cid:54) = j z (cid:62) j X .,k ˆ B MTL k,. z (cid:62) j X .,j . (7)To derive d-MTLasso statistical properties, we need the extended Restricted Eigenvalue (RE) property(Lounici et al., 2011, Assumption 3.1), detailed in Appendix B. More precisely, we assume that(A1) RE( X , s ) is verified on X for a sparsity parameter s ≥ | Supp( B ) | and a constant κ = κ ( s ) > .Roughly, A1 can be seen as a combination of sparsity and "moderate" feature correlation assumptions. Proposition 2.1.
Considering the model in Equation (1) , assuming A1 and for a choice of λ largeenough in Equation (4) , then with high probability: √ n ( ˆ B (d − MTLasso) − B ) = Λ + ∆ , (8) Λ j,. ∼ N p ( , ˆ Ω j,j M ) , for all j ∈ [ p ] , where ˆ Ω j,k = n z (cid:62) j z k | z (cid:62) j X .,j || z (cid:62) k X .,k |(cid:107) ∆ (cid:107) , = O (cid:32) sλ (cid:112) log( p ) κ (cid:33) (9)Then, under the j -th null hypothesis H ( j )0 : “ B j,. = 0 ” and neglecting the term ∆ (see Appendix D.2for more details) in (8) as done by van de Geer et al. (2014), ˆ B (d − MTLasso) j,. is Gaussian withzero-mean. Finally, using standard results on χ distributions (see Appendix D.1), we obtain n (cid:13)(cid:13)(cid:13) ˆ B (d − MTLasso) j,. (cid:13)(cid:13)(cid:13) M − ∼ ˆ Ω j,j χ T . If M is known, the quantity n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − / ˆ Ω j,j can be used as a decision statistic to obtaina p -value testing the importance of source j by comparison with the χ T distribution. In practicewe need to estimate M by ˆ M . Notably, assuming that we have an estimator ˆ σ of σ that verifiesapproximately ( n − ˆ s )ˆ σ /σ ∼ χ n − ˆ s , where ˆ s = | Supp( ˆ B MTL ) | (see Sec. 2.5), we take ˆ f j := n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − T ˆ Ω j,j , (10)as statistic to compare with a Fisher distribution with parameters T and n − ˆ s , to compute the p -values. The full d-MTLasso algorithm is given in Algorithm 1. Note that, a Python implementation ofthe procedures presented in this paper is available on https://github.com/ja-che/hidimstat along with some examples. In Sec. 2.1 noise is assumed homogeneous across sensors, allowing to obtain a robust estimator.Extending Reid et al. (2016) to multi-task regression, we consider the residuals ˆ E = Y − X ˆ B MTL ,and the estimated support size ˆ s . Defining, for t ∈ [ T ] , ˆ σ t = (cid:107) ˆ E .,t (cid:107) / ( n − ˆ s ) , an estimate of σ is: ˆ σ = median( { ˆ σ t , t ∈ [ T ] } ) . Taking the median instead of the mean avoids depending on prospective under-fitted time steps andturns out to be more robust empirically. Similarly, defining for all t ∈ [ T − , ˆ ρ t = cor n ( ˆ E .,t , ˆ E .,t +1 ) (where cor n ( ., . ) is the empirical correlation), ρ is estimated by taking ˆ ρ = median( { ˆ ρ t , t ∈ [ T − } ) .Then, an estimator ˆ M of M is given by ˆ M t,u = ˆ σ ˆ ρ | t − u | . See the proof of (Lounici et al., 2011, Theorem3.1). lgorithm 1 d-MTLasso input : X ∈ R n × p , Y ˆ B MTL ← MTL( X , Y ) // cross-validated multi-task Lasso ˆ E ← Y − X ˆ B MTL // Residuals ˆ s ← | Supp( ˆ B MTL ) | for t ∈ [ T ] do // Noise level estimation ˆ σ t = (cid:107) ˆ E .,t (cid:107) / ( n − ˆ s )ˆ σ = median( { ˆ σ t , t ∈ [ T ] } ) Get ˆ M thanks to Sec. 2.5 for j ∈ [ p ] do z j ← Lasso( X ( − j ) , X .,j ) // cross-validated Lasso ˆ Ω j,j ← n z (cid:62) j z j | z (cid:62) j X .,j || z (cid:62) j X .,j | ˆ B (d − MTLasso) j,. ← z (cid:62) j Yz (cid:62) j X .,j − (cid:80) k (cid:54) = j z (cid:62) j X .,k ˆ B MTL k,. z (cid:62) j X .,j // Desparsified multi-task Lasso ˆ f j ← n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − T ˆ Ω j,j // Inference statistics return ˆ f , . . . , ˆ f p In the high-dimensional inference scenario considered, the number of sensors is more than one orderof magnitude smaller than the number of sources, n (cid:28) p . Therefore, estimators of conditional associ-ation between sources and observations struggle to identify the solution. The setting is even moredifficult due to the presence of high correlation between sources (see Figure 6 in appendix). Furthergains can however come from a compression of the design matrix (Bühlmann et al., 2013; Mandozziand Bühlmann, 2016). For this we introduce a clustering step that reduces data dimensionality whileleveraging spatial structure. We consider a spatially-constrained hierarchical clustering algorithmdescribed by Varoquaux et al. (2012) that uses Ward criterion . Other clustering schemes might beconsidered, as long as they yield spatially contiguous regions of the cortical surface. The combinationof this clustering algorithm with the d-Lasso or d-MTLasso algorithms will be respectively referred toas clustered desparsified Lasso (cd-Lasso) and clustered desparsified multi-task Lasso (cd-MTLasso).The number of clusters is denoted by C and, for r ∈ [ C ] , we denote by G r the r -th group. Everycluster representative variable is given by the average of the covariates it contains. Then, reorderingconveniently the columns of X , the compressed design matrix Z ∈ R n × C is given by: Z = XA , A = | G | | G | . . . | G | | G | . . . ... ... ... ... ... ... . . . ... ... ... . . . | G r | | G r | , (11)where A ∈ R p × C . We say that the compression of X is of good quality if:(A2) there exists Γ ∈ R C × T such that Γ r,. = (cid:80) j ∈ G r w j B j,. with w j ≥ for all j ∈ [ p ] , andthe associated compression loss XB − ZΓ is "small enough" with respect to the model noise (seeAppendix D.3 for more details).(A3) RE( Z , s (cid:48) ) is verified on Z for sparsity parameter s (cid:48) ≥ | Supp( Γ ) | and constant κ (cid:48) = κ (cid:48) ( s (cid:48) ) > . Proposition 2.2.
Assume Equation (1) , A2, A3, a choice of regularization parameter in the MTLassoregression of Z against Y that is large enough, and that the largest cluster of the compression is ofsize δ , then cd-MTLasso controls the δ -FWER. ata Ensembling desparsifiedmulti-task Lasso C l u s t e r i n g B C l u s t e r i n g Figure 1: ecd-MTL overview diagram.
While cd-MTLasso applies d-MTLasso to clustered data,ecd-MTLasso aggregates several cd-MTLasso solutions.
To reduce the sensitivity of cd-MTLasso to small data perturbations, we propose to randomize over theclustering. We build several clustering solution, considering B = 100 different random subsamples ofsize of the full sample; then we aggregate the p -value maps output by cd-MTLasso. To aggregatethe B cd-MTLasso solutions, we use the adaptive quantile aggregation proposed by Meinshausenet al. (2009) detailed in Appendix C. The full procedure of ensembling B cd-MTLasso (resp. cd-Lasso), solutions is called ecd-MTLasso for ensemble of clustered desparsified multi-task Lasso(resp. ecd-Lasso). Algorithm of ecd-MTLasso is given in Algorithm 2 in appendix. Also, we give anoverview diagram to clarify the nesting structure of the proposed solutions in Figure 1. Proposition 2.3.
Assume that for each of the B compressions the hypotheses of Prop. 2.2 are verified,then ecd-MTLasso controls the δ -FWER. This result is conservative and mixing several cd-MTLasso usually reduces the spatial tolerance δ .Additional details on the procedure and computational complexity are deferred to Appendix E. In this section, we give empirical evidence of the advantages of ecd-MTLasso for source localization.First, in a typical point source simulation, we compare the methods with respect to the standard PLEmetric; notably, we study the effect of i/clustering and ii/integrating time dimension. In a secondsimulation with more realistic features, we examine the δ -FWER control property and compare thesupport recovery properties of all methods. Lastly, working on real MEG data, we show that, contraryto sLORETA, ecd-MTLasso retrieves expected patterns using a universal threshold. Here, we study how the proposed estimators perform compared to standard (cid:96) regularized approaches,and assess whether time-aware statistical analysis improves upon static d-Lasso as it is essential forM/EEG source imaging. We use the head anatomy and the recording setup from the sample datasetpublicly available from the MNE software (Gramfort et al., 2014). The design matrix X is computedwith a three-shell boundary element model with p = 7498 candidate cortical locations, and a 306-channels Elekta Neuromag Vectorview system with 102 magnetometers and 204 gradiometers. Weonly keep the gradiometers and remove one defective sensor leading to n = 203 . When consideringmultiple consecutive time instants to demonstrate the ability of the solver to leverage spatio-temporaldata, the source is fixed and the temporal noise autocorrelation is set to ρ = 0 . .Figure 2 reports the normalized histograms of PLE for the 7498 locations for the different methodsinvestigated; results on spatial dispersion (SD) are available in Figure 7 in appendix. While it mightseem simplistic to consider a single source, this experiment allows to demonstrate that d-Lassoimproves over sLORETA in the presence of noise (see Figure 2, left). In the same figure, one canobserve that clustering degrades this performance, as it carries an intrinsic spatial blur. However, evenin this adversarial scenario (Dirac-like source location), cd-Lasso and ecd-Lasso remain competitive w.r.t. sLORETA, avoiding extreme PLE values. Note that, here, a single time point was used (T=1). A typical choice is C = 1000 clusters for M/EEG data. | Supp( Γ ) | ≤ | Supp( B ) | and Z is generally better conditioned than X making A3 more plausible than A1. PLE (mm)
PLE (mm) T = 6) Figure 2:
Peak Localization Error (PLE) histograms. (left): PLE on a fixed time point (T=1),sLORETA is outperformed by desparsified Lasso; cd-Lasso and ecd-Lasso are more concentratedand exhibit a smaller number of very low PLE but also a smaller number of extreme PLE values.(right): PLE for desparsified multi-task Lasso (d-MTLasso) with T=6 compared to d-Lasso (T=1).More time points improve the results by reducing the PLE.The right panel in Figure 2 shows that d-MTLasso (T=6) significantly outperforms d-Lasso (T=1)in terms of PLE. Leveraging spatio-temporal data indeed increases the signal-to-noise ratio, whichenhances spatial specificity. Effects in terms of SD are minor (see appendix, Figure 7).
We now investigate whether the different versions of d-MTLasso control the δ -FWER on a realisticsimulation, and compare their support recovery properties. The data are the same as in Sec. 3.1. Tosimulate the sources, we randomly draw active regions by selecting parcels from a subdividedcortical Freesurfer parcellation with 448 parcels (Khan et al., 2018). For each selected parcel wetake as sources all the dipoles at a -mm geodesic distance from the center of the parcel (around dipoles per region), fixing the amplitude at 10 nAm. To evaluate how the methods control the δ -FWER, we perform simulations and count how often active sources are found outside the δ -dilated ground truth. In the left panel of Figure 3, we see that d-MTLasso does not control the δ -FWER, due to the violation of some hypotheses of proposition 1, in particular those regarding sourcecorrelation. However, we notice that handling noise autocorrelation reduces the empirical δ -FWER.Using clustering, assumptions of Prop. 2.2 are more easily met, in particular the conditioning ofthe problem is improved (Mattout et al., 2005). Yet cd-MTLasso does not control the δ -FWER for δ = 40 mm, because the δ -FWER is controlled if δ is smaller than the largest cluster diameter, whichmay not hold. Finally, randomization via ecd-MTLasso further improves FWER control. Empirically,we observe that the δ -FWER is controlled for δ around twice the average cluster diameter. Then,with the limitation of having a compressed design matrix well conditioned ( C not too large), we canreduce the tolerance δ by increasing C (empirical support of this claim in appendix in Figure 9). Wehave excluded sLORETA from this study since it does not provide guarantees on the false discoveries.The right panel of Figure 3 shows the δ -precision recall curve of the different methods. We first noticethat d-MTLasso cannot compete with sLORETA, because the high dimensionality of the problemmakes the computation of the source importance overly ill-posed. cd-MTLasso improves detectionaccuracy, but still does not perform as well as sLORETA. However, adding the ensembling step, the δ -precision improves strongly, making ecd-MTLasso much better than sLORETA. In Figure 8 inappendix, we obtain similar results when considering the standard precision-recall curve. We now report results on three MEG datasets spanning three types of sensory stimuli: auditory, visualand somatosensory. Additional results on EEG datasets are presented in Appendix H. The auditoryevoked fields (AEF) and visual evoked field (VEF) are obtained using stimuli in the left ear and leftvisual hemifield. The somatosensory evoked fields (SEF) are obtained following electrical stimulationof the left median nerve on the wrist. The detailed description of the data is provided in Appendix F.Experimental results are presented in Figure 4 and Figure 10 (cf. Appendix H). Among the manymethods for M/EEG source imaging present in the literature, the methods that are compared herehave in common to output a statistical map. The (cid:96) regularized sLORETA method is compared to the7 .0 0.2 0.4 0.6 0.8 1.0 empirical δ -FWER nominal rate99% conf. int.d-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso 0.0 0.2 0.4 0.6 0.8 1.0 Recall δ - P r e c i s i o n sLORETAd-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso Figure 3: δ -FWER and δ -Precision-Recall. (left): δ -FWER control of the different d-MTLassomethods. δ -FWER control is hard for d-MTLasso and cd-MTLasso, as some detections are madefar from the true sources, due to remote correlations. Ensembles of clusters allow to limit thesefalse detections. (right): δ -Precision-Recall curves: sLORETA outperforms d-MTLasso AR0 andAR1, because the problem is too high dimensional for the d-MTLasso to work properly. Clusteringimproves the outcome, and ensembling brings further benefits: ecd-MTLasso outperforms sLORETA.debiased sparse estimators presented and evaluated above. The input for all solvers is a time windowof data: from t = 50 to t = 100 ms for AEF and VEF, and from t = 30 to t = 40 ms for SEF. Duringsuch time intervals one can expect the sources to originate primarily from the early sensory corticeswhose locations are anatomically known for normal subjects.First one can observe that all methods manage to highlight the proper functional sensory units (planumtemporale for AEF, calcarine region for VEF and central sulcus for SEF). Considering sLORETAresults, one can observe that at a common threshold of 3.0 on the Student statistic, the estimatoris quite spatially specific for VEF, but is overly conservative for AEF and clearly leading to manyfalse positives for SEF. By inspection of the d-MTLasso solution, one can observe that taking intoaccount the autocorrelation of the noise leads to a better calibrated noise variance, and thereforefewer dubious detection. Considering ecd-MTLasso results, while all maps are also thresholded witha single level, one can see that it retrieves expected patterns without making dubious discoveries. In Sec. 3.1, we have shown that taking into account the time dimensionimprove the results in terms of PLE. Also, we have seen that even in this adversarial point sourcescenario (cf. Sec. 3.1), clustered methods remain competitive. In Sec. 3.2, while no control of falsediscoveries is proposed by sLORETA, ecd-MTL is the only method that offers statistical controlin practice. Namely, it controls the δ -FWER for δ equals to twice the average cluster diameter.Additionally, in this realistic simulation, ecd-MTL exhibits the best support recovery properties. InSec. 3.3, working on real MEG data, we show that, contrary to sLORETA, ecd-MTLasso producescalibrated statistics with universal threshold and retrieves expected patterns without making dubiousdiscoveries. Overall, ecd-MTL offers statistical guarantees and is our privileged method. Guidelines for statistical inference with ecd-MTLasso on temporal M/EEG data.
First, we tryto give guidelines concerning the number of clusters C . Hoyos-Idrobo et al. (2015) exhibit thatclustering improves problem conditioning, this means that the Restricted Eigenvalue (RE) property(see assumptions A1 and A3) is more likely to be verified. Complementary, we argue that, keeping C over a hundred (limiting compression loss), the fewer clusters, the more A3 is likely to be verifiedfor Prop. 2.2 and Prop. 2.3 to hold but also the better the sensitivity of ecd-MTL. However, small C also requires a higher spatial tolerance. We then hit a fundamental trade-off for statistical inferencebetween sensitivity and spatial specificity. Then, C can be chosen depending on the problem setting:if it is difficult (noisy), it seems natural to lower spatial tolerance expectations (diminish C ); in thatsense ecd-MTL is an adaptive method (cf. Figure 9). For the present use case, taking C = 1000 seems an adequate trade-off to ensure δ -FWER control with reasonable spatial tolerance.Now, we give recommendation for time sampling and window size. Choosing too short windowscomplicate AR model estimation due to the lack of data, while choosing too large windows may leadto non stationary support. We recommend taking windows of to ms with a time sampling at to ms as keeping T < reduces computation time and should not decrease sensitivity significantly.8igure 4: Empirical comparison on 3 MEG datasets.
From left to right one can see sLORETA,d-MTLasso without AR modeling (assuming non-autocorrelated noise), d-MTLasso with an AR1noise model and the ecd-MTLasso using also an AR1. Results correspond to auditory (top), visual(middle) and somatosensory (bottom) evoked fields. Colormaps are fixed across datasets and adjustedbased on meaningful statistical thresholds in order to qualitatively illustrate FWER control issues.Finally, when working with M/EEG data, we recommend to use only of the full data to computeseveral clustering solutions with spatial constraint and Ward criteria to ensure enough diversity.
Limitations.
The main limitation is the fact that mixing different types of sensors violates modelingassumptions both on temporal correlations and on spatial correlations, that is why we had to treatMEG and EEG sensors separately. A possibility to handle heterogeneous sensors is to follow Massiaset al. (2018b), but for the temporal part further developments are required and left for future work.Also left for future work, is the possibility of studying windows larger than ms. A simple solutionis to slide a window of to ms over the considered period of time.Finally, a more common limitation is the fact that assumptions are hard to test in practice. The MEG source imaging problem poses a hard statistical inference challenge: namely that of high-dimensional statistical analysis, furthermore with high correlations in the design. We have proposedan estimator that calibrates correctly the effects size and variance, up to a number of hypotheses,that are not easily met: some level of sparsity, mild correlation across sensors, homogeneity andheteroscedasticity of the noise. Up to these hypotheses, and up to a spatial tolerance on the exactlocation of the sources, we provide the first method with statistical guarantees for source imaging.This is made possible by bringing several improvements to the original desparsified Lasso solution: amulti-task formulation that increases power by basing inference on multiple time steps, a clusteringstep that renders the design less ill-posed and an ensembling step that mitigates the (hard) choice ofclusters. Finally, our privileged method, ecd-MTLasso, runs in less than 10 mn on a real dataset onnon-specialized hardware, making it usable by practitioners.9
Statement of broader impact
Magnetoencephalography (MEG) and electroencephalography (EEG) offer a unique opportunity toimage brain activity non-invasively with a temporal resolution in the order of milliseconds. This isrelevant for cognitive neuroscience to describe the sequence of active areas during certain cognitivetasks, but also for clinical neuroscience, where electrophysiology is used for diagnosis ( e.g., sleepmedicine, epilepsy presurgical mapping). Yet, doing brain imaging with M/EEG requires to solve achallenging high-dimensional inverse problem for which statistical guarantees are crucially important.In this work, we address this statistical challenge when using sparsity promoting regularizationand when considering the specificity of M/EEG signals: data are spatio-temporal and the noise istemporally autocorrelated. The proposed algorithm is built on very recent work in optimizationto speed up Lasso-type solvers, as well as work in mathematical statistics on desparsified Lassoestimators. We believe that this work, whose contribution is both on the modeling side and on theinference aspects, brings sparse estimators close to a wide adoptions in the neuroscience community.We also would like to emphasize that the inference framework can be adapted to many otherhigh-dimensional problems where data structure can be leveraged: biomedical data and physicalobservations (cardiac or brain monitoring, genomics, seismology, etc.), especially those that involveseverely ill-posed inverse problems.
Acknowledgements
This research is supported under funding of French ANR project FastBig(ANR-17- CE23-0011), the KARAIB AI chair (ANR-20-CHIA-0025-01), the European ResearchCouncil Starting Grant SLAB ERC-StG-676943 and Labex DigiCosme (ANR-11-LABEX-0045-DIGICOSME).
References
S. Baillet, J. C. Mosher, and R. M. Leahy. Electromagnetic brain mapping.
IEEE Signal Proc. Mag. ,18(6):14–30, Nov. 2001.R. F. Barber and E. J. Candès. A knockoff filter for high-dimensional selective inference.
Ann. Statist. ,47(5):2504–2537, 2019.Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and PowerfulApproach to Multiple Testing.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 57(1):289–300, 1995.P. Bühlmann. Statistical significance in high-dimensional linear models.
Bernoulli , 19(4):1212–1242,09 2013.P. Bühlmann and S. van de Geer.
Statistics for high-dimensional data: methods, theory andapplications . Springer Science & Business Media, 2011.P. Bühlmann, P. Rütimann, S. van de Geer, and C.-H. Zhang. Correlated variables in regression: Clus-tering and sparse estimation.
Journal of Statistical Planning and Inference , 143(11):1835–1858,Nov 2013.S. Chen and D. L. Donoho. Basis pursuit. In
Proceedings of 1994 28th Asilomar Conference onSignals, Systems and Computers , volume 1, pages 41–44. IEEE, 1994.J.-A. Chevalier, J. Salmon, and B. Thirion. Statistical inference with ensemble of clustered despar-sified lasso. In
International Conference on Medical Image Computing and Computer-AssistedIntervention , pages 638–646, 2018.A. M. Dale, A. K. Liu, B. R. Fischl, R. L. Buckner, J. W. Belliveau, J. D. Lewine, and E. Halgren.Dynamic statistical parametric mapping.
Neuron , 26(1):55–67, 2000.R. Dezeure, P. Bühlmann, L. Meier, and N. Meinshausen. High-dimensional inference: Confidenceintervals, p -values and r-software hdi. Statist. Sci. , 30(4):533–558, 2015.O. J. Dunn. Multiple comparisons among means.
J. Amer. Statist. Assoc. , 56(293):52–64, 1961.J. R. Gimenez and J. Zou. Discovering conditionally salient features with statistical guarantees. In
ICML , pages 2290–2298, 2019. 10. Gramfort, M. Kowalski, and M. Hämäläinen. Mixed-norm estimates for the M/EEG inverseproblem using accelerated gradient methods.
Phys. Med. Biol. , 57(7):1937–1961, 2012.A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, andM. S. Hämäläinen. MNE software for processing MEG and EEG data.
NeuroImage , 86:446–460,2014.M. S. Hämäläinen and R. J. Ilmoniemi. Interpreting magnetic fields of the brain: minimum normestimates.
Medical & Biological Engineering & Computing , 32(1):35–42, Jan 1994.S. Haufe, V. V. Nikulin, A. Ziehe, K.-R. Müller, and Guido Nolte. Estimating vector fields usingsparse basis field expansions. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
NeurIPS , pages 617–624. Curran Associates, Inc., 2009.O. Hauk, D. G. Wakeman, and R. Henson. Comparison of noise-normalized minimum norm estimatesfor meg analysis using multiple resolution metrics.
NeuroImage , 54(3):1966 – 1974, 2011.Y. Hochberg and A. C. Tamhane.
Multiple comparison procedures . Wiley Series in Probability andStatistics. John Wiley & Sons, Inc., 1987.A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics , 12(1):55–67, 1970.Andrés Hoyos-Idrobo, Yannick Schwartz, Gaël Varoquaux, and Bertrand Thirion. Improving sparserecovery on structured images with bagged clustering. In , pages 73–76. IEEE, 2015.L. Janson and W. Su. Familywise error rate control via knockoffs.
Electron. J. Stat. , 10(1):960–975,2016.A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensionalregression.
J. Mach. Learn. Res. , 15:2869–2909, 2014.S. Khan, J. A. Hashmi, F. Mamashli, K. Michmizos, M. G. Kitzbichler, H. Bharadwaj, Y. Bekhti,S. Ganesan, K.-L. A. Garel, S. Whitfield-Gabrieli, R. L. Gollub, J. Kong, L. M. Vaina, K. D. Rana,S. M. Stufflebeam, M. S. Hämäläinen, and T. Kenet. Maturation trajectories of cortical resting-statenetworks depend on the mediating frequency band.
NeuroImage , 174:57 – 68, 2018.F. H. Lin, T. Witzel, S. P. Ahlfors, S. M. Stufflebeam, J. W. Belliveau, and M. S. Hämäläinen.Assessing and improving the spatial accuracy in meg source localization by depth-weightedminimum-norm estimates.
NeuroImage , 31(1):160–71, 2006.K. Lounici, M. Pontil, S. van de Geer, and A. B. Tsybakov. Oracle inequalities and optimal inferenceunder group sparsity.
Ann. Statist. , 39(4):2164–2204, 2011.F. Lucka, S. Pursiainen, M. Burger, and C. Wolters. Hierarchical bayesian inference for the EEGinverse problem using realistic FE head models: Depth localization and source separation for focalprimary currents.
NeuroImage , 61(4):1364–1382, Apr. 2012.J. Mandozzi and P. Bühlmann. Hierarchical testing in the high-dimensional setting with correlatedvariables.
J. Amer. Statist. Assoc. , 111(513):331–343, 2016.M. Massias, A. Gramfort, and J. Salmon. Celer: a Fast Solver for the Lasso with Dual Extrapolation.In
ICML , volume 80, pages 3315–3324, 2018a.M. Massias, S. Vaiter, A. Gramfort, and J. Salmon. Dual extrapolation for sparse generalized linearmodels. arXiv preprint arXiv:1907.05830 , 2019.Mathurin Massias, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Generalized concomitantmulti-task lasso for sparse multimodal regression. In
International Conference on ArtificialIntelligence and Statistics , pages 998–1007, 2018b.K. Matsuura and Y. Okabe. Selective minimum-norm solution of the biomagnetic inverse problem.
IEEE Trans. Biomed. Eng. , 42(6):608–615, June 1995. ISSN 0018-9294.11. Mattout, M. Pélégrini-Issac, L. Garnero, and H. Benali. Multivariate source prelocalization (MSP):Use of functionally informed basis functions for better conditioning the MEG inverse problem.
NeuroImage , 26(2):356 – 373, 2005.N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso.
Ann. Statist. , 34(3):1436–1462, 2006.N. Meinshausen and P. Bühlmann. Stability selection.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 72:417–473, 2010.N. Meinshausen, L. Meier, and P. Bühlmann. P-values for high-dimensional regression.
J. Amer.Statist. Assoc. , 104(488):1671–1681, 2009.R. Mitra and C.-H. Zhang. The benefit of group sparsity in group inference with de-biased scaledgroup lasso.
Electron. J. Stat. , 10(2):1829–1873, 2016.A. Molins, S. M. Stufflebeam, E. N. Brown, and M. S. Hämäläinen. Quantification of the benefitfrom integrating MEG and EEG data in minimum L2-norm estimation.
NeuroImage , 42(3):1069 –1077, 2008.T.-B. Nguyen, J.-A. Chevalier, and B. Thirion. Ecko: Ensemble of clustered knockoffs for robustmultivariate inference on fMRI data. In
International Conference on Information Processing inMedical Imaging , pages 454–466, 2019.G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection formultiple classification problems.
Statistics and Computing , 20(2):231–252, 2010.W. Ou, M. S. Hämaläinen, and P. Golland. A distributed spatio-temporal EEG/MEG inverse solver.
NeuroImage , 44(3):932–946, Feb. 2009.R. Pascual-Marqui. Standardized low resolution brain electromagnetic tomography (sLORETA):technical details.
Methods Find. Exp. Clin. Pharmacology , 24(D):5–12, 2002.S. Reid, R. Tibshirani, and J. Friedman. A study of error variance estimation in lasso regression.
Stat.Sin. , 26(1):35–67, 2016.B. Stucky and S. van de Geer. Asymptotic confidence regions for high-dimensional structuredsparsity.
IEEE Trans. Signal Process. , 66(8):2178–2190, 2018.S. Taulu. Spatiotemporal Signal Space Separation method for rejecting nearby interference in MEGmeasurements.
Physics in Medicine and Biology , 51(7):1759–1769, 2006.J. Taylor and R. J. Tibshirani. Statistical learning and selective inference.
Proceedings of the NationalAcademy of Sciences , 112(25):7629–7634, 2015.R. Tibshirani. Regression shrinkage and selection via the lasso.
J. R. Stat. Soc. Ser. B Stat. Methodol. ,58(1):267–288, 1996.S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regionsand tests for high-dimensional models.
Ann. Statist. , 42(3):1166–1202, 2014.G. Varoquaux, A. Gramfort, and B. Thirion. Small-sample brain mapping: sparse recovery onspatially correlated designs with randomization and clustering. In
ICML , pages 1375–1382, 2012.L. Wasserman and K. Roeder. High-dimensional variable selection.
Ann. Statist. , 37(5A):2178–2201,2009.D. Wipf and S. Nagarajan. A unified bayesian framework for MEG/EEG source imaging.
NeuroImage ,44(3):947–966, Feb. 2009.C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high dimen-sional linear models.
J. R. Stat. Soc. Ser. B Stat. Methodol. , 76(1):217–242, 2014.12
PPENDIX
A Formal definition of δ -FWER control Now we give a more formal definition of the δ -FWER. Definition A.1 ( δ -family wise error rate) . Given a family of (corrected) p -values ˆ p = (ˆ p j ) j ∈ [ p ] and athreshold x ∈ (0 , , the δ -FWER, also denoted FWER δx (ˆ p ) , is the probability to make at least onefalse discovery at a distance at least δ from the true support: FWER δx (ˆ p ) = P ( min j ∈ N δ ˆ p j ≤ x ) , (12) with N δ = { j ∈ [ p ] : ∀ k ∈ Supp( B ) , d ( j, k ) ≥ δ } and d ( j, k ) is the distance between source j and k . Definition A.2 ( δ -FWER control) . We say that the family of (corrected) p -values ˆ p = (ˆ p j ) j ∈ [ p ] controls the δ -FWER if, for all x ∈ (0 , : FWER δx (ˆ p ) ≤ x . (13) B Extended Restricted Eigenvalue assumption
Here, we rewrite (Lounici et al., 2011, Assumption 3.1), adjusting it for the multi-task Lasso case(particular case of the more general group Lasso). Notice that for a given value of T , the assumptionis equivalent to (Lounici et al., 2011, Assumption 4.1). Let ≤ s ≤ p be an integer that gives anupper bound on the sparsity | Supp( B ) | . The extended Restricted Eigenvalue assumption RE( X , s ) isverified on X for sparsity parameter s and constant κ = κ ( s ) > , if: min (cid:26) (cid:107) XΘ (cid:107)√ nT (cid:107) Θ J (cid:107) : | J | ≤ s, Θ ∈ R p × T \ { } , (cid:107) Θ J C (cid:107) , ≤ (cid:107) Θ (cid:107) , (cid:27) ≥ κ , (14)where J ⊂ [ p ] and J C denotes its complementary i.e., J C = [ p ] \ J , and Θ J refers to the matrix Θ without the rows J C . C Adaptive quantile aggregation of p -values and ecd-MTLasso algorithm In this section, we provide some more details on the way we perform aggregation of p -values acrossthe p -values maps created through the clustering randomization, then we give the full ecd-MTLassoalgorithm.For the j -th features (or source) we have a vector ( p ( b ) j ) b ∈ [ B ] of p -values, with one p -value computedfor each of the B clusterings. Then, the final p -value of the j -th feature is given by the adaptivequantile aggregation, as proposed by Meinshausen et al. (2009): p j = min (cid:40) (1 − log( γ min )) inf γ ∈ ( γ min , (cid:16) γ -quantile (cid:110) p ( b ) j γ ; b ∈ [ B ] (cid:111)(cid:17) , (cid:41) , where we have taken γ min = 0 . in our experiments. Taking a value of γ min not too small( e.g., γ min ≥ . ) allows to recover sources that have received small p -values several times ( e.g., atleast for B/ different choices of clustering).We give the full algorithm of ecd-MTLasso in Algorithm 2. D Proofs
D.1 Probability lemmaLemma D.1.
Let ε ∈ R T be a centered Gaussian random vector with (symmetric positive definite)covariance M ∈ R T × T . Then, the random variable ε (cid:62) M − ε follows a χ T distribution. lgorithm 2 ecd-MTLasso input : X ∈ R n × p , Y param : C = 1000 , B = 100 for b = 1 , . . . , B do X ( b ) = sample ( X ) A ( b ) = Ward ( C, X ( b ) ) Z ( b ) = XA ( b ) q ( b ) = d-MTLasso ( Z ( b ) , Y ) C // corrected cluster-wise p -values at bootstrap b for j = 1 , . . . , p do p ( b ) j = q ( b ) r if j ∈ G r // corrected feature-wise p -values at bootstrap b for j = 1 , . . . , p do p j = aggregation ( p ( b ) j , b ∈ [ B ]) // aggregated corrected feature-wise p -values return p j for j ∈ [ p ] Proof.
Note first that since M is symmetric positive definite, its square-root N ∈ R T × T exists and isa symmetric positive definite matrix satisfying N = M . Hence, this leads to the following displays ε (cid:62) M − ε = ( N − ε ) (cid:62) ( N − ε ) . We have that N − ε is a centered Gaussian random vector, and its covariance matrix reads: E (cid:2) ( N − ε )( N − ε ) (cid:62) (cid:3) = E (cid:2) N − εε (cid:62) N − (cid:3) = E (cid:2) N − εε (cid:62) N − (cid:3) = N − E (cid:2) εε (cid:62) (cid:3) N − = N − MN − = N − N N − = Id T . To conclude N − ε ∈ R T is a centered Gaussian vector with covariance Id T , hence its squaredEuclidean norm (cid:13)(cid:13) N − ε (cid:13)(cid:13) = ( N − ε ) (cid:62) ( N − ε ) follows a χ T distribution. D.2 Proof of Prop. 2.1
Now, we give a proof of Prop. 2.1:
Proof.
First, let us fix an index j ∈ [ p ] . Then, using Equation (7) we have: √ n ( ˆ B (d − MTLasso) j,. − B j,. ) = √ n z (cid:62) j Ez (cid:62) j X .,j − (cid:88) k (cid:54) = j √ n z (cid:62) j X .,k ( ˆ B MTL k,. − B k,. ) z (cid:62) j X .,j = Λ j,. + ∆ j,. , (15)where Λ j,. = √ n z (cid:62) j Ez (cid:62) j X .,j and ∆ j,. = √ n (cid:80) k (cid:54) = j P j,k ( B k,. − ˆ B MTL k,. ) with P j,k = z (cid:62) j X .,k z (cid:62) j X .,j . Now, we show that Λ j,. ∼ N p ( , ˆ Ω j,j M ) , or equivalently we show that E (cid:62) z j ∼ N (0 , n (cid:107) z j (cid:107) M ) .It is clear that E (cid:62) z j is a centered Gaussian vector. Then, its covariance denoted by V ( j ) , can becomputed as follows: V ( j ) = E ( E (cid:62) z j z (cid:62) j E ) ∈ R T × T , t, t (cid:48) ∈ [ T ] by V ( j ) t,t (cid:48) = E ( E (cid:62) .,t z j z (cid:62) j E .,t (cid:48) )= E ( z (cid:62) j E .,t (cid:48) E (cid:62) .,t z j ) ( scalar values commute )= z (cid:62) j E ( E .,t (cid:48) E (cid:62) .,t ) z j = z (cid:62) j E ( n (cid:88) i =1 E i,t (cid:48) E (cid:62) i,t ) z j = z (cid:62) j n (cid:88) i =1 E ( E i,t (cid:48) E (cid:62) i,t ) z j . Then, the noise structure in Equation (2) yields V ( j ) t,t (cid:48) = z (cid:62) j n M t,t (cid:48) z j = n (cid:107) z j (cid:107) M t,t (cid:48) .Now, we show that with high probability (cid:107) ∆ (cid:107) , = O (cid:18) sλ √ log( p ) κ (cid:19) . First, notice that: (cid:107) ∆ (cid:107) , ≤ √ n max k (cid:54) = j | P j,k | (cid:13)(cid:13)(cid:13) ˆ B MTL − B (cid:13)(cid:13)(cid:13) , . For a convenient choice of the regularization parameters α , using Bühlmann and van de Geer (2011,Lemma 2.1) and following the same approach as Dezeure et al. (2015, Appendix A.1), we obtain,with high probability: √ n max k (cid:54) = j | P j,k | = O (cid:16)(cid:112) log( p ) (cid:17) . Bounds on (cid:107) ˆ B MTL − B (cid:107) , are also available in the literature (Lounici et al., 2011) for ρ = 0 andcan be extended to ρ > similarly. Notably, provided ρ = 0 , assuming A1 for a sparsity parameter | Supp( B ∗ ) | ≤ s , a given constant κ = κ ( s ) > , and a choice of λ large enough in Equation (4),(Lounici et al., 2011, Theorem 3.1) gives directly the following bound, with high probability: (cid:13)(cid:13)(cid:13) ˆ B MTL − B (cid:13)(cid:13)(cid:13) , = O (cid:18) sλκ (cid:19) . Remark D.1.
Following van de Geer et al. (2014), to neglect ∆ we need to have (cid:107) ∆ (cid:107) ∞ = o(1) .This condition is verified if s = o (cid:16) κ λ √ log( p ) (cid:17) . D.3 Proof of Prop. 2.2
Before starting the proof, let us give more precision on assumption A2, the complete assumption isthe following:(A2) there exists Γ ∈ R C × T such that Γ r,. = (cid:80) j ∈ G r w j B j,. with w j ≥ for all j ∈ [ p ] , so that theassociated compression loss XB − ZΓ is bounded as follows: (cid:107) XB − ZΓ (cid:107) , ≤ ξ T φ ( M ) n = ξ T φ ( R ) σ n , (16)where ξ > is an arbitrary small constant, φ ( M ) > is the smallest eigenvalue of M and φ ( R ) > is the smallest eigenvalue of R , the temporal correlation matrix of the noise defined by R = M /σ . The hypothesis plainly means that the noise induced by design matrix compression issmall enough with respect to the model noise.Now we give a proof of Prop. 2.2: Proof.
First, we derive the d-MTLasso for the compressed problem, for r ∈ [ C ] :15 Γ (d − MTLasso) r,. = a (cid:62) r Ya (cid:62) r Z .,r − (cid:88) l (cid:54) = r a (cid:62) r Z .,l ˆ Γ MTL r,. a (cid:62) r Z .,r , (17)where a r ’s are the residuals obtained by nodewise Lasso on Z playing the same role as the z j ’s inEquation (7). Then, as done in Appendix D.2, we derive: √ n (ˆ Γ (d − MTLasso) r,. − Γ r,. ) = √ n a (cid:62) r Ea (cid:62) r Z .,r − (cid:88) l (cid:54) = r √ n a (cid:62) r Z .,l (ˆ Γ MTL l,. − Γ l,. ) a (cid:62) r Z .,r + √ n a (cid:62) r ( XB − ZΓ ) a (cid:62) r Z .,r = Λ (cid:48) r,. + ∆ (cid:48) r,. + Π r,. , (18)We treat Λ (cid:48) and ∆ (cid:48) as in Appendix D.2, assuming that the hypotheses that are used to bound (hence,neglect) ∆ (cid:48) are verified (notably A3).Next, for r ∈ [ C ] , we want to establish that n (cid:107) Π r,. (cid:107) M − T ˆ Ω (cid:48) r,r is negligible, i.e., that Π has a negligibleeffect on all decision statistics, where the covariance ˆ Ω (cid:48) has the following generic diagonal term: ˆ Ω (cid:48) r,r = n (cid:107) a r (cid:107) | a (cid:62) r Z .,r | . Given that (cid:107) Π r,. (cid:107) M − = n (cid:13)(cid:13) a (cid:62) r ( XB − ZΓ ) (cid:13)(cid:13) M − | a (cid:62) r Z .,r | (19) ≤ n (cid:13)(cid:13) a (cid:62) r (cid:13)(cid:13) | a (cid:62) r Z .,r | (cid:107) XB − ZΓ (cid:107) , φ ( M ) , (20)where (cid:107)·(cid:107) , denotes the spectral norm. Then, we obtain that n (cid:107) Π r,. (cid:107) M − T ˆ Ω (cid:48) r,r ≤ nT (cid:107) XB − ZΓ (cid:107) , φ ( M ) ≤ ξ . (21)Then, if A2 is verified for ξ small enough, we can also neglect Π in front of Λ (cid:48) .Then, by neglecting Π and ∆ (cid:48) , we have: √ n (ˆ Γ (d − MTLasso) − Γ ) ∼ N C ( , ˆ Ω (cid:48) r,r M ) . (22)Then we can construct p -values that test the r -th null hypothesis H ( r )0 : “ Γ j,. = 0 ”, applying thesame technique as in Sec. 2.4. By correcting these p -values — e.g., using the Bonferroni correction(Dunn, 1961), we multiply by C the initial p -values—, we obtain cluster-wise corrected p -values thatcontrol the FWER.Since, for all r ∈ [ C ] , Γ r,. is a linear combination of B j,. for j ∈ G r , then Γ r,. (cid:54) = 0 if at least thereexist j ∈ G r such that B j,. (cid:54) = 0 .Then, defining the feature-wise corrected p -values by the corrected p -values of the correspondingcluster, and assuming that clusters are at most of size δ , such corrected p -values control the δ -FWER. Remark D.2.
In assumption A2, having a positive linear combination is not necessary, a simplelinear combination is sufficient.However, we assumed that Γ r,. was a positive linear combination of B j,. for j ∈ G r , to get thefollowing desired properties:"If additionally for r ∈ [ C ] , for all j ∈ G r and all k ∈ G r , we have sign( B j,. ) = sign( B k,. ) , then sign( Γ r,. ) = sign( B j,. ) (zero being booth positive and negative)."This means that if all the features’ weights in a cluster have the same sign, there exists a compressionverifying A2 such that the cluster weight preserves the sign. .4 Proof of Prop. 2.3 Proof.
Assuming the hypotheses of Prop. 2.3 and applying Prop. 2.2, we can, for each of the B compression of the problem in Equation (1), construct a corrected p -value family that control the δ -FWER. Applying the quantile aggregate method in Equation (15), we derive a corrected p -valuefamily taking into account for each compression choice. Applying Meinshausen et al. (2009, Theorem3.2), this aggregated corrected p -value family also controls the δ -FWER. E Computational aspects
Here we give some elements about the computational aspect of the algorithms we propose.For solving Lasso or multi-task Lasso problems, we rely for additional speed-up on celer (Massiaset al., 2018a, 2019), a solver which is much more efficient than the standard coordinate descent (speedup by more than 10x on our experiments).To compute d-MTLasso, we must solve p Lasso of size ( n, ( p − , and 1 multi-task Lasso withcross-validation on a dataset of size ( n, p, T ) . For n = 200 , p = 7500 and T = 10 , the algorithmscan be run on a standard laptop in around hours (using only 1 CPU). However, the algorithmis embarrassingly parallel and requires around minutes if run on a machine with CPUs. Tocompute cd-MTLasso, we must solve C Lasso of size ( n, ( C − . and 1 multi-task Lasso withcross-validation on a dataset of size ( n, C, T ) . For n = 200 , C = 1000 and T = 10 , it can be run ona standard local device in less than minute (using only 1 CPU). Finally, to compute ecd-MTLasso,we must solve B cd-MTLasso. For B = 100 ( is already a good value to get most of the advantagesof ensembling), n = 200 , C = 1000 and T = 10 , it can be run on a standard laptop in around hour(using only 1 CPU) and around minute on a machine with CPUs.Although, when using coordinate-descent-like algorithms, the complexity depends on solver parame-ters such as tolerance on stopping criteria, the complexity in C (or p ) appears empirically to be cubic,while it is linear in n and T . It is also linear in B . F Detailed data description
For AEF and VEF, data contained one artifactual channel leading to n = 203 , while for SEFdata were preprocessed for removal of environmental noise leading to an effective number ofsamples of n = 64 (Taulu, 2006). For the AEF dataset, we report results for AEFs evoked by leftauditory stimulation with pure tones of 500 Hz. The analysis window for source estimation waschosen from 50 ms to 200 ms based on visual inspection of the evoked data to capture the dominantN100m component, leading to T = 6 . For the SEF dataset, we analyzed SEFs evoked by bipolarelectrical stimulation (0.2 ms in duration) of the left median nerve. To capture the main peaks ofthe evoked response and exclude the strong stimulus artifact, the analysis window was chosen from18 ms to 200 ms based on visual inspection of the sensor signal.Preprocessing was done following the standard pipeline from the MNE software (Gramfort et al.,2014). Baseline correction using pre-stimulus data (from -200 ms to 0 ms) was used. Epochs withpeak-to-peak amplitudes exceeding predefined rejection parameters (3 pT for magnetometers and400 pT/m for gradiometers, and 150 µ V for EOG on AEF and VEF and 350 µ V for SEF) wereassumed to be affected by artifacts and discarded. This resulted in 55 (AEF), 67 (SEF) and 111 (SEF)artifact-free measurements which were average to produce the target matrix Y . The gain matrix wascomputed using a set of p = 7498 cortical locations, and a three-layer boundary element model. G Related Work
The topic of high-dimensional inference has been addressed in many recent works. Yet, to the bestof our knowledge, none of this literature has been applied to the source localization problem weconsider here. https://github.com/mathurinm/CELER
17 The idea of associating clustering with high-dimensional inference can also be foundin recent works with application to genetic data: Bühlmann et al. (2013) has used a fixedclustering step, which is made adaptive in Mandozzi and Bühlmann (2016). Our contributiondeviates from these works in two regards: unlike Bühlmann et al. (2013), we do not considerthat a fixed clustering, however good it is, indeed captures the essence of the problem: thisis why we resort to an ensemble of different clustering solutions. Unlike Mandozzi andBühlmann (2016), we do not try to narrow down the inference in a hierarchical fashion,because we do not consider that source imaging can in effect be traced down to the vertexlevel: given the difficulty of the source imaging problem, we find it more satisfactory tooutline a region of putative activity.• Another family of inference methods based on sample splits has been introduced by Mein-shausen et al. (2009): train data are used to select regions, test data to assess their statisticalsignificance. The choice of splits can be varied and aggregated upon to mitigate the impactof arbitrary splits selection. However, data splitting has a high cost in terms of statisticalpower, making these approaches weakly sensitive Taylor and Tibshirani (2015).• An alternative method yielding family-wise error rate (FWER) control is the stabilityselection method, that builds on bootstrapped randomized sparse regression Meinshausenand Bühlmann (2010). Yet, this approach has been found too weakly sensitive and it has notbeen considered in further statistical inference works, see e.g.,
Dezeure et al. (2015).• Post-selection inference Taylor and Tibshirani (2015) is an approach that typically relieson a sparse estimator (such as Lasso) and then assesses the significance of the selectedvariables. It accounts for the selection in the inference process, avoiding the undesirable biasof selecting and testing on the same data. However, we have not not found an implementationthat scales in a numerically sound way to the problem size that we are considering here:thousand features, even after clustering.• Knockoff inference (with or without clustering) is probably the most recent alternativedeveloped for high-dimensional inference (Barber and Candès, 2019): it consists in ap-pending noisy copy of the problem features and selecting only variables that are muchmore significantly associated than their noisy copy. While this approach is computationallyrelevant for the problem at hand, it suffers from the arbitrary knockoff variable set used;it yields a control of the false discovery rate of the detection problem, that is not directlycomparable with the family-wise error rate (FWER) considered here. FWER control ispossible with knockoff Janson and Su (2016), yet very weakly sensitive.18
Supplementary figures
Figure 5:
Illustrating spatial tolerance of size δ = 20 mm and δ = 40 mm. The true source inred has a 10 mm radius (distance measured on the cortical surface) and the spatial tolerance extendthis region by 20 mm on the left side and 40 mm on the right side in yellow. The δ -FWER is theprobability of making false discoveries outside of the extended region. Then, a false discovery madein the yellow region is not counted neither as an error nor a true positive. Connected feature correlation density
Inter feature correlation density
Connected cluster correlation density (C=1000)
Inter cluster correlation density (C=1000)
Figure 6:
Illustrating correlation in MNE sample MEG data. (left): Distribution of the maximumcorrelation between a feature (resp. cluster) and another connected feature (resp. cluster). (Top)the maximum connected feature correlation is close to 0.98 in average. (Bottom) the maximumconnected cluster correlation is lower, close to 0.9 on average. Clustering improves conditioningsignificantly. (right): The density of the inter feature correlation (top) looks similar to the density ofthe inter cluster correlation (bottom). By focusing the extreme values of correlation, we see a littledecrease of extreme values for the clustered data.19
10 15 20
SD (mm)
SD (mm) T = 6) Figure 7:
Spatial Dispersion (SD) histograms. (left): SD on a fixed time point (Hauk et al., 2011).All methods lead to comparable spatial dispersion. (right): SD for desparsified multi-task Lasso(d-MTLasso) with increasing time points. See Figure 2 for PLE histograms on the same experiments.
Recall P r e c i s i o n sLORETAd-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso Figure 8:
Precision-Recall.
See Figure 3 for δ − Precision-Recall curves computed on the same data.
20 30 40spatial tolerance (mm)100020004000 nu m b e r o f c l u s t e r s ecd-MTL empirical δ -FWER 0.10.20.30.40.50.6 P r e c i s i o n ecd-MTL precision-recallC = 1000C = 2000C = 4000 Figure 9: ecd-MTLasso empirical δ -FWER and precision recall for different choice of clustersizes. (left): Running the same simulation as in Sec. 3.2, we observe that the spatial tolerance δ canbe reduced to 20 mm by increasing the number of clusters up to . With C = 1000 clusters (resp. C = 2000 , C = 4000 ), the average cluster diameter is around 18 mm (resp. 13 mm and 9 mm). Itturns out that the δ -FWER is controlled for around twice the diameter (if the compressed designmatrix verifies assumption A1). (right): We see that this decrease in spatial tolerance comes witha price regarding support recovery: the precision-recall curve declines with when C is increased.(both): Note that we need to set the hyper-parameter c that is used to compute the regularizationparameters α (see note coming with Equation (5)). We found empirically that it should be inverselyproportional to C : for C = 1000 , c = 0 . ; for C = 2000 , c = 0 . ; for C = 4000 , c = 0 . .20igure 10: Comparison on audio dataset on both hemispheres.
From left to right are comparedsLORETA, d-MTLasso without AR modeling (noise is assumed non-autocorrelated), d-MTLassowith an AR1 noise model and the ecd-MTLasso using also an AR1. The results correspond to auditory(top) evoked fields. Colormaps are fixed across datasets and adjusted based on meaningful statisticalthresholds in order to outline FWER control issues.Figure 11: