[PDF] Statistical control for spatio-temporal MEG/EEG source imaging with desparsified multi-task Lasso

Abstract

Detecting where and when brain regions activate in a cognitive task or in a given clinical condition is the promise of non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG). This problem, referred to as source localization, or source imaging, poses however a high-dimensional statistical inference challenge. While sparsity promoting regularizations have been proposed to address the regression problem, it remains unclear how to ensure statistical control of false detections. Moreover, M/EEG source imaging requires to work with spatio-temporal data and autocorrelated noise. To deal with this, we adapt the desparsified Lasso estimator -- an estimator tailored for high dimensional linear model that asymptotically follows a Gaussian distribution under sparsity and moderate feature correlation assumptions -- to temporal data corrupted with autocorrelated noise. We call it the desparsified multi-task Lasso (d-MTLasso). We combine d-MTLasso with spatially constrained clustering to reduce data dimension and with ensembling to mitigate the arbitrary choice of clustering; the resulting estimator is called ensemble of clustered desparsified multi-task Lasso (ecd-MTLasso). With respect to the current procedures, the two advantages of ecd-MTLasso are that i)it offers statistical guarantees and ii)it allows to trade spatial specificity for sensitivity, leading to a powerful adaptive method. Extensive simulations on realistic head geometries, as well as empirical results on various MEG datasets, demonstrate the high recovery performance of ecd-MTLasso and its primary practical benefit: offer a statistically principled way to threshold MEG/EEG source maps.

Full PDF

SStatistical control for spatio-temporal MEG/EEGsource imaging with desparsiﬁed multi-task Lasso

Jerome-Alexis Chevalier

Inria SaclayParis-Saclay, France [email protected]

Alexandre, Gramfort

Inria SaclayParis-Saclay, France [email protected]

Joseph Salmon

IMAG, Université de MontpellierMontpellier, France [email protected]

Bertrand, Thirion

Inria Saclay, CEAParis-Saclay, France [email protected]

Abstract

Detecting where and when brain regions activate in a cognitive task or in agiven clinical condition is the promise of non-invasive techniques like magnetoen-cephalography (MEG) or electroencephalography (EEG). This problem, referredto as source localization, or source imaging, poses however a high-dimensionalstatistical inference challenge. While sparsity promoting regularizations have beenproposed to address the regression problem, it remains unclear how to ensurestatistical control of false detections. Moreover, M/EEG source imaging requiresto work with spatio-temporal data and autocorrelated noise. To deal with this, weadapt the desparsiﬁed Lasso estimator —an estimator tailored for high dimensionallinear model that asymptotically follows a Gaussian distribution under sparsityand moderate feature correlation assumptions— to temporal data corrupted withautocorrelated noise. We call it the desparsiﬁed multi-task Lasso (d-MTLasso). Wecombine d-MTLasso with spatially constrained clustering to reduce data dimensionand with ensembling to mitigate the arbitrary choice of clustering; the resulting esti-mator is called ensemble of clustered desparsiﬁed multi-task Lasso (ecd-MTLasso).With respect to the current procedures, the two advantages of ecd-MTLasso are that i) it offers statistical guarantees and ii) it allows to trade spatial speciﬁcity for sensi-tivity, leading to a powerful adaptive method. Extensive simulations on realistichead geometries, as well as empirical results on various MEG datasets, demonstratethe high recovery performance of ecd-MTLasso and its primary practical beneﬁt:offer a statistically principled way to threshold MEG/EEG source maps. Source imaging with magnetoencephalography (MEG) and electroencephalography (EEG) deliversinsights into brain activity with high temporal and good spatial resolution in a non-invasive way (Bail-let et al., 2001). It however requires to solve the bioelectromagnetic inverse problem, which is ahigh-dimensional ill-posed regression problem. Various approaches have been proposed to regularizethe estimation of the regression coefﬁcients that map activity to brain locations. Historically, (cid:96) regularization was considered ﬁrst (Hämäläinen and Ilmoniemi, 1994), with successive improvementsknown as dSPM (Dale et al., 2000) and sLORETA (Pascual-Marqui, 2002) that are referred to as“noise normalized” solutions. The reason is that the coefﬁcients are standardized with an estimate a r X i v : . [ s t a t . M L ] N ov f the noise standard deviation, producing outputs that are comparable to T or F statistics, yet notstatistically calibrated. These latter techniques have since become standard when using (cid:96) approaches.More recently, alternative approaches based on sparsity assumptions have been proposed with theambition to improve the spatial speciﬁcity of M/EEG source imaging (Matsuura and Okabe, 1995;Haufe et al., 2009; Gramfort et al., 2012; Lucka et al., 2012; Wipf and Nagarajan, 2009). The output ofsuch methods consists of focal sources as opposed to blurred images obtained with (cid:96) regularization.However, obtaining statistics (“noise normalized”) from sparse or non-linear estimators seemschallenging, especially since M/EEG data are spatio-temporal data with complex noise structure. Anatural way to deal with the temporal dimension is to consider a multi-task estimator and structuredsparse priors based on (cid:96) /(cid:96) mixed norms (Ou et al., 2009; Gramfort et al., 2012).In the statistical literature, some attempts to obtain an estimate of both regression coefﬁcients andtheir variance have been proposed for linear models in high dimension (Wasserman and Roeder, 2009;Meinshausen et al., 2009; Bühlmann, 2013). These estimates can then be translated to p -value maps, i.e., maps of p -values associated with each covariate. Some methods adapted for sparse scenarios havethen proposed to debias the Lasso to obtain p -values or conﬁdence intervals (Zhang and Zhang, 2014;van de Geer et al., 2014; Javanmard and Montanari, 2014). We refer to such variants as desparsiﬁedLasso. Recently, desparsiﬁed extensions of group Lasso have also been considered (Mitra and Zhang,2016; Stucky and van de Geer, 2018). However, all these previous methods generally lack of powerwhen p (cid:29) n . Here, we propose to address a multi-task setting in the presence of correlated noise,and to deal with high-dimensional when p (cid:29) n leveraging on data structure as done by Chevalieret al. (2018). All these challenges need to be considered for M/EEG source imaging.Our ﬁrst contribution is to propose the desparsiﬁed multi-task Lasso (d-MTLasso), an extension ofthe desparsiﬁed Lasso (d-Lasso) (Zhang and Zhang, 2014; van de Geer et al., 2014) to multi-tasksetting (Obozinski et al., 2010). More precisely, we adapt the group formulation by Mitra and Zhang(2016) to the multi-task setting that enjoys i) a simple statistic test formula with ii) a natural integrationof auto-correlated noise and iii) a simpliﬁcation of the assumptions. Our second contribution isto introduce ensemble of clustered desparsiﬁed multi-task Lasso (ecd-MTLasso), which has twoadvantages compared to current methods: i) it offers statistical guarantees and ii) it allows to tradespatial speciﬁcity for sensitivity, leading to a powerful adaptive method. Our third contribution isan empirical validation of the theoretical claims. In particular, we run extensive simulations onrealistic head geometries, as well as empirical results on various MEG datasets to demonstrate thehigh recovery performance of ecd-MTLasso and its primary practical beneﬁt: offer a statisticallyprincipled way to threshold MEG/EEG source maps. In this section, we give the noise model, we provide standard tools for solving the source localizationproblem and, mainly, we present three new methods with their assumptions and statistical guarantees.

For clarity, we use bold lowercase for vectors and bold uppercase for matrices. For any positiveinteger p ∈ N ∗ , we write [ p ] for the set { , . . . , p } . For a vector β , β j refers to its j -th coordinate.For a matrix X ∈ R n × p , X ( − j ) refers to matrix X without the j -th column, X i,. refers to the i -throw and X .,j to the j -th column and X i,j refers to the element in the i -th row and j -th column. Thenotation (cid:107)·(cid:107) refers to the Frobenius norm for matrices and to the standard Euclidean norm for vectors.For a covariance matrix M , the Mahalanobis norm is denoted by (cid:107)·(cid:107) M − and for a given vector a we have (cid:107) a (cid:107) M − (cid:44) Tr( a (cid:62) M − a ) . For B ∈ R p × T , (cid:107) B (cid:107) , = (cid:80) pj =1 (cid:107) B j,. (cid:107) , and its (row) supportis Supp( B ) = { j ∈ [ p ] : B j,. (cid:54) = 0 } . We assume that the underlying model is linear: Y = XB + E , (1)where Y ∈ R n × T is the signal observed on M/EEG sensors, X ∈ R n × p the design matrix represent-ing the M/EEG forward model, B ∈ R p × T the underlying signal in source space and E ∈ R n × T thenoise. We assume that there exist ρ ∈ [0 , and σ > such that all t ∈ [ T ] , E .,t ∼ N ( , σ I n ) andthat for all i ∈ [ n ] and all t ∈ [ T − , Cor( E i,t , E i,t +1 ) = ρ. For all i ∈ [ n ] , E i,. is Gaussian with2oeplitz covariance, i.e., deﬁning M ∈ R T × T by M t,u = σ ρ | t − u | for all ( t, u ) ∈ [ T ] , we have: E i,. ∼ N ( , M ) . (2)We further assume that X has been column-wise standardized and denote by ˆ Σ ∈ R p × p the empiricalcovariance matrix of X , i.e., ˆ Σ = X (cid:62) X /n with ˆ Σ j,j = 1 . All proofs are given in Appendix D. To quantify the ability of a M/EEG source imaging technique to obtain a good estimated ˆ B , acommonly reported quantity is the Peak Localization Error (PLE) (Hauk et al., 2011). It consists inmeasuring the distance (in mm) along the cortical surface between the true simulated source and thelocation with maximum amplitude in the estimator. By contrast, spatial dispersion (SD) measureshow much the activity is spread out by the inverse method (Molins et al., 2008).To quantify the control of statistical errors, we consider a generalization of the Family Wise ErrorRate (FWER) (Hochberg and Tamhane, 1987): the δ -FWER. As illustrated in Figure 5 in appendix,it is the FWER taken with respect to a ground truth dilated spatially by an amount δ —in thepresent study a distance in mm. A rigorous deﬁnition of δ -FWER is given in Appendix A. Therationale is that detections made outside of the support, but less than δ away from the supportshould count as slight inaccuracies of the methods, not as false positives. In an analogous manner, δ -FDR = (1 − δ -precision ) has been proposed recently as an extension of the False DiscoveryRate (FDR) (Benjamini and Hochberg, 1995) to include a spatial tolerance (Nguyen et al., 2019;Gimenez and Zou, 2019). We thus characterize the selection capabilities of the methods through a δ -precision/recall curve. The sLORETA and dSPM estimators are derived from the ridge estimator (Hoerl and Kennard, 1970): ˆ B Ridge = KY where K = X (cid:62) ( XX (cid:62) + λ I ) − . (3)They are obtained by scaling each row j in ˆ B Ridge by an estimate of the noise level at location j .It reads (Lin et al., 2006) ˆ B dSPM j,t = ˆ B Ridge j,t /σ dSPM j and ˆ B sLORETA j,t = ˆ B Ridge j,t /σ sLORETA j , where σ dSPM j = (cid:112) σ [ KK (cid:62) ] j,j and σ sLORETA j = (cid:112) [ K ( σ I + XX (cid:62) ) K (cid:62) ] j,j . Interestingly, it can beproved that in the absence of noise and when only a single coefﬁcient is non-zero, the sLORETAestimate has its maximum at the correct location (Pascual-Marqui, 2002). Assuming B .,t ∼ N ( , I ) ,the covariance of Y reads σ I + XX (cid:62) . Hence, one can consider that sLORETA adds to dSPM anextra term in the sensor covariance matrix that comes from the sources. Note that these methods treateach time instant independently, hence ignoring source and noise temporal autocorrelations. Let us ﬁrst recall the deﬁnition of the multi-task Lasso (MTLasso) estimator (Obozinski et al., 2010)in our setting. For a tuning parameter λ > , it is deﬁned as ˆ B MTL ∈ argmin B ∈ R p × T (cid:26) n (cid:107) Y − XB (cid:107) + λ (cid:107) B (cid:107) , (cid:27) . (4)It is well known that similarly to the Lasso, MTLasso is biased: it tends to shrink rows with largeamplitude towards zero. Below, we provide an adaptation of the Desparsiﬁed Lasso following theapproach by Zhang and Zhang (2014), see also Mitra and Zhang (2016), to ensure statistical control.The approach relies on the introduction of score vectors z , . . . , z p in R n deﬁned by z j = X · ,j − X ( − j ) ˆ β ( − j ) α j , (5)where, for j ∈ [ p ] , ˆ β ( − j ) α j is the Lasso solution (Tibshirani (1996); Chen and Donoho (1994)) of theregression of X · ,j against X ( − j ) with regularization parameter α j . Note that these score vectors λ is set by cross-validation on a logarithmic grid going from λ max to λ max , where λ max = (cid:13)(cid:13) X (cid:62) Y (cid:13)(cid:13) , ∞ . In (Zhang and Zhang, 2014, Table 1) an algorithm for choosing α j is proposed. We noticed that taking forall j ∈ [ p ] , α j = c α max ,j := c (cid:107) X ( − j ) X · ,j (cid:107) ∞ /n with c = 0 . for M/EEG data allows to make a signiﬁcantcomputation gain and yields adequate residuals for C = 1000 (see Sec. 2.6). Y and their computation is then equivalent to solving the node-wise Lasso(Meinshausen and Bühlmann, 2006). For such vectors, the noise model in (1) yields z (cid:62) j Yz (cid:62) j X .,j = B j,. + z (cid:62) j Ez (cid:62) j X .,j + (cid:88) k (cid:54) = j z (cid:62) j X .,k B k,. z (cid:62) j X .,j . (6)Discarding the noise term and plugging ˆ B MTL k,. as a preliminary estimator of B k,. in (6), we coin thedesparsiﬁed multi-task Lasso (d-MTLasso), a debiased estimator of ˆ B MTL deﬁned for all j ∈ [ p ] by ˆ B (d − MTLasso) j,. = z (cid:62) j Yz (cid:62) j X .,j − (cid:88) k (cid:54) = j z (cid:62) j X .,k ˆ B MTL k,. z (cid:62) j X .,j . (7)To derive d-MTLasso statistical properties, we need the extended Restricted Eigenvalue (RE) property(Lounici et al., 2011, Assumption 3.1), detailed in Appendix B. More precisely, we assume that(A1) RE( X , s ) is veriﬁed on X for a sparsity parameter s ≥ | Supp( B ) | and a constant κ = κ ( s ) > .Roughly, A1 can be seen as a combination of sparsity and "moderate" feature correlation assumptions. Proposition 2.1.

Considering the model in Equation (1) , assuming A1 and for a choice of λ largeenough in Equation (4) , then with high probability: √ n ( ˆ B (d − MTLasso) − B ) = Λ + ∆ , (8) Λ j,. ∼ N p ( , ˆ Ω j,j M ) , for all j ∈ [ p ] , where ˆ Ω j,k = n z (cid:62) j z k | z (cid:62) j X .,j || z (cid:62) k X .,k |(cid:107) ∆ (cid:107) , = O (cid:32) sλ (cid:112) log( p ) κ (cid:33) (9)Then, under the j -th null hypothesis H ( j )0 : “ B j,. = 0 ” and neglecting the term ∆ (see Appendix D.2for more details) in (8) as done by van de Geer et al. (2014), ˆ B (d − MTLasso) j,. is Gaussian withzero-mean. Finally, using standard results on χ distributions (see Appendix D.1), we obtain n (cid:13)(cid:13)(cid:13) ˆ B (d − MTLasso) j,. (cid:13)(cid:13)(cid:13) M − ∼ ˆ Ω j,j χ T . If M is known, the quantity n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − / ˆ Ω j,j can be used as a decision statistic to obtaina p -value testing the importance of source j by comparison with the χ T distribution. In practicewe need to estimate M by ˆ M . Notably, assuming that we have an estimator ˆ σ of σ that veriﬁesapproximately ( n − ˆ s )ˆ σ /σ ∼ χ n − ˆ s , where ˆ s = | Supp( ˆ B MTL ) | (see Sec. 2.5), we take ˆ f j := n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − T ˆ Ω j,j , (10)as statistic to compare with a Fisher distribution with parameters T and n − ˆ s , to compute the p -values. The full d-MTLasso algorithm is given in Algorithm 1. Note that, a Python implementation ofthe procedures presented in this paper is available on https://github.com/ja-che/hidimstat along with some examples. In Sec. 2.1 noise is assumed homogeneous across sensors, allowing to obtain a robust estimator.Extending Reid et al. (2016) to multi-task regression, we consider the residuals ˆ E = Y − X ˆ B MTL ,and the estimated support size ˆ s . Deﬁning, for t ∈ [ T ] , ˆ σ t = (cid:107) ˆ E .,t (cid:107) / ( n − ˆ s ) , an estimate of σ is: ˆ σ = median( { ˆ σ t , t ∈ [ T ] } ) . Taking the median instead of the mean avoids depending on prospective under-ﬁtted time steps andturns out to be more robust empirically. Similarly, deﬁning for all t ∈ [ T − , ˆ ρ t = cor n ( ˆ E .,t , ˆ E .,t +1 ) (where cor n ( ., . ) is the empirical correlation), ρ is estimated by taking ˆ ρ = median( { ˆ ρ t , t ∈ [ T − } ) .Then, an estimator ˆ M of M is given by ˆ M t,u = ˆ σ ˆ ρ | t − u | . See the proof of (Lounici et al., 2011, Theorem3.1). lgorithm 1 d-MTLasso input : X ∈ R n × p , Y ˆ B MTL ← MTL( X , Y ) // cross-validated multi-task Lasso ˆ E ← Y − X ˆ B MTL // Residuals ˆ s ← | Supp( ˆ B MTL ) | for t ∈ [ T ] do // Noise level estimation ˆ σ t = (cid:107) ˆ E .,t (cid:107) / ( n − ˆ s )ˆ σ = median( { ˆ σ t , t ∈ [ T ] } ) Get ˆ M thanks to Sec. 2.5 for j ∈ [ p ] do z j ← Lasso( X ( − j ) , X .,j ) // cross-validated Lasso ˆ Ω j,j ← n z (cid:62) j z j | z (cid:62) j X .,j || z (cid:62) j X .,j | ˆ B (d − MTLasso) j,. ← z (cid:62) j Yz (cid:62) j X .,j − (cid:80) k (cid:54) = j z (cid:62) j X .,k ˆ B MTL k,. z (cid:62) j X .,j // Desparsified multi-task Lasso ˆ f j ← n (cid:107) ˆ B (d − MTLasso) j,. (cid:107) M − T ˆ Ω j,j // Inference statistics return ˆ f , . . . , ˆ f p In the high-dimensional inference scenario considered, the number of sensors is more than one orderof magnitude smaller than the number of sources, n (cid:28) p . Therefore, estimators of conditional associ-ation between sources and observations struggle to identify the solution. The setting is even moredifﬁcult due to the presence of high correlation between sources (see Figure 6 in appendix). Furthergains can however come from a compression of the design matrix (Bühlmann et al., 2013; Mandozziand Bühlmann, 2016). For this we introduce a clustering step that reduces data dimensionality whileleveraging spatial structure. We consider a spatially-constrained hierarchical clustering algorithmdescribed by Varoquaux et al. (2012) that uses Ward criterion . Other clustering schemes might beconsidered, as long as they yield spatially contiguous regions of the cortical surface. The combinationof this clustering algorithm with the d-Lasso or d-MTLasso algorithms will be respectively referred toas clustered desparsiﬁed Lasso (cd-Lasso) and clustered desparsiﬁed multi-task Lasso (cd-MTLasso).The number of clusters is denoted by C and, for r ∈ [ C ] , we denote by G r the r -th group. Everycluster representative variable is given by the average of the covariates it contains. Then, reorderingconveniently the columns of X , the compressed design matrix Z ∈ R n × C is given by: Z = XA , A =  | G | | G | . . . | G | | G | . . . ... ... ... ... ... ... . . . ... ... ... . . . | G r | | G r |  , (11)where A ∈ R p × C . We say that the compression of X is of good quality if:(A2) there exists Γ ∈ R C × T such that Γ r,. = (cid:80) j ∈ G r w j B j,. with w j ≥ for all j ∈ [ p ] , andthe associated compression loss XB − ZΓ is "small enough" with respect to the model noise (seeAppendix D.3 for more details).(A3) RE( Z , s (cid:48) ) is veriﬁed on Z for sparsity parameter s (cid:48) ≥ | Supp( Γ ) | and constant κ (cid:48) = κ (cid:48) ( s (cid:48) ) > . Proposition 2.2.

Assume Equation (1) , A2, A3, a choice of regularization parameter in the MTLassoregression of Z against Y that is large enough, and that the largest cluster of the compression is ofsize δ , then cd-MTLasso controls the δ -FWER. ata Ensembling desparsifiedmulti-task Lasso C l u s t e r i n g B C l u s t e r i n g Figure 1: ecd-MTL overview diagram.

While cd-MTLasso applies d-MTLasso to clustered data,ecd-MTLasso aggregates several cd-MTLasso solutions.

To reduce the sensitivity of cd-MTLasso to small data perturbations, we propose to randomize over theclustering. We build several clustering solution, considering B = 100 different random subsamples ofsize of the full sample; then we aggregate the p -value maps output by cd-MTLasso. To aggregatethe B cd-MTLasso solutions, we use the adaptive quantile aggregation proposed by Meinshausenet al. (2009) detailed in Appendix C. The full procedure of ensembling B cd-MTLasso (resp. cd-Lasso), solutions is called ecd-MTLasso for ensemble of clustered desparsiﬁed multi-task Lasso(resp. ecd-Lasso). Algorithm of ecd-MTLasso is given in Algorithm 2 in appendix. Also, we give anoverview diagram to clarify the nesting structure of the proposed solutions in Figure 1. Proposition 2.3.

Assume that for each of the B compressions the hypotheses of Prop. 2.2 are veriﬁed,then ecd-MTLasso controls the δ -FWER. This result is conservative and mixing several cd-MTLasso usually reduces the spatial tolerance δ .Additional details on the procedure and computational complexity are deferred to Appendix E. In this section, we give empirical evidence of the advantages of ecd-MTLasso for source localization.First, in a typical point source simulation, we compare the methods with respect to the standard PLEmetric; notably, we study the effect of i/clustering and ii/integrating time dimension. In a secondsimulation with more realistic features, we examine the δ -FWER control property and compare thesupport recovery properties of all methods. Lastly, working on real MEG data, we show that, contraryto sLORETA, ecd-MTLasso retrieves expected patterns using a universal threshold. Here, we study how the proposed estimators perform compared to standard (cid:96) regularized approaches,and assess whether time-aware statistical analysis improves upon static d-Lasso as it is essential forM/EEG source imaging. We use the head anatomy and the recording setup from the sample datasetpublicly available from the MNE software (Gramfort et al., 2014). The design matrix X is computedwith a three-shell boundary element model with p = 7498 candidate cortical locations, and a 306-channels Elekta Neuromag Vectorview system with 102 magnetometers and 204 gradiometers. Weonly keep the gradiometers and remove one defective sensor leading to n = 203 . When consideringmultiple consecutive time instants to demonstrate the ability of the solver to leverage spatio-temporaldata, the source is ﬁxed and the temporal noise autocorrelation is set to ρ = 0 . .Figure 2 reports the normalized histograms of PLE for the 7498 locations for the different methodsinvestigated; results on spatial dispersion (SD) are available in Figure 7 in appendix. While it mightseem simplistic to consider a single source, this experiment allows to demonstrate that d-Lassoimproves over sLORETA in the presence of noise (see Figure 2, left). In the same ﬁgure, one canobserve that clustering degrades this performance, as it carries an intrinsic spatial blur. However, evenin this adversarial scenario (Dirac-like source location), cd-Lasso and ecd-Lasso remain competitive w.r.t. sLORETA, avoiding extreme PLE values. Note that, here, a single time point was used (T=1). A typical choice is C = 1000 clusters for M/EEG data. | Supp( Γ ) | ≤ | Supp( B ) | and Z is generally better conditioned than X making A3 more plausible than A1. PLE (mm)

PLE (mm) T = 6) Figure 2:

Peak Localization Error (PLE) histograms. (left): PLE on a ﬁxed time point (T=1),sLORETA is outperformed by desparsiﬁed Lasso; cd-Lasso and ecd-Lasso are more concentratedand exhibit a smaller number of very low PLE but also a smaller number of extreme PLE values.(right): PLE for desparsiﬁed multi-task Lasso (d-MTLasso) with T=6 compared to d-Lasso (T=1).More time points improve the results by reducing the PLE.The right panel in Figure 2 shows that d-MTLasso (T=6) signiﬁcantly outperforms d-Lasso (T=1)in terms of PLE. Leveraging spatio-temporal data indeed increases the signal-to-noise ratio, whichenhances spatial speciﬁcity. Effects in terms of SD are minor (see appendix, Figure 7).

We now investigate whether the different versions of d-MTLasso control the δ -FWER on a realisticsimulation, and compare their support recovery properties. The data are the same as in Sec. 3.1. Tosimulate the sources, we randomly draw active regions by selecting parcels from a subdividedcortical Freesurfer parcellation with 448 parcels (Khan et al., 2018). For each selected parcel wetake as sources all the dipoles at a -mm geodesic distance from the center of the parcel (around dipoles per region), ﬁxing the amplitude at 10 nAm. To evaluate how the methods control the δ -FWER, we perform simulations and count how often active sources are found outside the δ -dilated ground truth. In the left panel of Figure 3, we see that d-MTLasso does not control the δ -FWER, due to the violation of some hypotheses of proposition 1, in particular those regarding sourcecorrelation. However, we notice that handling noise autocorrelation reduces the empirical δ -FWER.Using clustering, assumptions of Prop. 2.2 are more easily met, in particular the conditioning ofthe problem is improved (Mattout et al., 2005). Yet cd-MTLasso does not control the δ -FWER for δ = 40 mm, because the δ -FWER is controlled if δ is smaller than the largest cluster diameter, whichmay not hold. Finally, randomization via ecd-MTLasso further improves FWER control. Empirically,we observe that the δ -FWER is controlled for δ around twice the average cluster diameter. Then,with the limitation of having a compressed design matrix well conditioned ( C not too large), we canreduce the tolerance δ by increasing C (empirical support of this claim in appendix in Figure 9). Wehave excluded sLORETA from this study since it does not provide guarantees on the false discoveries.The right panel of Figure 3 shows the δ -precision recall curve of the different methods. We ﬁrst noticethat d-MTLasso cannot compete with sLORETA, because the high dimensionality of the problemmakes the computation of the source importance overly ill-posed. cd-MTLasso improves detectionaccuracy, but still does not perform as well as sLORETA. However, adding the ensembling step, the δ -precision improves strongly, making ecd-MTLasso much better than sLORETA. In Figure 8 inappendix, we obtain similar results when considering the standard precision-recall curve. We now report results on three MEG datasets spanning three types of sensory stimuli: auditory, visualand somatosensory. Additional results on EEG datasets are presented in Appendix H. The auditoryevoked ﬁelds (AEF) and visual evoked ﬁeld (VEF) are obtained using stimuli in the left ear and leftvisual hemiﬁeld. The somatosensory evoked ﬁelds (SEF) are obtained following electrical stimulationof the left median nerve on the wrist. The detailed description of the data is provided in Appendix F.Experimental results are presented in Figure 4 and Figure 10 (cf. Appendix H). Among the manymethods for M/EEG source imaging present in the literature, the methods that are compared herehave in common to output a statistical map. The (cid:96) regularized sLORETA method is compared to the7 .0 0.2 0.4 0.6 0.8 1.0 empirical δ -FWER nominal rate99% conf. int.d-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso 0.0 0.2 0.4 0.6 0.8 1.0 Recall δ - P r e c i s i o n sLORETAd-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso Figure 3: δ -FWER and δ -Precision-Recall. (left): δ -FWER control of the different d-MTLassomethods. δ -FWER control is hard for d-MTLasso and cd-MTLasso, as some detections are madefar from the true sources, due to remote correlations. Ensembles of clusters allow to limit thesefalse detections. (right): δ -Precision-Recall curves: sLORETA outperforms d-MTLasso AR0 andAR1, because the problem is too high dimensional for the d-MTLasso to work properly. Clusteringimproves the outcome, and ensembling brings further beneﬁts: ecd-MTLasso outperforms sLORETA.debiased sparse estimators presented and evaluated above. The input for all solvers is a time windowof data: from t = 50 to t = 100 ms for AEF and VEF, and from t = 30 to t = 40 ms for SEF. Duringsuch time intervals one can expect the sources to originate primarily from the early sensory corticeswhose locations are anatomically known for normal subjects.First one can observe that all methods manage to highlight the proper functional sensory units (planumtemporale for AEF, calcarine region for VEF and central sulcus for SEF). Considering sLORETAresults, one can observe that at a common threshold of 3.0 on the Student statistic, the estimatoris quite spatially speciﬁc for VEF, but is overly conservative for AEF and clearly leading to manyfalse positives for SEF. By inspection of the d-MTLasso solution, one can observe that taking intoaccount the autocorrelation of the noise leads to a better calibrated noise variance, and thereforefewer dubious detection. Considering ecd-MTLasso results, while all maps are also thresholded witha single level, one can see that it retrieves expected patterns without making dubious discoveries. In Sec. 3.1, we have shown that taking into account the time dimensionimprove the results in terms of PLE. Also, we have seen that even in this adversarial point sourcescenario (cf. Sec. 3.1), clustered methods remain competitive. In Sec. 3.2, while no control of falsediscoveries is proposed by sLORETA, ecd-MTL is the only method that offers statistical controlin practice. Namely, it controls the δ -FWER for δ equals to twice the average cluster diameter.Additionally, in this realistic simulation, ecd-MTL exhibits the best support recovery properties. InSec. 3.3, working on real MEG data, we show that, contrary to sLORETA, ecd-MTLasso producescalibrated statistics with universal threshold and retrieves expected patterns without making dubiousdiscoveries. Overall, ecd-MTL offers statistical guarantees and is our privileged method. Guidelines for statistical inference with ecd-MTLasso on temporal M/EEG data.

First, we tryto give guidelines concerning the number of clusters C . Hoyos-Idrobo et al. (2015) exhibit thatclustering improves problem conditioning, this means that the Restricted Eigenvalue (RE) property(see assumptions A1 and A3) is more likely to be veriﬁed. Complementary, we argue that, keeping C over a hundred (limiting compression loss), the fewer clusters, the more A3 is likely to be veriﬁedfor Prop. 2.2 and Prop. 2.3 to hold but also the better the sensitivity of ecd-MTL. However, small C also requires a higher spatial tolerance. We then hit a fundamental trade-off for statistical inferencebetween sensitivity and spatial speciﬁcity. Then, C can be chosen depending on the problem setting:if it is difﬁcult (noisy), it seems natural to lower spatial tolerance expectations (diminish C ); in thatsense ecd-MTL is an adaptive method (cf. Figure 9). For the present use case, taking C = 1000 seems an adequate trade-off to ensure δ -FWER control with reasonable spatial tolerance.Now, we give recommendation for time sampling and window size. Choosing too short windowscomplicate AR model estimation due to the lack of data, while choosing too large windows may leadto non stationary support. We recommend taking windows of to ms with a time sampling at to ms as keeping T < reduces computation time and should not decrease sensitivity signiﬁcantly.8igure 4: Empirical comparison on 3 MEG datasets.

From left to right one can see sLORETA,d-MTLasso without AR modeling (assuming non-autocorrelated noise), d-MTLasso with an AR1noise model and the ecd-MTLasso using also an AR1. Results correspond to auditory (top), visual(middle) and somatosensory (bottom) evoked ﬁelds. Colormaps are ﬁxed across datasets and adjustedbased on meaningful statistical thresholds in order to qualitatively illustrate FWER control issues.Finally, when working with M/EEG data, we recommend to use only of the full data to computeseveral clustering solutions with spatial constraint and Ward criteria to ensure enough diversity.

Limitations.

The main limitation is the fact that mixing different types of sensors violates modelingassumptions both on temporal correlations and on spatial correlations, that is why we had to treatMEG and EEG sensors separately. A possibility to handle heterogeneous sensors is to follow Massiaset al. (2018b), but for the temporal part further developments are required and left for future work.Also left for future work, is the possibility of studying windows larger than ms. A simple solutionis to slide a window of to ms over the considered period of time.Finally, a more common limitation is the fact that assumptions are hard to test in practice. The MEG source imaging problem poses a hard statistical inference challenge: namely that of high-dimensional statistical analysis, furthermore with high correlations in the design. We have proposedan estimator that calibrates correctly the effects size and variance, up to a number of hypotheses,that are not easily met: some level of sparsity, mild correlation across sensors, homogeneity andheteroscedasticity of the noise. Up to these hypotheses, and up to a spatial tolerance on the exactlocation of the sources, we provide the ﬁrst method with statistical guarantees for source imaging.This is made possible by bringing several improvements to the original desparsiﬁed Lasso solution: amulti-task formulation that increases power by basing inference on multiple time steps, a clusteringstep that renders the design less ill-posed and an ensembling step that mitigates the (hard) choice ofclusters. Finally, our privileged method, ecd-MTLasso, runs in less than 10 mn on a real dataset onnon-specialized hardware, making it usable by practitioners.9

Statement of broader impact

Magnetoencephalography (MEG) and electroencephalography (EEG) offer a unique opportunity toimage brain activity non-invasively with a temporal resolution in the order of milliseconds. This isrelevant for cognitive neuroscience to describe the sequence of active areas during certain cognitivetasks, but also for clinical neuroscience, where electrophysiology is used for diagnosis ( e.g., sleepmedicine, epilepsy presurgical mapping). Yet, doing brain imaging with M/EEG requires to solve achallenging high-dimensional inverse problem for which statistical guarantees are crucially important.In this work, we address this statistical challenge when using sparsity promoting regularizationand when considering the speciﬁcity of M/EEG signals: data are spatio-temporal and the noise istemporally autocorrelated. The proposed algorithm is built on very recent work in optimizationto speed up Lasso-type solvers, as well as work in mathematical statistics on desparsiﬁed Lassoestimators. We believe that this work, whose contribution is both on the modeling side and on theinference aspects, brings sparse estimators close to a wide adoptions in the neuroscience community.We also would like to emphasize that the inference framework can be adapted to many otherhigh-dimensional problems where data structure can be leveraged: biomedical data and physicalobservations (cardiac or brain monitoring, genomics, seismology, etc.), especially those that involveseverely ill-posed inverse problems.

Acknowledgements

This research is supported under funding of French ANR project FastBig(ANR-17- CE23-0011), the KARAIB AI chair (ANR-20-CHIA-0025-01), the European ResearchCouncil Starting Grant SLAB ERC-StG-676943 and Labex DigiCosme (ANR-11-LABEX-0045-DIGICOSME).

References

S. Baillet, J. C. Mosher, and R. M. Leahy. Electromagnetic brain mapping.

IEEE Signal Proc. Mag. ,18(6):14–30, Nov. 2001.R. F. Barber and E. J. Candès. A knockoff ﬁlter for high-dimensional selective inference.

Ann. Statist. ,47(5):2504–2537, 2019.Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and PowerfulApproach to Multiple Testing.

J. R. Stat. Soc. Ser. B Stat. Methodol. , 57(1):289–300, 1995.P. Bühlmann. Statistical signiﬁcance in high-dimensional linear models.

Bernoulli , 19(4):1212–1242,09 2013.P. Bühlmann and S. van de Geer.

Statistics for high-dimensional data: methods, theory andapplications . Springer Science & Business Media, 2011.P. Bühlmann, P. Rütimann, S. van de Geer, and C.-H. Zhang. Correlated variables in regression: Clus-tering and sparse estimation.

Journal of Statistical Planning and Inference , 143(11):1835–1858,Nov 2013.S. Chen and D. L. Donoho. Basis pursuit. In

Proceedings of 1994 28th Asilomar Conference onSignals, Systems and Computers , volume 1, pages 41–44. IEEE, 1994.J.-A. Chevalier, J. Salmon, and B. Thirion. Statistical inference with ensemble of clustered despar-siﬁed lasso. In

International Conference on Medical Image Computing and Computer-AssistedIntervention , pages 638–646, 2018.A. M. Dale, A. K. Liu, B. R. Fischl, R. L. Buckner, J. W. Belliveau, J. D. Lewine, and E. Halgren.Dynamic statistical parametric mapping.

Neuron , 26(1):55–67, 2000.R. Dezeure, P. Bühlmann, L. Meier, and N. Meinshausen. High-dimensional inference: Conﬁdenceintervals, p -values and r-software hdi. Statist. Sci. , 30(4):533–558, 2015.O. J. Dunn. Multiple comparisons among means.

J. Amer. Statist. Assoc. , 56(293):52–64, 1961.J. R. Gimenez and J. Zou. Discovering conditionally salient features with statistical guarantees. In

ICML , pages 2290–2298, 2019. 10. Gramfort, M. Kowalski, and M. Hämäläinen. Mixed-norm estimates for the M/EEG inverseproblem using accelerated gradient methods.

Phys. Med. Biol. , 57(7):1937–1961, 2012.A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, andM. S. Hämäläinen. MNE software for processing MEG and EEG data.

NeuroImage , 86:446–460,2014.M. S. Hämäläinen and R. J. Ilmoniemi. Interpreting magnetic ﬁelds of the brain: minimum normestimates.

Medical & Biological Engineering & Computing , 32(1):35–42, Jan 1994.S. Haufe, V. V. Nikulin, A. Ziehe, K.-R. Müller, and Guido Nolte. Estimating vector ﬁelds usingsparse basis ﬁeld expansions. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

NeurIPS , pages 617–624. Curran Associates, Inc., 2009.O. Hauk, D. G. Wakeman, and R. Henson. Comparison of noise-normalized minimum norm estimatesfor meg analysis using multiple resolution metrics.

NeuroImage , 54(3):1966 – 1974, 2011.Y. Hochberg and A. C. Tamhane.

Multiple comparison procedures . Wiley Series in Probability andStatistics. John Wiley & Sons, Inc., 1987.A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics , 12(1):55–67, 1970.Andrés Hoyos-Idrobo, Yannick Schwartz, Gaël Varoquaux, and Bertrand Thirion. Improving sparserecovery on structured images with bagged clustering. In , pages 73–76. IEEE, 2015.L. Janson and W. Su. Familywise error rate control via knockoffs.

Electron. J. Stat. , 10(1):960–975,2016.A. Javanmard and A. Montanari. Conﬁdence intervals and hypothesis testing for high-dimensionalregression.

J. Mach. Learn. Res. , 15:2869–2909, 2014.S. Khan, J. A. Hashmi, F. Mamashli, K. Michmizos, M. G. Kitzbichler, H. Bharadwaj, Y. Bekhti,S. Ganesan, K.-L. A. Garel, S. Whitﬁeld-Gabrieli, R. L. Gollub, J. Kong, L. M. Vaina, K. D. Rana,S. M. Stufﬂebeam, M. S. Hämäläinen, and T. Kenet. Maturation trajectories of cortical resting-statenetworks depend on the mediating frequency band.

NeuroImage , 174:57 – 68, 2018.F. H. Lin, T. Witzel, S. P. Ahlfors, S. M. Stufﬂebeam, J. W. Belliveau, and M. S. Hämäläinen.Assessing and improving the spatial accuracy in meg source localization by depth-weightedminimum-norm estimates.

NeuroImage , 31(1):160–71, 2006.K. Lounici, M. Pontil, S. van de Geer, and A. B. Tsybakov. Oracle inequalities and optimal inferenceunder group sparsity.

Ann. Statist. , 39(4):2164–2204, 2011.F. Lucka, S. Pursiainen, M. Burger, and C. Wolters. Hierarchical bayesian inference for the EEGinverse problem using realistic FE head models: Depth localization and source separation for focalprimary currents.

NeuroImage , 61(4):1364–1382, Apr. 2012.J. Mandozzi and P. Bühlmann. Hierarchical testing in the high-dimensional setting with correlatedvariables.

J. Amer. Statist. Assoc. , 111(513):331–343, 2016.M. Massias, A. Gramfort, and J. Salmon. Celer: a Fast Solver for the Lasso with Dual Extrapolation.In

ICML , volume 80, pages 3315–3324, 2018a.M. Massias, S. Vaiter, A. Gramfort, and J. Salmon. Dual extrapolation for sparse generalized linearmodels. arXiv preprint arXiv:1907.05830 , 2019.Mathurin Massias, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Generalized concomitantmulti-task lasso for sparse multimodal regression. In

International Conference on ArtiﬁcialIntelligence and Statistics , pages 998–1007, 2018b.K. Matsuura and Y. Okabe. Selective minimum-norm solution of the biomagnetic inverse problem.

IEEE Trans. Biomed. Eng. , 42(6):608–615, June 1995. ISSN 0018-9294.11. Mattout, M. Pélégrini-Issac, L. Garnero, and H. Benali. Multivariate source prelocalization (MSP):Use of functionally informed basis functions for better conditioning the MEG inverse problem.

NeuroImage , 26(2):356 – 373, 2005.N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso.

Ann. Statist. , 34(3):1436–1462, 2006.N. Meinshausen and P. Bühlmann. Stability selection.

J. R. Stat. Soc. Ser. B Stat. Methodol. , 72:417–473, 2010.N. Meinshausen, L. Meier, and P. Bühlmann. P-values for high-dimensional regression.

J. Amer.Statist. Assoc. , 104(488):1671–1681, 2009.R. Mitra and C.-H. Zhang. The beneﬁt of group sparsity in group inference with de-biased scaledgroup lasso.

Electron. J. Stat. , 10(2):1829–1873, 2016.A. Molins, S. M. Stufﬂebeam, E. N. Brown, and M. S. Hämäläinen. Quantiﬁcation of the beneﬁtfrom integrating MEG and EEG data in minimum L2-norm estimation.

NeuroImage , 42(3):1069 –1077, 2008.T.-B. Nguyen, J.-A. Chevalier, and B. Thirion. Ecko: Ensemble of clustered knockoffs for robustmultivariate inference on fMRI data. In

International Conference on Information Processing inMedical Imaging , pages 454–466, 2019.G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection formultiple classiﬁcation problems.

Statistics and Computing , 20(2):231–252, 2010.W. Ou, M. S. Hämaläinen, and P. Golland. A distributed spatio-temporal EEG/MEG inverse solver.

NeuroImage , 44(3):932–946, Feb. 2009.R. Pascual-Marqui. Standardized low resolution brain electromagnetic tomography (sLORETA):technical details.

Methods Find. Exp. Clin. Pharmacology , 24(D):5–12, 2002.S. Reid, R. Tibshirani, and J. Friedman. A study of error variance estimation in lasso regression.

Stat.Sin. , 26(1):35–67, 2016.B. Stucky and S. van de Geer. Asymptotic conﬁdence regions for high-dimensional structuredsparsity.

IEEE Trans. Signal Process. , 66(8):2178–2190, 2018.S. Taulu. Spatiotemporal Signal Space Separation method for rejecting nearby interference in MEGmeasurements.

Physics in Medicine and Biology , 51(7):1759–1769, 2006.J. Taylor and R. J. Tibshirani. Statistical learning and selective inference.

Proceedings of the NationalAcademy of Sciences , 112(25):7629–7634, 2015.R. Tibshirani. Regression shrinkage and selection via the lasso.

J. R. Stat. Soc. Ser. B Stat. Methodol. ,58(1):267–288, 1996.S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal conﬁdence regionsand tests for high-dimensional models.

Ann. Statist. , 42(3):1166–1202, 2014.G. Varoquaux, A. Gramfort, and B. Thirion. Small-sample brain mapping: sparse recovery onspatially correlated designs with randomization and clustering. In

ICML , pages 1375–1382, 2012.L. Wasserman and K. Roeder. High-dimensional variable selection.

Ann. Statist. , 37(5A):2178–2201,2009.D. Wipf and S. Nagarajan. A uniﬁed bayesian framework for MEG/EEG source imaging.

NeuroImage ,44(3):947–966, Feb. 2009.C.-H. Zhang and S. S. Zhang. Conﬁdence intervals for low dimensional parameters in high dimen-sional linear models.

J. R. Stat. Soc. Ser. B Stat. Methodol. , 76(1):217–242, 2014.12

PPENDIX

A Formal deﬁnition of δ -FWER control Now we give a more formal deﬁnition of the δ -FWER. Deﬁnition A.1 ( δ -family wise error rate) . Given a family of (corrected) p -values ˆ p = (ˆ p j ) j ∈ [ p ] and athreshold x ∈ (0 , , the δ -FWER, also denoted FWER δx (ˆ p ) , is the probability to make at least onefalse discovery at a distance at least δ from the true support: FWER δx (ˆ p ) = P ( min j ∈ N δ ˆ p j ≤ x ) , (12) with N δ = { j ∈ [ p ] : ∀ k ∈ Supp( B ) , d ( j, k ) ≥ δ } and d ( j, k ) is the distance between source j and k . Deﬁnition A.2 ( δ -FWER control) . We say that the family of (corrected) p -values ˆ p = (ˆ p j ) j ∈ [ p ] controls the δ -FWER if, for all x ∈ (0 , : FWER δx (ˆ p ) ≤ x . (13) B Extended Restricted Eigenvalue assumption

Here, we rewrite (Lounici et al., 2011, Assumption 3.1), adjusting it for the multi-task Lasso case(particular case of the more general group Lasso). Notice that for a given value of T , the assumptionis equivalent to (Lounici et al., 2011, Assumption 4.1). Let ≤ s ≤ p be an integer that gives anupper bound on the sparsity | Supp( B ) | . The extended Restricted Eigenvalue assumption RE( X , s ) isveriﬁed on X for sparsity parameter s and constant κ = κ ( s ) > , if: min (cid:26) (cid:107) XΘ (cid:107)√ nT (cid:107) Θ J (cid:107) : | J | ≤ s, Θ ∈ R p × T \ { } , (cid:107) Θ J C (cid:107) , ≤ (cid:107) Θ (cid:107) , (cid:27) ≥ κ , (14)where J ⊂ [ p ] and J C denotes its complementary i.e., J C = [ p ] \ J , and Θ J refers to the matrix Θ without the rows J C . C Adaptive quantile aggregation of p -values and ecd-MTLasso algorithm In this section, we provide some more details on the way we perform aggregation of p -values acrossthe p -values maps created through the clustering randomization, then we give the full ecd-MTLassoalgorithm.For the j -th features (or source) we have a vector ( p ( b ) j ) b ∈ [ B ] of p -values, with one p -value computedfor each of the B clusterings. Then, the ﬁnal p -value of the j -th feature is given by the adaptivequantile aggregation, as proposed by Meinshausen et al. (2009): p j = min (cid:40) (1 − log( γ min )) inf γ ∈ ( γ min , (cid:16) γ -quantile (cid:110) p ( b ) j γ ; b ∈ [ B ] (cid:111)(cid:17) , (cid:41) , where we have taken γ min = 0 . in our experiments. Taking a value of γ min not too small( e.g., γ min ≥ . ) allows to recover sources that have received small p -values several times ( e.g., atleast for B/ different choices of clustering).We give the full algorithm of ecd-MTLasso in Algorithm 2. D Proofs

D.1 Probability lemmaLemma D.1.

Let ε ∈ R T be a centered Gaussian random vector with (symmetric positive deﬁnite)covariance M ∈ R T × T . Then, the random variable ε (cid:62) M − ε follows a χ T distribution. lgorithm 2 ecd-MTLasso input : X ∈ R n × p , Y param : C = 1000 , B = 100 for b = 1 , . . . , B do X ( b ) = sample ( X ) A ( b ) = Ward ( C, X ( b ) ) Z ( b ) = XA ( b ) q ( b ) = d-MTLasso ( Z ( b ) , Y ) C // corrected cluster-wise p -values at bootstrap b for j = 1 , . . . , p do p ( b ) j = q ( b ) r if j ∈ G r // corrected feature-wise p -values at bootstrap b for j = 1 , . . . , p do p j = aggregation ( p ( b ) j , b ∈ [ B ]) // aggregated corrected feature-wise p -values return p j for j ∈ [ p ] Proof.

Note ﬁrst that since M is symmetric positive deﬁnite, its square-root N ∈ R T × T exists and isa symmetric positive deﬁnite matrix satisfying N = M . Hence, this leads to the following displays ε (cid:62) M − ε = ( N − ε ) (cid:62) ( N − ε ) . We have that N − ε is a centered Gaussian random vector, and its covariance matrix reads: E (cid:2) ( N − ε )( N − ε ) (cid:62) (cid:3) = E (cid:2) N − εε (cid:62) N − (cid:3) = E (cid:2) N − εε (cid:62) N − (cid:3) = N − E (cid:2) εε (cid:62) (cid:3) N − = N − MN − = N − N N − = Id T . To conclude N − ε ∈ R T is a centered Gaussian vector with covariance Id T , hence its squaredEuclidean norm (cid:13)(cid:13) N − ε (cid:13)(cid:13) = ( N − ε ) (cid:62) ( N − ε ) follows a χ T distribution. D.2 Proof of Prop. 2.1

Now, we give a proof of Prop. 2.1:

Proof.

First, let us ﬁx an index j ∈ [ p ] . Then, using Equation (7) we have: √ n ( ˆ B (d − MTLasso) j,. − B j,. ) = √ n z (cid:62) j Ez (cid:62) j X .,j − (cid:88) k (cid:54) = j √ n z (cid:62) j X .,k ( ˆ B MTL k,. − B k,. ) z (cid:62) j X .,j = Λ j,. + ∆ j,. , (15)where Λ j,. = √ n z (cid:62) j Ez (cid:62) j X .,j and ∆ j,. = √ n (cid:80) k (cid:54) = j P j,k ( B k,. − ˆ B MTL k,. ) with P j,k = z (cid:62) j X .,k z (cid:62) j X .,j . Now, we show that Λ j,. ∼ N p ( , ˆ Ω j,j M ) , or equivalently we show that E (cid:62) z j ∼ N (0 , n (cid:107) z j (cid:107) M ) .It is clear that E (cid:62) z j is a centered Gaussian vector. Then, its covariance denoted by V ( j ) , can becomputed as follows: V ( j ) = E ( E (cid:62) z j z (cid:62) j E ) ∈ R T × T , t, t (cid:48) ∈ [ T ] by V ( j ) t,t (cid:48) = E ( E (cid:62) .,t z j z (cid:62) j E .,t (cid:48) )= E ( z (cid:62) j E .,t (cid:48) E (cid:62) .,t z j ) ( scalar values commute )= z (cid:62) j E ( E .,t (cid:48) E (cid:62) .,t ) z j = z (cid:62) j E ( n (cid:88) i =1 E i,t (cid:48) E (cid:62) i,t ) z j = z (cid:62) j n (cid:88) i =1 E ( E i,t (cid:48) E (cid:62) i,t ) z j . Then, the noise structure in Equation (2) yields V ( j ) t,t (cid:48) = z (cid:62) j n M t,t (cid:48) z j = n (cid:107) z j (cid:107) M t,t (cid:48) .Now, we show that with high probability (cid:107) ∆ (cid:107) , = O (cid:18) sλ √ log( p ) κ (cid:19) . First, notice that: (cid:107) ∆ (cid:107) , ≤ √ n max k (cid:54) = j | P j,k | (cid:13)(cid:13)(cid:13) ˆ B MTL − B (cid:13)(cid:13)(cid:13) , . For a convenient choice of the regularization parameters α , using Bühlmann and van de Geer (2011,Lemma 2.1) and following the same approach as Dezeure et al. (2015, Appendix A.1), we obtain,with high probability: √ n max k (cid:54) = j | P j,k | = O (cid:16)(cid:112) log( p ) (cid:17) . Bounds on (cid:107) ˆ B MTL − B (cid:107) , are also available in the literature (Lounici et al., 2011) for ρ = 0 andcan be extended to ρ > similarly. Notably, provided ρ = 0 , assuming A1 for a sparsity parameter | Supp( B ∗ ) | ≤ s , a given constant κ = κ ( s ) > , and a choice of λ large enough in Equation (4),(Lounici et al., 2011, Theorem 3.1) gives directly the following bound, with high probability: (cid:13)(cid:13)(cid:13) ˆ B MTL − B (cid:13)(cid:13)(cid:13) , = O (cid:18) sλκ (cid:19) . Remark D.1.

Following van de Geer et al. (2014), to neglect ∆ we need to have (cid:107) ∆ (cid:107) ∞ = o(1) .This condition is veriﬁed if s = o (cid:16) κ λ √ log( p ) (cid:17) . D.3 Proof of Prop. 2.2

Before starting the proof, let us give more precision on assumption A2, the complete assumption isthe following:(A2) there exists Γ ∈ R C × T such that Γ r,. = (cid:80) j ∈ G r w j B j,. with w j ≥ for all j ∈ [ p ] , so that theassociated compression loss XB − ZΓ is bounded as follows: (cid:107) XB − ZΓ (cid:107) , ≤ ξ T φ ( M ) n = ξ T φ ( R ) σ n , (16)where ξ > is an arbitrary small constant, φ ( M ) > is the smallest eigenvalue of M and φ ( R ) > is the smallest eigenvalue of R , the temporal correlation matrix of the noise deﬁned by R = M /σ . The hypothesis plainly means that the noise induced by design matrix compression issmall enough with respect to the model noise.Now we give a proof of Prop. 2.2: Proof.

First, we derive the d-MTLasso for the compressed problem, for r ∈ [ C ] :15 Γ (d − MTLasso) r,. = a (cid:62) r Ya (cid:62) r Z .,r − (cid:88) l (cid:54) = r a (cid:62) r Z .,l ˆ Γ MTL r,. a (cid:62) r Z .,r , (17)where a r ’s are the residuals obtained by nodewise Lasso on Z playing the same role as the z j ’s inEquation (7). Then, as done in Appendix D.2, we derive: √ n (ˆ Γ (d − MTLasso) r,. − Γ r,. ) = √ n a (cid:62) r Ea (cid:62) r Z .,r − (cid:88) l (cid:54) = r √ n a (cid:62) r Z .,l (ˆ Γ MTL l,. − Γ l,. ) a (cid:62) r Z .,r + √ n a (cid:62) r ( XB − ZΓ ) a (cid:62) r Z .,r = Λ (cid:48) r,. + ∆ (cid:48) r,. + Π r,. , (18)We treat Λ (cid:48) and ∆ (cid:48) as in Appendix D.2, assuming that the hypotheses that are used to bound (hence,neglect) ∆ (cid:48) are veriﬁed (notably A3).Next, for r ∈ [ C ] , we want to establish that n (cid:107) Π r,. (cid:107) M − T ˆ Ω (cid:48) r,r is negligible, i.e., that Π has a negligibleeffect on all decision statistics, where the covariance ˆ Ω (cid:48) has the following generic diagonal term: ˆ Ω (cid:48) r,r = n (cid:107) a r (cid:107) | a (cid:62) r Z .,r | . Given that (cid:107) Π r,. (cid:107) M − = n (cid:13)(cid:13) a (cid:62) r ( XB − ZΓ ) (cid:13)(cid:13) M − | a (cid:62) r Z .,r | (19) ≤ n (cid:13)(cid:13) a (cid:62) r (cid:13)(cid:13) | a (cid:62) r Z .,r | (cid:107) XB − ZΓ (cid:107) , φ ( M ) , (20)where (cid:107)·(cid:107) , denotes the spectral norm. Then, we obtain that n (cid:107) Π r,. (cid:107) M − T ˆ Ω (cid:48) r,r ≤ nT (cid:107) XB − ZΓ (cid:107) , φ ( M ) ≤ ξ . (21)Then, if A2 is veriﬁed for ξ small enough, we can also neglect Π in front of Λ (cid:48) .Then, by neglecting Π and ∆ (cid:48) , we have: √ n (ˆ Γ (d − MTLasso) − Γ ) ∼ N C ( , ˆ Ω (cid:48) r,r M ) . (22)Then we can construct p -values that test the r -th null hypothesis H ( r )0 : “ Γ j,. = 0 ”, applying thesame technique as in Sec. 2.4. By correcting these p -values — e.g., using the Bonferroni correction(Dunn, 1961), we multiply by C the initial p -values—, we obtain cluster-wise corrected p -values thatcontrol the FWER.Since, for all r ∈ [ C ] , Γ r,. is a linear combination of B j,. for j ∈ G r , then Γ r,. (cid:54) = 0 if at least thereexist j ∈ G r such that B j,. (cid:54) = 0 .Then, deﬁning the feature-wise corrected p -values by the corrected p -values of the correspondingcluster, and assuming that clusters are at most of size δ , such corrected p -values control the δ -FWER. Remark D.2.

In assumption A2, having a positive linear combination is not necessary, a simplelinear combination is sufﬁcient.However, we assumed that Γ r,. was a positive linear combination of B j,. for j ∈ G r , to get thefollowing desired properties:"If additionally for r ∈ [ C ] , for all j ∈ G r and all k ∈ G r , we have sign( B j,. ) = sign( B k,. ) , then sign( Γ r,. ) = sign( B j,. ) (zero being booth positive and negative)."This means that if all the features’ weights in a cluster have the same sign, there exists a compressionverifying A2 such that the cluster weight preserves the sign. .4 Proof of Prop. 2.3 Proof.

Assuming the hypotheses of Prop. 2.3 and applying Prop. 2.2, we can, for each of the B compression of the problem in Equation (1), construct a corrected p -value family that control the δ -FWER. Applying the quantile aggregate method in Equation (15), we derive a corrected p -valuefamily taking into account for each compression choice. Applying Meinshausen et al. (2009, Theorem3.2), this aggregated corrected p -value family also controls the δ -FWER. E Computational aspects

Here we give some elements about the computational aspect of the algorithms we propose.For solving Lasso or multi-task Lasso problems, we rely for additional speed-up on celer (Massiaset al., 2018a, 2019), a solver which is much more efﬁcient than the standard coordinate descent (speedup by more than 10x on our experiments).To compute d-MTLasso, we must solve p Lasso of size ( n, ( p − , and 1 multi-task Lasso withcross-validation on a dataset of size ( n, p, T ) . For n = 200 , p = 7500 and T = 10 , the algorithmscan be run on a standard laptop in around hours (using only 1 CPU). However, the algorithmis embarrassingly parallel and requires around minutes if run on a machine with CPUs. Tocompute cd-MTLasso, we must solve C Lasso of size ( n, ( C − . and 1 multi-task Lasso withcross-validation on a dataset of size ( n, C, T ) . For n = 200 , C = 1000 and T = 10 , it can be run ona standard local device in less than minute (using only 1 CPU). Finally, to compute ecd-MTLasso,we must solve B cd-MTLasso. For B = 100 ( is already a good value to get most of the advantagesof ensembling), n = 200 , C = 1000 and T = 10 , it can be run on a standard laptop in around hour(using only 1 CPU) and around minute on a machine with CPUs.Although, when using coordinate-descent-like algorithms, the complexity depends on solver parame-ters such as tolerance on stopping criteria, the complexity in C (or p ) appears empirically to be cubic,while it is linear in n and T . It is also linear in B . F Detailed data description

For AEF and VEF, data contained one artifactual channel leading to n = 203 , while for SEFdata were preprocessed for removal of environmental noise leading to an effective number ofsamples of n = 64 (Taulu, 2006). For the AEF dataset, we report results for AEFs evoked by leftauditory stimulation with pure tones of 500 Hz. The analysis window for source estimation waschosen from 50 ms to 200 ms based on visual inspection of the evoked data to capture the dominantN100m component, leading to T = 6 . For the SEF dataset, we analyzed SEFs evoked by bipolarelectrical stimulation (0.2 ms in duration) of the left median nerve. To capture the main peaks ofthe evoked response and exclude the strong stimulus artifact, the analysis window was chosen from18 ms to 200 ms based on visual inspection of the sensor signal.Preprocessing was done following the standard pipeline from the MNE software (Gramfort et al.,2014). Baseline correction using pre-stimulus data (from -200 ms to 0 ms) was used. Epochs withpeak-to-peak amplitudes exceeding predeﬁned rejection parameters (3 pT for magnetometers and400 pT/m for gradiometers, and 150 µ V for EOG on AEF and VEF and 350 µ V for SEF) wereassumed to be affected by artifacts and discarded. This resulted in 55 (AEF), 67 (SEF) and 111 (SEF)artifact-free measurements which were average to produce the target matrix Y . The gain matrix wascomputed using a set of p = 7498 cortical locations, and a three-layer boundary element model. G Related Work

The topic of high-dimensional inference has been addressed in many recent works. Yet, to the bestof our knowledge, none of this literature has been applied to the source localization problem weconsider here. https://github.com/mathurinm/CELER

17 The idea of associating clustering with high-dimensional inference can also be foundin recent works with application to genetic data: Bühlmann et al. (2013) has used a ﬁxedclustering step, which is made adaptive in Mandozzi and Bühlmann (2016). Our contributiondeviates from these works in two regards: unlike Bühlmann et al. (2013), we do not considerthat a ﬁxed clustering, however good it is, indeed captures the essence of the problem: thisis why we resort to an ensemble of different clustering solutions. Unlike Mandozzi andBühlmann (2016), we do not try to narrow down the inference in a hierarchical fashion,because we do not consider that source imaging can in effect be traced down to the vertexlevel: given the difﬁculty of the source imaging problem, we ﬁnd it more satisfactory tooutline a region of putative activity.• Another family of inference methods based on sample splits has been introduced by Mein-shausen et al. (2009): train data are used to select regions, test data to assess their statisticalsigniﬁcance. The choice of splits can be varied and aggregated upon to mitigate the impactof arbitrary splits selection. However, data splitting has a high cost in terms of statisticalpower, making these approaches weakly sensitive Taylor and Tibshirani (2015).• An alternative method yielding family-wise error rate (FWER) control is the stabilityselection method, that builds on bootstrapped randomized sparse regression Meinshausenand Bühlmann (2010). Yet, this approach has been found too weakly sensitive and it has notbeen considered in further statistical inference works, see e.g.,

Dezeure et al. (2015).• Post-selection inference Taylor and Tibshirani (2015) is an approach that typically relieson a sparse estimator (such as Lasso) and then assesses the signiﬁcance of the selectedvariables. It accounts for the selection in the inference process, avoiding the undesirable biasof selecting and testing on the same data. However, we have not not found an implementationthat scales in a numerically sound way to the problem size that we are considering here:thousand features, even after clustering.• Knockoff inference (with or without clustering) is probably the most recent alternativedeveloped for high-dimensional inference (Barber and Candès, 2019): it consists in ap-pending noisy copy of the problem features and selecting only variables that are muchmore signiﬁcantly associated than their noisy copy. While this approach is computationallyrelevant for the problem at hand, it suffers from the arbitrary knockoff variable set used;it yields a control of the false discovery rate of the detection problem, that is not directlycomparable with the family-wise error rate (FWER) considered here. FWER control ispossible with knockoff Janson and Su (2016), yet very weakly sensitive.18

Supplementary ﬁgures

Figure 5:

Illustrating spatial tolerance of size δ = 20 mm and δ = 40 mm. The true source inred has a 10 mm radius (distance measured on the cortical surface) and the spatial tolerance extendthis region by 20 mm on the left side and 40 mm on the right side in yellow. The δ -FWER is theprobability of making false discoveries outside of the extended region. Then, a false discovery madein the yellow region is not counted neither as an error nor a true positive. Connected feature correlation density

Inter feature correlation density

Connected cluster correlation density (C=1000)

Inter cluster correlation density (C=1000)

Figure 6:

Illustrating correlation in MNE sample MEG data. (left): Distribution of the maximumcorrelation between a feature (resp. cluster) and another connected feature (resp. cluster). (Top)the maximum connected feature correlation is close to 0.98 in average. (Bottom) the maximumconnected cluster correlation is lower, close to 0.9 on average. Clustering improves conditioningsigniﬁcantly. (right): The density of the inter feature correlation (top) looks similar to the density ofthe inter cluster correlation (bottom). By focusing the extreme values of correlation, we see a littledecrease of extreme values for the clustered data.19

10 15 20

SD (mm)

SD (mm) T = 6) Figure 7:

Spatial Dispersion (SD) histograms. (left): SD on a ﬁxed time point (Hauk et al., 2011).All methods lead to comparable spatial dispersion. (right): SD for desparsiﬁed multi-task Lasso(d-MTLasso) with increasing time points. See Figure 2 for PLE histograms on the same experiments.

Recall P r e c i s i o n sLORETAd-MTLasso AR0d-MTLasso AR1cd-MTLassoecd-MTLasso Figure 8:

Precision-Recall.

See Figure 3 for δ − Precision-Recall curves computed on the same data.

20 30 40spatial tolerance (mm)100020004000 nu m b e r o f c l u s t e r s ecd-MTL empirical δ -FWER 0.10.20.30.40.50.6 P r e c i s i o n ecd-MTL precision-recallC = 1000C = 2000C = 4000 Figure 9: ecd-MTLasso empirical δ -FWER and precision recall for different choice of clustersizes. (left): Running the same simulation as in Sec. 3.2, we observe that the spatial tolerance δ canbe reduced to 20 mm by increasing the number of clusters up to . With C = 1000 clusters (resp. C = 2000 , C = 4000 ), the average cluster diameter is around 18 mm (resp. 13 mm and 9 mm). Itturns out that the δ -FWER is controlled for around twice the diameter (if the compressed designmatrix veriﬁes assumption A1). (right): We see that this decrease in spatial tolerance comes witha price regarding support recovery: the precision-recall curve declines with when C is increased.(both): Note that we need to set the hyper-parameter c that is used to compute the regularizationparameters α (see note coming with Equation (5)). We found empirically that it should be inverselyproportional to C : for C = 1000 , c = 0 . ; for C = 2000 , c = 0 . ; for C = 4000 , c = 0 . .20igure 10: Comparison on audio dataset on both hemispheres.

From left to right are comparedsLORETA, d-MTLasso without AR modeling (noise is assumed non-autocorrelated), d-MTLassowith an AR1 noise model and the ecd-MTLasso using also an AR1. The results correspond to auditory(top) evoked ﬁelds. Colormaps are ﬁxed across datasets and adjusted based on meaningful statisticalthresholds in order to outline FWER control issues.Figure 11: