[PDF] Bayesian Structured Sparsity Priors for EEG Source Localization Technical Report

Abstract

This report introduces a new hierarchical Bayesian model for the EEG source localization problem. This model promotes structured sparsity to search for focal brain activity. This sparsity is obtained via a multivariate Bernoulli Laplacian prior assigned to the brain activity approximating an ℓ 20 pseudo norm regularization in a Bayesian framework. A partially collapsed Gibbs sampler is used to draw samples asymptotically distributed according to the posterior associated with the proposed Bayesian model. The generated samples are used to estimate the brain activity and the model hyperparameters jointly in an unsupervised framework. Two different kinds of Metropolis-Hastings moves are introduced to accelerate the convergence of the Gibbs sampler. The first move is based on multiple dipole shifts within each MCMC chain whereas the second one exploits proposals associated with different MCMC chains. We use both synthetic and real data to compare the performance of the proposed method with the weighted ℓ 21 mixed norm regularization and a method based on a multiple sparse prior, showing that our algorithm presents advantages in several scenarios.

Full PDF

EENSEEIHT

Bayesian Structured Sparsity Priors for EEGSource Localization Technical Report ∗ Facundo Costa, Hadj Batatia, Thomas Oberlin, and Jean-Yves Tourneret

October 17, 2018

Abstract

This report introduces a new hierarchical Bayesian model for the EEG sourcelocalization problem. This model promotes structured sparsity to search for fo-cal brain activity. This sparsity is obtained via a multivariate Bernoulli Laplacianprior assigned to the brain activity approximating an (cid:96) pseudo norm regular-ization in a Bayesian framework. A partially collapsed Gibbs sampler is used todraw samples asymptotically distributed according to the posterior associatedwith the proposed Bayesian model. The generated samples are used to esti-mate the brain activity and the model hyperparameters jointly in an unsuper-vised framework. Two different kinds of Metropolis-Hastings moves are intro-duced to accelerate the convergence of the Gibbs sampler. The ﬁrst move isbased on multiple dipole shifts within each MCMC chain whereas the secondone exploits proposals associated with different MCMC chains. We use both syn-thetic and real data to compare the performance of the proposed method withthe weighted (cid:96) mixed norm regularization and a method based on a multiplesparse prior, showing that our algorithm presents advantages in several scenar-ios. ∗ An extended version of a paper submitted for publication:

Bayesian Structured Sparsity Priors for EEG SourceLocalization. a r X i v : . [ s t a t . M E ] S e p . I NTRODUCTION

The EEG source localization problem continues to attract a high level of coverage in the liter-ature resulting in a wide array of methods developed in the last years. These can be classiﬁedin two groups: (i) the dipole-ﬁtting models that represent the brain activity as a small num-ber of dipoles with unknown positions; and (ii) the distributed-source models that representit as a large number of dipoles in ﬁxed positions. Dipole-ﬁtting models [1, 2] try to estimatethe amplitudes, orientations and positions of a few dipoles that explain the measured data.Unfortunately, these models usually provide solutions that vary with the initial guess of thenumber of dipoles and with their initial locations due to the presence of a large number oflocal minima in their cost function [3]. Several algorithms based on MUSIC were developedto solve this problem [4–7]. In addition, sequential Monte Carlo methods were also investi-gated to estimate the dipole-ﬁtting model parameters [8]. If the brain activity is composedof a small number of clustered sources, the dipole-ﬁtting algorithms are capable of providinggood results [9, 10]. However, their performance deteriorates for detecting multiple spatiallyextended sources [3]. On the other hand, the distributed-source methods model the brainactivity as the result of a large number of discrete dipoles with ﬁxed positions and try to es-timate their amplitudes and orientations [3]. Since the amount of dipoles used in the brainmodel is typically much larger than the amount of electrodes, the inverse problem is ill-posedin the sense that there is an inﬁnite amount of brain activities that can justify the measure-ments [3]. A regularization is thus needed in order to incorporate additional information tosolve this inverse problem. One of the most simple regularizations consists of penalizing the (cid:96) norm of the solution using the minimum norm estimation algorithm [11] or its variantsbased on the weighted minimum norm: Loreta [12] and sLoreta [13]. However, these meth-ods have been shown to overestimate the size of the active area if the brain activity is focused[3], which is believed to be the case in a number of medical applications. A better way to esti-mate focal brain activity is to promote sparsity, by applying an (cid:96) pseudo norm regularization[14]. Unfortunately, this procedure is known to be intractable in an optimization framework.As a consequence, the (cid:96) pseudo norm is usually approximated by the (cid:96) norm via convexrelaxation [15], in spite of the fact that these two approaches do not always provide the samesolution [14]. In a previous work, we proposed to combine them in a Bayesian framework[16], using the (cid:96) penalty to locate the non-zero positions and the (cid:96) norm to estimate theiramplitudes. However, this (cid:96) + (cid:96) method, similarly to (cid:96) and (cid:96) separately, considers eachtime sample independently leading in some cases to unrealistic solutions [17].To improve source localization, it is possible to make use of the temporal structure of the databy promoting structured sparsity, which is known to yield better results than standard spar-sity when applied to strongly group sparse signals [18]. Structured sparsity has been shown toimprove results in several applications including audio restoration [19], image analysis [20]and machine learning [21]. One way of applying structured sparsity in EEG source localiza-tion is to use mixed-norms regularization such as the (cid:96) mixed norm [17] (also referred to asgroup-lasso). This approach promotes sparsity among different dipoles (via the (cid:96) portion ofthe norm) but groups all the time samples of the same dipole together, forcing them all to beeither jointly active or inactive (with the (cid:96) norm portion). However, it has several drawbacksincluding the manual tuning of the regularization parameter. 2n addition to optimization techniques, several approaches have tried to model the time evo-lution of the dipole activity and estimate it using either Kalman ﬁltering [22, 23] or particleﬁlters [24–26]. Several Bayesian methods have been used as well, both in dipole-ﬁtting mod-els [27, 28] and distributed source models [29, 30]. In [29], Friston et al. developed the multi-ple sparse priors (MSP) approach, in which they parcellate the brain in different pre-deﬁnedregions and promote all the dipoles in each region to be active or inactive jointly. Doing thisthey encourage the brain activity to extend over an area instead of being focused in point-like sources. Conversely, we are mainly interested in estimating point-like focal source ac-tivity which has been proved to be relevant in clinical applications [31]. In order to do this,we will consider each dipole separately instead of grouping them together. Note that this ap-proach avoids the need of choosing a criterion for brain parcellization as required in the MSPmethod.This report introduces a new hierarchical Bayesian model that estimates the brain activitywith a structured sparsity constraint by using a multivariate Bernoulli Laplace prior (approx-imating the weighted (cid:96) mixed norm). Since the parameters of the proposed model can-not be computed with closed-form expressions, we investigate Markov chain Monte Carlosampling techniques to draw samples that are asymptotically distributed according to theposterior of the proposed model. We then estimate jointly the brain activity, the model pa-rameters and hyperparameters in an unsupervised framework. In order to avoid the samplerto get stuck around local maxima, speciﬁc Metropolis-Hastings moves are introduced, allow-ing new modes of the posterior to be explored. These moves are based on multiple dipoleshifts (moving active dipoles to neighboring positions) and inter-chain proposals (exchang-ing samples between parallel MCMC chains) that signiﬁcantly accelerate the convergencespeed of the proposed sampler. These proposals generate candidates that are accepted orrejected using a Metropolis-Hastings criterion. The method is applied to both synthetic andreal data showing promising results compared to the more traditional (cid:96) mixed norm andthe MSP method.The report is organized as follows: Section 2 describes the considered problem. The proposedBayesian model is presented in Section 3. Section 4 introduces the partially collapsed Gibbssampler used to generate samples distributed according to the posterior of this model. TheMetropolis-Hastings moves that are used to accelerate the convergence of the sampler areintroduced in Section 5. Experimental results conducted for both synthetic and real data arepresented in Section 6. Conclusions are ﬁnally reported in Section 7. Appendices A and Binclude the algebraic derivations of the conditional distributions used in the Gibbs samplerand the acceptance rate of the Metropolis-Hastings moves respectively.

2. P

ROBLEM STATEMENT

The EEG source localization is an inverse problem that consists in estimating the brain ac-tivity of a patient from EEG measurements taken from M electrodes during T time samples.In a distributed source model, the brain activity is represented by a ﬁnite number of dipoleslocated at ﬁxed positions in the brain cortex. More precisely, we consider N dipoles locatedin the surface of the brain cortex and oriented orthogonally to it (motivations for this can be3ound in [32]). The EEG measurement matrix Y ∈ (cid:82) M × T can be written Y = HX + E (2.1)where X ∈ (cid:82) N × T contains the dipole amplitudes, H ∈ (cid:82) M × N represents the head operator(usually called “leadﬁeld matrix”), and E is the measurement noise. Denote as m i the i -throw of the matrix M and as m j its j -th column.Thus, the EEG source localization problem consists in estimating the matrix X from theknown operator H and the measurements Y .The next section introduces the hierarchical Bayesian method proposed to solve this inverseproblem.

3. B

AYESIAN MODEL

IKELIHOOD

As is classical in the literature, we consider an additive white Gaussian noise with a constantvariance σ n over the considered time samples [3]. Note that when this assumption does nothold it is possible to estimate the noise covariance matrix from measurements that do notcontain the signal of interest and use it to whiten the data [33]. This leads to a gaussian like-lihood f ( Y | X , σ n ) = T (cid:89) t = N (cid:179) y t (cid:175)(cid:175)(cid:175) Hx t , σ n I M (cid:180) (3.1)where I M is the identity matrix of size M × M . RIOR DISTRIBUTIONS

RIOR OF THE BRAIN ACTIVITY X To promote structured sparsity of the source activity, we ﬁrst consider the weighted (cid:96) mixedpseudo-norm || X || = i : (cid:112) v i || x i || (cid:54)=

0} (3.2)where v i = || h i || is a weight introduced to compensate the depth-weighting effect [3, 15] and S denotes the cardinal of the set S . Since this prior leads to intractable computations, wepropose to approximate it by a multivariate Laplace Bernoulli prior for each row of X f ( x i | z i , λ ) ∝ (cid:40) δ ( x i ) if z i = (cid:179) − λ (cid:112) v i || x i || (cid:180) if z i = λ is the exponential distribution parameter and z ∈ {0, 1} N is a vector indicating if therows of X are non-zero. To make the analysis easier it is convenient to deﬁne the hyperpa-rameter a = σ n λ which transforms the prior to 4 ( x i | z i , a , σ n ) ∝ (cid:40) δ ( x i ) if z i = (cid:179) − (cid:113) v i a σ n || x i || (cid:180) if z i = z i are assigned a Bernoulli prior distribution with parameter ω ∈ [0, 1]: f ( z i | ω ) = B (cid:179) z i | ω (cid:180) (3.5)Note that the Dirac delta function δ (.) in the prior of x i promotes sparsity while the Laplacedistribution regulates the amplitudes of the non-zero rows. The parameter ω tunes the bal-ance between them. Indeed, ω = X = ω = τ ∈ ( (cid:82) + ) N as proposed in [35]. More precisely, we use the following gamma and Bernoulli-Gaussian priors on τ i and x i respectively f ( τ i | a ) = G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) (3.6) f ( x i | z i , τ i , σ n ) = (cid:40) δ ( x i ) if z i = N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180) if z i = x i (3.4) [35].In addition assuming the rows of τ , z , and X apriori independent leads to the followingpriors: f ( τ | a ) = N (cid:89) i f ( τ i | a ) f ( z | ω ) = N (cid:89) i f ( z i | ω ) f ( X | z , τ , σ n ) = N (cid:89) i f ( x i | z i , τ i , σ n )3.2.2. P RIOR OF THE NOISE VARIANCE ACTIVITY σ n The noise variance is assigned a Jeffrey’s prior f ( σ n ) ∝ σ n (cid:82) + ( σ n ) (3.8)where 1 (cid:82) + ( ξ ) = ξ ∈ (cid:82) + and 0 otherwise. This choice is very classical when no informationabout a scale parameter is available (see [36] for details). 5 .3. H YPERPARAMETER PRIORS

The proposed method allows one to balance the importance between sparsity of the solutionand ﬁdelity to the measurements using two hyperparameters: 1) ω that adjusts the propor-tion of non-zero rows, and 2) a that controls the amplitudes of the non-zeros. The corre-sponding hierarchy of parameters and hyperparameters is shown in Fig. 3.1. In contrast tothe (cid:96) mixed norm our algorithm does not require to adjust these hyperparameters but isable to estimate their values from the data by assigning hyperpriors to them following a so-called hierarchical Bayesian analysis. This section deﬁnes the priors assigned to the modelhyperparameters. YX , σ n z τ a ω Figure 3.1: Directed acyclic graph for the proposed Bayesian model.3.3.1. H

YPERPRIOR OF a A conjugate gamma prior is assigned to af ( a | α , β ) = G (cid:179) a (cid:175)(cid:175)(cid:175) α , β (cid:180) (3.9)with α = β =

1. These values of α and β yield a vague hyperprior for a . The conjugacy ofthis hyperprior will make the analysis easier in the sense that the conditional distribution of a required in the Gibbs sampler will also be a gamma distribution.3.3.2. H YPERPRIOR OF ω A uniform prior on [0, 1] is used for ω f ( ω ) = U (cid:179) ω (cid:175)(cid:175)(cid:175)

0, 1 (cid:180) (3.10)reﬂecting the absence of knowledge for this hyperparameter. 6 .4. P

OSTERIOR DISTRIBUTION

Using the previously described priors and hyperpriors, the posterior distribution of the pro-posed Bayesian model is f ( Y , σ n , X , z , a , τ , ω ) ∝ f ( Y | X , σ n ) f ( X | τ , z , σ n ) f ( z | ω ) f ( τ | a ) f ( σ n ) f ( a ) f ( ω ) (3.11)The following section investigates a partially collapsed Gibbs sampler that is used to sampleaccording to the posterior distribution (3.11) and to build estimators of the unknown modelparameters and hyperparameters using these generated samples.

4. A P

ARTIALLY COLLAPSED G IBBS S AMPLER

The posterior distribution (3.11) is intractable and does not allow us to derive closed-formexpressions for the estimators of the different parameters and hyperparameters of the pro-posed model. Thus we propose to draw samples from (3.11) and to use them to estimate thebrain activity jointly with the model hyperparameters. More precisely, we investigate a par-tially collapsed Gibbs sampler that samples the variables z i and x i jointly in order to exploitthe strong correlation between these two variables. If we denote by X − i the matrix X whose i th row has been replaced by zeros, the resulting sampling strategy is summarized in Algo-rithm 1. The corresponding conditional distributions are shown hereafter and their exactderivation can be found in Appendix A. Algorithm 1

Partially Collapsed Gibbs sampler.Initialize X = z = a and τ from their prior distributions repeat Sample σ n from f ( σ n | Y , X , τ , z )Sample ω from f ( ω | z ) for i = N do Sample τ i from f ( τ i | x i , σ n , a , z i )Sample z i from f ( z i | Y , X − i , σ n , τ i , ω )Sample x i from f ( x i | z i , Y , X − i , σ n , τ i ) end for Sample a from f ( a | τ ) until convergence 4.0.1. C ONDITIONAL DISTRIBUTION OF τ i The conditional distribution of τ i is a gamma distribution or a generalized inverse Gaussiandistribution depending on the value of z i . More precisely f ( τ i | x i , σ n , a , z i ) =  G (cid:179) τ i (cid:175)(cid:175)(cid:175) T + , v i a (cid:180) if z i = G I G ( τ i (cid:175)(cid:175)(cid:175) , v i a , || x i || σ n (cid:180) if z i =

1. (4.1)7.0.2. C

ONDITIONAL DISTRIBUTION OF x i The conditional distribution of the i th row of X is f ( x i | z i , Y , X − i , σ n , τ i ) = (cid:40) δ ( x i ) if z i = N (cid:179) x i (cid:175)(cid:175)(cid:175) µ i , σ i (cid:180) if z i =

1. (4.2)with µ i = σ i ( h i ) T ( Y − HX − i ) σ n , σ i = σ n τ i + τ i ( h i ) T h i (4.3)4.0.3. C ONDITIONAL DISTRIBUTION OF z i The conditional distribution of z i is a Bernoulli distribution f ( z i | Y , X − i , σ n , τ i , ω ) = B (cid:179) z i (cid:175)(cid:175)(cid:175) k k + k (cid:180) (4.4)with k = − ω , k = ω ( σ n τ i σ i ) − T exp (cid:179) || µ i || σ i (cid:180) . (4.5)4.0.4. C ONDITIONAL DISTRIBUTION OF a The conditional distribution of a | τ is the following gamma distribution f ( a | τ ) = G (cid:179) a (cid:175)(cid:175)(cid:175) N ( T + + α , (cid:80) i [ v i τ i ]2 + β (cid:180) . (4.6)4.0.5. C ONDITIONAL DISTRIBUTION OF σ n The distribution of σ n | Y , X , τ , z is the following inverse gamma distribution f ( σ n | Y , X , τ , z ) = I G (cid:179) σ n (cid:175)(cid:175)(cid:175) ( M + || z || ) T (cid:104) || HX − Y || + (cid:88) i ∈ I || x i || τ i (cid:105)(cid:180) . (4.7)4.0.6. C ONDITIONAL DISTRIBUTION OF ω Finally, ω | z has the following beta distribution f ( ω | z ) = B e (cid:179) ω (cid:175)(cid:175)(cid:175) + || z || , 1 + N − || z || (cid:180) . (4.8)8igure 4.1: Example of posterior distribution of z with local maxima (a) Ground truth - Axial, coronal and sagittal views respectively(b) Estimation without proposals - Axial, coronal and sagittalviews respectively(c) Estimation with proposals - Axial, coronal and sagittal viewsrespectively Figure 4.2: Illustration of the effectiveness of the multiple dipole shift proposals. 9 . C

ONVERGENCE CONSIDERATIONS

OCAL MAXIMA

We have observed that the proposed partially collapsed Gibbs sampler may be stuck aroundlocal maxima of the variable z from which it is very difﬁcult to escape in a reasonable amountof iterations. Fig. 4.1 illustrates via a simple example that if the sampler gets to z = {0, 1} itrequires to go through intermediary states with very low probability to move to the correctvalue z = {1, 0}. This kind of situation can occur if the dipoles corresponding to z and z produce similar measurements Y when active so that the probability of having either of themactive is much higher than having them both on (or off) at the same time. ULTIPLE DIPOLE SHIFT PROPOSALS

In order to solve this problem, we introduce a proposal making scheme, that consists inchanging several elements of z simultaneously (which would allows us to go from {0, 1} to{1, 0} in one step in the previous example) after each sampling iteration. The proposal isaccepted or rejected using a Metropolis-Hastings criteria which guarantees that the targetdistribution is preserved.We have observed that one of the most common ways for the algorithm to get stuck in localmaxima is by failing to estimate the position of the non-zero elements of z . In other words,the sampling scheme detects a non-zero in a position that is not correct but is close (in thesense that it produces similar measurements) to the correct one, which causes the problemdescribed above. 5.2.1. S HIFT BASED PROPOSALS

Before describing the proposed move, it is interesting to mention that it was inspired by anidea developed by Bourguignon et al in [37] to perform a spectral analysis of astrophysicaldata obtained after irregular sampling. The authors of [37] proposed to move a single non-zero element of a binary sequence to a random neighboring position after each iteration ofthe MCMC sampler. In the presence of a single non-zero element, this move is sufﬁcientto escape from a local maximum of the posterior associated with our EEG source localiza-tion model. However, when there are several non-zero elements located at wrong locations,proposing to move each one of them separately may not be sufﬁcient to escape from a localmaximum. For this reason, we have generalized the scheme presented in [37] by proposingto move a random subset of K estimated non-zeros simultaneously to random neighboringpositions. According to our experiments (some of them described in Section 6), the simplechoice K = τ and z , it is convenient to update their values jointly. The resultingmultiple dipole shift proposal is detailed in Algorithm 2. Note that Bourguignon et al workwith 1-dimensional data so they deﬁne the neighborhood of the element z k as { z k − , z k + }. Incontrast, we are working with dipoles located in a 3-dimensional brain so the neighborhooddeﬁnition is non-trivial and will be described in the following. 10 lgorithm 2 Multiple dipole shift proposal.¯ z = z repeat K timesSet ind old to be the index of a random non-zero of z Set p = [ind old , neigh γ (ind old )]Set ind new to be a random element of p Set ¯ z ind old = z ind new = end Sample ¯ X from f ( ¯ X | ¯ z , Y , σ n , τ ).Sample ¯ τ from f ( ¯ τ | ¯ X , σ n , a , ¯ z ).Set { z , τ } = { ¯ z , ¯ τ } with probability min (cid:179) f ( ¯ z ,¯ τ | .) f ( z , τ | .) , 1 (cid:180) Resample X if the proposal was acceptedIn Fig. 4.2 we can see the effect of introducing multiple dipole shift proposals (with K =

2) ina practical case. The ﬁrst row of images is the ground truth (a single dipole activation) whilethe second row shows the probability of ﬁnding each dipole active with eight MCMC parallelchains without using proposals after 10.000 iterations. As we can see, the activity is in thecorrect area but the algorithm is not able to converge to the correct value of z . After intro-ducing the multiple dipole shift proposals the sampler converges in less than 1.000 iterationsas shown in the third row of the ﬁgure.5.2.2. A CCEPTANCE PROBABILITY

In order to guarantee that the generated samples are asymptotically distributed accordingto the posterior of the proposed model, all moves resulting from the multiple dipole shiftproposal are accepted or rejected with the acceptance probability described in Algorithm 2that uses the following probability distribution f ( z r , τ r | Y , a , σ n , ω ) ∝ (1 − ω ) C ω C ( σ n ) − TC | Σ | T (5.1) (cid:89) i ∈ I ( τ i ) − T exp (cid:179) − (cid:80) Tt = K t (cid:180) N (cid:89) i = G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) where r = { i : z i (cid:54)= ¯ z i }, I k = { i : z r i = k }, C k = I k for k = {0, 1} and Σ − = σ n (cid:104) ( H I ) T H I + diag (cid:179) τ r (cid:180)(cid:105) µ t = − Σ ( H I ) T ( H − r x t − r − y t ) σ n K t = ( H − r x t − r − y t ) T ( H − r x t − r − y t ) σ n − ( µ t ) T Σ − µ t .Note that m − s is obtained after removing in the vector m all the rows belonging to the set s , M − s is the matrix M whose columns belonging to s have been removed, diag( s ) is the11iagonal square matrix whose diagonal elements are the elements of s and | M | is the deter-minant of the matrix M . The exact derivation is included in Appendix B.5.2.3. N EIGHBORHOOD DEFINITION

It is obvious that the deﬁnition of the neighborhood used to exchange non-zero elements iscrucial. Initially, we used a geometrical neighborhood, deﬁned in terms of vertex connexity inthe triangular tessellation modeling the brain cortex. However, this deﬁnition usually yieldsvery small neighborhoods. This may cause the proposals not to be ﬂexible enough to helpthe algorithm escape from local maxima.For this reason we propose a neighborhood deﬁnition that considers two dipoles to be neigh-bors if the correlation between their respective columns is higher than a certain thresholdneigh γ ( i ) (cid:44) (cid:110) j (cid:54)= i (cid:175)(cid:175)(cid:175) | corr( h i , h j ) | ≥ γ (cid:111) (5.2)where corr( v , v ) is the correlation between vectors v and v and where the neighborhoodsize can be adjusted by setting γ ∈ [0, 1] ( γ = γ = z may be available.In order to maximize the moves efﬁciency, the value of γ has to be selected carefully. A verylarge value of γ will result in proposals not being ﬂexible enough to help the algorithm inescaping local maxima. A very low value of γ will result in a very large amount of possibleproposals with many of them being useless leading to a large number of iterations to reachuseful moves. Our experiments have shown that a good compromise is obtained with γ = NTER - CHAIN PROPOSALS

The multiple dipole shift proposal scheme previously described allows the algorithm to bet-ter sample the value of z present in the posterior distribution and is able to ﬁnd the activedipoles correctly if they are few in number. However, when running multiple MCMC chainsin parallel with a higher amount of non-zeros present in the ground truth, it is possible forthe different chains to get stuck in different values of z . In order to help them converge to thesame (most probable) value, it is possible to exchange information between parallel chains toavoid local maxima during their runs as other approaches do, including Metropolis-coupledMCMC [38], Population MCMC [39] and simulated tempering [40, 41].In this report, we introduce inter-chain moves by proposing to exchange the values of z and τ between different chains. This move is accepted with the Metropolis-Hastings probabilityshown in Algorithm 3.One inconvenience introduced by inter-chain proposals is the fact that they require synchro-nizing the parallel MCMC chain processes, which decreases the iteration speed of the al-gorithm. In order to minimize this effect, an inter-chain proposal will be made after eachiteration with probability p (adjusted to by cross validation) according to Algorithm 3.12 lgorithm 3 Inter-chain proposals.Deﬁne a vector c = {1, 2, ..., L } where L is the number of chains for i = {1, 2, ..., L }Choose (and remove) a random element from c and denote it by k Denote as { ¯ z k , ¯ τ k } the sampled values of { z , τ } of MCMC chain number k For the chain i set { z i , τ i } = { ¯ z k , ¯ τ k } with probability f ( ¯ z k ,¯ τ k | .) f ( z , τ | .) Resample X if the proposal has been accepted end (a) Ground truth - Axial, coronal and sagittal views respectively(b) Estimation without proposals - Axial, coronal and sagittalviews respectively(c) Estimation with proposals - Axial, coronal and sagittal viewsrespectively Figure 5.1: Illustration of inter-chain proposals effectiveness. 13he beneﬁt of introducing inter-chain proposals is illustrated in Fig. 5.1. The ﬁrst row ofimages of this ﬁgure displays the ﬁve non-zeros pressent in the ground truth. Without usinginter-chain proposals the different chains arrive to different modes after 100.000 iterations asillustrated in the second row (which displays the probability of ﬁnding each dipole active with8 parallel MCMC chains). The introduction of inter-chain proposals causes all the differentchains to converge to the same (correct) value of z in less than 5.000 iterations as illustratedin the third row. STIMATORS

The point estimators used in this study are deﬁned as followsˆ z (cid:44) arg max ¯ z ∈ {0,1} N (cid:179) M ( ¯ z ) (cid:180) (5.3)ˆ p (cid:44) M ( ˆ z ) (cid:88) m ∈ M ( ˆ z ) p ( m ) (5.4)where M ( ¯ z ) is the set of iteration numbers m for which the sampled variable z ( m ) = ¯ z afterthe burn-in period and p stands for any of the variables X , a , σ n , ω and τ . Thus the estima-tor ˆ z (5.3) corresponds to a maximum a posteriori estimator whereas the estimator used forall the other sampled variables (5.4) is a minimum mean square error (MMSE) estimator.However, it is interesting to note that the proposed method provides the full distribution ofthe unknown parameters and is not limited to point-estimate as the methods based on the (cid:96) mixed norm.It will be shown in the next section that in certain conditions, such as low SNR, the a-posterioridistribution of z has several values of comparable probability. These values of z are usuallyminor variations of each other (changing one of the dipoles to a neighboring position for in-stance). In this case, after convergence the algorithm oscillates between several values of z which allows the proposed method to identify several possible solutions (each of them cor-responding to a different value of z ) with their corresponding probabilities. This is an advan-tage over the mixed (cid:96) mixed norm method that is only able to provide a point-estimate ofthe solution.

6. E

XPERIMENTAL RESULTS

YNTHETIC DATA

Synthetic data are ﬁrst considered to compare the (cid:96) mixed norm approach with the pro-posed method using a 212-dipole Stok three-sphere head model [42] with 41 electrodes. Threekinds of activations are performed: (1) three dipoles with low SNR, (2) ﬁve dipoles with highSNR and (3) multiple dipoles with high SNR. 14

20 40 60 80 100 − . . . . (a) Ground truth − . . . . (b) (cid:96) -mixed norm estimation − . . . . (c) Proposed method Figure 6.1: Estimated waveforms for three dipoles with SNR = 30dB. (a) Ground truth - Axial, coronal and sagittal views respectively(b) (cid:96) - Axial, coronal and sagittal views respectively(c) Proposed method - Axial, coronal and sagittal views respec-tively Figure 6.2: Estimated activity for three dipoles and SNR = 30dB. 15

20 40 60 80 100 − . . . . (a) Ground truth . (b) (cid:96) -mixed norm estimation − . . . . (c) Proposed method Figure 6.3: Estimated waveforms for three dipoles with SNR = -3dB. (a) Ground truth - Axial, coronal and sagittal views respectively(b) (cid:96) - Axial, coronal and sagittal views respectively(c) Proposed method - Axial, coronal and sagittal views respec-tively Figure 6.4: Estimated activity for three dipoles and SNR = -3dB. 16ctive non-zeros Percentage of samples1, 2, 3 43%1, 2, 4 22%1, 2, 5 11%1, 2, 6 7%1, 2, 7 6%Others 11%Table 6.1: Three dipoles with SNR = -3dB: modes explored after convergence. Positions 1, 2and 3 correspond to the non-zero elements of the ground truth. . (a) Dipole waveform 1 − . . . (b) Dipole waveform 2 − . . . (c) Dipole waveform 3 Figure 6.5: Estimated boundaries µ ± σ for the three dipole simulation with SNR = -3dB.6.1.1. T HREE - DIPOLES WITH LOW

SNRThree dipoles were assigned excitations deﬁned as synthetic damped sinusoidal waves withfrequencies between 5 and 20Hz. These excitations were 500ms long (a period correspond-ing to a stationary dipole activity) and sampled at 200Hz. Different levels of noise were usedto compare the performance of the proposed method with the weighted (cid:96) mixed norm.The parameters of our multiple dipole shift proposal were set to K = γ = (cid:96) mixed norm approach, the value of the regu-larization parameter λ was chosen using cross-validation to get the best possible result.The results for SNR = 30dB are shown in Fig. 6.1 and Fig. 6.2. Both algorithms seem to providethe same solution that is equal to the ground truth. The only minor difference is that the (cid:96) -mixed norm regularization presents some dipoles (around ten) with very low but non-zerovalues, while our algorithm only detects the three non-zeros as real activity, this can be seenin the estimated dipoles locations in Fig. 6.2.The estimated dipole locations with SNR = − (cid:96) approach. The approach based on the (cid:96) norm managesto recover only two of the three non-zero activities at the correct positions and seems to un-derestimate considerably the amplitude of the activity. This is a known problem caused by17 .

02 0 .

04 0 .

06 0 . (a) Histogram of ω

40 60 800200400 (b) Histogram of a . .

42 0 .

44 0 .

46 0 . (c) Histogram of σ n Figure 6.6: Three dipoles with SNR = -3dB: histograms of the hyperparameters. The actualvalues of ω and σ n are marked with a red vertical line. ,

000 10 ,

000 15 ,

000 20 , . . . (a) PSRF of ω ,

000 10 ,

000 15 ,

000 20 , . . . . (b) PSRF of a ,

000 10 ,

000 15 ,

000 20 , . . . (c) PSRF of σ n ,

000 10 ,

000 15 ,

000 20 , . . (d) Maximum PSRF of X Figure 6.7: Three dipoles with SNR = -3dB: PSRFs of sampled variables.approximating the (cid:96) pseudo-norm by the (cid:96) norm, since the later penalizes high amplitudeswhile the former penalizes all non-zero values equally. Our algorithm oscillates between sev-eral values of z (speciﬁed in Table 6.1). However, the most probable value of z found by thealgorithm is the correct one whereas the other ones have one of the active non-zeros movedto a close neighbor.The proposed method does not only allow a point-estimation of the activity but can alsobe used to estimate uncertainties associated with the activity. For instance, Fig. 6.5 showsthe conﬁdence intervals of the activity estimation (mean ± standard deviations). The actual18 a) Ground truth - Axial, coronal and sagittal views respectively(b) (cid:96) dipole locations - Axial, coronal and sagittal views re-spectively(c) Proposed Method - Axial, coronal and sagittal views respec-tively Figure 6.8: Estimated activity for ﬁve dipoles and SNR = 30dB.ground truth activation is clearly located within two standard deviations of the estimatedmean value obtained with the proposed algorithm. The histogram of the generated hyper-parameters ω , a and σ n are shown in Fig. 6.6. They are clearly in good agreement with theactual values of the corresponding parameters. The PSRF’s are displayed in Fig. 6.7. It is pos-sible to see that the PSRF’s tend to 1 as the iterations increase, showing that the simulation isconverging correctly. 6.1.2. F IVE DIPOLES

On our second kind of experiments, ﬁve dipoles were activated with the same damped sinu-soidal wave of 5Hz. The activations were sampled at 200Hz and scaled in amplitude so thateach of them produced the same energy in the measurements. Noise was added to the mea-surements to obtain SNR = 30dB. For the (cid:96) mixed norm regularization the regularizationparameter was set according to the uncertainty principle which consists in ﬁnding a solutionˆ X such that || H ˆ X − Y || ≈ || HX − Y || [43]. Eight MCMC chains were run in parallel for theproposed method. Only the ﬁve non-zeros of the estimated activity with highest energy inthe measurements were considered. 19

50 100 − . . . . (a) Dipole waveform 1 − . . . . (b) Dipole waveform 2 − . . . . (c) Dipole waveform 3 − (d) Dipole waveform 4 . (e) Dipole waveform 5 Figure 6.9: Estimated waveforms for ﬁve dipoles with SNR = 30dB. Green represents theground truth, blue the (cid:96) mixed norm estimation and red the proposed method.20 . . . .

81 Amount of non-zeros R ec o v e r y r a t e ‘ PM (a) Recovery rate as a function of P . . . .

08 Amount of non-zeros P r o p o r t i o n o f r e s i du a l e n e r g y ‘ PM (b) Proportion of residual energy as a function of P Figure 6.10: Performance measures for multiple dipoles.The results are displayed on Fig. 6.8 and Fig. 6.9. In the ﬁrst one we are able to see thatthe proposed method is able to recover the ﬁve locations perfectly while the (cid:96) norm onlydetects four activations, two of which are not in the correct locations. In the waveforms dis-played in Fig. 6.9 we can see that, while both methods are able to recover the general activitypattern, the proposed method closely matches the waveform amplitude while the (cid:96) mixednorm does not. This is partially due to the fact that it detects some of the activations in thewrong positions (waveforms 4 and 5) and partially due to the tendency to underestimate theactivation amplitudes inherent to the (cid:96) to (cid:96) convex relaxation.6.1.3. M ULTIPLE DIPOLES

In this section, we compare the detection capabilities of the algorithm with respect to the (cid:96) -mixed norm approach by varying the amount of non-zeros present in the ground truth.In each simulation of this section, P dipoles were activated with damped sinusoidal waveswith frequencies varying between 5 and 20Hz. The activations were sampled at 200Hz andscaled in amplitude so that each of them produced the same energy in the measurements.Fifty different sets of localizations were used for the non-zero positions for each value of P =

1, ..., 20, resulting in a total of 1000 experiments. Noise was added to the measurements toobtain SNR = 30dB. For the (cid:96) mixed norm regularization the regularization parameter wasset according to the uncertainty principle.For each simulation, the P non-zeros of the estimated activity associated with the highestenergy in the measurements were considered as the estimated activity whereas the other el-ements were considered as residuals. We deﬁne the recovery rate as the proportion of non-zeros in the ground truth that are also present in the estimated activity. The average recoveryrates of the proposed method and the (cid:96) mixed norm approach are presented in the ﬁrstplot of Fig. 6.10 as a function of P . For P ≤

10 our algorithm detects the non-zeros with anaccuracy higher than 90% which drops to 60.2% for P =

11 and 49.7% for P =

12. This dropof the recovery rate when a large number of non-zeros is present in the ground truth is well21

20 40 60 80 − (a) EEG whitened measurements − − (b) (cid:96) -mixed norm estimated wave-forms − − (c) Proposed method estimated wave-forms Figure 6.11: Measurements and estimated waveforms for the auditory evoked responses.known, since the possible amount of non-zeros to recover correctly is limited by the span ofthe operator [14]. For comparison, the (cid:96) mixed norm regularization recovers up to P = P =

10. Note that our method performs better than the (cid:96) approach for P ≤

11. How-ever, beyond this point, the performance of both methods is very poor preventing them frombeing used in practical applications.The recovery rate is calculated from the P main non-zero elements of the activity. However, itis also interesting to analyze how much activity is present in the residual non-zero elements.Thus, we deﬁne the proportion of residual energy as the amount of energy contained in themeasurements generated by the residual non-zeros with respect to the total energy in themeasurements. This residual energy serves as a measure of the sparsity of the solution. Thesecond plot of Fig. 6.10 shows the value of the residual energy obtained for both algorithmsas a function of P . The (cid:96) approach has up to 7.7% of the activity detected in residual non-zeros whereas our algorithm never exceeds 1.1% and always has lower residual activity than (cid:96) , conﬁrming its good sparsity properties. EAL DATA EXPERIMENT

Two real data sets are then considered to validate the proposed method. The ﬁrst datasetcorresponds to the auditory evoked responses to left ear pure tone stimulus while the secondone consists of the evoked responses to facial stimulus. The results of the proposed methodare compared with the weighted (cid:96) mixed norm [17] and the multiple sparse priors method[29]. 6.2.1. A UDITORY EVOKED RESPONSES

The default data set of the MNE software [44, 45] is used in this section. It consists of theevoked response to left-ear auditory pure-tone stimulus using a realistic BEM (Boundary ele-ment method) head model sampled with 60 EEG electrodes and 306 MEG sensors. The head22 a) Weighted (cid:96) mixed norm - Uncertainty principle of parameter(b) Weighted (cid:96) mixed norm - Manual adjustment of parameter(c) Proposed method(d) MSP algorithm Figure 6.12: Estimated activity for the auditory evoked responses. 23

20 40 60 80 − − − (a) Waveform 1 (Center right dipole) − (b) Waveform 2 (Center left dipole 1,closer to the edge) − (c) Waveform 3 (Center left dipole 2) − − (d) Waveform 4 (Posterior right dipole) − − (e) Waveform 5 (Posterior left dipole) Figure 6.13: Estimated waveforms mean and boundaries µ ± σ for the auditory evoked re-sponses. .

003 0 .

006 0 . (a) Histogram of ω

120 140 160 180050100 (b) Histogram of a .

035 0 .

036 0 . (c) Histogram of σ n Figure 6.14: Hyperparameters histograms for the auditory evoked responses. 24 ,

000 4 ,

000 6 ,

000 8 ,

000 10 , . . . (a) PSRF of ω ,

000 4 ,

000 6 ,

000 8 ,

000 10 , (b) PSRF of a ,

000 4 ,

000 6 ,

000 8 ,

000 10 , (c) PSRF of σ n ,

000 4 ,

000 6 ,

000 8 ,

000 10 , (d) Maximum PSRF of X Figure 6.15: PSRFs of sampled variables for the auditory evoked responses .model contains 1.844 dipoles located on the cortex with orientations that are normal to thebrain surface. Two channels that had technical artifacts were ignored. The data was sam-pled at 600Hz. The samples were low-pass ﬁltered at 40Hz and downsampled to 150Hz. Thenoise covariance matrix was estimated from 200ms of the data preceding each stimulus andwas used to whiten the measurements. Fifty-one epochs were averaged to calculate the mea-surements Y . The activity of the source dipoles was estimated jointly for the period from0ms to 500ms after the stimulus. From a clinical perspective it is expected to ﬁnd the brainactivity primarily focused on the auditory cortices that are located close to the ears in bothhemispheres of the brain.Since the measurements were whitened, it is possible to use the uncertainty principle to ad-just the hyperparameter of the (cid:96) mixed norm. However, this provides an activity distributedall over the brain as shown in the ﬁrst row of Fig. 6.12. By manually adjusting the hyperpa-rameter to produce a sparser result, the (cid:96) mixed norm can obtain a solution that has activ-ity in the auditory cortices as expected, shown in the second row of images. In contrast, ouralgorithm estimates its hyperparameters automatically and ﬁnds most of the activity in theauditory cortices without requiring any manual adjustment as displayed in the third row. Onthe other hand, the MSP method spreads the activity around the auditory cortices area sinceit groups the dipoles together in pre-deﬁned regions.The whitened measurements are displayed in Fig. 6.11 along with the activity estimation forboth the (cid:96) approach (after manually adjusting the hyperparameter) and our algorithm.The ﬁve waveforms estimated by the proposed method with their conﬁdence intervals of 2 σ are shown in Fig. 6.13. Both results present sharp peaks in the activations of the auditory cor-tex dipoles between 80 and 100 milliseconds after the application of the stimulus. Note that25

50 100 − (a) EEG whitened measurements − . − . − . . . (b) (cid:96) -mixed norm estimated wave-forms − . − . . (c) Proposed method estimated wave-forms Figure 6.16: Measurements and estimated waveforms for the facial evoked responses.the amplitudes estimated by the proposed method are much higher than the ones obtainedwith the (cid:96) approach due to the aforementioned amplitude underestimation of the latter.The histograms of the hyperparameters of our algorithm are presented in Fig. 6.14 while theirPSRF’s are shown in Fig. 6.15. In the PSRF’s we can see very abrupt changes in values for allthe variables (most noticiably for X ) in the same iterations. These correspond to the itera-tions in which proposals were accepted by different chains and reﬂect the fact that the PSRF’scan have very high values while the chains are in different modes. However, at the end of thesimulation all chains converge to the same value of z which causes all the PSRF’s to tend to1, showing the correct convergence of all the chains to the same posterior distribution.6.2.2. F ACIAL EVOKED RESPONSES

In a second experiment, we used data acquired from a face perception study where the sub-ject was required to evaluate the symmetry of a mixed set of faces and scrambled faces,one of the default datasets of the SPM software . Faces were presented during 600ms ev-ery 3600ms. The measurements were taken by the electrodes of a 128-channel ActiveTwosystem that sampled at 2048 Hz. The measurements were downsampled to 200Hz and, afterartifact rejection, 299 epochs corresponding to the non-scrambled faces were averaged andlow-pass ﬁltered to 40Hz. A T1 MRI scan was then downsampled to generate a 3004 dipolehead model. The estimated activities are shown in Fig. 6.17. As in the previous case, we cansee that the (cid:96) mixed norm response (with the regularization parameter adjusted accordingto the uncertainty principle) estimates the activity spread around the brain. In contrast, ad-justing its regularization parameter manually results in a focal response concentrated in oneof the fusiform regions in the temporal lobe associated with the facial recognition process[46] similar to the one obtained by our algorithm. The MSP algorithm spreads the activityover brain regions located more to the lateral and posterior parts of the brain, further awayfrom the expected area. a) Weighted (cid:96) mixed norm - Uncertainty principle for parameter(b) Weighted (cid:96) mixed norm - Manual adjustment of parameter(c) Proposed method(d) MSP algorithm Figure 6.17: Estimated activity for the facial evoked responses. 27ig. 6.16 shows the EEG measurements and the estimated waveforms by the (cid:96) approachand the proposed method. As with the auditory evoked responses data, they differ in thescale due to the underestimation of the activity amplitude by the (cid:96) approach. OMPUTATIONAL COST

It is important to note that the price to pay with the proposed method, while having severaladvantages over the (cid:96) mixed norm approach, is its higher computational complexity asis typical of MCMC methods when compared with optimization techniques. The low SNRthree-dipole experiment was processed in 6 seconds running in a modern Xeon CPU E3-1240@ 3.4GHz processor (using a Matlab implementation with MEX ﬁles written in C) against104ms for the (cid:96) mixed norm approach. However, it is interesting to mention that the (cid:96) norm approach requires running the algorithm multiple times to adjust the regularizationparameter by cross-validation.

7. C

ONCLUSION

We presented a Bayesian mathematical model for sparse EEG reconstruction that approx-imates the (cid:96) mixed norm in a Bayesian framework by a multivariate Bernoulli Laplacianprior. A partially collapsed Gibbs sampler was used to sample from the target posterior dis-tribution. We introduced multiple dipole shift proposals within each MCMC chain and ex-change moves between different chains to improve the convergence speed. Using the gen-erated samples, the source activity was estimated jointly with the model hyperparameters inan unsupervised framework. The proposed method was compared with the (cid:96) mixed normand the multiple sparse prior methods for a wide variety of situations including several multi-dipole synthetic activations and two different sets of real data. Our algorithm presented sev-eral advantages including better recovery of dipole locations and waveforms in low SNR con-ditions, the capacity of correctly detecting a higher amount of non-zeros, providing sparsersolutions and avoiding underestimation of the activation amplitude. Finally, the possibility ofproviding several solutions with their corresponding probabilities is interesting. Future workwill be devoted to a generalization of the proposed model to cases where the head model isnot precisely known. 28 . D ERIVATION OF THE CONDITIONAL PROBABILITY DISTRIBUTIONS

We will now proceed to show the derivation of the conditional probability distributions of theassociated model presented in section 3 of the report in detail.

A.1. P

OSTERIOR DISTRIBUTION

ONDITIONAL DISTRIBUTIONS

Conditional distribution of τ i The conditional distribution of τ i is f ( τ i | x i , σ n , a , z i ) ∝ f ( x i | τ i , z i , σ n ) f ( τ i | a ) (A.7)that is equal to f ( τ i | x i , σ n , a , z i ) =  δ ( x i ) G (cid:179) τ i (cid:175)(cid:175)(cid:175) T + , v i a (cid:180) if z i = N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180) G (cid:179) τ i (cid:175)(cid:175)(cid:175) T + , v i a (cid:180) if z i =

1. (A.8)Based on the development of [35] it can be seen that the conditional distribution of τ i is ageneralized inverse gaussian when z i = z i = f ( τ i | x i , σ n , a , z i ) =  G (cid:179) τ i (cid:175)(cid:175)(cid:175) T + , v i a (cid:180) if z i = G I G (cid:179) τ i (cid:175)(cid:175)(cid:175) , v i a , || x i || σ n (cid:180) if z i =

1. (A.9)Conditional distribution of z i and x i As stated in the report, we will sample z i and x i jointly from f ( z i , x i | Y , X − i , σ n , τ i , ω ) ∝ f ( Y | X , σ n ) f ( x i | τ i , z i , σ n ) f ( z i | ω ) (A.10)that is equal to 29 ( z i , x i | Y , X − i , σ n , τ i , ω ) ∝ exp (cid:179) − || HX − Y || σ n (cid:180)(cid:104) (1 − z i ) δ ( x i ) + z i N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180)(cid:105)(cid:104) (1 − ω ) δ ( z i ) + ωδ ( z i − (cid:105) = exp (cid:179) − || HX − Y || σ n (cid:180)(cid:104) (1 − ω ) δ ( z i ) δ ( x i ) + ωδ ( z i − N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180)(cid:105) = (1 − ω ) δ ( z i ) exp (cid:179) − || HX − i − Y || σ n (cid:180) δ ( x i ) + ω (2 πτ i σ n ) − T δ ( z i −

1) exp (cid:179) − || HX − Y || σ n (cid:180) exp (cid:179) − || x i || σ n τ i (cid:180) .(A.11)The marginal distribution of z i is of the form f ( z i | Y , X − i , σ n , τ i , ω ) = (cid:90) f ( z i , x i | Y , X − i , σ n , τ i , ω ) d x i ∝ k δ ( z i ) + k δ ( z i −

1) (A.12)with k = (cid:90) (1 − ω ) exp (cid:179) − || HX − i − Y || σ n (cid:180) δ ( x i ) d x i = (1 − ω ) exp (cid:179) − || HX − i − Y || σ n (cid:180) (A.13) k = ω (2 πτ i σ n ) − T (cid:90) exp (cid:104) − σ n (cid:179) || HX − Y || + || x i || τ i (cid:180)(cid:105) d x i . (A.14)This implies that z i has the following Bernoulli distribution f ( z i | Y , X − i , σ n , τ i , ω ) = B (cid:179) z i (cid:175)(cid:175)(cid:175) k k + k (cid:180) . (A.15)To ﬁnd the value of k we calculate the minus logarithm of the integrand − log (cid:179) exp (cid:104) − σ n (cid:179) || HX − Y || + || x i || τ i (cid:180)(cid:105)(cid:180) = σ n (cid:179) || HX − Y || + || x i || τ i (cid:180) (A.16)and express it as a sum for the different values of t σ n (cid:179) || HX − Y || + || x i || τ i (cid:180) = σ n T (cid:88) t = (cid:179) || Hx t − y t || + ( x ti ) τ i (cid:180) . (A.17)If we denote h i each column of the operator H we can express term number t of the sum as12 σ n (cid:179) || Hx t − y t || + ( x ti ) τ i (cid:180) = σ n (cid:104)(cid:179) (cid:88) j (cid:54)= i h j x tj + h i x ti − y t (cid:180) T (cid:179) (cid:88) j (cid:54)= i h j x tj + h i x ti − y t (cid:180) + ( x ti ) τ i (cid:105) .(A.18)By denoting D ti = Y t − (cid:80) j (cid:54)= i h j x tj and expanding this expression we have 30A.18) = σ n (cid:179) ( h i ) T h i ( x ti ) − x ti ( h i ) T D ti + ( D ti ) T D ti + ( x ti ) τ i (cid:180) . (A.19)Matching the terms of the previous expression with ( x ti − µ ti ) σ i + K ti we have12 σ n (cid:179) ( h i ) T h i ( x ti ) − x ti ( h i ) T D ti + ( D ti ) T D ti + ( x ti ) τ i (cid:180) = ( x ti ) σ i − µ ti x ti σ i + ( µ ti ) σ i + K ti (A.20)12 σ i = ( h i ) T h i + τ i σ n => σ i = σ n τ i + τ i ( h i ) T h i (A.21) µ ti σ i = ( h i ) T D ti σ n => µ ti = σ i ( h i ) T D ti σ n (A.22)( µ ti ) σ i + K ti = ( D ti ) T D ti σ n => K ti = ( D ti ) T D ti σ n − ( µ ti ) σ i . (A.23)Summing over all time samples and applying the function exp( − x ) (to compensate the stepsdone in A.16 and A.18) results inexp (cid:179) − || HX − Y || σ n (cid:180) exp (cid:179) − || x i || σ n τ i (cid:180) = (2 πσ i ) T T (cid:89) t = exp( − K ti ) N (cid:179) x ti (cid:175)(cid:175)(cid:175) µ ti , σ i (cid:180) . (A.24)We can calculate the ﬁnal value of k by combining (A.14) and (A.24) k = ω (2 πτ i σ n ) − T (cid:90) exp (cid:179) − || HX − Y || σ n (cid:180) exp (cid:179) − || x i || σ n τ i (cid:180) d x i = ω (2 πτ i σ n ) − T (2 πσ i ) T T (cid:89) t = exp( − K ti ) = ω (cid:179) σ n τ i σ i (cid:180) − T exp (cid:179) − || HX − i − Y || σ n (cid:180) exp (cid:179) || µ i || σ i (cid:180) . (A.25)Using (A.11) and (A.24) we obtain the conditional distribution of x i f ( x i | z i , Y , X − i , σ n , τ i ) = (cid:40) δ ( x i ) if z i = N (cid:179) x i (cid:175)(cid:175)(cid:175) µ i , σ i I T (cid:180) if z i =

1. (A.26)Conditional distribution of a The conditional distribution of a is f ( a | τ ) ∝ f ( a ) N (cid:89) i = f ( τ i | x i , σ n , a ) (A.27) f ( a | τ ) ∝ a α − exp (cid:179) − β a (cid:180) N (cid:89) i = (cid:104)(cid:179) av i (cid:180) T + ( τ i ) T − exp (cid:179) − av i τ i (cid:180)(cid:105) ∝ a α + N ( T + − exp (cid:179) − (cid:179) β + N (cid:88) i = [ v i τ i ] (cid:180) a (cid:180) (A.28)31hich corresponds to the following gamma distribution f ( a | τ ) = G (cid:179) a (cid:175)(cid:175)(cid:175) N ( T + + α , (cid:80) i [ v i τ i ]2 + β (cid:180) . (A.29)Conditional distribution of σ n The conditional distribution of σ n is f ( σ n | Y , X , τ , z ) ∝ f ( Y | X , σ n ) f ( X | τ , z , σ n ) f ( σ n ) (A.30) f ( σ n | Y , X , τ , z ) ∝ ( σ n ) − (2 πσ n ) − MT exp (cid:179) − || HX − Y || σ n (cid:180) N (cid:89) i = (cid:104) ( z i − δ ( x i ) + z i N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180)(cid:105) .Denoting I k = { i : z i = k } for k = {0, 1} and using the identity N (cid:89) i = (cid:104) ( z i − f ( x ) + z i g ( x ) (cid:105) = (cid:89) i ∈ I f ( x ) (cid:89) i ∈ I g ( x ) (A.31)we can express this by f ( σ n | Y , X , τ , z ) ∝ ( σ n ) − (2 πσ n ) − MT exp (cid:179) − || HX − Y || σ n (cid:180) (cid:89) i ∈ I δ ( x i ) (cid:89) i ∈ I (cid:104) (2 πτ i σ n ) − T exp (cid:179) − || x i || τ i σ n (cid:180)(cid:105) .(A.32)The previous expression leads to f ( σ n | Y , X , τ , z ) ∝ ( σ n ) − (cid:179) + ( M +|| z || T (cid:180) exp (cid:179) − σ n (cid:104) || HX − Y || + (cid:88) i ∈ I || x i || τ i (cid:105)(cid:180) (A.33)which corresponds to the following inverse gamma distribution f ( σ n | Y , X , τ , z ) = I G (cid:179) σ n (cid:175)(cid:175)(cid:175) ( M + || z || ) T (cid:104) || HX − Y || + (cid:88) i ∈ I || x i || τ i (cid:105)(cid:180) . (A.34)Conditional distribution of ω The conditional distribution for ω is f ( ω | z ) ∝ f ( z | ω ) f ( ω ) (A.35) f ( ω | z ) ∝ N (cid:89) i = (cid:104) δ ( z i )(1 − ω ) + δ ( z i − ω (cid:105) (A.36)where 1 represents a function that is 1 for 0 < ω < ( ω | z ) ∝ (1 − ω ) N −|| z || ω || z || (cid:89) i ∈ I δ ( z i ) (cid:89) i ∈ I δ ( z i −

1) (A.37)which corresponds to the following Beta distribution f ( ω | z ) = B e (cid:179) ω (cid:175)(cid:175)(cid:175) + || z || , 1 + N − || z || (cid:180) . (A.38)33 . D ERIVATION OF PROPOSAL ACCEPTANCE PROBABILITY

In section 5 we presented two different kind of proposals used in the algorithm: multipledipole shifts and inter-chain proposals. Both of them consists in proposing to change thevalues of the vectors z and τ jointly. In this appendix we will calculate their acceptanceprobability.Using Bayes’ theorem it is easy to show that the acceptance probability is equal to the ratioof the conditional distribution f ( τ , z | Y , a , σ n , ω ) evaluated in the current and proposed val-ues of the vectors. It is possible to improve the speed of the calculation by only consideringthe i th row if the current value of the element z i differs from the proposed value z pi . Denot-ing this subset of rows by r we can calculate f ( τ r , z r , X r | Y , X − r , z − r , τ − r , a , σ n , ω ) and thenintegrate it with respect to X r .Based on the posterior (3.11), we can derive the conditional distribution of the subset r ofrows as f ( τ r , z r , X r | Y , X − r , z − r , τ − r , a , σ n , ω ) ∝ f ( Y | X , σ n ) (cid:89) i ∈ r f ( x i , z i | σ n , τ i , ω ) (cid:89) i ∈ r f ( τ i | a ) (B.1)which implies f ( τ r , z r , X r | .) ∝ T (cid:89) t = N (cid:179) y t (cid:175)(cid:175)(cid:175) Hx t , σ n I M (cid:180) (B.2) (cid:89) i ∈ r (cid:104) (1 − ω ) δ ( z i ) δ ( x i ) + ωδ ( z i − N (cid:179) x i (cid:175)(cid:175)(cid:175) σ n τ i I T (cid:180)(cid:105)(cid:89) i ∈ r G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) .Denoting the subsets: I k = { r i : z r i = k } (Note that I ∪ I = r and I ∩ I = (cid:59) ) and theircardinals C k = I k for k = {0, 1} and using identity (A.31) allows us to express this as f ( τ r , z r , X r | .) ∝ T (cid:89) t = N (cid:179) y t (cid:175)(cid:175)(cid:175) Hx t , σ n I M (cid:180) (B.3) (cid:104) (1 − ω ) C ω C (cid:179) (cid:89) i ∈ I δ ( x i ) (cid:180)(cid:179) (cid:89) i ∈ I N ( x i | σ n τ i I T ) (cid:180)(cid:105)(cid:89) i ∈ r G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) .We now integrate the above expression over X r leading to f ( τ r , z r | .) ∝ (cid:90) X r f ( τ r , z r , X r | .) d X r = (1 − ω ) C ω C (cid:89) i ∈ r G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) (B.4) (cid:90) X r (cid:104)(cid:179) T (cid:89) t = N (cid:179) y t (cid:175)(cid:175)(cid:175) Hx t , σ n I M (cid:180)(cid:180)(cid:179) (cid:89) i ∈ I δ ( x i ) (cid:180)(cid:179) (cid:89) i ∈ I N (cid:179) x i | σ n τ i I T (cid:180)(cid:180)(cid:105) d X r . 34valuating x i = i ∈ I because of the δ ( x i ) converts this to f ( τ r , z r | .) ∝ (1 − ω ) C ω C I (cid:89) i ∈ r G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) (B.5)with I = (cid:90) X I (cid:104)(cid:179) T (cid:89) t = N (cid:179) y t (cid:175)(cid:175)(cid:175) H − I x t − I , σ n I M (cid:180)(cid:180)(cid:179) (cid:89) i ∈ I N (cid:179) x i | σ n τ i I T (cid:180)(cid:180)(cid:105) d X I (B.6)being x t − s the vector x t that has all the rows in s eliminated and H − s the matrix H that hasall the columns in s eliminated. I can also be expressed as I = (2 πσ n ) − MT (2 πσ n ) − TC (cid:89) i ∈ I ( τ i ) (cid:90) X I exp (cid:104) − T (cid:88) t = C t (cid:105) d x I (B.7)with C t = σ n (cid:179) || H − I x t − I − y t || + (cid:80) i ∈ I ( x ti ) τ i (cid:180) = σ n (cid:179) || H − r x t − r + H I x t I − y t || + (cid:80) i ∈ I ( x ti ) τ i (cid:180) .Denoting D t = H − R x t − R − y t this can be expressed as C t = σ n (cid:179) || D t + H I x t I || + (cid:88) i ∈ I ( x ti ) τ i (cid:180) . (B.8)If we denote Q = diag( τ r ) being diag( s ) the diagonal square matrix formed by the elementsof vector s C t = σ n (cid:179) ( x t I ) T (( H I ) T H I + Q ) x t I + D t ) T H I ( x t I ) T + ( D t ) T D t (cid:180) . (B.9)Matching each term between the previous expression and (cid:179) x t I − µ t (cid:180) T Σ − (cid:179) x t I − µ t (cid:180) σ n (cid:179) ( x t I ) T ( H T I H I + Q ) x t I + D t ) T H I ( x t I ) T + ( D t ) T D t (cid:180) = (B.10)( x t I ) T Σ − x t I − µ t ) T Σ − ( x t I ) T + ( µ t ) T Σ − µ t + K t we get the following equations 1 σ n (cid:179) H T I H I + Q (cid:180) = Σ − (B.11)2( D t ) T H I σ n = − µ t ) T Σ − => µ t = − Σ H T I D t σ n (B.12)( µ t ) T Σ − µ t + K t = ( D t ) T D t σ n => K t = ( D t ) T D t σ n − ( µ t ) T Σ − µ t . (B.13)35e can now calculate the value of II = (2 πσ n ) − MT (2 πσ n ) − TC (cid:89) i ∈ I ( τ i ) (cid:90) X I exp (cid:104) − T (cid:88) t = C t (cid:105) d X I = (B.14)(2 πσ n ) − MT (2 πσ n ) − TC (cid:89) i ∈ I ( τ i ) exp (cid:179) − (cid:80) Tt = K t (cid:180) (cid:90) X I T (cid:89) t = exp (cid:179) − (cid:104) ( x ti − µ ti ) T Σ − ( x ti − µ ti ) (cid:105)(cid:180) d X I = (2 πσ n ) − MT ( σ n ) − TC (cid:89) i ∈ I ( τ i ) exp (cid:179) − (cid:80) Tt = K t (cid:180) | Σ | T .Finally using the value of I in (B.5) yields f ( τ r , z r | .) ∝ (cid:90) x r f ( τ r , z r , x r | .) d x r ∝ (B.15)(1 − ω ) C ω C ( σ n ) − TC | Σ | T (cid:89) i ∈ I ( τ i ) − T exp (cid:179) − (cid:80) Tt = K t (cid:180) N (cid:89) i = G (cid:179) τ i (cid:175)(cid:175)(cid:175) T +

12 , v i a (cid:180) . R EFERENCES [1] H. Buchner, G. Knoll, M. Fuchs, A. Rienäcker, R. Beckmann, M. Wagner, J. Silny, and J. Pesch, “Inverse local-ization of electric dipole current sources in ﬁnite element models of the human head,”

Electroencephalogr.Clin. Neurophysiol. , vol. 102, no. 4, pp. 267–278, 1997.[2] B. N. Cufﬁn, “A method for localizing EEG sources in realistic head models,”

IEEE Trans. Biomed. Eng. ,vol. 42, no. 1, pp. 68–71, 1995.[3] R. Grech, T. Cassar, J. Muscat, K. P. Camilleri, S. G. Fabri, M. Zervakis, P. Xanthopoulos, V. Sakkalis, andB. Vanrumste, “Review on solving the inverse problem in EEG source analysis,”

J. Neuroeng. Rehabil. , vol. 4,pp. 5–25, 2008.[4] J. C. Mosher, P. S. Lewis, and R. M. Leahy, “Multiple dipole modeling and localization from spatio-temporalMEG data,”

IEEE Trans. Biomed. Eng. , vol. 39, no. 6, pp. 541–557, 1992.[5] J. C. Mosher and R. M. Leahy, “Recursive MUSIC: a framework for EEG and MEG source localization,”

IEEETrans. Biomed. Eng. , vol. 45, no. 11, pp. 1342–1354, 1998.[6] J. Mosher and R. Leahy, “Source localization using recursively applied and projected (RAP) MUSIC,”

IEEETrans. Signal Process. , vol. 47, no. 2, pp. 332–340, 1999.[7] X.-L. Xu, B. Xu, and B. He, “An alternative subspace approach to EEG dipole source localization,”

Phys. Med.Biol. , vol. 49, no. 2, pp. 327–343, 2004.[8] S. Sommariva and A. Sorrentino, “Sequential Monte Carlo samplers for semi-linear inverse problems andapplication to magnetoencephalography,”

Inverse Probl. , vol. 30, no. 11, pp. 114 020–114 043, 2014.[9] F. L. da Silva and A. Van Rotterdam, “Biophysical aspects of EEG and magnetoencephalogram generation,”in

Electroencephalography: Basic Principles, Clinical Applications and Related Fields . Baltimore: Williams& Wilkins, 1998.[10] H. Liu, P. H. Schimpf, G. Dong, X. Gao, F. Yang, and S. Gao, “Standardized shrinking LORETA-FOCUSS(SSLOFO): a new algorithm for spatio-temporal EEG source reconstruction,”

IEEE Trans. Biomed. Eng. ,vol. 52, no. 10, pp. 1681–1691, 2005.

11] R. D. Pascual-Marqui, “Review of methods for solving the EEG inverse problem,”

Int. J. Bioelectromagn. ,vol. 1, no. 1, pp. 75–86, 1999.[12] R. D. Pascual-Marqui, C. M. Michel, and D. Lehmann, “Low resolution electromagnetic tomography: a newmethod for localizing electrical activity in the brain,”

Int. J. Psychophysiol. , vol. 18, no. 1, pp. 49–65, 1994.[13] R. Pascual-Marqui et al. , “Standardized low-resolution brain electromagnetic tomography (sLORETA): tech-nical details,”

Methods Findings Exp. Clin. Pharmacol. , vol. 24D, pp. 5–12, 2002.[14] E. J. Candes, “The restricted isometry property and its implications for compressed sensing,”

C. R.l’Académie des Sciences , vol. 346, no. 9, pp. 589–592, 2008.[15] K. Uutela, M. Hämäläinen, and E. Somersalo, “Visualization of magnetoencephalographic data using mini-mum current estimates,”

NeuroImage , vol. 10, no. 2, pp. 173–180, 1999.[16] F. Costa, H. Batatia, L. Chaari, and J.-Y. Tourneret, “Sparse EEG Source Localization using Bernoulli LaplacianPriors,”

IEEE Trans. Biomed. Eng. , to appear, 2015.[17] A. Gramfort, M. Kowalski, and M. Hämäläinen, “Mixed-norm estimates for the M/EEG inverse problemusing accelerated gradient methods,”

Phys. Med. Biol. , vol. 57, no. 7, p. 1937, 2012.[18] J. Huang and T. Zhang, “The beneﬁt of group sparsity,”

Ann. Statist. , vol. 38, no. 4, pp. 1978–2004, Aug. 2010.[19] M. Kowalski, K. Siedenburg, and M. Dorﬂer, “Social sparsity! neighborhood systems enrich structuredshrinkage operators,”

IEEE Trans. Signal Process. , vol. 61, no. 10, pp. 2498–2511, 2013.[20] G. Yu, G. Sapiro, and S. Mallat, “Solving inverse problems with piecewise linear estimators: from Gaussianmixture models to structured sparsity,”

IEEE Trans. Image Process. , vol. 21, no. 5, pp. 2481–2499, 2012.[21] J. Huang, T. Zhang, and D. Metaxas, “Learning with structured sparsity,”

J. Mach. Learn. Res. , vol. 12, pp.3371–3412, 2011.[22] A. Galka, O. Yamashita, T. Ozaki, R. Biscay, and P. Valdés-Sosa, “A solution to the dynamical inverse problemof EEG generation using spatiotemporal Kalman ﬁltering,”

NeuroImage , vol. 23, no. 2, pp. 435–453, 2004.[23] C. J. Long, P. L. Purdon, S. Temereanca, N. U. Desai, M. S. Hämäläinen, and E. N. Brown, “State-space solu-tions to the dynamic magnetoencephalography inverse problem using high performance computing,”

Theannals of applied statistics , vol. 5, no. 2B, pp. 1207–1228, 2011.[24] E. Somersalo, A. Voutilainen, and J. Kaipio, “Non-stationary magnetoencephalography by Bayesian ﬁlteringof dipole models,”

Inverse Probl. , vol. 19, no. 5, pp. 1047–1063, 2003.[25] A. Sorrentino, A. M. Johansen, J. A. Aston, T. E. Nichols, W. S. Kendall et al. , “Dynamic ﬁltering of staticdipoles in magnetoencephalography,”

The annals of applied statistics , vol. 7, no. 2, pp. 955–988, 2013.[26] X. Chen and S. Godsill, “Multiple dipolar sources localization for MEG using Bayesian particle ﬁltering,” in

Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) , Vancouver, Canada, May 2013.[27] S. J. Kiebel, J. Daunizeau, C. Phillips, and K. J. Friston, “Variational Bayesian inversion of the equivalentcurrent dipole model in EEG/MEG,”

NeuroImage , vol. 39, no. 2, pp. 728–741, 2008.[28] S. C. Jun, J. S. George, J. Paré-Blagoev, S. M. Plis, D. M. Ranken, D. M. Schmidt, and C. Wood, “SpatiotemporalBayesian inference dipole analysis for MEG neuroimaging data,”

NeuroImage , vol. 28, no. 1, pp. 84–98, 2005.[29] K. Friston, L. Harrison, J. Daunizeau, S. Kiebel, C. Phillips, N. Trujillo-Barreto, R. Henson, G. Flandin, andJ. Mattout, “Multiple sparse priors for the M/EEG inverse problem,”

NeuroImage , vol. 39, no. 3, pp. 1104–1120, 2008.

30] C. Stahlhut, H. T. Attias, K. Sekihara, D. Wipf, L. K. Hansen, and S. S. Nagarajan, “A hierarchical BayesianM/EEG imaging method correcting for incomplete spatio-temporal priors,” in

Proc. IEEE 10th Int. Symp.Biomed. Imagi. (ISBI) , San Fransisco, USA, April 2013.[31] H.-J. Huppertz, S. Hoegg, C. Sick, C. Lücking, J. Zentner, A. Schulze-Bonhage, and R. Kristeva-Feige, “Corti-cal current density reconstruction of interictal epileptiform activity in temporal lobe epilepsy,”

Clin. Neuro-phys. , vol. 112, no. 9, pp. 1761–1772, 2001.[32] H. Hallez, B. Vanrumste, R. Grech, J. Muscat, W. De Clercq, A. Vergult, Y. D’Asseler, K. P. Camilleri, S. G. Fabri,S. Van Huffel et al. , “Review on solving the forward problem in EEG source analysis,”

J. Neuroeng. Rehabil. ,vol. 4, pp. 46–75, 2007.[33] E. Maris, “A resampling method for estimating the signal subspace of spatio-temporal EEG/MEG data,”

IEEETrans. Biomed. Eng. , vol. 50, no. 8, pp. 935–949, 2003.[34] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”

J. Roy. Statist.Soc. , vol. 68, no. 1, pp. 49–67, 2006.[35] S. Raman, T. J. Fuchs, P. J. Wild, E. Dahl, and V. Roth, “The Bayesian group-lasso for analyzing contingencytables,” in

Proc. 26th ACM Annu. Int. Conf. Mach. Learn. (ICML) , Montreal, Quebec, Jun. 2009.[36] G. Casella and C. P. Robert,

Monte Carlo Statistical Methods . New York: Springer-Verlag, 1999.[37] S. Bourguignon and H. Carfantan, “Bernoulli-Gaussian spectral analysis of unevenly spaced astrophysicaldata,” in

Proc. IEEE Workshop on Stat. Signal Processing (SSP) , Bordeaux, France, Jul. 2005.[38] C. J. Geyer, “Markov chain Monte Carlo maximum likelihood,” in

Proc. 23rd Symp. Interface Comput. Sci.Statist. , Seattle, USA, Oct. 1991.[39] K. B. Laskey and J. W. Myers, “Population markov chain monte carlo,”

Mach. Learn. , vol. 50, pp. 175–196,2003.[40] C. J. Geyer and E. A. Thompson, “Annealing Markov chain Monte Carlo with applications to ancestral infer-ence,”

J. Amer. Statist. Assoc. , vol. 90, no. 431, pp. 909–920, 1995.[41] E. Marinari and G. Parisi, “Simulated tempering: a new Monte Carlo scheme,”

Europhy. Lett. , vol. 19, no. 6,pp. 451–458, 1992.[42] C. J. Stok, “The inverse Problem in EEG and MEG with Application to Visual Evoked Responses,” Ph.D. dis-sertation, Univ. Twente, Enschede, The Netherlands, 1986.[43] V. A. Morozov, “On the solution of functional equations by the method of regularization,”

Soviet Math. Dokl. ,vol. 7, pp. 414–417, 1966.[44] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S.Hämäläinen, “MNE software for processing MEG and EEG data,”

NeuroImage , vol. 86, pp. 446–460, 2014.[45] A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, R. Goj, M. Jas, T. Brooks,L. Parkkonen et al. , “MEG and EEG data analysis with MNE-Python,”

Front. Neurosci. , vol. 7, no. 267, pp.1–13, 2013.[46] N. Kanwisher, J. McDermott, and M. M. Chun, “The fusiform face area: a module in human extrastriatecortex specialized for face perception,”

J. Neurosci. , vol. 17, no. 11, pp. 4302–4311, 1997., vol. 17, no. 11, pp. 4302–4311, 1997.