[PDF] A Bayesian Generalized CAR Model for Correlated Signal Detection

Abstract

Over the last decade, large-scale multiple testing has found itself at the forefront of modern data analysis. In many applications data are correlated, so that the observed test statistic used for detecting a non-null case, or signal, at each location in a dataset carries some information about the chances of a true signal at other locations. Brown, Lazar, Datta, Jang, and McDowell (2014) proposed in the neuroimaging context a Bayesian multiple testing model that accounts for the dependence of each volume element on the behavior of its neighbors through a conditional autoregressive (CAR) model. Here, we propose a generalized CAR model that allows for inclusion of points with no neighbors at all, something that is not possible under conventional CAR models. We consider also neighborhoods based on criteria other than physical location, such as genetic pathways in microarray determined from existing biological knowledge. This generalization provides a unified framework for the simultaneous modeling of dependent and independent cases, resulting in stronger Bayesian learning in the posterior and increased precision in the estimates of interesting signals. We justify the selected prior distribution and prove that the resulting posterior distribution is proper. We illustrate the effectiveness and applicability of our proposed model by using it to analyze both simulated and real microarray data in which the genes exhibit dependence that is determined by physical adjacency on a chromosome or predefined gene pathways.

Full PDF

AA Bayesian Generalized CAR Model for Correlated Signal Detection

D. Andrew Brown ∗ Gauri S. Datta † Nicole A. Lazar † February 9, 2017

Abstract

Over the last decade, large-scale multiple testing has found itself at the forefront of moderndata analysis. In many applications data are correlated, so that the observed test statistic usedfor detecting a non-null case, or signal, at each location in a dataset carries some informationabout the chances of a true signal at other locations. Brown, Lazar, Datta, Jang, and McDowell(2014) proposed in the neuroimaging context a Bayesian multiple testing model that accountsfor the dependence of each volume element on the behavior of its neighbors through a condi-tional autoregressive (CAR) model. Here, we propose a generalized CAR model that allows forinclusion of points with no neighbors at all, something that is not possible under conventionalCAR models. We consider also neighborhoods based on criteria other than physical location,such as genetic pathways in microarray determined from existing biological knowledge. Thisgeneralization provides a uniﬁed framework for the simultaneous modeling of dependent and in-dependent cases, resulting in stronger Bayesian learning in the posterior and increased precisionin the estimates of interesting signals. We justify the selected prior distribution and prove thatthe resulting posterior distribution is proper. We illustrate the eﬀectiveness and applicability ofour proposed model by using it to analyze both simulated and real microarray data in whichthe genes exhibit dependence that is determined by physical adjacency on a chromosome orpredeﬁned gene pathways.

Key Words: conditional autoregressive model, enrichment, microarray, multiple testing, sig-niﬁcance analysis of microarrays, spike-and-slab prior ∗ Corresponding author, Department of Mathematical Sciences, Clemson University, Clemson, SC 29634, Email:[email protected] † Department of Statistics, University of Georgia, Athens, GA 30602 a r X i v : . [ s t a t . M E ] F e b Introduction

The analysis of high throughput data presents many challenges to researchers across a variety ofdisciplines. Many of the problems that must be dealt with are ubiquitous in the sciences but areexacerbated when the datasets are massive in size. Often, the goal is to detect the presence orabsence of a signal over an extremely large number of cases, creating a massive multiple testingproblem. Prior to the last two decades, most multiple testing procedures were constructed to con-trol an overall error rate for a relatively small number of simultaneous tests (Efron, 2010). Theadvent of high throughput technology quickly revealed, however, that classical procedures can beinappropriate in the presence of thousands of simultaneous tests (Benjamini, 2010).Suppose our data consist of J cases, each of which arises independently from a normal distri-bution with case-speciﬁc mean, y j ∼ N ( θ j , σ ) , j = 1 , . . . , J . For example, in gene microarray ornext-generation RNA sequencing, y j may be the test statistic quantifying the diﬀerential expressionof gene j between cancerous and healthy tissue (Efron and Tibshirani, 2002; Li and Tibshirani,2011). In functional magnetic resonance imaging (fMRI), signals are observed over time at thou-sands of points in the brain and a test statistic y j is calculated at each point j to summarize theobserved diﬀerence in that area’s signal between some stimulus condition versus a baseline (Friston,Holmes, Worsley, Poline, Frith, and Frackowiak, 1995). Frequently, a question of interest is whetheror not θ j = 0 at each j ; i.e., a hypothesis test is conducted at each of thousands of locations todetermine which of the θ j are non-zero, indicating an interesting signal. The problem is to ﬁnd astatistical procedure which corrects for multiple testing but does not sacriﬁce too much sensitivityfor the sake of preventing false positives.A similar issue arises in variable selection, where one is interested in determining which (ofpossibly many) variables contribute in a meaningful way to the observed response. A Bayesianapproach is to assume the coeﬃcient corresponding to each variable, β j , belongs to one of twodistinct classes, the null class in which β j ∼ N (0 , σ ), or the non-null class, β j ∼ N ( θ j , σ ) , θ j (cid:54) = 0(e.g., Mitchell and Beauchamp, 1988; George and McCulloch, 1993; Scott and Berger, 2006; Efron,2008). Each β j is assigned an a priori probability p of belonging to the null class. This value may beregarded as the proportion of all possible cases which would be null so that p ≈ p ≈ p , is usually unknown, but itcan be assigned a prior distribution to reﬂect the researcher’s beliefs about the level of sparsity inthe data, or it can be estimated via empirical Bayes. The posterior inclusion probabilities (Barbieriand Berger, 2004) can subsequently be estimated from the posterior distribution. Placing a Betaprior on p induces a multiplicity adjustment in that the model automatically penalizes for thenumber of tests in a posteriori probability statements. Scott and Berger (2010) discuss this issueand the conditions under which multiplicity correction can be induced.Much of the work thus far developed is based on the assumption of independent hypothesis tests.This assumption is untenable in many applications. For example, nontrivial dependence structuresare known to exist in neuroimaging data (Lee, Jones, Caﬀo, and Bassett, 2014; Zhang, Guindani,Versace, and Vannucci, 2014), syndromic surveillance (Banks, Datta, Karr, Lynch, Niemi, andVera, 2012), gene microarray (Zhao, Kang, and Yu, 2014), and RNA sequencing (RNA-seq; Love,Huber, and Anders, 2014). Correlation can cause the null distribution of the observed test statisticsto be over- or under-dispersed relative to the theoretical null under independence. Consequently,either too few or too many test statistics may be declared signiﬁcant. The deleterious impactcorrelation can have on empirical Bayes methods and false discovery rate control (FDR; Benjaminiand Hochberg, 1995) was investigated in Qiu, Klebanov, and Yakovlev (2005). Efron (2007) focusedon the eﬀects of dependence on the distribution of test statistics. While some work has been doneon incorporating known dependence structure into Bayesian models for identiﬁcation of interestingases (e.g., Smith and Fahrmeir, 2007; Li and Zhang, 2010; Stingo, Chen, Tadesse, and Vannucci,2011; Lee, Jones, Caﬀo, and Bassett, 2014; Zhao, Kang, and Yu, 2014; Zhang, Guindani, Versace,and Vannucci, 2014), it has been limited, particularly with respect to exploring the multiple testingadjustments incurred through data-dependent estimation of inclusion probabilities.Many datasets include isolated observations. For example, genes in microarray data share com-mon pathways, but many genes are in no pathway at all. It is important to include as many casesas possible when evaluating the posterior distribution. A standard CAR structure assumes everyobservation has at least one neighbor, so that one is forced to either use an inappropriate neigh-borhood structure or exclude isolated points altogether. In the current paper, we extend a modelproposed by Brown, Lazar, Datta, Jang, and McDowell (2014) for Bayesian multiple hypothesistesting and discuss more general neighborhood structures to provide a uniﬁed treatment of caseswith at least one neighbor and with no neighbor. We use a less restrictive improper prior distri-bution for the variance components and establish the propriety of the posterior distribution ofour model, ensuring that inferences are valid. In addition to allowing the newly proposed modelto identify isolated non-null cases, we demonstrate that the inclusion of isolated points results instronger Bayesian learning and improved estimation of the signal strengths of the selected cases.We brieﬂy motivate a common model used in Bayesian signal detection in Section 2. This leadsto our proposed extension of the so-called spike-and-slab prior to accommodate local dependence.We prove the propriety of the posterior distribution of our proposed model and discuss computation.In Section 3, we simulate correlated microarray data to study our model’s performance against aprior assuming independence and assuming a conventional CAR structure. We also compare itagainst results obtained from the signiﬁcance analysis for microarrays (SAM) procedure (Efron,Tibshirani, Storey, and Tusher, 2001; Tusher, Tibshirani, and Chu, 2001; Efron, 2010). We applyour procedure to two real gene microarray datasets in Section 4, one using dependence determinedby adjacency on a chromosome, and the other with gene pathways deﬁning the neighborhoods.Section 5 concludes with a discussion of the results. To facilitate a Bayesian multiple testing correction, we postulate a “two groups model” (Efron,2008). That is, we assume two possible cases for each of the observed test quantities, reﬂected in aBayesian model through the prior on θ j , π ( θ j | p, τ ) = pδ ( θ j ) + (1 − p ) ϕ ,τ ( θ j ) j = 1 , . . . , J, (1)where δ ( · ) is the Dirac delta spike at zero and ϕ ,τ ( · ) is the Gaussian density function with meanzero and variance τ . So-called “spike-and-slab” or “spike-and-bell” priors of this form are standardin the Bayesian variable selection framework. They were introduced by Mitchell and Beauchamp(1988) for variable selection in linear regression. The mixture model was used in Geweke (1996),who provided a procedure for selecting models subject to order constraints among the variablesincluded in each model. A similar approach was taken in George and McCulloch (1993), who treatedeach regression coeﬃcient as arising from a mixture of two continuous distributions with diﬀerentvariances for stochastic search variable selection. Literature on Bayesian variable selection wasreviewed in Clyde and George (2004) and O’Hara and Sillanp¨a¨a (2009). Scott and Berger (2006)explored Model (1) and ways in which it induces multiplicity correction, along with graphicaldisplays and decision rules for subsequent inferences.3y allowing p to be determined by the data, the joint posterior distributions obtained fromspike-and-slab priors adapt to the number of tests, resulting in the posterior inclusion probabilities, p j := P ( θ j (cid:54) = 0 | y ), being penalized to account for the multiple tests (Scott and Berger, 2006,2010). Speciﬁcally, Scott and Berger (2006) used a Beta( α,

1) prior density on p , where α is aspeciﬁed value, and π ( τ , σ ) = ( τ + σ ) − as a prior density for the variance components. Underthis model, Scott and Berger (2006, Lemma 3) showed that p j = 1 − E (cid:32) − pp (cid:114) σ σ + τ exp (cid:32) y j τ σ ( σ + τ ) (cid:33)(cid:33) −  , (2)where the expectation is taken with respect to the joint posterior distribution of p , σ , and τ . Posterior inference can be sharpened if we exploit correlation among potential predictors in thesearch for interesting signals. Brown, Lazar, Datta, Jang, and McDowell (2014) proposed allow-ing the continuous component of (1) to share information across observations by reparameter-izing θ j as γ j µ j , where γ j iid ∼ Bern(1 − p ) and µ j is Gaussian, so that y ∼ N ( Γ µ , σ I ), where y = ( y , . . . , y J ) T , Γ = diag { γ i , i = 1 , . . . , J } , and µ = ( µ , . . . , µ J ) T . The lattice structure ofdatasets such as those arising from fMRI and gene microarray, makes a conditional autoregressivemodel (CAR; Besag, 1974) a natural choice for incorporating local dependence into the prior on µ . Since the potential non-null signals are expected to be as much positive as negative, a priori itis reasonable to assume such signals have zero means. Thus, we consider prior distributions of theform µ j | µ ( − j ) ∼ N (cid:16)(cid:80) Ji =1 c ji µ i , τ j (cid:17) , j = 1 , . . . , J , where µ ( − j ) = ( µ , . . . , µ j − , µ j +1 , . . . , µ J ) T , c jj = 0, and c ji = 0 except when cases j and i are neighbors. The intrinsic autoregressive model(IAR; Besag, York, and Molli´e, 1991) emerges by taking c ji = w ji /w j . and τ j = τ /w j . , where w ji (cid:54) = 0 if and only if sites j and i are neighbors and w j . = (cid:80) Ji =1 w ji . Under an IAR model for µ ,the prior density is given by π ( µ | τ ) ∝ exp (cid:18) − τ µ T ( D w − W ) µ (cid:19) , (3)where D w = diag { w j ., j = 1 , . . . , J } and W = { w ji } Jj,i =1 . Note that ( D w − W ) = sothat the precision matrix has a nontrvial nullspace and hence the IAR is improper. However, a“propriety parameter”, ρ , can be used in the conditional distributions such that µ j | µ ( − j ) ∼ N ( ρ (cid:80) Ji =1 w ji µ i /w j ., τ /w j . ) with precision matrix τ − ( D w − ρ W ). In general, the precision ma-trix will be nonsingular if λ − < ρ < λ − J , where λ < λ J > D − / w WD − / w , respectively (Banerjee, Carlin, and Gelfand, 2015).Any data that have a lattice structure with known or suspected correlations occurring along pre-deﬁned networks can be modeled with a CAR model. For instance, genes in microarray are knownto express themselves in clusters along a chromosome (e.g., Xiao, Reilly, and Khodursky, 2009),or to behave in concert along speciﬁc gene pathways (Subramanian, Tamayo, Mootha, Mukher-jee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander, and Mesirov, 2005). Thus, neighborhoodscan be deﬁned in terms of adjacency on a chromosome or based on genes sharing certain prede-ﬁned pathways determined from prior knowledge. Care should be taken in deﬁning neighborhoods,though, as microarray / RNA-seq datasets often include genes that are not members of any known4athway and thus are isolated. Including isolated points in the IAR induces zero rows in the pre-cision matrix, a problem that cannot be ﬁxed with a propriety parameter. In response, we adjustthe neighborhood weights to allow for inclusion of the isolated points while avoiding a singularprecision matrix.We modify the usual IAR model by deﬁning the neighborhood weights about µ j to be c ji = w ji / ( d + w j . ) with conditional variance τ / ( d + w j . ), where d >

0. The consequent precision matrixis τ − ( D w + d I − W ). Then x T ( D w + d I − W ) x = (cid:80) Ji =1 dx i + (1 / (cid:80) i (cid:80) j w ij ( x i − x j ) ≥

0, withstrict inequality for x (cid:54) = . Thus, with d >

0, we are able to include isolated points in the modelwhile maintaining the propriety of the distribution. If we take d = 1, then for any isolated point j (cid:48) , w j (cid:48) . = 0 so that E ( µ j (cid:48) | µ ( − j (cid:48) ) ) = 0, V ar ( µ j (cid:48) | µ ( − j (cid:48) ) ) = τ , and µ j (cid:48) | τ ∼ N (0 , τ ) independentlyof other points. Hence, we can facilitate conditional independence of µ j while allowing all J pointsto share information about plausible values of the hypervariance through the prior distribution onthe variance components. Taking this view, we can express the traditional IAR model as a specialcase in which every point in a dataset has at least one neighbor and d = 0.Let the joint density of the data and parameters be given by f ( y , ψ , µ , p, γ ), where ψ containsnuisance parameters modeled in the prior distribution. When γ j = 0 for all j , µ does not appearin the resulting likelihood and thus is Bayesianly unidentiﬁed (Gelfand and Sahu, 1999; Eberlyand Carlin, 2000). This means that (cid:82) µ f ( y , ψ | γ = , µ ) π ( µ ) d µ ≡ (cid:82) µ f ( y , ψ | γ = ) π ( µ ) d µ = f ( y , ψ | γ = ) (cid:82) µ π ( µ ) d µ , so that we must have a proper prior on µ for the posterior distributionto be proper, making the inclusion of ρ necessary when d = 0. See McLachlan and Peel (2000,Chapter 4) for further discussion of prior distributions in ﬁnite mixture models.Usually, there is little direct information available about ρ , so estimating it may be diﬃcult.Previous work has shown that appreciable interaction between adjacent points only occurs when ρ is close to its upper bound under the d = 0 model (Banerjee, Carlin, and Gelfand, 2015). To give thedata more freedom in determining the spatial association without speciﬁc regard for interpretability,we consider the prior π ρ ( ρ ) ∝ I ( λ − < ρ < λ − J ). It is important to note that inclusion of ρ is stillpossible when d >

0, provided that ρ is bounded between the reciprocals of the smallest and largesteigenvalues of ( D w + d I ) − / W ( D w + d I ) − / .An additional advantage of including ρ in the joint model for µ is that, under positive spatialassociation, the posterior distribution becomes insensitive to the choice of d in the neighborhoodweights. This is because as d grows, ρ is allowed to increase as well. In other words, if d < d , then λ − J, < λ − J, , where λ J,i > D w + d i I ) − / W ( D w + d i I ) − / , i = 1 , σ , or the hyper-variance, τ . Gelman (2006) suggested that the priors speciﬁed for the variance parameters inhierarchical models may have a disproportionate eﬀect in that they can impose strong restrictionson posterior inference. Conversely, priors used for scale hyperparameters that are intended to benoninformative may, in fact, be too weak by placing considerable probability on unreasonable ex-treme values in the posterior. To address this possibility, Gelman (2006) proposed the use of aweakly informative (vague but proper) prior on the scale hyperparameter such as the folded- t dis-tribution. Scott and Berger (2006) argued for the use of a joint prior on σ and τ with density π ( τ ,σ ) ( τ , σ ) = ( τ + σ ) − = ( σ ) − (1 + τ /σ ) − ( σ ) − ≡ π τ | σ ( τ | σ ) π σ ( σ ) so that astandard improper prior on σ can be used while scaling τ by σ . The prior on τ | σ is similarto the prior on τ which results from placing a folded- t prior on τ with scale σ . SupplementaryFigure 6 illustrates a slight diﬀerence between the two priors for small values of τ so that theScott-Berger (SB) prior on τ is slightly less informative than a folded- t . However, the SB priorand the folded- t -based prior are tail equivalent in the sense that the ratio of the two densities is5 (1) as τ → ∞ . Both priors attempt to balance the freedom of the data to inform about thisparameter against the need to protect against unreasonable values. We thus follow the precedentset by Scott and Berger (2006) and use the same joint prior distribution on τ and σ .This leads to the following model: y j | γ j , µ j , σ ∼ N ( γ j µ j , σ ); γ j | p iid ∼ Bern(1 − p ) , j = 1 , . . . , Jµ j | µ ( − j ) , τ , ρ ∼ N (cid:32) J (cid:88) i =1 ρw ji µ i d + w j . , τ d + w j . (cid:33) , d ≥ , j = 1 , . . . , Jp ∼ Beta( α, , α ≥ ρ ∼ Unif( ν − , ν − J ) π τ | σ ( τ | σ ) = (cid:18) σ (cid:19) (cid:18) τ σ (cid:19) − , τ > π σ ( σ ) = 1 σ , σ > , (4)where ν and ν J are the smallest and largest eigenvalues of ( D w + d I ) − / W ( D w + d I ) − / , respec-tively. Since we are using an improper prior in (4), the posterior density is not guaranteed to beintegrable. We provide a proof in the Appendix that the posterior distribution is indeed proper.The practical eﬀect of ρ having room to increase along with d in our proposed model can be seenthrough simulation. We simulate a 20 ×

20 array of observations arising from both null and non-nulldistributions. The activation pattern is created by drawing from an Ising distribution (e.g., Higdon,1994), p ( x ) ∝ exp( β (cid:80) i ∼ j I ( x i = x j )) , x ∈ { , } , with interaction parameter β = 0 .

35. The nullcases ( x i = 0) are drawn from N (0 ,

1) and the non-null ( x i = 1) cases are drawn from N (3 . , d = 0 , , and 5. SeeSubsection 2.3 for implementation details. A descriptive measure of spatial association is Moran’s I , I ( y ) = n (cid:80) i (cid:80) j w ij ( y i − y )( y j − y ) / [( (cid:80) i (cid:54) = j w ij ) (cid:80) i ( y i − y ) ], where values away from zero areevidence of spatial association according to the predeﬁned neighborhood structure (e.g., Banerjee,Carlin, and Gelfand, 2015, Sec. 4.1). Figure 1 displays smoothed histograms of realizations of I ( y ∗ ) from 2,000 replications each from the three respective posterior predictive distributions, p ( y ∗ | y ) = (cid:82) Θ f ( y ∗ | θ ) π ( θ | y ) d θ , along with the approximate marginal posterior densitiesof ρ . For each value of d , the posterior of ρ tends to concentrate near its upper bound and theposterior predictive densities of I are nearly indistinguishable. Regardless of the value of d , ρ adjusts accordingly and the overall spatial association is consistent with the data.Generally speaking, including as many cases as possible when evaluating the posterior distribu-tion is important. It is often the case that the dataset to be analyzed contains subsets of correlatedobservations among many independent observations. A standard CAR structure assumes every ob-servation has at least one neighbor so that the only possibilities for dealing with both types ofdata are to force every data point into the model via an inappropriate neighborhood structure orto exclude isolated cases. Using an inappropriate dependence structure is clearly undesirable, andexcluding isolated points can result in discarding a large proportion of the data. With our proposedapproach, adjusting the weights with d > d = 0 or d = 1. These are admittedly ad hocvalues, and may not be appropriate for all situations.6 igure 1: Smoothed histograms of 2,000 realizations of Moran’s I from the corresponding posterior predictivedistributions (left panel) and estimated marginal posterior distributions of ρ under model (4) (right panel).The dark vertical line in the left panel is at the observed value, I ( y ) . I D en s i t y d = 0 d = 1d = 5 r p ( r | y ) d = 0 d = 1 d = 5 We facilitate the Gibbs sampling algorithm (Geman and Geman, 1984) by reparameterizing themodel as τ = ησ . For ease of notation, deﬁne D ∗ w := D w + d I . Then π ( µ , γ , σ , η, ρ, p | y ) ∝ ( σ ) − J − exp (cid:18) − ( y − Γ µ ) T ( y − Γ µ )2 σ (cid:19) ×| η ( D ∗ w − ρ W ) − | − / exp (cid:18) − µ T ( D ∗ w − ρ W ) µ ησ (cid:19) × p J − (cid:80) Ji =1 γ i + α − (1 − p ) (cid:80) Ji =1 γ i (1 + η ) − I ( ν − < ρ < ν − J ) . The full conditional distribution of µ is N ( RΓy , σ R ), where R = ( Γ + η − ( D ∗ w − ρ W )) − (Carlinand Louis, 2009). To avoid matrix inversion with extremely large J , we update µ element-wise.The full conditional distributions are given in the Supplementary Material.We use rejection sampling to draw from the full conditional distribution of η . However, the im-portance ratio determining the acceptance probability is η / (1 + η ) → η →

0. This means thatthe iterations can slow down on this step with candidate densities that concentrate on extremelysmall values of η . Possible alternatives are an adaptive Metropolis algorithm (Haario, Saksman,and Tamminen, 2005) or adaptive rejection sampling (Gilks and Wild, 1992).Since ρ has bounded support, we follow the suggestion of Carlin and Banerjee (2003) anduse slice sampling (Neal, 2003) to sample from the full conditional distribution of ρ . Our experi-ence is that the algorithm performs better with the “doubling” procedure outlined by Neal (2003)to adaptively determine good proposal intervals. Note that the determinant | D ∗ w − ρ W | / ∝| I − ρ ( D ∗ w ) − / W ( D ∗ w ) − / | / in the conditional density of ρ can be quickly computed using theeigenvalues of ( D ∗ w ) − / W ( D ∗ w ) − / , which do not depend on any unknown parameters. Theseeigenvalues will have already been computed if using the proposed model in (4). Matrix computa-tions can also be eased by calculating µ T D ∗ w µ = (cid:80) Jj =1 ( d + w j . ) µ j and µ T W µ before searching foran acceptable update for ρ . These only need to be calculated once for each Gibbs iteration.From (10) in the Supplementary Material, we can see the strong dependence between γ and p intheir full conditional distributions. On the k th iteration of a Gibbs sampling routine, if the sample7raw p ( k ) is extremely close to 1, then most of the draws γ ( k ) i , i = 1 , . . . , J , will be zero. But then (cid:80) i γ i will be close to zero so that the conditional Beta density will concentrate close to 1, leadingto another high value of p , and so on. Thus, in spite of the computationally convenient conditionalconjugacy, an MCMC routine can get stuck in the region of the parameter space with most γ i = 0,which slows. This situation can be ameliorated by reparameterizing the mixing proportion p toeliminate boundary constraints, for instance a logit transformation π := log( p/ (1 − p )), and usingLangevin-Hastings proposals for π to push the chain toward the posterior mode (Gilks et al.,1996, Ch. 6). Occasionally mixing in ordinary Metropolis proposals in place of Langevin-Hastingsproposals oﬀers further improvements when the chain is far from the mode (Carlin and Louis, 2009,Ch. 3).The quantities of interest are the marginal posterior inclusion probabilities for each signal j , P ( γ j = 1 | y ). To estimate this from the posterior sample draws, we recognize that in Model (4), P ( γ j = 1 | y ) = E ( p ∗ j ), where the expectation is taken with respect to π ( µ , γ ( − j ) , σ , η, ρ, p, | y ),and p ∗ j := P ( γ j = 1 | µ , γ ( − j ) , σ , η, ρ, p, y ) is given by p ∗ j = (1 − p ) ϕ ,σ ( y j − µ j )(1 − p ) ϕ ,σ ( y j − µ j ) + pϕ ,σ ( y j ) . (5)This quantity can be estimated by N − (cid:80) Ni =1 ˆ p ∗ , ( i ) j , where ˆ p ∗ , ( i ) j is the plug-in estimate of p ∗ j eval-uated with the i th draws p ( i ) , µ ( i ) j , σ , ( i ) from the approximate joint posterior, and N is the MonteCarlo sample size. Similarly, we can use (2) to estimate the inclusion probabilities under the Scott-Berger model using p ( i ) , σ , ( i ) , τ , ( i ) drawn from the appropriate posterior. Both of these estimatorsare “Rao-Blackwellized” in the sense of Gelfand and Smith (1990) and thus have smaller MonteCarlo variance than other more naive estimators (Carlin and Louis, 2009, Ch. 3). To evaluate the performance of our model, we simulate a dataset in a manner similar to thatwhich was carried out in Xiao, Reilly, and Khodursky (2009), resulting in similar correlationstructure as sometimes arises between genes on chromosomes. We consider a situation in which1,000 gene expressions are recorded for ten subjects. For the j th gene on the i th subject, i =1 , . . . , , j = 1 , . . . , X ij is drawn from N ( µ ij , µ ij = 0 for i = 1 , . . . , , j = 1 , . . . , i = 6 , . . . , µ ij = 1 . , j = 1 , . . . , , , . . . , , , . . . ,

230 and µ ij = − . , j = 311 , . . . , , , . . . , X i, , . . . , X i, ) T = X i ∼ N ( , I ) , i = 1 , . . . ,

5. For the treatment group, the test statisticsbelonging to the null class are again simulated as i.i.d. standard normal. We model correlationamong the diﬀerentially expressed (non-null) cases by drawing each group of twenty test statis-tics as X ( k ) i ∼ N ( µ ( k ) , Σ ) , i = 6 , . . . ,

10, where X ( k ) i , k = 1 , . . . , , is the k th cluster of non-nullcases, (i.e., µ ( k ) = (1 . , . , . . . , . T , k = 1 , , , µ ( k ) = ( − . , − . , . . . , − . T , k = 4 , Σ = { . | i − j | } i,j =1 . This results in null cases that are independent of each other, but correlatednon-null cases. Pooled t statistics, t j , j = 1 , . . . , p -values (Efron, 2010), yielding test statistics y = ( y , . . . , y ) T , where y j = Φ − ( F ( t j )) with F ( · ) being the distribution function of the t statistics.8e analyze the simulated data using both our proposed model and the Scott-Berger (SB)model assuming independence. In our model, we consider the sharing of information across genesusing three diﬀerent neighborhood structures. These neighborhoods are displayed graphically inSupplementary Figure 9. The associated adjacency matrices for these neighborhoods are W =  . . . . . . . . . . . .  , W =  . . . . . . . . . . . .  , W = 

12 13 . . .

12 13 . . . . . .  . For further comparison, we implement also the Signiﬁcance Analysis for Microarrays (SAM) proce-dure as outlined in Efron (2010). SAM is a popular method for analyzing microarray data designedto approximately control the false discovery rate (FDR). For this procedure, we vary the FDRcriterion between 0.05 and 0.15 to study performance across a range of FDR levels commonly usedin practice. See Efron (2010, Chapter 4) for further details of the SAM procedure.Each neighborhood is one in which every location has at least one neighbor so that, in (4), w j . > j . We take d = 0, reducing to the usual IAR model like the one considered in Brown,Lazar, Datta, Jang, and McDowell (2014). To enforce sparsity a priori , we choose a high value of α in the prior on p so that its prior probability mass is concentrated close to 1. We take α = 150 here.In the SB model, we ﬁnd that the best results are obtained using a uniform prior on p , with highervalues of the shape parameter negatively aﬀecting the model’s ability to detect activation. Lastly,note that the data generating mechanism here is diﬀerent from what is assumed under either ourproposed model or the SB model, allowing us to study robustness, as well.For both Bayesian models, we simulate the posterior distributions via MCMC. We implementthe SB model using Gibbs sampling with nested rejection sampling steps for the non-standarddistributions. To draw from the posterior of our proposed model, we use the conditional distributionsgiven in the Supplementary Material for a Gibbs sampling procedure with nested rejection and slicesampling steps as described in Subsection 2.3. The algorithms are coded entirely in R (R Core Team,2015). For our proposed model, a single chain uses a burn-in period of 5,000 iterations followed byan additional 10,000 sampling iterations, thinning to every ﬁfth draw for a ﬁnal sample of size 2,000.We run three chains in parallel and assess convergence with trace plots and scale reduction factorsfor selected parameters (Gelman and Rubin, 1992). Upon attaining approximate convergence, theretained draws from each of the three chains are combined for a ﬁnal Monte Carlo sample size of6,000. Similarly, we run parallel chains, assess convergence, and combine the draws to simulate theposterior distribution of the SB model. The posterior inclusion probabilities are estimated using(2) for the SB model and (5) for our proposed model. To select simulated genes as diﬀerentiallyexpressed, we threshold the estimated posterior inclusion probabilities at 0.95.Table 1 displays the empirical error rates for the SB model, the SAM procedure, and our modelusing W , W , and W above as neighborhoods. Of these models, we observe that the SB modelresults in the highest overall misclassiﬁcation proportion, due to the false non-discoveries. In thiscase, the SB model is overcorrecting for multiplicity to the point that no cases are selected at9ll (hence the identically zero false discovery proportion). The SAM procedure performs slightlybetter in terms of non-discoveries and overall misclassiﬁcation proportion, at the expense of falsepositives accounting for 13% - 16% of all discoveries. In contrast, our proposed model performsbetter, regardless of the selected neighborhood structure. The ﬁrst-order neighborhood with unitweights ( W ; top illustration in Supplementary Figure 9) performs best both in terms of non-discoveries and false discoveries, but all of the error rates are very close when compared to theother two approaches. Table 1: False non-discovery proportions (FNP), false discovery proportions (FDP), and misclassiﬁcationproportions (MCP) for the simulated gene expression data

SB SAM(0.05) SAM(0.10) SAM(0.15) CAR ( W ) CAR ( W ) CAR ( W )FNP 0.1000 0.0938 0.0883 0.0856 0.0546 0.0577 0.0616FDP 0.0000 0.1250 0.1333 0.1579 0.0000 0.0217 0.0238MCP 0.1000 0.0940 0.0890 0.0870 0.0520 0.0560 0.0600Figure 2 displays the empirical receiver operating characteristic (ROC) curves for the SB, SAM,and CAR( W ) models. The ROC curves for the CAR( W ) and CAR( W ) models are virtuallyindistinguishable from the CAR( W ) curve and so are not displayed. We see that the SB test-ing model and the SAM procedure are comparable in terms of overall discriminatory power. Theapproximate areas under the curves (AUC) are given in Table 2. While both of these proceduresyield sensitivity slightly superior to that of our model at lower levels of speciﬁcity (less than about0.40), ours still captures more area under its curve. In particular, note that our model attains avery sharp increase in sensitivity at very high levels of speciﬁcity. Figure 2: Empirical ROC curves for the simulated microarray data . . . . . . S en s i t i v i t y CAR(W1)SBSAM

Table 2: Approximate areas under each ROC curve displayed in Figure 2.

Procedure AUCCAR( W ) 0.8969SB 0.8686SAM 0.850210nsight into the reasons for the discrepancies between our model and the SB model can be gainedby examining the smoothed approximate posterior densities of p , σ , and τ = ησ , displayed inFigure 3. Incorporating the dependence results in much stronger Bayesian learning about theseparameters. In this simulation, 100 out of 1,000 cases are non-null, so that we expect p to be large,though not exactly 0.9 since correlation among the cases reduces the eﬀective sample size (Carlinand Louis, 2009, Ch. 3). That is indeed the case under our model. The SB model, on the otherhand, results in considerably lower estimates of p , along with weakly identiﬁed distributions of σ and τ . This weak identiﬁability contributes to the error rates observed in Table 1. Figure 3: Smoothed posterior estimates of p , σ , and τ for the simulated microarray data. p ( p | y ) CARSB 0.0 0.5 1.0 1.5 2.0 s p ( s | y ) CARSB 0.0 0.5 1.0 1.5 2.0 t p ( t | y ) CARSB

Supplementary Figure 10 plots the estimated inclusion probabilities versus the observed teststatistics. All of the test statistics are assigned relatively high probabilities of being non-null underthe SB model, but the lack of information about σ and τ prevent distinctions being drawn thatare strong enough to pass a 0.95 inclusion probability threshold. Our approach, on the other hand,results in stronger statements about the likelihoods of cases being non-null. The ‘jagged’ qualityof the curve corresponding to the CAR( W ) model is due to the estimated inclusion probabilitiesbeing not a function of the y j values alone, but also of their location with respect to nearby observedvalues. To see this, consider the circled point in Supplementary Figure 10, which corresponds tothe identiﬁed statistic in the graphical depiction of the test statistics in Supplementary Figure 11.This relatively extreme observation is in the middle of uninteresting test statistics. This is a trulynull case, so the incorporation of a local dependence structure helps to prevent a false discovery.In addition to neighborhoods determined by physical adjacency, it may be desirable to facili-tate the sharing of information within gene sets such as those used in enrichment analysis (Sub-ramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander, andMesirov, 2005). To investigate this scenario, we simulate an additional microarray dataset withexpression characteristics similar to the simulation carried out in Efron and Tibshirani (2007). Weagain consider a collection of 1,000 genes, with genes 11-20, 111-130, 211-230, 311-330, 411-430deﬁning ﬁve diﬀerent gene sets. The set consisting of genes 111-130 is simulated as diﬀerentiallyexpressed by drawing them independently from N (2 . , N ( − . , N (0 , W can be used with d = 0 in(4) as before. We compare the performance of this CAR structure to our proposed approach withneighborhoods determined by pathway membership. In the latter case, the adjacency matrix W isdetermined by deﬁning genes that are in the same set to be neighbors. To include all observationswithout reducing the rank of the precision matrix, we set d = 1 in (4) so that the marginaldistributions of isolated µ j become N (0 , τ ). We implement both models using MCMC with thesame burn-in and sampling settings as the previous simulation.Supplementary Figure 12 displays the empirical ROC curves from our model under both neigh-borhood assumptions, in which we can see that making incorrect assumptions about the neigh-borhood structure severely inhibits the model’s discriminatory power. In fact, thresholding theposterior inclusion probabilities at 0.95 as before results in no cases being identiﬁed at all underthe physical adjacency assumption. The misspeciﬁcation of the correlation results in the modeloverestimating the noise variability in the data, as is clear upon examination of the histogram inSupplementary Figure 13. Superimposed on the histogram are two mean-zero normal densities withvariances equal to the posterior means of σ under both models. The overestimation of the spreadof the null distribution results in poor identiﬁcation of the non-null cases, indicated with the darktick marks along the x -axis. Incorporating a more appropriate neighborhood structure results inimproved estimation of the variance components and thus superior discrimination among cases.Even with knowledge of an approximately correct dependence structure among pathways, atemptation might be to use a standard CAR model by discarding the isolated cases. While this isobviously better than using an entirely incorrect neighborhood assumption, the isolated points stillprovide information about the parameters common to both the null and non-null components ofthe data distribution and hence potentially useful information can be needlessly discarded. Con-sider the smoothed marginal posterior densities of p , σ , and τ resulting from both approaches,displayed in Figure 4. Eliminating the isolated cases reduces J in (4), so we obtain considerably lessposterior concentration about the error variance, which in turn aﬀects the amount of informationavailable to estimate the second variance component, τ . As is the case in any hypothesis testingscenario, the operating characteristics are directly aﬀected by the amount of information we haveabout the error variability. This results in the “borderline” cases being misclassiﬁed as noise at a0.95 inclusion probability threshold, thus increasing the false non-discovery proportion. (See Sup-plementary Figure 14.) The false discovery, false non-discovery, and misclassiﬁcation proportionsat the 0.95 threshold with and without isolated cases are given in Table 3. Figure 4: Smoothed posterior estimates of p , σ , and τ with and without the isolated cases. p ( p | y ) All CasesCases Excluded s p ( s | y ) All CasesCases Excluded t p ( t | y ) All CasesCases Excluded

In addition to detection, there is often an interest in estimating the true signal strengths of the12 able 3: Error rates for the simulated pathway example using pathway-based neighborhoods.

With Isolated Cases Without Isolated CasesFNP 0.0083 0.1781FDP 0.0000 0.0000MCP 0.0080 0.1300non-null cases. Figure 5 plots the smoothed approximate posterior densities and approximate 95%credible intervals about the signals, µ , for two typical non-null cases in the simulated gene pathwaydata. Again, the posteriors are evaluated with and without the isolated cases using neighborhoodsdeﬁned by pathway membership. The true signal strengths (i.e., the means of the distributions of thetest statistics) for these two cases are E ( Z ) ≈ − .

78 (top panel) and E ( Z ) ≈ .

12 (bottom panel),indicated in the plots by vertical lines. The Figure illustrates the reduction in posterior uncertaintythat can be obtained by including all of the data points. For the top panel, the approximate 95%credible intervals with and without the isolated cases are [ − . , − .

40] and [ − . , − . . , .

66] and [1 . , .

07] with and without theisolated cases, respectively. Table 4 displays the average widths of the approximate 95% credibleintervals over all of the cases in both non-null pathways with and without the isolated observations.By including the isolated points with an appropriate neighborhood structure aﬀorded by our pro-posed model, we attain an approximate four-fold increase in the precisions of the signal estimates.

Figure 5: Smoothed approximate posterior densities of the signal strengths µ j for two non-null cases in thesimulated gene pathway example. The bars at the top are the approximate 95% posterior credible intervals.The vertical lines indicate the true means of the non-null distributions of the test statistics. −5 −4 −3 −2 . . . . . m D en s i t y All CasesCases Excluded0 1 2 3 4 . . . . . m D en s i t y All CasesCases Excluded able 4: Average widths of the approximate 95% credible intervals for the non-null pathways in the simulatedgene pathway example. Pathway All Cases Cases Excluded2 0.9603 (0.0137) 1.708 (0.0604)5 0.9338 (0.0105) 1.700 (0.0433)In practice, the most appropriate neighborhood structure to use in the CAR model may notbe known. For gene microarray, though, only biologically meaningful dependence structures wouldusually be considered so that one would not be faced with completely unguided choices. Evenamong interpretable neighborhoods, we still might be forced to compare competing neighborhoodassumptions in an eﬀort to determine which is best. Consider again the simulated pathway data,only we do not know whether the dependence is among biologically-determined pathways, or if it isa function of physical adjacency. In this case, we can gauge the spatial dependence by consideringMoran’s I statistic under diﬀerent neighborhood assumptions, whence the competing models canbe examined by looking at the strength of estimated spatial association according to each. Forthe simulated pathway data, we have I ( y ) = 0 . I ( y ) = 2 . Table 5: Root mean square predictive error (RMSPE) and Wantanabe-Akaike information criterion (WAIC)for both neighborhood models in the simulated pathway example.

Pathways AdjacencyRMSPE 1.498 1.578WAIC 2794.614 3016.502The true correlation structure among gene expression data is complex. The implementation ofour proposed model (and similar CAR models), however, requires neighborhood structures to bespeciﬁed a priori , and this speciﬁcation can potentially be incomplete or simplistic. Therefore, tostudy the performance of our proposed model under partially incorrect neighborhood assumptions,we simulate 1,000 expression levels over ten subjects (ﬁve treatment, ﬁve control) with genes 11-20,111-130, 211-230, 311-330, 411-430 deﬁning ﬁve diﬀerent gene sets as before. We again suppose thatthe second and ﬁfth sets are enriched in the treatment group. In contrast to the previous simulation,assume for each treatment subject that the correlation among signals is induced according to adirected graph, as might occur in metabolic pathways in which a signal originates from a single geneand cascades via several networks to other genes downstream (Stingo, Chen, Tadesse, and Vannucci,2011). A “parent” gene is selected at random from each of the two active gene sets, and the signalsfor these parent genes are simulated as µ p ∼ N (2 . , . ). Within each pathway, seven additionalchild nodes are selected at random and their signals are taken to be µ c , . . . , µ c iid ∼ N (0 . µ p , . ).The remaining signals are drawn independently from N (0 . (cid:80) s =1 µ c s , . ). Within each gene14et, the observed expression levels are taken to be X ik indep. ∼ N ( µ k , , i = 1 , . . . ,

10, where k indexes over all genes, including the parent nodes and child nodes, and i indexes the subjects.To represent incompleteness, suppose the neighborhood structure determining W omits two genesthat are actually members of one of the active sets, and likewise for one of the inactive sets.Lastly, we randomly select 30 isolated genes and draw their expression levels from N ( , Σ ) for allsubjects, treatment and control, where Σ = { . | i − j | } i,j =1 . Our proposed model thus incorrectlyassumes uniform correlation within each pathway, uses incomplete pathway deﬁnitions, and ignorescorrelation among a subset of genes with no known pathway membership.Table 6 displays the empirical false nondiscovery proportions, false discovery proportions, andoverall misclassiﬁcation proportions for these simulated data under the Bayesian independencetesting model and our proposed pathway-determined CAR model with 0.95 posterior probabilitythreshold, as well the SAM procedure with three common thresholds. Despite the model misspeci-ﬁcation, our model performs comparably to the Bayesian approach assuming (incorrectly) indepen-dence and the generally applicable SAM procedure. In fact, our approach still yields the smallestempirical misclassiﬁcation proportion, though they are all very close. Even though we have partiallyincorrect assumptions about the correlation structure, our approach is still superior to that of as-suming complete independence in terms of predictive capability, as evident in the lower RMSPE andWAIC values, also displayed in Table 6. Further, the assumed neighborhood structure in our modelpredicts non-zero spatial correlations that are fairly consistent with those observed in the actualdata under the same structure. This is evident in Supplementary Figure 15, which plots a histogramof realized I ( y ) values from the posterior predictive distribution along with the value observed inthe actual data. Under the assumed neighborhood structure, P ( I ( y ∗ ) ≥ I ( y ) | y ) ≈ . Table 6: Error rates and predictive capabilities of the independence Bayesian testing model (SB) and theCAR testing model under incorrect correlation assumptions, along with error rates from the SAM procedurewith three common thresholds. SAM is not used for prediction, so the RMSPE and WAIC are not applicable.

SB CAR SAM(0.05) SAM(0.10) SAM(0.15)FNP 0.0123 0.0093 0.0093 0.0093 0.0093FDP 0.0000 0.0313 0.0606 0.0882 0.1143MCP 0.0120 0.0100 0.0110 0.0120 0.0130RMSPE 1.3256 0.7590 − − −

WAIC 3470.8036 2778.6636 − − −

These simulation results indicate that overall performance of the mixture prior can be improvedby using common information across local neighborhoods, when it is available. This improvement isdue to the induced penalty on the inclusion probabilities of statistics surrounded by uninterestingobservations, and to improved estimation of the mixing proportion and variance components inthe data. By choosing the neighborhood weights appropriately, we demonstrate how our model canaccommodate isolated genes that have no neighbors. Our proposed approach is evidently superiorto an approach using a conventional CAR model, even when the correct pathway information isused to deﬁne the neighborhoods but isolated points are discarded. Incorporating spatial depen-dence and isolated cases results in much sharper Bayesian learning in the posterior distribution,improving inference and reducing uncertainty. Simple diagnostics such as Moran’s I under diﬀer-ent assumed neighborhood structures can be helpful when competing neighborhood structures areavailable, as well as considering predictive capabilities through measures such as WAIC, whichapproximates out-of-sample predictive error while averaging over the posterior distribution. Weillustrate also that even simplistic correlation assumptions in the Bayesian testing approach still15erform competitively with alternatives such as the SAM procedure while predicting dependencefeatures that are consistent with the observed data. Judging from model ﬁt criteria, partially in-correct correlation assumptions are better than ignoring them altogether in an independence model. As one application of our proposed model, we consider the microarray expression data from Xiao,Reilly, and Khodursky (2009). This dataset contains transcriptional activity on the

Escherichiacoli chromosome measured as log ratios of transcript abundances between a control and variouschemical, physiological, and genetic perturbations comprising 53 experimental conditions. The ob-served gene expression levels are the average log ratios across conditions. Here, we have 4 replicatemeasurements on 4,276 genes. Test statistics are calculated by dividing the mean diﬀerence by thestandard error plus a small constant to reduce the eﬀect of extreme observations, as is done in theSAM procedure. These statistics are probit transformed to yield equivalent z statistics.The E. coli chromosome has been shown to have correlated expression levels according to genelocation, but with a circular structure so that the ﬁrst and last genes on the chromosome are con-sidered neighbors (Xiao, Reilly, and Khodursky, 2009). This structure leads us to use the adjacencymatrix obtained by replacing the last elements of the ﬁrst row and ﬁrst column of W in Section3 by 1. We take d = 0 since each point has two neighbors.We simulate the posterior distributions of the independence model and our CAR model usingthe MCMC algorithms described in Sections 2 and 3. The resulting posterior activation probabilitiesare thresholded at 0.99 to select genes as being diﬀerentially expressed. To evaluate performance,we compare the genes selected as diﬀerentially expressed to a list of 41 genes identiﬁed in Macnab(1992) as having a known or suspected function in the E. coli chromosome. This list serves as areference with which we calculate approximate false discovery and false non-discovery proportions.To illustrate the eﬀect of the shape parameter in the prior for p in our model, we repeatedlysimulate the posterior distribution while varying α over selected values between 1 and 1775. Forthe independence model, we again take p ∼ Unif(0 , α , the sensitivity results in generally higher error rates compared tothat from assuming independence. However, for higher α , we are able to attain uniformly betterperformance, with all three error rates outperforming the independence model. The eﬀect of α onthe marginal posterior distributions of p and ρ can also be seen in Supplementary Figure 16. Higher α values result in higher estimated values of p , as expected, but they also result in sharper posteriorinferences about both p and ρ . We remark that the false discovery proportions are all quite highhere. As this analysis is based on a real dataset, though, there is no way of knowing which of theseare true false discoveries. Indeed, the high FDP could be a consequence of Macnab (1992) listingthe most interesting genes, not necessarily all interesting genes. For an additional application under a diﬀerent neighborhood structure, we consider mRNA expres-sion proﬁle data collected from lymphoblastoid cell lines derived from ﬁfteen males and seventeenfemales. This dataset was analyzed in Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette,Paulovich, Pomeroy, Golub, Lander, and Mesirov (2005) with gene set enrichment analysis (GSEA),16 able 7: Error rates for the E. Coli data (Xiao et al., 2009) under the independence model and the CARtesting model under selected values of α in π p ( p ) . For the CAR model, d = 0 . The independence model usesa uniform prior on p . Independence α = 1 α = 500 α = 1000 α = 1775FNP 0.0021 0.0005 0.0010 0.0009 0.0012FDP 0.4074 0.7821 0.5316 0.4219 0.3793MCP 0.0072 0.0332 0.0108 0.0072 0.0063who sought to identify gene sets enriched in males (male > female) and enriched in females (female > male). Each cell line contains measurements on 22,283 genes. The existing catalog for this datasetincludes 319 cytogenetic gene sets, 24 for each of the 24 human chromosomes, and 295 associatedwith cytogenetic bands. For our analysis, we again calculate two-sample t -statistics and normalizeto obtain z -statistics.We consider two variants of our proposed CAR testing approach, both of which use the 319 genesets to deﬁne the neighborhoods, but with one model including the isolated points and the otherexcluding isolated points so that a conventional CAR model is applicable. With the isolated casesincluded, we set d = 1 in (4); d = 0 when there are no isolated cases. Otherwise, both models are thesame. We take α = 1 in the prior on p so that it becomes uniform. The MCMC algorithm, identicalunder both models, uses a burn in of 25,000 iterations, followed by 10,000 sampling iterations, ofwhich every ﬁfth draw is retained. The posterior inclusion probabilities are calculated using (5) andthresholded at 0.99 to identify diﬀerentially expressed genes.The estimated posterior inclusion probabilities under both models are quite similar, thoughnot exactly the same. This can be seen in Supplementary Figure 17, which plots the posteriorinclusion probabilities versus the test statistics for both models. The diﬀerences result in a coupleof disagreements on the selection of diﬀerentially expressed genes, listed in Table 8. Table 8: List of genes selected under the generalized CAR model including isolated cases (gCAR) or a con-ventional CAR model with isolated cases excluded (CAR). The × symbol indicates selection under the gCARwith isolated points, the • symbol indicates selection when ignoring isolated points. Also listed are the eﬀectivesample sizes (ESS) of µ (1) j , . . . , µ (2000) j for each gene from the MCMC output. ESS PathwayGene gCAR CAR chrX chrXq13 chrXp22 chrY chrYp11 chrYq11201028_s_at • • • • ×• ×• ×• ×• ×• ×• ×• ×• ×• ×• ×• ×• ×• ×• • • ×• ×• • ×• ×• × (Isolated case)The disagreements between the two approaches can partly be explained by the reliability of17he estimates themselves. Table 8 lists the eﬀective sample sizes (ESS) of the MCMC draws ofthe signals for each identiﬁed gene, µ (1) j , . . . , µ (2000) j (Kass, Carlin, Gelman, and Neal, 1998). TheESS is an approximation of the number of independent pieces of information about a parameterproduced by an MCMC algorithm, where lower numbers reﬂect higher autocorrelation in the chainand hence slower convergence. The diﬀerences between the two approaches considered here aresubstantial. With a couple of exceptions, the retained values from the Markov chain under theconventional CAR assumption are more highly correlated than in the gCAR, thus reducing theamount of available information about these parameters from a ﬁxed number of iterations. Onthe other hand, including the isolated cases in this instance generally results in the Markov chainsamples being almost as good as an i.i.d. sample from the target distribution.Table 8 also indicates the pathway membership for each case. There are six gene sets in whichthe identiﬁed individual genes appear, and the clustering of genes along the X and Y chromosomeis apparent. If one were to use the approach of simply declaring as enriched the pathways includingthe individual diﬀerentially expressed genes, our results would agree closely with those found inthe GSEA. In particular, the GSEA identiﬁed the Y chromosome ( chrY ) and two Y bands withat least 15 genes ( chrYp11, chrYq11 ) as being associated with higher expression levels in males.Two of the genes selected by both models appear in the chrX and chrXq13 gene sets. These genesare associated with the set of X chromosome inactivation genes, which is expected to be enrichedin females. Note that our proposed gCAR enabled us to identify an isolated gene ( ) thatdoes not appear in any of the predeﬁned gene sets. This gene would have been missed entirely if wewere to use a conventional CAR structure inside of our Bayesian testing model, since it would havebeen discarded from the analysis. We notice also a particular gene selected only by the conventionalCAR that appears in both the X and Y chromosome. This curious case could be a false positive,though, as the slower MCMC convergence under the conventional CAR model makes the posteriorinference less reliable than its gCAR counterpart.The applications presented here illustrate two diﬀerent approaches to deﬁning neighborhoodsacross which information may be shared when searching for non-null cases under our Bayesian test-ing model. While the results are sensitive to the choice of hyperparameters as well as the threshold,it is apparent that “good” choices can lead to desirable operating characteristics. Applying our pro-posed approach to these data yields results consistent with past analyses. These results demonstrateour model’s ability to harness shared information between cases without sacriﬁcing the possibilityof identifying independent cases, something that would not be possible under the conventional CARassumption. In this analysis, we found that including the isolated cases substantially reduced theautocorrelation. This results in quicker convergence and much greater sampling eﬃciency than isobtained from a conventional CAR model. This is an especially important consideration when per-forming MCMC in a large-scale setting, where the computational burden limits the feasible numberof iterations. Our results suggest that there could be a positive eﬀect on the sampling eﬃciency ofan MCMC algorithm by including isolated, independent cases in our generalized CAR structure.This is possibly an interesting topic for future research that we do not pursue here. We present a uniﬁed approach to correlated Bayesian testing whereby isolated cases and neighbor-ing cases can be included in the same analysis. This allows for improved estimation of the signalstrengths, the possibility of identifying isolated cases, and sharper posterior inferences about pa-rameters of interest. We suggest some simple diagnostics that can aid a researcher in determining18he most appropriate neighborhood structure when choosing among plausible models. When verylittle prior information is available concerning correlation structure, we note that there exist in theliterature proposed techniques for discovering structure in high dimensional data such as sparsefactor modeling (West, 2003) and independent components analysis (Comon, 1994). We demon-strate the robustness of our approach to model misspeciﬁcation by applying it to simulated datawith a complex correlation structure in which the assumptions are partially incorrect. It performscompetitively with well-established procedures.The results presented here are seen to be quite sensitive to the choice of the shape parameter inthe prior for the mixing proportion, p , in our proposed model. This is in part because large α valuesresult in both prior and posterior concentration of p about large values. These large values mean thatthe Gibbs sampler tends to visit sparser models more often, and thus parameters that only appearin non-null cases are updated less frequently. Certain applications necessitate the use of a prior thatenforces known sparsity (e.g., West, 2003; Carvalho, Chang, Lucas, Nevins, Wang, and West, 2008).While the choice of shape hyperparameter does have a considerable eﬀect on subsequent inferences,we demonstrate how ﬁnding a good value leads to desirable operating characteristics. In workingwith our model, we found that an acceptable value of the shape parameter seems to depend uponthe strength of the correlation across neighboring observations. The best way to choose this valueor otherwise tune the prior to approximately match the true a priori non-null probability in thedata is still an open problem worthy of further investigation.A related point is the thresholding of the location-speciﬁc a posteriori inclusion probabilities.We use throughout this paper an informal 0.95 decision rule for selecting non-null cases. Decisionrules in the Bayesian testing paradigm have been proposed through average risk optimization andconsideration of the so-called Bayes false discovery rate (bFDR) (e.g., Efron and Tibshirani, 2002;Tadesse, Ibrahim, Vannucci, and Gentleman, 2005; Bogdan, Ghosh, and Tokdar, 2008). However,most results concerning the relationship between bFDR and frequentist error measures are based onassumed independence in the data, which we are not considering. Performance also is determinedin part by speciﬁcation of the prior probabilities of the hypotheses. However, the approach ofScott and Berger (2006) on which our model is based enjoys the virtue of inducing an automaticmultiplicity adjustment, even in the correlated case (Brown, Lazar, Datta, Jang, and McDowell,2014). While expression (5) allows our calculations to viewed as a fully Bayesian treatment of localfalse discovery rates (Efron, 2010, Ch. 5), much work remains to be done on establishing optimaldecision rules for controlling diﬀerent error rates under dependence.This paper provides a glimpse at the possibility of facilitating more reliable inference by cap-turing (or at least approximating) the true dependence structures that are inherent in modernhigh-dimensional data. While this issue has garnered more interest in the recent literature (e.g.,Smith and Fahrmeir, 2007; Li and Zhang, 2010; Stingo, Chen, Tadesse, and Vannucci, 2011; Lee,Jones, Caﬀo, and Bassett, 2014; Zhang, Guindani, Versace, and Vannucci, 2014; Zhao, Kang, andYu, 2014) relatively limited work has been done on modeling nontrivial dependence in Bayesianmodels for signal detection, particularly in the high-dimensional setting where classical multipletesting approaches are no longer appropriate. Here we propose using Markov-type dependence struc-tures in the continuous component of the spike-and-slab mixture with a Gaussian CAR model. Anavenue of future work could be the exploration of other dependence structures. Work has been doneon modeling complex measures of distance and covariance structures (e.g., Dryden, Koloydenko,and Zhou, 2009), but there is a need for much further research toward building a ﬂexible class ofmultiple testing models capable of dealing with a wide variety of dependence types.19 ppendix: Proof of Posterior Propriety of the Proposed Model Let Θ = ( µ T , σ , τ , ρ ) T denote the parameter vector and let F ( y , Θ , γ , p ) be the joint distributionof the data and the parameters. Then for Model (4), it suﬃces to show that (cid:90) ( Θ , γ ,p ) dF ( y , Θ , γ , p ) = (cid:88) γ ∈{ , } J (cid:90) p (cid:90) Θ f ( y , Θ , γ , p ) d Θ dp < ∞ , ∀ y ( a.e. ) . (6)First, note that for any γ ∈ { , } J , f ( y , Θ , γ , p ) = (2 πσ ) − J/ exp  − σ J (cid:88) j =1 ( y j − γ j µ j )  × (2 π ) − J/ | τ ( D ∗ w − ρ W ) − | − / exp (cid:18) − τ µ T ( D ∗ w − ρ W ) µ (cid:19) × ( σ + τ ) − π ρ ( ρ ) αp J − (cid:80) j γ j + α − (1 − p ) (cid:80) j γ j ≡ f ( y , Θ | γ ) π ( γ ,p ) ( γ , p ) , where π ( γ ,p ) ( γ , p ) = αp J − (cid:80) j γ j + α − (1 − p ) (cid:80) j γ j . Hence, the integral inside the summation in (6) is (cid:90) p (cid:90) Θ f ( y , Θ , γ , p ) d Θ dp ∝ (cid:90) Θ f ( y , Θ | γ ) d Θ , since (cid:82) p J − (cid:80) j γ j + α − (1 − p ) (cid:80) j γ j dp < ∞ . Also, the summation over γ has 2 J terms, so it suﬃcesto establish that (cid:90) ρ (cid:90) τ (cid:90) σ (cid:90) µ f ( y , µ , σ , τ , ρ | γ ) d µ dσ dτ dρ < ∞ , ∀ γ ∈ { , } J . We begin with the case when γ j = 1 , for all j ∈ { , . . . , J } and deﬁne f ( y | µ ) := f ( y | µ , γ = ) to simplify notation. We have that f ( y | µ ) = J (cid:89) j =1 f ( y j | µ j , γ j = 1) = J (cid:89) j =1 N ( y j | µ j , σ ) . Now, since ρ ∈ ( ν − , ν − J ) implies that ( D ∗ w − ρ W ) >

0, where the notation A > A is positive deﬁnite (Banerjee, Carlin, and Gelfand, 2015), the prior on µ ∼ N J ( , τ ( D ∗ w − ρ W ) − ). We can thus integrate f ( y | µ ) π µ ( µ | τ , ρ ) with respect to µ as the convolution oftwo normal densities. The marginal density of y is then that of a N J ( , σ I + τ ( D ∗ w − ρ W ) − )distribution. So, we know the integration yields f ( y , σ , τ , ρ ) = π ( τ ,σ ( τ , σ ) π ρ ( ρ ) N J ( y | , σ I + τ ( D ∗ − ρ W ) − ) . Now, we reparameterize the variance components by deﬁning η := τ /σ and integrate over σ using the inverse gamma integral to obtain f ( y , η, ρ ) ∝ π ρ ( ρ )(1 + η ) − | I + η ( D ∗ w − ρ W ) − | − / × (cid:90) ∞ ( σ ) − ( J/ − exp (cid:18) − σ y T ( I + η ( D ∗ w − ρ W ) − ) − y (cid:19) dσ ∝ π ρ ( ρ )(1 + η ) − | I + η ( D ∗ w − ρ W ) − | − / ( y T ( I + η ( D ∗ w − ρ W ) − ) − y ) J/ .

20e can rewrite the quadratic form in the denominator of the last expression as y T ( I + η ( D ∗ w − ρ W ) − ) − y = y T ( D ∗ w ) / ( D ∗ w + η ( I − ρ W ∗ ) − ) − ( D ∗ w ) / y = x T ( D ∗ w + η ( I − ρ W ∗ ) − ) − x , where W ∗ = ( D ∗ w ) − / W ( D ∗ w ) − / and x = ( D ∗ w ) / y . Let w ( J ) = max ≤ j ≤ J w j . Then, after substantialsimpliﬁcation (see Supplementary Material) it can be shown that, for all ρ ∈ ( ν − , ν − J ), x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ≥ k ( w ( J ) + d + η ) − (7) ⇒ ( x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ) − J/ ≤ k (cid:48) ( w ( J ) + d + η ) J/ (8)where 0 < k (cid:48) < ∞ is constant.Similarly, after substantial simpliﬁcation (see Supplementary Material), we can show that | I + η ( D ∗ w − ρ W ∗ ) − | − / ≤ k (cid:48) (cid:81) Jj =1 max { (1 − ρν j ) / , } ( w (1) + d + η ) J/ . (9)We have now established that the density function satisﬁes f ( y , η, ρ ) ≤ C ( w ( J ) + d + η ) J/ (cid:32) (cid:81) Jj =1 max { (1 − ρν j ) / , } ( w (1) + d + η ) J/ (cid:33) (1 + η ) − π ρ ( ρ )=  C (cid:18) w ( J ) + d + ηw (1) + d + η (cid:19) J/ (cid:18) (cid:81) Jj = r (1 − ρν j ) / (1+ η ) (cid:19) π ρ ( ρ ) , ρ < C (cid:18) w ( J ) + d + ηw (1) + d + η (cid:19) J/ (cid:18) (cid:81) J − r j =1 (1 − ρν j ) / (1+ η ) (cid:19) π ρ ( ρ ) , ρ > C is a ﬁnite positive constant.Now, π ρ ( ρ ) ∝ I ( ν − < ρ < ν − J ), (cid:90) ν − J (cid:89) j = r +1 (1 − ρν j ) / dρ < (cid:90) ν − (1 − ρν J ) J − r dρ < ∞ , and (cid:90) ν − J J − r (cid:89) j =1 (1 − ρν j ) / dρ < (cid:90) ν − J (1 − ρν ) J − r dρ < ∞ . Also, (cid:32) w ( J ) + d + ηw (1) + d + η (cid:33) J/ (1 + η ) − ≤ k (cid:48) (1 + η ) − , for all η , ⇒ (cid:90) ∞ (cid:32) w ( J ) + d + ηw (1) + d + η (cid:33) J/ (1 + η ) − dη < ∞ . From this, it follows that the integral (cid:82) ν − J ν − (cid:82) ∞ f ( y , η, ρ ) dηdρ is ﬁnite, showing that (cid:82) Θ f ( y , Θ | γ ) d Θ < ∞ when γ = γ = · · · = γ J = 1. 21e now turn our attention to the case |{ γ j : γ j = 1 }| = J < J , where | · | denotes thecardinality of a set. Let γ ∗ denote such an arbitrary conﬁguration of γ . Further, let S = { j : γ j = 1 } (cid:40) { , . . . , J } , y = { y j : j ∈ S } , and y = { y j : j ∈ S c } so that we may partition y as y = ( y T , y T ) T . Our strategy here is to show h ( y , y ) ≡ h ( y ) := (cid:82) Θ f ( y , Θ | γ ∗ ) d Θ < ∞ byshowing that (cid:82) y h ( y , y ) d y < ∞ for all S c and y ( a.e. ). The key elements in the argument arethat J < J ⇒ J − J > (cid:90) R J (cid:90) R J f ( y | µ ) π ( µ ) d µ d y = (cid:18)(cid:90) R J f ( y | µ ) d y (cid:19) (cid:18)(cid:90) R J π ( µ ) d µ (cid:19) . Now, we have (cid:90) y h ( y , y ) d y = (cid:90) σ (cid:90) τ (cid:90) ρ (cid:90) µ (cid:90) y f ( y , y , τ , σ , ρ, µ | γ ∗ ) d y d µ dρdτ dσ ∝ (cid:90) ∞ ( σ ) − J − J exp  − σ (cid:88) j ∈ S c y j  (cid:90) ∞ ( σ + τ ) − dτ dσ = (cid:90) ∞ (cid:18) σ (cid:19) J − J +1 exp (cid:34) − (cid:18) σ (cid:19) (cid:32) (cid:80) j ∈ S c y j (cid:33)(cid:35) dσ < ∞ , since the integral is that of the kernel of an inverse gamma density over (0 , ∞ ).The proof is completed by considering the case γ j = 0 , for all j . The preceding argument stillapplies, though, with S = ∅ and J = 0 (so that integration over y is not done). The result istherefore established. References

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2015),

Hierarchical Modeling and Analysis forSpatial Data , Boca Raton: Chapman & Hall/CRC, 2nd ed.Banks, D., Datta, G., Karr, A., Lynch, J., Niemi, J., and Vera, F. (2012), “Bayesian CAR modelsfor syndromic surveillance on multiple data streams: Theory and practice,”

Information Fusion ,13, 105–116.Barbieri, M. M. and Berger, J. O. (2004), “Optimal predictive model selection,”

Annals of Statistics ,32, 870–897.Benjamini, Y. (2010), “Discovering the false discovery rate,”

Journal of the Royal Statistical Society,Series B , 72, 405–416.Benjamini, Y. and Hochberg, Y. (1995), “Controlling the false discovery rate: A practical andpowerful approach to multiple testing,”

Journal of the Royal Statistical Society, Series B , 57,289–300.Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,”

Journal of theRoyal Statistical Society, Series B , 36, 192–236.Besag, J., York, J. C., and Molli´e, A. (1991), “Bayesian image restoration, with two applications inspatial statistics (with discussion),”

Annals of the Institute of Statisical Mathematics , 43, 1–59.22ogdan, M., Ghosh, J. K., and Tokdar, S. T. (2008), “A comparison of the Benjamini-Hochbergprocedure with some Bayesian rules for multiple testing,” in

Beyond Parametrics in Interdisci-plinary Research: Festschrift in Honor of Professor Pranab K. Sen , eds. Balakrishnan, N., Pe˜na,E. A., and Silvapulle, M. J., Beachwood: Institute of Mathematical Statistics, pp. 211–230.Brown, D. A., Lazar, N. A., Datta, G. S., Jang, W., and McDowell, J. E. (2014), “Incorporatingspatial dependence into Bayesian multiple testing of statistical parametric maps in functionalneuroimaging,”

NeuroImage , 84, 97–112.Carlin, B. and Banerjee, S. (2003), “Hierarchical multivariate CAR models for spatio-temporallycorrelated survival data,” in

Bayesian Statistics 7 , eds. Bernardo, J. M., Bayarri, M. J., Berger,J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., and West, M., Oxford: Oxford UniversityPress, pp. 45–63.Carlin, B. P. and Louis, T. A. (2009),

Bayesian Methods for Data Analysis , Boca Raton: Chapman& Hall/CRC, 3rd ed.Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008), “High-dimensional sparse factor modeling: Applications in gene expression genomics,”

Journal of theAmerican Statistical Association , 103, 1438–1456.Clyde, M. and George, E. I. (2004), “Model uncertainty,”

Statistical Science , 19, 81–94.Comon, P. (1994), “Independent component analysis, A new concept?”

Signal Processing , 36, 287–314.Dryden, I. L., Koloydenko, A., and Zhou, D. (2009), “Non-Euclidean statistics for covariance ma-trices, with applications to diﬀusion tensor imaging,”

Annals of Applied Statistics , 3, 1102–1123.Eberly, L. E. and Carlin, B. P. (2000), “Identiﬁability and convergence issues for Markov chainMonte Carlo ﬁtting of spatial models,”

Statistics in Medicine , 19, 2279–2294.Efron, B. (2007), “Correlation and large-scale simultaneous signiﬁcance testing,”

Journal of theAmerican Statistical Association , 102, 93–103.— (2008), “Microarrays, empirical Bayes, and the two-groups model,”

Statistical Science , 23, 1–22.— (2010),

Large Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction ,New York: Cambridge University Press.Efron, B. and Tibshirani, R. (2002), “Empirical Bayes methods and false discovery rates for mi-croarrays,”

Genetic Epidemiology , 23, 70–86.— (2007), “On testing the signiﬁcance of sets of genes,”

Annals of Applied Statistics , 1, 107–129.Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), “Empirical Bayes analysis of amicroarray experiment,”

Journal of the American Statistical Association , 96, 1151–1160.Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J.-P., Frith, C. D., and Frackowiak, R. S. J.(1995), “Statistical parametric maps in functional imaging: A general linear approach,”

HumanBrain Mapping , 2, 189–210.Gelfand, A. E. and Sahu, S. K. (1999), “Identiﬁability, improper priors, and Gibbs sampling forgeneralized linear models,”

Journal of the American Statistical Association , 94, 247–253.23elfand, A. E. and Smith, A. F. M. (1990), “Sampling-based approaches to calculating marginaldensities,”

Journal of the American Statistical Association , 85, 398–409.Gelman, A. (2006), “Prior distributions for variance parameters in hierarchical models,”

BayesianAnalysis , 1, 515–533.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014),

Bayesian Data Analysis , Boca Raton: Chapman & Hall/CRC, 3rd ed.Gelman, A. and Rubin, D. B. (1992), “Inference from iterative simulation using multiple sequences(with discussion),”

Statistical Science , 7, 457–511.Geman, S. and Geman, D. (1984), “Stochastic relaxation, Gibbs distributions and the Bayesianrestoration of images,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , 6,721–741.George, E. I. and McCulloch, R. E. (1993), “Variable selection via Gibbs sampling,”

Journal of theAmerican Statistical Association , 88, 881–889.Geweke, J. (1996), “Variable selection and model comparison in regression,” in

Bayesian Statistics5 , eds. Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., Oxford: OxfordUniversity Press, pp. 609–620.Gilks, W., Richardson, S., and Spiegelhalter, D. (eds.) (1996),

Markov Chain Monte Carlo inPractice , London: Chapman & Hall.Gilks, W. R. and Wild, P. (1992), “Adaptive rejection sampling for Gibbs sampling,”

Journal ofthe Royal Statistical Society, Series C , 41, 337–348.Haario, H., Saksman, E., and Tamminen, J. (2005), “Componentwise adaptation for high dimen-sional MCMC,”

Computational Statistics , 20, 265–273.Higdon, D. (1994), “Spatial applications of Markov chain Monte Carlo for Bayesian inference,”Unpublished doctoral thesis, University of Washington, Department of Statistics.Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. (1998), “Markov chain Monte Carlo in practice:A roundtable discussion,”

The American Statistician , 52, 93–100.Lee, K.-J., Jones, G. L., Caﬀo, B. S., and Bassett, S. S. (2014), “Spatial Bayesian variable selectionmodels on functional magnetic resonance imaging time-series data,”

Bayesian Analysis , 9, 699–732.Li, F. and Zhang, N. R. (2010), “Bayesian variable selection in structured high-dimensional covariatespaces with applications in genomics,”

Journal of the American Statistical Association , 105, 1202–1214.Li, J. and Tibshirani, R. (2011), “Finding consistent patterns: A nonparametric approach for iden-tifying diﬀerential expression in RNA-Seq data,”

Statistical Methods in Medical Research , 22,519–536.Love, M. I., Huber, W., and Anders, S. (2014), “Moderated estimation of fold change and dispersionfor RNA-Seq data with DESeq2,” BioRxiv Preprint doi: http://dx.doi.org/10.1101/002832.24acnab, R. M. (1992), “Genetics and biogenesis of bacterial ﬂagella,”

Annual Review of Genetics ,26, 131–158.McLachlan, G. and Peel, D. (2000),

Finite Mixture Models , New York: John Wiley and Sons.Mitchell, T. J. and Beauchamp, J. J. (1988), “Bayesian variable selection in linear regression,”

Journal of the American Statistical Association , 83, 1023–1032.Neal, R. M. (2003), “Slice sampling,”

Annals of Statistics , 31, 705–741.O’Hara, R. B. and Sillanp¨a¨a, M. J. (2009), “A review of Bayesian variable selection methods: What,how, and which,”

Bayesian Analysis , 4, 85–118.Qiu, X., Klebanov, L., and Yakovlev, A. (2005), “Correlation between gene expression levels andlimitations of the empirical Bayes methodology for ﬁnding diﬀerentially expressed genes,”

Sta-tistical Applications in Genetics and Molecular Biology , 4, article 34.R Core Team (2015),

R: A Language and Environment for Statistical Computing , R Foundationfor Statistical Computing, Vienna, Austria.Scott, J. G. and Berger, J. O. (2006), “An exploration of aspects of Bayesian multiple testing,”

Journal of Statistical Planning and Inference , 136, 2144–2162.— (2010), “Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem,”

Annals of Statistics , 38, 2587–2619.Serre, D. (2002),

Matrices: Theory and Applications , New York: Springer-Verlag.Smith, M. and Fahrmeir, L. (2007), “Spatial Bayesian variable selection with application to func-tional magnetic resonance imaging,”

Journal of the American Statistical Association , 102, 417–431.Stingo, F. C., Chen, Y. A., Tadesse, M. G., and Vannucci, M. (2011), “Incorporating biologicalinformation in the linear models: A Bayesian approach to the selection of pathways and genes,”

Annals of Applied Statistics , 5, 1978–2002.Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A.,Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P. (2005), “Geneset enrichment analysis: A knowledge-based approach for interpreting genome-wide expressionproﬁles,”

Proceedings of the National Academy of Sciences , 102, 15545–15550.Tadesse, M., Ibrahim, J. G., Vannucci, M., and Gentleman, R. (2005), “Wavelet thresholding withBayesian false discovery rate control,”

Biometrics , 61, 25–35.Tusher, V. G., Tibshirani, R., and Chu, G. (2001), “Signiﬁcance analysis of microarrays applied tothe ionizing radiation response,”

Proceedings of the National Academy of Sciences , 98, 5116–5121.Wantanabe, S. (2010), “Asymptotic equivalence of Bayes cross validation and widely applicableinformation criterion in singular learning theory,”

Journal of Machine Learning Research , 11,3571–3594.West, M. (2003), “Bayesian factor regression models in the ‘large p, small n’ paradigm,” in

BayesianStatistics 7 , eds. Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D.,Smith, A. F. M., and West, M., Oxford: Oxford University Press, pp. 723–732.25iao, G., Reilly, C., and Khodursky, A. B. (2009), “Improved detection of diﬀerentially expressedgenes through incorporation of gene locations,”

Biometrics , 65, 805–814.Zhang, L., Guindani, M., Versace, F., and Vannucci, M. (2014), “A spatio-temporal nonparametricBayesian variable selection model of fMRI data for clustering correlated time courses,”

NeuroIm-age , 95, 162–175.Zhao, Y., Kang, J., and Yu, T. (2014), “A Bayesian nonparametric mixutre model for selectinggenes and gene subnetworks,”

Annals of Applied Statistics , 8, 999–1021.26

Bayesian Generalized CAR Model for CorrelatedSignal DetectionSupplementary Material

D. Andrew Brown Gauri S. Datta Nicole A. Lazar

Model (2.4) leads to the following full conditional distributions needed for a Gibbs sampling algo-rithm: µ j | µ ( − j ) , γ , σ , η, ρ, p, y ∼ N (cid:32) γ j y j + ρη − (cid:80) i (cid:54) = j w ji µ i γ j + η − ( w j . + d ) , σ ( γ j + η − ( w j . + d )) − (cid:33) , j = 1 , . . . , Jγ j | µ , γ ( − j ) , σ , η, ρ, p, y ∼ Bern  (1 − p ) exp (cid:0) − ( y j − µ j ) / σ (cid:1) (1 − p ) exp ( − ( y j − µ j ) / σ ) + p exp (cid:16) − y j / σ (cid:17)  , j = 1 , . . . , Jσ | µ , γ , η, ρ, p, y ∼ InvGam (cid:18) J, ( y − Γ µ ) T ( y − Γ µ ) + η − µ T ( D ∗ w − ρ W ) µ (cid:19) p | µ , γ , σ , η, ρ, y ∼ Beta (cid:32) J − J (cid:88) i =1 γ i + α, J (cid:88) i =1 γ i + 1 (cid:33) [ η | µ , γ , σ , ρ, p, y ] ∝ η − J/ − exp (cid:18) − η (cid:18) µ T ( D ∗ w − ρ W ) µ σ (cid:19)(cid:19) (cid:18) η η (cid:19) , η > ρ | µ , γ , σ , η, p, y ] ∝ | D ∗ w − ρ W | / exp (cid:18) − µ T ( D ∗ w − ρ W ) µ ησ (cid:19) I ( ν − < ρ < ν − J ) , (10)where [ · | · ] denotes a conditional density. In this Section, we provide a proof that the maximum value of ρ is an increasing function of d inModel (4). Also, we detail the simpliﬁcations used to obtain (7), (8), and (9) in the proof. Thesimpliﬁcations make use of a simple inequality, which we state as a Lemma. ρ and d Consider two values d < d . Let x ∗ i = arg max x x T ( D w + d i I ) − / W ( D w + d i I ) − / x / x T x , i = 1 , λ J,i > D w + d i I ) − / W ( D w + d i I ) − / , i = 1 ,

2. Then it27ollows from the properties of the Rayleigh quotient that λ J, = x ∗ ,T ( D w + d I ) − / W ( D w + d I ) − / x ∗ x ∗ ,T x ∗ = 1 x ∗ ,T x ∗ (cid:88) i (cid:88) j w ij x ∗ i, x ∗ j, √ w i . + d (cid:112) w j . + d < x ∗ ,T x ∗ (cid:88) i (cid:88) j w ij x ∗ i, x ∗ j, √ w i . + d (cid:112) w j . + d ≤ x ∗ ,T ( D w + d I ) − / W ( D w + d I ) − / x ∗ x ∗ ,T x ∗ = λ J, , so λ − J, < λ − J, . The inequality in the third line follows from the Perron-Frobenius Theorem (e.g.,Serre, 2002), which guarantees that the elements of the eigenvector associated to λ J, are all positive.Thus, the maximum possible value for ρ is an increasing function in d . Lemma 1.

For any positive constants a, b, and c , bba + c < max { b, } a + c . Proof.

Let a, c ∈ R + . Then, for 0 < b < bba + c − a + c = b ( a + c ) − ba − c ( ba + c )( a + c )= c ( b − ba + c )( a + c ) < , and for b > bba + c − ba + c = b ( a + c ) − b ( ba + c )( ba + c )( a + c )= ba (1 − b )( ba + c )( a + c ) < . Now, the matrix ( w ( J ) + d ) I − D ∗ w = w ( J ) I − D w is diagonal with nonnegative entries and thuspositive semideﬁnite, which we denote as ( w ( J ) + d ) I − D ∗ w ≥ ν ≤ ν ≤ · · · ≤ ν J be the ordered eigenvalues of W ∗ . Then the eigenvalues of I − ρ W ∗ are1 − ρν j , j = 1 , . . . , J . Hence, ρ ∈ ( ν − , ν − J ) ⇒ − ρν j > , ∀ j ⇒ ( I − ρ W ∗ ) > ⇒ η ( I − ρ W ∗ ) − > .

28y adding and subtracting η ( I − ρ W ∗ ) − , we obtain( w ( J ) + d ) I − D ∗ w = ( w ( J ) + d ) I + η ( I − ρ W ∗ ) − − ( D ∗ w + η ( I − ρ W ∗ ) − ) ≥ . Making use of the fact that B > , A − B ≥ ⇒ B − − A − ≥

0, it follows that( D ∗ w + η ( I − ρ W ∗ ) − ) − − (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − ≥ ⇒ x T ( D ∗ w + η ( I − ρ W ∗ ) − ) − x ≥ x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ⇒ ( x T ( D ∗ w + η ( I − ρ W ∗ ) − ) − x ) − J/ ≤ ( x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ) − J/ . Now, W ∗ is symmetric, so it has a spectral decomposition of the form W ∗ = PMP T , where P is the orthogonal matrix of eigenvectors of W ∗ and M is the diagonal matrix of eigenvalues. Let u = P T x ⇒ x = Pu so that x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x = u T P T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − Pu = u T ( P T ( w ( J ) + d ) P + η P T ( I − ρ W ∗ ) − P ) − u = u T (( w ( J ) + d ) I + η ( I − ρ M ) − ) − u = J (cid:88) j =1 (1 − ρν j ) u j ( w ( J ) + d )(1 − ρν j ) + η . Since D ∗ w is diagonal and the diagonal elements of W are zero, the diagonal elements of W ∗ =( D ∗ w ) − / W ( D ∗ w ) − / are zero and thus tr( W ∗ ) = 0 = (cid:80) Jj =1 ν j . It must then be true that there are r > r > W ∗ , since r := r + r = rank( W ∗ ) >

0. The summation in the last line can then be separated according to the sign of the eigenvalue ineach term as r (cid:88) j =1 (1 − ρν j ) u j ( w ( J ) + d )(1 − ρν j ) + η (cid:124) (cid:123)(cid:122) (cid:125) ν j < + J (cid:88) j = J − r +1 (1 − ρν j ) u j ( w ( J ) + d )(1 − ρν j ) + η (cid:124) (cid:123)(cid:122) (cid:125) ν j > + J − r (cid:88) j = r +1 u j w ( J ) + d + η (cid:124) (cid:123)(cid:122) (cid:125) ν j =0 . (11)If ν − < ρ <

0, then 0 < − ρν j < j = 1 , . . . , r and 1 − ρν j > j = J − r + 1 , . . . , J, so(11) ≥ J (cid:88) j = J − r +1 (1 − ρν j ) u j ( w ( J ) + d )(1 − ρν j ) + η + J − r (cid:88) j = r +1 u j w ( J ) + d + η ≥ J (cid:88) j = J − r +1 u j w ( J ) + d + η + J − r (cid:88) j = r +1 u j w ( J ) + d + η = ( w ( J ) + d + η ) − J (cid:88) j = r +1 u j , where the second line follows from noticing that ( w ( J ) + d + η ) − − (1 − ρν j )(( w ( J ) + d )(1 − ρν j )+ η ) − <

29 for ν j >

0. Similarly, if 0 < ρ < ν − J , then(11) ≥ r (cid:88) j =1 (1 − ρν j ) u j ( w ( J ) + d )(1 − ρν j ) + η + J − r (cid:88) j = r +1 u j w ( J ) + d + η ≥ r (cid:88) j =1 u j w ( J ) + d + η + J − r (cid:88) j = r +1 u j w ( J ) + d + η = ( w ( J ) + d + η ) − J − r (cid:88) j =1 u j . Thus, we have that for all ρ ∈ ( ν − , ν − J ), x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ≥ k ( w ( J ) + d + η ) − ⇒ ( x T (( w ( J ) + d ) I + η ( I − ρ W ∗ ) − ) − x ) − J/ ≤ k (cid:48) ( w ( J ) + d + η ) J/ where 0 < k (cid:48) < ∞ is constant. This establishes (7) and (8).To see that (9) holds, note that | I + η ( D ∗ w − ρ W ) − | = | ( D ∗ w ) − / || D ∗ w + η ( I − ρ W ∗ ) − || ( D ∗ w ) − / | = | ( D ∗ w ) | − | D ∗ w + η ( I − ρ W ∗ ) − |≡ k | D ∗ w + η ( I − ρ W ∗ ) − | . Letting w (1) = min ≤ j ≤ J w j . , and again adding and subtracting η ( I − ρ W ∗ ) − , we obtain D ∗ w − ( w (1) + d ) I ≥ ⇒ D ∗ w + η ( I − ρ W ∗ ) − − (( w (1) + d ) I + η ( I − ρ W ∗ ) − ) ≥ . But B > , A − B ≥ | A | ≥ | B | , so we ﬁnd that | D ∗ w + η ( I − ρ W ∗ ) − | ≥ | ( w (1) + d ) I + η ( I − ρ W ∗ ) − |⇒ k | D ∗ w + η ( I − ρ W ∗ ) − | ≥ K | ( w (1) + d ) I + η ( I − ρ W ∗ ) − | . The eigenvalues of ( I − ρ W ∗ ) − are (1 − ρν j ) − , j = 1 , . . . , J , so it follows that the eigenvalues of( w (1) + d ) I + η ( I − ρ W ∗ ) − are w (1) + d + η (1 − ρν j ) − , j = 1 , . . . , J . Therefore, k | ( w (1) + d ) I + η ( I − ρ W ∗ ) − | = k J (cid:89) j =1 ( w (1) + d + η (1 − ρν j ) − )= k (cid:81) Jj =1 ((1 − ρν j )( w (1) + d ) + η ) (cid:81) Jj =1 (1 − ρν j ) , and subsequently | I + η ( D ∗ w − ρ W ) − | − / ≤ k (cid:48) (cid:32) (cid:81) Jj =1 ((1 − ρν j )( w (1) + d ) + η ) (cid:81) Jj =1 (1 − ρν j ) (cid:33) − / , where 0 < k (cid:48) < ∞ is constant. But 1 − ρν j > , for all j , so by Lemma 11 − ρν j (1 − ρν j )( w (1) + d ) + η ≤ max { − ρν j , } w (1) + d + η , ∀ j ∈ { , . . . , J }⇒ (cid:81) Jj =1 (1 − ρν j ) (cid:81) Jj =1 ((1 − ρν j )( w (1) + d ) + η ) ≤ (cid:81) Jj =1 max { − ρν j , } ( w (1) + d + η ) J Supplementary Figures

Figure 6: Comparison of the | t | prior on τ (Gelman, 2006) (after transforming to the τ scale) and theprior on τ | σ suggested by Scott and Berger (2006), denoted by SB. Here, the scale parameter is set to 1in both densities. t p ( t ) SBHalf t

Figure 7: Simulated binary activation pattern drawn from an Ising distribution. igure 8: Simulated data using the activation pattern in Figure 7 with non-null mean 3.5. −20246 Figure 9: Neighborhood structures and weights used for the Bayesian CAR model in the simulation study.In each illustration, the center square represents the gene of interest, and the numbers are the weights w ij assigned to each gene in the neighborhood. igure 10: Estimated posterior inclusion probabilities p i versus test statistics y i , i = 1 , . . . , for thesimulated microarray data with neighborhoods determined through physical adjacency. The dashed (jagged)curve results from the CAR( W ) model, and the dotted (smooth) line results from the SB model. The circledpoint corresponds to the indicated statistic depicted in Figure 11. −3 −2 −1 0 1 2 3 . . . . . . y i p i Figure 11: Graphical depiction of test statistics from the simulated microarray data. The test statistics y i arearranged in order, i = 1 , , . . . , , going from the lower left to the upper right, row-wise from left to right.The circle indicates the statistic corresponding to the circled ( y i , p i ) point in Figure 10. igure 12: Empirical ROC curves for the testing model using the physical-adjacency CAR model and thegeneralized CAR using pathways to deﬁne neighborhoods with isolated points included. . . . . . . S en s i t i v i t y PathwaysAdjacency

Figure 13: Histogram of test statistics from the simulated pathways example with normal densities superim-posed, each with mean zero and standard deviations estimated as (cid:112) E ( σ | y ) from the posterior distributionsof both neighborhood structures. The tick marks at the bottom indicate the non-null cases. y−4 −2 0 2 PathwaysAdjacency igure 14: Posterior inclusion probabilities estimated under the CAR testing model with and without theisolated cases. The dashed line is at the 0.95 threshold. Notice that several of the P5 cases have been pulleddownward, resulting in more false non-discoveries by excluding the isolated cases. −4 −2 0 2 4 . . . . . . y j p j Isolated pointsP1P2 (active)P3P4P5 (active) −4 −2 0 2 4 . . . . . . y j p j P1P2 (active)P3P4P5 (active) igure 15: Histograms of realizations of Moran’s I calculated from the posterior predictive distribution of theCAR testing model, p ( I ( y ∗ ) | y ) = (cid:82) θ p ( I ( y ∗ ) | θ ) π ( θ | y ) d θ , under partially incorrect correlation assump-tions. The dark vertical line indicates the observed value of I under the assumed neighborhood structure. I F r equen cy igure 16: Smoothed posterior densities of p and ρ for the E. Coli data (Xiao, Reilly, and Khodursky, 2009)with diﬀerent values of α in the prior p ∼ Beta ( α, . D en s i t y a = a = a = r Figure 17: Estimated posterior inclusion probabilities for each of the 22,283 genes in the lymphoblastoidcell data (Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander,and Meslrov, 2005) under both CAR testing models with and without isolated cases. The left panel resultsfrom excluding the isolated cases ( d = 0 ), the right panel results from including isolated cases ( d = 1 ). Thehorizontal lines represent the 0.99 threshold. −5 0 5 . . . . . . z i p i −5 0 5 . . . . . . z i p ii