Bayesian Nonparametric Variable Selection as an Exploratory Tool for Finding Genes that Matter
BBayesian Nonparametric Variable Selection as anExploratory Tool for Finding Genes that Matter
Babak Shahbaba and Wesley O. Johnson
Department of Statistics, University of California at Irvine, CA, USA
Abstract:
High-throughput scientific studies involving no clear a’priori hypothesis are com-mon. For example, a large-scale genomic study of a disease may examine thousands of geneswithout hypothesizing that any specific gene is responsible for the disease. In these studies,the objective is to explore a large number of possible factors (e.g. genes) in order to identifya small number that will be considered in follow-up studies that tend to be more thoroughand on smaller scales. For large-scale studies, we propose a nonparametric Bayesian approachbased on random partition models. Our model thus divides the set of candidate factors intoseveral subgroups according to their degrees of relevance , or potential effect, in relation to theoutcome of interest. The model allows for a latent rank to be assigned to each factor accordingto the overall potential importance of its corresponding group. The posterior expectation ormode of these ranks is used to set up a threshold for selecting potentially relevant factors.Using simulated data, we demonstrate that our approach could be quite effective in findingrelevant genes compared to several alternative methods. We apply our model to two large-scalestudies. The first study involves transcriptome analysis of infection by human cytomegalovirus(HCMV). The objective of the second study is to identify differentially expressed genes be-tween two types of leukemia.
Keywords and phrases:
Random partition models, High throughput studies, Dirichletprocess mixtures, Gene expression microarrays.0 a r X i v : . [ s t a t . M E ] M a r . Shahbaba and W. O. Johnson/
1. Introduction
High throughput studies are typically aimed at assessment of a large number of factors with respectto their relationship to an outcome of interest (e.g. disease status). The overall objective is to select asubset of factors that might be relevant to the outcome. In this paper, we discuss two such studies asmotivating examples. The objective of the first study, conducted by Chan et al. (2008), is to identifygenes that are differentially expressed after human cytomegalovirus (HCMV) infection. Infectionby HCMV leads to morbidity and mortality in immunocompromised individuals, including AIDSand organ transplant patients. In this study, the expression levels of 12,626 genes were comparedbetween six HCMV-infected (case group) and six mock-infected (control group) samples. The secondstudy, conducted by Armstrong et al. (2002), aims at identifying differentially expressed genes intwo types of leukemia: acute myeloid leukemia (AML) and acute lymphoid leukemia (ALL). Thedata include the expression levels of 10,056 genes for 48 subjects (24 subjects in each group).There are many papers that have approached this problem from the point of view of assessing“statistical significance” while controlling the family-wise error rate. Here, we do not attempt toassess statistical significance, but we maintain the same goal of identifying factors that warrantfurther study. Indeed, the method we propose, though model based, is regarded as exploratory dataanalysis where we search through a rather large number of genes in search of a subset of themthat might be important or relevant. High throughput studies can be regarded as a starting pointto generate a small number of hypotheses that are worthy of further investigation with follow-upstudies. Our approach is nonparametric and employs a latent random partition model based onDirichlet Process mixtures (DPM) to divide genes into subgroups. Genes are ranked according tothe overall potential importance of their corresponding groups. The posterior expectation of theseranks could be used to set up a threshold for selecting a relevant set. Our method can be viewed inthe context of variable selection where there could potentially be a very large number of terms inthe model but where there is a belief in sparsity, which translates to parsimony.Without loss of generality, we focus on the analysis of gene expression microarray data as atypical high throughput study. Gene expression studies deal with identifying genetic factors that . Shahbaba and W. O. Johnson/ are potentially important to an outcome of interest. Microarrays measure the expression levels forthousands of genes simultaneously. We denote the genes as G , . . . , G N . The outcome of interest isdisease status (i.e., diseased vs. healthy), which is fixed in a case-control design. The gene expressiondata consist responses on multiple diseased and healthy individuals for each gene, namely, for gene i we observe y i = ( y ijk : j = 1 , ..., n ik , k = 0 ,
1) where k = 0 corresponds to healthy and k = 1 todiseased. Disease status is represented by a binary variable, x ijk , which takes the value one if k = 1and zero otherwise.Within the hypothesis testing framework, the usual statistical methods assume that for eachgene, G i , there is a corresponding [null] hypothesis, H i , stating that there is no change in geneexpression between the two groups (diseased vs. healthy). Based on this assumption and the ob-served expression data, y i , a simple test statistic z i is often computed for each gene such that thedistribution of Z i is known under the null hypothesis. In general, larger values of z i provide strongerevidence of departure from H i and statistics above a certain cutoff are deemed significant, afteradjustment to control the family-wise error rate or false discovery rate (FDR). (See for example,Benjamini and Hochberg, 1995; Hochberg, 1988; Hommel, 1988; Storey, Taylor and Siegmund, 2004;Westfall and Young, 1993). As argued by several authors (e.g., Guindani, M¨uller and Zhang, 2009;Storey, 2007; Storey, Dai and Leek, 2007; Sun and Cai, 2009), methods that are based on Z -scorescalculated for each test individually ignore information from other tests.Efron et al. (2001) introduce the local false discovery rate (locFDR), which is the empirical Bayesversion of the method proposed by Benjamini and Hochberg (1995) for estimating the false discoveryrate (FDR). The fully Bayesian version improves the performance of multiple significance testing byborrowing information across all tests when assessing the relative significance of each one of them.See, for example, Do, M¨uller and Tang (2005); M¨uller, Parmigiani and Rice (2007); Newton et al. (2001); Scott and Berger (2006). See Scott and Berger (2010) for discussion regarding the comparisonof fully Bayesian and empirical Bayesian adjustment for multiplicity. In what follows, we presentapproaches based on the actual data and also based on summary Z statistics for completeness.Storey, Dai and Leek (2007) and Storey (2007) proposed a related method called the optimaldiscovery procedure (ODP), which is approximately equivalent to minimizing the missed discovery . Shahbaba and W. O. Johnson/ rate for each fixed FDR. More recently, Cao et al. (2009) proposed a hierarchical Bayesian model,whose estimates are utilized in the ODP. We refer to this method as the Bayesian optimal discoveryprocedure (BODP). Guindani, M¨uller and Zhang (2009) (GMZ) showed that the ODP could beinterpreted as an approximate Bayes rule under a semiparametric model. They proposed a Bayesiandiscovery procedure (BDP) that improves the approximation and allows for multiple shrinkage inclusters implied by a Dirichlet process mixture (DPM) model. Using DP priors in the context ofmultiple hypothesis testing has been discussed by several other authors. See for example, Bogdan,Gosh and Tokdar (2008); Dahl and Newton (2007); Gopalan and Berry (1993). Both ODP andBDP show improvement over some commonly used procedures such as SAM (Significance Analysisof Microarrays, Tusher, Tibshirani and Chu, 2001), empirical Bayes, and local FDR.In the remainder of the paper, we discuss our approach in details and compare its performance toseveral alternative methods. In Section 2, we present the basic models for the full data and reduceddata. We also present the GMZ method and re-casts it for our purposes since it is related to oursand because of its success in comparison with other methods. Section 3 presents our data analysesfollowed by comparisons based on simulated data in section 4, sensitivity analysis in Section 5, andfinal conclusions in section 6.
2. The Basic Model and Method
Let y ijk be expression values and let x ijk denote the fixed values for group membership (healthy/diseased)as discussed above. Denote the full data set as y = { y ijk } . Then the basic model for these data is: y ijk | α i , β i ind ∼ N ( α i + β i x ijk , σ i ) i = 1 , , . . . , N, j = 1 , ..., n ik , k = 0 , α i interpreted as the expectation of gene expression values for gene i within the control groupand β i is the expected change in expression of this gene for the case group. Variances for geneexpression are allowed to vary across genes. The following prior distributions are assumed for α i and σ i p ( σ i ) ind ∝ /σ i , α i | κ ind ∼ N (0 , κ ) ; . Shahbaba and W. O. Johnson/ we take κ to be very large.The traditional hypotheses are H i : β i = 0 across all genes. So if all H i are true, there is nodifference in gene expression comparing healthy to diseased individuals across all genes. It is herethat we can recognize that the problem at hand can be cast in terms of variable selection. Formicroarray studies, N is generally quite large and the n ik tend to be small by comparison. There isalso a general belief that most of the null hypotheses will be true. We are thus expecting a sparsityof non-zero β i s.All that remains is for us to specify a model for the β i s. There are many possibilities in theBayesian variable selection literature, see for example see George and McCulloch (1993) and O’Haraand Sillanp¨a¨a (2009). Our model for the regression coefficients is hierarchical where the first levelassigns independent normal priors to the β i s with distinct variances, namely β i | τ i ind ∼ N (0 , τ i ) (2.2)This is termed an adaptive shrinkage prior by O’Hara and Sillanp¨a¨a (2009) (2009). At the secondlevel, we assume that the first level variances, τ i , are themselves iid from some unknown distribution.At the third level we assume a Dirichlet Process prior for the unknown distribution. In notation,we have: τ i | G ind ∼ G G ∼ D ( G , γ ) (2.3)where G is the expectation of G and γ is a weight. Thus the regression coefficients are modeledas a Dirichlet Process mixture (DPM) of mean zero normal distributions where the mixing is onthe variance. Moreover, if γ is large then G is approximately G while if it is small, the random G allows for realizations that can depart considerably from G thus resulting in much greater modelingflexibility. It is well known that the DP is discrete with probability one so this model allows forclustering the variances for the regression coefficients, which induces clustering on the regressioncoefficients themselves. Regression coefficients associated with large variances will be presumed tobe potentially important. Our basic model is thus expressed by (2.1-3).Based on Sethuraman’s constructive definition of the DP (Sethuraman, 1994), we have the fol- . Shahbaba and W. O. Johnson/ lowing representation of the DPM for the regression coefficients β i ∼ ∞ (cid:88) c =1 p c N (0 , φ c ) φ c s are iid from G , p c = b c (cid:81) c − i =1 (1 − b i ) where b c ind ∼ beta(1 , γ ). For our problem we choose G tobe Log- N ( m, M ). We also assume γ ∼ Log- N ( l, L ).The distribution of the β j s is thus modeled as a countable random mixture of normal distributionswith mean zero and different variances. Thus the τ i for different genes may be the same and equalto one of the values of φ c . The DPM model naturally creates clusters of genes, where each clusterhas its own unique φ . The number of possible clusters is theoretically infinite a priori , but isdata driven and there are only a finite number of genes, so there can only be a finite numberof clusters. We can rank the identified groups according to their degree of importance using theircorresponding φ . Those β ’s in clusters with φ c that are near zero correspond to normal distributionswith small variances and their corresponding genes must have relatively small regression coefficientsand can be considered least relevant. In contrast, those β i s that are relatively far from zero wouldbe assigned to normal distributions with larger variances and the corresponding genes considered asmore relevant. Therefore, when comparing two subgroups of genes, the one with the highest valueof φ c is considered to be most the relevant. We let R i denote the rank of the i th gene so if R i = 1,it is in the cluster with the smallest variance. For inference, it is awkward to work directly with the DP so it is common to marginalize over G .We also introduce the latent variable c i , which identifies the cluster to which gene i belongs. Oncethe magnitudes of all cluster variances are known, the value of R i is determined by knowing thevalue of c i . Since these values are both latent, we cannot say precisely what they are. However, sincewe use Markov chain Monte Carlo (MCMC) numerical methods, and in particular Gibbs Sampling,for approximating the joint posterior distribution, we can obtain samples of these values at eachiteration of the Gibbs Sampler. . Shahbaba and W. O. Johnson/ The actual model specification is greatly simplified by marginalization. There are a numberof different approaches to marginalization in the literature but we follow the one developed byMacEachern and M¨uller (1998). First, it is well known that P ( c j = c | c , ..., c j − ) = N jc j − γ , c ∈ { c , . . . , c j − } (2.4)where N jc is the number of genes previously assigned to group c . This defines the marginal jointdistribution for ( c , ..., c N ). Then the marginalized model replaces (2.1) and (2.3) with (2.4) and(2.5) y ijk | α i , { β c } , c i ind ∼ N ( α i + β c i x ijk , σ i ) φ c iid ∼ G c = 1 , ..., K ≡ max { c , ..., c N } (2.5)This formulation and posterior inferences for it are well developed in the literature. From here,it is straight forward to obtain sample the full conditionals for φ c | { φ j : j (cid:54) = c } , y and for c j | { c k : k (cid:54) = j } , y . We have used algorithm number 8 in Neal (2000), which closely resembles the“no gaps” algorithm of MacEachern and M¨uller (1998).As previously mentioned, at iteration p of the Gibbs sampler, we are able to ascertain the ranks R ( p ) i for i = 1 , ..., N , where R ( p ) i = 1 if c ( p ) i = 1 and if φ ( p )1 = min c { φ ( p ) c } ≡ φ ( p )0 , and so on. Thenumber of clusters, K ( p ) , also varies across iterations. We thus obtain the posterior distribution ofthese ranks and the number of clusters. We denote the posterior mean of the rank for gene i as¯ R i and use it as a measure of relevance. That is, we denote the gene i as relevant if ¯ R i exceeds apre-specified cutoff. Alternatively, we can use the posterior mode (i.e., most frequent rank over B posterior Monte Carlo samples) of the rank R i for each gene as its degree of relevance. We denotethis measure of relevance as ˆ R i . As the value of R i increases, the degree of relevance increases.We also define a relevance measure that is similar to that of GMZ. (See (2.7)). To this end, wedenote the smallest value of φ c at each iteration as φ . For gene i , we create a binary indicator, s i , which is set to 1 when φ c i = φ , and zero otherwise. We can use the B posterior Monte Carlosamples to calculate v i = 1 − (cid:80) Bb =1 I ( s ( b ) i = 1) /B . The gene i is then selected as relevant if v i exceeds a pre-specified cutoff value. As we will see later, the v i measure obtained from our model . Shahbaba and W. O. Johnson/ could lead to better identification of relevant genes compared to the corresponding v measure basedon the corresponding GMZ measure. We refer to our final model as BRD (Bayesian RelevanceDetermination) .Relevance measures are used to decide on whether to keep regression coefficients in the model, ornot. Coefficients that correspond to values that exceed a cutoff are kept while those that don’t aredropped. As a method of comparison of different criteria, in section 4 we simulate data from varioustypes of models and obtain a numerical approximation to the true receiver operating characteristic(ROC) curves that correspond to each measure. The ROC curve plots the proportion of falsepositive outcomes against the proportion of true positive outcomes. Here the false positive rate isthe proportion of genes that are not differentially expressed for which their regression coefficientwas included in the selected model, and the true positive rate is the proportion of genes for whichthere is a difference between diseased and healthy individuals where the corresponding regressioncoefficient is left in the model. The area under the curve (AUC) is a measure of overall performanceof the measure as a discriminator among models; if the AUC is one then the correct model is alwaysselected and if it is 0.5, the measure is equivalent to tossing a fair coin e.g. useless.
When the data are reduced to a collection of summary test statistics, our model cannot be modifieddirectly. When the data are reduced from having two samples (cases and controls) for each gene toa single univariate test statistic, our model (2.1) no longer applies. Thus, we model the summarystatistics, z i , as follows: z i | τ i ind ∼ N (0 , τ i ) τ i | G ind ∼ G G ∼ D ( G , γ ) (2.6)Here, the mean zero for the z i s corresponds to the new H i : E ( z i ) = 0, e.g. the hypothesis thatthere is no difference in expression for gene i between diseased and healthy individuals. This modelallows for the partitioning of genes into groups of increasing relevance just as before. It also allowsfor the possibility that there is only a single group, which would correspond to no differentialgene expression at all. While details are of course different, inferences for this model proceed in . Shahbaba and W. O. Johnson/ similar way as was discussed for the full data. Our code for the two models can be found at .We note that the marginal model for the full data based on (2.1-2) is y ijk | x ijk ind ∼ N ( α i , σ i + x ijk τ i ) , ∀ i, j, k, while the comparable model here is z i ind ∼ N (0 , τ i ) , ∀ i. The remaining parts of both models areidentical. Here, we have no interpretation involving variable selection, but rather model selection.Note that using summary statistics instead of the full data is not recommended in general.As pointed out by Storey, Dai and Leek (2007) too much information is lost by using gene-levelsummary statistics such as z scores. However, for illustration purposes, we use summary statisticsin some of the simulations discussed in Section 4 and one of the real datasets in Section 3.By focusing on variances, our method has the ability to detect both location shift and scalechange in the distribution of z i (or β in the full data case). To see how τ can capture locationshift, suppose all z i for the relevant group are around 1. Since the mean is fixed at zero, thevariance for the corresponding cluster of relevant genes must be large to accommodate these values.Therefore, by fixing the mean at zero, our method has the capability to detect both location shiftand scale change. The main model proposed and discussed by GMZ (2009), for reduced data, has the following form: z i | µ i ind ∼ N ( µ i , σ ) , i = 1 , . . . , Nµ i | G ind ∼ GG ∼ D ( G , γ ) G = p h { } ( . ) + (1 − p ) h { } c ( . )The baseline distribution G is a mixture of two terms, one with point mass at zero, and the otherwith a continuous distribution, N (0 , ˜ σ ) and technically excluding zero. p is a mixing parameter.The same model was proposed by Bogdan, Gosh and Tokdar (2008). A particularly nice feature ofthis model is that this particular choice of G results in a cluster of means all taking the value zero, . Shahbaba and W. O. Johnson/ which corresponds to the collection of null hypotheses being true. They also allow for point massat zero to be replaced with a small interval around zero.They define the indicator s i such that s i = 1 when µ i = 0. Then using B posterior Monte Carlosamples, they use the following measure to set up a threshold in order to divide the genes into“significant” and “non-significant” categories: v i = 1 − B (cid:88) b =1 I ( s ( b ) i = 1) /B (2.7)GMZ showed that the criterion based on v i can be approximated by the ODP criterion. They madea number of comparisons with ODP and found that BDP was comparable to better in a number ofinstances.The BDP model for reduced data could be employed for relevance determination in the sameway that ours is by simply clustering at each iteration of the GS on the distinct values of | µ ( b ) i | : i =1 , ..., N . Clusters corresponding to the maximum absolute mean are considered most relevant. Theirmeasure, v i , can also be used to assess what we call relevance in exactly the same way that we useour v i measure. Using simulated data, we demonstrate in section 4 that our model for reduced datacould provide greater capability, with more parsimonious modeling (e.g., fewer mixture components)to identify relevant genes for further investigation.GMZ also also adapt their model to handle the full data, y (see the last paragraph of their section5.2 for details). They employ a DPM that mixes (and consequently clusters) on ( µ i , σ i ) ( µ ik is themean response for gene i in case-control type k ; µ i is forced to zero). They set up their clusteringso that µ i = 0 is a possibility through the same device as before. They create a statistic v i as beforeand select genes based on an FDR based cutoff. Our method based on full data is related to theirsthrough its use of DPM modeling. However, their clustering is on the mean-variance combinationamong the cases while ours is on the effects, β i , among the cases. Our approach amounts to variableselection among these effects and ultimately including or excluding these coefficients in the modelbased on the magnitudes of their variances, while their approach amounts to using the clusteringonly through the cluster with point mass of zero. It is not clear how to use their clusters beyondthat here since there is no obvious linear ordering on the two dimensional based clusters. . Shahbaba and W. O. Johnson/
3. Data Analysis
In this section, we apply BRD, BDP, and locFDR to two real data based on two scientific studies.For the first study, we analyze the observed data directly. For the second study, we model thesummary statistics.We use the “fdrtool” package in R to run the locFDR model. The thresholds to divide genesinto relevant and not relevant are specified based on q -values (Storey, 2002), which are definedas the minimum false discovery rate that is incurred when calling a test significant. (See Storey(2002) for more discussion.) For the BDP model, we used the R code available online at . (In this version of the BDP model,DPM clustering is on both mean and variance.) For our model, we used the log-normal distributionwith mean -3 and variance 4 for G since this distribution covers the range of feasible values for τ with high probability. The 0.025 and 0.975 quantiles of this distribution are 1 . × − and1 . × , respectively. We also use Log- N ( − , ) as the prior for the scale parameter γ . In practice,one could choose a distribution with much narrower 95% intervals and upper limits closer to zerosince the values of γ and τ tend to be small. We use the posterior mean of rank, ¯ R , to set thethresholds. Our first example involves identifying differentially expressed genes due to infection by human cy-tomegalovirus. This study was conducted by Chan et al. (2008). Out of 12,626 genes, they identified1,204 genes as statistically and biologically significant. They considered a gene as statistically signif-icant if the observed significance level, p -value, was less than 0.05 based on a one-way ANOVA test.Among those genes considered as statistically significant, they only selected biologically significantgenes for which at least four of six HCMV samples changed by 1.5 folds or higher. Chan et al. (2008) did not adjust their significance cutoff for multiple hypothesis testing. By using local falsediscovery rate and setting the cutoff q -value (Storey, 2002) to 0.05, the number of selected genes isreduced to 361. The number of biologically significant genes among these is 328. . Shahbaba and W. O. Johnson/ Rank ( R ^ ) F r equen cy Fig 1 . Histogram of gene ranks ( ˆ R ) for the HCMV data. The top ranking group ( ˆ R = 4 ) includes 27 genes. For this example, we run our BRD model directly on the observed data, y , as opposed to thesummary statistics. The bar plot in Figure 1 shows the distribution of ranks ˆ R . As can be seen,the genes are divided into four groups with respect to their degree of relevance. The most relevantgroup, ˆ R = 4, includes 27 genes. Of these 27 genes, 18 of them are among the 361 genes selectedbased on local false discovery rate, and 19 of them are biologically significant (i.e., at least 4 out of6 samples changed by 1.5 folds). The list of these genes are provided as a supplementary file. Notethat some of these genes have relatively high q -values and are not considered as relevant based onclassical significance tests.We used the Functional Annotation tool in DAVID Bioinformatics Resources (Huang, Shermanand Lempicki, 2008) to learn more about the selected genes. We found that 7 of these genes are in-volved in rheumatoid arthritis, 7 are involved in colorectal cancer, 6 of them in breast cancer, and 5of them are involved in inflammatory bowel disease. HCMV is in fact known to be related to inflam- . Shahbaba and W. O. Johnson/ matory diseases (e.g., rheumatoid arthritis) and several cancers. S¨oderberg-Naucl´er (2008) reviewsthe evidences for involvement of HCMV microinfections in inflammatory diseases and cancer.We were also interested in finding the pathways that these genes represent. Using DAVID, wefound that 6 of these genes are involved in Toll-like receptor signaling pathway. Further, we iden-tified several Gene Ontology (GO) biological process terms associated with the selected genes.Specifically, we found that 9 of these genes are related to defense response, and 8 are associatedwith inflammatory response. The relationship between HCMV and these biological processes arewell documented (e.g., Booss et al. , 1989; Compton et al. , 2003).We also used the BDP model for this study. The minimum value of v over all genes was 0.998,which means no gene was selected as significant.As discussed above, we could also use z scores instead of the full data. For our model, this leadsto the selection of 23 genes that correspond to ˆ R = 5; genes were divided into 5 groups based ontheir relevance. Six of these genes are among the top 27 that were selected by our model using thefull data. For the BDP model, 631 genes were selected using the cutoff 0.01 for v . Among the top 23genes selected using BDP, only 3 appear in the list of 23 genes based on BRD. Thus the methodscan give quite distinct results. Our second example, which is a well-known study typically used as benchmark problem, involvesidentifying differentially expressed genes between two types of leukemia: acute myeloid leukemia(AML) and acute lymphoid leukemia (ALL). This study was conducted by Armstrong et al. (2002).The data include the expression levels of 10,056 genes for 48 subjects (24 subjects in each group).The histogram of z scores for the leukemia data is shown in Figure 2. For our model (2.6), weran 4000 iterations of MCMC, discarded the first 1000 iterations, and used the remaining iterationsto obtain posterior distributions. Convergence of the chain was verified based on the trace plots ofhyperparameters. As mentioned above, at each iteration, we denote the smallest value of φ as φ .The curve in Figure 2 shows the predictive density based on φ . To obtain this curve, we found thedensity function for N (0 , φ ( b )0 ) at iteration b , and averaged these over MCMC iterations. . Shahbaba and W. O. Johnson/ z D en s i t y -10 -5 . . . . Fig 2 . Histogram of observed z scores for the leukemia data. The superimposed curve is the predictive density basedon φ , which is the smallest value of φ at each iteration. The five points identified on the plot are the z scores forthe genes that are identified as relevant based on R . The posterior mode of rank, ˆ R , is 1 for most genes. However, there are five genes whose pos-terior mode of rank is ˆ R = 2. These genes (in the descending order of ¯ R ) are TCL1A, DNTT,CD24, TOP2B, and PSMA6. The values of v for these genes are 0.011, 0.048, 0.06, 0.107, and 0.19respectively. These genes are represented by individual data points in Figure 2.The identified genes all are known to be associated with leukemia. Specifically, TCL1A (T-cellleukemia/lymphoma 1A) is known to be upregulated in ALL patients (Zangrando et al. , 2009).DNTT is also upregulated in B lineage ALL compared to AML (Farahat et al. , 1995). In theirpaper, Raife et al. (1994) showed that expression of CD24 predicts monocytic lineage in AML. Theresistance of several leukemia cell lines to therapy has been associated with a decreased proteinexpression and/or activity of TOP2B (topoisomerase II) enzymes (Deffie, Batra and Goldenberg,1989). Lastly, while the effect of PSMA6 on ALL vs. AML is not well studied, Chen et al. (2009) haverecently shown that the expression levels of PSMA6 in AML-M5 leukemia cells was low comparedto AML-M5 leukemia cells and normal blood cells.We also used the BDP model for analyzing the leukemia data, and ranked the genes based ontheir values of v . The five genes identified by our model are also the top ranking genes using BDP.Using locFDR and setting the cutoff for q -value at 0.05, we select 2652 genes as significant. The . Shahbaba and W. O. Johnson/ top 5 genes based on this method are TOP2B, TCL1A, PSMA6, CD24, and CD79A. Four of thesegenes also appear among the top five genes based on BDP and BRD.
4. Simulations
In this section, simulated summary data are used to compare the results of locFDR, BDP and BRD.We also use the full data to compare our method to BODP (Cao et al. , 2009) as well as locFDRand BDP. We compare different methods using the area under the receiver operating characteristic(ROC) curve (AUC) for identifying relevant genes.Our first simulation study, Simulation 1, is similar to that of GMZ (2009). We generate 500 z scores. (Note that we simulate z scores as opposed to gene expression values y for simplicity.)The first 40 z scores are sampled from N ( µ k , σ ), where σ = 1 and µ k = − , , , N (0 , z scores as relevant. Note that the data are following the BDP model very closely (i.e.,mixture components have different means and the same variance) so we expect it to outperformthe BRD model.Our criterion for comparison of methods is to find AUCs, which are used to determine howwell diagnostic procedures discriminate between groups. Table 1 shows the average AUC using 100simulated data sets. The corresponding standard errors (SE) are presented in parentheses. We seethat BRD has a statistically higher average AUC than BDP ( p -value < .
01 using a paired t -test),however the difference is not substantially large. Since outcomes depend on the specific conditionsunder which z scores are generated, we investigated further, repeating the above simulation withdifferent values of σ . For the first 20 data sets, we set σ = 0 .
25. Under this scenario, the relevantgenes are easier to identify. For the second 20 data sets, we increased σ to 0.5. This makes theidentification of relevant genes slightly difficult. We continued increasing σ to 1, 1.5, 2 and 2.5for each consecutive set of 20 simulated data sets. Figure 3 compares the two models, BDP andBRD, in terms of average AUC for each value of σ . While, BRD performs slightly better thanBDP for small values of σ , BDP performs better than BRD for relatively larger values of σ , whereidentifying relevant genes becomes difficult. This was of course expected since the data matches . Shahbaba and W. O. Johnson/ σ M ean A UC . . . . . BRD
BDP
Fig 3 . Increasing the standard deviation, σ , in Simulation 1 gradually from 0.25 to 2. The data are generatedaccording to the BDP model. While BRD performs better than BDP initially, the BDP model outperforms ourmodel for higher values of standard deviation. the BDP model more closely. In what follows, we consider alternative scenarios where data are notgenerated according to any of the above three models.For our second simulation study, Simulation 2, we allow the distribution of the z scores fornot relevant genes to deviate from normality. To this end, we sample 460 p -values for these genesfrom Beta(6, 4) and obtain their z scores by applying the inverse cdf of the standard normaldistribution to their p -values. For Beta(6, 4), the 95% interval is (0 . , . p -valuesare generally considered as “non-significant,” they are not uniformly distributed between 0 and 1.Therefore, the resulting distribution of z ’s is not N (0 , p -values from Beta(6, 40), whose 95% interval is (0 . , . z scores for these relevant and moderately relevant genes, we apply the inverse cdf of the standardnormal distribution to their p -values. Table 1 shows the average (over 100 data sets) AUC and thecorresponding standard errors for the three models. This time, our model performs substantiallybetter than the two alternative models.For our third simulation study, we try to make the simulated data as realistic as possible. To thisend, we use the actual gene expression values and the class labels from real biological data obtainedbased on interrogating the mutation status of p53 in cancer cell lines. The data are publicly available . Shahbaba and W. O. Johnson/ Table 1
Comparing BRD to BODP, locFDR, and BDP based on the area under the ROC curve (AUC). The correspondingstandard errors are shown in parentheses.
AUC% locFDR BODP BDP BRD v ¯ R Simulation 1 75.8 (0.7) - 75.8 (0.4) 76.8 (0.4) 77.0 (0.4)Simulation 2 90.9 (1.5) - 88.9 (0.4) 93.5 (0.3) 94.8 (0.2)Simulation 3 64.5 (2.3) - 82.2 (1.3) 90.3 (0.7) 94.4 (0.4)Simulation 4 57.9 (1.8) - 88.5 (1.1) 86.8 (0.8) 91.8 (0.6)Simulation 5 73.7 (1.1) 76.2 (1.2) 65.1 (1.0) 79.6 (1.0) 77.7 (1.0)Simulation 6 75.5 (1.1) 72.8 (1.2) 63.9 (1.0) 77.6 (1.1) 78.3 (1.1) from . Out of n = 50 cell lines, 17 wereclassified as normal and the remaining 33 were classified as muted. The data include 9,703 humancDNAs, which correspond to approximately 8,000 different genes. For each gene, we calculate the z -score using the t -test based on the difference in the gene expression between normal and mutatedcells. We use these z -scores (approximately 8,000) to simulate 100 data sets. For this, we sort the z scores according to the their absolute values. Then, we randomly sample between 40 to 80 z valuesfrom the top 400 genes (i.e., genes with highest values of | z | ), and sample 460 z values from lowranking genes (i.e., genes with relatively lower values of | z | ). A good model is expected to regardthe first set of genes as relevant and the remaining 460 genes as not relevant. As before, we useAUC to compare the above three methods. Our model outperforms the other two methods by asubstantial amount (Table 1).Simulation 4 is similar to Simulation 3 (i.e., we use the actual z scores from p53 data) but thistime we create two groups of relevant genes with different degrees of relevance. To this end, werandomly select 5 to 10 z scores from the top 100 genes and regard them as relevant. Then, weremove the top 200 genes from the data. From the remaining genes, we select 20 to 30 of top ranking z scores and regard the corresponding genes as moderately relevant. Finally, we randomly sample460 of low ranking z scores for the not relevant group. This time, BDP performs slightly betterthan our model if we use v as a measure of relevance (Table 1). The difference, however, is notstatistically significant ( p -value = 0.23 using a paired t -test). If we use the posterior mean of rank,¯ R , as a measure of relevance, our model performs substantially better than BDP ( p -value = 0.02using a paired t -test). . Shahbaba and W. O. Johnson/ So far, our simulations have been based on summary statistics. Next, we evaluate the performanceof our method based on the full data as discussed in Section 2.1. For Simulation 5, we randomlysample 10 cell lines and 250 genes from the p53 dataset. We permute the labels (normal vs. muted)of these cell lines. In this way, all 250 genes would become irrelevant with respect to the newlabels, denoted as ˜ x ijk , where i = 1 , . . . , j = 1 , . . . ,
20, and k = 0 ,
1. Here, k = 0 denotes thecontrol group and k = 1 the case group after permutation of the labels. For the first five genes (i.e., i = 1 , . . . , k = 1). Theconstant is set to -1 with the probability of 0.7 and to 2 with the probability of 0.3. Therefore,the first five genes in this simulation are considered as relevant, while the remaining 245 genesare irrelevant. Results for this simulation are shown in Table 1. As before, our method performsbetter than BDP and locFDR. It also outperforms BODP. However, the improvement is statisticallysignificant (at the 0.05 level using a paired t -test) based on v only; the improvement based on ¯ R is marginally significant ( p -value = 0.07). For the BDP model, in consultation with the authors,we used an Inv- γ (1 ,
1) as the prior distribution of σ i and set k = 2. (See Equation 16 in GMZ.).For this model, the low value for AUC is unexpected. While we did perform other simulationswith similar results, we recommend caution about generalizing any of these empirical results. Ourmethod was designed based on variable selection as a goal, and simulation 5 has indeed created avariable selection problem, so perhaps this is what gives BRD the advantage. If we calculate the the z scores for simulated data and run BDP and BRD based on summary statistics, AUC’s increaseto 94.6, 97.4, and 97.5 for BDP, BRD using v , and BRD using ¯ R respectively. Both models performsubstantially better than locFDR.For our sixth simulation, we follow a similar procedure as in Simulation 5, but we increase thesample size to 20 and sample, from the standard normal distribution, the values that are addedto the gene expression of the case group. As before, our model provides the highest AUC (Table1), and the improvement is statistically significant compared to all other methods. Also, similar toSimulation 5, AUC is unexpectedly low for BDP. However, if we use the z scores, as opposed to fulldata, for this model, AUC increases to 70.0. For our model, using z scores reduces AUC slightly to75.1 and 75.2 based on v and ¯ R respectively. . Shahbaba and W. O. Johnson/ a N u m be r o f m i x t u r e c o m ponen t s . . . . . . -3 -2 -1 Fig 4 . Average number of mixture components for different values of a when no gene is relevant to the disease.Here, log γ ∼ N ( a, . The vertical lines show the corresponding confidence interval for each mean. Note that as a increases, the prior puts higher probabilities on large number of components. Finally, we compare the two nonparametric models, BDP and BRD, when no gene is related tothe outcome of interest and all observed effects are due to chance alone. In this case, we expect thetwo models to assign all genes to the one group, which will be regarded as not relevant; that is, themixture distribution should have one component only. To evaluate the two models, we simulated500 z scores from the standard normal distribution. We ran both models on the data and obtainedthe mode of the number of the mixture components over MCMC iterations. We denote this measureas C m . For the BDP model, the average of C m over 100 data sets was 2.97 (SE=0.07). The averageof C m for the BRD model was 1.03 (SE = 0.017), which is substantially lower than that of BDP.
5. Sensitivity Analysis
A key parameter in our model is γ , which is a positive scale parameter that controls the numberof components of the mixture that will be represented in the sample. A larger value of γ resultsin a larger number of components. In the previous section, we showed that our model performsreasonably well when there is no relevant gene in the data by keeping the number of mixturecomponents close to one. In this section, we investigate the influence of the specified prior for γ onthe number of identified mixture components, C m , where no gene is relevant. . Shahbaba and W. O. Johnson/ In our model, we assumed the following prior distribution for γ : γ ∼ Log- N ( a, b )To change this prior, we set a to -3, -2, -1, 0, 1, and 2 while fixing b at 2. By increasing a , the priorputs more probability on large values of γ , and in turn increases the prior probability of havinglarge number of components.We simulated 100 data sets for each value of a . Figure 4 shows that the average of C m increasesfrom 1.03 to 1.2 as we increase a from -3 to 2. Note that for most practical problems, values suchas 1 and 2 for a lead to unreasonably wide priors; for example, the 95% interval is (0 . , a = 1. In practice, values such as a = − a = − γ between 0 and 10.
6. Discussion
We have proposed a new approach for analyzing large-scale studies, where the objective is to identifyfactors that are relevant to an outcome of interest. Our proposed method divides a set of candidatefactors (e.g., genes) into several subgroups according to their degree of relevance. Not only does ourmethod provide a flexible model for analyzing high throughput studies, but also it simplifies thetask of selecting a subset of factors for possible follow-up studies.Our method uses the Dirichlet process to introduce a random grouping on factors (e.g., genes).It is common to refer to such priors as “Polya Urn” to accentuate the clustering aspect of theprior. Many alternative priors have been recently introduced in the literature. For example, thenormalized inverse Gaussian process (Lijoi, Mena and Prunster, 2005) allows different priors on theclustering while keeping the posterior inference simple.While we have applied our model to the analysis of gene expression microarrays, our approachcan be applied to a wide range of problems, where the relevance of many factors are simultaneouslyinvestigated. For example, functional neuroimaging is used to study normal vs. pathological brainprocesses. A typical experiment in this area involves assessing a large number of pixels, each repre-senting a small area of brain tissue. Another possible application for our method is the analysis of . Shahbaba and W. O. Johnson/ single-nucleotide polymorphisms in genome-wide association studies, whose objective is to identifyand characterize genetic variants related to common complex diseases.In this paper, we presented our model for a binary health outcome variable and continuousresponses. Our model could easily be extended to multiple health outcomes and discrete responses.This is especially simple if we use summary statistics.The main challenge of our model is its high computational cost. For the leukemia data, runningour code takes 2.92 seconds of CPU time per iteration. While this is much smaller compared to theBDP model, which takes 32.75 seconds CPU time per iteration (perhaps an unfair comparison sincethe two codes were developed independently and without intention for comparison), it is possibleto reduce the computational cost by using more efficient MCMC algorithms. For example, we couldapply the “split-merge” approach, which follows a Metropolis-Hastings procedure that resamplesclusters of observations simultaneously rather than incrementally assigning a single observationJain and Neal (2007). Alternatively, one could use the method of Dahl (2009) to find the maximum a posterior (MAP) estimates (instead of the posterior distribution) of cluster structure in 1 − d Dirichlet process mixtures.Future directions could involve incorporating additional knowledge about the underlying struc-ture of data. For example, we could incorporate prior information on the interconnectivity amonggenes. In this way, knowing that a gene, G j , is differentially expressed could increase the probabilityof differential expression for other genes that are related to that gene (i.e., either they activate G j or are activated by it). To this end, we can use relevant biological data to group genes into subsetsof related genes and shift the focus of analysis towards gene sets as opposed to individual genes(Shahbaba et al. , 2011).Another possible research direction involves extending our model to allow for incorporating moreinformation on subjects. For example, we could include clinical measures and demographic variablesin our model.Finally, there is much scope for theoretical investigation of the DPM based approaches. We arecurrently developing a first attempt at this investigation and expect to continue our investigationin future work. . Shahbaba and W. O. Johnson/ References
Armstrong, S. , Staunton, J. , Silverman, L. , Pieters, R. , den Boer, M. , Minden, M. , Sallan, S. , Lander, E. , Golub, T. and
Korsmeyer, S. (2002). MLL translocations specifya distinct gene expression profile, distinguishing a unique leukemia.
Nature Genetics Benjamini, Y. and
Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practicaland Powerful Approach to Multiple Testing.
Journal of the Royal Statistical Society. Series B(Methodological) Bogdan, M. , Gosh, J. and
Tokdar, S. (2008). A comparison of the benjamini-hochberg pro-cedure with some Bayesian rules for multile testing. In
Beyond Parametrics in InterdisciplinaryResearch: Festshcrift in Honor of Professor Pranab K. Sen. (N. Balakrishnan, E. Pena and M. Sil-vapulle, eds.) 211–230. Institute of Mathematical Statistics.
Booss, J. , Dann, P. R. , Griffith, B. P. and
Kim, J. H. (1989). Host defense response tocytomegalovirus in the central nervous system. Predominance of the monocyte.
Am J Pathol
Cao, J. , Xie, X. J. , Zhang, S. , Whitehurst, A. and
White, M. (2009). Bayesian optimaldiscovery procedure for simultaneous significance testing.
BMC Bioinformatics . Chan, G. , Bivins-Smith, E. R. , Smith, M. S. , Smith, P. M. and
Yurochko, A. D. (2008).Transcriptome analysis reveals human cytomegalovirus reprograms monocyte differentiation to-ward an M1 macrophage.
J Immunol
Chen, Y.-X. , Wang, W.-P. , Zhang, P.-Y. , Zhang, W.-G. , Liu, J. and
Ma, X.-R. (2009).Expression of genes psma6 and slc25a4 in patients with acute monocytic leukemia.
Journal ofexperimental hematology (in Chinese) . Compton, T. , Kurt-Jones, E. A. , Boehme, K. W. , Belko, J. , Latz, E. , Golenbock, D. T. and
Finberg, R. W. (2003). Human cytomegalovirus activates inflammatory cytokine responsesvia CD14 and Toll-like receptor 2.
J Virol Dahl, D. B. (2009). Modal Clustering in a Class of Product Partition Models.
Bayesian Analysis Dahl, D. and
Newton, M. (2007). Multiple hypothesis testing by clustering treatment effects.
Journal of American Statistical Association
Deffie, A. , Batra, J. and
Goldenberg, G. (1989). Direct correlation between DNA topoiso-merase II activity and cytotoxicity in adriamycin-sensitive and -resistant P388 leukemia cell lines.
Cancer Research Do, K. , M¨uller, P. and
Tang, F. (2005). A Bayesian mixture model for differential gene expres-sion.
Journal of the Royal Statistical Society: Series C (Applied Statistics) Efron, B. , Tibshirani, R. , Storey, J. D. and
Tusher, V. (2001). Empirical Bayes Analysisof a Microarray Experiment.
Journal of the American Statistical Association Farahat, N. , Lens, D. , Morilla, R. , Matutes, E. and
Catovsky, D. (1995). DifferentialTdT expression in acute leukemia by flow cytometry: a quantitative study.
Leukemia George, E. I. and
McCulloch, R. E. (1993). Variable Selection Via Gibbs Sampling.
Journalof the American Statistical Association Gopalan, R. and
Berry, D. (1993). Bayesian multiple comparisons using dirichlet process priors.
Journal of American Statistical Association Guindani, M. , M¨uller, P. and
Zhang, S. (2009). A Bayesian discovery procedure.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) Hochberg, Y. (1988). A Sharper Bonferroni Procedure for Multiple Tests of Significance. . Shahbaba and W. O. Johnson/ Biometrika Hommel, G. (1988). A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonfer-roni Test.
Biometrika Huang, D. W. , Sherman, B. T. and
Lempicki, R. A. (2008). Systematic and integrative analysisof large gene lists using DAVID bioinformatics resources.
Nature Protocols Jain, S. and
Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichletprocess mixture model (with discussion).
Bayesian Analysis Lijoi, A. , Mena, R. H. and
Prunster, I. (2005). Hierarchical Mixture Modeling With NormalizedInverse-Gaussian Priors.
Journal of the American Statistical Association
MacEachern, S. N. and
M¨uller, P. (1998). Estimating mixture of Dirichlet process models.
Journal of Computational and Graphical Statistics M¨uller, P. , Parmigiani, G. and
Rice, K. (2007). FDR and Bayesian Multiple ComparisonsRules. In
Bayesian Statistics 8 (J. Bernardo, S. Bayarri, J. Berger, A. Dawid, D. Heckerman,A. Smith and M. West, eds.) Oxford University Press.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models.
Journalof Computational and Graphical Statistics Newton, M. , Kendziorski, C. , Richmond, C. , FR, B. and
KW, T. (2001). On differentialvariability of expression ratios: improving statistical inference about gene expression changesfrom microarray data.
Journal of Computational Biology O’Hara, R. and
Sillanp¨a¨a, M. (2009). A Review of Bayesian Variable Selection Methods: What,How, and Which.
Bayesian Analysis Raife, T. , Lager, D. , Kemp, J. and
Dick, F. (1994). Expression of CD24 (BA-1) predictsmonocytic lineage in acute myeloid leukemia.
American Journal of Clinical Pathology Scott, J. and
Berger, J. (2006). An exploration of aspects of Bayesian multiple testing.
Journalof Statistical Planning and Inference
Scott, J. G. and
Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment inthe variable-selection problem.
ANNALS OF STATISTICS Sethuraman, J. (1994). A constructive definition of Dirichlet priors.
Statistica Sinica Shahbaba, B. , Tibshirani, R. , Shachaf, C. M. and
Plevritis, S. K. (2011). Bayesian gene setanalysis for identifying significant biological pathways.
Journal of the Royal Statistical Society,Series C S¨oderberg-Naucl´er, C. (2008). HCMV microinfections in inflammatory diseases and cancer.
JClin Virol Storey, J. D. (2002). A direct approach to false discovery rates.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous sig-nificance testing.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) Storey, J. D. , Dai, J. Y. and
Leek, J. T. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments.
Biostatistics Storey, J. D. , Taylor, J. E. and
Siegmund, D. (2004). Strong Control, Conservative PointEstimation and Simultaneous Conservative Consistency of False Discovery Rates: A Unified Ap-proach.
Journal of Royal Statistics Society, B. Sun, W. and
Cai, T. (2009). Large-scale multiple testing under dependence.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) . Shahbaba and W. O. Johnson/ Tusher, V. , Tibshirani, R. and
Chu, G. (2001). Significance analysis of microarrays applied tothe ionizing radiation response.
PNAS Westfall, P. H. and
Young, S. S. (1993).
Resampling-based Multiple Testing: Examples andMethods for P-value Adjustment . John Willey and Son.
Zangrando, A. , Dell’Orto, M. , te Kronnie, G. and Basso, G. (2009). MLL rearrangementsin pediatric acute lymphoblastic and myeloblastic leukemias: MLL specific and lineage specificsignatures.
BMC Medical Genomics2