A nonparametric empirical Bayes framework for large-scale multiple testing
aa r X i v : . [ s t a t . M E ] O c t A nonparametric empirical Bayes framework forlarge-scale multiple testing
Ryan MartinDepartment of Mathematics, Statistics, and Computer ScienceUniversity of Illinois at Chicago [email protected]
Surya T. TokdarDepartment of Statistical ScienceDuke University [email protected]
November 7, 2018
Abstract
We propose a flexible and identifiable version of the two-groups model, moti-vated by hierarchical Bayes considerations, that features an empirical null and asemiparametric mixture model for the non-null cases. We use a computationallyefficient predictive recursion marginal likelihood procedure to estimate the modelparameters, even the nonparametric mixing distribution. This leads to a nonpara-metric empirical Bayes testing procedure, which we call PRtest, based on thresh-olding the estimated local false discovery rates. Simulations and real-data examplesdemonstrate that, compared to existing approaches, PRtest’s careful handling ofthe non-null density can give a much better fit in the tails of the mixture distributionwhich, in turn, can lead to more realistic conclusions.
Keywords and phrases:
Dirichlet process; marginal likelihood; mixture model;predictive recursion; two-groups model.
Large-scale multiple testing problems arise in many applied fields such as genomics(Dudoit and van der Laan 2008; Sch¨afer and Strimmer 2005), proteomics (Ghosh 2009),astrophysics (Liang et al. 2004; Miller et al. 2001), and image analysis (Lindquist 2008;Schwartzman et al. 2008), to name a few. An abstract representation of the problem istesting a set of hypotheses H i : the i th case manifests a “null” behavior , i = 1 , . . . , n Z , . . . , Z n . The null behavior of a singlez-score Z i can be described by the N (0 ,
1) distribution when Z i is defined as the Gaussiantransform of a test statistic derived for the i th case, such as the two sample t-statisticcomparing treatment to control. Although this characterization leads to a simple re-jection rule for the i th case in isolation, it is found insufficient when all n tests in areto be performed, particularly when n is very large. In fact, one of the major develop-ments of modern statistics has been the philosophical shift from treating the z-scoresas mutually independent to treating them as exchangeable (Efron and Tibshirani 2002).Consequently, recent work on large-scale simultaneous testing has focused on Bayesianmodels and, in particular, empirical Bayes methods that allow for information sharingbetween cases, even though separate decisions will be made for each case.An elegant formalization of the large-scale simultaneous testing problem is the two-groups model (Efron 2004, 2007, 2008) which assumes Z , . . . , Z n arise from a mixturedensity f ( z ) = πf ( z ) + (1 − π ) f ( z ) , (1)with f and f , respectively, describing the null and non-null distributions of the z-scores.Efron (2004, 2008) argues that, for a variety of reasons, the case-specific theoretical nulldistribution N (0 ,
1) may not be an adequate choice for f , and a more appropriate choiceis the so-called empirical null distribution N ( µ, σ ), where µ and σ are to be estimatedfrom data.Following Efron’s original treatment, various new methods have been proposed forfitting and drawing inference from the two groups model of z-scores (Jin and Cai 2007;Muralidharan 2010). These methods, together with related methodology based on p-values or t-scores (e.g., Benjamini and Hochberg 1995; Storey 2003), have been widelyused in biological studies with high-throughput data, in particular to identify genes re-sponsible for a phenotypical behavior based on microarray analysis. The single-summary-per-case approach of these methods offers substantial computational advantage over otherapproaches to analyze such data, such as those based on high-dimensional classificationtechniques (Golub et al. 1999; Lee et al. 2003).However, currently available methods for fitting (1) do not take full advantage of thetwo-groups formulation. Motivated by applications to microarray studies, where typicallya very small fraction of genes are linked with the phenotype, existing two-groups methodstake a conservative approach of encouraging estimates of π close to 1. While this isreasonable for many applications, there are scientific studies where such a conservativeapproach fails to detect any or a majority of the interesting cases. Figure 1 reports twosuch microarray studies, a leukemia study by Golub et al. (1999) and a breast cancerstudy by Hedenfalk et al. (2001); more details are given in Section 6. As shown in thefigure, existing methods each produce estimates of the null component πf that coverone or both tails of the z-score histogram, leaving little to be explained by the non-nullcomponent (1 − π ) f . Consequently, zero discoveries of interesting genes are made in oneor both tails; see Table 3 in Section 6. High-dimensional classification-based analyses(Golub et al. 1999; Hedenfalk et al. 2001; Lee et al. 2003), on the other hand, identify anumber of interesting genes on either tail for each of the two studies.In this paper we consider a new likelihood-based analysis of the two-groups model,with a regularization on µ, σ, π and a semiparametric specification of the non-null density f . We employ a mixture representation of f that gives it heavier tails than f to reflect2 D en s i t y −6 −4 −2 0 2 4 . . . . (a) Leukemia z-scores z D en s i t y −4 −2 0 2 4 . . . . (b) Breast cancer z-scores Figure 1: Density histogram of z-scores from leukemia microarray data (Golub et al.1999) and breast cancer data (Hedenfalk et al. 2001) and estimates of πf based on themethods of Efron (2004) (—), Jin and Cai (2007)( −− ) and Muralidharan (2010) ( · · · ).the belief that z-scores from the non-null cases are likely to be larger in magnitude thanthose from the null cases. The null weight π is given a beta prior with a center close toone but with a relatively long left tail. Additionally, we use a prior on ( µ, σ ) to reflectthe belief that this vector is likely to be close to (0 , π . For scientific studies where the existing methods discover afair number of interesting cases, our method makes similar discoveries. But for otherstudies where existing methods fail, such as the two studies mentioned earlier, our methodmakes discoveries that are comparable to those found via high-dimensional classificationmethods. A similar adaptability property manifests in a simulation study where z-scoresare generated according to (1) with π ranging between 0.75 to 0.99; see Table 1 inSection 5.Despite a nonparametric specification of f and a likelihood-based analysis, our treat-ment of the two-groups model retains the computational efficiency that is hallmark ofmethods based on z-scores. This has been possible due to recent developments on astochastic algorithm due to Newton (2002) called predictive recursion , for estimation ofmixing densities with respect to any arbitrary dominating measure; see also Newton et al.(1998). Theoretical properties of this algorithm are addressed in Ghosh and Tokdar(2006), Martin and Ghosh (2008), Tokdar et al. (2009), and Martin and Tokdar (2009).Martin and Tokdar (2011) show how this algorithm can be used in a hierarchical mixturemodel to construct a likelihood function over non-mixing model parameters, marginalizedover the mixing density. This marginal likelihood is shown to have strong connectionsto the marginal likelihood under a Bayesian Dirichlet process mixture model. We adoptthis marginal likelihood calculation to the two groups model, with µ , σ , π and a scaling3arameter in the specification of f serving as the non-mixing parameters.For the multiple testing problem, we adopt the strategy of mimicking the Bayes oraclerule by thresholding a plug-in estimate of the local false discovery rate, similar to Efron(2004, 2008), Jin and Cai (2007), and Muralidharan (2010). Simulations presented inSection 5 show that the proposed method, called PRtest, is more adaptive to asymmetryin the non-null density f and to the degree of sparsity characterized by π . Performance ofPRtest in an interesting example using the artificial microarray data of Choe et al. (2005)is addressed in Section 6. In this example, the set of interesting genes is known and we findthat PRtest performs considerably better than existing methods and strikingly similarto the oracle. Likewise, for the leukemia and hereditary breast cancer studies, we findthat the PR-based estimation produces a better fit in the tails of the distribution thanthat seen in Figure 1 and, consequently, we are able to identify a number of interestinggenes in each example. The identified genes are, in fact, consistent with those identifiedby more sophisticated high-dimensional classification-based techniques. We take f ( z ) = N ( z | µ, σ ), the normal density with unknown mean and variance µ and σ . The non-null density f is taken to be a semiparametric mixture of the form f ( z ) = Z U N ( z | µ + τ σu, σ ) ψ ( u ) du, (2)with ψ a density with respect to the Lebesgue measure on U = [ − ,
1] and τ ≥ ψ be a density is givenin the following theorem; see Appendix A for a proof. Theorem 1.
For f and f as described above, the parameters ( µ, σ, π, τ, ψ ) in ourversion of the two-groups model are identifiable. This result is useful because, in general, identifiability is not guaranteed for a twogroups model (1) with an empirical null that involves unknown parameters. For ourspecification, the key to identifiability is the model feature that f , by virtue of averagingover locations shifts of f , has heavier tails than f . This feature is scientifically relevantas it embeds the belief that z-scores in the tails of the histogram are more likely tocorrespond to non-null cases than null. Efron (2008) incorporates a similar belief througha zero-assumption : most z-scores near zero are from the null component. However, such azero-assumption can be too strong to allow learning from data and can lead to an estimateof πf that has heavier tails than any reasonable histogram-smoothing estimate of f , asreported by Strimmer (2008) and illustrated in Figure 1. In comparison, separating f and f by their tails seems more practical; see Section 6.An important feature of the model is that f and the kernel being mixed in (2) have acommon scale. This, along with the assumption that ψ is a density, are the driving forcesbehind the characteristic function-based proof of Theorem 1, not the distributional formof these densities. In fact, the proof in Appendix A applies when f ( z ) = σ − p (cid:0) ( z − µ ) /σ (cid:1) for a suitably smooth density p symmetric about zero, and f is likewise determined by p . Here we focus only on the case where p is the N (0 ,
1) density, but other choices like aStudent-t density with fixed degrees of freedom can also be entertained.4
Mixture models and predictive recursion
It is more convenient to write our specification of f as the mixture model f ( z ) = Z U p ( z | θ, u ) Ψ( du ) (3)with parameters θ = ( µ, σ, τ ), kernel p ( z | θ, u ) = N ( z | µ + τ σu, σ ), and mixingprobability measure Ψ on U that assigns a positive mass π at 0 ∈ U and distributes theremaining mass on U according to a Lebesgue density ψ . The collection of all such Ψ isthe set P = P ( U , ν ) of probability measures that are absolutely continuous with respectto the measure ν defined as the sum of the Lebesgue measure on U and a point mass at0. The ν -density of such an Ψ will be denoted by π h i + (1 − π ) ψ .Inference on ( θ, Ψ) can be performed in a Bayesian setting with a prior distribution on( θ, Ψ). A popular choice of prior distribution for the nonparametric probability measureΨ is the Dirichlet process prior (Ferguson 1973). However, there are two practical diffi-culties in employing this inference framework for our model. First, the Dirichlet processprior entertains only discrete probability measures, thus violating the important absolutecontinuity property of Ψ with respect to ν . Second, despite recent advances in computing,fitting a Dirichlet process mixture model does not scale well with the number of observa-tions n . For microarray studies, n ranges from thousands to tens of thousands, whereasfor more recent single nucleotide polymorphism studies, n can reach several hundreds ofthousands. For such massive data sets, fitting a Dirichlet process mixture model can befairly time consuming, nullifying some of the advantages of the two-groups framework.As an alternative, we estimate ( θ, Ψ) via the predictive recursion (PR) methodology(Martin and Tokdar 2011; Newton 2002). PR is a stochastic algorithm for estimatinga mixing distribution Ψ in (3) through fast, recursive updates that have a strong con-nection with posterior updates for Dirichlet process mixture models. The algorithm ac-commodates user-specified absolute continuity constraints on the mixing distribution andenjoys attractive convergence properties under mild conditions with allowance for modelmisspecification (Ghosh and Tokdar 2006; Martin and Ghosh 2008; Martin and Tokdar2009; Tokdar et al. 2009). However, Newton’s original proposal can estimate the mixingdistribution only when the kernel being mixed is known exactly, i.e., for (3), an estimateof Ψ is available only when θ is known. To resolve this difficulty, Martin and Tokdar(2011) introduce a “marginal likelihood” function for non-mixing parameters θ based onthe output of the predictive recursion. PR Algorithm.
Start with an initial estimate Ψ with ν -density π h i + (1 − π ) ψ anda sequence of weights w , . . . , w n ∈ (0 , i = 1 , . . . , n compute f i − ,θ ( Z i ) = Z p ( Z i | θ, u ) Ψ i − ( du ) , Ψ i ( du ) = (1 − w i )Ψ i − ( du ) + w i p ( Z i | θ, u )Ψ i − ( du ) /f i − ,θ ( Z i ) . (4)Produce Ψ n as an estimate of Ψ and L n ( θ ) = Q ni =1 f i − ,θ ( Z i ) as a marginal likelihoodfunction for θ .Martin and Tokdar (2011) point out several justifications for labeling L n ( θ ) as a like-lihood function of θ . For n = 1, L ( θ ) equals the marginal likelihood function of θ ,5ntegrating out Ψ under the Bayesian specification Ψ ∼ DP ( α, Ψ ), the Dirichlet processdistribution with precision α = (1 − w ) /w and base measure Ψ . For n >
1, this cor-respondence is not exact, but L n ( θ ) can be viewed as a filtering approximation of thecorresponding Dirichlet process marginal likelihood function. Additionally, L n ( θ ) featuresan asymptotic concentration property commonly enjoyed by likelihood functions for in-dependent and identically distributed data models (Wald 1949). Specifically, for large n ,with Z , . . . , Z n independently drawn from a common density f ⋆ , log L n ( θ ) ≈ − nK ⋆ ( θ ),where K ⋆ ( θ ) equals the minimum Kullback–Leibler divergence between f ⋆ and densities f of the form (3) with Ψ ranging over the set P and all its weak limits points. We employ a regularized version of the predictive recursion methodology to estimate( θ, Ψ) for our two groups model. The regularization is motivated by a hierarchical Bayesformulation of (3) with Ψ ∼ DP ( α, Ψ ) where hyper-prior distributions are specified onthe model parameters µ, σ, τ and Ψ . We take the ν -density of Ψ to be π h i + (1 − π ) ψ with a fixed choice of ψ ( u ) ∝ u . Among the remaining parameters, σ ∈ (0 , ∞ ), τ ∈ (1 , ∞ ) and π ∈ (0 ,
1) are taken to be independent with log σ ∼ N (0 , . ), log( τ − ∼ N (0 ,
1) and π ∼ Beta (22 . , σ and the other parameters, µ is assigned theconditional prior distribution N (0 , σ / σ in the range [0 . , .
0] is typical, and the log-normal prior putsnearly all of its mass there. Other priors for σ may also be considered, such as a conjugatescaled inverse-chi distribution. The restriction τ > f is considerably wider than f , and the normal prior for log( τ −
1) supports a largeset of values in this range. The 22.7 in the beta prior for π , also used by Bogdan et al.(2008), assigns about 90% of its mass to the interval [0 . , π is likely to be large. Finally, the prior for µ is scaled to the choiceof σ and highly concentrated around the origin, reflecting the belief that the z-scoresshould have mean close to zero. Finer tuning of this default prior for specific problems isstraightforward.For a predictive recursion analog of this hierarchical Bayesian model, we interpretthe predictive recursion likelihood as a function of both θ = ( µ, σ, τ ) and π . Writingthis likelihood as L n ( µ, σ, τ, π ) and letting g ( µ, σ, τ, π ) denote the joint prior densityfunction on these parameters, a regularized version of the predictive recursion marginallog-likelihood function can be written as˜ ℓ n ( µ, σ, τ, π ) = log L n ( µ, σ, τ, π ) + log g ( µ, σ, τ, π ) . (5)Estimates of these parameters are obtained by maximizing ˜ ℓ n = ˜ ℓ n ( µ, σ, τ, π ). Oncethese estimates are obtained, predictive recursion is run one last time with the estimatedvalues of these parameters to produce an estimate of F , i.e., of π and of ψ in (1) and(2), respectively. In our implementations, maximization of ℓ n is done by the gradient-based Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization method. In Appendix Bwe provide a variation on the PR algorithm that produces the gradient of log L n as aby-product. 6he predictive recursion methodology depends on two additional factors, namely, thechoice of weights w , . . . , w n and the order in which the z-scores are processed by the al-gorithm. Martin and Tokdar (2009) provide an upper bound on the rate of convergencefor PR estimates of the mixture f when the weights are of the form w i = ( i + 1) − γ , γ ∈ (2 / , w i = ( i + 1) − . is close to the limit γ = 2 / Z i values are visited. We reduce this dependence by replacing ˜ ℓ n with itsaverage over a number of random permutations of the data sequence. Averaging overpermutations increases the overall computation time, but adds stability to parameterestimation (Tokdar et al. 2009). In our experience, averaging over 10 random permuta-tions is sufficient to stabilize the estimates of θ , and the additional computation timerequired is negligible. To reduce variability due to random permutation, we keep the setof permutations fixed over the process of maximizing ˜ ℓ n .For multiple testing, we consider the local false discovery rate (Efron 2004), given byfdr( z ) = πf ( z ) /f ( z ) , which represents the posterior probability that a case with z-score Z = z is null. Sun and Cai(2007) argue that the local false discovery rate is the fundamental quantity for multipletesting. Once regularized PR estimation of ( µ, σ, τ, π, ψ ) is completed, a plug-in estimate c fdr of fdr is readily available, and PRtest is implemented by thresholding c fdr; that is, wedeclare case i as non-null if c fdr( Z i ) < r for some specified threshold r ∈ (0 , r . In our examples we take r = 0 .
1. This choice, used by Sun and Cai(2007), is somewhat subjective, but sits between the choice r = 0 . r = 0 .
05 of Jin and Cai (2007) and others.
Here we investigate the performance of PRtest in several large-scale simulations wherewe can compare the results with the benchmark Bayes oracle test. The results will alsobe compared to those obtained from the Fourier-based method of Jin and Cai (2007) andthe mixfdr method of Muralidharan (2010).For Z , . . . , Z n , we assume independence and take the null density as f ( z ) = N ( z | µ, σ ). Here we fix n = 1000, µ = 0, and σ = 1. Four choices of f are considered:C1: f ( z ) = N ( z | , σ + ω ). Taking ω = 13 ≈ σ log n ensures the non-null z-scoresare “detectable” (Donoho and Johnstone 1994). But, in our experience, the rangeof z-scores one finds in real data analysis is consistent with smaller signals, so wetake ω = 4.C2: f ( z ) = 0 . R N ( z | u, σ ) du . This choice, used by Muralidharan (2010) andJohnstone and Silverman (2004), exhibits asymmetry and has only slightly heaviertails than the null.C3: f ( z ) = 0 . N ( z | − ,
2) + 0 . N ( z | , f ( z ) = 0 . R [ − , − ∪ [2 , N ( z | u, σ ) du . This is a symmetrized version of C2. A keyfeature of this choice is that the unobserved signals are bounded away from zero.For each of the four choices of f , we consider six choices of π ranging from 0.75 to 0.99,forming a total of 24 simulations settings. Each setting is replicated 500 times and theresults are reported below. Our implementation of PR uses weights w i = ( i + 1) − . andthe regularized likelihood ˜ ℓ n is averaged over 10 permutations of the data sequence.Table 1 summarizes the estimates of the null parameters π for each simulation setting.Estimates of ( µ, σ ) are similarly accurate across methods, models, and sparsity, so theseresults are omitted. From the table we find that the maximum PR marginal likelihoodestimates are the most adaptive across the range of π values, specifically for choices C2–C4. Of particular interest is PRtest’s strong performance in the two most practicallyrealistic cases, namely C3 and C4, which have smooth non-null densities with modes onboth the left and right side of zero. Also the average computation time for PRtest isroughly 3 seconds, which compares favorably with that for Jin–Cai (0 . . π for the selected methods and the Bayes oracle procedure; the Bayes oracleis the rule based on thresholding the true fdr at level 0.1. The general message is thatPRtest is competitive with the other tests in all aspects across a range of sparsity levels.In particular, the four tests are similar in terms of false non-discovery rate, particularlyfor large π , but PRtest is better than mixfdr and Jin–Cai for relatively small π . Also,each of the four tests have relatively small false discovery rates, although the Jin–Caimethod has a somewhat unexpected spike, which explains its higher power for large π values. Theoretically, the Bayes oracle test has the smallest Bayes risk uniformly over π ,but the PRtest risk sits very close over the entire range of π . This observation suggeststhat PRtest may be asymptotically optimal in the sense of Bogdan et al. (2011). An interesting spike-in dataset was built by Choe et al. (2005). The dataset itself isartificial—so the set of interesting genes is known—but their careful construction gives itsome features of a real control-versus-treatment microarray study. We consider a subsetof this data (available in the R package st ) consisting of 11,475 genes, of which 1,331are differentially expressed. Z-scores are obtained by taking a Gaussian transform ofthe standard two-sample t-test statistics. Figure 3(a) shows histogram of the observed z-scores, along with the PRtest fit of the two-groups mixture model. The estimated densityclearly fits the data very well, and the fdr thresholding method flags 235 genes as down-regulated. For comparison, Figure 3(b) reports an oracle fit of the two-groups model,where π is estimated as the known proportion of differentially expressed genes, ( µ, σ ) areestimated by maximum likelihood based on the null z-scores, and f is estimated by astandard Gaussian kernel estimate based on the non-null z-scores; Table 2 reports the8 π Jin–Cai mixfdr PRtestC1 0.75 0.928 (0.019) 0.957 (0.009) 0.918 (0.017)0.80 0.929 (0.019) 0.965 (0.007) 0.930 (0.016)0.85 0.934 (0.018) 0.971 (0.006) 0.942 (0.014)0.90 0.945 (0.015) 0.980 (0.005) 0.960 (0.014)0.95 0.961 (0.011) 0.989 (0.003) 0.980 (0.010)0.99 0.978 (0.005) 0.995 (0.001) 0.995 (0.003)C2 0.75 0.905 (0.015) 0.827 (0.016) 0.761 (0.017)0.80 0.874 (0.019) 0.860 (0.012) 0.804 (0.014)0.85 0.860 (0.023) 0.894 (0.009) 0.851 (0.013)0.90 0.869 (0.028) 0.927 (0.007) 0.896 (0.010)0.95 0.926 (0.017) 0.962 (0.005) 0.940 (0.009)0.99 0.984 (0.007) 0.991 (0.003) 0.980 (0.008)C3 0.75 0.909 (0.013) 0.857 (0.017) 0.788 (0.016)0.80 0.886 (0.015) 0.881 (0.013) 0.828 (0.015)0.85 0.871 (0.021) 0.909 (0.011) 0.867 (0.014)0.90 0.886 (0.020) 0.937 (0.008) 0.903 (0.014)0.95 0.935 (0.012) 0.967 (0.005) 0.937 (0.013)0.99 0.980 (0.004) 0.991 (0.003) 0.982 (0.010)C4 0.75 0.951 (0.007) 0.886 (0.035) 0.784 (0.066)0.80 0.934 (0.010) 0.897 (0.015) 0.814 (0.021)0.85 0.920 (0.015) 0.920 (0.010) 0.862 (0.018)0.90 0.908 (0.025) 0.948 (0.007) 0.901 (0.013)0.95 0.929 (0.017) 0.975 (0.005) 0.943 (0.012)0.99 0.980 (0.007) 0.995 (0.002) 0.992 (0.005)Table 1: Mean (standard deviation) of the 500 estimates of π for the method ofJin and Cai (2007), the mixfdr method of Muralidharan (2010), and PRtest for the fouralternative densities ( f ’s) described in Section 5.9 .75 0.80 0.85 0.90 0.95 . . . . . FNR p F NR BayesJinCaimixfdrPRtest 0.75 0.80 0.85 0.90 0.95 . . . . . FDR p F DR . . . . Power p P o w e r . . . . Risk p R i sk Figure 2: Plots of the false non-discovery rate (FNR, top left), false discovery rate (FDR,top right), power (bottom left), and Bayes risk (bottom right) against π for the selectedtesting procedures in the C3 simulation setting described in Section 5.10 D en s i t y −6 −4 −2 0 2 4 6 . . .
20 10 f d r
235 11240 0 (a) PRtest fit z D en s i t y −6 −4 −2 0 2 4 6 . . .
20 10 f d r
249 11226 0 (b) Oracle fit
Figure 3: Histogram of the z-scores for the golden-spike data in Section 6.1, along withfits of the two-groups model using (a) PRtest and (b) the Oracle described in the text. Ineach plot, overlays are πf (solid black line), (1 − π ) f (dashed black line), and f (solidgray line). The estimated fdr and the 0.1 threshold are shown on the negative scale.Numerical values on the top left and right indicate the number of genes flagged as down-and up-regulated, respectively, by the fdr thresholding rule.parameter estimates. This oracle procedure is, in some sense, the best fdr thresholdingprocedure one can hope for, and it flags 249 genes as down-regulated.For further comparison, we applied the methods of Efron, Jin and Cai, and Muralid-haran and the results are summarized in the top panel of Table 2. PRtest and the oracleperform similarly in every respect, while the other methods are substantially different.Only the Jin–Cai method is able to pick out a reasonable set of interesting genes, abit larger than the sets identified by the oracle and PRtest. However, these additionaldiscoveries result in a 50% increase in false discovery rate. We applied PRtest, along with the methods of Efron, Jin and Cai, and Muralidharan,to the two microarray gene expression datasets mentioned in Section 1: the leukemiastudy by Golub et al. (1999) and the hereditary breast cancer study by Hedenfalk et al.(2001). The parameter estimates and gene classifications are summarized in Table 3.In both datasets, PRtest estimates π to be relatively small and identifies a number ofinteresting genes, while the others identify none; see Figure 4. PRtest’s findings in thesetwo datasets are corroborated by the results of Lee et al. (2003) who learn a treatmentclassifier from gene expression levels and validate it by accurately classifying samplesfrom an independent test set. That is, the set of interesting genes identified by PRtestsubstantially overlaps with the set of genes Lee et al. (2003) flag as important constituentsof their classifier; these are also displayed in Figure 4. For the breast cancer study,some of the genes identified by PRtest and Lee et al. (2003), such as keratin 8, TOB 1,11umber of genesMethod µ σ π Left Right FDR FNREfron 0.33 1.50 0.99 2 0 0% 12%Jin–Cai 0.77 1.45 0.91 306 0 3% 9%mixfdr 0.28 1.45 0.97 8 0 0% 12%PRtest 0.42 1.34 0.88 235 0 2% 10%
Oracle µ σ π
Left RightLeukemia Efron 0.57 1.18 0.88 276 0Jin–Cai 0.95 1.30 0.91 291 0mixfdr 0.56 1.35 0.96 71 0PRtest 0.23 1.04 0.63 333 226BRCA Efron − .
33 1.45 1.00 0 0Jin–Cai − .
42 1.44 1.00 0 0mixfdr − .
31 1.38 0.99 0 0PRtest − .
01 1.04 0.45 231 44Table 3: Results for the two real microarray datasets considered in Section 6.2.and phosphofructokinase platelet, have known biological connections to breast cancermutations (Lee et al. 2003, p. 93). The fact that the gene expression levels lead to awell-validated classifier suggests that some genes must be differentially expressed. In thislight, it is surprising that the methods of Efron, Jin and Cai, and Muralidharan fail toidentify a single interesting gene.
This paper provides a new and identifiable semiparametric formulation of the two-groupsmodel and a computationally efficient algorithm to estimate the model parameters. Thisnaturally leads to a nonparametric empirical Bayes multiple testing rule based on thresh-olding the estimated local false discovery rate. In simulations we find that PRtest iscomparable to existing methods, including the Bayes oracle. What is particularly inter-esting is that the PRtest results differ substantially from those of existing methods in theexamples of Section 6, and we argue that our findings are, in fact, more believable.We have chosen to focus only on the case where the null z-scores are normally dis-tributed, though the theory and methods presented here work for other well behavedparametric families. Normality of null z-scores is indeed a strong structural assump-12 D en s i t y −8 −6 −4 −2 0 2 4 6 . . .
30 10 f d r
333 6570 226 (a) Leukemia z D en s i t y −4 −2 0 2 4 . . .
30 10 f d r
231 2951 44 (b) Breast cancer
Figure 4: PRtest’s fit to z-scores values from leukemia and breast cancer microarraydata. Overlaid on the z-score histogram are the estimates of πf (solid, black), (1 − π ) f (dashed, black) and f = πf + (1 − π ) f (solid, gray). The estimated fdr curve is shownon the negative scale, with the cut-off of 0.1 marked by the horizontal grey, dashed line.The 27 genes identified by Lee et al. (2003) in each data set are marked with a blue barat their z-score, with the height of the bar indicating the posterior probability of beingincluded in the classification model. For the leukemia data, red dots are placed on thez-scores of the 50 genes identified in the original study by Golub et al. (1999).13ion, but identification of the null from the non-null requires strong parametric shaperestrictions on one of the two components. Assuming a normal null component is naturalbecause, theoretically, the null z-scores should have a standard normal distribution. Thisis similar to p-value-based methods where the null p-values are assumed to be uniform.A purely statistical verification of this kind of assumption seems quite challenging. Onecould possibly gain insight on this issue through biological experiments consisting entirelyof null cases.We have justified the continuous location mixture formulation of f in (2) on twogrounds: first, it makes the model parameters identifiable and, second, it conforms tothe accepted notion that the alternative is more likely than the null to produce z-scoresof large magnitude. This latter property is also satisfied by a discrete mixture f = P Jj =1 π j N ( µ + τ σu j , σ ), for which the identifiability condition does not hold. But withthe regularization to encourage selection of f centered near zero, and the ability of aflexible continuous mixture to approximate a discrete one, PRtest might still perform wellin this difficult situation. Our limited simulations seem to indicate that this is true. Thecase where f is not wider than f also yields a coherent statistical simulation model, butwe argue that it corresponds to a biologically untenable abstraction. Indeed, the multipletesting framework accepts the z-scores as scores whose magnitudes (possibly after a smallshift of origin) give an ordering of how interesting the cases are relative to each other. Thequestion is to decide how interesting a case must be in order to be labeled as non-null.Accepting the relative ordering is equivalent to accepting that f must be wider than f . Software
R software to implement the proposed PRtest methodology can be found at S. Tokdar’swebsite, . Acknowledgments
The authors are grateful to the Editor, Associate Editor, and two anonymous referees fortheir insightful comments and suggestions, and to Professor J. K. Ghosh for helpful dis-cussions. A portion of this work was completed while R. Martin was with the Departmentof Mathematical Sciences, Indiana University–Purdue University Indianapolis.
A Proof of Theorem 1
Here we prove a more general version of Theorem 1 in the main text. Let p ( z ) be aprobability density function on R , symmetric about zero. Furthermore, assume p issupersmooth in the sense of Fan (1991); see (9) below. In the main text, we took p ( z )to be a N (0 ,
1) kernel but, e.g., a Student-t kernel with known degrees of freedom wouldalso satisfy these conditions. 14or the particular choice of p , define the density-valued mapping M ( µ, σ, τ, π, ψ )( z ) = πσ − p (cid:0) ( z − µ ) /σ (cid:1) + (1 − π ) Z σ − p (cid:0) ( z − µ − τ u ) /σ (cid:1) ψ ( u ) du. To prove that ( µ, σ, τ, π, ψ ) are identifiable, we need to show that M is a one-to-onefunction. Therefore, we start by assuming M ( µ , σ , τ , π , ψ ) = M ( µ , σ , τ , π , ψ ).Let p ∗ ( t ) and ψ ∗ k ( t ) denote the characteristic functions of p ( z ) and ψ k ( z ), respectively, for k = 1 ,
2. Then we must haveexp( itµ ) p ∗ ( σ t ) (cid:8) π + (1 − π ) ψ ∗ ( σ t/τ ) (cid:9) = exp( itµ ) p ∗ ( σ t ) (cid:8) π + (1 − π ) ψ ∗ ( σ t/τ ) (cid:9) (6)for every t ∈ R , where i is the imaginary unit. Recall that the Riemann–Lebesgue lemma(e.g., Billingsley 1995, Theorem 26.1) says, for k = 1 , ψ ∗ k ( t ) → t → ±∞ . (7)Now, suppose σ > σ and assume, without loss of generality, that µ >
0. Choose asequence { t s } ⊂ R such that t s → ∞ and exp( it s µ ) ≡
1. Then, for large enough s , (7)implies that π + (1 − π ) ψ ∗ ( σ t s /τ ) = 0. On rearranging the terms in (6) we get p ∗ ( σ t s ) p ∗ ( σ t s ) = exp( it s µ ) (cid:8) π + (1 − π ) ψ ∗ ( σ t s /τ ) (cid:9) π + (1 − π ) ψ ∗ ( σ t s /τ ) . (8)We have assumed that p is supersmooth (Fan 1991), which means that d | t | β exp {−| t | β /γ } ≤ | p ∗ ( t ) | ≤ d | t | β exp {−| t | β /γ } , (9)for all t and for some positive constants d , d , β , β , β , and γ . Under this assumption,the modulus of the left-hand side of (8) satisfies (cid:12)(cid:12)(cid:12) p ∗ ( σ t s ) p ∗ ( σ t s ) (cid:12)(cid:12)(cid:12) ≥ const × | t s | β − β exp {| t s | β ( σ β − σ β ) /γ } . Therefore, as s → ∞ , the left-hand side of (8) is unbounded while the right-hand side isbounded. This is a contradiction, so we need σ ≤ σ . But by symmetry, it follows that σ = σ . With this equality, relation (6) easily leads to the equalities µ = µ , τ = τ , π = π and ψ = ψ , completing the proof. B Gradient of the log PR marginal likelihood
This section provides a variation on the predictive recursion (PR) algorithm that yieldsthe gradient of the log PR marginal likelihood function, based on the development in(Martin and Tokdar 2011). The model under consideration here is the following: f ( z ) = π N ( z | µ, σ ) + ¯ π Z N ( z | µ + τ σu, σ ) ψ ( u ) du, ψ is an unknown mixing density supported on [ − , ℓ n ( θ ) = P ni =1 log f i − ,θ ( Z i ), where f k,θ ( z ) is the PR estimate of the mixture density basedon Z , . . . , Z k and θ = ( µ, σ, τ, π ), slightly different than in the main text.Define an unconstrained version of θ , i.e., η = ( µ, log σ, log( τ − , logit π ), wherelogit x = log( x − x ). In what follows, ∇ will denote a gradient with respect to η , and if g isa function of a variable u , then ∇ g ( u ) denotes the gradient with respect to η , pointwisein u . The following algorithm shows how to compute λ i = f i − ,θ ( η ) ( Z i ) and ∇ log λ i for i = 1 , . . . , n .1. Start with user-specified π and f , and set ∇ π = (0 , , , π (1 − π )) and ∇ ψ ( u ) ≡ (0 , , , .
2. For i = 1 , . . . , n , repeat the following three steps:(a) For the normal kernel p ( z | θ, u ) = N ( z | µ + στ u, σ ), set G = N ( Z i | µ, σ ) and G ( u ) = N ( Z i | µ + στ u, σ ) , and analytically evaluate the gradients ∇ G and ∇ G ( u ): ∇ G = ( z /σ, z − , , · G ∇ G ( u ) = ( z ( u ) /σ, z ( u ) τ u/σ + z ( u ) − , z ( u ) u ( τ − , · G ( u ) , where z = ( Z i − µ ) /σ and z ( u ) = ( Z i − µ − στ u ) /σ .(b) Compute h i = Z G ( u ) ψ i − ( u ) duλ i = π i − G + (1 − π i − ) h i ∇ log h i = 1 h i Z (cid:8) G ( u ) ∇ ψ i − ( u ) + ∇ G ( u ) ψ i − ( u ) (cid:9) du ∇ log λ i = ∇ π i − G + π i − ∇ G + h i { (1 − π i − ) ∇ log h i − ∇ π i − } u i (c) Update π i = A π i − ∇ π i = A ∇ π i − + ∇ A π i − ψ i ( u ) = BA ( u ) ψ i − ( u ) ∇ ψ i ( u ) = {∇ BA ( u ) + B ∇ A ( u ) } ψ i − ( u ) + BA ( u ) ∇ ψ i − ( u )where A = 1 + w i ( G /λ i − A ( u ) = 1 + w i ( G ( u ) /λ i − B = (1 − π i − ) / (1 − A π i − )16nd ∇ A = w i {∇ G − G ∇ log λ i } /λ i ∇ A ( u ) = w i {∇ G ( u ) − G ( u ) ∇ log λ i } /λ i ∇ B = ( BA − ∇ π i − + B ∇ A π i − − A π i −
3. Return the log-likelihood P ni =1 log λ i and its gradient P ni =1 ∇ log λ i . References
Benjamini, Y. and Hochberg, Y. (1995), “Controlling the false discovery rate: a practicaland powerful approach to multiple testing,”
J. Roy. Statist. Soc. Ser. B , 57, 289–300.Billingsley, P. (1995),
Probability and measure , New York: John Wiley & Sons Inc., 3rded.Bogdan, M., Chakrabarti, A., Frommlet, F., and Ghosh, J. K. (2011), “AsymptoticBayes-optimality under sparsity of some multiple testing procedures,”
Ann. Statist. ,39, 1551–1579.Bogdan, M., Ghosh, J. K., and Tokdar, S. T. (2008), “A comparison of the Benjamini-Hochberg procedure with some Bayesian rules for multiple testing,” in
Beyond Para-metrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K.Sen , eds. Balakrishnan, N., Pe˜na, E., and Silvapulle, M., Beachwood, OH: IMS, pp.211–230.Choe, S. E., Boutros, M., Michelson, A. M., Church, G. M., and Halfon, M. S. (2005),“Preferred analysis methods for Affymetrix GeneChips revealed by wholly defined con-trol dataset,”
Genome Biol. , 6, R16.Donoho, D. L. and Johnstone, I. M. (1994), “Minimax risk over l p -balls for l q -error,” Probab. Theory Related Fields , 99, 277–303.Dudoit, S. and van der Laan, M. J. (2008),
Multiple testing procedures with applicationsto genomics , New York: Springer.Efron, B. (2004), “Large-scale simultaneous hypothesis testing: the choice of a null hy-pothesis,”
J. Amer. Statist. Assoc. , 99, 96–104.— (2007), “Correlation and large-scale simultaneous significance testing,”
J. Amer.Statist. Assoc. , 102, 93–103.— (2008), “Microarrays, empirical Bayes and the two-groups model,”
Statist. Sci. , 23,1–22.Efron, B. and Tibshirani, R. (2002), “Empirical Bayes methods and False Discovery Ratesfor Microarrays,”
Genet. Epidemiol. , 23, 70–86.17an, J. (1991), “On the optimal rates of convergence for nonparametric deconvolutionproblems,”
Ann. Statist. , 19, 1257–1272.Ferguson, T. S. (1973), “A Bayesian analysis of some nonparametric problems,”
Ann.Statist. , 1, 209–230.Ghosh, D. (2009), “Assessing significance of peptide spectrum matches in proteomics: amultiple testing approach,”
Statistics in Biosciences , 1, 199–213.Ghosh, J. K. and Tokdar, S. T. (2006), “Convergence and consistency of Newton’s algo-rithm for estimating mixing distribution,” in
Frontiers in Statistics , eds. Fan, J. andKoul, H., London: Imp. Coll. Press, pp. 429–443.Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander,E. S. (1999), “Molecular Classification of Cancer: Class Discovery and Class Predictionby Gene Expression Monitoring,”
Science , 286, 531–537.Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer,P., Gusterson, B., Esteller, M., Kallioniemi, O., B., W., Borg, A., Trent, J., Raffeld,M., Yakhini, Z., Ben-Dor, A., Dougherty, E., Kononen, J., Bubendorf, L., Fehrle, W.,Pittaluga, S., Gruvberger, S., Loman, N., Johannsson, O., Olsson, H., and Sauter, G.(2001), “Gene-expression profiles in hereditary breast cancer,”
N. Engl. J. Med. , 344,539–548.Jin, J. and Cai, T. T. (2007), “Estimating the null and the proportional of nonnull effectsin large-scale multiple comparisons,”
J. Amer. Statist. Assoc. , 102, 495–506.Johnstone, I. M. and Silverman, B. W. (2004), “Needles and straw in haystacks: empiricalBayes estimates of possibly sparse sequences,”
Ann. Statist. , 32, 1594–1649.Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., and Mallick, B. K. (2003), “Geneselection: a Bayesian variable selection approach,”
Bioinformatics , 19, 90–97.Liang, C.-L., Rice, J. A., de Pater, I., Alcock, C., Axelrod, T., Wang, A., and Marshall, S.(2004), “Statistical methods for detecting stellar occultations by Kuiper belt objects:the Taiwanese-American occultation survey,”
Statist. Sci. , 19, 265–274.Lindquist, M. A. (2008), “The statistical analysis of fMRI data,”
Statist. Sci. , 23, 439–464.Martin, R. and Ghosh, J. K. (2008), “Stochastic approximation and Newton’s estimateof a mixing distribution,”
Statist. Sci. , 23, 365–382.Martin, R. and Tokdar, S. T. (2009), “Asymptotic properties of predictive recursion:robustness and rate of convergence,”
Electron. J. Stat. , 3, 1455–1472.— (2011), “Semiparametric inference in mixture models with predictive recursionmarginal likelihood,”
Biometrika , 98, 567–582.18iller, C. J., Genovese, C., Nichol, R. C., Wasserman, L., Connolly, A., Reichart, D., andHopkins, A. (2001), “Controlling false discovery rate in astrophysical data analysis,”
Astron. J. , 122, 3492–3505.Muralidharan, O. (2010), “An empirical Bayes mixture method for effect size and falsediscovery rate estimation,”
Ann. Appl. Statist. , 4, 422–438.Newton, M. A. (2002), “On a nonparametric recursive estimator of the mixing distribu-tion,”
Sankhy¯a Ser. A , 64, 306–322.Newton, M. A., Quintana, F. A., and Zhang, Y. (1998), “Nonparametric Bayes methodsusing predictive updating,” in
Practical nonparametric and semiparametric Bayesianstatistics , eds. Dey, D., M¨uller, P., and Sinha, D., New York: Springer, vol. 133 of
Lecture Notes in Statist. , pp. 45–61.Sch¨afer, J. and Strimmer, K. (2005), “An empirical Bayes approach to inferring large-scale gene association networks,”
Bioinformatics , 21, 754–765.Schwartzman, A., Dougherty, R. F., and Taylor, J. E. (2008), “False discovery rateanalysis of brain diffusion direction maps,”
Ann. Appl. Stat. , 2, 153–175.Storey, J. D. (2003), “The positive false discovery rate: a Bayesian interpretation andthe q -value,” Ann. Statist. , 31, 2013–2035.Strimmer, K. (2008), “A unified approach to false discovery rate estimation,”
BMC Bioin-formatics , 9, 303.Sun, W. and Cai, T. T. (2007), “Oracle and adaptive compound decision rules for falsediscovery rate control,”
J. Amer. Statist. Assoc. , 102, 901–912.Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009), “Consistency of a recursive estimateof mixing distributions,”
Ann. Statist. , 37, 2502–2522.Wald, A. (1949), “Note on the consistency of the maximum likelihood estimate,”