Invidious Comparisons: Ranking and Selection as Compound Decisions
IINVIDIOUS COMPARISONS:RANKING AND SELECTION AS COMPOUND DECISIONS
JIAYING GU AND ROGER KOENKER
Abstract.
There is an innate human tendency, one might call it the “league tablementality,” to construct rankings. Schools, hospitals, sports teams, movies, andmyriad other objects are ranked even though their inherent multi-dimensionalitywould suggest that – at best – only partial orderings were possible. We considera large class of elementary ranking problems in which we observe noisy, scalarmeasurements of merit for n objects of potentially heterogeneous precision and areasked to select a group of the objects that are “most meritorious.” The problemis naturally formulated in the compound decision framework of Robbins’s (1956)empirical Bayes theory, but it also exhibits close connections to the recent literatureon multiple testing. The nonparametric maximum likelihood estimator for mixturemodels (Kiefer and Wolfowitz (1956)) is employed to construct optimal rankingand selection rules. Performance of the rules is evaluated in simulations and anapplication to ranking U.S kidney dialysis centers. Introduction
In the wake of Wald’s seminal monograph on statistical decision theory there wasa growing awareness that the Neyman-Pearson testing apparatus was inadequate formany important statistical tasks. Ranking and selection problems featured promi-nently in this perception. Motivated by a suggestion of Harold Hotelling, Bahadur(1950) studied selection of the best of several Gaussian populations. Assuming thatsample means were observed for each of K populations with means, θ k and commonvariance, the problem of selecting the best population, θ ∗ , was formulated as choosingweights z , · · · , z K to minimize, L ( θ, z ) = θ ∗ − K (cid:88) k =1 z k θ k / K (cid:88) k =1 z k . Bahadur showed that it was minimax to select only the population with the largestsample mean, that is to choose z ∗ i = 1 if ¯ X i = max { ¯ X , · · · , ¯ X K } and z ∗ i = 0 oth-erwise, thereby clearly demonstrating that procedures that did preliminary tests ofequality of means and then chose z i > Version: December 24, 2020. This paper was presented as the Walras-Bowley Lecture at the2020 World Congress of the Econometric Society, and is dedicated to the memory of Larry Brownwho introduced us to empirical Bayes methods. We thank Michael Gilraine, Keisuke Hirano, RobertMcMillan, Stanislav Volgushev and Sihai Dave Zhao for useful discussions. Jiaying Gu acknowledgesfinancial support from Social Sciences and Humanities Research Council of Canada. a r X i v : . [ ec on . E M ] D ec Invidious Comparisons optimize the number of selected populations as well as their identities, see Gupta andPanchapakesan (1979) and Bechhofer, Kiefer, and Sobel (1968) for extensive reviewsof subsequent developments.Goel and Rubin (1977) pioneered the hierarchical Bayesian approach to selectionthat has been adopted by numerous authors in the ensuing decades, early on by Bergerand Deely (1988) and Laird and Louis (1989). Portnoy (1982) showed that rankingsbased on best linear predictors were optimal in Gaussian multivariate variance com-ponents models, but cautioned that departures from normality could easily disruptthis optimality. A notable feature of the hierarchical model paradigm is the recogni-tion that sample observations may exhibit heterogeneous precision; this is typicallyaccounted for by assuming known variances for observed sample means. As rankingand selection methods became increasingly relevant in genomic applications there hasbeen renewed interest in loss functions and linkages to the burgeoning literature onmultiple testing. Our perspective is informed by recent developments in the non-parametric estimation of mixture models and its relevance for a variety of compounddecision problems. This approach seeks to reduce the reliance on Gaussian distribu-tional assumptions that pervades the earlier literature. As we have argued elsewhere,Gu and Koenker (2016a), and Koenker and Gu (2019) nonparametric empirical Bayesmethods offer powerful complementary methods to more conventional parametric hi-erarchical Bayes for multiple testing and compound decision problems. Our primaryobjective in this paper is to elaborate this assertion for ranking and selection appli-cations. Throughout we try to draw parallels and contrasts with the literature onmultiple testing. We will restrict our attention to settings where we observe a scalarestimate of an unobserved latent quality measure accompanied by some measure ofits precision, thereby evading more complex multivariate settings, as in Boyd, Cortes,Mohri, and Radovanovic (2012) who employ quantile regression methods.An important motivation for revived interest in ranking and selection problemsin econometrics has been the influential work of Chetty and his collaborators onteacher evaluation and geographic mobility in the U.S. This has stimulated the im-portant recent work of Mogstad, Romano, Shaikh, and Wilhelm (2020) proposingnew resampling methods for constructing confidence sets for ranking and selection-for a finite population. In contrast to this inferential approach we focus instead onthe complementary perspective of compound decision making, constructing decisionrules for selecting the best, or worst, populations subject to control of the expectednumber of elements selected and among those selected, the expected proportion offalse discovery.Before proceeding it is important to acknowledge that despite its universal appealand application there is something inherently futile about many ranking and selectionproblems as intimated by our title. If our latent measure of true quality is Gaussian,as assumed in virtually all of the econometric applications of the selection problem,and we wish to select the top ten percent of individuals given that their true qualityis contaminated by Gaussian noise. We will see that conventional linear shrinkageas embodied in the classical James-Stein formula can improve performance consid-erably over naive maximum likelihood (fixed effects) procedures, and some furtherimprovement is possible by carefully tailoring the decision rules for tail probabil-ity loss, however we find that even oracle decision rules that incorporate complete u and Koenker 3 knowledge of the precise distributional features of the problem cannot achieve betterthan about even odds that selected individuals have latent ability above the selectionthresholds. When the latent distribution of ability is heavier tailed then selectionbecomes somewhat easier, and more refined selection rules are more advantageous,but as we will show the selection problem still remains quite challenging.Thus, a secondary objective of the paper is to add another cautionary voice to thosewho have already questioned the reliability of existing ranking and selection methods.A critical overview of the role of ranking and selection in public policy applicationsis provided by Goldstein and Spiegelhalter (1996). It is widely acknowledged thatleague tables as currently employed can be a pernicious influence on policy, a view-point underscored in Gelman and Price (1999). While much of this criticism canbe attributed to inadequate data collection, we believe that there is also room formethodological improvements.Section 2 provides a brief overview of compound decision theory and describesnonparametric methods for estimation of Gaussian mixture models. Section 3 intro-duces a basic framework for our approach to ranking and selection in a setting withhomogeneous precision of the observed measurements. In Section 4 we introduceheterogeneous precision of known form, and Section 5 considers settings in whichthe joint distribution of the observed measurements and their precision determinesthe form of the ranking and selection rules. Optimal ranking and selection rules arederived in each of these sections under the assumption that the form of the mixingdistribution of the unobserved, latent quality of the observations is known. Section6 introduces feasible ranking and selection rules and conditions under which they at-tain the same asymptotic performance as the optimal rules. Section 7 then comparesseveral feasible ranking and selection methods, some that ignore the compound deci-sion structure of the problem, some that employ parametric empirical Bayes methodsand some that rely on nonparametric empirical Bayes methods. Finally, Section 8describes an empirical application on evaluating the performance of medical dialysiscenters in the United States. Proofs of all formal results are collected in Appendix A.2.
The Compound Decision Framework
Robbins (1951) posed a challenge to the nascent minimax decision theory of Wald(1950): Suppose we observe independent Gaussian realizations, Y i ∼ N ( θ i , , i =1 , · · · , n with means θ i taking either the value +1 or −
1. We are asked to estimatethe n -vector θ = ( θ , · · · , θ n ) subject to mean absolute error loss, L (ˆ θ, θ ) = n − n (cid:88) i =1 | ˆ θ i − θ i | . When n = 1 Robbins shows that the minimax decision rule is δ ( y ) = sgn( y ); inthe least favorable variant of the problem malevolent nature chooses ± θ i = +1 when Y i is positive, and θ i = − n >
1, this rule remainsminimax, each coordinate is treated independently as if viewed in complete isolation.This is also the maximum likelihood estimator, and may be viewed in econometricsterms as a classical fixed-effects estimator. But is it at all reasonable?
Invidious Comparisons
Doesn’t our sample convey information about the relative frequency of ± p = P ( θ i = 1), then the conditionalprobability that θ = 1 given Y i = y , is given by, P ( θ = 1 | y ) = pϕ ( y − pϕ ( y −
1) + (1 − p ) ϕ ( y + 1) . We should guess ˆ θ i = 1 if this probability exceeds 1/2, giving us the revised decisionrule, δ p ( y ) = sgn( y − log((1 − p ) /p )) . Each observed, Y i , is modified by a simple logistic perturbation before computing thesign. Our observed random sample, y = ( y , · · · , y n ), is informative about p , we havethe log likelihood, (cid:96) n ( p | y ) = n (cid:88) i =1 log( pϕ ( y i −
1) + (1 − p ) ϕ ( y i + 1)) , which could be augmented by a prior of some form, if desired, to obtain a posteriormean for p and a plug-in Bayes rule for estimating each of the θ i ’s. The Bayesrisk of this procedure is substantially less than the minimax risk when p (cid:54) = 1 / p = 1 /
2. This is the firstprinciple of compound decision theory: borrowing strength across an entire ensembleof related decision problems yields improved collective performance.What happens when we relax the restriction on the support of the θ ’s and allowsupport on the whole real line? We now have a general Gaussian mixture settingwhere the observed Y i ’s have marginal density given by the convolution, f = ϕ ∗ G ,that is, f ( y ) = (cid:90) ϕ ( y − θ ) dG ( θ ) , and instead of merely needing to estimate one probability we need an estimate ofan entire distribution function, G . Kiefer and Wolfowitz (1956), anticipated by anabstract of Robbins (1950), established that the nonparametric maximum likelihoodestimator (NPMLE),ˆ G = argmin G ∈G {− n (cid:88) i =1 log f ( y i ) | f ( y i ) = (cid:90) ϕ ( y i − θ ) dG ( θ ) } where G is the space of probability measures on R , is a consistent estimator of G .This is an infinite dimensional convex optimization problem with a strictly convexobjective subject to linear constraints. See Lindsay (1995) and Koenker and Mizera(2014) for further details on the geometry and computational aspects of the NPMLEproblem. Solutions are atomic as a consequence of the Carath´eodory theorem, butuntil quite recently little was known about the growth rate of the number of atomscharacterizing the solutions. Polyanskiy and Wu (2020) have established that for G with sub-Gaussian tails the cardinality of its support, i.e. the number of atoms,of ˆ G , grows like O (log n ). In this respect the NPMLE shares a property of several u and Koenker 5 shape-constrained density estimators. Polyanskiy and Wu call such procedures self-regularizing: without any further penalization, the maximum likelihood estimatorin such cases automatically selects a highly parsimonious estimator. Thus, in sharpcontrast to the well-known difficulties with maximum likelihood for finite dimensionalmixture models, or with Gaussian deconvolution employing Fourier methods, thereare no awkward problems of tuning parameter selection for the NPMLE.Having seen that the upper bound on the complexity of the NPMLE ˆ G was only O (log n ), one might begin to wonder whether O (log n ) mixtures are “complex enough”to adequately represent the process that generated our observed data. Polyanskiyand Wu (2020) also address this concern: they note that for any sub-Gaussian G ,there exists a discrete distribution, G k , with k = O (log n ) atoms, such that for f k = ϕ ∗ G k , the total variation distance, T V ( f, f k ) = o (1 /n ), and consequentlythere is no statistical justification for considering estimators of G whose complexitygrows more rapidly than O (log n ). This observation is related to recent literatureon generative adversarial networks, e.g. Athey, Imbens, Metzger, and Munro (2019),that target models and estimators that, when simulated, successfully mimic observeddata. This viewpoint also underlies the innovative monograph of Davies (2014).Other nonparametric maximum likelihood estimators for G are potentially alsoof interest. Efron (2016) and Efron (2019) has proposed an elegant log-spline sieveapproach that yields a smooth estimate of G ; this has advantages especially froman inferential perspective, at the cost of reintroducing the task of selecting tuningparameters. An early proposal of Laird and Louis (1991) merged parametric empiricalBayes estimation of G with an EM step that pulled the parametric estimate backtoward the NPMLE.Given an estimate, ˆ G , it is straightforward to compute posterior distributions foreach of sample observations, or for that matter, for out-of-sample observations. Ineffect, we have estimated the prior, as in Robbins (1951) binary means problem,but we have ignored the variability of ˆ G when we adopt plug-in procedures thatuse it. This may account for the improved performance of smoothed estimates of G in certain inferential problems, as conjectured in Koenker (2020). In the sequel wewill compare ranking and selection procedures based on various functionals of theseposterior distributions. A leading example is the posterior mean, but ranking andselection problems suggest other functionals of potential interest.If we are asked to estimate the θ i ’s subject to quadratic loss, and assuming standardGaussian noise, the Bayes rule is given by the posterior mean,(2.1) δ ( y ) = E ( θ | y ) = y + f (cid:48) ( y ) /f ( y ) . Efron (2011) refers to this as Tweedie’s formula, it appears in Robbins (1956) creditedto M.C.K. Tweedie. Appendix A of Gu and Koenker (2016b) provides an elementaryderivation. The nonlinear shrinkage term takes a particularly simple affine form when G happens to be Gaussian, since in this case f is itself also Gaussian and the formulareduces to well-known linear shrinkage variants of classical Stein rules.A striking feature of the Tweedie formula is that the dependence on G of thenonlinear shrinkage term is hidden in the log derivative which depends only on themarginal density of the Y i ’s. This creates a temptation to believe that estimationof G is really superfluous, that it would suffice to simply estimate f instead. This Invidious Comparisons f -modeling temptation should be resisted for a variety of reasons elucidated in Efron(2019) not the least of which is that it is difficult to account for the full probabilisticstructure of the problem when estimating f directly. For example, it is known that theBayes rule for the posterior mean in the Gaussian case is monotone increasing in y , afact not easily incorporated into the conventional kernel density estimation approach.Koenker and Mizera (2014) consider this point in more detail. More importantly,estimation of G opens the way to improved methods for attacking many practicaldecision problems including ranking and selection.We have focused in this brief overview on compound decision problems for Gaussianlocation mixtures, however the NPMLE is adaptable to a wide variety of other mixtureproblems. Some of these other options are described in Koenker and Gu (2017) andare implemented in the R package Koenker and Gu (2015).3. Homogeneous Variances
Suppose that you are given real-valued measurements, y i : i = 1 , , · · · , n of someattribute like test score performance for students or their teachers, survival rates forhospital surgical procedures, etc., and are told that the measurements are exchange-able and approximately Gaussian with unknown means θ i and known variances σ i assumed provisionally to take the same value σ . Your task, should you decide toaccept it, is to choose a group of size not to exceed αn of the elements with the largest θ i ’s. One’s first inclination might be to view each y i as the maximum likelihood es-timate for the corresponding θ i , and select the αn largest observed values, but thecompound decision framework suggests that it would be better to treat the problemsas an ensemble. A second natural inclination might be to compute posterior meansof the θ ’s with some linear or nonlinear shrinkage rule, rank them and select the α best, but we will see that this too may be questionable.3.1. Posterior Tail Probability.
A natural alternative to ranking by the posteriormeans is to rank by posterior tail probabilities. Let θ α = G − (1 − α ), and define, v α ( y ) := P ( θ ≥ θ α | Y = y ), then ranking by posterior tail probability gives the decisionrule, δ ( y ) = (cid:49) { v α ( y ) ≥ λ α } where λ α is chosen so that P ( v α ( Y ) ≥ λ α ) = α . This ranking criterion has beenproposed by Henderson and Newton (2016) motivated as a ranking device for a fixedquantile level α . It can be interpreted in multiple testing terms: 1 − v α ( y ) is thelocal false discovery rate of Efron, Tibshirani, Storey, and Tusher (2001) and Storey(2002), for testing the hypothesis H : θ < θ α vs. H A : θ ≥ θ α . To see this, let h i bea binary random variable h i = (cid:49) { θ i ≥ θ α } , the loss function for observation i is L ( δ i , θ i ) = λ (cid:49) { h i = 0 , δ i = 1) + (cid:49) { h i = 1 , δ i = 0 } . The compound Bayes risk is, E [ n (cid:88) i =1 L ( δ i , θ i )] = n [ α + (cid:90) δ ( y )[(1 − α ) λf ( y ) − αf ( y )] dy ] u and Koenker 7 where f ( y ) = (1 − α ) − (cid:82) θ α −∞ ϕ ( y | θ, σ ) dG ( θ ) and f ( y ) = α − (cid:82) + ∞ θ α ϕ ( y | θ, σ ) dG ( θ ), ϕ ( y | θ, σ ) = φ (( y − θ ) /σ ) /σ and φ ( · ) is the standard normal density. The Bayes rulefor a fixed λ is δ ( y i ) = (cid:49) (cid:110) v α ( y i ) ≥ λ λ (cid:111) where v α ( y ) = αf ( y ) /f ( y ) = P ( θ ≥ θ α | Y = y ), and f ( y ) = (1 − α ) f ( y ) + αf ( y ) . Provided that v α ( y ) is monotone in y a unique λ α can be found such that P ( δ ( Y ) =1) = P ( v α ( Y ) ≥ λ α / (1 + λ α )) = α . Lemma 3.1.
For fixed α , assuming E θ | Y [ ∇ y log ϕ ( y | θ, σ ) | Y ] < ∞ , v α ( y ) is monotonein y and the sets Ω α := { Y : v α ( Y ) ≥ λ α / (1 + λ α ) } have a nested structure, that is if α > α , then Ω α ⊆ Ω α . Any implementation of such a Bayes rule requires an estimate of the mixing dis-tribution, G , or something essentially equivalent that would enable us to computethe local false discovery rates v α ( y ) and the cut-off θ α . The NPMLE, or perhaps asmoothed version of it, will provide a natural ˆ G for this task.3.2. Posterior Tail Expectation and Other Losses.
Rather than assessing lossby simply counting misclassifications we might consider weighting such misclassifica-tions by the magnitude of θ , for example, L ( δ i , θ i ) = n (cid:88) i =1 (1 − δ i ) (cid:49) ( θ i ≥ θ α ) θ i . This presumes, of course, that we have centered the distribution G in some reasonableway, perhaps by forcing the mean or median to be zero. Minimizing with respect to δ subject to the constraint that P ( δ ( Y ) = 1) = α leads to the Lagrangian,min δ (cid:90) (cid:90) (1 − δ ( y )) (cid:49) { θ ≥ θ α } θϕ ( y | θ, σ ) dG ( θ ) dy + λ (cid:104) (cid:90) (cid:90) δ ( y ) ϕ ( y | θ, σ ) dG ( θ ) dy − α (cid:105) which is equivalent tomin δ (cid:90) (cid:90) (cid:49) { θ ≥ θ α } ( θ − λ ) ϕ ( y | θ, σ ) dG ( θ ) dy − (cid:90) δ ( y ) (cid:104) (cid:90) (cid:49) { θ ≥ θ α } ( θ − λ ) ϕ ( y | θ, σ ) dG ( θ ) − (cid:90) λ (cid:49) { θ < θ α } ϕ ( y | θ, σ ) dG ( θ ) (cid:105) dy. Ignoring the first term since it doesn’t depend upon δ , the Oracle Bayes rule, becomes,choose δ ( y ) = 1 if, (cid:82) (cid:49) { θ ≥ θ α } θϕ ( y | θ, σ ) dG ( θ ) (cid:82) ϕ ( y | θ, σ ) dG ( θ ) ≥ λ, with λ chosen so that P ( δ ( Y ) = 1) = α . Such criteria are closely related to expectedshortfall criteria appearing in the literature on risk assessment. Again, the NPMLEcan be employed to construct feasible posterior ranking criteria.Several other loss functions are considered by Lin, Louis, Paddock, and Ridgeway(2006) including some based on global alignment of the ranks. However, if the ob-jective of the ranking exercise is eventually to select the best, or worst, elements itseems difficult to rationalize global loss functions of this type. Invidious Comparisons
False Discovery and the α -Level. Although our loss functions yield distinctcriteria for ranking, their decision rules lead to the same selections when the precisionof the measurements is homogeneous. When variances are homogeneous there is acut-off, η α , and a decision rule, δ α ( Y ) = 1( Y ≥ η α ), determining a common selectionfor all decision rules. Lemma 3.2.
For fixed α and homogeneous variance, posterior mean, posterior tailprobability and posterior tail expectation all yield the same ranking and therefore thesame selection. The marginal false discovery rate for this selection in our Gaussian mixture settingis, mF DR = P ( θ < θ α | δ α ( Y ) = 1)= α − P ( θ < θ α , θ + σZ ≥ η α )= α − (cid:90) θ α −∞ Φ(( θ − η α ) /σ ) dG ( θ ) , where Z is standard Gaussian. The marginal false non-discovery rate is, mF N R = P ( θ ≥ θ α | δ α ( Y ) = 0)= (1 − α ) − (cid:90) ∞ θ α Φ(( η α − θ ) /σ ) dG ( θ ) , Figure 3.1 shows the false discovery rate, and false non-discovery rate for a rangeof capacity constraints, α , when the mixing distribution, G , is standard Gaussianand σ = 1. In this case, the cut-off value η α is the (1 − α ) quantile of N (0 , α quantile, the false discovery rate remains alarminglyhigh especially for smaller α , implying that the selected set may consist of a very highproportion of false discoveries. For example, for α = 0 .
10 the proportion of selectedobservations with θ below the threshold θ α is slightly greater than 50 percent. . . . . a F DR ( a ) . . . a F NR ( a ) Figure 3.1.
False Discovery Rates and False Non-Discovery Rates for aStandard Gaussian Mixing Distribution. u and Koenker 9
Thus far we have implicitly assumed that the size of the selected set is predeter-mined by the parameter α . Having established a ranking based on a particular lossfunction, we simply select a subset of size (cid:100) αn (cid:101) consisting of the highest ranked ob-servations. In the next subsection we begin to consider modifying this strategy byconstraining the probability of false discoveries. This will allow the size of the selectedset to adapt to the difficulty of the selection task.3.4. Guarding against false discovery.
Recognizing the risk of false “discoveries”among those selected, we will consider an expanded loss function,(3.1) L ( δ , θ ) = n (cid:88) i =1 h i (1 − δ i ) + τ (cid:16) n (cid:88) i =1 (cid:110) (1 − h i ) δ i − γδ i (cid:111)(cid:17) + τ (cid:16) n (cid:88) i =1 δ i − αn (cid:17) where h i = (cid:49) { θ i ≥ θ α } . If we set τ to zero, then minimizing the expected lossleads to the Bayes rule discussed in Section 3.1. On the other hand, if we set τ tozero, then minimizing expected loss leads to a decision rule that is equivalent to amultiple testing problem with null hypothesis H i : θ i ≤ θ α ; the goal is to minimizethe expected number of over-looked discoveries subject to the constraint that themarginal FDR rate is controlled at level γ , that is, E [ (cid:80) ni =1 (1 − h i ) δ i ] / E [ (cid:80) ni =1 δ i ] ≤ γ. When τ = 0, the risk can be expressed as, E θ | Y (cid:104) L ( δ , θ ) (cid:105) = n (cid:88) i =1 (1 − δ i ) v α ( Y i ) + τ (cid:16) n (cid:88) i =1 δ i − αn (cid:17) where v α ( y i ) = P ( θ i ≥ θ α | Y i = y i ). Taking another expectation over Y , and minimiz-ing over both δ and τ , leads to the decision rule, δ ∗ i = (cid:40) , if v α ( y i ) ≥ τ ∗ , if v α ( y i ) < τ ∗ . The Lagrange multiplier is chosen so that the constraint P ( δ i = 1) ≤ α holds withequality: τ ∗ = min { τ : P ( v α ( y i ) ≥ τ ) ≤ α } Each selection improves the objective function by v α ( y i ), but incurs a cost of τ . Sinceall selections incur the same cost, we may rank according to v α ( y i ), selecting unitsuntil the capacity constraint αn is achieved. Selection of the last unit may need tobe randomized to exactly satisfy the constraint, as we note below.When τ = 0 the focus shifts to the marginal FDR, the ratio of the expected numberof false discoveries, to the expected number of selections. This is slightly differentfrom the original FDR as defined in Benjamini and Hochberg (1995). However, when n is large the two concepts are asymptotically equivalent as shown by Genovese andWasserman (2002). Our objective becomes, E θ | Y (cid:104) L ( δ , θ ) (cid:105) = n (cid:88) i =1 (1 − δ i ) v α ( Y i ) + τ (cid:16) n (cid:88) i =1 { δ i (1 − v α ( Y i )) − γδ i } (cid:17) . Invidious Comparisons
Taking expectations again over Y and minimizing over both δ and τ yields, δ ∗ i = (cid:40) , if v α ( y i ) > τ ∗ (1 − v α ( y i ) − γ )0 , if v α ( y i ) ≤ τ ∗ (1 − v α ( y i ) − γ )and the Lagrange multiplier takes a value τ ∗ to make the marginal FDR constrainthold with equality.When both constraints are incorporated we must balance the power gain frommore selections and the cost that occurs from both the capacity constraint and FDRcontrol. The Bayes rule solves,min δ E (cid:104) n (cid:88) i =1 (1 − δ i ) v α ( y i ) (cid:105) + τ (cid:16) E (cid:104) n (cid:88) i =1 (cid:110) (1 − v α ( y i )) δ i − γδ i (cid:111)(cid:105)(cid:17) + τ (cid:16) E (cid:104) n (cid:88) i =1 δ i (cid:105) − αn (cid:17) . Given the discrete nature of the decision function, this problem appears to take theform of a classical napsack problem, however following the approach of Basu, Cai,Das, and Sun (2018) we will consider a relaxed version of the problem in which unitsare selected sequentially until one or the other constraint would be violated, with thefinal selection randomized to satisfy the constraint exactly.
Remark.
Given the Lagrangian form of our loss function it is natural to consideran optimization perspective for the selection problem. Minimizing the expectationof the loss defined in (3.1) is equivalent to minimizing P [ δ i = 0 , θ i ≥ θ α ] subject tothe constraint that P [ δ i = 1 , θ i < θ α ] / P [ δ i = 1] ≤ γ and P [ δ i = 1] ≤ α . So weare looking for a thresholding rule that minimizes the expected number of misseddiscoveries subject to the capacity constraint and the constraint that the marginalFDR rate of the decision rule is below level γ . This minimization problem is alsoeasily seen – from a testing perspective – to be equivalent to maximizing power ofthe decision rule δ , P [ δ i = 1 | θ i ≥ θ α ], subject to the same two constraints. Proposition 3.3.
For any pair, ( α, γ ) such that γ < − α , the optimal Bayes ruletakes the form of δ ∗ i = (cid:49) { v α ( y i ) ≥ λ ∗ ( α, γ ) } where λ ∗ ( α, γ ) = v α ( t ∗ ) with t ∗ = max { t ∗ , t ∗ } and t ∗ = min (cid:110) t : (cid:82) θ α −∞ ˜Φ(( t − θ ) /σ ) dG ( θ ) (cid:82) + ∞−∞ ˜Φ(( t − θ ) /σ ) dG ( θ ) − γ ≤ (cid:111) t ∗ = min (cid:110) t : (cid:90) + ∞−∞ ˜Φ(( t − θ ) /σ ) dG ( θ ) − α ≤ (cid:111) with ˜Φ being the survival function of a standard normal random variable. Remark.
The optimal cutoff t ∗ depends on the data generating process and alsothe choice of α and γ . When data is noisy, the FDR control constraint may bebinding before the capacity constraint is reached, and consequently the selected setmay be strictly smaller than the pre-specified α proportion. On the other hand, whenthe signal is strong, the FDR control constraint is unlikely to be binding before thecapacity constraint is attained. u and Koenker 11 We have seen that when variances are homogeneous, the optimal selection rulethresholds on Y , so it is clear then that any ranking that is based on a monotonetransformation of Y will lead to an equivalent selected set. We should also highlightthat we have focused on a null hypothesis that depends on α , while the multiple testingliterature, for example Efron, Tibshirani, Storey, and Tusher (2001), Sun and Cai(2007) and Basu, Cai, Das, and Sun (2018), typically focuses on the null hypothesisof H i : θ i = 0. When variances are homogeneous, it doesn’t matter whether weuse an α dependent null or the conventional zero null, because the transformationbased on the conventional null, P ( θ > | Y = y ), is also a monotone function of Y , and therefore yields an equivalent decision rule. However, when variances areheterogeneous, this invariance no longer holds; different transformations of the pair( y, σ ) lead to distinct decision rules that lead to distinct performance, and usingthe conventional null hypothesis is no longer advisable for the ranking and selectionproblem as we will show in the next section.4. heterogeneous known variances The homogeneous variance assumption of the preceding section is unsustainablein most applications. Batting averages are accompanied by a number of “at bats”and mean test score performances are accompanied by student sample sizes. In thissection we will consider the expanded model, Y i ∼ N ( θ i , σ i ) , and θ i ∼ G, σ i ∼ H, σ i ⊥⊥ θ i We will assume that we observe σ i , an assumption that will be relaxed in the nextsection.4.1. Posterior Tail Probability.
With the same alternative hypothesis as H A : θ ≥ θ α , it is natural to consider the posterior tail probability again, now as a function ofthe pair, ( y i , σ i ), v α ( y i , σ i ) = P ( θ i ≥ θ α | y i , σ i ) = (cid:82) + ∞ θ α σ i φ (( y i − θ ) /σ i ) dG ( θ ) (cid:82) + ∞−∞ σ i φ (( y i − θ ) /σ i ) dG ( θ ) . Solving the same decision problem with the loss function specified in (3.1), we havethe conditional risk, E θ | Y , σ (cid:104) L ( δ , θ ) (cid:105) = n (cid:88) i =1 (1 − δ i ) v α ( Y i , σ i )+ τ (cid:16) n (cid:88) i =1 { δ i (1 − v α ( Y i , σ i )) − γδ i } (cid:17) + τ (cid:16) n (cid:88) i =1 δ i − αn (cid:17) . Taking another expectation with respect to the joint distribution of the ( Y i , σ i )’s, theBayes rule solvesmin δ E (cid:104) n (cid:88) i =1 (1 − δ i ) v α ( y i , σ i ) (cid:105) + τ (cid:16) E (cid:104) n (cid:88) i =1 (cid:110) (1 − v α ( y i , σ i )) δ i − γδ i (cid:111)(cid:105)(cid:17) + τ (cid:16) E (cid:104) n (cid:88) i =1 δ i (cid:105) − αn (cid:17) The optimal selection rule can again be characterized as a thresholding rule on v α ( y i , σ i ). Invidious Comparisons
Proposition 4.1.
For a pre-specified pair ( α, γ ) such that γ < − α , the optimalBayes rule takes the form δ ∗ ( y, σ ) = (cid:49) { v α ( y, σ ) ≥ λ ∗ ( α, γ ) } where λ ∗ ( α, γ ) = max { λ ∗ ( α, γ ) , λ ∗ ( α ) } and λ ∗ ( α, γ ) = min (cid:110) λ : (cid:82) (cid:82) θ α −∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) (cid:82) (cid:82) + ∞−∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) − γ ≤ (cid:111) λ ∗ ( α ) = min (cid:110) λ : (cid:90) (cid:90) + ∞−∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) − α ≤ (cid:111) denoting ˜Φ = 1 − Φ and with t α ( λ, σ ) defined as v α ( t α ( λ, σ ) , σ ) = λ for all λ ∈ [0 , . Remark.
Note that although the thresholding value λ ∗ does not depend on thevalue of σ , the ranking does depend on σ . One way to see this is that since v α ( y, σ )is monotone in y for all σ >
0, the optimal rule is equivalent to (cid:49) { y i > t α ( λ ∗ , σ ) } ,where t α ( λ, σ ) is a function of σ . For a fixed value of λ ∗ , the selection region for Y depends on σ in a nonlinear way. Comparing individuals i and j , it may be the casethat y i > y j , but y j belongs to the selection region while y i does not. An example toillustrate this appears below. It should also be emphasized that when variances areheterogeneous, different loss functions, need not lead to equivalent selections.4.2. The Conventional Null Hypothesis.
The posterior tail probability criterionis motivated by viewing the ranking and selection problems as hypothesis testing whileallowing the null hypothesis to be α dependent. The particular construction of thenull hypothesis turns out to be critical for the ranking exercise. In this subsection wepresent a simple example to illustrate that tail probability based on the conventionalnull hypothesis of zero effect does not lead to a powerful ranking device. Considerdata generated from a three component normal mixture model,(4.1) Y i | σ i ∼ . N ( − , σ i ) + 0 . N (0 . , σ i ) + 0 . N (5 , σ i ) , σ i ∼ U [0 . , v α , we consider the transformation, T ( y i , σ i ) = P ( θ i > | y i , σ i ) = (cid:82) + ∞ σ i φ (( y i − θ ) /σ i ) dG ( θ ) (cid:82) + ∞−∞ σ i φ (( y i − θ ) /σ i ) dG ( θ )and rank individuals accordingly. This transformation corresponds to the proce-dure proposed in Sun and McLain (2012), and is motivated for multiple testingproblems under the conventional null hypothesis H : θ ≤
0. The decision rule δ T i = (cid:49) { T ( y i , σ i ) ≥ λ } then chooses the cutoff value λ that respects both the capacityconstraint and the FDR control constraint for selecting the top α proportion.Figure 4.1 compares the selection region for the two ranking procedures with α =5% and marginal FDR control at level 10%. The solid black line corresponds to theselection boundary using ranking based on transformation v α and the dashed red linecorresponds to the selection boundary using ranking based on the transformation T .The black highlighted area below the black selection boundary corresponds to a regionwhere the ranking method based on T will select but the ranking method based on v α does not. On the other hand, the blue highlighted area corresponds to a region u and Koenker 13 s T h r e s ho l d zeroNulltailNull Figure 4.1.
Selection boundaries based on the model (4.1) with α = 0 . γ = 0 .
1. The solid black curve corresponds to the boundary of theselection region based on transformation v α . The dash red curve correspondsto the boundary of the selection region based on transformation T . Densityof σ is assumed to be uniform on the interval [0 . , selected by v α , but not for T . The transformation T ranks those in the black regionhigher than those in the blue region because although they have a relatively smallermean effect y , their associated variances are also smaller indicating stronger evidencethat such individuals have a positive θ than those located in the blue area. However,our task is to find individuals with true effects, θ i , in the upper tail. For α = 5%,we aim to select all individuals with θ = 5, individuals in the black region presentstrong evidence that their true effect can not be too large because their observedeffect y is small and their associated variance is also small, while those in the blueregion, although their observed mean effects are associated with larger variances offerreasonable evidence that their associated true effect θ may be large. This evidence isnot apparent in transformation T , but is captured in the transformation v α .Indeed, the average power of ranking based on the two different transformation v α and T differ significantly. Defining the power of the selection rule as β ( δ ) := P ( θ i ≥ θ α , δ i = 1) / P ( θ i ≥ θ α ), the proportion of true top α cases selected based on decisionrule δ , then β ( δ T ) = 39% and β ( δ ∗ ) = 69%. Thus, although much of the literaturerelies on ranking and selection rules based on some form of posterior means andconventional hypothesis testing apparatus we would caution that such methods canbe quite misleading and inefficient.4.3. Nestedness of Selection Sets.
If we were to relax the capacity constraint toallow a larger proportion, α > α be be selected, while maintaining our initial falsediscovery control, we would expect that members selected under the more stringentcapacity constraint should remain selected under the relaxed constraint. We now Invidious Comparisons discuss sufficient conditions under which we obtain this nestedness of the selectionsets when using the posterior tail probability rule.The optimal Bayes rule defines the selection set for each pair of ( α, γ ) asΩ α,γ := { ( y, σ ) : v α ( y, σ ) ≥ λ ∗ ( α, γ ) } and when σ is known, v α ( y, σ ) is monotone in y as shown in Lemma 3.2 for each fixed σ , hence the selection set can also be represented asΩ α,γ = { ( y, σ ) : y ≥ t α ( λ ∗ ( α, γ ) , σ ) } It is also convenient for later discussion to defineΩ
F DRα,γ : = { ( y, σ ) : v α ( y, σ ) ≥ λ ∗ ( α, γ ) } = { ( y, σ ) : y ≥ t α ( λ ∗ ( α, γ ) , σ ) } Ω Cα := { ( y, σ ) : v α ( y, σ ) ≥ λ ∗ ( α ) } = { ( y, σ ) : y ≥ t α ( λ ∗ ( α ) , σ ) } which are respectively the selection sets when the false discovery rate constraint orthe capacity constraint is binding. It is easy to see that Ω α,γ = Ω F DRα,γ ∩ Ω Cα . Lemma 4.2.
Let the density function of v α ( y i , σ i ) be denoted as f v ( v ; α ) and let λ ∗ ( α, γ ) = min (cid:110) λ : (cid:82) (cid:82) θ α −∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) (cid:82) (cid:82) + ∞−∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) − γ ≤ (cid:111) with t α ( λ, σ ) defined as v α ( t α ( λ, σ ) , σ ) = λ and ˜Φ be the survival function of thestandard normal random variable. If ∇ α log f v ( v ; α ) is non-decreasing in v , then forfixed γ , if α > α , we have λ ∗ ( α , γ ) ≤ λ ∗ ( α , γ ) . Remark.
The density function f v ( v ; α ) can be viewed as a function of v indexed bythe parameter α . An explicit form for f v ( v ; α ) appears in Section 4.4 for the normal-normal model). The condition imposed in Lemma 4.2 is equivalent to a monotonelikelihood ratio condition, that is that the likelihood ratio f v ( v ; α ) /f v ( v ; α ) is non-decreasing in v if α > α . The monotone likelihood ratio condition is stronger in thatit implies stochastic dominance between the distributions of the random variables v α and v α , which holds by construction. Corollary 4.3.
If the condition in Lemma 4.2 holds, then Ω F DRα ,γ ⊆ Ω F DRα ,γ for any α > α . Remark.
The condition in Lemma 4.2 is sufficient but not necessary for nestednessof Ω
F DRα,γ because even when λ ∗ ( α , γ ) > λ ∗ ( α , γ ), we can still have t α ( λ ∗ ( α , γ ) , σ ) Let λ ∗ ( α ) = min { λ : (cid:90) (cid:90) + ∞−∞ ˜Φ(( t α ( λ, σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) ≤ α } If for any α > α , t α ( λ ∗ ( α ) , σ ) ≤ t α ( λ ∗ ( α ) , σ ) for each σ , then Ω Cα ⊆ Ω Cα . Remark. Monotonicity coincides with the condition in Theorem 3 in Henderson andNewton (2016). They demonstrate that it holds when G follows a normal distribution.However, it need not hold as shown in the counter-example in Section 4.5. u and Koenker 15 Lemma 4.5. If ∇ α log f v ( v ; α ) is non-decreasing in v and the condition in Lemma4.4 holds, then for a fixed γ , the selection region has a nested structure: if α > α then Ω α ,γ ⊆ Ω α ,γ . Examples. In this section we consider several examples beginning with thesimplest classical case in which the θ i constitute a random sample from the standardGaussian distribution. This Gaussian assumption on the form of the mixing distri-bution G underlies almost all of the empirical Bayes literature in applied economics;it is precisely what justifies the linear shrinkage rules that are typically employed. Example. [ Gaussian G ] Consider the normal-normal model, where y | θ, σ ∼ N ( θ, σ )and θ ∼ N (0 , σ θ ) and σ ∼ H with density function h ( σ ). The marginal distributionof y given σ is N (0 , σ + σ θ ) and the joint density of ( y, σ ) takes the form f ( y, σ ) = 1 (cid:112) π ( σ + σ θ ) exp (cid:110) − y σ + σ θ ) (cid:111) h ( σ )Given the normal conjugacy, the posterior distribution of θ | y, σ follows N ( ρy, ρσ )where ρ = σ θ / ( σ θ + σ ). The random variable v is a transformation of the pair ( Y, σ ),defined as v = ψ ( y, σ ) := P ( θ ≥ θ α | y, σ ) = Φ(( ρy − θ α ) / (cid:112) ρσ )For fixed σ , ψ is monotone increasing in y and ψ − ( v ) = θ α /ρ + (cid:112) σ /ρ Φ − ( v ) with ∇ v ψ − ( v ) = (cid:112) σ /ρ/φ (Φ − ( v )). The joint density of v and σ is thus, g ( v, σ ) = f ( ψ − ( v ) , σ ) |∇ v ψ − ( v ) | = 1 (cid:112) π ( σ + σ θ ) exp (cid:110) − ( θ α /ρ + (cid:112) σ /ρ Φ − ( v )) σ + σ θ ) (cid:111) (cid:112) σ /ρφ (Φ − ( v )) h ( σ ) . Integrating out σ we have the marginal density of v , f v ( v ; α ) = (cid:90) (cid:112) π ( σ + σ θ ) exp (cid:110) − ( θ α /ρ + (cid:112) σ /ρ Φ − ( v )) σ + σ θ ) (cid:111) (cid:112) σ /ρφ (Φ − (˜ v )) dH ( σ )The capacity constraint is P ( v ≥ λ ∗ ) = α , with cut-off value, λ ∗ , satisfying, α = P ( v ≥ λ ∗ ) = 1 − (cid:90) Φ (cid:16) θ α (cid:112) σ + σ θ σ θ − Φ − (1 − λ ∗ ) (cid:113) σ /σ θ (cid:17) dH ( σ ) . To find λ ∗ , we can use the formula provided in Proposition 4.1. A more directapproach is to recognize, see Section 6, that the FDR control constraint is defined as γ = E [(1 − v ) (cid:49) { v ≥ λ ∗ } ] / P ( v ≥ λ ∗ ), where the cut-off value λ ∗ is defined through γ = (cid:90) λ ∗ (1 − v ) f v ( v ; α ) dv/ (cid:90) λ ∗ f v ( v ; α ) dv. Let λ ∗ = max { λ ∗ , λ ∗ } , the selection region is then { ( y, σ ) : y ≥ t α ( λ ∗ , σ ) } with t α ( λ ∗ , σ ) = θ α /ρ − Φ − (1 − λ ∗ ) (cid:112) σ /ρ. Suppose we use the posterior mean of θ as a ranking device, so δ PM i = (cid:49) { yρ ≥ ω ∗ } for some suitably chosen ω ∗ that guarantees both capacity and FDR control. For the Invidious Comparisons capacity constraint, the thresholding value solves,1 − α = (cid:90) P ( yρ < ω ∗ ) dH ( σ )= (cid:90) Φ (cid:16) ω ∗ / ( σ θ (cid:113) σ θ + σ ) (cid:17) dH ( σ ) , while FDR control requires a thresholding value that solves, γ = (cid:90) P ( y ≥ ω ∗ /ρ, θ < θ α ) dH ( σ ) / (cid:90) P ( y ≥ ω ∗ /ρ ) dH ( σ )= (cid:90) (cid:90) [ ω ∗ /ρ, + ∞ ) (1 − α ) f ( y ) dydH ( σ ) / (cid:90) − Φ (cid:16) ω ∗ / ( σ θ (cid:113) σ θ + σ (cid:17) dH ( σ ) , with f ( y ) = 11 − α (cid:112) π ( σ θ + σ ) exp (cid:110) − y σ θ + σ ) (cid:111) Φ (cid:16) ( θ α − yρ ) (cid:112) ρσ (cid:17) , denoting the density of y under the null θ < θ α . Setting ω ∗ = max { ω ∗ , ω ∗ } , theselection region is then { ( y, σ ) : y ≥ ω ∗ /ρ } . . . . . . . . Selection Boundaries s T h r e s ho l d Lfdr−CPM−CLfdr−FDRPM−FDR 0.5 0.6 0.7 0.8 0.9 1.0 . . . . . . . Capacity constraint s y All agreedTailp extraPM extra 0.5 0.6 0.7 0.8 0.9 1.0 . . . . . . . FDR constraint s y All agreedTailp extraPM extra Figure 4.2. The left panel plots the selection boundaries for the normal-normal model with σ θ = 1 and α = 0 . 05 and γ = 0 . 2. The density of σ is assumed to be uniform on the range [0 . , y i , σ i ) above the curves. The red curves correspond to the selection regionboundaries with FDR controlled at level 0 . 2; solid lines for posterior meanranking and dash line for posterior tail probability ranking. The black curvescorrespond to the selection boundaries with capacity control at level 0 . Figure 4.2 plots the selection boundaries for both constraints with θ ∼ N (0 , 1) and σ ∼ U [0 . , α = 0 . 05 and γ = 0 . 2, the FDR constraint is binding, but not u and Koenker 17 the capacity constraint. In this example, if we only impose the capacity constraintto be 5 percent, even an Oracle totally aware of G , will face a false discovery rate ofnearly 52 percent. In other words more than half of those selected to be in the righttail will be individuals with θ < θ α rather than from the intended θ ≥ θ α group. Thisfact motivates our more explicit incorporation of FDR into the selection constraints.We may recall that in the homogeneous variance Gaussian setting we saw in Figure3.1 that FDR was also very high when α is set at 0.05. Figure 4.2 also depicts theselected set with a realized sample of 10,000 from the normal-normal model. Withcapacity constraint alone, the posterior mean criteria favours individuals with smallervariances. When the FDR constraint is implemented, with γ = 0 . 2, it becomes thebinding constraint in this setting, both criteria become more stringent and only amuch smaller set of individuals are selected, and there is less conflict in the selections.The corresponding selected sets are plotted in the right panel of Figure 4.2. Whenthe variance parameter σ θ in G is not observed, we can estimate it via the MLE basedon the marginal likelihood of Y . This leads to the generalized James-Stein estimatorproposed in Efron and Morris (1973). Example. [ A Discrete G ] Suppose θ ∼ . δ − + 0 . δ + 0 . δ . Then the marginaldensity of y given σ takes the form, f ( y | σ ) = (cid:90) σ φ (( y − θ ) /σ ) dG ( θ )= 1 σ (cid:16) . φ (( y + 1) /σ ) + 0 . φ (( y − /σ ) + 0 . φ (( y − /σ ))And the random variable v is a transformation of the pair ( y, σ ), defined as, v = ψ ( y, σ ) := P ( θ ≥ θ α | y, σ ) = (cid:82) + ∞ θ α σ φ (( y − θ ) /σ ) dG ( θ ) (cid:82) ∞−∞ σ φ (( y − θ ) /σ ) dG ( θ ) . The capacity constraint leads to a thresholding rule on v such that P ( v ≥ λ ∗ ) = α ,while the FDR controls leads to a cutoff value λ ∗ defined through γ = E [(1 − v ) (cid:49) { v ≥ λ ∗ } ] / P ( v ≥ λ ∗ ). Let λ ∗ = max { λ ∗ , λ ∗ } , the selection region is then defined by { ( y, σ ) : y ≥ t α ( λ ∗ , σ ) } , and can be found easily numerically.Figure 4.3 plots the selection boundaries for both constraints when θ follows thisdiscrete distribution. We again set α = 0 . 05 and γ = 0 . 2, so we would like to selectall the individuals associated with the largest effect size, { θ = 5 } , while controllingthe FDR rate below 20%. The red curves again correspond to FDR control withthe two ranking procedure, while the black curves correspond to capacity control.For the two regions to overlap with α fixed at 0 . 05, we must be willing to tolerate γ ≈ Invidious Comparisons Selection Boundaries s T h r e s ho l d Lfdr−CPM−CLfdr−FDRPM−FDR l ll lll lll lllll lll ll lll l l ll ll ll lll lll ll lll l lllll l ll ll ll lll l l ll l ll lll ll ll ll lllll lll ll ll ll ll l ll lllllll l l llll ll ll ll lll ll l l ll ll l l l l ll ll llll lll l l lll llll ll l lll lll ll l llll ll l l ll llll lll lll lll lll l ll lll llll llll ll l lll lll ll l l ll lll l lll l ll l llll ll ll l ll lll l ll ll l lll l ll ll ll lll ll l ll l lll l lll ll l l lll lll lll lll ll l ll lll ll lll lll lll ll ll ll lllll l l ll ll lll l l lll l lllll llll lll lll ll l l ll lll lll ll l l ll ll lll ll l l ll ll ll l llll l lll lll lll l l lll l ll lll l ll ll lll lll lll ll l lll lll l lll l lll llll ll l llll l ll Capacity constraint s y l All agreedTailp extraPM extra l ll llllll l llll l ll ll ll lllll lll lllll lll lll l ll llll lll lll lll lll l l llllll l lll ll llll ll l l ll l l lll ll ll ll ll ll ll l l lll ll ll lll lll ll l lll ll ll ll ll ll l ll lll lll l ll ll l llll ll ll l ll l ll lllll l ll lll ll l l lll lll lll ll l lll ll lll ll lll ll ll llll l l lll l lll ll lll lll lll ll l ll lll lll lll ll ll l ll l ll lll ll llll l l lll l ll ll l ll llllllll ll ll lll lll lll lll llll l l FDR constraint s y l All agreedTailp extraPM extra Figure 4.3. The left panel plots the selection boundaries for the normal-discrete model with θ ∼ G = 0 . δ − + 0 . δ + 0 . δ and α = 0 . 05 and γ = 0 . 1. The density of σ is uniform on the range [0 . , . 2, solid linesfor posterior mean ranking and dashed lines for posterior tail probabilityranking. The black curves correspond to the selection region with capacitycontrol at level 0 . 05. The other panels are structured as in the previousfigure. A Counterexample: Non-nested Selection Regions. Thus far we havestressed conditions under which selection regions are nested with respect to α , thatis, for α < α < · · · < α m , we have Ω α ⊆ Ω α ⊆ · · · ⊆ Ω α m , for the selection regions.However, this need not hold when there is variance heterogeneity and when nestingfails we can have seemingly anomalous situations in which units are selected by the tailprobability rule at some stringent, low α , but are then rejected for some less stringent,larger α ’s. To illustrate this phenomenon we will neglect the FDR constraint and focuson our discrete mixing distribution, G = 0 . δ − + 0 . δ + 0 . δ , with σ ∼ U [1 / , α ∈ { . , . , . , . } . Unitsare selected when their observed pair, ( y i , σ i ) lies above these curves for various α ’s.When σ is small we see, as expected, that selection is nested: if a unit is selected atlow α it stays selected at larger α ’s. However, when σ = 3, we see that there areunits selected at α = 0 . 05 and even α = 0 . 04 and yet they are rejected for α = 0 . G , and when you decide to select with α = 0 . 06 you know that you will have to select a few θ = 2 types, since there areonly 5 percent of the θ = 5 types. Your main worry at that point is to try to avoidselecting any θ = − σ types. In contrast when α = 0 . 05 so we are trying to vacuum up all of the θ = 5 typesit is worth taking more of a risk with high σ types as long as their y i is reasonablylarge.The crossing of the selection boundaries and non-nestedness of the selection regionsis closely tied up with the tail probability criterion and the α dependent feature of u and Koenker 19 s y t h r e s hho l d a = a = a = a = s y t h r e s hho l d a = a = a = a = Figure 4.4. Oracle selection boundaries (with just capacity constraint)for several α levels for the tail probability criterion (left panel) and posteriormean criterion (right panel) with a discrete example with G = 0 . δ − +0 . δ + 0 . δ , and σ ∼ U [1 / , the hypothesis. If we repeat our exercise with the same G , and σ distribution, butselect according to posterior means, we get the nested selection boundaries illustratedin Figure 4.4. Proposition 4.6 establishes this to be a general phenomenon for anydistributution G .It should be noted that the crossing of selection boundaries we have illustratedseems to have been anticipated in Henderson and Newton (2016), who consider similartail criteria. They propose a ranking scheme that assigns rank equal to the smallest α for which a unit would be selected as a way to resolve the ambiguities generatedby crossing. We don’t see a compelling decision theoretic rationale for this revisedranking rule, instead we prefer to maintain some separation between the ranking andselection problems and focus on risk assessment as a way to reconcile them.The risk based on the loss function defined in (3.1) clearly depends on α and γ .More specifically it consists of three pieces, the leading term has the interpretation of“missed discovery” probability, which we try to minimize and the second and thirdpieces correspond to the FDR and capacity constraints respectively each weighted bya Lagrangian multiplier. Focusing on the first term, we have E [ H i (1 − δ i )] = E [ (cid:49) { θ i ≥ θ α } (1 − δ i )]= P [ θ i ≥ θ α ] − E [ δ i (cid:49) { θ i ≥ θ α } ]= P [ θ i ≥ θ α ] − (cid:90) (cid:90) + ∞ θ α (cid:104) − Φ(( t α ( λ, σ ) − θ ) /σ ) (cid:105) dG ( θ ) dH ( σ ) Invidious Comparisons l ll l l l l l . . . . a M i ss ed D i sc o v e r y P r obab ili t y l FDR = 0.01FDR = 0.05FDR = 0.1FDR = 0.15FDR=0.3Capacity Figure 4.5. Oracle risk evaluation for several α levels for the tail proba-bility criterion and a discrete example with G = 0 . δ − + 0 . δ + 0 . δ ,and σ ∼ U [1 / , where λ , depends on α and γ , is determined by either the false discovery rate controlor the capacity constraint, whichever binds.A feature of discrete mixing distributions, G , is that the first term in the loss, P ( θ i ≥ θ α ), is piece-wise constant, with jumps occurring only at discontinuity pointsof G , while the second term depends on both θ α and the cut-off values t α ( λ, σ ). Whenthe capacity constraint binds there exist ranges of α such that θ α remains constant,while t α ( λ, σ ) decreases for each σ , hence the risk with just capacity constraint bindingis a decreasing function for α in the interval (0 . , . t α ( λ, σ ) is constant.To see this recall that the cutoff λ determined by the FDR constraint is defined as E [(1 − v α ( Y, σ )) (cid:49) { v α ( Y, σ ) ≥ λ } ] = γ P ( v α ( Y, σ ) ≥ λ ), so when θ α is constant over arange of α the distribution of v α ( Y, σ ) does not change, and consequently the value λ is constant over that range of α .Figure 4.5 evaluates risk based on the optimal selection rule for various α and FDRlevels, γ ∈ { . , . , . , . , . } . The solid curves correspond to risk evaluatedat the optimal Bayes rule defined in Proposition 4.1. The dotted line corresponds tothe risk evaluated at the Bayes rule when only the capacity constraint is imposed. As γ increases, the risk decreases as expected. For FDR levels as stringent as γ = 0 . α such that capacity constraint becomes binding, and therisk decreases after the initial jump at α = 0 . u and Koenker 21 The evaluation of the risk for this particular example indicates that it is easier toselect the top 5% individuals, those with θ = 5. As we intend to select more in theright tail, we are facing more uncertainty. This also motivates a more systematicchoice of ( α, γ ). Although selection based on the tail probability criterion can leadto non-nested selection regions, we conclude this sub-section by demonstrating thatposterior mean selection is necessarily nested. Proposition 4.6. Let the density function of y conditional on θ and σ be denoted as f ( y | θ, σ ) . If selection is based on the posterior mean, δ i = { M ( y, σ ) ≥ c ( α, γ ) } with M ( y, σ ) := (cid:82) θf ( y | θ, σ ) dG ( θ ) (cid:82) f ( y | θ, σ ) dG ( θ ) and c ( α, γ ) is chosen to satisfy both the capacity constraint at level α and the FDRconstraint at level γ , then the selection regions, defined as Λ α,γ = { ( y, σ ) : M ( y, σ ) ≥ c ( α, γ ) } , are nested, that is, for any α > α , Λ α ,γ ⊆ Λ α ,γ . Heterogeneous unknown variances Assuming that the σ ’s are known, up to a common scale parameter, may beplausible in some applications such as baseball batting averages, but it is frequentlymore plausible to adopt the view that we are simply confronted with an estimateavailable perhaps from longitudinal data. In such cases we need to consider the pairs,( y i , S i ) as potentially jointly dependent random variables arising from the longitudinalmodel, Y it = θ i + σ i (cid:15) it , (cid:15) it ∼ iid N (0 , , ( θ i , σ i ) ∼ G, with sufficient statistics, Y i = T − i (cid:80) T i t =1 Y it and S i = ( T i − − (cid:80) T i t =1 ( Y it − Y i ) ,for ( θ i , σ i ). Conditional on ( θ i , σ i ), we have Y i | θ i , σ i ∼ N ( θ i , σ i /T i ) and S i | σ i isdistributed as Gamma with shape parameter r i = ( T i − / 2, scale parameter, σ i /r i ,and density function denoted as Γ( S i | r i , σ i /r i ).Given the loss function (3.1) and defining θ α as α = P ( θ i ≥ θ α ) = (cid:82) (cid:82) + ∞ θ α dG ( θ, σ ),the conditional risk is, E θ | Y , S (cid:104) L ( δ , θ ) (cid:105) = n (cid:88) i =1 (1 − δ i ) v α ( Y i , S i )+ τ (cid:16) n (cid:88) i =1 { δ i (1 − v α ( Y i , S i )) − γδ i } (cid:17) + τ (cid:16) n (cid:88) i =1 δ i − αn (cid:17) with v α ( y i , s i ) = P ( θ i ≥ θ α | Y i = y i , S i = s i )= (cid:82) (cid:82) + ∞ θ α Γ( s i | r i , σ i /r i ) φ (( y i − θ ) / (cid:112) σ /T i ) / (cid:112) σ /T i dG ( θ, σ ) (cid:82) (cid:82) Γ( s i | r i , σ i /r i ) φ (( y i − θ ) / (cid:112) σ /T i ) / (cid:112) σ /T i dG ( θ, σ ) . Taking expectations with respect to the joint distribution of ( Y, S ), the optimal Bayesrule solves,min δ E (cid:104) n (cid:88) i =1 (1 − δ i ) v α ( y i , s i ) (cid:105) + τ (cid:16) E (cid:104) n (cid:88) i =1 (cid:110) (1 − v α ( y i , s i )) δ i − γδ i (cid:111)(cid:105)(cid:17) + τ (cid:16) E (cid:104) n (cid:88) i =1 δ i (cid:105) − αn (cid:17) . Invidious Comparisons Before characterizing the Bayes rule, we should observe that when variances σ are not directly observed, the tail probability v α ( Y, S ) may no longer have the mono-tonicity property we have described above. Lemma 5.1. Consider the transformation v α ( Y, S ) = P ( θ ≥ θ α | Y, S ] , then for fixed S = s , the function v α ( Y, s ) may not be monotone in Y ; and for fixed Y = y , thefunction v α ( y, S ) may not be monotone in S . Proposition 5.2. For a pair of pre-specified ( α, γ ) such that γ < − α , the optimalBayes selection rule takes the form δ ∗ i = (cid:49) { v α ( Y, S ) ≥ λ ∗ ( α, γ ) } where λ ∗ ( α, γ ) = max { λ ∗ ( α, γ ) , λ ∗ ( α ) } with λ ∗ ( α, γ ) = min (cid:110) λ : E (cid:104) (1 − v α ( Y, S ) − γ ) (cid:49) { v α ( Y, S ) ≥ λ } (cid:105) ≤ (cid:111) and λ ∗ ( α ) = min (cid:110) λ : P ( v α ( Y, S ) ≥ λ ) − α ≤ (cid:111) Based on the optimal Bayes rule, the selected set is defined asΩ α,γ = { ( Y, S ) : v α ( Y, S ) ≥ λ ∗ ( α, γ ) } . Note that for each prespecified pair ( α, γ ), Ω α,γ is just the λ ∗ ( α, γ )-superlevel setof the function v α ( Y, S ). For any α > α , nestedness of the selected sets wouldmean that the λ ∗ ( α , γ )-superlevel set of the function v α must be a subset of the λ ∗ ( α , γ )-superlevel set of the function v α . Remark. The construction and the form of the optimal selection rule may appearto be very similar to the case where σ i is observed. However, the crucial differenceis that we no longer require the independence between θ and σ in this section. Incontrast, when σ i is assumed to be directly observed, the independence assumptionis critical for all the derivations. For instance, the non-null proportion, defined as P ( θ i ≥ θ α ), must change for different values of σ i if we allow the distribution of θ todepend on σ .5.1. Two Examples. In this subsection we discuss two explicit examples. In the firstexample, G , is the classical conjugate prior for the Gaussian model with unknownmeans and variances. Example. Suppose we have balanced panel data y i , . . . , y iT ∼ N ( θ, σ ) with sam-ple means Y i = T (cid:80) t y it and sample variances S i = T − (cid:80) t ( y it − Y i ) . Further, sup-pose that G ( θ, σ ) takes the normal-inverse-chi-squared form, NIX( θ , κ , ν , σ ) = N ( θ | θ , σ /κ ) χ − ( σ | ν , σ ). Integrating out σ , the marginal distribution of θ is ageneralized t distribution θ − θ σ / √ κ ∼ t ν where t ν is the t -distribution with degree of freedom ν . Therefore, the 1 − α quantileof θ , denoted θ α is simply, θ α = θ + σ √ κ F − t ν (1 − α ) u and Koenker 23 where F − t ν denotes the quantile function of t ν .Conjugacy of the distribution G implies that the posterior distribution of ( θ, σ | Y, S )follows N IX ( θ T , κ T , ν T , σ T ) = N ( θ | θ T , σ T /κ T ) χ − ( σ | ν T , σ T ) with ν T = ν + Tκ T = κ + Tθ T = κ θ + T Yκ T σ T = 1 ν T (cid:16) ν σ + ( T − S + T κ κ + T ( θ − Y ) (cid:17) . Integrating out σ , the marginal posterior of θ again follows a generalized t -distribution, θ − θ T σ T / √ κ T ∼ t ν T , where t ν T is the t -distribution with degree of freedom ν T . It is then clear that theposterior mean of θ , is simply a linear function of Y and independent of S , E [ θ | Y, S ] = θ T = κ θ + T Yκ T , and the posterior tail probability is given by, v α ( Y, S ) = P ( θ ≥ θ α | Y, S ) = P (cid:16) θ − θ T σ T / √ κ T ≥ θ α − θ T σ T / √ κ T | Y, S (cid:17) = 1 − F t νT (cid:16) θ α − θ T σ T / √ κ T (cid:17) . To illustrate this case, suppose θ = 0, κ = 1 , σ = 1 and ν = 6 and T = 9,it can be verified that v α ( Y, S ) is in fact a monotone function of Y for each fixed S and any α > 0, hence in this example we can invert the function v α ( y, s ) to obtainthe level curves. The left panel of Figure 5.1 shows the level curves for v α ( Y, S ) and E ( θ | Y, S ) for α = 5%. It is clear that the posterior mean is a constant function of S , while the posterior tail probability exhibits more exotic behaviour with respectto S , especially for more extreme values of Y . If we fix S = s , then v α ( Y, s ) isan increasing function of Y . On the other hand, fixing Y = y , for small y impliesthat v α ( y , S ) is a increasing function of S , while for y large, v α ( y , S ) becomes adecreasing function of S .A capacity constraint of size α implies the thresholding rule, P ( v α ( Y, S ) ≥ λ ∗ ) = α, while FDR control at level γ leads to a cutoff value λ ∗ defined as γ = E [(1 − v α ( Y, S )) (cid:49) { v α ( Y, S ) ≥ λ ∗ } ] / P ( v α ( Y, S ) ≥ λ ∗ ) . The larger of the two thresholds, denoted λ ∗ = max { λ ∗ , λ ∗ } defines the selectionregion based on posterior tail probability ranking Ω α,γ = { ( Y, S ) : v α ( Y, S ) ≥ λ ∗ } .For α = 5% and γ = 10%, the selection region based on the tail probability ruleis { ( Y, S ) : v α ( Y, S ) ≥ . } . The posterior mean ranking is defined as { ( Y, S ) : E [ θ | Y, S ] ≥ . } . These selection boundaries are depicted as the red dashed line andblack solid line respectively in the right panel of Figure 5.1. In this case, the FDRconstraint binds. If only the capacity constraint were in place, we would have a cut-off for tail probability at 0 . 40 and the cut-off for posterior mean at 1 . 84. Figure 5.2 Invidious Comparisons further shows the comparison of the selected set based on a sample realization fromthe model.In Appendix B we consider a more complex bivariate discrete example that il-lustrates somewhat more exotic behavior of the decision boundaries and comparesperformance of several different ranking and selection rules.5.2. Variants of the unknown variance model. We’ve assumed that the onlyscale heterogeneity is driven by σ i in the above model, but often there may be moreheteroskedasticity that should be allowed in (cid:15) it . Here we consider the variant where Y it = θ i + σ i (cid:15) it , (cid:15) it ∼ N (0 , /w it ) , ( θ i , σ i ) ∼ G we also assume that w it ∼ H are known quantity and are independent from ( θ i , σ i ).Denoting w i = (cid:80) T i t =1 w it , the sufficient statistics now takes the form Y i = (cid:80) T i t =1 w it Y it /w i and S i = ( T i − − (cid:80) T i t =1 ( Y it − Y i ) . It is easy to show that Y i | θ i , σ i ∼ N ( θ i , σ i /w i )and S i | σ i follows Gamma distribution with shape parameter r i = ( T i − / σ i /r i . The decision rules now become a function of the tuple ( Y i , S i , w i ),for instance the tail probability can be specified as v α ( y i , s i , w i ) = P ( θ i ≥ θ α | y i , s i , w i ) = (cid:82) + ∞ θ α f ( y i | θ, σ /w i )Γ( s i | r i , σ /r i ) dG ( θ, σ ) (cid:82) + ∞−∞ f ( y i | θ, σ /w i )Γ( s i | r i , σ /r i ) dG ( θ, σ ) . Level Curves s y . . . . . . . . . . . . . . . . PMTP Selection Boundary s y . . . . . . . PMTP Figure 5.1. The left panel shows level curves of the posterior mean(marked as red dashed lines) and posterior tail probability (marked as blacksolid lines) for the normal model with ( θ, σ ) ∼ N IX (0 , , , 1) and paneltime dimension T = 9. The right plot shows the boundary of the selectionregion based on posterior mean ranking (marked as the red dashed line) andthe posterior tail probability ranking (marked as the solid black line) with α = 5% and γ = 10%. u and Koenker 25 lll llll ll ll ll l l ll l lll l lll lll lll ll ll ll ll lll llll l ll ll l l ll llllll l ll l lll l lll ll ll ll lll l l llll llll l ll llll ll llll ll l llll l l lll ll llll lll ll l l lll ll ll lll ll ll ll ll llll l ll ll lll lll lllllll l l lll llll l ll lll lll lll lll l lllll ll l lll lll lll ll l ll lll lll l lll lll lll l l ll ll lll l ll lll ll ll l llll ll lll l lll l lllll l l lll lll ll lll lll llll ll ll lll ll ll l l llll lll l ll l l llllll ll lll ll lll ll llll l ll ll ll lll l ll l ll l ll lll l llll l ll l lll l ll ll lll llll llll l ll llll ll ll l llll ll lll ll l ll l ll lll ll ll l llll lll ll l lll l lll lll ll lllll llll l llll llll ll ll ll llll l lllll ll ll ll lll lll l llll l llll ll lll l lll llll ll l ll ll lll ll l ll l llllll l ll lll lll lll l lll l ll l llll lll lll ll l ll ll ll ll ll l l ll lll ll ll l ll ll l llll ll l ll l l lll ll l lllll ll lll ll ll ll lll l ll lllll ll ll ll lll llll l ll ll llll lll l lll ll l ll l llll ll lll ll l lll lll ll lll ll lll ll l lll llll ll ll ll l ll ll lll ll l ll ll ll lll lllll ll ll ll l ll ll llll ll lll l ll ll ll llll lll l ll l lllll l ll ll ll llll lll ll lll lllll ll l llll l l ll ll ll l ll ll lll l ll ll lll ll l llll l l l llll l lll ll lll lll l ll ll l ll ll ll ll lllll ll ll l ll ll l ll l llllll llll l l llll l l ll l l llll lll ll l ll lll ll ll lll l ll l lllll llll lll llll l lll llll llll l ll ll l l ll lll lll ll l ll lll l llllll l ll l lll l ll llll lll ll ll ll llll lll l ll lllll llll l l ll l ll l l lll ll ll ll lll ll l lll ll ll l llll lll ll lll l ll ll ll ll llll llll l llll lll ll ll lll ll l ll l l lll lll l lllll l lll l ll ll ll llll ll lll l ll ll lllll ll lll lll ll ll lll ll l l ll l lllll ll l ll ll lll lll llll ll l llll ll l lll lll l l l lll ll ll l ll l l l llll ll lll ll lll l llll ll lll l lll l l llll lll llll ll ll ll l l ll l ll ll lll l llll lllll l ll ll lll ll lll ll ll lll llll ll lll lll llll ll ll lll ll lll ll llll ll l llll lllll ll ll lll ll llll ll lll llll ll lllll l l llll lll lll ll l ll l llll ll ll ll ll l ll l ll ll ll l lll lll l lll lll l l llllll l ll l l ll ll lll l ll l lll ll ll l ll ll ll l ll llll ll ll lll l ll l llll l ll ll lll ll l lll ll ll lll l lll llll l ll l l llll ll ll lll l llll ll l l ll ll ll lll l lll ll l llll l ll ll ll ll ll lll ll ll l lll l ll ll ll lll ll l llll llll ll llll lll lll ll lll ll lll llll l ll lll l lllll ll ll lll l ll ll lll ll ll l ll ll l llll l lll ll l llll ll ll l ll l ll l l ll ll l l ll lll l l ll ll lll llll ll ll l ll lll l ll l lll l lll l l lll ll l ll lll ll lllll ll l l lll ll ll ll ll llll l l lll l ll ll l ll ll llll l ll lllll lll lll l ll ll ll llll l ll lll lll lll llll ll l lll lll l ll llll llll l ll ll ll lll l ll ll lll ll llll l lll l ll l lll ll ll lll ll ll lll l lll ll ll ll ll lll l ll ll l ll lll l llll ll ll lll l llll l ll lllll l lll llll ll l ll l llll l l lllll lll ll ll ll lll l lll l l ll lll lll l ll llll lll ll ll ll ll lllll ll lll ll l l lll l ll lll ll ll ll ll lll lll ll ll llll ll l ll ll ll lll ll llll l llll ll ll ll ll ll ll lll llll l ll ll lll ll lll ll lll lll l ll lll ll lll l l ll lll llllll l lll ll l llll llll ll ll lll lllll l l ll ll l ll l l lll l lll ll lll lll ll ll ll ll llll ll l lll lll l l llll l ll llll llll lllll l lll llll lll llll l lll lllllll ll ll lll llll l ll l ll ll l l ll ll ll l ll llll ll lll l ll ll llll lllll ll lllll ll ll lll ll l llll ll lll ll l ll ll l ll llll l lll l l lll l l lllll ll lll lll l ll ll ll lll ll lll l ll l Capacity constraint s y l All agreedTailp extraPM extra l ll ll ll l ll ll lll l ll llll llll llll l ll l lll l lll llll l ll lllll llll ll lll ll llll lll ll l ll ll ll ll ll l ll ll ll llll lllll l ll ll llll l llllll l ll ll ll l l ll l llll ll ll llll l lll ll lll l ll l lll ll lll llll lll l ll lll ll ll l lllll lll ll lll l llll l ll l ll ll l lll l llllll ll l llll l l lll ll l lll ll l ll lll ll ll l lll lll lll ll l lll ll lll llll llll ll l lll lll ll lll l ll l ll lll ll ll ll l lll ll l ll l lll ll lll ll ll l ll ll ll l lll ll l ll ll l lll ll l ll llll ll llll ll ll llll lll ll l lll l l lll ll llll ll l ll l lllll l lll l lll l lll lll l ll ll llllll ll l ll ll llll lllll ll ll l ll ll ll l ll ll l ll ll ll ll ll ll l ll l ll l ll ll ll ll l ll l lll ll l ll llll lll ll l l l l lll l ll ll ll lllll l lll ll l lllllll ll l ll l lll ll l ll l llll ll lll ll l l lll ll ll l llll l ll l ll ll l ll ll ll ll l ll l ll lll ll l lllll l ll ll ll l lll lll l lllll lll lllll l ll ll ll ll ll l l l lllll l lll ll ll lll ll l l lll ll ll llllll ll l lll llll l llll ll l lll l lll lll lll llll l llll lll l ll llll ll ll l ll llll l l lllll l ll llll ll ll ll lll l l lll llll llll lll l l ll l lll ll l lll ll l l lll l lll lll l l ll ll ll ll ll ll lll l l llll ll ll llll l ll l lll llll ll l llll l lll lll lll ll lll l ll ll ll l lll lll ll lllll ll l ll l ll ll ll lll ll ll lll ll ll llll lll l ll ll l l l ll lllll ll l lll ll l lll l lll llll l ll ll ll llll ll ll l ll ll ll ll lll ll ll ll lllll lll l l ll lll l lll ll lll lllll lll l llll l l llll l ll l llllll ll lll l lll ll ll lllll lll llll lll l l ll lll ll l ll ll lll ll ll llll ll lll l ll ll l ll llll lll ll llllll l ll ll ll l llll lll ll ll ll ll llllll ll ll lll llll ll l l ll llllll l lll l l llll lllll lll l ll l l l lll lll ll l ll llll l l lll lll l ll ll lllllll ll lll l llllll ll ll llll l ll lll ll llll lll ll ll llll ll llll ll lll ll lll l l ll lllll ll l lll l lll lll ll lll l FDR constraint s y l All agreedTailp extraPM extra Figure 5.2. Selection set comparison for one sample realization from thenormal model with ( θ, σ ) ∼ N IX (0 , , , 1) and panel time dimension T =9: The left panel shows in grey circles the agreed selected elements by boththe posterior mean and posterior tail probability criteria under the capacityconstraint, extra elements selected by the posterior mean are marked in greenand extra elements selected by the posterior tail probability rule are markedin red. The right panel shows the comparison of the selected sets under boththe capacity and FDR constraint with α = 5% and γ = 10%. and posterior mean takes the form E [ θ i | y i , s i , w i ] = (cid:90) θf ( y i , s i | θ, σ , w i ) dG ( θ, σ ) . The threshold values under either the capacity or the FDR constraint can be workedout in a similar fashion. For any ranking statistics δ ( Y i , S i , w i ) together with a decisionrule (cid:49) { δ ( Y i , S i , w i ) ≥ λ } , the capacity constraint requires choosing a thresholdingvalue λ ∗ ( α ) such that α = (cid:90) (cid:90) (cid:49) { δ ( y, s, w ) ≥ λ ∗ ( α ) } f ( y, s | θ, σ , w ) dG ( θ, σ ) dH ( w ) , while the thresholding value to control in addition the FDR rate under size γ requiressolving for λ ∗ ( α, γ ) such that γ = P ( δ ( y, s, w ) ≥ λ ∗ ( α, γ ); θ < θ α ) P ( δ ( y, s, w ) ≥ λ ∗ ( α, γ ))which can be further represented as γ = (cid:82) (cid:82) (cid:49) { δ ( y, s, w ) ≥ λ ∗ ( α, γ ) } (1 − α ) f ( y, s | θ, σ , w ) dG ( θ, σ ) dH ( w ) (cid:82) (cid:82) (cid:49) { δ ( t, s, w ) ≥ λ ∗ ( α, γ ) } f ( y, s | θ, σ , w ) dG ( θ, σ ) dH ( w )where f ( y, s | θ, σ , w ) is the density of ( y, s ) under the null hypothesis θ < θ α .We can again consider selection region as those plotted in Figure 5.1 and FigureB.1 to appreciate how different decision criteria determines the selection. As soon Invidious Comparisons as the ranking statistics depends on w , the selection region of the thresholding rule (cid:49) { δ ( y, s, w ) ≥ λ ∗ } will also depend on the magnitude of w .6. Asymptotic Adaptivity The previous sections propose the optimal Bayes rule for minimizing the expectednumber of missed discoveries subject to both capacity and FDR constraint undervarious modeling environments. In each of these environments, the optimal Bayesrule takes the form δ ∗ = 1 { v α ≥ λ ∗ } where v α is defined as the posterior probabilityof θ ≥ θ α conditional on the data. The thresholding value λ ∗ is defined to satisfyboth the capacity and FDR constraints. The optimal Bayes rule involves severalunknown quantities, in particular the v α ’s and the optimal thresholding value, λ ∗ ,that require knowledge on the distribution of θ i or the joint distribution of ( θ i , σ i )when the variances are latent variables. For estimating this distribution of the latentvariables, we propose a plug-in procedure that is very much in the spirit of empiricalBayes methods pioneered by Robbins (1956). In this Section we also establish thatthe resulting feasible rules achieve asymptotic validity and asymptotically attain thesame performance as the infeasible Bayes rule.We begin by discussing properties of the Oracle procedure assuming that v α isknown and we only need to estimate the optimal thresholding value. We establishasymptotic validity of this Oracle procedure and then propose a plug-in method forboth v α and the thresholding value thereby establishing the asymptotic validity of theempirical rule. Before presenting the formal results, we introduce regularity conditionsthat will be required. We distinguish two cases depending on whether the σ i ’s areobserved. Assumption 1. (1) (Variances observed) { Y i , σ i , θ i } are independent and iden-tically distributed with a joint distribution with σ i and θ i independent. Therandom variables θ i and σ i have positive densities with respect to Lebesguemeasure on a compact set Θ ⊂ R and [ σ , ¯ σ ] respectively for some σ > and ¯ σ < + ∞ . (2) (Variance unobserved) Let S i be an individual sample variance based on T repeated measurements and Y i be the sample means with T ≥ . Supposefurther that { Y i , S i , θ i , σ i } are independent and identically distributed and thatthe random variables { θ i , σ i } have a joint distribution G with a joint densitypositive everywhere on its support. Optimal thresholding. Whether σ i is observed or estimated, the optimalthresholding value can be defined in a unified manner by λ ∗ = max { λ ∗ , λ ∗ } with λ ∗ = inf { t ∈ (0 , , H v ( t ) ≥ − α } λ ∗ = inf { t ∈ (0 , , Q ( t ) ≤ γ } where H v denotes the cumulative distribution of either v α ( y i , σ i ) or v α ( y i , s i ), inducedby the marginal distribution of the data, either as the pair { y i , σ i } when variancesare observed or the pair { y i , s i } otherwise. Hence λ ∗ is the 1 − α quantile of H v .The function Q ( t ) is defined as Q ( t ) = E [(1 − v α )1 { v α ≥ t } ] / E [1 { v α ≥ t } ]. Itsformulation recalls Proposition 5.2 and the existence of λ ∗ is guaranteed as long as u and Koenker 27 α < − γ . The thresholding value is also equivalent to those defined in Proposition 3.3and Proposition 4.1. In particular, the thresholding values t ∗ and t ∗ in Proposition3.3 are cast in terms of Y directly and it is easy to see λ ∗ j = v α ( t ∗ j ) for j = 1 , λ ∗ and λ ∗ in Proposition 4.1 are a result of invoking the monotonicity of v α ( y, σ ) withrespect to y for each fixed value of σ . The function Q ( t ) is the mFDR of the procedure δ = 1 { v α ≥ t } for any α ∈ (0 , t . Monotonicity of Q ( t ) is crucial to justify this thresholding procedure insuring that either the capacityconstraint or the mFDR constraint must be binding. Cao, Sun, and Kosorok (2013)have observed that a sufficient condition for monotonicity for a broad class of multipletesting procedures is that the ratio of densities under the null and alternative of thetest statistics employed for ranking be monotone and discuss the consequences of theviolation of this condition. For the posterior tail probability criterion this monotonelikelihood ratio condition, as we will see, can be verified directly.To see this, recall that mFDR is defined as (cid:80) ni =1 P [ δ i = 1 , θ i < θ α ] / (cid:80) ni =1 P ( δ i = 1).It suffices to show that P [ δ i = 1 , θ i < θ α ] = E [(1 − v α,i ) δ i ]. Since v α,i = P [ θ i ≥ θ α | D i ] = αf ( D i ) /f ( D i ) where D i is the individual data being either { y i , σ i } or { y i , s i } depending on the model and f is the marginal density of the data when θ i ≥ θ α and f is the marginal density of D i . Then it is easy to see that P [ δ i = 1 , θ i < θ α ] =(1 − α ) (cid:82) { v α,i ≥ t } f ( D i ) dD i = (cid:82) { v α,i ≥ t } (1 − v α,i ) f ( D i ) dD i = E [(1 − v α,i )1 { v α,i ≥ t } ]. Then Q ( t ) = (cid:82) t (1 − v ) h v dv/ (cid:82) t h v dv where h v is the density function of v α .Monotonicity of Q ( t ) can be easily verified by showing that the derivative with respectto t of the right hand side quantity is nonpositive.6.2. Oracle Procedure. The only unknown quantity in the Oracle procedure isthe thresholding value and we now discuss how to estimate it to achieve asymptoticvalidity. H v and Q can be estimated by the following quantities: H n ( t ) = 1 n n (cid:88) i =1 { v α,i ≤ t } Q n ( t ) = (cid:80) ni =1 (1 − v α,i )1 { v α,i ≥ t } (cid:80) ni =1 { v α,i ≥ t } and the associated thresholding values are then defined as λ n = inf { t ∈ [0 , , H n ( t ) ≥ − α } λ n = inf { t ∈ [0 , , Q n ( t ) ≤ γ } λ n = max { λ n , λ n } Theorem 6.1. (Asymptotic validity of the Oracle procedure) Under Assumption 1,the procedure δ i = 1 { v α,i ≥ λ n } asymptotically controls the false discovery rate below γ and the expected proportion of rejections below α for any ( α, γ ) ∈ [0 , and γ < − α Invidious Comparisons when n → ∞ , more specifically lim sup n →∞ E (cid:104) (cid:80) ni =1 { θ i < θ α , v α,i ≥ λ n } (cid:80) ni =1 { v α,i ≥ λ n } (cid:87) (cid:105) ≤ γ lim sup n →∞ E (cid:104) n n (cid:88) i =1 { v α,i ≥ λ n } (cid:105) ≤ α Adaptive procedure. In practise the posterior tail probability also involvesthe unknown quantity θ α = G − (1 − α ) that needs to be estimated. We proposea plug-in estimator in the spirit of the empirical Bayes method: estimating G byits nonparametric maximum likelihood estimator ˆ G n and estimating θ α as its 1 − α quantile.Consistency of the nonparametric maximum likelihood estimator, ˆ G n , was firstproven by Kiefer and Wolfowitz (1956) using Wald type arguments. A Hellingerrisk bound for the associated marginal density estimate and adaptivity of ˆ G n and aself-regularization property have been recently established in Saha and Guntuboyina(2020) and Polyanskiy and Wu (2020). In particular, the following established result,stated here as an assumption, is crucial for establishing the asymptotic validity of theadaptive procedure. Assumption 2. The nonparametric maximum likelihood estimator ˆ G n is stronglyconsistent for G . That is, for all continuity point k of G , ˆ G n ( k ) → G ( k ) almostsurely when n → ∞ . Furthermore, the estimated marginal (mixture) density convergesalmost surely in Hellinger distance. When variances are homogeneous or when variances are unknown but we havelongitudinal data so that we have a mixture model for the pair { Y i , S i } , the Hellingerconvergence is established in van de Geer (1993). When variances are heterogeneousbut known, the Hellinger bound for marginal density is established recently in Jiang(2020).The plug-in estimators for the posterior tail probability, v α ( y i , σ i ) when variancesare known or v α ( y i , s i ) when variances are unknown is then defined respectively asˆ v α ( y i , σ i ) = (cid:90) + ∞ ˆ θ α σ i φ (( y i − θ ) /σ i ) d ˆ G n ( θ ) / (cid:90) + ∞−∞ σ i φ (( y i − θ ) /σ i ) d ˆ G n ( θ )ˆ v α ( y i , s i ) = (cid:90) + ∞ ˆ θ α f ( y i , s i | θ, σ ) d ˆ G n ( θ, σ ) / (cid:90) + ∞−∞ f ( y i , s i | θ, σ ) d ˆ G n ( θ, σ )where φ is the standard normal density and f is the density function for ( y i , s i ) whichis a product of Gaussian and gamma density. Abbreviating the estimated posteriortail probability by ˆ v α,i , we mimic the Oracle procedure and estimate the thresholding u and Koenker 29 value by ˆ λ n = max { ˆ λ n , ˆ λ n } , where,ˆ λ n = inf { t ∈ [0 , 1] : 1 n n (cid:88) i =1 { ˆ v α,i ≤ t } ≥ − α } ˆ λ n = inf { t ∈ [0 , 1] : (cid:80) ni =1 (1 − ˆ v α,i )1 { ˆ v α,i ≥ t } (cid:80) ni =1 { ˆ v α,i ≥ t } ≥ γ } Theorem 6.2. (Asymptotic validity of adaptive procedure) Under Assumptions 1 and2, the adaptive procedure δ i = 1 { ˆ v α,i ≥ ˆ λ n } asymptotically controls the false discoveryrate below γ and the expected proportion of rejections below α for any ( α, γ ) ∈ [0 , with α < − γ when n → ∞ , more specifically lim sup n →∞ E (cid:104) (cid:80) ni =1 { θ i < θ α , ˆ v α,i ≥ ˆ λ n } (cid:80) ni =1 { ˆ v α,i ≥ ˆ λ n } (cid:87) (cid:105) ≤ γ lim sup n →∞ E (cid:104) n (cid:88) i =1 (cid:110) n ˆ v α,i ≥ ˆ λ n (cid:111)(cid:105) ≤ α It is clear that given the Lagrangian formulation of the compound decision problem,it can be viewed equivalently as a constrained optimization problem. See also thediscussion in Remark 3.4. We seek to maximize power defined as β ( t ) := P ( θ i ≥ θ α , δ i = 1) /α subject to two constraints: the first marginal FDR rate and the other onthe selected proportion. For each fixed pair of { α, γ } , the optimal Bayes rule achievesthe best power among all thresholding procedures that respect the two constraints.The next theorem establishes that our feasible, adaptive procedure achieves the samepower as the oracle rule asymptotically. Theorem 6.3. Under Assumption 1 and 2, the adaptive procedure δ i = 1 { ˆ v α,i ≥ ˆ λ n } attains the same power as the optimal Bayes rule asymptotically. In particular, as n → ∞ , (cid:80) ni =1 { θ i ≥ θ α , ˆ v α,i ≥ ˆ λ n } (cid:80) ni =1 { θ i ≥ θ α } p → β ( λ ∗ )This result is supported by the simulation evidence presented in the next section.7. Simulation Evidence In this Section we describe two small simulation exercises designed to illustrate per-formance of several competing methods for ranking and selection. As a benchmarkfor evaluating performance we consider several Oracle methods that presume knowl-edge of the true distribution, G , generating the θ ’s as well as several feasible methodsthat rely on estimation of G . These are contrasted with more traditional methodsthat are based on linear shrinkage rules of the Stein type. The linear shrinkage rule isthe posterior mean of θ under the assumption that G follows a Gaussian distributionwith unknown mean and variance parameters. This is the commonly used estima-tor for ranking and selection in applied work, notably Chetty, Friedman and Rockoff(2014a, 2014b) for teacher evaluation and Chetty and Hendren (2018) for studyingintergenerational mobility. Typically the linear shrinkage estimator is used in the Invidious Comparisons context of heterogeneous known variances, this will be the model we focus on in oursimulation experiments. The linear shrinkage formula defined in (2.1) easily adaptsto the heterogeneous variances case and leads to the Jame-Stein shrinkage rule withheterogeneous known variances. Efron and Morris (1973) introduced some furthermodifications. As we have already demonstrated, when variances are heterogeneous,the linear shrinkage estimator provides a different ranking compared to posterior tailprobability rules. A further complication arises when we seek procedures that alsocontrol for false discovery. To estimate the false discovery rate for different threshold-ing values we requires knowledge of G . If the Gaussian assumption on G underlyingthe linear shrinkage rules is misplaced, it may lead to an inaccurate estimator forFDR, and consequently to procedures that fail to control for false discovery.Performance will be evaluated primarily on the basis of power, which we define asthe proportion of the individuals whose true θ i exceeds the cutoff θ α = G − (1 − α ),who are actually selected. This is the sample counterpart of P ( δ i = 1 , θ i ≥ θ α ) / P ( θ i ≥ θ α ). FDR is calculated as the sample counterpart of P ( δ i = 1 , θ i < θ α ) / P ( δ i = 1),that is the proportion of selected individuals whose true θ i falls below the threshold.While our selection rules are intended to constrain FDR below the γ threshold, asin other testing problems they are not always successful in this objective in finitesamples so empirical power comparisons must be interpreted cautiously in view ofthis. Nonetheless, asymptotic validity is assured by the results in Section 6. Wecompare performance for three distinct α levels, { . , . , . } and three γ levels { . , . , . } .7.1. The Student t Setting. Our first simulation setting focuses on the effect oftail behavior of the distribution on performance of competing rules. For these sim-ulations we take G to be a discrete approximation to Student t distributions withdegrees of freedom in the set { , , , , } , and supported on the interval [ − , . , . OTP: Oracle Tail Probability Rule OPM: Oracle Posterior Mean Rule Efron: Efron Tail Probability Rule KWs: Kiefer-Wolfowitz Smoothed Tail Probability Rule EM: Efron and Morris (1973) Linear Shrinkage RuleWe illustrate the results in Figure 7.1, where we plot empirical power against de-grees of freedom of the t distribution for a selected set of values for the capacityconstraint, α ∈ { . , . , . } and FDR constraint, γ ∈ { . , . , . } as in-dicated at the top of each panel of the figure. The most striking conclusion fromthis exercise is the dramatic decrease in power as we move toward the Gaussian dis-tribution. At the Cauchy, t , power is quite respectable for all choices of α and γ ,but power declines rapidly as the degrees of freedom increases, reenforcing our earlierconclusion that the Gaussian case is extremely difficult. We would stress, in viewof this finding, that classical linear shrinkage procedures designed for the Gaussiansetting are poorly adapted to heavy tailed settings in which the reliability of selectionprocedures is potentially greatest. u and Koenker 31 dfs P o w e r a = g = a = g = a = g = a = g = a = g = a = g = a = g = a = g = a = g = Figure 7.1. Power Performance for Several Selection Rules with Student t Signal. Capacity and FDR constraints are indicated at the top of eachpanel in the Figure. Careful examination of this figure reveals that there is a slight advantage to theposterior tail probability rules over the posterior mean procedures, both for the Oraclerules and for our feasible procedures. There is surprisingly little sacrifice in power inmoving from the Oracle methods to the Efron or Kiefer-Wolfowitz rules. The Efronand Morris selection rule is very competitive in the almost Gaussian, t settingbut sacrifices considerable power in the lower degrees of freedom settings due to themisspecification of the distribution G and consequent inaccurate estimation of thefalse discovery rate.7.2. A Teacher Value-Added Setting. Our second simulation setting is basedon a discrete approximation of the data structure employed in Gilraine, Gu, andMcMillan (2020) to study teacher value-addded methods. Several longitudinal wavesof student test scores from the Los Angeles Unified School District were combined in Invidious Comparisons this study. Here we abstract from many features of the full longitudinal structure ofthis data, and focus instead on simulating performance of several selection methods.We maintain our standard known variance model in which we observe Y i ∼ N ( θ i , σ i )with θ i ’s drawn iidly from a distribution ˜ G estimated by Gilraine, Gu, and McMillan(2020). This distribution was estimated from the full longitudinal LA sample usingthe nonparametric maximum likelihood estimator of Kiefer and Wolfowitz and thensmoothed slightly by convolution with a biweight kernel and illustrated in the leftpanel of Figure 7.2. Variances, in keeping with our hypothesis in Section 4, are drawnfrom a distribution with density illustrated in the right panel of Figure 7.2. We focuson selection from the left tail of the resulting distribution since it is those teacherswhose jobs are endangered by recent policy recommendations in the literature. (seefor instance Hanushek (2011)). −1.0 −0.5 0.0 0.5 1.0 . . . . q g ( q ) s h ( s ) Figure 7.2. Densities of “True” (Mean) Ability and Standard Deviationfor the Teacher Value Added Simulations We draw samples of size 10,000 from the foregoing distribution and compute per-formance measures based on 100 replications. The fitted densities for this simulationexercise are based on a sample of roughly 11,000 teachers, so the simulation samplesize is chosen to be commensurate with this. In Table 7.1 we report power, FDRand the proportion selected by ten selection rules. The Oracle rules, OTP and OPM,based ranking by the tail probability and posterior mean criteria can be consideredbenchmarks for the remaining feasible procedures. Only the Oracle procedures canbe considered reliable from the perspective of adhering to the capacity and FDR con-straints. Consequently, some caution is required in the interpretation of the powercomparisons since feasible procedures can exhibit good power at the expense of vio-lating these constraints. This is analogous to the common difficulty in interpretingpower in testing problems when different procedures have differing size. When FDRis constrained to 5%, even the Oracle is only able to select about half of the de-serving individuals; OTP is consistently preferable to OPM as expected and powerperformance improves somewhat as the capacity constraint is relaxed. Among the fea-sible G -modeling selection procedures, the Efron rules have good power performance,but fail to meet the FDR constraints. We conjecture that somewhat less aggressive u and Koenker 33 smoothing than the default, df = 5 , c = 0 . G , and thus leadsto weaker power performance. As a further comparison, when the linear shrinkagerules are implemented without any FDR constraint, denoted LPM* and EM* in thetable, as they typically would be used in practice, the false discovery proportion isconsiderably higher than the targeted γ . We also report the performance of MLEand P-value rules, implemented without FDR control; again both yield a higher FDRrate, making it futile to evaluate their power performance.8. Ranking and Selection of U.S. Dialysis Centers Motivated by important prior work on ranking and selection by Lin, Louis, Paddockand Ridgeway (2006, 2009) illustrated by applications to ranking U.S. dialysis centers,we have chosen to maintain this focus to illustrate our own approach. Kidney diseaseis a growing medical problem in the U.S and considerable effort has been devotedto data collection and evaluation of the relative performance of the more than 6000dialysis centers serving the afflicted population. Centers are evaluated on multiplecriteria, but the primary focus of center ranking is their standardized mortality rate, orSMR, the ratio of observed deaths to expected deaths for center patients. Allocatingpatients to centers is itself a complex task since patients may move from one centerto another in the course of a year. Centers also vary considerably in the mix ofpatients they serve. Predictions from an estimated Cox proportional hazard modelthat attempts to account for this heterogeneity are employed to estimate expecteddeaths for each center.Our analysis focuses exclusively on the SMR evaluation of centers using longitudinaldata from 2004-18 as reported in University of Michigan Kidney Epidemiology andCost Center (2009–2019). We restrict attention to 3230 centers that have consistentlyreported SMR data over this sample period. Observed deaths, denoted y it for center i in year t are conventionally modeled as Poisson, y it ∼ Pois( ρ i µ it )where µ it is center i ’s expected deaths as predicted by the Cox model in year t and ρ i is the center’s unobserved mortality rate. We view µ it as the effective samplesize for the center, after adjustment for patient characteristics of the center. Centercharacteristics are expliciitly excluded from the Cox model. The classical variancestabilizing transformation for the Poisson brings us back to the Gaussian model, z it = (cid:112) y it /µ it ∼ N ( θ i , /w it ) , where θ i = √ ρ i and w it = 4 µ it . Exchangeability of the centers yields a mixture modelin which the parameter θ i , is effectively assumed to be drawn iidly from a distribution, G . The predictions of expected mortality, µ it , are assumed to be sufficiently accuratethat we treat w it as known, and independent of θ i ∼ G . Invidious Comparisons γ = 5% γ = 10% α = 1% α = 3% α = 5% α = 10% α = 1% α = 3% α = 5% α = 10% Power OTP 0 . 394 0 . 520 0 . 554 0 . 626 0 . 494 0 . 625 0 . 661 0 . . 365 0 . 492 0 . 521 0 . 599 0 . 484 0 . 620 0 . 654 0 . . 435 0 . 540 0 . 580 0 . 657 0 . 540 0 . 647 0 . 688 0 . . 355 0 . 477 0 . 521 0 . 614 0 . 452 0 . 583 0 . 631 0 . . 398 0 . 511 0 . 552 0 . 632 0 . 528 0 . 642 0 . 683 0 . . 325 0 . 447 0 . 492 0 . 588 0 . 440 0 . 576 0 . 624 0 . . 162 0 . 341 0 . 418 0 . 689 0 . 246 0 . 460 0 . 542 0 . . 726 0 . 781 0 . 796 0 . 829 0 . 726 0 . 781 0 . 796 0 . . 162 0 . 341 0 . 418 0 . 689 0 . 246 0 . 460 0 . 542 0 . . 726 0 . 781 0 . 796 0 . 829 0 . 726 0 . 781 0 . 796 0 . . 699 0 . 768 0 . 787 0 . 824 0 . 699 0 . 768 0 . 787 0 . . 374 0 . 478 0 . 535 0 . 635 0 . 374 0 . 478 0 . 535 0 . FDR OTP 0 . 050 0 . 050 0 . 051 0 . 051 0 . 103 0 . 103 0 . 100 0 . . 047 0 . 050 0 . 051 0 . 053 0 . 103 0 . 101 0 . 101 0 . . 070 0 . 059 0 . 061 0 . 064 0 . 128 0 . 115 0 . 117 0 . . 035 0 . 037 0 . 041 0 . 048 0 . 082 0 . 081 0 . 085 0 . . 063 0 . 057 0 . 059 0 . 062 0 . 129 0 . 115 0 . 116 0 . . 038 0 . 040 0 . 045 0 . 051 0 . 081 0 . 084 0 . 087 0 . . 016 0 . 025 0 . 033 0 . 083 0 . 031 0 . 048 0 . 061 0 . . 276 0 . 226 0 . 207 0 . 172 0 . 276 0 . 226 0 . 207 0 . . 016 0 . 025 0 . 033 0 . 083 0 . 031 0 . 048 0 . 061 0 . . 276 0 . 225 0 . 207 0 . 172 0 . 276 0 . 225 0 . 207 0 . . 304 0 . 238 0 . 216 0 . 177 0 . 304 0 . 238 0 . 216 0 . . 627 0 . 526 0 . 467 0 . 365 0 . 627 0 . 526 0 . 467 0 . Selected OTP 0 . 004 0 . 016 0 . 029 0 . 066 0 . 006 0 . 021 0 . 037 0 . . 004 0 . 015 0 . 027 0 . 063 0 . 005 0 . 021 0 . 036 0 . . 005 0 . 017 0 . 031 0 . 070 0 . 006 0 . 022 0 . 039 0 . . 004 0 . 015 0 . 027 0 . 064 0 . 005 0 . 019 0 . 034 0 . . 004 0 . 016 0 . 029 0 . 067 0 . 006 0 . 022 0 . 038 0 . . 003 0 . 014 0 . 026 0 . 062 0 . 005 0 . 019 0 . 034 0 . . 002 0 . 010 0 . 022 0 . 075 0 . 003 0 . 014 0 . 029 0 . . 010 0 . 030 0 . 050 0 . 100 0 . 010 0 . 030 0 . 050 0 . . 002 0 . 010 0 . 022 0 . 075 0 . 003 0 . 014 0 . 029 0 . . 010 0 . 030 0 . 050 0 . 100 0 . 010 0 . 030 0 . 050 0 . . 010 0 . 030 0 . 050 0 . 100 0 . 010 0 . 030 0 . 050 0 . . 010 0 . 030 0 . 050 0 . 100 0 . 010 0 . 030 0 . 050 0 . Table 7.1. Comparison of Performance of Several Selection Rules forthe Teacher Value Added Simulation.Over short time horizons like 3 years we assume that centers have a fixed draw of θ i from G , and thus we have sufficient statistics for θ i , as, T i = (cid:88) t ∈T w it z it /w i ∼ N ( θ i , /w i ) , where the set T is the corresponding three year window and w i = (cid:80) t w it . Given theseingredients it is straightforward to construct a likelihood for the mixing distribution, G , and proceed with estimation of it.Our objective is then to select centers based on the posterior distributions of their θ i ’s. For example, the posterior tail probability of center i is given by, v α ( t i , w i ) = P ( θ i ≥ θ α | t i , w i ) = (cid:82) + ∞ θ α f ( t i | θ, w i ) dG ( θ ) (cid:82) + ∞−∞ f ( t i | θ, w i ) dG ( θ ) , u and Koenker 35 where f is the density function of T i conditional on θ i and w i . The capacity constraintrequires choosing a thresholding value λ ∗ ( α ) such that α = (cid:90) (cid:90) (cid:49) { v α ( t, w ) ≥ λ ∗ ( α ) } ϕ ( t | θ, w ) dG ( θ ) dH ( w ) , which can be approximated by n (cid:80) i (cid:49) { v α ( t i , w i ) ≥ λ ∗ ( α ) } , and inverted to obtain thethreshold. Based on the discussion in Section 6, for the FDR constraint we chose athresholding value λ ∗ ( α, γ ) such that(8.1) γ = (cid:82) (cid:82) (cid:49) { v α ( t, w ) ≥ λ ∗ ( α, γ ) } (1 − v α ( t, w )) f ( t | θ, w ) dG ( θ ) dH ( w ) (cid:82) (cid:82) (cid:49) { v α ( t, w ) ≥ λ ∗ ( α, γ ) } f ( t | θ, w ) dG ( θ ) dH ( w )where H is the marginal distribution of the observed portion of the variance effect.The numerator can be approximated by n (cid:80) i (1 − v α ( t i , w i )) (cid:49) { v α ( t i , w i ) ≥ λ ∗ ( α, γ ) } and the denominator can be approximated by, n (cid:80) i (cid:49) { v α ( t i , w i ) ≥ λ ∗ ( α, γ ) } .Posterior mean ranking, in contrast, is based on, δ ( t i , w i ) = E [ θ i | t i , w i ] = (cid:90) θf ( t i | θ, w i ) dG ( θ ) . For the capacity constraint we choose a thresholding value C ∗ ( α ) such that α = (cid:90) (cid:90) (cid:49) { δ ( t, w ) ≥ C ∗ ( α ) } f ( t | θ, w ) dG ( θ ) dH ( w ) . For FDR constraint, we pick a thresholding value C ∗ ( α, γ ) such that γ = P ( δ ( t, w ) ≥ C ∗ ( α, γ ); θ < θ α ) P ( δ ( t, w ) ≥ C ∗ ( α, γ )) . The right hand side of the FDR constraint can be approximated by1 n (cid:88) i (cid:49) { δ ( t i , s i , w i ) ≥ C ∗ ( α, γ ) } (1 − v α ( t i , w i )) / n (cid:88) i (cid:49) { δ ( t i , w i ) ≥ C ∗ ( α, γ ) } , while the right hand side of the capacity constraint can be approximated by1 n (cid:88) i (cid:49) { δ ( t i , w i ) ≥ C ∗ ( α ) } , so C ∗ ( α ) is simply the empirical quantile of the δ ( t i , w i ).We will compare the foregoing ranking and selection rules with more naive rulesbased upon the Poisson and Gaussian MLEs, (cid:80) t ∈T y it / (cid:80) t ∈T µ it , and T i , respectively,a variant of the much maligned P-value, as well as a linear shrinkage procedure. Forthese rules we do not attempt to control for FDR since this is how they are typicallyimplemented in practice.To help appreciate the difficulty of the selection task, Table 8.1 reports estimatedFDR rates for several selection rules under a range of capacity constraints α for bothright and left tail selection based on the data from 2004 - 2006. Right tail selectioncorresponds to identifying centers whose mortality rate is higher than expected; lefttail selection to centers with mortality lower than expected. To estimate FDR werequire an estimate of the distribution of distribution, G . For this purpose we usethe smoothed version of the Kiefer-Wolfowitz NPMLE introduced in Section 2. The Invidious Comparisons α = 4% α = 10% α = 15% α = 20% α = 25% Right Selection MLE 0 . 564 0 . 481 0 . 432 0 . 399 0 . . 564 0 . 485 0 . 436 0 . 402 0 . . 559 0 . 479 0 . 430 0 . 396 0 . . 547 0 . 476 0 . 427 0 . 395 0 . . 547 0 . 476 0 . 427 0 . 395 0 . . 543 0 . 474 0 . 426 0 . 395 0 . . 542 0 . 473 0 . 425 0 . 394 0 . Left Selection MLE 0 . 643 0 . 559 0 . 479 0 . 437 0 . . 634 0 . 556 0 . 478 0 . 436 0 . . 660 0 . 566 0 . 482 0 . 440 0 . . 635 0 . 550 0 . 475 0 . 434 0 . . 635 0 . 550 0 . 475 0 . 434 0 . . 632 0 . 549 0 . 474 0 . 434 0 . . 627 0 . 547 0 . 472 0 . 432 0 . Table 8.1. FDR rate: 2004-2006biweight bandwidth for the smoothing was chosen as the mean absolute deviationfrom the median of the discrete NPMLE, ˆ G . The assessment of FDR reported inTable 8.1 reflects the considerable uncertainty associated with the selected set ofcenters deemed by the capacity constraint to be in the upper (or lower) α quantilebased upon our estimate of the distribution, G , of unobserved quality.The MLE rule ranks centers based on their Gaussian MLE, T i , while the Poisson-MLE rule ranks on (cid:80) t y it / (cid:80) t µ it , which is the MLE of ρ i from the Poisson model. P-value ranks centers based on the variance stabilizing transformation from the Poissonmodel under the null hypothesis ρ i = 1 and ρ i > ρ i < µ θ + ( T i − ˆ µ θ )ˆ σ θ / (ˆ σ θ + 1 /w i ) which is the posterior mean of θ i based on the model T i ∼ N ( θ i , /w i ) assuming that the latent variable θ i follows a Gaussian distributionwith mean µ θ and variance σ θ . We also consider the Efron and Morris (1973) estimatorwhich is a slight modification of the James-Stein estimator.Finally, PM and TP are the posterior mean of θ and posterior tail probability of θ ≥ θ α , for right selection, and θ ≤ θ α for left selection based on our estimated ˆ G . Forboth left and right tail selection, as α increases, the FDR rate decreases, indicatingthe selection task becomes easier. All rules that account for the compound decisionperspective of the problem have slightly lower FDRs than those that consider eachcenter individually.The Kidney Epidemiology and Cost Center (2018) assigns ratings of five stars downto one star to centers in the proportions { . , . , . , . , . } respectively. Wewill abbreviate these ratings to the conventional academic scale of A-F. To illustratethe conflict between the selection criteria we plot in Figure 8.1 the centers selected forthe grade A (Five star, which consists 22% of the centers that suppose to have theirtrue mortality rate being the lowest) category with and without FDR control. Centersare characterized by pairs, ( T i , w i ), consisting of their weighted mean standarizedmortality, T i , and their estimate of the precision, w i , of these mortality estimates. u and Koenker 37 In each plot the solid curves represent the decision boundaries of the selection ruleunder comparison. Centers with low mortality and relatively high precision appeartoward the northwest in each figure.Panel (a) of the figure compares the posterior tail probability selection with theMLE, or fixed effect, selection. The selection boundary for the MLE is the (red)vertical line, since the MLE ignores the precision of the estimates entirely. Theselection boundary for the tail probability rule is indicated by the (blue) curve. Afew centers with high precision excluded by the MLE rule are selected by the TP rule,and on the contrary a few centers with low precision are selected by the MLE rule,but excluded by the TP criterion. Panel (b) imposes FDR control with γ = 0 . 20 onthe TP selection with an estimated thresholding value implied by the FDR constraintusing the smoothed NPMLE. The MLE selection is the same as in Panel (a) withoutthe FDR control. We see that under TP rule with FDR control, the number ofselected centers is reduced considerably. Instead of selecting 711 centers allowed bythe capacity, it selects only 230 centers. In comparison, the MLE rule under capacityconstraint has an estimated FDR rate at 0.431. Panel (c) compares selected centersby the TP rule with those selected by a James-Stein linear shrinkage rule. Now theTP rule tolerates a few more low precision centers, while it is the James-Stein rulethat demands higher precision to be selected. Finally, in Panel (d) we again subjectthe TP rule to FDR control of 20 percent, while the James-Stein rule continues toadhere only to the capacity constraint. The TP boundary scales back substantially,suggesting that a large proportion of the extra selections made by James-Stein linearshrinkage rules are likely to be false discoveries. In fact, the estimated FDR rate ofthe James-Stein rule under just capacity constraint is also 0.431, same as that of theMLE rule.Given the longtitudinal structure of the Dialysis data, it would be possible toconsider the models in Section 5 that allow for unobserved variance heterogeneity.We refrain from doing so partly due to space considerations and because we arereluctant to assume stationarity of random effects over longer time horizons.8.1. Temporal Stability, Ranking and Selection. Given the longitudinal natureof the data, it is natural to ask, “How stable are rankings over time, and isn’t theresome temporal dependence in the observed data that should be accounted for?” Sur-prisingly, the year-to-year dependence in the observed mortality is quite weak. InFigure 8.2 we plot a histogram of estimated AR(1) coefficients for the 3230 centers;it is roughly centered at zero and slightly skewed to the left. We do not draw theconclusion from this that there is no temporal dependence in the observed y it , butonly that there is considerable heterogeneity in the nature of this dependence withroughly as many centers exhibiting negative serial dependence as those with positivedependence. Our approach of considering brief, 3-5 year, windows of presumed sta-bility in center performance is consistent with the procedures of the official rankingagency. In each of these windows we can compute a ranking according to one of thecriteria introduced above, and it is of interest to know how much stability do we seein these rankings?To address this question we consider rankings based on the posterior tail probabilitycriterion for three year windows. In each of the 5 3-year windows we assign centers Invidious Comparisons T W l lll lllll l ll lllll l lll l ll llll ll llll lll ll ll ll lll l ll llll lll lll ll llll ll ll l ll llllll ll lll llll ll lll l llll ll l l lll ll ll lllll l lll l ll l lll l ll lll l lllll ll llll l lll ll ll ll ll lll ll l ll lll l ll ll l lll l l l l lll ll l l ll lll ll ll llll l ll ll ll ll l l ll ll lll ll lllll lll l llll lll ll lll ll ll lll ll lll lll ll lllll l lllll ll ll lll ll l ll llll l lll ll l lll l l ll lllll ll ll l l llll llll ll ll llll l lll ll lll ll l lll l ll l ll lll ll l ll ll lll ll llll l lll ll lll lllll ll l ll ll l lll l llll ll l llll ll lll ll l llll l ll l lll ll lll l l ll lll ll l ll ll ll lll l ll llll ll l lll l llll llll l ll ll lll lll l lll lll l lll lll l ll lll ll ll llll lll l lll ll l ll ll l ll l ll lll lll ll lll ll l ll ll ll l lll l lll l lll lll lllll l lll llll ll l lll lllll ll l l ll ll ll ll lll ll llll ll lllll l ll lllll ll l l llll ll ll lllll llll l l lll llll ll l l l ll llll ll lllll lll l lll l llllllllllllllllllllllllllllllllllllll lll ll lll llllll llll lll lll lll ll ll l lll AgreeTP extraMLE extra (a) T W ll ll ll lllll llll l ll lll ll l l llll l ll l ll ll lll lll ll lll lll l ll lll lll ll ll ll ll ll lllll ll ll lll l ll lll l lll l ll ll ll ll l lll ll l ll l llllll ll ll ll ll lll ll lll llll lllll lll ll ll ll llll lll lll ll ll l ll l l lll ll llll l l lll lll lll llll lll ll ll llll l lll l lll llll ll l l lll ll ll ll l lll ll lll l lllll l lll l ll llll llll llll ll ll llll llll lllll lllll ll ll llllll ll llll ll lll lllll l lll ll l lllll l llll ll lll llll ll l ll ll ll ll ll llll lll ll l ll l ll l lll ll lll ll lll lll l ll ll lll ll ll llll llll lll ll llllll ll ll lll ll lll ll ll lll ll llll l ll ll ll llll l lll ll ll l llll lll ll lll ll ll lll lllll llll ll ll l ll llll llll l ll ll lll ll l lll lll l lll lll ll llll ll l lll lll l l ll lll ll ll ll lll lllll llll lll lll lll llll ll l ll lll l ll ll ll lllll llll ll ll ll ll l l lll llll lll ll llll l lll l lll ll llll ll ll lllllll l l lll ll lll ll lllll l lllllll l lll l lllll ll l llll lll ll ll lllll lll lll AgreeTP extraMLE extra (b) T W l lll lllll ll lllll l lll ll lllllll ll lllll ll ll ll lll l ll llll ll ll ll llll ll ll l ll lllllll ll lll llll ll lll l llll ll l l ll ll ll lllll l lll l ll l ll ll lll l lll l llll l lll ll lll ll lll ll l ll lll l ll ll lll l l l l ll ll ll l l ll ll ll ll ll llll l ll ll ll ll l l ll ll lll ll llll llll l llll lll ll lll l ll ll lll ll lll lll ll lllll l lllll ll ll lll ll l ll llll ll l ll ll ll l lll l l ll lllll l ll l l ll lllllll ll lll l llll l lll ll lll ll lll l ll l ll lll ll l ll ll lll ll llll l lll ll lll lllll ll l ll ll l ll ll llll ll l llll ll lll ll l llll ll ll ll ll lll l l l ll lll ll l l ll ll ll lll ll lll ll l lll l ll lll llll l ll ll lll l lll l lll lll l lll l lll l ll lll ll ll lll ll l llll ll l ll ll l ll l l ll lll ll l lll ll l ll ll ll l lll l lll l lll lll lllll l lll llll ll lll lllll ll l l ll ll l l ll lll lll llll ll lllll l ll lllll ll l l llll ll ll lllll llll lll ll lll lll ll l ll l ll llll ll l ll llll lll l lll l ll ll ll lllll ll ll ll lll llll ll lllll ll llll lllllllllllllllllllllllllllll lll AgreeTP extraJS extra (c) T W ll ll ll lllll llll l ll lll ll l l llll l ll l ll ll lll lll ll lll lll l ll lll lll ll ll ll ll ll lllll ll ll lll l ll lll l lll l ll ll ll ll l lll ll l ll l llllll ll ll ll ll lll ll lll llll lllll lll ll ll ll llll lll lll ll ll l ll l l lll ll llll l l lll lll lll llll lll ll ll llll l lll l lll llll ll l l lll ll ll ll l lll llll l llll l ll lll lllllll llll ll ll lll llll lll lllll ll l ll llll lllll ll lllll lll llll l ll ll l llll ll llll lll lll l ll ll ll lll lll ll ll l ll ll l ll ll ll ll ll l llll lll l ll l ll lll ll ll llll lll llll ll llll llll lll ll lll ll llll ll llll l ll ll ll llll ll l ll ll ll l ll l llll l ll ll lll lll l llllll lll lll l llll lll l ll lll llll llll l lll llll l l lll l ll l llll lll ll lllll ll ll lll l l l ll lll l ll ll ll lll ll ll ll lll llll lll l lll ll ll l lll l ll ll ll llll l llll ll ll ll ll l l lll ll lll ll lll l lll ll lll ll lllll lll llllllll l ll l ll l llll ll lllll l lllllll l ll ll lllll ll ll llll lll lll ll ll l ll lll lll lll AgreeTP extraJS extra (d) Figure 8.1. Contrasting Selections for A-rated centers: The two upperpanels compare posterior tail probability selection with MLE (fixed effects)selection, while the lower panels compare TP selection with James-Stein (lin-ear shrinkage) selection. Left panels impose capacity control only, while theright panels impose 20 percent FDR control for the TP rule. The estimatedFDR rate for both the MLE and James-Stein selection under capacity con-straint, using the smoothed NPMLE estimator for G , is 0.431. Comparisonsare based on the 2004-2006 data. letter grades, A-F, with proportions { . , . , . , . , . } respectively. Table8.2 reports the estimated transition matrix between these categories, so entry i, j inthe matrix represents the estimated probability of a center in state i moving to state j in the next period. u and Koenker 39 r ^ D en s i t y −1.0 −0.5 0.0 0.5 . . . . Figure 8.2. Histogram of estimated AR(1) coefficients for 3230 Dialysiscenters based on annual data 2004-2017 A B C D FA 0 . 440 0 . 330 0 . 200 0 . 024 0 . . 248 0 . 357 0 . 328 0 . 059 0 . . 122 0 . 286 0 . 440 0 . 113 0 . . 060 0 . 188 0 . 436 0 . 208 0 . . 021 0 . 081 0 . 352 0 . 217 0 . Table 8.2. Estimated First Order Markov Transition Matrix: Entry i, j of the matrix estimates the probability of a transition from state i to state j based on posterior tail probability rankings for 3-year longi-tudinal grouping of the center data.It is obviously difficult to maintain an “A” rating for more than a couple of periods,but centers with poor performance are also likely to move into the middle of therankings. Although, as we have seen, there is no guarantee that the posterior tailprobability criterion yields a nested ranking, nestedness does hold in this particularapplication. Posterior mean ranking yields similar transition behavior. The highdegree of mobility between rating categories reenforces our conclusion that rankingand selection into rating categories is subject to considerable uncertainty. Invidious Comparisons Conclusions Robbins’s compound decision framework is well suited to ranking and selectionproblems, and nonparametric maximum likelihood estimation of mixture models of-fers a powerful tool for implementing empirical Bayes rules for such problems. Pos-terior tail probability selection rules perform better than posterior mean rules whenprecision is heterogeneous. Ranking and selection is especially difficult in Gaussiansettings where classical linear shrinkage methods are most appropriate. Nonparamet-ric empirical Bayes methods can substantially improve upon selection methods basedon linear shrinkage and traditional p -values when the latent mixing distribution isnot Gaussian both in terms of power and false discovery rate. u and Koenker 41 Appendix A. Proofs Proof. [Lemma 3.1] We can write: ∇ y αf ( y ) f ( y ) = (cid:82) [ θ α , + ∞ ) ∇ y log ϕ ( y | θ, σ ) ϕ ( y | θ, σ ) dG ( θ ) (cid:82) ( −∞ , + ∞ ) ϕ ( y | θ, σ ) dG ( θ ) − (cid:82) [ θ α , + ∞ ) ϕ ( y | θ, σ ) dG ( θ ) (cid:82) ( −∞ , + ∞ )) ϕ ( y | θ, σ ) dG ( θ ) (cid:82) ( −∞ , + ∞ ) ∇ y log ϕ ( y | θ, σ ) ϕ ( y | θ, σ ) dG ( θ ) (cid:82) ( −∞ , + ∞ ) ϕ ( y | θ, σ ) dG ( θ )= E (cid:104) (cid:49) { θ ≥ θ α }∇ y log ϕ ( y | θ, σ ) | Y (cid:105) − E (cid:104) (cid:49) { θ ≥ θ α }| Y ] E [ ∇ y log ϕ ( y | θ, σ ) | Y (cid:105) = Cov (cid:104) (cid:49) { θ ≥ θ α } , ∇ y log ϕ ( y | θ, σ ) | Y (cid:105) ≥ ∇ y log ϕ ( y | θ, σ ) is increasing in θ for each fixed σ , by the Gaussianassumption and the fact that covariance of monotone functions of θ is non-negative (see Schmidt(2014)) assuming the existence of E θ | Y [ ∇ y log ϕ ( Y | θ, σ ) | Y ] which we assume. Nesting follows fromthe monotonicity of the v α ( y ) criterion: monotonicity of v α ( y ) implies that there exists t α such that (cid:49) { v α ( y ) ≥ λ α / (1 + λ α ) } = (cid:49) { y ≥ t α } , hence if α > α and P ( y ≥ t α ) = α and P ( y ≥ t α ) = α ,then it must be that t α ≤ t α , implying nestedness. Proof. [Lemma 3.2] Denote the three decision criteria, v ( y ) = E ( θ | Y = y ), v ( y ) = P ( θ ≥ G − (1 − α ) | Y = y ) and v ( y ) = E ( θ θ ≥ G − (1 − α )) | Y = y ). Assuming that E [ θ | Y ] < ∞ , E θ | Y [ ∇ y log ϕ ( y | θ, σ ) | Y ] < ∞ and E θ | Y [ θ ∇ y log ϕ ( y | θ, σ ) | Y ] < ∞ , the calculation leading to theproof of Lemma 1 shows: ∇ y v ( y ) = Cov( θ, ∇ y log ϕ ( y | θ ) | Y = y ) ∇ y v ( y ) = Cov(1( θ ≥ G − (1 − α )) , ∇ y log ϕ ( y | θ ) | Y = y ) ∇ y v ( y ) = Cov( θ θ ≥ G − (1 − α )) , ∇ y log ϕ ( y | θ ) | Y = y ) . Thus, the monotonicity of ∇ y log ϕ ( y | θ, σ ) implies they all yield identical rankings. Proof. [Proposition 3.3] The Bayes rule for the non-randomized selections can be characterized as, δ ∗ i = (cid:40) , if v α ( y i ) ≥ τ ∗ (1 − v α ( y i ) − γ ) + τ ∗ , if v α ( y i ) < τ ∗ (1 − v α ( y i ) − γ ) + τ ∗ with Karush-Kuhn-Tucker conditions, τ ∗ (cid:16) E (cid:104) n (cid:88) i =1 (cid:110) (1 − v α ( y i )) δ ∗ i − γδ ∗ i (cid:111)(cid:105)(cid:17) = 0(A.1) τ ∗ (cid:16) E (cid:104) n (cid:88) i =1 δ ∗ i (cid:105) − αn (cid:17) = 0(A.2) E (cid:104) n (cid:88) i =1 (cid:110) (1 − v α ( y i )) δ ∗ i − γδ ∗ i (cid:111)(cid:105) ≤ (cid:49) n E (cid:104) n (cid:88) i =1 δ ∗ i (cid:105) − α ≤ τ ∗ ≥ τ ∗ ≥ v α ( y ) and since v α ( y ) is monotone in y as shown in Lemma 3.2, it is therefore a thresholding rule on Y , δ ∗ i = (cid:49) { y i ≥ t ∗ } with cutoff t ∗ depending on the values of ( τ ∗ , τ ∗ , α, γ ). Condition (A.3) is equivalent to the Invidious Comparisons condition that the marginal false discovery rate, mFDR, since it requires E (cid:104) n (cid:88) i =1 { (1 − v α ( y i ) δ ∗ i } (cid:105) / E (cid:104) n (cid:88) i =1 δ ∗ i (cid:105) ≤ γ and we can show that the left hand side quantity is precisely the mFDR sincemFDR( t ∗ ) = P ( δ ∗ i = 1 , θ i ≤ θ α ) / P ( δ ∗ i = 1)=(1 − α ) (cid:90) (cid:49) { y ≥ t ∗ } f ( y ) dy/ (cid:90) (cid:49) { y ≥ t ∗ } f ( y ) dy = (cid:90) (cid:49) { y ≥ t ∗ } (1 − v α ( y )) f ( y ) dy/ (cid:90) (cid:49) { y ≥ t ∗ } f ( y ) dy = E (cid:104) n (cid:88) i =1 { (1 − v α ( y i )) δ ∗ i } (cid:105) / E (cid:104) n (cid:88) i =1 δ ∗ i (cid:105) = (cid:82) θ α −∞ ˜Φ(( t ∗ − θ ) /σ ) dG ( θ ) (cid:82) + ∞−∞ ˜Φ(( t ∗ − θ ) /σ ) dG ( θ ) ≤ γ. For any mixing distribution G , as t ∗ increases, it becomes less likely for condition (A.3) to bind. Andas t ∗ approaches −∞ , left side of (A.3) approaches 1 − α − γ , and hence we’ve restricted γ < − α toavoid cases where the condition (A.3) never binds. On the other hand, condition (A.4) is equivalentto P ( δ ∗ i = 1) − α = (cid:90) + ∞−∞ ˜Φ(( t ∗ − θ ) /σ ) dG ( θ ) − α ≤ . As t ∗ increases, it also becomes less likely that condition (A.4) binds. Therefore, we can define, t ∗ = min { t : mFDR( t ) − γ ≤ } t ∗ = min { t : (cid:90) + ∞−∞ ˜Φ(( t − θ ) /σ ) dG ( θ ) − α ≤ } When t ∗ < t ∗ , the feasible region for Y defined by inequality (A.4) is a strict subset of that definedby inequality (A.3). When t ∗ > t ∗ , then the feasible region defined by inequality (A.3) is a strictsubset of that defined by inequality (A.4). When t ∗ = t ∗ , the feasible regions coincide. Thiscase occurs when mFDR( t ∗ ) = γ and P ( y ≥ t ∗ ) = α , so E [ v α ( Y ) (cid:49) { v α ( Y ) ≥ λ ∗ } ] = α − αγ with v α ( t ∗ ) = λ ∗ ( α, γ ). Again, the strict thresholding enforced by the statement of the proposition can berelaxed slightly by randomizing the selection probability of the last unit so that the active constraintis satisfied exactly. Proof. [Proposition 4.1] In the proof we will suppress the dependence of λ ∗ on the ( α, γ ). Theargument is very similar to the proof for Proposition 3.3, except that now the feasible region definedby constraint (A.3) and (A.4) is a two-dimensional region for ( y i , σ i ). Since the posterior tailprobability v α ( y, σ ) is monotone in y for any fixed σ as a result of Lemma 3.2, the optimal rule canbe reformulated as a thresholding rule on Y again, δ ∗ i = (cid:49) { y i > t α ( λ ∗ , σ i ) } except now the thresholdvalue also depends on σ i .Now consider the constraint (A.3) and (A.4). Condition (A.3) is equivalent to the condition that, E (cid:104) (cid:80) ni =1 { (1 − v α ( y i , σ i )) δ ∗ i } (cid:105) E (cid:104) (cid:80) ni =1 δ ∗ i (cid:105) − γ = (cid:82) (cid:82) θ α −∞ ˜Φ(( t α ( λ ∗ , σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) (cid:82) (cid:82) + ∞−∞ ˜Φ(( t α ( λ ∗ , σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) − γ ≤ . For any marginal distributions G and H of ( θ, σ ) and for a fixed pair of ( α, γ ), as λ ∗ increases, t α ( λ ∗ , σ ) also increases for any σ > P ( δ ∗ i = 1) − α = (cid:90) (cid:90) ˜Φ(( t α ( λ ∗ , σ ) − θ ) /σ ) dG ( θ ) dH ( σ ) − α ≤ . u and Koenker 43 So as λ ∗ increases, it is also less likely for condition (A.4) to bind. Thus, when λ ∗ < λ ∗ , the feasibleregion on ( Y, σ ) defined by inequality constraint (A.4) is a strict subset of that defined by inequality(A.3). When λ ∗ > λ ∗ , then the feasible region defined by inequality (A.3) is a strict subset of thatdefined by inequality (A.4). When λ ∗ = λ ∗ , the feasible regions coincide; this case occurs when E (cid:104) v α ( Y, σ ) (cid:49) { v α ( Y, σ ) ≥ λ ∗ } (cid:105) = α − αγ where the expectation is taken with respect to the joint distribution of ( Y, σ ).Finally regarding the existence of λ ∗ , note that existence of a solution for λ ∗ for any α ∈ (0 , α , f ( α, λ ) = P ( v α ( y, σ ) > λ ) − α , is a decreasing function in λ and λ ∗ ( α ) is defined as the zero-crossing point of f ( α, λ ). Note that f ( α, 0) = 1 − α and f ( α, 1) = − α . Therefore for any α ∈ (0 , λ ∗ ( α ) ∈ (0 , 1) such that f ( α, λ ∗ ( α )) = 0.Now consider f ( α, γ, λ ) = E [(1 − v α ( y, σ ) − γ ) (cid:49) { v α ( y, σ ) > λ } ]. For a fixed pair of ( α, γ ), λ ∗ ( α, γ )is defined as the zero crossing point of f ( α, γ, λ ). Note that f ( α, γ, λ ) decreases first and thenincreases in λ with its minimum achieved at λ = 1 − γ . We also know that f ( α, γ, 0) = 1 − γ − E [ v α ( y, σ )] = 1 − γ − α and f ( α, γ, 1) = 0. Hence as long as γ < − α , the zero-crossing λ ∗ ( α, γ )exists. The condition γ < − α is imposed to rule out cases where FDR constraint never binds. Proof. [Lemma 4.2] Note that for any cutoff value λ , the mFDR can be expressed as,mFDR( α, λ ) = E [(1 − v α ( y i , σ i )) (cid:49) { v α ( y i , σ i ) ≥ λ } ] / P [ v α ( y i , σ i ) ≥ λ ] . Thus, mFDR depends both on the cutoff value λ and on α since v α , is a function of α , and conse-quently its density function is also indexed by α .First, we will show that ∇ λ mFDR( α, λ ) ≤ α ∈ (0 , λ gives, ∇ λ (cid:82) λ (1 − v ) f v ( v ; α ) dv (cid:82) λ f v ( v ; α ) dv = − (1 − λ ) f v ( λ ; α ) (cid:82) λ f v ( v ; α ) dv + (cid:82) λ (1 − v ) f v ( v ; α ) dvf v ( λ ; α )( (cid:82) λ f v ( v ; α ) dv ) = f v ( λ ; α )( (cid:82) λ f v ( v ; α ) dv ) (cid:16) (cid:90) λ (1 − v ) f v ( v ; α ) dv − (cid:90) λ (1 − λ ) f v ( v ; α ) dv (cid:17) ≤ . Next, to establish that ∇ α mFDR( α, λ ) ≤ 0, we differentiate with respect to α , to obtain, ∇ α (cid:82) λ (1 − v ) f v ( v ; α ) dv (cid:82) λ f v ( v ; α ) dv = (cid:82) λ (1 − v ) ∇ α log f v ( v ; α ) f v ( v ; α ) dv (cid:82) λ f v ( v ; α ) dv − (cid:82) λ (1 − v ) f v ( v ; α ) dv (cid:82) λ f v ( v ; α ) dv (cid:82) λ ∇ α log f v ( v ; α ) f v ( v ; α ) dv (cid:82) λ f v ( v ; α ) dv = E [(1 − v ) ∇ α log f v ( v ; α ) (cid:49) { v ≥ λ } ] P ( v ≥ λ ) − E [(1 − v ) (cid:49) { v ≥ λ } ] P ( v ≥ λ ) E [ ∇ α log f v ( v ; α ) (cid:49) { v ≥ λ } ] P ( v ≥ λ )= E [(1 − v ) ∇ α log f v ( v ; α ) | v ≥ λ ] − E [1 − v | v ≥ λ ] E [ ∇ α log f v ( v ; α ) | v ≥ λ ]= cov[1 − v, ∇ α log f v ( v ; α ) | v ≥ λ ] ≤ . where the last inequality holds because ∇ α log f v ( v ; α ) is non-decreasing in v .Now suppose we have the cutoff value λ ∗ ( α , γ ) such that, E [(1 − v α ( y i , σ i )) (cid:49) { v α ( y i , σ i ) > λ ∗ ( α , γ ) } ] / P ( v α ( y i , σ i ) > λ ∗ ( α , γ )] = γ. If we maintain the same cutoff value for v α ( y i , σ i ) with α > α , given the second property ofmFDR, we know, E [(1 − v α ( y i , σ i )) (cid:49) { v α ( y i , σ i ) ≥ λ ∗ ( α , γ ) } ] / P ( v α ( y i , σ i ) ≥ λ ∗ ( α , γ )) ≤ γ. If equality holds, then by definition we have λ ∗ ( α , γ ) = λ ∗ ( α , γ ), if strict inequality holds, thenby the first property of mFDR, in order to increase mFDR level to be equal to γ , we must have λ ∗ ( α , γ ) < λ ∗ ( α , γ ). Invidious Comparisons Proof. [Corollary 4.3] For any α > α , we have v α ( y, σ ) ≥ v α ( y, σ ) for all pair of ( y, σ ) ∈ R × R + .When the condition in Lemma 4.2 holds, then v α ( y, σ ) ≥ v α ( y, σ ) > λ ∗ ( α , γ ) ≥ λ ∗ ( α , γ ), whichimplies Ω F DRα ,γ ⊆ Ω F DRα ,γ . Proof. [Lemma 4.4] Since t α ( λ ∗ ( α ) , σ ) defines the boundary of the selection region under the ca-pacity constraint for a fixed level α . The condition imposed that as α increases, for each fixed σ ,the thresholding value for Y decreases, hence nestedness of the selection region. Proof. [Lemma 4.5] Based on results in Lemma 4.2 and Lemma 4.4 and the fact that Ω α,γ =Ω F DRα,γ ∩ Ω Cα , we have nestedness of the selection set. Proof. [Proposition 4.6] The capacity constraint requires that α = P ( M ( y, σ ) ≥ C ∗ ( α )) = (cid:90) (cid:90) (cid:49) { M ( y, σ ) ≥ C ∗ ( α ) } f ( y | θ, σ ) dG ( θ ) dH ( σ )For any α > α , it is then clear that C ∗ ( α ) ≤ C ∗ ( α ). Given the monotonity of M ( y, σ ) for eachfixed σ established in Lemma 3.2, the selection set based on capacity constraint is nested. For FDRconstraint, we require(A.7) γ = (cid:82) (cid:82) θ α −∞ (cid:49) { M ( y, σ ) ≥ C ∗ ( α, γ ) } f ( y | θ, σ ) dG ( θ ) dH ( σ ) (cid:82) (cid:82) (cid:49) { M ( y, σ ) ≥ C ∗ ( α, γ ) } f ( y | θ, σ ) dG ( θ ) dH ( σ )Fix γ , it suffices to show that if α > α , then C ∗ ( α , γ ) ≤ C ∗ ( α , γ ). First, solve for C ∗ ( α , γ )from equation (A.7). Now suppose we use this same thresholding value when we increase capacityto α > α , we evaluate the right hand side of equation (A.7). Since θ α ≤ θ α , then the numeratordecreases, (cid:90) (cid:90) θ α −∞ (cid:49) { M ( y, σ ) ≥ C ∗ ( α , γ ) } f ( y | θ, σ ) dG ( θ ) dH ( σ ) ≤ (cid:90) (cid:90) θ α −∞ (cid:49) { M ( y, σ ) ≥ C ∗ ( α , γ ) } f ( y | θ, σ ) dG ( θ ) dH ( σ ) , while the denominator does not change. The only way to satisfy the equality (A.7) again is todecrease the thresholding value, therefore C ∗ ( α , γ ) ≤ C ∗ ( α , γ ). The result in the Proposition isthen reached since the selection set is the intersection of the selection set under capacity constraintand that under FDR constraint. Proof. [Lemma 5.1] The logarithm of the Gamma density of S i takes the form,log Γ( S i | r i , σ i ) = r i log( r i /σ i ) − log(Γ( r i )) + ( r i − 1) log S i − S i r i σ i , hence ∇ s Γ( s | r, σ ) = Γ( s | r, σ ) (cid:16) r − s − rσ (cid:17) . Fixing y and differentiating with respect to s , we have, ∇ s v α ( y, s ) = (cid:82) (cid:82) + ∞ θ α f ( y | θ, σ )Γ( s | r, σ ) (cid:104) r − s − rσ (cid:105) dG ( θ, σ ) (cid:82) (cid:82) f ( y | θ, σ )Γ( s | r, σ ) dG ( θ, σ ) − (cid:82) (cid:82) + ∞ θ α f ( y | θ, σ )Γ( s | r, σ ) dG ( θ, σ ) (cid:82) (cid:82) f ( y | θ, σ )Γ( s | r, σ ) dG ( θ, σ ) (cid:82) (cid:82) f ( y | θ, σ )Γ( s | r, σ ) (cid:104) r − s − rσ (cid:105) dG ( θ, σ ) (cid:82) (cid:82) f ( y | θ, σ )Γ( s | r, σ ) dG ( θ, σ )= E (cid:104) (cid:49) { θ ≥ θ α } (cid:16) r − s − rσ (cid:17) | Y = y, S = s (cid:105) − E (cid:104) (cid:49) { θ ≥ θ α }| Y = y, S = s (cid:105) E (cid:104) r − s − rσ | Y = y, S = s (cid:105) = − Cov (cid:104) (cid:49) { θ ≥ θ α } , rσ | Y = y, S = s (cid:105) u and Koenker 45 The covariance term can take either sign since we do not restrict the distribution G , so v α ( Y, S )need not be monotone in S . On the other hand, if we fix s and differentiate with respect to y , ∇ y v α ( y, s ) = E (cid:104) (cid:49) { θ ≥ θ α } (cid:104) − y − θσ /T (cid:105) | Y = y, S = s (cid:105) − E (cid:104) (cid:49) { θ ≥ θ α }| Y = y, S = s (cid:105) E (cid:104) − y − θσ /T | Y = y, S = s (cid:105) = Cov (cid:104) (cid:49) { θ ≥ θ α } , θ − yσ /T | Y = y, S = s (cid:105) . Again, the covariance term can take either sign, depending on the correlation of θ and σ conditionalon ( Y, S ). Therefore, fixing S , v α ( Y, S ) need not be a monotone function of Y . Proof. [Proposition 5.2] The proof is very similar to that of Proposition 4.1, the only differenceis that we can no longer formulate the decision rule by simply thresholding on Y because thetransformation v α ( Y, S ) need not be monotone in Y for fixed values of S as shown in Lemma 5.1,hence λ ∗ ( α, γ ) and λ ∗ ( α ) must now be defined directly through the random variable v α . The firstconstraint states that E (cid:104) n (cid:88) i =1 { (1 − v α ( y i , s i )) δ ∗ i (cid:105) / E (cid:104) n (cid:88) i =1 δ ∗ i (cid:105) ≤ γ with δ ∗ i = 1 { v α ( y i , s i ) ≥ λ ∗ } For each fixed α , let the density function for v α ( Y, S ) be denoted as f v ( · ; α ), then the constraints can be formulated as (cid:90) λ ∗ (1 − v ) f v ( v ; α ) dv/ (cid:90) λ ∗ f v ( v ; α ) dv which is non-increasing in λ ∗ , hence the constraint becomes less likely to bind as λ ∗ increases. Onthe other hand the second constraint states that, P ( δ ∗ i = 1) − α = (cid:90) λ ∗ f v ( v ; α ) dv − α For each fixed α ∈ (0 , λ ∗ increases. Proof. [Theorem 6.1] To prove Theorem 6.1, we first introduce some additional notation and proveseveral lemmas. Let H n, ( t ) = 1 − H n ( t ) = 1 n n (cid:88) i =1 { v α,i ≥ t } H n, ( t ) = 1 n n (cid:88) i =1 (1 − v α,i )1 { v α,i ≥ t } Q n ( t ) = H n, ( t ) /H n, ( t ) V n ( t ) = 1 n n (cid:88) i =1 { v α,i ≥ t } { θ i ≤ θ α } H ( t ) = 1 − H ( t ) = P ( v α,i ≥ t ) H ( t ) = E (cid:104) (1 − v α,i )1 { v α,i ≥ t } (cid:105) Q ( t ) = H ( t ) /H ( t ) Lemma A.1. Under Assumption 1, as n → ∞ , sup t ∈ [0 , | H n, ( t ) − H ( t ) | p → t ∈ [0 , | H n, ( t ) − H ( t ) | p → Invidious Comparisons Proof. [Proof of Lemma A.1] Under Assumption 1 and the fact that v α,i ∈ [0 , t ∈ [0 , n → ∞ , H n, ( t ) p → H ( t ) H n, ( t ) p → H ( t )By the Glivenko-Cantalli theorem, the first result is immediate. To prove the second result, it sufficesto show that for any (cid:15) > 0, as n → ∞ , P (cid:16) sup t ∈ [0 , | H n, ( t ) − H ( t ) | > (cid:15) (cid:17) → . It is clear that since v α,i has a continuous distribution, H ( t ) is a monotonically decreasing andbounded function in t with H (0) = 1 − α and H (1) = 0. It is also clear that the function H n, ( t ) ismonotonically decreasing in t , so we can find m (cid:15) < ∞ points such that 0 = t < t < · · · < t m (cid:15) = 1,and for any j ∈ { , , . . . , m (cid:15) } , we have H ( x j ) − H ( x j − ) ≤ (cid:15)/ 2. For any t ∈ [0 , j such that x j − ≤ t ≤ x j and H n, ( t ) − H ( t ) ≤ H n, ( t j − ) − H ( t j )= ( H n, ( t j − ) − H ( t j − )) + ( H ( t j − ) − H ( t j )) ≤ | H n, ( x j − ) − H ( x j − ) | + (cid:15)/ ≤ max j | H n, ( t j − ) − H ( t j − ) | + (cid:15)/ H n, ( t ) − H ( t ) ≥ − max j | H n, ( t j ) − H ( t j ) | − (cid:15)/ 2, hencesup t ∈ [0 , | H n, ( t ) − H ( t ) | ≤ max j | H n, ( t j ) − H ( t j ) | + (cid:15)/ . Since m (cid:15) is finite then for any δ > 0, there exists N such that for all n ≥ N , P (cid:16) max j | H n, ( t j ) − H ( t j ) | ≥ (cid:15)/ (cid:17) ≤ δ which then implies that P (cid:16) sup t ∈ [0 , | H n, ( t ) − H ( t ) | ≥ (cid:15) (cid:17) ≤ P (cid:16) (cid:15) j | H n, ( t j ) − H ( t j ) | ≥ (cid:15) (cid:17) = P (cid:16) max j | H n, ( t j ) − H ( t j ) | ≥ (cid:15)/ (cid:17) → Lemma A.2. Under Assumption 1 and α < − γ , Q (1 − γ ) < γ . Proof. [Proof of Lemma A.2] Define ¯ Q ( t ) = E (cid:104) (1 − v α,i − γ )1 { v α,i ≥ t } (cid:105) , then Q ( t ) = γ implies¯ Q ( t ) = 0. Since Q ( t ) is monotonically decreasing in t as shown in the proof of Lemma 4.2 , it sufficesto prove that ¯ Q (1 − γ ) < 0. To this end, note that ∇ t ¯ Q ( t ) < t < − γ and ∇ t ¯ Q ( t ) > t > − γ , hence ¯ Q ( t ) obtains its minimun value at t = 1 − γ . Note that ¯ Q (0) = 1 − γ − α and¯ Q (1) = 0, thus ¯ Q (1 − γ ) < t ≤ − γ (cid:12)(cid:12)(cid:12) Q n ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) p → 0, since, (cid:12)(cid:12)(cid:12) Q n ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) H ( t ) H n, ( t ) − H ( t ) H n, ( t ) H n, ( t ) H ( t ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) H ( t )( H n, ( t ) − H ( t )) − H ( t )( H n, ( t ) − H ( t )) H n, ( t ) H ( t ) (cid:12)(cid:12)(cid:12) ≤ H (0)sup t | H n, ( t ) − H ( t ) | + H (0)sup t | H n, ( t ) − H ( t ) | H (1 − γ ) (cid:16) H (1 − γ ) − sup t | H n, ( t ) − H ( t ) | (cid:17) p → u and Koenker 47 uniformly for any t ≤ − γ . The last inequality holds because min t ≤ − γ H ( t ) = H (1 − γ ) by monotonic-ity of H ( t ). With a similar argument, we can also show that sup t ≤ − γ (cid:12)(cid:12)(cid:12) V n ( t ) H n, ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) p → 0. Using this re-sult and the fact that Q (1 − γ ) < γ by Lemma A.2, we have P (cid:16) | Q n (1 − γ ) − Q (1 − γ ) | < γ − Q (1 − γ )2 (cid:17) → P ( Q n (1 − γ ) < γ ) → P ( λ n ≤ − γ ) → λ n . Since λ n ≤ λ n by definition, we also have P ( λ n ≤ − γ ) → 1. On the other hand, Q n ( λ n ) − V n ( λ n ) H n, ( λ n ) ≥ inf t ≤ − γ (cid:16) Q n ( t ) − Q ( t ) + Q ( t ) − V n ( t ) /H n, ( t ) (cid:17) = o p (1)Since Q n ( λ n ) ≤ Q n ( λ n ) ≤ γ , it follows that V n ( λ n ) H n, ( λ n ) (cid:87) ≤ V n ( λ n ) H n, ( λ n ) ≤ γ + o p (1) . Since V n ( λ n ) H n, ( λ n ) (cid:87) is upper bounded by 1, by Fatou’s lemma, we havelim sup n →∞ E (cid:104) V n ( λ n ) H n, ( λ n ) (cid:87) (cid:105) ≤ γ Lemma A.3. Under Assumption 1 and 2, as n → ∞ , ˆ θ α → θ α a.s. Proof. [Proof of Lemma A.3] See Lemma 21.2 in van der Vaart (2000). Lemma A.4. Under Assumption 1 and 2, as n → ∞ , sup i | ˆ v α,i − v α,i | → a.s. Proof. [Proof of Lemma A.3] Since ˆ v α,i = (cid:82) + ∞ ˆ θ α f ( D i | θ ) d ˆ G n ( θ ) (cid:82) + ∞−∞ f ( D i | θ ) d ˆ G n ( θ )where we denote D i as data with a density function f ( D i | θ ). When variances are known, then D i = { y i , σ i } and f ( D i | θ ) = σ i ϕ (( y i − θ ) /σ i ) and when variances are unknown, then D i = { y i , s i } and f ( D i | θ ) = √ σ /T ϕ (( y i − θ ) / (cid:112) σ /T )Γ( s i | r, σ /r ) with r = ( T − / ϕ ( · ) and Γ( ·|· , · ) beingthe standard normal and gamma density function, respectively.We first analyze the denominator and prove(A.8) sup x (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) + ∞−∞ f ( x | θ ) d ˆ G n ( θ ) − (cid:90) + ∞−∞ f ( x | θ ) dG ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) → a.s. Let f n ( x ) = (cid:82) f ( x | θ ) d ˆ G n ( θ ) and f ( x ) = (cid:82) f ( x | θ ) dG ( θ ). Under Assumption 2, we have (cid:82) (cid:16)(cid:112) f n ( x ) − (cid:112) f ( x ) (cid:17) dµ ( x ) → (cid:82) | f n ( x ) − f ( x ) | dx → f n ( x )and f ( x ) are Lipschitz continuous, we proceed by contradiction. Suppose (A.8) doesn’t hold, thenthere exists (cid:15) > { x n } n ≥ such that | f n ( x n ) − f ( x n ) | ≥ (cid:15) for all n . By Lipschitzcontinuity of f n and f , there exists C such that | f n ( x n + δ ) − f n ( x n ) | ≤ C (cid:107) δ (cid:107)| f ( x n + δ ) − f ( x n ) | ≤ C (cid:107) δ (cid:107) And therefore there exists η > (cid:107) y − x n (cid:107) ≤ η , | f n ( y ) − f ( y ) | ≥ (cid:15)/ 2, which then implies (cid:90) | f n ( x ) − f ( x ) | dx ≥ (cid:90) {(cid:107) y − x n (cid:107) ≤ η }| f n ( y ) − f ( y ) | dy ≥ (cid:15) (cid:90) {(cid:107) y − x n (cid:107) ≤ η } dy Invidious Comparisons which contradicts (cid:82) | f n ( x ) − f ( x ) | dx → f n and f areLipschitz continuous. Note that it suffices to prove that for each fixed parameter θ , | f ( x | θ ) − f ( y | θ ) | ≤ C θ (cid:107) x − y (cid:107) and sup θ C θ < ∞ . This clearly holds for the Gaussian density since the Gaussian densityis everywhere differentiable and has bounded first derivative. Under Assumption 1 with T ≥ x (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) + ∞ ˆ θ α f ( x | θ ) d ˆ G n ( θ ) − (cid:90) + ∞ θ α f ( x | θ ) dG ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) → a.s. Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) + ∞ ˆ θ α f ( x | θ ) d ˆ G n ( θ ) − (cid:90) + ∞ θ α f ( x | θ ) dG ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) + ∞ ˆ θ α f ( x | θ ) d ˆ G n ( θ ) − (cid:90) + ∞ θ α f ( x | θ ) d ˆ G n ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) + ∞ θ α f ( x | θ ) d ˆ G n ( θ ) − (cid:90) + ∞ θ α f ( x | θ ) dG ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) . The first term converges to 0 uniformly due to Lemma A.3. To show the second term also convergesto zero uniformly, we make use of the result that if ˆ G n weakly converges to G , which holds underAssumption 2, then sup g ∈BL | (cid:82) gd ˆ G n − (cid:82) gdG | → BL is the class of bounded Lipschitz continuousfunctions. Note that f ( x | θ )1 { θ ≥ θ α } is bounded and continuous except at θ = θ α . So we constructa smoothed version of f ( x | θ )1 { θ ≥ θ α } , denoted as g ( x | θ ), by replacing 1 { θ ≥ θ α } by a piecewiselinear function taking value zero for θ < θ α and value 1 for θ ≥ θ α + (cid:15) and taking the form − θ α /(cid:15) + θ/(cid:15) for θ ∈ [ θ α , θ α + (cid:15) ], then g ∈ BL . The result (A.9) then holds by showing thatsup x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) θ α + (cid:15)θ α f ( x | θ ) d ˆ G n ( θ ) + (cid:90) θ α + (cid:15)θ α f ( x | θ ) dG ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → a.s. which holds by Assumptions 1 and 2. Proof. [Proof of Theorem 6.2] Define analogouslyˆ H n, ( t ) = 1 n n (cid:88) i =1 { ˆ v α i ≥ t } ˆ H n, ( t ) = 1 n n (cid:88) i =1 (1 − ˆ v α,i )1 { ˆ v α,i ≥ t } ˆ Q n ( t ) = ˆ H n, ( t ) / ˆ H n, ( t )We first show sup t ∈ [0 , | ˆ H n, ( t ) − H ( t ) | p → t ∈ [0 , | ˆ H n, ( t ) − H ( t ) | p → t ∈ [0 , (cid:12)(cid:12)(cid:12) ˆ H n, ( t ) − H n, ( t ) (cid:12)(cid:12)(cid:12) p → u and Koenker 49 To this end, note sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (1 − ˆ v α,i )1 { ˆ v α i ≥ t } − n (cid:88) i (1 − v α,i )1 { v α,i ≥ t } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (1 − ˆ v α,i )1 { ˆ v α i ≥ t } − n (cid:88) i (1 − v α,i )1 { ˆ v α,i ≥ t } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (1 − v α,i )1 { ˆ v α,i ≥ t } − n (cid:88) i (1 − v α,i )1 { v α,i ≥ t } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:88) i | ˆ v α,i − v α,i | + sup t ∈ [0 , n (cid:88) i | { ˆ v α,i ≥ t } − { v α,i ≥ t }| The first term is implied by the result in Lemma A.4. The second term can be written assup t ∈ [0 , n (cid:88) i (cid:12)(cid:12)(cid:12) { ˆ v α,i ≥ t } − { v α,i ≥ t } (cid:12)(cid:12)(cid:12) = sup t ∈ [0 , n (cid:88) i (cid:104) { ˆ v α,i ≥ t, v α i < t } + 1 { ˆ v α,i < t, v α,i ≥ t } (cid:105) = sup t ∈ [0 , n (cid:88) i (cid:104) { ˆ v α,i ≥ t, t − e < v α i < t } + 1 { ˆ v α,i < t, t ≤ v α,i < t + e } (cid:105) + 1 n (cid:88) i (cid:104) { ˆ v α,i ≥ t, v α,i < t − e } + 1 { ˆ v α,i < t, v α,i ≥ t + e } (cid:105) ≤ sup t ∈ [0 , n (cid:88) i { t − e ≤ v α,i ≤ t + e } + 1 ne (cid:88) i | ˆ v α,i − v α,i |≤ sup t ∈ [0 , | H ( t + e ) − H ( t − e ) | + 2 sup t ∈ [0 , | H n, ( t ) − H ( t ) | + 1 ne (cid:88) i | ˆ v α,i − v α,i | for some e > H . Using similar arguments inthe proof of Theorem 6.1, we can establish that sup t ≤ − γ (cid:12)(cid:12)(cid:12) ˆ Q n ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) p → t ≤ − γ (cid:12)(cid:12)(cid:12) ˆ V n ( t )ˆ H n, ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) p → Q n ( t ) = ˆ H n, ( t ) / ˆ H n, ( t ) and ˆ V n ( t ) = n (cid:80) ni =1 { ˆ v α,i ≥ t } { θ i ≤ θ α } and consequentlylim sup n →∞ E (cid:104) ˆ V n (ˆ λ n )ˆ H n, (ˆ λ n ) (cid:87) (cid:105) ≤ γ . Proof. [Proof of Theorem 6.3]We first show that ˆ λ n p → λ ∗ and ˆ λ n p → λ ∗ , then by the continuous mapping theorem, we haveˆ λ n = max { ˆ λ n , ˆ λ n } p → max { λ ∗ , λ ∗ } = λ ∗ . The second statement follows from Lemma A.3. Thefirst statement holds because by the argument for Theorem 6.2, we have(A.10) sup t ≥ − γ (cid:12)(cid:12)(cid:12) ˆ Q n ( t ) − Q ( t ) (cid:12)(cid:12)(cid:12) p → (cid:15) > t ≤ λ ∗ − (cid:15) Q ( t ) > γ and Q ( λ ∗ + (cid:15) ) > γ by monotonicityof Q ( t ). Combined with (A.10), we have ˆ λ n p → λ ∗ . Invidious Comparisons Now define H n, = 1 n n (cid:88) i =1 v α,i (cid:49) { v α,i ≥ t } ˆ H n, ( t ) = 1 n n (cid:88) i =1 ˆ v α,i (cid:49) { ˆ v α,i ≥ t } ˆ U n ( t ) = 1 n n (cid:88) i =1 (cid:49) { θ i ≥ θ α , ˆ v α,i ≥ t } U n ( t ) = 1 n n (cid:88) i =1 (cid:49) { θ i ≥ θ α , v α,i ≥ t } H ( t ) = P ( θ i ≥ θ α , v α,i ≥ t ) = αβ ( t )It suffices to prove that n (cid:80) ni =1 (cid:49) { θ i ≥ θ α , ˆ v α,i ≥ ˆ λ n } p → H ( λ ∗ ). Using a similar argument as forTheorem 6.2, we can show sup t ∈ [0 , (cid:12)(cid:12)(cid:12) ˆ H n, ( t ) − H ( t ) (cid:12)(cid:12)(cid:12) p → t ∈ [0 , (cid:12)(cid:12)(cid:12) ˆ U n ( t ) − H ( t ) (cid:12)(cid:12)(cid:12) p → λ n p → λ ∗ , we have n (cid:80) ni =1 (cid:49) { θ i ≥ θ α , ˆ v α,i ≥ ˆ λ n } p → H ( λ ∗ ) bycontinuity of H . u and Koenker 51 Appendix B. A Discrete Bivariate Example In this appendix we consider a case where G is a discrete distribution joint distribution in thepairs, θ, σ ), in particular, G ( θ, σ ) = 0 . δ ( − , + 0 . δ (4 , + 0 . δ (5 , . In contrast to the discreteexample in Section 4.4, the unobserved variance σ is now clearly informative about θ for thisdistribution G .We will focus on the capacity constraint α = 0 . 05 so θ α = 5. For T = 9, the level curves for tailprobability and posterior mean are shown in Figure B.1 and the selection set comparison for onesample realization in Figure B.2. The right panel of the figure plots the selection boundaries for thetwo ranking criteria for γ = 10%. The non-monotonicity of v α ( y, s ) in both y and s is apparent. Theposterior mean criteria based on E ( θ | Y, S ), prefers individuals with smaller variances compared tothe rule based on the tail probability. Since sample variances S are informative about θ , when thesample variance is small and selection is based on posterior tail probability, the oracle is aware thatsuch a small sample variance is only likely when θ = 4, hence will only make a selection when weobserve a very large y . As a result, the Oracle sets a higher selection threshold on y to avoid selectingindividuals with true effect θ = 4. On the other hand, the posterior mean criterion also tries to useinformation from the sample variance, but not as effectively for our selection objective. This can beseen in the level curves in the middle panel of Figure B.1. When the sample variance is small, theposterior mean shrinks very aggressively towards 4, thereby sacrificing valuable information from y .For a wide range of values for the sample mean y , the posterior mean delivers a value close to 4, thusfailing to distinguish between those with θ = 5 and those with θ = 4. Consequently, the posteriormean rule sets a lower thresholding value on y for the selection region when sample variance is smallresulting in inferior power performance, as shown in Table B.1. TP Level Curves s y . . . . . . . PM Level Curves s y . . . . . Selection Boundaries s y . . Figure B.1. The left panel plots the level curves for the posterior tailprobability criterion and the middle panel depicts the level curves for poste-rior mean criterion. The right panel plots the selection boundary based onposterior mean ranking (shown as the red dashed lines) and the posterior tailprobability ranking (shown as the black solid lines) for α = 5% and γ = 10%with G ( θ, σ ) follows a three points discrete distribution. Table B.1 reports several performance measures over 200 simulation repetitions with n = 50 , • MLE: ranking of the maximum likelihood estimators, Y i for each of the θ i , Invidious Comparisons Capacity constraint s y All agreedTailp extraPM extra 1 2 3 4 5 6 7 8 FDR constraint s y All agreedTailp extraPM extra Figure B.2. Selection set comparison for one sample realization from thethree points discrete distribution model: The left panel shows in black circlesthe agreed selected elements by both the posterior mean and posterior tailprobability criteria under the capacity constraint, extra elements selected bythe posterior mean are marked in green and extra elements selected by theposterior tail probability rule are marked in red. The right panel shows thecomparison of the selected sets under both the capacity and FDR constraintwith α = 5% and γ = 10%. γ = 1% γ = 5% γ = 10%Power FDR SelProp Power FDR SelProp Power FDR SelPropPM 0 . 217 0 . 010 0 . 011 0 . 482 0 . 050 0 . 025 0 . 580 0 . 100 0 . . 252 0 . 010 0 . 013 0 . 561 0 . 050 0 . 030 0 . 697 0 . 100 0 . . 651 0 . 349 0 . 050 0 . 651 0 . 350 0 . 050 0 . 651 0 . 349 0 . . 611 0 . 390 0 . 050 0 . 611 0 . 390 0 . 050 0 . 610 0 . 390 0 . . 611 0 . 390 0 . 050 0 . 611 0 . 390 0 . 050 0 . 610 0 . 390 0 . . 619 0 . 382 0 . 050 0 . 619 0 . 382 0 . 050 0 . 618 0 . 382 0 . Table B.1. Performance comparison for ranking procedures based onposterior mean, posterior tail probability, the P-value and the MLE of θ i . All results are based on 200 simulation repetitions with n = 50,000for G following the three point discrete distribution and T = 9 or whenG is assumed to follow a normal-inverse-chi-square distribution. For thefirst two rows, number reported in the table correspond to performancewhen both capacity and FDR constraints are in place. For the last fourrows, only capacity constraint is in place. • P-values: ranking of the P-values of the conventional one-sided test of the null hypothesis H : θ < θ α , • PM-NIX: ranking of the posterior means based on the normal-inverse-chi-square (NIX) priordistribution, • TP-NIX: ranking of the posterior tail probability based on NIX prior distribution u and Koenker 53 The first two of these selection rules ignore the compound decision perspective of the problementirely. The other two ranking criteria we consider are based on posterior mean and tail probabilityassuming G follows a normal-inverse-chi-square (NIX) distribution (denoted as PM-NIX and TP-NIX in Table B.1). The parameters of the NIX distribution are estimated from the data, hencethese rules can be viewed as generalization of James-Stein estimator for homogeneous variances andthe Efron-Morris shrinkage estimator for known heterogeneous variances case. We refer the detailsof the NIX distribution and the posterior distribution of ( θ, σ ) to Example 5.1.We report the power and false discovery rates for the posterior mean (denoted as PM) andposterior tail probability (denoted as TP) selection as well as the proportion of selected observationsfor α = 5% and for several different γ under both capacity and FDR control. For all other fourselection rules we only impose the capacity constraint, as how they are usually implemented incurrent practise. For the PM and TP rules, from the proportion selected observations we can inferwhether the FDR constraint or the capacity constraint is binding in each configuration. Rankingbased on the posterior tail probability clearly has better power performance for each of configurationswhen compared to the posterior mean ranking. When selecting as few as 5%, FDR constraints arebinding for both PM and TP rule for all ranges of γ ∈ { , , } . Among all the other rules, wesee that the false discovery rate is around 40% and PM-NIX has identical performance as rankingbased on the MLE for θ ; this can be understood by noting that the posterior mean of θ under theNIX prior is simply linear shrinkage of the MLE of θ , hence it does not alter individual rankingsbetween the two methods. TP-NIX behaves similarly to PM-NIX, with slightly better power andslightly lower false discovery rate. Invidious Comparisons ReferencesAthey, S., G. Imbens, J. Metzger, and E. Munro (2019): “Using Wasserstein GenerativeAdversarial Networks for the Design of Monte Carlo Simulations,” . Bahadur, R. (1950): “On a problem in the theory of k populations,” Annals of MathematicalStatistics , 21, 362–375. Bahadur, R. R., and H. Robbins (1950): “The Problem of the Greater Mean,” The Annals ofMathematical Statistics , 21, 469–487. Basu, P., T. Cai, K. Das, and W. Sun (2018): “Weighted False Discovery Rate Control inLarge-Scale Multiple Testing,” Journal of the American Statistical Association , 113, 1172–1183. Bechhofer, R., J. Kiefer, and M. Sobel (1968): Sequential Identification and Ranking Proce-dures . University of Chicago Press. Bechhofer, R. E. (1954): “A Single-Sample Multiple Decision Procedure for Ranking Means ofNormal Populations with known Variances,” The Annals of Mathematical Statistics , 25, 16–39. Benjamini, Y., and Y. Hochberg (1995): “Controlling the False Discovery Rate: A Practicaland Powerful Approach to Multiple Testing,” Journal of Royal Statistical Society, Series B , 57,289–300. Berger, J. O., and J. Deely (1988): “A Bayesian Approach to Ranking and Selection of Re-lated Means With Alternatives to Analysis-of-Variance Methodology,” Journal of the AmericanStatistical Association , 83, 364–373. Boyd, S., C. Cortes, M. Mohri, and A. Radovanovic (2012): “Accuracy at the Top,”in Advances in Neural Information Processing Systems 25 , ed. by F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, pp. 953–961. Curran Associates, Inc. Cao, H., W. Sun, and M. R. Kosorok (2013): “The optimal power puzzle: scrutiny of themonotone likelihood ratio assumption in multiple testing,” Biometrika , 100(2), 495–502. Chetty, R., J. Friedman, and J. Rockoff (2014a): “Measuring the impacts of teachers I:Evaluating bias in teacher value-added estimates,” American Economic Review , 104, 2593–2632.(2014b): “Measuring the impacts of teachers II: Teacher value-added and student outcomesin adulthood,” American Economic Review , 104, 2633–2679. Chetty, R., and N. Hendren (2018): “The impacts of neighborhoods on intergenerationalmobility II: County-level estimates,” The Quarterly Journal of Economics , 133(3), 1163–1228. Chetty, R., N. Hendren, P. Kline, and E. Saez (2014): “Where is the land of opportu-nity? The geography of intergenerational mobility in the united states,” The Quarterly Journal ofEconomics , 129, 1553–1624. Davies, L. (2014): Data Analysis and Approximate Models . CRC Press. Efron, B. (2011): “Tweedie’s Formula and Selection Bias,” Journal of the American StatisticalAssociation , 106, 1602–1614. Efron, B. (2016): “Empirical Bayes deconvolution estimates,” Biometrika , 103, 1–20.(2019): “Bayes, Oracle Bayes and Empirical Bayes,” Statistical Science , 34, 177–201. Efron, B., and C. Morris (1973): “Stein’s Estimation Rule and Its Competitiors - An EmpiricalBayes Approach,” Journal of the American Statistical Association , 68, 117–130. Efron, B., R. Tibshirani, J. Storey, and V. Tusher (2001): “Empirical Bayes Analysis ofMicroarray Experiments,” J. American Statistical Association , 96, 1151–1160. Gelman, A., and P. N. Price (1999): “All maps of parameter estimates are misleading,” Statis-tics in Medicine , 18, 3221–3234. Genovese, C., and L. Wasserman (2002): “Operating Characteristic and Extensions of theFalse Discovery Rate Procedure,” Journal of the Royal Statistical Society, Series B , 64, 499–517. Gilraine, M., J. Gu, and R. McMillan (2020): “A New Method for Estimating Teacher Value-Added,” NBER Working Paper Series Number 27094. Goel, P. K., and H. Rubin (1977): “On Selecting a Subset Containing the Best Population-ABayesian Approach,” The Annals of Statistics , 5, 969–983. u and Koenker 55 Goldstein, H., and D. J. Spiegelhalter (1996): “League tables and their limitations: Statis-tical issues in comparisons of institutional performance, (with discussion),” Journal of the RoyalStatistical Society: Series A , 159, 385–443. Gu, J., and R. Koenker (2016a): “On a Problem of Robbins,” International Statistical Review ,84, 224–244. Gu, J., and R. Koenker (2016b): “Unobserved Heterogeneity in Income Dynamics: An EmpiricalBayes Perspective,” J. of Economic and Business Statistics , forthcoming. Gupta, S. (1956): “On a decision rule for a problem in ranking means,” Mimeograph Series No.150, Institute of Statistics, University of North Carolina, Chapel Hill. Gupta, S. S., and S. Panchapakesan (1979): Multiple Decision Procedures: Theory and Method-ology of Selecting and Ranking Populations . Wiley. Hanushek, E. A. (2011): “The economic value of higher teacher quality,” Economics of Educationreview , 30(3), 466–479. Henderson, N., and M. Newton (2016): “Making the cut: improved ranking and selection forlarge-scale inference,” Journal of the Royal Statistical Society, Series B , 78(4), 781–804. Jiang, W. (2020): “On general maximum likelihood empirical Bayes estimation of heteroscedasticiid normal means,” Electronic Journal of Statistics , 14(1), 2272–2297. Kidney Epidemiology and Cost Center (2018): “Technical Notes on the Dialysis Facility Com-pare Quality of Patient Care Star Rating Methodology for the October 2018 Release,” Universityof Michigan, School of Public Health. Kiefer, J., and J. Wolfowitz (1956): “Consistency of the Maximum Likelihood Estimator inthe Presence of Infinitely Many Incidental Parameters,” The Annals of Mathematical Statistics , 27,887–906. Koenker, R. (2020): “Empirical Bayes Confidence Intervals: An R Vinaigrette,” available from . Koenker, R., and J. Gu (2015): “REBayes: An R Package for Empirical Bayes Methods,”Available from https://cran.r-project.org/package=REBayes . Koenker, R., and J. Gu (2017): “REBayes: An R Package for Empirical Bayes Mixture Meth-ods,” Journal of Statistical Software , 82, 1–26.(2019): “Comment: Minimalist G -Modeling,” Statistical Science , 34, 209–213. Koenker, R., and I. Mizera (2014): “Convex Optimization, Shape Constraints, CompoundDecisions and Empirical Bayes Rules,” J. of Am. Stat. Assoc. , 109, 674–685. Laird, N. M., and T. A. Louis (1989): “Empirical Bayes Ranking Methods,” Journal of Educa-tional Statistics , 14, 29–46.(1991): “Smoothing the non-parametric estimate of a prior distribution by roughening,” Computational Statistics & Data Analysis , 12, 27–37. Lin, R., T. Louis, S. Paddock, and G. Ridgeway (2006): “Loss Function Based Ranking inTwo-Stage, Hierarchical Models,” Bayesian Analysis , 1, 915–946.(2009): “Ranking USRDS provider specific SMRs from 1998-2001,” Health Service Out-comes Research Methodology , 9, 22–38. Lindsay, B. (1995): “Mixture Models: Theory, Geometry and Applications,” in NSF-CBMS re-gional conference series in probability and statistics . Mogstad, M., J. Romano, A. Shaikh, and D. Wilhelm (2020): “Inferences for ranks with ap-plications to mobility across neighborhoods and academic achievement across countries,” preprint. Polyanskiy, Y., and Y. Wu (2020): “Self-regularizing Property of Nonparametric MaximumLikelihood Estimator in Mixture Models,” preprint. Portnoy, S. (1982): “Maximizing the probability of correctly ordering random variables usinglinear predictors,” Journal of Multivariate Analysis , 12, 256 – 269. Robbins, H. (1950): “A Generalization of the Method of Maximum Likelihood: Estimating aMixing Distribution (Abstract),” The Annals of Mathematical Statistics , 21, 314–315.(1951): “Asymptotically Subminimax Solutions of Compound Statistical Decision Prob-lems,” in Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability , vol. I,pp. 131–149. University of California Press: Berkeley. Invidious Comparisons (1956): “An Empirical Bayes Approach to Statistics,” in Proceedings of the Third BerkeleySymposium on Mathematical Statistics and Probability , vol. I, pp. 157–163. University of CaliforniaPress: Berkeley. Saha, S., and A. Guntuboyina (2020): “On the nonparametric maximum likelihood estimator forGaussian location mixture densities with application to Gaussian denoising,” Annals of Statistics ,48(2), 738–762. Schmidt, K. D. (2014): “On inequalities for moments and the covariance of monotone functions,” Insurance: Mathematics and Economics , 55, 91–95. Storey, J. D. (2002): “A direct approach to false discovery rates,” Journal of the Royal StatisticalB , 64, 479–498. Sun, W., and T. T. Cai (2007): “Oracle and Adaptive Compound Decision Rules for FalseDiscovery Rate Control,” Journal American Statistical Association , 102, 901–912. Sun, W., and A. C. McLain (2012): “Multiple Testing of Composite Null Hypotheses in Het-eroscedastic Models,” Journal of the American Statistical Association , 107, 673–687. University of Michigan Kidney Epidemiology and Cost Center (2009–2019): “DialysisFacility Reports,” available from: https://data.cms.gov/dialysis-facility-reports . van de Geer, S. (1993): “Hellinger-consistency of certain nonparametric maximum likelihoodestimators,” The Annals of Statistics , pp. 14–44. van der Vaart, A. W. (2000): Asymptotic statistics , vol. 3. Cambridge university press. Wald, A. (1950):