Asymptotically optimal inference in sparse sequence models with a simple data-dependent measure
Asymptotically optimal inference in sparse sequencemodels with a simple data-dependent measure ∗ Ryan Martin † January 11, 2021
Abstract
For high-dimensional inference problems, statisticians have a number of com-peting interests. On the one hand, procedures should provide accurate estimation,reliable structure learning, and valid uncertainty quantification. On the other hand,procedures should be computationally efficient and able to scale to very high di-mensions. In this note, I show that a very simple data-dependent measure canachieve all of these desirable properties simultaneously, along with some robustnessto the error distribution, in sparse sequence models.
Keywords and phrases: high-dimensional inference; concentration rate; struc-ture learning; uncertainty quantification; variational approximation.
Dating back at least to Stein (1956, 1981), a fundamental problem in statistical inferenceis that of estimating a high-dimensional mean vector under additive noise. Specifically,suppose that the observable data is Y n = ( Y , . . . , Y n ) (cid:62) with posited model Y i = θ i + Z i , i = 1 , . . . , n, (1)where the errors, Z , . . . , Z n , are independent and identically distributed (iid) with meanzero and subgaussian tails; see Section 3 for specifics. While I don’t assume the errordistribution to be normal, I do assume that it’s fully known , with no parameters to beestimated. The goal is inference on θ = ( θ , . . . , θ n ) (cid:62) .In practical applications using models like in (1), e.g., Efron (2010), the index n is generally quite large, which makes this a high-dimensional inference problem. Asis often the case in such problems, certain structural assumption on θ = ( θ , . . . , θ n ) (cid:62) are necessary in order to achieve accurate estimation. Here, the structural assumptionI’ll consider is sparsity —most of the entries in the θ vector are zero. This notion ofsparsity is consistent with the science: in genomics applications, the biology says thatonly a relatively small number of genes would be associated with a particular phenotype.Mathematically weaker notions of sparsity are possible, ones that don’t assume any exactzeros, which I discuss below in Section 5. ∗ This work is partially supported by the U.S. National Science Foundation, DMS–1811802. † Department of Statistics, North Carolina State University, [email protected] a r X i v : . [ m a t h . S T ] J a n ww.researchers.one/articles/21.01.00001 There now many different approaches to this problem. On the non-Bayesian side, thefocus has been on developing shrinkage and thresholding estimators, and notable refer-ences include James and Stein (1961), Efron and Morris (1973), Donoho and Johnstone(1994), and Abramovich et al. (2006), among others. On the Bayesian side, the strat-egy is to construct suitable sparsity-inducing priors. This can be done using continuousshrinkage priors (e.g., Bhadra et al. 2017; Bhattacharya et al. 2015; Carvalho et al. 2010),or with spike-and-slab priors (e.g., Castillo and van der Vaart 2012; Johnstone and Silver-man 2004; Martin and Walker 2014). Inspired by the developments in Belitser (2017) andBelitser and Nurushev (2020), here I want to take a different approach. In particular, Iconstruct a so-called data-dependent measure which is based on a simple approximation toa more complicated (empirical) Bayes formulation developed in Martin et al. (2017) andspecialized to the present context by Martin and Ning (2020). Computation of this data-dependent measure requires no Markov chain Monte Carlo (MCMC) or optimization—ithas a simple, intuitive, and explicit form. Most importantly, its theoretical properties areoptimal, or nearly so, in every relevant sense: • it concentrates asymptotically around the true sparse mean vector θ (cid:63) at the minimaxoptimal rate, adaptive to the unknown sparsity level; • its mean vector is an asymptotically, adaptively minimax estimator; • it concentrates asymptotically on a subspace of R n whose dimension is roughly thesame as the effective dimension of θ (cid:63) ; • it consistently selects the non-zero entries of θ (cid:63) ; • and it provides asymptotically valid uncertainty quantification.The majority of these properties are familiar, perhaps only the last one requires furtherexplanation. From this data-dependent measure, one can readily extract credible setsthat capture a specified fraction of the distribution’s mass. A question is if that credibleset is also asymptotically a confidence set in the sense that it covers the true θ (cid:63) withprobability approximately equal to that same fraction. This has been a hot topic in theBayesian literature recently (e.g., Castillo and Szab´o 2020; Rousseau and Szabo 2020;Szab´o et al. 2015; van der Pas et al. 2017), and here I establish this property for mysimple data-dependent measure. The ideas and techniques developed in Belitser (2017)and Belitser and Nurushev (2020) will prove to be useful in this regard.The remainder of the paper is organized as follows. Section 2 gives some backgroundon an approach for high-dimensional inference that uses empirical or data-driven priors.The data-dependent measure investigated here originally arose as a variational approx-imation to this empirical Bayes-style posterior distribution. Section 3 investigates theasymptotic concentration properties of the data-dependent measure, providing justifica-tion for the claims made in the bulleted list above. Some comments about computationof the data-dependent measure are given in Section 4, along with an illustration to showthat its theoretical and computational simplicity don’t come at the cost of statistical in-efficiency in finite samples. Concluding remarks are given in Section 5 and all the proofsare presented in Appendix A. 2 ww.researchers.one/articles/21.01.00001 While I’m not doing so here in this paper, it is common to assume that the Y i ’s areindependent and normally distributed, i.e., Y i ∼ N ( θ i , σ ), for i = 1 , . . . , n , with knownvariance σ . This distributional assumption determines a likelihood function and, in turn,a Bayes or empirical Bayes approach can be taken. One such approach developed recentlyis that in Martin and Walker (2014), which makes us of an empirical or data-dependent prior, that is, a prior distribution that directly depends on the data. The details hereare based on the developments in Martin et al. (2017) for the high-dimensional linearregression problem; see Martin and Walker (2019) for some general theory on empiricalpriors and properties of the corresponding posterior distributions.In the approach of Martin and Walker (2019)—see Martin and Ning (2020) for thesequence model special case—the prior is formulated by first expressing θ as ( S, θ S ),where S ⊆ { , , . . . , n } represents the configuration of zeros and non-zeros and θ S theconfiguration-specific parameters. Then the (empirical) prior for θ , denoted by Π n , withsubscript “ n ” to indicate data dependence, is specified hierarchically as follows. • Marginal prior for S : The the prior on the size/cardinality | S | of S has a massfunction f n , and the conditional prior for S , given its cardinality | S | = s is uniformover the (cid:0) ns (cid:1) subsets of size s . Several different forms of f n are discussed in Martinand Ning (2020), but I’ll not need these for what I plan to do here. • Conditional prior for θ S , given S : With the configuration S given, it’s determinedthat θ i = 0 for i (cid:54)∈ S , so only a prior for θ S = { θ i : i ∈ S } is necessary. For thatprior, the choice in Martin and Ning (2020) is( θ S | S ) ∼ N | S | ( Y S , σ γ − I | S | ) , where σ is the assumed value of the variance in the normal model, γ > I | S | is the | S | × | S | identity matrix. Note that this conditionalprior depends on data through the centering around Y S = { Y i : i ∈ S } .With this empirical prior and the normal likelihood, Martin and Ning (2020) describedconstruction of a corresponding posterior distribution asΠ n ( dθ ) ∝ L αn ( θ ) Π n ( dθ ) , where L n ( θ ) = (2 πσ ) − n/ exp {− σ (cid:107) Y − θ (cid:107) } is the likelihood function and α ∈ (0 , α is given in Martin andWalker (2019) and the references therein. Again, I’ll not be using this posterior so it’snot necessary for me to give an explanation of α here. Computation of the posterior Π n requires MCMC, but this can be done rather efficiently,as demonstrated in Martin and Ning (2020) and van Erven and Szab´o (2020), comparedto the proposed Monte Carlo computations for the horseshoe and other priors. The sameis true for the high-dimensional regression version; see, e.g., Martin et al. (2017) andMartin and Tang (2020). However, as shown in Ray and Szab´o (2020) and elsewhere,3 ww.researchers.one/articles/21.01.00001 simple and computationally efficient variational approximations of these high-dimensionalposterior are possible. This inspired Yang and Martin (2020) to develop a correspondingvariational approximation for the empirical prior formulation.The jumping off point for Yang and Martin (2020) was the recognition that the em-pirical prior described above can be written in a very simple form. Indeed, the priorassumes that θ , . . . , θ n are independent and the respective marginal distributions are θ i ∼ λ n N ( Y i , σ γ − ) + (1 − λ n ) δ , i = 1 , . . . , n, (2)where λ n is the prior inclusion probability, which depends on the sample size n but not onthe particular index i . It’s relatively easy to show that λ n = n − E | S | , where E | S | is theprior mean for | S | under f n . For the two kinds of prior mass function f n they considered,it follows that, for a constant a > λ n = O ( n − ( a +1) ) , n → ∞ . (3)That λ n is vanishing with n —and faster than n − —is consistent with the idea that θ isbelieved to be sparse. For simplicity in what follows, I’ll take λ n = n − (1+ a ) .When the prior can be expressed in the basic spike-and-slab form (2), it is not too dif-ficult to derive a variational approximation to the posterior distribution. Indeed, considera mean-field approximation family of the form n (cid:79) i =1 { φ i N ( µ i , τ i ) + (1 − φ i ) δ } . This corresponds to independent components, each being a mixture of a normal and apoint mass at 0, but with component-specific parameters. The variational approximationproceeds by finding the set of parameters { ( µ i , τ i , φ i ) : i = 1 , . . . , n } that minimizes theKullback–Leibler divergence of the posterior distribution Π n from the above family. AsYang and Martin (2020) show, there are closed-form expressions—no update equationsas is typical—for those “best” parameters: µ i = y i τ i = σ ( α + γ ) − (4)logit( φ i ) = logit( λ n ) + log γα + γ + α σ y i . Here, logit( φ ) = log { φ/ (1 − φ ) } . Note that the variance component τ i is actually thesame for each i . Moreover, the weights φ i that indicate whether a θ i is zero or non-zeroare increasing in y i , as one would expect. My proposal here in this paper is to define a data-dependent measure∆ n = n (cid:79) i =1 { φ i N ( µ i , τ i ) + (1 − φ i ) δ } , (5)with the specific parameters (4) plugged in. This is different from the perspective inYang and Martin (2020) because I’m not starting with a normal model, constructing4 ww.researchers.one/articles/21.01.00001 a posterior distribution based on an empirical prior, and then developing a variationalapproximation. Instead, I’m directly defining ∆ n as the data-dependent measure I intendto use for inference on θ . Consequently, there are no choices about priors to be explainedor claims of the variational approximation’s accuracy to be justified. Whether ∆ n is areasonable procedure to use rests entirely on what properties it possesses and, as I shownext in Section 3, its properties are optimal in every practical respect. To set the scene, recall that the errors Z , . . . , Z n are iid copies of a random variable Z whose distribution is known. Moreover, I’ll assume that Z is subgaussian in the sensethat the moment-generating function of Z satisfies E exp( tZ ) ≤ exp( σ t / , all t ∈ R , (6)where σ is a scale parameter, often called the variance proxy , such that σ ≥ V ( Z ). Ofcourse, this covers the Gaussian case, but there are other examples too, including boundedrandom variables; see, e.g., Boucheron et al. (2013). One key property of subgaussianrandom variables is that they have exponential tail probability bounds; see Appendix A.1below. In addition, the tails are sufficiently thin to ensure that the moment-generatingfunction of ( Z/σ ) exists in an interval that contains the origin. What’s relevant to theanalysis here is the upper endpoint of that interval, which I’ll denote as T >
0. Theactual endpoint depends on the specific form of the Z distribution, e.g., if Z is Gaussian,then T = 1 /
2. However, the largest bound that I’m aware of that covers all subgaussiancases simultaneously is T = 1 / θ (cid:63) is sparse , soI need to make this structural assumption precise. For a generic vector θ ∈ R n , let S θ denote its configuration, i.e., S θ = { i : θ i (cid:54) = 0 } . Let | S θ | denote the cardinality ofthe configuration. Of course, | S θ (cid:63) | ≤ n but I have in mind cases where the inequalityis strict, even/especially cases where | S θ (cid:63) | (cid:28) n , since these are the only ones in whichaccurate estimation of θ (cid:63) is possible. To characterize this notion of accuracy, recall thatthe minimax rate (e.g., Donoho et al. 1992) relative to (cid:96) -error is ε n ( θ (cid:63) ) = | S θ (cid:63) | log( en/ | S θ (cid:63) | ) . (7)That is, every estimator ˆ θ satisfiessup θ (cid:63) ε − n ( θ (cid:63) ) E θ (cid:63) (cid:107) ˆ θ − θ (cid:63) (cid:107) ≥ . So we say that an estimator is minimax optimal if equality holds up to a constant,sup θ (cid:63) ε − n ( θ (cid:63) ) E θ (cid:63) (cid:107) ˆ θ − θ (cid:63) (cid:107) (cid:46) , (8)where “ (cid:46) ” denotes inequality up to a universal constant. The following two results showthat the data-dependent measure ∆ n in (5), with suitable choice of α , has this sameasymptotic concentration rate property. 5 ww.researchers.one/articles/21.01.00001 Theorem 1.
For the data-dependent measure ∆ n in (5) , suppose that λ n and α in (4) satisfy, respectively, (3) and α < T , for T determined by the subgaussian error distribu-tion. For ε n ( θ (cid:63) ) defined in (7) and any sequence M n > with M n → ∞ , sup θ (cid:63) E θ (cid:63) ∆ n ( { θ ∈ R n : (cid:107) θ − θ (cid:63) (cid:107) > M n ε n ( θ (cid:63) ) } ) → , n → ∞ . Theorem 2.
Under the conditions of Theorem 1, the mean vector ˆ θ derived from thedata-dependent measure ∆ n in (5) satisfies (8) . An important observation is that the data-dependent measure ∆ n is not aware of thesparsity level | S θ (cid:63) | of the true θ (cid:63) , and yet it concentrates at that specific optimal rate.This feature is commonly referred to as adaptation , i.e., the concentration rate of ∆ n isadaptive to the unknown sparsity level of θ (cid:63) that determines the optimal rate.Of course, the data-dependent measure’s (nearly) optimal concentration rate is a plus,but this property alone doesn’t imply that ∆ n is learning the low-dimensional structurein θ (cid:63) . The next result demonstrates that indeed the data-dependent measure is learningthat structure in the sense that the dimension of the space on which ∆ n concentrates isroughly the same as the effective dimension | S θ (cid:63) | of θ (cid:63) . Theorem 3.
Under the conditions of Theorem 1, for any M n > such that M n → ∞ ,the data-dependent measure ∆ n in (5) satisfies sup θ (cid:63) E θ (cid:63) ∆ n ( { θ ∈ R n : | S θ | > M n | S θ (cid:63) |} ) → , n → ∞ . Theorem 3 established that ∆ n concentrates on a subspace of roughly the effective di-mension of θ (cid:63) , but more can be said. The next result shows that, asymptotically, ∆ n willnot assign positive mass to proper supersets of S θ (cid:63) . To ensure that all the signals aredetectable, an additional assumption about the magnitude of those non-zero θ (cid:63)i values isneeded. Specifically, considermin i ∈ S θ(cid:63) | θ (cid:63)i | ≥ H := (cid:0) σ Kα log n (cid:1) / , for some K > a, (9)where σ is the variance proxy of the error distribution, α is as in (4), and a is as in(3). Up to constants, (9) is equivalent to the “beta-min condition” common in the high-dimensional estimation literature. Theorem 4.
Under the conditions of Theorem 1, the data-dependent measure ∆ n satisfies sup θ (cid:63) E θ (cid:63) ∆ n ( { θ ∈ R n : S θ ⊃ S θ (cid:63) } ) → , n → ∞ . Moreover, if θ (cid:63) is such that (9) holds, then sup θ (cid:63) E θ (cid:63) ∆ n ( { θ ∈ R n : S θ (cid:54)⊇ S θ (cid:63) } ) → , n → ∞ . If all the above conditions hold, then the two conclusions can be combined, giving E θ (cid:63) ∆ n ( { θ ∈ R n : S θ = S θ (cid:63) } ) → , n → ∞ . (10)6 ww.researchers.one/articles/21.01.00001 This theorem says that, asymptotically, ∆ n will not support configurations that con-tain zero entries in θ (cid:63) . Moreover, if the non-zero entries in θ (cid:63) are sufficiently large, inthe sense of (9), then ∆ n will not support configurations that miss any of those non-zeroentries either. In the latter case, the only option is that ∆ n asymptotically supports thetrue configuration S θ (cid:63) , hence it effectively learns the low-dimensional structure in θ (cid:63) . Theresult in (10) is often referred to as a selection consistency property, since any reasonableselection procedure based on ∆ n , e.g.,ˆ S = arg max S δ n ( S ) or ˆ S = { i : φ i > . } , will, for large enough n , identify the correct S θ (cid:63) . An important question is if inferences derived from the data-dependent measure arereliable in the sense that they control the frequency of errors, at least asymptotically.There are a number of ways this can be assessed in the present context. One is to considercertain one-dimensional summaries of the n -dimensional vector θ , in particular, linearfunctionals. Yang and Martin (2020) considered this when treating ∆ n as a variationalapproximation of the full (empirical) Bayes posterior under a normal model. Anotherangle is to consider a credible set for the full n -dimensional vector. This approach hasbeen considered for a variety of different kinds of Bayes and empirical Bayes posteriordistributions in the literature, e.g., van der Pas et al. (2017), Belitser (2017), Belitser andNurushev (2020), and Belitser and Ghosal (2019). Here I’m going to derive analogousproperties for the simple data-dependent measure ∆ n in (5) under the same generalsubgaussian error structure as above.Recall that ˆ θ is the mean vector of the data-dependent measure ∆ n . Define a ballcentered around ˆ θ with radius r > B n ( r ) = { θ ∈ R n : (cid:107) θ − ˆ θ (cid:107) ≤ r } . The goal is to choose a data-dependent radius r = ˆ r such that the ball approximatelyachieves a target coverage probability 1 − ζ and has near-optimal size. There are twonatural strategies for selecting the radius ˆ r based on ∆ n . The first is motivated byensuring the ball has sufficient probability under ∆ n , while the second is motivated byachieving the optimal size.1. Quantile-based . Set ˆ r = inf { r : ∆ n ( θ : (cid:107) θ − ˆ θ (cid:107) ≤ r ) ≥ − ζ } .2. Plug-in estimator-based . Define ˆ S = { i : φ i > } and set ˆ r = | ˆ S | log( en/ | ˆ S | ).The following theorem summarizes the coverage probability and size properties of thetwo corresponding credible balls. Theorem 5.
Assume the conditions of Theorem 4 hold. Also, let Θ n ⊂ R n denote theset where the condition (9) on the minimum signal size holds. Fix a significance level ζ ∈ (0 , ) and a threshold η > .1. Let ˆ r denote the ∆ n quantile-based radius defined above. Then there exists constants L and M such that, for all sufficiently large n , sup θ (cid:63) ∈ Θ n P θ (cid:63) { B n ( M g n ˆ r ) (cid:54)(cid:51) θ (cid:63) } ≤ ζ and sup θ (cid:63) P θ (cid:63) { ˆ r > Lε n ( θ (cid:63) ) } ≤ η, where the inflation factor g n satisfies g n = log( en ) . ww.researchers.one/articles/21.01.00001
2. Let ˆ r denote the plug-in estimator-based radius defined above. Then there existsconstants L and M such that, for all sufficiently large n , sup θ (cid:63) ∈ Θ n P θ (cid:63) { B n ( M ˆ r ) (cid:54)(cid:51) θ (cid:63) } ≤ ζ and sup θ (cid:63) P θ (cid:63) { ˆ r > Lε n ( θ (cid:63) ) } ≤ η. More-or-less explicit expressions for the constants (
L, M ) are given in the proof, soone could technically use these values for practical implementation. However, I make noclaims that these constants are optimal, in fact, it’s likely that they’re conservative. In anycase, the point is simply to say that the data-dependent measure spread is “right” in thesense that credible balls with slightly larger than optimal size can achieve the nominalcoverage probability. The additional inflation factor g n in the quantile-based credibleball is needed because, apparently, the quantile itself is too small by a logarithmic factor.Similar inflation factors have been needed by other authors proving analogous results(e.g., Belitser and Ghosal 2019). Computation of the data-dependent measure ∆ n in (5) is trivial and fast. Virtuallyevery summary has a closed-form expression, so it’s straightforward to produce the meanˆ θ , to select a set of “active” variables via ˆ S = { i : φ i > } , and to extract marginalcredible intervals for each θ i . This can be done almost instantaneously, far faster thanthe computations using the horseshoe package in R (van der Pas et al. 2016) and atleast most of those methods compared in van Erven and Szab´o (2020).There are a host of available methods that provide high-quality estimation and struc-ture learning. The theoretical support for uncertainty quantification using the simpledata-dependent measure is the chief novelty here, so that’s what I’ll focus on here. Al-though the theory presented above is for the joint credible ball, there is good reason(e.g., Martin and Ning 2020; Yang and Martin 2020) to believe that the correspondingmarginal credible intervals would be approximately valid too. So my objective in thissection is simply to show that the very fast computations do not come at the expenseof validity or efficiency. That is, this simple data-dependent measure produces marginalcredible intervals which are as good or better than those from other methods sharing thesame theoretical guarantees but with heavier computational burden.Specifically, I redo the simulation study presented in Martin and Ning (2020) com-paring the coverage probability and mean length of the horseshoe and two empiricalprior-based credible intervals. Let n = 500 and suppose that the errors are iid standardnormal. Similar to Section 2 of van der Pas et al. (2017), consider a case where thefirst five entries of θ (cid:63) are relatively large, i.e., θ (cid:63) = · · · = θ (cid:63) = 7, the second five areintermediate, i.e., θ (cid:63) = · · · = θ (cid:63) = 2; θ (cid:63) will vary; and the remaining θ (cid:63) , · · · , θ (cid:63)n are 0.Of interest is to see how large θ (cid:63) needs to be in order for the coverage probability tobe approximately equal to 0.95, the nominal level. Figure 1 plots the empirical coverageprobability and mean lengths of the four marginal credible intervals for θ , as a functionof the signal size θ (cid:63) . Of course, the coverage probability will be low when the signal sizeis small, so of primary interest is how quickly the coverage probability climbs to near0.95 as θ (cid:63) increases. It’s clear that the data-dependent measure and the empirical priorformulation of Martin and Walker (2014) perform comparably in the sense that both getto the target coverage probability by around θ (cid:63) ≈
6, before the other two. Interestingly,8 ww.researchers.one/articles/21.01.00001 . . . . . . θ C o v e r age P r obab ili t y HSEB1EB2DDM (a) Coverage probability θ M ean Leng t h HSEB1EB2DDM (b) Mean length
Figure 1: Plots of the coverage probability and mean length of the marginal credibleintervals for θ , as a function of θ (cid:63) , based on the four methods: horseshoe (HS), a beta–binomial empirical prior (EB1 Martin and Walker 2014), a complexity-driven empiricalprior (EB2, Martin et al. 2017), and the data-dependent measure (DDM).these two are no less efficient in terms of interval lengths, since they’re all close to theoptimal “2 × .
96 = 3 .
92” length marked by the horizontal line on the plot.Finally, the data-dependent measure can easily scale to far bigger n , e.g., n ∼ ,while the other methods would have serious difficulties with problems of this size. Athorough comparison of the proposed data-dependent measure with other fast methodsfor this problem (e.g., Ray and Szab´o 2020; Roˇckov´a and George 2018; van Erven andSzab´o 2020) would be an interesting direction to pursue. In this paper, I’ve considered inference on a sparse, high-dimensional mean vector using asimple data-dependent measure. Assuming only subgaussianity of the error distribution,I was able to show that the data-dependent measure has optimal asymptotic convergenceproperties in virtually every respect. Most notably, I was able to establish that thedata-dependent measure provides asymptotically valid uncertainty quantification in thesense that credible balls centered around the data-dependent measure’s mean vector, withsuitable data-driven choices of radius, can achieve the nominal coverage probability whilemaintaining roughly the optimal size. My proofs of the various results presented hereinare relatively straightforward thanks to the simple form of the data-dependent measureunder investigation. Moreover, the simple form makes computation, even for very large-scale problems, fast and easy. The numerical illustration in Section 4 shows, however,that the theoretical and computational simplicity don’t come at the expense of statisticalefficiency or poor finite-sample performance.One possible extension is to explore the asymptotic behavior of this data-dependentmeasure under assumptions on θ (cid:63) that are mathematically weaker than my notion ofsparsity here, e.g., under the so-called excessive bias restriction in Belitser (2017). Ichose to work here with sparsity because it’s a simpler and more intuitive condition, but I9 ww.researchers.one/articles/21.01.00001 expect that more general results are possible, even with only subgaussianity assumptions.Other kinds of low-dimensional structure in the mean vector can likely be handledusing similarly simple data-dependent measures. For example, in a sequence model wherethe mean has a piecewise constant structure (e.g., Liu et al. 2020; van der Pas and Rockova2017), sparsity shows up in the successive differences, so things would not be too muchdifferent from the case considered here. Clearly there are limitations to how far thiskind of simple approach could go, but it’s interesting and practically useful to find wherethe boundary is. In the high-dimensional regression problem, for example, the theoreticalsupport available in, say, Ray and Szab´o (2020), for mean-field variational approximationssuggests that other simple and more directly defined data-dependent measures could havesimilar properties. It’s perhaps not surprising that posterior concentration rate resultscould be achieved even if the inherent correlation in the full posterior distribution isignored, but it would be interesting to see if other structure learning or uncertaintyquantification properties were similarly unaffected.It’s a mathematical fact that credible sets in high-dimensions can’t be both valid con-fidence sets and of adaptively optimal size. The approach that most investigations havetaken, including mine here, is to start with a data-dependent measure that achieves theoptimal size property and show that its credible sets approximately achieve the targetcoverage probability too. To me, the most interesting take-away message from this paperis that apparently very simple solutions can achieve this “optimal size and approximatecoverage” property. If both simple and not-so-simple solutions can achieve the sameproperties, then arguably the standard is too low. How might we approach these struc-tured high-dimensional problems in a more discriminating way? One idea is to turn theline of reasoning around, that is, to start with something that achieves valid uncertaintyquantification and think about how to introduce the assumed structure, e.g., sparsity, insuch a way that efficiency is gained but validity isn’t lost. The approach I have in mind,with developments underway, is to start with a valid inferential model (e.g., Martin andLiu 2013, 2015), treat the assumed structure as genuine but incomplete prior informationencoded as an imprecise probability, and combine with the inferential model output in anappropriate way that preserves validity. The main difference between this and the stan-dard approach taken in this paper is that validity is given higher priority than efficiency,which I believe to more appropriate for scientific investigations. A Proofs
A.1 Preliminary results
The only distributional assumption being made here is that the errors Z , . . . , Z n in (1)are iid copies of a random variable Z with subgaussian tails. As the name suggests, thiscondition implies that Z has some Gaussian-like properties. Here I collect a few relevantfacts about subgaussian random variables that will be used in what follows. • It is well known that the square of a subgaussian random variable is subexponential.I won’t need any specific properties of subexponential random variables, so there’sno need to give a formal definition. All that matters here is that subexponentialrandom variables have a moment-generating function in an interval that containsthe origin. In particular, E e t ( Z/σ ) (cid:46) , t ∈ (0 , T ] . (11)10 ww.researchers.one/articles/21.01.00001 Moreover, since translations don’t effect the tails of a distribution, the moment-generating function of ( Z + u ) , for any u , also exists for some arguments, in par-ticular, when the argument is negative, I get E e − t ( Z + u ) /σ (cid:46) e − tu /σ , t > . (12) • An equivalent definition of subgaussian random variables is that they admit anexponential tail probability bound just like the Gaussian. In particular, P ( | Z | > t ) ≤ e − t / σ , t > . (13)Next are two results that’ll be needed in the proofs of the theorems below. Thesemake use of the properties for subgaussian random variables described above. Lemma 1. If λ n satisfies (3) and α < T , then n (cid:88) i =1 E θ (cid:63)i φ i ≤ | S (cid:63) | + o (1) (cid:46) | S (cid:63) | . Proof.
First, split the sum as n (cid:88) i =1 E θ (cid:63)i φ i = (cid:88) i ∈ S (cid:63) E θ (cid:63)i φ i + (cid:88) i (cid:54)∈ S (cid:63) E θ (cid:63)i φ i . Since φ i ≤
1, the sum over i ∈ S (cid:63) is clearly ≤ | S (cid:63) | . For the sum over i (cid:54)∈ S (cid:63) , note thatall the means are zero and, therefore, all the terms in the sum are the same, i.e., (cid:88) i (cid:54)∈ S (cid:63) E θ (cid:63)i φ i = ( n − | S (cid:63) | ) E φ i . Again, since φ i ≤
1, it follows that E φ i ≤ E e logit( φ i ) = ξ n E e α ( Z/σ ) , where ξ n = exp { logit( λ n ) + log γα + γ } . Since α < T by assumption, it follows from (11) that E φ i (cid:46) ξ n (cid:46) exp { logit( λ n ) } = n − (1+ a ) , which implies (cid:88) i (cid:54)∈ S (cid:63) E θ (cid:63)i φ i (cid:46) n − a = o (1) , as n → ∞ . Combining this with the bound from the sum over i ∈ S (cid:63) completes the proof. Lemma 2.
If the λ n in (5) satisfies (3) , then E θ (cid:63) (cid:90) (cid:107) θ − θ (cid:63) (cid:107) ∆ n ( dθ ) (cid:46) ε n ( θ (cid:63) ) . ww.researchers.one/articles/21.01.00001 Proof.
By the definition of ∆ n in (5), it’s easy to check that (cid:90) (cid:107) θ − θ (cid:63) (cid:107) ∆ n ( dθ ) = n (cid:88) i =1 (cid:90) ( θ i − θ (cid:63)i ) ∆ n ( dθ )= n (cid:88) i =1 (cid:110) φ i (cid:90) ( θ i − θ (cid:63)i ) N ( θ i | Y i , τ i ) dθ i + (1 − φ i ) θ (cid:63) i (cid:111) = n (cid:88) i =1 (cid:2) φ i { τ i + ( Y i − θ (cid:63)i ) } + (1 − φ i ) θ (cid:63) i (cid:3) = n (cid:88) i =1 τ i φ i + (cid:88) i (cid:54)∈ S (cid:63) φ i Y i + (cid:88) i ∈ S (cid:63) { φ i ( Y i − θ (cid:63)i ) + (1 − φ i ) θ (cid:63) i } . Note that τ i are constant in i and do not depend on data, this can come outside thesame (and the following expectation). Taking expectation, as using the fact that φ i ≤ τ n (cid:88) i =1 E φ i + (cid:88) i (cid:54)∈ S (cid:63) E φ i Y i + σ | S (cid:63) | + (cid:88) i ∈ S (cid:63) θ (cid:63) i E (1 − φ i ) . I’ll deal with each term in this sum separately. The first term is (cid:46) | S (cid:63) | by Lemma 1.Second, consider E θ (cid:63)i φ i Y i for i (cid:54)∈ S (cid:63) , which means θ (cid:63)i = 0. For x > E φ i Y i = E φ i Y i | Y i |≤ x + E φ i Y i | Y i | >x . The first term on the right-hand side is x E φ i (cid:46) x n − ( a +1) , as shown in the proof ofLemma 1. The second term is bounded by E Y i | Y i | >x = E Z | Z | >x . For this, we can usethe tail probability bound (13) for Z as follows: E Z | Z | >x (cid:90) ∞ P ( Z | Z | >x > t ) dt = (cid:90) ∞ P {| Z | > max( x, t / ) } dt = (cid:90) x P ( | Z | > x ) dt + (cid:90) ∞ x P ( | Z | > t / ) dt ≤ x e − x / σ + 2 (cid:90) ∞ x e − t/ σ dt (cid:46) ( x + 1) e − x / σ . Take x = { σ log( n/ | S (cid:63) | ) } / , so that E φ i Y i (cid:46) n − ( a +1) log( n/ | S (cid:63) | ) + | S (cid:63) | n − log( en/ | S (cid:63) | ) . Summing over i (cid:54)∈ S (cid:63) gives (cid:88) i (cid:54)∈ S (cid:63) E φ i y i (cid:46) n − a log( n/ | S (cid:63) | ) + | S (cid:63) | log( en/ | S (cid:63) | ) (cid:46) ε n ( θ (cid:63) ) . (14)Lastly, for the third term, recall that1 − φ i = 1 − { ξ − n e − ( α/ σ ) Y i } − , ww.researchers.one/articles/21.01.00001 where ξ n ∝ exp {− logit( λ n ) } . Since z (cid:55)→ (1 + z ) − is convex, Jensen’s inequality says E θ (cid:63)i (1 − φ i ) ≤ − { ξ − n E θ (cid:63)i e − ( α/ σ ) Y i } − . By (12), the expectation satisfies E θ (cid:63)i e − ( α/ σ ) Y i = E e − ( α/ Z + θ (cid:63)i ) /σ ≤ ce − ( α/ θ (cid:63) i /σ , for a constant c >
0. Therefore, E θ (cid:63)i (1 − φ i ) ≤ cξ − n e − kθ (cid:63) i cξ − n e − kθ (cid:63) i , where k = α/ σ . Multiplying by θ (cid:63) i gives θ (cid:63) i E θ (cid:63)i (1 − φ i ) ≤ cξ − n θ (cid:63) i e − kθ (cid:63) i cξ − n e − kθ (cid:63) i . As a function of θ (cid:63) i , this has the form of a gamma density with shape parameter 2 andrate parameter k . Such a density has mode k − . Plugging in that mode, what’s left is abounded sequence in n , so the right-hand side above is (cid:46)
1, which implies (cid:88) i ∈ S (cid:63) θ (cid:63) i E θ (cid:63)i (1 − φ i ) (cid:46) | S (cid:63) | . Putting all the bounds together gives E θ (cid:63) (cid:90) (cid:107) θ − θ (cid:63) (cid:107) ∆ n ( dθ ) (cid:46) | S (cid:63) | + ε n ( θ (cid:63) ) + 1 (cid:46) ε n ( θ (cid:63) ) . A.2 Proofs of Theorems 1–3
Proof of Theorem 1.
By Markov’s inequality,∆ n ( { θ : (cid:107) θ − θ (cid:63) (cid:107) > M n ε n ( θ (cid:63) ) } ) ≤ M n ε n ( θ (cid:63) ) (cid:90) (cid:107) θ − θ (cid:63) (cid:107) ∆ n ( dθ ) . Taking expectation and applying the bound in Lemma 2 gives E θ (cid:63) ∆ n ( { θ : (cid:107) θ − θ (cid:63) (cid:107) > M n ε n ( θ (cid:63) ) } ) (cid:46) M − n , and since M n → ∞ , the claim follows. Proof of Theorem 2.
By Jensen’s inequality, (cid:107) ˆ θ − θ (cid:63) (cid:107) ≤ (cid:90) (cid:107) θ − θ (cid:63) (cid:107) ∆ n ( dθ ) . Then the claim follows by Lemma 2.
Proof of Theorem 3.
By Markov’s inequality∆ n ( { θ : | S θ | > M n | S θ (cid:63) |} ) ≤ M n | S θ (cid:63) | n (cid:88) i =1 φ i , where the sum on the right-hand side is the expectation of | S θ | under θ ∼ ∆ n . Takeexpectation of both sides and apply the bound in Lemma 1, to get E θ (cid:63) ∆ n ( { θ : | S θ | > M n | S θ (cid:63) |} ) (cid:46) M − n , and, since M n → ∞ , the claim follows. 13 ww.researchers.one/articles/21.01.00001 A.3 Proof of Theorem 4
Let δ n denote the mass function of the marginal distribution of S θ under θ ∼ ∆ n , i.e., δ n ( S ) = ∆ n ( { θ : S θ = S } ) , S ⊆ { , , . . . , n } . Also, for the given θ (cid:63) , let S (cid:63) = S θ (cid:63) . From the simple form of ∆ n , it’s easy to check that δ n ( S ) = (cid:89) i ∈ S φ i · (cid:89) i (cid:54)∈ S (1 − φ i ) . This leads to a convenient bound δ n ( S ) ≤ δ n ( S ) δ n ( S (cid:63) ) = (cid:89) i ∈ S ∩ S (cid:63)c e logit( φ i ) (cid:89) i ∈ S c ∩ S (cid:63) e − logit( φ i ) . Since each φ i only depends on Y i , and these are independent, we can interchange theorder of expectation and product. Also, for those i ∈ S (cid:63)c , with θ (cid:63)i = 0, the φ i ’s are iid,so each term in that product has the same expectation. Therefore, E θ (cid:63) δ n ( S ) ≤ (cid:8) E e logit( φ ) (cid:9) | S ∩ S (cid:63)c | (cid:89) i ∈ S c ∩ S (cid:63) E θ (cid:63)i e − logit( φ i ) . By the moment-generating function bounds in (11) and (12), E e logit( φ ) ≤ c ξ n E θ (cid:63)i e − logit( φ i ) ≤ c ξ − n e − kθ (cid:63) i , where k = α/ σ , ξ n = exp { logit( λ n ) } ∼ n − (1+ a ) , and c and c are the hidden propor-tionality constants in (11) and (12), respectively.Note also that, by Theorem 3, the δ n -probability of the event “ | S | > M n | S (cid:63) | ” hasvanishing expectation for any M n → ∞ . Thanks to this, I can immediately restrict myattention to those S such that “ | S | ≤ M n | S (cid:63) | ” in what follows.Now consider two distinct cases separately, namely, S ⊃ S (cid:63) and S (cid:54)⊇ S (cid:63) . First, forany S ⊃ S (cid:63) , it follows that | S c ∩ S (cid:63) | = 0. So E θ (cid:63) δ n ( S : S ⊃ S (cid:63) ) ≤ (cid:88) S : S ⊃ S (cid:63) , | S |≤ M n | S (cid:63) | { E e logit( φ ) } | S ∩ S (cid:63)c | = M n | S (cid:63) | (cid:88) t =1 (cid:18) n − | S (cid:63) | t (cid:19) { E e logit( φ ) } t ≤ M n | S (cid:63) | (cid:88) t =1 { e ( n − | S (cid:63) | ) E e logit( φ ) } t (cid:46) ( n − | S (cid:63) | ) ξ n . By definition of ξ n , the upper bound is vanishing as n → ∞ , proving the first claim.14 ww.researchers.one/articles/21.01.00001 Next, for any S (cid:54)⊇ S (cid:63) , there must be at least one component in S (cid:63) that is not included in S . So, E θ (cid:63) δ n ( S : S (cid:54)⊇ S (cid:63) ) ≤ (cid:88) S : S (cid:54)⊇ S (cid:63) , | S |≤ M n | S (cid:63) | (cid:104)(cid:8) E e logit( φ ) (cid:9) | S ∩ S (cid:63)c | (cid:89) i ∈ S c ∩ S (cid:63) E θ (cid:63)i e − logit( φ i ) (cid:105) ≤ (cid:88) S : S (cid:54)⊇ S (cid:63) , | S |≤ M n | S (cid:63) | ( c ξ n ) | S ∩ S (cid:63)c | ( c ξ − n e − kH ) | S c ∩ S (cid:63) | = M n | S (cid:63) | (cid:88) s =0 s ∧ ( | S (cid:63) |− (cid:88) t =0 (cid:18) | S (cid:63) | t (cid:19)(cid:18) n − | S (cid:63) | s − t (cid:19) ( c ξ n ) s − t ( c ξ − n e − kH ) | S (cid:63) |− t ≤ M n | S (cid:63) | (cid:88) s =0 s ∧ ( | S (cid:63) |− (cid:88) t =0 { c ( n − | S (cid:63) | ) ξ n } s − t { c | S (cid:63) | ξ − n e − kH } | S (cid:63) |− t . (In the above derivation, s represents | S | and t represents | S ∩ S (cid:63) | , which implies s − t = | S ∩ S (cid:63)c | and | S (cid:63) | − t = | S c ∩ S (cid:63) | .) Note that t < | S (cid:63) | because S (cid:54)⊇ S (cid:63) implies that S can’t include all the entries in S (cid:63) . This means that there is a constant factor | S (cid:63) | ξ − n e − kH , (15)which goes to 0 as n → ∞ if H is as in (9). The terms that involve ( n − | S (cid:63) | ) ξ n alsovanish as in the previous case above. So all the terms in the sum are geometrically small,hence the sum is bounded. But since the common factor (15) vanishes, the upper bounditself vanishes, proving the second claim of the theorem. A.4 Proof of Theorem 5
For any data-dependent radius ˆ ρ , “ B n ( ˆ ρ ) (cid:54)(cid:51) θ (cid:63) ” is equivalent to “ (cid:107) ˆ θ − θ (cid:63) (cid:107) > ˆ ρ ,” and thefollowing decomposition, which holds for any deterministic R >
0, is helpful: P θ (cid:63) {(cid:107) ˆ θ − θ (cid:63) (cid:107) > ˆ ρ } ≤ P θ (cid:63) {(cid:107) ˆ θ − θ (cid:63) (cid:107) > ˆ ρ, ˆ ρ > R } + P θ (cid:63) { ˆ ρ ≤ R }≤ P θ (cid:63) {(cid:107) ˆ θ − θ (cid:63) (cid:107) > R } + P θ (cid:63) { ˆ ρ ≤ R } . (16)The first term in (16) has nothing to do with the radius, so this can be approached thesame way for both types of credible balls. Indeed, by Markov’s inequality, P θ (cid:63) {(cid:107) ˆ θ − θ (cid:63) (cid:107) > R } ≤ R − E θ (cid:63) (cid:107) ˆ θ − θ (cid:63) (cid:107) . By Theorem 2, the expectation in the upper bound is no more than M (cid:48) ε n ( θ (cid:63) ) for someconstant M (cid:48) > ε n ( θ (cid:63) ) in (7). Therefore, if R is a suitable multiple of ε n ( θ (cid:63) ), thenthe first term in (16) can be made less than a fraction of ζ . The specific constants dependon how ˆ r is defined, and the details for each case are presented below. Proof for the quantile-based radius.
Start with a bound on the non-coverage proba-bility for the quantile-based radius. Here I’ll bound the non-coverage probability by asum of three terms, each will be bounded by ζ/
3, for sufficiently large n . The first of thesethree terms comes from the above analysis, so we set R = (3 M (cid:48) /ζ ) ε n ( θ (cid:63) ) and conclude P θ (cid:63) {(cid:107) ˆ θ − θ (cid:63) (cid:107) > R } ≤ ζ/ . ww.researchers.one/articles/21.01.00001 Next, to bound the second term in (16), recall that ˆ ρ = M g n ˆ r , where ˆ r is based on the(1 − ζ )-quantile of ∆ n and M is some sufficiently large constant yet to be determined.Define (cid:101) R = ( M g n ) − / R . Then by definition of ˆ r , and Markov’s inequality (again), P θ (cid:63) { ˆ ρ ≤ R } = P θ (cid:63) { ˆ r ≤ (cid:101) R }≤ P θ (cid:63) { ∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R ) ≥ − ζ }≤ (1 − ζ ) − E θ (cid:63) ∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R ) . So the second term in (16) can be upper-bounded if the above expectation can be upper-bounded. Towards this, let S (cid:63) = S θ (cid:63) and use the total probability formula to write∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R ) = (cid:88) S ∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R | S ) δ n ( S ) ≤ { − δ n ( S (cid:63) ) } + ∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R | S (cid:63) ) . Under the conditions of Theorem 4, the first term has expectation that vanishes with n ,hence is eventually smaller than ζ/
3. So it remains to look at the second term in theabove display. The conditional distribution of θ under ∆ n , given S (cid:63) , is θ ∼ N | S (cid:63) | ( Y S (cid:63) , τ I ) ⊗ δ | S (cid:63)c | , where τ = σ ( α + γ ) − from (4) is deterministic. For such a θ , (cid:107) θ − ˆ θ (cid:107) d = (cid:88) i ∈ S (cid:63) { τ G i + (1 − φ i ) Y i } + (cid:88) i (cid:54)∈ S (cid:63) ( φ i Y i ) ≥ (cid:88) i ∈ S (cid:63) { τ G i + (1 − φ i ) Y i } , where G S (cid:63) = { G i : i ∈ S (cid:63) } are iid N (0 , n ( (cid:107) θ − ˆ θ (cid:107) ≤ (cid:101) R | S (cid:63) ) ≤ P (cid:110)(cid:88) i ∈ S (cid:63) { τ G i + (1 − φ i ) Y i } ≤ (cid:101) R (cid:111) ≤ P { τ (cid:107) G S (cid:63) (cid:107) ≤ (cid:101) R } , where the second line follows by Anderson’s inequality. Note that (cid:107) G S (cid:63) (cid:107) ∼ ChiSq ( | S (cid:63) | ),which means that (cid:107) G S (cid:63) (cid:107) scales like | S (cid:63) . Consider two cases: | S (cid:63) | = O (1) and | S (cid:63) | → ∞ .In the former case, (cid:107) G S (cid:63) (cid:107) is stochastically bounded, so the upper bound in the abovedisplay can be made less than ζ/ (cid:101) R is small or, equivalently, if M is sufficiently large.For the latter case, with | S (cid:63) | → ∞ , the above probability can be bounded as (cid:107) G S (cid:63) (cid:107) ≤ τ − (cid:101) R ⇐⇒ (cid:107) G S (cid:63) (cid:107) − | S (cid:63) | ≤ τ − (cid:101) R − | S (cid:63) |⇐⇒ (cid:107) G S (cid:63) (cid:107) − | S (cid:63) | ≤ − w | S (cid:63) | , where w = 1 − ( τ | S (cid:63) | ) − (cid:101) R . To see that w > (cid:101) R < τ | S (cid:63) | forsufficiently large n , plug in the definition of (cid:101) R to get(3 M (cid:48) /ζ ) ε n ( θ (cid:63) ) M g n < τ | S (cid:63) | ⇐⇒ log( en/ | S (cid:63) | )log( en ) < M τ M (cid:48) /ζ . ww.researchers.one/articles/21.01.00001 Note that the right-most inequality holds if
M > M (cid:48) / ( τ ζ ). Therefore, P { τ (cid:107) G S (cid:63) (cid:107) ≤ (cid:101) R } ≤ P {(cid:107) G S (cid:63) (cid:107) − | S (cid:63) | ≤ − w | S (cid:63) |} . Using a standard tail probability bound for chi-square random variables (e.g., Laurentand Massart 2000, Lemma 1), the probability in the above display is bounded by P { τ (cid:107) G S (cid:107) ≤ (cid:101) R } ≤ P {(cid:107) G S (cid:63) (cid:107) − | S (cid:63) | ≤ − | S (cid:63) | x ) / } ≤ e − x , where x = w | S (cid:63) | /
4. Since | S (cid:63) | → ∞ , this upper bound will eventually be less than ζ/ M as above. Putting everything together, all three terms upper bounding thenon-coverage probability are less than ζ/ n sufficiently large.Now, for the size of the credible ball. Write ε n = ε n ( θ (cid:63) ). By definition of ˆ r ,ˆ r > Lε n ⇐⇒ ∆ n ( (cid:107) θ − ˆ θ (cid:107) ≤ Lε n ) < − ζ ⇐⇒ ∆ n ( (cid:107) θ − ˆ θ (cid:107) > Lε n ) > ζ. For the constant M (cid:48) used above from Theorem 2, and for the specified threshold η > L ≥ M (cid:48) { ζη/ − / } . Then the triangle inequality implies∆ n ( (cid:107) θ − ˆ θ (cid:107) > Lε n ) ≤ ∆ n ( (cid:107) θ − θ (cid:63) (cid:107) > M (cid:48) ε n ) + 1 {(cid:107) ˆ θ − θ (cid:63) (cid:107) > ( ζη/ − M (cid:48) ε n } . Using Markov’s inequality twice gives P θ (cid:63) { ˆ r > Lε n } ≤ ζ − E θ (cid:63) ∆ n ( (cid:107) θ − ˆ θ (cid:107) > Lε n ) ≤ ζ − (cid:110) E θ (cid:63) ∆ n ( (cid:107) θ − θ (cid:63) (cid:107) > M (cid:48) ε n ) + ζη E θ (cid:63) (cid:107) ˆ θ − θ (cid:63) (cid:107) M (cid:48) ε n (cid:111) . The first term in the curly brackets is vanishing as n → ∞ by Theorem 1 and, hence, willeventually be less than ζη/
2. By Theorem 2 and the definition of M (cid:48) , the second termin the curly brackets is no more than ζη/
2. Therefore, the upper bound is no more than η , which proves the claim. Proof for the plug-in estimator-based radius.
Following the same argument as above,let R = (2 M (cid:48) /ζ ) ε n ( θ (cid:63) ). Then Theorem 2 implies that the first term in (16) is no morethan ζ/
2. For the second term in (16), note that x (cid:55)→ x log( en/x ) is increasing on [0 , n ] . (17)This implies that, for sufficiently small c > | ˆ S | log en | ˆ S | ≤ c | S (cid:63) | log en | S (cid:63) | = ⇒ | ˆ S | log en | ˆ S | ≤ c | S (cid:63) | log enc | S (cid:63) | ⇐⇒ | ˆ S | ≤ c | S (cid:63) | . Therefore, to get a bound on P θ (cid:63) { ˆ ρ ≤ R } , with ˆ ρ = M ˆ r , it suffices to bound P θ (cid:63) {| ˆ S | ≤ c | S (cid:63) |} , where c = 2 M (cid:48) /M ζ , which can be made small with suitable choice of M . Note that | ˆ S | is a sum of independent but non-identically distributed Bernoulli random variables, soits expected value is µ (cid:63) = E θ (cid:63) | ˆ S | = (cid:88) i ∈ S (cid:63) P θ (cid:63)i ( φ i > ) + ( n − | S (cid:63) | ) P ( φ i > ) . ww.researchers.one/articles/21.01.00001 Both lower and upper bounds for µ (cid:63) are needed. Towards this, P ( φ i > ) = P { logit( φ i ) > } = P (cid:2) ( Z/σ ) > σ α { logit( λ n ) + log γα + γ } (cid:3) . From the tail probability bound (13), it follows that P ( φ i > ) (cid:46) n − (1+ a ) , i (cid:54)∈ S (cid:63) , and, therefore, µ (cid:63) ≤ | S (cid:63) | + n − (1+ a ) ( n − | S (cid:63) | ) = { o ( n − a ) }| S (cid:63) | . (18)For the lower bound on µ (cid:63) , µ (cid:63) ≥ (cid:88) i ∈ S (cid:63) P θ (cid:63)i ( φ i > ) ≥ | S (cid:63) | P H ( φ i > ) , where H is the minimum (non-zero) signal size in (9). Then P H ( φ i > ) = 1 − P H { logit( φ i ) < } = 1 − P H (cid:2) − α σ Y i > −{ logit( λ n ) + log γα + γ } (cid:3) . Apply the exponential function to both sides of the inequality inside P H ( · · · ) and thenuse Markov’s inequality, the bound in (12), the size of H in (9), and an argument likethat at the end of the proof of Theorem 4 to get P H (cid:2) − α σ Y i > −{ logit( λ n ) + log γα + γ } (cid:3) = o (1) , n → ∞ . Therefore, µ (cid:63) ≥ { − o (1) }| S (cid:63) | . Using the standard Chernoff bounds for Bernoulli randomvariables, we get P θ (cid:63) {| ˆ S | ≤ c | S (cid:63) |} = inf t> e t | S (cid:63) | E θ (cid:63) e − t | ˆ S | = exp (cid:110) c | S (cid:63) | log µ (cid:63) c | S (cid:63) | + c | S (cid:63) | − µ (cid:63) (cid:111) . Plug in the lower and upper bounds for µ (cid:63) to get P θ (cid:63) {| ˆ S | ≤ c | S (cid:63) |} ≤ exp (cid:104) | S (cid:63) | (cid:110) c log 1 + o ( n − a ) c − (1 − c ) + o (1) (cid:111)(cid:105) . The term inside {· · · } is negative for c < n sufficiently large, so the right-hand sidecan be upper-bounded by ζ/ M is a sufficiently large multiple of 2 M (cid:48) /ζ .Finally, for the size of the credible ball, by the same monotonicity property (17) usedabove, it follows that P θ (cid:63) { ˆ r > Lε n } ≤ P θ (cid:63) {| ˆ S | > L | S (cid:63) |} . By Markov’s inequality and the upper bound on µ (cid:63) = E θ (cid:63) | ˆ S | in (18), P θ (cid:63) {| ˆ S | > L | S (cid:63) |} ≤ { o ( n − a ) }| S (cid:63) | L | S (cid:63) | . Therefore, if
L > η − , then the size condition P θ (cid:63) { ˆ r > Lε n } ≤ η holds.18 ww.researchers.one/articles/21.01.00001 References
Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006). Adapting tounknown sparsity by controlling the false discovery rate.
Ann. Statist. , 34(2):584–653.Belitser, E. (2017). On coverage and local radial rates of credible sets.
Ann. Statist. ,45(3):1124–1151.Belitser, E. and Ghosal, S. (2019). Empirical Bayes oracle uncertainty quantification.
Ann. Statist. , to appear, .Belitser, E. and Nurushev, N. (2020). Needles and straw in a haystack: Robust confidencefor possibly sparse sequences.
Bernoulli , 26(1):191–225.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The horseshoe+ estimatorof ultra-sparse signals.
Bayesian Anal. , 12(4):1105–1131.Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015). Dirichlet-Laplacepriors for optimal shrinkage.
J. Amer. Statist. Assoc. , 110(512):1479–1490.Boucheron, S., Lugosi, G., and Massart, P. (2013).
Concentration Inequalities . OxfordUniversity Press, Oxford. A Nonasymptotic Theory of Independence, With a forewordby Michel Ledoux.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator forsparse signals.
Biometrika , 97(2):465–480.Castillo, I. and Szab´o, B. (2020). Spike and slab empirical Bayes sparse credible sets.
Bernoulli , 26(1):127–158.Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: posteriorconcentration for possibly sparse sequences.
Ann. Statist. , 40(4):2069–2101.Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over l p -balls for l q -error. Probab.Theory Related Fields , 99(2):277–303.Donoho, D. L., Johnstone, I. M., Hoch, J. C., and Stern, A. S. (1992). Maximum entropyand the nearly black object.
J. Roy. Statist. Soc. Ser. B , 54(1):41–81. With discussionand a reply by the authors.Efron, B. (2010).
Large-Scale Inference , volume 1 of
Institute of Mathematical StatisticsMonographs . Cambridge University Press, Cambridge.Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors—an empiricalBayes approach.
J. Amer. Statist. Assoc. , 68:117–130.Honorio, J. and Jaakkola, T. (2014). Tight Bounds for the Expected Risk of LinearClassifiers and PAC-Bayes Finite-Sample Guarantees. In Kaski, S. and Corander, J.,editors,
Proceedings of the Seventeenth International Conference on Artificial Intelli-gence and Statistics , volume 33 of
Proceedings of Machine Learning Research , pages384–392, Reykjavik, Iceland. PMLR. 19 ww.researchers.one/articles/21.01.00001
James, W. and Stein, C. (1961). Estimation with quadratic loss. In
Proc. 4th Berke-ley Sympos. Math. Statist. and Prob., Vol. I , pages 361–379. Univ. California Press,Berkeley, Calif.Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: empiricalBayes estimates of possibly sparse sequences.
Ann. Statist. , 32(4):1594–1649.Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional bymodel selection.
Ann. Statist. , 28(5):1302–1338.Liu, C., Martin, R., and Shen, W. (2020). Empirical priors and posterior concentrationin a piecewise polynomial sequence model. arXiv:1712.03848 .Martin, R. and Liu, C. (2013). Inferential models: a framework for prior-free posteriorprobabilistic inference.
J. Amer. Statist. Assoc. , 108(501):301–313.Martin, R. and Liu, C. (2015).
Inferential Models: Reasoning with Uncertainty , volume147 of
Monographs on Statistics and Applied Probability . CRC Press, Boca Raton, FL.Martin, R., Mess, R., and Walker, S. G. (2017). Empirical Bayes posterior concentrationin sparse high-dimensional linear models.
Bernoulli , 23(3):1822–1847.Martin, R. and Ning, B. (2020). Empirical priors and coverage of posterior credible setsin a sparse normal mean model.
Sankhy¯a A. , 82:477–498. Special issue in memory ofJayanta K. Ghosh.Martin, R. and Tang, Y. (2020). Empirical priors for prediction in sparse high-dimensionallinear regression.
J. Mach. Learn. Res. , 21(144):1–30.Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical Bayes estimationof a sparse normal mean vector.
Electron. J. Stat. , 8(2):2188–2206.Martin, R. and Walker, S. G. (2019). Data-dependent priors and their posterior concen-tration rates.
Electron. J. Stat. , 13(2):3049–3081.Ray, K. and Szab´o, B. (2020). Variational Bayes for high-dimensional linear regressionwith sparse priors.
J. Amer. Statist. Assoc. , to appear; arXiv:1904.07150 .Rousseau, J. and Szabo, B. (2020). Asymptotic frequentist coverage properties ofBayesian credible sets for sieve priors.
Ann. Statist. , 48(4):2155–2179.Roˇckov´a, V. and George, E. I. (2018). The spike-and-slab LASSO.
J. Amer. Statist.Assoc. , 113(521):431–444.Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariatenormal distribution. In
Proceedings of the Third Berkeley Symposium on MathematicalStatistics and Probability, 1954–1955, vol. I , pages 197–206, Berkeley and Los Angeles.University of California Press.Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.
Ann.Statist. , 9(6):1135–1151.Szab´o, B., van der Vaart, A. W., and van Zanten, J. H. (2015). Frequentist coverage ofadaptive nonparametric Bayesian credible sets.
Ann. Statist. , 43(4):1391–1428.20 ww.researchers.one/articles/21.01.00001 van der Pas, S. and Rockova, V. (2017). Bayesian dyadic trees and histograms for regres-sion. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,S., and Garnett, R., editors,
Advances in Neural Information Processing Systems 30 ,pages 2089–2099. Curran Associates, Inc.van der Pas, S., Scott, J., Chakraborty, A., and Bhattacharya, A. (2016). horseshoe:Implementation of the Horseshoe Prior . R package version 0.1.0.van der Pas, S., Szab´o, B., and van der Vaart, A. (2017). Uncertainty quantification forthe horseshoe (with discussion).
Bayesian Anal. , 12(4):1221–1274. With a rejoinderby the authors.van Erven, T. and Szab´o, B. (2020). Fast exact Bayesian inference for sparse signals inthe normal sequence model.
Bayesian Anal. , to appear; arXiv:1810.10883 .Yang, Y. and Martin, R. (2020). Empirical priors and variational approximations of theposterior in high-dimensional linear models. arXiv:2007.15930arXiv:2007.15930