Confidently Comparing Estimators with the c-value
CConfidently Comparing Estimators with the c-value
Brian L. Trippe , Sameer K. Deshpande , and Tamara Broderick Computer Science and Artificial Intelligence Laboratory, MITCorrespondence to: Brian L. Trippe ([email protected])February 22, 2021
Abstract
Modern statistics provides an ever-expanding toolkit for estimating unknownparameters. Consequently, applied statisticians frequently face a difficult decision:retain a parameter estimate from a familiar method or replace it with an estimatefrom a newer or complex one. While it is traditional to compare estimators usingrisk, such comparisons are rarely conclusive in realistic settings.In response, we propose the “c-value” as a measure of confidence that a newestimate achieves smaller loss than an old estimate on a given dataset. We showthat it is unlikely that a computed c-value is large and that the new estimate haslarger loss than the old. Therefore, just as a small p-value provides evidence toreject a null hypothesis, a large c-value provides evidence to use a new estimatein place of the old. For a wide class of problems and estimators, we show how tocompute a c-value by first constructing a data-dependent high-probability lowerbound on the difference in loss. The c-value is frequentist in nature, but we showthat it can provide a validation of Bayesian estimates in real data applicationsinvolving hierarchical models and Gaussian processes.
Keywords:
Decision Theory, Normal Means, Model Selection, Shrinkage, EmpiricalBayes 1 a r X i v : . [ s t a t . M E ] F e b Introduction
Modern statistics provides an expansive toolkit of sophisticated methodology for esti-mating unknown parameters. However, this abundance of different estimators oftenpresents practitioners already familiar with one mode of analysis with a difficult chal-lenge: choosing between the output of a familiar method (e.g. a maximum likelihoodestimate (MLE)) and that of a more complicated method (e.g. the posterior mean ofa hierarchical Bayesian model). From a practical perspective, abandoning a familiarapproach in favor of a more complicated one is unreasonable without some sort ofassurance that the latter provides a more accurate estimate. Our goal is to determinewhether it is safe to abandon a default estimate in favor of an alternative, and to providean assessment of the degree of confidence we should have in this decision.Traditionally decisions between estimators are based on risk, the loss averaged overall possible realizations of the data (Lehmann & Casella 2006, Chapters 4-5). We notetwo limitations of using risk. First, it is rare that one estimator within a given pairwill have smaller risk across all possible parameter values. Instead, it is more often thecase that one estimator will have smaller risk for some unknown parameter values butlarger risk for other parameter values. Second, one estimator may have lower risk thananother but incur higher loss on a majority of datasets; see Appendix A for an example.In this work we propose a framework for choosing between estimators based on theirperformance on the observed dataset rather than their average performance. Specifically,we introduce the “c-value” (“c” for confidence in the new estimate), which we constructusing a data-dependent high-probability lower bound on the difference in loss. We showthat it is unlikely that simultaneously the c-value is large and the alternative estimatehas larger loss than the default. We then demonstrate how to use the c-value to selectbetween two estimates in a principled, data-driven way. Critically, the c-value requiresno assumptions on the unknown parameter; our guarantees hold uniformly across allparameters. Before presenting our general methodology, we discuss a few applicationsfor which we will demonstrate the utility of the c-value.
Hierarchical modeling of educational testing data from multiple schools has a longhistory and is widely considered a classic application of Bayesian methodology (Lindley& Smith 1972, Rubin 1981, Gelman et al. 2013). While a default approach may be toanalyze data from each school separately, hierarchical Bayesian approaches attempt to“share strength” by partially pooling data across related schools. However, such analysesoften rely on subjective or strongly simplifying prior assumptions. Consequently, itis not always clear whether these latter models actually yield improved estimates ofschool-specific parameters.As a first application of our methodology, we revisit Hoff (2020)’s estimates ofaverage student reading ability at several schools in the 2002 Educational LongitudinalStudy. Our analysis is quite confident that Hoff (2020)’s estimates, which partially2ool data across schools with similar characteristics, yield smaller square error thanthe MLE, which estimated reading scores separately for each school. By considering aperturbed version of the dataset, we verify that our methodology does not always favormore complex alternate estimators.
Considerable empirical evidence links a community’s exposure to violent crime andadverse behavioral, mental, and physical health outcomes among its residents (Bukaet al. 2001, Kondo et al. 2018). Although overall violent crimes rates in the U.S. havedecreased over the last two decades, there is considerable variation in time trends atthe neighborhood level (Balocchi & Jensen 2019, Balocchi et al. 2019). A critical firststep in understanding what drives neighborhood-level variation is accurate estimationof the actual amount of violent crime that occurs in each neighborhood.Typically, researchers rely on the reported counts of violent crime aggregated atsmall spatial resolutions (e.g. at the census tract level). However, in light of samplingvariability due to the relative infrequency of certain crime types in small areas, it isnatural to wonder if auxiliary data can be used to improve estimates of violent crimeincidence.As a second application of our framework, we analyze the number of violent crimesreported per square mile in several neighborhoods in the city of Philadelphia. Ouranalysis suggests that one can obtain improved estimates of this violent crime densityby using a simple hierarchical Bayesian model that incorporates information aboutnon-violent crime incidence. We then consider a yet-more-complex estimator thatadditionally incorporates spatial information. Our further analysis does not providehigh confidence that the more complex spatial model provides an improvement over thesimpler hierarchical model.
Accurate estimation of ocean current dynamics is critical for forecasting the dispersionof oceanic contaminations (Poje et al. 2014). While it is commonplace to model oceanflow dynamics at or above the mesoscale (roughly 10 km), Lodise et al. (2020) haverecently advocated modeling dynamics at both the mesoscale and the submesoscale (roughly 0.1–10 km). They specifically proposed a Gaussian process model that accountsfor variation across multiple resolutions to estimate ocean currents from positional datataken from hundreds of free-floating buoys.In a third application of our framework, we find that the multi-resolution procedureproduces a large c-value, indicating that accounting for variation across multiple scalesenables more accurate estimates than are obtained when accounting only for mesoscalevariation. 3 .4 Organization of the article & contributions
We formally present our general framework, including the c-value, in Section 2 and high-light similarities and differences between our framework and existing work on preliminarytesting and post-selection inference in Section 2.1. Our approach to computing c-valuesdepends on the availability of high-confidence lower bounds on the difference in thelosses of the two estimates that holds uniformly across the parameter space. Sections 3to 5 provide these bounds for several models and classes of estimators for squared errorloss. In Section 3, we illustrate our general strategy by constructing the necessarybound for the canonical normal means problem. Then, in Section 4, we generalize thisderivation to compare affine estimates of normal means with correlated observations.We extend our framework to empirical Bayes estimators and a non-Gaussian likelihoodin Section 5. We provide empirical validation along the way as well as illustrativeapplications to the aforementioned datasets in Section 6.
We now describe our approach for quantifying confidence in the statement that oneestimate of an unknown parameter is superior to another. We begin by introducingsome notation and building up to a definition of the c-value, before stating our mainresults. This development is very general, and leaves practical instantiations to thesubsequent sections. We include proofs of the results of this section in Appendix B.Suppose that we observe data y according to some distribution that depends on anunknown parameter θ. We consider deciding between two estimates, ˆ θ ( y ) and θ ∗ ( y ) , of θ on the basis of a loss function L ( θ, · ) . Our focus is on asymmetric situations in whichˆ θ ( · ) is a standard or more familiar estimator while θ ∗ ( · ) is a more complicated or lessfamiliar estimator. For simplicity, we will refer to ˆ θ ( · ) as the default estimator and θ ∗ ( · )as the alternative estimator.We next define the “win” obtained by using θ ∗ ( y ) rather than ˆ θ ( y ) as the differencein loss, W ( θ, y ) := L ( θ, ˆ θ ( y )) − L ( θ, θ ∗ ( y )) . While a typical comparison based on riskwould proceed by taking the expectation of W ( θ, y ) over all possible datasets drawn forfixed θ, we maintain focus on the single observed dataset. Notably, the win is positivewhenever the alternative estimate achieves a smaller loss than the default estimate. Assuch, if we knew that W ( θ, y ) > y and unknown parameter θ, then we would prefer to use the alternative θ ∗ ( y ) instead of the default ˆ θ ( y ) . Since θ is unknown, definitively concluding that W ( θ, y ) > b ( y, α ) , depending only on the data and a pre-specified level α ∈ [0 , , that satisfies for all θ P θ (cid:2) W ( θ, y ) ≥ b ( y, α ) (cid:3) ≥ α. (1)For values of α close to 1 , b ( y, α ) is a high-probability lower bound on the win thatholds uniformly across all possible values of the unknown parameter θ. Loosely speaking,4f b ( y, α ) > α close to 1, then we can conclude with high confidence that thealternative estimate has smaller loss than the default estimate.This lower bound allows us to define a measure of confidence, which we call the“c-value,” that θ ∗ ( y ) is superior to ˆ θ ( y ), c ( y ) := inf α ∈ [0 , (cid:8) α | b ( y, α ) ≤ (cid:9) . (2)The c-value marks a meaningful boundary in the space of confidence levels; it is thelargest value such that for every level α < c ( y ) , we have confidence α that the win ispositive. Remark . An alternative definition for the c-value is c + ( y ) = sup α ∈ [0 , { α | b ( y, α ) ≥ } . Although c + ( y ) = c ( y ) when b ( y, · ) is continuous and strictly decreasing in α, c + ( · )may be overconfident otherwise. We detail a particularly pathological example inAppendix B.1.Our first main result formalizes the interpretation of c ( y ) as a measure of confidence. Theorem 2.2.
Let b ( · , · ) be any function satisfying the condition in Equation (1) . Thenfor any θ and α ∈ [0 , , P θ (cid:2) W ( θ, y ) ≤ and c ( y ) > α (cid:3) ≤ − α. (3)The result follows directly from the definition of c ( · ) and the condition on b ( · , · ) . Informally, Theorem 2.2 assures us that the event that simultaneously (A) the c -valueis large and (B) θ ∗ ( y ) does not provide smaller loss than ˆ θ ( y ) is unlikely. Consequently,just as a small p-value provides evidence to reject a null hypothesis, a large c-valueprovides evidence to abandon the default estimate in favor of the alternative.The strategy described above necessarily involves using the data twice, once tocompute the two estimates and once more to compute the c-value to choose betweenthem. Accordingly, one might justly ask if this “double-dipping” into the dataset islikely to damage the quality of the resulting estimate. To address this question, weformalize this two-step procedure with a single estimator θ † ( y, α ) := [ c ( y ) ≤ α ]ˆ θ ( y ) + [ c ( y ) > α ] θ ∗ ( y ) , (4)which picks between the two estimates ˆ θ ( y ) and θ ∗ ( y ) based on the value c ( y ) and apre-specified level α ∈ [0 , . Our second main result guarantees that when using thetwo-staged θ † ( · , α ) with α close to 1 we will rarely incur a larger loss than that of thedefault estimate. Theorem 2.3.
Let b ( · , · ) be any function that satisfies the condition in Equation (1) .Then for any θ and α ∈ [0 , , P θ (cid:20) L (cid:16) θ, θ † ( y, α ) (cid:17) > L (cid:16) θ, ˆ θ ( y ) (cid:17)(cid:21) ≤ − α. (5)When the alternative estimator is comparatively more complex or less familiar thanthe default estimator, such reassurance is highly desirable.5 verview of the remainder of the paper. The c-value is useful insofar as thelower bound b ( y, α ) is sufficiently tight and readily computable. It remains to showthat such practical bounds exist. A primary contribution of this work is the explicitconstruction of these bounds in settings of practical interest. In what follows, we (A)illustrate one approach for constructing and computing b ( y, α ), (B) explore our proposedbounds’ empirical properties on simulated data, and (C) demonstrate their practicalutility on real-world data. Hypothesis testing, p-values, and pre-test estimation.
Our proposed c-valuebears a resemblance to the p-value in hypothesis testing (Casella & Berger 2002, Section8.3.4), but with a few key differences. Indeed, just as a small p-value can providesupport to reject a simple null hypothesis in favor of a more complex alternative, alarge c-value can provide support for a rejecting a simple default estimate in favor of amore complex one. Furthermore both tools provide a frequentist notion of confidencebased on the idea of repeated sampling. From this perspective, the two-step estimator θ † ( · , α ) resembles a preliminary testing estimator. Preliminary testing links the choicebetween estimators to the outcome of a hypothesis test for the null hypothesis that θ lies in some pre-specified subspace (Wallace 1977).The similarities to hypothesis testing go only so far. Notably, in this work ourtarget of inference (whether W ( θ, y ) is positive) is random. Hypothesis tests, incontrast, concern only fixed statements about parameters, with nulls and alternativescorresponding to disjoint subsets of an underlying parameter space (Casella & Berger2002, Definition 8.1.3). Our approach does not admit an interpretation as testing afixed hypothesis.Nevertheless, the connection to p-values can help us understand some limitationsof the c-value. First, just as hypothesis tests may incur Type II errors (i.e. failuresto reject a false null), for certain models and estimators there may be no b ( · , · ) thatconsistently detects improvements by the alternative estimate. Second, even if goodchoices of b ( · , · ) exist, it could be challenging to derive them analytically. This analyticalchallenge is reminiscent of difficulties for hypothesis testing in many models, whereinconservative, super-uniform p-values are used when analytic quantile functions areunavailable. Third, we note that it may be tempting to interpret c-values as theconditional probability that the alternative estimate is superior to the default; however,just as it is incorrect to interpret a p-value as a probability that the null hypothesis istrue, such an interpretation for a c-value would also be incorrect. Post-selection inference.
In recent years, there has been considerable progress onunderstanding the behavior of inferential procedures that, like θ † ( · , α ), use the datatwice, first to select amongst different models and then again to fit the selected model.Important recent work has focused on computing p-values and confidence intervals forlinear regression parameters that are valid after selection with the lasso (Lockhart et al.6014, Lee et al. 2016, Taylor & Tibshirani 2018) and arbitrary selection procedures (Berket al. 2013). Somewhat more closely related to our focus on estimation are Tibshirani &Rosset (2019) and Tian (2020), which both bound prediction error after model selection.Unlike these papers, which study the effects of selection on downstream inference, weeffectively perform inference on the selection itself. That is, we quantify the confidencethat we pick the estimator with smaller loss. Additional related work.
We review more loosely related work in Appendix C. Inparticular, we discuss connections to Bayesian model checking and prior predictivep-values, as well thematically related methodologies that seek to provide frequentistguarantees for Bayesian procedures. c -values for estimating normal means In this section, we derive a bound b ( y, α ) and compute the c-value for comparinghierarchical Bayesian estimates to maximum likelihood estimates (MLE) of the mean ofa multivariate normal from a single vector observation (i.e. the normal means problem).Our goal is to illustrate a simple instance of a general strategy for lower bounding thewin that we will later generalize to more complex estimators and models. In Section 3.1,we define the model and the estimators that we consider. In Section 3.2, we introduceour lower bound b ( · , · ) and present a theorem that guarantees this bound satisfiesEquation (1). Then, in Section 3.3, we empirically verify the utility of the resultingc-value and study the performance of the estimator θ † ( · , α ) that chooses between thedefault and alternative estimators based on the c-value (Equation (4)). Several details,including the proof of Theorem 3.1, are left to Appendix D. Let θ ∈ R N be an unknown vector and consider estimating θ from a noisy vectorobservation y = θ + (cid:15) where (cid:15) ∼ N (0 , I N ) under squared error loss L ( θ, ˆ θ ) := (cid:107) ˆ θ − θ (cid:107) . For simplicity, we focus on the case of isotropic noise with variance one; we remove thisrestriction in Section 4. For our demonstration, we take the MLE ˆ θ ( y ) = y to be thedefault estimate. As the alternative estimator, we consider a shrinkage estimator thatwas first studied extensively by Lindley & Smith (1972), θ ∗ ( y ) = y + τ − ¯ y N τ − where N is the vector of all ones, τ > y := N − (cid:62) N y is the mean of the observed y n ’s. Operationally, θ ∗ ( y ) shrinks each coordinate of theMLE towards the grand mean ¯ y. .2 Construction of the lower bound To lower bound the win, we first rewrite θ ∗ ( y ) = ˆ θ ( y ) − Gy where G := (1 + τ ) − P ⊥ and P ⊥ = I N − N − N (cid:62) N is the projection onto the subspace orthogonal to N . Thewin in squared error loss may then be written as W ( θ, y ) := (cid:107) ˆ θ ( y ) − θ (cid:107) − (cid:107) θ ∗ ( y ) − θ (cid:107) = 2 (cid:15) (cid:62) Gy − (cid:107) Gy (cid:107) . (6)Observe that we can compute (cid:107) Gy (cid:107) directly from our data. As a result, in orderto lower bound the win W ( θ, y ) , it suffices to lower bound 2 (cid:15) (cid:62) Gy.
As we detail inAppendix D.1, 2 (cid:15) (cid:62) Gy follows a scaled and shifted non-central chi-squared distribution,2 (cid:15) (cid:62) Gy ∼
21 + τ (cid:20) χ N − ( 14 (cid:107) P ⊥ θ (cid:107) ) − (cid:107) P ⊥ θ (cid:107) (cid:21) , where χ N − ( λ ) denotes the non-central chi-squared distribution with N − λ . Thus for any α ∈ (0 ,
1) and any fixed value of (cid:107) P ⊥ θ (cid:107) , W ( θ, y ) ≥
21 + τ F − N − (1 − α ; 14 (cid:107) P ⊥ θ (cid:107) ) − (cid:107) P ⊥ θ (cid:107) τ ) − (cid:107) Gy (cid:107) (7)with probability α , where F − N − (1 − α ; λ ) denotes the inverse cumulative distributionfunction of χ N − ( λ ) evaluated at 1 − α. Were (cid:107) P ⊥ θ (cid:107) known, the right hand sideof Equation (7) would immediately provide a valid bound. However since (cid:107) P ⊥ θ (cid:107) isnot typically known, we use the data to address our uncertainty in this quantity. Weobtain our bound by forming a one-sided confidence interval for (cid:107) P ⊥ θ (cid:107) that holdssimultaneously with Equation (7). Bound 3.1 (Normal means: Lindley and Smith estimate v.s. MLE) . Observe y = θ + (cid:15) with (cid:15) ∼ N (0 , I N ) and consider ˆ θ ( y ) = y vs. θ ∗ ( y ) = ( y + τ − ¯ y N ) / (1 + τ − ) . Wepropose b ( y, α ) := inf λ ∈ [0 ,U ( y, − α )] (cid:40)
21 + τ F − N − (cid:18) − α λ (cid:19) − λ τ ) − (cid:107) P ⊥ y (cid:107) (1 + τ ) (cid:41) (8) as an α -confidence lower bound on the win, where U (cid:18) y, − α (cid:19) := inf δ> (cid:40) δ (cid:12)(cid:12)(cid:12) (cid:107) P ⊥ y (cid:107) ≤ F − N − (cid:18) − α δ (cid:19)(cid:41) (9) is a high-confidence upper bound on (cid:107) P ⊥ θ (cid:107) . Bound 3.1 relies on a high-confidence upper bound on (cid:107) P ⊥ θ (cid:107) , but a two-sidedinterval could in principle provide a valid bound as well. In Appendix D.3 we providean intuitive justification for the choice of an upper bound. Theorem 3.1 justifies the useof Bound 3.1 for computing c-values. 8 heorem 3.1. Define c ( y ) := inf α ∈ [0 , { α | b ( y, α ) ≤ } for b ( · , · ) in Bound 3.1. Then c ( y ) is a valid c-value, satisfying the guarantees of Theorems 2.2 and 2.3.Remark . Equation (8) in Bound 3.1 can be readilycomputed. Notably, many standard statistical software packages provide numericalapproximation to non-central χ quantiles. Further, the one-dimensional optimizationproblems in Equations (8) and (9) can be solved numerically. Remark . The alternative estimator θ ∗ ( y ) considered in this section is the posteriormean of θ corresponding to the hierarchical prior θ | µ ∼ N ( µ N , τ I N ) with furtherimproper hyper-prior on µ. This prior expresses a belief that θ lies close to the one-dimensional subspace spanned by N . Using a similar approach to the one above, wecan derive lower bounds on the win for a more general class of estimators that shrinkthe MLE towards a pre-specified D -dimensional subspace. See Appendix D.4 for detailsand an application to a real dataset on which a large computed c-value indicates animproved estimate. To explore the empirical properties of Bound 3.1, we carried out a simulation studyin which we simulated 500 datasets with N = 50 as y ∼ N ( θ, I N ) for several valuesof θ. For each simulated dataset y, we computed the win W ( θ, y ) , the proposed lowerbound b ( y, α ) , and the c-value c ( y ) . Conveniently, for this likelihood, the distributionsof W ( θ, y ) and b ( y, α ) depend on θ only through N − (cid:107) P ⊥ θ (cid:107) . Consequently, we canexhaustively assess how our procedure behaves for different θ by varying this norm.Throughout our simulation study, we fixed τ = 1 . We first checked that the empirical probability that the win W ( y, α ) exceededthe bound b ( y, α ) in Bound 3.1 was at least as large as the nominal probability α (Figure 1a). Across various choices of N − (cid:107) P ⊥ θ (cid:107) , we see that b ( · , α ) is conservative,typically providing higher than nominal coverage. Surprisingly, the gap between theactual and nominal coverages does not seem to depend heavily on θ , suggesting wecould potentially obtain a tighter bound by calibrating b ( y, α ) to its actual coverage.We next checked that our computed c-values successfully detected improvementsby the alternative estimate. Recall that the alternative estimate θ ∗ ( y ) shrinks allcomponents of y towards the global mean y. Further, recall that by construction θ † ( y, α ) = θ ∗ ( y ) if and only if c ( y ) > α. Intuitively, then, we would expect the alternativeestimator to improve over the MLE and for the two-staged θ † ( · , α ) to select θ ∗ ( · ) when θ is close to the subspace spanned by N and N − (cid:107) P ⊥ θ (cid:107) is small. Figure 1b, which plotsthe probability that θ † ( · , α ) selects θ ∗ ( · ) across different values of θ and α, confirms thisintuition; when N − (cid:107) P ⊥ θ (cid:107) is small, we very often obtain large c-values and select thealternative estimator.For completeness, we also considered the risk profile of the two-stage estimator θ † ( · , α ) (Figure 1c). Specifically, for different choices of θ we computed a Monte Carloestimate of the expected squared error loss. For the most part, the risk of θ † ( · , α ) lies9 a) (b) (c) Figure 1: Bound calibration and the two-stage estimator for a hierarchical normal modelin simulation. (a) Empirical coverage of the lower bound b ( · , α ) across different levels α. Coverage is nearly identical across the parameter space. (b) Probability of selectingthe alternative estimate across the parameter space. Selection probability is higher forlower thresholds α. (c) Risk profiles of the two-stage estimators for different choices of α, as well as the MLE ˆ θ ( · ) and the Bayes estimator θ ∗ ( · ) . Each data point is computedfrom 500 replicates with N = 50.between the risks of ˆ θ ( · ) and θ ∗ ( · ) . However, the risk of the two-stage estimator appearsto exceed the risks of the default and alternative estimators for a narrow range of valuesof (cid:107) P ⊥ θ (cid:107) . While it is tempting to characterize this excess risk as the price we mustpay for “double-dipping” into our data, we note that the bump in risk appears to benon-trivial only for very small values of α. Recall again that we recommend choosing θ ∗ ( y ) in place of ˆ θ ( y ) only when c ( y ) is close to 1. As such, we do not expect this typeof risk increase to be much of a concern in practice.Unlike conventional p-values under a null hypothesis, we should not expect thedistribution of informative c-values to be uniform; indeed for parameters such that thewin is consistently positive or negative we should hope to observe c-values concentratednear 1 or 0 , respectively. In agreement with this expectation, Figure 4 illustrates thatwhen (cid:107) P ⊥ θ (cid:107) is small c-values concentrate near 1 , while when (cid:107) P ⊥ θ (cid:107) is large theyconcentrate near zero. We now generalize the situation described in the previous section in two ways. First, weconsider correlated Gaussian noise with covariance Σ, where Σ is any N × N positivedefinite covariance matrix rather than restricting to Σ = I N . Second, we let our defaultand alternative estimates, ˆ θ ( y ) and θ ∗ ( y ), be arbitrary affine transformations of thedata y . Though these two estimates take similar functional forms in this section, weremain concerned with asymmetric comparisons wherein θ ∗ ( y ) is comparatively morecomplex or less standard than ˆ θ ( y ) . χ to guaranteeexact coverage in Theorem 3.1. In the present case, we encounter sums of differentlyscaled non-central χ random variables, which do not admit analytically tractablequantiles. However, by approximating these sums with Gaussians with matched meansand variances, we can proceed in essentially the same manner as in Section 3 to derivean approximate lower bound on the win. After introducing the bound, we commenton the key steps in its derivation to highlight the approximations involved, but leavedetails of intermediate steps to Appendix E. We conclude with a non-asymptotic boundon the error introduced by these approximations on the coverage of the proposed boundon the win. Approximate Bound 4.1 (Correlated Gaussian likelihood: arbitrary affine estimates) . Observe y = θ + (cid:15) with (cid:15) ∼ N (0 , Σ) and consider ˆ θ ( y ) = Ay + k vs. θ ∗ ( y ) = Cy + (cid:96), where A, C ∈ R N × N are linear operators and k, (cid:96) ∈ R N are N -vectors. We propose b ( y, α ) = (cid:107) ˆ θ − y (cid:107) − (cid:107) θ ∗ − y (cid:107) + 2tr[( A − C )Σ] +2 z − α (cid:114) U ( (cid:107) G ( y ) (cid:107) , − α (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F (10) as an approximate high-probability lower bound for the win. In this expression, tr[ · ] denotes the trace of a matrix, G ( y ) := ( A − C ) y + ( k − (cid:96) ) , (cid:107) · (cid:107) Σ denotes the Σ quadraticnorm of a vector ( (cid:107) v (cid:107) Σ := √ v (cid:62) Σ v ), (cid:107) · (cid:107) F denotes the Frobenius norm of a matrix, and z α denotes the α -quantile of the standard normal. U ( (cid:107) G ( y ) (cid:107) , − α ) := inf δ> (cid:26) δ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) G ( y ) (cid:107) ≤ ( δ + (cid:107) Σ ( A − C )Σ (cid:107) F ) + z − α (cid:113) (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) Σ ( A − C )Σ (cid:107) δ (cid:27) (11) is an approximate high-confidence upper bound on (cid:107) G ( θ ) (cid:107) where (cid:107) · (cid:107) OP denotes theL2 operator norm of a matrix. To derive Approximate Bound 4.1 we again start by rewriting the alternativeestimate as θ ∗ ( y ) = ˆ θ ( y ) − G ( y ), where now G ( · ) is an affine transformation of y , G ( y ) := ( A − C ) y + ( k − (cid:96) ) . We next write the squared error win of using θ ∗ ( y ) in placeof ˆ θ ( y ) as W ( θ, y ) = 2 (cid:15) (cid:62) G ( y ) + (cid:16) (cid:107) ˆ θ ( y ) − y (cid:107) − (cid:107) θ ∗ ( y ) − y (cid:107) (cid:17) (12)and observe that it suffices to obtain a high-probability lower bound for this first term.For tractability, we approximate the distribution of (cid:15) (cid:62) G ( y ) by a normal with matchedmean and variance. As we will soon see, this approximation is accurate when N is largeand A − C is well conditioned; in this case (cid:15) (cid:62) G ( y ) may be written as the sum of manyof uncorrelated terms of similar size. The mean and variance may be expressed as E [ (cid:15) (cid:62) G ( y )] = tr[( A − C )Σ] , Var[ (cid:15) (cid:62) G ( y )] = (cid:107) G ( θ ) (cid:107) + (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F . (13)11ith these moments in hand, we form a probability α lower bound approximately as W ( θ, y ) ≥ (cid:107) ˆ θ ( y ) − y (cid:107) − (cid:107) θ ∗ ( y ) − y (cid:107) + 2tr[( A − C )Σ] +2 z − α (cid:114) (cid:107) G ( θ ) (cid:107) + 12 (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F . (14)However, as before, in order to use this approximate bound we require a simulta-neous upper bound on a norm of a transformation of the unknown parameter, in thiscase (cid:107) G ( θ ) (cid:107) . We compute one by considering the test statistic (cid:107) G ( y ) (cid:107) and againappealing to approximate normality. In particular we characterize the dependence ofthe distribution of this statistic on (cid:107) G ( θ ) (cid:107) through its mean and variance. We find itsmean as E [ (cid:107) G ( y ) (cid:107) ] = (cid:107) G ( θ ) (cid:107) + (cid:107) Σ ( A − C )Σ (cid:107) F (15)and upper bound its variance byVar[ (cid:107) G ( y ) (cid:107) ] ≤ (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) Σ ( A − C )Σ (cid:107) (cid:107) G ( θ ) (cid:107) . (16)Using the two quantities above and an appeal to approximate normality, we proposethe approximate high-confidence upper bound, U ( (cid:107) G ( y ) (cid:107) , − α ) , in Equation (11).As before, by splitting our α across these two bounds we obtain the desired expression,Equation (10) in Approximate Bound 4.1. Approximation Quality.
Due to the two Gaussian approximations, ApproximateBound 4.1 does not provide nominal coverage by construction. Our next result showsthat little error is introduced when N is large enough and the problem is well conditioned. Theorem 4.1 (Berry–Esseen bound) . Let α ∈ (0 , and consider b ( · , α ) in ApproximateBound 4.1. If both A and C are symmetric, then P θ (cid:2) W ( θ, y ) ≥ b ( y, α ) (cid:3) ≥ α − √ √ N C · κ (Σ ( A − C )Σ ) (17) where κ ( · ) denotes the condition number of its matrix argument (i.e. the ratio of itslargest to smallest singular values) and C ≤ . is a universal constant (Berry 1941,Theorem 1).Remark . Theorem 4.1 is a special case of a more general result that we provide inAppendix E.4, which does not require A and C to be symmetric. We highlight thisspecial case here because the bound takes a simpler form from which the dependenceon the conditioning of A − C is clearer, and because this condition is satisfied for manyimportant estimates. Notably A and C are symmetric in all applications discussed inthis paper.Though Theorem 4.1 provides an expected O ( N − ) drop in approximation error,the bound itself may be too loose to be useful in practice. However, in Section 6.1 we12how in simulation that Approximate Bound 4.1 provides the nominal coverage andtends to be substantially conservative. This conservatism likely owes to slack from (A)the operator norm bound in Equation (16) and (B) the union bound ensuring that theconfidence interval for (cid:107) G ( θ ) (cid:107) and the quantile in Equation (14) hold simultaneously.We conclude this section with a remark about computation of Approximate Bound 4.1. Remark b ( y, α )) . A naive approach to computing b ( y, α ) inEquation (10) involves finding U ( (cid:107) G ( y ) (cid:107) , − α ) with a binary search. For more rapidcomputation, we can recognize U ( (cid:107) G ( y ) (cid:107) , − α ) as the root of a quadratic. Specifically,define γ := (cid:107) G ( y ) (cid:107) − (cid:107) Σ ( A − C )Σ (cid:107) F , η := z α , ρ := 2 (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F ,and ν := 4 (cid:107) Σ ( A − C )Σ (cid:107) ; then from Equation (11) we have that the δ that achievesthe supremum satisfies γ = δ + η √ ρ + νδ. Rearranging, we find that U ( (cid:107) G ( y ) (cid:107) , − α )is the larger root of x − (2 γ + η ν ) x + ( γ − η ρ ) = 0 . Up to this point, we focused on estimating normal means with fixed affine estimators.Now we extend our c-value framework in two important directions. In Section 5.1, wederive c-values for empirical Bayesian estimates of normal means. We then move beyondGaussian likelihoods in Section 5.2 and derive c-values for regularized logistic regression.We defer all proofs and details of synthetic data experiments to Appendices F and G.
While many Bayesian estimates are affine in the data for fixed settings of prior pa-rameters, when prior parameters are chosen using the data, the resulting empiricalBayesian estimates are not affine in general. In this subsection we explore computationof approximate high-confidence lower bounds on the win of empirical Bayesian estima-tors. In particular, we consider an approach that essentially amounts to ignoring therandomness in estimated prior parameters and computing the bound as if the prior werefixed. For simplicity, we focus on a particularly simple empirical Bayesian estimatorfor the normal means problem that coincides with the James–Stein estimator (Efron& Morris 1973). We find that, in the high-dimensional limit, bounds obtained withthis naive approach achieve at least the desired nominal coverage. Finally, we show insimulation that the approximate bound has favorable finite sample coverage properties.
Empirical Bayes for estimation of normal means.
Consider a sequence of real-valued parameters θ , θ , . . . , and corresponding observations y n indep ∼ N ( θ n , N ∈ N , let Θ N := [ θ , θ , . . . , θ N ] (cid:62) and Y N := [ y , y , . . . , y N ] (cid:62) denote the first N parameters and observations, respectively.We consider the MLE for Θ N (i.e. Y N ) as our default, which we denote by ˆΘ N ( Y N ) = Y N , and we take the James–Stein estimate as our alternative; we compare on the basis13f squared error loss. We write the James–Stein estimate on the first N data points asΘ ∗ N ( Y N ) := (cid:0) − (1 + ˆ τ N ( Y N )) − (cid:1) Y N , where ˆ τ N ( Y N ) := (cid:107) Y N (cid:107) / ( N − −
1. Θ ∗ N ( Y N )corresponds to the Bayes estimate under the prior θ n i.i.d. ∼ N (0 , ˆ τ N ) (Efron & Morris1973). For this comparison, the win is W N ( Y N , Θ N ) := (cid:107) ˆΘ N ( Y N ) − Θ N (cid:107) − (cid:107) Θ ∗ N ( Y N ) − Θ N (cid:107) , and Appendix F details the associated bound b N ( Y N , α ) obtained with Bound D.1.In the following theorem, we lower bound the win by applying our earlier machineryfor Bayes rules with fixed priors. We find that the desired coverage is obtained in thehigh-dimensional limit. Theorem 5.1.
For each N ∈ N , let τ N := N − (cid:80) Nn =1 θ n . If the sequence τ , τ , . . . isbounded, then for any α ∈ [0 , , lim N →∞ P (cid:2) W N ( Y N , Θ N ) ≥ b N ( Y N , α ) (cid:3) ≥ α. The key step in the proof of Theorem 5.1 is establishing an O p ( N − ) rate ofconvergence of ˆ τ N − τ N to zero; under this condition the empirical Bayes estimateand bound converge to the analogous estimates and bounds computed with the priorvariance fixed to τ N . Accordingly, we expect similar results to hold for other modelsand empirical Bayes estimates when the standard deviations of the empirical Bayesestimates of the prior parameters drop as O p ( N − ). Remark . Theorem 5.1 easily extends to cover the case in which we consider asequence of random (rather than fixed) parameters drawn i.i.d. from a Bayesian prior,which is a more classical setup for guarantees of empirical Bayesian methods; see e.g.Robbins (1964). Specifically, our proof goes through in this Bayesian setting so longas the sequence τ , τ , . . . is bounded in probability. This condition is satisfied, forexample, when the θ n are i.i.d. from any prior with a finite second moment.To check finite sample coverage, we performed a simulation and evaluated calibrationof the associated c-values (Figure 5). Despite the empirical Bayes step, the c-valuesappear to be similarly conservative to those computed with the exact bound in Figure 1a.Furthermore, this calibration profile does not appear to be sensitive to the magnitudeof the unknown parameter. In this subsection we illustrate how to compute an approximate high-confidence lowerbound on the win in squared error loss with a logistic regression likelihood. Our keyinsight is that by appealing to limiting behavior, this non-Gaussian problem may beapproached with the same machinery developed in Section 4.
Notation and estimates.
Consider a collection of M data points with randomcovariates X M := [ x , x , . . . , x M ] (cid:62) ∈ R M × N and responses Y M := [ y , y , . . . , y M ] (cid:62) ∈{ , − } M . For the m th data point, assume y m indep ∼ p ( · | x m ; θ ) := (1 + exp {− x (cid:62) m θ } ) − δ + (1 + exp { x (cid:62) m θ } ) − δ − , (18)14here θ ∈ R N is an unknown parameter of covariate effects and δ and δ − denote Diracmasses on 1 and −
1, respectively.In this section, we choose the MLE as our default, ˆ θ ( X M , Y M ) := arg max θ log p ( Y M | X M ; θ ). And we choose our alternative to be a Bayesian maximum a posteriori (MAP)estimate under a standard normal prior ( θ ∼ N (0 , I N )): θ ∗ ( X M , Y M ) := arg max θ (cid:26) log p ( Y M | X M ; θ ) − (cid:107) θ (cid:107) (cid:27) . While a first choice for a Bayesian estimate might be the posterior mean, the MAP isan effective and widely used alternative to the MLE in practice. Notably, the MAPestimate is easier to compute and is often close to the posterior mean; see Huggins et al.(2018, Proposition 6.2) and Schervish (1995, Theorem 7.116). In particular, the distancebetween the posterior mean and the MAP estimate decays at an O ( M − ) rate with thenumber of observations M. Approximating the Bayes estimate as an affine transformation.
In movingaway from a Gaussian likelihood we have forfeit prior-to-likelihood conjugacy. In previoussections, conjugacy provided analytically convenient expressions for Bayes estimates. Inorder to regain analytical tractability, we appeal to a Gaussian approximation of thelikelihood, defined with a second order Taylor approximation of the log likelihood aboutthe MLE. This approximation is equivalent to approximating the distribution of theMLE as ˆ θ ( X M , Y M ) ∼ N ( θ, ˜Σ M ) , where ˜Σ M := −∇ θ log p ( Y M | X M ; θ ) (cid:12)(cid:12) θ =ˆ θ ( X M ,Y M ) . Assuch, we regain conjugacy, and we obtain an approximate Bayes estimate as an affinetransformation of the MLE,˜ θ ∗ ( X M , Y M ) = (cid:104) I N + ˜Σ M (cid:105) − ˆ θ ( X M , Y M ) . (19)As we show in Appendix G, ˜ θ ∗ ( X M , Y M ) is a very close approximation of θ ∗ ( X M , Y M ) , with distance decreasing at an O p ( M − ) rate. An approximate bound and an asymptotic guarantee.
We leverage the formin Equation (19) to compute Approximate Bound 4.1 as a lower bound on the win insquared error of using the MAP estimate in place of the MLE. In particular, we take y := ˆ θ ( X M , Y M ) as the data in Approximate Bound 4.1 (this corresponds to A = I N and k = 0) and approximate the distribution of (cid:15) := ˆ θ ( X M , Y M ) − θ as N (0 , ˜Σ M ) . Further, to compute the bound, we approximate θ ∗ ( X M , Y M ) by ˜ θ ∗ ( X M , Y M ) as inEquation (19), corresponding to C = (cid:104) I N + ˜Σ M (cid:105) − and (cid:96) = 0.While the precise coverage of this bound is difficult to analyze, our next resultreveals favorable properties in the large sample limit. Theorem 5.3.
Consider a sequence of random covariates x , x , . . . and responses y , y , . . . distributed as in Equation (18) . For each M ∈ N , let W M := (cid:107) ˆ θ ( X M , Y M ) − θ (cid:107) − (cid:107) θ ∗ ( X M , Y M ) − θ (cid:107) be the win of using the MAP estimate in place of the MLE. inally, let b M ( α ) be the level- α approximate bound on W M described above. If x , x , . . . are i.i.d. with finite third moment and with positive definite covariance, then for any α ∈ (0 , , lim M →∞ P θ (cid:2) W M ≥ b M ( α ) (cid:3) ≥ α. Theorem 5.3 guarantees that in the large sample limit, b M ( · ) has at least nominalcoverage. We provide a proof of the theorem and demonstrate its favorable empiricalproperties in simulation in Appendix G. We now demonstrate our approach on the three applications introduced in Section 1.Our goal in this section is to demonstrate how one can compute and interpret c-valuesin realistic workflows. In analogy to hypothesis testing, where a p-value cutoff of 0.05is standard for rejecting a null, we require a c-value of at least 0.95 to accept thealternative estimate; with this threshold, we expect to incorrectly reject the defaultestimate in at most 5% of our decisions. For all applications, we provide substantialadditional details in Appendix H.
In this section we apply our methodology to a model and dataset considered by Hoff(2020, Section 3.2), in which the goal is to estimate the average student reading abilityat different schools in the 2002 Educational Longitudinal Study. At each of N = 676schools, between 5 and 50 tenth grade students were given a standardized test of readingability. We let y = [ y , y , . . . , y N ] (cid:62) denote the average scores, and for each school,indexed by n , model y n indep ∼ N ( θ n , σ n ) , where θ = [ θ , θ , . . . , θ N ] (cid:62) denotes the school-level means and each σ n is the school-level standard error; specifically σ n := σ/ √ size n where σ denotes a student-level standard deviation and size n is the number of studentstested at school n . For convenience, we let Σ := diag([ σ , σ , . . . , σ N ]) so that we maywrite y ∼ N ( θ, Σ) . The goal is to estimate the school-level performances θ. Following Hoff (2020), we perform small area inference with the Fay-Herriot model(Fay & Herriot 1979) to estimate θ under the assumption that similar schools may havesimilar student performances. Specifically, we consider a vector of D = 8 attributesof each school X = [ x , x , . . . , x N ] (cid:62) ; these include participation levels in a free lunchprogram, enrollment, and other characteristics such as region and school type. Wemodel the school-level mean as a priori distributed as θ ∼ N ( Xβ, τ I N ) where β is anunknown D -vector of fixed effects and τ is an unknown scalar that describes variationin θ not captured by the covariates. Following Hoff (2020), we take an empiricalBayesian approach and estimate β, τ , and σ with lme4 (Bates et al. 2015). We thencompare the posterior mean — which is affine in y for fixed β, τ , and σ — as analternative to the MLE as a default; we use Approximate Bound 4.1. Specifically, wetake θ ∗ ( y ) := E [ θ | y ; β, τ, σ ] = [ I N + τ − Σ] − y + [ I N + τ Σ − ] − Xβ and ˆ θ ( y ) = y. We16ompute a large c-value ( c = 0 . θ ∗ ( y ) ismore accurate than ˆ θ ( y ) . We should not always expect to obtain a large c-value for an alternative estimate,however. We next describe a case where we expect the alternative estimate to be lessaccurate than the default, and we check that we obtain a small c-value. In particular, wenow let our alternative estimate be the posterior mean under the same model as abovebut with the covariates, X, randomly permuted across schools. In this situation, theresponses y have no relation to the covariates, and we should not expect an improvement.Indeed, on this dataset we compute a c-value of exactly zero. However, we recall thatjust as a large p-value in hypothesis testing does not provide evidence that a nullhypothesis is true, a small c-value does not provide direct evidence that the alternativeestimate is less accurate than the default.We provide additional details for all parts of this application in Appendix H.1.There, we demonstrate in a simulation study that our bounds remain substantiallyconservative for these estimators and model even with an empirical Bayes step. As a second application, we consider estimating the areal density of violent crimes(i.e. counts per square mile) reported in each of Philadelphia’s N = 384 census tracts.Following Balocchi et al. (2019), we work with the inverse hyperbolic sine transformeddensity. Letting y n be the observed transformed density of reported violent crimesin census tract n, we model y n indep ∼ N ( θ n , σ y ) where θ n represents the underlyingtransformed density and σ y is the noise variance. While one might interpret θ n as thetrue density of violent crime in census tract n , we note that the implicit assumptionof zero-mean error in each tract may not be realistic. Namely, systematic biases mayimpact the rates at which police receive and respond to calls and file incident reports indifferent parts of the city. Unfortunately, we are unable to probe this possibility withthe available data. Nevertheless, our goal is to estimate the vector of unknown rates, θ = [ θ , θ , . . . , θ N ] (cid:62) from y = [ y , y , . . . , y N ] (cid:62) . The observations y are a simple proxyof transformed violent crime density, but they are noisy. So it is natural to wonder ifwe might obtain a more accurate estimate of θ. Figure 2 plots the transformed densities of both violent and non-violent crimesreported in October 2018 in each census tract. Immediately, we see that, for anyparticular census tract, the observed densities of the two types of crime are similar.Further, we observe considerable spatial correlation in each plot. It is tempting to usea Bayesian hierarchical model that exploits this structure in order to produce moreaccurate estimates of θ. In this application, we consider iteratively refining an estimateof θ by (A) incorporating the observed non-violent crime data and then by (B) carefullyaccounting for the observed spatial correlation. At each step of our refinement, we usea c-value to decide whether to continue. Before proceeding, we make a remark aboutour sequential approach. Remark . Consider using c -values and a chosen level α to choose one of three estimates17 a) (b) Figure 2: Transformed densities of reported (a) violent and (b) non-violent crimes ineach census tract in Philadelphia in October 2018.(say ˆ θ ( y ) , θ ∗ ( y ), and θ ◦ ( y )) in two stages. Suppose we first choose θ ∗ ( y ) over ˆ θ ( y ) only ifthe associated c-value is greater than α . Second, only if we chose θ ∗ ( y ), we next choose θ ◦ ( y ) over θ ∗ ( y ) only if the new c-value associated with those estimates exceeds α . Thena union bound guarantees that θ ◦ ( y ) will be incorrectly chosen with probability at most2(1 − α ).We begin by seeing if we can improve upon the MLE, ˆ θ ( y ) = y, by leveraging theauxiliary dataset of transformed non-violent crimes in each tract, z , z , . . . , z N . Tothis end, we model these auxiliary data analogously to y ; in each tract n, we let η n be the unknown transformed density and independently model z n indep ∼ N ( η n , σ z ) . Wenext introduce a hierarchical prior that captures the apparent similarity between θ and η within each tract. Specifically, for each tract n we decompose θ n = µ n + δ yn and η n = µ n + δ zn , where µ n is a shared mean for the transformed densities of violentand non-violent reports and δ yn and δ zn represent deviations from the shared meanspecific to each crime type. Rather than encode explicit prior beliefs about µ n , weexpress ignorance in these quantities with an improper uniform prior. Additionally,we model δ yn , δ zn i.i.d ∼ N (0 , σ δ ). We fix the values of σ y , σ z , and σ δ using historical data.We then compute the posterior mean of θ as an alternative estimate, θ ∗ ( y ) . Thanks tothe Gaussian conjugacy of this model, θ ∗ ( y ) is affine in the data y , and a closed formexpression is available. See Appendix H.2 for additional details. A c-value of 0.999982suggests that we should be highly confident that θ ∗ ( y ) is a more accurate estimate of θ than ˆ θ ( y ) . We next consider additionally sharing strength amongst spatially adjacent cen-sus tracts. To this end, consider a second model with spatially correlated variancecomponents: θ n = µ n + δ yn + κ yn and η n = µ n + δ zn + κ zn . The additional terms18 y = [ κ y , κ y , . . . , κ yN ] (cid:62) and κ z = [ κ z , κ z , . . . , κ zN ] (cid:62) capture a priori spatial correla-tions; we model κ y , κ z i.i.d. ∼ N (0 , K ) , where K is an N × N covariance matrix determinedby a squared exponential covariance function (Rasmussen & Williams 2006, Chapter 4)that depends on the distance between the centroids of the census tracts. Once again, weexploit conjugacy in this second hierarchical model to derive the posterior mean θ ◦ ( y )in closed form. As θ ◦ ( y ) is also an affine transformation of y, we can use ApproximateBound 4.1 to compute the c-value for comparing θ ◦ ( y ) to θ ∗ ( y ) . The c-value for thiscomparison is only 0.843, providing much weaker support for using θ ◦ ( y ) over θ ∗ ( y ) . Because this c-value is less than 0.95, we conclude our analysis content with θ ∗ ( y ) asour final estimate. Accurate understanding of ocean current dynamics is important for forecasting thedispersion of oceanic contaminations, such as after the Deepwater Horizon oil spill (Pojeet al. 2014). Lodise et al. (2020) have recently advocated for a statistical approachto inferring ocean currents from observations of free-floating, GPS-trackable buoys.Their approach seeks to provide improved estimates by incorporating variation at the submesoscale (roughly 0.1–10 km) in addition to more commonly considered mesoscale variation (roughly 10 km and above). In this section we apply our methodology toassess if this approach provides improved estimates relative to a baseline including onlymesoscale variation.In our analysis, we consider a segment of the Carthe Grand Lagrangian Drifter(GLAD) deployment dataset ( ¨Ozg¨okmen 2013). Specifically, we model a set of 50buoys with velocities estimated at 3 hour intervals over one day ( N = 400 observationstotal). Each observation n consists of latitudinal and longitudinal ocean current velocitymeasurements y n = [ y (1) n , y (2) n ] (cid:62) ∈ R and associated spatio-temporal coordinates[lat n , lon n , t n ] . Following Lodise et al. (2020), we model each measurement as a noisyobservation of an underlying time varying vector-field distributed independently as y n indep ∼ N (cid:0) F (lat n , lon n , t n ) , σ (cid:15) I (cid:1) , where F : R → R denotes the time evolvingvector-field of ocean currents and σ (cid:15) is the error variance. Our goal is to estimate F at the observation points θ := [ θ , θ , . . . , θ N ] (cid:62) , where for each n, θ n = [ θ (1) n , θ (2) n ] (cid:62) = F (lat n , lon n , t n ) . Following Lodise et al. (2020), we place a Gaussian process prior on F to encodeexpected spatio-temporal structure while allowing for variation at multiple scales.Specifically, we model F ∼ GP (cid:0) , k ( · , · ) (cid:1) , where k ( θ ( i ) n , θ ( i ) n (cid:48) ) = k ( θ ( i ) n , θ ( i ) n (cid:48) ) + k ( θ ( i ) n , θ ( i ) n (cid:48) ) , i ∈ { , } . (20)Here k and k are squared exponential kernels with spatial and temporal length-scalesthat reflect mesoscale and submesoscale variations, respectively; see Appendix H.3 fordetails. For simplicity, we model the latitudinal and longitudinal components of F independently. We take the posterior mean of θ under this model as the alternativeestimate, θ ∗ ( y ) .
19s a baseline, we consider an analogous estimate with covariance function k ( θ ( i ) n , θ ( i ) n (cid:48) ) = k ( θ ( i ) n , θ ( i ) n (cid:48) ) + k ( θ ( i ) n , θ ( i ) n (cid:48) ) [ n = n (cid:48) ] , which maintains the same marginal variance butexcludes submesoscale covariances. We take the posterior mean under this model as thedefault estimate ˆ θ ( y ). Both θ ∗ ( y ) and ˆ θ ( y ) may be written as affine transformations of y. Using Approximate Bound 4.1, we compute a c-value of 0 . . This large c-value allows us to confidently conclude that modeling both mesoscale and submesocalevariation can yield more accurate estimates of ocean currents than mesocale modelingalone.
We have provided a simple method for quantifying confidence in improvements providedby certain Bayesian estimates without relying on subjective assumptions about theparameter of interest. Our approach has compelling theoretical properties, and wehave demonstrated its utility on several data analyses of recent interest. However, thescope of the current work has several limitations. The present paper has explored theuse of the c-value only for problems of moderate dimensionality ( N between 20 and700). Loosely speaking, we suspect c-values may be underpowered to robustly identifysubstantial improvements provided by estimates in lower dimensional problems. Furtherinvestigation into such dimension dependence is an important direction for future work.In addition, our approach depends crucially on a high-probability lower bound that isinherently specific to the underlying model of the data, a loss function, and the pairof estimators. In the present work, we have shown how to derive and compute thisbound for models with general Gaussian likelihoods, when accuracy may be measuredin terms of squared error loss, and when both estimates are affine transformations ofthe data. We have provided a first step to extending beyond simple Gaussian modelswith the application to logistic regression; while we have not yet explored the efficacyof this extension on real data, we view our work as an important starting point forgeneralizing to broader model classes and estimation problems. We believe that furtherextensions to the classes of models, estimates, and losses for which c-values can becomputed provide fertile ground for future work. Acknowledgements
The authors thank Jonathan H. Huggins for the suggestions to consider Berry–Esseenbounds and the extension to logistic regression, Lorenzo Masoero and Hannah Diehl forinsightful comments on the manuscript, and Matthew Stephens and Lucas Janson foruseful early conversations. This work was supported in part by ONR Award N00014-18-S-F006, an ARPA-E project with program director David Tew, and an NSF CAREERAward. BLT is supported by NSF GRFP.20 eferences
Balocchi, C., Deshpande, S. K., George, E. I. & Jensen, S. T. (2019), ‘Crime inPhiladelphia: Bayesian clustering with particle optimization’. arXiv:1912.00111.Balocchi, C. & Jensen, S. T. (2019), ‘Spatial modeling of trends in crime over time inPhiladelphia’,
The Annals of Applied Statistics (4), 2235–2259.Bates, D., M¨achler, M., Bolker, B. & Walker, S. (2015), ‘Fitting linear mixed-effectsmodels using lme4’, Journal of Statistical Software (1), 1–48.Berk, R., Brown, L., Buja, A., Zhang, K. & Zhao, L. (2013), ‘Valid post-selectioninference’, The Annals of Statistics (2), 802–837.Berry, A. C. (1941), ‘The accuracy of the Gaussian approximation to the sum ofindependent variates’, Transactions of the American Mathematical Society (1), 122–136.Blei, D. M. (2014), ‘Build, compute, critique, repeat: data analysis with latent variablemodels’, Annual Review of Statistics and Its Application (1), 203–232.Box, G. E. (1980), ‘Sampling and Bayes’ inference in scientific modelling and robustness’, Journal of the Royal Statistical Society: Series A (4), 383–404.Box, G. E. & Hunter, W. G. (1962), ‘A useful method for model-building’,
Technometrics (3), 301–318.Buka, S. L., Stichick, T. L., Birdthistle, I. & Earls, F. (2001), ‘Youth exposure toviolence: prevalance, risks, and consequences’, American Journal of Orthopsychiatry (3), 298 – 310.Burbidge, J. B., Magee, L. & Robb, A. L. (1988), ‘Alternative transformations tohandle extreme values of the dependent variable’, Journal of the American StatisticalAssociation (401), 123–127.Casella, G. & Berger, R. L. (2002), Statistical Inference , Duxbury Pacific Grove, CA.Efron, B. & Morris, C. (1973), ‘Stein’s estimation rule and its competitors — an empiricalBayes approach’,
Journal of the American Statistical Association (341), 117–130.Fay, R. E. & Herriot, R. A. (1979), ‘Estimates of income for small places: an applica-tion of James-Stein procedures to census data’, Journal of the American StatisticalAssociation (366a), 269–277.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B.(2013), Bayesian Data Analysis , Chapman and Hall/CRC.Hoff, P. D. (2020), ‘Smaller p -values via indirect information’, Journal of the AmericanStatistical Association pp. 1–35.Hoff, P. & Yu, C. (2019), ‘Exact adaptive confidence intervals for linear regressioncoefficients’,
Electronic Journal of Statistics (1), 94–119.21uggins, J. H., Campbell, T., Kasprzak, M. & Broderick, T. (2018), ‘Practical boundson the error of Bayesian posterior approximations: A nonasymptotic approach’.arXiv:1809.09505.Kondo, M. C., Andreyeva, E., South, E. C., MacDonal, J. M. & Branas, C. C. (2018),‘Neighborhood interventions to reduce violence’, Annual Review of Public Health , 253 – 271.Lee, J. D., Sun, D. L., Sun, Y. & Taylor, J. E. (2016), ‘Exact post-selection inference,with application to the lasso’, Annals of Statistics (3), 907 – 927.Lehmann, E. L. & Casella, G. (2006), Theory of point estimation , Springer Science &Business Media.Lindley, D. V. & Smith, A. F. (1972), ‘Bayes estimates for the linear model’,
Journal ofthe Royal Statistical Society: Series B (1), 1–18.Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. (2014), ‘A significance testfor the lasso’, Annals of Statistics (2), 413 – 486.Lodise, J., ¨Ozg¨okmen, T., Gon¸calves, R. C., Iskandarani, M., Lund, B., Horstmann, J.,Poulain, P.-M., Klymak, J., Ryan, E. H. & Guigand, C. (2020), ‘Investigating theformation of submesoscale structures along mesoscale fronts and estimating kinematicquantities using lagrangian drifters’, Fluids (3), 1–38.Mathai, A. M. & Provost, S. B. (1992), Quadratic Forms in Random Variables: Theoryand Applications , Dekker.Morris, C. N. (1983), ‘Parametric empirical Bayes inference: theory and applications’,
Journal of the American Statistical Association (381), 47–55.¨Ozg¨okmen, T. M. (2013), ‘GLAD experiment CODE-style drifter trajectories (low-passfiltered, 15 minute interval records), Northern Gulf of Mexico near DeSoto Canyon,July-October 2012. Harte Research Institute, Texas A&M University-Corpus Christi.’. URL: https://data.gulfresearchinitiative.org/data/R1.x134.073:0004
Poje, A. C., ¨Ozg¨okmen, T. M., Lipphardt, B. L., Haus, B. K., Ryan, E. H., Haza, A. C.,Jacobs, G. A., Reniers, A., Olascoaga, M. J., Novelli, G., Griffa, A., Beron-Vera, F. J.,Chen, S. S., Coelho, E., Hogan, P. J., Kirwan, A. D. J., Huntley, H. S. & Mariano,A. J. (2014), ‘Submesoscale dispersion in the vicinity of the Deepwater Horizon spill’,
Proceedings of the National Academy of Sciences (35), 12693–12698.Pratt, J. W. (1963), ‘Shorter confidence intervals for the mean of a normal distributionwith known variance’,
The Annals of Mathematical Statistics pp. 574–586.Rasmussen, C. E. & Williams, C. K. (2006),
Gaussian Processes for Machine Learning ,MIT Press.Robbins, H. (1964), ‘The empirical Bayes approach to statistical decision problems’,
The Annals of Mathematical Statistics (1), 1–20.22ubin, D. B. (1981), ‘Estimation in parallel randomized experiments’, Journal ofEducational Statistics (4), 377–401.Schervish, M. J. (1995), Theory of Statistics , Springer Science & Business Media.Taylor, J. & Tibshirani, R. (2018), ‘Post-selection inference for-penalized likelihoodmodels’,
Canadian Journal of Statistics (1), 41–61.Tian, X. (2020), ‘Prediction error after model search’, Annals of Statistics (2), 763 –784.Tibshirani, R. & Rosset, S. (2019), ‘Excess optimism: how biased is the apparent errorof an estimator tuned by SURE?’, Journal of the American Statistical Association (526), 697 – 712.Trippe, B. L., Huggins, J. H., Agrawal, R. & Broderick, T. (2019), ‘LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations’, in ‘Proceed-ings of the 36th International Conference on Machine Learning’, Vol. 97 of Proceedingsof Machine Learning Research , PMLR, pp. 6315–6324.Van der Vaart, A. W. (2000),
Asymptotic Statistics , Cambridge University Press.Wallace, T. D. (1977), ‘Pretest estimation in regression: a survey’,
American Journalof Agricultural Economics (3), 431–443.Woody, S., Padilla, O. H. M. & Scott, J. G. (2020), ‘Optimal post-selection inferencefor sparse signals: a nonparametric empirical-Bayes approach’. arXiv:1810.11042.23 UPPLEMENTARY MATERIAL
A Pitfalls of risk when choosing between estimators
Before proceeding, we require some additional notation and definitions. We denote therisk of an arbitrary estimator θ (cid:48) ( · ) by R ( θ, θ (cid:48) ) = E θ (cid:104) L (cid:0) θ, θ (cid:48) ( y ) (cid:1)(cid:105) . Given two estimators θ (cid:48) ( · ) and θ † ( · ) we say that θ (cid:48) ( · ) dominates θ † ( · ) if, for all values of θ, R ( θ, θ (cid:48) ) ≤ R ( θ, θ † )and R ( θ, θ (cid:48) ) < R ( θ, θ † ) for at least one value of θ. If we were able to show that one of ˆ θ ( · ) or θ ∗ ( · ) dominates the other, it would betempting to always select the dominating estimator. Unfortunately, it is very oftenthe case that neither estimator dominates the other. In other words, it may be thecase that R ( θ, θ ∗ ) < R ( θ, ˆ θ ) for all values of θ in some non-trivial subset of the spaceΘ but R ( θ, θ ∗ ) > R ( θ, ˆ θ ) for some θ / ∈ Θ . Lindley & Smith (1972) provide a simpleillustration of this dilemma in the following normal means problem: suppose that weobserve an N -vector normally distributed about its mean and with identity covariance, I N , as y ∼ N ( θ, I N ) , and wish to compare the default estimate ˆ θ ( y ) = y of θ and thealternative estimate θ ∗ ( y ) = y + y N /τ /τ for a fixed value of τ > , where y := N − (cid:80) Nn =1 y n and N is the N -vector of ones.Lindley & Smith (1972) showed that R ( θ, θ ∗ ) < R ( θ, ˆ θ ) if and only if (cid:107) θ − θ N (cid:107) < (cid:112) ( N − τ ) , (21)where θ := N − (cid:80) Nn =1 θ n . Without strong assumptions about the value of θ, whichwe may be unable or unwilling to make, a simple comparison of risk functions canprove inconclusive. Interestingly, in the setting considered by Lindley & Smith (1972),it is possible to construct θ so that (A) R ( θ, θ ∗ ) < R ( θ, ˆ θ ) but (B) P θ [ L ( θ, θ ∗ ( y )) >L ( θ, ˆ θ ( y ))] > . . In particular, for N = 2 , τ = 1 , and (cid:107) θ − θ N (cid:107) = 2 . , θ ∗ ( · ) hasslightly smaller risk than the MLE, but the MLE has smaller loss in 3397 out of 5000simulated datasets, or about 68% of the time. In other words, even if we were to assumethat θ satisfied Equation (21), for the majority of datasets y that we might observe, thealternative estimator incurs higher loss than the default. The situation above highlightsan important, but in our mind under-discussed, limitation of risk: the loss averaged overall possible unrealized datasets may not be close to the loss incurred on an observeddataset.This disagreement between risk and the probability of having smaller loss can beespecially pronounced when the distribution of the loss of one of the estimators isheavy-tailed. For example, consider a scalar parameter θ = 0 , a deterministic defaultestimate ˆ θ = 1 , and an alternative estimate distributed as θ ∗ ∼ α δ √ α (1+ (cid:15) ) + (1 − α ) δ , where δ x denotes a Dirac mass on x and (cid:15) >
0. Then θ ∗ ( · ) has larger risk than ˆ θ ( · ) (1 + (cid:15) rather than 1), but has smaller loss with probability 1 − α . By taking α → ∞ , we see24hat θ ∗ ( · ) may have smaller loss than ˆ θ ( · ) with arbitrarily high probability. This exampleis particularly extreme; our intent is merely to illustrate that large disagreements could,at least in principal, arise in practical settings. B Additional details related to Section 2
Proof of Theorem 2.2
Proof.
The result follows directly from the definition of c ( y ) and the conditions on b ( · , · ).More explicitly, P θ (cid:2) W ( θ, y ) ≤ c ( y ) > α (cid:3) ≤ P θ (cid:2) W ( θ, y ) ≤ b ( y, α ) > (cid:3) ≤ P θ (cid:2) W ( θ, y ) < b ( y, α ) (cid:3) ≤ − α, where the first line follows from the definition of the c-value and the final line followsfrom Equation (1). Proof of Theorem 2.3
Proof.
The condition L ( θ, θ † ( y, α )) > L ( θ, ˆ θ ( y )) can occur only when both (A) 0 >W ( θ, y ) and (B) θ † ( · , α ) evaluates to θ ∗ ( · ) rather than ˆ θ ( · ) . Event (B) implies c ( y ) > α and therefore b ( y, α ) >
0. By transitivity, b ( y, α ) > > W ( θ, y ) = ⇒ b ( y, α ) >W ( θ, y ). By assumption, the event b ( y, α ) > W ( θ, y ) occurs with probability at most1 − α . B.1 Defining c-values as a supremum vs. infimum
In this section we describe a pathological model and construction of a lower boundfunction for which the two possible definitions of the c-value described in Remark 2.1lead to notably different behaviours.Consider a variant of the normal means problem. Let θ ∈ R be an unknown meanand observe y := (cid:34) θ + (cid:15)u (cid:35) , where (cid:15) ∼ N (0 ,
1) and u ∼ U ([0 , , u is ancillary to θ (i.e. its distribution does not depend on θ ). We will construct apathological b ( y, α ) that depends on y only through u and will therefore be ancillaryto θ as well. We begin by constructing a countably infinite collection of independentuniform random variables from u, indexed by the rationals Q , S ( u ) := { u r } r ∈ Q . Such acountably infinite collection may be obtained by segmenting the decimal expansion of25 ; for example, if we let d i denote the i th digit of u, we could obtain this sequence bydefining uniform random variables with decimal expansions u : = [ d , d , d , d , d . . . ] ,u : = [ d , d , d , d . . . ] ,u : = [ d , d , d . . . ] ,u : = [ d , d , . . . ] ,u : = [ d , . . . ] , and so on, and then mapping from { u i } i ∈ N to S ( u ).Next, define b ( y, α ) := (cid:40) ( − [ u α <α ] ∞ if α ∈ Q −∞ otherwise . For any bounded default and alternative estimators, the win will be finite and the bound b ( y, α ) holds if an only if it evaluates to −∞ . Because b ( y, α ) = −∞ with probability atleast α, even though b ( y, α ) is ancillary to θ , it still satisfies the condition in Equation (1)for every θ and α ∈ [0 , c + ( y ) := sup α ∈ [0 , { α | b ( y, α ) ≥ } vs. c − ( y ) := inf α ∈ [0 , { α | b ( y, α ) ≤ } , where c − ( y ) = c ( y ) is the definition we have chosen in Section 2. Note that c − ( y ) ≤ c + ( y ) , and that if b ( y, α ) is continuous and strictly decreasing in α for every y, then c − ( y ) = c + ( y ) . In this almost surely discontinuous case, however, we have that c + ( y ) a.s. =1 . and c − ( y ) a.s. = 0 . Since estimators exist for which W ( θ, y ) < c + ( y ) . In the present paper, c − ( y ) = c + ( y ) for all bounds considered. Our preference fordefining the c-value as c − ( y ) derives from simplicity; we may disregard edge cases likethe one above, which would complicate our proofs. However for the reason describedin this section, we emphasize that using c − ( y ) rather than c + ( y ) may have practicalimplications when these quantities differ. C Additional related work
Bayesian model checking.
Working in a Bayesian context, Box & Hunter (1962)and Box (1980) advocate a dynamic, iterative approach to modeling and inference(“Box’s loop” (Blei 2014)), in which models for an observed dataset are successivelyproposed, evaluated, and refined. Specifically, one first assesses the suitability of agiven Bayesian model with the prior predictive probability of an observed statistic. Asufficiently small prior predictive p-value indicates that the observed data is an atypicalrealization from the proposed model; in this case, one then refines and re-assessesthe model. Through this lens, our proposed estimation procedure resembles a single26teration through Box’s loop: we discard the default estimator in favor of an alternativeif the c-value is sufficiently extreme. However, prior predictive p-values are limitedby the requirement for a “good” choice of prior — namely a prior that provides anadequate description of the data. Indeed, one might lack such a prior, but the associatedBayes estimate could still be superior to a given alternative. For example, our approachcan choose a Bayes rule with an improper prior over a default estimator, while theprior predictive p-value is undefined. Or our approach can also be used to choose anestimator that is not Bayesian in origin. We emphasize as well that our contribution inthe present work is a direct quantification of confidence in the relative performance ofestimates that is absent from earlier work.
Non-asymptotic frequentist guarantees for “Bayesian” procedures.
As wewill demonstrate, our procedure allows a practitioner to benefit from a Bayesian modelwhile still providing frequentist guarantees that do not depend on validity of a Bayesianprior. This flavor is not unique to our proposal.Most notably, empirical Bayesian methods (Morris 1983) avoid dependence oncertain subjective choices in the specification of the prior by selecting a prior fromthe same data used to estimate the parameter. While many frequentist properties ofthese estimators are typically unavailable due to the difficulty of analysis, in some casesauthors have established favorable properties. For example, Efron & Morris (1973)famously show that an empirical Bayesian estimator dominates the maximum likelihoodestimate.Outside of decision theory, Bayesian models have played a role in the constructionof smaller confidence intervals with exact frequentist coverage, initially by Pratt (1963)and more recently in the context of empirical Bayes inference by Hoff & Yu (2019) andpost-selection inference by Woody et al. (2020). In the context of hypothesis testing,Hoff (2020) leveraged a similar approach to increase power across multiple hypotheseswhile maintaining exact coverage. The objectives of these papers are distinct from ourown; we do not consider hypothesis testing or forming confidence intervals. But theseworks are thematically related nonetheless; in particular, their methods can utilize aBayesian model to provide improved statistical procedures that maintain frequentistguarantees, and they do not rely on subjective assumptions about unknown parameters.
D Additional details related to Section 3
D.1 Distribution of win term
We here provide a derivation of the distributional form of 2 (cid:15) (cid:62) Gy given in Section 3.2.In Section 3.2 we found that2 (cid:15) (cid:62) Gy ∼
21 + τ (cid:20) χ N − ( 14 (cid:107) P ⊥ θ (cid:107) ) − (cid:107) P ⊥ θ (cid:107) (cid:21) , χ N − ( λ ) denotes the non-central chi-squared distribution with N − λ .Recall that Gy = (1 + τ ) − P ⊥ ( θ + (cid:15) ) . As such we can rewrite2 (cid:15) (cid:62) Gy = 21 + τ (cid:104) (cid:15) (cid:62) P ⊥ (cid:15) + (cid:15) (cid:62) P ⊥ θ (cid:105) = 21 + τ (cid:104) ( P ⊥ (cid:15) ) (cid:62) ( P ⊥ (cid:15) ) + ( P ⊥ (cid:15) ) (cid:62) ( P ⊥ θ ) (cid:105) // since P ⊥ = P ⊥ P ⊥ = 21 + τ (cid:20) (cid:107) P ⊥ (cid:15) + 12 P ⊥ θ (cid:107) − (cid:107) P ⊥ θ (cid:107) (cid:21) // by completing the square= 21 + τ (cid:20) χ N − ( 14 (cid:107) P ⊥ θ (cid:107) ) − (cid:107) P ⊥ θ (cid:107) (cid:21) , as desired, where in the last line the degrees of freedom parameter is N − P ⊥ projects into an N − R N . D.2 Proof of Theorem 3.1
We here provide a proof of Theorem 3.1.
Proof.
The proof amounts to showing that b ( · , · ) achieves at least nominal coverage,i.e. for any θ and α ∈ [0 , P (cid:2) W ( y, θ ) ≥ b ( y, α ) (cid:3) ≥ α . By construction, W ( θ, y ) ≥ b ( y, α ) may be violated only if either (A) (cid:107) P ⊥ θ (cid:107) (cid:54)∈ [0 , U ( y, − α )] or (B) W ( θ, y ) < τ F − N − ( − α ; (cid:107) P ⊥ θ (cid:107) ) − (cid:107) P ⊥ θ (cid:107) τ ) − (cid:107) P ⊥ y (cid:107) (1+ τ ) . Noticing that (cid:107) P ⊥ y (cid:107) ∼ χ N − ( (cid:107) P ⊥ θ (cid:107) ) , we can recognize [0 , U ( − α )] as valid confidence interval for (cid:107) P ⊥ θ (cid:107) and see that (A)occurs with probability at most − α . Next, comparing to Equation (7), we see that (B)represents 2 (cid:15) (cid:62) Gy falling below its − α quantile and thus occurs with probability atmost − α . Therefore the union bound guarantees that b ( y, α ) obtains at least nominalcoverage. D.3 Why an upper bound on (cid:107) P ⊥ θ (cid:107) ? We here provide justification for the use of a high-confidence upper bound on (cid:107) P ⊥ θ (cid:107) in Bound 3.1. Recall that Equation (7) provides a lower bound on W ( θ, y ) if we cancontrol (cid:107) P ⊥ θ (cid:107) . However, it is not immediately obvious what sort of control on (cid:107) P ⊥ θ (cid:107) will yield the tightest bound; should we have derived a two-sided interval or a lowerbound instead of an upper bound? We answer this question by appealing to a normalapproximation of the non-central χ for intuition. This approximation will be close whenthe degrees of freedom parameter is large. Specifically, by replacing the non-central χ quantile with that of a normal with matched first and second moments we may28pproximate the lower bound as W ( θ, y ) ∼ ≥
21 + τ (cid:104) N − − ( (cid:107) P ⊥ θ (cid:107) + 2 N − z α (cid:105) − (cid:107) P ⊥ y (cid:107) (1 + τ ) , (22)where z α is the α quantile of the standard normal.Equation (22) is monotone decreasing in (cid:107) P ⊥ θ (cid:107) for any α > . As such, we canexpect this quantile to be smallest for large values of (cid:107) P ⊥ θ (cid:107) , and for this reasonseek to find a high-confidence upper bound on (cid:107) P ⊥ θ (cid:107) . Indeed, in agreement withEquation (22) we have found empirically that the infimum in Equation (8) is alwaysachieved at this upper bound, and conjecture that this is true in general. D.4 Shrinking towards an arbitrary subspace
We now show how the approach developed in Section 3 immediately extends to a broaderclass of models in the spirit of those considered by Morris (1983). In particular, let θ again be an unknown N -vector and X ∈ R N × D be a design matrix where for each n , X n is a D -vector of covariates associated with θ n . If we believe that the parameterscan be roughly described as scattered around a linear function of these covariates withvariance τ , we might consider trying to improve our estimates by estimating the lineardependence and interpolating between the sample estimate and the associated linearapproximation. Following Morris (1983), we obtain this type of shrinkage with theestimate θ ∗ ( y ) := y + τ − X ( X (cid:62) X ) − X (cid:62) y τ − , which is the posterior mean of the Bayesian model that assumes for each n , θ n ∼N ( X (cid:62) n β, τ ) a priori. Here β is an unknown D -vector of coefficients that is given animproper uniform prior.For this setting, we propose the following bound. Bound D.1 (Normal Means: Flexible shrinkage estimate vs. MLE) . Observe y = θ + (cid:15) with (cid:15) ∼ N (0 , I N ) and consider estimates ˆ θ ( y ) = y and θ ∗ ( y ) := y + τ − X ( X (cid:62) X ) − X (cid:62) y τ − , where τ is a scalar and X is an N by D matrix of covariates. We propose b ( y, α ) = inf λ ∈ [0 ,U ( y, − α )]
21 + τ F − N − D (cid:18) − α , λ (cid:19) − λ τ ) − (cid:107) P ⊥ X y (cid:107) (1 + τ ) (23) as a high-probability lower bound on the win. In this expression, F − N − D (1 − α, λ ) denotesthe inverse cumulative distribution function of the non-central χ with N − D degrees offreedom and non-centrality parameter λ evaluated at − α. P ⊥ X := I N − X ( X (cid:62) X ) − X (cid:62) is the projection onto the subspace orthogonal to the column-space of X.U ( y, − α ) := inf δ> (cid:110) δ (cid:12)(cid:12)(cid:12) (cid:107) P ⊥ X y (cid:107) ≤ F − N − D (1 − α, δ ) (cid:111) (24)29 s a high-confidence upper bound on (cid:107) P ⊥ X θ (cid:107) . This bound is identical to Bound 3.1 except that it projects to a different subspace,and loses D degrees of freedom in the χ random variables, rather than 1. Indeed,this is a strict generalization, as we obtain our earlier example when we take X = N .Bound D.1 is also computable (for the same reasons discussed in Remark 3.2) and valid,as we see in the next proposition. Proposition D.1.
Equation (23) in Bound D.1 satisfies the conditions of Theorem 2.2.In particular, for any θ and α ∈ [0 , , P θ (cid:2) W ( y, θ ) ≥ b ( y, α ) (cid:3) ≥ α .Proof. Proposition D.1 follows from an argument very closely analogous to the proofof Theorem 3.1. We first rewrite θ ∗ ( y ) as θ ∗ ( y ) = y − Gy for G := (1 + τ ) − P ⊥ X .Equation (6) then holds exactly as before (i.e. W ( θ, y ) = 2 (cid:15) (cid:62) Gy − (cid:107) Gy (cid:107) ). The twoterms are treated as in Theorem 3.1; the only differences are that the norm underconsideration is (cid:107) P ⊥ X θ (cid:107) rather than (cid:107) P ⊥ θ (cid:107) , and the change in degrees of freedom from N − N − D .Figure 3: The estimate shrinking towards a quadratic fit provides a significant improve-ment ( c = 0 . σ = 0 .
025 and τ = 0 . c = 0 . θ ∗ ( · ) may be30igure 4: Distribution of c-values across several choices of N − (cid:107) P ⊥ θ (cid:107) .greater than that of ˆ θ ( · ) for many possible θ , this analysis supports the conclusion thatfor the true unknown θ and observed y , θ ∗ ( y ) is superior. D.5 Distribution of c-values
As mentioned in Section 3.3, we do not in general expect to see a uniform distributionof c-values. Figure 4 illustrates the dependence of the distribution of c-values on theparameter, in the same simulations detailed in Figures 1a to 1c.
E Affine estimators supplementary information
E.1 Step by step derivation of Equation (12)
The win of using θ ∗ ( y ) in place of ˆ θ ( y ) may be expressed as W ( θ, y ) = (cid:107) ˆ θ ( y ) − θ (cid:107) − (cid:107) θ ∗ ( y ) − θ (cid:107) = (cid:16) (cid:107) ˆ θ ( y ) (cid:107) + (cid:107) θ (cid:107) − θ (cid:62) ˆ θ ( y ) (cid:17) − (cid:16) (cid:107) θ ∗ ( y ) (cid:107) + (cid:107) θ (cid:107) − θ (cid:62) θ ∗ ( y ) (cid:17) = − θ (cid:62) G ( y ) + (cid:16) (cid:107) ˆ θ ( y ) (cid:107) − (cid:107) θ ∗ ( y ) (cid:107) (cid:17) // where G ( y ) := ˆ θ ( y ) − θ ∗ ( y )= 2 (cid:15) (cid:62) G ( y ) − y (cid:62) G ( y ) + (cid:16) (cid:107) ˆ θ ( y ) (cid:107) − (cid:107) θ ∗ ( y ) (cid:107) (cid:17) = 2 (cid:15) (cid:62) G ( y ) + (cid:16) (cid:107) ˆ θ ( y ) − y (cid:107) − (cid:107) θ ∗ ( y ) − y (cid:107) (cid:17) . (25)31 .2 Derivation of Equation (13) Observe that E [ (cid:15) (cid:62) G ( y )] = E [ (cid:15) (cid:62) G ( θ ) + (cid:15) (cid:62) ( A − C ) (cid:15) ]= E [ (cid:15) ] (cid:62) G ( θ ) + E [tr[( A − C ) (cid:15)(cid:15) (cid:62) ]]= tr[( A − C )Σ]andVar[ (cid:15) (cid:62) G ( y )] = Var[ (cid:15) (cid:62) G ( θ )] + Var[ (cid:15) (cid:62) ( A − C ) (cid:15) ]// since (cid:15) (cid:62) G ( θ ) and (cid:15) (cid:62) ( A − C ) (cid:15) are uncorrelated= ( G ( θ )) (cid:62) Σ( G ( θ )) + 2tr[ A + A (cid:62) − C − C (cid:62) A + A (cid:62) − C − C (cid:62) (cid:107) G ( θ ) (cid:107) + 12 tr[(( A + A (cid:62) − C − C (cid:62) )Σ) ]= (cid:107) G ( θ ) (cid:107) + 12 (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F , where (cid:107) · (cid:107) Σ and (cid:107) · (cid:107) F denote the Σ quadratic norm and Frobenius norm, respectively.The third line of the derivation above obtains from recognizing Var[ (cid:15) (cid:62) ( A − C ) (cid:15) ] as aquadratic form (Mathai & Provost 1992, Chapter 2). E.3 Derivations of Equations (15) and (16)
Equations (15) and (16) characterize the dependence of the distribution of (cid:107) G ( y ) (cid:107) on (cid:107) G ( θ ) (cid:107) through its mean and variance. Recognizing (cid:107) G ( y ) (cid:107) as a quadratic form(Mathai & Provost 1992, Chapter 2), with G ( y ) ∼ N (cid:16) G ( θ ) , ( A − C )Σ( A − C ) (cid:62) (cid:17) , wefind its mean as E [ (cid:107) G ( y ) (cid:107) ] = G ( θ ) (cid:62) Σ G ( θ ) + tr[Σ(( A − C )Σ( A − C ) (cid:62) )]= (cid:107) G ( θ ) (cid:107) + tr[Σ ( A − C )Σ( A − C ) (cid:62) Σ ]= (cid:107) G ( θ ) (cid:107) + (cid:107) Σ ( A − C )Σ (cid:107) F . For the variance, we similarly rely on the known variance of a quadratic form. Startingfrom that expression, we upper bound the variance asVar[ (cid:107) G ( y ) (cid:107) ] = 2tr (cid:20) Σ (cid:16) ( A − C )Σ( A − C ) (cid:62) (cid:17) Σ (cid:16) ( A − C )Σ( A − C ) (cid:62) (cid:17)(cid:21) +4 G ( θ ) (cid:62) Σ (cid:16) ( A − C )Σ( A − C ) (cid:62) (cid:17) Σ G ( θ )= 2 (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) (cid:16) Σ ( A − C ) (cid:62) Σ (cid:17) Σ G ( θ ) (cid:107) ≤ (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) Σ ( A − C )Σ (cid:107) (cid:107) G ( θ ) (cid:107) , (26)where (cid:107) · (cid:107) OP denotes the L .4 The Berry–Esseen bound: Theorem 4.1 We here prove Theorem 4.1, a non-asymptotic upper bound on the error introduced bythe two Gaussian approximations in Approximate Bound 4.1. We begin by restatingkey notation for convenience. We then state a more general variant of the bound thatremoves the restriction that the operators A and C be symmetric, and we show how itreduces to the simpler quantity stated in Theorem 4.1. Finally, we present a proof ofthe theorem as well as several supporting lemmas. Notation and statement of the theorem its more general form.
Recall thatwe are concerned with the coverage of Approximate Bound 4.1 b ( y, α ) = (cid:107) ˆ θ − y (cid:107) − (cid:107) θ ∗ − y (cid:107) + 2tr[( A − C )Σ] +2 z − α (cid:114) U ( (cid:107) G ( y ) (cid:107) , − α (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F . In this equation, G ( y ) := ( A − C ) y + ( k − (cid:96) ), z α denotes the α -quantile of the standardnormal, and U (cid:0) (cid:107) G ( y ) (cid:107) , − α (cid:1) = inf δ> (cid:26) δ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) G ( y ) (cid:107) ≤ ( δ + (cid:107) Σ ( A − C )Σ (cid:107) F ) + z − α (cid:113) (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) Σ ( A − C )Σ (cid:107) δ (cid:27) is a high-confidence upper bound on (cid:107) G ( θ ) (cid:107) . For convenience, we introduce˜ F − ( (cid:107) G ( θ ) (cid:107) , α ) := 2tr[( A − C )Σ] + 2 z α (cid:114) (cid:107) G ( θ ) (cid:107) + 12 (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F , (27)to denote the inverse CDF of our normal approximation to the distribution of 2 (cid:15) (cid:62) G ( y )evaluated at α. As such, we may write b ( y, α ) = (cid:107) ˆ θ − y (cid:107) − (cid:107) θ ∗ − y (cid:107) + ˜ F − (cid:18) U ( (cid:107) G ( y ) (cid:107) , − α , − α (cid:19) . Finally, recall that to prove the theorem we desire to show P θ (cid:2) W ( θ, y ) ≥ b ( y, α ) (cid:3) ≥ α − √ √ N C · κ (Σ ( A − C )Σ ) for any θ and α ∈ [0 , , where C < .
88 is a universal constant, in the case when both A and C are symmetric. We accomplish this by first proving a more general boundholds even in the non-symmetric case, P θ (cid:2) W ( θ, y ) ≥ b ( y, α ) (cid:3) ≥ α − √ √ N C (cid:104) κ (Σ ( A − C )Σ ) + κ (Σ ( A + A (cid:62) − C − C (cid:62) )Σ ) (cid:105) . (28)33he special case obtains by replacing A (cid:62) and C (cid:62) with A and C, respectively, and notingthat κ ( M ) ≥ κ ( M ) for any matrix, M. A key tool in this proof is the classic result of Berry (1941), which we restate below.
Theorem E.1 (Berry, 1941, Theorem 1) . Let X , X , . . . , X N be random variables. Foreach n ∈ { , , . . . , N } , let σ n and ρ n denote the variance and third central moment of X n , respectively. Define λ n := ρ n σ n if σ n > and λ n = 0 otherwise. Define σ := (cid:80) Nn =1 σ n and X := N − (cid:80) Nn =1 X n . Then sup x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F X ( x ) − Φ (cid:18) x − E [ X ] σ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C max n λ n σ , where C ≤ . is a universal constant and F X ( · ) is the cumulative distribution functionof X . Proof of Theorem 4.1
The desired bound may be stated equivalently as, for any α ∈ [0 , , P θ (cid:2) W ( θ, y ) < b ( y, α ) (cid:3) < (1 − α ) + 5 √ √ N C (cid:104) κ (Σ ( A − C )Σ ) + κ (Σ ( A + A (cid:62) − C − C (cid:62) )Σ ) (cid:105) . (29)We first rewrite the condition W ( θ, y ) < b ( y, α ) as 2 (cid:15) (cid:62) G ( y ) < ˜ F − (cid:16) U ( (cid:107) G ( y ) (cid:107) , − α ) , − α (cid:17) (recall Equation (25)). Since ˜ F − is monotonically decreasing in its first argument,this condition may occur only if either 2 (cid:15) (cid:62) G ( y ) < ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , − α (cid:17) or (cid:107) G ( θ ) (cid:107) >U ( (cid:107) G ( y ) (cid:107) , − α ) . Therefore, by the union bound, we have that P θ (cid:2) W ( θ, y ) < b ( y, α ) (cid:3) < P θ (cid:34) (cid:15) (cid:62) G ( y ) < ˜ F − (cid:18) (cid:107) G ( θ ) (cid:107) , − α (cid:19)(cid:35) + P θ (cid:20) (cid:107) G ( θ ) (cid:107) > U ( (cid:107) G ( y ) (cid:107) , − α (cid:21) . (30)Lemmas E.1 and E.2 provide that P θ (cid:20) (cid:15) (cid:62) G ( y ) < ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , − α (cid:17)(cid:21) < − α + √ √ N C κ (Σ ( A + A (cid:62) − C − C (cid:62) )Σ ) and P θ (cid:104) (cid:107) G ( θ ) (cid:107) > U ( (cid:107) G ( y ) (cid:107) , − α ) (cid:105) < − α + √ √ N C κ (Σ ( A − C )Σ ) , respectively. Substituting these two bounds into Equation (30)we obtain Equation (29) as desired. Lemma E.1.
Let y = θ + (cid:15) be a random N -vector with (cid:15) ∼ N (0 , Σ) . Let ˜ F − be thenormal approximation to the inverse CDF of (cid:15) (cid:62) G ( y ) in Equation (27) . Then for any α ∈ [0 , , P θ (cid:20) (cid:15) (cid:62) G ( y ) < ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , α (cid:17)(cid:21) < α + 5 √ √ N C κ (Σ ( A + A (cid:62) − C − C (cid:62) )Σ ) . roof. Note first that for any α we may rewrite P θ (cid:20) (cid:15) (cid:62) G ( y ) < ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , α (cid:17)(cid:21) = F (cid:20) ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , α (cid:17)(cid:21) = α + (cid:40) F (cid:20) ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , α (cid:17)(cid:21) − ˜ F (cid:20) ˜ F − (cid:16) (cid:107) G ( θ ) (cid:107) , α (cid:17)(cid:21)(cid:41) , where F and ˜ F are the exact and approximate CDFs of 2 (cid:15) (cid:62) G ( y ) , respectively. Recallingthat the normal approximation comes from matching moments to 2 (cid:15) (cid:62) G ( y ), we havethat for any v, ˜ F ( v ) = Φ( v − E [2 (cid:15) (cid:62) G ( y )] √ Var[2 (cid:15) (cid:62) G ( y )] ) . Therefore, it will suffice to obtain that forevery v, (cid:12)(cid:12)(cid:12) ˜ F ( v ) − F ( v ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ( v ) − Φ (cid:32) v − E [2 (cid:15) (cid:62) G ( y )] (cid:112) Var[2 (cid:15) (cid:62) G ( y )] (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ √ N C κ (Σ ( A + A (cid:62) − C − C (cid:62) )Σ ) . We will obtain this result by writing 2 (cid:15) (cid:62) G ( y ) a sum of independent random variablesand using a Berry–Esseen Theorem (Theorem E.1) to bound the error of this normalapproximation.Lemma E.3 allows us to write 2 (cid:15) (cid:62) G ( y ) = 2 (cid:15) (cid:62) ( A − C ) (cid:15) + 2 (cid:2) ( A − C ) θ + ( k − (cid:96) ) (cid:3) (cid:62) (cid:15) as a shifted sum of N differently-scaled, independent non-central χ random variables.We denote these N random variables by X , X , . . . , X N . Lemma E.3 additionally tellsus that the scaling parameters of these non-central χ random variables will be theeigenvalues of Σ ( A + A (cid:62) − C (cid:62) − C )Σ , which we denote by λ ≥ λ ≥ · · · ≥ λ N ≥ . To use Theorem E.1 we require the ratios of the third to second central moments ofthese random variables, as well as the variance of the sum. Specifically,sup v ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ( v − E [2 (cid:15) (cid:62) G ( y )] (cid:112) Var[2 (cid:15) (cid:62) G ( y )] ) − F ( v )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C max n ρ ( X n )Var[ X n ] (cid:112) Var[2 (cid:15) (cid:62) G ( y )] , where for each index n, ρ ( X n ) := E [( X n − E [ X n ]) ] is the third central moment of X n , and C < .
88 is a universal constant.Conveniently, as we show in Lemma E.4, for each n, ρ ( X n )Var[ X n ] ≤ λ n . Further, since (cid:112)
Var[2 (cid:15) (cid:62) G ( y )] > (cid:113) (cid:80) Nn =1 λ n > √ N λ N (recall that Equation (13) provides thatVar[2 (cid:15) (cid:62) G ( y )] = 4 (cid:107) G ( θ ) (cid:107) + 2 (cid:107) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:107) F ) we may additionally seethat sup v ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ (cid:32) v − E [2 (cid:15) (cid:62) G ( y )] (cid:112) Var[2 (cid:15) (cid:62) G ( y )] (cid:33) − F ( v )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C √ N max n λ n min n λ n = C √ √ N κ (cid:16) Σ ( A + A (cid:62) − C − C (cid:62) )Σ (cid:17) where κ ( · ) denotes the condition number of its matrix argument, as desired.35 emma E.2. Let y = θ + (cid:15) be a random N -vector with (cid:15) ∼ N (0 , Σ) . Let U ( (cid:107) G ( y ) (cid:107) , α ) be the approximate high-confidence upper bound on (cid:107) G ( θ ) (cid:107) . Then for any α ∈ [ , , P θ (cid:2) (cid:107) G ( θ ) (cid:107) > U ( (cid:107) G ( y ) (cid:107) , − α ) (cid:3) < − α + √ √ N C κ (Σ ( A − C )Σ ) . Proof.
Our proof of the lemma follows roughly the same approach taken to proveLemma E.1. First note that the condition that (cid:107) G ( θ ) (cid:107) > U ( (cid:107) G ( y ) (cid:107) , − α ) impliesthat (cid:107) G ( y ) (cid:107) ≤ ( (cid:107) G ( θ ) (cid:107) + (cid:107) Σ ( A − C )Σ (cid:107) F ) + z − α (cid:113) (cid:107) Σ ( A − C )Σ( A − C ) (cid:62) Σ (cid:107) F + 4 (cid:107) Σ ( A − C )Σ (cid:107) (cid:107) G ( θ ) (cid:107) ≤ E [ G ( y ) (cid:107) ] + z − α (cid:112) Var[ G ( y )]for any α ∈ [ , , where the first line follows from the definition of U ( (cid:107) G ( y ) (cid:107) , − α ) . The second line follows from the observations that (A) z − α < (cid:107) G ( y ) (cid:107) (Equation (16)).We now proceed to upper bound the probability of the event in the display equationabove. First consider a normal approximation to the distribution of (cid:107) G ( y ) (cid:107) Σ withmatched moments, and denote its inverse CDF by F †− ( θ, α ) . We may then write theprobability of the event above as P (cid:104) (cid:107) G ( y ) (cid:107) ≤ E [ G ( y ) (cid:107) ] + z α (cid:112) Var[ G ( y )] (cid:105) = F (cid:104) F †− ( θ, α ) (cid:105) = α + (cid:26) F (cid:104) ¯ F − ( θ, α ) (cid:105) − F † (cid:104) F †− ( θ, α ) (cid:105)(cid:27) , where F ( · ) and F † ( · ) denote the exact and approximate CDFs of (cid:107) G ( y ) (cid:107) . It will sufficeto show that for any v, | F ( v ) − F † ( v ) | ≤ √ √ N κ (Σ ( A − C )Σ ) . As in Lemma E.1 we obtain this result through the Berry–Esseen theorem. In thiscase, the variable of interest is (cid:107) G ( y ) (cid:107) = (cid:15) (cid:62) ( A − C ) (cid:62) Σ( A − C ) (cid:15) +2 (cid:15) (cid:62) (cid:2) ( A − C ) θ + ( k − (cid:96) ) (cid:3) . As in this previous lemma, we use Lemma E.3 to write this variable as a shifted sum ofindependent, scaled non-central χ random variables, this time with scaling parametersequal to the eigenvalues Σ ( A − C ) (cid:62) Σ( A − C )Σ . Recognizing that the eigenvalues ofthe matrix M (cid:62) M are the squares of the singular values of M for any matrix M, weobtain the desired result. Lemma E.3.
Let X be a random N -vector distributed as X ∼ (cid:15) (cid:62) A(cid:15) + b (cid:62) (cid:15) where A ∈ R N × N , b ∈ R N , and (cid:15) ∼ N (0 , Σ) . Then X is distributed as a shifted sum ofdifferently scaled, independent non-central χ random variables. In particular, if welet U diag( λ ) U (cid:62) be the eigen-decomposition of Σ ( A + A (cid:62) )Σ , then we can write X d = (cid:80) Nn =1 Y n − (cid:107) diag( λ ) − U (cid:62) Σ b (cid:107) , where each Y n indep ∼ λ n χ ( λ − n e (cid:62) n U (cid:62) Σ b ) , where e n is the n th basis vector. roof. The proof of the lemma proceeds through a long algebraic rearrangement. Inparticular we rewrite X as X = 2 (cid:15) (cid:62) A(cid:15) + b (cid:62) (cid:15) = δ (cid:62) Σ ( A + A (cid:62) )Σ δ + b (cid:62) Σ δ // defining δ := Σ − (cid:15) so that δ ∼ N (0 , I N ) . = δ (cid:62) U diag( λ ) U (cid:62) δ + b (cid:62) Σ U diag( λ ) − diag( λ ) U (cid:62) δ // Letting U diag( λ ) U (cid:62) := Σ ( A + A (cid:62) )Σ be an eigen-decomposition,// with U (cid:62) U = I N and λ ∈ R N + d = δ (cid:62) diag( λ ) δ + b (cid:62) Σ U diag( λ ) − diag( λ ) δ = N (cid:88) n =1 ( λ n δ n + 12 λ − n e (cid:62) n U (cid:62) Σ b ) − b (cid:62) Σ U diag( λ ) − U (cid:62) Σ b d = − b (cid:62) ( A + A (cid:62) ) − b N (cid:88) n =1 λ n χ ( 12 λ − n e (cid:62) n U (cid:62) Σ b ) , where each e n denotes the n th basis vector and each of the scaled non-central χ randomvariables in the last line are independent. Lemma E.4.
Consider a scaled non-central chi-squared random variable, X ∼ sχ ( λ ) ,where s and λ are scaling and non-centrality parameters, respectively. Denote thesecond and third central moments of X by σ = Var[ X ] and ρ = E (cid:2) ( X − E [ X ]) (cid:3) . Then ρσ ≤ s. Proof.
Recall that the second and third central moments of the scaled non-central χ have known forms, σ = 2 s (1 + 2 λ ) and ρ = 8 s (1 + 3 λ ). Therefore we may write ρσ = 8 s (1 + 3 λ )2 s (1 + 2 λ ) ≤ s (cid:18)
11 + 3 λ λ (cid:19) = 4 · s = 10 s, as desired. F Empirical Bayes supplementary details
F.1 Additional figure
Figure 5 shows the calibration in the simulation experiment described in Section 5.1.37igure 5: Calibration of approximate high-confidence bounds on the win of an empiricalBayes estimate over the MLE in simulation. Each series depicts calibration for a differentchoice of the parameter θ ( N = 50). F.2 Asymptotic coverage of the empirical Bayes estimate
Theorem 5.1 shows that we can apply the machinery developed for Bayes rules withfixed priors to lower bound the win with at least the desired coverage asymptotically.We here consider a scaling of win, W N (Θ N , Y N ) := 1 √ N (cid:104) (cid:107) Y N − Θ N (cid:107) − (cid:107) Θ ∗ N ( Y N ) − Θ N (cid:107) (cid:105) . We use a special case of Bound D.1 in Appendix D.4 with no covariates (i.e. D = 0),and we treat the estimate ˆ τ N ( Y N ) as if it were fixed rather than estimated from thedata. For each N , this bound is b N ( Y N , α ) := 1 √ N inf λ ∈ [0 ,U ( Y N , − α )]
21 + ˆ τ N F − (cid:20) χ N ( λ , − α (cid:21) − λ τ N ) − (cid:107) Y N (cid:107) (1 + ˆ τ N ) where F − (cid:2) χ N ( λ ) , − α (cid:3) denotes the inverse cumulative distribution function of thenon-central χ with N degrees of freedom and non-centrality parameter λ, evaluated at1 − α and U ( Y N , − α ) := inf δ ≥ (cid:110) δ (cid:12)(cid:12)(cid:12) (cid:107) Y N (cid:107) ≤ F − (cid:2) χ N ( δ ) , − α (cid:3)(cid:111) is a high-confidenceupper bound on (cid:107) θ (cid:107) .For our theorem and its proof, a key quantity is, for each N , the sample secondmoment for the first N parameters, which we denote by τ N := N − (cid:80) Nn =1 θ n . Weemphasize, however, that while it may be convenient to describe τ N as a samplemoment, θ is fixed in Theorem 5.1 and throughout this analysis. Proof of Theorem 5.1.
We prove the theorem by showing that for any α , the gapbetween the win W N (Θ N , Y N ) and the bound b N ( Y N , α ) computed for the empirical38ayes estimate converges in distribution to the gap between the analogous win andbound computed for the same estimates but with prior variance fixed as τ = τ N .We denote these latter quantities by W ∗ N (Θ N , Y N ) and b ∗ N ( Y N , α ) , and note that since τ N is fixed P [ W ∗ N (Θ N , Y N ) ≥ b ∗ ( Y N , α )] ≥ α by construction (Proposition D.1). Forconvenience, we denote W N (Θ N , Y N ) by W N , b N ( Y N , α ) by b N , W ∗ N (Θ N , Y N ) by W ∗ N , and b ∗ N ( Y N , α ) by b ∗ N . Observe that we can write W N − b N = W N − b N W ∗ N − b ∗ N ( W ∗ N − b ∗ N ) . By Lemma F.4, W ∗ N − b ∗ N is asymptotically Gaussian, and by Lemma F.2 W N − b N W ∗ N − b ∗ N p → W N − b N approaches the distribution of W ∗ N − b ∗ N insupremum norm. Since b ∗ N obtains the desired coverage by construction, the resultfollows. Supporting lemmas.Lemma F.1.
If the sequence τ N is bounded, then τ N − ˆ τ N is O p ( N − ) , where O p ( · ) denotes stochastic convergence in probability.Proof. Note that for each N , (cid:107) Y N (cid:107) ∼ χ N ( N τ N ). Therefore we have that E [ (cid:107) Y N (cid:107) ] = N + N τ N and Var[ (cid:107) Y N (cid:107) ] = 2( N + 2 N τ N ). So, recalling that ˆ τ N := (cid:107) Y N (cid:107) N − − (cid:107) Y N (cid:107) − ( N − N − we may writeˆ τ N = (cid:107) Y N (cid:107) − E [ (cid:107) Y N (cid:107) ] N − N + N τ N ) − ( N − N − (cid:107) Y N (cid:107) − E [ (cid:107) Y N (cid:107) ] N + τ N + O ( 1 N ) . And so | ˆ τ N − τ N | ≤ (cid:12)(cid:12)(cid:12) (cid:107) Y N (cid:107) − E [ (cid:107) Y N (cid:107) ] N (cid:12)(cid:12)(cid:12) + O ( 1 N )= (cid:113) τ N √ N (cid:12)(cid:12)(cid:12) (cid:107) Y N (cid:107) − E [ (cid:107) Y N (cid:107) ] (cid:112) Var[ (cid:107) Y N (cid:107) ] (cid:12)(cid:12)(cid:12) + O ( 1 N ) . By Chebyshev’s inequality, (cid:107) Y N (cid:107) − E [ (cid:107) Y N (cid:107) ] √ Var[ (cid:107) Y N (cid:107) ] is bounded in probability and we can seethat | ˆ τ N − τ N | is O p ( N − ). Lemma F.2.
Let W ∗ N and b ∗ N denote the win and its bound evaluated for τ = τ N , rather than the empirical Bayes estimate. Then W N − b N W ∗ N − b ∗ N = 1 + τ N − ˆ τ N τ N = 1 + O p ( 1 √ N ) . roof. Recall that we may decompose W N as W N (Θ N , Y N ) = 1 √ N (cid:34)
21 + ˆ τ N (cid:15) (cid:62) N Y N − τ N ) (cid:107) Y N (cid:107) (cid:35) and that our bound is b N ( Y N , α ) = 1 √ N (cid:40) inf λ ∈ [0 ,U ( Y N , − α )]
21 + ˆ τ N F − (cid:20) χ N ( λ , − α (cid:21) − λ τ N ) − (cid:107) Y N (cid:107) (1 + ˆ τ N ) (cid:41) , where U ( Y N , α ) does not depend on ˆ τ N .As such, W N − b N = 2 √ N (1 + ˆ τ N ) (cid:40) (cid:15) (cid:62) N Y N − inf λ ∈ [0 ,U ( Y N , − α )] F − (cid:20) χ N ( λ , − α (cid:21) + λ (cid:41) , and we can see that W N − b N W ∗ N − b ∗ N = 1 + τ N τ N = 1 + τ N − ˆ τ N τ N . By Lemma F.1 the second term is O p ( N − ), as desired. Lemma F.3.
Let λ , λ , . . . be a sequence of reals satisfying, for each N , N − λ N < κ for some constant κ . Let F − χ N ( λ N , α ) denote the inverse CDF of a non-central χ with N degrees of freedom and non-centrality parameter λ N . Then for any α ∈ (0 , , √ N (cid:20) F − χ N ( λ N , α ) − ( N + λ N ) (cid:21) = (cid:114) λ N N z α + O ( 1 √ N ) , where z α is the α -quantile of the standard normal.Proof. Note that a χ N ( λ N ) random variable is equal in distribution to a sum of N i.i.d. χ ( N − λ N ) random variables. Let σ N := Var[ χ ( N − λ N )] = 2 + 4 N − λ N and notethat each σ N ≥ . Let ρ N := 8 + 24 N − λ N be third central moment of these variatesand note that each ρ N ≤ κ .Let F χ N ( λ N ) ( x ) denote the CDF of a non-central χ random variable with N degreesof freedom and non-centrality parameter λ N evaluated at x . By the Berry–Esseentheorem (Berry 1941, Theorem 1), for all x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F χ N ( λ N ) ( x ) − Φ (cid:20) x − ( N + λ N ) √ N + 4 λ N (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ρσ √ N ≤ C (8 + 24 κ )2 √ N = O ( 1 √ N ) , C ≤ .
88 is a universal constant. Since Φ( · ) is continuously differentiable andinvertible, we obtain the same convergence rate for the inverse CDFs. That is, for any α ∈ (0 , , F − χ N ( λ N , α ) − ( N + λ N ) √ N + 4 λ N − z α = O ( 1 √ N ) . Rescaling these terms by N − √ N + 4 λ N and rearranging, we find1 √ N (cid:20) F − χ N ( λ N , α ) − ( N + λ N ) (cid:21) = (cid:114) λ N N z α + O ( 1 √ N )as desired. Lemma F.4.
Let b ∗ N and W ∗ N again denote the win and bounds evaluated for thevariance τ = τ N rather than the empirical Bayes estimate. If the sequence τ N isbounded, then ( W ∗ N − b ∗ N ) − c N d N → N (0 , for some sequences of constants c , c , . . . and d , d , . . . .Proof. Let κ be such that for all N , τ N < κ .Recall that we may write W ∗ N − b ∗ N = 2 √ N (1 + τ N ) (cid:40) (cid:15) (cid:62) N Y N − inf λ ∈ [0 ,U ( Y N , − α )] F − (cid:20) χ N ( λ , − α (cid:21) + λ (cid:41) . (31)To prove the lemma, we build off of the normal approximation described in Ap-pendix D.1. Note first that an application of Chebyshev’s inequality provides that N − U ( Y N , − α ) − τ N is O p ( N − ), so that N − U ( Y N , − α ) < κ with probability ap-proaching 1. Next, by Lemma F.3,1 √ N (cid:40) F − (cid:20) χ N ( λ N , − α (cid:21) − (cid:20) λ N N (cid:21)(cid:41) = (cid:114) λ N N z − α + O ( 1 √ N ) , for any sequence λ , λ , . . . that satisfies, for each N , N − λ N < κ. Notably, since any sequence of λ N ’s achieving the infima in Equation (31) will satisfy41his condition, we may substitute this expression in and rewrite W ∗ N − b ∗ N as W ∗ N − b ∗ N = 21 + τ N (cid:15) (cid:62) N Y N √ N − √ N (cid:40) inf λ N ∈ [0 ,U ( Y N , − α )] F − (cid:20) χ N ( λ N , − α (cid:21) − (cid:20) λ N N (cid:21)(cid:41) − √ N = 21 + τ N (cid:34) (cid:15) (cid:62) N Y N − N √ N − inf λ N ∈ [0 ,U ( Y N , − α )] z − α (cid:114) λ N N + O p ( 1 √ N ) (cid:35) = 21 + τ N (cid:15) (cid:62) N Y N − N √ N − z − α (cid:115) U ( Y N , − α ) N + O p ( 1 √ N ) = 21 + τ N (cid:34) (cid:15) (cid:62) N Y N − N √ N − z − α (cid:113) τ N + O p ( 1 √ N ) (cid:35) // Since τ N − U ( Y N , − α ) N is O p ( 1 √ N ) . Finally, note that (cid:15) (cid:62) Y N is approximately normal with mean N and variance N (2 + τ N ). Furthermore, the distribution of this quantity approaches that of a normal atthe same O ( N − ) rate in the supremum norm (one may make this precise with aBerry–Esseen bound). This allows us to write W ∗ N − b ∗ N ∼
21 + τ N (cid:20)(cid:113) τ N x − (cid:113) τ N z − α (cid:21) + O p ( 1 √ N ) ∼ (cid:113) τ N τ N ( x − z − α ) + O p ( 1 √ N )for x ∼ N (0 , d N := (2 (cid:113) τ N ) / (1 + τ N ) and c N := − d N z − α , and noting that the lower order term does not influence the limitingdistribution of d − N (cid:2) ( W ∗ N − b ∗ N ) − c N (cid:3) . G Logistic regression supplementary material
This section provides supplementary information related to Section 5.2. We begin byreviewing notation for convenience in Appendix G.1. In Appendix G.2 we then providea proposition demonstrating the asymptotic rate of convergence of the approximation ofthe MAP estimate to the exact MAP estimate, as well a proof and supporting lemmas.Appendix G.3 then provides a proof of Theorem 5.3. Appendix G.4 gives additionaldetails on the simulation experiments. 42 .1 Preliminaries and notation
Consider logistic regression with random N -vector covariates x , x , . . . and responses y , y , . . . , where for each data point m, y m | x m , θ ∼ (1 + exp {− x (cid:62) m θ } ) − δ + (1 +exp { x (cid:62) m θ } ) − δ − for some unknown parameter θ ∈ R N . We use X M = [ x , x , . . . , x M ] (cid:62) and Y M = [ y , y , . . . , y M ] (cid:62) to denote the first M data points.One choice of an estimate for θ after observing M observations is the MLE,ˆ θ M := arg max θ log p ( Y M | X M , θ ) . Another possibility is the MAP estimate under a standard normal prior θ ∗ M := arg max θ log p ( Y M | X M , θ ) − (cid:107) θ (cid:107) . The approach in Section 5.2 involves an approximation to this estimate involving aGaussian approximation to the likelihood, defined by a 2nd order Taylor approximationof the log posterior formed at ˆ θ M . In particular, by Bayes’ rule, the log posterior is, upto an additive constant,log p M ( θ ) := log p ( Y M | X M , θ ) − (cid:107) θ (cid:107) and we use the approximationlog ˜ p M ( θ ) := log p ( Y M | X M , ˆ θ M ) − (cid:107) θ (cid:107) −
12 ( θ − ˆ θ M ) (cid:62) H M (ˆ θ M )( θ − ˆ θ M ) , (32)where H M (ˆ θ M ) = ∇ θ − log p ( Y M | X M , θ ) (cid:12)(cid:12) θ =ˆ θ M is the Hessian of the negative loglikelihood, computed at the MLE.The approximation we use for computing our proposed bound is then the maximizerof this approximation ˜ θ ∗ M := arg max θ log ˜ p M ( θ ) . In Section 5.2 we found that we could express ˜ θ ∗ M as˜ θ ∗ M = (cid:104) I N + ˜Σ M (cid:105) − ˆ θ M , where ˜Σ M := H M (ˆ θ M ) − is an approximation to the covariance of ˆ θ M . This solutionmay be seen by considering the first order optimality condition (i.e. setting the gradientof log ˜ p M ( θ ) to zero). G.2 Asymptotic approximation quality
We here show that, in the large sample limit, ˜ θ ∗ provides a very close approximation ofthe MAP estimate, θ ∗ . roposition G.1 (Asymptotic approximation quality) . Consider Bayesian logistic re-gression with a Gaussian prior θ ∼ N (0 , I N ) . Let x , x , . . . be a sequence of random i.i.d.covariates satisfying E [ x m x (cid:62) m ] (cid:31) and with bounded third moment, and let y , y , . . . be responses distributed as in Equation (18) . Denote by X M := [ x , x , . . . , x M ] (cid:62) and Y M := [ y , y , . . . , y M ] (cid:62) the covariates and labels of the first M data points. Considerthe MAP estimate of θ after observing M data points, θ ∗ M := arg max θ p ( θ | Y M , X M ) and the approximation ˜ θ ∗ M := (cid:104) I N + ˜Σ M (cid:105) − ˆ θ M , (33) where ˆ θ M := arg max θ p ( Y M | X M ; θ ) and ˜Σ M := (cid:104) −∇ θ log p ( Y M | X M ; θ ) (cid:12)(cid:12) θ =ˆ θ M (cid:105) − . Then (cid:107) ˜ θ ∗ M − θ ∗ M (cid:107) ∈ O p ( M − ) , where O p denotes stochastic convergence in probability. The O p ( M − ) convergence rate established in Proposition G.1 is very fast in com-parison to the O p ( M − ) convergence rate of the MLE, as well as to the O p ( M − ) rateof convergence of the MAP to the posterior mean. Notably, this asymptotic rate isconsistent with rates observed in simulation (Figure 6a). Proof.
We here show that (cid:107) θ ∗ M − ˜ θ ∗ M (cid:107) is O p ( M − ). Our route to proving this relieson Lemma G.1 (Trippe et al. 2019, Lemma E.1), which will provide a sequence ofbounds on (cid:107) θ ∗ M − ˜ θ ∗ M (cid:107) that depend on the norms of the gradients of log p M ( · ) at˜ θ ∗ M , c M := (cid:107)∇ θ log p M (˜ θ ∗ M ) (cid:107) , and a sequence of strong log-concavity constants α M for log p M ( · ) which hold on the interval { tθ ∗ M + (1 − t )˜ θ ∗ M | t ∈ [0 , } . In particular,Lemma G.1 provides that (cid:107) θ ∗ M − ˜ θ ∗ M (cid:107) ≤ c M α M and we obtain the result by showing that α M grows as Ω p ( M ) and c M drops as O p ( M − ).We first use Lemma G.3 to show that the strong log-concavity constants of log p M in a neighborhood of radius (cid:15) of θ, B (cid:15) ( θ ) grow as Ω p ( M ) . This allows us to establishthat (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) is O p ( M − ) (Lemma G.4). Since both ˆ θ M and θ ∗ M converge stronglyto θ under these conditions (see e.g. Van der Vaart (2000, Theorem 10.10)), the interval { tθ ∗ M + (1 − t )˜ θ ∗ M | t ∈ [0 , } is then contained within B (cid:15) ( θ ) with probability approaching1 . Consequently, the constants of strong log concavity of log p M on this interval, whichwe take as α , α , . . . , must grow as Ω p ( M ) as well.Now all that remains is to show that c M drops as O p ( M − ) . Recall from abovethat (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) is O ( M − ) . This fact and the boundedness of the higher derivatives of ∇ log p M will allow us to use Taylor’s theorem to obtain the desired rate.However, before proceeding to a more detailed derivation of this rate, we introducesome additional notation. Let φ ( y, a ) denote the GLM mapping function, such that φ ( y, a = x (cid:62) θ ) = log p ( y | x, θ )= − log(1 + exp {− yx (cid:62) θ } )and note that all higher derivatives with respect to a are bounded. In particular, thirdderivative satisfies φ (cid:48)(cid:48)(cid:48) ( a ) := d da φ ( y, a ) ≤ √ , y as an argument, because these higher derivatives do notdepend on y .We now proceed to derive a stochastic rate of convergence of (cid:107)∇ θ log p M (˜ θ ∗ M ) (cid:107) . Weobtain this through a long derivation involving a series of upper bounds. (cid:107)∇ θ log p M (˜ θ ∗ M ) (cid:107) = (cid:107)∇ θ (log p M − log ˜ p M )(˜ θ ∗ M ) (cid:107) = (cid:107)∇ θ (log p M − log ˜ p M )(ˆ θ M ) + (˜ θ ∗ M − ˆ θ M ) (cid:62) ∇ θ (log p M − log ˜ p M )( θ (cid:48) M ) (cid:107) // By Taylor’s theorem, for some θ (cid:48) M ∈ { t ˆ θ M + (1 − t )˜ θ ∗ M | t ∈ [0 , } = (cid:107) (˜ θ ∗ M − ˆ θ M ) (cid:62) ∇ θ (log p M ( θ (cid:48) M ) − log ˜ p M ( θ (cid:48) M )) (cid:107) // Since ∇ θ log ˜ p M (ˆ θ ) = ∇ θ log p M (ˆ θ )= (cid:107) (˜ θ ∗ M − ˆ θ M ) (cid:62) (cid:104) ∇ θ log p ( Y M | X M , θ (cid:48) M ) − ∇ θ log p ( Y M | X M , ˆ θ M ) (cid:105) (cid:107) // Since log ˜ p M is a second degree approximation defined at ˆ θ M ≤ (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) M (cid:88) m =1 (cid:107)∇ θ log p ( y m | x m , θ (cid:48) M ) − ∇ θ log p ( y m | x m , ˆ θ M ) (cid:107) OP = (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) M (cid:88) m =1 (cid:107) θ (cid:48) M − ˆ θ M (cid:107) · (cid:107) (cid:90) t =0 ∂∂t ∇ θ log p ( y m | x m , θ ) (cid:12)(cid:12) θ = t ˆ θ M +(1 − t ) θ (cid:48) M (cid:107) OP // By the fundamental theorem of calculus ≤ (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) M (cid:88) m =1 (cid:107) (cid:90) t =0 ∂∂t ∇ θ log p ( y m | x m , θ ) (cid:12)(cid:12) θ = t ˆ θ M +(1 − t ) θ (cid:48) M (cid:107) OP ≤ (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) M (cid:88) m =1 (cid:107) x m (cid:107) (max a φ (cid:48)(cid:48)(cid:48) ( a )) = 16 √ (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) M (cid:88) m =1 (cid:107) x m (cid:107) ≤ O p ( 1 M ) O p ( M ) = O p ( 1 M ) , where the final line requires that the covariates have bounded third moment. Supporting LemmasLemma G.1 (Trippe et al., 2019, Lemma E.1) . Let f, g be twice differentiable functionsmapping R N → R and attaining minima at θ f = arg min θ f ( θ ) and θ g = arg min θ g ( θ ) ,respectively. Additionally, assume that f is α –strongly convex for some α > on theset { tθ f + (1 − t ) θ g | t ∈ [0 , } and that (cid:107)∇ θ f ( θ g ) − ∇ θ g ( θ g ) (cid:107) = (cid:107)∇ θ f ( θ g ) (cid:107) ≤ c . Then (cid:107) θ f − θ g (cid:107) ≤ cα . (34)45 emma G.2 (uniform law of large numbers) . Let H M ( θ ) be as defined in Equation (32) and define H ( θ ) := E [ ∇ θ log p ( y | x ; θ )] , where the expectation is taken under the true θ . If E [ x x (cid:62) ] exists and is positive definite then sup θ (cid:48) ∈ B (cid:15) ( θ ) (cid:107) M H M ( θ (cid:48) ) − H ( θ (cid:48) ) (cid:107) a.s. → . according to p , where B (cid:15) ( θ ) is a closed neighborhood of θ of radius (cid:15), for any (cid:15) > . Proof.
Since the each of the M data points { ( x m , y m ) } ∞ m =1 are i.i.d. by assumption, M − H M converges point-wise by the law of large numbers. However, we are additionallyinterested in uniform convergence; a number of different uniform laws of large numberssuffice for this. Because H is continuously differentiable in θ (recall that for any x m , d dθ log p ( y m | x m , θ ) is bounded) it is therefore Lipschitz continuous on the boundedset B (cid:15) ( θ ). As such one can construct a bounded envelope for H on this set, whichamounts to a sufficient condition for uniform convergence on B (cid:15) , see Van der Vaart (2000,Theorem 19.4 - Glivenko-Cantelli). We refer the reader to Van der Vaart (2000, Chapter19) for technical background, and in particular to Van der Vaart (2000, Example 19.8)which walks through an example closely related to the present case. Lemma G.3.
Consider logistic regression with random covariates, x , x , . . . . Let B (cid:15) ( θ ) be a closed neighborhood of radius (cid:15) > around θ and for each M define α M := inf θ (cid:48) ∈ B (cid:15) ( θ ) λ min (cid:104) ∇ θ log p M ( θ (cid:48) ) (cid:105) to be the constant of strong log-concavity constant of log p M ( · ) on B (cid:15) ( θ ) , where λ min ( · ) denotes the smallest eigenvalue of its matrix argument. If the covariates are i.i.d. andsatisfy E [ x x (cid:62) ] (cid:31) , then α M is Ω p ( M ) . Proof.
Consider the scaled Hessians of log p M ( · ), M − H M ( · ) . By Lemma G.2, M − H M ( · )converges uniformly to its expectation, H ( θ ) := E [ ∇ θ log p ( y | x , θ )] on B (cid:15) ( θ ) . Since H ( θ ) (cid:31) B (cid:15) ( θ ) , we have thatinf θ (cid:48) ∈ B (cid:15) ( θ ) λ min ( 1 M H M ( θ )) a.s. → inf θ (cid:48) ∈ B (cid:15) ( θ ) λ min ( H M ( θ )) > . Therefore α M := inf θ (cid:48) ∈ B (cid:15) ( θ ) λ min (cid:2) ∇ θ log p M ( θ (cid:48) ) (cid:3) is Ω p ( M ) . Lemma G.4.
Let ˆ θ and ˜ θ ∗ be the MLE and the approximation to the MAP defined inEquation (19) , respectively. If the covariates, x , x , . . . are i.i.d. and satisfy E [ x x (cid:62) ] (cid:31) , then (cid:107) ˆ θ M − ˜ θ ∗ M (cid:107) is O p ( M − ) . Proof.
Recall that ˜ θ ∗ M = (cid:104) I N + ˜Σ M (cid:105) − ˆ θ M , M := H M (ˆ θ M ) − . Lemma G.3 provides that the constants of strong log-concavityfor log p M grow as Ω p ( M ) in a neighborhood of θ. Therefore, since ˆ θ M converges stronglyto θ, we can see that λ min ( H M (ˆ θ M )) is Ω p ( M ) . Next, we rewrite (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) = (cid:107) (cid:104) I N + ˜Σ M (cid:105) − ˆ θ M − ˆ θ M (cid:107) = (cid:107) (cid:104) I N + H M (ˆ θ M ) (cid:105) − ˆ θ M (cid:107)≤ (cid:107) (cid:104) I N + H M (ˆ θ M ) (cid:105) − (cid:107) OP (cid:107) ˆ θ M (cid:107)≤ (cid:107) ˆ θ M (cid:107) λ min (cid:16) H M (ˆ θ M ) (cid:17) . which one can see is O p ( M − ) since (cid:107) ˆ θ M (cid:107) is bounded in probability. G.3 Proof of Theorem 5.3
Before proving the theorem we begin by explicitly writing out the win and our proposedbound defined in Section 5.2. For clarity, we introduce a subscript M to index thesize of the dataset on which these quantities are computed. Specifically, recallingthat in this case we have A = I N and C = ( I N + ˜Σ M ) − , and noting that therefore A − C = ( I N + ˜Σ − M ) − , we have b M ( α ) = 2tr[( I N + ˜Σ − M ) − ˜Σ M ] +2 z − α (cid:114) U M ( (cid:107) G M (ˆ θ M ) (cid:107) M , − α (cid:107) ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M (cid:107) F − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) where G M (ˆ θ M ) := ( I N + ˜Σ − M ) − ˆ θ M and U M ( (cid:107) G M (ˆ θ M ) (cid:107) M , − α ) := inf δ> (cid:26) δ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) G M (ˆ θ M ) (cid:107) M ≤ ( δ + (cid:107) ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M (cid:107) F ) +(35) z − α (cid:113) (cid:107) ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M (cid:107) F + 4 (cid:107) ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M (cid:107) δ (cid:27) (36)is an approximate high-confidence upper bound on (cid:107) G M (ˆ θ M ) (cid:107) M . For convenience, weabbreviate U M ( (cid:107) G M (ˆ θ M ) (cid:107) M , − α ) by U M . Next, we recall that we may decompose the win in squared error loss for using θ ∗ M in place of ˆ θ M as W M ( θ ) = 2 (cid:15) (cid:62) M ( I N + ˜Σ − M ) − ˆ θ − (cid:107) θ ∗ M − ˆ θ M (cid:107) , where (cid:15) M := ˆ θ M − θ. roof. Proving the theorem amounts to showing that for any θ and α ∈ (0 , M →∞ P θ (cid:2) W M ( θ ) ≥ b M ( α ) (cid:3) ≥ α. Lemma G.6 provides that M . ( W M ( θ ) − b M ( α )) converges in distribution to 2 (cid:112) θ (cid:62) H ( θ ) − θ ( δ − z − α ) , for δ ∼ N (0 , . Thus for any θ, P θ (cid:2) W M ( θ ) − b M ( α ) > (cid:3) → (1 − Φ( z − α )) =1 − − α > α. This establishes that b M ( · ) has above nominal coverage asymptotically, asdesired. Lemma G.5. | U M − (cid:107) ˜Σ M θ (cid:107) M | is O p ( M − . ) . Proof.
Recall that we can rearrange Equation (35) to see that U M satisfies (cid:107) ( I + ˜Σ − M ) − ˆ θ M (cid:107) M = U M + 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) F + (cid:113) (cid:107) ˜Σ M ( I N + ˜Σ M ) (cid:107) F + 4 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) U M where we have simplified ˜Σ M ( I N + ˜Σ − M ) − ˜Σ M to ˜Σ M ( I N + ˜Σ M ) − . We next further simplify the condition above by replacing two quantities withsimplifying approximations plus lower order terms. First note that we may write (cid:107) ( I + ˜Σ − M ) − ˆ θ M (cid:107) M = (cid:107) ˜Σ M ˆ θ M − ˜Σ M ( I N + ˜Σ M ) − ˆ θ M (cid:107) M = (cid:107) ˜Σ M ˆ θ M (cid:107) M + (cid:107) ˜Σ M ( I N + ˜Σ M ) − ˆ θ M (cid:107) M − θ (cid:62) M ˜Σ M ( I N + ˜Σ M ) − ˆ θ M = (cid:107) ˜Σ M ( θ + (cid:15) M ) (cid:107) M + O p ( M − )= (cid:107) ˜Σ M θ (cid:107) M + (cid:107) ˜Σ M (cid:15) M (cid:107) M + 2 (cid:15) (cid:62) M ˜Σ M θ + O p ( M − )= (cid:107) ˜Σ M θ (cid:107) M + O p ( M − . ) . Second, we write (cid:113) (cid:107) ˜Σ M ( I N + ˜Σ M ) (cid:107) F + 4 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) U M = (cid:113) O p ( M − ) + 4 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) U M = 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) OP (cid:112) U M + O p ( M − ) . As such, we may see that U M satisfies (cid:107) ˜Σ M θ M (cid:107) M − U M = 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) F + 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) OP (cid:112) U M + O p ( M − . )= 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) OP (cid:112) U M + O p ( M − . ) (37)where we have dropped 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) F since it is O p ( M − ) . We next observe that U M must be O p ( M − ) . Otherwise, the event that (cid:107) ˜Σ M θ M (cid:107) M − U M < (cid:107) ˜Σ M θ (cid:107) M is O p ( M − )); in turn, this condition48ould imply that (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) OP √ U M < U M is O p ( M − ) allows us to see that (cid:12)(cid:12)(cid:12) U M − (cid:107) Σ θ (cid:107) M (cid:12)(cid:12)(cid:12) is O p ( M − . ), as desired. Lemma G.6.
Let α ∈ (0 , and θ ∈ R N . Consider the sequence of wins, W M ( θ ) , andbounds, b M ( α ) , computed for logistic regression. Then M . ( W M ( θ ) − b M ( α )) d → (cid:113) θ (cid:62) H ( θ ) − θ ( δ − z − α ) , where δ ∼ N (0 , . Proof.
We prove the lemma by first writing W M and b M using simplifying approxima-tions and lower order terms. The result is obtained by manipulating a scaling of thedifference between the two expressions and considering the limit in M. Note first that we may write W M ( θ ) : = 2 (cid:15) (cid:62) ( θ ∗ M − ˆ θ M ) − (cid:107) θ ∗ M − ˆ θ M (cid:107) = 2 (cid:15) (cid:62) (˜ θ ∗ M − ˆ θ M ) − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − )= 2 (cid:15) (cid:62) ( I N + ˜Σ − M ) − ˆ θ M − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − )= 2 (cid:15) (cid:62) ˜Σ M ˆ θ M − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − )= 2 (cid:15) (cid:62) ˜Σ M θ − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − ) . Next we write b M ( α ) = 2tr (cid:104) ( I N + ˜Σ − M ) − ˜Σ M (cid:105) + 2 z − α (cid:113) U M + 2 (cid:107) ˜Σ M ( I N + ˜Σ M ) − (cid:107) F − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) = 2 z − α (cid:113) (cid:107) ˜Σ M θ (cid:107) M + O p ( M − . ) − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − )= 2 z − α (cid:107) ˜Σ M θ (cid:107) ˜Σ M − (cid:107) ˜ θ ∗ M − ˆ θ M (cid:107) + O p ( M − ) . where the second line uses Lemma G.5.By considering a scaled difference between these two terms we find, M . ( W M ( θ ) − b M ( α )) = 2 M . (cid:15) (cid:62) ˜Σ M θ − M . z − α (cid:107) ˜Σ M θ (cid:107) ˜Σ M + O p ( M − ) d → M . (cid:107) ˜Σ M θ (cid:107) ˜Σ M ( δ − z − α )for δ ∼ N (0 , , by recognizing that (cid:15) M is asymptotically normal with mean zero andcovariance Σ M , and therefore that 2 (cid:15) (cid:62) ˜Σ M θ is asymptotically normal with variance (cid:107) ˜Σ M θ (cid:107) M . Finally, the result obtains by noting that Lemma G.2 implies that M . (cid:107) ˜Σ M θ (cid:107) ˜Σ M = (cid:113) θ (cid:62) ( H M (ˆ θ ) /M ) − θ a.s. → (cid:113) θ (cid:62) H ( θ ) − θ. .4 Empirical validation of logistic regression bound in simulation (a) (b) (c) Figure 6: c-values for logistic regression in simulation. (a) Empirical rates of convergenceof distances amongst various estimates and the true parameter with N = 2. In simulationwith N = 25 and M = 1000 (b) c-values are able to detect improvements, sometimeswith high confidence (c) the approximate bound has greater than nominal coverage.See Appendix G.4 for details.We here explore the behaviour of our proposed approximation, bound and the associatedc-values empirically on simulated data. Figure 6a shows the distance between variousestimates and the true parameter for a range of sample sizes in simulation. Due tothe log-log scale, the slopes of the series in this plot reflect the polynomial rates ofconvergence. Notably we see the fast O p ( M − ) rate of convergence of our approximationto the MAP estimate, ˜ θ ∗ M , to the exact MAP estimate, θ ∗ M . Figure 6b demonstrates that our approach is able to detect improvements (i.e. wecan obtain high c-values). Furthermore, our proposed bound has similar coverageproperties as in the Gaussian case (Figure 6c). In the experiments for Figures 6b and 6c,we simulated the parameter as θ ∼ N (0 , I N ) and, in each replicate, simulated thecovariates for each data point, m, as x m i.i.d. ∼ N (0 , N − I N ) . Two of the series in Figure 6a are distances between the posterior mean of θ andother estimates, E [ θ | X, Y ] = (cid:82) p ( θ | X, Y ) θdθ. Because this model is non-conjugate,the estimate does not have an analytic form. As such approximated these quantitieswith Gauss-Hermite quadrature. For each sample size M, we performed 25 replicatesimulations.In the experiments that went into Figures 6b and 6c, we used N = 25 and M = 1000 . See logistic regression approximations.ipynb and logistic regression c values and operating characteristics.ipynb for details.50igure 7: Calibration of the lower bounds b ( y, α ) in small area inference with anempirical Bayes step (5000 replicates). The coverage on the y-axis is a Monte Carloestimate of P θ (cid:2) W ( θ, y ) ≥ b ( y, α ) (cid:3) . Each series corresponds to a set of simulations withinwhich we excluded a different subset of schools based on a minimum number of studentstested.
H Additional details on applications
In this section, we provide additional details associated with the applications in Section 6.
H.1 Estimation from educational testing data
Conservatism of c-values with the empirical Bayes step.
The application inSection 6.1 diverges from the scenarios covered by our theory in Sections 3 and 4 in itsuse of the empirical Bayes step to estimate β, τ, and σ. As a result, our theory doesprovide that c ( y ) satisfies the guarantee of Theorem 2.2. However, given the favorableasymptotic and empirical properties of the empirical Bayes procedure established inSection 5.1, we conjectured that the looseness in the lower bound b ( y, α ) would besufficiently large to compensate for any error introduced by these departures fromthe assumptions of our theory. To investigate this, we performed a simulation studyin which we used this empirical Bayes step and confirmed that the c-values retainedat least nominal coverage (Figure 7). To ensure that the simulated data had similarcharacteristics to the real data, we simulated 5000 datasets by drawing hypotheticalschool level means according the assumed generative model with the parameters ( β, τ and σ ) fit on the real dataset. In each simulation, we re-estimated the fixed effects andvariances (again using lme4 ), and computed the associated MLE, Bayes estimates, andbounds across a range of confidence levels. We then computed the empirical coverage ofthese bounds and found them to be conservative across all tested levels.51 dditional preprocessing and calibration details. Hoff (2020) considered onlyschools at which 2 or more students took the reading test. We excluded an additional 8schools with fewer than 5 students tested because we expected that the high variance inthese observations could introduce too much slack into our bound as result of the poorconditioning of Σ ( A − C )Σ (recall the operator norm bound in Equation (11), derivedin Equation (26)). Consistent with this hypothesis we computed a c-value of 0 .
88 whenwe included these additional schools, and when we further restricted to the 657 schoolswith at least 10 students tested we computed a c-value 0 . . To further validatethis hypothesis of increased conservatism we simulated additional datasets with thesedifferent thresholds on school size and evaluated the calibration of computed bounds(Figure 7). We observed the coverage for the simulations with smallest threshold wasnoticeably higher at large α , in agreement with this hypothesis. H.2 Estimation of violent crime rates in Philadelphia
Dependence on the order in which estimates are compared.
In Section 6.2we chose to report one among three estimates as described in Remark 6.1. We notehowever that this paradigm is sensitive to the order in which the different estimates areconsidered. For this set of three models, if we had first compared θ ◦ ( y ) as the alternativeto ˆ θ ( y ) as the default we would have rejected ˆ θ ( y ) (with c = 0 . c = 0 .
0) forcomparing θ ∗ ( y ) as the alternative against θ ◦ ( y ) as the default. The potential cost ofending up with a worse estimate as a result of considering these estimates in sequencemay be understood as a cost of looking at the data an additional time. Selection of prior parameters from historical data.
The parameters σ δ , σ z , σ y were selected based on historical data. Specifically, we estimated σ y and σ z as theaverages of the sample variances of the violent and non-violent report rates, respectively,computed within each census block in the preceding years. For the first model describedin Section 6.2, we then estimated σ δ using these same historical data to reflect the priorbelief that half of the variability across the unknown rates is common across the tworesponse types.For the second model considered, we selected the signal variance and length scale ofthis covariance function by drawing hypothetical datasets of crime levels from the priorpredictive distributions and selecting those which produced the most reasonable lookingpatterns. In particular, we chose the length scale to be one sixth of the maximumdistance between the centroids of census blocks, and the signal variance to reflect theprior belief that one third of the variability in the unknown rates was explained bythe spatial component. In addition, we choose a smaller value for σ δ in this secondmodel, so that the total implied variance would be the same. See supplementary codein Philly reported crime estimation.ipynb for additional details.52 erivation of θ ∗ (posterior mean in the first model). As mentioned in the maintext, since the prior and likelihoods for this model are independent across each censusblock we can compute the posterior mean for each block independently.Let π ( · ) denote the joint density of all variables. Then, since z n | = y n (cid:12)(cid:12) θ n , we havethat π ( θ n | y n , z n ) ∝ π ( θ n | z n ) π ( y n | θ n , z n )= π ( θ n | z n ) π ( y n | θ n ) . Next observe that by construction, z n − θ n = (cid:15) zn + δ zn − δ yn ∼ N (0 , σ δ + σ z ) and so θ n | z n ∼ N ( z n , σ δ + σ z ). Since again by construction we have that y n | θ n ∼ N ( θ n , σ y ) , Gaussian conjugacy provides that θ n | y n , z n ∼ N ( E [ θ n | y n , z n ] , Var[ θ n | y n , z n ]) , where Var[ θ n | y n , z n ] = 1 σ − y + (2 σ δ + σ z ) − = σ y (2 σ δ + σ z ) σ y + 2 σ δ + σ z and E [ θ n | y n , z n ] = Var[ θ n | y n , z n ](Var[ θ n | z n ] − E [ θ n | z n ] + Var[ y n | θ n ] − y n )= σ y (2 σ δ + σ z ) σ y + 2 σ δ + σ z (cid:104) (2 σ δ + σ z ) − z n + σ − y n (cid:105) = 2 σ δ + σ z σ δ + σ y + σ z y n + σ y σ δ + σ y + σ z z n as desired.Analogously, for the second model considered in Section 6.2 we find the posteriormean as θ ◦ ( y ) = (cid:104) I N + σ y (2 K + 2 σ δ I N + σ z I N ) − (cid:105) − y + (cid:104) I N + σ − y (2 K + 2 σ δ I N + σ z I N ) (cid:105) − z. Additional dataset details.
The data considered in this application are countsof police responses categorized as associated with violent crimes and violent crimesin October 2018. These were obtained from opendataphilly.org. The observed datawe model are the inverse hyperbolic sine transform of the number of recorded policeresponses per square mile. For all practical purposes, these values can be interpreted aslog densities (see, e.g., Burbidge et al. (1988)).53 .3 Gaussian process kernel selection for estimation of ocean currents
We here provide additional details of the Gaussian process covariance functions used inSection 6.3. The first covariance function described, which incorporated covariation attwo scale is defined, for both the longitudinal and latitudinal components ( i in { , } )and for each pair of buoys n and n (cid:48) , as k ( θ ( i ) n , θ ( i ) n (cid:48) ) = σ exp − (cid:34) (lat n − lat n (cid:48) ) r , lat + (lon n − lon n (cid:48) ) r , lon + ( t n − t n (cid:48) ) r ,t (cid:35) + σ exp − (cid:34) (lat n − lat n (cid:48) ) r , lat + (lon n − lon n (cid:48) ) r , lon + ( t n − t n (cid:48) ) r ,t (cid:35) , where σ , r , lat , r , lon and r ,t parameterize the mesoscale variation in currents whereas σ , r , lat , r , lon and r ,t parameterize the submesoscale variation. As in Lodise et al.(2020), the latitudinal and longitudinal components of F are modeled as a prioriindependent. We choose these parameters by maximal marginal likelihood (Rasmussen& Williams 2006, Chapter 5) on an independent subset of the GLAD dataset. Estimatesof the underlying currents are obtained as the posterior mean of F under this model,which we take as the alternative, θ ∗ ( y ).The second covariance function captures covariation among observations only at themesoscale. In this case, the Gaussian process prior has covariance function k ( θ ( i ) n , θ ( i ) n (cid:48) ) = σ exp − (cid:34) (lat n − lat n (cid:48) ) r , lat + (lon n − lon n (cid:48) ) r , lon + ( t n − t n (cid:48) ) r ,t (cid:35) + σ [ n = n (cid:48) ] , which maintains the same marginal variance but excludes submesoscale covariances.We take the posterior mean under this model as the default estimate ˆ θ ( y ). See submesoscale GP c value.ipynbsubmesoscale GP c value.ipynb