Nonparametric hierarchical Bayesian quantiles
NNonparametric hierarchical Bayesian quantiles
Luke Bornn
Department of Statistics and Actuarial Science, Simon Fraser University [email protected]
Neil Shephard
Department of Economics and Department of Statistics, Harvard University [email protected]
Reza Solgi
Department of Statistics, Harvard University [email protected]
May 12, 2016
Abstract
Here we develop a method for performing nonparametric Bayesian inference on quantiles.Relying on geometric measure theory and employing a Hausdorff base measure, we are able tospecify meaningful priors for the quantile while treating the distribution of the data otherwisenonparametrically. We further extend the method to a hierarchical model for quantiles ofsubpopulations, linking subgroups together solely through their quantiles. Our approach iscomputationally straightforward, allowing for censored and noisy data. We demonstrate theproposed methodology on simulated data and an applied problem from sports statistics, whereit is observed to stabilize and improve inference and prediction.
Keywords : Censoring; Hausdorff measure, Hierarchical models; Nonparametrics; Quantile.
Consider learning about β , the τ ∈ (0 ,
1) quantile of the random variable Z . This will be basedon data D = { z , ..., z n } , where we assume z i , i = 1 , , ..., n , are scalars and initially that they areindependent and identically distributed. We will perform nonparametric Bayesian inference on β given D . The importance of quantiles is emphasized by, for example, Parzen (1979, 2004), Koenkerand Bassett (1978) and Koenker (2005). By solving this problem we will also deliver a nonpara-metric Bayesian hierarchical quantile model, which allows us to analyze data with subpopulationsonly linked through quantiles. The methods extend to censored and partially observed data. In early work on Bayesian inference on quantiles, Section 4.4 of Jeffreys (1961) used a “substitutelikelihood” s ( β ) = (cid:0) nn β (cid:1) τ n β (1 − τ ) n − n β , where n β = (cid:80) Ni =1 z i ≤ β ). See also Boos and Monahan1 a r X i v : . [ s t a t . M E ] M a y β .Yu and Moyeed (2001) carried out Bayesian analysis of quantiles using a likelihood based onan asymmetric Laplace distribution for the regression residuals e i = y i − x (cid:48) i β (see also Koenkerand Machado (1999) and Tsionas (2003)), L ( D| β ) = exp {− (cid:80) ni =1 ρ τ ( e i ) } where ρ τ ( · ) is the “checkfunction” (Koenker and Bassett (1978)), ρ τ ( e ) = | e | { (1 − τ )1 e< + τ e ≥ } , e ∈ R. (1)Here ρ τ ( e ) is continuous everywhere, convex and differentiable at all points except when e = 0.This Bayesian posterior is relatively easy to compute using mixture representations of Laplacedistributions. Papers which extend this tradition include Kozumi and Kobayashi (2011), Li et al.(2010), Tsionas (2003), Kottas and Krnjajic (2009) and Yang et al. (2015). Unfortunately theLaplace distribution is a misspecified distribution and so typically yields inference which is overlyoptimistic. Yang et al. (2015) and Feng et al. (2015) discuss how to overcome some of thesechallenges, see the related works by Chernozhukov and Hong (2003) and Muller (2013).Closer to our paper is Hjort and Petrone (2007) who assume the distribution function of Z isa Dirichlet process with parameter aF , focusing on when a ↓
0. Hjort and Walker (2009) writedown nonparametric Bayesian priors on the quantile function. Our focus is on using informativepriors for β , but our focus on a non-informative prior for the distribution of Z aligns with that ofHjort and Petrone (2007).Our paper is related to Bornn et al. (2016) who develop a Bayesian nonparametric approach tomoment based estimation. Their methods do not cover our case; the differences are brought out inthe next section. The intellectual root is similar though: the quantile model only specifies a partof the distribution, so we complete the model by using Bayesian nonparametrics.Hierarchical models date back to Stein (1966), while linear regression versions were developedby Lindley and Smith (1972). Discussions of the literature include Morris and Lysy (2012) andEfron (2010). Our focus is on developing models where the quantiles of individual subpopulationsare thought of as drawn from a common population-wide mixing distribution, but where all otherfeatures of the subpopulations are nonparametric and uncommon across the populations. Themixing distribution is also nonparametrically specified. There is some linkages with deconvolution2roblems, see for example Butucea and Comte (2009) and Cavalier and Hengartner (2009), butour work is discrete and not linear. It is more related to, for example, Robbins (1956), Carlin andLouis (2008), McAuliffe et al. (2006) and Efron (2013) on empirical Bayes methods.Here we report a simple to use method for handling this problem, which scales effectively withthe sample size and the number of subpopulations. The method extends to allow for censored data.Our hierarchical method is illustrated on an example drawn from sports statistics. In Section 2 we discuss our modelling framework and how we define Bayesian inference on quan-tiles. Particular focus is on uniqueness and priors. A flexible way of building tractable modelsis developed. This gives an analytic expression for the posterior on a quantile. A Monte Carloanalysis is carried out to study the bias, precision and coverage of our proposed method, which alsocompares the results to that seen for sample quantiles using central limit theories and bootstraps.In Section 3 we extend the analysis by introducing a nonparametric hierarchical quantile modeland show how to handle it using very simple simulation methods. A detailed study is made ofindividual sporting careers using the hierarchical model, borrowing strengths across careers whenthe careers are short and data is limited. In Section 4 we extend the analysis to data which iscensored and this is applied in practice to our sporting career example. Section 5 concludes, whilean Appendix contains various proofs of results stated in the main body of the paper.
We use the conventional modern definition of the τ quantile β , that is β = argmin b E { ρ τ ( Z − b ) } . To start suppose Z has known finite support S = { s , ..., s J } , and writePr( Z = s j | θ ) = θ j , for 1 ≤ j ≤ J, with θ = ( θ , θ , ..., θ J − ) (cid:48) ∈ Θ θ , and Θ θ ⊆ ∆, where ∆ is the simplex, ∆ = { θ ; ι (cid:48) θ < θ j > } ,and define θ J = 1 − ι (cid:48) θ , in which ι is a vector of ones. The functionΨ( b, θ ) = E θ { ρ τ ( Z − b ) } = J (cid:88) j =1 θ j ρ τ ( s j − b ) ,
3s continuous everywhere, convex and differentiable at all points except when b ∈ S .We define the “Bayesian nonparametric quantile” problem as learning from data the unknowns( β, θ (cid:48) ) (cid:48) ∈ Θ β,θ , where Θ β,θ ⊆ R × ∆ ⊂ R J . Each point within Θ β,θ is a pair ( β, θ ) which satisfies both the probability axioms and β = argmin b J (cid:88) j =1 θ j ρ τ ( s j − b ) . Here θ almost surely determines β — this will be formalized in Proposition 1.Unique β (with probability 1) -2 -1 0 1 2 * ( b ; ) = (0 : ; : b -2 -1 0 1 2 r * ( b ; ) -0.6-0.4-0.200.20.40.6 -2 -1 0 1 2 = (0 : ; : b -2 -1 0 1 2 Non-unique β (with probability 0) -2 -1 0 1 2 = (0 : ; : b -2 -1 0 1 2 Figure 1:
This plot shows the 0.4-quantile with support S = {− , , } . Plotted is Ψ( b, θ ) and itsdirectional derivatives with respect to b , ∇ Ψ( b, θ ) . Left hand has θ = (0 . , . (cid:48) , the center is θ =(0 . , . (cid:48) , and the right hand is θ = (0 . , . (cid:48) . In the left and center, the quantiles are are 0 and 1,respectively, while, in the right the optimization does not have a unique solution. Example 1
Figure 1 sets τ = 0 . , and S = {− , , } . In the left panel, for θ = (0 . , . (cid:48) , weplot Ψ( b, θ ) and its directional derivatives with respect to b , ∇ Ψ( b, θ ) . The resulting quantile is Figure 1 demonstrates that Bayesian nonparametric quantile estimation is not a special case of Bayesian non-parametric ψ type M-estimators, and so not a special case of moment estimation. This means we are outside theframework developed by Bornn et al. (2016). Recall, for the generic function f ( b ), the corresponding directional derivative is ∇ v f ( b ) = lim h ↓ f ( b + hv ) − f ( b ) h . = 0 . In the center panels, θ = (0 . , . (cid:48) , implying β = 1 . β is not unique iff θ = 0 . or θ + θ = 0 . — which are 0 probability events. An example of the latter case is θ = (0 . , . (cid:48) ,which is shown in the right panel . Here Ψ( b, θ ) is minimized on [0 , . Proposition 1 formalizes the connection between β and θ . Proposition 1
Without loss of generality, assume s < · · · < s J . Then β is unique iff τ / ∈{ θ , θ + θ , ...., θ + · · · + θ J − } . If θ has a continuous distribution with respect to the Lebesguemeasure, then with probability 1, for each θ there is a unique quantile β ∈ S and with probability 1 ∂β∂θ (cid:48) = 0 . (2)Proposition 1 means we can partition the simplex in J + 1 sets, ∆ = (cid:16)(cid:83) Jk =1 A k (cid:17) ∪ N , where N is a zero Lebesgue measure set and the sets A k = { θ ∈ ∆; s k = argmin b Ψ( b, θ ) } , 1 ≤ k ≤ J , containall the values of θ which deliver a quantile β = s k = argmin b Ψ( b, θ ). We write this compactly as β = t ( θ ), β ∈ S , θ ∈ ∆, and the corresponding set index k = k ( θ ), 1 ≤ k ≤ J , θ ∈ ∆, so β = s k ( θ ) . Example 1 (Continued)
Figure 2 is a ternary plot showing all possible values of θ = ( θ , θ ) (cid:48) and θ = 1 − θ − θ and the implied value of β overlaid for τ = 0 . . The values of θ which containdistinct values of β are collected into the sets A (where β = s ), A (where β = s ), A (where β = s ). The interior lines marking the boundaries between these sets are the zero measure eventscollected into N . The union of the disjoint sets A , A , A , and N , make up the simplex ∆ . The set of admissible pairs ( β, θ ) is denoted by Θ β,θ ⊆ S × ∆. Now Θ β,θ is a lower dimensional spaceas β = t ( θ ). Using the Hausdorff measure , we are able to assign measures to the lower dimensionalsubsets of S × ∆, and therefore we can define probability density functions with respect to Hausdorffmeasure on manifolds within
S × ∆. If D = S then the empirical quantile is (cid:98) β = argmin b (cid:80) Jj =1 ρ τ ( s j − b ), which is non-unique if τ J is an integer (e.g.if τ = 0 .
5, then if J is even). Assume E ⊆ R n , d ∈ [0 , + ∞ ) and δ ∈ (0 , + ∞ ]. The Hausdorff premeasure of E is defined as follows, H dδ ( E ) = v m inf E ⊆∪ E j d ( E j ) <δ ∞ (cid:88) j =1 (cid:18) diam( E j )2 (cid:19) d where v m = Γ( ) d d Γ( d +1) is the volume of the unit d -sphere, and diam( E j ) is the diameter of E j . H dδ ( E ) is a nonincreasingfunction of δ , and the d -dimensional Hausdorff measure of E is defined as its limit when δ goes to zero, H d ( E ) =lim δ → + H dδ ( E ). The Hausdorff measure is an outer measure. Moreover H n defined on R n coincide with Lebesguemeasure. See Federer (1969) for more details. (0.4, 0, 0.6) ( , . , . ) ( . , . , ) - = s - = s - = s Figure 2: Ternary plots of θ , θ and θ = 1 − θ − θ and the implied quantiles β at level τ = 0 . β ∈ { s , s , s } . A k is the set of probabilities θ , θ where β = s k .One approach to building a joint prior p ( β, θ ) is to place a prior on β ∈ S , which we write as p ( β ) , β ∈ S and then build a conditional prior density, p ( θ | β = s k ) , θ ∈ A k recalling A k ⊆ ∆. Then the joint density with respect to Hausdorff measure on Θ β,θ is p ( β, θ ) = p ( β ) p ( θ | β ) . For the quantile problem, with probability one β = t ( θ ), so the “area formula” of Federer(1969) (see also Diaconis et al. (2013) and Bornn et al. (2016)) implies the marginal density for theprobabilities is induced as p ( θ ) = p ( β, θ ) , β = t ( θ ) , as Proposition 1 shows that ∂β/∂θ (cid:48) = 0. Here the right hand side is the density of the prior withrespect to Hausdorff measure defined on Θ β,θ , while the left hand side is the implied density of theprior distribution of θ with respect to Lebesgue measure defined on the simplex ∆.The model’s likelihood is, J (cid:89) j =1 θ n j j , n j = (cid:80) ni =1 z i = s j ). Then the posterior distribution of β, θ will be, p ( θ |D ) = p ( β = s k ( θ ) , θ |D ) ∝ p ( β = s k ( θ ) ) p ( θ | β = s k ( θ ) ) J (cid:89) j =1 θ n j j , β = t ( θ ) . (3)This means that p ( β = s k |D ) = (cid:90) A k p ( θ |D )d θ ∝ p ( β = s k ) (cid:90) A k p ( θ | β = s k ) J (cid:89) j =1 θ n j j d θ. p ( θ | β ) models Assume f ∆ ( θ ) is the density function of a continuous distribution on ∆, and define, c k = Pr f ∆ ( β = s k ) = (cid:90) A k f ∆ ( θ )d θ. Then one way to build an explicit prior for p ( θ | β ) is to decide to set p ( θ | β = s k ) = f ∆ ( θ ) c k A k ( θ ) , θ ∈ A k . Proposition 2 shows how to compute { c k } . Proposition 2
Here c = 1 − Pr( θ < τ ) , c J = Pr (cid:16)(cid:80) j =1 θ j < τ (cid:17) , and c k = Pr (cid:16)(cid:80) k − j =1 θ j < τ (cid:17) − Pr (cid:16)(cid:80) kj =1 θ j < τ (cid:17) , for k = 2 , ..., J − . This conditional distribution can be combined with a fully flexible prior Pr( β = s k ) = b k , where b k >
0, for 1 ≤ k ≤ J , and (cid:80) Jk =1 b k = 1. Returning to the general case, this implies the joint p ( θ ) = p ( β = s k ( θ ) , θ ) = b k ( θ ) c k ( θ ) f ∆ ( θ ) , (4)which in turn means, Pr( β = s k ) = (cid:82) A k p ( β, θ ; α )d θ = b k , the scientific marginal for β . Notethat p ( θ ) is discontinuous at the set boundaries (that is the zero Lebesgue measure set N ), and p ( θ ) (cid:54) = f ∆ ( θ ) unless b k = c k for all k .From (3) the posterior distribution of β, θ will be, p ( β = s k ( θ ) , θ |D ) ∝ b k ( θ ) c k ( θ ) f ∆ ( θ ) J (cid:89) j =1 θ n j j , and p ( β = s k |D ) ∝ b k c k (cid:90) A k f ∆ ( θ ) J (cid:89) j =1 θ n j j d θ. The Dirichlet case is particularly convenient. 7 .4 Dirichlet special case
Let f ∆ be the Dirichlet density, f D ( θ ; α ) = B ( α ) − (cid:81) Jj =1 θ α j − j , where α = ( α , ..., α J ) is the vectorof positive parameters, and B ( α ) is the beta function. Then c k can be computed via Proposition2 using the distribution function of θ + k ∼ Be (cid:0) α + k , α + J − α + k (cid:1) , where generically α + k = k (cid:88) j =1 α j . To mark their dependence on α , in the Dirichlet case we write c k = c k ( α ). We will refer to p ( θ | β = s k ) = f D ( θ ; α ) c k A k ( θ ) , θ ∈ A k , (5)as the density function of D J ( α, k ), the Dirichlet distribution truncated to A k .This result can be used to power the following simple prior to posterior calculation. Proposition 3
When f ∆ ( θ ) = f D ( θ ; α ) , then Pr( β = s k |D ) = 1 C ( α, n ) c k ( α + n ) c k ( α ) Pr( β = s k ) , where n = ( n , ..., n J ) . Here C ( α, n ) is the normalizing constant, which is computed via enumera-tion, C ( α, n ) = (cid:80) Jk =1 c k ( α + n ) c k ( α ) b k . Further, p ( θ |D ) = 1 C ( α, n ) Pr( β = s k ( θ ) ) c k ( θ ) ( α ) f D ( θ ; α + n ) . The Bayesian posterior mean or quantiles of the posterior can be computed by enumerationunless J is massive, in which case simulation can be used. Here D is simulated from the long right hand tailed z i iid ∼ − log χ , so the τ -quantile is β τ = − log (cid:110) F − χ (1 − τ ) (cid:111) . The empirical quantile (cid:98) β τ will be used to benchmark the Bayesian procedures.The distribution of (cid:98) β τ will be computed using its limiting distribution √ n (cid:16)(cid:98) β τ − β τ (cid:17) d → N (0 , τ (1 − τ ) /f z ( β τ ) ) and by bootstrapping. In the case of the limiting distribution, the f z ( β τ ) has beenestimated by a kernel density estimator, with normal kernel and Silverman’s optimal bandwidth.We build two Bayesian estimators: So Pr (cid:0) θ + k < τ (cid:1) = I τ ( α + k , α + J − α + k ) = B k , in which I τ ( α, β ) = B ( τ , α, β ) /B ( α, β ) is the regularized incom-plete beta function, B ( τ , α, β ) = (cid:82) τ x α − (1 − x ) β − d x is the incomplete beta function. When α + k and α + J − α + k are large some care has to be taken in computing c k . We have written c k = B k − − B k = B k (cid:110) B k − B k − (cid:111) = B k { exp (log B k − − log B k ) − } so log c k = log B k + log { exp (log B k − − log B k ) − } . Now B ( x, a, b ) = F ( a + b, , a + 1 , x ) a x a (1 − x ) b where F is the Gauss hypergeometric function. Hence we can compute log c k accurately. s j -10 0 10 20 30 40 P r ( - = s j ) Posterior for median s j -5 0 5 10 P r ( - = s j j D ) Figure 3:
The left hand side shows the prior distribution of β for the discrete method, and the right shows thecorresponding posterior for the first replication. Notice the posterior has many small atoms marked in short greenlines. These points originate from the prior and represent around 1 data points. Discrete.
The s j = −
10 + 50( j − / ( J − J = 1 , β τ = s k ) ∝ exp {− λ | s k − δ τ |} , (6)where λ = 0 .
1, and δ τ = β τ + γ τ , where γ τ >
0, and we let γ τ increases when τ deviates from0 .
5. This prior is not centered at the true value of the quantile and is more contaminatedfor the tail quantiles. In particular in our simulations we use γ . = 2 .
333 and γ . = 6 . α = J . The (6), for τ = 0 .
5, is shown in Figure 3together with the associate posterior for one replication of simulated data.2.
Data.
The support S is the data (therefore J = n ), α = J , and the prior height (6) sits onthose J points of support (so the prior changes in each replication).Table 1 reports the results from 25 ,
000 replications, comparing the five different modes ofinference. It shows the asymptotic distribution of the empirical quantile provides a poor guidewhen working within a thin tail even when the n is quite large. In the center of the distribution itis satisfactory by the time n hits 40. The bootstrap performs poorly in the tail when n is tiny, butis solid when n is large.Not surprisingly the bootstrap of the empirical quantile (cid:98) β and the Bayesian method usingsupport from the data are very similar. Assuming no ties, straightforward computations leads to,Pr( (cid:98) β = s j ) = F B ( (cid:100) τ J (cid:101) − J, ( j − /J ) − F B ( (cid:100) τ J (cid:101) − J, j/J )9 = 0 . τ = 0 . n = 10Bias -0.157 0.152 0.345 0.206 1.245 -0.174 1.083 -0.167 n / SE 2.221 2.119 2.155 2.136 7.909 5.087 4.764 5.206RMSE 0.720 0.687 0.764 0.706 2.794 1.618 1.856 1.655Coverage 0.913 0.943 0.936 0.897 0.805 0.645 0.932 0.638 n = 40Bias -0.039 0.038 0.085 0.054 -0.224 0.050 0.438 0.098 n / SE 2.309 2.184 2.214 2.203 5.639 5.403 5.731 5.560RMSE 0.367 0.347 0.360 0.353 0.919 0.856 1.007 0.885Coverage 0.945 0.945 0.937 0.940 0.810 0.912 0.944 0.910 n = 160Bias 0.020 0.010 0.021 0.014 0.072 0.014 0.102 0.025 n / SE 2.358 2.258 2.266 2.262 6.130 5.650 5.767 5.700RMSE 0.187 0.179 0.180 0.179 0.490 0.447 0.467 0.451Coverage 0.955 0.948 0.946 0.953 0.922 0.944 0.952 0.947 n = 320Bias -0.003 0.002 0.005 0.003 -0.015 0.002 0.023 0.004 n / SE 2.343 2.296 2.297 2.297 6.029 5.856 5.885 5.867RMSE 0.093 0.091 0.091 0.091 0.239 0.231 0.234 0.232Coverage 0.952 0.950 0.948 0.947 0.930 0.947 0.940 0.951
Table 1:
Monte Carlo experiment using 25,000 replications and a highly biased prior. Coverage probability is basedon a nominal 95% confidence or credible interval. Bayesian estimators are the posterior mean. RMSE denotes rootmean square error. Boot denotes bootstrap, CLT implemented using a kernel for the asymptotic standard error. where F B ( · ; n, p ) is the binomial cumulative distribution function with size parameter n , and proba-bility of success p . Interestingly, for large J , this is a close approximation to c j ( ). This connectionwill become more explicit in the next subsection.The discrete Bayesian procedure is by far the most reliable, performing quite well for all n .It does have a large bias for small n , caused by the poor prior, but the coverage is encouraging.Overall, there is some evidence that for small samples the Bayesian estimators perform well inmoderate to large samples. The two Bayesian procedures have roughly the same properties. Some interesting connections can be established by thinking of α as being small. Proposition 4
Conditioning on the data, if α k ↓ , and α k α l → , then c k ( α ) → J , and, c k ( α + n ) → n + k − (cid:88) j = n + k − f B ( j ; n − , τ ) , k = 1 , , ..., J, where, for k = 0 , , ..., n , f B ( k ; n, p ) = (cid:0) nk (cid:1) p k (1 − p ) n − k , is the binomial probability mass functionwith the size parameter n , and the probability of success p , and n +0 = 0 . n −
1, not n ,appears in the limit of c k ( α + n ) is that S only has n elements so j runs from 0 to n −
1. The proposition means that if there are no ties in the data and D = S , then c k ( α + n ) → f B ( k − J − , τ ) , k = 1 , , ..., J, Pr( β = s k |D ) → C ( n ) − f ( k − J − , τ ) Pr( β = s k ) . Here C ( n ) is the normalizing constant, computed via enumeration, C ( n ) = (cid:80) Jk =1 f ( k − J − , τ ) b k .The result in Proposition 4 is close to, but different from, Jeffrey’s substitution likelihood s ( β ) = f ( k ; J, τ ), for s k ≤ β < s k +1 where s = −∞ and s J +1 = ∞ (Jeffrey has n + 1 categoriesto choose from, not n , as he allows data outside the supposed S ). s ( β ) is a piecewise constant,non-integrable function (which means it needs proper priors to make sense) in β ∈ R , while for us β ∈ S (and the posterior is always proper). The prior and posterior distribution of β in the Bayesian bootstrap are Pr( β = s k ) = c k ( α ) andPr( β = s k |D ) = c k ( α + n ), respectively. Therefore, Proposition 3 demonstrates that the choice of b k = c k ( α ) delivers the Bayesian bootstrap (here the results are computed analytically rather thanvia simulation). If a Bayesian bootstrap was run, each draw would be weighed by w k = b k /c k ( α ) toproduce a Bayesian analysis using a proper prior; w k is the ratio of the priors and does not dependupon the data. Finally, Proposition 4 implies that as α ↓
0, so c k ( α ) → J − . This demonstratesthat, in the Bayesian bootstrap, the implied prior of β is the uniform discrete distribution on thesupport of the data. In many applications this is an inappropriate prior. Remark 1
To simulate from p ( θ |D ) = 1 C ( α, n ) b k ( θ ) c k ( θ ) ( α ) f D ( θ ; α + n ) , b k = Pr( β = s k ) , write m k = b k c k /C ( α, n ) , m (cid:48) k = b k /c k , M = max ( m , ..., m J ) and M (cid:48) = max ( m (cid:48) , ..., m (cid:48) J ) . Now p ( θ |D ) ≤ M f D ( θ ; α + n ) , for any θ . We can sample from p ( θ |D ) by drawing from Dirichlet ( α + n ) and accepting with probability m k ( θ ) /M = m (cid:48) k ( θ ) /M (cid:48) . The overall acceptance rate is /M . If theprior on β is weakly informative then m (cid:48) k (cid:39) for each k , and so the acceptance rate m (cid:48) k ( θ ) /M (cid:48) (cid:39) . If J is large, α ↓ J log f B ( k − J − , τ ) (cid:39) − (cid:16) k − J − − τ (cid:17) τ (1 − τ ) , τ is in the tails, or J is small. So the resulting trivialapproximations to the main posterior quantities are (cid:98) E ( β |D ) = J (cid:88) j =1 w ∗ j s j , w ∗ k = w k b k / J (cid:88) j =1 w j b j , w k = exp − J (cid:16) k − J − − τ (cid:17) τ (1 − τ ) , (cid:100) Var ( β |D ) = J (cid:88) j =1 w ∗ j (cid:110) s j − (cid:98) E ( β |D ) (cid:111) , (cid:98) F β |D ( β ) = J (cid:88) j =1 w ∗ j s j ≤ β .When the prior is flat, this is a kernel weighted average of the data where the weights are deter-mined by the ordering of the data. So large weights are placed on data with ranks ( k − / ( J − τ . This is very close to the literature on kernel quantiles, e.g. Parzen (1979),Azzalini (1981), Yang (1985) and Sheather and Marron (1990). Assume a population is indexed by i = 1 , , ..., I subpopulations, and that our random variable Z again has known discrete support, S = { s , ..., s J } . Then we assume within the i -th subpopulationPr( Z = s j | θ, i ) = θ ( i ) j , (7)thus allowing the distribution to change across the subpopulations. Here θ ( i ) = ( θ ( i )1 , ..., θ ( i ) J − ), θ ( i ) J = 1 − ι (cid:48) ( i ) , and θ = ( θ (1) , ..., θ ( I ) ). We assume the data D = { Z , ..., Z n } are conditionallyindependent draws from (7). We assume that each time we see datapoints we also see whichsubpopulation the datapoint comes from. The data from the i -th population will be written as D i .For the i -th subpopulation, the Bayesian nonparametric τ quantile is defined as β i = argmin b J (cid:88) j =1 θ ( i ) j ρ τ ( s j − b ) . Collecting terms β = ( β , β , ..., β I ) (cid:48) , the crucial assumption in our model is that f ( θ | β ) = I (cid:89) i =1 f ( θ ( i ) | β i ) . This says the distributions across subpopulations are conditionally independent given the quantile.That is, the single quantiles are the only feature which is shared across subpopulations.We assume β i ∈ S , and the { β i } are i.i.d. across i , but from the shared distribution Pr( β i = s j | i, π ) = π j , i = 1 , , ..., I , where π = ( π , ..., π J − ), and π J = 1 − ι (cid:48) π . We write a prior on π as12 ( π ). Then the prior on the hierarchical parameters is f ( β, π ) = f ( π ) f ( β | π ) = f ( π ) I (cid:89) i =1 f ( β i | π ) . This structured distribution will allow us to pool quantile information across subpopulations.Our task is to make inference on ( β , β , ..., β I ) (cid:48) from D . When taken together, we call this a“nonparametric hierarchical quantile model”. This can also be thought of as related to the Robbins(1956) empirical Bayes method, but here each step is nonparametric.By Bayes theorem, f ( β, π |D ) ∝ f ( β, π ) f ( D| β, π ) . (8)We will access this joint density using simulation. • Algorithm 1: β, π |D Gibbs sampler
1. Sample from Pr( β |D , π ) = I (cid:89) i =1 Pr( β i |D i , π ).2. Sample from f ( π |D , β ) = f ( π | β ).In the Dirichlet case, we can sample from Pr( β i |D i , π ) using Proposition 3. If f ( π ) is Dirichlet,then π | β =Dirichlet( λ + ν ), where ν = ( ν , ..., ν J ), in which ν j = (cid:80) Ii =1 β ( i ) = s j ). We illustrate the hierarchical model using a dataset of the number of runs (which is a non-negativeinteger) scored in each innings by the most recent (by debut) I = 300 English test players. “Tests”are international matches, typically played over 5 days. Here we look at only games involvingthe English national team. This team plays matches against Australia, Bangladesh, India, NewZealand, Pakistan, South Africa, Sri Lanka, West Indies and Zimbabwe. Batsmen can bat up totwice in each test, but some players fail to get to bat in an individual game due to the weather ordue to the match situation. Some players are elite batsmen and score many runs, others specializein other aspects of the game and have poor batting records without any runs.The database starts on 14th December 1951 and ends on 22nd January 2016. Some of theseplayers never bat, others have long careers, the largest of which we see in our database is 235innings, covering well over 100 test matches. In test matches batsmen can continue their inningsfor potentially a very long time and so can accumulate very high scores. An inning can be leftincomplete for a number of reasons, so the score is right-censored — such innings are marked as13eing “not out”. By the rules of cricket at least 9% of the data must be right-censored. Thedatabase is quite large, but has a simple structure. The statistical challenge is with the data.Batting records are full of heterogeneity, highly skewed, partially censored and heavy tailed data.It is a good test case for our methods.Interesting academic papers on the statistics of batting includes Kimber and Hansford (1993),which is a sustained statistical analysis of estimating the average performance of batsmen just usingtheir own scores. Elderton (1945) is a pioneering cricket statistics paper in the same spirit. Morerecent papers include Philipson and Boys (2015) and Brewer (2013).Our initial aim will be to make inference on the 0 . s j E ( : j D ) Posterior on median for 4 players s j P ( - s j j D ) : | DA KhanJC ButtlerKF BarringtonACS Pigott Figure 4:
The left hand side shows the posterior distribution of the population probabilities of the quantiles π j = P ( β = s j ) , τ = 0 . . The right hand side shows the posterior distribution of median of several players alongwith the posterior distribution of π . Notice in the case of Barrington there is only one innings which finished in therange 36 to 44 inclusive, which makes estimating the median unexpectedly hard (given how large a sample we have)and encourages the Bayesian method to aggressively shrink the estimator of the median. The common support of data for all the players is S = { , , ..., } , therefore J = 351. Theprior distribution of θ ( i ) is a Dirichlet distribution with α = ( α , ..., α J ), where α j = 4˜ α j + J with˜ α j ∝ e − . s j and (cid:80) Jj =1 ˜ α j = 1 (The empirical probability mass function of batting scores of allEnglish players in the matches started between 1930 and 1949, p i = Pr( Z = s j ), is approximatelyproportional to e − . s j . Therefore our Dirichlet prior for θ is approximately centered around thisempirical probability mass function with a large variability). We assume π ∼ Dirichlet( λ ), where14 j = ˜ λ j + J , in which ˜ λ j ∝ e − (cid:16) sj − (cid:17) for j = 1 , ..., J , and (cid:80) Jj =1 ˜ λ j = 5. In the left hand side ofFigure 4 we have depicted E( π |D ) for the τ = 0 . τ = 0 . τ = 0 . Posterior PosteriorBatsman (cid:101) β / Q Q (cid:98) β / n i Batsman (cid:101) β / Q Q (cid:98) β / n i A Khan 12.8 1 27 – 0 CJ Tavare 17.4 13 25 19.5 56ACS Pigott 7.8 4 19 6 2 PCR Tufnell 1.2 0 2 1 59A McGrath 13.1 4 27 34 5 MS Panesar 1.6 0 4 1 64AJ Hollioake 4.9 2 12 3 6 CM Old 8.4 7 11 9 66JB Mortimore 16.7 9 19 11.5 12 JA Snow 5.5 4 8 6 71DS Steele 18.8 7 38 43 16 DW Randall 14.6 9 19 15 79PJW Allott 6.6 4 14 6.5 18 RC Russell 14.6 10 20 15 86JC Buttler 17.4 13 27 13.5 20 MR Ramprakash 18.7 14 21 19 92W Larkins 12.8 7 25 11 25 PD Collingwood 23.2 19 28 25 115NG Cowans 4.1 3 7 3 29 RGD Willis 4.1 4 5 5 128JK Lever 5.4 4 10 6 31 KF Barrington 31 25 46 46 131M Hendrick 3.4 1 4 2 35 APE Knott 17.6 13 24 19 149DR Pringle 7.8 4 9 8 50 IT Botham 20.7 15 27 21 161C White 11.5 7 19 10.5 50 DI Gower 26.9 25 28 27 204GO Jones 15.9 10 22 14 53 AJ Stewart 25.6 19 28 27 235
Table 2:
Estimated median batting scores, treating not outs as if they were completed innings (i.e. ignoring rightcensoring). The batsman are ordered by sample size (i.e. the number of innings the batsman had). Table shows, foreach batsmen, the mean of the Bayesian posterior of the median given the data, (cid:101) β / = E ( β / |D ), the samplemedian (cid:98) β / and the sample size n i . Q and Q are the estimates of the Bayesian 5% and 95% quantiles of theposterior distribution of the median, so indicates how uncertain we are about the Bayesian estimator of the mean.All the Bayesian quantities are estimated by simulation. In the right hand side plot in Figure 4, the posterior distribution function of the median ofscores for several players have been compared with the posterior distribution function of π (theblack curve). For the first player, A. Khan (the blue curve), no data are available as he neverbatted, and the distribution is indistinguishable from that for E( π |D ). A.C.S. Pigott played twoinnings for England, scoring 4 and 8 not out. The light blue curve shows that even with just twodata points a lot of the posterior mass on the median has moved to the left, but the median isvery imprecisely estimated (the estimate of median is 7 . , E π |D ( β ) = 12 .
9. His20 actual scores were 85, 70, 45, 0, 59*, 13, 3*, 35*, 67, 14, 10, 73, 27, 7, 13, 11, 9, 12, 1, 42.His scores are not particularly heavy-tailed and so the median is reasonably well determined (theestimate of median is 17 . , .
0) but surprisingly not well determined (with 95% credible region[25 , . . p N M e d i a n DS Steele KF BarringtonNF Williams
Figure 5:
Sample median (arrow nocks) and mean of posterior distribution of medians (arrow heads) against thesample size for all players. The blue arrows indicate the estimates which were moved upwards, and the estimatorswhich were moved down demonstrated by the red arrows. The dashed line is the expected value of β under E ( π | D ) . (cid:98) β / against the batsman’s sample size n i . Blue arrows show that theBayesian posterior mean of the median is below the sample median, that is, it is shrunk down. Redarrows are the opposite, the Bayesian estimator is above the sample median, so is moved upwards.The picture shows there is typically more shrinkage for small sample sizes. But also, high samplemedians are typically shrunk more than low sample medians, but there are more medians whichare moved up than down. All this makes sense: the data are highly skewed, so high scores canoccur due to high medians or by chance. Hence until we have seen a lot of high scores, we shouldshrink a high median down towards a more common value. Of interest is β τ , the τ -th quantile, as a function of τ . Here we estimate that relationship pointwise,building a separate hierarchical model for each value of τ . The only change we will employ is toset ˜ λ j ∝ exp (cid:110) − ( s j − µ τ ) /σ τ (cid:111) , allowing µ τ = 15 + 15Φ − ( τ ) and σ τ = 15.30% quantile s j E ( : j D )
90% quantile s j E ( : j D ) Figure 6:
The left hand side shows the posterior distribution of the population probabilities of the quantiles π j = P ( β = s j ) , τ = 0 . . The right hand side shows the corresponding result for τ = 0 . π |D ) for two quantile levels τ = 0 .
30 and τ = 0 .
90. Notice, of course, how different they are, with a great deal of mass on low scores when τ = 0 .
30 and vastly more scatter for τ = 0 .
90. This is because even the very best batsmen fail witha substantial probability, frequently recording very low scores. In the right hand tail, the differencebetween the skill levels of the players is much more stark, with enormous scatter.We now turn to individual players. The dashed blue line in the left hand side of Figure 7 shows17he empirical quantile function for P.J.W. Allott, while also plotted using a blue full line is the as-sociated Bayesian quantile function E( β τ |D ). The results are computed for τ ∈ { . , . , ..., . } .The Bayesian function also shows a central 90% interval around the estimate.The right hand side shows the same object but for K.F. Barrington, who tended to score veryhighly and also played a great deal (his n i is around 8 times larger than Allott’s). We can see inboth players’ cases the lower quantiles are very precisely estimated and not very different, but athigher quantile levels the uncertainty is material and the differences in level stretch out. Further,at these higher levels the 90% intervals are typically skewed, with a longer right hand tail.The Bayesian quantile functions seem shrunk more for Barrington, which looks odd as Allotthas a smaller sample size. But Barrington has typically much higher scores (and so more variable)and so his quantiles are intrinsically harder to estimate and so are more strongly shrunk. Hisexceptionalism is reduced by the shrinkage. = S c o r e PJW AllottKF Barrington
Figure 7:
The pointwise estimated quantile function for two cricketers: P.J.W. Allott and K.F. Barrington. Thesecalculations ignore the impact of censoring. Horizonal lines denote 90% posterior intervals with 5% in each tail. Thecurve for Allott uses his 18 innings, Barrington had 131 innings.
For a moment we now leave the cricket example. We should note that we have ignored the factsome innings were not completed and marked “not out”, a form of censoring. We now developmethods to overcome this deficiency. 18
Truncated data
Here we show how this methodology can be extended to models with truncated data. The proba-bilistic aspect of the model is unaltered. We assume the support is sorted and known to be S , andPr( Z = s j | θ ) = θ j . However, in addition to some fully observed data, D = { z , ..., z N } , there exist N (cid:48) additional data, D = { s l i , ..., s l N (cid:48) } , which we know has been right truncated. We assume thenon-truncated versions of the data are independent over i , such that U i ≥ s l i , 1 ≤ i ≤ N (cid:48) , U i ∈ S ,Pr( U = s j | θ ) = θ j . We write U = { U , ..., U N (cid:48) } . Therefore our data is D = D (cid:83) D .Inference on ( β, π ) is carried out by augmenting it with U , and employing a Gibbs sampler inorder to draw from p ( β, π, U |D ). We implement this by Gibbs sampling, adding a first step to Algorithm 1. • Algorithm 2: β, π, U |D Gibbs sampler
1. Sample Pr( U | β, D , π ) .
2. Sample Pr( β |D , U, π ) .
3. Sample f ( π | β ), returning to 1.Sampling from U | ( β = s k , D ) is not standard, but is also not difficult. We carry this out throughdata augmentation:1. Sampling from Pr( U | β, D , π ) by,(a) Sample θ | ( β = s k , D ) ∼ D J ( α + n , k ).(b) Sample U | ( β = s k , θ, D ).Step 1(b) is straightforward, while 1(a) is a truncated Dirichlet defined in (5). The AppendixC shows how to simulate from D J ( α, k ) exactly. As a side remark, it is tempting to sample θ | U, D and U | θ, D but this fails in practice; the reasons for this are described in detail in Appendix B.19 .3 A Bayesian bootstrap for the censored data A Bayesian bootstrap algorithm can be developed to deal with the censored data (however itsextension to hierarchical model is not straightforward, since priors on β can not be incorporatedin this algorithm). Independent draws from the Bayesian bootstrap posterior distribution can beobtained by the following algorithm. • Algorithm 3: Bayesian bootstrap with censored data
1. Draw θ ∗ ∼ Dirichlet( α + n ).2. For 1 ≤ i ≤ N (cid:48) , draw U i from { s l i , ..., s J } , with probability Pr( U i = s j ) = θ ∗ j (cid:80) Jk = li θ ∗ k , and set n (cid:48) j = (cid:80) N (cid:48) U i = s j ), and n (cid:48) = ( n (cid:48) , ..., n (cid:48) J ).3. Draw θ ∼ Dirichlet( α + n + n (cid:48) ). Set β = t ( θ ). Go to 1. In cricket scores at least 9% of scores in each innings must be not out, so right censoring is importantstatistically. Not outs are particularly important for weaker batsmen who are often left not out atthe end of the team’s innings. In Section 3.2 we ignored this feature of batting and here we returnto it to correct the results.Figure 8 shows the estimated pointwise quantile function for Barrington and Allott, taking intoaccount the not outs. Both are shifted upwards, particularly Allott in the right hand tail. However,Allott’s right hand tail is not precisely estimated.Table 3 shows the Bayesian results for our selected 30 players, updating Table 2 to reflect therole of right censoring. Here n (cid:48) i denotes the number of not out, that is right censored innings, theplayer had. In many cases this is between 10% and 20% of the innings, but for some players it is farhigher. R.G.S. Willis is the leading example, who had 55 not outs of 128 innings. A leading bowler,he usually batted towards the end of innings and was often left not out. His posterior mean of themedian is inflated greatly by the statistical treatment of censoring. Further, the interval between Q and Q is widened substantially. Other players are hardly affected, e.g. M.R. Ramprakash,who had 6 not outs in 92 innings.Table 3 shows a ranking of players by the mean of the posteriors of the quantiles, at threedifferent levels of quantiles. This shows how the rankings change greatly with the quantile level.For small levels, we can think of this as being about consistency. For the median it is about typical20 S c o r e PJW AllottKF Barrington
Figure 8:
The censored-adjusted pointwise estimated quantile function for two cricketers: P.J.W. Allott and K.F.Barrington. The solid lines are the the estimates with the censored observations, and the dashed lines are obtainedby ignoring that they are censored data. Horizonal lines denote 90% posterior intervals with 5% in each tail. Thecurve for Allott uses his 18 innings, Barrington had 131 innings. performance. For the 90% quantile this is about upside potential to bat long. A remarkable resultis J.B. Bolus who has a very high β . quantile. His career innings were the following: 14, 43, 33,15, 88, 22, 25, 57, 39, 35, 58, 67. He only played for a single year, but never really failed in a singleinning. However he never managed to put together a very long memorable innings and this meanthis Test career was cut short by the team selectors. They seem to not so highly value reliability.Again K.F. Barrington is the standout batsman. He is very strong at all the different quantiles.Notice though he still had a 30% chance of scoring 14 or less — which would be regarded by manycricket watchers as a failure. But once his innings was established his record was remarkably strong,typically playing long innings. In this paper we provide a Bayesian analysis of quantiles by embedding the quantile problem ina larger inference challenge. This delivers quite simple ways of performing inference on a singlequantile. The frequentist performance of our methods are similar to that of the bootstrap.We extend the framework to introduce a hierarchical quantile model, where each subpopulation’s21 gnoring censoring in analysisBayesian Bayesian EmpiricalBatsman (cid:101) β / Q Q (cid:101) β / Q Q n i n (cid:48) i (cid:98) β / A Khan 14.7 4 30 12.7 2 27 0 0 –ACS Pigott 9.7 4 27 7.8 4 19 2 1 4A McGrath 14.3 4 31 13 4 27 5 0 34AJ Hollioake 5.5 4 14 4.7 2 12 6 0 2JB Mortimore 16.4 9 20 16.7 9 19 12 2 11DS Steele 19.4 7 37 18.8 7 35 16 0 42PJW Allott 7.8 4 14 6.7 4 14 18 3 6JC Buttler 17.7 13 27 17.1 13 27 20 3 13W Larkins 12.6 7 27 13.3 7 25 25 1 11NG Cowans 5.9 4 10 4.1 3 7 29 7 3JK Lever 6.7 4 11 5.3 4 8.5 31 5 6M Hendrick 6.3 4 10 3.4 1 5 35 15 2DR Pringle 8.3 7 10 7.7 4 9 50 4 8C White 13.1 8 19 11.7 7 19 50 7 10GO Jones 17.1 10 22 15.9 10 19 53 4 14CJ Tavare 17.4 12.5 25 17.3 13 25 56 2 22PCR Tufnell 4.8 1 9 1.2 0 2 59 29 1MS Panesar 4 4 4 1.6 0 4 64 21 1CM Old 8.9 7 13 8.4 7 11 66 9 9JA Snow 7.9 4 9 5.5 4 8 71 14 6DW Randall 14.6 9 19 14.5 10 19 79 5 15RC Russell 16.2 9.5 24 14.6 12 20 86 16 15MR Ramprakash 18.8 14 21 18.6 14 21 92 6 19PD Collingwood 25 19 30 23.3 19 28 115 10 25RGD Willis 8.5 7 10 4.1 4 5 128 55 5KF Barrington 33.7 27 48 31 25 46 131 15 46APE Knott 18.9 14 27 17.5 13 24 149 15 19IT Botham 20.6 15 27 20.7 15 27 161 6 21DI Gower 27.5 26 32 27 25 28 204 18 27AJ Stewart 26.3 19 29.5 25.6 19 28 235 21 27
Table 3:
Estimated median batting scores. Sample median is compared with two Bayesian estimators, where (cid:101) β / = E ( β / | D ) . n i is the number of innings, n (cid:48) i denotes the number of not outs which are treated as right censoreddata and (cid:98) β / is the empirical median. In the first model the not outs are assumed to be right censored observations.In the second model they are treated as if they were completed innings. Q denotes the estimated 5% point on therelevant posterior distribution. distribution is modeled nonparametrically but linked through a nonparametric mixing distributionplaced on the quantile. This allows non-linear shrinkage, adjusting to skewed and sparse data inan automatic manner.This approach is illustrated by the analysis of a large database from sports statistics of 300 Testcricketers. Each person’s batting performance is modeled nonparametrically and separately, butlinked through a quantile which is drawn from a common distribution. This allows us to shrinkeach cricketer’s performance – a particular advantage in cases where the careers are very short.The modeling approach is extended to allow for truncated data. This is implemented by usingsimulation based inference. Again this set is illustrated in practice by looking at not outs in batting22 .3 quantile 0.5 quantile 0.9 quantilerank Batsman (cid:101) β . Q Q Batsman (cid:101) β . Q Q Batsman (cid:101) β . Q Q Table 4: Best 20 players ranked based on the mean of the posteriors of the quantiles, at threedifferent levels of quantiles.innings, where we think of the data as right censored.
References
Azzalini, A. (1981). A note on the estimation of a distribution function and quantiles by a kernelmethod.
Biometrika 68 , 326–328.Boos, D. and J. F. Monahan (1986). Bootstrap methods using prior information.
Biometrika 73 ,77–83.Bornn, L., N. Shephard, and R. Solgi (2016). Moment conditions and Bayesian nonparametrics.Unpublished paper: arXiv:1507.08645.Brewer, B. J. (2013). Getting your eye in: A Bayesian analysis of early dismissals in cricket.Unpublished paper: School of Mathematics and Statistics, The University of New South Wales.Butucea, C. and F. Comte (2009). Adaptive estimation of linear functionals in the convolutionmodel and applications.
Bernoulli 15 , 69–98.Carlin, B. P. and T. A. Louis (2008).
Bayes and Empirical Bayes Methods for Data Analysis (3ed.). Chapman and Hall.Cavalier, L. and N. W. Hengartner (2009). Estimating linear functionals in Poisson mixture models.
Journal of Nonparametric Statistics 21 , 713–728.Chamberlain, G. and G. Imbens (2003). Nonparametric applications of Bayesian inference.
Journalof Business and Economic Statistics 21 , 12–18.23hernozhukov, V. and H. Hong (2003). An MCMC approach to classical inference.
Journal ofEconometrics 115 , 293–346.Diaconis, P., S. Holmes, and M. Shahshahani (2013). Sampling from a manifold. In G. Jones andX. Shen (Eds.),
Advances in Modern Statistical Theory and Applications . Institute of Mathemat-ical Statistics.Dunson, D. and J. Taylor (2005). Approximate Bayesian inference for quantiles.
Journal of Non-parametric Statistics 17 , 385–400.Efron, B. (2010).
Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, andPrediction . Cambridge University Press.Efron, B. (2013). Empirical Bayes modeling, computation, and accuracy. Unpublished paper,Department of Statistics, Stanford University.Elderton, W. (1945). Cricket scores and some skew correlation distributions.
Journal of the RoyalStatistical Society 108 , 1–11.Federer, H. (1969).
Geometric Measure Theory . New York: Springer–Verlag.Feng, Y., Y. Chen, and X. He (2015). Bayesian quantile regression with approximate likelihood.
Bernoulli 21 , 832–850.Hjort, N. and S. Petrone (2007). Nonparametric quantile inference with Dirichlet processes. InV. Nair (Ed.),
Advances in Statistical Modeling and Inference. Essays in Honor of Kjell A.Doksum , pp. 463–492. World Scientific.Hjort, N. L. and S. G. Walker (2009). Quantile pyramids for Bayesian nonparametrics.
The Annalsof Statistics 37 , 105–131.Jeffreys, H. (1961).
Theory of Probability . Oxford: Oxford University Press.Kimber, A. C. and A. R. Hansford (1993). A statistical analysis of batting in cricket.
Journal ofthe Royal Statistical Society, Series B 156 , 443–455.Koenker, R. (2005).
Quantile Regression . Cambridge: Cambridge University Press.Koenker, R. and G. Bassett (1978). Regression quantiles.
Econometrica 46 , 33–50.Koenker, R. and J. Machado (1999). Goodness of fit and related inference processes for quantileregression.
Journal of the American Statistical Association 94 , 1296?1309.Kottas, A. and M. Krnjajic (2009). Bayesian semiparametric modelling in quantile regression.
Scandinavian Journal of Statistics 36 , 297–319.Kozumi, H. and G. Kobayashi (2011). Gibbs sampling methods for Bayesian quantile regression.
Journal of Statistical Computation and Simulation 81 , 15651578.Lancaster, T. and S. J. Jun (2010). Bayesian quantile regression methods.
Journal of AppliedEconometrics 25 , 287–307.Lavine, M. (1995). On an approximate likelihood for quantiles.
Biometrika 82 , 220–222.24azar, N. A. (2003). Bayesian empirical likelihood.
Biometrika 90 , 319–326.Li, Q., R. Xi, and N. Lin (2010). Bayesian regularized quantile regression.
Bayesian Analysis 5 ,1–24.Lindley, D. V. and A. F. M. Smith (1972). Bayes estimates for the linear model.
Journal of theRoyal Statistical Society, Series B , 1–41.McAuliffe, J. D., D. M. Blei, and M. I. Jordan (2006). Nonparametric empirical Bayes for theDirichlet process mixture model.
Statistical Computing 16 , 5–14.Morris, C. N. and M. Lysy (2012). Shrinkage estimation in multilevel normal models.
StatisticalScience 27 , 115–134.Muller, U. (2013). Risk of Bayesian inference in misspecified models, and the sandwich covariancematrix.
Econometrica 81 , 1805–1849.Parzen, E. (1979). Nonparametric statistical data modeling.
Journal of the American StatisticalAssociation 74 , 105–121.Parzen, E. (2004). Quantile probability and statistical data modeling.
Statistical Science 19 ,652–662.Philipson, P. and R. Boys (2015). Who is the greatest? A Bayesian analysis of test match cricketers.Unpublished paper: New England Symposium on Statistics in Sports.Robbins, H. (1956). An empirical Bayesian approach to statistics. In
Proceedings of the ThirdBerkeley Symposium on Mathematical Statistics and Probability , Volume 1, pp. 157–163. Univer-sity of California Press.Rockafellar, R. (1970).
Convex Analysis . Princeton, New Jersey: Princeton University Press.Rubin, D. B. (1981). The Bayesian bootstrap.
Annals of Statistics 9 , 130–134.Sheather, S. J. and J. S. Marron (1990). Kernel quantile estimators. , 410–416.Stein, C. M. (1966). An approach to recovery of interblock information in balanced incompleteblock designs. In Research Papers in Statistics (Festchrift J. Neyman) , pp. 351–366. London:Wiley.Tsionas, E. G. (2003). Bayesian quantile regression.
Journal of Statistical Computation and Sim-ulation 73 , 659–674.Yang, S. S. (1985). A smooth nonparametric estimator of a quantile function.
Journal of theAmerican Statistical Association 80 , 1004–1011.Yang, Y. and X. He (2012). Bayesian empirical likelihood for quantile regression.
The Annals ofStatistics 40 , 1102–1131.Yang, Y., H. J. Wang, and X. He (2015). Posterior inference in Bayesian quantile regression withasymmetric Laplace likelihood.
International Statistical Review . Forthcoming.Yu, K. and R. A. Moyeed (2001). Bayesian quantile regression.
Statistics and Probability Letters 54 ,437–447. 25
Appendix
A.1 Proof or Proposition 1
As Ψ( b, θ ) is a convex function it has a unique minimizer on R , or its optimal set is a closedinterval, [ β l , β u ], where s ≤ β l < β u ≤ s J (Rockafellar (1970)). In the latter case, there exist β m ∈ [ β l , β u ] \S , at which ∂ Ψ( b,θ ) ∂b exists and is equal to zero, ∂ Ψ( b, θ ) ∂b (cid:12)(cid:12)(cid:12)(cid:12) b = β m = (1 − τ ) τ (cid:48) − τ (1 − τ (cid:48) ) = 0where τ (cid:48) = (cid:80) kj =1 θ j , and k = max { j ; s j < β m } . For a specific value of β m , this equality holds if τ (cid:48) = τ , that means, (cid:80) kj =1 θ j = τ . This implies that the minimizer of Ψ( b, θ ) is not unique if andonly if τ ∈ { θ , θ + θ , ...., θ + · · · + θ J − } , and this is a zero measure event if θ is non-singular.Now assume β ∗ is the unique minimizer of Ψ( b, θ ) for θ = θ ∗ . This implies that the directionalderivatives of Ψ( b, θ ) are strictly positive at the optimal point, ∇ v Ψ( β ∗ , θ ∗ ) >
0, for v ∈ {− , } .However, ∇ v Ψ( θ, b ) is an affine (and therefore a continuous) function of θ , ∇ v Ψ( b, θ ) = J (cid:88) j =1 (1 − τ ) θ j v s j < b ) − τ θ j v s j > b ) + θ j ρ τ ( − v )1( s j = b ) , therefore there exists an open ball centered at θ ∗ with radius δ > B δ ( θ ∗ ), in such a way that, ∀ v ∈ {− , } , ∀ θ ∈ B δ ( θ ∗ ), ∇ v Ψ( β ∗ , θ ) >
0. Hence, for any θ ∈ B δ ( θ ∗ ), the objective function Ψ( b, θ )has a unique minimizer at β ∗ : ∀ θ ∈ B δ ( θ ∗ ), β ∗ = argmin b Ψ( b, θ ), and this implies, ∂β/∂θ (cid:48) (cid:12)(cid:12) θ = θ ∗ = 0. A.2 Proof of Proposition 2
For k = 1, A = { θ ; θ > τ } , therefore, c = Pr( θ ∈ A ) = Pr( θ > τ ). For 2 ≤ k ≤ J −
1, we have, A k = θ ; k − (cid:88) j =1 θ j < τ , and k (cid:88) j =1 θ j > τ = θ ; k − (cid:88) j =1 θ j < τ \ θ ; k (cid:88) j =1 θ j ≤ τ . Since (cid:110) θ ; (cid:80) kj =1 θ j ≤ τ (cid:111) ⊂ (cid:110) θ ; (cid:80) k − j =1 θ j < τ (cid:111) , then, c k = Pr( θ ∈ A k ) = Pr (cid:16)(cid:80) k − j =1 θ j < τ (cid:17) − Pr (cid:16)(cid:80) kj =1 θ j < τ (cid:17) . Finally, for k = J , A J = (cid:110) θ ; (cid:80) j =1 θ j < τ (cid:111) , and so c J = Pr (cid:16)(cid:80) j =1 θ j < τ (cid:17) . A.3 Proof of Proposition 3
In the Dirichlet case the posterior distribution of β, θ will be, p ( β = s k ( θ ) , θ |D ) ∝ J (cid:89) j =1 θ n j j b k ( θ ) c k ( θ ) ( α ) f D ( θ ; α ) = 1 c ( α + n ) b k ( θ ) c k ( θ ) ( α ) f D ( θ ; α + n ) , n = ( n , ..., n J ) and c ( α + n ) = (cid:82) ∆ f D ( θ ; α + n )d θ . The right hand side integrates to C ( α, n ) = (cid:80) Jk =1 c k ( α + n ) c k ( α ) b k , hence the normalized posterior is p ( β = s k ( θ ) , θ |D ) = m k ( θ ) f D ( θ ; α + n ),where m k = C ( α,n ) b k c k ( α ) .The posterior distribution of β can be found analytically, recalling that for quantiles the areaformula implies p ( θ |D ) = p ( β, θ |D ), as Pr( β = s k |D ) = (cid:82) A k p ( θ |D )d θ = C ( α,n ) c k ( α + n ) c k ( α ) b k . A.4 Proof of Proposition 4
Note that,lim α,β → B ( α, β ) = lim α,β → Γ( α )Γ( β )Γ( α + β ) = lim α,β → Γ( α + 1)Γ( β + 1)Γ( α + β + 1) α + βαβ = lim α,β → α + βαβ , lim α,β → αB ( τ ; α, β ) = lim α,β → α τ α (1 − τ ) β α (cid:18) α + βα + 1 τ + ( α + β )( α + β + 1)( α + 1)( α + 2) τ + · · · (cid:19) = 1 . Hence, lim α,β → I τ ( α, β ) = β/ ( α + β ). Assume all the elements of α goes to 0 at the same rate, c k ( α ) → lim α k ↓ I τ ( α + k − , α + J − α + k − ) − I τ ( α + k , α + J − α + k ) → lim ε ↓ I τ (( k − ε, ( J − k + 1) ε ) − I τ ( kε, ( J − k ) ε ) = J − k + 1 J − J − kJ = 1 J . As α ↓
0, so θ + k L → Beta( n + k , n − n + k ). Now using that limit, Pr( θ + k < τ ) = I τ ( n + k , n − n + k ). For τ ∈ (0 , k and n , I τ ( k, n ) = 1 − F B ( k − n + k − , τ ). So, for 2 ≤ k ≤ J − c k = Pr( θ + k − < τ ) − Pr( θ + k < τ ) = I τ ( n + k − , n − n + k − ) − I τ ( n + k , n − n + k )= F B ( n + k − n − , τ ) − F B ( n + k − − n − , τ ) = n + k − (cid:88) k = n + k − f B ( k ; n − , τ ) ,c = 1 − Pr( θ +1 < τ ) = F B ( n +1 − n − , τ ) = n +1 − (cid:88) k =0 f B ( k ; n − , τ ) = n +1 − (cid:88) k = n +0 f B ( k ; n − , τ ) c J = Pr( θ + J − < τ ) = 1 − F B ( n + J − − n − , τ ) = n − (cid:88) k = n + J − f B ( k ; n − , τ ) = n + J − (cid:88) k = n + J − f B ( k ; n − , τ ) , where n +0 = 0. If there are no ties, n=J, and z = s < · · · < z J = s J , then the result holds. B A basic simulator which does not work
The U = ( U , ..., U N (cid:48) ), will be treated as missing data, and the inference can be performed bysampling from π, θ, U |D . To do this we would need to sample from p ( θ, U |D ).One approach to sampling from this is using a Gibbs sampler.27 Algorithm 2: θ, U |D Gibbs sampler
1. Draw from, θ |D , U ∼ Dirichlet( α + n + n (cid:48) ), n (cid:48) = ( n (cid:48) , ..., n (cid:48) J ), n (cid:48) j = (cid:80) N (cid:48) i =1 U i = s j ).2. Draw from p ( U | θ, D ) = p ( U | θ ), Pr( U i = s j | θ ) = (cid:110) θ j / (cid:80) Jk = l i θ k (cid:111) { s li ,...,s J } ( s j ), i = 1 , , ..., N (cid:48) .This Gibbs sampler sometimes performs well. However, if some of the missing data is constrainedto fall in a block with no other data, which we call “isolated missingness”, then there are numericaldifficulties. In the case of right censored data, a censored data point is suffering from isolatedmissingness if s l i > max k { z , ..., z N } . Assume α is small and U i = s j . Then, for θ simulated in Step1, with high probability we have θ j / (cid:80) Jk = l i θ k (cid:39)
1, and so at Step 2 with high probability U i = s j .The result is a highly correlated Gibbs chain and this form of simulation is highly likely to fail. C Constrained Dirichlet sampling
Assume θ ∼ Dirichlet( α ), and consider simulating from θ | β = s k , for k = 1 , ..., J . This is equivalentto simulating from θ ∼ Dirichlet( α )1 A k ( θ ). Let D J ( α, k ) denote the J − α = ( α , ..., α J ), and truncated to A k , for k = 1 , ..., J , withthe following density function: p ( θ ) = f D ( θ ; α ) /c k ( α ). Below we show how we can sample from D ( α, k ). Once this is developed the generalization to J > • Algorithm: sampling from D ( α, k )1. A : Draw from θ ∼ Dirichlet( α )1 A ( θ ), by: draw θ from Beta( α , α + α )1 ( τ, ( θ ); thendraw θ from (1 − θ )Beta( α , α ).2. A : Draw from θ ∼ Dirichlet( α )1 A ( θ ) by: draw θ from Beta( α , α + α )1 (0 ,τ ) ( θ ); thendraw θ from (1 − θ )Beta( α , α ), until θ + θ > τ . However, the rejection step could beinefficient. Now f ( θ | θ ∈ A ) = 1 c ( α ) f B ( θ ; α , α + α ) (cid:20) − F B (cid:18) τ − θ − θ ; α , α (cid:19)(cid:21) (0 ,τ ) ( θ ) , Therefore we can instead use the more reliable alternative: draw θ from f ( θ | θ ∈ A ); draw θ from (1 − θ )Beta( α , α )1 ( τ − θ − θ , ( θ ).3. A : Draw from θ ∼ Dirichlet( α )1 A ( θ ) by drawing θ from Beta( α , α + α )1 (1 − τ, ( θ ); thendraw θ from (1 − θ )Beta( α , α ). 28 Algorithm: sampling from D J ( α, k ) for J > .
1. For k = 1: draw ( θ , S, θ J ) from D (( α , α + J − − α +1 , α J ) , S ( θ , ..., θ J − ) fromDirichlet( α , ..., α J − ).2. For 2 ≤ k ≤ J −
1: draw ( S , θ k , S ) from D (( α + k − , α k , α + J − α + k ) , S ( θ , ..., θ k − )from Dirichlet( α , ..., α k − ); draw S ( θ k +1 , ..., θ J ) from S Dirichlet( α k +1 , ..., α J ).3. For k = J : draw ( S, θ J − , θ J ) from D (( α + J − , α J − , α J ) , S ( θ , ..., θ J − ) fromDirichlet( α , ..., α J − ). D Sampling from truncated beta distribution
Here we simulate θ from Beta( α, β ), truncated to ( L, U ) interval, where 0 ≤ L < U ≤ L (cid:54) = 0 or U (cid:54) = 1). Inverse transform sampling will be numerically infeasible if F B ( L ; α, β )and F B ( U ; α, β ) are both very close to 0 or 1. We can distinguish 4 cases. All other cases can betransformed to one of the four cases by Beta( α, β ) d = 1 − Beta( β, α ). • Algorithm: α < , β < , and U <
1. Draws from Beta( α, β ) truncated to (
L, U ):1. Draw u ∼ Uniform(0 , z ∗ = α ln( L α + ( U α − L α ) u ).2. Draw v ∼ Uniform(0 , v ≤ (cid:16) − e z ∗ − U (cid:17) β − , set z = z ∗ , otherwise go to step (1).3. Return θ = e z .This is a rejection sampler for z = ln θ, with the proposal density f Z ( z ) = αe αz U α − L α (log L, log U ) ( z ). • Algorithm: α < , β > , and < L . Draws from Beta( α, β ) truncated to ( L, U ):1. Draw u ∼ Uniform(0 , θ ∗ = 1 − (cid:2) (1 − L ) β − (cid:2) (1 − L ) β − (1 − U ) β (cid:3) u (cid:3) β .2. Draw v ∼ Uniform(0 , v ≤ (cid:16) θ ∗ L (cid:17) α − , set θ = θ ∗ , otherwise go to step 1.This is a rejection sampler for θ, with the proposal density f Z ( z ) = β (1 − L ) β − (1 − U ) β (1 − θ ) β − . • Algorithm: α > , β < . If β B ( α,β ) U α − [ (1 − L ) β − (1 − U ) β ] ≤
1, then generate θ from B( α, β ), andaccept if L ≤ θ ≤ U . Otherwise, if 0 < L , the following rejection algorithm is more efficient:1. Draw u ∼ Uniform(0 , θ ∗ = 1 − (cid:2) (1 − L ) β − (cid:2) (1 − L ) β − (1 − U ) β (cid:3) u (cid:3) β .29. Draw v ∼ Uniform(0 , v ≤ (cid:16) θ ∗ U (cid:17) α − , set θ = θ ∗ , otherwise go to step 1. • Algorithm: α > , β > , and αα + β < U . If L < αα + β , then generate θ from B( α, β ), anaccept if L ≤ θ ≤ U . Otherwise, define λ = ( β − L − ( α − − L ) L (1 − L ) . The following returns a drawfrom Beta( α, β ) truncated to ( L, U ):1. Draw u ∼ Uniform(0 , θ ∗ = L − λ ln (cid:2) − (cid:0) − e − λ ( U − L ) (cid:1) u (cid:3) .2. Draw v ∼ Uniform(0 , v ≤ (cid:16) θ ∗ L (cid:17) α − (cid:16) − θ ∗ − L (cid:17) β − e λ ( θ ∗ − L ) , set θ = θ ∗ , otherwise go to 1.This rejection algorithm for θ uses proposal density f Z ( z ) = λe − λ ( θ − L ) − e − λ ( U − L ) ( L,U ) ( θθ