[PDF] Convergence Rates for Bayesian Estimation and Testing in Monotone Regression

Abstract

Shape restrictions such as monotonicity on functions often arise naturally in statistical modeling. We consider a Bayesian approach to the problem of estimation of a monotone regression function and testing for monotonicity. We construct a prior distribution using piecewise constant functions. For estimation, a prior imposing monotonicity of the heights of these steps is sensible, but the resulting posterior is harder to analyze theoretically. We consider a ``projection-posterior'' approach, where a conjugate normal prior is used, but the monotonicity constraint is imposed on posterior samples by a projection map on the space of monotone functions. We show that the resulting posterior contracts at the optimal rate n −1/3 under the L 1 -metric and at a nearly optimal rate under the empirical L p -metrics for 0<p≤2 . The projection-posterior approach is also computationally more convenient. We also construct a Bayesian test for the hypothesis of monotonicity using the posterior probability of a shrinking neighborhood of the set of monotone functions. We show that the resulting test has a universal consistency property and obtain the separation rate which ensures that the resulting power function approaches one.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug Electronic Journal of Statistics

Vol. 0 (2020) 1–8ISSN: 1935-7524

Convergence Rates for BayesianEstimation and Testing in MonotoneRegression

Moumita Chakraborty

Department of Operations ResearchNorth Carolina State UniversityRaleigh, NC 27695U.S.A.e-mail: [email protected] andSubhashis Ghosal ∗ Department of StatisticsNorth Carolina State UniversityRaleigh, NC 27695U.S.A.e-mail: [email protected]

Abstract:

2. The projection-posterior approach is also computationallymore convenient. We also construct a Bayesian test for the hypothesis ofmonotonicity using the posterior probability of a shrinking neighborhoodof the set of monotone functions. We show that the resulting test has a uni-versal consistency property and obtain the separation rate which ensuresthat the resulting power function approaches one.

Keywords and phrases:

Monotonicity, Posterior contraction, Bayesiantesting, Projection-posterior.

1. Introduction

We consider the nonparametric regression model Y = f ( X ) + ε for a re-sponse variable Y with respect to a one-dimensional predictor variable X ∈ [0 , ∗ Research is partially supported by NSF grant number DMS-1916419.1 hakraborty and Ghosal/Bayesian Monotone Regression (without loss of generality) and ε a mean-zero random error with ﬁnite vari-ance σ . Instead of the more commonly imposed smoothness condition, f isassumed to be a monotone increasing function on [0 , n repli-cations ( Y , X ) , . . . , ( Y n , X n ), where the design points X , . . . , X n are eitherdeterministic or are randomly sampled from a ﬁxed distribution G . The error ε is assumed to be distributed independently of the predictor X .The problem has been widely studied in the frequentist literature, and is com-monly known as isotonic regression. Barlow and Brunk [5] obtained the great-est convex minorant (GCM) of a cumulative sum diagram as the least-squareestimator under the monotonicity constraint. The Pool-Adjacent-Violators Al-gorithm (PAVA) describes a method of successive approximation to the GCM,and is the most commonly used algorithm for isotonic regression (see Ayer etal. [2], Barlow et al. [4], or De Leeuw et al. [10]). Brunk [9] showed that theestimated value of the regression function at a point converges at a rate n − / ,and evaluated its asymptotic distribution. Durot [11] established n − / rate ofconvergence of the isotonic regression estimator under the L -metric.A Bayesian approach to the monotone regression problem involves putting aprior on functions under the monotonicity constraint. Since step-functions canapproximate monotone functions, a natural approach is to put priors on stepheights under the monotonicity constraint, and possibly also on the locationsand the number of intervals. For smoother sample paths, higher-order splinescan be used instead of the indicator functions of intervals. Shivley [21] used amixture of constrained normal distributions as a prior for spline coeﬃcients.Bayesian nonparametric methods have been developed also for other shape-constrained problems, such as monotone density and current status censoringmodel. Salomond [19] established the nearly minimax rate n − / for a decreasingdensity using a mixture of uniform densities as a prior. Testing for monotonicityof a regression function has been studied in the frequentist literature by Akakpo[1], Hall and Heckman [16], Baraud et al. [3], Ghosal et al. [12] and Bowman etal. [8]. A Bayesian approach to testing monotonicity was proposed by Salomond[20].A diﬃculty with the usual Bayesian approach to isotonic regression is thatthe monotonicity constraint on the coeﬃcient makes both posterior computa-tion and study of posterior concentration with increasing sample size a lot morechallenging. This is especially the case if the true regression function lies onthe boundary of the set of monotone functions, since then the prior puts a rel-atively less mass in the neighborhood of the true regression function. A veryuseful approach that can still utilize the conjugacy structure is provided bya “projection-posterior” distribution. In this approach, the monotonicity con-straint on the step size is initially ignored, so that they may be given independentnormal priors, and hence the posterior distribution is also normal, allowing easysampling, and large sample analysis of posterior concentration. Then a poste-rior distribution is directly induced by a projection map that projects a stepfunction to the nearest monotone function in terms of the L -distance or someother metric. A similar idea based on a Gaussian process prior was used byLin and Dunson [17] for monotone regression. Bhaumik and Ghosal [6, 7] used hakraborty and Ghosal/Bayesian Monotone Regression this idea of embedding in an unrestricted space and then projecting a conju-gate posterior in regression models driven by ordinary diﬀerential equations. Inthis paper, we pursue the projection-posterior approach and show that the re-sulting projection-posterior distribution concentrates at the optimal rate n − / in terms of the L -distance. We obtain nearly optimal posterior concentrationunder an empirical L p -distance for 0 < p ≤

2. We also construct a Bayesiantest for the hypothesis of monotonicity based on the posterior distribution ofthe diﬀerence between the unrestricted posterior sample and its projection. Weshow that the resulting test is universally consistent, in that the Type I errorprobability goes to zero and the power goes to one at any ﬁxed alternative, re-gardless of smoothness. For a sequence of smooth alternatives, we also computethe needed separation from the null region to obtain high power. Our proposedtest is similar in spirit to Salomond’s [20] test in that both are based on theposterior probability of a slightly extended null region, but our use of the L -metric on the function or the Hellinger metric on the density of Y , leads to theuniversal consistency.The paper is organized as follows. In the next section, we formally introducethe modeling assumptions and the prior and describe the projection posteriorapproach. In Section 3, we present results on posterior contraction rates of theprojection posterior distribution. In Section 4, we derive asymptotic propertiesof the proposed Bayesian tests. Proofs of the main results are given in Section 5and those of the auxiliary results in Section 6.

2. Model, prior and projection posterior

The following notations will be used throughout the paper. Let I m standfor the m × m identity matrix. By Z ∼ N J ( µ , Σ ), we mean that Z has a J -dimensional normal distribution with mean µ and covariance matrix Σ . Fora vector x , the Euclidean norm will be denoted by k x k . The transpose of avector x is denoted by x T and that of a matrix A is denoted by A T . If f is afunction and H a measure, the L p -norm of f is given by k f k p,H = ( R | f | p dH ) /p for 1 ≤ p < ∞ , and the L p distance between two functions f and g is givenby d p,H ( f, g ) = k f − g k p,H for 1 ≤ p < ∞ and d p,H ( f, g ) = R | f − g | p dH for0 < p <

1. The indicator function will be denoted by and a n and b n , a n . b n means that a n /b n isbounded, a n ≍ b n means that both a n . b n and b n . a n , and a n ≪ b n meansthat a n /b n →

0. For a random variable Y and a sequence of random variables X n , X n → P Y means that X n converges to Y in P -probability.Let F and F + respectively denote the space of real-valued measurable func-tions and monotone increasing functions on [0 , K >

0, let F + ( K ) = { f ∈ F + : | f | ≤ K } . For f : [0 , R and d a distance on F , let the projectionof f on F + be the function f ∗ that minimizes d ( f, h ) over h ∈ F + . The topo-logical closure of F + is denoted by ¯ F + . The ǫ -covering number of a set A withrespect to a metric d , denoted by N ( ǫ, A, d ), is the minimum number of balls ofradius ǫ needed to cover A . hakraborty and Ghosal/Bayesian Monotone Regression Let G n ( x ) = n − P ni =1 { X i ≤ x } , the empirical distribution of the predic-tors X .A prior distribution on the regression function f will be given by a randomstep function f ( x ) = P Jj =1 θ j { x ∈ I j } , x ∈ (0 ,

1] where I , . . . , I J are disjointintervals partitioning [0 ,

1] given by I j = ( ξ j − , ξ j ], j = 1 , . . . , J −

1, and I J =[ ξ J − , ξ J ]. The knot points are 0 = ξ < ξ < . . . < ξ J − < ξ J = 1. With agiven set of J knots, the corresponding collection of step functions is denotedby F J . The counts of these intervals are denoted by N j = P ni =1 { X i ∈ I j } , j = 1 , . . . , J . For the prior, J or ξ = ( ξ , . . . , ξ J − ) or both may be given, orthese may may be distributed according to a prior. Depending on their choices,the following three types of prior distributions will be considered in this paper.1. Type 1 prior : The number of steps J is deterministic (will be dependenton the sample size n ), ξ j = j/J , j = 1 , . . . , J − Type 2 prior : The number of steps J is deterministic,P(( ξ , . . . , ξ J − ) = S ) = 1 (cid:0) nJ − (cid:1) , S ⊂ { X , . . . , X n } , S = J − , that is, the knots are sampled randomly without replacement from theobserved values of predictor variables (only applicable for deterministic X with distinct values).3. Type 3 prior : The knots are equidistant and the number of steps J isgiven a prior satisfyingexp[ − b j (log j ) t ] ≤ Π( J = j ) ≤ exp[ − b j (log j ) t ] (2.1)for some b , b > ≤ t ≤ t ≤ σ and J , the coeﬃcients θ , . . . , θ j are given indepen-dent normal priors θ j | σ ∼ N( ζ j , σ λ j ), B < λ j < B for some B , B > | ζ | , . . . , | ζ J | . We write Λ = diag( λ , . . . , λ J ), the diagonal matrix withentries λ , . . . , λ J . Hence the prior Type 1 prior will be used to obtain optimalposterior contraction in L -distance, Type 2 prior for posterior contraction interms of an empirical L -distance while Type 3 prior will be used for testingmonotonicity against smooth alternatives of unspeciﬁed smoothness.The variance parameter σ is either estimated by maximizing the marginallikelihood, or is given an inverse-gamma prior σ ∼ IG( β , β ) with β > β > Y = ( Y , . . . , Y n ) T , X = ( X , . . . , X n ) T , D n = ( Y , X ), ε =( ε , . . . , ε n ) T , B = (( { X i ∈ I j } )), an n × J matrix, and θ = ( θ , . . . , θ j ) T .Thus the model can be written as Y = Bθ + ε , and the prior (given J and σ ) as θ | J ∼ N J ( ζ , σ Λ ) with ζ = ( ζ , . . . , ζ J ) T . Then θ | ( D n , J, σ, ξ ) ∼ N J (( B T B + Λ ) − ( B T Y + Λ − ζ ) , σ ( B T B + Λ − ) − ), that is, θ j are a poste-riori independent with θ j | ( ξ , σ, J, D n ) ∼ N (cid:18) N j ¯ Y j + ζ j /λ j N j + 1 /λ j , σ N j + 1 /λ j (cid:19) . (2.2) hakraborty and Ghosal/Bayesian Monotone Regression The marginal distribution of the observations Y (given X and J, σ , ξ ) is Y | ( σ, ξ , J, X ) ∼ N n (cid:0) Bζ , σ ( B Λ B T + I n ) (cid:1) . (2.3)As the coeﬃcients θ have not been restricted to the cone of monotone increas-ing values Q := { ( q , . . . , q J ) : q ≤ q ≤ · · · ≤ q J } , the resulting regressionfunction f = P Jj =1 θ j I j may not be monotone. In order to comply with themonotonicity restriction, a sampled value of the function f from its posterior(obtained through the posterior sampling of θ ) is projected on the set of mono-tone functions F + on [0 ,

1] to obtain f ∗ ∈ F + nearest to f with respect to somedistance d . The induced distribution of f ∗ will be called the projection-posteriordistribution. It will be denoted by Π ∗ n and will be the basis of inference on theregression function f . By its deﬁnition, the projection-posterior distribution isrestricted to F + .We also ﬁnd that the projection f ∗ of a step function f = P Jj =1 θ j I j ∈ F J is itself a step function f = P Jj =1 θ ∗ j I j ∈ F J , with θ ∗ ≤ · · · ≤ θ ∗ J . For thethe L ( G n )-distance, these values are obtained by the weighted isotonizationprocedure minimize J X j =1 N j ( θ j − θ ∗ j ) subject to θ ∗ ≤ · · · ≤ θ ∗ J . (2.4)The optimizing values θ ∗ , . . . , θ ∗ J can be computed using the PAVA and canbe characterized as the left-derivative at the point n − P jk =1 N k of the great-est convex minorant of the graph of the line segments connecting the points (cid:8) (0 , , (cid:0) N /n, N θ /n (cid:1) , . . . , (cid:0) P Jk =1 N k /n, P Jk =1 N k θ k /n (cid:1)(cid:9) (cf. Lemma 2.1 ofGroeneboom and Jongbloed [15]). the same solution is obtained even if the L ( G n )-distance is replaced by a wider class; see Theorem 2.1 of Groeneboomand Jongbloed [15].We make one of the following design assumptions (DD) or (DR) on the pre-dictor X and the assumption (E) on the error variables. Condition (DD) (Deterministic predictor). The predictor variables X isdeterministic assuming values X , . . . , X n , and the counts N , . . . , N J of J eq-uispaced intervals I , . . . , I J satisfy, for J → ∞ , max { N j : 1 ≤ j ≤ J } /n → {| G n ( x ) − G ( x ) | : x ∈ [0 , } = o ( J − ), where G has a positive and continuous density g on [0 , Condition (DR) (Random predictor). The predictor X is sampled indepen-dently from a distribution G , having a density g , which is bounded and boundedaway from zero on [0 , Condition (E) (True error distribution). The error variables ε , . . . , ε n arei.i.d. sub-Gaussian with mean 0 and variance σ .We denote the true value of the regression function by f and write the vectorof function values at the observed points by F = ( f ( X ) , . . . , f ( X n )) and the hakraborty and Ghosal/Bayesian Monotone Regression corresponding true distribution by P . Let E ( · ) and Var ( · ) be the expectationand variance operators taken under the true distribution P .The error variance σ may be estimated by maximizing the marginal likeli-hood of σ . From (2.3), it follows that the marginal maximum likelihood estimateof σ is given byˆ σ n = n − ( Y − Bζ ) T ( B Λ B T + I n ) − ( Y − Bζ ) . (2.5)The plug-in posterior distribution of f is then obtained by substituting ˆ σ n for σ in (2.2). If instead, we equip σ with inverse-gamma prior σ ∼ IG( β , β ),then a fully Bayes procedure can be based on the posterior distributionˆ σ n ∼ IG( β + n/ , β + ( Y − Bζ ) T ( B Λ B T + I n ) − ( Y − Bζ ) / . (2.6)

3. Posterior contraction rates under monotonicity

To establish posterior contraction rates for f with unknown σ , we need toeﬀectively control the range of values of σ .It will be shown in Lemma 6.2 that the maximum marginal likelihood estima-tor for σ in the plug-in Bayes approach or the marginal posterior distributionof σ in the fully Bayes approach, are consistent for any f ∈ F + , and the con-vergence is also uniform over F + ( K ), for any ﬁxed K >

0. This allows us totreat σ as essentially known in studying the posterior contraction.As mentioned in the last section, we impose monotonicity on f by projecting f on F + and use the projection posterior distribution for inference. The fol-lowing argument shows that the concentration property of the posterior at anymonotone function is not weakened by this procedure.Let Π ∗ n stand for the projection posterior distribution given byΠ ∗ n ( B ) = Π( f : f ∗ ∈ B | D n ) , B ⊂ F , (3.1)where f ∗ is the projection of f on F + with respect to some metric d on thespace of regression functions. Then for the true regression function f ∈ F + and ǫ >

0, we have thatΠ ∗ n ( d ( f, f ) > ǫ ) ≤ Π( f : d ( f ∗ , f ) > ǫ | D n ) , (3.2)and hence the contraction rate of the unrestricted posterior is inherited by theprojection posterior. To see this, note that d ( f ∗ , f ) ≤ d ( f , f ) by the propertyof the projection. Hence, using the triangle inequality d ( f ∗ , f ) ≤ d ( f ∗ , f ) + d ( f, f ) ≤ d ( f , f ) + d ( f, f ) = 2 d ( f, f ) . (3.3)For p ≥

1, the L p -projection of a step function is easily computable, by algo-rithms similar to the PAVA (see Section 3.1 of De Leeuw et al. [10]). hakraborty and Ghosal/Bayesian Monotone Regression L -metric In this subsection, we derive the posterior contraction rate with respect tothe L -metric. An important factor determining this rate is the approximationrate of monotone functions by step functions. For the L -metric, step functionswith regularly placed knots are adequate for the optimal approximation rate(see Lemma 6.3), and hence it is suﬃcient to consider a Type 1 prior. In thefollowing theorem, we derive the contraction rate at a monotone function in the L -metric by directly bounding posterior moments. Theorem 3.1.

Let f ∈ F + , and assume that Condition (E) holds. Let the prioron f be of Type , with J → ∞ and J ≪ n . Let σ be estimated using the plug-inBayes approach or endowed with the inverse-gamma prior using a fully Bayesapproach. Assume that either X is deterministic and Condition (DD) holds, or X is random and Condition (DR) holds. Then for ǫ n = max { J − , ( J/n ) / } and every M n → ∞ , (a) E Π ∗ n ( k f − f k ,G n > M n ǫ n ) → for the ﬁxed design; (b) E Π ∗ n ( k f − f k ,G > M n ǫ n ) → for the random design.In particular, if we choose J ≍ n / , the projection-posterior contracts at theminimax rate ǫ n = n − / . Moreover, the convergence is uniform over F + ( K ) for any K > . Under Condition (DR), the L ( G )-distance is equivalent to the usual L -metric on [0 , X in the theorem above is needed onlyto conclude, using Lemma 6.2, that the estimator (or the posterior) for σ isconsistent. The conclusion is only used to get an upper bound for σ . If instead,we assume an upper bound for σ (and change the prior on σ to comply with thebound, if the fully Bayes procedure is used), we can remove these conditions. L p -metric When the metric under consideration is L p with p >

1, step functions basedon equidistant knots do not have the optimal approximation property. To restorethis ability, we need to allow arbitrary knots (see Lemma 6.3), and put a prioron these. Then the theory of posterior contraction for general (independent, notidentically distributed) observations of Ghosal and van der Vaart [13] can beapplied by computing the prior concentration rate near the truth and boundingthe metric entropy of a suitable subset of the parameter space, called a sieve.However, due to their ordering requirement and possibly very uneven allocationof the knots ξ used for the construction of the optimal approximation, theconcentration of the prior distribution of ξ near their values appearing in theoptimal approximation may be low, and hence the posterior concentration ratemay suﬀer. The problem can be avoided by choosing knots from the observed hakraborty and Ghosal/Bayesian Monotone Regression values of X when the predictor variable is deterministic and the empirical L p -norm k f k p,G n is used. Then the optimal rate (up to a logarithmic factor) canbe obtained. Theorem 3.2.

Let X be deterministic assuming values X , . . . , X n . Let f ∈F + and the prior on f be of Type , with log J ≍ log n . Let ε , . . . , ε n be i.i.d.normal with mean zero and variance σ , which is estimated using the plug-inBayes approach or is endowed with the inverse-gamma prior using a fully Bayesapproach. Then for any < p ≤ , E Π ∗ n ( k f − f k p,G n > M n ǫ n ) → , where ǫ n = max { p ( J log n ) /n, J − } . In particular, the best rate ǫ n = ( n/ log n ) − / is obtained by choosing J ≍ ( n/ log n ) / . Moreover, the conergence is uniformover F + ( K ) for any K > .If instead of choosing J , we put a prior also on J following (2.1) , then thecontraction rate is given by n − / (log n ) (5 − t ) / . Clearly, with a prior on J given by (2.1), the best rate ( n/ log n ) − / is ob-tained when t = t = 1. A Poisson (or a suitably truncated Poisson) prior meetsthe requirement. Again, Condition (DD) is used only to derive the consistencyof the estimator (or the posterior) of σ , and the condition can be removed if σ is assumed to be bounded.It would be interesting to obtain nearly optimal contraction rates for thecontinuous L p -metric, but we do not know an appropriate prior on the knot-locations that would allow suﬃcient prior concentration to yield the desiredresult. For a continuous metric L p -metric, the weak approximation with equalintervals allows only a sub-optimal approximation rate J − /p (see Lemma 6.3),and consequently a suboptimal posterior contraction rate ( n/ log n ) − / ( p +2) .

4. Bayesian testing for monotonicity of f A natural test for the hypothesis of monotonicity is given by the posteriorprobability of F + : reject the hypothesis if Π( f ∈ F + | D n ) is smaller than 1 / f ∈ F belongs tothe boundary of F + , then even if the posterior is consistent at f , the posteriorprobability Π( f ∈ F + | D n ) may be low because a large part of a neighborhoodof f may fall outside F + . In order to avoid such false rejections, one mayquantify a test based on a discrepancy measure d ( f, F + ) between f sampledfrom the posterior, and the set of monotone functions F + (that is, a nonnegativefunction of f that vanishes exactly on F + ), or equivalently, d ( f, f ∗ ) where f ∗ is the projection of f on F + . A reasonable test can be based on the posteriorprobability Π( f : d ( f, F + ) < τ n | D n ) for a sequence τ n → F τ n + | D n ) ofthe τ n -neighborhood F τ n + = { f : d ( f, F + ) < τ n } of F + . This approach was alsopursued by Salomond [19, 20], with a discrepancy measure given by d ( f, F + ) =max { ( θ j − θ i ) : 1 ≤ j ≤ i ≤ J } for f = P Jj =1 θ j I j (with equidistant knots) anda cut-oﬀ τ n = p ( J log n ) /n . This test has probability of Type I error going tozero and has high power against smooth alternatives, if appropriately separated hakraborty and Ghosal/Bayesian Monotone Regression from the null. However, the power of this test at a non-smooth alternative maynot go to one. This prompts us to propose an alternative test, based on the L -distance as the discrepancy measure, which has the property of universalconsistency, that is, the power at any ﬁxed alternative goes to one.Let H ( α, L ) be the H¨older space of α -smooth function with H¨older normbounded by L (see Deﬁnition C.4 of Ghosal and van der Vaart [14]). Theorem 4.1.

Consider a Type prior with J ≍ n / . Let σ be estimated usingthe plug-in Bayes approach or endowed with the inverse-gamma prior using afully Bayes approach. Assume that X is random and Condition (DR) holds, andthe errors satisfy Condition (E) . For d ( f , f ) = R | f − f | dG , consider the testdeﬁned by φ n = { Π( d ( f, F + ) ≤ M n n − / | D n ) < γ } , where < γ < is apredetermined constant and M n → ∞ is ﬁxed slowly growing sequence. Thenthe following assertions hold. (a) ( Consistency under H ) : For any ﬁxed f ∈ F + , E φ n → , and furtherthe convergence is uniform over F + ( K ) . (b) ( Universal Consistency ) :

For any ﬁxed f integrable on [0 , and f / ∈ ¯ F + , E (1 − φ n ) → . (c) ( High power at converging smooth alternatives ) :

For any < α ≤ and L > , sup { E (1 − φ n ) : f ∈ H ( α, L ) , d ( f , F + ) > ρ n ( α ) } → , where ρ n ( α ) = ( Cn − α/ , for some C > if α < ,CM n n − / , for any C > if α = 1 . In the above theorem, the L ( G )-distance may be replaced by the L -distanceunder the Lebesgue measure, since under Condition (DR), these two metrics areequivalent. In this case, part (c) may be strengthened by replacing the H¨olderspace H ( α, L ) by the Sobolev space with (1 , α )-Sobolev norm bounded by L (seeDeﬁnition C.6 of Ghosal and van der Vaart [14]). Also, if G is replaced by theempirical distribution G n (and assuming that Condition (DD) holds instead ofCondition (DR) if X is deterministic), the conclusions in parts (a) and (c) willstill hold. The proof is very similar. If σ has a known bound, then Condition(DD) or Condition (DR) is not needed.The procedure involving the test φ n is computationally simple as it does notinvolve a prior on J . The algorithm for median isotonic regression (see Robertsonand Wright [18] and De Leeuw et al. [10]) allows us to compute d ( f, F + ) veryeﬃciently. However, with a deterministic choice of J , the posterior contractionis not adaptive on classes of functions with diﬀerent smoothness α . Thereforean order of separation n − α/ (up to a logarithmic factor) is needed, which islarger than the optimal order n − α/ (1+2 α ) of separation for α <

1. Adaptationcan however be restored by using a prior on J and letting cut-oﬀ value for thediscrepancy with F + depend on J , as in Salomond [20], if the class of regressionfunctions is uniformly bounded. Theorem 4.2.

Let the prior on f be of Type with J given a Poisson prior,and σ be bounded and be given a positive prior density with bounded support hakraborty and Ghosal/Bayesian Monotone Regression containing the true value σ . Assume that X ∼ G and G satisﬁes Condition (DR) . Let φ n = { Π( d ( f, F + ) ≤ M p ( J log n ) /n | D n ) < γ } , where d is theHellinger distance on p f ( y, x ) = (2 πσ ) − / exp[ − ( y − f ( x )) / (2 σ )] g ( x ) , thedensity induced by f , < γ < is a predetermined constant and M > is asuﬃciently large constant. (a) ( Consistency under H ) : For any ﬁxed f ∈ F + , E φ n → , and theconvergence is uniform over F + ( K ) . (b) ( Universal Consistency ) :

For any ﬁxed f integrable on [0 , and f / ∈ ¯ F + , E (1 − φ n ) → . (c) ( Adaptive power at converging smooth alternatives ) :

For f / ∈ F + , f ∈H ( α, L ) , there exists C depending on α and L only such that sup { E (1 − φ n ) : f ∈ H ( α, L ) , d ( f , F + ) > C ( n/ log n ) − α/ (1+2 α ) } → . In the theorem, G can be replaced by the uniform distribution in the deﬁnitionof the test. In this case, the H¨older space H ( α, L ) in part (c) can be replacedby the Sobolev space with (2 , α )-Sobolev norm bounded by L .Unlike Theorem 4.1, the proof requires the application of the general theoryof posterior contraction. The weaker Hellinger distance for separation is usedso that a test required for the application of the theory is available automati-cally without requiring the regression functions to be bounded by a constant, acondition that will rule out the conjugate normal prior needed in the proof. Analternative is to use the empirical L -distance and conclude parts (a) and (c)only, assuming that N j ≍ n/J uniformly in j = 1 , . . . , J .

5. Proofs of the main results

Proof of Theorem 3.1.

In view of (3.2), it is enough to obtain the contractionrate of the unrestricted posterior. We prove the result for the plug-in Bayesapproach; the fully Bayes case can be dealt with similarly. From Lemma 6.2,get a shrinking neighborhood U n of σ with P (ˆ σ ∈ U n ) →

1. Hence for thepurpose of the proof, we may assume that ˆ σ ∈ U n .We ﬁrst consider the case that X is deterministic. Let f J = P Jj =1 θ j I j with θ j = N − j P i : X i ∈ I j f ( X i ) for all 1 ≤ j ≤ J . By Lemma 6.3 (a), k f J − f k ,G n . J − and the bound is also uniform for f ∈ F + ( K ). To complete theproof, we now show thatE Π( k f − f J k ,G n > M n p J/n (cid:12)(cid:12) D n ) → M n → ∞ . (5.1)Since f = θ j and f = θ j on I j , k f − f J k ,G n = n − P Jj =1 N j | θ j − θ j | .Hence by the Cauchy-Schwarz inequality followed by Markov’s inequality,Π( k f − f J k ,G n > M n p J/n (cid:12)(cid:12) D n ) . M n J J X j =1 N j E( | θ j − θ j | (cid:12)(cid:12) D n ) . (5.2) hakraborty and Ghosal/Bayesian Monotone Regression For 1 ≤ j ≤ J , we bound E( | θ j − θ j | (cid:12)(cid:12) D n ) = Var( θ j (cid:12)(cid:12) D n ) + | E( θ j (cid:12)(cid:12) D n ) − θ j | ,bound the expectation of both terms, and put in (5.2) to obtain the desiredresult. For the ﬁrst term, N j Var( θ j (cid:12)(cid:12) D n ) ≤ sup σ ∈U n N j σ [ N j + λ − j ] / . . (5.3)We bound E [ N j | E( θ j (cid:12)(cid:12) D n ) − θ j | ] asE (cid:20) N j (cid:12)(cid:12)(cid:12)(cid:12) N j ¯ Y j + ζ j λ j N j + λ j − P i : X i ∈ I j f ( X i ) N j (cid:12)(cid:12)(cid:12)(cid:12) (cid:21) . (cid:12)(cid:12)(cid:12)(cid:12) P i : X i ∈ I j ( Y i − f ( X i )) N j (cid:12)(cid:12)(cid:12)(cid:12) . Using the boundedness of ζ j and λ − j , and the second term in the last expressionis bounded by σ by the moment inequality.For random predictors, we use the k · k ,G -distance, which involves anotherintegration with respect to X , . . . , X n on the left side of (5.1). Proof of Theorem 3.2.

Because of (3.2), it suﬃces to obtain the contractionrate of the unrestricted posterior. Since for 0 < p <

2, the L p ( G n )-distanceis dominated by the L ( G n )-distance, it suﬃces to prove the result for p = 2.We shall apply the general theory of posterior contraction (Ghosal and van derVaart [14], Chapter 8) using the sieve P n = (cid:8) f = J X j =1 θ j [ ξ j − ,ξ j ) , ξ , . . . , ξ J − ∈ X , max j | θ j | ≤ n (cid:9) . (5.4)Let p ( n ) f,σ denote the joint density of Y , . . . , Y n for a regression function f . Weverify the conditions of Theorem 8.26 of Ghosal and van der Vaart [14] for ǫ n = max { p ( J log n ) /n, J − } . Note that by Lemma 6.2, we can restrict σ toan arbitrarily small neighborhood of σ , so the test construction in Lemma 8.27of Ghosal and van der Vaart [14] is applicable.By direct calculations, the Kullback-Leibler divergence and the square Kullback-Leibler variation are respectively equal to K ( p ( n ) f ,σ ; p ( n ) f,σ ) = E log p ( n ) f ,σ p ( n ) f,σ = n σ k f − f k ,G n + n (cid:2) σ σ − − log σ σ (cid:3) ,V , ( p ( n ) f ,σ ; p ( n ) f,σ ) = Var log p ( n ) f ,σ p ( n ) f,σ = n (cid:0) σ σ − (cid:1) + nσ σ k f − f k ,G n . Therefore for a suﬃciently small ǫ , there exists C > B n, (( f , σ ) , ǫ ) := { ( f, σ ) : K ( p ( n ) f ,σ , p ( n ) f,σ ) ≤ nǫ , V , ( p ( n ) f ,σ ; p ( n ) f,σ ) ≤ nǫ }⊃ { ( f, σ ) : k f − f k ,G n ≤ C ǫ , | σ − σ | ≤ C ǫ } . hakraborty and Ghosal/Bayesian Monotone Regression By Lemma 6.3, there exists f J such that f J ( · ) = P Jj =1 θ j I j , where I , . . . , I J are an interval partition with knots { ξ , , . . . , ξ ,J − } ⊂ { X , . . . , X n } and k f J − f k ,G n . ǫ n . By the prior independence of f and σ , and because − log Π( | σ − σ | ≤ Cǫ n ) . log(1 /ǫ n ) . log n , it suﬃces thatΠ( k f − f J k ,G n ≤ C ǫ n ) = Π (cid:0) n X i =1 N j ( θ j − θ j ) ≤ C nǫ n (cid:12)(cid:12) ξ = ξ (cid:1) Π( ξ = ξ ) ≥ Π (cid:0) J \ j =1 (cid:8) | θ j − θ ,j | ≤ p C ǫ n (cid:9)(cid:1) (cid:0) nJ − (cid:1) , since P ni =1 ( f ( X i ) − f J ( X i )) = P Jj =1 N j | θ j − θ j | and P Jj =1 N j = n . The lastexpression is at least of the order ( C ǫ n ) J n − ( J − for some C >

0. Puttingthese together, we have − log Π( B n, (( f , σ ) , ǫ n )) . J [log(1 /ǫ n ) + log J ] . J log n . nǫ n by the deﬁnition of ǫ n , fullﬁlling the condition of prior proba-bility concentration needed for posterior contraction rate ǫ n .Observe that the metric entropy log N ( ǫ, P n , k · k p,G n ) of the sieve P n in (5.4)is bounded above by J log( n/ǫ n ) . J log n . nǫ n . Finally, the prior probabilityΠ( P cn ) of the complement of the sieve P n is bounded by Je − n / ≪ e − cnǫ n forany c >

0, establishing condition (8.33) of Ghosal and van der Vaart [14]. Thisestablishes the rate ǫ n = max( p ( J log n ) /n, J − ) when J is chosen determin-istically. Clearly, the best choice is J ≍ ( n/ log n ) / , giving the nearly optimalrate ( n/ log n ) − / .When J is given a prior, to lower bound Π( B n, (( f , σ ) , ǫ )), we intersectthe set with { J = J } , where J ≍ ( n/ log n ) / . This gives an additional fac-tor e − b J (log J ) t , which is absorbed in e − cn ¯ ǫ n by adjusting the constant fora pre-rate ¯ ǫ n = ( n/ log n ) − / , because t ≤

1. Modify the sieve in (5.4) byintersecting with { J ≤ J } , where J to be determined. The prior probabilityof the complement P cn then contributes an extra factor a constant multiple of e − b J (log J ) t to J e − n / . To obtain the ﬁnal rate, we need to choose J suchthat J (log n ) t exceeds a suﬃciently large multiple of n ¯ ǫ n , and then the rate isgiven by p ( J log n ) /n = n − / (log n ) (5 − t ) / . Proof of Theorem 4.1. (a) Let f ∈ F + . Using the deﬁnition of projection,E Π( k f − f ∗ k ,G > M n n − / | D n ) ≤ E Π( k f − f k ,G > M n n − / | D n ) → J ≍ n / by Theorem 3.1. Then it follows that E φ n = P (Π( d ( f, F + ) ≤ M n n − / | D n ) < γ ) →

0. Further, the convergence is uniform over f ∈ F + ( K )for any K > f / ∈ ¯ F + be ﬁxed and integrable. Using the properties of the pro-jection, d ( f , F + ) = k f − f ∗ k ,G is bounded by k f − f ∗ k ,G , which, by thetriangle inequality, is further bounded above by k f − f k ,G + k f − f ∗ k ,G = k f − f k ,G + d ( f, F + ) . hakraborty and Ghosal/Bayesian Monotone Regression This leads to d ( f, F + ) ≥ d ( f , F + ) − k f − f k ,G , and hence Π( d ( f, F + ) ≤ M n n − / (cid:12)(cid:12) D n ) ≤ Π( k f − f k ,G + M n n − / ≥ d ( f , F + ) (cid:12)(cid:12) D n ) . Let θ j = R I j f dG/G ( I j ), 1 ≤ j ≤ J . Then as shown in the proof of Theorem3.1, Π( k f − f J k ,G > M n p J/n (cid:12)(cid:12) D n ) → P

0, and hence for J ≍ n / , we haveΠ( k f − f J k ,G > M n n − / (cid:12)(cid:12) D n ) → P

0. Next, since f is integrable, by themartingale convergence theorem, k f − f J k ,G →

0. henceE Π( k f − f k ,G + M n n − / ≥ d ( f , F + ) | D n ) ≤ E Π (cid:16) k f − f J k ,G ≥ d ( f , F + ) − k f J − f k ,G − M n n − / (cid:12)(cid:12) D n (cid:17) → d ( f , F + ) is ﬁxed and positive. This implies that the probability of Type2 error P (Π( d ( f, F + ) ≤ M n n − / | D n ) ≥ γ ) → f / ∈ F + and f ∈ H ( α, L ) such that d ( f , F ) ≥ ρ n ( α ). Consider thestep function f J of f as in part (b). By a well-known fact from approxima-tion theory, we have that k f − f J k ,G ≤ C ( L ) J − α for some constant C ( L )depending only on L . For instance, the bound follows from de Boor [ ? ] as stepfunctions with equidistant points are B-splines of order 1. Hence for J ≍ n / ,by we have Π( k f − f k ,G > M n n − / + C ( L ) n − α/ | D n ) → P

0, uniformly forall f ∈ H ( α, L ). Thus d ( f, F + ) is d ( f, f ∗ ) ≥ d ( f , f ∗ ) − d ( f, f ) ≥ d ( f , F + ) − d ( f, f ) ≥ ρ n ( α ) − d ( f, f ) , so thatΠ( d ( f, F + ) ≤ M n n − / | D n ) ≤ Π( k f − f k ,G ≥ ρ n ( α ) − M n n − / | D n ) → P α < ρ n ( α ) − M n n − / ≥ M n n − / + C ( L ) n − α/ for C > C ( L ), while for α = 1, ρ n ( α ) − M n n − / ≥ M n n − / + C ( L ) n − α/ for C >

1; the last follows because M n → ∞ . Proof of Theorem 4.2.

Let f be a bounded, measurable true regression function(irrespective of monotonicity or smoothness). For a given J , consider f ,J = P Jj =1 θ j I j with θ j = R I j f dG , j = 1 , . . . , J . First, we show that for a given γ ′ > M ,E Π( k f − f J k ,G ≥ M p ( J log n ) /n, J ≤ J n | D n ) < γ ′ , (5.5)provided that log J n ≍ log n . We write the expression inside the expectation as J n X J =1 Π( J | D n )Π (cid:0) J X j =1 ( θ j − θ j ) G ( I j ) ≥ M J (log n ) /n (cid:12)(cid:12) D n (cid:1) , (5.6) hakraborty and Ghosal/Bayesian Monotone Regression and bound Π (cid:0) J X j =1 ( θ j − θ j ) G ( I j ) ≥ M J (log n ) /n (cid:12)(cid:12) D n (cid:1) ≤ n P Jj =1 G ( I j )[Var( θ j | D n ) + (E( θ | D n ) − θ j ) ] M J log n . (5.7)In view of Condition (DR), G ( I j ) are of the order 1 /J , and by Lemma 6.1, N j are of the order n/J in probability uniformly in j = 1 , . . . , J . Under theboundedness assumption on the prior parameters and the sampling variance,Var( θ j | D n ) . /N j . J/n with high probability, from the standard expressionsfor normal-normal conjugate setting (see the proof of Theorem 3.1).To estimate (E( θ | D n ) − θ j ) , with ¯ Y j standing for N − j P i : X i ∈ I j Y i and¯ ε j standing for N − j P i : ε i ∈ I j Y i , we ﬁrst observe that | ¯ ε j | ≤ N − j log n . ( J log n ) /n with high probability. Here we have used the maximal norm esti-mate using the squared-exponential Orlicz norm (see Lemma 2.2.2 of van derVaart and Wellner [22]) and { ¯ ε j : j ≤ J ≤ J n } . J n . By the same argumentand the boundedness of f , we also have | N − j X i : X i ∈ I j f ( X i ) − θ j | . N − j log n . ( J log n ) /n with high probability. Also, | ¯ Y j | is uniformly bounded with high probability,because Y i = f ( X i ) + ε i . Putting in the expression for E( θ | D n ), we concludethat (E( θ | D n ) − θ j ) ≤ ( J log n ) /n .Putting these estimates in (5.7), we ﬁnd that the expression is bounded by M − with high probability simultaneously for all J ≤ J n . Hence by (5.6), itfollows that (5.5) holds.We also observe that, if the posterior contracts at the rate ǫ n at f in thesense that E Π( J : d ( f, f ) > M ǫ n | D n ) → M >

0, thenE Π( J : d ( f J , f ) > M ǫ n | D n ) → . (5.8)This follows because f J is the closest to f in F J , so if for a J , d ( f J , f ) >M ǫ n , then Π( J = J | D n ) ≤ Π( J : d ( f J , f ) > M ǫ n | D n ).(a) If f ∈ F + , then f J ∈ F + . By Lemma 6.3, the L -approximation rate of F J with equidistant intervals at a monotone function is J − / . Then standardarguments as in the proof of Theorem 3.2 show that the prior probability of aKullback-Leibler neighborhood of size ǫ is bounded below by exp {− C ǫ − log(1 /ǫ ) } .The required test with respect to d is automatically available, while the sievecan be chosen as in Theorem 3.2 and its entropy can be bounded in the sameway by noting that d is bounded by the L ( G )-metric, leading to a (suboptimal)contraction rate ǫ n = ( n/ log n ) − / . It also follows that for J n a large constantmultiple of ǫ − n , the prior probability of J > J n is exponentially small com-pared with the prior concentration, and hence { J > J n } has a small posteriorprobability. Since log J n . log n , it follows that (5.5) holds. hakraborty and Ghosal/Bayesian Monotone Regression (b) Let f / ∈ ¯ F + be ﬁxed and bounded. By the martingale convergence theo-rem, k f J − f k ,G → J → ∞ , so for a given ǫ >

0, we can get J (dependingon ǫ but not depending on n ) such that k f J − f k ,G < ǫ/

2. Then for some δ >

0, we haveΠ( k f − f k ,G < ǫ ) ≥ Π( J = J )Π(max {| θ j − θ j | : 1 ≤ j ≤ J } < δ ) > . Further, for J an arbitrarily small multiple of n/ log n , the excess prior prob-ability Π( J > J ) can be bounded by e − bn for some b > c .Considering a sieve P n = (cid:8) f = P Jj =1 θ j I j , max j | θ j | ≤ n, J ≤ J (cid:9) , standardestimates gives a bound for its metric entropy an arbitrarily small multiple of n .Therefore it follows that (see Theorem 6.17 of Ghosal and van der Vaart [14])that E Π( J > J | D n ) → f with respect to d , because d ( f , f ) ≤ k f − f k ,G .Observe that for any f ∈ F J , d ( f, F + ) = d ( f, f ∗ ) ≥ d ( f , f ∗ ) − d ( f, f J ) − d ( f J , f ) . (5.9)Since f / ∈ ¯ F + , the ﬁrst term is a ﬁxed positive number. The second term isbounded by p ( J log n ) /n with high posterior probability, and J can be re-stricted to be at most J , which can be taken to be an arbitrarily small multipleof n/ log n . Hence we can make the second terms as small as we like, with highposterior probability. By (5.8) and posterior consistency, the third term canalso be made arbitrarily small with high posterior probability. This shows that d ( f, F + ) larger than some ﬁxed positive number with high posterior probabil-ity, and hence it will exceed p ( J log n ) /n with high posterior probability for all J ≤ J , prompting the test to reject the null hypothesis of monotonicity withtrue probability tending to one.(c) Let f / ∈ F + and f ∈ H ( α, L ) such that d ( f , F + ) ≥ ρ n ( α ). The proofis very similar to part (b) with the following changes. First, by the well-knownapproximation rate J − α at functions in H ( α, L ) by step functions, and standardarguments as used in part (a) and (b), giving prior concentration and metricentropy bounds, the posterior contraction rate at f with respect to d is ǫ n =( n/ log n ) − α/ (2 α +1) . Also, with high posterior probability, J can be restricted toless than J ≍ nǫ n / log n = ( n/ log n ) / (2 α +1) . This bounds the second term by amultiple of ( n/ log n ) − α/ (2 α +1) with high posterior probability. Finally, by (5.8),the third term is also bounded by a multiple of ( n/ log n ) − α/ (2 α +1) with highposterior probability. Therefore, the expression on the right side of (5.9) is largerthan M p ( J log n ) /n with high posterior probability. Thus the test rejects thenull hypothesis of monotonicity with true probability tending to one.

6. Auxiliary resultsLemma 6.1.

If the predictors are random, Condition (DR) holds and n/J ≫ log J , then for A n = (cid:8) a n/ (2 J ) ≤ min( N , . . . , N J ) ≤ max( N , . . . , N J ) ≤ a n/J (cid:9) , we have P ( A n ) → . In other words, N , . . . , N J are simultaneouslyof the order n/J in probability. hakraborty and Ghosal/Bayesian Monotone Regression Proof.

From N j ∼ Bin( n ; G ( I j )) and a /J ≤ G ( I j ) ≤ a /J for every 1 ≤ j ≤ J ,a standard large deviation estimate for P( N j ≥ a n/J ) is 2 e − Cn/J for someconstant

C >

0, and similarly for P( N j ≤ a n/ (2 J )). Adding these probabilities J times, we get the desired result because the factor log J can be absorbed in n/J . Lemma 6.2.

Let the predictors be deterministic satisfying Condition (DD) orbe random satisfying Condition (DR) . Let f ∈ F + , the prior on f of Type ,and Condition (E) holds. Then for J → ∞ such that J ≪ n , we have(a) the maximum marginal likelihood estimator ˆ σ n converges in probability to σ at the rate max { n − / , n − J } .(b) If σ ∼ IG( β , β ) with β > , β > , then the marginal posteriordistribution of σ contracts at the rate max { n − / , n − J } .Proof. (a) Let f ∈ F + . We ﬁrst show that there exists θ J = ( θ , . . . , θ J ) suchthat n − k F − Bθ J k . J − for deterministic X , and n − E G k F − Bθ J k . J − for random X .On a set with min { N j : 1 ≤ j ≤ J } >

0, let θ j = N − j P i : X i ∈ I j f ( X i ).Using the monotonicity of f , we write n − k F − Bθ J k as1 n J X j =1 X i : X i ∈ I j ( f ( X i ) − θ j ) ≤ n J X j =1 X i : X i ∈ I j ( f ( j/J ) − f (( j − /J )) = J X j =1 N j n ( f ( j/J ) − f (( j − /J )) . (6.1)For deterministic X , by Condition (DD) and the monotonicity of f , (6.1) isbounded bymax ≤ j ≤ J N j n J X j =1 [ f ( j/J ) − f (( j − /J )] ≤ max ≤ j ≤ J N j n ( f (1) − f (0)) → . (6.2)For random X , using the fact that N j ∼ Bin( n ; G ( I j )), the expectation of(6.1) under G equals to P Jj =1 G ( I j ) ( f ( j/J ) − f (( j − /J )) , which, in viewof Condition (DR), has the bound max ≤ j ≤ J G ( I j )( f (1) − f (0)) → X is ﬁxed, satisfying Condition(DD); the random case can be dealt with similarly, by taking expectation withrespect to G and using Condition (DR). We imitate the proof of Proposition 4.1(a) of Yoo and Ghosal [23] but assuming that f is monotone instead of smooth.Deﬁne U = ( B Λ B T + I n ) − . We write | E (ˆ σ n ) − σ | = | n − σ tr( U ) − σ | + n − ( F − Bζ ) T U ( F − Bζ )and bound it by a constant multiple of n − [tr( I n − U ) + ( F − Bθ J ) T U ( F − Bθ J ) hakraborty and Ghosal/Bayesian Monotone Regression +( Bθ J − Bζ ) T U ( Bθ J − Bζ )] . (6.3)Among these terms, only the middle term arising out of the approximation of thetrue function by step functions, is diﬀerent — the other two terms are boundedby J/n considering step functions as B-splines of order 1 in one dimension. Thesecond term can also be bounded by a multiple of J − in the same way Yooand Ghosal [23] using the L -approximation rate J − / for monotone function,leading the upper bound a multiple of J/n + J − for the expression in (6.3).To complete the proof of part (a), we bound Var (ˆ σ n ) by a multiple of n − .Again, we can follow the same steps in the proof of Proposition 4.1 (a) of Yooand Ghosal [23] with the approximate rate for a smooth function replaced bythe approximation rate n − for a monotone function. We also observe that thebounds obtained in the proof are uniform over f ∈ F + ( K ) for any K >

Lemma 6.3.

Let p ≥ and K > . Then for every f ∈ F + ( K ) and J > ,there exist θ ≤ · · · ≤ θ J from [ − K, K ] such that the following assertions hold.(a) For any partition intervals I , . . . , I J and probability measure H satisfying H ( I j ) ≤ M/J , with f J = P Jj =1 θ j I j ∈ F + ( K ) we have that R | f − f J | p dH ≤ M K p /J .(b) For any probability measure H and ≤ p < ∞ , there exist knots ξ <ξ < · · · < ξ J − < ξ J = 1 from the topological support of H such thatfor any f ∈ F + ( K ) , the exits a function of the form f J = P Jj =1 θ j I j ∈F + ( K ) satisfying R | f − f J | p dH ≤ K p /J p , where I j = [ ξ j − , ξ j ) , j =1 , . . . , J − , I J = [ ξ J − , ξ J ] .Proof. We bound the discrepancy R | f − f J | p dH = P Jj =1 R I j | f − f J | p dH by J X j =1 H ( I j ) | f ( j/J ) − f (( j − /J ) | p ≤ M J − J X j =1 | f ( j/J ) − f (( j − /J ) | p , which is bounded by | f (1) − f (0) | p by the estimate P a pk ≤ ( P a k ) p for positivenumbers a , . . . , a k and p ≥ ǫ >

0, there exists a J = J ( ǫ ) . ǫ − , 0 ≤ ξ < · · · < ξ J − ≤ θ , . . . , θ J such that f J = P Jj =1 θ j I j satisﬁes k f − f J k p,H < ǫ , where I , . . . , I J form aninterval partition of [0 ,

1] with knots 0 = ξ < ξ < · · · < ξ J − ≤ ξ J = 1. Forinstance, one of the lower brackets in their construction of an ǫ -bracketing willsatisfy the approximation property. The role of ǫ and J can be reversed, in that,given J , we can ﬁrst obtain ǫ > J ( ǫ ) is within J .Finally, we need to conclude that the knot points ξ < · · · < ξ J − can bechosen from the support of H . The construction in van der Vaart and Wellner hakraborty and Ghosal/Bayesian Monotone Regression [22] assumed, without loss of generality, that H is uniform. For a general H , thequantile transform is applied, transforming the j th knot ξ j to H − ( ξ j ), whichbelongs to the support of H . References [1]

Akakpo, N. , Balabdaoui, F. and

Durot, C. (2014). Testing monotonic-ity via local least concave majorants.

Bernoulli Ayer, M. , Brunk, H. D. , Ewing, G. M. , Reid, W. T. and

Silver-man, E. (1955). An empirical distribution function for sampling with in-complete information.

Ann. Math. Statist. Baraud, Y. , Huet, S. and

Laurent, B. (2005). Testing convex hypothe-ses on the mean of a Gaussian vector. Application to testing qualitative hy-potheses on a regression function.

Ann. Statist. Barlow, R. E. , Bartholomew, D. J. , Bremner, J. M. and

Brunk, H. D. (1972).

Statistical Inference under Order Restrictions.The Theory and Application of Isotonic Regression . John Wiley & Sons,London-New York-Sydney Wiley Series in Probability and MathematicalStatistics. MR0326887[5]

Barlow, R. E. and

Brunk, H. D. (1972). The isotonic regression problemand its dual.

J. Amer. Statist. Assoc. Bhaumik, P. and

Ghosal, S. (2015). Bayesian two-step estimation indiﬀerential equation models.

Electron. J. Statist. Bhaumik, P. and

Ghosal, S. (2017). Eﬃcient Bayesian estimationand uncertainty quantiﬁcation in ordinary diﬀerential equation models.

Bernoulli Bowman, A. W. , Jones, M. C. and

Gijbels, I. (1998). Testing Mono-tonicity of Regression.

J. Comput. Graph. Statist. Brunk, H. D. (1970). Estimation of isotonic regression. In

NonparametricTechniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloom-ington, Ind., 1969)

De Leeuw, J. , Kurt, H. and

Mair, P. (2009). Isotone optimization inR: Pool-Adjacent-Violators Algorithm (PAVA) and active set methods.

J.Stat. Softw. .[11] Durot, C. (2002). Sharp asymptotics for isotonic regression.

Probab. The-ory Relat. Fields

Ghosal, S. , Sen, A. and van der Vaart, A. W. (2000). Testing mono-tonicity of regression.

Ann. Statist. Ghosal, S. and

Van Der Vaart, A. (2007). Convergence rates of pos-terior distributions for noniid observations.

Ann. Statist. Ghosal, S. and van der Vaart, A. (2017).

Fundamentals of Nonpara-metric Bayesian Inference . Cambridge Series in Statistical and ProbabilisticMathematics . Cambridge University Press, Cambridge. MR3587782[15] Groeneboom, P. and

Jongbloed, G. (2014).

Nonparametric Estimationunder Shape Constraints . Cambridge Series in Statistical and Probabilistic hakraborty and Ghosal/Bayesian Monotone Regression Mathematics . Cambridge University Press, New York Estimators, algo-rithms and asymptotics. MR3445293[16] Hall, P. and

Heckman, N. E. (2000). Testing for monotonicity of aregression mean by calibrating for linear functions.

Ann. Statist. Lin, L. and

Dunson, D. B. (2014). Bayesian monotone regression usingGaussian process projection.

Biometrika

Robertson, T. and

Wright, F. T. (1973). Multiple isotonic medianregression.

Ann. Statist. Salomond, J.-B. (2014). Adaptive Bayes test for monotonicity. In

TheContribution of Young Researchers to Bayesian Statistics . Springer Proc.Math. Stat. Salomond, J.-B. (2018). Testing un-separated hypotheses by estimatinga distance.

Bayesian Anal. Shively, T. S. , Sager, T. W. and

Walker, S. G. (2009). A Bayesianapproach to non-parametric monotone function estimation.

J. R. Stat. Soc.Ser. B Stat. Methodol. van der Vaart, A. W. and Wellner, J. A. (1996).

Weak Convergenceand Empirical Process With Applications to Statistics . Springer-Verlag NewYork, Inc.[23]

Yoo, W. W. and

Ghosal, S. (2016). Supremum norm posterior con-traction and credible sets for nonparametric multivariate regression.

Ann.Statist.44