Asymptotic properties and approximation of Bayesian logspline density estimators for communication-free parallel computing methods
AAsymptotic properties and approximation of Bayesian logsplinedensity estimators for communication-free parallel computingmethods
A. Miroshnikov, K. Kotsiopoulos, E. M. Conlon
Abstract
In this article we perform an asymptotic analysis of Bayesian parallel density estimators which arebased on logspline density estimation. The parallel estimator we introduce is in the spirit of a kerneldensity estimator introduced in recent studies. We provide a numerical procedure that produces thedensity estimator itself in place of the sampling algorithm. We then derive an error bound for themean integrated squared error for the full data posterior density estimator. We also investigate theparameters that arise from logspline density estimation and the numerical approximation procedure.Our investigation identifies specific choices of parameters for logspline density estimation that result inthe error bound scaling appropriately in relation to these choices.
Keywords: logspline density estimation, asymptotic properties, error analysis, parallel algorithms.
Mathematics Subject Classification (2000):
Contents
Contents 11 Introduction 22 Notation and hypotheses 43 Analysis of
MISE for ˆ p c = ˆ λ − . . . . . . . . . . . . . 12 a r X i v : . [ m a t h . S T ] M a y Appendix 21 S k,t . . . . . . . . . . . . . . . . . . . . . . . . 296.6 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 References 32
The recent advances in data science and big data research have brought challenges in analyzing large datasets in full. These massive data sets may be too large to read into a computers memory in full, and datasets may be located on different machines. In addition, there is a lengthy time needed to process thesedata sets. To alleviate these difficulties, many parallel computing methods have recently been developed.One such approach partitions large data sets into subsets, where each subset is analyzed on a separatemachine using parallel Markov chain Monte Carlo (MCMC) methods [8, 9, 10]; here, communicationbetween machines is required for each MCMC iteration, increasing computation time.Due to the limitations of methods requiring communication between machines, a number of alternativecommunication- free parallel MCMC methods have been developed for Bayesian analysis of big data[5, 6]. For these approaches, Bayesian MCMC analysis is performed on each subset independently, andthe subset posterior samples are combined to estimate the full data posterior distributions. Neiswanger,Wang and Xing [5] introduced a parallel kernel density estimator that first approximates each subsetposterior density and then estimates the full data posterior by multiplying together the subset posteriorestimators. The authors of [5] show that the estimator they use is asymptotically exact; they thendevelop an algorithm that generates samples from the posterior distribution approximating the full dataposterior estimator. Though the estimator is asymptotically exact, the algorithm of [5] does not performwell for posteriors that have non-Gaussian shape. This under-performance is attributed to the method ofconstruction of the subset posterior densities; this method produces near-Gaussian posteriors even if thetrue underlying distribution is non-Gaussian. Another limitation of the method of Neiswanger, Wangand Xing is its use in high-dimensional parameter spaces, since it becomes impractical to carry out thismethod when the number of model parameters increases.Miroshnikov and Conlon [6] introduced a new approach for parallel MCMC that addresses the lim-itations of [5]. Their method performs well for non-Gaussian posterior distributions and only analyzesdensities marginally for each parameter, so that the size of the parameter space is not a limitation. Theauthors use logspline density estimation for each subset posterior, and the subsets are combined by adirect numeric product of the subset posterior estimates. However, note that this technique does notproduce joint posterior estimates, as in [5].The estimator introduced in [6] follows the ideas of Neiswanger et al. [5]. Specifically, let p ( x | θ ) bethe likelihood of the full data given the parameter θ ∈ R . We partition x into M disjoint subsets x m ,with m ∈ { , , ..., M } . For each subset we draw N samples θ m , θ m , ..., θ mN whose distribution is given bythe subset posterior density p ( θ | x m ). Given prior p ( θ ), the datasets x , x , . . . , x M and assuming thatthey are independent from each other, then the posterior density, see [5], is expressed by p ( θ | x ) ∝ p ( θ ) M (cid:89) m =1 p ( x m | θ ) = M (cid:89) m =1 p m ( θ ) =: p ∗ ( θ ) , where p ( θ | x m ) := p m ( θ ) = p ( x m | θ ) p ( θ ) /M . (1)In our work, we investigate the properties of the estimator ˆ p ( θ | x ), defined in [6], that has the formˆ p ( θ | x ) ∝ M (cid:89) m =1 ˆ p m ( θ ) =: ˆ p ∗ ( θ ) , (2)where ˆ p m ( θ ) is the logspline density estimator of p m ( θ ) and where we suppressed the information aboutthe data x . he estimated product ˆ p ∗ of the subset posterior densities is, in general, unnormalized. This motivatesus to define the normalization constant ˆ c for the estimated product ˆ p ∗ . Thus, the normalized density ˆ p ,one of the main points of interest in our work, is given byˆ p ( θ ) = ˆ c − ˆ p ∗ ( θ ) , where ˆ c = (cid:90) ˆ p ∗ ( θ ) dθ. Computing the normalization constant analytically is a difficult task since the subset posterior densitiesare not explicitly calculated, with the exception of a finite number of points (cid:0) θ i , ˆ p ∗ m ( θ i ) (cid:1) , where i ∈{ , . . . , n } . By taking the product of these values for each i we obtain the value of ˆ p ∗ ( θ i ). This allows usto numerically approximate the unnormalized product ˆ p ∗ by using a Lagrange interpolation polynomials.This approximation is denoted by ˜ p ∗ . Then we approximate the constant ˆ c by numerically integrating˜ p ∗ . The approximation of the normalization constant ˆ c is denoted by ˜ c , given by˜ c = (cid:90) ˜ p ∗ ( θ ) dθ, and we set ˜ p ( θ ) := ˜ c − ˜ p ∗ ( θ ) . The newly defined density ˜ p acts as the estimator for the full-data posterior p .In this paper, we establish error estimates between the three densities via the mean integrated squarederror or MISE, defined for two functions f, g asMISE( f, g ) := E (cid:90) (cid:0) f ( θ ) − g ( θ ) (cid:1) dθ. (3)Thus, our work involves two types of approximations: 1) the construction of ˆ p ∗ using logspline densityestimators and 2) the construction of the interpolation polynomial ˜ p ∗ . The methodology of logsplinedensity estimation was introduced in [2] and corresponding error estimates between the estimator andthe density it is approximating are presented in [3, 4]. These error estimates depend on three factors:i) the N m number of samples drawn from the subset posterior density, ii) the K m + 1 number of knotsused to create the k -order B-splines, and iii) the step-size of those knots, which we denote by h m .In our work we estimate the MISE between the functions ˆ p ∗ and p ∗ by adapting the estimationtechniques introduced in [3, 4]. We then utilize this analysis to establish a similar estimate for thenormalized densities ˆ p and p ,MISE( p ∗ , ˆ p ∗ ) = O (cid:32) exp (cid:40) M (cid:88) m =1 K m + 1 − kN / m + h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:41) − (cid:33) , where h max = max m { h m } and j + 1 is the number of continuous derivatives of p . Notice that theexponential contains two terms, where the first depends on the number of samples and the number ofknots and the other depends on the placement of the spline knots. Both terms converge to zero and forMISE to scale optimally both terms must converge at the same rate. To this end, we choose h max andeach K m to be functions of the vector N = (cid:8) N , . . . , N M (cid:9) and scale appropriately with the norm (cid:107) N (cid:107) .This simplifies the above estimate toMISE( p ∗ , ˆ p ∗ ) = O (cid:16) M − β (cid:107) N (cid:107) − β (cid:17) where the parameter β ∈ (0 , /
2) is related to the convergence of the logspline density estimators.The estimate for MISE between ˜ p ∗ and ˆ p ∗ is obtained in a similar way by utilizing Lagrange inter-polation error bounds, as described in [7]. This error depends on two factors: i) the step-size ∆ x ofthe grid points chosen to construct the polynomial, where the grid points correspond to the coordinates (cid:0) θ i , ˆ p ∗ m ( θ i ) (cid:1) discussed earlier, and ii) the degree l of the Lagrange polynomial. The estimate obtained isalso shown to hold for the normalized densities ˜ p and ˆ p .MISE(ˆ p ∗ , ˜ p ∗ ) = O (cid:32) ∆ xh min ( N ) M (cid:33) l +1) where h min ( N ) is the minimal distance between the spline knots and is chosen to asymptotically scalewith the norm of the vector of samples N , see Section 2. e then combine both estimates to obtain a bound for MISE for the densities p and ˜ p . We obtainMISE( p , ˜ p ) = O M − β (cid:107) N (cid:107) − β + (cid:32) ∆ xh min ( N ) M (cid:33) l +1) . In order for MISE to scale optimally the two terms in the sum must converge to zero at the same rate.As before with the distance between ˆ p ∗ and p ∗ , we choose ∆ x to scale appropriately with the norm ofthe vector N . This leads to the optimum error bound for the distance between the estimator ˜ p and thedensity p , MISE( p , ˜ p ) = O (cid:16) (cid:107) N (cid:107) − β (cid:17) where we choose ∆ x = O (cid:18) (cid:107) N (cid:107) − β (cid:16) l +1 + j +1 (cid:17) (cid:19) . (4)The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundationof the analysis. In Section 3 we derive an asymptotic expansion for MISE of the non-normalized estimator,which are central to the analysis performed in subsequent sections. We also perform there the analysisof MISE for the full data set posterior density estimator ˆ p . In Section 4, we perform the analysis forthe numerical estimator ˜ p . In Section 5 we showcase our simulated experiments and discuss the results.Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 andSection 4. For the convenience of the reader we collect in this section all hypotheses and results relevant to ouranalysis and present the notation that is utilized throughout the article. (H1)
Motivated by the form of the posterior density at Neiswanger et al. [5] we consider the probabilitydensity function of the form p ( θ ) ∝ p ∗ ( θ ) where p ∗ ( θ ) := M (cid:89) m =1 p m ( θ ) (5)where we assume that p m ( θ ) , m ∈ { , . . . , M } have compact support on the interval [ a, b ]. (H2) For each m ∈ { , . . . , M } p m ( θ ) is a probability density function. We consider the estimator of p in the form ˆ p ( θ ) ∝ ˆ p ∗ ( θ ) where ˆ p ∗ ( θ ) := M (cid:89) m =1 ˆ p m ( θ ) (H2-a)and for each m ∈ { , . . . , M } ˆ p m ( θ ) is the logspline density estimator of the probability density p m ( θ ) that has the formˆ p m : R × Ω mn m defined by ˆ p m ( θ, ω ) = f m ( θ, ˆ y ( θ m , . . . , θ mn m )) , ω ∈ Ω mn m (H2-b)We also consider the additional estimators ¯ p m of p m as defined in (71) and¯ p ∗ ( θ ) := M (cid:89) m =1 ¯ p m ( θ ) . Here θ m , θ m , . . . , θ mn m ∼ p m ( x ) are independent identically distributed random variables and f m isthe logspline density estimate introduced in Definition (37) with N m number of knots and the orderof the B-splines is k m .Ω mn m = (cid:26) ω ∈ Ω : ˆ y = ˆ y ( θ m , . . . , θ mn m ) ∈ R L m +1 exists (cid:27) . (6)where L m := N m − k m . he mean integrated square error of the estimator ˆ p ∗ of the product p ∗ is defined byMISE [ N ] := MISE( p ∗ , ˆ p ∗ ) = E (cid:90) (ˆ p ∗ ( θ ; ω ) − p ∗ ( θ )) dθ (7)where we use the notation N = ( N m ) Nm =1 .We assume that the probability densities functions p , . . . , p M satisfy the following hypotheses: (H3) The number of samples for each subset are parameterized by a governing parameter n as follows: N ( n ) = { N ( n ) , N ( n ) , N ( n ) , . . . , N M ( n ) } : N → N M such that for all m ∈ { , , . . . , M } D ≤ N m n ≤ D lim n →∞ N m ( n ) = ∞ . (8)Note that C (cid:107) N ( n ) (cid:107) ≤ N m ( n ) ≤ C (cid:107) N ( n ) (cid:107) . (H4) For each m ∈ { , . . . , M } , k = k = · · · = k M = k for some fixed k in N . For the number of knotsfor each m are parameterized by n as follows: K ( n ) = { K ( n ) , K ( n ) , K ( n ) , . . . , K M ( n ) } : N → N M (9)where K m ( n ) + 1 is the number of knots for B-splines on the interval [ a, b ] and thus L ( n ) = { L ( n ) , L ( n ) , L ( n ) , . . . , L M ( n ) } : N → N M with L m ( n ) = K m ( n ) − k and we require lim n →∞ L m ( n ) = ∞ and lim n →∞ L m ( n ) N m ( n ) / − β = 0 , < β < . (H5) For the knots T K m ( n ) = ( t mi ) K m ( n ) i =0 , we write¯ h m = max k − ≤ i ≤ K m ( n ) − k ( t mi +1 − t mi ) and h m = min k − ≤ i ≤ K m ( n ) − k ( t mi +1 − t mi ) . (10) (H6) For each m ∈ { , . . . , M } , j ∈ { , . . . , k − } and density p m ∈ C j +1 ([ a, b ]) there exists C m,s ≥ (cid:12)(cid:12)(cid:12)(cid:12) d j +1 log ( p m ( θ )) dθ j +1 (cid:12)(cid:12)(cid:12)(cid:12) < C m,s for all x. (11) (H7) Let (cid:107) · (cid:107) denote the L -norm on [ a, b ]. For p ∗ defined as in H1, there exists C ∗ ≥ (cid:107) p ∗ (cid:107) = (cid:90) ( p ∗ ( θ )) dθ < C ∗ . (12) (H8) For each subset x m , the B-splines are created by choosing a uniform knot sequence. Thus,¯ h m = h m = h m , for m ∈ { , . . . , M } . (13)Let h min = min ≤ m ≤ M { h m } and h max = max ≤ m ≤ M { h m } . (14)We assume that h min , h max scale in a similar way to the number of samples, i.e c (cid:107) N ( n ) (cid:107) − β ≤ h j +1 min ( n ) ≤ h j +1 max ( n ) ≤ c (cid:107) N ( n ) (cid:107) − β where j ∈ { , . . . , k − } is the same as in hypothesis (H6). Analysis of
MISE for ˆ p Suppose we are given a data set x and it is partitioned into M ≥ x m , m ∈ { , . . . , M } .We are interested in the subset posterior densities p m ( θ ) = p ( θ | x m ). For each such density we applythe analysis from before. Let ˆ p m and ¯ p m , m ∈ { , . . . , M } be the corresponding logspline estimators asdefined in (70) and (71) respectively. By definition of ˆ p m , that is equal to the logspline density estimateon Ω mn m ⊂ Ω, where Ω mn m is the set defined in (69) for ˆ p m . Definition 1.
For m ∈ { , . . . , M } , let Ω mn m be the set defined in (6) . We then set Ω M, N := M (cid:92) m =1 Ω mn m where N = ( n , . . . , n m ) which is the set where the maximizer for the log-likelihood exists given each data subset and thus alllogspline density estimators ˆ p m exist. Lemma 2.
Suppose the conditions in (H3) and (H4) hold. Given the previous definition, we have that lim n →∞ P (cid:16) Ω M, N ( n ) (cid:17) = 1 Proof.
By Theorem 53 we have that P (cid:16) Ω \ Ω M, N ( n ) (cid:17) = P (cid:32) M (cid:91) m =1 (Ω mN m ( n ) ) c (cid:33) ≤ M (cid:88) m =1 P (cid:0) (Ω mN m ( n ) ) c (cid:1) ≤ M (cid:88) m =1 e − N m ( n ) (cid:15) ( L m ( n )+1) δ m ( D ) and the result follows by taking n to infinity.Since the probability of the set where the estimators ˆ p m exist for all m ∈ { , . . . , M } tends to 1, it makessense to do our analysis for a conditional MISE on the set Ω M, N ( n ) . Considering the practical aspect, wewill never encounter the set where the maximizer of the log-likelihood doesn’t exist.At this point, let’s state a bound for | ˆ p ∗ ( θ ; ω ) − p ∗ ( θ ) | which will be essential in our analysis of MISE. Lemma 3.
Suppose the hypotheses (H1) - (H7) hold and that we are restricted to the sample subspace Ω M, N ( n ) . We then have the following:(a) There exists a positive constant R = R ( M ) such that (cid:107) log(ˆ p ∗ ( · , ω )) − log(¯ p ∗ ( · )) (cid:107) ∞ ≤ R M (cid:88) m =1 L m ( n ) + 1 (cid:112) N m ( n ) . (b) There exists a positive constant R = R ( M, k, j, F p , γ ( T K ( n ) ) , . . . , γ ( T K M ( n ) )) such that (cid:107) log ( p ∗ ) − log (¯ p ∗ ) (cid:107) ∞ ≤ R ¯ h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ where ¯ h max = max m { ¯ h m } . (c) Using the bounds from ( a ) and ( b ) we have | ˆ p ∗ ( θ ; ω ) − p ∗ ( θ ) | ≤ (cid:32) exp (cid:40) R M (cid:88) m =1 L m ( n ) + 1 (cid:112) N m ( n ) + R ¯ h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:41) − (cid:33) p ∗ ( θ ) . Proof. (a) The bound can be shown by writing (cid:107) log(ˆ p ∗ ( · , ω )) − log(¯ p ∗ ( · )) (cid:107) ∞ = (cid:107) log( M (cid:89) m =1 ˆ p m ( · ; ω )) − log( M (cid:89) m =1 ¯ p m ( · )) (cid:107) ∞ ≤ M (cid:88) m =1 (cid:107) log(ˆ p m ( · ; ω )) − log(¯ p m ( · )) (cid:107) ∞ and then applying Theorem 56. For each m ∈ { , . . . , M } there will be an M m appearing in thebound and we can take R = max m { M m } . b) Similar to part ( a ) we can write (cid:107) log( p ∗ ( · )) − log(¯ p ∗ ( · )) (cid:107) ∞ = (cid:107) log( M (cid:89) m =1 p m ( · )) − log( M (cid:89) m =1 ¯ p m ( · )) (cid:107) ∞ ≤ M (cid:88) m =1 (cid:107) log( p m ( · )) − log(¯ p m ( · )) (cid:107) ∞ and then we apply Lemma 47. For each m ∈ { , . . . , M } there will be constants M (cid:48) m and C m ( k, j )appearing and we can take R = max m { M (cid:48) m C m ( k, j ) } .(c) To see why this is true, we write | ˆ p ∗ ( θ ; ω ) − p ∗ ( θ ) | = p ∗ ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ˆ p ∗ ( θ ; ω ) p ∗ ( θ ) − (cid:12)(cid:12)(cid:12)(cid:12) = p ∗ ( θ ) | exp { log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) } − | . If ˆ p ∗ ( θ ; ω ) ≥ p ∗ ( θ ) then | exp { log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) } − | = exp { log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) } − . If ˆ p ∗ ( θ ; ω ) < p ∗ ( θ ) then | exp { log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) } − | = 1 − exp { log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) }≤ exp { log( p ∗ ( θ )) − log(ˆ p ∗ ( θ ; ω )) } − − e − x ≤ e x − , for any x ≥
0. This implies | ˆ p ∗ ( θ ; ω ) − p ∗ ( θ ) | ≤ p ∗ ( θ ) (exp {| log(ˆ p ∗ ( θ ; ω )) − log( p ∗ ( θ )) |} − ≤ p ∗ ( θ ) (exp {| log(ˆ p ∗ ( θ ; ω )) − log(¯ p ∗ ( θ )) | + | log(¯ p ∗ ( θ )) − log( p ∗ ( θ )) |} − p ∗ and ˆ p ∗ . Theorem 4. (Conditional
MISE for unnormalized p ∗ , ˆ p ∗ and M ≥ )Assume the conditions (H1) - (H7) hold. Given M ≥ we have MISE( p ∗ , ˆ p ∗ | Ω M, N ( n ) ) ≤ (cid:32) exp (cid:40) R M (cid:88) m =1 L m ( n ) + 1 (cid:112) N m ( n ) + R ¯ h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:41) − (cid:33) (cid:107) p ∗ (cid:107) (15) where R , R are as in Lemma 3.In addition, if (H8) holds, then MISE scales optimally in regards to the number of samples, (cid:112)
MISE( p ∗ , ˆ p ∗ ) = O ( Mn − β ) = O ( M − β (cid:107) N ( n ) (cid:107) − β ) (16) Proof.
By definition of the conditional MISE and Lemma 3, we haveMISE( p ∗ , ˆ p ∗ | Ω M, N ( n ) ) = E Ω M, N ( n ) (cid:90) (ˆ p ∗ ( θ ; ω ) − p ∗ ( θ )) dθ ≤ E Ω M, N ( n ) (cid:90) (cid:34)(cid:32) exp (cid:40) R M (cid:88) m =1 L m ( n ) + 1 (cid:112) N m ( n ) + R ¯ h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:41) − (cid:33) p ∗ ( θ ) (cid:35) dθ which implies (15). Next, if (H8) holds, then (16) follows directly. Remark 5.
It’s interesting to note how the number of knots, their placement and the number of samplesall play a role in the above bound. If we want to be accurate, all of the parameters L m ( n ) , N m ( n ) and ¯ h max must be chosen appropriately. .2 Analysis for renormalization constant We will now consider the error that arises for MISE when one renormalizes the product of the estimatorsso it can be a probability density. The renormalization can affect the error since p ∗ and ˆ p ∗ are rescaled.We define the renormalization constant and its estimator to be λ = (cid:90) p ∗ ( θ ) dθ and ˆ λ = ˆ λ ( ω ) = (cid:90) ˆ p ∗ ( θ ; ω ) dθ (17)Therefore, we are interested in analyzingMISE( p, ˆ p ) = MISE( cp ∗ , ˆ c ˆ p ∗ ) , where c = λ − , ˆ c = ˆ λ − . We first state the following lemma for λ and ˆ λ ( ω ). Lemma 6.
Let λ and ˆ λ ( ω ) be defined as in (6) . Suppose that (H8) holds and we are restricted to thesample subspace Ω M, N ( n ) . Then we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ λ ( ω ) λ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( M − β (cid:107) N ( n ) (cid:107) − β ) (18) Proof.
By definition of λ and ˆ λ ( ω ), we have | λ − ˆ λ ( ω ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) p ∗ ( θ ) dθ − (cid:90) ˆ p ∗ ( θ ; ω ) dθ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:32) exp (cid:40) R M (cid:88) m =1 L m ( n ) + 1 (cid:112) N m ( n ) + R ¯ h j +1 max M (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p m ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:41) − (cid:33) λ where the inequality is justified by Lemma 3(c). Dividing by λ the result then follows by hypothesis(H8).So what the above lemma suggests is that when restricted to the sample subspace Ω M, N ( n ) , the spacewhere the logspline density estimators ˆ p m , m ∈ { , . . . , M } are all defined, the renormalization constantˆ c of the product of the estimators approximates the true renormalization constant c .Knowing now how ˆ λ ( ω ) scales we can start analyzing MISE( p, ˆ p ) on the sample subspace. However,to make the analysis slightly easier we introduce a new functional, called MISE. This new functional isasymptotically equivalent to MISE as we will show, thus providing us with the means to view how MISEscales without having to directly analyze it. Definition 7.
Suppose M ≥ and hypotheses (H1) - (H2) hold. Given the sample subspace Ω M, N ( n ) wedefine the functional MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) = E Ω M, N ( n ) (cid:34)(cid:32) ˆ λ ( ω ) λ (cid:33) (cid:90) (ˆ p ( θ ; ω ) − p ( θ )) dθ (cid:35) (19) Proposition 8.
The functional
MISE is asymptotically equivalent to
MISE on Ω M, N ( n ) , in the sensethat lim (cid:107) N ( n ) (cid:107)→∞ MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) MISE (cid:0) p, ˆ p | Ω M, N ( n ) (cid:1) = 1 (20) Proof.
Notice that MISE can be written asMISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) = E Ω M, N ( n ) (cid:34)(cid:32) ˆ λλ − (cid:33) (cid:90) (ˆ p ( θ ; ω ) − p ( θ )) dθ (cid:35) = E Ω M, N ( n ) (cid:34)(cid:34)(cid:32) ˆ λλ − (cid:33) + 2 (cid:32) ˆ λλ − (cid:33) + 1 (cid:35) (cid:90) (ˆ p ( θ ; ω ) − p ( θ )) dθ (cid:35) nd thus by Lemma 6MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) = (1 + E ( n ))MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) where E ( n ) = O ( M − β (cid:107) N ( n ) (cid:107) − β )which then implies the result.We conclude our analysis with the next theorem, which states how MISE scales for the renormalizedestimators. Theorem 9.
Let M ≥ . Assume the conditions (H1) - (H8) hold. Then MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) = O ( M − β (cid:107) N ( n ) (cid:107) − β ) . (21) Proof.
We will do the work for MISE and the result will follow from Proposition 8. Notice that MISEcan be written as below. Also, let E n ( · ) = E ( ·| Ω M, N ( n ) )MISE (cid:16) p, ˆ p | Ω M, N ( n ) (cid:17) = E n (cid:34)(cid:32) ˆ λλ (cid:33) (cid:90) ( p − ˆ p ) dθ (cid:35) = (cid:107) p (cid:107) E n (cid:34)(cid:32) ˆ λλ − (cid:33) (cid:35) + λ − MISE n ( p ∗ , ˆ p ∗ ) − λ − E n (cid:90) (cid:32) ˆ λλ − (cid:33) (ˆ p ∗ − p ∗ ) p dθ = J + J + J We now determine how each of the J i , i ∈ { , , } scale. For J by Lemma 6 we have J = O ( M − β (cid:107) N ( n ) (cid:107) − β ) , for J we have from (H8) J = O ( M − β (cid:107) N ( n ) (cid:107) − β )and for J we have from Lemmas 3(c) and 6 | J | ≤ λ − (cid:32) E n (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ λλ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | ˆ p ∗ − p ∗ | p dθ (cid:33) ≤ λ − E n (cid:34)(cid:32) ˆ λλ − (cid:33) (cid:90) p dθ (cid:35) · MISE n ( p ∗ , ˆ p ∗ ) . Thus, by hypotheses (H7)-(H8) | J | = O ( M − β (cid:107) N ( n ) (cid:107) − β ) . So far we have estimated the error that arises between the unknown density p and the full-data estimatorˆ p . However, in practice it is difficult to evaluate the renormalization constantˆ λ ( ω ) = (cid:90) ˆ p ∗ ( θ ) dθ = (cid:90) M (cid:89) m =1 ˆ p m ( θ ) dθ defined in (17). The difficulty is due to the process of generating MCMC samples and thus ˆ p ∗ is notexplicitly known. In order to circumvent this issue, our idea is to approximate the integral above nu-merically. To accomplish this, we interpolate ˆ p ∗ using Lagrange polynomials. This procedure leads to he construction of an interpolant estimator ˜ p ∗ which we then integrate numerically. We then normalize˜ p ∗ and use that as a density estimator for p . Unfortunately, to estimate the error by considering thatkind of approximation given an arbitrary grid of points for Lagrange polynomials, independent of theset of knots ( t i ) for B-splines gives a stringent condition on the smoothness of B-splines we incorporate.It turns out that we have to utilize B-splines of order at least k = 4. For this reason we consider usingLagrange polynomials of order l + 1 which satisfy l < k − We remind the reader the model we deal with throughout our work. We recall that the (marginal)posterior of the parameter θ ∈ R (which is a component of a multidimensional parameter θθθ ∈ R d ) giventhe data x = { x , x , . . . , x M } partitioned into M disjoint sets x m , m = 1 , . . . , M is assumed to have the form p ( θ | x ) ∝ M (cid:89) m =1 p m ( θ ) (22)with p ( θ | x m ) denoting the (marginal) posterior density of θ given data x m .The estimator ˆ p ( θ | x ) of the posterior p ( θ | x ) is taken to beˆ p ( θ | x ) ∝ M (cid:89) m =1 ˆ p m ( θ ) (23)where ˆ p m ( θ ) stands for the logspline density estimator of the sub-posterior density p m ( θ ). Recall fromDefinition 37 and hypotheses (H1)-(H5) that for each m ∈ { , . . . , M } , the estimator ˆ p m has the formˆ p m ( θ ) = exp ( B m ( θ ; ˆ y m ) − c (ˆ y m )) (24)where B m ( θ ; ˆ y m ) = L m ( n ) (cid:88) j =0 ˆ y mj B j,k,T Km ( n ) ( θ )and c (ˆ y m ) = log (cid:18)(cid:90) exp ( B m ( θ ; ˆ y m ) dθ ) (cid:19) The vector ˆ y m = (ˆ y m , . . . , ˆ y mL m ( n ) ) is the argument that maximizes the log-likelihood, as described inequation (65) and we also remind the reader that this maximizer exists for all m ∈ { , . . . , M } as wecarry out our analysis on the sample subspace Ω M, N ( n ) .Together with the hypotheses stated in section 3, we now add the next proposition which will benecessary for our work later on. Proposition 10.
Suppose hypotheses (H1) - (H8) hold. Given the space Ω M, N ( n ) , we have that the esti-mator ˆ p m is bounded and its derivatives of all orders satisfy (cid:12)(cid:12)(cid:12) ˆ p ( α ) m ( θ ) (cid:12)(cid:12)(cid:12) ≤ C ( α, k, p m ) (cid:107) N ( n ) (cid:107) αβ/ ( j +1) for θ ∈ ( a, b ) and α < k − where the constant C ( α, k, p m ) depends on the order k of the B-splines, the order α of the derivative andthe density p m .Proof. Observe that the estimator ˆ p m can be expressed asˆ p m ( θ ) = exp (cid:2) L m ( n ) (cid:88) j =0 ˆ y mj B j,k ( θ ) − c (ˆ y m ) (cid:3) = exp (cid:2) L m ( n ) (cid:88) j =0 (ˆ y mj − c (ˆ y m )) B j,k ( θ ) (cid:3) hen, applying Faa di Bruno’s formula, we obtain | ˆ p ( α ) m ( θ ) | ≤ ˆ p m ( θ ) (cid:88) k +2 k + ··· + αk α = α α ! k ! k ! . . . k α ! α (cid:89) i =1 (cid:12)(cid:12)(cid:12) d i dθ i (cid:80) L m ( n ) j =0 (ˆ y mj − c (ˆ y m )) B j,k ( θ ) (cid:12)(cid:12)(cid:12) i ! k i , for θ ∈ [ t i , t i +1 ] . where k , . . . , k α are nonnegative integers and if k i > i ≥ k then that term in the sum abovewill be zero since almost everywhere B ( i ) j,k ( θ ) = 0. By De Boor’s formula [1, p.132], we can estimate thederivative of a spline as follows (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d i dθ i L m ( n ) (cid:88) j =0 (ˆ y mj − c (ˆ y m )) B j,k ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) d i dθ i log ˆ p m ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:107) log ˆ p (cid:107) ∞ h im . where the constant C depends only on the order k of the B-splines. Therefore, we can bound | ˆ p ( α ) m ( θ ) | as follows | ˆ p ( α ) m ( θ ) | ≤ ˆ p m ( θ ) (cid:88) k +2 k + ··· + αk α = α α ! k ! k ! . . . k α ! α (cid:89) i =1 (cid:18) C (cid:107) log ˆ p m (cid:107) ∞ i ! h im (cid:19) k i ≤ ˆ p m ( θ ) (cid:18) C α (cid:107) log ˆ p m (cid:107) α ∞ h αm (cid:19) (cid:88) k +2 k + ··· + αk α = α α ! k ! k ! . . . k α ! . The above leads to the following bound: (cid:12)(cid:12)(cid:12) ˆ p ( α ) m ( θ ) (cid:12)(cid:12)(cid:12) ≤ ˆ p m ( θ ) 1 + C α (cid:107) log ˆ p m (cid:107) α ∞ h αm α (cid:88) ζ =1 α ! ζ ! ( α − ζ + 1) ζ ≤ C ( k, α ) ˆ p m ( θ ) 1 + (cid:107) log ˆ p m (cid:107) α ∞ h αm where C ( k, α ) is a constant that depends on the order k and the α . Next, recalling the hypotheses (H3),(H4),(H6) and (H8), we obtainˆ p m ( θ ) ≤ | ˆ p m ( θ ) − p m ( θ ) | + p m ( θ ) ≤ (cid:107) p m (cid:107) ∞ (1 + c (cid:107) N ( n ) (cid:107) − β )and (cid:107) log ˆ p m (cid:107) ∞ ≤ (cid:107) log ˆ p m − log ¯ p m (cid:107) ∞ + (cid:107) log ¯ p m − log p m (cid:107) ∞ + (cid:107) log p m (cid:107) ∞ ≤ c (cid:107) N ( n ) (cid:107) − β + (cid:107) log p m (cid:107) ∞ where we also used Lemma 47, Theorem 56, Lemma 3. Therefore, (cid:12)(cid:12)(cid:12) ˆ p ( α ) m ( θ ) (cid:12)(cid:12)(cid:12) ≤ C ( k, α ) (cid:107) p m (cid:107) ∞ (1 + (cid:107) N ( n ) (cid:107) − β ) 1 + (cid:107) N ( n ) (cid:107) − αβ + (cid:107) log p m (cid:107) α ∞ h αm ≤ C ( α, k, p m ) 1 h αm = C ( α, k, p m )( h j +1 m ) − α/ ( j +1) ∼ C ( α, k, p m ) (cid:107) N ( n ) (cid:107) αβ/ ( j +1) The final result follows immediately and since the index i was chosen arbitrarily and that all interiorknots are simple, this concludes the proof. Remark 11.
Remark 30, in Section 6, allowed us to extend the bound for all θ ∈ ( a, b ) in the proofabove. In reality, we can also extend the bound to the closed interval [ a, b ] . Since a = t and b = t K m ( n ) are knots with multiplicity k , any B-spline that isn’t continuous at those knots will just be a polynomialthat has been cut off, which means there is no blow-up. Thus, we can extend the bound by consideringright-hand and left-hand limits of derivatives at a and b , respectively. From this point on we consider thebound in Proposition 10 holds for all θ ∈ [ a, b ] . emma 12. Assume hypotheses (H1) - (H8) hold. Suppose that for each m = 1 , . . . , M the sub-posteriorestimator ˆ p m ( θ ) is α -times differentiable on [ a, b ] for some positive integer α < k − .Then, the estimator ˆ p ∗ satisfies (cid:12)(cid:12)(cid:12) d α dθ α ˆ p ∗ ( θ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) (ˆ p ... ˆ p M ) ( α ) ( θ ) (cid:12)(cid:12) ≤ C ( α, k, p , . . . , p M ) (cid:107) N ( n ) (cid:107) αβ/ ( j +1) M α for θ ∈ [ a, b ] , (25) where C ( α, k, p , . . . , p M ) depends on the order k of the B-splines, the order α of the derivative and thedensities p , . . . , p M .Proof. Let θ ∈ [ a, b ]. By Proposition (10) we have | ˆ p ( α ) m ( θ ) | ≤ C ( α, k, p m ) (cid:107) N ( n ) (cid:107) αβ/ ( j +1) . Then, using the general Leibnitz rule and employing the above inequality we obtain (cid:12)(cid:12)(cid:12) d α dθ α ˆ p ∗ ( θ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) (ˆ p ... ˆ p M ) ( α ) ( θ ) (cid:12)(cid:12) == (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i + ··· + i M = α α ! i ! . . . i M ! ˆ p ( i )1 ... ˆ p ( i M ) M (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) i + ... + i M = α α ! i ! ...i M ! C ( i , k, p ) (cid:107) N ( n ) (cid:107) i β/ ( j +1) ... C ( i M , k, p M ) (cid:107) N ( n ) (cid:107) i M β/ ( j +1) = (cid:107) N ( n ) (cid:107) αβ/ ( j +1) (cid:88) i + ... + i M = α α ! i ! ...i M ! C ( i , k, p ) ... C ( i M , k, p M )From the proof of Proposition 10, notice that C ( i, k, p m ) ≤ C ( j, k, p m ) for positive integers i ≤ j .Therefore, we have | ˆ p ( α ) m ( θ ) | ≤ C ( α, k, p , . . . , p M ) (cid:107) N ( n ) (cid:107) αβ/ ( j +1) (cid:88) i + ... + i M = α α ! i ! ...i M !where C ( α, k, p , . . . , p M ) = C ( α, k, p ) . . . C ( α, k, p M ) and the result follows from the multinomial theo-rem. This concludes the proof. ˆ c = ˆ λ − By Remark 30, in Section 6, we have that B-splines of order k , and therefore any splines that arise fromthese, will have k − a, b ). Thus, in order to utilize Lemma 59, we must havethat the order of the Lagrange polynomials be at most k −
2, i.e. l ≤ k −
3. Since l ≥ k ≥ ≤ l ≤ k − N ∈ N be the number of sub-intervals of [ a, b ] on each of which we will interpolate the product ofestimators by the polynomial of degree l . Thus each sub-interval has to be further subdivided into l intervals. Define the partition X of [ a, b ] such that X = { a = x < x < x < · · · < x Nl = b } and x i +1 − x i = b − aNl = ∆ x. (26)For each i = 0 , . . . , N −
1, recalling the formula (87), we define the (random) Lagrange polynomialˆ q i ( θ ) := l (cid:88) τ =0 ˆ p ∗ ( x il + τ ) l τ,i ( θ ) with l τ,i ( θ ) := (cid:89) j ∈{ ,...,l }\{ τ } (cid:18) θ − x il + j x il + τ − x il + j (cid:19) , (27)which is a polynomial that interpolates the estimator ˆ p ∗ ( θ ) on the interval [ x il , x ( i +1) l ]. We next definean interpolant estimator ˜ p ∗ to be a random composite polynomial given by˜ p ∗ ( θ ) := (cid:40) , θ ∈ R \ [ a, b ]ˆ q i ( θ ) , θ ∈ [ x il , x ( i +1) l ] (28) hich approximates the estimator ˆ p ∗ on the whole interval [ a, b ].We are now ready to estimate the mean integrated squared error given byMISE (cid:0) p ∗ , ˜ p ∗ | Ω M, N ( n ) (cid:1) = E (cid:90) (cid:0) p ∗ ( θ ) − ˜ p ∗ ( θ ) (cid:1) dθ (29) Lemma 13.
Assume that hypotheses (H1) - (H8) hold and ˜ p ∗ is the estimator of ˆ p ∗ as defined in (28) given the partition X from (26) respectively. The following estimate holds provided ≤ l ≤ k − . MISE(ˆ p ∗ , ˜ p ∗ | Ω M, N ( n ) ) = E (cid:90) ba (cid:0) ˆ p ∗ ( θ ) − ˜ p ∗ ( θ ) (cid:1) dθ ≤ (cid:32) (∆ x ) l +1 l + 1) (cid:107) N ( n ) (cid:107) ( l +1) β/ ( j +1) M l +1 (cid:33) C ( l + 1 , k, p , . . . , p M , ( a, b )) (30) where the constant C ( l + 1 , k, p , . . . , p M , ( a, b )) depends on the order l + 1 of the Lagrange polynomials,the order k of the B-splines, the densities p , . . . , p M and the length of the interval ( a, b ) .Proof. Let i ∈ { , . . . , N − } . By Lemma 59, Lemma 12, and (28) for any θ ∈ [ x il , x ( i +1) l ] we have (cid:12)(cid:12) ˆ p ∗ ( θ ) − ˜ p ∗ ( θ ) (cid:12)(cid:12) = (cid:12)(cid:12) ˆ p ∗ ( θ ) − ˆ q i ( θ ) (cid:12)(cid:12) ≤ (cid:18) sup θ ∈ [ x il ,x ( i +1) l ] (cid:12)(cid:12)(cid:12) ddθ ( l +1) ˆ p ∗ ( θ ) (cid:12)(cid:12)(cid:12)(cid:19) (∆ x ) l +1 l + 1) ≤ (∆ x ) l +1 l + 1) C ( l + 1 , k, p , . . . , p M ) (cid:107) N ( n ) (cid:107) ( l +1) β/ ( j +1) M l +1 . (31)Thus we conclude that E (cid:90) ba (cid:0) ˆ p ∗ ( θ ) − ˜ p ∗ ( θ ) (cid:1) dθ = N − (cid:88) i =0 E (cid:90) x ( i +1) l x il (cid:0) ˆ p ∗ ( θ ) − ˆ q i ( θ ) (cid:1) dθ ≤ (cid:32) (∆ x ) l +1 l + 1) (cid:107) N ( n ) (cid:107) ( l +1) β/ ( j +1) M l +1 (cid:33) C ( l + 1 , k, p , . . . , p M , ( a, b )) . where C ( l + 1 , k, p , . . . , p M , ( a, b )) = C ( l + 1 , k, p , . . . , p M )( b − a ).Now that we have bounded the error between ˆ p ∗ and ˜ p ∗ , we define the renormalization constant ˜ c andthe density estimator ˜ p of ˆ p . 1˜ c = ˜ λ = (cid:90) ba ˜ p ∗ ( θ ) dθ and ˜ p := ˜ c ˜ p ∗ (32)Now the question is, how close is ˜ λ to ˆ λ . This is answered in the following lemma. Lemma 14.
Given the definitions of ˆ λ and ˜ λ in (17) and (32) respectively, we have that the distancebetween the two renormalization constants is bounded by | ˆ λ − ˜ λ | ≤ (cid:32) (∆ x ) l +1 l + 1) (cid:107) N ( n ) (cid:107) ( l +1) β/ ( j +1) M l +1 (cid:33) R ( l + 1 , k, p , . . . , p M , ( a, b )) (33) where the constant R ( l + 1 , k, p , . . . , p M , ( a, b )) = C ( l + 1 , k, p , . . . , p M )( b − a ) .Proof. We write | ˆ λ − ˜ λ | ≤ (cid:90) ba | ˆ p ∗ ( θ ) − ˜ p ∗ ( θ ) | dθ and then we just apply the Lagrange interpolation error from Lemma 59.We will continue by following the same steps as in subsection 3.2. The idea is to introduce a functionalthat will scale the same as MISE(ˆ p , ˜ p | Ω M, N ( n ) ). efinition 15. Suppose M ≥ and hypotheses (H1) , (H2) and (H8) hold. Given the sample subspace Ω M, N ( n ) we define the functional MISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) = E Ω M, N ( n ) (cid:34)(cid:32) ˜ λ ˆ λ ( ω ) (cid:33) (cid:90) (ˆ p ( θ ; ω ) − ˜ p ( θ )) dθ (cid:35) (34) Proposition 16.
The functional
MISE is asymptotically equivalent to
MISE on Ω M, N ( n ) , in the sensethat lim ∆ x → MISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) MISE (cid:0) ˆ p, ˜ p | Ω M, N ( n ) (cid:1) = 1 (35) Proof.
Notice that MISE can be written asMISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) = E Ω M, N ( n ) (cid:34)(cid:18) ˜ λ ˆ λ − (cid:19) (cid:90) (ˆ p ( θ ; ω ) − ˜ p ( θ )) dθ (cid:35) = E Ω M, N ( n ) (cid:34)(cid:32) λ − (cid:18) λ ˆ λ (cid:19) (cid:16) ˜ λ − ˆ λ (cid:17) + 2 λ − λ ˆ λ (cid:16) ˜ λ − ˆ λ (cid:17) + 1 (cid:33) (cid:90) (ˆ p ( θ ; ω ) − ˜ p ( θ )) dθ (cid:35) . Thus, by Lemmas 6 and 14, where the former implies λ ˆ λ ≤ − C M − β (cid:107) N ( n ) (cid:107) − β , and for large enough n for which 1 − C M − β (cid:107) N ( n ) (cid:107) − β >
0, we haveMISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) = (1 + E ( n ))MISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) with E ( n ) = O ( M l +1 (∆ x ) l +1 ). This then implies the result. Theorem 17.
Let M ≥ . Assume the conditions (H1) - (H8) hold. Then MISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) = O (cid:20)(cid:16) (cid:107) N ( n ) (cid:107) β/ ( j +1) (∆ x ) M (cid:17) l +1) (cid:21) . (36) Proof.
We will do the work for MISE and the result will follow from Proposition 16. Notice that MISEcan be written as below. Also, let E n ( · ) = E ( ·| Ω M, N ( n ) )MISE (cid:16) ˆ p, ˜ p | Ω M, N ( n ) (cid:17) = E n (cid:34)(cid:18) ˜ λ ˆ λ (cid:19) (cid:90) (ˆ p ( θ ; ω ) − ˜ p ( θ )) dθ (cid:35) = E n (cid:90) (cid:18) ˜ λ ˆ λ ˆ p − λ ˜ p ∗ − ˆ p + ˆ p (cid:19) dθ ≤ λ − − C M − β (cid:107) N ( n ) (cid:107) − β E n (cid:90) (cid:16) (˜ λ − ˆ λ )(ˆ p − p ) + (˜ λ − ˆ λ ) p + (ˆ p ∗ − ˜ p ∗ ) (cid:17) dθ ≤ λ − − C M − β (cid:107) N ( n ) (cid:107) − β ( J + J + J + J + J + J )where J = E n (cid:90) (˜ λ − ˆ λ ) (ˆ p − p ) dθ, J = E n (cid:90) (˜ λ − ˆ λ ) p dθ,J = E n (cid:90) (ˆ p ∗ − ˜ p ∗ ) dθ, J = 2 E n (cid:90) (˜ λ − ˆ λ ) (ˆ p − p ) p dθ,J = 2 E n (cid:90) (˜ λ − ˆ λ )(ˆ p − p )(ˆ p ∗ − ˜ p ∗ ) dθ, J = 2 E n (cid:90) (˜ λ − ˆ λ )(ˆ p ∗ − ˜ p ∗ ) p dθ. nd by hypotheses (H1)-(H8) and Lemmas 9, 13 and 14, we obtain | J | ≤ C (cid:16) (cid:107) N ( n ) (cid:107) β/ ( j +1) (∆ x ) M (cid:17) l +1) · M − β (cid:107) N ( n ) (cid:107) − β | J | + | J | + | J | ≤ C (cid:16) (cid:107) N ( n ) (cid:107) β/ ( j +1) (∆ x ) M (cid:17) l +1) | J | + | J | ≤ C (cid:16) (cid:107) N ( n ) (cid:107) β/ ( j +1) (∆ x ) M (cid:17) l +1) · M − β (cid:107) N ( n ) (cid:107) − β which for large n implies the result. Theorem 18.
Assume that hypotheses (H1) - (H8) hold. Let ˜ p be the polynomial that interpolates ˆ p asdefined in (28) , given the partition X . We then have the estimate MISE( p , ˜ p | Ω M, N ( n ) ) = E (cid:90) ba (cid:0) p ( θ ) − ˜ p ( θ ) (cid:1) dθ ≤ C M − β (cid:107) N ( n ) (cid:107) − β + (cid:32) (∆ x ) (cid:107) N ( n ) (cid:107) β/ ( j +1) M (cid:33) l +1) (37) where the constant C depends on the order k of the B-splines, the degree l of the interpolating polynomial,the densities p , . . . , p M and the length of the interval ( a, b ) . Furthermore, assuming that ∆ x is a functionof the vector of samples N ( n ) , then MISE scales optimally with respect to N ( n ) such that MISE( p , ˜ p | Ω M, N ( n ) ) ≤ C (cid:107) N ( n ) (cid:107) − β when ∆ x = O (cid:18) (cid:107) N ( n ) (cid:107) − β (cid:16) l +1 + j +1 (cid:17) (cid:19) . (38) Proof.
Observe thatMISE( p , ˜ p | Ω M, N ( n ) ) ≤ E (cid:90) ba (cid:0) p ( θ ) − ˆ p ( θ ) (cid:1) dθ + E (cid:90) ba (cid:0) ˆ p ( θ ) − ˜ p ( θ ) (cid:1) dθ =: I + I . (37) then follows from Theorem 9 and Theorem 17. Using that estimate we can ask the following question.Suppose that we chose ∆ x to be a function of the number of samples so that c (cid:107) N ( n ) (cid:107) − α ≤ ∆ x ( n ) ≤ c (cid:107) N ( n ) (cid:107) − α (39)for some constants c , c and α . Clearly, one would not like ∆ x to be excessively small in order to avoiddifficulties that appear with round-off error when computing. On the other hand one would like the errorto converge to zero as fast as possible. Thus let us find the smallest rate α for which the asymptotic rateachieves its maximum. To this end we define the function R ( α ) := − lim (cid:107) N ( n ) (cid:107)→∞ log (cid:107) N ( n ) (cid:107) MISE( p, ˜ p | Ω M, N ( n ) )that describes the asymptotic rate of convergence of the mean integrated squared error. By (37) we have R ( α ) = β, α ≥ β (cid:18) l + 1 + 1 j + 1 (cid:19)(cid:18) α − βj + 1 (cid:19) l + 1) , α < β (cid:18) l + 1 + 1 j + 1 (cid:19) It is obvious that the smallest rate for which the function R ( α ) achieves its maximum value of 2 β is givenby α = β (cid:16) l +1 + j +1 (cid:17) . This concludes the proof. Numerical Experiments
This numerical experiment, as well as the following, is designed to investigate the relationship between theapproximated value of MISE( p , ˜ p | Ω M, N ( n ) ) and the bound given by (38). One iteration of the experimentgenerates M = 3 subsets of a predetermined number of MCMC samples with ˆ p m ∼ N (2 , , m = 1 , , p is computed a hundred times by re-sampling in orderto obtain an approximation to MISE and its standard deviation. For this specific example, we performten iterations starting with 20 ,
000 samples and increasing that number by 10 ,
000 for each experiment.In the experiments we ran, we chose the parameters so that the optimal rate of convergence for MISE wasobtained. Thus, β = 1 / k = 4), which implies l = 1 in (38). Furthermore, we chose j = 1. Thisyields the rate C (cid:107) N (cid:107) − as the upper bound for the convergence rate of MISE. Figure 1: The full data posterior (black line) is shown with the 3 subset posterior densities (red, blue, green)for one iteration of 110,000 samples. 16igure 2: The full data posterior (black line) is shown with the combined subset posterior density (bluepoints) for one iteration of 110,000 samples.Figure 3: The average MISE estimate is depicted for the ten experiments along with standard deviation bars(black) plotted on a log-log scale with a regression line added. The red line is the upper bound of (38) ascalculated for the different number of samples.
Notice in Figure 3 how the regression line and the theoretical error line seem parallel. This implies thatthe rate obtained from (38) is numerically satisfied. .2 Numerical experiment with gamma subset posterior densities This experiment mimics the previous one with the normally distributed generated MCMC samples, withthe difference now that they are generated by a
Gamma (1 , ,
000 to 110 ,
000 by an increment of 10 ,
000 for each iteration. Furthermore, M = 5 subsets arenow created. Figure 4: The full data posterior (black line) is shown with the 5 subset posterior densities (red, blue, green,purple, gray) for one iteration of 110,000 samples.Figure 5: The full data posterior (black line) is shown with the combined subset posterior density (bluepoints) for one iteration of 110,000 samples. 18igure 6: The averaged MISE is depicted for the ten experiments along with standard deviation bars (black)plotted on a log-log scale with a regression line added. The red line is the upper bound of (38) as calculatedfor the different number of samples.
The result we obtain from Figure 6 is similar to the previous example. The rate from the bound (38) isagain numerically satisfied.
In this series of experiments we employ the data of US flights that are from or to the state of New Yorkfor the month of January 2018. The data was obtained from the Bureau of Transportation Statistics [12].We were specifically interested in the delayed arrival times, thus flights that arrived 15 minutes or laterfrom the scheduled time. There were a total of 12,100 such flights, which in turn were divided into 5 datasubsets of 2,420 each. We assumed that the delayed arrival times are distributed according to a Gammadistribution with some shape parameter and rate parameter. In what follows, we will be doing inferencefor the shape parameter, denoted by α . Using the JAGS sampling package [11], we generated samplesfrom the marginal full data posterior distribution and the subset posterior distributions for α . The datawere shuffled beforehand to ensure that the condition of independence between subsets is satisfied. Teniterations were performed, starting with 20,000 samples and increasing that number by 10,000. In eachiteration, the values were then re-sampled 100 times in order to obtain an approximation to MISE and itsstandard deviation. Similar to the first example in this section, the parameters were chosen in a mannerto achieve optimal convergence for MISE. Therefore, β = 1 /
2, cubic B-splines were implemented, whichimplies l = 1, and we chose j = 1. These yield the rate of C (cid:107) N (cid:107) − for MISE, as given in (38). .3.2 Numerical results Figure 7: The full data posterior (black line) is shown with the 5 subset posterior densities (red, blue, green,purple, gray) for one iteration of 110,000 samples.Figure 8: The full data posterior (black line) is shown with the combined subset posterior density (bluepoints) for one iteration of 110,000 samples. 20igure 9: The averaged MISE is depicted for the ten experiments along with standard deviation bars (black)plotted on a log-log scale with a regression line added. The red line is the upper bound of (38) as calculatedfor the different number of samples.
From Figure 9, the conclusion is similar to the previous examples with the simulated data. The regressionline shows that the rate given by (38), with the choice of parameters as mentioned in the description, isagain numerically satisfied.
Here we provide all the relevant results related to B-splines and logspline density estimators based onthe works of [1, 2, 3, 4].
In this section we will define the logspline family of densities and present an overview of how the logsplinedensity estimator is chosen for the density p . The idea behind logspline density estimation of an unknowndensity p is that the logarithm of p is estimated by a spline function, a piecewise polynomial thatinterpolates the function to be estimated. Therefore, the family of estimators constructed for the unknowndensity is a family of functions that are exponentials of splines that are suitably normalized so that theycan be densities. Thus, to build up the estimation method, we need to start the theory with thebuilding blocks of splines themselves, the functions we call basis splines or B-splines for short whoselinear combination generates the set of splines of a given order.So, the first question we will answer is how we construct B-splines. There are several ways to dothis, some less intuitive than others. The approach we will take will be through the use of divideddifferences . It is a recursive division process that is used to calculate the coefficients of interpolatingpolynomials written in a specific form called the Newton form. Definition 19.
The kth divided difference of a function g at the knots t , . . . , t k is the leading coefficient(meaning the coefficient of x k ) of the interpolating polynomial q of order k+1 that agrees with g at thoseknots. We denote this number as [ t , . . . , t k ] g (40) ere we use the terminology found in De Boor [1], where a polynomial of order k+1 is a polynomial ofdegree less than or equal to k. It’s better to work with the ”order” of a polynomial since all polynomialsof a certain order form a vector space, whereas polynomials of a certain degree do not. The term ”agree”in the definition means that for the sequence of knots ( t i ) ki =0 , if ζ appears in the sequence m times, thenfor the interpolating polynomial we have q ( i − ( ζ ) = g ( i − ( ζ ) , i = 1 , . . . , m (41)Since the interpolating polynomial depends only on the data points, the order in which the values of t , . . . , t appear in the notation in (19) does not matter. Also, if all the knots are distinct, then theinterpolating polynomial is unique.At this point let’s write down some examples to see how the recursion algorithm pops up. If wewant to interpolate a function g using only one knot, say t , then we will of course have the constantpolynomial q ( x ) = g ( t ). Thus, since g ( t ) is the only coefficient, we have[ t ] g = g ( t ) (42)Now suppose we have two knots, t , t .If t (cid:54) = t , then q is the secant line defined by the two points ( t , g ( t )) and ( t , g ( t )). Thus, theinterpolating polynomial will be given by q ( x ) = g ( t ) + ( x − t ) g ( t ) − g ( t ) t − t (43)Therefore, [ t , t ] g = g ( t ) − g ( t ) t − t = [ t ] g − [ t ] gt − t (44)To see what happens when t = t , we can take the limit t → t above and thus [ t , t ] g = g (cid:48) ( t ).By continuing these calculations for more knots yields the following result: Lemma 20.
Given a function g and a sequence of knots ( t i ) ki =0 , the kth divided difference of g is givenby(a) [ t , . . . , t k ] g = g ( k ) ( t ) k ! when t = · · · = t k , g ∈ C k , therefore yielding the leading coefficient of theTaylor approximation of order k+1 to g.(b) [ t , . . . , t k ] g = [ t , . . . , t r − , t r +1 , . . . , t k ] g − [ t , . . . , t s − , t s +1 , . . . , t k ] gt s − t r , where t r and t s are any twodistinct knots in the sequence ( t i ) ki =0 . Now that we have defined the kth divided difference of a function, we can easily state what B-splinesare. B-splines arise as appropriately scaled divided differences of the positive part of a certain powerfunction and it can be shown that B-splines form a basis of the linear space of splines of some order.Let’s start with the definition.
Definition 21.
Let t = ( t i ) Ni =0 be a nondecreasing sequence of knots. Let ≤ k ≤ N . The j-th B-splineof order k, with j ∈ { , , . . . , N − k } , for the knot sequence ( t i ) Ni =0 is denoted by B j,k,t and is defined bythe rule B j,k,t ( x ) = ( t j + k − t j )[ t j , . . . , t j + k ]( · − x ) k − (45) where ( · ) + defines the positive part of a function, i.e. ( f ( x )) + = max x { f ( x ) , } . The ”placeholder” notation in the above definition says that the kth divided difference of ( · − x ) k − is to be considered for the function ( t − x ) k − as a function of t and have x fixed. Of course, in theend the number will vary as x varies, giving rise to the function B j,k,t . If either k or t can be inferredfrom context then we will usually drop them from the notation and write B j instead of B j,k,t . A directconsequence we receive from the above definition is the support of B j,k,t . Lemma 22.
Let B j,k,t be defined as in 21. Then the support of the function is contained in the interval [ t j , t j + k ) . roof. All we need to do is show that if x / ∈ [ t j , t j + k ), then B j,k,t ( x ) = 0.Suppose first that x ≥ t j + k . Then we will have that t i − x ≤ i = j, . . . , j + k which in turn implies( t i − x ) + = 0 and finally [ t j , . . . , t j + k ]( · − x ) k − = 0.On the other hand, if x < t j , then since ( t − x ) k − as a function of t is a polynomial of order k and wehave k + 1 sites where it agrees with its interpolating polynomial, necessarily they are both the same.This implies [ t j , . . . , t j + k ]( · − x ) k − = 0 since the coefficient of t k is zero. Since we stated the definition of B-splines using divided differences, we can use that to state the recur-rence relation for B-splines which will be useful when we will later prove various properties of thesefunctions. We start by stating and proving the Leibniz formula which will be needed in the proof of therecurrence relation
Lemma 23.
Suppose f, g, h are functions such that f = g · h , meaning f ( x ) = g ( x ) h ( x ) for all x and let ( t i ) be a sequence of knots. Then we have the following formula [ t j , . . . , t j + k ] f = j + k (cid:88) r = j ([ t j , . . . , t r ] g )([ t r , . . . , t j + k ] h ) , for some j, k ∈ N . (46) Proof.
First of all, observe that the function (cid:32) g ( t j ) + j + k (cid:88) r = j +1 ( x − t j ) . . . ( x − t r − )[ t j , . . . , t r ] g (cid:33) · (cid:32) h ( t j + k ) + j + k − (cid:88) s = j ( x − t s +1 ) . . . ( x − t j + k )[ t s , . . . , t j + k ] h (cid:33) agrees with f at the knots t j , . . . , t j + k since the first and second factor agree with g and h respectivelyat those values. Now, observe that if r > s then the above product vanishes at all the knots since theterm ( x − t i ) for i = j, . . . , j + k will appear in at least one of the two factors. Thus, the above agreeswith f at t j , . . . , t j + k when r ≤ s . But then the product turns into a polynomial of order k + 1 whoseleading coefficient is (cid:88) r = s ([ t j , . . . , t r ] g )([ t s , . . . , t j + k ] h )and that of course must be equal to [ t j , . . . , t j + k ] f Now we can state and prove the recurrence relation for B-splines.
Lemma 24.
Let t = ( t i ) Ni =0 be a sequence of knots and let ≤ k ≤ N . For j ∈ { , , . . . , N − k } we canconstruct the j -th B-spline B j,k of order k associated with the knots t = ( t i ) Ni =0 as follows:(1) First we have B j, be the characteristic function on the interval [ t j , t j +1 ) B j, ( x ) = (cid:40) , x ∈ [ t j , t j +1 )0 , x / ∈ [ t j , t j +1 ) (47) (2) The B-splines of order k for k > on [ t j , t j + k ) are given by B j,k ( x ) = x − t j t j + k − − t j B j,k − ( x ) + t j + k − xt j + k − t j +1 B j +1 ,k − ( x ) (48) Proof. (1) easily follows from the definition we gave for B-splines using divided differences in Definition21. (2) can be proven using Lemma 23. Since B-splines were defined using the function ( t − x ) k − forfixed x , we apply the Leibniz formula for the kth divided difference to the product( t − x ) k − = ( t − x )( t − x ) k − This yields[ t j , . . . , t j + k ]( · − x ) k − = ( t j − x )[ t j , . . . , t j + k ]( · − x ) k − + 1 · [ t j +1 , . . . , t j + k ]( · − x ) k − (49) ince [ t j ]( · − x ) = ( t j − x ) , [ t j , t j +1 ]( · − x ) = 1 and [ t j , . . . , t r ]( · − x ) = 0 for r > j + 1. Now, from Lemma20 (b), we have that ( t j − x )[ t j , . . . , t j + k ]( · − x ) k − can be written as( t j − x )[ t j , . . . , t j + k ]( · − x ) k − = t j − xt j + k − t j ([ t j +1 , . . . , t j + k ] − [ t j , . . . , t j + k − ]) (50)Thus, by replacing that term in the result (49) we obtained by Leibniz, we get[ t j , . . . , t j + k ]( · − x ) k − = x − t j t j + k − t j [ t j , . . . , t j + k − ]( · − x ) k − + t j + k − xt j + k − t j [ t j +1 , . . . , t j + k ]( · − x ) k − (51)The result in (2) follows immediately once we multiply both sides by ( t j + k − t j ) and then multiply anddivide the first term in the sum on the right hand side by ( t j + k − − t j ) and then multiply and divide thesecond term by ( t j + k − t j +1 ).From the recurrence relation we acquire information about B-splines that was not clear from thefirst definition we gave using divided differences. B j, is a characteristic function, or otherwise piecewiseconstant. By Lemma 24 (b), since the coefficients of B j,k − are linear functions of x , we have B j, isa piecewise linear function on [ t j , t j +2 ). Therefore, inductively we have B j, is a piecewise parabolicfunction on [ t j , t j +3 ), B j, is a piecewise polynomial of degree 3 on [ t j , t j +4 ) and so on. Below there is avisual representation of B-splines showing how the graph changes as the order increases.Since we now have defined what a B-spline is as a function, the next step is to ask what set is generatedwhen considering linear combinations of these functions. Since B-splines are piecewise polynomialsthemselves, we have that this set is a subset of the set of piecewise polynomials with breaks at the knots( t i ). Something that can be proven though, is that it is exactly the set of piecewise polynomials withcertain break and continuity conditions at the knots and this equality occurs on a smaller interval, whichwe call the basic interval, denoted by I k,t . Definition 25.
Suppose t = ( t , . . . , t N ) is a nondecreasing sequence of knots. Then for the B-splinesof order k , with 2 k < N + 2, that arise from these knots, we define I k,t = [ t k − , t N − k +1 ] and call it the basic interval . Remark 26.
In order for this definition to be correct, we need to extend the B-splines and have thembe left continuous at the right endpoint of the basic interval since we are defining it as a closed interval.
Remark 27.
The basic interval for the N − k + 1 B-splines of order k > I k,t and later we will see that theB-splines form a partition of unity on the basic interval. For k = 1, by construction the B-splines alreadyform a partition of unity on I ,t = [ t , t N ].For example, let t = ( t i ) i =0 be disjoint and k = 3. Then there are 4 B-splines, B j, , j = 0 , , ,
3, of order3 that arise in this framework. Their supports are [ t , t ) , [ t , t ) , [ t , t ) , [ t , t ) respectively. Clearly, n [ t , t ) only B , is supported and since as a function is non-constant we cannot have (cid:88) j =0 B j, = B , on [ t , t ) be equal to 1.The partition of unity is stated and proved in the next lemma together with other properties of theB-splines. The recurrence relation makes the proofs fairly easy compared to using the divided differencedefinition of the B-splines. Lemma 28.
Let B j,k,t be the function as given in Definition 21 for the knot sequence t = ( t i ) Ni =0 . Thenthe following hold:(a) B j,k,t ( x ) > for x ∈ ( t j , t j + k ) .(b) (Marsden’s Identity) For any α ∈ R , we have ( x − α ) k − = (cid:80) j ψ j,k ( α ) B j,k,t ( x ) , where ψ j,k ( α ) =( t j +1 − α ) . . . ( t j + k − − α ) and ψ j, ( α ) = 1 .(c) (cid:80) j B j,k,t = 1 on the basic interval I k,t .Proof. (a) This is a simple induction. For k = 1 the hypothesis holds since the B-splines are justcharacteristic functions on [ t j , t j +1 ) and thus strictly positive in the interior.For k = 2 by the recurrence relation, B j, ,t is a linear combination of B j, , B j +1 , with coefficients thelinear functions x − t j t j +1 − t j , t j +2 − xt j +2 − t j +1 which is positive on ( t j , t j +2 ).Assuming the hypothesis holds for k = r , we can show it is true for k = r + 1 by using the same argumentas in the previous case.(b) Let ω j,k ( x ) = x − t j t j + k − − t j . Thus, t j + k − xt j + k − t j +1 = 1 − ω j +1 ,k ( x ). This way we can write the recurrencerelation as B j,k ( x ) = ω j,k ( x ) B j,k − ( x ) + (1 − ω j +1 ,k ( x )) B j +1 ,k − (52)Using this we can write (cid:80) j ψ j,k ( α ) B j,k,t ( x ) as (cid:88) j ψ j,k ( α ) B j,k,t ( x ) = (cid:88) j [ ω j,k ( x ) ψ j,k ( α ) + (1 − ω j,k ( x )) ψ j − ,k ( α )] B j,k − ,t ( x )= (cid:88) j ψ j,k − ( α )[ ω j,k ( x )( t j + k − − α ) + (1 − ω j,k ( x ))( t j − α )] B j,k − ,t ( x )= (cid:88) j ψ j,k − ( α )( x − α ) B j,k − ,t ( x ) (53)since ω j,k ( x ) f ( t j + k − ) + (1 − ω j,k ( x )) f ( t j ) is the unique straight line that intersects f at x = t j and x = t j + k − . Thus, ω j,k ( x )( t j + k − − α ) + (1 − ω j,k ( x ))( t j − α ) = x − α Therefore, by induction we have (cid:88) j ψ j,k ( α ) B j,k,t ( x ) = (cid:88) j ψ j, ( α )( x − α ) k − B j, ,t ( x )= ( x − α ) k − (cid:88) j ψ j, ( α ) B j, ,t ( x )= ( x − α ) k − since ψ j, ( α ) = 1 and B j, ,t are just characteristic functions.(c) To prove the partition of unity, we start with Marsden’s Identity and divide both sides by ( k − ν − α for some positive integer ν ≤ k −
1. We then have( x − α ) k − ν ( k − ν )! = (cid:88) j ( − ν − ( k − d ν − ψ j,k ( α ) dα ν − B j,k,t ( x ) (54)Now, for some polynomial q of order k , we can use the Taylor expansion of qq = k (cid:88) ν =1 ( x − α ) k − ν ( k − ν )! d k − ν q ( α ) dα k − ν (55) sing this we see that q = (cid:88) j λ j,k [ q ] B j,k,t where λ j,k [ q ] = k (cid:88) ν =1 ( − ν − ( k − d ν − ψ j,k ( α ) dα ν − d k − ν q ( α ) dα k − ν (56)which holds only on the basic interval. Now, to show that the B-splines are a partition of unity, we justuse this identity for q = 1. Remark 29.
Marsden’s Identity says something very important. That all polynomials of order k arecontained in the set generated by the B-splines B j,k , which is also what makes the step in the proof of (c)viable. Furthermore, we can replace the ( x − α ) in the identity by ( x − α ) + which shows that piecewisepolynomials are also contained in the same set. Remark 30.
Another consequence of Marsden’s Identity is the Curry-Schoenberg theorem. We do notexplicitly state the theorem as we do not require it, rather we state a simple result from it for B-splinesof order k given a sequence of knots ( t i ) Ni =0 , which can be summarized asnumber of continuity conditions at t i + multiplicity of t i = k Therefore, for a simple knot t i , any B-spline of order k there will be continuous and also have k − continuous derivatives. On the other hand, if t i has multiplicity k , any k -th order B-spline will have adiscontinuity there. Below there is a figure which shows the importance of the basic interval as the interval where we havepartition of unity.
Remark 31.
When the sequence of t (cid:48) i s is distinct then the sum of B-splines belongs to C (cid:0) ( t , t N ) (cid:1) .However, the sum of B-splines on the basic interval I k,t is equal to 1. To make sure that the sum equalsto 1 on the whole interval ( t , t N ), the assumption of the knots being distinct has to be dropped. It isobvious that we have to take t = · · · = t k − and t N − k +1 = · · · = t N . Definition 32.
Let ( t i ) Ni =0 be a sequence of knots such that t = · · · = t k − and t N − k +1 = · · · = t N ,where ≤ k ≤ N . Let B j,k,t be the B-splines as defined in 21 with knot sequence t = ( t i ) Ni =0 . Theset generated by the sequence { B j,k,t : all j } , denoted by S k,t , is the set of splines of order k with knotsequence t . In symbols we have S k,t = (cid:40)(cid:88) j a j B j,k,t : a j ∈ R , all j (cid:41) (57) emark 33. Fix an interval [ a, b ] . Let T N = ( t i ) Ni =0 be a sequence as in definition 32 with t = a and t N = b , where N ∈ N . The choice in definition 32 implies that (cid:91) N ∈ N S k,T N is dense in C ([ a, b ]) (58) Later in this paper when we will be conducting our analysis on MISE, derivatives of spline functions willfactor in. Since splines are just linear combinations of B-splines we just need to investigate the result ofdifferentiating a B-spline on the interior of its support. The derivative of a k -th order B-spline is directlyassociated with B-splines of order k −
1. To see this we use the recurrence relation which leads us to thefollowing theorem:
Theorem 34.
Let B j,k,t be the function as defined in 21. The support of B j,k,t is the interval [ t j , t j + k ) .Then the following equation holds on the open interval ( t j , t j + k ) ddθ B j,k,t ( θ ) = , k = 1( k − (cid:18) B j,k − ,t ( θ ) t j + k − − t j − B j +1 ,k − ,t ( θ ) t j + k − t j +1 (cid:19) , k > Proof.
The proof is done by induction on k . For k = 1 it is straightforward since B j, ,t is a constant on( t j , t j +1 ) and for k > S k,t , the B-splines we will be working with form a partition of unity on[ t , t N ] and since they are strictly positive on the interior of their supports, we have that each B-splineis bounded by 1 for all θ . B j,k,t ( θ ) ≤ , ∀ θ ∈ R Furthermore, by induction we can prove the following lemma:
Lemma 35.
Let t = ( t i ) Ni =0 be a sequence of knots as in definition 32 and B j,k,t be the function asdefined in 21. Let h N = min k ≤ i ≤ N − k +1 ( t i − t i − ) and α be a positive integer such that α < k − . Then, onthe open interval ( t j , t j + k ) we have sup θ ∈ ( t j ,t j + k ) (cid:12)(cid:12)(cid:12)(cid:12) d α dθ α B j,k,t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ α h αN ( k − k − α − , for any j Proof.
We fix k and we do induction on α . Let’s start with α = 1 (cid:12)(cid:12)(cid:12)(cid:12) ddθ B j,k,t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( k − (cid:18) B j,k − ,t ( θ ) t j + k − − t j − B j +1 ,k − ,t ( θ ) t j + k − t j +1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ h N ( k − k − . Thus the inequality holds for α = 1.Now we assume it holds for α = n and we will show it holds for α = n + 1. (cid:12)(cid:12)(cid:12)(cid:12) d n +1 dθ n +1 B j,k,t ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) d n dθ n ( k − (cid:18) B j,k − ,t ( θ ) t j + k − − t j − B j +1 ,k − ,t ( θ ) t j + k − t j +1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n +1 h n +1 N ( k − k − ( n + 1) − . This concludes the proof.
Remark 36.
Considering Remark 30, the bound in Lemma 35 can be extended to hold on the closedinterval [ t j , t j + k ] assuming the knots t j , . . . , t j + k are simple. Also, it is clear that we need to utilize atleast parabolic B-splines in order to have a bound on a continuous derivative. .4 Logspline Density Estimation In this part we will present the method for constructing logspline density estimators using B-splines. Let p be a continuous probability density function supported on an interval [ a, b ]. Suppose p is unknown andwe would like to construct density estimators for this function. The methodology is as follows Definition 37.
Let T N = ( t i ) Ni =0 , N ∈ N , be a sequence of knots such that t = · · · = t k − = a and t N − k +1 = · · · = t N = b , where ≤ k ≤ N , k fixed. Thus, the set of splines S k,T N of order k generatedby the B-splines B j,k,T N can be obtained. We suppress the parameters k, T N and just write B j instead of B j,k,T N . Define the spline function B ( θ ; y ) = L (cid:88) j =0 y j B j ( θ ) , y = ( y , . . . , y L ) ∈ R L +1 with L := N − k. (60) and for each y we set the probability density function f ( θ ; y ) = exp (cid:16) L (cid:88) j =0 y j B j ( θ ) − c ( y )) (cid:17) = exp (cid:16) B ( θ ; y ) − c ( y )) (cid:17) , where c ( y ) = log (cid:32)(cid:90) ba exp (cid:16) L (cid:88) j =0 y j B j ( θ ) (cid:17) dθ (cid:33) < ∞ . (61)The family of exponential densities { f ( θ ; y ) : y ∈ R L +1 } is not identifiable since if β is any constant,then c (( y + β, . . . , y L + β )) = c ( y ) + β and thus f ( θ ; ( y + β, . . . , y L + β )) = f ( θ ; y )To make the family identifiable we restrict the vectors y to the set Y = (cid:40) y ∈ R L +1 : L (cid:88) i =0 y i = 0 (cid:41) . (62) Remark 38. Y depends only on the number of knots and the order of the B-splines and not the numberof samples. Definition 39.
We define the logspline model as the family of estimators L = (cid:8) f ( θ ; y ) given by (61) : y ∈ Y (cid:9) . (63) For any f ∈ L log ( f ) = L (cid:88) j =0 y j B j ( θ ) − c ( y ) ∈ S k,T N . (64)Next, let us pick a set of independent, identically distributed random variablesΘ n = (cid:0) θ , θ , ..., θ n (cid:1) ∈ R n , n ∈ N where each θ i is drawn from a distribution that has density p ( θ ).We next define the log-likelihood function l n : R L +1+ n → R corresponding to the logspline model by l n ( y ) = l n ( y ; θ , θ, . . . , θ n ) = l n ( y ; Θ n )= n (cid:88) i =1 log( f ( θ i ; y )) = n (cid:88) i =1 (cid:18) L (cid:88) j =0 y j B j ( θ i ) (cid:19) − nc ( y ) , y ∈ Y (65)and the maximizer of the log-likelihood l n ( y ) byˆ y n = ˆ y n ( θ , . . . , θ n ) = arg max y ∈ Y l n ( y ) (66)whenever this random variable exists, which will be shown on a subset of the sample space whoseprobability will tend to 1. The density f ( · ; ˆ y n ) is called the logspline density estimate of p . e define the expected log-likelihood function λ n ( y ) by λ n ( y ) = E [ l ( y ; θ , . . . , θ n )] = n (cid:32) − c ( y ) + (cid:90) ba (cid:18) L (cid:88) j =0 y j B j ( θ ) (cid:19) p ( θ ) dθ (cid:33) < ∞ , y ∈ Y . (67)It follows by a convexity argument that the expected log-likelihood function has a unique maximizingvalue ¯ y = arg max y ∈ Y λ n ( y ) = arg max y ∈ Y λ n ( y ) n (68)which is independent of n but depends on the knots.Note that the function λ n ( y ) is bounded above and goes to −∞ as | y | → ∞ within Y and therefore,due to Jensen’s Inequality, the constant ¯ y is finite; see Stone [4]. The estimator ˆ y ( θ , . . . , θ n ), in generaldoes not exist. This motivates us to define the setΩ n = (cid:26) ω ∈ Ω : ˆ y = ˆ y ( θ , . . . , θ n ) ∈ R L +1 exists (cid:27) . (69)In what follows we will show that P (Ω n ) → n → ∞ . We also note that due to convexity of l n ( y )and λ n ( y ) the estimators ˆ y and ¯ y are unique whenever they exist.We define the logspline estimator ˆ p of p on the space Ω n byˆ p : R × Ω n defined by ˆ p ( θ, ω ) = f ( θ, ˆ y ( θ , . . . , θ n )) , ω ∈ Ω n (70)and define the function ¯ p ( θ ) := f ( θ, ¯ y ) . (71) Remark 40.
In order for the maximum likelihood estimates to be reliable, we require that the modelingerror tend to 0 as n → ∞ , as described in hypothesis (H4) . S k,t It is a well known fact that continuous functions can be approximated by polynomials. Now that wehave defined the set of splines S k,t in Definition 32 and from what we have stated in remark 33, that (cid:83) N ∈ N S k,T N is dense in the space of continuous functions, there is a question that arises at this point:Given an arbitrary continuous function g on [ a, b ], an integer k ≥ T N = ( t i ) Ni =0 as in Remark 33, how close is g to the set S k,T N of splines of order k ?Let’s state this question in a slightly different way. What we would like to do is find a bound for thesup-norm distance between g ∈ C [ a, b ] and S k,T N , where this distance is denoted by dist ( g, S k,T N ) andis defined as dist( g, S k,T N ) = inf s ∈ S k,TN (cid:107) g − s (cid:107) ∞ , g ∈ C [ a, b ] . (72)The answer to our question is given by Jackson’s Theorem found in de Boor [1]. To state it we first needthe following definition. Definition 41.
The modulus of continuity ω ( g ; h ) of some function g ∈ C [ a, b ] for some positive number h is defined as ω ( g ; h ) = max {| g ( θ ) − g ( θ ) | : θ , θ ∈ [ a, b ] , | θ − θ | ≤ h } . (73)The bound given by Jackson’s Theorem contains the modulus of continuity of the function whose sup-norm distance we want to estimate from the set of splines. The theorem is stated below. Theorem 42.
Let T N = ( t i ) Ni =0 , N ∈ N , be a sequence of knots such that t = · · · = t k − = a and b = t N − k +1 = · · · = t N , where ≤ k ≤ N . Let S k,T N be the set of splines as in definition 32 for the knotsequence T N . For each j ∈ { , . . . , k − } , there exists C = C ( k, j ) such that for g ∈ C j [ a, b ] dist ( g, S k,T N ) ≤ C h j ω (cid:18) d j gdθ j ; | t | (cid:19) where h = max i | t i +1 − t i | . (74) In particular, from the Mean Value Theorem it follows dist ( g, S k,T N ) ≤ C h j +1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 gdθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (75) in the case that g ∈ C j +1 [ a, b ] . emark 43. Please note that for the approximation the mesh size enters into the bound in (75) whichdictates the placement for the knots.
Jackson’s Theorem supplies us with an estimate of how good an approximation is contained in thespace of splines for a continuous function. However, in this paper we are interested in estimates forprobability densities, especially since the focus is on logspline density estimates. At this point let’s stateresults specifically for densities. The following can be found in Stone [3].Suppose that p is a continuous probability density supported on some interval [ a, b ], similar to the set-upwhen we defined the logspline density estimation method. Define the family F p of densities such that F p = (cid:26) p α : p α ( x ) = ( p ( x )) α (cid:82) ( p ( y )) α dy , ≤ α ≤ (cid:27) . (76)It is easy to see that for α ∈ [0 , p α is a probability density on [a,b]. An interesting consequence fromthis family is the following Lemma 44.
We define the family of functions F logp = { log ( u ) : u ∈ F p } . (77) Then, F logp defines a family of functions that is equicontinuous on the set { θ : p ( θ ) > } .Proof. The proof is simple enough. Pick (cid:15) >
0. There exists δ > | log ( p ( x )) − log ( p ( y )) | < (cid:15) whenever | x − y | < δ . Pick any α ∈ [0 , α = 0 then p is just a constant and thus | log ( p ( x )) − log ( p ( y )) | = 0 < (cid:15) .If 0 < α <
1, then | log ( p α ( x )) − log ( p α ( y )) | = | α log ( p ( x )) − α log ( p ( y )) | < α (cid:15) < (cid:15) . Remark 45.
It is practical to work with p ( x ) > on the set [ a, b ] and this is what we assume until theend of the manuscript. In this case, log ( p ) ∈ C [ a, b ] . Remark 46.
We will be using the notation ¯ h = max i | t i +1 − t i | and h = min i | t i +1 − t i | , and γ ( T N ) = ¯ h/h . We can apply the logspline estimation method to p . Let ¯ p be defined as in (71), the density estimategiven by maximizing the expected log-likelihood. We then have the following lemma: Lemma 47.
Suppose p is an unknown continuous density function supported on [ a, b ] and ¯ p is as in (71) . Then there exists constant M (cid:48) = M (cid:48) ( F p , k, γ ( T N )) that depends on the family F p , order k andglobal mesh ratio γ ( T N ) of S k,T N such that (cid:107) log ( p ) − log (¯ p ) (cid:107) ∞ ≤ M (cid:48) dist(log( p ) , S k,T N ) (78) and therefore (cid:107) p − ¯ p (cid:107) ∞ ≤ (cid:0) exp { M (cid:48) dist(log( p ) , S k,T N ) } − (cid:1) (cid:107) p (cid:107) ∞ . (79) Moreover, if log( p ) ∈ C j +1 ([ a, b ]) for some j ∈ { , . . . , k − } then by Jackson’s Theorem we obtain (cid:107) log ( p ) − log (¯ p ) (cid:107) ∞ ≤ M (cid:48) C ( k, j ) ¯ h j +1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:107) p − ¯ p (cid:107) ∞ ≤ (cid:18) exp (cid:26) M (cid:48) C ( k, j ) ¯ h j +1 (cid:13)(cid:13)(cid:13)(cid:13) d j +1 log( p ) dθ j +1 (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:27) − (cid:19) (cid:107) p (cid:107) ∞ . (80) Remark 48.
Please note that the constant M does not depend on the dimension of S k,T N . For allpractical purposes, we will be using uniformly placed knots, thus suppressing the dependence on γ ( T N ) ,which will be equal to the constant 1. Now we will present certain error bounds required to calculate a bound for MISE. Assume p, ˆ p and¯ p as in the previous section. Also, assume that n is the number of random samples drawn from p .We will state a series of definitions and theorems that encompass the results from Lemma 5, Lemma6, Lemma 7, and Lemma 8 in the work of Stone[4][pp.728-729]. efinition 49. Let n ≥ and b > . Let y ∈ Y . Let l n and λ n be defined by (65) and (67) , respectively.We define A n,b ( y ) = (cid:40) ω ∈ Ω : | l ( y ; Θ n ( ω )) − l (¯ y ; Θ n ( ω )) − ( λ n ( y ) − λ n (¯ y )) | < nb (cid:18) (cid:90) | log( f ( θ ; y )) − log( f ( θ ; ¯ y )) | d θ (cid:19) / (cid:41) . (81) where f is defined in (61) as a function in the logspline family. Definition 50.
Given n ≥ and < (cid:15) we define E (cid:15),n to be the subset of F = { f ( · ; y ) : y ∈ Y } suchthat E (cid:15),n = (cid:40) f ( · ; y ) : y ∈ Y and (cid:18) (cid:90) | log( f ( θ ; y )) − log( f ( θ ; ¯ y )) | d θ (cid:19) / ≤ n (cid:15) (cid:114) L + 1 n (cid:41) . (82) Lemma 51 (Stone[4][p.728]) . For each y , y ∈ Y and ω ∈ Ω we have | l ( y ; Θ n ( ω )) − l ( y ; Θ n ( ω )) − ( λ n ( y ) − λ n ( y )) | ≤ n (cid:107) log f ( · ; y ) − log f ( · ; y ) (cid:107) ∞ . (83) Lemma 52 (Stone[4][p.729]) . Let n ≥ . Given (cid:15) > and δ > , there exists an integer N = N ( n ) > and sets E j ⊂ F , j = 1 , . . . , N satisfying sup f ,f ∈ E j (cid:107) log( f ) − log( f ) (cid:107) ∞ ≤ δn (cid:15) − ( L + 1) such that E ε,n ⊂ (cid:83) Ni =1 E i . Combining the above lemmas it leads to the following theorem, which is a result outlined in lemmas5 and 8 found in Stone[4].
Theorem 53.
Given
D > and (cid:15) > , let b n = n (cid:15) (cid:114) L ( n ) + 1 n , n ≥ , and < (cid:15) < and β = (cid:15) in ( ?? ) .There exists N = N ( D ) such that for all n > NA n,b n ( y ) ⊂ Ω n for each y ∈ Y (84) and thus P (Ω cn ) ≤ P (cid:0) A cn,b n ( y ) (cid:1) ≤ e − n (cid:15) ( L +1) δ ( D ) . (85) Remark 54.
From (85) we can see that as number of samples goes to infinity, we have that P (Ω n ) → as n → ∞ . Remark 55.
The bound (85) presented in Theorem 53 is a consequence of Hoeffdings inequality whichstates that for any t > P (cid:18)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 X i − E X (cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ (cid:18) − n t (cid:80) ni =1 ( b i − a i ) (cid:19) where X , . . . , X n are identically distributed independent random variables with P ( X ∈ [ a i , b i ]) = 1 . Toget the bound (85) one needs to choose t = b (cid:18) (cid:90) | log( f ( θ ; y )) − log( f ( θ ; ¯ y )) | dθ (cid:19) . Now that we have defined the set where ˆ y exists and showed that the probability of its complementvanishes as n → ∞ with a specific exponential rate, we will now state certain rates of convergence thatonly apply on Ω n . The following theorem contains results of Theorem 2 and Lemma 12 of Stone[4]. heorem 56. There exist constants M , M , M and M such that for all ω ∈ Ω n | ˆ y ( θ ( ω ) , . . . , θ n ( ω )) − ¯ y | ≤ M ( L + 1) √ n (cid:107) ˆ p ( · , ω ) − ¯ p ( · ) (cid:107) ≤ M (cid:114) L + 1 n (cid:107) log(ˆ p ( · , ω )) − log(¯ p ( · )) (cid:107) ∞ ≤ M ( L + 1) √ n . (86) The following two theorems are well-known facts which we cite from [7, p.132, p.134].
Theorem 57.
Let f : [ a, b ] → R . Given distinct points a = x < x < ... < x l = b and l + 1 ordinates y i = f ( x i ) , i = 0 , . . . , l there exists an interpolating polynomial q ( x ) of degree at most l suchthat f ( x i ) = q ( x i ) , i = 0 , . . . , l . This polynomial q ( x ) is unique among the set of all polynomials of degreeat most l . Moreover, q ( x ) is called the Lagrange interpolating polynomial of f and can be written in theexplicit form q ( x ) = l (cid:88) i =0 y i l i ( x ) with l i ( x ) = (cid:89) j (cid:54) = i (cid:18) x − x j x i − x j (cid:19) , i = 0 , , . . . , l. (87) Theorem 58.
Suppose that f : [ a, b ] → R has l + 1 continuous derivatives on ( a, b ) . Let a = x < x <... < x l = b and y i = f ( x i ) , i = 0 , . . . , l . Let q ( x ) be the Lagrange interpolating polynomial of f given byformula (87) . Then for every x ∈ [ a, b ] there exists ξ ∈ ( a, b ) such that f ( x ) − q ( x ) = (cid:81) li =0 ( x − x i )( l + 1)! f ( l +1) ( ξ ) . (88)We next prove an elementary lemma that provides the estimate of the interpolation error wheninformation on the derivatives of f is available. This lemma is used later in Theorem 18 to compute themean integrated squared error. Lemma 59.
Let f ( x ) , q ( x ) , and ( x i , y i ) , i = 0 , . . . , l , with l ≥ , be as in Theorem 58. Suppose that sup x ∈ [ a,b ] | f ( l +1) ( x ) | ≤ C for some constant C ≥ and x i +1 − x i = b − al =: ∆ x for each i = 0 , . . . , l − . Then max x ∈ [ a,b ] | f ( x ) − q ( x ) | ≤ C (∆ x ) l +1 l + 1) . (89) Proof.
Let x ∈ [ a, b ]. Then x ∈ [ x j , x j +1 ] for some j ∈ { , . . . , l − } . Observe that | ( x − x j )( x − x j +1 ) | ≤
14 (∆ x ) and for m ∈ {− j, − j + 1 , . . . , − } ∪ { , . . . , l − j } we have | x − x j + m | ≤ (∆ x ) | m | . From this it followsthat l (cid:89) i =0 | x − x i | ≤ (∆ x ) ( l +1) j !( l − j )! ≤ (∆ x ) ( l +1) l !4 . Then Theorem 58 together with the above estimate implies (89).
References [1] C. de Boor,
A Practical Guide to Splines (Revised Edition) , Springer, New York, 2001.[2] C. J.Stone and C.-Y. Koo,
Logspline Density Estimation , in Contemporary Mathematics, Volume59, 1986.
3] C. J. Stone,
Uniform Error Bounds Involving Logspline Models , in Probability, Statistics and Math-ematics: Papers in honor of Samuel Karlin, 335-355, 1989.[4] C. J. Stone,
Large-sample Inference for Log-spline Models , in Annals of Statistics, Vol. 18, No. 2,717-741, 1990.[5] W. Neiswanger, C. Wang and E. P. Xing,
Asymptotically Exact, Embarrassingly Parallel MCMC ,Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence. 2014; pp. 623-632.[6] A. Miroshnikov, Z. Wei and E. Conlon,
Parallel Markov Chain Monte Carlo for Non-GaussianPosterior Distributions , Stat. Accepted (2015).[7] K. E. Atkinson,
An Introduction to Numerical Analysis, 2nd Edition , John Wiley & Sons, 1989.[8] J. Langford, A.J. Smola and M. Zinkevich, Slow learners are fast. In: Bengio Y, Schuurmans D, J.D.Lafferty, C.K.I. Williams, A. Culotta,
Advances in Neural Information Processing Systems (2009),22 (NIPS), New York: Curran Associates, Inc.[9] D. Newman, A. Asuncion, P. Smyth and M. Welling,
Distributed algorithms for topic models . JMachine Learn Res (2009), 10, 1801-1828.[10] A. Smola and S. Narayanamurthy,
An architecture for parallel topic models . Proceedings of theVLDB Endowment (2010), 3, 1-2, 703-710.[11] M. Plummer,
JAGS: A Program for Analysis of Bayesian Graphical Models using Gibbs Sampling ,Proceedings of the 3rd International Workshop on Distributed Statistical Computing, March 20-22,Vienna, Austria (2003).[12] United States Department of Transportation, Bureau of Transportation Statistics, URL