aa r X i v : . [ m a t h . S T ] O c t Bernoulli (4), 2014, 1802–1818DOI: 10.3150/13-BEJ542 Minimax bounds for estimation of normalmixtures
ARLENE K.H. KIM
Statistical Laboratory, Center for Mathematical Sciences, University of Cambridge, WilberforceRoad, Cambridge, CB30WB, UK. E-mail: [email protected]
This paper deals with minimax rates of convergence for estimation of density functions on thereal line. The densities are assumed to be location mixtures of normals, a global regularityrequirement that creates subtle difficulties for the application of standard minimax lower boundmethods. Using novel Fourier and Hermite polynomial techniques, we determine the minimaxoptimal rate – slightly larger than the parametric rate – under squared error loss. For Hellingerloss, we provide a minimax lower bound using ideas modified from the squared error loss case.
Keywords:
Assouad’s lemma; Hermite polynomials; minimax lower bound; normal locationmixture
1. Introduction
This paper establishes the optimal minimax rate of convergence under squared errorloss, for densities that are normal mixtures. The analysis reveals a subtle difficulty inthe application of Assouad’s lemma to parameter spaces defined by indirect regularityconditions, which complicate the usual construction of subsets of the parameter spaceindexed by “hyper-rectangles.”More precisely, we consider independent observations from probability distributions P f on the real line whose densities f (with respect to Lebesgue measure on R ) belong tothe set of convolutions F = (cid:26) f : f ( x ) = φ ⋆ Π( x ) = Z φ ( x − u ) dΠ( u ) , Π ∈ P ( R ) (cid:27) , where φ denotes the standard normal N (0 ,
1) density and P ( R ) denotes the set of allprobability measures on the (Borel sigma-field of the) real line. Our main result gives anasymptotic minimax lower bound for the L risk of estimators of f ∈ F . This is an electronic reprint of the original article published by the ISI/BS in
Bernoulli ,2014, Vol. 20, No. 4, 1802–1818. This reprint differs from the original in pagination andtypographic detail. (cid:13)
A.K.H. Kim
Theorem 1.1.
Let X , . . . , X n be independent and identically distributed with density f ∈ F . Then there exists a positive constant c such that sup f ∈F E n,f Z ∞−∞ ( ˆ f n ( x ) − f ( x )) d x ≥ c · log n · n √ log n := cℓ n for every estimator ˆ f n = ˆ f n ( X , . . . , X n ) . Let F denote the subset of F consisting of those normal mixture densities whosemixing measure is absolutely continuous with respect to Lebesgue measure. The proof ofTheorem 1.1, which is given in Section 2, involves the construction of a finite subset of F , so the lower bound also holds when the supremum is taken over f ∈ F . Perhaps themost interesting feature of this result is that the same rate has been obtained as an upperbound for the minimax risk with respect to squared error loss over much larger classes offunctions. For instance, [4] defined the class F ∗ consisting of those densities that can beextended to an entire function f ∗ on C satisfying sup y ∈ R e − y / sup x ∈ R | f ∗ ( x + i y ) | < ∞ .He proved the following theorem. Theorem 1.2 ([4], Theorem 4.1).
Let X , . . . , X n be independent and identically dis-tributed with density f ∈ F ∗ . Then there exists an estimator ˆ f n = ˆ f n ( X , . . . , X n ) of f such that sup f ∈F ∗ E n,f Z ∞−∞ ( ˆ f n ( x ) − f ( x )) d x = O( ℓ n ) . For the reader’s convenience, in Section 3, we show that
F ⊆ F ∗ and summarize Ibrag-imov’s proof. Theorems 1.1 and 1.2 together establish that the minimax optimal rate ofestimation for squared L loss is ℓ n for any class of functions containing F and containedin F ∗ . In particular, this is the case for F .While a minimax result under the L loss presents the most successful case, this lossfunction is often criticized for giving too little weight to errors from the tails. As analternative, we also consider the Hellinger loss. Define a class of probability measureswith sub-Gaussian tails, P s ( R ) := { Π ∈ P ( R ): ∃ C > | u | > t ) ≤ C exp( − t /C ) for all real t } . For the following class of normal location mixtures F s := (cid:26) f : f ( x ) = φ ⋆ Π( x ) = Z φ ( x − u ) dΠ( u ) , Π ∈ P s ( R ) (cid:27) , [2] provide a sieved maximum likelihood estimator whose convergence rate is O((log n ) /n ).However, as they pointed out, the optimal rate for F s is still unknown. Our techniquegives a lower bound that lies within a logarithmic factor of Ghosal and van der Vaart’supper bound. inimax bounds for estimation of normal mixtures Theorem 1.3.
Let X , . . . , X n be independent and identically distributed with density f ∈ F s . Then there exists a positive constant c such that sup f ∈F s E n,f Z ∞−∞ ( q ˆ f n ( x ) − p f ( x )) d x ≥ c · log n · n for every estimator ˆ f n = ˆ f n ( X , . . . , X n ) . To prove Theorems 1.1 and 1.3, we use a variation on Assouad’s lemma (cf. [11], page347). When specialized to density estimation, the lemma can be cast into the follow-ing form. (Henceforth, we omit the ±∞ terminals on the integrals when there is noambiguity.) For completeness, we provide the proof in the Appendix. Lemma 1.4.
Let { f α , α ∈ { , } K } ⊆ F where K is a finite index set of cardinality m .Suppose W is a nonnegative loss function for which there exists ζ > such that, for all g , g ∈ F , inf f ∈F W ( f, g ) + W ( f, g ) ≥ ζW ( g , g ) . (1) Suppose also that for some constants c > and > c > , W ( f α , f β ) ≥ c ε k α − β k for all α, β ∈ { , } K (2) and Z ( f α − f β ) f α ≤ c n if k α − β k = 1 , (3) where k α − β k = P k ∈ K { α k = β k } , the Hamming distance. Then, for every estimator ˆ f n based on n independent observations, sup f ∈F E n,f W ( ˆ f n , f ) ≥ c ζ − √ c ) mε . (4) Remark 1.1.
Assumption (3) regarding the χ distance is merely a convenient way toshow that the testing affinity, k P nf α ∧ P nf β k , is at least 1 − √ c , where P nf is a productprobability measure under f and k P ∧ Q k is defined as R min(d P, d Q ). Remark 1.2.
To apply Lemma 1.4, we try to maximize mε for the best possible lowerbound. While we construct the finite density class satisfying the loss separation condition(2), we need to restrict the size ε and m so that two nearest densities should be rea-sonably close as in (3), and so that the constructed densities are truly in the parameterspace F .For the proof in Section 2, we construct f α ’s of the form f α ( x ) = f ( x ) + ε X k ∈ K α k ∆ k ( x ) , α ∈ { , } K , A.K.H. Kim where f is the normal density function with a zero mean (and variance specified later),where K = { , , . . . , m − } , and where m , ε >
0, and ∆ k could depend on n . The maindifficulty lies in choosing the (signed) perturbations ∆ k so that each f α is a normallocation mixture. The natural way around this problem is to construct the Assouadhyper-rectangle in the space of mixing distributions, f α = φ ⋆ Π α , where Π α ( u ) = Π ( u ) + ε X k ∈ K α k V k ( u ) , α ∈ { , } K , where the signed measures V k must be chosen so that each Π α is a probability measure.In contrast to the standard construction, the indirect form of f α = φ ⋆ Π α leads to anembedding condition of the form W ( φ ⋆ Π α , φ ⋆ Π β ) ≥ τ n X k ∈ K ( α k − β k ) (5)for some τ n . The right side of (5) is expressed in terms of P k ∈ K ( α k − β k ) instead ofthe Hamming distance, in order to emphasize the orthogonality relation. If the convo-lution with the normal density were not present, such a property could be obtained bychoosing the perturbations to be exactly orthogonal to each other, subject to variousother regularity properties that define the parameter space. The smoothing effect of theconvolution operation, however, makes it difficult to choose the V k to achieve such near-orthogonality. Nevertheless, we can achieve (5) by choosing the perturbations so thattheir Fourier transforms are orthogonal as elements in L ( φ ), the space of complex-valued functions g such that R φ ( x ) | g ( x ) | d x < ∞ for the L loss. Similarly, we achieve(5) under the Hellinger loss using the similar ideas under L except that φ is replacedby a different weight function.
2. Proofs of the lower bounds
First, we introduce some notation used in this section. We let φ σ be the normal densitywith mean zero and variance σ . Following, for example, [9], Chapter 9, we define theFourier transform T by T f ( t ) := ˘ f ( t ) = 1 √ π Z ∞−∞ exp( − i xt ) f ( x ) d x for f ∈ L ( λ ) where λ is Lebesgue measure, and then extend from L ∩ L to L byextending an isometry of L ∩ L into L to an isometry of L onto L .For both theorems, we construct the signed measures V k to have (signed) densities v k with respect to λ : π α ( u ) = dΠ α d λ ( u ) = π ( u ) + ε X k ∈ K α k v k ( u ) , α ∈ { , } K , (6) inimax bounds for estimation of normal mixtures π is the normal density with zero mean and each v k is a function for which R v k = 0 and π ( u ) + ε X k ∈ K α k v k ( u ) ≥ u. We then need to check the assumptions for Lemma 1.4.
Here we let W ( f, g ) := k f − g k = R ( f − g ) , so (1) is satisfied with ζ = 1 /
2. The choiceof the v k ’s is suggested by Fourier methods. By the Plancherel formula (and the fact that˘ φ = φ ), recalling that f α = φ ⋆ Π α ,12 π k f α − f β k = 12 π k ˘ f α − ˘ f β k = ε Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12) X k ∈ K ( α k − β k ) φ ( t )˘ v k ( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t, which lets us write the desired property (2) of Lemma 1.4 as Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12) X k ∈ K ( α k − β k ) φ ( t )˘ v k ( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t ≥ c π X k ∈ K ( α k − β k ) ∀ α, β ∈ { , } K . We might achieve such an inequality by choosing the v k ’s to make the functions ψ k ( t ) := φ ( t )˘ v k ( t ) orthogonal. Ignoring other requirements for the moment, we could even startfrom an orthonormal set { ψ k } and then try to define v k as the (inverse) Fourier transformof ψ k ( t ) /φ ( t ), provided that the ratio is square integrable. This heuristic succeeds if westart from the normalized orthogonal functions (see [5], Chapter 9), ψ k ( t ) = C i − k φ ( t ) H k (2 t ) √ k ! = i − k p φ (2 t ) H k (2 t ) √ k ! (7)for k ∈ K := { , , . . ., m − } , where C = √ π ) / is chosen so that Cφ ( t ) = p φ (2 t )and H k ( t ) is the Hermite polynomial of order k , the polynomial for which φ ( t ) has k thderivative ( − k H k ( t ) φ ( t ). Remark 2.1. { H k , k = 1 , , . . . } is sometimes called the “probabilists’ Hermite Polyno-mials” (denoted as “ He ” in [3]), as opposed to the “physicists’ Hermite Polynomials” H .There is one-to-one relation between H and H , given by H k ( t ) = 2 − k/ H k (cid:18) t √ (cid:19) . To calculate the Fourier inverse transform of ψ k ( t ) /φ ( t ), we provide the followinglemma. A.K.H. Kim
Lemma 2.1.
For b > a > , T − [ φ ( at ) H k ( bt )]( u ) = Q k φ (cid:18) ua (cid:19) H k ( b ′ u ) , (8) where Q k = (i c a,b ) k /a with c a,b = p b /a − and b ′ = b/ ( a c a,b ) . Remark 2.2.
Lemma 2.1 illustrates a general form of the eigenvalue-eigenfunction re-lation for the Fourier transform of Hermite functions, T [ φ ( t ) H k ( √ t )]( u ) = ( − i) k φ ( u ) H k ( √ u ) . (See (7.376) in [3], or for more details, see Section 4.11 in [6]).We now formulate these arguments into a proof. Proof of Theorem 1.1.
By Lemma 2.1, defining { ψ k , k ∈ K } as in (7) leads to v k ( u ) = C r k k ! φ ( u ) H k (cid:18) √ u (cid:19) for k ∈ K, (9)because T − [ φ ( t ) H k (2 t )]( u ) = i k k/ φ ( u ) H k (2 u/ √ k ,we make the v k ’s real-valued and odd, thereby ensuring that R v k d λ = 0 and R π α d λ = 1for each α in { , } K .In summary, the choice of v k as in (9) gives12 π k f α − f β k = ε Z ∞−∞ (cid:18) X k ∈ K ( α k − β k ) ψ k ( t ) (cid:19) d t = ε X k ∈ K ( α k − β k ) . (10)That is, the condition (2) of Lemma 1.4 is satisfied with c = 2 π .We still need to check the condition (3), and also show that ε can be chosen smallenough to make all the π α ’s nonnegative. Actually, we first show that π α ≥ π / > ε ≤
116 3 − m +1 / m − / , (11)and by choosing π = φ m . Secondly, we determine the largest size m while the twodensities f α and f β are close in terms of the χ distance as O(1 /n ) when there is onlyone different coordinate between α and β .To control the denominator in (3), we first show that | v k ( u ) | ≤ C k √ mπ ( u ) where C k = 8 · k/ . By Cram´er’s inequality [3], equation (8.954) | H k ( u ) | ≤ κ √ k ! exp( u /
4) with κ ≈ . . (12) inimax bounds for estimation of normal mixtures | v k ( u ) | ≤ κC k/ √ π exp (cid:18) − u (cid:19) ≤ C k φ ( u/ √
3) (13) ≤ C k φ ( u/ √ m ) = C k √ mπ ( u ) . (14)Using (14), we have π α ( u ) = π ( u ) + ε X k ∈ K α k v k ( u ) ≥ π ( u ) − ε X k ∈ K C k √ mπ ( u ) ≥ π ( u )[1 − C m − m / ε ]= π ( u )[1 − · m − / m / ε ] ≥ π ( u )2by the choice of ε in (11).Hence, under the condition (11), φ ⋆ Π α := f α ≥ f / φ ⋆ Π /
2, which implies thatthe second condition in Lemma 1.4 is rewritten as R ( f α − f β ) /f ≤ c / n for α and β having only one different coordinate. The denominator f = φ ⋆ Π is again normallydistributed with mean zero and variance 1 + m by the choice of dΠ / d λ := π = φ m density.For convenience, we let α = β (all the other cases work the same way). By splittingthe integral into two regions | x | ≤ M √ m and | x | > M √ m with a constant M = 8 log 9, Z ( f α − f β ) f = Z | x |≤ M √ m ( f α − f β ) f + ε Z | x | >M √ m ( R φ ( x − u ) v ( u ) d λ ) f . For the first integral, the denominator is bounded below on the interval {| x | ≤ M √ m } ,since f ( x ) {| x | ≤ M √ m } > exp( − M / / (2 √ π √ m ) := 1 / ( C ∗ √ m ) , where C ∗ := 2 √ π exp( M / L loss calculation from (10), Z | x |≤ M √ m ( f α ( x ) − f β ( x )) f ( x ) d x ≤ C ∗ √ m k f α − f β k = 2 π C ∗ √ mε . For the second integral, recall that for any k = 1 , , . . . , m −
1, we have | v k ( u ) | ≤ C m − φ ( u/ √
3) := C m − σ φ σ with σ = √ C m − := 8 · m − / and φ σ ( x ) ≤ √ mφ m ( x ), with anotation R ( x ) := {| x | > M √ m } , we bound the second integral: ε Z R ( x ) ( R φ ( x − u ) v ( u ) d λ ) f ( x ) d x ≤ ε C m − σ Z R ( x ) ( R φ ( x − u ) φ σ ( u ) d λ ) φ m ( x ) d x A.K.H. Kim ≤ √ mε C m − σ Z R ( x ) φ σ ( x ) d x = (cid:18) √ mε (cid:19)(cid:18) m Z R ( x ) φ σ ( x ) d x (cid:19) ≤ √ mε . Here the last inequality is obtained by a Gaussian tail property with √ m ≫ σ := √ Z | x | >M √ m φ σ ( x ) d x ≤ exp (cid:18) − M m (cid:19) = 3 − m since M = 8 log 9.Combining these two upper bounds for the integral, we obtain Z ( f α − f β ) f ≤ √ mε (2 π C ∗ + 64 /
3) = c n as long as √ mε ≤ n c π C ∗ + 64 / . (15)As a consequence, the constructed mixing densities fulfil the two requirements in As-souad’s lemma under conditions (11) and (15), with ε ≤ min (cid:18) − m +1 m − , n √ m c π C ∗ + 64 / (cid:19) . From Lemma 1.4, the lower bound is obtained as cε m , which is at mostmin(3 − m m − , √ m/n )up to a constant. To find the largest mε , by equating 3 − m m − = √ m/n , we obtain m and ε as log n and 1 / ( n √ log n ), respectively, up to a constant, and hence the lowerbound is obtained as √ log n/n up to a constant. (cid:3) Here we let W ( f, g ) := k√ f − √ g k = R ( √ f − √ g ) , so (1) is satisfied with ζ = 1. First,we relate the Hellinger distance and the χ distance. That is, suppose we can show(1 / π ( u ) ≤ π ς ( u ) ≤ (3 / π ( u ) so that (1 / f ( x ) ≤ f ς ( x ) ≤ (3 / f ( x ) for both ς = α and ς = β , by convolving with the standard normal density. Then, using the upper bound inimax bounds for estimation of normal mixtures f α and f β , Z ( p f α − p f β ) = Z ( f α − f β ) ( √ f α + p f β ) ≥ Z ( f α − f β ) f . (16)Similarly, the lower bound for f α would give an upper bound for the testing condition Z ( f α − f β ) f α ≤ Z ( f α − f β ) f . (17)Thus it would be enough to work with the following quantity Z ( f α − f β ) f = Z (cid:18) f α √ f − f β √ f (cid:19) = ε Z (cid:18) X k ∈ K ( α k − β k ) φ ⋆ v k √ f (cid:19) , where the second equality is given by (6).At first glance, R ( f α − f β ) /f does not look amenable to Fourier techniques. However,as Lemma 2.2 below shows, φ ⋆ v k / √ f is expressed as convolution of a normal density(with a variance larger than 1) with a certain choice of the perturbation function v k andbase function π = φ σ . Lemma 2.2.
Consider the perturbation functions v k ( u ) = C k √ k ! φ ( ρu ) H k ( γu ) , ρ ≥ σ + γ , where C k is a constant depending on k and γ > . Then [ φ ⋆ v k ]( x ) p φ ⋆ φ σ ( x ) = φ ˜ σ ⋆ ˜ v k , (18) where ˜ v k ( u ) := ˜ C k √ k ! φ (˜ ρu ) H k (˜ γu ) , (19) with ˜ σ = 1 + 12 σ + 1 , ˜ C k = C k (4 π ) / ˜ σ , ˜ ρ = p ρ + 1 − ˜ σ ˜ σ , ˜ γ = γ ˜ σ . (20)By Lemma 2.2, the denominator effect can be incorporated into the normal convolu-tion. Then we follow similar ideas used in the proof of Theorem 1.1. Proof of Theorem 1.3.
Again, the choice of v k ’s is suggested by Fourier methods. Forconvenience, we let π = φ , so f = φ and √ f = 2 π / φ . Assuming ˜ v k in (19) are in L , T (cid:20) f α √ f (cid:21) ( t ) = T [ p f ]( t ) + ε X k ∈ K α k T (cid:20) φ ⋆ v k √ f (cid:21) ( t )0 A.K.H. Kim = T [ p f ]( t ) + ε X k ∈ K α k T φ / ( t ) T ˜ v k ( t )by Lemma 2.2.By the Plancherel formula,12 π (cid:13)(cid:13)(cid:13)(cid:13) f α √ f − f β √ f (cid:13)(cid:13)(cid:13)(cid:13) = ε Z (cid:12)(cid:12)(cid:12)(cid:12) X k ∈ K ( α k − β k ) T φ / ( t ) T [˜ v k ]( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t, which lets us write the condition (2) in Assouad’s Lemma 1.4 as Z (cid:12)(cid:12)(cid:12)(cid:12) X k ∈ K ( α k − β k ) T [ φ / ]( t ) T [˜ v k ]( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t ≥ c π X k ∈ K ( α k − β k ) ∀ α, β ∈ { , } K , with δ = π c ε / c = π / v k ’s to make the functions ψ k ( t ) := T [ φ / ]( t ) T [˜ v k ]( t ) orthonormal.Ignoring other requirements, we also start from the same orthonormal set (7), and thentry to define ˜ v k as the inverse Fourier transform.From the fact that T [ φ / ]( t ) = 1 √ π exp (cid:18) − t (cid:19) and by definition of ˜ v k in (19), the requirement is that ψ k ( t ) := ˜ C k √ π k ! exp (cid:18) − t (cid:19) T [ φ (˜ ρu ) H k (˜ γu )]( t ) = i − k p φ (2 t ) H k (2 t ) √ k ! . (21)If we determine all the parameters to make (21) true, we have the desired property forthe loss separation condition (2), that is, we have Z ( f α − f β ) f = 2 π ε X k ∈ K ( α k − β k ) . (22)We have to find ˜ ρ , ˜ γ and ˜ C k so that (21) is satisfied. The solutions are derived belowand given in (23).After some calculations, T [ φ (˜ ρu ) H k (˜ γu )]( t ) = i − k √ π ) / ˜ C k φ (cid:18) t r (cid:19) H k (2 t ) , which leads to T − (cid:20) φ (cid:18) t r (cid:19) H k (2 t ) (cid:21) ( u ) = i k ˜ C k √ π ) / φ (˜ ρu ) H k (˜ γu ) . inimax bounds for estimation of normal mixtures a = p / b = 2 into Lemma 2.1, we have the following solutions,˜ C k = (2 π ) / √ √ k , ˜ ρ = r , ˜ γ = 3 √ . (23)We need to ensure that the choice of σ = 1 satisfies the inequality ρ ≥ σ + γ neededfor the Lemma 2.2. Comparing (20) and (23), we obtain ρ = 3 and γ = √ , which satisfythe condition. Also, C k is obtained as C k = (2 / √ π ) √ k .Therefore, this choice for the ψ k ’s leads to v k ( u ) = 2 / √ π r k k ! φ ( √ u ) H k (cid:18) √ u (cid:19) for k ∈ K. (24)By restricting to odd values of k , we make the v k ’s real-valued and odd, thereby ensuringthat R v k d λ = 0.Using exactly the same idea as in the previous section, if ε ≤ κmC m − with κ ≃ . , (25)then π ( u ) ≤ π α ( u ) ≤ π ( u ) for all u ∈ R , α ∈ { , } K . Now the second testing condition can be treated straightforwardly. Indeed, once wechoose orthonormal functions { ψ k , k ∈ K } , we obtain Z ( f α − f β ) f α ≤ Z f α − f β ) f = 4 π ε for k α − β k = 1 by (17) and ( 22) . Thus it is enough to choose ε < / (4 π n ). With our choice ε = 1 / (4 √ n ), the testingcondition is satisfied.From the lower bound mε , we want to choose m as large as possible. The conditionin (25) restricts the size of m ,2 κmC m − < m m ≤ √ n. Thus, we have the upper bound for m , m . (1 / (2 log 5)) log n ≃ (0 .
31) log n. Finally, we check these constructed π α ’s are inside of the parameter space P s ( R ). Fromthe fact that π α ( u ) ≤ (3 / π ( u ) for all u ∈ R and α ∈ { , } K , it is clear that π is inthe space P s ( R ) from the tail property of normal density.Consequently, the lower bound is obtained as log n/n up to a constant. (cid:3) A.K.H. Kim
3. Proof of the upper bound
For the reader’s convenience, we summarize the arguments for Theorem 1.2, followingpages 365–369 by [4]. Before turning to that result, we first show that if f = φ ⋆ Π ∈F , then f can be extended to an entire function f ∗ . To see this, we let f ∗ ( x + i y ) = √ π R exp( − ( x + i y − u ) ) /
2) dΠ( u ). Defining a ( x, y, u ) ≡ y ( u − x ) , b ( x, y, u ) ≡ − { ( x − u ) − y } , we write √ π f ∗ ( x + i y ) = Z [ { cos a ( x, y, u ) + i sin a ( x, y, u ) } e b ( x,y,u ) ] dΠ( u )= Z { cos a ( x, y, u )e b ( x,y,u ) } dΠ( u ) + i Z { sin a ( x, y, u )e b ( x,y,u ) } dΠ( u ):= v ( x, y ) + i w ( x, y ) . By differentiating under the integral (see Theorem 16.8 in [1]), ∂v∂x = Z [ { y sin a ( x, y, u ) + ( u − x ) cos a ( x, y, u ) } e b ( x,y,u ) ] dΠ( u ) = ∂w∂y ,∂v∂y = Z [ { y cos a ( x, y, u ) − ( u − x ) sin a ( x, y, u ) } e b ( x,y,u ) ] dΠ( u ) = − ∂w∂x . Also note that ∂v/∂x, ∂v/∂y, ∂w/∂x , and ∂w/∂y are continuous. Then by Cauchy–Riemann theorem (see Theorem 1.5.8 in [8]), f ∗ is analytic.Now it suffices to show that f ∗ satisfies the growth condition. Indeed,sup x | f ∗ ( x + i y ) | = sup x (cid:12)(cid:12)(cid:12)(cid:12) √ π Z exp (cid:18) − ( x + i y − u ) (cid:19) dΠ( u ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ π sup x Z (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − ( x + i y − u ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dΠ( u ) (26) ≤ √ π exp (cid:18) y (cid:19) sup x Z dΠ( u )= 1 √ π exp (cid:18) y (cid:19) . Thus,
F ⊆ F ∗ , which ensures that Ibragimov’s estimation also gives the upper bound tomatch Theorem 1.1. Proof of Theorem 1.2.
Ibragimov used a sinc kernel estimator,ˆ f n ( x ) = 1 nh n X j =1 K (cid:18) X j − xh (cid:19) , K ( u ) = sin( u ) π u , inimax bounds for estimation of normal mixtures h = 1 / √ log n . It is important for his method that the Fourier transform of K is˘ K ( t ) = √ π {| t | ≤ } and also ˘ K ( t ) = √ π π (1 − | t | ) + where x + = max( x, nh ) − R K ( u ) d u .For the bias term, note that E n,f ˆ f has the Fourier transform √ π ˘ f ( t ) ˘ K ( ht ), so thatbias := Z ( E n,f ˆ f n − f ) = Z |T [ E n,f ˆ f n ]( t ) − T [ f ]( t ) | d t by Plancherel= Z | ˘ f ( t ) | |√ π ˘ K ( ht ) − | d t = Z | t |≥ /h | ˘ f ( t ) | d t by the form of ˘ K≤ − y/h Z e − yt | ˘ f ( t ) | d t for y >
0= 2e − y/h lim M →∞ Z e − yt | ˘ f ( t ) | (cid:18) − | t | M (cid:19) + d t. Write the last integral as Z e − yt | ˘ f ( t ) | (cid:18) − | t | M (cid:19) + d t = Z ˘ f ( t )e − yt ˘ ϑ ( t ) d t = Z ˘ f ( t )e − yt ˘ ϑ ( − t ) d t, where ˘ ϑ ( t ) = ˘ f ( t )(1 − | t | M ) + is the Fourier transform of the nonnegative function ϑ ( x ) = Mπ √ π R f ( u ) K ( M ( x − u )) d u . Using ϑ ( x + i y ) = √ π R e i tx e − yt ˘ ϑ ( t ) d t , we have R f ( x ) ϑ ( x + i y ) d x = R ˘ f ( t )e yt ˘ ϑ ( − t ) d t by Parseval’s theorem. By changing the contourof the integration, R f ( x + i y ) ϑ ( x ) d x = R ˘ f ( t )e − yt ˘ ϑ ( − t ) d t . Combining these ideas, Z ˘ f ( t )e − yt ϑ ( − t ) d t = Z f ( x + i y ) ϑ ( x ) d x ≤ Z sup x | f ( x + i y ) | ϑ ( x ) d x ≤ exp( y / √ π , where the last inequality follows by (26) together with R ϑ ( x ) d x = √ π ˘ ϑ (0) = 1. Bytaking y = 1 /h , we obtain the upper bound as √ log n/n := ℓ n up to a constant. (cid:3)
4. Discussion
It has been claimed that the Fano’s method is more general in a sense (see [13], page428). Indeed, using Varshamov–Gilbert’s lemma (e.g., Lemma 2.9 in [10]), it is not verydifficult to prove the same rate result for F with a similar type of sub-parameter spaceusing Fano’s method.However, Assouad’s method seems more convenient in some cases. For instance, beforeknowing how to construct the subspace, it would be extremely difficult to determine4 A.K.H. Kim the right family of densities when there are only indirect regularity conditions as in thisexample. Assouad’s hyper-rectangle method indicates that the problem can be solved ifwe can show the orthogonality relations between the constructed densities. These addedregularity conditions can cause different difficulties, but we at least have some clues tohandle these problems.On the other hand, if we know metric entropy (good packing and covering numberbounds) results beforehand, the optimal minimax rates can be obtained almost automat-ically with the predictive Bayes density estimator using the main theorems in [12]. It willbe interesting to see if we can calculate a sharper metric entropy for F or F s than theone that appeared in [2]. Appendix
Proof of Lemma 1.4.
Most of the proof is based on ideas borrowed from [7, 10],and some unpublished notes by David Pollard. Denote A = { , } K and for conveniencedenote E α for E f α and P α for P f α where P f α = P nf α . For any density estimator ˆ f n basedon the observation X , . . . , X n , define an estimatorˆ α = arg min α ∈ A W ( ˆ f n , f α ) . By restricting the parameter space and by the definition of ˆ α ,sup f ∈F E f W ( ˆ f n , f ) ≥ max α ∈ A E α W ( ˆ f n , f α ) ≥
12 max α ∈ A E α ( W ( ˆ f n , f α ) + W ( ˆ f n , f ˆ α )) ≥ ζ α ∈ A E α W ( f α , f ˆ α )using the pseudo-distance property (1). Now, using the condition (2) in the lemma fol-lowed by the simple fact that the supremum is bounded by the average, the last equationcan be lower bounded by c ε ζ α ∈ A m X k =1 E α { α k = ˆ α k } ≥ c ε ζ m X α ∈ A m X k =1 E α { α k = ˆ α k } . Define ¯ P ,k = 12 m − X α ∈ A ,k P α , ¯ P ,k = 12 m − X α ∈ A ,k P α , k = 1 , . . . , m, where A i,k = { α ∈ A : α k = i } for i = 0 , inimax bounds for estimation of normal mixtures α k , ˆ α k ∈ { , } , we have12 m X α ∈ A m X k =1 E α { α k = ˆ α k } = 12 m m X k =1 (cid:18) X α ∈ A ,k P α { ˆ α k = 0 } + X α ∈ A ,k P α { ˆ α k = 1 } (cid:19) = 12 m X k =1 (¯ P ,k { ˆ α k = 0 } + ¯ P ,k { ˆ α k = 1 } ) , which gives us the following lower boundsup f ∈F E f W ( ˆ f n , f ) ≥ c ε ζ m X k =1 k ¯ P ,k ∧ ¯ P ,k k by P h + Q (1 − h ) ≥ k P ∧ Q k for h ≥ h = { ˆ α = 0 } .For k = m , each α in A ,m is of the form ( γ,
0) with γ ∈ D := { , } m − . Similarly,each α in A ,m is of the form ( γ,
1) with γ ∈ D . Now k ¯ P ,m ∧ ¯ P ,m k = Z (cid:18) m − X γ ∈ D p γ, (cid:19) ∧ (cid:18) m − X γ ∈ D p γ, (cid:19) ≥ Z m − X γ ∈ D ( p γ, ∧ p γ, ) . Note that ( γ,
0) and ( γ,
1) have only one different coordinate. By similar calculations forother k ′ s, we obtain sup f ∈F E f W ( ˆ f n , f ) ≥ c ε ζ m min d ( α,β )=1 k P α ∧ P β k . In general, it is difficult to calculate the testing affinity exactly. Fortunately, a conve-nient lower bound in terms of distances between marginals is available when P α and P β are both product measures. For instance, when P α = P nα for i.i.d. case, we can boundthis using the chi-squared distance χ by the following relation.(1 − k P α ∧ P β k ) ≤ nχ ( P α , P β ) := n Z ( θ α − θ β ) θ α . Thus, the condition (3) in the lemma yields a lower bound for the maximum risksup f ∈F E f W ( ˆ f n , f ) ≥ c ε ζ m (1 − √ c ) . (cid:3) See [10], Lemma 2.7 on page 90, or [7], Lemma 1 on page 40, for the derivation of factsabout relations between distances.
Proof of Lemma 2.1.
For b > a >
0, we have φ ( at ) exp (cid:18) btx − x (cid:19) = φ ( at ) ∞ X k =0 H k ( bt ) k ! x k . A.K.H. Kim
Thus, T − (cid:20) φ ( at ) exp (cid:18) btx − x (cid:19)(cid:21) ( u ) = Z ∞−∞ exp(i tu )2 π exp (cid:18) − a t bxt − x (cid:19) d t = 1 a √ π exp (cid:18) ( bx + i u ) a − x (cid:19) = 1 a φ (cid:18) ua (cid:19) exp (cid:18) bxu i a −
12 (i xc a,b ) (cid:19) = 1 a φ (cid:18) ua (cid:19) ∞ X k =0 H k ( b/ ( a c a,b ) u ) k ! (i c a,b ) k x k . The inverse Fourier transform of the right side is ∞ X k =0 T − (cid:20) φ ( at ) H k ( bt ) k ! (cid:21) ( u ) x k . By matching the coefficient for the k th power of x , T − [ φ ( at ) H k ( bt )]( u ) = (i c a,b ) k a φ (cid:18) ua (cid:19) H k (cid:18) ba c a,b u (cid:19) , which proves the claim. (cid:3) Proof of Lemma 2.2.
First, note that φ ⋆ φ σ = φ σ . We define [ φ ⋆ v k ( u )]( x ) = R φ ( x − u ) v k ( u ) d u and similarly [ φ ⋆ φ ( ρu ) H k ( γu )]( x ) = R φ ( x − u ) φ ( ρu ) H k ( ru ) d u . Bydefinition of v k , we have[ φ ⋆ v k ( u )]( x ) p φ σ ( x ) = C k √ k ! [ φ ⋆ φ ( ρu ) H k ( γu )]( x ) p φ σ ( x )= C k √ k ! Z (1 / √ π ) exp( − / x − u ) )(1 / √ π ) exp( − (1 / ρ u ) H k ( γu )(2 π (1 + σ )) − / exp( − (1 / x / (1 + σ ))) d u. Now, by completing the square,exp (cid:18) −
12 ( x − u ) (cid:19) exp (cid:18) − ρ u (cid:19) exp (cid:18) x σ (cid:19) = exp (cid:18)(cid:18) −
12 + 14(1 + σ ) (cid:19) x + xu − (cid:18)
12 + 12 ρ (cid:19) u (cid:19) = exp (cid:18) − σ x + xu − (cid:18)
12 + 12 ρ (cid:19) u (cid:19) by definition of ˜ σ in (20) inimax bounds for estimation of normal mixtures
17= exp (cid:18) − σ ( x − ˜ u ) (cid:19) exp (cid:18) −
12 (1 + ρ − ˜ σ ) ˜ u ˜ σ (cid:19) by ˜ u := ˜ σ u = (2 π ˜ σ ) φ ˜ σ ( x − ˜ u ) φ (cid:18) p ρ − ˜ σ ˜ σ ˜ u (cid:19) , where the positive value for (1 + ρ − ˜ σ ) is guaranteed by the condition ρ ≥ /σ + γ / > / (1 + 2 σ ) := 1 − ˜ σ . By change of variables,[ φ ⋆ v k ( u )]( x ) p φ σ ( x ) = (cid:18) C k √ k ! [2 π (1 + σ )] / ˜ σ (cid:19) φ ˜ σ ⋆ φ (cid:18) p ρ − ˜ σ ˜ σ ˜ u (cid:19) H k (cid:18) γ ˜ σ ˜ u (cid:19) . Using the definitions of each transformed variables (20), the proof is complete. (cid:3)
Acknowledgements
This work is part of the author’s Ph.D. dissertation, written at Yale University. Theauthor is grateful to David Pollard, Harrison Zhou, and Richard J. Samworth for theircomments and advice. The research was supported in part by NSF Career Award DMS-06-45676 and NSF FRG Grant DMS-08-54975.
References [1]
Billingsley, P. (1995).
Probability and Measure , 3rd ed.
Wiley Series in Probability andMathematical Statistics . New York: Wiley. MR1324786[2]
Ghosal, S. and van der Vaart, A.W. (2001). Entropies and rates of convergence for max-imum likelihood and Bayes estimation for mixtures of normal densities.
Ann. Statist. Gradshteyn, I.S. and
Ryzhik, I.M. (2007).
Table of Integrals, Series, and Products , 7thed. Amsterdam: Elsevier/Academic Press. MR2360010[4]
Ibragimov, I. (2001). Estimation of analytic functions. In
State of the Art in Probabilityand Statistics (Leiden, 1999) ( C. Klaasen , M. de Gunst and
A.W. van der Vaart ,eds.).
Institute of Mathematical Statistics Lecture Notes—Monograph Series Jackson, D. (2004).
Fourier Series and Orthogonal Polynomials . Mineola, NY: Dover.MR2098657[6]
Kawata, T. (1972).
Fourier Analysis in Probability Theory . New York: Academic Press.MR0464353[7]
Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions.
Ann.Statist. Marsden, J.E. and
Hoffman, M.J. (1987).
Basic Complex Analysis , 2nd ed. New York:Freeman. MR0913736[9]
Rudin, W. (1987).
Real and Complex Analysis , 3rd ed. New York: McGraw-Hill.MR0924157 A.K.H. Kim [10]
Tsybakov, A.B. (2009).
Introduction to Nonparametric Estimation . Springer Series inStatistics . New York: Springer. MR2724359[11]
Van der Vaart, A.W. (1998).
Asymptotic Statistics . Cambridge Series in Statistical andProbabilistic Mathematics . Cambridge: Cambridge Univ. Press. MR1652247[12] Yang, Y. and
Barron, A. (1999). Information-theoretic determination of minimax ratesof convergence.
Ann. Statist. Yu, B. (1997). Assouad, Fano, and Le Cam. In
Festschrift for Lucien Le Cam ( D. Pol-lard , E. Torgersen and
G.L. Yang , eds.) 423–435. New York: Springer. MR1462963, eds.) 423–435. New York: Springer. MR1462963