On the consistency of the Kozachenko-Leonenko entropy estimate
aa r X i v : . [ m a t h . S T ] F e b On the consistency of the Kozachenko-Leonenkoentropy estimate
Luc Devroye ∗ L´aszl´o Gy¨orfi † February 26, 2021
Abstract
We revisit the problem of the estimation of the differential entropy H ( f ) of arandom vector X in Rd with density f , assuming that H ( f ) exists and is finite. Inthis note, we study the consistency of the popular nearest neighbor estimate Hn ofKozachenko and Leonenko. Without any smoothness condition we show that the esti-mate is consistent ( EfjHn(cid:0)H ( f ) jg ! n ! 1 ) if and only if E f log( kXk +1) g < 1 .Furthermore, if X has compact support, then Hn ! H ( f ) almost surely. Index terms: differential entropy estimate, consistency conditions
The differential entropy of a random Rd -valued vector X with probability density function f is H ( f ) = (cid:0) Z f ( x ) ln f ( x ) dx = (cid:0)E ff ( X ) g (1)when this integral exists.The objective of this paper is to study an estimate of (1) based on independent andidentically distributed samples X ; : : : Xn , with density f . ∗ Luc Devroye, School of Computer Science, McGill University, Montreal, Canada. [email protected]’s research was supported by a Discovery Grant from NSERC. † L´aszl´o Gy¨orfi, Department of Computer Science and Information Theory, Budapest University of Tech-nology and Economics, Budapest, Hungary. gyorfi@cs.bme.hu.
R fn log(1 =fn ) when fn is a general densityestimate (see, e.g., Godavarti [14]).There has been particular interest in minimax rates of convergence, culminating in thepaper by Han, Jiao, Weissman and Wu [19] who obtained the minimax rates for classesof densities on the unit cube of Rd that are Besov balls or H¨older (Lipschitz) balls ofsmoothness parameters s . Their rates are of exact order max(1 = ( n log n ) s= ( s + d ) ; =pn ).For standard Lipschitz densities, the minimax rate is of exact order 1 =pn (achieving theparametric rate) for dimension one, but it is 1 = ( n log n ) (cid:0) = ( d +1) when d >
1. They alsoconstruct a kernel-based estimate that is minimax optimal for these classes.Most of the previous work assumes compact support, and thus sidesteps the thornyissue of infinite tails. The objective of this note is merely to give a complete characteriza-tion of the consistency of the popular Kozachenko-Leonenko estimate [22] for estimatingthe differential entropy. Without any smoothness condition on the density f we studythree types of consistency: strong, L and weak.It is an open research problem, whether there exists an entropy estimate such thatit is consistent under the only condition that H ( f ) is finite. The most obvious way ofestimating the differential entropy is the partitioning-based estimate. If the correspondingpartition is deterministic, then we conjecture that there exist no deterministic partitionssuch that the relating differential entropy estimate is a.s. consistent for any finite differen-tial entropy. The significant breakthrough in this respect is due to Wang, Kulkarni, Verdu[35] and Silva, Narayanan [28]. They suggested data-dependent partitioning, for whichthe partitioning-based estimates of Kullback-Leibler (KL) divergence and of mutual infor-2ation are strongly consistent under the only condition that the KL-divergence and themutual information are finite, respectively. We guess that a universally consistent entropyestimate can be derived from data dependent partitioning.For the cross-validation estimate or leave-one-out entropy estimate, fn;i denotes a den-sity estimate based on X ; : : : Xn leaving Xi out, and the corresponding entropy estimateis of the form Hn = (cid:0) n nXi =1 ln fn;i ( Xi ) : (2)Kozachenko and Leonenko [22] introduced the nearest neighbor entropy estimate as follows.Let Rn;i ( x ), x 2 Rd , be defined by Rn;i ( x ) = min j6 = i;j(cid:20)n kx (cid:0) Xjk; where k (cid:1) k denotes the Euclidean norm. Then the nearest neighbor entropy estimate is Hn = 1 n nXi =1 ln(( n (cid:0) Rn;i ( Xi ) dvd ) + CE; (3)where CE = (cid:0) R 1 e(cid:0)t ln tdt = 0 : ::: is the Euler-Mascheroni constant and vd denotesthe volume of the unit sphere in Rd . The estimate in (3) has the form of (2) if fn;i is aconstant multiple of the first-nearest-neighbor (1-NN) density estimate: fn;i ( x ) = 1( n (cid:0)
1) min j6 = i;j(cid:20)n kx (cid:0) Xj kdvdeCE : Notice that this particular fn;i is not consistent in L , because Z fn;i ( x ) dx = Furthermore, fn;i ( x ) is unbounded at the Xj; j 6 = i . Also, fn;i ( x ) does not in general tendto f ( x ) in probability, i.e., the density estimates fn;i are not weakly consistent.Under some mild conditions on the density f , Kozachenko and Leonenko [22] provedthe mean square consistency. Biau and Devroye [4] showed that for bounded X , if R f ( x ) ln ( f ( x ) + 1) dx < 1 , then Hn ! H ( f ) in probability.For smooth densities, Berrett, Samworth and Yuan [3], Delattre and Fournier [7] andTsybakov and van der Meulen [34] studied the rate of convergence of Hn and of its exten-sions to many nearest neighbors.In this paper we show that for bounded X the Kozachenko-Leonenko estimate isstrongly consistent. Furthermore, give a necessary and sufficient tail condition for L consistency. In addition, construct a counterexample on weak consistency for a uniformdensity on an unbounded set. The proofs are presented in the last section.3 Consistency results
For bounded X , without any smoothness condition on the density f the Kozachenko-Leonenko estimate is strongly consistent: Theorem 1.
If the support of f is bounded and Z f ( x ) ln( f ( x ) + 1) dx < 1; (4)then lim n!1 Hn = H ( f ) (5)a.s.Next, we present a necessary and sufficient tail condition on the consistency in L : Theorem 2.
Assume that H ( f ) is finite. For any density f ,lim n!1 E fjHn (cid:0) H ( f ) jg = 0 (6)if and only if E f (ln kXk ) + g < 1: (7)If E f (ln kXk ) + g = , then maybe the expectations E f ln kXkg and E f ln Rn; ( X ) g don’t exist. However, for finite H ( f ), we show that the expectation E fHn g is well definedsuch that it is larger than (cid:0)1 .As sufficient condition, (7) appeared in the studies of distribution and density esti-mates consistent in KL-divergence. For discrete distributions concentrated to the set ofpositive integers, Gy¨orfi, P´ali and van der Meulen [16] proved that a distribution withfinite Shannon entropy cannot be estimated consistently in KL-divergence. It means thatfor any distribution estimate pn = ( pn; ; pn; ; : : : ) there exist a distribution p = ( p ; p ; : : : )with finite Shannon entropy such that for the KL-divergence KL ( p; pn ) = =1 pi ln pipn;i = n a.s. However, under (7) one can construct a distribution estimate pn such that KL ( p; pn ) ! g of power tail suchthat the mixture of the ordinary histogram and of this g is consistent in KL-divergence,see Barron, Gy¨orfi and van der Meulen [1].Theorem 2 is a complete characterization of the consistency in L . In fact, we showthat there are two cases: • either E f (ln kXk ) + g = and then E fHn g = , for all n , • or E f (ln kXk ) + g < 1 and then lim n!1 E fjHn (cid:0) H ( f ) jg = 0.In the sequel, we show an example, where the density is uniform on an unbounded set and Hn ! 1 in probability..Denote by (cid:21) the Lebesgue measure and by (cid:22) the distribution of X . For d = 1, let f = IA; where A = [j Aj with disjoint intervals Aj such that (cid:21) ( A ) = 1. Then, X is uniformlydistributed on A and therefore H ( f ) = 0. For j (cid:21)
1, let ∆ j = j ( j +1) and aj = 2 j . Set A = [j(cid:21) [ aj; aj + ∆ j ]. Then (cid:21) ( A ) = 1. Note that in this example E f (ln kXk ) + g = , andtherefore E fHn g = , for all n . Theorem 3.
In this setup, lim n!1 Hn = in probability. Proof of Theorem 1. Put ¯ fh ( x ) = (cid:22) ( B ( x; h )) (cid:21) ( B ( x; h )) = RB ( x;h ) f ( z ) dz(cid:21) ( B ( x; h )) ; B ( x; h ) stands for the sphere centered at x and having the radius r . Then, Hn (cid:0) H ( f )= 1 n nXi =1 ln(( n (cid:0) (cid:21) ( B ( Xi; Rn;i ( Xi )))) + CE (cid:0) H ( f )= (cid:0) n nXi =1 ln (cid:22) ( B ( Xi; Rn;i ( Xi ))) (cid:21) ( B ( Xi; Rn;i ( Xi ))) (cid:0) H ( f ) + 1 n nXi =1 ln(( n (cid:0) (cid:22) ( B ( Xi; Rn;i ( Xi )))) + CE = ˜ Hn (cid:0) H ( f ) + Mn + CE ; where ˜ Hn = (cid:0) n nXi =1 ln ¯ fRn;i ( Xi ) ( Xi ) ; and Mn = 1 n nXi =1 ln(( n (cid:0) (cid:22) ( B ( Xi; Rn;i ( Xi )))) : Biau and Devroye [4] showed that the distribution of Mn does not depend on the density f , and E fMn g = (cid:0)CE + O (1 =n ) and Var ( Mn ) = O (1 =n ). The problem left is to show Mn (cid:0) E fMn g !
Hn ! H ( f ) (9)a.s.The proof of (8) relies on the following extension of the Efron-Stein inequality for thecentered higher moments: Lemma 1. (Devroye et al. [9]) Let Z = ( Z ; : : : ; Zn ) be a collection of independentrandom variables taking values in some measurable set A and denote by Z ( i ) =( Z ; : : : ; Zi(cid:0) ; Zi +1 ; : : : ; Zn ) the collection with the i -th random variable dropped. Let f : An ! R be a measurable real-valued function and the function gi : An(cid:0) ! R isobtained from f by dropping the i -th argument, i = 1 ; : : : ; n . Then for any integer q (cid:21) E h ( f ( Z ) (cid:0) E f ( Z )) q i (cid:20) ( q ) qE " nXi =1 (cid:16)f ( Z ) (cid:0) gi ( Z ( i ) ) (cid:17) !q + ( q ) qE " nXi =1 E (cid:20)(cid:16)f ( Z ) (cid:0) gi ( Z ( i ) ) (cid:17) j Z ; : : : ; Zi(cid:0) ; Zi +1 ; : : : ; Zn(cid:21)!q (10)with a universal constant < :
1. 6or the term
Mn (cid:0) E fMn g , define M ( i ) n as Mn without the i -th term. We apply Lemma1 with q = 2: E h ( Mn (cid:0) E fMn g ) i (cid:20) ( E 24 nXi =1 (cid:16)Mn (cid:0) M ( i ) n (cid:17) ! + ( E 24 nXi =1 E (cid:20)(cid:16)Mn (cid:0) M ( i ) n (cid:17) j X ; : : : ; Xi(cid:0) ; Xi +1 ; : : : ; Xn(cid:21)! n E (cid:20)(cid:16)Mn (cid:0) M ( n ) n (cid:17) (cid:21) = 2 n E "(cid:18)Mn (cid:0) n (cid:0) n Mn(cid:0) (cid:19) Noting that (cid:18)Mn (cid:0) n (cid:0) n Mn(cid:0) (cid:19) = 1 n nXi =1 ln(( n (cid:0) (cid:22) ( B ( Xi; Rn;i ( Xi )))) (cid:0) n(cid:0) Xi =1 ln(( n (cid:0) (cid:22) ( B ( Xi; Rn(cid:0) ;i ( Xi )))) ! = 1 n (cid:18) (cid:0) ln(( n (cid:0) (cid:22) ( B ( Xn; Rn;n ( Xn )))) + ( n (cid:0)
1) ln n (cid:0) n (cid:0) n(cid:0) Xi =1 ln (cid:22) ( B ( Xi; Rn(cid:0) ;i ( Xi ))) (cid:22) ( B ( Xi; Rn;i ( Xi ))) IRn;i ( Xi ) 35 (cid:19)(cid:20) n (cid:18)E h (ln(( n (cid:0) (cid:22) ( B ( Xn; Rn;n ( Xn ))))) i + 1+ E 24 n(cid:0) Xi =1 IRn;i ( Xi ) 35 (cid:19): For any x , (cid:22) ( B ( x; kx (cid:0) Xik )) is uniformly distributed on [0 ; 1] (see Section 1.2 in [4]), andtherefore ( n (cid:0) (cid:22) ( B ( Xn; Rn;n ( Xn ))) = ( n (cid:0) (cid:22) ( B ( Xn; min (cid:20)i(cid:20)n(cid:0) kXi (cid:0) Xnk ))= ( n (cid:0) 1) min (cid:20)i(cid:20)n(cid:0) (cid:22) ( B ( Xn; kXi (cid:0) Xnk )) : 7t implies, that for given Xn ( n (cid:0) (cid:22) ( B ( Xn; Rn;n ( Xn ))) L = ( n (cid:0) 1) min (cid:20)i(cid:20)n(cid:0) Ui; where L = denotes equality in distribution, and U ; : : : ; Un(cid:0) are i.i.d. uniform on [0 ; E h (ln(( n (cid:0) (cid:22) ( B ( Xn; Rn;n ( Xn ))))) i = E "(cid:18) ln (cid:18) ( n (cid:0) 1) min (cid:20)i(cid:20)n(cid:0) Ui(cid:19)(cid:19) = O (1) : Lemma 20.6 in [4] yields n(cid:0) Xi =1 IRn;i ( Xi ) E h ( Mn (cid:0) E fMn g ) i = O (1 =n ) ; which together with the Borel-Cantelli Lemma yields (8).In the proof of (9) we apply Breiman’s generalized ergodic theorem (see Lemma 27.2 in[15]): 8 emma 2. Let X ; X ; : : : be a stationary and ergodic sequence. If Fn = Fn ( fXig ), n = 1 ; ; : : : are random functions such that Fn ( X ) ! F ( X ) (11)a.s. and E f sup n jFn ( X ) jg < 1; (12)then 1 n nXi =1 Fn ( Xi ) ! E fF ( X ) g (13)a.s.If X0n ( x ) stands for the second nearest neighbor of x among X ; : : : ; Xn , then Rn;i ( Xi ) = kX0n ( Xi ) (cid:0) Xik and so ¯ fRn;i ( Xi ) ( Xi ) = ¯ fkX0n ( Xi ) (cid:0)Xik ( Xi )and ˜ Hn = (cid:0) n nXi =1 ln ¯ fkX0n ( Xi ) (cid:0)Xik ( Xi ) : Therefore, (9) means that (cid:0) n nXi =1 ln ¯ fkX0n ( Xi ) (cid:0)Xik ( Xi ) ! H ( f ) (14)a.s. Defining Fn ( x ) = (cid:0) ln ¯ fkX0n ( x ) (cid:0)xk ( x )and F ( x ) = (cid:0) ln f ( x ) ; we verify the conditions of Lemma 2. The Lebesgue differentiation theorem (cf. Theorem20.18 in [4]) yields that lim r ¯ fr ( x ) = f ( x )9or (cid:21) -almost all x . The Cover-Hart theorem (cf. Lemma 2.2 in [4]) implies kX0n ( x ) (cid:0) xk ! (cid:22) -almost all x . As (cid:22) is absolutely continuous with respect to (cid:21) , these limit relationsresult in ¯ fkX0n ( x ) (cid:0)xk ( x ) ! f ( x )a.s. for (cid:22) -almost all x , from which (11) follows. Let L denote an upper bound on kXk .Introduce the Hardy-Littlewood maximal functions f (cid:3) ( x ) = sup h> ¯ fh ( x )and g(cid:3) ( x ) = sup L>h> fh ( x )(4) implies that Z f ( x ) ln( f (cid:3) ( x ) + 1) dx < 1; (15)while if, in addition, X is bounded, then Z f ( x ) ln( g(cid:3) ( x ) + 1) dx < 1; (16)see page 82 in [4]. Note that (16) is the only item in the proof, where the boundedness of X is used. Thus, j ln ¯ fkX0n ( x ) (cid:0)xk ( x ) j = (ln ¯ fkX0n ( x ) (cid:0)xk ( x )) + + ln 1¯ fkX0n ( x ) (cid:0)xk ( x ) ! + (cid:20) ln( f (cid:3) ( x ) + 1) + ln( g(cid:3) ( x ) + 1) ; and so (15) and (16) result in E f sup n j ln ¯ fkX0n ( X1 ) (cid:0)X1k ( X ) jg (cid:20) Z f ( x ) ln( f (cid:3) ( x ) + 1) dx + Z f ( x ) ln( g(cid:3) ( x ) + 1) dx < 1; which yields (12), and the conditions of Lemma 2 are verified.Note, that similarly to the proof of (8), we can show the universal strong law of the sumnearest neighbor balls: for any density f , nXi =1 (cid:22) (cid:18)B (cid:18)Xi; min j6 = i;j(cid:20)n kXi (cid:0) Xj k(cid:19)(cid:19) ! ✷ E fHn g and equivalently E f ˜ Hn g exist. Jensen’s inequality implies that E f ( ˜ Hn ) (cid:0)g = E 8<: (cid:0) n nXi =1 ln ¯ fRn;i ( Xi ) ( Xi ) !(cid:0)9=;(cid:21) n nXi =1 E (cid:26)(cid:16)(cid:0) ln ¯ fRn;i ( Xi ) ( Xi ) (cid:17)(cid:0)(cid:27) = E (cid:26)(cid:16)(cid:0) ln ¯ fRn;1 ( X1 ) ( X ) (cid:17)(cid:0)(cid:27)(cid:21) (cid:0) Z f ( x )(ln f (cid:3) ( x )) + dx> (cid:0)1; when Z f ( x ) ln f (cid:3) ( x ) f ( x ) dx < 1: (17)Choose 0 < L < ess sup x f ( x ). Jensen’s inequality implies Z f ( x ) ln f (cid:3) ( x ) f ( x ) If ( x ) >Ldx (cid:20) (cid:18)Z f ( x ) If ( x ) >Ldx(cid:19) ln R f (cid:3) ( x ) If ( x ) >LdxR f ( x ) If ( x ) >Ldx < 1; because (4) together with (cid:21) ( fx : f ( x ) > Lg ) < 1 yields Z f (cid:3) ( x ) If ( x ) >Ldx < 1; see Fefferman and Stein [11]. Furthermore, Z f ( x ) ln f (cid:3) ( x ) f ( x ) If ( x ) (cid:20)Ldx = Z f ( x ) ln f (cid:3) ( x ) If ( x ) (cid:20)Ldx (cid:0) Z f ( x ) ln f ( x ) If ( x ) (cid:20)Ldx(cid:20) L Z ln max ff (cid:3) ( x ) ; gdx (cid:0) Z f ( x ) ln f ( x ) If ( x ) (cid:20)Ldx: Fefferman and Stein [11] proved that (cid:21) ( fx : f (cid:3) ( x ) > tg ) (cid:20)
t ; (18)11ith t > depends only on the dimension d , see also (a) of Lemma 10.47 in[36]. By (18), Z ln max ff (cid:3) ( x ) ; gdx = Z 1 (cid:21) ( fx : ln max ff (cid:3) ( x ) ; g > tg ) dt = Z 1 (cid:21) ( fx : max ff (cid:3) ( x ) ; g > etg ) dt = Z 1 (cid:21) ( fx : f (cid:3) ( x ) > etg ) dt(cid:20) Z 1 et dt = : Thus, (17) is verified and so we proved that E f ˜ Hn g exists.Assume that E f (ln kXk ) + g = . Then, for any x 2 Rd , E f (ln kX (cid:0) xk ) + g = andtherefore E f (ln R ; ( X )) + g = E f (ln kX (cid:0)X k ) + g = . Next we show that E f (ln Rn; ( X )) + g = E f min (cid:20)i(cid:20)n (ln kX (cid:0) Xik ) + g = , too: Find r such that (cid:22) ( B (0 ; r )) = 1 = 2. Then E n (ln Rn; ( X )) + o (cid:21) E 8<: ln kX k ! + IkX1k(cid:21) rIX2;:::;Xn2B (0 ;r ) = 12 n(cid:0) E 8<: ln kX k ! + IkX1k(cid:21) r9=; = Thus, E f ˜ Hn g = (cid:0) Z E n ln ¯ fRn;1 ( x ) ( x ) o f ( x ) dx(cid:21) Z E 8<: ln (cid:21) ( B ( x; Rn; ( x ))) (cid:22) ( B ( x; Rn; ( x ))) ! + ( x ) dx(cid:21) Z E n (ln (cid:21) ( B ( x; Rn; ( x )))) + o f ( x ) dx(cid:21) dE n (ln Rn; ( X )) + o = (19)which yields the necessary part of the theorem: E fj ˜ Hn (cid:0) H ( f ) jg = E f (ln kXk ) + g < 1 . Notice that E fHn g = E f ln(( n (cid:0) Rn; ( X ) dvd ) g + CE(cid:20) ln( n (cid:0) 1) + E f ln( R ; ( X ) dvd ) g + CE(cid:20) ln( n (cid:0) 1) + E f (ln( kX (cid:0) X kdvd )) + g + CE< 1: (20)(19) and (20) means that E fHn g < 1 iff (7) holds. By the proof of Theorem 1, (6) isequivalent to E fj ˜ Hn (cid:0) H ( f ) jg ! : (21)Note that for a 2 R , jaj = 2 a + (cid:0) a and thus, one has E fj ˜ Hn (cid:0) H ( f ) jg (cid:20) Z E ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln ¯ fRn;1 ( x ) ( x ) f ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) f ( x ) dx = 2 Z E 8<: ln ¯ fRn;1 ( x ) ( x ) f ( x ) ! + ( x ) dx + E f ˜ Hn g (cid:0) H ( f ) : We show that Z E 8<: ln ¯ fRn;1 ( x ) ( x ) f ( x ) ! + ( x ) dx ! E f ˜ Hn g = (cid:0) Z E n ln ¯ fRn;1 ( x ) ( x ) o f ( x ) dx ! H ( f ) : (23)Concerning (22), we have a domination: ln ¯ fRn;1 ( x ) ( x ) f ( x ) ! + (cid:20) ln f (cid:3) ( x ) f ( x ) ! + = ln f (cid:3) ( x ) f ( x ) ; and therefore (17) and the pointwise convergence yield (22).With respect to (23), note that (22) implies that( H ( f ) (cid:0) E f ˜ Hn g ) + (cid:20) Z E 8<: ln ¯ fRn;1 ( x ) ( x ) f ( x ) ! + ( x ) dx ! n E f ˜ Hn g (cid:21) H ( f ) : n E f ˜ Hn g (cid:20) H ( f ) : (24)For g(cid:3)r ( x ) = inf E 8<:Z f ( x ) ln (cid:21) ( B ( x; Rn; ( x ))) (cid:22) ( B ( x; r )) ! + dx9=; ! ; and so lim sup n E f ˜ Hn g (cid:20) (cid:0) Z f ( x ) ln g(cid:3)r ( x ) dx: Therefore,lim sup n E f ˜ Hn g (cid:20) (cid:0) sup We will recall two things from the theory of order statistics:(i) If U ; : : : ; Un are i.i.d. uniform on [0 ; Z(cid:3)n (thesmallest 1-spacings) satisfies Z(cid:3)n ! E in distribution, where E is standard exponential.(ii) With probability one, max (cid:20)i(cid:20)n Yi (cid:21) n = except finitely often.Proof. P (cid:26) max (cid:20)i(cid:20)n Yi (cid:20) n = (cid:27) = (1 (cid:0) PfYi > n = g ) n (cid:20) e(cid:0)n1=2 : Apply Borel-Cantelli.(iii) If Y (cid:3)(cid:3); Y (cid:3) are the largest and second largest Yi ’s, then PfY (cid:3)(cid:3) = Y (cid:3)g ! : Proof. PfY (cid:3)(cid:3) = Y (cid:3)g (cid:20) n !P (cid:26)Y = Y (cid:21) max (cid:20)i(cid:20)n Yi(cid:27)(cid:20) n =1 ∆ i (1 (cid:0) = ( i + 1)) n(cid:0) (cid:20) n Xi(cid:20)n5=6 ∆ i e(cid:0) ( n(cid:0) = ( i +1) + n Xi>n5=6 ∆ i : n Xi(cid:20)n5=6 e(cid:0) ( n(cid:0) = ( n5=6 +1) = e(cid:0)n1=6 (1+ o (1)) ! ; while the second term on the right hand side is O ( n =n = ) = O (1 =pn ) ! : For the sake of simplicity, consider the negative of the estimate in (3) without the biascorrection: `n = 1 n nXi =1 log nZi ; (25)where Zi = min j6 = i;j(cid:20)n jXi (cid:0) Xjj . We show that `n ! (cid:0)1 in probability, which impliesthe theorem. Let Y (cid:3)(cid:3) and Y (cid:3) be the Y -values of the largest and second largest Xi ’s. Let Z(cid:3)(cid:3) be the nearest neighbor distance for to largest Xi ’s. This is also the distance betweenthe two largest Xi ’s. If Y (cid:3)(cid:3) 6 = Y (cid:3) , then Z(cid:3)(cid:3) (cid:21) aY (cid:3)(cid:3) (cid:0) aY (cid:3) (cid:0) ∆ Y (cid:3) (cid:21) aY (cid:3)(cid:3) = 12 2 Y (cid:3)(cid:3) : Furthermore, all other nearest neighbor distances are (cid:21) (in distribution) the minimaldistance between n uniform order statistics (by pushing the intervals of A together), andthis is (1 =n ) times something that tends in law to a standard exponential E . So, by (ii) `n (cid:20) n log (cid:20) Y (cid:3)(cid:3) (cid:21) + log " nE + oP (1) = (cid:0) Y (cid:3)(cid:3) (cid:0) n + OP (log n ) (cid:20) (cid:0) n IY (cid:3)(cid:3)(cid:20)pn (cid:0) pn (cid:0) n IY (cid:3)(cid:3)>pn + OP (log n ) : Thus, P (`n > (cid:0) pn n ) (cid:20) PfY (cid:3)(cid:3) = Y (cid:3)g + PfY (cid:3)(cid:3) (cid:20) png ! ; where we applied (iii). ✷ eferenceseferences