[PDF] Correlations with tailored extremal properties

Abstract

Recently, Chatterjee has introduced a new coefficient of correlation which has several natural properties. In particular, the coefficient attains its maximal value if and only if one variable is a measurable function of the other variable. In this paper, we seek to define correlations which have a similar property, except now the measurable function must belong to a pre-specified class, which amounts to a shape restriction on the function. We will then look specifically at the correlation corresponding to the class of monotone nondecreasing functions, in which case we can prove various asymptotic results, as well as perform local power calculations.

Full PDF

aa r X i v : . [ m a t h . S T ] O c t CORRELATIONS WITH TAILORED EXTREMALPROPERTIES

SKY CAO AND PETER J. BICKEL

Abstract.

Recently, Chatterjee has introduced a coeﬃcient of correla-tion which has several natural properties. In particular, the populationversion of the coeﬃcient, which generalizes an earlier one of Dette etal., attains its maximal value if and only if one variable is a measur-able function of the other variable. In this paper, we seek to deﬁnecorrelations which have a similar property, except now the measurablefunction must belong to a pre-speciﬁed class, which amounts to a shaperestriction on the function. We will then look speciﬁcally at the correla-tion corresponding to the class of monotone nondecreasing functions, inwhich case we can prove various asymptotic results, as well as performlocal power calculations. We will also perform local power calculationsfor Chatterjee’s correlation, and for an older one of Dette et al. Introduction

In a remarkable paper [4], Sourav Chatterjee proposed a new coeﬃcient ofcorrelation based on an i.i.d. sample ( X i , Y i ) , i = 1 , . . . , n . Assuming thereare no ties among the X i ’s and Y i ’s (see [4] for the deﬁnition in the generalcase), the correlation is deﬁned asˆ C n ( X, Y ) := 1 − P n − i =1 | r i +1 − r i | n − , (1.1)where the r i are deﬁned as follows. First, sort X (1) ≤ · · · ≤ X ( n ) , and foreach i let Y ( i ) be the Y sample corresponding to X ( i ) . Then r i is deﬁned asthe rank of Y ( i ) , i.e. the number of j such that Y j ≤ Y ( i ) . Chatterjee showedthat as n → ∞ , ˆ C n converges a.s. to the population measure C ( X, Y ) := R Var( E [ ( Y ≥ t ) | X ]) dµ ( t ) R Var( ( Y ≥ t )) dµ ( t ) = 1 − R E [Var( ( Y ≥ t ) | X )] dµ ( t ) R Var( ( Y ≥ t )) dµ ( t ) , (1.2)where µ is the law of Y . Here Y is assumed to not be constant. In thecase where X, Y are continuously distributed, this measure was introducedby Dette et al. [5] – see Remark 2.1. The measure C has a number ofinteresting properties:A) 0 ≤ C ≤ Mathematics Subject Classiﬁcation.

Key words and phrases.

Correlation, shape-restricted regression, isotonic regression.S.C. was supported by NSF grant DMS RTG 1501767. B) C = 0 if and only if X and Y are independent.C) C = 1 if and only if Y = h ( X ) a.s. for some measurable function h : R → R .D) C is asymmetric, but can be easily symmetrized to C ∗ ( X, Y ) := max( C ( X, Y ) , C ( Y, X )) , which clearly satisﬁes C ∗ = 1 if and only if X is a function of Y or Y is a function of X (or both).E) C is invariant under strictly increasing transformations of X and Y separately.This measure is akin to the R´enyi correlation (also commonly called themaximal correlation), which we shall denote R or R ( X, Y ), and is deﬁnedas the maximum Pearson correlation between all pairs of L functions of X and Y respectively. R may be computed as the square root of the maximaleigenvalue of a compact self adjoint operator, T : L ( X ) → L ( X ) , (or L ( Y ) → L ( Y ) with appropriate changes), where L ( X ) is the subspaceof L ( X ) consisting of mean zero random variables. The operator T is givenby T ( f ( X )) := E [ E [ f ( X ) | Y ] | X ] . The R´enyi correlation is well known to have properties

A, B, and D , but issymmetric, andC*) R = 1 if and only if g ( Y ) = h ( X ) for some functions g and h , with g ( X ) ∈ L ( X ) and h ( Y ) ∈ L ( Y ).An extensive account of the history, computation, and other properties of R may be found for instance in [3, 13].An advantage of C is that it gives a clear indication of the functionalrelationship between X and Y when C = 1. More signiﬁcantly, an empiricalestimate is explicit and simply computable for ﬁnite n , while unless X and Y are discrete, the empirical version of R may only be approximated. Onthe other hand, R is deﬁned if X and Y take values in R p and R q or moregeneral spaces. Our purposes in this paper are:(1) To relate C more closely to R .(2) To extend C to situations where the h appearing in C) is speciﬁedto be monotone or more generally shape restricted.(3) To study the asymptotic behavior of the sample versions of suchmeasures under independence, when they can be used for testing, aswas done by Chatterjee for ˆ C n .2. Relation between C and R Note that C and R are unsigned and may be viewed as absolute ratherthan signed measures of dependence. In fact, we shall argue that C is closelyrelated not to R but to R , the largest eigenvalue of T . We begin with the ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 3 solution to a simpler problem of R´enyi’s or Chatterjee’s. For f ( Y ) ∈ L ( Y ),let M ( X, f ( Y )) := sup g ( X ) ∈ L ( X ) ρ ( f ( Y ) , g ( X )) , (2.1)where ρ denotes the Pearson correlation. By Cauchy-Schwarz, or more in-sightfully the identity, valid if Var( f ( Y )) = Var( g ( X )) = 1, ρ ( f ( Y ) , g ( X )) = 1 −

12 Var( f ( Y ) − g ( X )) , a maximizer in (2.1) is g ( X ) = E [ f ( Y ) | X ]. Hence, M ( X, f ( Y )) = Var( E [ f ( Y ) | X ])Var( f ( Y )) = 1 − E Var( f ( Y ) | X )Var( f ( Y )) . (2.2)If we put f ( Y ) = Y we obtain a measure which satisﬁes A) and C) butnot B) and D). To obtain the last two properties we need only observe that M = 0 implies E [ f ( Y ) | X ] = E f ( Y ). Then to obtain B) and D) note thatindependence is equivalent to M ( X, ( Y > y )) = 0 for all y in a set S suchthat P ( Y ∈ S ) = 1. This leads fairly naturally to deﬁning C as in (1.2). Wenote here that this is related to the observation by Azadkia and Chatterjee[2, Section 3] that C is a mixture of partial R statistics.There are evidently many ways of constructing such coeﬃcients, but Chat-terjee’s is particularly elegant since it is easy to see if F, G are the cdfs of

X, Y respectively, then

X, Y can be replaced by F ( X ) , G ( Y ), leading natu-rally to an estimate like ˆ C n which depends on the paired ranks of ( X i , Y i )(with ties broken at random). Evidently, the empirical estimate of C re-quires an estimate of E [ ( Y ≥ t ) | X ] or Var( ( Y ≥ t ) | X ). Chatterjeedoes the latter in a clever way. See also Remark 3.3. Remark 2.1. (1) In 2013, Dette et al. [5] proposed (1.2) in the caseof continuously distributed

X, Y as a parameter having propertiesA)-C) with a diﬀerent empirical estimate of C than ˆ C n , which, itself,may be viewed as an estimate of the second expression in (1.2). Con-sistency of the estimate, which they establish, requires smoothnessconditions on the conditional distribution and choice of a bandwidthsince it requires a growing neighborhood of X ( i ) . Thus, it is not asgeneral.(2) Independently of our work but at the same time, Shi et al. [17]made an extensive comparison between ˆ C n , Dette et al.’s statistic,and other classical tests of the hypothesis of independence whichare consistent against all alternatives as is ˆ C n . They also studiedthe local power of ˆ C n , Dette et al.’s statistic, and these other testsagainst a wide class of contiguous alternatives and showed that ˆ C n has no power locally, while the performance of Dette et al.’s may haveno power in some cases and yet be rate optimal in others. We hadmade a more limited comparison and came to similar conclusionsin our original posting. However, partly inspired by their results, SKY CAO AND PETER J. BICKEL we show in Section 4.4 that ˆ C n has no power against a wide classof contiguous alternatives. We discuss this and the local power ofDette et al.’s statistic in Sections 4.4 and 4.5.(3) The overlap between our work and Shi et al.’s is entirely in Sections4.4 and 4.5. The results in these sections complement the results ofShi et al. [17] – our main focus is giving general conditions underwhich Chatterjee’s statistic and Dette et al.’s statistic have no localpower, while the focus of Shi et al. is to exhibit explicit familiesof alternatives for which the performance of ˆ C n and Dette et al.’sstatistic is demonstrably worse than other classical tests of indepen-dence.The problem we mainly intend to address in this note is how to constructand give appropriate measures for which the value 1 corresponds to a shaperestriction on the form of h ( · ) such that Y = h ( X ). Such restrictions arediscussed by Guntuboyina and Sen in [8]. They include monotonicity, aspecial case as we shall see, convexity or concavity, log concavity, and anumber of others. They are characterized by the requirement that h belongsto a collection of functions H , where H is such that H X := { h ( X ) : h ∈H , h ( X ) ∈ L ( X ) } is a closed, convex subset in L ( X ). The method weprescribe follows from formula (2.1). It is well known that for H as above,there exists a nonlinear operator Π H X : L ( X, Y ) → H X such that for f ( X, Y ) ∈ L ( X, Y ),Π H X ( f ( X, Y )) = arg inf h ( X ) ∈H X E ( f ( X, Y ) − h ( X )) . If we substitute Π H X ( Y ) for f ( Y ) or E [ f ( Y ) | X ] in (2.1), we evidently geta measure C H ( X, Y ) such that C H = 1 if and only if Y = h ( X ) a.s. with h ( X ) ∈ H X . If H and therefore H X is a convex cone, then it is well knownthat Cov( Y, Π H X ( Y )) = Var(Π H X ( Y )) . So C H is also given by (2.2) if Π H X ( Y ) replaces E [ Y | X ]. That is, C H := Var(Π H X ( Y ))Var( Y ) . (2.3)Unfortunately, while this measure satisﬁes properties A), C), and D) aboveit doesn’t satisfy B). However, this can be easily remedied by deﬁning˜ C H ( X, Y ) := 12 ( C ( X, Y ) + C H ( X, Y )) , (2.4)which is easily seen to satisfy all of A)-D).3. Empirical estimates of ˜ C Let H be a convex cone of R -valued functions on R d (here R = R ∪ {±∞} is the extended real line), containing the constant functions. We now turn tothe construction of empirical estimates for ˜ C H . We ﬁrst consider the general ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 5 case. By (2.4) it is enough to consider C H . Given a probability measure µ on R d , deﬁne H µ := H ∩ L ( µ ) = { h ∈ H : h ∈ L ( µ ) } . Note H µ is itself a convex cone. We will assume moreover that for all µ , H µ is closed in L ( µ ). (This assumption of closedness is why we need towork with R -valued functions as opposed to R -valued functions.) For B > H B := (cid:26) h ∈ H : sup x ∈ R d | h ( x ) | ≤ B (cid:27) . Let (

X, Y ) have some joint distribution P , with X ∈ R d and Y ∈ R .Assume that Y has ﬁnite second moment. Let ( X , Y ) , . . . , ( X n , Y n ) bei.i.d. samples from P . Let µ X be the law of X . Let g ∈ H µ X be such that E ( Y − g ( X )) = inf h ∈H µX E ( Y − h ( X )) . By our assumption that H µ X is a closed convex cone, it follows that g existsand is unique in L ( µ X ). Note that g ( X ) can be interpreted as the projectionof Y onto the convex cone H X = { h ( X ) : h ∈ H µ X } ⊆ L ( X ), so that inthe notation of the previous section, we have Π H X ( Y ) = g ( X ) a.s.Let µ n be the empirical distribution of the X samples, and let ˆ g n be suchthat 1 n n X i =1 ( Y i − ˆ g n ( X i ) ) = inf h ∈H µn n n X i =1 ( Y i − h ( X i )) . Again, ˆ g n exists and is unique in L ( µ n ). Given a function f ( X, Y ), letVar n ( f ( X, Y )) := 1 n n X i =1 f ( X i , Y i ) − (cid:18) n n X i =1 f ( X i , Y i ) (cid:19) , i.e. Var n ( f ( X, Y )) is the empirical variance of f ( X, Y ). Following (2.3),deﬁne now the empirical correlationˆ C H ( X, Y ) = ˆ C H ,n ( X, Y ) := Var n (ˆ g n ( X ))Var n ( Y ) . Note that if H is the cone of monotone nondecreasing functions, then ˆ C H isthe ratio of the empirical variance of the empirical isotonic regression of Y on X to the empirical variance of Y .We now make assumptions on H which guarantee convergence of ˆ C H ( X, Y )to C H ( X, Y ). The cone H is said to have Property P if in addition to con-taining the constant functions and H µ being closed in L ( µ ) for all prob-ability measures µ , the following two conditions are satisﬁed for any jointdistribution ( X, Y ) such that | Y | ≤ B a.s. for some B ≥ Boundedness.

We haveinf h ∈H B E ( Y − h ( X )) = inf h ∈H µX E ( Y − h ( X )) . SKY CAO AND PETER J. BICKEL

Consequently, we may assume that sup x ∈ R d | g ( x ) | ≤ B .(2) Glivenko-Cantelli.

We havesup h ∈H B (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( Y i − h ( X i )) − E ( Y − h ( X )) (cid:12)(cid:12)(cid:12)(cid:12) a.s. −→ . Proposition 3.1. If H satisﬁes Property P, then for any ( X, Y ) where Y isa.s. not a constant and has ﬁnite second moment, we have that ˆ C H ( X, Y ) a.s. → C H ( X, Y ) . We ﬁrst prove a preliminary result in the case of bounded Y . Lemma 3.2.

Suppose further that | Y | ≤ B a.s. Then n n X i =1 ˆ g n ( X i ) a.s. −→ E g ( X ) . Proof.

By well known properties of projections onto closed convex cones (seee.g. Lemma 3 of [9]), we have that1 n n X i =1 ( Y i − g ( X i )) ≥ n n X i =1 ( Y i − ˆ g n ( X i )) + 1 n n X i =1 (ˆ g n ( X i ) − g ( X i )) . By the law of large numbers, we have1 n n X i =1 ( Y i − g ( X i )) a.s. −→ E ( Y − g ( X )) . By the boundedness assumption, we may assume thatsup x ∈ R d | g ( x ) | , sup x ∈ R d | ˆ g n ( x ) | ≤ B. Thus by the Glivenko-Cantelli assumption and the deﬁnition of g as a min-imizer, we havelim sup n →∞ (cid:18) n n X i =1 ( Y i − g ( X i )) − n n X i =1 ( Y i − ˆ g n ( X i )) (cid:19) a.s. ≤ , from which it follows that1 n n X i =1 (ˆ g n ( X i ) − g ( X i )) a.s. −→ , and thus 1 n n X i =1 ˆ g n ( X i ) a.s. −→ E g ( X ) , as desired. (cid:3) ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 7

Proof of Proposition 3.1.

First, observe that by part (iv) of Lemma 3 of [9],and the fact that H contains the constant functions, we have E ( Y − g ( X )) ≤ , E ( Y − g ( X ))( − ≤ , which implies E g ( X ) = E Y . Similarly, in the empirical case, we have1 n n X i =1 ˆ g n ( X i ) = 1 n n X i =1 Y i . Thus by the law of large numbers, we obtain convergence of the samplemean of ˆ g n to E g ( X ).Now onto the second moments. For B >

0, deﬁne the function ϕ B ( x ) := min(max( x, − B ) , B ) . Using the boundedness assumption of Property P, let g B , ˆ g Bn ∈ H B be suchthat E ( ϕ B ( Y ) − g B ( X )) = inf h ∈H µX E ( Y − h ( X )) , n n X i =1 ( ϕ B ( Y i ) − ˆ g Bn ( X i )) = inf h ∈H µn n n X i =1 ( ϕ B ( Y i ) − h ( X i )) . By well known properties of projection onto closed convex cones (see e.g.Lemma [9]), we have E Y g ( X ) = E g ( X ) , n n X i =1 Y i ˆ g n ( X i ) = 1 n n X i =1 ˆ g n ( X i ) . (3.1)Thus it suﬃces to show1 n n X i =1 Y i ˆ g n ( X i ) a.s. −→ E Y g ( X ) . To start, observe for any

B > (cid:12)(cid:12)(cid:12)(cid:12) E Y g ( X ) − n n X i =1 Y i ˆ g n ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ | E Y g ( X ) − E ϕ B ( Y ) g B ( X ) | + (cid:12)(cid:12)(cid:12)(cid:12) E ϕ B ( Y ) g B ( X ) − n n X i =1 ϕ B ( Y i )ˆ g Bn ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ϕ B ( Y i )ˆ g Bn ( X i ) − n n X i =1 Y i ˆ g n ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) . By Lemma 3.2, and an observation analogous to (3.1), the middle term inthe right hand side above converges a.s. to 0. Thus it suﬃces to show thatthere is some function δ : R ≥ → R ≥ , such that lim B →∞ δ ( B ) = 0, and forall B > | E Y g ( X ) − E ϕ B ( Y ) g B ( X ) | ≤ δ ( B ) , SKY CAO AND PETER J. BICKEL lim sup n →∞ (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ϕ B ( Y i )ˆ g Bn ( X i ) − n n X i =1 Y i ˆ g n ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) a.s. ≤ δ ( B ) . To see this, observe | E Y g ( X ) − E ϕ B ( Y ) g B ( X ) | ≤ | E ( Y − ϕ B ( Y )) g ( X ) | + | E ϕ B ( Y )( g ( X ) − g B ( X )) | . By Cauchy-Schwarz and the fact that projections are contractions, we havethat the right hand side above may be bounded by( E Y ( | Y | > B )) / ( E Y ) / + ( E Y ) / ( E ( g ( X ) − g B ( X )) ) / . Now since the projection map is 1-Lipschitz (see e.g. part (vi) of Lemma 3of [9]), we have( E ( g ( X ) − g B ( X )) ) / ≤ ( E ( Y − ϕ B ( Y )) ) / ≤ ( E Y ( | Y | > B )) / . Combining the previous few displays, we thus obtain | E Y g ( X ) − E ϕ B ( Y ) g B ( X ) | ≤ E Y ( | Y | > B )) / ( E Y ) / . Applying the same argument to the sample quantities, we may similarlyobtain (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ϕ B ( Y i )ˆ g Bn ( X i ) − n n X i =1 Y i ˆ g n ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18) n n X i =1 Y i ( | Y i | > B ) (cid:19) / (cid:18) n n X i =1 Y i (cid:19) / . We thus see that the function δ ( B ) := 2( E Y ( | Y | > B )) / ( E Y ) / has the desired properties, and thus the desired result now follows. (cid:3) Remark 3.3.

This simple approach does not work when H is the set ofall measurable functions (so that H µ X = L ( µ X )), unless X is discrete,since then the empirical projection ˆ g n will perfectly match the Y samples,so that ˆ C H will always be 1. Chatterjee’s approach is essentially to useVar( Y | X ) = E [( Y − Y ′ ) | X ] /

2, where given X , the pair ( Y, Y ′ ) is i.i.d.Empirically, no such pair of Y ’s is available, but if X = X ( i ) the i th orderstatistic, he essentially approximates by using ( Y ( i ) , Y ( i +1) ). However, sincethe second identity in (1.2) no longer holds, this seems fruitless for H 6 = L .4. The isotonic case

The one dimensional isotonic case, where M is the set of monotone non-decreasing functions R → R , is special in a number of ways as noted bymany authors. For one, by a small trick, we can deﬁne the correlation notjust for Y with ﬁnite second moment, but actually for general Y , and theempirical version of this correlation will also satisfy property E). Secondly, ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 9 it is the only case so far where we are able to verify Property P and thusshow convergence of ˆ C M , and also establish the behavior of the statistic un-der independence. However, we mention here as a side note that it could bepossible that if we choose H to be a small convex subset of a cone, not neces-sarily closed (for instance, H = { g ∈ L ( µ ) : g is convex and 1-Lipschitz } ),and assume E [ Y | X ] ∈ H , then we may obtain Property P from knownresults [8]. In addition, we shall provide an alternative based on Spearman’scorrelation which is simpler to analyze and also has properties A)-E) (whereproperty C) is suitably modiﬁed). We ﬁrst verify that M indeed satisﬁesProperty P. Lemma 4.1.

The cone M satisﬁes Property P.Proof. Clearly M contains the constant functions. Let µ be a probabilitymeasure on R . To see why M µ = { h : h ∈ L ( µ ) } is closed in L ( µ ),suppose we have a sequence { g n } n ≥ , such that g n → g ∈ L ( µ ). We maythen extract a subsequence g n k which converges to g µ -a.s. Thus if we deﬁne˜ g := lim sup k g n k , we have that ˜ g is nondecreasing, and also ˜ g = g µ -a.e.Note here is why we need to work with R -valued functions, since even if g n k is R -valued for all k , it could be that ˜ g is R -valued.The boundedness property is clearly satisﬁed by M . Finally, to show theGlivenko-Cantelli property, ﬁx B > h ∈M B (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 h ( X i ) − E h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) , sup h ∈M B (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 h ( X i ) − E h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) a.s. −→ . Given | Y | ≤ B , we will thus be done if we can showsup h ∈M B (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Y i h ( X i ) − E Y h ( X ) (cid:12)(cid:12)(cid:12)(cid:12) a.s. −→ . (4.1)Towards this end, note1 n n X i =1 Y i h ( X i ) = 1 n n X i =1 ( Y i ) + h ( X i ) − n n X i =1 ( Y i ) − h ( X i ) , where ( Y i ) + = max( Y i ,

0) and ( Y i ) − = − min( Y i , ≤ Y ≤ B . But now that Y isnon-negative, we may apply a standard bracketing argument (see e.g. theproof of [21, Theorem 2.4.1]), using the fact that the bracketing numbers for M B are ﬁnite (see e.g. [21, Theorem 2.7.5]). (cid:3) The case of general Y . To deﬁne a correlation for general Y , the keyobservation is the following. Let G be the cdf of Y . Deﬁne G − : R → R by G − ( x ) := inf { t : G ( t ) ≥ x } . Lemma 4.2.

We have Y a.s. = G − ( G ( Y )) . Remark 4.3.

Observe that both

G, G − are nondecreasing. Thus thislemma shows that Y is a nondecreasing function of X if and only if G ( Y ) isa nondecreasing function of X . This will mean that Property C) holds forthe correlation we will soon deﬁne. Proof.

Observe by deﬁnition that for all x ∈ R , we have x ≥ G − ( G ( x )).Let C := { x ∈ R : x > G − ( G ( x )) } . It suﬃces to show that P ( Y ∈ C ) = 0.Given x ∈ R , deﬁne a x := G − ( G ( x )), and b x := sup { x ′ : G ( x ′ ) = G ( x ) } . If x ∈ C , then a x < x ≤ b x . I now claim that for any x, x ′ ∈ C , the intervals( a x , b x ), ( a x ′ , b x ′ ) are either disjoint or the same. To see this, ﬁrst note if a x = a x ′ , then G − ( G ( x )) = G − ( G ( x ′ )). As G − ( G ( x )) ≤ x , we have G ( G − ( G ( x ))) ≤ G ( x ). Moreover, we have G ( G − ( G ( x ))) = G ( G − ( G ( x ′ ))) ≥ G ( x ′ ) . The same argument with x, x ′ switched then gives G ( x ) = G ( x ′ ), and thuswe see that b x = b x ′ .Now suppose a x < a x ′ . We now show that b x ≤ a x ′ . Suppose ˜ x is suchthat G (˜ x ) = G ( x ). Then by assumption, G − ( G (˜ x )) < G − ( G ( x ′ )). Thisimplies G (˜ x ) < G ( x ′ ), which implies ˜ x < G − ( G ( x ′ )). Taking supremumover all such ˜ x , we obtain the desired inequality.We thus have that { ( a x , b x ) : x ∈ C} is a countable collection of intervals,and thus also D := { x ∈ C : x = b x } is countable. Note also that for any x ∈ C , we have P ( Y ∈ ( a x , x ]) = 0, which implies that P ( Y ∈ ( a x , b x )) = 0,and if additionally x = b x , then P ( Y = x ) = 0. As C ⊆ D ∪ [ x ∈C ( a x , b x ) , we may thus conclude by a union bound that P ( Y ∈ C ) = 0, as desired. (cid:3) Let us now deﬁne a population correlation C mon as follows: C mon ( X, Y ) := C M ( X, G ( Y )) . When deﬁning the empirical correlation, there is this problem that we maynot know G . Instead, we can plug in the estimate ˆ G n , which is the empiricalcdf of the Y samples. Thus, deﬁne the empirical correlation as follows:ˆ C mon ( X, Y ) = ˆ C mon,n ( X, Y ) := ˆ C M ,n ( X, ˆ G n ( Y )) . In other words, we are simply replacing Y i by its rank ˆ G n ( Y i ), for each1 ≤ i ≤ n .In practice, the empirical correlation may be computed as follows. First,sort X (1) ≤ · · · ≤ X ( n ) . For each 1 ≤ i ≤ n , let Y ( i ) be the Y samplecorresponding to X ( i ) . Let ˆ z ≤ · · · ≤ ˆ z n be the solution to the isotonicregression inf z ≤···≤ z n n n X i =1 ( ˆ G n ( Y ( i ) ) − z i ) , ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 11 where there is the restriction that if X ( i ) = X ( i +1) , then z i = z i +1 . Thenˆ C mon ( X, Y ) = n P ni =1 ˆ z i − (cid:18) n P ni =1 ˆ G n ( Y i ) (cid:19) Var n ( ˆ G n ( Y )) . Here we have also used the fact that1 n n X i =1 ˆ z i = 1 n n X i =1 ˆ G n ( Y i ) . Sorting the X sample takes time O ( n log n ), the isotonic regression canbe done in time O ( n ) by the pool adjacent violators algorithm, and allother computations take time O ( n ). Thus the empirical correlation may becalculated in time O ( n log n ). Proposition 4.4.

For any ( X, Y ) where Y is a.s. not a constant, we havethat ˆ C mon ( X, Y ) a.s. −→ C mon ( X, Y ) . Proof.

Note since M satisﬁes Property P and 0 ≤ ˆ G n , G ≤

1, by the bound-edness assumption it suﬃces to just optimize over M . We have by theGlivenko-Cantelli assumption thatsup h ∈M (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( G ( Y i ) − h ( X i )) − E ( G ( Y ) − h ( X )) (cid:12)(cid:12)(cid:12)(cid:12) a.s. −→ . Now for any h ∈ M , we have (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( G ( Y i ) − h ( X i )) − n n X i =1 ( ˆ G n ( Y i ) − h ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ x ∈ R | ˆ G n ( x ) − G ( x ) | . By the Glivenko-Cantelli theorem, we havesup x ∈ R | ˆ G n ( x ) − G ( x ) | a.s. −→ , and thus we obtainsup h ∈M (cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( ˆ G n ( Y i ) − h ( X i )) − E ( G ( Y ) − h ( X )) (cid:12)(cid:12)(cid:12)(cid:12) a.s. −→ . The rest of the proof now proceeds as in the proofs of Lemma 3.2 andProposition 3.1. (cid:3)

Similar to (2.4), we now deﬁne the averaged (population) correlation˜ C mon ( X, Y ) := 12 ( C ( X, Y ) + C / mon ( X, Y )) , as well as the empirical versionˆ˜ C mon,n ( X, Y ) := 12 ( ˆ C n ( X, Y ) + ˆ C / mon ( X, Y )) . Here we use C / mon rather than C mon , because as we shall see, the asymp-totic theory under independence is nicer. The population version satisﬁesproperties A)-E) (with property C) suitably adjusted), and is deﬁned forgeneral (non-constant) Y . By Proposition 4.4, we have that the empiricalcorrelation converges a.s. to the population correlation.4.2. Asymptotic behavior under independence and continuity.

Wenext investigate the asymptotic distribution of ˆ˜ C mon ( X, Y ) under the as-sumptions that

X, Y are independent and have continuous distributions.

Theorem 4.5.

Assume

X, Y are independent and continuously distributed.Then √ n (cid:18) ˆ˜ C mon ( X, Y ) − r log nn (cid:19) d −→ N (0 , / . This theorem follows from the following proposition about the joint asymp-totics of the two empirical correlations.

Proposition 4.6.

Assume

X, Y are independent and continuously distributed.Then (cid:18) √ n ˆ C n ( X, Y ) , n ˆ C mon ( X, Y ) − log n √ log n (cid:19) d −→ N (cid:18) (cid:18) (cid:19) , (cid:18) / (cid:19) (cid:19) . Proof of Theorem 4.5.

Let Z n := n ˆ C mon ( X, Y ) . By Taylor’s remainder theorem, we have p Z n = p log n + 12 √ log n ( Z n − log n ) − ξ / n ( Z n − log n ) , where ξ n is between Z n and log n . By Proposition 4.6, we obtain p Z n − p log n = 12 √ log n ( Z n − log n ) + o P (1) . We thus have √ n (cid:18) ˆ˜ C mon ( X, Y ) − r log nn (cid:19) =12 √ n ˆ C n ( X, Y ) + 14 n ˆ C mon ( X, Y ) − log n √ log n + o P (1) . We may now ﬁnish by Proposition 4.6 and Slutsky’s lemma. (cid:3)

Remark 4.7.

As we will see, the distribution of n ˆ C mon ( X, Y ) may be ap-proximately described as follows. First, sample N n , which is distributed asthe number of cycles in a uniform random permutation on [ n ] (and which hasan explicit distributional representation as a sum of independent Bernoullis;see e.g. [19, Section 2]). Then generate a χ N n ) . This explains the scaling ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 13 in the central limit theorem, since it is known that ( N n − log n ) / √ log n d −→ N (0 ,

1) (see e.g. [19, Section 2]), and ( χ n − n ) / √ n d −→ N (0 , X (1) < · · · < X ( n ) (the inequalities are strict since X has continuous distribution),and let Y ( i ) be the Y sample corresponding to X ( i ) . Let ˆ z ≤ · · · ≤ ˆ z n bethe isotonic regression of ˆ G n ( Y ) on X , i.e. the solution to the minimizationproblem inf z ≤···≤ z n n n X i =1 ( ˆ G n ( Y ( i ) ) − z i ) . The solution satisﬁes 1 n n X i =1 ˆ z i = 1 n n X i =1 ˆ G n ( Y i ) . Since Y is continuous, the right hand side above will deterministically be µ n := 12 (cid:18) n (cid:19) , and moreover σ n := 112 (cid:18) − n (cid:19) = Var n ( ˆ G n ( Y )) . We then have thatˆ C mon ( X, Y ) = 1 σ n n n X i =1 (ˆ z i − µ n ) = 1 n n X i =1 (cid:18) ˆ z i − µ n σ n (cid:19) . Note also that if ˜ z , . . . , ˜ z n is the isotonic regression of ( ˆ G n ( Y ) − µ n ) /σ n on X , i.e. the solution to the minimization probleminf z ≤···≤ z n n n X i =1 (cid:18) ˆ G n ( Y ( i ) ) − µ n σ n − z i (cid:19) , then we have that (ˆ z i − µ n ) /σ n = ˜ z i for all 1 ≤ i ≤ n , and thusˆ C mon ( X, Y ) = 1 n n X i =1 ˜ z i . Since

X, Y are independent and Y is continuous, we have that the randomvector ( n ˆ G n ( Y ) , . . . , n ˆ G n ( Y n )) has the same distribution as π , a uniformrandom permutation on [ n ] := { , . . . , n } . We thus have that (˜ z , . . . , ˜ z n )has the same distribution as ( ˆ w , . . . , ˆ w n ), where the latter is the isotonicregression of ( π/n − µ n ) /σ , i.e. solution to the minimization probleminf w ≤···≤ w n n n X i =1 (cid:18) π ( i ) /n − µ n σ n − w i (cid:19) . Recalling the deﬁnition (1.1) of ˆ C n ( X, Y ), we further obtain that the randomvector ( ˆ C n ( X, Y ) , ˆ C mon ( X, Y )) has the same distribution as − P n − i =1 | π ( i ) − π ( i + 1) | n − , n n X i =1 ˆ w i ! . (4.2)We thus have reduced the problem to analyzing statistics of a uniform ran-dom permutation.The key tool in our analysis will be the following bijection on permu-tations, which we now begin to describe, following [19, Section 3]. Westart by ﬁxing some real numbers y , . . . , y n which are linearly independentover Z ; i.e., if a , . . . , a n ∈ Z are such that a y + · · · + a n y n = 0, then a = · · · = a n = 0. Given a permutation τ = ( τ (1) , . . . , τ ( n )), deﬁne thecumulative sum process S τ : [0 , n ] → R by linearly interpolating betweenthe points S τ (0) := 0, S τ ( i ) := y τ (1) + · · · + y τ ( i ) for 1 ≤ i ≤ n . Let M τ be the greatest convex minorant of S τ (technically [19, Section 3] considersthe least concave majorant, but by a sign change we see that everything inthe section also applies to the greatest convex minorant). Note M τ will bea piecewise linear function, and so let 0 = i < i < · · · < i m = n denotethe knots of M τ . Now deﬁne the permutation ˜ τ as the product of cycles˜ τ := ( τ i +1 , . . . , τ i )( τ i +1 , . . . , τ i ) · · · ( τ i m − +1 , . . . , τ i m ) . (4.3)It is proven in [19, Section 4] that this map τ ˜ τ is a bijection. Call thismap B . To be clear, B is deﬁned by the real numbers y , . . . , y n , whichwere assumed to be linearly independent. This bijection on permutations iscalled the Bohnenblust-Spitzer algorithm.To see how the Bohnenblust-Spitzer algorithm relates to our current sit-uation, we now describe a well known explicit representation for the ˆ w i . For1 ≤ i ≤ n , deﬁne x i := 1 σ n (cid:18) in − µ n (cid:19) . Let S π be the cumulative sum process deﬁned as in the previous paragraph,but now using x , . . . , x n . Let M π be the greatest convex minorant of S π .Then for each 1 ≤ i ≤ n , ˆ w i is equal to the the left hand slope of M π at i (see e.g. [14, Theorem 1.2.1]). So given two consecutive knots i k − < i k of M π , and a point i k − < i ≤ i k , we haveˆ w i = S π ( i k ) − S π ( i k − ) i k − i k − = x π ( i k − +1) + · · · + x π ( i k ) i k − i k − . (4.4)Now deﬁne the functions on permutations of [ n ] f ( τ ) := 1 − nn − X C ∈ τ X ( i,j ) ∈ C | i/n − j/n | ,f ( τ ) := X C ∈ τ (cid:18) p | C | X i ∈ C x i (cid:19) , ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 15 where P C ∈ τ denotes summation over the cycles of τ , | C | is the length of C ,and if C = ( i , . . . , i k ), then P ( i,j ) ∈ C means we are summing over consecu-tive pairs ( i , i ) , ( i , i ) , . . . , ( i k , i ), and P i ∈ C means we are summing over i , . . . , i k . The next lemma allows us to prove Proposition 4.6 by studying( f ( ρ ) , f ( ρ )), for ρ a uniform random permutation on [ n ]. Lemma 4.8.

For all n ≥ , there is a coupling ( π, ρ ) , such that both π, ρ are uniform random permutations on [ n ] , and √ n (cid:18) − P n − i =1 | π ( i ) − π ( i + 1) | n − (cid:19) = √ nf ( ρ ) + o P (1) , n X i =1 ˆ w i = f ( ρ ) + o P (1) . Proof.

Fix n ≥

2. We use the Bohnenblust-Spitzer algorithm. There is theslight problem that x , . . . , x n are not linearly independent over Z . Thiscan be remedied by introducing δ := 2 − n (say), and taking a perturbation x δ , . . . , x δn which is linearly independent over Z and such that | x δi − x i | ≤ δ for all 1 ≤ i ≤ n . Let B δ be the bijection given by the Bohnenblust-Spitzeralgorithm applied with x δ , . . . , x δn . Since B δ is a bijection, B δ ( π ) is also auniform random permutation. Set ρ := B δ ( π ).We show the second statement ﬁrst. Let ˆ w δ , . . . , ˆ w δn be the isotonic re-gression of x δπ (1) , . . . , x δπ ( n ) . First, since isotonic regression is a projectiononto a convex cone and thus is 1-Lipschitz, we have n X i =1 ( ˆ w i − ˆ w δi ) ≤ n X i =1 ( x δi − x i ) ≤ nδ , and thus applying Cauchy-Schwarz, we obtain (cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ˆ w i − n X i =1 ( ˆ w δi ) (cid:12)(cid:12)(cid:12)(cid:12) = O ( nδ ) . Now observe that by the deﬁnition of B δ ( π ) (4.3) and the characterizationof the isotonic regression (4.4), we have n X i =1 ( ˆ w δi ) = X C ∈ ρ | C | (cid:18) | C | X i ∈ C x δi (cid:19) = X C ∈ ρ (cid:18) p | C | X i ∈ C x δi (cid:19) =: f δ ( ρ ) , and moreover | f δ ( ρ ) − f ( ρ ) | = X C ∈ ρ O ( p | C | ) p | C | δ = O ( nδ ) . Putting everything together, we obtain the second statement (actually wehave proven something slightly stronger – the o P (1) can be replaced by o (1)). For the ﬁrst statement, observe that the diﬀerences only arise when wewrap around a cycle of ρ , or at the boundary between two cycles of ρ . Let N n be the number of cycles of ρ . This then gives (cid:12)(cid:12)(cid:12)(cid:12) √ n (cid:18) − P n − i =1 | π ( i ) − π ( i + 1) | n − − f ( ρ ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:18) N n √ n (cid:19) . The desired statement now follows since E N n = O (log n ) (see e.g. [19,Section 2]). (cid:3) We have thus reduced to studying the joint asymptotics of ( f ( ρ ) , f ( ρ )),for a uniform random permutation ρ . We will now suppose ρ is sampledas follows. First, sample L ≥ · · · ≥ L N n , which are distributed as theranked cycle lengths of a uniform random permutation on [ n ] (and so N n isdistributed as the number of cycles). For 1 ≤ i ≤ N n , let a i := L + · · · + L i , and let a := 0. Independently of L , . . . , L N n , sample V , . . . , V n i.i.d. ∼ Unif(0 , F n be the empirical cdf of the V sample. Let η be thepermutation deﬁned by η ( i ) := ˆ F n ( V i ), 1 ≤ i ≤ n . Note η itself is a uniformrandom permutation on [ n ]. Finally, set ρ to be the product of cycles ρ := ( η ( a + 1) , . . . , η ( a )) · · · ( η ( a N n − + 1) , . . . , η ( a N n )) . For 1 ≤ i ≤ n , let ˜ V i := 1 √ (cid:18) V i − (cid:19) . Let 1 ≤ A n ≤ B n ≤ N n be such that L A n +1 ≥ · · · ≥ L B n are exactly thecycle lengths which are in the interval [(log n ) , n/ (log n ) ]. If there are nosuch cycle lengths, trivially set A n := 0 , B n := 0. The need for introducing A n , B n is detailed at two points later on, just before Lemma 4.11 and justbefore the proof of Proposition 4.6. Let F n := σ ( N n , L , . . . , L N n ). Beforecontinuing, we collect in the following lemma some basic facts about thecycles of random permutations. Lemma 4.9.

For ≤ i ≤ n , the expected number of cycles of length i ina uniform random permutation on [ n ] is exactly i − . Consequently, E N n = O (log n ) , and also E ( N n − ( B n − A n )) = O (log log n ) . We also have that N n − log n √ log n d −→ N (0 , . Consequently, N n / log n p → .Proof. For the ﬁrst claim, see e.g. [11, Theorem 2]. Using this claim, wehave E ( N n − ( B n − A n )) ≤ (log n ) X i =1 i + n X i = n/ (log n ) i = O (log log n ) . ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 17

For a proof of the central limit theorem, see e.g. [19, Section 2]. (cid:3)

The following lemma is an intermediate step in simplifying f ( ρ ) , f ( ρ ). Lemma 4.10.

We have √ nf ( ρ ) = 1 √ n n − X i =1 (1 − | η ( i ) /n − η ( i + 1) /n | ) + o P (1) ,f ( ρ ) = B n X i = A n +1 (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) + o P ( p log n ) . Proof.

First, observe that √ n (cid:18) f ( ρ ) − n X C ∈ ρ X ( i,j ) ∈ C (1 − | ρ ( i ) /n − ρ ( j ) /n | ) (cid:19) = O (cid:18) n / (cid:19) . Next, observe √ n (cid:18) n X C ∈ ρ X ( i,j ) ∈ C | ρ ( i ) /n − ρ ( j ) /n |− n n − X i =1 | η ( i ) /n − η ( i +1) /n | (cid:19) = O (cid:18) N n √ n (cid:19) . The ﬁrst claim now follows, since E N n = O (log n ) (by Lemma 4.9).For the second claim, observe (cid:12)(cid:12)(cid:12)(cid:12) f ( ρ ) − B n X i = A n +1 (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) ≤ R + R , with R := X C ∈ ρ | C | / ∈ [(log n ) ,n/ (log n ) ] (cid:18) p | C | X i ∈ C x i (cid:19) ,R := B n X i = A n +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) √ L i a i X j = a i − +1 x η ( j ) (cid:19) − (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) . To bound R , observe that if we condition on F n , then for any 1 ≤ i ≤ N n ,we have E (cid:20)(cid:18) √ L i a i X j = a i − x η ( j ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21) = O (1) . We can thus obtain (recalling Lemma 4.9) E R = O ( E ( N n − ( B n − A n ))) = O (log log n ) , We thus have R = o P ( √ log n ). Next, we will show that E R = O (1), which implies R = o P ( √ log n ).Let ∆ n := sup x ∈ [0 , | ˆ F n ( x ) − x | . Observe for any 1 ≤ i ≤ N n , we have byCauchy-Schwarz E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) √ L i a i X j = a i − +1 x η ( j ) (cid:19) − (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21) ≤ S S , where S := (cid:18) E (cid:20)(cid:18) √ L i a i X j = a i − +1 x η ( j ) + ˜ V j (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21)(cid:19) / ,S := (cid:18) E (cid:20)(cid:18) √ L i a i X j = a i − +1 x η ( j ) − ˜ V j (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21)(cid:19) / . The inequality ( x + y ) ≤ x + 2 y and a moment computation gives S = O (1) . To bound S , we have by the deﬁnition of ∆ n , the independence of ∆ n and F n , and the facts | µ n − / | ≤ /n , | σ n − / √ | = O (1 /n ), E ∆ n = O (1 /n )(which follows by e.g. the Dvoretzky-Kiefer-Wolfowitz inequality), S ≤ O ( n − / ) p L i . We thus obtain E R = O ( n − / ) E (cid:20) B n X i = A n +1 p L i (cid:21) . By Lemma 4.9, we have E (cid:20) B n X i = A n +1 p L i (cid:21) ≤ n X i =1 √ i i = O ( n / ) . The desired result now follows. (cid:3)

We now show that the simpliﬁed version of f ( ρ ) given by Lemma 4.10 isasymptotically normal. Here is where we crucially use our lower bound onthe cycle lengths L j for A n + 1 ≤ j ≤ B n , because this ensures that everyterm in the quantity T n deﬁned below is approximately a χ , meaning T n is approximately a χ B n − A n ) . Lemma 4.11.

We have √ log n (cid:18) B n X i = A n +1 (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) − log n (cid:19) d −→ N (0 , . Proof.

Let T n := B n X i = A n +1 (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) . ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 19

Let M n := B n − A n . Since E ( N n − M n ) = O (log log n ) (by Lemma 4.9), wehave 1 √ log n ( T n − log n ) = 1 √ log n ( T n − M n ) + 1 √ log n ( N n − log n ) + o P (1) . We know that ( N n − log n ) / √ log n d → N (0 ,

1) (by Lemma 4.9). Thus itsuﬃces to show that for all θ ∈ R , we have E (cid:20) exp (cid:16) iθ ( T n − M n ) / p log n (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21) p −→ exp (cid:0) − θ (cid:1) . To start, for k ≥

1, let φ k be the characteristic function of (cid:18) √ k k X i =1 ˜ V i (cid:19) . By the central limit theorem and the continuous mapping theorem, we havethat φ k → φ pointwise, where φ is the characteristic function of χ . More-over, we claim that for all M >

0, we havesup | θ |≤ M | φ k ( θ ) − φ ( θ ) | = O (cid:18) M k (cid:19) , (4.5)For now let us take the claim as given. Observe E (cid:20) exp (cid:16) iθ ( T n − M n ) / p log n (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21) = B n Y j = A n +1 φ L j ( θ/ p log n ) e − iθ/ √ log n . Observe moreover that (cid:12)(cid:12)(cid:12)(cid:12) B n Y j = A n +1 φ L j ( θ/ p log n ) e − iθ/ √ log n − B n Y j = A n +1 φ ( θ/ p log n ) e − iθ/ √ log n (cid:12)(cid:12)(cid:12)(cid:12) ≤ B n X j = A n +1 | φ L j ( θ/ p log n ) − φ ( θ/ p log n ) | . By the claim, and the fact that L j ≥ (log n ) for all A n + 1 ≤ j ≤ B n , theright hand side above may be bounded by O (cid:18) (1 + | θ | ) B n X j = A n L j (cid:19) = O (cid:18) (1 + | θ | ) N n (log n ) (cid:19) = o P (1) , where the second equality follows since E N n = O (log n ) (by Lemma 4.9).For k ≥

1, let ψ k be the characteristic function of1 √ k k X i =1 ( A i − , A i i.i.d. ∼ χ . Then we have shown that for all θ ∈ R , E (cid:20) exp (cid:16) iθ ( T n − M n ) / p log n (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) F n (cid:21) − ψ M n ( θ p M n / log n ) p −→ . Now observe that ψ M n is 1-Lipschitz for all n , and M n / log n p → ψ M n ( θ p M n / log n ) p −→ exp (cid:0) − θ (cid:1) . The desired result now follows, modulo the claim (4.5).To show the claim, ﬁrst let W k := ( k − / P ki =1 ˜ V i ) , and let A ∼ χ . For θ ∈ R , let h θ : R → R be deﬁned by h θ ( x ) := cos( θx ). Note for 0 ≤ i ≤ x ∈ R | h ( i ) ( x ) | ≤ | θ | i . Then by [7, Theorem 3.1], we have that forall θ ∈ R , k ≥ | E h θ ( W k ) − E h θ ( A ) | = O (cid:18) | θ | k (cid:19) . Applying this theorem also to the functions g θ ( x ) := sin( θx ), we obtain thedesired claim. (cid:3) The following lemma is the key result needed to show asymptotic inde-pendence of ( f ( ρ ) , f ( ρ )). Lemma 4.12.

We have √ n n − X i =1 (1 − | η ( i ) /n − η ( i + 1) /n | ) =1 √ n n − n/ (log n ) X i =1 (2 − | V i − V i +1 | − V i (1 − V i )) + o P (1) . Moreover, either side above converges in distribution to N (0 , / .Proof. From the proof of Theorem 2 in [1] (see in particular equations (5)and (9)), it is shown that1 n / ( n − (cid:18) n − X i =1 | η ( i ) − η ( i + 1) | − n ( n − / (cid:19) =1 √ n n − X i =1 ( | V i − V i +1 | + 2 V i (1 − V i ) − /

3) + o P (1) . From this it follows that1 √ n n − X i =1 ( | η ( i ) /n − η ( i + 1) /n | − /

3) =1 √ n n − X i =1 ( | V i − V i +1 | + 2 V i (1 − V i ) − /

3) + o P (1) . ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 21

By a variance calculation, we have that the right hand side above is equalto 1 √ n n − n/ (log n ) X i =1 ( | V i − V i +1 | + 2 V i (1 − V i ) − /

3) + o P (1) . The ﬁrst desired result now follows by combining the previous few obser-vations. The convergence in distribution follows also by the previous fewobservations and [1, Theorem 2]. (cid:3)

We are now ready to prove Proposition 4.6. Since we already have themarginal asymptotics by Lemmas 4.11 and 4.12, the major thing left isto show asymptotic independence of the two marginals. This is where wecrucially use our upper bound on the cycle lengths L j for A n + 1 ≤ j ≤ B n ,to ensure that the two marginals are essentially functions of disjoint sets ofthe V j variables. Proof of Proposition 4.6.

Let W n := 1 √ n n − n/ (log n ) X i =1 (2 − | V i − V i +1 | − V i (1 − V i )) ,Z n := 1 √ log n (cid:18) B n X i = A n (cid:18) √ L i a i X j = a i − +1 ˜ V j (cid:19) − log n (cid:19) . By combining (4.2), and Lemmas 4.8, 4.10, 4.11, and 4.12, we have that itsuﬃces to show the following. Let f, g : R → R be bounded and continuous.Then lim n →∞ | E f ( W n ) g ( Z n ) − E f ( W n ) E g ( Z n ) | = 0 . For each n , deﬁne the event E n := { L + · · · + L A n > n − n/ (log n ) } = { L A n +1 + · · · + L N n < n/ (log n ) } where we have used the fact L + · · · + L N n = n . Observe that on this event,the only ˜ V j variables which appear in Z n must have j > n − n/ (log n ) .From this, it follows that E n E [ f ( W n ) g ( Z n ) | F n ] = E n E [ f ( W n ) | F n ] E [ g ( Z n ) | F n ]= E n E [ f ( W n )] E [ g ( Z n ) | F n ] , where the second identity follows since V , . . . , V n is independent of F n .Letting C be such that sup x ∈ R | f ( x ) | , sup x ∈ R | g ( x ) | ≤ C , we thus havelim sup n →∞ | E f ( W n ) g ( Z n ) − E f ( W n ) E g ( Z n ) | ≤ C lim sup n →∞ P ( E cn ) . By Lemma 4.9, we have E [ L A n +1 + · · · + L N n ] ≤ X i ≤ n/ (log n ) i i = n (log n ) . Combining this with Markov’s inequality, we obtain P ( E cn ) = P ( L A n +1 + · · · + L N n ≥ n/ (log n ) ) ≤ n . The desired result now follows. (cid:3)

Other features of the isotonic case.

In the isotonic case there isa simple alternative to ˆ C mon . Recall the Spearman correlation [16], (alsoknown as Spearman’s ρ ) given by, if both X and Y have continuous distri-bution, ˆ C S = P ni =1 ( i − n +12 ) S i P ni =1 ( i − n +12 ) = 1 nσ n n X i =1 in S i n − µ n σ n , (4.6)where S i = ˆ G n ( Y ( i ) ) is the rank of Y ( i ) (where as usual Y ( i ) is the Y samplecorresponding to the i th order statistic X ( i ) ), and as in Section 4.2, we have µ n = 12 (cid:18) n (cid:19) , σ n = 112 (cid:18) − n (cid:19) . For general (

X, Y ) the population version is C S ( X, Y ) := corr( F ( X ) , G ( Y )) = Cov( F ( X ) , G ( Y ))(Var( F ( X ))Var( G ( Y ))) / , where F, G are the marginal cdfs of

X, Y , respectively. It is well known[16] that C S satisﬁes C S = 1 if and only if Y = g ( X ), where g is strictlyincreasing, and property B) holds partially, in that C S = 0 if X, Y areindependent.The general estimate of C S ( X, Y ) isˆ C S := P ni =1 ˆ F n ( X i ) ˆ G n ( Y i ) − ¯ˆ F n · ¯ˆ G n ( P ni =1 ( ˆ F n ( X i ) − ¯ˆ F n ) P ni =1 ( ˆ G n ( Y i ) − ¯ˆ G n ) ) / , (4.7)where ˆ F n , ˆ G n are the empirical cdfs of X, Y , respectively, and¯ˆ F n := 1 n n X i =1 ˆ F n ( X i ) , and ¯ˆ G n is similarly deﬁned. In the case F, G are continuous, (4.7) reduces to(4.6), and further the distribution of ˆ C S if X and Y are independent doesn’tdepend on F, G since F ( X ) , G ( Y ) i.i.d. ∼ Unif(0 , F, G are continuous (but

X, Y are not necessarily independent), the distributionof ˆ C S depends only on g ( v | u ), the conditional density of V := G ( Y ) given U := F ( X ), as is the case for Chatterjee’s correlation ˆ C n . ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 23

A linear expansion.

In the case of general (

X, Y ) we writeˆ C S = Q ( ˆ H n ) , where ˆ H n is the empirical joint distribution of the ( X i , Y i ) pairs, 1 ≤ i ≤ n ,and for any joint distribution H , Q ( H ) := R F ( x ) G ( y ) dH ( x, y ) − µ ( F ) µ ( G ) σ ( F ) σ ( G ) , where F, G are the marginals of H , and µ ( F ) := Z F ( x ) dF ( x ) , σ ( F ) := Z F ( x ) dF ( x ) − µ ( F ) . A standard delta method argument yields an expansion for Q ( ˆ H ) − Q ( H ),using Z ˆ F n ( x ) ˆ G n ( y ) d ˆ H n ( x, y ) = Z F ( x ) G ( y ) dH ( x, y ) + Z ( ˆ F n − F )( x ) G ( y ) dH ( x, y )+ Z F ( x )( ˆ G n − G )( y ) dH ( x, y )+ Z F ( x ) G ( y ) d ( ˆ H n − H )( x, y ) + o P ( n − / ) , and µ ( ˆ F n ) = µ ( F ) + Z ( ˆ F n − F )( x ) dF ( x ) + Z F ( x ) d ( ˆ F n − F )( x ) + o P ( n − / ) , and similar expansions for µ ( ˆ G n ) as well as σ ( ˆ F n ) , σ ( ˆ G n ). We do not developthe general details here but give the calculation for F, G continuous and

X, Y independent. In that case all terms but the last are treated as known, andwe can obtain (using integration by parts in the appropriate places)ˆ C S = 12 n n X i =1 ( U i − / V i − /

2) + o P ( n − / ) , (4.8)where U i = F ( X i ), V i = G ( Y i ), so that U i , V i , ≤ i ≤ n are i.i.d. Unif(0 , C n ( X, Y ) = 1 − nn − n − X i =1 | ˆ G n ( Y ( i ) ) − ˆ G n ( Y ( i +1) ) | , where the dependence on the X sample is only through the indices ( i ).Observe that as in the proof of Lemma 4.12, we have from the proof ofTheorem 2 in [1] (see in particular equations (5) and (9)), we may obtainˆ C n ( X, Y ) = 3 n n − X i =1 (2 / − | V ( i ) − V ( i +1) | − V ( i ) (1 − V ( i ) )) + o P ( n − / ) . (4.9) Now by a CLT for triangular arrays of 1-dependent random variables (seee.g. the Theorem of [12]), and the Cram´er-Wold device, we may obtain( √ n ˆ C n , √ n ˆ C S ) d −→ N (cid:18) (cid:18) (cid:19) , (cid:18)

00 1 (cid:19) (cid:19) , so that in particular, for any λ ∈ (0 , √ n ( λ ˆ C n + (1 − λ ) ˆ C S ) d −→ N (0 , λ / − λ ) / . Local power calculations.

We will make local power calculations for λ ˆ C n +(1 − λ ) ˆ C / mon , λ ˆ C n +(1 − λ ) ˆ C S , and other rank statistics using contiguitytheory (see [10] or [20, Chapters 6-8]).I We begin with a model { h θ ( x, y ) } | θ | < , where for each | θ | < h θ ( x, y ) is a joint density with respect to Lebesgue measure, and h ( x, y ) = f ( x ) g ( y ) (independence).II Suppose ˙ ℓ ( x, y ) := ∂∂θ log h θ ( x, y ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 exists, and suppose the family { h θ ( x, y ) } | θ | < is quadratic mean dif-ferentiable at θ = 0 with score function ˙ ℓ . That is, we have Z Z (cid:0)p h θ ( x, y ) − p h ( x, y ) − (1 / θ ˙ ℓ ( x, y ) p h ( x, y ) (cid:1) dxdy = o ( θ ) . Moreover, assume E ˙ ℓ ( X, Y ) > E [ ˙ ℓ ( X, Y ) ] < ∞ and E ˙ ℓ ( X, Y ) =0).For θ ∈ R let P n,θ be the product measure on ([0 , × [0 , n with den-sity Q ni =1 h θ ( x, y ). Take t ∈ R and let θ n := t/ √ n . By Le Cam’s Theo-rem (see e.g. [20, Theorem 7.2 and Example 6.5]), under our assumptions { P n,θ n } n ≥ is contiguous with respect to { P n, } n ≥ . That is if T n is a func-tion of (( X i , Y i ) , ≤ i ≤ n ), then T n P n, −→ T n P n,θn −→ . For our purposes, more importantly, Le Cam’s third lemma holds (see e.g.[20, Theorem 7.2 and Example 6.7]), stating if (cid:18) √ nT n , √ n n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) d −→ N (cid:18) (cid:18) (cid:19) , (cid:18) σ ρστ ρστ τ (cid:19) (cid:19) (4.10)under { P n, } n ≥ , where τ := E [ ˙ ℓ ( X, Y ) ] is the Fisher information, thenunder { P n,θ n } n ≥ , we have √ nT n d −→ N ( tc, σ ) , where c := ρστ is the asymptotic covariance (under { P n, } n ≥ ) between thestatistic T n and the score statistic. ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 25

Assuming (4.10) holds under { P n, } n ≥ , √ nT n /σ can be viewed as a teststatistic for the hypothesis of independence, while L n := 1 τ √ n n X i =1 ˙ ℓ ( X i , Y i )can be viewed as the asymptotically optimal test statistic for the family { h θ ( x, y ) } | θ | < . The Pitman eﬃciency (see e.g. [20, Chapter 8]) of the ﬁrsttest to the second is by (4.10), e ( T n ) = ρ . Note that ρ ( L n ) = 1. Proposition 4.13.

Suppose the listed assumptions I and II hold. Then wehave e ( ˆ C n ) = 0 . Remark 4.14.

In our initial draft, we had initially proven this result forsome speciﬁc models. After we became aware of the related work of Shiet al. [17], we realized that the result held in the present general setting.As such, the following proof is a small adaptation of an argument in [17].In particular, we utilize their very nice and crucial observation that somefortuitous cancellation occurs.

Proof.

Throughout, all expectations and covariances will be under θ = 0, sowe omit writing E and Cov . Recalling (4.9), we have that under { P n, } n ≥ ,ˆ C n = 1 n n X i =1 (2 / − | V ( i ) − V ( i +1) | − V ( i ) (1 − V ( i ) )) + o P ( n − / ) , where V i := G ( Y i ), G is the cdf of Y under θ = 0, and ( n + 1) := (1). Itthus suﬃces to show that for all n ≥ (cid:18) n X i =1 | V ( i ) − V ( i +1) | , n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) = − (cid:18) n X i =1 V ( i ) (1 − V ( i ) ) , n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) . We haveCov (cid:18) n X i =1 V ( i ) (1 − V ( i ) ) , n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) = Cov (cid:18) n X i =1 V i (1 − V i ) , n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) = n X i =1 Cov( V i (1 − V i ) , ˙ ℓ ( X i , Y i ))= n X i =1 E [ V i (1 − V i ) ˙ ℓ ( X i , Y i )] . where the second to last equality follows since E ˙ ℓ ( X, Y ) = 0. For theother covariance term, let π be the permutation deﬁned by if i = ( j ), then π ( i ) := ( j + 1). Then n X i =1 | V ( i ) − V ( i +1) | = n X i =1 | V i − V π ( i ) | . ThusCov (cid:18) n X i =1 | V ( i ) − V ( i +1) | , n X i =1 ˙ ℓ ( X i , Y i ) (cid:19) = n X i,j =1 Cov( | V i − V π ( i ) | , ˙ ℓ ( X j , Y j ))= n X i,j =1 E [ | V i − V π ( i ) | ˙ ℓ ( X j , Y j )] . In the case i = j , we have E [ | V i − V π ( i ) | ˙ ℓ ( X j , Y j )] = E (cid:2) ( π ( i ) = j ) E [ | V i − V π ( i ) | ˙ ℓ ( X j , Y j ) | X , . . . , X n ] (cid:3) + E [ ( π ( i ) = j ) | V i − V j | ˙ ℓ ( X j , Y j )] . On the event π ( i ) = j , we have that Y i , Y π ( i ) , Y j are all conditionally inde-pendent given X , . . . , X n , and thus ( π ( i ) = j ) E [ | V i − V π ( i ) | ˙ ℓ ( X j , Y j ) | X , . . . , X n ] = ( π ( i ) = j ) E [ | V i − V π ( i ) | ] E [ ˙ ℓ ( X j , Y j ) | X , . . . , X n ] . Note V i , V π ( i ) i.i.d. ∼ Unif(0 , E [ | V i − V π ( i ) | ] is constant in i . Note also n X i,j =1 i = j E [ ( π ( i ) = j ) ˙ ℓ ( X j , Y j )] = ( n − n X j =1 E ˙ ℓ ( X j , Y j ) = 0 . Thus upon combining the previous few displays, we obtain n X i,j =1 E [ | V i − V π ( i ) | ˙ ℓ ( X j , Y j )] = n X i =1 E [ | V i − V π ( i ) | ˙ ℓ ( X i , Y i )] + n X i,j =1 i = j E [ ( π ( i ) = j ) | V i − V j | ˙ ℓ ( X j , Y j )] . The right hand side above may be more simply written n X i =1 E [ | V i − V π ( i ) | ˙ ℓ ( X i , Y i )] + n X i =1 E [ | V π − ( i ) − V i | ˙ ℓ ( X i , Y i )] . To ﬁnish, we want to show that the above is equal to − E n X i =1 E [ V i (1 − V i ) ˙ ℓ ( X i , Y i )] = − n E [ V (1 − V ) ˙ ℓ ( X, Y )] . ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 27

Remark 4.15.

In addition to the models considered by Shi et al. [17], wenote another class of models to which Proposition 4.13 applies, similar tothe trend model considered by Chatterjee [4].Here Y = θa ( X ) + ε, where X and ε are independent with densities f , g , respectively. The jointdensity is h θ ( x, y ) = f ( x ) g ( y − θa ( x )) . If g is diﬀerentiable, and I ( g ) := Z ∞−∞ g ′ ( y ) g ( y ) dy < ∞ , then it is easy to see that the family { h θ ( x, y ) : | θ | < } is quadratic meandiﬀerentiable at θ = 0, with˙ ℓ ( X, Y ) = − a ( X ) g ′ ( Y ) g ( Y ) . Hence e ( ˆ C n ) = 0 here too. In fact, this is true even if we make the modelsemiparametric with g unknown. Comparison of ˆ C n to ˆ˜ C mon . Recall that we deﬁnedˆ˜ C mon := (1 / C n + ˆ C / mon ) . Let us generalize this slightly toˆ˜ C mon,λ := λ ˆ C n + (1 − λ ) ˆ C / mon . As in the proof of Theorem 4.5, we can obtain that for

X, Y independentand continuously distributed, √ n ( ˆ˜ C mon,λ − (1 − λ ) p log n/n ) d −→ λ + 34 (1 − λ ) . Now note that ˆ C / mon and L n are asymptotically independent. This followsby essentially the same argument given in Section 4.2 for the independenceof ˆ C n and ˆ C mon . Therefore, ˆ˜ C mon,λ always has Pitman eﬃciency less thanthat of ˆ C n unless λ = 1. Speciﬁcally, the asymptotic correlation betwewen L n and ˆ˜ C mon,λ is λ ( λ + (1 − λ ) ) / lim n →∞ Cov( ˆ C n , L n ) . This is equal to1(1 + ( − λλ ) ) / lim n →∞ Cov( ˆ C n , L n ) p / ( − λλ ) ) / e / ( ˆ C n ) . We thus obtain e ( ˆ˜ C mon,λ ) = e ( ˆ C n )1 + ( − λλ ) . When λ = 1 /

2, we have e ( ˆ˜ C mon ) = 823 e ( ˆ C n ) . A copula generated model.

We believe the appropriate setting forrank statistic power calculations is a copula generated model as deﬁnednext. We start with a parametric model of densities { h θ : | θ | < } , where h θ : (0 , → [0 , ∞ ) for all | θ | <

1, and h ( x, y ) ≡

1. Let F := { a : [0 , → [0 ,

1] absolutely continuous , a ′ > , a (0) = 0 , a (1) = 1 } . I.e., F is the set of all absolutely continuous, strictly increasing transforma-tions which map 0 to 0 and 1 to 1. Given q, r ∈ F , we may deﬁne h θ ( x, y, q, r ) := h θ ( q ( x ) , r ( y )) q ′ ( x ) r ′ ( y ) , ( x, y ) ∈ (0 , . Note that if h θ is the density of ( X, Y ), then h θ ( · , · , q, r ) is the density ofthe pair ( q − ( X ) , r − ( Y )). Deﬁne a semiparametric model by P := { h θ ( · , · , q, r ) : | θ | < , q, r ∈ F } . Note that h ( x, y, q, r ) = q ′ ( x ) r ′ ( y ) corresponds to independence for all ( q, r )pairs. ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 29

Let Q := { Q θ : ( F θ ( X ) , G θ ( Y )) ∼ Q θ , ( X, Y ) ∼ h θ , | θ | < } . Then Q is a copula which generates the same semiparametric model as { h θ : | θ | < } .Rank test statistics have the property that their distribution depends onlyon θ and not on q or r . Therefore, we believe calculations should be carriedout for parametric submodels where q and r also depend on θ .Consider a parametric submodel, with ¯ θ = ( θ , θ , θ ), and φ ¯ θ ( x, y ) := h θ ( x, y, q θ , r θ ) , | ¯ θ | ∞ < . Let ℓ ¯ θ := log φ ¯ θ , and let ∇ ℓ ¯0 ( x, y ) := ( D ( x, y ) , D ( x, y ) , D ( x, y ))be the gradient of ℓ (in the ¯ θ variable) at ¯ θ = 0. We suppose that the family { φ ¯ θ : | ¯ θ | ∞ < } is quadratic mean diﬀerentiable at ¯ θ = ¯0. That is, it satisﬁes E ¯0 (cid:0) φ ¯ θ ( X, Y ) / − φ ¯0 ( X, Y ) / − (1 / φ ¯0 ( X, Y ) / ( ∇ ℓ ¯0 ( X, Y ) , ¯ θ ) (cid:1) = o ( | ¯ θ | ) , (QMD)where E ¯0 signiﬁes that the random variables ( X, Y ) are distributed accordingto φ ¯0 ( x, y ) = q ′ ( x ) r ′ ( y ). Note that D ( X, Y ) = ∂∂θ log h θ ( q θ ( X ) , r θ ( Y )) (cid:12)(cid:12)(cid:12)(cid:12) ¯ θ =0 = ˙ ℓ ( q ( X ) , r ( Y )) , (here ˙ ℓ := ∂ log h θ /∂θ | θ =0 ) and since h ¯0 ≡

1, we have D ( X, Y ) = ∂∂θ log q ′ θ ( X ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 , D ( X, Y ) = ∂∂θ log r ′ θ ( Y ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 . Remark 4.16.

Let q ′ θ ( x ) := q ′ ( x ) (cid:0) θa ( q ( x )) (cid:1) ,r ′ θ ( y ) := r ′ ( y ) (cid:0) θb ( r ( y )) (cid:1) , (4.11)where | a | < , | b | <

1, and E a ( U ) = E b ( U ) = 0, for U ∼ Unif(0 , { h θ : | θ | < } satisﬁes assumptions I and II, then the parametric copulamodel deﬁned by (4.11) above satisﬁes (QMD).The tangent spaces (using the notation of [3]) of the model deﬁned by(4.11) at θ = θ = θ = 0 are˙ P θ = [ ˙ ℓ ( q ( X ) , r ( Y ))] , ˙ P θ = [ a ( q ( X ))] , ˙ P θ = [ b ( r ( Y ))] . Since the set of a, b as above are dense in L ([0 , { c ( U ) : E [ c ( U ) ] < ∞ , E c ( U ) = 0 } , the tangent spaces of the semiparametric copula model are˙ P θ = [ ˙ ℓ ( q ( X ) , r ( Y ))] , ˙ P q = { f ( q ( X )) : f ∈ L ([0 , } , ˙ P r = { f ( r ( Y )) : f ∈ L ([0 , } . While ˙ ℓ ( q ( X ) , r ( Y )) is the score function for the base model { h θ ( · , · , q , r ) : | θ | < } , we now obtain that˙ ℓ ∗ ( X, Y ) := ˙ ℓ ( q ( X ) , r ( Y )) − E [ ˙ ℓ ( q ( X ) , r ( Y )) | X ] − E [ ˙ ℓ ( q ( X ) , r ( Y )) | Y ]is the eﬃcient score function for the model P , since E [ ˙ ℓ ( q ( X ) , r ( Y )) | X ] isthe projection of ˙ ℓ ( q ( X ) , r ( Y )) on ˙ P q and similarly for ˙ P r (see [3, Chapter3]).Consider the submodel of P , given by (4.11), with θ = θ = θ = θ , and a ( q ( X )) = − E [ ˙ ℓ ( q ( X ) , r ( Y )) | X ] , b ( r ( Y )) = − E [ ˙ ℓ ( q ( X ) , r ( Y )) | Y ] . Given assumptions I and II, we have that (QMD) holds, and hence thismodel is also quadratic mean diﬀerentiable at θ = 0, with score function ˙ ℓ ∗ .For estimation, this means that if an estimate is regular and linear on P ,then its asymptotic variance σ /n ≥ ( I ∗ ) − /n , where I ∗ := E [ ˙ ℓ ∗ ( q ( X ) , r ( Y )) ] . For a rank test statistic T n such that under θ = 0, we have that (cid:18) √ nT n , √ n n X i =1 ˙ ℓ ∗ ( X i , Y i ) (cid:19) d −→ N (cid:18) (cid:18) (cid:19) , (cid:18) σ ρ ∗ σ √ I ∗ ρ ∗ σ √ I ∗ I ∗ (cid:19) (cid:19) , (4.12)then as in Section 4.4, the Pitman eﬃciency of T n with respect to the optimalasymptotic test is e ∗ ( T n ) = ( ρ ∗ ) . Remark 4.17.

We can exhibit explicit families { q θ : | θ | < } , { r θ : | θ | < } such that the parametric model { h θ ( · , · , q θ , r θ ) : | θ | < } has score function˙ ℓ ∗ . Set q θ := F − θ , r θ := G − θ , where F θ , G θ are the respective marginal cdfsof X, Y under h θ . (Note then the distribution h θ ( · , · , q θ , r θ ) is given by thelaw of ( F θ ( X ) , G θ ( Y )), if ( X, Y ) is distributed according to h θ . So we areessentially going from the semiparametric copula generated model back tothe parametric copula model.) Let ˙ s denote the score function at θ = 0 forthis model. A change of variables calculation then gives that˙ s ( X, Y ) = ˙ ℓ ( q ( X ) , r ( Y )) − f ( X ) ∂f θ ∂θ ( q ( X )) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 − g ( Y ) ∂g θ ∂θ ( r ( Y )) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 . It remains to show that under θ = 0, we have E [ ˙ ℓ ( q ( X ) , r ( Y )) | X ] = 1 f ( X ) ∂f θ ∂θ ( q ( X )) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 , and similarly for the other term. Since h ≡

1, and thus f ≡

1, this reducesto showing E (cid:20) ∂h θ ∂θ ( q ( X ) , r ( Y )) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 (cid:12)(cid:12)(cid:12)(cid:12) X (cid:21) = ∂f θ ∂θ ( q ( X )) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 . Assuming h θ has suﬃcient smoothness (e.g., there is some ε > H ( q ( X ) , r ( Y )) such that sup | θ | <ε | ( ∂h θ /∂θ )( q ( X ) , r ( Y )) | ≤ H ( q ( X ) , r ( Y )), ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 31 and E H ( q ( X ) , r ( Y )) < ∞ ) we can interchange the conditional expectationand the diﬀerentiation, to obtain that the left hand side is ∂∂θ E [ h θ ( q ( X ) , r ( Y )) | X ] (cid:12)(cid:12)(cid:12)(cid:12) θ =0 . To ﬁnish, we have (note at θ = 0, we have that Y is independent of X , andhas density r ′ ) E [ h θ ( q ( X ) , r ( Y )) | X ] = Z h θ ( q ( X ) , r ( y )) r ′ ( y ) dy = f θ ( q ( X )) . Note for any rank test statistic T n satisfying (4.12), the model with scorefunction ˙ ℓ ∗ at θ = 0 is least favorable in terms of power, and ( ρ ∗ ) ≤ ρ ,where ρ is as in Section 4.4. For ˆ C n , we have established that ρ = 0.For the semiparametric copula generated model, we can get a result whichis weaker for ˆ C n , but more general in a way we point out. Proposition 4.18.

Suppose we have a rank statistic T n which satisﬁes (4.12) , and such that for every n , there exists functions q x,n , q y,n , such that √ nT n = q x,n ( X , . . . , X n ) + q y,n ( Y , . . . , Y n ) + o P (1) Then under the submodel of P which has score function ˙ ℓ ∗ at θ = 0 , we havethat e ∗ ( T n ) = 0 .Proof. NoteCov (cid:18) q x,n ( X , . . . , X n ) , n X i =1 ℓ ∗ ( X i , Y i ) (cid:19) = n X i =1 E [ q x,n ( X , . . . , X n ) ℓ ∗ ( X i , Y i )]= n X i =1 E [ q x,n ( X , . . . , X n ) E [ ℓ ∗ ( X i , Y i ) | X i ]]= 0 , where the ﬁnal equality follows since E [ ℓ ∗ ( X i , Y i ) | X i ] = 0. The other termis handled similarly. (cid:3) Remark 4.19.

Dette et al. [5] deﬁned a statistic ˆ r n (see [5, equation (13)]),which under the right assumptions, is consistent for the correlation C deﬁnedby (1.2), and moreover satisﬁes a central limit theorem. If additionally,( X, Y ) are independent, then (see in particular equation (24) in the proofof [5, Theorem 3])16 ˆ r n + 13 = T ( Y , . . . , Y n ) + T ( X , . . . , X n ) + T ( Y , . . . , Y n ) + o P ( n − / ) , for some functions T , T , T . Thus the previous proposition implies that e (ˆ r n ) = 0 for quadratic mean diﬀerentiable families which have score func-tion ˙ ℓ ∗ . On the other hand, see Shi et al. [17] for a family for which ˆ r n israte optimal. We conjecture that the poor local behavior of ˆ C n and ˆ r n compared withother tests of independence which are consistent against all alternatives, asshown by Shi et al. [17], is due to their property of being exactly 1 atthe population level if and only if Y = g ( X ) a.s. for some g , which is notshared by any of the competitors. We believe that if one adds smoothnessassumptions to H : Y = g ( X ) a.s., then that gives these statistics uniquelygood local power against alternatives to this hypothesis. Acknowledgements

We thank Sourav Chatterjee for facilitating this collaboration, as wellas for helpful conversations. We thank Holger Dette for pointing out animportant reference. We thank Hongjian Shi, Fang Han, and Mathias Drtonfor valuable comments on the local power calculations, which led to animprovement of our results.

References [1]

Angus, J.E. (1995). A coupling proof of the asymptotic normality of the permutationoscillation.

Probab. Eng. Inf. Sci., , 615-621.[2] Azadkia, M. and Chatterjee, S. (2019). A simple measure of conditional depen-dence.

Preprint.

Available at arxiv:1910.12327[3]

Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner J. (1993). Eﬃcient andadaptive estimation for semiparametric models.

Johns Hopkins Univ. Press. [4] Chatterjee, S. (2019) A new coeﬃcient of correlation. To appear in

J. Amer. Stat.Assoc. [5] Dette, H., Siburg, K.F. and Stoimenov, P. (2013). A copula-based non-parametricmeasure of regression dependence.

Scand. J. Stat. , no. 1, 21-41.[6] Federer, H. (1969). Geometric measure theory. Springer-Verlag[7] Gaunt, R.E., Pickett, A.M. and Reinert, G. (2017). Chi-square approximation byStein’s method with application to Pearson’s statistic. Ann. Appl. Probab. , no. 2,720-756.[8] Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regression. Sta-tistical Science , , 568-594.[9] Ingram, J.M. and Marsh, M.M. (1991) Projections onto convex cones in Hilbert spaces. J. Approx. Theory , no. 3, 343-350.[10] Le Cam, L.M. and Yang, G.L. (1990). Asymptotics in statistics. Springer, New York[11] Lengyel, T. (1997). Cycles of a random permutation by simple enumeration. SankhyaA , bf 59 no. 1, 133-137.[12]

Orey, S. (1958). A central limit theorem for m -dependent random variables. DukeMath. J., , no. 4, 543-546.[13] R´enyi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hung. , no.3-4, 441-451.[14] Roberston, T., Wright, F.T. and Dykstra, R.L. (1988). Order restricted statisticalinference. John Wiley & Sons. [15] Ross, N. (2011). Fundamentals of Stein’s method.

Probab. Surveys , bf 8, 210-293.[16] Scarsini, M. (1984). On measures of concordance.

Stochastica , Preprint.

Available at arxiv:2008.11619[18] Soloﬀ, J.A., Guntuboyina, A. and Pitman, J. (2019). Distribution-free properties ofisotonic regression.

Electron. J. Statist. , no. 2, 3243-3253. ORRELATIONS WITH TAILORED EXTREMAL PROPERTIES 33 [19] Steele, J.M. (2002). The Bohnenblust-Spitzer algorithm and its applications.

J. Com-put. Appl. Math.

Cambridge Series in Sta-tistical and Probabilistic Mathematics

Cambridge: Cambridge University Press.doi:10.1017/CBO9780511802256[21] van der Vaart, A.W. and Wellner, J. (1996). Weak convergence and empirical pro-cesses: with applications to statistics.

Springer, New York.

Department of StatisticsStanford UniversitySequoia Hall, 390 Jane Stanford WayStanford, CA 94305 [email protected]