SSpatial Correlation Robust Inference ∗ Ulrich K. M¨uller and Mark W. WatsonDepartment of Economics, Princeton UniversityPrinceton, NJ, 08544First Draft: December 2020This Draft: February 2021
Abstract
We propose a method for constructing confidence intervals that account for manyforms of spatial correlation. The interval has the familiar ‘estimator plus and minus astandard error times a critical value’ form, but we propose new methods for constructingthe standard error and the critical value. The standard error is constructed usingpopulation principal components from a given ‘worst-case’ spatial covariance model.The critical value is chosen to ensure coverage in a benchmark parametric model for thespatial correlations. The method is shown to control coverage in large samples wheneverthe spatial correlation is weak, i.e., with average pairwise correlations that vanish as thesample size gets large. We also provide results on correct coverage in a restricted butnonparametric class of strong spatial correlations, as well as on the efficiency of themethod. In a design calibrated to match economic activity in U.S. states the methodoutperforms previous suggestions for spatially robust inference about the populationmean.Key Words: Confidence interval, HAR, HAC, Random fieldJEL: C12, C20 ∗ M¨uller acknowledges financial support from the National Science Foundation grant SES-191336. a r X i v : . [ ec on . E M ] F e b Introduction
Prompted by advances in both data availability and theory in economic geography, inter-national trade, urban economics, development and other fields, empirical work using spatialdata has become commonplace in economics. These applications highlight the importance ofeconometric methods that appropriately account for spatial correlation in real-world settings.While important advances have been made, researchers arguably lack practical methods thatallow for reliable inference about parameters estimated from spatial data for the wide-rangespatial designs and correlation patterns encountered in applied work. This paper takes astep forward in this regard.Specifically, we consider the problem of constructing a confidence interval (or test of ahypothesized value) for the mean of a spatially-sampled random variable. We propose aconfidence interval constructed in the usual way, i.e., as the sample mean plus and minus anestimate of its standard error multiplied by a critical value. The novelty is that the standarderror and critical value are constructed so the resulting confidence interval has the desiredlarge-sample coverage probability (say, 95%) for a relatively wide range of correlation patternsand spatial designs. The analysis is described for the mean, but the required modificationsfor regression coefficients or parameters in GMM settings follow from standard arguments.To be more precise, suppose that a random variable y is associated with a location s ∈ S ,where S ⊂ R d . Figure 1 shows three one-dimensional ( d = 1) spatial designs. Panel (a)shows the familiar case of regularly spaced locations, corresponding to the standard timeseries setting; panels (b) and (c) show randomly selected locations drawn from a density g ,where g is uniform in panel (b) and triangular in panel (c). Figure 2 shows two geographicexamples, so d = 2, for the U.S. state of Texas. In panel (a), the locations are randomlyselected from a uniform distribution, while in panel (b) they are more likely to be sampledfrom areas with high economic activity, here measured by light intensity as seen from space. In much of our analysis, we will assume that locations are i.i.d. draws from a distributionwith density g , and so will encompass the irregularly spaced time series and Texas examples.Adding some notation, suppose y l = µ + u l for l = 1 , ..., n (1) Ibragimov and M¨uller (2010), Sun and Kim (2012) and Bester, Conley, Hansen, and Vogelsang (2016),for instance, find nontrivial size distortions of modern methods even in arguably fairly benign designs, andKelly (2019) reports very large distortions under spatial correlations calibrated to real-world data. The light data are from Henderson, Squires, Storeygard, and Weil (2018). y l is associated with the spatial location s l , µ is the mean of y l , and u l is an unob-served error, assumed to be covariance stationary with mean zero and covariance function E [ u ( r ) u ( s )] = σ u ( r − s ). Let y denote the sample mean, and consider the usual t-statistic τ = √ n ( y − µ )ˆ σ where ˆ σ is an estimator for the variance of √ n ( y − µ ). Tests of the null hypothesis H : µ = µ reject when | τ | > cv, where cv is the critical value, and the corresponding confidence intervalfor µ has endpoints y ± cv ˆ σ/ √ n . Inference methods in this class differ in their choice of ˆ σ and critical value cv.The case of regularly-spaced time series observations (panel (a) of Figure 1) is the mostwell-studied version of this problem. Here Var( √ n ( y − µ )) is the long-run variance of y .Classic choices for ˆ σ are kernel-based consistent estimators such as those proposed in Neweyand West (1987) and Andrews (1991), and associated standard normal critical values. Amore recent literature initiated by Kiefer, Vogelsang, and Bunzel (2000) and Kiefer and2ogelsang (2005) accounts for the sampling uncertainty of kernel-based ˆ σ by considering“fixed- b ” asymptotics where the bandwidth is a fixed fraction of the sample size, which leadsto a corresponding upward adjustment of the critical value. Closely related are projectionestimators of ˆ σ where the number of projections is treated as fixed in the asymptotics, asin M¨uller (2004, 2007), Phillips (2005), Sun (2013), and others, leading to Student-t criticalvalues. These newer methods are found to markedly improve size control under moderateserial correlation compared to inference based on standard normal critical values.In the general spatial case, the variance of y depends on the correlation between all ofthe observations, and this in turn depends on two distinct features of the problem. Thefirst is the correlation between observations at arbitrary locations (say r and s ); this is givenby the covariance function σ u ( r − s ). The second feature is which locations in S are likelyto be sampled; this is given by the spatial density g . Only the first of these features isimportant in the regularly-spaced time series example because the locations do not vary fromone application to the next.Most existing suggestions for spatial inference are derived under the assumption that thelocations are (asymptotically) uniformly distributed, corresponding to a constant density g :This includes the consistent kernel-based estimator in Conley (1999), the spatial analogueof the fixed- b kernel approach analyzed in Bester, Conley, Hansen, and Vogelsang (2016), aswell as the spatial projection-based estimator put forward in Sun and Kim (2012). Exceptionsinclude Kelejian and Prucha (2007), who derive a consistent kernel for ˆ σ under assumptionsthat can accommodate arbitrary locations s l , and the cluster approach suggested by Ibragimovand M¨uller (2010, 2015) and Bester, Conley, and Hansen (2011) (also see Cao, Hansen,Kozbur, and Villacorta (2020)).This paper makes progress over this literature by developing a method that (i) accounts forsampling uncertainty in ˆ σ in a spatial context while allowing for nonuniform spatial densities g ; (ii) is valid under generic weakly correlated u l ; (iii) also controls size under a restrictedbut nonparametric form of strongly correlated u l . The last property sets it apart from allpreviously mentioned methods; in a time series setting, Robinson (2005) and M¨uller (2014)derive inference under parametric forms of strong dependence, and Dou (2019) derives optimalinference under a non-parametric form of strong dependence under a simplifying Whittle-typeapproximation to the implied covariance matrices.Our method works as follows: First, a benchmark parametric model is specified for thecovariance function, say σ u ( · ) = σ u ( ·| c ), where c is a persistence parameter with larger valuesindicating less dependence. For a given lower bound on the persistence parameter, say c ,3 hypothetical covariance matrix for ( y , ..., y n ) (cid:48) is constructed using σ u ( ·| c ) evaluated atthe actual sample locations ( s , ..., s n ). The eigenvectors of the demeaned version of thiscovariance matrix are the (population) principal components of the residuals ˆ u l = y l − y under σ u ( ·| c ), and the sample variance of q of these principal components is the estimator ˆ σ . Thecritical value is chosen to ensure coverage for all c ≥ c . The number of principal components q is chosen to minimize the expect length of the confidence interval in the model where u l is i.i.d. For shorthand, we refer to the method as spatial correlation principal components ,abbreviated SCPC.Intuitively, variance estimators ˆ σ that are quadratic forms in ˆ u are sums of squares ofweighted averages of ˆ u . Under spatial correlation, most weighted averages are less variablethan y , leading to a downward biased ˆ σ . SCPC selects the linear combinations of ˆ u that aremost variable, so that the bias is as small as possible in the benchmark model with parameter c . The remainder of the paper studies this method. Section 2 provides the specifics forSCPC. These specifics raise a variety of issues that are the focus of the remaining sections ofthe paper. In particular, Section 3 lays out the analytic framework used to study the large-sample and finite-sample Gaussian properties of spatial t-statistics. We use the frameworkto analyze SCPC, but several of the results in Section 3 encompass other methods, notably“fixed- b ” kernel-based methods, and general projection estimators with a fixed number of basisfunctions. We find that in contrast to the regularly spaced time series case, such t-statisticswith analogously adjusted critical values are not generically valid under weak correlation assoon as the spatial density function is not uniform. We develop an alternative approach tothe construction of critical values that restores validity, and this is used for SCPC inference.Section 4 thus shows that SCPC has the desired large-sample coverage probability undergeneric weak correlation. Moreover, Section 4 provides a set of (easily verifiable) sufficientconditions that guarantee coverage under arbitrary mixtures of a set of strong correlationpatterns in a finite-sample Gaussian setting. Section 4 also investigates the finite-samplecoverage probability of SCPC confidence sets when there is heteroskedasticity across locationsor measurement errors in locations — two problems faced in some applications. Section 5addresses the question of efficiency of SCPC by computing a lower bound on the expectedlength of confidence intervals for any inference method that controls coverage in a particularclass of spatial correlations. Comparing the expected length of SCPC to this lower boundprovides a measure of the efficiency of the method. Section 6 compares the properties ofSCPC to other methods that have been proposed in the literature, and the results suggest4hat SCPC dominates these methods over the range of covariance functions and spatial designsconsidered. Section 7 discusses extensions and implementation issues. First, it discusses howthe results developed in the body of the paper for inference about the population meancan be applied to inference problems about regression coefficients or parameters in GMMmodels. It then discusses two important computational issues involved in computing thecritical value and computing the required eigenvectors for the construction of SCPC in verylarge- n applications. Finally, Section 7 provides a sketch of the generalization of the SCPCmethod to multivariate (F-test) settings. Proofs are collected in the appendix. This section provides details for computing the SCPC t-statistic, critical value and associatedconfidence interval. The construction of SCPC raises a variety of questions about its prop-erties, many of which are posed here and discussed in detail in the remaining sections of thepaper.The construction of the SCPC t-test and confidence interval involves, among other things,various covariance matrices and probability calculations. We stress at the outset that these areused to describe the required calculations, and they are not assumptions about the probabilitydistribution of the data under study. Those assumptions will be listed in Section 3 and, itwill turn out, are significantly more general than what would follow from the description inthis section.Let y = ( y , y , ..., y n ) (cid:48) and similarly for s = ( s , s , ..., s n ) (cid:48) , u = ( u , u , ..., u n ) (cid:48) and thevector of residuals ˆu = (ˆ u , ˆ u , ..., ˆ u n ) (cid:48) . Let l denote an n × M = I − l ( l (cid:48) l ) − l (cid:48) .Consider a benchmark model for u l with a parametric covariance function Cov( u ( r ) , u ( s )) = σ u ( r − s | c ), where smaller values of the scalar parameter c indicate stronger correlations. In thefollowing, we focus on the simple Gaussian exponential (‘AR(1)’) model where σ u ( r − s | c ) =exp( − c || r − s || ) for c >
0. Let Σ ( c ) denote the n × n covariance matrix with Σ ( c ) ij =exp( − c || s i − s j || ), so that Σ ( c ) is the covariance matrix of u ( s ) evaluated at the samplelocations s . Let c denote a pre-determined value of c that is meant to capture an upper boundon the spatial persistence in the data. (The choice of c is discussed below). Let r , r , ..., r n denote the eigenvectors of MΣ ( c ) M corresponding to the eigenvalues ordered from largestto smallest, and normalized so that n − r (cid:48) j r j = 1 for all j . The scalar variable n − / r (cid:48) j ˆu hasthe interpretation as the j th population principle component of ˆu | s ∼ N ( , MΣ ( c ) M ). The5CPC estimator of σ based on the first q of these principal components isˆ σ ( q ) = q − q (cid:88) j =1 ( n − / r (cid:48) j ˆu ) , (2)and the corresponding SCPC t-statistic is τ SCPC ( q ) = √ n ( y − µ )ˆ σ SCPC ( q ) . (3)The critical value cv SCPC ( q ) of the level- α SCPC test is chosen so that size is equal to α under the Gaussian benchmark model with c ≥ c . That is, cv SCPC ( q ) satisfiessup c ≥ c P Σ ( c ) ( | τ SCPC ( q ) | > cv SCPC ( q ) | s ) = α, (4)where P Σ ( c ) means that the probability is computed in the benchmark model y | s ∼N ( µ l , Σ ( c )).The final ingredient in the method is the choice of q . Let E [2ˆ σ SCPC ( q ) cv SCPC ( q ) | s ] denotethe expected length of the confidence interval constructed using τ SCPC ( q ) under the Gaussiani.i.d. model y | s ∼ N ( l µ, I ). (The superscript “1” on E differentiates this from the benchmarkmodel with covariance matrix Σ ( c ).) SCPC chooses q SCPC to make this length as small aspossible, that is q solvesmin q ≥ E [2ˆ σ SCPC ( q ) cv SCPC ( q ) | s ] = min q ≥ √ n − / q − / cv SCPC ( q ) Γ(( q + 1) / q/
2) (5)with the equality exploiting that q ˆ σ ( q ) | s ∼ χ q in the Gaussian i.i.d. model. Remark 2.1.
The primary concern in the construction of ˆ σ is downward bias. Recall thatthe eigenvector r maximizes h (cid:48) MΣ ( c ) Mh among all vectors h of the same length, the secondeigenvector r maximizes h (cid:48) MΣ ( c ) Mh subject to h (cid:48) r = 0, and so forth, and for any q ≥ n × q matrix ( r , . . . , r q ) maximizes tr H (cid:48) MΣ ( c ) MH among all n × q matrices H with n − H (cid:48) H = I q . Thus, the SCPC method selects the linear combinations of ˆu in the estimatorof σ that induce the smallest bias in the benchmark model with c = c , under the constraintof being unbiased in the i.i.d. model.The choice of q trades off the downward bias in ˆ σ ( q ) that occurs when q is large andits large variance when q is small. Both bias and variance lead to a large critical value, and(5) leads to a choice of q that optimally trades off these two effects to obtain the shortestpossible expected confidence interval length in the i.i.d. model.6 emark 2.2. By construction, SCPC confidence intervals have correct coverage in Gaussianmodels with a spatial exponential covariance function (‘AR(1)’ models) with spatial persis-tence level less than or equal to the model with c = c . Lemma 1 in Section 3 provides acentral limit result that rationalizes the normality assumption. Theorem 7 provides condi-tions on the choice of c so that the SCPC t-test controls size in large samples not just in theexponential model, but under generic ‘weak correlation’, as defined in Section 3. Theorem8 provides easily verifiable sufficient conditions for size control under mixtures of parametricsmall sample Gaussian models. Remark 2.3.
SCPC requires that the researcher chooses a value for c which represents thehighest degree of spatial correlation allowed by the method. One way to calibrate c is viathe average pairwise correlation of the spatial observations ρ = 1 n ( n − n (cid:88) l =1 (cid:88) (cid:96) (cid:54) = l Cor ( y l , y (cid:96) | s n )that is, we set c so that it implies a given value ρ of ¯ ρ . For example, ρ = 0 .
001 impliesvery weak correlation, ρ = 0 .
02 stronger correlation, and ρ = 0 .
10 very strong correlation.In our examples, we calibrate c to these three values of ρ . Remark 2.4.
The SCPC method with c calibrated in this way is invariant to the scale of thelocations { s l } nl =1 (cid:55)→ { as l } nl =1 for a >
0, and (in contrast to Sun and Kim’s (2012) suggestion)also to arbitrary distance preserving transformations, such as rotations.
Remark 2.5.
The weights r j used to construct the principal components and ˆ σ ( q ) de-pend on s , the sample values of the spatial locations. Because the spatial locations arerandomly drawn, the r j weights are random. But as shown in Section 3, the weights havewell-defined limits in terms of appropriately defined nonrandom eigenfunctions. Figure 3plots selected eigenfunctions for two one-dimensional spatial designs and Figure 4 shows theassociated plots for the Texas example, where in both cases ρ = 0 .
02. With uniform spatialdensities (panel (a) in both figures), the eigenfunctions are much like the weighting functionsused for low-frequency projection methods for regularly spaced time series (e.g., M¨uller (2004),Phillips (2005), Sun (2013)) or its spatial analogue (e.g., Sun and Kim (2012)). In contrast,the non-uniform densities (panel (b) in the figures) produce weights that are distorted ver-sions of their uniform counterparts, with most of the variation concentrated in high-densityareas. 7igure 3: Eigenfunctions for Two One-Dimensional Spatial DesignsThe figures also show the associated normalized eigenvalues, that is the variance of theprincipal components under the assumed exponential model, relative to the variance of √ n ( y − µ ). When the density is uniform, these relative variances are slightly below 1 . j , anddecline monotonically with j . This leads to the familiar downward bias of ˆ σ in projectionmethods. When the spatial density is not uniform, the relative variance of the principalcomponents can be larger than unity, mitigating this downward bias. Remark 2.6.
In the regular spaced time series case, the eigenvectors of SCPC for ¯ ρ ∈{ . , . } are numerically close to the type-II cosine transforms considered in M¨uller (2004,2007), Lazarus, Lewis, Stock, and Watson (2018) and Dou (2019). What is more, the SCPCchoice of q is also numerically close to the corresponding optimal choice of q in Dou (2019).So when applied to time series, SCPC comes close to replicating Dou’s (2019) suggestion foroptimal inference, with c representing the upper bound for the degree of persistence. Thesame is true in a spatial design with locations that happen to fall on a line with approximatelyuniform empirical distribution. U.S. states spatial correlation designs . Before making two additional remarks about the8igure 4: Eigenfunctions for Two Geographic Spatial DesignsSCPC method, we introduce a set of spatial correlation designs that will be used throughoutthe paper. The idea is to consider a set of real world designs to learn about the usefulness ofthe SCPC and other methods in practice. In particular, we randomly draw n = 500 locationswithin the boundaries of the 48 contiguous states of the U.S. (we also considered n = 1000draws, and found nearly identical results in all exercises). The density of locations g withineach state is either uniform ( g uniform ), or it is proportional to light measured from space ( g light )as a proxy for economic activity. We draw five sets of 500 independent locations under eachdensity g ∈ { g uniform , g light } and ¯ ρ ∈ { . , . } for each state, for a total of 240 (= 48 states × { s l } l =1 and associated covariances under each of the four( g, ¯ ρ ) pairs. 9igure 5: CDFs of Expected Length of SCPC Confidence Interval Relative to Known VarianceInterval Remark 2.7.
The critical value of the SCPC t-statistic reflects randomness in both y andˆ σ . This is analogous to inference in small-sample Gaussian models using critical valuesfrom the Student-t distribution. Figure 5 shows the effect of the uncertainty in σ on theexpected length of 95% confidence intervals in the U.S. states spatial correlation designs, bycomparing the expected length of the SCPC confidence interval in the i.i.d. model to thethe length with σ known: this relative length is E [(cv / . σ SCPC /σ ) | s ], where 1.96 is thestandard normal critical value. The figure plots the CDF of these relative lengths over the240 draws under each ( g, ¯ ρ ) pair. For example, the left-most CDF (dashed blue, for g = g light and ρ = 0 .
02) shows that the relative expected length ranges from roughly 1.08 to 1.18 acrossthe 240 draws. The figure indicates that the expected lengths are higher under g uniform thanunder the g light design and are higher under ¯ ρ = 0 .
10 than ¯ ρ = 0 . . For comparison thefigure also shows the relative expected lengths of Student-t confidence intervals with 8 and3 degrees of freedom, in multiples of the length of the known variance z-interval. Evidently,when ¯ ρ = 0 .
02, the increase in expected length of the SCPC confidence interval relative toan oracle endowed with the value of σ is roughly like learning about the value of σ from 8i.i.d. N (0 , σ ) observations. When ¯ ρ = 0 .
10, relative lengths increase to approximately whatwould obtain from Student- t inference. Remark 2.8.
Consider the related question about the efficiency of SCPC relative to othermethods that do not assume that the value of σ is known. This question can be answeredin two ways. The first is to compare SCPC to methods that have previously been proposed.This is done in Section 6. A more ambitious approach compares SCPC to the most efficientmethod constructed for any particular spatial density that, like SCPC, produces confidence10ntervals with the desired coverage over a wide range of covariance functions. This is done inSection 5 which computes a lower bound on the expected length of confidence intervals forany such method. This section outlines a large-sample framework used to study SCPC and other spatial t-statistics. The first two subsections introduce notation and the asymptotic sampling frame-work. With these in hand, the remainder of the section summarizes the large-sample dis-tribution of various statistics including the SCPC and kernel-based t-statistics. Proofs areprovided in the appendix.
Some of this notation has been introduced earlier, but is repeated here for easy reference.The sample mean is denoted by y n , where here and elsewhere we append the subscript n for clarity in the asymptotic analysis. The residual is ˆ u l = y l − y n . Let y n = ( y , ..., y n ) (cid:48) , andsimilarly for u n , ˆ u n and s n . The vector l n is a n × M n = I n − l n ( l (cid:48) n l n ) − l n ,so that ˆu n = M n u n .Generically, we consider estimators ˆ σ n that are quadratic forms in ˆ u n . Let Q n be a positivesemidefinite matrix with Q n l n = . We consider estimators of the formˆ σ n ( Q n ) = n − ˆ u (cid:48) n Q n ˆ u n = n − u (cid:48) n Q n u n (6)where the final equality follows from Q n l n = .Two leading examples of estimators in this class are kernel-based estimators andorthogonal-projections estimators. For kernel-based estimators, let k ( r, s ) denote a positivesemi-definite kernel, k : S × S (cid:55)→ R . Let K n denote an n × n matrix with ( l, (cid:96) ) elementequal to k ( s l , s (cid:96) ) and let Q n = M n K n M n . Then ˆ σ n = n − (cid:80) l (cid:80) (cid:96) k ( s l , s (cid:96) )ˆ u l ˆ u (cid:96) = n − ˆ u (cid:48) n Q n ˆ u n .For orthogonal-projection estimators, let ˆW n be an n × q matrix with j th column givenby ˆw j satisfying n − ˆW (cid:48) n ˆW n = q − I q and ˆW (cid:48) n l n = (the ‘hat’ notation is a reminder that ˆW depends on the locations s n , which are random). With Q n = ˆW n ˆW (cid:48) n , the orthogonalprojection estimator is ˆ σ n = (cid:80) qj =1 ( n − / ˆw (cid:48) j ˆ u n ) = n − ˆ u (cid:48) n Q n ˆ u n . The SCPC estimator is anorthogonal-projection estimator using the first q eigenvectors of M n Σ ( c ) M n , scaled to havelength 1 / √ q, as the columns of ˆW n . 11or quadratic form estimators ˆ σ n ( Q n ), under the null hypothesis the squared t-statistic isa ratio of quadratic forms in u n τ n ( Q n ) = ( √ n ( y n − µ )) ˆ σ n ( Q n ) = u (cid:48) n l n l (cid:48) n u n u (cid:48) n Q n u n . (7) n framework The spatial locations s are chosen from S , a compact subset of R d . Sample locations areselected as i.i.d. draws from a distribution G with density g , where g ( s ) is continuous andpositive for all s ∈ S .The average pairwise correlation of y , conditional on the sample locations is ¯ ρ n = n ( n − (cid:80) nl =1 (cid:80) (cid:96) (cid:54) = l Cor ( y l , y (cid:96) | s n ). When ρ n = 0, y n | s n is white noise. When ρ n = O p (1)(and not o p (1)), we will say the process exhibits strong correlation . When ρ n = O p (1 /c dn )where c n is a sequence of constants with c n → ∞ , we follow Lahiri (2003) and say the processexhibits weak correlation .The following asymptotic framework, adapted from Lahiri (2003), is useful for representingweak and strong correlation. Let B be a zero-mean stationary random field on R d withcontinuous covariance function E [ B ( s ) B ( r )] = σ B ( s − r ), and B and { s l } nl =1 are independent.To avoid pathological cases, we further assume (cid:82) σ B ( s ) ds > B is nonsingular inthe sense that inf || f || =1 (cid:82) (cid:82) f ( r ) f ( s ) σ B ( s − r ) dG ( r ) dG ( s ) > || f || = (cid:82) f ( s ) dG ( s ).Let c n denote a sequence of constants with either c n → ∞ or c n = c >
0. We considera triangular-array framework with u l = B ( c n s l ) for s l ∈ S , so that σ u ( s ) = σ B ( c n s ). Thesequence c n determines the ‘infill’ and ‘outfill’ nature of the asymptotics. To see this, notethat the volume of the relevant domain for the random field B is c dn vol( S ), where vol( S ) is thevolume of S . The average number of sample points per unit of volume is then n/ ( c dn vol( S )) . If c dn ∝ n, the volume of the domain is increasing, while the number of points per unit ofvolume is not; this is the usual outfill asymptotic sampling scheme. On the other hand, when c n = c , a constant, the volume of the domain is fixed, and the number of points per unit ofvolume is proportional to n ; this is the usual infill sampling. Finally, when c n → ∞ with c dn = o ( n ) the sampling scheme features both infill and outfill asymptotics. A calculationshows that ρ n = O p (1 /c dn ), so the sequence c n characterizes weak and strong correlation asdescribed above. With this background, let a n = c dn /n ; we will assume that a n → a ∈ [0 , ∞ ).Finally, we specify a set of weighting functions. To simplify the problem, we initiallyconsider weights that are nonrandom. For j = 1 , . . . , q , let w j : S (cid:55)→ R denote a set of12ontinuous functions that satisfy (cid:82) w j ( s ) dG ( s ) = 0 and (cid:82) w j ( s ) dG ( s ) >
0. We introduce thefollowing notation involving these functions: w ( s ) is a q × w ( s ) = ( w ( s ) , ..., w q ( s )) (cid:48) ; w ( s ) = (1 , w ( s ) (cid:48) ) (cid:48) ; W n is a n × q matrix with l th row givenby w ( s l ) (cid:48) , and W n is a n × ( q + 1) matrix with l th row given by w ( s l ) (cid:48) so that W n = [ l n , W n ]. Remark 3.1.
In our framework, locations s l are sampled within S for a fixed and given S . But nothing changes in our derivations if instead we treated the observations y l as beingindexed by c n s l ∈ c n S , as in Lahiri (2003), or any other one-to-one transformation of s l . Theessential characteristic is the dependence pattern over the spatial domain of the observations,governed by c n and B .With this background, we now present the large-sample analysis. As is evident from equation (7) the squared t-statistic is a ratio of squares of weighted av-erage of the elements of u n . This subsection discusses the large-sample distribution of suchweighted averages. These results involve weak converge (i.e., convergence in distribution)where our interest lies in these limits conditional on the locations s n . With this in mind,for X n and X p -dimensional random vectors, we use the notation X n | s n ⇒ p X to denote E [ h ( X n ) | s n ] p → E [ h ( X )] for any bounded continuous function h : R p (cid:55)→ R . This notion ofweak convergence in probability is slightly weaker than almost sure weak convergence of con-ditional distributions, but still ensures that the limiting distribution is not induced by therandomness in the locations s n .Lemma 1 characterizes the large-sample behavior of sums of the form (cid:80) nl =1 w ( s l ) u ( s l ) . For the weak correlation result, we invoke the mixing and moment assumptions of Lahiri(2003) on B that underlie his Theorem 3.2. Lemma 1. (i) (strong correlation) Suppose c n = c > and B is a Gaussian process. Then n − W (cid:48) n u n | s n ⇒ p X ∼ N (0 , Ω sc ) with Ω sc = (cid:90) (cid:90) w ( r ) w ( s ) (cid:48) σ B ( c ( r − s )) dG ( r ) dG ( s ) . (ii) (weak correlation) Suppose c n → ∞ , and the assumptions of Lahiri’s (2003) Theorem3.2 hold. Then a / n n − / W (cid:48) n u n | s n ⇒ p X ∼ N (0 , Ω wc )13 ith Ω wc = aσ B (0) V + (cid:18)(cid:90) σ B ( s ) ds (cid:19) V where V = (cid:90) w ( s ) w ( s ) (cid:48) g ( s ) ds and V = (cid:90) w ( s ) w ( s ) (cid:48) g ( s ) ds. Remark 3.2.
Note that the variance of (cid:80) nl =1 w ( s l ) u ( s l ) conditional on s n isVar (cid:34) n (cid:88) l =1 w ( s l ) u ( s l ) | s n (cid:35) = (cid:88) l (cid:88) (cid:96) w ( s l ) w ( s (cid:96) ) (cid:48) σ u ( s l − s (cid:96) )= (cid:88) l (cid:88) (cid:96) w ( s l ) w ( s (cid:96) ) (cid:48) σ B ( c n ( s l − s (cid:96) )) . (8)The strong-correlation covariance matrix, Ω sc , is recognized as the large- n analogue of thisexpression after appropriate normalization and averaging over the locations. The weak-correlation covariance matrix, Ω wc , differs from Ω sc in two ways. First, because c n → ∞ in the weak-correlation case, and σ B ( r ) vanishes for large | r | , the second term in Ω wc is recog-nized as the limit of Ω sc as the double integral concentrates entirely on ‘the diagonal’ where r ≈ s . Second, as outfill becomes more important (that is, a n = c nn /n gets larger), variancesbecome more important relative to covariances; this explains the first term in Ω wc . Remark 3.3.
The form of V is further recognized as the limit covariance matrix in amodel where the observations are independent, with variance proportional to g ( s l ). Thus, V is what one would obtain for the limit covariance matrix under a specific form of non-stationarity. Intuitively, a high density area does not only yield many observations, but underspatial correlation, the variance contribution is further amplified by the resulting high averagecorrelation. Remark 3.4.
In the strong-correlation case, normality is assumed. That said, CLTs havebeen established also for strongly correlated models when d = 1 (i.e., the time series case),such as Taqqu (1975), Phillips (1987) or Chan and Wei (1987), and to a lesser extent alsofor d >
1, as in Wang (2014) or Lahiri and Robinson (2016). For the weak correlation case,large-sample normality follows from Theorem 3.2 in Lahiri (2003).
Remark 3.5.
When g ( s ) is constant, so the spatial distribution is uniform, V ∝ V and Ω wc ∝ (cid:82) w ( s ) w ( s ) (cid:48) ds . Thus, in a leading case with orthogonal w j of length 1 / √ q , (cid:82) w j ( s ) w i ( s ) dG ( s ) = q − [ i = j ], Ω wc ∝ diag(1 , q − I q ), a familiar result from the literature14n HAR inference in time series with regularly spaced observations. Importantly, while thisresult holds under constant g ( s ), it does not hold for other spatial distributions, so that thetypical HAR results about inconsistent variance estimators for regularly spaced time seriesunder weak dependence do not carry over to the spatial case. This section presents a useful representation for the limiting distribution of τ n ( W n W (cid:48) n ) underthe assumptions of Lemma 1. Theorem 2.
For cv > , let D (cv) = diag(1 , − cv I q ) , A = D (cv) Ω with Ω ∈ { Ω sc , Ω wc } ,and let ( ω , ω , ..., ω q ) denote the eigenvalues of A ordered from largest to smallest. Thenunder the assumptions of Lemma 1, under the null hypothesis and with ( Z , Z , ..., Z q ) (cid:48) ∼N (0 , I q +1 ) ,(i) ω > , and ω i ≤ for i ≥ ;(ii) P ( τ n ( W n W (cid:48) n ) > cv | s n ) p → P (cid:16) Z > (cid:80) qi =1 ( − ω i ω ) Z i (cid:17) . Remark 3.6.
In the weak-correlation case with constant spatial density g ( s ) and orthogonal w j of length 1 / √ q , Ω = Ω wc ∝ diag(1 , q − I q ). Thus − ω i /ω = cv /q , and the asymptoticrejection probability becomes the corresponding quantile of the F ,q distribution, a resultfamiliar from the limiting distribution of projection based squared t-statistics in the regularlyspaced time series case. Remark 3.7.
In the general weak correlation case with arbitrary spatial density g , Ω wc = aσ B (0) V + (cid:0)(cid:82) σ B ( s ) ds (cid:1) V . Because τ n is a scale-invariant function of u n , it is without lossof generality to normalize the scale of σ B ( · ) so that aσ B (0) + (cid:82) σ B ( s ) ds = 1. Under thisnormalization Ω wc = κ V + (1 − κ ) V (9)where κ is scalar with 0 ≤ κ <
1. Thus, the limiting CDF of τ n is seen to depend on σ B onlythrough the scalar κ ; the matrices V and V are functions of the weights w and the spatialdensity g . The scalar κ thus completely summarizes the large sample effect of alternativeunderlying random fields B and weak correlation sequences c n → ∞ . For SCPC and other estimators, the weights in w ( s ) are estimated using the sample locations s n . The conditions under which Lemma 1 continues to hold for such estimated weights is15iven in the following theorem. Theorem 3.
Suppose the mapping ˆw : S (cid:55)→ R q +1 is a function of s n (but not of B ), and sup s ∈S || ˆw ( s ) − w ( s ) || p → . (10) Then Lemma 1 and Theorem 2 continue to hold with ˆW n in place of W n , where the l th rowof ˆW n is equal to (1 , ˆw ( s l ) (cid:48) ) . Remark 3.8.
The theorem also accommodates location dependent convergent critical valuescv n p → cv by setting ˆw ( s ) = (cv n / cv) w ( s ). This subsection discusses how these results can be generalized so they apply to kernel-basedvariance estimators, ˆ σ n ( M n K n M n ) and associated t-statistics τ n ( M n K n M n ), where the n × n matrix K n has ( l, (cid:96) ) element equal to k ( s l , s (cid:96) ) for a positive semidefinite continuous kernel k : S × S (cid:55)→ R . Since in our framework, s l ∈ S for a fixed sampling region S , and k doesnot depend on n , these kernel estimators are spatial analogues of fixed- b time series long-runvariance estimators considered by Kiefer and Vogelsang (2005), as also investigated by Bester,Conley, Hansen, and Vogelsang (2016).Let ˆK n = M n K n M n , and note that the ( l, (cid:96) ) element of ˆK n is ˆ k n ( s l , s (cid:96) ) withˆ k n ( r, s ) = k ( r, s ) − n − n (cid:88) l =1 k ( s l , s ) − n − n (cid:88) l =1 k ( r, s l ) − n − n (cid:88) l =1 n (cid:88) (cid:96) =1 k ( s l , s (cid:96) ) . (11)To begin, consider a simpler problem using a kernel that replaces the sample means in (11)with populations means k ( r, s ) = k ( r, s ) − (cid:90) k ( u, s ) dG ( u ) − (cid:90) k ( r, u ) dG ( u ) + (cid:90) (cid:90) k ( u, t ) dG ( u ) dG ( t ) . (12)By Mercer’s Theorem, k ( r, s ) has the representation k ( s, r ) = ∞ (cid:88) i =1 λ i ϕ i ( s ) ϕ i ( r ) (13)where { λ i , ϕ i } are the eigenvalues and eigenfunctions of k , with eigenvalues ordered fromlargest to smallest, (cid:82) ϕ i ( s ) dG ( s ) = 0 and (cid:82) ϕ i ( s ) ϕ j ( s ) dG ( s ) = [ i = j ].16onsider the problem with a truncated version of k , k q ( s, r ) = q (cid:88) i =1 λ i ϕ i ( s ) ϕ i ( r ) . We can directly apply Theorem 2 using w j ( s ) = λ / j ϕ j ( s ). Specifically, let ¯K n,q be an n × n matrix with ( l, (cid:96) ) element equal to k q ( s l , s (cid:96) ). Then u (cid:48) n ¯K n,q u n = u (cid:48) n W n W (cid:48) n u n so that τ n ( ¯K n,q ) = τ n ( W n W (cid:48) n ), and P (cid:0) τ n ( ¯K n,q ) > cv | s n (cid:1) p → P (cid:16) Z > (cid:80) qi =1 ( − ω i ω ) Z i (cid:17) by Theorem2. To extend this result to the original problem, it is useful to reformulate it in terms ofeigenvalues of linear operators. Specifically, denote by L G the Hilbert space of functions S (cid:55)→ R with inner product (cid:104) f , f (cid:105) = (cid:82) f ( s ) f ( s ) dG ( s ). Normalize Ω wc = κ V + (1 − κ ) V ,as in (9). A tedious but straightforward calculation (see (27) in the appendix) shows that theeigenvalues ω i of A = D (cv) Ω with Ω = { Ω sc , Ω wc } are also the eigenvalues of finite rankself-adjoint linear operators L G (cid:55)→ L G , namely R sc T q R sc and R wc T q R wc in the strong andweak correlation case, respectively, where R sc ( f )( s ) = (cid:90) σ B ( c ( s − r )) f ( r ) dG ( r ) R wc ( f )( s ) = ( κ + (1 − κ ) g ( s )) f ( s ) T q ( f )( s ) = (cid:90) (cid:0) − cv k q ( s, r ) (cid:1) f ( r ) dG ( r ) . This suggests that the limiting rejection probability for the original non-truncated ¯ k mightbe characterized by the (potentially infinite) number of eigenvalues of the operators RT R : L G (cid:55)→ L G with R ∈ { R wc , R sc } , where T ( f )( s ) = (cid:90) (cid:0) − cv k ( s, r ) (cid:1) f ( r ) dG ( r ) . The following theorem shows this to be the case, and it also includes the generalization tosample demeaned kernels (11) instead of (12).
Theorem 4.
Let ω denote the largest eigenvalue, and ω i , i ≥ the remaining eigenvalues of RT R for R ∈ { R wc , R sc } . Then under the assumptions of Lemma 1, ω > and ω i ≤ for i ≥ , and P ( τ n ( ˆK n ) > cv | s n ) p → P ( Z > (cid:80) ∞ i =1 ( − ω i /ω ) Z i ) . Remark 3.9.
Under weak correlation the limit distribution of kernel-based spatial t-statisticsdepends on the spatial density g , since the eigenvalues of R wc T R wc are a function of g . This17s analogous to the results for projection estimators discussed above. Thus, in both cases,using a critical value that is appropriate for i.i.d. data (that is, setting κ = 1) does not, ingeneral, lead to valid inference under weak correlation. Remark 3.10.
The theorem is also applicable to projection estimators using basis functions φ i that are orthogonalized using the sample locations (such as those suggested in Sun andKim (2012)) by setting k ( r, s ) = q − (cid:80) qi =1 φ i ( r ) φ i ( s ). Remark 3.11.
The framework of Theorem 4 also sheds light on the asymptotic bias ofkernel-based and orthogonal projection estimators under weak correlation. The estimand σ is the limiting variance of a / n n − / (cid:80) nl =1 u l , which under the normalization (9) is equalto the (single) eigenvalue of the operator R wc T σ R wc with T σ ( f )( s ) = (cid:82) f ( r ) dG ( r ), that is (cid:82) ( κ + (1 − κ ) g ( s )) dG ( s ) . The expectation of a n ˆ σ n ( ˆK n ) converges to the trace of the operator R wc T ¯ k R wc with T ¯ k ( f )( s ) = (cid:82) k ( s, r ) f ( r ) dG ( r ), that is (cid:82) ( κ + (1 − κ ) g ( s ))¯ k ( s, s ) dG ( s ). Thus,the estimator is asymptotically unbiased for all g if and only if ¯ k ( s, s ) = 1. For standardchoices of k , k ( s, s ) = 1, so the only source of asymptotic bias is the demeaning (and if theestimator ˆ σ n uses the null value y n − µ l n instead of the residuals ˆu n , the asymptotic bias iszero under the null hypothesis). Moreover, if k ( r, s ) concentrates around the ‘diagonal’ where r ≈ s , corresponding to a fixed- b kernel estimator with small b , the demeaning effect is small,as is the asymptotic variability of a n ˆ σ n ( ˆK n ). Thus, fixed- b kernel estimators with standardkernel choices and small b yield nearly valid and efficient inference under weak correlation.In contrast, orthogonal projection estimators where ¯ k ( r, s ) = q − (cid:80) qi =1 φ i ( r ) φ i ( s ) do notshare this approximate unbiasedness property, even for q large, since (cid:82) φ i ( s ) dG ( s ) = 1 doesnot, in general, imply that ¯ k ( s, s ) = q − (cid:80) qi =1 φ i ( s ) ≈ Lemma 5.
Let ( ˆv i , ˆ λ i ) with ˆv i = (ˆ v i, , . . . , ˆ v i,n ) (cid:48) be the eigenvector-eigenvalue pairs of n − ˆK n with ˆ λ ≥ ˆ λ ≥ . . . ≥ ˆ λ n and n − ˆv (cid:48) i ˆv i = 1 . For all i with ˆ λ i > , define the S (cid:55)→ R functions ˆ ϕ i ( · ) = n − ˆ λ − i n (cid:88) l =1 ˆ v i,l ˆ k n ( · , s l ) . (14) Let λ ( j ) , j = 1 , . . . be the unique positive values of λ i , ordered descendingly, and suppose λ ( j ) has multiplicity m j ≥ . Then for any p such that λ ( p ) > , a) there exist rotation matrices ˆO ( j ) of dimension m j × m j , j = 1 , . . . , p such that with q = (cid:80) pj =1 m j , ϕ = ( ϕ , . . . , ϕ q ) (cid:48) and ˆ ϕ = (ˆ ϕ , . . . , ˆ ϕ q ) (cid:48) , sup s ∈S || ϕ ( s ) − diag( ˆO (1) , . . . , ˆO ( p ) ) ˆ ϕ ( s ) || = O p ( n − / ) ;(b) (cid:80) qi =1 (ˆ λ i − λ i ) = O p ( n − ) . Part (a) shows convergence of the eigenspace corresponding to unique eigenvalues, andpart (b) shows convergence of the eigenvalues.
Beyond its use in the proof of Theorem 4, Lemma 5 can be used to establish the large sampledistribution of the SCPC t-statistic for nonrandom q and critical value cv. Note that inthis application of Lemma 5, we are interested in the eigenfunctions of the covariance kernel k ( r, s ) = σ u ( r − s | c ) of the benchmark model, rather than the eigenfunctions of a kernelthat defines a kernel-based variance estimator.Recall from Section 2 that r i is the eigenvector of M n Σ n ( c ) M n corresponding to the i thlargest eigenvalue, normalized to satisfy n − r (cid:48) i r i = 1. Let ϕ i be the eigenfunction of the kernel k ( r, s ) corresponding to the i th largest eigenvalue λ i , where k ( r, s ) = σ u ( r − s | c ) and ¯ k isthe demeaned version of k in analogy to (12). Lemma 5 and a slightly extended version ofTheorem 3 (see Lemma 10 in the appendix) then yields the following corollary. Corollary 6.
Suppose λ q > λ q +1 . Then Theorem 2 holds for τ SCPC ( q ) = τ n ( q − (cid:80) qi =1 r i r (cid:48) i ) with w ( s ) = ( ϕ ( s ) , . . . , ϕ q ( s )) (cid:48) / √ q . This section presents two results on size control of spatial t-statistics, the first asymptoticand the second a finite-sample result, and applies these to SCPC.
As discussed above (see equation (9)), under weak correlation, the asymptotic rejection prob-ability of τ n for finite q can be studied via Ω wc ( κ ) = κ V + (1 − κ ) V , where the covariancefunction of u and the sequence c n affects the large-sample distribution of τ n only through19he scalar κ ∈ [0 , ≤ κ< P ( (cid:80) qi =0 ω i ( κ, cv) Z i >
0) = α , where { ω i ( κ, cv) } qi =0 are the eigenvalues of A ( κ, cv) = D (cv) Ω wc ( κ ), then setting cv n ≥ cv for all n yields inference that is asymptotically robust under all forms of weak correlation covered byTheorem 1 (ii). In the case of a kernel-based variance estimator, the same holds as long as cvsatisfies sup ≤ κ< P ( (cid:80) ∞ i =0 ω i ( κ, cv) Z i >
0) = α where { ω i ( κ, cv) } ∞ i =0 are the eigenvalues of thelinear operator L ( f )( s ) = (cid:82) (cid:112) κ + (1 − κ ) g ( s ) (cid:0) − cv k ( s, r ) (cid:1) (cid:112) κ + (1 − κ ) g ( r ) f ( r ) dG ( r ).The value cv depends on the spatial density g , which can be seen directly by inspectingthe form of Ω wc and the operator L . In principle, one could use these expressions to estimatecv directly. But this would involve estimates of the spatial density g , which leads to difficultbandwidth an other choices. We now discuss a simpler approach.Consider a benchmark model B that satisfies the assumptions of Theorem 1 (ii), such asthe Gaussian exponential model introduced in Section 2. Let σ B denote the covariance kernelof B , and suppose c n, , is chosen so that a n, = c dn, /n → a = 0 . For instance, c n, = c > c n, = n /d / log( n ). Note that for this model κ = 0. Supposecv n = cv n ( s n ) satisfies sup c ≥ c n, P Σ ( c ) ( τ n ≥ cv n | s n ) ≤ α (15)where P Σ ( c ) is computed under the benchmark model, that is under u n | s n ∼ N (0 , Σ ( c )) with Σ ( c ) the covariance matrix of ( B ( cs ) , ..., B ( cs n )) (cid:48) . Theorem 7.
Let cv n satisfy (15). Under weak correlation in the sense of Lemma 1 (ii), fort-statistics covered by Theorems 2, 3, 4 and Corollary 6, max(cv − cv n , p → . Consequently,for any (cid:15) > , lim sup n P ( P ( τ n > cv n | s n ) > α + (cid:15) ) → , so that lim sup n P ( τ n ≥ cv n ) ≤ α . The intuition for Theorem 7 is as follows. The critical value cv n in (15) is valid in thebenchmark model for all c ≥ c n, and n . Thus, it is also valid along arbitrary sequences c n ≥ c n, . Since the c n, model has κ = 0, there exists sequences c n ≥ c n, that induceany κ ∈ [0 ,
1) in the benchmark model; thus different sequences c n in the benchmark modelrecreate any possible limit distribution under generic weak correlation, so that size controlin the benchmark model for all c ≥ c n, translates into size control under generic weakcorrelation. For SCPC, the benchmark covariance kernel for B is exponential σ B ( r, s ) = exp( −|| r − s || )and (from equation (4)) the critical value is chosen to satisfy (15) with equality. Thus, with a20xed value of c , the SCPC t-test τ SCPC ( q ) controls size in large samples under generic weakcorrelation. In addition and by construction, the SCPC critical value is chosen to satisfy the sizeconstraint for all values of c ≥ c in the benchmark model. Thus, size is controlled byconstruction also in strong-correlation models with exponential covariance kernels for all c ≥ c . The asymptotic results of the last subsection are comforting, but in finite samples, the ro-bustness of a spatial t-statistic with critical value chosen according to (15) still depends onthe choice of c n, and the benchmark model. This motivates investigating size control in finitesamples, which potentially includes ‘strong’ correlation cases.We restrict attention to Gaussian models where y ∼ N ( l µ, Σ ) for some Σ and implicitlycondition on s , and we also omit the dependence on n to ease notation. In this finite sampleconditional framework, the distinction between W and ˆW is immaterial, so for simplicity, wewrite τ ( WW (cid:48) ) for the t-statistic. Let V denote a set of covariance matrices. A test using the t-statistic τ ( WW (cid:48) ) withcritical value cv is robust for V if sup Σ ∈V P Σ ( τ ( WW (cid:48) ) > cv ) ≤ α . For a finite or parametricset of V , sup Σ ∈V P Σ ( τ ( WW (cid:48) ) > cv ) can be established numerically. We therefore focus onan analytical robustness result for a non-parametric class V .Specifically, we establish a set of readily verifiable sufficient conditions to check robustnessfor sets V that are composed of mixtures of parametric covariance matrices Σ p ( θ ) for θ ∈ Θ.We then apply this result to a set of Mat´ern covariance matrices with parameter θ andinvestigate the robustness of SCPC over arbitrary mixtures of these Mat´ern models. Inaddition, we use the result to study the robustness of a popular projection based t-test in aregularly spaced time series setting.Consider a benchmark model with Σ = Σ , and suppose that cv has been chosen so that Technically, the SCPC choice of q in (5) is also a function of the locations of s n , so q SCPC is random.However, the argument that establishes Theorem 7 can be extended under this complication as long as q SCPC ≤ q max almost surely for some finite and fixed q max . See Theorem 11 in the appendix for a formalstatement. This also covers kernel variance estimators by setting q = T − MKM = WW (cid:48) . Σ ( τ ( WW (cid:48) ) > cv ) = α. We are interested in conditions under which P Σ ( τ ( WW (cid:48) ) > cv ) ≤ α for Σ = (cid:90) Θ Σ p ( θ ) dF ( θ ) (16)for a probability distribution F .Let λ j ( · ) denote the j th largest eigenvalue of some matrix. Theorem 8.
Let Ω = W (cid:48) Σ W , Ω ( θ ) = W (cid:48) Σ p ( θ ) W , and assume Ω and Ω ( θ ) , θ ∈ Θ are full rank. Suppose A = D (cv) Ω is diagonalizable, and let P be its eigenvectors. Let A ( θ ) = P − D (cv) Ω ( θ ) P and ¯A ( θ ) = ( A ( θ ) + A ( θ ) (cid:48) ) . Suppose A and A ( θ ) , θ ∈ Θ arescale normalized such that λ ( A ) = λ ( A ( θ )) = 1 . Let ν ( θ ) = λ q ( − ¯A ( θ )) − λ ( ¯A ( θ )) λ q ( − A ) − ( λ ( ¯A ( θ )) − ν i ( θ ) = λ q +1 − i ( − ¯A ( θ )) − λ ( ¯A ( θ )) λ q +1 − i ( − A ) for i = 2 , . . . , q. If for some probability distribution F on Θ , (cid:80) ji =1 (cid:82) ν i ( θ ) dF ( θ ) ≥ for all ≤ j ≤ q , then(16) holds. Remark 4.1. If (cid:80) ji =1 ν i ( θ ) ≥ θ ∈ Θ and 1 ≤ j ≤ q , then the theorem implies that P Σ ( τ ( WW (cid:48) ) > cv ) ≤ α for Σ an arbitrary mixture of Σ p ( θ ). Remark 4.2.
Note that for Σ p ( θ ) = Σ , ν i ( θ ) = 0 for 1 ≤ j ≤ q , so the inequalities ofthe theorem have no ‘minimal slack’ and potentially apply also to parametric models with acovariance matrix Σ p ( θ ) that takes on values arbitrarily close to Σ . Remark 4.3.
As shown in Theorem 2, the eigenvalues of A and A ( θ ) (or, equivalently, of D (cv) Ω ( θ )) govern the rejection probability of τ ( WW (cid:48) ) under Σ and Σ p ( θ ). Given thescale normalization λ ( A ) = λ ( A ( θ )) = 1, if − λ j ( A ( θ )) ≥ − λ j ( A ) for all j ≥
2, then theresult there implies that P Σ p ( θ ) ( τ ( WW (cid:48) ) > cv ) ≤ P Σ ( τ ( WW (cid:48) ) > cv ). It follows froman integral representation (cf. equation (20) below) that the null rejection probability of thet-statistic is Schur convex in these negative eigenvalues, so that the inequality holds wheneverthe negative eigenvalues of A ( θ ) weakly majorize those of A . Majorization inequalitiesabout eigenvalues of sums of matrices from Marshall, Olkin, and Arnold (2011) and additionalcalculations then extend this further to the result in Theorem 8. Remark 4.4.
The conditions of Theorem 8 implicitly depend on the locations s , so theimplications are specific to the application. In the spatial case, the practical importance22f the theorem is that the conditions are straightforward to check numerically for a givenparametric family Σ p ( θ ). This can establish a range of robustness of a spatial t-test in agiven application and is illustrated in the next subsection with the SCPC t-test and theMat´ern class of spatial correlations. The theorem also provides insights for inference in theregularly-spaced time series case, where the spatial design is fixed across applications. Thisis illustrated in the subsequent subsection for a projection-based t-statistic for mixtures ofAR(1) processes and processes that are ‘less persistent’ than a benchmark AR(1) model. The critical value for the SCPC t-test is chosen to control size in exponential models with c ≥ c , where c is calibrated to a value ρ . Because ρ is monotone in c , the resulting SCPCt-test controls size for all ρ ≤ ρ in the exponential model by construction.Let Σ p ( θ ) denote the covariance matrix associated with a parameter θ , with averagepairwise correlation ρ ( θ ). Let Θ ρ L ,ρ U = { θ | ρ L ≤ ρ ( θ ) ≤ ρ U } denote the set of values of θ thatinduce correlations between ρ L and ρ U . If the inequalities in Theorem 8 are satisfied for allvalues of θ ∈ Θ ρ L ,ρ U , then the SCPC t-test controls size for all mixtures of Σ p ( θ ) in this set.In this section we consider Σ p ( θ ) computed from Mat´ern processes with parameter θ = ( ν, c ), where ν and c are positive constants. If u follows a Mat´ern process, its co-variance function σ u ( r − s ) depends on the locations only through d = || r − s || . For ν ∈ { / , / , / , ∞} , the Mat´ern covariance functions are • ν = 1 / σ u ( d ) ∝ exp[ − cd ] • ν = 3 / σ u ( d ) ∝ (cid:0) √ dc (cid:1) exp[ −√ cd ] • ν = 5 / σ u ( d ) ∝ (cid:0) √ dc + (5 / d c (cid:1) exp[ −√ cd ] • ν = ∞ : σ u ( d ) ∝ exp[ − c d / . For any Σ ( c ) it is straightforward to compute the bounds ρ L and ρ U such that theinequalities in Theorem 8 are satisfied for all values of θ ∈ Θ ρ L ,ρ U with ν ∈ { / , / , / , ∞} and c >
0. We carried out this exercise for the U.S. states spatial correlation designs of Section2 (the calculations for one set of locations take less than a second). We find ρ L ≤ .
001 and ρ U = ρ ∈ { . , . } , with very few minor exceptions.We conclude that SCPC controls size in finite Gaussian samples for a wide range of Mat´ernprocess mixtures that imply ρ ≤ ρ , at least for this set of spatial designs.23 .2.2 Implications for regularly-spaced time series The spatial design is fixed for regularly-spaced time series, so the theorem can provide gen-eral robustness results. Consider, for instance, the equal weighted cosine (EWC) projectionestimator of M¨uller (2004, 2007), Lazarus, Lewis, Stock, and Watson (2018) and Dou (2019)where w ( s ) = (cid:112) /q (cos πs, cos(2 πs ) , . . . , cos( qπs )). Suppose the critical value cv n is chosenso that size is controlled in a Gaussian AR(1) with coefficient exp( − c /n ), and q is chosen tominimize expected length in the i.i.d. model. For c = 10, c = 25 and c = 50, we obtain q = 5 , n ∈ { , , } . Call this test the EWC( c ) t-test.Calculations based on Theorem 8 for these values of c and n show that the EWC( c ) t-testcontrols size for arbitrary mixtures of AR(1) processes with coefficients exp( − c/n ), c ≥ c .By taking the limit in n and using standard local-to-unity weak convergence results (as inM¨uller (2014)), one can further apply Theorem 1 to the limiting covariance matrices Ω and Ω ( θ ) to study asymptotic robustness of the EWC( c ) t-test with an asymptotically justifiedcritical value (which are equal to cv = 3 .
53, 2 .
71, 2 .
40 for c = 10, 25, 50, respectively).Another numerical calculation based on Theorem 8 then shows that these EWC( c ) t-testscontrol asymptotic size for underlying processes that are arbitrary mixtures of local-to-unitymodels with parameters c ≥ c .Moreover, let f n, : [ − π, π ] (cid:55)→ [0 , ∞ ) be the spectral density of an AR(1) process with co-efficient exp( − c /n ), so f n, ( ω ) ∝ (1 − e − c /n cos ω + e − c /n ) − . A spectral density f n, wouldnaturally be considered less persistent than f n, if f n, ( ω ) /f n, ( ω ) is (weakly) monotonicallyincreasing in | ω | . Denote all such functions by F n . Define M = f n, ( π ) /f n, ( π ) f n, (0) /f n, (0) , so M measures by how much f n, ( ω ) /f n, ( ω ) increases over [0 , π ], and denote by F ¯ Mn allfunctions in F n with M ≤ ¯ M for some ¯ M >
1. Then for any f n, ∈ F ¯ Mn , there exists a CDF H on [0 , π ] such that f ,n ( ω ) ∝ f n, ( ω ) + ( M − H ( | ω | ) f n, ( ω )= ¯ M − M ¯ M − f n, ( ω ) + M − M − (cid:90) [ f n, ( ω ) + ( ¯ M − [ | ω | ≥ θ ] f n, ( ω )] dH ( θ )so f n, has a representation as a scale mixture of f n, ( ω ) + ( ¯ M − [ | ω | ≥ θ ] f n, ( ω ), 0 ≤ θ ≤ π . After translating this back into a corresponding mixture of covariance matrices Σ p ( θ ), an application of Theorem 8 shows that the EWC( c ) t-test also controls size in this24lass, for ( c , ¯ M ) ∈ { (10 , , (25 , , (50 , } and all n ∈ { , , } . These results refinecorresponding results in Dou (2019) that are based on a Whittle-type diagonal approximationto Σ .Taking limits as n → ∞ yields a corresponding asymptotic robustness statement: Thefunction f : R (cid:55)→ [0 , ∞ ) with f ( ω ) = ( ω + c ) − is proportional to the ‘local-to-zero’ spectraldensity (cf. M¨uller and Watson (2016, 2017)) of a local-to-unity process with parameter c . Consider any process whose local-to-zero spectral density f is such that f ( ω ) /f ( ω ) ismonotonically increasing in | ω | with lim ω →∞ f ( ω ) /f ( ω ) ≤ ¯ M f (0) /f (0) and that satisfiesthe CLT in M¨uller and Watson (2016, 2017). A numerical calculation based on Theorem 8then shows that the EWC( c ) t-tests for ( c , ¯ M ) ∈ { (10 , , (25 , , (50 , } controls size inlarge samples under all such processes. The SCPC t-test is not robust to heteroskedasticity or measurement error in locations byconstruction. For example, suppose that u ( s ) = h ( s )˜ u ( s ), where ˜ u is homoskedastic andsatisfies the assumptions outlined above for u , and h : S (cid:55)→ R is a non-random function thatinduces heteroskedasticity in the u process. The linear combinations of u studied in Lemma2 are now (cid:80) nl =1 w ( s l ) u ( s l ) = (cid:80) nl =1 w h ( s l )˜ u ( s l ) where w h ( s ) = w ( s ) h ( s ). The results of thelemma and subsequent theorems then follow with w h replacing w . But, the test statisticand critical value is computed using w , not w h , so that size control is not guaranteed, evenin large samples. An analogous problem arises when the locations s i are measured with error.In both cases, the particulars of the size distortion depend on the distribution of spatiallocations, g , the weights w (which in turn depend on the value of ρ used to calibrate c ),the function h in the heteroskedastic model and the distribution of the measurement error forthe locations.We summarize two experiments that illustrate and quantify the size distortions in theU.S. states spatial correlation designs. The first experiment is a heteroskedastic model withlog h increasing or decreasing linearly from log h ( s ) = 0 to log h ( s ) = log 3 moving from themost westward to the most eastward location, the experiment is repeated with h increasing ordecreasing moving north to south, and we record the largest of the four rejection frequencies.Panel (a) of Figure 6 plots the CDF of rejection frequencies for nominal 5% SCPC tests foreach ( ρ , g ) pair. For these designs, the resulting size distortions are not large, except for a25igure 6: CDFs of Size under Heteroskedasticity and Location Measurement Errorfew states with ρ = 0 .
02 and the light spatial density g , where rejection frequencies approach10%.The second experiment investigates location measurement error of a form studied in Conleyand Molinari (2007). Specifically for each location, s ∗ i = s i + e i where s ∗ i is the measuredlocation, s i is the true location and e i is the measurement error. The error term is e i =( e ,i , e ,i ) with e ,i the north-south and e ,i the east-west coordinate and e j,i i.i.d. U ( − δ, δ )over j and i , and δ = 0 . H with H the length of the smallest square that encompasses alllocations, corresponding to “level 4” errors in Conley and Molinari’s (2007) classification. TheCDFs for the rejection frequencies are shown in panel (b) of Figure 6. Evidently, measurementerror of this sort has little effect on the size of SCPC under uniformly distributed locations,but can have a substantial effect for highly concentrated spatial distributions, especially when ρ = 0 . Figure 5 showed the expected length of the SCPC confidence interval relative to the lengthof an oracle confidence interval that uses the true value of Var( √ n ( y − µ )) conditional onthe observed locations s . (As before, in this subsection we keep the conditioning on s andthe dependence on n implicit.) For studying efficiency, a more relevant comparison involvesthe expected length of the SCPC confidence interval relative to a confidence interval that,like SCPC, does not depend on the true (unknown) value of Var( √ n ( y − µ )). Ideally, such a26omparison would involve SCPC and the most efficient method for constructing a confidenceinterval. We undertake such a comparison here.To be specific, let CS( y ) ⊂ R denote a confidence set for µ constructed from y . Werestrict attention to location and scale equivariant confidence sets, that is CS satisfies CS( a µ + a σ y ) = { µ : ( µ − a µ ) /a σ ∈ CS( y ) } for all y , a µ ∈ R and a σ >
0. As in Section 4.2,we focus on the Gaussian model y ∼ N ( l µ, Σ ). We want to compare the SCPC intervalwith a confidence interval that, like SCPC, has good coverage P Σ ( µ ∈ CS( y )) over a rangeof potential spatial correlation patterns Σ ∈V . The metric for measuring efficiency is theexpected length E [ (cid:82) [ x ∈ CS( y )] dx ] in the i.i.d. model y ∼ N ( l µ, I ).Our choice of V is motivated by the structure of the SCPC benchmark covariance matrix Σ ( c ). The idea is to include in V covariance matrices that are weakly less persistent than Σ ( c ), and that cannot be easily distinguished from the i.i.d. model. To characterize thesecovariance matrices, note that Σ ( c ) is generated from u , an isotropic random field withcovariance function σ u ( s, r ) = exp( − c || s − r || ). Isotropy implies that the spectrum of thisrandom field F : R d (cid:55)→ [0 , ∞ ) at frequency ω ∈ R d can be written as function of the scalar ω = || ω || , that is F ( ω ) = f ( ω ) for some f : R (cid:55)→ [0 , ∞ ). As is well known, the exponentialcovariance model for d = 2 corresponds to a spectral density function f proportional to( c + ω ) − / . By scale invariance of both CS and the SCPC interval, it is without loss ofgenerality to set f equal to f ( ω ) = 1( c + ω ) / . For some ¯ ω >
0, define f ∆ ( ω ) = [ | ω | ≤ ¯ ω ]( f ( ω ) − f ∆ (¯ ω )), and let f R ( ω ) = f ( ω ) − f ∆ ( ω ),so that f ( ω ) = f ∆ ( ω ) + f R ( ω ) . For 0 ≤ | ω | ≤ ¯ ω , the density f ∆ is equal to f ( ω ) − f (¯ ω ), so that the remainder f R ( ω ) isa continuous density that is flat for | ω | ≤ ¯ ω , and that follows the same decline as f for | ω | > ¯ ω . Since both f ∆ ( ω ) and f R ( ω ) are non-negative, we have the corresponding identityin covariance matrices Σ ( c ) = Σ ∆ (¯ ω ) + Σ R (¯ ω ) (17)where Σ ∆ (¯ ω ) and Σ R (¯ ω ) are induced by the isotropic random fields with spectral densities F ∆ ( ω ) = f ∆ ( || ω || ) and F R ( ω ) = f R ( || ω || ), respectively.Now consider the covariance matrix ¯Σ (¯ ω ) = Σ ∆ (¯ ω ) + λ ( Σ R (¯ ω )) I n λ ( Σ R (¯ ω )) is the largest eigenvalue of Σ R (¯ ω ). Since f R ( ω ) is monotonically decreasingin | ω | , also Σ R (¯ ω ) contributes to the persistence of Σ ( c ) in (17), so replacing it with whitenoise of weakly larger variance should make inference about µ under ¯Σ (¯ ω ) no harder thanunder Σ ( c ). Said differently, a method that is robust under correlation patterns weaklyless persistent than Σ ( c ) should continue to have good coverage after replacing medium andhigh frequency variation in y by white noise, that is, under ¯Σ (¯ ω ). This motivates the set V = { ¯Σ (¯ ω ) | ¯ ω > } .A calculation shows that in the U.S. states spatial correlation designs, the SCPC intervalhas good coverage properties under this V . With α SCPC (¯ ω ) = P ¯Σ (¯ ω ) ( τ > cv ) for thenominal 5% level SCPC test, for most designs, sup ¯ ω ≥ α SCPC (¯ ω ) is equal or very close to 5%,and it never exceeds 8%. To keep things on an equal footing, we allow CS the same degree ofundercoverage, that is we consider the probleminf CS E [ (cid:90) [ x ∈ CS( y )] dx ] s.t. P ¯Σ (¯ ω ) ( µ / ∈ CS( y )) ≤ max( α SCPC (¯ ω ) , α ) for all ¯ ω > . (18)In words, we seek the invariant confidence set with the shortest expected length in the i.i.d. lo-cation model among all confidence sets that are as robust as the SCPC interval under ¯Σ (¯ ω ),¯ ω > ω is one-dimensional, one can apply the numerical techniques of Elliott, M¨uller, andWatson (2015) and M¨uller and Norets (2016) (also see M¨uller and Watson (in preparation))to obtain an informative lower bound on the objective inf CS E [ (cid:82) [ x ∈ CS( y )] dx ] that holdsfor any equivariant CS( y ) that satisfies the constraint in (18).We compute such lower bounds in the U.S. states spatial correlation designs. Panel (a)of Figure 7 shows the CDFs of the length of SCPC confidence intervals relative to the lowerbounds for the 240 designs in each ( ρ , g ) pair. The expected lengths of SCPC are within 7% ofthe efficiency bound for all designs when ρ = 0 .
02. When ρ = 0 .
10, so that spatial correlationis high, and the spatial locations are highly concentrated as under the light design, the ex-pected length of the SCPC confidence interval can be more that 15% longer than the efficiencybound. In part, this is because the implied efficient confidence sets are complicated and ratheruninterpretable functions of y in this case. We thus repeat the exercise for confidence sets con-strained to be symmetric around y by imposing CS( a µ + a σ y ) = { µ : ( µ − a µ ) /a σ ∈ CS( y ) } In the regularly-spaced time series setting, white noise amounts to a flat spectrum, so Σ (¯ ω ) correspondsto an underlying spectral density equal to f ∆ ( ω ) + f (¯ ω ), which is the “kinked” spectral density consideredby Dou (2019). For arbitrary locations, however, the domain of the spectrum doesn’t fold onto the interval[ − π, π ], so that white noise cannot mathematically be represented by a flat spectrum. y , a µ ∈ R and a σ (cid:54) = 0. The results are summarized in panel (b), and we can see thatSCPC comes closer to the resulting higher bound on confidence interval length. Remark 5.1.
These efficiency results also provide a limit on the possibility of using data-dependent methods to learn about the value of the worst-case correlation c : Since thei.i.d. model corresponds to c → ∞ , if it was possible to learn the value of c from the data,one would be able to conduct much more efficient inference than what is reported in Figure7. The results here thus provide a rationalization for treating c as given. This section compares SCPC with other methods that have been proposed, focusing on sizeand expected length of confidence intervals in the benchmark Gaussian model with exponentialcovariance kernel and parameter c (calibrated by ¯ ρ ). We consider two kernel-based methods,two versions of a cluster method, and one projection method. All these methods are t-statisticbased tests of the form considered in Section 3.The kernel based methods use a Bartlett kernel, k ( s, r ) = k Bartlett ( || s − r || /b ). The methodsdiffer in their choice of bandwidth b and critical value. The first method uses a standardnormal critical value with b chosen so the resulting test has size as close as possible to 5%.This is a version of the method proposed by Conley (1999), but with an oracle choice for the Also see Dou (2019) for a related discussion and associated impossibility results. b = max l,(cid:96) || s l − s (cid:96) || and chooses the critical value toobtain exact coverage under Σ = I . This is the spatial analogue of the method suggestedby Kiefer, Vogelsang, and Bunzel (2000) (KVB) for regularly spaced time series. The clustermethods follow the approach of Ibragimov and M¨uller (2010) (IM) with student-t q criticalvalues and is implemented with q = 4 and q = 9 equal-sized clusters. The projection methodfollows Sun and Kim (2012). It uses a student-t q critical value and q low-frequency Fourierweights orthogonalized using the sample locations, where q is chosen as a function of theexponential model parameter c using the formula in their equation (8). The first and lastmethod are thus tailored to the true value c , just like SCPC.We analyze these methods in the U.S. states spatial correlation designs, augmented to alsoinclude the value ρ = 0 .
001 for the average pairwise correlation to investigate performanceunder ‘weak’ spatial correlations. Figure 8 summarizes the results for size control and expectedlengths by plotting the CDFs for each ( ρ , g ) pair. The first column shows the null rejectionfrequency for each method; by construction, the rejection frequency for SCPC is at most 5%in all designs. The expected lengths in the second and third column use size-corrected criticalvalues to ensure 95% coverage under Σ ( c ), and are given in multiples of the expected lengthof the (non-adjusted) SCPC method. The second column reports these relative expectedlengths under Σ = I , and the third column under Σ ( c ).Looking at the first column, the kernel and cluster methods have null rejection probabilitiesclose to 5% when ρ = 0 . ρ = 0 .
02 or 0 . y forthe latter two values of ρ . In contrast, the Fourier projection method has relatively small sizedistortions under g = g uniform but can have substantial size distortions under g = g light , evenwhen ¯ ρ = 0 . Ω ∝ I , which it is under weak-correlationwith g uniform, but not otherwise, even for large q (cf. Remark 3.11).The relative lengths shown in the second column are above unity, sometimes by a widemargin, indicating that SCPC is closer to the efficiency bound computed in Section 5 thanthese alternative methods, at least for the designs considered here. The third column shows The assignment of locations to clusters is performed sequentially, where at each step, we minimize (acrossyet unassigned locations) the maximal distance over clusters (among those that have not yet been assigned n/q locations). Cluster distances are computed from the northwest, northeast, southeast and southwest cornersof the location circumscribing rectangle, and in the q = 9 case, also from the mid-points of the four sides ofthis rectangle, and its center. Σ ( c ) with a few exceptions. Notably, theexpected length of the size-adjusted 9-cluster method is smaller than SCPC when ρ = 0 . This section discusses extensions of the method to regression and GMM models, some com-putational issues, and the multivariate extension of SCPC.
The extension of these results to regression and GMM problems follows from standard argu-ments. For example, consider the linear regression problem w l = x l β + z (cid:48) l δ + ε l for l = 1 , ..., n (19)where β is the (scalar) parameter of interest, z l are additional controls in the regression, and( w l , x l , z l ) are associated with location s l . Let ˜ x l = x l − S xz S − zz z l denote the residual fromregressing x l on z l , where we use the notation S ab = n − (cid:80) nl =1 a l b (cid:48) l for any vectors a l and b l .Suppose S ˜ x ˜ x p → σ x ˜ x > n − / n (cid:88) l =1 ˜ x l ε l | s ⇒ p N (0 , σ xε ) . Then √ n (ˆ β − β ) | s ⇒ p N (0 , σ )where σ = σ xε /σ x ˜ x . Spatial correlation affects inference in this model through σ xε whichincorporates potential correlation between ˜ x l ε l and ˜ x (cid:96) ε (cid:96) at spatial locations s l and s (cid:96) .Thus, suppose that ˜ x l ε l satisfies the assumptions previously made for u l . Then a straight-forward calculation shows that setting y l = ˆ β + ˜ x l ˆ ε l n − (cid:80) nl =1 ˜ x l in the analysis of the previous sections leads to analogous results with β replacing µ as theparameter of interest. The extension to GMM inference is analogous; see, for instance, Section4.4 of M¨uller (2020). 32 .2 Computational issues We highlight two computational issues. The first involves the calculation of the SCPC criticalvalue, and the second involves the problem of computing the eigenvectors r j of MΣ ( c ) M when n is very large.The critical value cv = cv SCPC ( q ) solves sup c ≥ c P Σ ( c ) ( τ ( q − (cid:80) qj =1 r j r (cid:48) j ) > cv ) = α orequivalently (from Theorem 2) sup c ≥ c P ( Z > (cid:80) qi =1 η i Z i ) = α where η i = − ω i /ω , ω i arethe eigenvalues of ˆW (cid:48) Σ ( c ) ˆW D (cv) with ˆW = [ l , r / √ q, . . . , r q / √ q ] and Z j ∼ i.i.d. N (0 , P (cid:32) Z ≥ q (cid:88) i =1 η i Z i (cid:33) = 1 π (cid:90) x q − (cid:112) (1 − x ) (cid:81) qi =1 ( x + η i ) dx, (20)which is readily evaluated by numerical quadrature. Thus cv SCPC ( q ) can be obtained bycombining a root-finder with a grid search over c ≥ c .The second problem involves computing the eigenvectors r j = ( r j, , . . . , r j,n ) (cid:48) of the n × n matrix MΣ ( c ) M when n is very large (say, larger than n = 2000). Here we can leverage theeigenfunction convergence result in Lemma 5 as discussed in Section 3.4.3: In the notationdefined there, we seek to approximate r j = (ˆ ϕ j ( s ) , . . . , ˆ ϕ j ( s n )) (cid:48) . Consider a random subsetof size ˜ n < n of the observed locations { ˜ s l } ˜ nl =1 ⊂ { s l } nl =1 , and let ˜Σ ( c ) be the implied ˜ n × ˜ n covariance matrix of ( u (˜ s ) , . . . , u (˜ s n )) (cid:48) using the benchmark covariance function σ u ( r − s | c ) =exp[ − c || r − s || ]. Let the eigenvector corresponding to the j th largest eigenvalue ˜ λ j of ˜Σ ( c )be ˜r j = (˜ r ,j , . . . , ˜ r ˜ n,j ) (cid:48) with ˜ n − ˜r (cid:48) j ˜r j = 1. As long as ˜ n → ∞ and λ q +1 > λ q , Lemma 5 impliesthat the span of the S (cid:55)→ R functions˜ ϕ j ( s ) = ˜ n − ˜ λ − j ˜ n (cid:88) l =1 ˜ r j,l (cid:32) exp[ − c || s − ˜ s l || ] − ˜ n − n (cid:88) (cid:96) =1 exp[ − c || ˜ s l − ˜ s (cid:96) || ] (cid:33) , j = 1 , . . . , q converges to the eigenspace spanned by ϕ j , j = 1 , . . . , q , just like the full sample estimators ˆ ϕ j .Thus, it is formally justified to approximate the value of ˆ ϕ j at locations { s l } nl =1 (cid:51) s (cid:96) / ∈ { ˜ s l } ˜ nl =1 via r j,(cid:96) = ˆ ϕ j ( s (cid:96) ) ≈ ˜ ϕ j ( s (cid:96) )—this is a version of the so-called Nystr¨om method (see, for instance,Rasmussen and Williams (2005) for discussion and references).In practice, such approximations can be carried out for several random subsets of ˜ n loca-tions, followed by a (sample) principle component analysis to extract the best approximationto the space spanned by the first q eigenvectors. The resulting algorithm has O ( n ) runningtime (in contrast to the O ( n ) running time of a basic implementation of Conley (1999)-type33ernel estimators). We provide corresponding STATA and Matlab code in the replicationfiles. Consider the case where y l = µ + u l with y l , µ and u l m × H : µ = µ . Suppose the observations conditional on s are generated by themodel u ( s l ) = B ( c n s l ), l = 1 , . . . , n where B ( s ) is an R m -valued mean-zero stationary random field on R d with covariance function E [ B ( s ) B ( r ) (cid:48) ] = σ B ( r − s ). Let Y and U be the n × m matrices of observations and innovations,respectively, and ¯y = n − (cid:80) nl =1 y l the sample mean. The natural analogue to the t-statistic τ ( ˆW ˆW (cid:48) ) is Hotelling’s- T statistic T ( ˆW ˆW (cid:48) ) = n ( ¯y − µ ) (cid:48) (cid:16) Y (cid:48) ˆW ˆW (cid:48) Y (cid:17) − ( ¯y − µ ) . (21)One would expect that under mixing and moment conditions similar to those of Lemma1 (ii) vec( W (cid:48) U ) | s ⇒ p N (cid:18) , a σ B ( r − s ) ⊗ V + (cid:20)(cid:90) σ B ( s ) ds (cid:21) ⊗ V (cid:19) . (22)Note that T ( ˆW ˆW (cid:48) ) is invariant to the transformation Y → YH for nonsingular H . For thepurposes of studying the limit distribution of T ( q ) under weak correlation, it is thus withoutloss of generality to normalize σ B ( · ) such that the limit covariance matrix in (22) becomesdiag( κ ) ⊗ V + ( I m − diag( κ )) ⊗ V (23)where κ is a m × , c =( c , . . . , c m ) where vec( Y ) | s ∼ N ( µ ⊗ l n , Σ ( c )) with Σ ( c ) = diag( Σ ( c ) , . . . , Σ ( c m )), and Σ ( c ) is as in Section 2. Let c = c l m , a m × c . The SCPCtest statistic T ( q ) is a special case of (21) with the columns of ˆ W equal to the first q eigenvectors of Σ ( c ), scaled to have length 1 / √ q , and with critical value cv T SCPC chosen tosatisfy sup c ≥ c P Σ ( c ) ( T ( q ) > cv T SCPC ( q ) | s ) = α, c ≥ c is understood as an elementwise inequality. Thevalue of q that minimizes the expected volume of the confidence ellipsoid under vec( Y ) | s ∼N ( µ ⊗ l , I m ⊗ I n ) ismin q ≥ m E [vol { m : m (cid:48) ( q − S q ) − m ≤ n − cv T SCPC ( q ) } ] = min q ≥ m (2 π cv T SCPC ( q ) /n ) m/ Γ(( q + 1) / √ q Γ(( q − m + 1) / m/ S q is distributed Wishart with q degrees of freedom, and the equality follows fromBartlett’s decomposition of a Wishart random matrix, and the formulas for the expectationof a χ random variable and the volume of an m dimensional ellipsoid.Since appropriate choices of c j,n → ∞ , j = 1 , . . . , m in the benchmark model can replicatethe normalized limit distributions (23) for all κ , by the same arguments that lead to Theorem7, T ( q ) controls size under all weak correlation patterns that induce (22). And as inSection 7.1, it is straightforward to adapt T ( q ) to test m restrictions in linear regressionand GMM problems. We omit details for brevity. Generalizing the results about the smallsample robustness of τ SCPC under potentially strong correlations in Theorem 8 to T isinteresting but challenging, and beyond the scope of this paper.35 Appendix
Lemma 9. If X n | s n ⇒ p X and Y n p → , then ( X n + Y n ) | s n ⇒ p X .Proof. Let BL be the space of Lipschitz continuous functions R p (cid:55)→ R bounded by one withunit Lipschitz constant. By Berti, Pratelli, and Rigo (2006), page 93, X n | s n ⇒ p X is equiva-lent to sup h ∈ BL | E [ h ( X n ) − h ( X ) | s n ] | p →
0, so it suffices to show that sup h ∈ BL | E [ h ( X n + Y n ) − h ( X ) | s n ] | p →
0. Let Y ∗ n = Y n [ || Y n || ≤ h ∈ BL | E [ h ( X n + Y n ) − h ( X ) | s n ] | ≤ sup h ∈ BL | E [ h ( X n + Y ∗ n ) − h ( X ) | s n ] | + 2 P ( || Y ∗ n || > | s n ) . Note that with ∆ n ( h ) = h ( X n + Y ∗ n ) − h ( X n ), | ∆ n ( h ) | ≤ || Y ∗ n || a.s. for all h ∈ BL, so thatsup h ∈ BL | E [ h ( X n + Y ∗ n ) − h ( X ) | s n ] | = sup h ∈ BL | E [∆ n ( h ) + h ( X n ) − h ( X ) | s n ] |≤ sup h ∈ BL ( | E [∆ n ( h ) | s n ] | + | E [ h ( X n ) − h ( X ) | s n ] | ) ≤ E [ || Y ∗ n ||| s n ] + sup h ∈ BL | E [ h ( X n ) − h ( X ) | s n ] | . We are left to show that Y n p → P ( || Y ∗ n || > | s n ) p → E [ || Y ∗ n ||| s n ] p → . Consider the latter claim. Suppose otherwise. Then for some ε >
0, and some subsequence n (cid:48) of n , lim n (cid:48) →∞ P ( E [ || Y ∗ n (cid:48) ||| s n (cid:48) ] > ε ) > ε , so that lim inf n (cid:48) →∞ E [ || Y ∗ n (cid:48) || ] > ε . But since Y ∗ n isbounded, Y n p → n →∞ E [ || Y ∗ n || ] = 0, a contradiction. A similar argument yields E [ || Y ∗ n ||| s n ] p →
0, concluding the proof.
Proof of Lemma 1: (i) Since B is Gaussian, n − W (cid:48) n u n | s n ∼ N (0 , Ω n ) with Ω n = n − (cid:80) l,(cid:96) w ( s l ) w ( s (cid:96) ) (cid:48) σ B ( c ( s l − s (cid:96) )). It thus suffices to show that Ω n p → Ω sc .We have Ω n = σ B (0) n − (cid:80) l w ( s l ) w ( s l ) (cid:48) + n − (cid:80) l (cid:54) = (cid:96) w ( s l ) w ( s (cid:96) ) (cid:48) σ B ( c ( s l − s (cid:96) )), and || n − (cid:80) l w ( s l ) w ( s l ) (cid:48) || ≤ n − sup s ∈S || w ( s ) || → . Furthermore, E (cid:34) n ( n − (cid:88) l (cid:54) = (cid:96) w ( s l ) w ( s (cid:96) ) (cid:48) σ B ( c ( s l − s (cid:96) )) (cid:35) = E [ w ( s ) w ( s ) (cid:48) σ B ( c ( s − s ))] = Ω sc and with w i ( s ) the i th element of w ( s ), E (cid:32) n ( n − (cid:88) l (cid:54) = (cid:96) w i ( s l ) w j ( s (cid:96) ) (cid:48) σ B ( c ( s l − s (cid:96) )) (cid:33)
36 ( n − n − n ( n − E [ w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s ))] E [ w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s ))]+ 4( n − n ( n − E [ w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s )) w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s ))]+ 2 n ( n − E [ w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s )) w i ( s ) w j ( s ) (cid:48) σ B ( c ( s − s ))]so that Var[ n ( n − (cid:80) l (cid:54) = (cid:96) w i ( s l ) w j ( s (cid:96) ) (cid:48) σ B ( c ( s l − s (cid:96) ))] = O ( n − ), and therefore Ω n p → Ω sc .(ii) Follows from Theorem 3.2 in Lahiri (2003) and the Cram´er-Wold device. (cid:4) Proof of Theorem 2:
In the notation of Lemma 1, with X = ( X , X (cid:48) q ) (cid:48) and Z =( Z , . . . , Z q ) (cid:48) we have P (cid:0) τ n ( W n W (cid:48) n ) > cv | s n (cid:1) p → P (cid:18) X X (cid:48) q X q > cv (cid:19) = P (cid:0) X − cv X (cid:48) q X q > (cid:1) = P ( X (cid:48) D (cv) X > P ( Z (cid:48) Ω / D (cv) Ω / Z > P (cid:32) q (cid:88) i =0 ω i Z i > (cid:33) where the convergence follows from Lemma 1 and the continuous mapping theorem, and thelast equality follows by similarity of the matrices Ω / D (cv) Ω / and D (cv) Ω . The claimabout the sign of the eigenvalues follows from Lemma 14 below. (cid:4) Proof of Theorem 3:
We show that Lemma 1 (i) and (ii) continue to hold with w replaced by ˆw . We have E (cid:32) n (cid:88) l =1 ( ˆ w i ( s l ) − w i ( s l )) u ( s l ) (cid:33) | s n ≤ sup s ∈S | ˆ w i ( s ) − w i ( s ) | (cid:88) l,(cid:96) | σ B ( c n ( s l − s (cid:96) )) | almost surely. Proceeding as in the proof of Lemma 1 (i) now shows that E [ n − (cid:80) l,(cid:96) | σ B ( c ( s l − s (cid:96) )) | ] = (cid:82) (cid:82) | σ B ( c ( r − s )) | g ( r ) g ( s ) drds , so n − (cid:80) l,(cid:96) | σ B ( c ( s l − s (cid:96) )) | = O p (1). Similarly, under the assumptions of part (ii) of Lemma 1, proceeding as inthe proof of Lemma 5.2 of Lahiri (2003) yields E [ a n n − (cid:80) l,(cid:96) | σ B ( c n ( s l − s (cid:96) )) | ] → aσ u + (cid:82) R d | σ B ( s ) | ds (cid:82) g ( s ) ds . The result thus follows from (10) and Lemma 9.The proof of Theorem 4 requires a slightly more general version of Theorem 3.37 emma 10. In the notation of Lemma 5, suppose ˆW = ˆL ˆΦ , where the i th column of the n × q matrix ˆΦ is ˆv i = (ˆ ϕ i ( s ) , . . . , ˆ ϕ i ( s n )) (cid:48) and ˆL = diag(ˆ λ , . . . , ˆ λ q ) . Under the assumptions ofLemma 1, c dn n − ( u (cid:48) ˆW ˆW (cid:48) u − u (cid:48) WW (cid:48) u ) | s n p → , where W = LΦ , L = diag( λ l m , . . . , λ p l m p ) and the i th column of Φ is equal to ( ϕ i ( s ) , . . . , ϕ i ( s n )) (cid:48) .Proof. With ˆO = diag( ˆO (1) , . . . , ˆO ( p ) ), c dn n − u (cid:48) ˆΦˆL ˆΦ (cid:48) u = c dn n − u (cid:48) ˆΦ ˆO ˆO (cid:48) ˆL ˆO (cid:48) ˆO ˆΦ (cid:48) u = c dn n − u (cid:48) Φ ˆO (cid:48) ˆL ˆO (cid:48) Φ (cid:48) u + o p (1)= c dn n − u (cid:48) Φ ˆO (cid:48) L ˆO (cid:48) Φ (cid:48) u + o p (1)= c dn n − u (cid:48) ΦL Φ (cid:48) u + o p (1)where the first line follows from ˆO (cid:48) ˆO = I q , the second from Lemma 5 (a) and (b) and thereasoning in the proof of Theorem 3, the third from Lemma 5 (b) and || c d/ n n − ˆO (cid:48) Φ (cid:48) u || ≤|| ˆO || · || c d/ n n − Φ (cid:48) u || = O p (1) using Lemma 1, and the fourth from ˆO (cid:48) L ˆO (cid:48) = L a.s. Theresult now follows from Lemma 9. Proof of Theorem 4:
For the first claim, by Theorem 4.4.6 of Harkrishan (2017), ω = sup || f || =1 (cid:104) f, RT Rf (cid:105) , so it suffices to show that for some f ∈ L G , (cid:104) f, RT Rf (cid:105) >
0. Inthe weak correlation case, this holds for f ( s ) = ( κ + (1 − κ ) g ( s )) − / , since (cid:104) f, R wc T R wc f (cid:105) = (cid:104) , T (cid:105) = (cid:82) (cid:82) (1 − ¯ k ( r, s )) dG ( r ) dG ( s ) = 1. In the strong correlation case, the same conclusionholds by setting f such that R sc f = 1. Such an f exists, because the kernel of R sc is equal to { } by assumption about σ B , so the range of R sc is L G \{ } by Theorem 3.5.8 of Harkrishan(2017).Under the null hypothesis, P ( τ n ( K n ) > cv | s n ) = P (ˆ ξ n > | s n ), where ˆ ξ n = c dn n − (cid:80) l,(cid:96) u l u (cid:96) (1 − cv ˆ k n ( s l , s (cid:96) )). By construction of ˆ λ i and ˆ ϕ i ( · ) in Lemma 5, for all1 ≤ l, (cid:96) ≤ n , ˆ k n ( s l , s (cid:96) ) = n (cid:88) i =1 ˆ λ i ˆ ϕ i ( s l )ˆ ϕ i ( s (cid:96) ) . For a given q satisfying the assumption of Lemma 5, and all n > q , letˆ k n,q ( r, s ) = q (cid:88) i =1 ˆ λ i ˆ ϕ i ( r )ˆ ϕ i ( s )and ˆ ξ qn = c dn n − (cid:80) l,(cid:96) u l u (cid:96) (1 − cv ˆ k n,q ( s l , s (cid:96) )) . We now show the last claim, that is P (ˆ ξ n > | s n ) p → P ( (cid:80) ∞ i =0 ω i Z i > ε > q →∞ lim sup n →∞ P ( | ˆ ξ n − ˆ ξ qn | > ε ) = 0 (24)38ii) for any fixed q , P (ˆ ξ qn > | s n ) p → P (cid:32) q (cid:88) i =0 ω q,i Z i > (cid:33) (25)(iii) lim q →∞ P (cid:32) q (cid:88) i =0 ω q,i Z i > (cid:33) = P (cid:32) ∞ (cid:88) i =0 ω i Z i > (cid:33) (26)for some double array of real numbers ω q,i by invoking Lemma 9.For claim (i), note that for all n > q , ˆ ξ n ≤ ˆ ξ qn a.s., and E [ˆ ξ qn − ˆ ξ n | s n ] = c dn n − (cid:88) l,(cid:96) σ B ( c n ( s l − s (cid:96) )) (cid:32) n (cid:88) i = q +1 ˆ λ i ˆ ϕ i ( s l )ˆ ϕ i ( s (cid:96) ) (cid:33) ≤ ˆ λ q +1 c dn n − (cid:88) l,(cid:96) σ B ( c n ( s l − s (cid:96) ))where the inequality follows from tr( AB ) ≤ λ ( A ) tr B for positive semidefinite matrices A , B and λ ( A ) the largest eigenvalue of A . By the same reasoning as employed in Theorem 3, c dn n − (cid:80) l,(cid:96) σ B ( c n ( s l − s (cid:96) )) = O p (1). Furthermore, by Lemma 5 (b), | ˆ λ q +1 − λ q +1 | = O q ( n − / ),and lim q →∞ λ q = 0. Thus (24) follows.For claim (ii), let ϕ ( s ) = 1 and λ = 1. By Lemma 5 (a), Lemma 10 and Theorem2, claim (25) holds, where ω q,i are the eigenvalues of D (cv) Ω for Ω ∈ { Ω sc , Ω wc } , and the( i + 1) , ( j + 1) element of Ω is equal to (cid:112) λ i λ j (cid:82) (cid:82) ϕ i ( s ) σ B ( c ( r − s )) ϕ j ( r ) dG ( s ) dG ( r ) and (cid:112) λ i λ j (cid:82) ϕ i ( s ) ϕ j ( s )( κ + (1 − κ ) g ( s )) ds under strong and weak correlation, respectively.For claim (iii), we first show that these ω q,i are also the eigenvalues of the finite rankself-adjoint linear operators RT q R , R ∈ { R sc , R wc } . To this end, let ϕ ∗ i ( s ) = √ λ i Rϕ i ( s ).With d = 1 and d i = − cv , we have RT q R ( f )( s ) = (cid:90) (cid:32) q (cid:88) i =0 d i ϕ ∗ i ( s ) ϕ ∗ i ( r ) (cid:33) f ( r ) dG ( r )and the ( i + 1) , ( j + 1) element of Ω stated above is equal to (cid:112) λ i λ j (cid:104) ϕ i , R ϕ j (cid:105) = (cid:112) λ i λ j (cid:104) Rϕ i , Rϕ j (cid:105) = (cid:82) ϕ ∗ i ( s ) ϕ ∗ j ( s ) dG ( s ). Let v = ( v , . . . , v q ) (cid:48) be an eigenvector of D (cv) Ω corresponding to eigenvalue ω , D (cv) Ωv = ω v . Then D (cv) Ωv = ω v implies (cid:90) ϕ ∗ ( r ) ϕ ∗ ( r ) · · · ϕ ∗ q ( r ) ϕ ∗ ( r ) − cv ϕ ∗ ( r ) ϕ ∗ ( r ) · · · − cv ϕ ∗ q ( r ) ϕ ∗ ( r )... . . . ... − cv ϕ ∗ ( r ) ϕ ∗ q ( r ) · · · − cv ϕ ∗ q ( r ) ϕ ∗ q ( r ) dG ( r ) v = ω v . ϕ ∗ ( s ) , . . . , ϕ ∗ q ( s )) yields q (cid:88) j =0 q (cid:88) i =0 v j ϕ ∗ i ( s ) (cid:90) d i ϕ ∗ j ( r ) ϕ ∗ i ( r ) dG ( r ) = ω q (cid:88) j =0 v j ϕ ∗ j ( s ) (cid:90) (cid:32) q (cid:88) i =0 d i ϕ ∗ i ( s ) ϕ ∗ i ( r ) (cid:33) (cid:32) q (cid:88) j =0 v j ϕ ∗ j ( r ) (cid:33) dG ( r ) = ω q (cid:88) j =0 v j ϕ ∗ j ( s ) (27)so (cid:80) qj =0 v j ϕ ∗ j ( r ) is an eigenvector of RT q R with eigenvalue ω , and since the kernel of RT q R contains all functions that are orthogonal to { ϕ ∗ i } qi =0 , these are the only nonzero eigenvalues.Now let ω ∆ q,i be the eigenvalues of the self-adjoint linear operator R ( T − T q ) R . By Kato(1987) (also see the development on page 911 of Rosasco, Belkin, and Vito (2010)), there isan enumeration of the eigenvalues ω q,i such that ∞ (cid:88) i =0 ( ω q,i − ω i ) ≤ ∞ (cid:88) i =0 ( ω ∆ q,i ) = || R ( T − T q ) R || HS (28)where || R ( T − T q ) R || HS is the Hilbert-Schmidt norm on the operator R ( T − T q ) R : L G (cid:55)→ L G induced by the norm (cid:112) (cid:104) f, f (cid:105) . Now || R ( T − T q ) R || HS ≤ || R || ·|| T − T q || HS (cf. (32) below), andsince T − T q is an integral operator, || T − T q || HS = (cid:82) (cid:82) (cid:16)(cid:80) ∞ i = q +1 λ i ϕ i ( s ) ϕ j ( s ) (cid:17) dG ( s ) dG ( r ).By Mercer’s Theorem, this converges to zero as q → ∞ , so thatlim q →∞ ∞ (cid:88) i =0 ( ω q,i − ω i ) = 0. (29)Thus using the same order of eigenvalues as in (28), we also have Var[ (cid:80) qi =0 ω q,i Z i − (cid:80) ∞ i =0 ω i Z i ] ≤ (cid:80) ∞ i =0 ( ω q,i − ω i ) , with the right-hand side converging to zero as q → ∞ by (29). But mean-square convergence implies convergence in distribution, and (26) follows.For the second claim of the theorem, by Lemma 1, ω q,i ≤ i ≥
1, which in conjunctionwith (29) implies ω i ≤ i ≥ (cid:4) Proof of Lemma 5:
We initially show a weaker claim than part (a), namely that thereexists a sequence of q × q rotation matrices ˆO n = ˆO n ( s n ) with elements ˆ O n,ij such thatmax i ≤ q sup s ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ϕ i ( s ) − q (cid:88) j =1 ˆ O n,ij ˆ ϕ i ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( n − / ) . (30)The proof follows closely the development in Rosasco, Belkin, and Vito (2010), denotedRBV in the following. Let k ( r, s ) = ¯ k ( r, s ) + 1 . Conditional on s n , define the linear operators40 G (cid:55)→ L G M ( f )( s ) = f ( s ) − (cid:90) f ( r ) dG ( r ) M n ( f )( s ) = f ( s ) − (cid:90) f ( r ) dG n ( r ) L ( f )( s ) = (cid:90) k ( r, s ) f ( r ) dG ( r ) L n ( f )( s ) = (cid:90) k ( r, s ) f ( r ) dG n ( r )and the derived operators ¯ L = M LM , ¯ L n = M L n M and ˆ L n = M n L n M n , so that ¯ L ( f )( s ) = (cid:82) f ( r )¯ k ( r, s ) dG ( r ), ¯ L n ( f )( s ) = (cid:82) ¯ k ( r, s ) f ( r ) dG n ( r ) and ˆ L n ( f )( s ) = (cid:82) ˆ k n ( r, s ) f ( r ) dG n ( r ),where G n is the empirical distribution of { s l } nl =1 .Let H ⊂ L G be the Reproducing Kernel Hilbert Space (RKHS) of functions f : S (cid:55)→ R with kernel k and inner product (cid:104)· , ·(cid:105) H satisfying (cid:104) f, k ( · , r ) (cid:105) H = f ( r )and associated norm || f || H . Let K = sup s ∈S k ( s, s ) . Define ¯ H as the RKHS of functions f : S (cid:55)→ R with kernel ¯ k , and H as the RKHS of functions f : S (cid:55)→ R with kernel equal to 1,which only consists of the constant function. Since k = ¯ k + 1, H contains all functions thatcan be written as linear combinations of ¯ H and H (see, for instance, Theorem 2.16 in Saitohand Sawano (2016)). Thus H contains the constant function, and || || H < ∞ . Furthermore,since for any f ∈ H , | f ( r ) | = (cid:104) f ( · ) , k ( · , r ) (cid:105) H ≤ || f || H · || k ( · , r ) || H ≤ √ K || f || H , we havesup r ∈S | f ( r ) | ≤ √ K · || f || H . (31)As in RBV, view the operators above as operators on H (cid:55)→ H . The operator norm || A || ofthe operator A : H (cid:55)→ H is defined as sup || f || H =1 || Af || H , and A is called bounded if || A || < ∞ .A bounded operator A is Hilbert-Schmidt if (cid:80) ∞ j =1 || Ae j || < ∞ for some (any) orthonormalbasis e j . The space of Hilbert-Schmidt operators is a Hilbert space endowed with the norm || A || HS = (cid:113)(cid:80) ∞ j =1 (cid:104) Ae j , Ae j (cid:105) H , and for any Hilbert-Schmidt operator A and bounded operator B , || AB || HS ≤ || A || HS || B || , || BA || HS ≤ || B || · || A || HS . (32)By Theorem 7 of RBV, L and L n are Hilbert-Schmidt.41urthermore, for any f ∈ H , || M f || H = || f − (cid:90) f ( r ) dG ( r ) || H ≤ || f || H + || || H (cid:90) f ( r ) dG ( r ) ≤ || f || H + || || H sup r ∈S | f ( r ) | so that (31) implies that || M || is a bounded operator. By the same argument, so is M n (almost surely). Thus, from (32), also ¯ L , ¯ L n and ˆ L n are Hilbert-Schmidt for almost all s n .Conditioning on s n throughout, we have the almost sure inequalities || ˆ L n − ¯ L || HS ≤ || ˆ L n − ¯ L n || HS + || ¯ L n − ¯ L || HS and using (32) || ˆ L n − ¯ L n || HS ≤ || ( M n − M ) L n M n || HS + || M L n ( M n − M ) || HS ≤ || M n − M || · || M n || · || L n || HS + || M n − M || · || M || · || L n || HS and || ( M n − M ) f || H = (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) f ( r ) dG n ( r ) − (cid:90) f ( r ) dG ( r ) (cid:13)(cid:13)(cid:13)(cid:13) H = || || H (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f ( r ) dG n ( r ) − (cid:90) f ( r ) dG ( r ) (cid:12)(cid:12)(cid:12)(cid:12) . Now consider the sequence of real independent random variables f ( s l ) , which have mean E [ f ( s l )] = (cid:82) f ( r ) dG ( r ), and, by (31), are almost surely bounded. Since (cid:82) f ( r )( dG n ( r ) − dG ( r )) = n − (cid:80) nl =1 f ( s l ) − E [ f ( s )], so that by Hoeffding’s inequality, with probability of atleast 1 − e − δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) f ( r )( dG n ( r ) − dG ( r )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ δn − / sup r ∈S | f ( r ) | for all δ ≥
0. This holds for all f ∈ H , so we conclude that || M n − M || = O p ( n − / ).Furthermore, applying the same reasoning as in the proof of Theorem 7 of RBV, || ¯ L n − ¯ L || HS = O p ( n − / ). Thus, || ˆ L n − ¯ L || HS = O p ( n − / ).The conclusion now follows from similar arguments as employed in Proposition 10 and12 of RBV. In particular, note that ϕ i ∈ H for all i . Furthermore, (cid:82) ϕ i ( s ) dG ( s ) = λ − i (cid:82) ϕ i ( r )¯ k ( r, s ) dG ( r ) dG ( s ) = 0. Thus, with e i = √ λ i ϕ i ∈ H , M e i = e i ,and (cid:104) e i , e i (cid:105) H = (cid:104) e i ( · ) , λ − i (cid:82) ¯ k ( r, · ) e i ( r ) dG ( r ) (cid:105) H = λ − i (cid:104) e i , ¯ Le i (cid:105) H = λ − i (cid:104) e i , Le i (cid:105) H =42 − i (cid:82) (cid:104) e i ( · ) , k ( r, · ) (cid:105) H e i ( r ) dG ( r ) = λ − i (cid:82) e i ( r ) dG ( r ) = 1, so that e i are normalized eigen-vectors of ¯ L : H (cid:55)→ H . Since
H ⊂ L G , these are the only eigenfunctions of ¯ L : H (cid:55)→ H withpositive eigenvalue, so that the spectrum of ¯ L is equal to { λ i } ∞ i =1 (cf. Proposition 8 of RBV).Also, ˆ ϕ i ∈ H , and since ˆv i is the eigenvector of n − ˆK n with eigenvalue ˆ λ i , n − ˆK n ˆv i = ˆ λ i ˆv i ,we obtain for ˆ λ i > L n (ˆ ϕ i )( · ) = (cid:90) ˆ k n ( r, · )ˆ ϕ i ( r ) dG n ( r )= n − n (cid:88) j =1 ˆ k n ( · , s j )ˆ ϕ i ( s j )= n − ˆ λ − i n (cid:88) j =1 ˆ k n ( · , s j ) n (cid:88) l =1 ˆ v i,l ˆ k n ( s j , s l )= n − n (cid:88) j =1 ˆ k n ( · , s j )ˆ v i,j = ˆ λ i ˆ ϕ i ( · )and (cid:90) ˆ ϕ i ( r ) dG n ( r ) = n − ˆ λ − i n (cid:88) j =1 n (cid:88) (cid:96) =1 n (cid:88) t =1 ˆ v i,j ˆ k n ( s j , s (cid:96) )ˆ k n ( s (cid:96) , s t )ˆ v i,t = 1 . Furthermore, from (cid:80) nl =1 ˆ v i,l = 0, also (cid:82) ˆ ϕ i ( s ) dG n ( s ) = 0, so that M n ˆ e i = ˆ e i . Thus,with ˆ e i = (cid:112) ˆ λ i ˆ ϕ i ∈ H , (cid:104) ˆ e i , ˆ e i (cid:105) H = (cid:104) ˆ e i ( · ) , ˆ λ − i (cid:82) ˆ k n ( r, · )ˆ e i ( r ) dG n ( r ) (cid:105) H = ˆ λ − i (cid:104) ˆ e i , ˆ L n ˆ e i (cid:105) H =ˆ λ − i (cid:104) ˆ e i , L n ˆ e i (cid:105) H = ˆ λ − i (cid:82) (cid:104) ˆ e i ( · ) , k ( r, · ) (cid:105) H ˆ e i ( r ) dG n ( r ) = ˆ λ − i (cid:82) ˆ e i ( r ) dG n ( r ) = 1 . Therefore ˆ e i are normalized eigenfunctions of ˆ L n : H (cid:55)→ H , and since all f ∈ H that are orthogonal toˆ e i , i = 1 , . . . , n are in the kernel of ˆ L n , these are the only eigenfunctions of ¯ L : H (cid:55)→ H withpositive eigenvalue, so the spectrum of ˆ L n : H (cid:55)→ H is equal to { ˆ λ i } ni =1 (cf. Proposition 9 ofRBV).Part (b) of the lemma now follows from || ˆ L n − ¯ L || HS = O p ( n − ) and the development onpage 911 of RBV.To establish (30), note that with the projection operators P q : H (cid:55)→ H and ˆ P q : H (cid:55)→ H defined via P q ( f )( · ) = (cid:80) qi =1 (cid:104) f, e i (cid:105) H e i ( · ) and ˆ P q ( f )( · ) = (cid:80) qi =1 (cid:104) f, ˆ e i (cid:105) H ˆ e i ( · ), by Proposition 6of RBV, || ˆ P q − P q || HS ≤ λ q − λ q +1 ) − || ˆ L n − ¯ L || HS + o p ( n − / ) = O p ( n − / ). Define the q × q matrix ˜O n with i, j th element ˜ O n,ij = (cid:104) ˆ e i , e j (cid:105) H . Then the j, t th element of ˜O (cid:48) n ˜O n is given by (cid:80) qi =1 ˜ O n,ij ˜ O n,it = (cid:80) qi =1 (cid:104) ˆ e i , e j (cid:105) H (cid:104) ˆ e i , e t (cid:105) H = (cid:104) e j , ˆ P q ( e t ) (cid:105) H , and [ j = t ] = (cid:104) e j , P q ( e t ) (cid:105) H , so that43y the Cauchy-Schwarz inequality (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q (cid:88) i =1 ˜ O n,ij ˜ O n,it − [ j = t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:104) e j , ( ˆ P q − P q ) e t (cid:105) H (cid:12)(cid:12)(cid:12) ≤ || ˆ P q − P q || HS = O p ( n − / ) . Thus || ˜O (cid:48) n ˜O n − I q || = O p ( n − / ), and with ˆO n = ( ˜O (cid:48) n ˜O n ) − / ˜O n , also || ˆO n − ˜O n || = O p ( n − / ).Furthermore, with ˆ r i = λ i / ˆ λ i p → (cid:112) λ i || q (cid:88) j =1 ˆ O n,ij ˆ ϕ j − ϕ i || H = || ˆ r i q (cid:88) j =1 ˆ O n,ij ˆ e j − e i || H ≤ || q (cid:88) j =1 ˜ O n,ij ˆ e j − e i || H + || q (cid:88) j =1 (ˆ r i ˆ O n,ij − ˜ O n,ij )ˆ e j || H ≤ || ( ˆ P q − P q ) e i || H + q (cid:88) j =1 | ˆ r i ˆ O n,ij − ˜ O n,ij |≤ || ˆ P q − P q || HS + q (cid:88) j =1 | ˆ r i ˆ O n,ij − ˜ O n,ij | = O p ( n − / )so (30) follows from (31).The claim in part (a) of the lemma now follows by induction from (30): For p = 1, thisfollows directly. Suppose the result holds for p −
1, and let ˆO B = diag( ˆO (1) , . . . , ˆO ( p − ), sothat sup s ∈S || ˆO B ˆ ϕ B ( s ) − ϕ B ( s ) || = O p ( n − / ) , (33)with ϕ B and ˆ ϕ B the vector of the first (cid:80) p − j =1 m j eigenfunctions. Now let ˆO I = (cid:32) ˆO ˆO ˆO ˆO (cid:33) be the ( (cid:80) pj =1 m j ) × ( (cid:80) pj =1 m j ) matrix ˆO n of (30) applied with q = (cid:80) pj =1 m j , with ˆO of the same dimensions as ˆO B . Let ϕ I − B and ˆ ϕ I − B be the m p × (cid:80) p − j =1 m j + 1 , . . . , (cid:80) pj =1 m j , so that by the conclusion of (30),sup s ∈S || ˆO ˆ ϕ B ( s )+ ˆO ˆ ϕ I − B ( s ) − ϕ B ( s ) || = O p ( n − / ) and sup s ∈S || ˆO ˆ ϕ B ( s )+ ˆO ˆ ϕ I − B ( s ) − ϕ I − B ( s ) || = O p ( n − / ). In conjunction with (33), the former yields sup s ∈S || ( ˆO − ˆO B ) ˆ ϕ B ( s )+ ˆO ˆ ϕ I − B ( s ) || = O p ( n − / ), which implies in light of (30) and the linear independence of eigen-vectors that both || ˆO − ˆO B || = O p ( n − / ) and || ˆO || = O p ( n − / ). Since ˆO I and ˆO B are ro-44ation matrices, ˆO (cid:48) B ˆO B = ˆO (cid:48) ˆO + ˆO (cid:48) ˆO = I , so that || ˆO − ˆO B || = O p ( n − / ) further im-plies || ˆO || = O p ( n − / ). We conclude that also sup s ∈S || ˆO ˆ ϕ I − B ( s ) − ϕ I − B ( s ) || = O p ( n − / ),so that the result for p holds with ˆO ( p ) = ˆO , which concludes the proof. (cid:4) Proof of Theorem 7:
Suppose max(cv − cv n , p → δ > n →∞ P (cv − cv n > δ ) > δ . Define κ ( κ, cv ) = P ( (cid:80) ∞ i =0 ω i ( κ, cv) Z i > ≤ κ< κ ( κ, cv ) = α by definition of cv. By continuityof κ , there exists 0 ≤ κ < − δ/ ≤ cv ≤ cv such that κ ( κ , cv ) = α . If κ = 0,set c n, = c n, . Otherwise, let c n, → ∞ be such that the corresponding a n, = c dn, /n → a satisfies a σ B (0) / ( a σ B (0) + (cid:82) σ B ( s ) ds ) = κ . Now let cv n, solve P Σ ( c n, ) ( τ n ≥ cv n, | s n ) = α a.s. , so that clearly, cv n, ≤ cv n a.s. for all large enough n . Thus, with A n the event that s n takeson a value such that cv − cv n (cid:48) , > δ , we also have lim sup n →∞ P ( A n ) > δ , and there exists asubsequence n (cid:48) → ∞ of n such that P ( A n (cid:48) ) > δ for all n (cid:48) .For all such n (cid:48) , α = P Σ ( c n (cid:48) , ) ( τ n (cid:48) ≥ cv n (cid:48) , |A n (cid:48) ) ≥ P Σ ( c n (cid:48) , ) ( τ n (cid:48) ≥ cv − δ |A n (cid:48) ) a.s. (34)and by Theorem 4, P Σ ( c n (cid:48) , ) ( τ n (cid:48) ≥ cv − δ |A n (cid:48) ) → κ ( κ , cv − δ ) > α . This contradicts (34),and the result follows. (cid:4) Theorem 11.
Let ˆ q n be an arbitrary function of s n taking values in Q = { , , . . . , q max } forsome sample size independent finite and nonrandom q max . Then for a t-statistic τ n ( q ) thatsatisfies the conditions of Theorem 7 for all q ∈ Q with critical value cv n ( q ) as in (15), forany (cid:15) > , lim sup n →∞ P ( P ( τ n (ˆ q n ) > cv n (ˆ q n ) | s n ) > α + (cid:15) ) = 0 .Proof. Suppose otherwise. Then there exists (cid:15) > n (cid:48) → ∞ such that with B n = { s n : P ( τ n (ˆ q ) > cv n (ˆ q ) | s n ) > α + (cid:15) } ⊂ S ,lim n (cid:48) →∞ P ( s n (cid:48) ∈ B n (cid:48) ) > (cid:15) .Let A n,i = { s n : ˆ q n = i } , so that lim n (cid:48) →∞ (cid:80) q max i =1 P ( s n (cid:48) ∈ B n (cid:48) ∩ A n (cid:48) ,i ) > (cid:15) . There henceexists some 1 ≤ q ≤ q max and a further subsequence n (cid:48)(cid:48) of n (cid:48) such that lim n (cid:48)(cid:48) P ( s n (cid:48)(cid:48) ∈ B n (cid:48)(cid:48) ∩A n (cid:48)(cid:48) ,q ) > (cid:15)/q max . But along this subsequence, q is fixed, so Theorem 7 applies and yieldslim n (cid:48)(cid:48) →∞ P ( s n (cid:48)(cid:48) ∈ B n (cid:48)(cid:48) ∩ A n (cid:48)(cid:48) ,q ) →
0, yielding the desired contradiction.45he proof of Theorem 8 relies on some preliminary results.
Lemma 12.
The R q (cid:55)→ R function J ( η ) = 1 π (cid:90) x q − (cid:112) (1 − x ) (cid:81) qi =1 ( x + η i ) dx with η = ( η , . . . , η q ) is Schur convex.Proof. By the Schur-Ostrowski criterion (Theorem 3.A.4 in Marshall, Olkin, and Arnold(2011)), J is Schur convex if (and only if)( η i − η j ) (cid:18) ∂J∂η i − ∂J∂η j (cid:19) ≥ ≤ i, j ≤ q .With ˜ J = ( x + η i ) − / ( x + η j ) − / , by a direct calculation,( η i − η j ) (cid:32) ∂ ˜ J∂η i − ∂ ˜ J∂η j (cid:33) = ( η i − η j ) x + η i ) / ( x + η j ) / ≥ Lemma 13.
For any two q × q positive semi-definite matrices B and B and vectors v , v ∈ R q , and all p ∈ [0 , , ς ( p ) = ( p v + (1 − p ) v ) (cid:48) ( I q + p B + (1 − p ) B ) − ( p v + (1 − p ) v ) − p v (cid:48) ( I q + B ) − v − (1 − p ) v (cid:48) ( I q + B ) − v ≤ .Proof. We first show that ς ( p ) is convex. Write G ( p ) = I q + p B + (1 − p ) B . The firstderivative of the nonlinear part of ς ( p ) is given by( v − v ) (cid:48) G ( p ) − ( p v + (1 − p ) v ) − ( p v + (1 − p ) v ) (cid:48) G ( p ) − ( B − B ) G ( p ) − ( p v + (1 − p ) v )so that the second derivative of ς ( p ) equals( v − v ) (cid:48) G ( p ) − ( v − v ) − v − v ) (cid:48) G ( p ) − ( B − B ) G ( p ) − ( p v + (1 − p ) v )+ ( p v + (1 − p ) v ) (cid:48) G ( p ) − ( B − B ) G ( p ) − ( B − B ) G ( p ) − ( p v + (1 − p ) v ) . With ∆ ( p ) = G ( p ) − / ( v − v ) and r ( p ) = − G ( p ) − / ( B − B ) G ( p ) − ( p v + (1 − p ) v ),the second derivative may be rewritten as (cid:32) ∆ ( p ) r ( p ) (cid:33) (cid:48) (cid:32) I q I q I q I q (cid:33) (cid:32) ∆ ( p ) r ( p ) (cid:33) ≥ p ∈ [0 , ς ( p ) ≤ max( ς (1) , ς (0)) = 0.46 emma 14. Let A = (cid:82) P − D (cv) Ω ( θ ) P dF ( θ ) . The q + 1 eigenvalues of A are real, andonly one is positive, and the same holds for A ( θ ) , θ ∈ Θ . Furthermore, λ ( A ) ≥ .Proof. By similarity, the eigenvalues of A are equal to those of PA P − , which in turn issimilar to the symmetric matrix (cid:32) l (cid:48) Σ l l (cid:48) Σ ˜W˜W (cid:48) Σ l ˜W (cid:48) Σ ˜W (cid:33) / (cid:32) − I q (cid:33) (cid:32) l (cid:48) Σ l l (cid:48) Σ ˜W˜W (cid:48) Σ l ˜W (cid:48) Σ ˜W (cid:33) / with ˜W = ( l , W / cv), and the first claim follows for A . The claim for A ( θ ) follows from thesame argument.For the last claim, let ¯ h : R (cid:55)→ R ¯ h ( t ) = 1 − t l (cid:48) Σ l + t l (cid:48) Σ ˜W ( I q + t ˜W (cid:48) Σ ˜W ) − ˜W (cid:48) Σ l . Note that ¯ h ( t ) is weakly decreasing in t >
0, since with ˜H = − t ˜W ( I q + t ˜W (cid:48) Σ ˜W ) − ˜W (cid:48) Σ l ¯ h (cid:48) ( t ) = − (cid:32) l˜H (cid:33) (cid:48) (cid:32) Σ Σ Σ Σ (cid:33) (cid:32) l˜H (cid:33) < . The characteristic polynomial of A is given bydet (cid:32) s − l (cid:48) Σ l l (cid:48) Σ ˜W − ˜W (cid:48) Σ l s I q + ˜W (cid:48) Σ ˜W (cid:33) = ( s − l (cid:48) Σ l + l (cid:48) Σ ˜W ( s I q + ˜W (cid:48) Σ ˜W ) − ˜W (cid:48) Σ l ) det( s I q + ˜W (cid:48) Σ ˜W )= s ¯ h ( s − ) det( s I q + ˜W (cid:48) Σ ˜W )so that λ ( A ) satisfies ¯ h (1 /λ ( A )) = 0. Similarly, 1 /λ ( A ( θ )) = 1 is a root of h θ ( t ) = 1 − t l (cid:48) Σ ( θ ) l + t l (cid:48) Σ ( θ ) ˜W ( I q + t ˜W (cid:48) Σ ( θ ) ˜W ) − ˜W (cid:48) Σ ( θ ) l . By Lemma 13, for any t > l (cid:48) Σ ˜W ( I q + t ˜W (cid:48) Σ ˜W ) − ˜W (cid:48) Σ l = (cid:18)(cid:90) ˜W (cid:48) Σ ( θ ) l dF ( θ ) (cid:19) (cid:48) (cid:18) I q + t (cid:90) ˜W (cid:48) Σ ( θ ) ˜W dF ( θ ) (cid:19) − (cid:18)(cid:90) ˜W (cid:48) Σ ( θ ) l dF ( θ ) (cid:19) ≤ (cid:90) l (cid:48) Σ ( θ ) ˜W ( I q + t ˜W (cid:48) Σ ( θ ) ˜W ) − ˜W (cid:48) Σ ( θ ) l dF ( θ ) . Thus, ¯ h ( t ) ≤ (cid:82) h θ ( t ) dF ( θ ), and from h θ (1) = 0 for all θ , ¯ h (1) ≤
0. Since h is decreasing, itsroot 1 /λ ( A ) must thus be smaller than unity, and the conclusion follows.47 roof of Theorem 8: Proceeding as in the proof of Theorem 2, P Σ ( τ ( WW (cid:48) ) > cv ) = P ( Z ≥ (cid:80) qi =1 ¯ η i Z i ) with ¯ η i = λ i ( − A ) /λ ( A ). By Lemma 14, ¯ η i ≥ i = 1 , . . . , q . Forfuture reference, note that P Σ ( τ ( WW (cid:48) ) > cv ) = α yields P (cid:32) Z ≥ q (cid:88) i =1 η i Z i (cid:33) ≤ α. (35)for η i = λ i ( − A ).In the following, we write a ≺ b for two vectors a , b ∈ R q to indicate that b majorizes a ,that is, with the elements of a i and b i sorted in descending order, j (cid:88) i =1 a i ≤ j (cid:88) i =1 b i for all j = 1 , . . . , q and (cid:80) qi =1 a i = (cid:80) qi =1 b i . Let ¯A = ( A + A (cid:48) ). From Theorems 9.F.1 and 9.G.1 in Marshall,Olkin, and Arnold (2011)( λ ( − A ) , . . . , λ q +1 ( − A )) ≺ ( λ ( − ¯A ) , . . . , λ q +1 ( − ¯A )) (36) ≺ (cid:18)(cid:90) λ ( − ¯A ( θ )) dF ( θ ) , . . . , (cid:90) λ q ( − ¯A ( θ )) dF ( θ ) , (cid:90) λ q +1 ( − ¯A ( θ )) dF ( θ ) (cid:19) . Since (cid:82) λ q +1 ( − ¯A ( θ )) dF ( θ ) = − (cid:82) λ ( ¯A ( θ )) dF ( θ ) and λ q +1 ( − A ) = − λ ( A ), we have − λ ( A ) + q (cid:88) j =1 λ j ( − A ) = − (cid:90) λ ( ¯A ( θ )) dF ( θ ) + q (cid:88) j =1 (cid:90) λ j ( − ¯A ( θ )) dF ( θ ) . The majorization result (36) further implies λ ( A ) ≤ λ ( ¯A ) ≤ (cid:90) λ ( ¯A ( θ )) dF ( θ ) (37)so that also( λ ( − A ) , . . . , λ q ( − A )) ≺ (cid:18)(cid:90) λ ( − ¯A ( θ )) dF ( θ ) , . . . , (cid:90) λ q − ( − ¯A ( θ )) dF ( θ ) , (cid:90) λ q ( − ¯A ( θ )) dF ( θ ) − (cid:18)(cid:90) λ ( ¯A ( θ )) dF ( θ )) − λ ( A ) (cid:19)(cid:19) . η i = (cid:82) λ i ( − ¯A ( θ )) dF ( θ ) /λ ( A )for i = 1 , . . . , q − η q = (cid:82) λ q ( − ¯A ( θ )) dF ( θ ) − ( (cid:82) λ ( ¯A ( θ )) dF ( θ )) − λ ( A )) λ ( A )we have (¯ η , . . . , ¯ η q ) ≺ (˜ η , . . . , ˜ η q ), so that by (20) and Lemma 12, P ( Z ≥ (cid:80) qi =1 ¯ η i Z i ) ≤ P ( Z ≥ (cid:80) qi =1 ˜ η i Z i ).Now applying (37)˜ η ∗ i = (cid:90) λ i ( − ¯A ( θ )) dF ( θ ) / (cid:90) λ ( ¯A ( θ )) dF ( θ ) ≤ ˜ η i for i = 1 , . . . , q −
1, and since from Lemma 14, λ ( A ) ≥
1, also˜ η ∗ q = (cid:82) λ q ( − ¯A ( θ )) dF ( θ ) − ( (cid:82) λ ( ¯A ( θ )) dF ( θ ) − (cid:82) λ ( ¯A ( θ )) dF ( θ ) ≤ ˜ η q provided (cid:90) λ q ( − ¯A ( θ )) dF ( θ ) − (cid:18)(cid:90) λ ( ¯A ( θ )) dF ( θ ) − (cid:19) ≥ . (38)Since P ( Z ≥ (cid:80) qi =1 ˜ η i Z i ) is a decreasing function in ˜ η i , P ( Z ≥ (cid:80) qi =1 ˜ η i Z i ) ≤ P ( Z ≥ (cid:80) qi =1 ˜ η ∗ i Z i ) . By Theorem 3.A.8 of Marshall, Olkin, and Arnold (2011), Lemma 12,and (35), it now suffices to show that j (cid:88) i =1 ˜ η ∗ q +1 − i ≥ j (cid:88) i =1 η q +1 − i (39)for all 1 ≤ j ≤ q , and since η q ≥
0, this also ensures that (38) holds. Condition (39) may berewritten as (cid:80) ji =1 (cid:82) ν i ( θ ) dF ( θ ) ≥
0, and the result follows. (cid:4) eferences Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covari-ance Matrix Estimation,”
Econometrica , 59, 817–858.
Bakirov, N. K., and
G. J. Sz´ekely (2005): “Student’s T-Test for Gaussian Scale Mix-tures,”
Zapiski Nauchnyh Seminarov POMI , 328, 5–19.
Berti, P., L. Pratelli, and
P. Rigo (2006): “Almost sure weak convergence of ran-dom probability measures,”
Stochastics: An International Journal of Probability andStochastic Processes , 78(2), 91–97.
Bester, C., T. Conley, C. Hansen, and
T. Vogelsang (2016): “Fixed-b Asymp-totics for Spatially Dependent Robust Nonparametric Covariance Matrix Estimators,”
Econometric Theory , 32, 154–186.
Bester, C. A., T. G. Conley, and
C. B. Hansen (2011): “Inference with DependentData Using Cluster Covariance Estimators,”
Journal of Econometrics , 165, 137–151.
Cao, J., C. Hansen, D. Kozbur, and
L. Villacorta (2020): “Inference for DependentData with Learned Clusters,”
Working paper . Chan, N. H., and
C. Z. Wei (1987): “Asymptotic Inference for Nearly NonstationaryAR(1) Processes,”
The Annals of Statistics , 15, 1050–1063.
Conley, T., and
F. Molinari (2007): “Spatial Correlation Robust Inference with Errorsin Location or Distance,”
Journal of Econometrics , 140, 76–96.
Conley, T. G. (1999): “GMM Estimation with Cross Sectional Dependence,”
Journal ofEconometrics , 92, 1–45.
Dou, L. (2019): “Optimal HAR Inference,”
Working Paper, Princeton University . Elliott, G., U. K. M¨uller, and
M. W. Watson (2015): “Nearly Optimal Tests When aNuisance Parameter is Present Under the Null Hypothesis,”
Econometrica , 83, 771–811.
Harkrishan, L. V. (2017):
Elements of Hilbert Spaces and Operator Theory . Springer.
Henderson, J., T. Squires, A. Storeygard, and
D. Weil (2018): “The Global Dis-tribution of Economic Activity: Nature, History, and the Role of Trade,”
QuarterlyJournal of Economics , 133(1), 357–406.
Ibragimov, R., and
U. K. M¨uller (2010): “T-Statistic Based Correlation and Hetero-geneity Robust Inference,”
Journal of Business and Economic Statistics , 28, 453–468.502016): “Inference with Few Heterogeneous Clusters,”
Review of Economics andStatistics , 98(1), 83–96.
Kato, T. (1987): “Variation of discrete spectra,”
Communications in Mathematical Physics ,111(3), 501–504.
Kelejian, H. H., and
I. R. Prucha (2007): “HAC estimation in a spatial framework,”
Journal of Econometrics , 140(1), 131 – 154.
Kelly, M. (2019): “The standard errors of persistence,”
University College DublinWP19/13 . Kiefer, N., and
T. J. Vogelsang (2005): “A New Asymptotic Theory forHeteroskedasticity-Autocorrelation Robust Tests,”
Econometric Theory , 21, 1130–1164.
Kiefer, N. M., T. J. Vogelsang, and
H. Bunzel (2000): “Simple Robust Testing ofRegression Hypotheses,”
Econometrica , 68, 695–714.
Lahiri, S. (2003): “Central Limit Theorems for Weighted Sums of a Spatial Process undera Class of Stochastic and Fixed Designs,”
Sankhya , 65(2), 356–388.
Lahiri, S., and
P. M. Robinson (2016): “Central limit theorems for long range dependentspatial linear processes,”
Bernoulli , 22(1), 345–375.
Lazarus, E., D. J. Lewis, J. H. Stock, and
M. W. Watson (2018): “HAR Inference:Recommendations for Practice,”
Journal of Business and Economic Statistics , 36(4),541–559.
Marshall, A. W., I. Olkin, and
B. C. Arnold (2011):
Inequalities: theory of majoriza-tion and its applications . Springer Series in Statistics, New York.
M¨uller, U. K. (2004): “A Theory of Robust Long-Run Variance Estimation,”
Workingpaper, Princeton University .(2007): “A Theory of Robust Long-Run Variance Estimation,”
Journal of Econo-metrics , 141, 1331–1352.(2014): “HAC Corrections for Strongly Autocorrelated Time Series,”
Journal ofBusiness and Economic Statistics , 32, 311–322.
M¨uller, U. K. (2020): “A More Robust t-Test,” arXiv:2007.07065 . M¨uller, U. K., and
A. Norets (2016): “Credibility of Confidence Sets in NonstandardEconometric Problems,”
Econometrica , 84, 2183–2213.51 ¨uller, U. K., and
M. W. Watson (2016): “Measuring Uncertainty about Long-RunPredictions,”
Review of Economic Studies , 83.(2017): “Low-Frequency Econometrics,” in
Advances in Economics: Eleventh WorldCongress of the Econometric Society , ed. by B. Honor´e, and
L. Samuelson, vol. II, pp.63–94. Cambridge University Press.(in preparation): “Low-Frequency Analysis of Economic Time Series,” in
Handbookof Econometrics . Elsevier.
Newey, W. K., and
K. West (1987): “A Simple, Positive Semi-Definite, Heteroskedastic-ity and Autocorrelation Consistent Covariance Matrix,”
Econometrica , 55, 703–708.
Phillips, P. C. B. (1987): “Towards a Unified Asymptotic Theory for Autoregression,”
Biometrika , 74, 535–547.(2005): “HAC Estimation by Automated Regression,”
Econometric Theory , 21,116–142.
Rasmussen, C. E., and
C. K. I. Williams (2005):
Gaussian Processes for MachineLearning . The MIT Press.
Robinson, P. M. (2005): “Robust Covariance Matrix Estimation: HAC Estimates withLong Memory/Antipersistence Correction,”
Econometric Theory , 21, 171–180.
Rosasco, L., M. Belkin, and
E. D. Vito (2010): “On Learning with Integral Operators,”
Journal of Machine Learning Research , 11(30), 905–934.
Saitoh, S., and
Y. Sawano (2016):
Theory of Reproducing Kernels and Applications .Springer, New York.
Sun, Y. (2013): “Heteroscedasticity and Autocorrelation Robust F Test Using OrthonormalSeries Variance Estimator,”
The Econometrics Journal , 16, 1–26.
Sun, Y., and
M. Kim (2012): “Asymptotic F-Test in a GMM Framework with Cross-Sectional Dependence,”
Review of Economics and Statistics , 91(1), 210–233.
Taqqu, M. (1975): “Weak convergence to fractional Brownian motion and to the Rosenblattprocess,”
Advances in Applied Probability , 7(2), 249.
Wang, Y. (2014): “An Invariance Principle for Fractional Brownian Sheets,”