Fixed-k Inference for Conditional Extremal Quantiles
FFixed- k Inference for Conditional Extremal Quantiles ∗ Yuya Sasaki † and Yulong Wang ‡ First arXiv version: August 2019This version: July 2020
Abstract
We develop a new extreme value theory for repeated cross-sectional and panel datato construct asymptotically valid confidence intervals (CIs) for conditional extremalquantiles from a fixed number k of nearest-neighbor tail observations. As a by-product,we also construct CIs for extremal quantiles of coefficients in linear random coefficientmodels. For any fixed k , the CIs are uniformly valid without parametric assumptionsover a set of nonparametric data generating processes associated with various tailindices. Simulation studies show that our CIs exhibit superior small-sample coverageand length properties than alternative nonparametric methods based on asymptoticnormality. Applying the proposed method to Natality Vital Statistics, we study factorsof extremely low birth weights. We find that signs of major effects are the same as thosefound in preceding studies based on parametric models, but with different magnitudes. Keywords: conditional extremal quantile, confidence interval, extreme value theory,fixed k , random coefficient ∗ We thank Federico Bugni, Xiaohong Chen, Tim Christensen, Yanqin Fan, Yoonseok Lee, Zhijie Xiao,Yichong Zhang, and participants at the seminar/conference at Boston College, PSU, SMU, EC Conference2019, Econometric Society North American Summer Meeting 2019, and Greater New York MetropolitanArea Econometrics Colloquium 2019, for very helpful comments and advice. Wang gratefully acknowledgesthe financial support by the Applyby-Mosher fund. † Associate professor of economics, Vanderbilt University. Email: [email protected] ‡ Assistant professor of economics, Syracuse University. Email: [email protected]. a r X i v : . [ ec on . E M ] J u l Introduction
Tail risks and extreme events are important research topics in economics. In many applica-tions with multivariate analysis, features of interest are conditional tail properties such asconditional extremal quantiles. This article provides a new method to construct confidenceintervals for conditional extremal quantiles from a fixed number k of nearest-neighbor tailobservations. Advantages of the proposed method are three-fold: first, it is robust againstflexible distributional assumptions unlike parametric methods; second, the procedure yieldsasymptotically valid confidence intervals for any fixed tuning parameter k unlike existingkernel methods that rely on sequences of moving tuning parameters for asymptotically validinference; and third, our confidence intervals enjoy a uniform coverage property over a setof data generating processes involving a set of values of the tail index. In the existing litera-ture, methods of inference about conditional quantiles concern about middle quantiles, e.g.,Qu and Yoon (2015) – also see Qu and Yoon (2019) – based on local quantile estimators ofFan, Hu, and Truong (1994) and Yu and Jones (1998). We aim to complement this existingliterature by proposing a method of inference about conditional extremal quantiles.Compared with unconditional tail features, the conditional tail counterparts are moredifficult to study. This is because conditional tails depend on both marginal distributionsand their joint behavior. Although marginal distributions can be generally assumed to beapproximately Pareto near the tails, joint distributions cannot be generally assumed to beapproximated by a fully parametric joint distribution and thus are harder to study given verylimited tail observations. To model a covariate-dependent yet tractable tails, the seminalpaper by Chernozhukov (2005) extends the quantile regression (QR) estimator of Koenkerand Bassett (1978) to tails, and proposes a method called the extremal quantile regression(EQR). Chernozhukov and Fern´andez-Val (2011) further investigate the EQR to constructconfidence intervals (CIs) based on subsampling.The EQR approach is based on the assumption that the conditional extremal quantilecan be well approximated by a parametric location-scale shift model: Q Y | X = x ( τ ) ∼ µ ( x ) + σ ( x ) (1 − τ ) − ξ (1)for τ →
1, where µ ( x ) and σ ( x ) are parametric functions that capture the location and scale,respectively. The element (1 − τ ) − ξ can be treated as the quantile function of a standard This statement follows from the Pickands-Balkema-de Haan Theorem (Balkema and de Haan (1974) andPickands (1975)). See de Haan and Ferreira (2007) for an overview. P ( Y > y ) ∼ y − /ξ where 1 /ξ is the Pareto exponent and ξ is thetail index. This single parameter captures the tail shape in the way that a larger ξ implies aheavier tail. The assumption of model (1) simplifies the conditional tail distribution so thatthe covariate X only affects the location and scale, but not the shape. This is satisfied if X and Y are jointly normal but violated by many other joint distributions. Unlike mid-samplefeatures, misspecification bias could be substantial in studying tail ones. In this paper, weconsider a wider class of flexible joint distribution models using a repeated cross-sectionalor panel data structure.There are a number of reasons for which we want to study conditional tail features, suchas conditional extremal quantiles, under flexible joint distribution models. First, conditionalvalue-at-risk (VaR) is a risk measure commonly used in financial management, insurance,and actuarial science. Estimation and inference are studied by Chernozhukov and Umantsev(2001) and Engle and Manganelli (2004), among others. Adrian and Brunnermeier (2016)propose a new measure for systemic risk, ∆-CoVar, defined as the difference between twoconditional VaRs. The tail shape governs the third-and higher-order moments of the port-folio return, which typically depend on other economic factors, e.g., business cycles. Asthis is excluded by the location-scale model (1), it is preferred to accommodate a largerclass of joint distributions. Second, Kelly and Jiang (2014) find that extreme event riskaffects asset pricing in the U.S. stock market. The shape parameter measures tail risk andvaries with other stock characteristics such as stock size. Third, macroeconomists are in-terested in analyzing lower tails of the conditional distributions of GDP growth rate givenfinancial conditions in the recent growth-at-risk literature – see Adrian, Boyarchenko, andGiannone (2019) for example. Fourth, top wealth inequality is an active research questionin macro-finance literature (see, for example, Piketty and Saez (2003), Gabaix, Lasry, Lions,and Moll (2016), and Jones and Kim (2018)). The tail of the wealth distribution is welldocumented to follow Pareto, and the exponent is in general a function of fundamentals ingeneral equilibrium models. For example, Beare and Toda (2017) derive a formula for thePareto exponent and comparative statics results, and Toda (2019) applies that formula in a Wang and Li (2013) formally establish that the location-shift model assumption is equivalent to assuming ξ remains constant across x . With this said, we remark that the existing literature suggests a couple of ways in which one canrationalize a possibly misspecified quantile regression. Angrist, Chernozhukov, and Fern´andez-Val (2006)show that the parametric linear quantile regression function minimizes a weighted distance to the truenonparametric quantile regression function. Kato and Sasaki (2017) show that the linear quantile regressionparameter is a weighted average of the slopes of the true nonparametric quantile regression function. ξ ( x ) equals to exp( x (cid:124) θ ) for some unknownparameter θ . Wang and Li (2013) assume that the Box-Cox transformed Y has linear con-ditional quantiles in X . The second class is fully nonparametric and constructs some localsmooth estimators, including, for example, Beirlant, Joossens, and Segers (2004), Gardes,Girard, and Lekina (2010), Gardes, Guillou, and Schorgen (2012), Daouia, Gardes, andGirard (2013), and Martins-Filho, Yao, and Torero (2018).In this article, we focus on statistical inference rather than estimation, and provideconfidence intervals (CIs) of a conditional extremal quantile that have preferred coverageand length properties. Our proposed method applies to both repeated cross-sectional dataand panel data. The main idea is very intuitive. Consider the case of using panel data of( Y, X ) to fix ideas, and suppose that one is interested in the conditional extremal quantileof Y given X = x , denoted by Q Y | X = x ( τ ). If, for every individual, there exists some timeperiod in which X takes the value x , then we can simply collect the associated Y ’s andform a cross-sectional sample from F Y | X = x . Since this is infeasible especially when X iscontinuous, we instead collect from each individual’s time series the induced Y associatedwith X that is the nearest neighbor (NN) of x . These induced Y ’s are now approximately stemming from F Y | X = x , and the large (respectively, small) order statistics from them canbe used for inference about the upper (respectively, lower) conditional extremal quantile Q Y | X = x ( τ ). For multi-dimensional covariates, this is done by defining the NN measuredby a certain choice of metric, such as the one induced by the Euclidean norm. If a linearregression model is appropriate, then the NN can also be defined using the linear index.The above approximation approach is formalized by establishing a new extreme value(EV) theory. The theory is based on the large- n and large- T asymptotics, where n and T T guarantees that the NN is close enough to the query point x , while a large n providesenough observations from a more accurate tail sample. Given the new EV theory, we applyit to construct new confidence intervals for the conditional extremal quantiles.Our proposed approach only requires some smoothness condition on the joint distribu-tion and hence enjoys more robustness against functional form specification than existingmethods. A natural question is how much efficiency we lose by using only one out of T observations in each time series. It turns out that if the tail shape depends on the covari-ate highly nonlinearly, then our proposed NN method dominates existing methods in bothcoverage and length when T is only moderately large, say 50. When T is very large, say500, the new CIs also deliver comparable lengths to the kernel regression method with theoptimal bandwidth – see the Monte Carlo results ahead in Section 3 for more details.As a by-product of our main result, we also develop CIs for extremal quantiles of thecoefficients in a random coefficient regression model. In particular, suppose that Y it and X it are generated from the model Y it = α i + X (cid:124) it β i + u it , where ( α i , β (cid:124) i ) (cid:124) is a random vector drawnfrom some unknown distribution. We first construct the least squares estimators of α i and β i using the i -th time series for all i and collect the largest (smallest) order statistics fromthese estimates. We then show that the estimation error is negligible under the large n andlarge T framework, and hence the largest (smallest) order statistics among these estimatesagain satisfy the desired EV theory, which further supports the application of the fixed- k CIs for extremal quantiles of α i and β i . This complements the existing literature focusingon the mid-sample properties of heterogeneous effects (e.g., Hsiao and Pesaran (2004) andWooldridge (2005)).Applying the proposed methods, we study the tail risk of extremely low birth weightconditional on mothers’ behavioral and demographics characteristics. We find that signs ofmajor effects are the same as those found in preceding studies based on parametric models.On the other hand, we find that some effects exhibit different magnitudes from those reportedin the previous studies based on parametric models.The rest of the paper is organized as follows. Section 2 presents the main results of thispaper. Section 3 presents Monte Carlo simulation studies. Section 4 presents an empiricalapplication. Section 5 concludes the paper. All mathematical proofs and additional detailsare found in the appendix. Notation
Let p → denote convergence in probability and d → denote convergence in distri- See Section 3 for concrete numerical settings which this qualitative phrase stands for. n, T → ∞ . Let [ A ] denote the indicator function of a generic event A . Let || B || denote the Euclidean norm of a vector or matrix B , and let C denote a generic constantwhose value may change across lines. Let B δ ( x ) denote a generic open ball centered at x withradius δ . When X denotes a column vector and c a scalar, the notation X − c is understoodas the vector X − ( c, . . . , c ) (cid:124) . We present the main result of this paper in this section. Let X denote a dim( X ) × The main object ofinterest is the conditional extremal quantile Q Y | X = x ( τ ) of Y given X = x for a pre-specified x ∈ R dim( X ) and τ →
1. For ease of exposition, we consider a balanced repeated cross-sectional or panel data set { Y it , X it } i =1: n,t =1: T that is i.i.d. across i and strictly stationaryand weakly dependent across t . Section 2.1 presents an informal overview of our proposedmethod, and Section 2.2 gives a formal theoretical justification. Finally, Section 2.3 presentsan extension of the main theoretical results to linear random coefficient models. Our method consists of the following three steps. In the first step, we make use of therepeated cross-sectional or panel data structure by selecting a subsample induced by thedistances of the covariates { X it } to the query point x . The subsample will be a k × Y . In the second step, by appealing to the extreme value theory,we show that after some normalization, Y converges in distribution to a well-defined limitingrandom variable, V , whose distribution f V is parametric and uniquely determined by the tailfeatures of F Y | X = x . In particular, f V will be uniquely characterized by a scalar parameter ξ that fully captures the tail heaviness of F Y | X = x . Note that ξ depends on the query point x ,which will be suppressed in our notations for simplicity when there is no confusion. Since Q Y | X = x ( τ ) can also be uniquely expressed as a function of ξ after suitably normalizing τ , While we focus on continuous random variables in our presentation, the method can also accommodatediscrete random variables. Suppose that the covariate vector is written as ( X (cid:48) , W (cid:48) ) (cid:48) where the subvector X consists of continuous random variables and the subvector W consists of discrete random variables. Supposethat one is interested in the conditional extremal quantiles given ( X (cid:48) , W (cid:48) ) (cid:48) = ( x (cid:48) , w (cid:48) ) (cid:48) . We can then extractthe subsample with W = w , and then apply our proposed method for the subsample. This is only for notational ease. The new approach is valid as long as T is large for all i . ξ ( x ) given a random draw V . This type of problems is studied by Elliott, M¨uller,and Watson (2015), who provide a generic argument to construct optimal inference whenthere exists a nuisance parameter under a null hypothesis. In the third and final step, wetailor their arguments to inference about Q Y | X = x ( τ ) with ξ being the nuisance parameter.The next three subsubsections introduce details of these three steps in order. Section 2.2then follows up by presenting regularity conditions and the main theoretical result of thepaper that guarantees that our confidence interval constructed in the three-step procedurecontrols coverage asymptotically and uniformly over a set of data generating processes. First, we select our subsample Y as follows. • Collect, for each i , the induced Y associated with the NN of { X it } Tt =1 to x , where theNN is measured by the Euclidean distance || X it − x || . Denote them by { Y i, [ x ] } ni =1 . • Take the largest k order statistics from { Y i, [ x ] } ni =1 and denote the vector of them by Y = ( Y (1) , [ x ] , Y (2) , [ x ] , ..., Y ( k ) , [ x ] ) (cid:124) , (2)where Y (1) , [ x ] ≥ Y (2) , [ x ] ≥ . . . ≥ Y ( n ) , [ x ] are the order statistics of { Y i, [ x ] } ni =1 .The key idea for such a selection is heuristically illustrated by the following derivation -a formal argument is presented as a proof of Theorem 1 in Appendix A.1. For each i , denotethe NN among { X it } Tt =1 to x as X i, ( x ) . Then for any y ∈ R , P (cid:0) Y i, [ x ] ≤ y (cid:1) = E X i, ( x (cid:2) P (cid:0) Y i, [ x ] ≤ y | X i, ( x ) (cid:1)(cid:3) = E X i, ( x (cid:104) F Y | X = X i, ( x ( y ) (cid:105) (by strict stationarity)= F Y | X = x ( y ) + E X i, ( x (cid:34) ∂F Y | X = x ( y ) ∂x (cid:124) (cid:12)(cid:12)(cid:12)(cid:12) x = ˙ x i ( X i, ( x ) − x ) (cid:35) (by mean value expansion) → F Y | X = x ( y ) as T → ∞ , Details of this step with more notations are as follows. For each i ∈ { , ..., n } and t ∈ { , ..., T } , compute d it = (cid:107) X it − x (cid:107) where (cid:107) · (cid:107) denotes the Euclidean distance. Then, for each i ∈ { , ..., n } , let t ∗ i denote theargument t that minimizes d it . We denote Y i, [ x ] = Y it ∗ i . x i lies between X i, ( x ) and x . The first equality is by the definition of conditionalexpectation. The second one follows under the strict stationarity. The third equality is validif the conditional CDF is smooth. The last convergence holds if the NN converges to itsquery point x and if the CDF is smooth with bounded derivatives.The above derivation states that the collection of the induced order statistics Y associatedwith the NN of x can be treated as approximately stemming from the true conditional CDF F Y | X = x asymptotically. Thus the largest (cross-sectional) order statistics Y can be treatedas draws from the tail of F Y | X = x . To proceed with the second step, we need some regularity conditions about F Y | X = x . Forreadability, we introduce only one of the conditions here with the remaining of them discussedin Section 2.2. Specifically, we assume that F Y | X = x is within the domain of attraction (DOA)of the extreme value distribution (denoted by F Y | X = x ∈ D ( G ξ )), in the sense that there existsequences of constants a n and b n such that for every v ,lim n →∞ F Y | X = x ( a n v + b n ) = G ξ ( v ) , where G ξ ( v ) = (cid:40) exp( − (1 + ξv ) − /ξ ), 1 + ξv >
0, for ξ (cid:54) = 0exp( − e − v ), v ∈ R , ξ = 0 . (3)This DOA condition is extensively studied in the statistics literature and is satisfied by manycommonly used distributions, including, for example, Pareto, Student-t, F, Gaussian, andeven uniform distributions. See Chapter 1 in de Haan and Ferreira (2007) for a completereview.Under the DOA assumption and the cross-sectional i.i.d. assumption, we show that forany fixed k , Y − b n a n d → V = V ... V k , as n, T → ∞ , (4)where the joint probability density function (PDF) of V is given by f V ( v , . . . , v k ; ξ ) = G ξ ( v k ) k (cid:89) i =1 g ξ ( v i ) /G ξ ( v i ) (5)for v k ≤ v k − ≤ . . . ≤ v with g ξ ( v ) = ∂G ξ ( v ) /∂v , and zero otherwise.8ote that the constants a n and b n depend on ξ , and their estimates thus tend to exhibitlarge magnitudes of sensitivity to tail observations. For example, a n is n ξ if F Y is standardPareto. Since a small estimation error in ξ is amplified by the n -power, inference relyingon a good estimate of ξ and the scale usually requires a large k and a even larger samplesize n . Besides, the G ξ ( v k ) term in (5) suggests that the largest k order statistics are notasymptotically independent, given any fixed k . We aim for a (1 − α ) CI for the conditional extremal quantile Q Y | X = x ( τ ) for τ close to 1.Specifically, we rewrite τ as 1 − h/n for some h > n random draws from the true conditionalCDF F Y | X = x .Our objective is to construct a confidence set S ( Y ) ⊂ R such that P ( Q Y | X = x ( τ ) ∈ S ( Y )) ≥ − α + o (1), as n → ∞ and T → ∞ . Under the DOA assumption, calculationsshow that Q Y | X = x (1 − h/n ) − b n a n → q ( ξ, h ) ≡ (cid:40) h − ξ − ξ if ξ (cid:54) = 0 − log( h ) if ξ = 0 . Note that q ( ξ, h ) is the exp( h ) quantile of V . Since it is shared by both Y and Q Y | X = x (1 − h/n ), we can impose location and scale equivariance on the CI to cancel themout. Specifically, we impose that for any constants a > b , S ( a Y + b ) = aS ( Y ) + b ,where aS ( Y ) + b = { y : ( y − b ) /a ∈ S ( Y ) } . Under this equivariance constraint, we can write P ( Q Y | X = x (1 − h/n ) ∈ S ( Y ))= P (cid:18) Q Y | X = x (1 − h/n ) − Y ( k ) , [ x ] Y (1) , [ x ] − Y ( k ) , [ x ] ∈ S (cid:18) Y − Y ( k ) , [ x ] Y (1) , [ x ] − Y ( k ) , [ x ] (cid:19)(cid:19) → P ξ ( V q ∈ S ( V ∗ )) , where we introduce the self-normalized statistics V q = q ( ξ, h ) − V k V − V k We also derived the estimation and inference method based on the increasing- k asymptotics in a previousversion of this article. Their performance is dominated by the fixed- k approach, especially in case with onlymoderate sample sizes. Therefore, we present the fixed- k result exclusively in this version for illustrationalsimplicity. ∗ = (cid:18) V − V k V − V k , V − V k V − V k , ..., V k − V k V − V k (cid:19) , and highlight with the subscript ξ that the densities of V q and V ∗ now depend solely on ξ .These can be computed by using (4), (5), and a change of variables.Since ξ is unknown, we impose the size constraint uniformly for all the values of ξ thatare empirically relevant. In this sense the fixed- k approach is more robust against misspeci-fication, especially when the sample size is not large enough to support a precise estimationof ξ . Let Ξ ⊂ R be the set of tail indices for which we impose the asymptotically correctcoverage. The asymptotic problem then is to construct a location and scale equivariant S that satisfies P ξ ( V q ∈ S ( V ∗ )) ≥ − α for all ξ ∈ Ξ , (6)since any S that satisfies (6) also satisfies lim inf n →∞ ,T →∞ P ( Q Y | X = x (1 − h/n ) ∈ S ( Y )) ≥ − α by the continuous mapping theorem. Among all the solutions to this problem,we choose the optimal one that minimizes the weighted average expected length criterion (cid:90) E ξ [lgth( S ( V ))] dW ( ξ ), (7)where W is a positive measure with support on Ξ, and lgth( A ) = (cid:82) [ y ∈ A ] dy for any Borelset A ⊂ R . The equivariance of S further implies E ξ [lgth( S ( V ))] = E ξ [( V − V k )lgth( S ( V ∗ ))].Thus the program of minimizing (7) subject to (6) among all equivariant sets S asymptoti-cally becomes min S ( · ) (cid:82) Ξ E ξ [( V − V k )lgth( S ( V ∗ ))] dW ( ξ )s.t. P ξ ( V q ∈ S ( V ∗ )) ≥ − α for all ξ ∈ Ξ , (8)where we abuse the notation of E ξ and P ξ to emphasize that the distributions of V q and V ∗ depend on ξ (and further on x ). Note that any solution to (8) also provides the form of S , that is, S ( V ) = ( V − V k ) S ( V ∗ ) + V k . Once S ( · ) is determined, therefore, the confidenceinterval can be constructed in practice by plugging in( Y (1) , [ x ] − Y ( k ) , [ x ] ) S (cid:18) Y − Y ( k ) , [ x ] Y (1) , [ x ] − Y ( k ) , [ x ] (cid:19) + Y ( k ) , [ x ] . In solving (8), we write the problem in the following Lagrangian form:min S ( · ) (cid:90) Ξ E ξ [( V − V k )lgth( S ( V ∗ ))] dW ( ξ ) + (cid:90) Ξ P ξ ( V q ∈ S ( V ∗ )) d Λ( ξ ) , We use Ξ = [ − / , /
2] for inference about conditional extremal quantiles in later applications, whichcovers all the distributions with finite variance. This range can be easily extended. We use the uniform weight in later sections. κ ( V ∗ ; ξ ) = E ξ [ V − V k | V ∗ ] and writing the expectationsabove as integrals over the densities f V ∗ and f V q , V ∗ of V ∗ and ( V q , V ∗ ), respectively, thesolution of the above problem is given by S ( v ∗ ) = (cid:26) y : (cid:90) Ξ κ ( v ∗ ; ξ ) f V ∗ ( v ∗ ; ξ ) dW ( ξ ) < (cid:90) Ξ f V q , V ∗ ( y, v ∗ ; ξ ) d Λ( ξ ) (cid:27) . (9)The integrals can be numerically computed by Gaussian quadrature. To find suitable La-grangian weights Λ, we appeal to the generic algorithm in Elliott, M¨uller, and Watson (2015),who provide a numerical method to construct Λ. We tailor their arguments to our condi-tional extreme tail inference problem and provide the corresponding MATLAB program onthe author’s website. The computation cost is only several seconds using a modern PC. Notethat Λ only needs to constructed once by the author but not the empirical users. See SectionA.2 for more details.Other tail-related quantities, such as the conditional tail expectations, are also coveredby our proposed method as long as they can be expressed as functions of the conditional tailindex. We discuss such an extension in Section 2.3. In the following subsection, we formallyintroduce all the regularity conditions and formalize the uniform coverage property of theconfidence interval (9). Our asymptotic theory requires the following four conditions.
Condition 1.1 ( Y i , X (cid:124) i ) (cid:124) , . . . , ( Y iT , X (cid:124) iT ) (cid:124) are i.i.d. across i . ( Y it , X (cid:124) it ) (cid:124) for each t = 1 , . . . , T is strictly stationary and β -mixing with the mixing coefficient satisfying β ( t ) = O ( t − − ε ) for some ε >
0. In addition, f X ( x ) is uniformly continuously differentiableand bounded away from 0 in an open ball centered at x .Condition 1.1 requires the data to be independent across i and weakly dependent across t , which is plausibly satisfied by the Natality Vital Statistics that we use for our empiricalapplication. In addition, this condition also requires the density of X to be positive in anopen neighborhood around the query point x . This condition is sufficient to establish thatthe NN converges to the query point x almost surely at some power rate. To the best of ourknowledge, this is the first result about the (almost sure and L ) convergence rate of the NNunder weak dependence, whose proof is non-trivial. We formalize this result as Lemma 1 in11ppendix A.1, which might be of independent research interest. Note that we intentionallychoose only one NN to allow for weak dependence across t . If data are independent acrossboth i and t , then more than one NNs can be chosen to enlarge the effective sample. Weleave this for future research. Condition 1.2 F Y | X = x ∈ D (cid:0) G ξ ( x ) (cid:1) with ξ ( x ) ∈ Ξ , a compact subset of R .This condition requires that the underlying conditional distribution is in the domainof attraction of the generalized EV distribution. This is a mild condition as it is satisfiedby many commonly used joint distributions. In particular, it generalizes the conditionallocation-scale shift model (1) by allowing µ ( x ) , σ ( x ) , and ξ ( x ) to be all unknown (but smooth)functions of x . The case of negative ξ ( x ) is included only for comprehensiveness, since the Y in most applications involving tail features has an unbounded support that entails a non-negative ξ ( x ). To illustrate the mildness of this condition, we discuss the following threeexamples. Our Condition 1.2 is satisfied in all three of them, but the location-scale modelassumption (1) is not. Example 1 (Joint Normal)
Suppose that (
Y, X ) is jointly normal with zero means, unitvariances, and correlation ρ . Then the conditional distribution of Y given X = x isnormal with mean ρx , and variance 1 − ρ . The conditional tail index is ξ ( x ) = 0 for all x ∈ R . The conditional quantile is Q Y | X = x ( τ ) = ρx + (cid:112) − ρ Φ − ( τ ), where Φ − ( · )is the quantile function of the standard normal distribution. Thus, the location-scalemodel assumption (1) is satisfied. Example 2 (Joint Student-t)
Suppose that (
Y, X ) is jointly Student-t distributed withd.f. v , zero means, unit variances, and correlation ρ (cid:54) = 0. Then the conditional distri-bution of Y given X = x is Student-t distributed with d.f. v + 1, mean ρx , and variance(1 − ρ )( v + x ) / ( v +1). The conditional tail index is ξ ( x ) = 1 / ( v +1) for all x ∈ R . Theconditional quantile is Q Y | X = x ( τ ) = ρx + (cid:112) (1 − ρ )( v + x ) / ( v + 1) Q t ( v ) ( τ ), where Q t ( v ) ( · ) is the quantile function of the standard Student-t distribution with d.f. v .This specification satisfies the location-scale shift model (1) but the scale function ishighly nonlinear in x . The β -mixing condition allows the application of Berbee’s lemma (Berbee (1987)) in establishing Lemma1. This is assumed to avoid technical complexity and can be relaxed to other forms of weak dependence. See Ding (2016) for the exact expression for the PDF. xample 3 (Conditional Pareto) Suppose that X is half-normal with positive supportand Y given X = x is the Pareto distribution such that P ( Y ≤ y | X = x ) = 1 − ( y +1) − /x for y ≥ x >
0. Then the conditional tail index is ξ ( x ) = x and theconditional quantile is Q Y | X = x ( τ ) = − − τ ) − x , which violates the location-scaleshift model (1).Let y denote the end-point of the conditional CDF, that is, y = Q Y | X = x (1) ≤ ∞ . Thenext condition is a high level regularity assumption on the smoothness of the conditionaltail. Condition 1.3 f Y | X = x ( y ) is uniformly bounded and continuously differentiablein x and y . In addition, for any fixed y > u n = a n y + b n → y ,and any open ball B η T ( x ) centered at x with radius η T ≡ O ( T − η )for some η >
0, lim u n → y sup x ∈ B ηT ( x ) T − η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 andlim u n → y sup x ∈ B ηT ( x ) T − η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 as n → ∞ and T → ∞ . Condition 1.3 requires that the derivatives of the conditional CDF and PDF are smoothand decay quickly. This is a mild condition again, which is satisfied by the above examplesby straightforward calculation. For readability, we provide low-level primitive assumptionsas sufficient conditions for Condition 1.3 and discuss them in Appendix A.3.
Condition 1.4 n → ∞ , T → ∞ , and T /n → λ for some λ ∈ (0 , ∞ ).Condition 1.4 requires both n and T to be large. A large n guarantees that the error dueto the EV approximation is negligible, and a large T controls the distance between the NNand the query point. The parameter λ can be any constant in the open unit interval, andhence T can be much smaller than n .Under the above conditions, we establish the asymptotically correct uniform coverage byconfidence interval (9) in the following theorem, which is the main result of this paper. Theorem 1
Suppose that Conditions 1.1-1.4 hold. For any fixed k and any F Y | X = x thatsatisfies these conditions, lim inf n,T →∞ P (cid:0) Q Y | X = x (1 − h/n ) ∈ S ( Y ) (cid:1) ≥ − α, where S ( · ) is determined in (9).
13e conclude this subsection with a discussion of main properties, advantages and disad-vantages of the our proposed fixed- k confidence interval. First, since the confidence intervalis based on a fixed number k of tail observations, its length does not decrease in n . Onthe other hand, the length decreases in k . Second, unlike kernel regression approaches thatrequire a sequence of moving tuning parameter (i.e., the bandwidth parameter tending tozero), our method only relies on a ‘fixed’ tuning parameter which is k . While common data-driven choice rules for bandwidths are not theoretically compatible with inference for theirfailure to undersmooth estimates, our method based on any fixed k of a researcher’s choiceguarantees asymptotically valid inference. Third, our fixed- k approach allows for the confi-dence interval to have a uniform size control property over a set of data generating processesinvolving a set of values of the tail index, while the existing methods have not been shownto share this uniformity property. This property of our fixed- k approach is useful because ξ is practically unknown to researchers and thus the size control should be uniform for all thevalues of ξ that are empirically relevant. The proof strategy for our main result, namely Theorem 1, and thus our proposed methodof constructing confidence intervals apply to other contexts. Among others, inference forextremal quantiles of random coefficients in linear regression models is also possible withour proposed strategy. In this section, we study this class of models which have been widelyused in empirical studies in economics.Consider the model Y it = α i + X (cid:124) it β i + u it , (10)where ( α i , β (cid:124) i ) (cid:124) denotes random coefficients and u it denotes an error term. This setup hasbeen studied by numerous papers in the literature, and covers the classic panel linear regres-sion model with fixed effects in which β i = β for all i . As long as Conditions 1.1-1.4 aresatisfied, the previously introduced methods naturally apply here for inference on the con-ditional extremal quantiles of Y . In addition, the model (10) allows us to conduct inferenceon the unconditional tail features of the random coefficients, α i and β i . The remainder ofthis subsection illustrates a procedure to this end.Let (ˆ α i , ˆ β (cid:124) i ) (cid:124) be the OLS estimator by regressing Y it on (1 , X (cid:124) it ) (cid:124) using the time seriesassociated with the i -th individual. Collect { (ˆ α i , ˆ β (cid:124) i ) (cid:124) } ni =1 and sort each series of estimates14n the descending order. We then define A = (ˆ α (1) , ..., ˆ α ( k ) ) (cid:124) , that is, the largest k order statistics of { ˆ α i } , and B j = (ˆ β j, (1) , ..., ˆ β j, ( k ) ) (cid:124) , that is, the largest k order statistics of the j -th coordinate of { ˆ β i } ni =1 , for each j . Withoutloss of generality, we focus on the first coordinate of β i , and suppress the subscript j fromour notations for simplicity.Now, we substitute A or B in S ( · ) as in (9) to construct the confidence interval for ex-tremal quantiles of α i or β i . The following conditions are imposed for a theoretical guaranteeof correct asymptotic coverage. Condition 2.1 ( α i , β (cid:124) i , u it , X (cid:124) it ) (cid:124) are i.i.d. across i and strictly stationary and weakly depen-dent across t ; Condition 2.2 F α ∈ D (cid:0) G ξ α (cid:1) and F β ∈ D (cid:16) G ξ β (cid:17) with ξ α ∈ Ξ and ξ β ∈ Ξ; Condition 2.3 sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (ˆ α i , ˆ β (cid:124) i ) (cid:124) − ( α i , β (cid:124) i ) (cid:124) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1), sup i | ¯ u i | = o p (1), and sup i (cid:12)(cid:12)(cid:12)(cid:12) ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) = O p (1), where ¯ X i = T − (cid:80) Tt =1 X it and ¯ u i = T − (cid:80) Tt =1 u it . In ad-dition, if ξ ω = 0, sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (ˆ α i , ˆ β (cid:124) i ) (cid:124) − ( α i , β (cid:124) i ) (cid:124) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) /f ω ( Q ω (1 − /n )) = o p (1) andsup i | ¯ u i | /f ω ( Q ω (1 − /n )) = o p (1) for ω = α or β , where Q ω ( · ) and f ω ( · ) denotethe quantile function and the PDF of ω , respectively. Alternatively, if ξ ω < n − ξ ω T − / →
0, sup i (cid:12)(cid:12)(cid:12)(cid:12) ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) = O p (1), sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (ˆ α i , ˆ β (cid:124) i ) (cid:124) − ( α i , β (cid:124) i ) (cid:124) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:0) T − / (cid:1) andsup i | ¯ u i | = O p (cid:0) T − / (cid:1) .Condition 2.1 is similar to Condition 1.1. Since the objects of interest are the uncondi-tional extremal quantiles of α i and β i , we do not need the NN condition on covariates. Thedependence structure is left unspecified as long as it is sufficient for Condition 2.3. Condition2.2 assumes that the distributions of α i and β i are in the domains of attraction of G ξ α and G ξ β , respectively. Condition 2.3 requires that the estimator (ˆ α i , ˆ β (cid:124) i ) (cid:124) is consistent for all i and the moments of sample averages of u it and X it across t are bounded. If the tail index isnon-positive, then these bounds need to be stronger to accommodate the fact that a n → A straightforward calculation yields that normal distribution satisfies Condition 2.3, ifsup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (ˆ α i , ˆ β (cid:124) i ) (cid:124) − ( α i , β (cid:124) i ) (cid:124) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( T − ε ) and sup i | ¯ u i | = O p ( T − ε ) for some ε > n/T → λ for some λ ∈ (0 , ∞ ). This can be seen by 1 /f α ( Q α (1 − /n )) ≤ O (log( n )) when f α and Q α are standard normaldensity and quantile functions, respectively (cf. Example 1.1.7 in de Haan and Ferreira (2007)). Corollary 1
Suppose that Conditions 1.4 and 2.1-2.3 hold. For any fixed k and any F α and F β that satisfy these conditions, lim inf n,T →∞ P ( Q α (1 − h/n ) ∈ S ( A )) ≥ − α lim inf n,T →∞ P ( Q β (1 − h/n ) ∈ S ( B )) ≥ − α where S ( · ) is defined in (9). We conduct Monte Carlo experiments to examine the small sample performance of the newapproach. In Section 3.1, we first consider the simple panel data { Y it , X it } without anyfixed effect. In Section 3.2, we compare the efficiency of the new approach with the kernelestimator, which essentially uses more than one NNs. In Section 3.3, we consider the linearrandom coefficient regression setup (10). We continue to consider the three examples in Section 2.1 as the data generating processes(DGPs). In all experiments, generated data are i.i.d. across i , but are dependent across t .The dependence structure across t is specified as follows.
1. Joint Normal X it = ρX it − + u it with u it ∼ iid N (0 , − ρ ) and X i ∼ N (0 , Y it = r xy X it + (cid:112) − r xy v it where v it ∼ iid N (0 ,
1) and independent of u it . Set ρ = 0 . r xy = 0 . .
2. Joint Student-t ( X it , Y it ) is i.i.d. across t and distributed as t v ( µ, Σ) with v = 3, µ =[0 , (cid:124) , and Σ = [1 , .
5; 0 . ,
3. Conditional Pareto X it = ρX it − + u it with u it ∼ iid N (0 , − ρ ) and X i ∼ N (0 , Y it | X it = x ∼ Pa( ξ ( x )), that is, P ( Y it ≤ y | X it = x ) = 1 − y − /ξ ( x ) for y ≥ ξ ( x ) = x + 0 .
5. 16e construct CIs for Q Y | X = x (1 − h/n ) with x = 0 and 1 .
65 (the 50% and 95% quantilesof X , respectively) and h = 1 and 5. The sample sizes n and T are either 200 or 500, withsmaller combinations exercised in later experiments.We compare results across three approaches: (i) the fixed- k approach (fixed- k ) introducedin this paper, (ii) quantile regression (QR), and (iii) bootstrapping the empirical quantile(Boot). We produce the fixed- k CI using k = 20 in most cases if not otherwise noted. Thespace of ξ is restricted to be [ − / , / Y it on X it and a constant at the τ quantile for each i . The conditional quantile is estimatedat ˆ β i + x ˆ β i where ˆ β i and ˆ β i are the coefficient estimates using the i -th individual’sobservations. The CI is defined the 2.5% and 97.5% quantiles of these n estimates. Thebootstrap CI is based on bootstrapping the empirical τ quantile in { Y i, [ x ] } ni =1 . The bootstrapsize is 200.Tables 1-3 depict the coverage probabilities (Cov) and the average lengths (Lgth) of theabove three methods based on 500 simulation draws. The fixed- k approach performs wellin terms of both the coverage and length across all the specifications. Regarding the QRmethod, recall that the conditional quantile is a linear function of X in the first DGP butnot in the other two. Therefore, not surprisingly, the CIs based on QR perform well inthe first DGP but deliver substantial undercoverage and longer length in the other two dueto misspecification. The bootstrap approach is robust to misspecification but requires theasymptotic normal approximation, which performs well only in the mid-sample and does notnear the tails. As such, the bootstrap intervals exhibit more undercoverage for h = 1 thanfor h = 5.We conclude the current subsection with a remark about the choice of k . A larger k leads to more tail observations and hence shorter confidence intervals, but is subject to alarger approximation bias due to including too many mid-sample data. This indicates thatthe choice of k is difficult, especially when n is only moderate. It is actually impossible tochoose a uniformly best k allowing the underlying CDF to be flexible (see Theorem 1 ofM¨uller and Wang (2017)). The CDFs in our Monte Carlo designs are all well behaved sothat such a value of k as large as 40% of the sample size performs well. This is seen in Table3, which reports the numbers for k = 20 and 50. Our new approach takes only one NN in each time series, which raises the question ofefficiency loss. We answer this by comparing our fixed- k approach with the kernel smoothing17able 1: Finite sample performance of inference about conditional extremal quantile, nomodel specification n
200 (97.5% quantile) 500 (99% quantile) T
200 500 200 500Cov Lgth Cov Lgth Cov Lgth Cov LgthJoint Normalfixed- k k k > > > > Boot 0.93 8.30 0.93 7.80 0.94 15.3 0.90 12.7Note: Entries are coverages and lengths of the CIs for Q Y | X =0 (1 − /n ). See the main text for the descriptionof the three approaches and the data generating processes. Confidence level is 5%. Based on 500 simulationdraws. Table 2: Finite sample performance of inference about conditional extremal quantile, nomodel specification n
200 (97.5% quantile) 500 (99% quantile) T
200 500 200 500Cov Lgth Cov Lgth Cov Lgth Cov LgthJoint Normalfixed- k k k > > > > Boot 0.95 12.5 0.96 8.91 0.80 26.8 0.93 15.2Note: Entries are coverages and lengths of the CIs for Q Y | X =1 . (1 − /n ). See the main text for thedescription of the three approaches and the data generating processes. Confidence level is 5%. Based on 500simulation draws. n
200 (99.5% quantile) 500 (99.8% quantile) T
200 500 200 500Cov Lgth Cov Lgth Cov Lgth Cov LgthJoint Normalfixed- k (k=20) 0.95 1.82 0.96 1.83 0.97 1.69 0.96 1.70QR 1.00 1.19 1.00 0.75 1.00 1.18 1.00 1.07Boot 0.63 0.62 0.64 0.59 0.64 0.57 0.65 0.59Joint Student-tfixed- k (k=20) 0.96 4.71 0.96 4.69 0.96 5.62 0.97 5.61fixed- k (k=50) 0.94 3.91 0.92 3.90 0.95 4.85 0.92 4.73QR 1.00 8.51 0.68 5.51 1.00 8.47 1.00 11.5Boot 0.62 2.01 0.60 2.02 0.63 2.57 0.61 2.56Conditional Paretofixed- k (k=20) 0.98 27.6 0.98 26.1 0.94 48.1 0.97 40.5QR 0.00 > > > > Boot 0.71 25.9 0.63 30.4 0.78 76.2 0.77 43.9Note: Entries are coverages and lengths of the CIs for Q Y | X =0 (1 − /n ). See the main text for the descriptionof the three approaches and the data generating processes. Confidence level is 5%. Based on 500 simulationdraws. method proposed by Gardes, Girard, and Lekina (2010). In particular, we first pool the paneldata into a cross-sectional sample. Suppose the object of interest is still Q Y | X = x ( τ ). Wefollow Gardes, Girard, and Lekina (2010) to pick the bin B b nT ( x ) centered at x with abandwidth b nT . Since there is no theoretical justification for the optimal choice of b nT ,we take the rule-of-thumb choice c ( nT ) − / with different values of the constant c . Now acertain choice of b nT leads to a certain collection of Y (cid:48) s whose paired X (cid:48) s are in the bin B b nT ( x ). Sort these induced Y (cid:48) s in the descending order into { Y (1) ≥ Y (2) ≥ ... ≥ Y ( m ) } where m denotes the local sample size determined by the bandwidth. Such local sample sizeis approximately nT b n in the kernel smoothing (as opposed to n in our new approach).Given the induced Y ’s, the conditional quantile is estimated as ˆ Q Y | X = x ( τ )= Y ( (cid:98) (1 − τ ) m (cid:99) ) ,that is, the (cid:98) (1 − τ ) m (cid:99) -th largest order statistics in the induced Y ’s where (cid:98) (1 − τ ) m (cid:99) de-notes the integer part of (1 − τ ) m . Gardes, Girard, and Lekina (2010) show that, under m (1 − τ ) → ∞ and some other regularity conditions, (cid:112) m (1 − τ ) (cid:32) ˆ Q Y | X = x ( τ ) Q Y | X = x ( τ ) − (cid:33) d → N (0 , /ξ ( x )) . Then, the CI of Q Y | X = x ( τ ) is constructed by the delta method and plugging in some con-19istent estimator of ξ . One choice which they propose is the Hill-type estimator1 / ˆ ξ = 1 k − k − (cid:88) i =1 i log( Y ( i ) /Y ( i +1) ) (11)for some choice of k < m .For comparisons, we implement our fixed- k approach by using the panel data and theabove kernel approach by pooling the data. In particular, we implement the conditionalPareto DGP in the previous experiment with n = 200 and T ranging from 50 to 500. Forthe fixed- k CI, we set k = 50. For the kernel method, we implement c ∈ { . , . , . , , } and set k (in the Hill-type index estimator (11)) as the largest integer less than or equal to m/ k and the kernel CIs. Severalinteresting observations can be made. First, the kernel approach is sensitive to the choice ofthe bandwidth. In particular, a correct coverage relies on a narrow window of the bandwidthchoice. A larger choice can lead to a substantial undercoverage since the smoothing biasdominates quickly in the tail. Second, when T is only moderately large (say 25 and 50),the fixed- k CIs are much shorter than the kernel one and both of them have good coverageproperties. This is because the fully nonparametric method ignores the domain-of-attractioninformation, which is utilized by our fixed- k method. Third, when T is very large, say 500,choosing only one NN does incur an efficiency loss as we compare the lengths between ourfixed- k and the kernel CIs. But such a loss is approximately in a factor of two or three insteadof T / . This means that a general covariate-dependent tail is very difficult to estimate in afully nonparametric way.In Table 5, we consider a two-dimensional standard normal X and generate Y it by Y it | X it = x ∼ ± Pa( ξ ( x )) with ξ ( x , x ) = x + x + 0 .
5. The kernel method is illustratedwith c ∈ { . , , , } . All the other parameter choices for both methods remain unchangedas those used for Table 4. The results clearly suggest that our fixed- k method together withthe NN choice dominates the kernel method in terms of both the coverage probabilities andlengths. In particular, the kernel method suffers from the curse of dimensionality as thedimension of X increases.As a final remark of this subsection, we also implement the standard kernel weightedquantile regression method designed for the mid-sample quantiles (cf. Chapter 10 of Li andRacine (2007)). Given a large T , the target 1 − /n conditional quantile is relatively in themid-sample after pooling the panel data into a cross-sectional one, and hence the confidence20able 4: Finite sample performance of inference about conditional extremal quantile, com-parison with kernel method T
50 100 200 500Cov Lgth Cov Lgth Cov Lgth Cov Lgthfixed-k 0.97 21.1 0.97 19.8 0.97 16.8 0.98 15.3NP(c=0.1) 0.91 50.1 0.89 30.0 0.94 24.4 0.93 14.6NP(c=0.25) 0.94 33.1 0.96 19.0 0.93 13.3 0.96 9.15NP(c=0.5) 0.93 17.1 0.94 12.7 0.93 9.28 0.95 6.36NP(c=1) 0.93 13.9 0.90 9.84 0.89 7.14 0.88 4.77NP(c=2) 0.37 15.6 0.24 10.1 0.14 6.86 0.11 4.18Note: Entries are coverages and lengths of the CIs for Q Y | X =0 (1 − /n ) under the conditional Pareto DGP.See the main text for the description of the two approaches and details of the DGP. Confidence level is 5%.Based on 500 simulation draws. Table 5: Finite sample performance of inference about conditional extremal quantile, com-parison with kernel method, two-dimensional X T
50 100 200 500Cov Lgth Cov Lgth Cov Lgth Cov Lgthfixed-k 0.96 21.0 0.97 17.2 0.96 16.5 0.96 16.3NP(c=0.5) 0.56 65.8 0.73 60.6 0.73 54.3 0.81 53.8NP(c=1) 0.83 75.0 0.80 60.8 0.93 60.9 0.91 32.0NP(c=2) 0.96 66.1 1.00 36.0 0.97 25.2 0.97 16.5NP(c=4) 0.74 54.5 0.47 35.4 0.23 22.8 0.16 13.2Note: Entries are coverages and lengths of the CIs for Q Y | X =0 (1 − /n ) under the conditional Pareto DGP.See the main text for the description of the two approaches and details of the DGP. Confidence level is 5%.Based on 500 simulation draws. T is substantially larger than (e.g, fivetimes as much as) n . In our experiments, it is strictly dominated by the method proposedby Gardes, Girard, and Lekina (2010). In this section, we consider the linear random coefficient model Y it = α i + X it β + u it ,where the generated observations including the random coefficients are i.i.d. across i . Forthe time series dependence, we set α i = T − (cid:80) Tt =1 X it and X it = ρX it − + e it with e it ∼ iid N (0 , (1 − ρ )) and X i ∼ N (0 , u it given X it = x arespecified as follows.
1. Conditional Normal u it | X it = x ∼ N (0 , x ).
2. Conditional Student-t u it | X it = x ∼ t (2 + | x | ).
3. Conditional Pareto u it | X it = x ∼ ± Pa( ξ ( x )), that is, P ( u it ≤ y | X it = x ) = 1 / − (1 + y ) − /ξ ( x ) ) / y ≥
0, and P ( u it ≤ y | X it = x ) = ( − y + 1) − /ξ ( x ) / y ≤ ξ ( x ) = x + 0 . Q Y it | X it = x ( τ ) = Q ε it | X it = x ( τ ) + x β , where ε it denotes α i + u it . Specifically, our fixed- k approach is conducted in two ways: with or without usingthe standard within least squares estimator of β . For the former (fixed- k w. LS), we firstestimate β using the standard within estimator ˆ β and back out ˆ ε it = Y it − X it ˆ β . We thenimplement the steps in Section 2.1 to construct the CIs for the conditional quantiles of ε it .The CIs for Q Y it | X it = x ( τ ) are obtained by adding back x ˆ β . For the one ignoring the linearregression structure (fixed- k w/o LS), we directly use ( Y it , X it ) (cid:124) , and apply Steps 1-3 inSection 2.1.Table 6 presents the results for n ∈ { , } and T ∈ { , , , } . Severalinteresting observations can be made. First, the errors in the conditional t and conditionalPareto models do not have finite variances when x is 0, and hence the LS estimator of β behaves poorly. This leads to a poor performance of the fixed- k approach if the linearregression model is utilized. This problem can be solved by using the least absolute deviation(LAD) estimator as shown in unreported results. In comparison, the fixed- k CIs without22able 6: Finite sample performance of inference about conditional extremal quantile, non-dynamic model with random effects n
200 (99.5% quantile) 100 (99% quantile) T
200 500 25 50Cov Lgth Cov Lgth Cov Lgth Cov LgthConditional Normalfixed- k w. LS 0.94 2.23 0.93 2.13 0.80 2.85 0.88 2.34fixed- k w/o LS 0.93 2.22 0.92 2.14 0.53 3.17 0.76 2.62QR 0.00 3.04 0.00 2.19 1.00 3.10 1.00 2.96Boot 0.73 0.77 0.67 0.71 0.73 1.11 0.81 0.92Conditional Student-tfixed- k w. LS 0.95 15.3 0.94 15.5 0.93 10.0 0.96 10.4fixed- k w/o LS 0.95 15.3 0.94 15.5 0.91 10.0 0.93 10.3QR 1.00 26.5 1.00 7.98 0.99 11.0 1.00 15.0Boot 0.56 11.5 0.60 11.0 0.47 6.32 0.51 6.01Conditional Paretofixed- k w. LS 0.00 16.2 0.00 5.90 0.02 59.6 0.01 58.1fixed- k w/o LS 0.97 18.8 0.97 16.9 0.95 16.0 0.96 15.4QR 0.00 > > > > Boot 0.71 16.2 0.67 16.8 0.76 313 0.78 31.4Note: Entries are coverages and lengths of the CIs for Q Y | X =0 (1 − /n ). See the main text for the descriptionof different approaches and the data generating processes. Confidence level is 5%. Based on 500 simulationdraws. using the linear regression model always perform well given a large enough sample size.Second, the QR approach still suffers from undercoverage in all three specifications since thenormal and the Student-t DGPs have nonlinear heteroskedasticity and the conditional ParetoDGP violates the constant tail shape condition. Finally, the bootstrap method performspoorly if the extremal quantiles under investigation are too far in the tail.In Table 7, we study the CIs for high quantiles of α i and β i with data generated from Y it = α i + X it β i + u it , where ( α i , β i , X it , u it ) (cid:124) ∼ iid N (0 , I ). The i.i.d. condition is acrossboth i and t in this setting. We first estimate α i and β i by regressing Y it on (1 , X it ) (cid:124) with T observations from individual i . We then collect the estimates, ˆ α i and ˆ β i , for all i andsort them in the descending order to apply each of the fixed- k , QR, and bootstrap methods.The QR estimator is simply the empirical quantile among the estimators for all i , whoseasymptotic variance is estimated by the standard kernel density estimator with the rule-of-thumb bandwidth. The results suggest that the fixed- k approach with NN dominates theother two in both coverage and length, especially when the sample size is only moderate.23able 7: Finite sample performance of inference about large quantiles of the random coeffi-cients n
200 500 T
10 20 10 20Cov Lgth Cov Lgth Cov Lgth Cov LgthCIs for Q α (1 − /n )fixed- k Q β (1 − /n )fixed- k Q α (1 − /n )fixed- k Q β (1 − /n )fixed- k k approach usingthe largest k=20 estimated coefficients, (ii) empirical quantile of the estimated coefficients with asymptoticnormal approximation, and (iii) empirical quantile function of the estimated coefficients and bootstrap. Dataare generated from Y it = α i + X it β i + u it where ( α i , β i , X it , u it ) (cid:124) ∼ iid N (0 , I ). The target is the 1-h/nquantile of α i and β i with h = 1 and 5, corresponding to 97.5%, 98%, 99%, and 99.8% quantiles given n = 200 and 500, respectively. Confidence level is 5%. Based on 500 simulation draws. Empirical application to extremal birth weights
In this section, we reconsider the extremely low birth weights and their relationships withmother’s demographic characteristics and maternal behaviors, which addresses an importantquestion in health economics. We use the detailed natality data published by the NationalCenter for Health Statistics, which has been used by Abrevaya (2001), Koenker and Hallock(2001), and Chernozhukov and Fern´andez-Val (2011) among many others. We follow thesepreceding studies, but our analysis is different from theirs in two aspects. First, thesepreceding studies use the cross-sectional data in one time period, while we collect the repeatedcross-sectional samples from January 1989 to December 2002. Second, the previous studiesall made some parametric model assumptions, including either the linear projection model orthe (extremal) quantile regression model. In contrast, our fixed- k method is nonparametric,allowing for nonparametric joint distributions. Accordingly, some of our findings are differentfrom those in the previous studies.Details of our implementation are as follows. First, we follow the previously mentionedliterature – Abrevaya (2001) in particular – to choose included covariates. Our dependentvariable is the infant birth weight measured in kilograms, and the continuous covariatesinclude mother’s age and net weight gain (wtgain) during pregnancy. All the remainingcovariates are discrete, and hence we consider the subsamples constructed from variouscombinations of the categorical variables. For comparison, we set a benchmark subsamplein which the infant is a boy, the mother is white and married, has levels of education lessthan a high school degree, had her first prenatal visit in the first trimester (natal1), and didnot smoke during pregnancy. Second, since the samples are repeated cross-sectional, it ismore natural to switch the labeling of the indices i and t , and first take the NN within eachmonth. The query point is set at age equal to 27 and wtgain equal to 30, correspondingto their respective median values. The NN is then measured by the Euclidean norm afterstandardizing each of the two variables with mean zero and unit variance. Using the samenotation as in Section 2, we have n = 168 and T is at least 100 in every subsample. Thus,our fixed- k asymptotic framework with a large n and a large T is suitable with this data.Third, we set k = 30 based on our simulation results in the previous section and constructthe 95% fixed- k confidence intervals for the conditional p -quantiles with p ranging from 1%to 10%. Figure 1 depicts these confidence intervals in the benchmark subsample and six We chose this specific period for two reasons. First, this period contains the time period of the cross-sectional data used by Abrevaya (2001), Koenker and Hallock (2001), and Chernozhukov and Fern´andez-Val(2011). Second, these periods maintain the identical variable definitions. We can make the following observations in Figure 1. First, the effects of changing thecovariates are found to have a similar pattern as in the previous studies. In particular,compared with the benchmark subsample, the conditional quantile of infant birth weightdecreases substantially if the mother is black, did not have a prenatal visit, and/or smokedduring pregnancy. These results reconfirm the signs of the effects reported by the previousstudies. Second, on the other hand, the magnitudes of these effects are larger than thosedocumented in the previous studies. Specifically, Abrevaya (2001) finds that smoking tencigarettes leads to approximately 200 fewer grams at the 10th percentile of the infant birthweight, compared with smoking no cigarettes. Chernozhukov and Fern´andez-Val (2011)finds the quantile regression coefficient associated with the number of cigarettes is nearlyzero at the 1st percentile (their Figure 8). On the other hand, the last sub-figure in Figure1 suggests that the difference can be over 1000 grams at the 1st percentile if we comparethe mid-value between the upper and lower bounds of the confidence intervals between thesetwo subsamples. Finally, the effects on extremal birth weight quantiles induced by thedemographic characteristics vary across levels of quantiles, instead of remaining fixed.
This paper develops a new nonparametric method of inference for conditional extremalquantiles using a fixed number k of nearest-neighbor tail observations in repeated cross-sectional or panel data. There are three advantages of our proposed method. First, it isrobust against flexible distributional assumptions unlike parametric methods. Second, theprocedure yields asymptotically valid confidence intervals for any fixed tuning parameter k , unlike existing kernel methods that rely on a sequence of moving tuning parameters forasymptotically valid inference. Third, our confidence intervals enjoy the uniform coverageproperty over a set of data generating processes involving a set of values of the tail index.The key insight is that the induced order statistics in each time series can be treated asapproximately stemming from the true conditional distribution, and the large order statistics For the majority of the subsamples, the number of cigarettes as recorded in data takes only a few discretevalues, including 0, 5, 10, and 20. Therefore, we treat it as a discrete random variable in our study.
Note: This figure plots the 95% fixed- k confidence intervals for the conditional p -quantile of infant birthweight with p ∈ [0 . , . p A Appendix
This appendix provides the proof of Theorem 1, some computational details, and discussions aboutsome primitive conditions.
A.1 Proofs
To establish Theorem 1, we first establish the following intermediate result, which establishes therate of convergence fo the NN to the query point.
Lemma 1
Under Condition 1.1, for each i and for some η > , (cid:12)(cid:12)(cid:12)(cid:12) X i, ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12) = o a.s. ( T − η ) and (12) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) X i, ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12)(cid:3) = O (cid:16) T − / (cid:17) . (13) Proof of Lemma 1
We first prove (12). The subscript i is suppressed for notional ease. Define D t = || X t − x || for t ∈ { , . . . , T } , which is still strictly stationary and β -mixing. By Berbee’slemma (enlarging the probability space as necessary), the process { D t } can be coupled with aprocess { D ∗ t } that satisfies the following three properties: (i) Z i ≡ { D ( i − × q T +1 , . . . , D i × q T } and Z ∗ i ≡ { D ∗ ( i − × q T +1 , . . . , D ∗ i × q T } are identically distributed for all i ∈ { , . . . , k T } , where Z ∗ i is thesame decomposition of { D ∗ t } as Z i and k T × q T = T ; (ii) P ( Z ∗ i (cid:54) = Z i ) ≤ β ( q T ) for all i ∈ { , . . . , k T } ; nd (iii) { Z ∗ , Z ∗ , . . . } are independent and { Z ∗ , Z ∗ , . . . } are independent (cf. Lemma 2.1 in Berbee(1987) and Proposition 2 in Doukhan, Massart, and Rio (1995)). Suppose k T is an even integer forsimplicity and define U ∗ i as i.i.d. standard uniform random variable. Then these properties yieldthat P (cid:18) min t ∈{ ,...,T } { D t } > εT − η (cid:19) = P (cid:18) min t ∈{ ,...,T } { D t } > εT − η , { D t } Tt =1 = { D ∗ t } Tt =1 (cid:19) + P (cid:18) min t ∈{ ,...,T } { D t } > εT − η , { D t } Tt =1 (cid:54) = { D ∗ t } Tt =1 (cid:19) ≤ (1) P (cid:18) min t ∈{ q T , q T ...,k T q T } { D ∗ t } > εT − η (cid:19) + P (cid:0) { D t } Tt =1 (cid:54) = { D ∗ t } Tt =1 (cid:1) ≤ (2) P (cid:18) min i ∈{ , ,...,k T / } { U ∗ i } > F D (cid:0) εT − η (cid:1)(cid:19) + P (cid:0) { D t } Tt =1 (cid:54) = { D ∗ t } Tt =1 (cid:1) ≤ (3) (1 − CT − η ) k T / + k T β ( q T ) , where inequality(1) follows by considering the first elements in all even blocks, which are indepen-dent by property(iii) above, inequality(2) follows from the CDF transformation, and inequality(3)follows from the CDF of the standard uniform distribution and properties (ii) and (iii) above.Choosing k T as the largest even integer no larger than 2 T / and using Condition 1.1 againyield that ∞ (cid:88) T =1 P (cid:18) min t ∈{ ,...,T } { D t } > εT − η (cid:19) ≤ ∞ (cid:88) T =1 (1 − cT − η ) T / + ∞ (cid:88) T =1 T / O (cid:16) T − / − ε (cid:17) < ∞ for any η ∈ (0 , / . Then T η (cid:12)(cid:12)(cid:12)(cid:12) X ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12) = o a.s. (1) is implied by Borel Cantelli Lemma. The convergence of (cid:80) ∞ T =1 (1 − cT − η ) T / is checked by the ratio test that lim T →∞ (1 − c ( T + 1) − η ) ( T +1) / / (1 − cT − η ) T / < η ∈ (0 , / Z i (and Z ∗ i ), denoted min { Z i } (and min { Z ∗ i } ). Let E T denote the eventthat { D t } Tt =1 = { D ∗ t } Tt =1 . The above three properties and (12) yield that for some constant C > E [ (cid:12)(cid:12)(cid:12)(cid:12) X ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12) ]= E (cid:20) min t ∈{ ,...,T } { D t } [ E T ] (cid:21) + E (cid:20) min t ∈{ ,...,T } { D t } [ E cT ] (cid:21) (1) E (cid:20) min i ∈{ , ,...,k T } { min { Z i }} [ E T ] (cid:21) + CT − η E [ [ E cT ]] ≤ (2) E (cid:20) min i ∈{ , ,...,k T } { min { Z ∗ i }} (cid:21) + CT − η k T β ( q T ) ≤ (3) E (cid:20) min i ∈{ , ,...,k T } { D ∗ i × q T } (cid:21) + CT − η k T β ( q T ) , where inequality(1) follows from considering even blocks only and (12), inequality(2) follows fromproperty(ii) above, and inequality(3) follows from the fact that min { Z ∗ i } ≤ D ∗ i × q T (the minimumvalue within the block Z ∗ i is less than or equal to the last element in that block).The second term in the last step above is o ( T − / ) by setting q T = k T equal to the largest eveninteger no larger than T / . Regarding the first item above, notice that min i ∈{ ,...,k T } { D ∗ i × q T } is thesample minimum of k T / F D ( · ), which has the bounded lowerend-point 0. Condition 1.1 implies that F D ( · ) is continuously differentiable and monotonicallyincreasing in a neighborhood of zero. Then we have E (cid:20) min i ∈{ , ,...,k T } { D ∗ i × q T } (cid:21) = E (cid:20) F − D (cid:18) min i ∈{ , ,...,k T } { U ∗ i } (cid:19)(cid:21) = (1) E (cid:20)(cid:0) /f D (cid:0) F − D ( ˙ u ) (cid:1)(cid:1) min i ∈{ , ,...,k T } { U ∗ i } (cid:21) ≤ (2) C E (cid:20) min i ∈{ , ,...,k T } { U ∗ i } (cid:21) = (3) O ( k − T ) , where equality(1) follows from mean value expansion with some ˙ u between 0 andmin i ∈{ , ,...,k T } { U ∗ i } , inequality(2) follows from the fact that f D ( · ) is uniformly boundedaway from 0 in a neighborhood of zero, which is implied by Condition 1.1 again, and equality(3)follows from Theorem 5.3.1 in de Haan and Ferreira (2007) since U ∗ i is i.i.d. standard uniformdistribution with the tail index −
1. So (13) is established by setting k T equal to the largest eveninteger no larger than T / again. (cid:4) Now using Lemma 1, we prove Theorem 1.
Proof of Theorem 1
Since the proof is long, we decompose it into three steps: (i) we firstestablish the convergence in distribution of Y (1) , [ x ] to V ; (ii) we next generalize it to the wholevector Y ; and (iii) we finally construct the test in the limiting problem as a function of V so thatthe uniform coverage is established by construction. Step 1.
We claim that, under Conditions 1.1-1.4, there exist sequences of constants a n > b n epending on x such that Y (1) , [ x ] − b n a n d → V , (14)where V is EV distributed with (3) and ξ = ξ ( x ).By Corollary 1.2.4 and Remark 1.2.7 in de Haan and Ferreira (2007), the constants a n and b n canbe chosen as follows. If ξ ( x ) >
0, we choose a n ( ξ ( x )) = Q Y | X = x (1 − /n ) and b n ( ξ ( x )) = 0. If ξ ( x ) = 0, we choose a n ( ξ ( x )) = 1 / ( nf Y | X = x ( b n ( x ))) and b n ( ξ ( x )) = Q Y | X = x (1 − /n ).If ξ ( x ) < , we choose a n ( ξ ( x )) = − ξ ( x ) ( y − Q Y | X = x (1 − /n )) > and b n ( ξ ( x )) = Q Y | X = x (1 − /n ) (Lemma 1.2.9 in de Haan and Ferreira (2007)), where re-call that y denotes the right end-point of F Y | X = x . By construction, these constants satisfy1 − F Y | X = x ( a n ( ξ ( x )) y + b n ( ξ ( x ))) = O ( n − ) for any fixed y > ξ ( x ) in the notations of a n ( · ) and b n ( · ). By strict stationarity across t (Condition 1.1), P (cid:0) Y i, [ x ] ≤ v (cid:1) = E X i, ( x (cid:2) P (cid:0) Y i, [ x ] ≤ v | X i, ( x ) (cid:1)(cid:3) = E X i, ( x (cid:104) F Y | X = X i, ( x ( v ) (cid:105) (15)holds for any generic argument v . Thus, we have P (cid:0) Y (1) , [ x ] ≤ a n y + b n (cid:1) = F nY i, [ x ( a n y + b n ) (by i.i.d. across i )= F nY | X = x ( a n y + b n ) (cid:32) P (cid:0) Y i, [ x ] ≤ a n y + b n (cid:1) F Y | X = x ( a n y + b n ) (cid:33) n = F nY | X = x ( a n y + b n ) E X i, ( x (cid:104) F Y | X = X i, ( x ( a n y + b n ) (cid:105) F Y | X = x ( a n y + b n ) n (by (15))= F nY | X = x ( a n y + b n ) E X i, ( x (cid:104) F Y | X = X i, ( x ( a n y + b n ) (cid:105) − F Y | X = x ( a n y + b n ) F Y | X = x ( a n y + b n ) n ≡ A n ( y ) (cid:18) B n,T ( y ) F Y | X = x ( a n y + b n ) (cid:19) n . By the EV theory and Condition 1.2, A n ( y ) → G ξ ( y ) as n → ∞ . Regarding B n,T ( y ), we derivethat, for some ˙ x i between X i, ( x ) and x for each i , some open ball B η T ( x ) centered at x withradius η T = O ( T − η ), and some constant 0 < C < ∞ , | B n,T ( y ) | = (1) E (cid:20) ∂∂x F Y | X = x ( a n y + b n ) | x = ˙ x i (cid:0) X i, ( x ) − x (cid:1)(cid:21) (2) CT − η sup x ∈ B T − η ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂x F Y | X = x ( a n y + b n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (3) CT − η n − sup x ∈ B T − η ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂x F Y | X = x ( a n y + b n )1 − F Y | X = x ( a n y + b n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (4) o ( n − )holds, where equality (1) is by the mean value expansion; inequality (2) follows from that X i, ( x ) ∈ B η T ( x ) holds almost surely (Lemma 1); inequality (3) is due to 1 − F Y | X = x ( a n y + b n ) = O (1 /n );and equality (4) is given by Conditions 1.3-1.4. Hence given a n y + b n → y and using Lemma 8.4.1in Arnold, Balakrishnan, and Nagaraja (1992), we have (cid:18) B n,T ( y ) F Y | X = x ( a n y + b n ) (cid:19) n ≤ (cid:32) o (cid:0) n − (cid:1) F Y | X = x ( a n y + b n ) (cid:33) n → . The proof for the case of k = 1 is then complete by the continuous mapping theorem. Step 2.
We next claim that (14) can be generalized to the cases of k > a n and b n as in (14), the convergence (4) holds, that is, Y − b n a n d → V . To this end, consider y > y > · · · > y k . Theorem 8.4.2 in Arnold, Balakrishnan, and Nagaraja(1992) gives that P (cid:0) Y (1) , [ x ] ≤ a n y + b n , ..., Y ( k ) , [ x ] ≤ a n y k + b n (cid:1) = F n − kY i, [ x ( a n y k + b n ) k (cid:89) r =1 ( n − r + 1) a n f Y i , [ x ] ( a n y r + b n ) (by i.i.d. across i )= (cid:34) F n − kY | X = x ( a n y k + b n ) k (cid:89) r =1 ( n − r + 1) a n f Y | X = x ( a n y r + b n ) (cid:35) × (cid:32) P (cid:0) Y i, [ x ] ≤ a n y k + b n (cid:1) F Y | X = x ( a n y k + b n ) (cid:33) n − k k (cid:89) r =1 f Y i, [ x ( a n y r + b n ) f Y | X = x ( a n y r + b n ) ≡ ˜ A n × ˜ B nT . The convergence ˜ A n → G ξ ( y k ) (cid:81) kr =1 { g ξ ( y r ) /G ξ ( y k ) } is established by Theorem 8.4.2 inArnold, Balakrishnan, and Nagaraja (1992). It now remains to show ˜ B nT →
1. First, P (cid:0) Y i, [ x ] ≤ a n y k + b n (cid:1) /F Y | X = x ( a n y k + b n )) n − k → k = 1 case. Second, for any v , we have f Y i, [ x ( v ) f Y | X = x ( v ) = ∂ P ( Y i, [ x ≤ v ) ∂v f Y | X = x ( v )= ∂∂v E X i, ( x (cid:104) F Y | X = X i, ( x ( v ) (cid:105) f Y | X = x ( v ) (by (15))= ∂∂v (cid:82) F Y | X = x ( v ) f X i , ( x ) ( x ) dxf Y | X = x ( v )= (cid:82) ∂∂v F Y | X = x ( v ) f X i , ( x ) ( x ) dxf Y | X = x ( v ) (by Leibniz’s rule)= E X i, ( x (cid:104) f Y | X = X i, ( x ( v ) (cid:105) f Y | X = x ( v ) , where the application of Leibniz’s rule is permitted under the assumption (Condition 1.3) that f Y | X = x ( v ) is uniformly continuous in x and v . Then similarly to the argument of bounding B n,T above, we use the mean value expansion under Condition 1.3, Lemma 1, and Conditions 1.3-1.4 toderive that for any r ∈ { , ..., k } and some constant 0 < C < ∞ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f Y i, [ x ( a n y r + b n ) f Y | X = x ( a n y r + b n ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E X i, ( x (cid:104) f Y | X = X i, ( x ( v ) − f Y | X = x ( a n y r + b n ) (cid:105) f Y | X = x ( a n y r + b n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( a n y r + b n ) /∂xf Y | X = x ( a n y r + b n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:2)(cid:12)(cid:12)(cid:12)(cid:12) X i, ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12)(cid:3) ≤ o (1) × O (cid:0) T − η (cid:1) = o (1) . Step 3.
Note that lim n →∞ F Y | X = x ( a n v + b n ) = G ξ ( v ), (4), and the continuous mapping theoremyield that Q Y | X = x (1 − h/n ) − Y ( k ) , [ x Y (1) , [ x − Y ( k ) , [ x Y − Y ( k ) , [ x Y (1) , [ x − Y ( k ) , [ x d → (cid:32) V q V ∗ (cid:33) . By another application of the continuous mapping theorem, any equivariant S that satis-fies the asymptotic size constraint P ( V q ∈ S ( V ∗ )) ≥ − α for every ξ ∈ Ξ also satisfieslim inf n →∞ ,T →∞ P ( Q Y | X = x (1 − h/n ) ∈ S ( Y )) ≥ − α . Thus, it suffices to determine S ( · ) in he limiting problem where the observation is V ∗ . Given the solution (9), it further suffices todetermine a suitable Lagrangian weight Λ. We accomplish this by construction.Consider Λ = c ˜Λ, where ˜Λ is some probability distribution function with support on Ξ and c some positive constant to be determined. Note that the density f V q , V ∗ ( y, v ∗ ; ξ ) is continuouslydifferentiable in all three arguments, and hence P ξ (cid:0) V q ∈ S c ˜Λ ( V ∗ ) (cid:1) as a function of ξ and c iscontinuous in both arguments. Denote by S in (9) as S Λ to indicate that the confidence intervaldepends on the choice of Λ. Then, given any ˜Λ and c , inf ξ ∈ Ξ P ξ (cid:0) V q ∈ S c ˜Λ ( V ∗ ) (cid:1) is obtainable, sayat ξ ∗ , since Ξ is compact. Furthermore, for every ξ ∈ Ξ, P ξ (cid:0) V q ∈ S c ˜Λ ( V ∗ ) (cid:1) as a function of c isincreasing. We can choose c = c ∗ such that P ξ ∗ (cid:0) V q ∈ S c ∗ ˜Λ ( V ∗ ) (cid:1) = 1 − α . This is always feasiblesince inf ξ ∈ Ξ P ξ (cid:0) V q ∈ S c ∗ ˜Λ ( V ∗ ) (cid:1) → c ∗ → ∞ and sup ξ ∈ Ξ P ξ (cid:0) V q ∈ S c ∗ ˜Λ ( V ∗ ) (cid:1) → c ∗ → (cid:4) Remark 2
Since ˜Λ in the last part of the above proof can be arbitrary in theory, we provide anempirical guide for determining a nearly optimal ˜Λ in Section A.2. Proof of Corollary 1
By Corollary 1.2.4 and Remark 1.2.7 in de Haan and Ferreira (2007),the constants a n and b n can be chosen as follows. We present the case for α only, and the choicefor β follows identically. If ξ α >
0, we choose a n ( ξ α ) = Q α (1 − /n ) and b n ( ξ α ) = 0, where recallthat Q α ( · ) denotes the quantile function of α i . If ξ α = 0, we choose a n ( ξ α ) = 1 / ( nf α ( b n ( ξ α )))and b n ( ξ α ) = Q α (1 − /n ), where recall that f α ( · ) denotes the PDF of α i . If ξ α < , we choose a n ( ξ α ) = − ξ α ( Q α (1) − Q α (1 − /n )) and b n ( ξ α ) = Q α (1 − /n ). By construction, theseconstants satisfy that 1 − F α ( a n ( ξ α ) y + b n ( ξ α )) = O ( n − ) for any fixed y > A . By the EV theory, Condition 2.1 ( α i is i.i.d.) andCondition 2.2 ( F α ∈ D (cid:0) G ξ α (cid:1) ) imply (cid:18) α (1) − b n ( ξ α ) a n ( ξ α ) , ..., α ( k ) − b n ( ξ α ) a n ( ξ α ) (cid:19) (cid:124) d → V ( ξ α ) , (16)where V ( ξ α ) is jointly EV distributed with tail index ξ α .Let I = ( I , . . . , I k ) ∈ { , . . . , T } k be the k random indices such that α ( j ) = α I j , j = 1 , . . . , k ,and let ˆ I be the corresponding indices such that ˆ α ( j ) = ˆ α ˆ I j . Then, the convergence of A followsfrom (16) once we establish | ˆ α ˆ I j − α I j | = o p ( a n ( ξ α )) for j = 1 , . . . , k . We present the case of k = 1,but the argument for a general k is similar. Denote ε i ≡ ˆ α i − α i .First, consider the case with ξ α >
0. The part in Condition 2.3 for ξ α > i | ε i | = sup i (cid:12)(cid:12)(cid:12) ¯ X (cid:124) i (cid:16) β i − ˆ β i (cid:17) + ¯ u i (cid:12)(cid:12)(cid:12) sup i (cid:12)(cid:12)(cid:12)(cid:12) ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β i − ˆ β i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup i | ¯ u i | = o p (1) . Given this result, we have that, on one hand, ˆ α ˆ I = max i { α i + ε i } ≤ α I + sup i | ε i | = α I + o p (1); and,on the other hand, ˆ α ˆ I = max i { α i + ε i } ≥ max i { α i + min i { ε i }} ≥ α I + min i { ε i } ≥ α I − sup i | ε i | = α I − o p (1). Therefore, (cid:12)(cid:12) ˆ α ˆ I − α I (cid:12)(cid:12) ≤ o p (1) = o p ( a n ( ξ α )) since a n ( ξ α ) → ∞ .Second, consider the case with ξ α = 0. Corollary 1.2.4 in de Haan and Ferreira (2007) impliesthat a n ( ξ α ) = f α ( Q α (1 − /n )). Thus, the part in Condition 2.3 for ξ α = 0 implies that1 a n ( ξ α ) sup i | ε i | ≤ sup i (cid:12)(cid:12)(cid:12)(cid:12) ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β i − ˆ β i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup i | ¯ u i | f α ( Q α (1 − /n ))= o p (1) . Now, the same argument as above yields that (cid:12)(cid:12) ˆ α ˆ I − α I (cid:12)(cid:12) ≤ O p (sup i | ε i | ) = o p ( a n ( ξ α )).Third, consider the case with ξ α < . The fact that a n ( ξ α ) = O (cid:0) n ξ α (cid:1) implies that1 a n ( ξ α ) sup i | ε i | ≤ n − ξ α (cid:18) sup i (cid:12)(cid:12)(cid:12)(cid:12) ¯ X i (cid:12)(cid:12)(cid:12)(cid:12) sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β i − ˆ β i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup i | ¯ u i | (cid:19) = O p (cid:0) n − ξ α T − / (cid:1) = o p (1), where the two equalities follow from Condition 2.3. The rest of the proof is identical to Step 3 inthe proof of Theorem 1.Finally, we establish the convergence of B . Recall that we focus on, without loss of general-ity, the first component of β i , so that ( β (1) , ..., β ( k ) ) (cid:124) denotes the largest k elements in the firstcomponents of { β i } ni =1 . Conditions 2.1 and 2.2 imply that (cid:32) β (1) − b n (cid:0) ξ β (cid:1) a n (cid:0) ξ β (cid:1) , ..., β ( k ) − b n (cid:0) ξ β (cid:1) a n (cid:0) ξ β (cid:1) (cid:33) (cid:124) d → V (cid:0) ξ β (cid:1) . Condition 2.3 and a similar argument to that for A complete the proof. (cid:4) A.2 Computational details
We discuss the choice of Λ following Elliott, M¨uller, and Watson (2015) and M¨uller and Wang(2017). Using the same notation as in the proof of Theorem 1, consider Λ = c ˜Λ, where ˜Λ is someprobability distribution function with support on Ξ. Suppose ξ is randomly drawn from ˜Λ and c satisfies that (cid:82) P ξ ( V q ∈ S c ˜Λ ( V ∗ )) d ˜Λ( ξ ) = 1 − α . Denote the W -weighted average length as ˜Λ = (cid:82) Ξ E ξ [ κ ( V ∗ ; ξ )lgth( S c ˜Λ ( V ∗ ))] dW ( ξ ). Since the uniform coverage for all ξ ∈ Ξ implies the˜Λ-weighted average coverage for any probability distribution ˜Λ and S c ˜Λ minimizes the W -weightedaverage length by construction, V ˜Λ essentially provides a lower bound for the W -weighted averagelength among all sets S that satisfy the uniform coverage.Now suppose we obtain some ˜Λ ∗ on Ξ and the constant c ∗ such that P ξ ( V q ∈ S c ∗ ˜Λ ∗ ( V ∗ )) ≥ − α for all ξ ∈ Ξ, (17)and (cid:90) Ξ E ξ [ κ ( V ∗ ; ξ )lgth( S c ∗ ˜Λ ∗ ( V ∗ ))] dW ( ξ ) ≤ (1 + ε ) V ˜Λ ∗ , (18)then the confidence interval S c ∗ ˜Λ ∗ will have a W -weighted average expected length no more than100 ε % longer than any other confidence interval of the same level. We set ε = 0 . ∗ , we can discretize Ξ into a grid Ξ a and determine ˜Λ accordinglyas the point masses. Then we can simulate N random draws of ( V q , V ∗ ) from ξ ∈ Ξ a and estimate P ξ ( V q ∈ S c ∗ ˜Λ ( V ∗ )) by sample fractions. By iteratively increasing or decreasing the point massesas a function of whether the estimated P ξ ( V q ∈ S c ∗ ˜Λ ( V ∗ )) is larger or smaller than the nominallevel, we can always find a candidate ˜Λ ∗ . Note that such ˜Λ ∗ always exists since we allow P ξ ( V q ∈ S c ∗ ˜Λ ( V ∗ )) > − α for some ξ . We determine c ∗ so that (18) is satisfied. The continuity of f V q , V ∗ ( y, v ∗ ; ξ ) entails that P ξ ( V q ∈ S c ∗ ˜Λ ( V ∗ )) as a function of ξ is also continuous. Therefore,(17) is guaranteed as we consider | Ξ a | → ∞ and N/ | Ξ a | → ∞ , where | Ξ a | denotes the cardinalityof Ξ a .In our simulations, we consider Ξ = [ − / , / a = {− / , − / / , . . . , / } , andaccordingly ˜Λ is equal to 60 point masses on Ξ d . We consider α = 0 .
05. Following Elliott, M¨uller,and Watson (2015), we determine these point masses by the following steps. Simulate N = 100 ,
000 i.i.d. random draws from some proposal density with ξ drawn uni-formly from Ξ p = {− / , − / / , . . . , / } . Start with ˜Λ (0) = { / , / , . . . , / } and c ∗ = 1. Calculate the (estimated) coverageprobabilities P ξ j ( V q ∈ S c ∗ ˜Λ(0) ( V ∗ )) for every ξ j ∈ Ξ a using importance sampling. Denotethem as P = ( P , ..., P ) (cid:48) . Update Λ by setting Λ ( s +1) = Λ ( s ) + η Λ ( P − .
95) with some step-length constant η Λ > j -th point mass in Λ is increased/decreased if the coverage probability for ξ j islarger/smaller than the nominal level. Keep the integration for 500 times. Then the resulting Λ (500) is a valid candidate.
Numerically check if S Λ (500) indeed controls the coverage uniformly by simulating the coverageprobabilities over the fine enough grid Ξ f = {− / , − / / , . . . , / } . If not, go backto step 2 with a finer Ξ a .The expressions of f V q , V ∗ and κ ( v ∗ ; ξ ) f V ∗ ( v ∗ ; ξ ) are as follows. f V q , V ∗ ( y, v ∗ ; ξ ) = | y | k (cid:90) b ( ξ ) a ( ξ ) | q ( ξ, h ) − s | k − × exp (cid:104) − (1 + ξs ) − /ξ − (1 + 1 /ξ ) (cid:105) × k (cid:88) i =1 log (cid:18) ξs + v ∗ i ξ q ( ξ, h ) − sy (cid:19) ds, where a ( ξ ) and b ( ξ ) are such that for all s ∈ ( a ( ξ ) , b ( ξ )), 1 + ξs > , ξs + ξ ( q ( ξ, h ) − s ) /y > κ ( v ∗ ; ξ ) f V ∗ ( v ∗ ; ξ ) = Γ ( k − ξ ) (cid:90) b ( ξ )0 s k − × exp (cid:34) − (1 + 1 /ξ ) k (cid:88) i =1 log (1 + ξ v ∗ i s ) (cid:35) ds, where Γ is the Gamma function, and b ( ξ ) = − /ξ for ξ < b ( ξ ) = ∞ otherwise. A.3 Primitive conditions for Condition 1.3
In this appendix, we provide primitive conditions for Condition 1.3. The following conditions aresufficient. Recall that y denotes the right end-point sup { y, F Y | X = x ( y ) < } . The notation issimpler if we use the following notations: γ ( · ) = 1 /ξ ( · ), g i denotes the the partial derivative of ageneric function g ( · , · ) w.r.t. the i -th element, and g ij the i , j -th cross derivative. Condition B X it has a compact support. F Y | X = x ( y ) satisfies one of the following three cases:(i) ξ ( x ) > − F Y | X = x ( y ) = c ( x ) y − γ ( x ) (1 + d ( x )( y ) − ˜ γ ( x ) + r ( x, y ))where c ( · ) > d ( · ) are uniformly bounded between 0 and ∞ and continuously differ-entiable with uniformly bounded derivatives, γ ( · ) > γ ( · ) > r ( x, y ) is continuously differentiable with bounded derivatives w.r.t.both x and y, and satisfies for some δ > y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12) r ( x, y ) /y − ˜ γ ( x ) (cid:12)(cid:12)(cid:12) → , im sup y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12) r ( x, y ) / ( y − ˜ γ ( x ) − ) (cid:12)(cid:12)(cid:12) → , lim sup y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12) r ( x, y ) /y − ˜ γ ( x ) (cid:12)(cid:12)(cid:12) → , lim sup y → y sup B δ ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12) r ( x, y ) / ( y − ˜ γ ( x ) − ) (cid:12)(cid:12)(cid:12) → ξ ( x ) = 0 and f Y | X = x ( y ) = c ( x ) y ˜ c ( x ) exp( − d ( x ) ˜ d ( y ))(1 + r ( x, y )) , where c ( · ) > d ( · ) > ∞ , ˜ c ( · ) is continuously differentiable and uniformly bounded by − ∞ , and ˜ d ( y ) is continuously differentiable and satisfies C (log y ) ≤ ˜ d ( y ) ≤ C y C for some constants 0 ≤ C , C , C < ∞ . The remainder r ( x, y ) is uniformly bounded andcontinuously differentiable w.r.t. both arguments with bounded derivatives, and satisfies thatfor some δ > y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x )=0 } | max { r ( x, y ) , r ( x, y ) , r ( x, y ) }| → . (iii) ξ ( x ) < , − F Y | X = x ( y ) = c ( x )( y − y ) − γ ( x ) (1 + d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y )) where c ( · ) > and d ( · ) are uniformly bounded and continuously differentiable with uniformlybounded derivatives, and γ ( · ) < and ˜ γ ( · ) < are continuously differentiable functions,and r ( x, y ) is continuously differentiable with bounded derivatives w.r.t. both x and y, andsatisfies for some δ > y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12) r ( x, y ) / ( y − y ) − ˜ γ ( x ) (cid:12)(cid:12) → , lim sup y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12) r ( x, y ) / (( y − y ) − ˜ γ ( x ) − ) (cid:12)(cid:12) → , lim sup y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12) r ( x, y ) / ( y − y ) − ˜ γ ( x ) ) (cid:12)(cid:12) → , lim sup y → y sup x ∈ B δ ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12) r ( x, y ) / ( y − y ) − ˜ γ ( x ) − ) (cid:12)(cid:12) → Condition B assumes that the error of approximating the true CDF with a generalized Paretodistribution consists of the leading terms 1+ d ( x )( y ) − ˜ γ ( x ) and c ( x ) y ˜ c ( x ) exp( − d ( x ) ˜ d ( y )), respectivelyin the two cases with ξ ( x ) > ξ ( x ) = 0 and the remainder r ( x, y ). Most of it is essentially conditional version of the unconditional second order assumptions that are common in the EVliterature. In particular, Case (i) covers regularly varying tails, and are imposed by Smith (1982) tostudy unconditional problems. See also Hall (1982) and Smith (1987). Case (ii) covers slowly vary-ing tails, including Gaussian (˜ c ( x ) = 0 and ˜ d ( y ) = y ), lognormal (˜ c ( x ) = − d ( y ) = (log y ) ),and the exponential family (˜ c ( x ) = 0 and ˜ d ( y ) = y ). See, for example, Chapter B in de Haan andFerreira (2007). Case (iii) covers the thin tail case where the conditional distribution has a boundedright end-point. For example, the standard uniform distribution on [0 , is covered with c ( x ) = y = − γ ( x ) = 1 and d ( x ) = 0. Compared with the unconditional EV literature, we require astronger version that the derivatives of r ( x, y ) are uniformly bounded. This is to guarantee that thetail of f Y | X = x is also uniformly bounded. The compact support of X is imposed to simplify theproof (cf. Wang and Li (2013)). The following lemma establishes Condition 1.3 using Conditions1.4 and B. Its proof is collected at the very end of this article. Lemma 2
If Condition 1.4 and Condition B hold, then Condition 1.3 holds, i.e., for u n = a n y + b n with any fixed y > , as n → ∞ and T → ∞ (a) lim u n → y sup x ∈ B ηT ( x ) T − η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 ,(b) lim u n → y sup x ∈ B ηT ( x ) T − η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . To give a better sense of Condition B, we now show that it is satisfied by the three examplesintroduced in Section 2.1.First consider the joint normal distribution. Condition B.(ii) is satisfied by setting c ( x ) = (cid:112) π (1 − ρ ), d ( x ) = 1, ˜ d ( y ) = y / (2(1 − ρ )), and r ( x, y ) = exp(2 ρx/y + ρ x /y ) −
1. Second, forthe conditional Student-t distribution, Ding (2016) derives that the conditional PDF of Y given X = x is f Y | X = x ( y ) = Cσ ( x ) (cid:18) y − ρx ) ( v + 1) σ ( x ) (cid:19) − v +22 for some constant C depending on v only and σ ( x ) = (cid:112) (1 − ρ )( v + x ) / ( v + 1). Then ConditionB.(i) holds with γ ( x ) = 1 / ( v + 1), c ( x ) ∝ σ ( x ) v +1 , d ( x ) ∝ ρx, ˜ γ ( x ) = 1, and r ( x, y ) = O ( y − ) forany x ∈ R . Finally, for the conditional Pareto distribution, Taylor expansion yields1 − F Y | X = x ( y ) = y − /x (1 + 1 /y ) − /x = y − /x (1 − xy + O ( 1 y )) . Thus Condition B.(i) holds with c ( x ) = 1 , γ ( x ) = 1 /x, d ( x ) = − /x, ˜ γ ( x ) = 1 , and r ( x, y ) = O ( y − )for x bounded below from 0. roof of Lemma 2 The proof is different for ξ ( x ) > = < ξ ( x ) case. Recall that B η T ( x ) denotes an open ball centered at x with radius η T = T − η , where η is determined in Lemma 1. For (a), inf x ∈ B ηT ( x ) ξ ( x ) > T is large enough. This is feasiblegiven the continuity of ξ ( · ). Then by the chain rule and the condition (Condition B.(i)) that1 − F Y | X = x ( y ) = c ( x ) y − γ ( x ) (1 + d ( x )( y ) − ˜ γ ( x ) + r ( x, y )) , (19)we have ∂F Y | X = x ( y ) /∂x − F Y | X = x ( y )= c ( x ) c ( x ) − γ ( x ) log y + d ( x ) y − ˜ γ ( x ) d ( x )( y ) − ˜ γ ( x ) + r ( x, y ) − d ( x ) y − ˜ γ ( x ) ˜ γ ( x ) log y d ( x )( y ) − ˜ γ ( x ) + r ( x, y ) + r ( x, y )1 + d ( x )( y ) − ˜ γ ( x ) + r ( x, y ) . Recall that u n = a n y + b n = O ( Q Y | X = x (1 − /n ))= O ( n ξ ( x ) ) (20)(cf. Corollary 1.2.4 and Remark 1.2.11 in de Haan and Ferreira (2007)). Then after applying thetriangle inequality and the smoothness and boundedness of c ( · ), d ( · ), and γ ( · ) (Condition B.(i)),we have that for some constant C > , sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (21) ≤ sup x ∈ B ηT ( x ) (cid:13)(cid:13)(cid:13)(cid:13) c ( x ) c ( x ) − log ( u n ) γ ( x ) + Cd ( x )( u n ) − ˜ γ ( x ) − Cd ( x ) u − ˜ γ ( x ) n ˜ γ ( x ) log( u n ) + Cr ( x, u n ) (cid:13)(cid:13)(cid:13) = O (log( u n )) (by Condition B.(i))= O (log n ) . (by (20))By (19) again, we havesup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12) − F Y | X = x ( u n )1 − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12) (22) ≤ sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12) u − γ ( x )+ γ ( x ) n (cid:12)(cid:12)(cid:12) sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12) c ( x ) c ( x ) (cid:12)(cid:12)(cid:12)(cid:12) sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ( x )( y ) − ˜ γ ( x ) + r ( x, u n )1 + d ( x )( y ) − ˜ γ ( x ) + r ( x , u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C exp (cid:32) sup x ∈ B ηT ( x ) log (cid:16) u − γ ( x )+ γ ( x ) n (cid:17)(cid:33) (by Condition B.(i))= C exp (cid:0) O ( T − η log ( u n )) (cid:1) = C exp( O ( T − η log n ) (by (20))= O (1) . (by Condition 1.4)Then part (a) follows by combining (21) and (22) and using O ( T − η ) × O (log n ) = o (1) by Condition1.4 again.For (b), Condition B.(i) implies that f Y | X = x ( y ) = − c ( x ) γ ( x )( y ) − γ ( x ) − (1 + d ( x )( y ) − ˜ γ ( x ) + r ( x, y )) (23)+ c ( x )( y ) − γ ( x ) ( − d ( x ) y − ˜ γ ( x ) − ˜ γ ( x ) + r ( x, y )) . A similar argument as above with Conditions B.(i) and 1.4 yieldssup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (log( u n )) = O (log n )and sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f Y | X = x ( u n ) f Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C exp (cid:32) sup x ∈ B ηT ( x ) log (cid:16) u − γ ( x )+ γ ( x ) n (cid:17)(cid:33) = O (1) , which yield part (b) by using Condition 1.4 again.The proof for ξ ( x ) < is very similar when we replace y with y − y . In particular, by thechain rule and the condition (Condition B.(iii)) that1 − F Y | X = x ( y ) = c ( x ) ( y − y ) − γ ( x ) (1 + d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y )) , (24)we have ∂F Y | X = x ( y ) /∂x − F Y | X = x ( y )= c ( x ) c ( x ) − γ ( x ) log ( y − y ) + d ( x ) ( y − y ) − ˜ γ ( x ) d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y ) − d ( x ) ( y − y ) − ˜ γ ( x ) ˜ γ ( x ) log ( y − y )1 + d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y ) + r ( x, y )1 + d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y ) . Similarly as (20), we denote u n = a n y + b n and define ¯ u n = y − u n . Then ¯ u n = y − u n (25)= (cid:0) y − Q Y | X = x (1 − /n ) (cid:1) (1 + ξ ( x ) y ) O ( n ξ ( x ) ) , and then sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (26) ≤ sup x ∈ B ηT ( x ) (cid:13)(cid:13)(cid:13)(cid:13) c ( x ) c ( x ) − log (¯ u n ) γ ( x ) + Cd ( x )(¯ u n ) − ˜ γ ( x ) − Cd ( x )¯ u − ˜ γ ( x ) n ˜ γ ( x ) log(¯ u n ) + Cr ( x, u n ) (cid:13)(cid:13)(cid:13) = O ( − log(¯ u n )) (by Condition B.(iii))= O (log n ) . (by (25))By (24) again, we havesup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12) − F Y | X = x ( u n )1 − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12) (27) ≤ sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12) ¯ u − γ ( x )+ γ ( x ) n (cid:12)(cid:12)(cid:12) sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12) c ( x ) c ( x ) (cid:12)(cid:12)(cid:12)(cid:12) sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ( x )(¯ u n ) − ˜ γ ( x ) + r ( x, u n )1 + d ( x )(¯ u n ) − ˜ γ ( x ) + r ( x , u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C exp (cid:32) sup x ∈ B ηT ( x ) log (cid:16) ¯ u − γ ( x )+ γ ( x ) n (cid:17)(cid:33) (by Condition B.(iii))= C exp (cid:0) O ( − T − η log (¯ u n )) (cid:1) = C exp( O ( T − η log n ) (by (25))= O (1) . (by Condition 1.4)Then part (a) follows by combining (26) and (27) and using O ( T − η ) × O (log n ) = o (1) by Condition1.4 again.For (b), Condition B.(iii) implies that f Y | X = x ( y ) = c ( x ) γ ( x )( y − y ) − γ ( x ) − (1 + d ( x )( y − y ) − ˜ γ ( x ) + r ( x, y ))+ c ( x )( y − y ) − γ ( x ) ( d ( x )( y − y ) − ˜ γ ( x ) − ˜ γ ( x ) + r ( x, y )) . A similar argument as above with Conditions B.(iii) and 1.4 yieldssup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O ( − log(¯ u n )) = O (log n )and sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f Y | X = x ( u n ) f Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C exp (cid:32) sup x ∈ B ηT ( x ) (cid:16) − log (cid:16) ¯ u − γ ( x )+ γ ( x ) n (cid:17)(cid:17)(cid:33) = O (1) , hich yield part (b) by using Condition 1.4 again.Now it remains prove (a) and (b) for ξ ( x ) = 0. Note that u n = O (cid:0) Q Y | X = x (1 − /n ) (cid:1) , whichis at most of the order exp(Φ − (1 − /n )) = exp( √ n ) by the condition C (log y ) ≤ ˜ d ( y ) ≤ C y C .For (a), we decompose B η T ( x ) into B η T ( x ) ∩ { x : ξ ( x ) > } , B η T ( x ) ∩ { x : ξ ( x ) = 0 } , and B η T ( x ) ∩ { x : ξ ( x ) < } and then sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max (cid:40) sup x ∈ B ηT ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , sup x ∈ B ηT ( x ) ∩{ x : ξ ( x )=0 } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , sup x ∈ B ηT ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (cid:41) (28) For the first item in (28), Conditions 1.1 and B.(i) imply that ∂F Y | X = x ( u n ) /∂x = O ( u − γ ( x ) n log u n )and γ ( x ) = 1 /ξ ( x ) = 1 /ξ (cid:48) ( ˙ x ) T η ≥ O ( T η ) where ˙ x is within B η T ( x ) such that /ξ (cid:48) ( ˙ x ) > . Thus,Condition 1.4 and the fact that 1 − F Y | X = x ( u n ) = O ( n − ) yield that for any x ∈ B η T ( x ) ∩ { x : ξ ( x ) > } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16) n × u − γ ( x ) n log u n (cid:17) = O (exp (log n − γ ( x ) log u n + log (log u n ))) ≤ O (exp (log n − T η log u n + log (log u n )))= o (1) . For the second term in (28), apply Leibniz’s rule and Condition B.(ii) to obtainsup x ∈ B ηT ( x ) ∩{ x : ξ ( x )=0 } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup x ∈ B ηT ( x ) ∩{ x : ξ ( x )=0 } Cn (cid:90) y u n y C f Y | X = x ( y ) dy ≤ Cn (cid:90) y u n y C + ¯ C T exp( − D T (log y ) ) dy (29)= Cn (cid:90) y log u n exp( − D T s + ( C + ¯ C T + 1) s ) ds (by change of variables)= O (1) , where we denote ¯ C T = sup x ∈ B ηT ( x ) ˜ c ( x ) < ∞ and D T = inf x ∈ B ηT ( x ) d ( x ) >
0, and the last equa-tion follows from that u n is at most of the order exp( √ n ) and the fact that the 1 − /n quantileof a normal distribution is O ( (cid:112) log( n )). that ∂F Y | X = x ( u n ) /∂x = O ( − ¯ u − γ ( x ) n log ¯ u n ) and − γ ( x ) = − /ξ ( x ) = − /ξ (cid:48) ( ˙ x ) T η ≥ O ( T η ) where ˙ x is within B η T ( x ) suchthat − /ξ (cid:48) ( ˙ x ) > Thus, Condition 1.4 and the fact that 1 − F Y | X = x ( u n ) = O ( n − ) yield thatfor any x ∈ B η T ( x ) ∩ { x : ξ ( x ) < } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂F Y | X = x ( u n ) /∂x − F Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16) n × ¯ u − γ ( x ) n ( − log ¯ u n ) (cid:17) = O (exp (log n − γ ( x ) log ¯ u n + log ( − log ¯ u n ))) ≤ O (exp (log n + T η log ¯ u n + log ( − log ¯ u n ))) (by log ¯ u n < o (1) . For (b), we similarly derive sup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max (cid:40) sup x ∈ B δT ( x ) ∩{ x : ξ ( x ) > } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (30)sup x ∈ B δT ( x ) ∩{ x : ξ ( x )=0 } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , sup x ∈ B δT ( x ) ∩{ x : ξ ( x ) < } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) . Using (23) and Condition B.(i), we have (cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂x (cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16) u − γ ( x ) − n γ ( x ) (cid:17) + O (cid:16) u − γ ( x ) n (cid:17) when ξ ( x ) >
0. By Condition B.(ii) and under ξ ( x ) = 0, we have1 /f Y | X = x ( u n ) ≤ Cu n exp (cid:0) ¯ D T C u C n (cid:1) where we denote ¯ D T = sup x ∈ B ηT ( x ) d ( x ) >
0. Thusfor any x ∈ B δ T ( x ) ∩ { x : ξ ( x ) > } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Cu n exp (cid:0) ¯ D T C u C n (cid:1) (cid:16) u − γ ( x ) − n γ ( x ) + u − γ ( x ) n (cid:17) = Cu n exp (cid:0) ¯ D T C u C n − ( γ ( x ) + 1) log ( u n ) + log γ ( x ) (cid:1) + u n C exp (cid:0) ¯ D T C u C n − γ ( x ) log ( u n ) (cid:1) ≤ Cu n exp (cid:0) ¯ D T C u C n − T − η log ( u n ) + log γ ( x ) (cid:1) = o (1) , where the last line follows from Condition 1.4 and the fact that u n is at most of the orderexp( √ n ) . The second term in (30) is bounded bysup x ∈ B ηT ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c ( x ) c ( x ) + 1 u n ˜ c ( x ) + d ( x ) ˜ d ( u n ) + r ( x, u n )1 + r ( x, u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) O ( u C n ) ≤ O ((log( n )) C / ) . To bound the the third term in (30), we have (cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂x (cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16) − ¯ u − γ ( x ) − n γ ( x ) (cid:17) + O (cid:16) ¯ u − γ ( x ) n (cid:17) when ξ ( x ) < . Then similarly as bounding the firstterm, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂f Y | X = x ( u n ) /∂xf Y | X = x ( u n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Cu n exp (cid:0) ¯ D T C u C n (cid:1) (cid:16) − ¯ u − γ ( x ) − n γ ( x ) + ¯ u − γ ( x ) n (cid:17) = Cu n exp (cid:0) ¯ D T C u C n − ( γ ( x ) + 1) log (¯ u n ) + log ( − γ ( x )) (cid:1) + u n C exp (cid:0) ¯ D T C u C n − γ ( x ) log (¯ u n ) (cid:1) ≤ Cu n exp (cid:0) ¯ D T C u C n + T − η log (¯ u n ) + log ( − γ ( x )) (cid:1) = o (1) , Thus (b) for ξ ( x ) = 0 is established. (cid:4) References
Abrevaya, J. (2001): “The effects of demographics and maternal Behavior on the distri-bution of birth outcomes,”
Empirical Economics , 26, 247–257.
Adrian, T., N. Boyarchenko, and
D. Giannone (2019): “Vulnerable Growth,”
Amer-ican Economic Review , 109(4), 1263–89.
Adrian, T., and
M. K. Brunnermeier (2016): “CoVaR,”
American Economic Review ,106(7), 1705.
Angrist, J., V. Chernozhukov, and
I. Fern´andez-Val (2006): “Quantile Regressionunder Misspecification, with an Application to the US Wage Structure,”
Econometrica ,74(2), 539–563.
Arnold, B. C., N. Balakrishnan, and
H. H. N. Nagaraja (1992):
A First Coursein Order Statistics . Siam.
Balkema, A. A., and
L. de Haan (1974): “Residual Life Time at Great Age,”
TheAnnals of Probability , 2, 792–804. 45 eare, B., and
A. A. Toda (2017): “Geometrically Stopped Markovian Random GrowthProcesses and Pareto Tails,” arXiv:1712.01431 . Beirlant, J., E. Joossens, and
J. Segers (2004): “Discussion of ”Generalized ParetoFit to the Society of Actuaries’ Large Claims Database” by A Cebrian, M Denuit andP Lambert,”
North American Actuarial Journal , 8, 108–111.
Berbee, H. (1987): “Convergence Rates in the Strong Law for Bounded Mixing Sequences,”
Probability Theory and Related Fields , 74, 255–270.
Chernozhukov, V. (2005): “Extremal Quantile Autoregression,”
The Annals of Statistics ,33(2), 806–839.
Chernozhukov, V., and
I. Fern´andez-Val (2011): “Inference for extremal conditionalquantile models, with an application to market and birthweight Risks,”
The Review ofEconomic Studies , 78, 559–589.
Chernozhukov, V., I. Fern´andez-Val, and
T. Kaji (2017): “Extremal Quantile Re-gression: An Overview,” arXiv: 1612.06850 . Chernozhukov, V., and
L. Umantsev (2001): “Conditional value-at-risk: Aspects ofmodeling and estimation,”
Empirical Economics , 26(1), 271–293.
Daouia, A., L. Gardes, and
S. Girard (2013): “On Kernel Smoothing for ExtremalQuantile Regression,”
Bernoulli , 19, 2557–2589. de Haan, L., and
A. Ferreira (2007):
Extreme Value Theory: An Introduction . SpringerScience and Business Media, New York.
Ding, P. (2016): “On the Conditional Distribution of the Multivariate t Distribution,”
TheAmerican Statistician , 70, 293–295.
Doukhan, P., P. Massart, and
E. Rio (1995): “Invariance principles for absolutelyregular empirical processes,”
Annales de l’I.H.P. Probabilit´es et statistiques , 31(2),393–427.
Elliott, G., U. K. M¨uller, and
M. W. Watson (2015): “Nearly Optimal Tests Whena Nuisance Parameter is Present under the Null Hypothesis,”
Econometrica , 83, 771–811. 46 ngle, R. F., and
S. Manganelli (2004): “CAViaR: Conditional Autoregressive Valueat Risk by Regression Quantiles,”
Journal of Business & Economic Statistics , 22(4),367–381.
Fan, J., T.-C. Hu, and
Y. K. Truong (1994): “Robust Non-Parametric Function Esti-mation,”
Scandinavian journal of statistics , pp. 433–446.
Gabaix, X., J. Lasry, P. Lions, and
B. Moll (2016): “The Dynamics of Inequality,”
Econometrica , 85(6), 2071–2111.
Gardes, L., S. Girard, and
A. Lekina (2010): “Functional Nonparametric Estimationof Conditional Extreme Quantiles,”
Journal of Multivariate Analysis , 101, 419–433.
Gardes, L., A. Guillou, and
A. Schorgen (2012): “Estimating the Conditional TailIndex by Integrating a Kernel Conditional Quantile Estimator,”
Journal of StatisticalPlanning and Inference , 142, 1586–1598.
Hall, P. (1982): “On Some Simple Estimates of an Exponent of Regular Variation,”
Jour-nal of Royal Statistic Society, Series B , 44(1), 37–42.
Hsiao, C., and
M. Pesaran (2004): “Random Coefficient and Panel Data Models,”
In:L. M´aty´as and P. Sevestre (eds.), The Econometrics of Panel Data, Springer 2008(3rd ed.) . Jones, C. I., and
J. Kim (2018): “A Schumpeterian Model of Top Income Inequality,”
Journal of Political Economy , 126(5), 1785–1826.
Kato, R., and
Y. Sasaki (2017): “On Using Linear Quantile Regressions for CausalInference,”
Econometric Theory , 33(3), 664–690.
Kelly, B., and
H. Jiang (2014): “Tail Risk and Asset Prices,”
The Review of FinancialStudies , 27(10), 2841–2871.
Koenker, R., and
G. S. Bassett (1978): “Regression Quantiles,”
Econometrica , 46,33–50.
Koenker, R., and
K. Hallock (2001): “Quantile Regression: An introduction,”
Journalof Economic Perspectives , 15, 143–156.
Li, Q., and
J. S. Racine (2007):
Nonparametric Econometrics: Theory and Practice .Princeton University Press. 47 artins-Filho, C., F. Yao, and
M. Torero (2018): “Nonparametric Estimation ofConditional Valuat-at-Risk and Expected Shortfall Based on Extreme Value Theory,”
Econometric Theory , 34, 23–67.
M¨uller, U. K., and
Y. Wang (2017): “Fixed-k Asymptotic Inference about Tail Prop-erties,” the Journal of the American Statistical Association , 112, 1134–1143.
Pickands, III, J. (1975): “Statistical inference using extreme order statistics,”
Annals ofStatistics , 3(1), 119–131.
Piketty, T., and
E. Saez (2003): “Income inquality in the United States, 1913-1998,”
The Quarterly Journal of Economics , 118(1), 1–41.
Qu, Z., and
J. Yoon (2015): “Nonparametric Estimation and Inference on ConditionalQuantile Processes,”
Journal of Econometrics , 185(1), 1–19.(2019): “Uniform Inference on Quantile Effects under Sharp Regression Disconti-nuity Designs,”
Journal of Business & Economic Statistics , 37(4), 625–647.
Smith, R. L. (1982): “Uniform rates of convergence in Extreme-Value theory,”
Advancesin Applied Probability , 13(3), 600–622.(1987): “Estimating Tails of Probability Distributions,”
Annals of Statistics , 15,1174–1207.
Toda, A. A. (2019): “Wealth distribution with random discount factors,”
Journal of Mon-etary Economics , 104, 101–113.
Wang, H., and
D. Li (2013): “Estimation of Extreme Conditional Quantiles ThroughPower Transformation,”
Journal of the American Statistical Association , 108(503),1062–1074.
Wang, H., and
C. L. Tsai (2009): “Tail Index Regression,”
Journal of the AmericanStatistical Association , 104, 1233–1240.
Wooldridge, J. M. (2005): “Fixed-effects and related estimators for correlated random-coefficient and treatment-effect panel data models,”
Review of Economics and Statis-tics , 87, 385–390.
Yu, K., and
M. Jones (1998): “Local Linear Quantile Regression,”