Nonparametric C- and D-vine based quantile regression
Marija Tepegjozova, Jing Zhou, Gerda Claeskens, Claudia Czado
NNonparametric C- and D-vine based quantileregression
Marija Tepegjozova ∗ , Jing Zhou † , Gerda Claeskens † and Claudia Czado ∗ February 10, 2021
Abstract
Quantile regression is a field with steadily growing importance in statistical model-ing. It is a complementary method to linear regression, since computing a range ofconditional quantile functions provides a more accurate modelling of the stochastic re-lationship among variables, especially in the tails. We introduce a novel non-restrictiveand highly flexible nonparametric quantile regression approach based on C- and D-vinecopulas. Vine copulas allow for separate modeling of marginal distributions and the de-pendence structure in the data, and can be expressed through a graph theoretical modelgiven by a sequence of trees. This way we obtain a quantile regression model, thatovercomes typical issues of quantile regression such as quantile crossings or collinearity,the need for transformations and interactions of variables. Our approach incorporatesa two-step ahead ordering of variables, by maximizing the conditional log-likelihood ofthe tree sequence, while taking into account the next two tree levels. Further, we showthat the nonparametric conditional quantile estimator is consistent. The performanceof the proposed methods is evaluated in both low- and high-dimensional settings usingsimulated and real world data. The results support the superior prediction ability ofthe proposed models.
Keyword: quantile regression, vine copula, conditional quantile function, nonparametricpair-copulas
As a robust alternative for the ordinary least squares regression, which estimates the con-ditional mean, quantile regression (Koenker and Bassett 1978) focuses on the conditionalquantiles. This method has been studied extensively in statistics, economics and finance. ∗ Department of Mathematics, Technische Universit¨at M¨unchen, Boltzmannstraße 3, 85748 Garching,Germany (email: [email protected] (corresponding author), [email protected]) † ORStat and Leuven Statistics Research Center, KU Leuven, Naamsestraat 69-box 3555 Leuven, Bel-gium (email: [email protected], [email protected]) a r X i v : . [ s t a t . M E ] F e b he pioneer literature Koenker (2005a) investigated linear quantile regression systemati-cally. It presented properties of the estimators including asymptotic normality and consis-tency, under various assumptions such as independence of the observations, independentand identically distributed (i.i.d.) errors with continuous distribution and predictors hav-ing bounded second moment. Later, the subsequent extensions of linear quantile regressionhave been intensively studied, see for example adapting quantile regression in the Bayesianframework (Yu and Moyeed 2001), for longitudinal data (Koenker 2004), time-series mod-els (Xiao and Koenker 2009), high-dimensional models with l -regularizer (Belloni andChernozhukov 2011), nonparametric estimation by kernel weighted local linear fitting (Yuand Jones 1998) and by additive models (Koenker 2011; Fenske et al. 2011), etc. Thetheoretical analysis of the above-mentioned extensions is based on imposing additional as-sumptions such as samples that are i.i.d. (see for example Yu and Jones (1998); Belloniand Chernozhukov (2011), or that are generated by a known additive function (see forexample Koenker (2011); Koenker (2004)). However, such assumptions, which guaranteethe performance of the proposed methods for certain data structure, causes concerns inapplication due to the uncertainty of the real-world data structures. In addition to theabove-mentioned issue, Bernard and Czado (2015) addressed other potential concerns suchas quantile crossings and model-misspecification, when the dependence structure of theresponse variables and the predictive variables does not follow a Gaussian copula. Hence,flexible models without assuming homoscedasticity, or a linear relationship between theresponse and the predictive variables are of interest. Recent research on dealing with thisissue includes quantile forest (Meinshausen 2006; Li and Martin 2017; Athey et al. 2019)inspired by the earlier work of random forest (Breiman 2001). Further, flexible quantilemodeling motivates research on modeling conditional quantiles using copulas (see also Nohet al. (2013); Noh et al. (2015); Chen et al. (2009)). Vine copulas in the context of con-ditional quantile prediction have already been investigated by Kraus and Czado (2017)using drawable (D-vines) vine copulas and Chang and Joe (2019) using regular vines. Theapproach of Chang and Joe (2019) is based on first finding the locally optimal regularvine structure among all predictors and then adding the response to each selected tree inthe vine structure as a leaf, as also followed by Bauer and Czado (2016) in the context ofnon Gaussian conditional independence testing. The proceeding in Chang and Joe (2019)allows for a recursive determination of the response quantiles, which however is restrictedthrough the prespecified dependence structure among predictor variables. The latter mightnot be the one maximizing the conditional response likelihood, which is the main focus inregression setup. The approach of Kraus and Czado (2017) is based on optimizing theconditional log-likelihood and selecting predictors sequentially until no improvement of theconditional log-likelihood is achieved. This approach based on the conditional responselikelihood is also more suited to determine the associated response quantiles. Therefore,we extend the method of Kraus and Czado (2017) by considering the two-step ahead treesequence and its conditional log-likelihood for the class of both C- and D-vines. By con-struction, quantile crossings are avoided. Additionally, all marginal densities and copulas2re estimated nonparametrically, allowing more flexibility than parametric specifications.The benefit of the nonparametric estimation of bivariate copulas in the quantile regressionframework is addressed in Kraus and Czado (2017). This construction permits a large vari-ety of dependence structures, resulting in a well-performing conditional quantile estimator.Moreover, extending to the C-vine copula class, in addition to the D-vine copulas, providesgreater flexibility.The paper is organized as follows. Section 2 introduces the general setup and the conceptof C-vine and D-vine copulas, while also introducing the nonparametric approach for esti-mating copula densities. Section 3 describes in detail the vine based approach for quantileregression. The new two-step ahead forward selection algorithms are described in detail inSection 4. Since all densities involved in the model construction are estimated nonpara-metrically, we investigate in Proposition 4.1 the consistency of the conditional quantileestimator for given variable orders. The finite sample performance of the vine based condi-tional quantile estimator is evaluated in Section 5 by several quantile related measurementsin various simulation settings. Further, we also apply the newly introduced algorithms tolow- and high-dimensional real data in Section 6. In Section 7 we conclude and discusspossible directions of future research. Let X = ( X , . . . , X d ) T be a random vector with observed values denoted as ( x , . . . , x d ) T .The marginal distribution function of X j is denoted by F X j for j = 1 , . . . , d and thejoint distribution function of the random vector X is denoted by F . Sklar’s theorem(Sklar 1959) offers a way to represent any given multivariate distribution in terms of itsmarginals and the corresponding copula encoding the dependence structure. If we as-sume that F X j , and F are continuous, then there exists a copula C associated with thejoint distribution of X such that F ( x , . . . , x d ) = C ( F X ( x ) , . . . , F X d ( x d )) , with the jointdensity function f ( x , . . . , x d ) = c ( F X ( x ) , . . . , F X d ( x d )) · f X ( x ) · . . . · f X d ( x d ) . To in-spect the dependence structure of X , we transform each variable by applying the prob-ability integral transform (PIT). The corresponding transformed variables are defined as U j := F X j ( X j ) , j = 1 , . . . , d , which are uniformly distributed on [0 , U = ( U , . . . , U d ) T are denoted as ( u , . . . , u d ) T . The joint distri-bution of U is a copula denoted by C U ,...,U d with copula density function c U ,...,U d . Weutilize notation similar to Czado (2019, Section 5.4). Let D ⊂ { , . . . , d } be a subset of in-dexes such that X D is a sub-vector of the d -dimensional vector X . Let i, j ∈ { , . . . , d }\ D be two indices not belonging to the set D . C U i ,U j ; U D denotes the copula associated withthe conditional distribution of ( X i , X j ) conditioned on X D = x D with the correspondingcopula density c U i ,U j ; U D . The set U D = { F X s ( X s ) , s ∈ D } is called the conditioning setand ( U i , U j ) form the conditioned set. F X i | X D denotes the conditional distribution of X i conditioned on X D = x D . C U i | U D denotes the conditional distribution of the PIT trans-3ormed variable U i conditioned on U D with corresponding density function c U i | U D . Theso called h-functions are defined as h U i | U j ( u i | u j ) := ∂∂u j C U i ,U j ( u i , u j ) . In this case it holdsthat h U i | U j ( u i | u j ) = C U i | U j ( u i | u j ). For the conditioned case, we define the h-function as h U i | U j ; U D ( u i | u j ; u D ) := ∂∂u j C U i ,U j ; U D ( u i , u j ; u D ) . When the simplifying assumption holds(Czado 2019, Section 5.4), that is, the copula function does not depend on the specific con-ditioning value, then h U i | U j ; U k ( u i | u j ; u k ) = ∂∂u j C U i ,U j ; U k ( u i , u j ) . This assumption is oftenmade due to tractability reasons in higher dimensions, see Haff et al. (2010) and Stoeberet al. (2013). The concept of vine copulas, first introduced in Joe (1996), is that any d -dimensional copula can be expressed in terms of d ( d − / d -dimensional vine copula in terms of a sequence of trees. The setof trees V = ( T , . . . , T d − ) is a regular vine tree sequence on d elements if each tree T j isconnected, i.e. T j contains a path between every pair of nodes; T is a tree with node set N = { , . . . , d } and edge set E ; for j ≥ , T j is a tree with node set N j = E j − and edgeset E j and if two nodes in T j are joined by an edge, the corresponding edges in T j − mustshare a common node. This condition is also called the proximity condition. In additionto the vine tree sequence, a regular vine copula identifies each edge of the trees with abivariate pair-copula defined through the set B ( V ) := { c e | e ∈ E j , ≤ j ≤ d − } . Hereeach c e corresponds to the density of the pair-copula associated with the correspondingedge e of V . We consider two different classes of regular vine tree sequences: a C-vine treesequence and a D-vine tree sequence. A regular vine tree sequence V = ( T , . . . , T d − ) iscalled a D-vine tree sequence, if all trees T , . . . , T d − have degree less or equal to 2, i.e.they are paths, and the nodes with degree 1 are called leaf nodes. Once the first tree T in the tree sequence is defined, then all the other trees T , . . . , T d − are determined, sincethe proximity condition has to be satisfied. The right panel of Figure 1 shows an exampleof a D-vine tree sequence in four dimensions. The construction of the joint density f of a d -dimensional random vector X from a product of (conditional) bivariate copula densities,associated with a D-vine tree sequence, and marginal densities is given as f ( x , . . . , x d ) = d − (cid:89) j =1 d − j (cid:89) i =1 c U si ,U si + j ; U si +1 ,...,U si + j − (cid:16) F X si | X si +1 ,...,X si + j − ( x s i | x s i +1 , . . . , x s i + j − ) ,F X si + j | X si +1 ,...,X si + j − ( x s i + j | x s i +1 , . . . , x s i + j − ) (cid:17) · d (cid:89) k =1 f X sk ( x s k ) , where s , . . . , s d corresponds to an arbitrary permutation of 1 , . . . , d . The distributionassociated with this density decomposition, if the marginals are uniformly distributed, iscalled a D-vine copula. We call a regular vine tree sequence V = ( T , . . . , T d − ) a C − vinetree sequence if in each tree there is one node connected to all the other nodes, calleda root node. This implies that all trees in the C-vine tree sequence are stars. The leftpanel of Figure 1 shows an example of a C-vine tree sequence in four dimensions. The4onstruction of the joint density f of a d -dimensional random vector X based on a productof (conditional) bivariate copula densities, associated with a C-vine tree structure, andmarginal densities is given as f ( x , . . . , x d ) = d − (cid:89) j =1 d − j (cid:89) i =1 c U sj ,U sj + i ; U s ,...,U sj − (cid:16) F X sj | X s ,...,X sj − ( x s j | x s , . . . , x s j − ) ,F X sj + i | X s ,...,X sj − ( x s j + i | x s , . . . , x s j − ) (cid:17) · d (cid:89) k =1 f X sk ( x s k ) , where s , . . . , s d corresponds to an arbitrary permutation of 1 , . . . , d . The distributionassociated with this density construction, when the marginals are uniformly distributed, iscalled a C-vine copula. For more details on the pair-copula construction based on C- andD-vine tree sequences refer to Czado (2019, Section 4.2). When non-simplified pair copulasare used as building blocks, then these constructions are also possible decompositions ofan arbitrary multivariate density. T T T T T T Figure 1:
C-vine tree sequence (left panel) and a D-vine tree sequence (right panel) in 4dimensions.
We discuss now how the h-functions can be estimated in a nonparametric way, due totheir later importance in the estimation of conditional distribution functions. According5o the simplifying assumption it holds that h U i | U j ; U D ( u i | u j ; u D ) = ∂∂u j C U i ,U j ; U D ( u i , u j ) canbe estimated by the bivariate copula C U i ,U j ; U D which does not depend on specific valuesof u D . Thus, it is sufficient to show the estimation procedure for the h-function h U i | U j associated with an arbitrary pair copula. Consider a pair ( U i , U j ) of which the h-function h U i | U j = C U i | U j can be estimated by the following rescaled estimatorˆ C U i | U j ( u i | u j ) = (cid:90) u i ˆ c U i ,U j (˜ u i , u j )d˜ u i (cid:46)(cid:90) ˆ c U i ,U j (˜ u i , u j )d˜ u i , where ˆ c U i ,U j is a nonparametric estimator of the bivariate copula density of ( U i , U j ).Example estimators of the copula density c U i ,U j are the transformation estimator (Charp-entier et al. 2007), the transformation local likelihood estimator (Geenens et al. 2017), thetapered transformation estimator (Wen and Wu 2015), the beta kernel estimator (Char-pentier et al. 2007), and the mirror-reflection estimator (Gijbels and Mielniczuk 1990).Among the above-mentioned kernel estimators, the transformation local likelihood esti-mator (Geenens et al. 2017) was found by Nagler et al. (2017) to have an overall bestperformance. The estimator is implemented in the R packages kdecopula (Nagler 2018)and rvinecopulib (Nagler and Vatter 2019b) using Gaussian kernels. This is the esti-mator that we further use. We shortly review its construction in Appendix A. In a regression framework some variables among the set of potential covariates X =( X , . . . , X p ) T have predictive ability for the response Y ∈ R . The main interest of vinebased quantile regression is to predict the α ∈ (0 ,
1) quantile of the response variable Y given X , which can be achieved by a joint modeling of ( Y, X ) T and by using the conditionalquantile function q α ( x , . . . , x p ) = F − Y | X ,...,X p ( α | x , . . . , x p ) . The conditional distributionof Y given X on the u-scale, for V = F Y ( Y ) and U j = F X j ( X j ) , j = 1 , . . . , p , as derived inKraus and Czado (2017), is given as F Y | X ,...,X p ( y | x , . . . , x p ) = C V | U ,...,U p ( v | u , . . . , u p ) , and its corresponding inverse, the quantile function as F − Y | X ,...,X p ( α | x , . . . , x p ) = F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) . (1)Here C V | U ,...,U p is the conditional distribution function of V given U j = u j for j = 1 , . . . , p and C V,U ,...,U p is defined as the ( p + 1)-dimensional copula associated with the joint dis-tribution of ( Y, X ) T . Compared to Section 1, here it holds that d = p + 1. The aboveequations indicate that the conditional quantile function of Y given X can be expressed as acomposition of the inverses of the marginal distribution F Y and the conditional distribution C V | U ,...,U p . Thus, an estimate of the conditional quantile function can be obtained fromthe estimated inverses of the marginal distributions, denoted by ˆ F − Y , ˆ F − X j , j = 1 , . . . , p ,6nd of the conditional distribution function, denoted by ˆ C − V | U ,...,U p , such thatˆ q α ( x , . . . , x p ) = ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . ˆ u p ) (cid:17) , where ˆ u j = ˆ F X j ( x j ) for j = 1 , . . . , p . In fact, C V,U ,...,U p can be any ( p +1)-dimensional mul-tivariate copula. Using C- and D-vine pair-copula constructions this high-dimensional andcomplex estimation problem can be solved in a sequential and tractable way. Additionalflexibility is achieved by allowing for nonparametric pair-copulas as building blocks. In par-ticular, the order of the predictors within the tree sequences itself is a free parameter withdirect impact on the target function C V | U ,...,U p and thus on the corresponding predictionperformance. However, we want to make sure that evaluating the conditional distributionfunction C V | U ,...,U p from the joint copula C V,U ,...,U p remains feasible. The pair-copula con-struction on C- and D-vines provides a strategy how to derive the conditional distribution C V | U ,...,U p using specific pair-copulas and the following recursion for conditional distri-bution functions (Joe 1997). Let W = ( W , . . . , W p ) T be uniformly distributed randomvariables with observed values w = ( w , . . . , w p ) T . Further, let D ⊂ { , . . . p } and definethe set D − j := D \ { j } for some j ∈ D . Then, for any i / ∈ D and j ∈ D , it holds that C W i | W D ( w i | w D ) = h W i | W j ; W D − j (cid:16) C W i | W D − j (cid:0) w i | w D − j (cid:1) | C W j | W D − j (cid:0) w j | w D − j (cid:1)(cid:17) , (2)where h W i | W j ; W D − j is the h-function associated with the pair copula C W i ,W j ; W D − j . Asshown, for example in Tepegjozova (2019), this implies that C V | U ,...,U p can be derivedusing only pair copulas defined by the tree sequence of the vine copula C V,U ,...,U p if andonly if in a D-vine the response V is a leaf node in the first tree of the tree sequence.For a C-vine we need to require that the node containing the response variable V in theconditioned set is not a root node in any tree. More details in Example 3.1. Example 3.1.
Consider the C-vine copula associated to the tree sequence given on theleft panel of Figure 1. In the tree sequence V = ( T , T , T ) we can see that all trees arestars, where in each tree there a root node connected to all other nodes in that tree level,i.e. 1 is the root node of T while 1 , ,
3; 1 are the root nodes of trees T and T ,respectively. Using (2) the conditional distribution C | , , can be expressed as C | , , ( u | u , u , u ) = h | , (cid:0) C | , ( u | u , u ) | C | , ( u | u , u ) (cid:1) = h | , (cid:0) h | (cid:0) C | ( u | u ) | C | ( u | u ) (cid:1) | h | (cid:0) C | ( u | u ) | C | ( u | u ) (cid:1)(cid:1) (3)= h | , (cid:0) h | (cid:0) h | ( u | u ) | h | ( u | u ) (cid:1) | h | (cid:0) h | ( u | u ) | h | ( u | u ) (cid:1)(cid:1) . The corresponding inverse C − | , , can be obtained as C − | , , ( α | u , u , u )= h − | (cid:18) h − | (cid:16) h − | , (cid:0) α (cid:12)(cid:12) h | (cid:0) h | ( u | u ) | h | ( u | u ) (cid:1)(cid:1)(cid:12)(cid:12)(cid:12) h | ( u | u ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) u (cid:19) . C | , , is expressed using only the h -functions h | , , h | , h | , h | , h | and h | which are derived by taking the appropriate derivativesfrom the pair copulas C , , , C , , C , , C , , C , and C , , respectively. With E ( T i )the edge set of tree T i it holds that { (3 ,
4; 1 , } ⊆ E ( T ) , { (2 ,
3; 1) , (2 ,
4; 1) } ⊆ E ( T ) , and { (1 , , (1 , , (1 , } ⊆ E ( T ) , . Thus, the pair copulas C , , , C , , C , , C , , C , and C , are included in B ( V ), the set of pair copula families defining the vine. Onthe other hand, the conditional distribution function C | , , can be expressed as C | , , ( u | u , u , u ) = h | , (cid:0) C | , ( u | u , u ) | C | , ( u | u , u ) (cid:1) h | , (cid:0) C | , ( u | u , u ) | C | , ( u | u , u ) (cid:1) h | , (cid:0) C | , ( u | u , u ) | C | , ( u | u , u ) (cid:1) . The h -functions h | , , h | , and h | , can only be derived from the pair copulas C , , , C , , and C , , . However, the edges { (1 ,
2; 3 , } (cid:54)⊂ E ( T ) , { (2 ,
3; 1 , } (cid:54)⊂ E ( T ) , { (2 ,
4; 1 , } (cid:54)⊂ E ( T ) thus the needed pair copulas are not included in B ( V ).Using Equation (2) we can derive any of the conditional distributions C | , , , C | , , , C | , , and C | , , . However, for the conditional distribution C | , , , as seen from Equa-tion (3), this can be done using only pair copulas which are included in B ( V ). The sameholds for the conditional distribution C | , , . But, for the conditional distributions C | , , and C | , , additional pair copulas, which are not included in B ( V ), are needed. Theseare determined by the vine specification, however they require integration over higher di-mensional marginal distributions of the vine, which is computationally intractable.As motivated by Example 3.1, we assume that if C V,U ,...,U p is a D-vine, then V is fixedas a leaf node in its first tree, and if C V,U ,...,U p is a C-vine, then the node containing theresponse V in the conditioned set is excluded from being part of the root node specificationin any tree of its tree sequence. We introduce the concept of an order of the nodes in a C-and D-vine, as defined in Tepegjozova (2019). A D-vine copula, denoted by C D , has order O D ( C D ) = (cid:0) V, U i , . . . , U i p (cid:1) , if the response V is the first node of the first tree T and U i k is the ( k + 1)-th node of T , for k = 1 , . . . , p . A C-vine copula, denoted by C C , has order O C ( C C ) = (cid:0) V, U i , . . . , U i p (cid:1) , if U i is the root node in the first tree T , U i U i is the rootnode in the second tree T , and U i k U i k − ; U i , . . . , U i k − is the root node in the k -th tree T k for k = 3 , . . . , p −
1. The goal of C- and D-vine based quantile regression is finding the modelwith the optimal order associated to a fit measure, as it has a direct impact on the predictionperformance. To be able to compare and quantify the explanatory power of models withdifferent orders of the predictors, we use the conditional log-likelihood function as a fitmeasure. For N independent identically distributed observations from the random vector( V, U , . . . , U p ) T denoted by v := (cid:0) v (1) , . . . , v ( N ) (cid:1) and u j := (cid:0) u (1) j , . . . , u ( N ) j (cid:1) , for j =1 , . . . , p, with a fitted nonparametric C- or D-vine copula, denoted by ˆ C , with order is8 (cid:0) ˆ C (cid:1) = ( V, U , . . . , U p ), the conditional log-likelihood can be expressed as cll (cid:0) ˆ C , v , ( u , . . . , u p ) (cid:1) = N (cid:88) n =1 log ˆ c V | U ,...,U p (cid:16) v ( n ) | u ( n )1 , . . . , u ( n ) p (cid:17) . This is for example shown in Tepegjozova (2019), where it also shows that this expressioncan be rewritten as cll (cid:16) ˆ C , v , ( u , . . . , u p ) (cid:17) = N (cid:88) n =1 (cid:34) log ˆ c V,U (cid:16) v ( n ) , u ( n )1 (cid:17) + p (cid:88) j =2 log ˆ c V,U j | U ,...,U j − (cid:16) ˆ C V | U ,...,U j − (cid:16) v ( n ) | u ( n )1 , . . . , u ( n ) j − (cid:17) , ˆ C U j | U ,...,U j − (cid:16) u ( n ) j | u ( n )1 , . . . , u ( n ) j − (cid:17)(cid:17) (cid:35) . Penalizations for model complexity can be added as well.
Given a response variable V and a set of p predictors, U , . . . , U p , the goal of the proposedvine based quantile regression is to find the cll − optimal C- or D-vine copula, i.e. theC- or D-vine with the highest conditional log-likelihood. The definition of the order ofa C- or D-vine implies that the order uniquely determines the underlying tree sequenceand vice versa. Further, since the response variable is fixed as the first element of theorder, while the other elements can be any permutation of the predictors, there are p !different orders. This implies that there are p ! different C-vines and p ! different D-vinesthat can be fitted. However, in practice, fitting all possible C- or D-vines and comparingthe conditional log-likelihood values would be computationally inefficient. Thus, the ideais to have an algorithm that will sequentially choose the elements of the order one by one,so that at every step the resulting model for the prediction of the conditional quantiles hasthe highest conditional log-likelihood. The first algorithm of this type is the one-step aheadD-vine quantile regression introduced by Kraus and Czado (2017). The idea of the one-stepahead algorithm is to sequentially update the order of the D-vine by adding the predictorthat yields the highest improvement in the conditional log-likelihood at each step. Buildingupon the idea of Kraus and Czado (2017), we propose an algorithm which allows for moreflexibility and which in particular is less greedy, given the intention to obtain a globallyoptimal C- or D-vine fit. The algorithm builds the C- or D-vine step by step, startingwith an order consisting of only the response variable, V . In each step it adds one of the9redictors to the order based on the improvement of the conditional log-likelihood, whiletaking into account the possibility of future improvement, i.e. extending our view two stepsahead in the order. Further, as already discussed in Section 2.1, the pair copulas at eachstep are estimated nonparametrically in contrast to the parametric approach by Kraus andCzado (2017). We present the implementation for both C-vine and D-vine based quantileregression as a single algorithm, in which we can choose whether we want a C-vine, orD-vine model fitted, based on the given data set and background knowledge of dependencystructures in the data. Furthermore, since implementing the algorithm on a large data setis computationally expensive, we introduce some randomization in the algorithm, so thatit remains computationally tractable in high dimensions. Input and data preprocessing:
Let there be given a data set of N independent and identically distributed observations,denoted by y := (cid:0) y (1) , . . . , y ( N ) (cid:1) and x j := (cid:0) x (1) j , . . . , x ( N ) j (cid:1) , for j = 1 , . . . , p, fromthe random vector ( Y, X , . . . , X p ) T . The raw input data are on the x-scale, but in or-der to fit bivariate copulas to the data we need to transform them to the u-scale. Toachieve the desired transformation the probability integral transform is employed. How-ever, since the marginal distributions are usually unknown estimates need to be acquired. F Y and F X j , for j = 1 , . . . , p, are estimated using a univariate nonparametric kernel den-sity estimator, as implemented in the R package kde1d (Nagler and Vatter 2019a). Withthe marginals estimated, the pseudo copula data are obtained as ˆ v ( n ) := ˆ F Y (cid:0) y ( n ) (cid:1) andˆ u ( n ) j := ˆ F X j (cid:0) x ( n ) j (cid:1) , for n = 1 , . . . , N, j = 1 , . . . , p. Also, assume that the normalizedmarginals are defined as Z j := Φ − ( U j ) for j = 1 , . . . , p, and Z V := Φ − ( V ), where Φdenotes the cumulative distribution function of a standard Gaussian distribution. Step 1:
In the first step of the algorithm, to reduce computational complexity, we perform a pre-selection of the predictors based on the Kendall’s τ correlation values. The motivation ofthis choice is based on the fact that Kendall’s τ is rank-based, therefore invariant withrespect to monotone transformations of the marginals and can be expressed in terms ofpair copulas. Using the pseudo copula data(ˆ v , ˆ u j ) = (cid:110) ˆ v ( n ) , ˆ u ( n ) j | n = 1 , . . . , N (cid:111) estimates ˆ τ V U j of the Kendall’s τ values between the response V , and all possible predictors U j for j = 1 , . . . , p , are obtained. Further, for a given k ≤ p , the k largest estimates of | ˆ τ V U j | are selected and the corresponding indices q , . . . , q k are identified such that | ˆ τ V U q | ≥ | ˆ τ V U q | ≥ . . . ≥ | ˆ τ V U qk | ≥ | ˆ τ V U qk +1 | ≥ . . . ≥ | ˆ τ V U qp | . k candidate predictors and the corresponding candidate index set of step 1 are definedas U q , . . . , U q k and the set K = { q , . . . , q k } , respectively.Next, for all c ∈ K and j ∈ { , . . . , p } \ { c } the candidate two-step ahead C- or D-vinecopulas are defined as the 3-dimensional copulas C c,j with order O (cid:0) C c,j (cid:1) = ( V, U c , U j ). Thefirst predictor is added to the order based on the conditional log-likelihood of the candidatetwo-step ahead C- or D-vine copulas, C c,j , given as cll (cid:0) C c,j , ˆ v , ( ˆ u c , ˆ u j ) (cid:1) = N (cid:88) n =1 (cid:104) log ˆ c V,U c (cid:16) ˆ v ( n ) , ˆ u ( n ) c (cid:17) + log ˆ c V,U j | U c (cid:16) ˆ h V | U c (cid:0) ˆ v ( n ) | ˆ u ( n ) c (cid:1) , ˆ h U j | U c (cid:0) ˆ u ( n ) j | ˆ u ( n ) c (cid:1)(cid:17) (cid:105) . For each candidate predictor U c the maximal two-step ahead conditional log-likelihood atstep 1, cll c , is defined as cll c := max j ∈{ ,...,p }\{ c } cll (cid:0) C c,j , ˆ v , ( ˆ u c , ˆ u j ) (cid:1) , ∀ c ∈ K . Finally, based on the maximal two-step ahead conditional log-likelihood at step 1, cll c , theindex t is chosen as t := arg max c ∈ K cll c , and the corresponding candidate predictor U t is selected as the first predictor added to theorder. An illustration of the vine tree structure of the candidate two-step ahead copulas C c,j , in the case of fitting a D-vine model, with order O D (cid:0) C c,j (cid:1) = ( V, U c , U j ) is given inFigure 2. Finally, the current optimal fit after the first step is the C-vine or D-vine copula, C with order O ( C ) = ( V, U t ). T V U c U j T V U c U c U j V U j ; U c Figure 2: V is fixed as the first node of T and the first candidate predictor to be includedin the model, U c (gray), is chosen based on the conditional log-likelihood of the two-step aheadcopula including the predictor U j (gray filled). tep r : After r − C r − with order O ( C r − ) = (cid:0) V, U t , . . . , U t r − (cid:1) . At each previous step i , the order of the current optimalfit is sequentially updated with the predictor U t i for i = 1 , . . . , r −
1. At the r -th stepthe next predictor candidate is to be included. In order to do so, the set of potentialcandidates is narrowed based on a partial correlation measure. Thus, estimates of theempirical Pearson’s partial correlation, ˆ ρ Z V ,Z j ; Z t ,...,Z tr − , between the normalized responsevariable V and available predictors U j for j ∈ { , , . . . , p } \ { t , . . . , t r − } are obtained.Similar to the first step, a set of candidate predictors of size k is selected based on thelargest values of | ˆ ρ Z V ,Z j ; Z t ,...,Z tr − | and the corresponding indices q , . . . , q k are identifiedsuch that it holds | ˆ ρ Z V ,Z q ; Z t ,...,Z tr − | ≥ . . . ≥ | ˆ ρ Z V ,Z qk ; Z t ,...,Z tr − | ≥ . . . ≥ | ˆ ρ Z V ,Z qp − r +1 ; Z t ,...,Z tr − | . The k candidate predictors and the corresponding candidate index set of step r are definedas U q , . . . , U q k and the set K r = { q , . . . , q k } , respectively. Next, for all c ∈ K r and j ∈ { , , . . . , p } \ { t , . . . , t r − , c } the candidate two-step ahead C- or D-vine copulas aredefined as the copulas C rc,j with order O (cid:0) C rc,j (cid:1) = (cid:0) V, U t , . . . , U t r − , U c , U j (cid:1) .At this point there are k × ( p − r ) different candidate two-step ahead C- or D-vine copulas C rc,j and their corresponding conditional log-likelihood is given as cll (cid:0) C rc,j , ˆ v , (cid:0) ˆ u t . . . ˆ u t r − , ˆ u c , ˆ u j (cid:1)(cid:1) = cll (cid:0) C r − , ˆ v , (cid:0) ˆ u t . . . ˆ u t r − (cid:1)(cid:1) + N (cid:88) n =1 log ˆ c V U c ; U t ,...,U tr − (cid:16) ˆ C V | U t ,...,U tr − (cid:16) ˆ v ( n ) | ˆ u ( n ) t , . . . , ˆ u ( n ) t r − (cid:17) , ˆ C U c | U t ,...,U tr − (cid:16) ˆ u ( n ) c | ˆ u ( n ) t , . . . , ˆ u ( n ) t r − (cid:17)(cid:17) + N (cid:88) n =1 log ˆ c V U j ; U t ,...,U tr − ,U c (cid:16) ˆ C V | U t ,...,U tr − ,U c (cid:16) ˆ v ( n ) | ˆ u ( n ) t , . . . , ˆ u ( n ) t r − , ˆ u ( n ) c (cid:17) , ˆ C U j | U t ,...,U tr − ,U c (cid:16) ˆ u ( n ) j | ˆ u ( n ) t , . . . , ˆ u ( n ) t r − , ˆ u ( n ) c (cid:17)(cid:17) . The r -th predictor is then added to the order based on the maximal two-step ahead con-ditional log-likelihood at Step r , cll rc , defined as cll rc := max j ∈{ , ,...,p }\{ t ,...,t r − ,c } cll (cid:0) C rc,j , ˆ v , (cid:0) ˆ u t . . . ˆ u t r − , ˆ u c , ˆ u j (cid:1)(cid:1) , ∀ c ∈ K r . (4)Namely, the index t r is chosen as t r := arg max c ∈ K r cll rc , and the corresponding candidate predictor U t r is selected as the r − th predictor of theorder. An illustration of the vine tree structure of the candidate two-step ahead copulas12 rc,j , in the case of fitting a D-vine model, with order O D (cid:0) C rc,j (cid:1) = (cid:0) V, U t , . . . , U t r − , U c , U j (cid:1) is given in Figure 3. At this step, the current optimal fit is the C-vine or D-vine copula C r ,with order O ( C r ) = ( V, U t , . . . U t r ) . The iterative procedure is repeated until all predictors are included in the order of the C-or D-vine copula model. T V U t ... U t r − U c U j T V U t U t U t ... U t r − U c U c U j ... T r V U t r − ; U t r − U t U c ; U t r − U t U j ; U t r − ,c T r +1 V U c ; U t r − U U j ; U t r ,c V U j ; U t r − ,c Figure 3:
In step r , the current optimal fit, C r − (black), is extended by one more predictor, U c (gray), to obtain the new current optimal fit C r (black and gray), based on the conditionallog-likelihood of the two-step ahead copula C rc,j which also includes the predictor U j (gray filled).(In the figure, we use the shortened notation U t r − instead of writing U t , . . . , U t r − and we use U t r − ,c instead of U t , . . . , U t r − , U c .) The search procedure described above requires calculating p − ( r −
1) conditional log-likelihoods for each candidate predictor at a given step r . This leads to calculating a totalof ( p − r + 1) × k conditional log-likelihoods, where k denotes the number of candidates. Incases where p is large, this search procedure would cause a heavy computational burden.Hence, the idea is to reduce the number of conditional log-likelihoods calculated for eachcandidate predictor. This is done by reducing the size of the set over which the maximaltwo-step ahead conditional log-likelihood, cll rc in Equation 4, is computed. Instead of takingthe maximum over the set { , , . . . , p }\{ t , . . . , t r − , c } , it can be taken over an appropriatesubset. This subset can be then chosen either based on the largest Pearson’s partialcorrelations in absolute value, | ˆ ρ Z V ,Z j ; Z t ,...,Z tr − ,Z c | , by random selection or a combinationof the two, where the selection method and the size of reduction are user-decided.13 .2 Consistency of the conditional quantile estimator Example 3.1 gives an intuition on expressing the conditional quantile function C − V | U ,...,U p recursively by the inverses of h-functions for fixed variable orders. The conditional quantilefunction on the original scale in Equation (1) requires the inverse of the marginal distri-bution function of Y . Following Kraus and Czado (2017); Noh et al. (2013), the marginalcumulative distribution functions F Y and F X j , for j = 1 , . . . p , are estimated nonparamet-rically to reduce the bias caused by model misspecification. Examples of nonparametricestimators for the marginal distributions F Y and F X j , for j = 1 , . . . p , are the continuouskernel smoothing estimator (Parzen 1962) and the transformed local likelihood estimator inthe univariate case (Geenens 2014). Using a Gaussian kernel, the above two estimators ofthe marginal distribution are strong uniformly consistent. When also all inverses of the h-functions are estimated nonparametrically, we establish the consistency of the conditionalquantile estimator ˆ F − Y | X ,...,X p in Proposition 4.1 for fixed variable orders. Appendix Bcontains the proof. Proposition 4.1.
Let the inverse of the marginal distribution functions F Y and F X j j = 1 , . . . , p be uniformly continuous and estimated nonparametrically, and let the inverseof the h-functions expressing the conditional quantile estimator C − V | U ,...,U p be uniformlycontinuous and estimated nonparametrically in the interior of the support of bivariatecopulas, i.e., [ δ, − δ ] , δ → + .1. If estimators of the inverse of marginal functions ˆ F − Y , ˆ F − X j , j = 1 , . . . , p , are stronguniformly consistent on the support [ δ, − δ ] , δ → + , and the estimators of theinverse of h-functions composing the conditional quantile estimator C − V | U ,...,U p arestrong uniformly consistent, then the estimator ˆ F − Y | X ,...,X p ( α | x , . . . , x p ) is also stronguniformly consistent.2. If estimators of the inverse of marginal functions ˆ F − Y , ˆ F − X j , j = 1 , . . . , p , are at leastweak consistent, and the estimators of the inverse of h-functions are also at leastweak consistent, then the estimator ˆ F − Y | X ,...,X p ( α | x , . . . , x p ) is weak consistent.For more details about uniform continuous functions see Bartle and Sherbert (2000, Sec-tion 5.4), Kolmogorov and Fomin (1970, p.109,Def. 1). For a definition of strong uniformconsistency or convergence with probability one, see Ryzin (1969); Silverman (1978) andDurrett (2010, p.16), while for a definition for weak consistency or convergence in prob-ability, see Durrett (2010, p.53). The strong uniform consistency result in Proposition 1requires additionally that all estimators of ˆ F − Y , ˆ F − X j , for j = 1 , . . . p , are strong uniformlyconsistent on a truncated compact interval [ δ, − δ ] , δ → + . Although not directly usedin the proof of Proposition 4.1 in Appendix B, the truncation is an essential condition forguaranteeing the strong uniform consistency of all estimators of the inverse of the marginaldistributions (i.e. estimators of quantile functions), see Cheng (1995); Van Keilegom and14eraverbeke (1998); Cheng (1984). The inverse of h-functions can be obtained througha nested sequence of bivariate copula densities. Here, we consider the transformed locallikelihood estimator in Geenens et al. (2017), which inherits uniformly (weak or strong)consistency from the estimator ˜ f S,T in (8). Using a product of univariate Gaussian kernelsfor the kernel function K in (9) for estimating ˜ f S,T is discussed in Geenens et al. (2017) andimplemented in the R packages kdecop and rvinecopulib . Proposition 4.1 shows theuniform consistency and gives an indication on the performance of the conditional quantileestimator ˆ F − Y | X ,...,X p for fixed variable orders, while combining the consistent estimatorsof F Y , F X j ’s, and bivariate copula densities. Extensive studies on numerical performanceof ˆ F − Y | X ,...,X p are presented in Section 5. The proposed two-step ahead forward selection algorithms for C- and D-vine based quantileregression, from Section 4.1, are implemented in the statistical language R (R Core Team2020). The D-vine one-step ahead algorithm is implemented in the R package vinereg (Nagler 2019). In the simulation study from Kraus and Czado (2017), it is shown thatthe D-vine one-step ahead forward selection algorithm performs better or similar, com-pared to other state of the art quantile methods, boosting additive quantile regression(Koenker 2005b; Fenske et al. 2011), nonparametric quantile regression (Li et al. 2013),semi-parametric quantile regression (Noh et al. 2015) and the linear quantile regression(Koenker and Bassett 1978). Thus we use the one-step ahead algorithm as the benchmarkcompetitive method in the simulation study. The following simulation settings are used.Each setting is replicated for R = 100 times. In each simulation replication, we randomlygenerate N train samples used for fitting the appropriate nonparametric vine based quantileregression models. Additionally, another N eval = N train samples for settings (a) – (f)and N eval = N train for settings (g) and (h) are generated for predicting conditional quan-tiles from the models. Settings (a) – (f) are designed to test quantile prediction accuracyof nonparametric C- or D-vine quantile regression in cases where p ≤ N ; hence, we set N train = 1000 or 300. Settings (g) and (h) test quantile prediction accuracy in cases where p > N ; hence, we set N train = 100.(a) Simulation setting M5 from Kraus and Czado (2017): Y = (cid:112) | X − X + 0 . | + ( − . X + 1)(0 . X ) + σε, with ε ∼ N (0 , , σ ∈ { . , } , ( X , X , X , X ) T ∼ N (0 , Σ), and the ( i, j )th compo-nent of the covariance matrix given as (Σ) i,j = 0 . | i − j | .(b) ( Y, X , . . . , X ) T follows a mixture of two 6-dimensional t copulas with degrees offreedom equal to 3 and mixture probabilities 0.3 and 0.7. Association matrices R , R and marginal distributions are recorded in Table 1.15 = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R = − . − . − . − . − . − . . . . . − . . . . . − . . . . . − . . . . . − . . . . . Y X X X X X N (0 , t N (1 , t N (1 , t Table 1:
Association matrices of the multivariate t-copula and marginal distributions for sim-ulation setting (b) .(c) Linear and heteroscedastic (Chang and Joe 2019): Y = 5( X + X + X + X ) + 10( U + U + U + U ) ε, where ( X , X , X , X ) T ∼ N (0 , Σ), Σ i,j = 0 . I { i (cid:54) = j } , ε ∼ N (0 , . U j , j =1 , . . . , X j ’s by the probability integral transform.(d) Nonlinear and heteroscedastic (Chang and Joe 2019): Y = U U e . U U + 0 . U + U + U + U ) ε, where U j , j = 1 , . . . , N (0 , Σ), Σ i,j =0 . I { i (cid:54) = j } , and ε ∼ N (0 , . V, U , . . . , U ) T follows an R-vine distribution with paircopulas given in Table 2.(f) D-vine copula (Tepegjozova 2019): ( V, U , . . . , U ) T follows a D-vine distribution withpair copulas given in Table 3.(g) Similar to setting (a), Y = (cid:112) | X − X + 0 . | + ( − . X + 1)(0 . X ) + ( X , . . . , X )(0 , . . . , T + σε, where ( X , . . . , X ) T ∼ N (0 , Σ) with the ( i, j )th component of the covariancematrix (Σ) i,j = 0 . | i − j | , ε ∼ N (0 , σ ∈ { . , } .(h) Similar to (g), Y = ( X , . . . , X ) β + ε, where ( X , . . . , X ) T ∼ N (0 , Σ A ) with the ( i, j )th component of the covariancematrix (Σ A ) i,j = 0 . | i − j | , ( X , . . . , X ) T ∼ N (0 , Σ B ) with (Σ B ) i,j = 0 . | i − j | .16ree Edge Conditioned ; Conditioning Family Parameter Kendall’s τ U , U ; Gumbel 3.9 0.741 2 U , U ; Gauss 0.9 0.711 3 V , U ; Gauss 0.5 0.331 4 V , U ; Clayton 4.8 0.712 1 V , U ; U Gumbel(90) 6.5 -0.852 2
V , U ; U Gumbel(90) 2.6 -0.622 3 U , U ; V Gumbel 1.9 0.483 1 U , U ; V , U Clayton 0.9 0.313 2 U , U ; V , U Clayton(90) 5.1 -0.724 1 U , U ; V , U , U Gauss 0.2 0.13
Table 2:
Pair copulas of the R-vine C V,U ,U ,U ,U , with their family parameter and Kendall’s τ for simulation setting (e). The first 10 entries of β are a descending sequence between (2 , .
1) with increment of0 . ε ∼ N (0 , σ ) and σ ∈ { . , } .Since the true regression quantiles are difficult to obtain in most settings, we consider theaveraged check loss (Kraus and Czado 2017; Komunjer 2013) and the interval score (Changand Joe 2019; Gneiting and Raftery 2007), instead of the out-of-sample mean averagedsquare error in Kraus and Czado (2017), to evaluate the performance of the estimationmethods. For a chosen α ∈ (0 , (cid:99) CL α = 1 R R (cid:88) r =1 (cid:26) N eval N eval (cid:88) n =1 (cid:110) γ α (cid:16) Y eval r,n − ˆ q α ( X eval r,n ) (cid:17) (cid:111)(cid:27) , (5)where the check loss function γ α is given as γ α (cid:16) Y eval r,n − ˆ q α ( X eval r,n ) (cid:17) = (cid:16) Y eval r,n − ˆ q α ( X eval r,n ) (cid:17) (cid:16) α − I ( (cid:0) Y eval r,n − ˆ q α ( X eval r,n ) (cid:1) < (cid:17) = (cid:40) (cid:0) Y eval r,n − ˆ q α ( X eval r,n ) (cid:1) α, (cid:0) Y eval r,n − ˆ q α ( X eval r,n ) (cid:1) ≥ (cid:0) Y eval r,n − ˆ q α ( X eval r,n ) (cid:1) ( α − , (cid:0) Y eval r,n − ˆ q α ( X eval r,n ) (cid:1) < . The interval score, for the (1 − α ) × (cid:98) IS α = 1 R R (cid:88) r =1 (cid:26) N eval N eval (cid:88) n =1 (cid:110)(cid:0) ˆ q α/ ( X eval r,n ) − ˆ q − α/ ( X eval r,n ) (cid:1) (6)+ 2 α (cid:0) ˆ q − α/ ( X eval r,n ) − Y eval r,n (cid:1) I { Y eval r,n ≤ ˆ q − α/ ( X eval r,n ) } + 2 α (cid:0) Y eval r,n − ˆ q α/ ( X eval r,n ) (cid:1) I { Y eval r,n > ˆ q α/ ( X eval r,n ) } (cid:111)(cid:27) , τ V , U ; Clayton 3.00 0.601 2 U , U ; Joe 8.77 0.801 3 U , U ; Gumbel 2.00 0.501 4 U , U ; Gauss 0.20 0.131 5 U , U ; Indep. 0.00 0.002 1 V , U ; U Gumbel 5.00 0.802 2 U , U ; U Frank 9.44 0.652 3 U , U ; U Joe 2.78 0.492 4 U , U ; U Gauss 0.20 0.133 1
V , U ; U , U Joe 3.83 0.603 2 U , U ; U , U Frank 6.73 0.553 3 U , U ; U , U Gauss 0.29 0.194 1
V , U ; U , U , U Clayton 2.00 0.504 2 U , U ; U , U , U Gauss 0.09 0.065 1
V , U ; U , U , U , U Indep. 0.00 0.00
Table 3:
Pair copulas of the D-vine C V,U ,U ,U ,U ,U , with their family parameter and Kendall’s τ for simulation setting (f). and smaller interval scores are better. For setting (a) – (f), the estimation procedurefor the two-step ahead C- or D-vine quantile regression follows exactly Section 4.1 wherethe candidate sets at each step include all possible remaining predictors. The additionalvariable reduction described in Section 4.1.1 is not applied, thus we calculate all possibleconditional log-likelihoods in each step. On the contrary, due to computational burden inSettings (g) and (h), we set the number of candidates to be k = 5 and the additional variablereduction from Section 4.1.1 is applied. The chosen subset contains 20% of all possiblechoices, where 10% are predictors having the highest Pearson’s partial correlation withthe response and the remaining 10% are chosen randomly from the remaining predictors.Performance of the C- and D-vine two-step ahead quantile regression is compared withthe C- and D-vine one-step ahead quantile regression. The performance of the competitivemethods, evaluated by the averaged check loss at 5%, 50%, 95% quantile levels and intervalscore for the 95% prediction interval, are recorded in Tables 4, 5 and 6. All densities areestimated nonparametrically for fair comparison. Tables 4 and 5 show that the C- andD-vine two-step ahead regression models outperform the C- and D-vine one-step aheadregression models in five out of seven settings, with the exception of settings (b) and (e), inwhich all models perform quite similarly to each other. Further, when comparing regressionmodels within the same vine copula class, again in five out of seven settings the C-vinetwo-step ahead regression models outperform the C-vine one-step ahead models. Similarly,the D-vine two-step ahead models outperform the D-vine one-step ahead models in sixout of seven scenarios, with the exception of setting (b) only. Thus, in scenarios where18here is no significant improvement through the second step, both one-step and two-stepahead approaches perform very similar. All of that implies that the two-step ahead vinebased quantile regression greatly improves the performance of the one-step ahead quantileregression. Table 6 indicates that in the high dimensional settings where the two-stepahead quantile regression was used in combination with the additional variable selectionfrom Section 4.1.1, in three out of four simulation settings the two-step ahead modelsoutperform the one-step ahead models. In the simulation setting (g) we can see that allmodels show similar performance. However, in the setting (g) with noise of σ = 0 .
1, the D-vine one-step ahead model outperforms the other models, while in the setting (g) with noiseof σ = 1 the D-vine two-step ahead model shows a better performance. In the simulationsetting (h) we see that there is a significant improvement in the two-step ahead models,compared to the one-step ahead models. For both σ = 0 . σ = 1 the best performingmodel is the C-vine two-step ahead model. These results indicate that the newly proposedmethod improves the accuracy of the one-step ahead quantile regression in high dimensions,even with an attempt to ease the computational complexity of the two-step ahead modelwith a low number of candidates, compared to the number of predictors. We test the proposed methods on two real data sets, i.e., the concrete data set from Yeh(1998) corresponding to p ≤ N , and the riboflavin data set from B¨uhlmann and van de Geer(2011) corresponding to p > N . For both data sets, performances of the four competitivealgorithms is evaluated by the averaged check loss defined in Equation (5) at 5%, 50% and95% quantile levels, as well as the 95% prediction interval score defined in Equation (6),by randomly splitting the data set into training and evaluation sets 100 times. The concrete data set was originaly used in Yeh (1998), and is available at the UCI MachineLearning Repository (Dua and Graff 2017). The data set has in total 1030 samples. Ourobjective is quantile predictions of the concrete compressive strength, which is a highlynonlinear function of age and ingredients. The predictors are age (
AgeDay , counted indays) and 7 physical measurements of the concrete ingredients (given in kg in a m mix-ture): cement ( CementComp ), blast furnace slag (
BlastFur ), fly ash (
FlyAsh ), water(
WaterComp ), superplastizer (
Superplastizer ), coarse aggregate (
CoarseAggre ) andfine aggregate (
FineAggre ). We randomly split the data set into training set with 830samples and evaluation set with 200 samples; the random splitting is repeated for 100times. Performance of the proposed C- and D-vine two-step ahead quantile regression,compared with the C- and D-vine one-step ahead quantile regression, is evaluated by sev-eral measurements reported in Table 7 after 100 repetitions of fitting the models. Theresults are quite close to each other, but given the small number of predictors, that is what19etting Model (cid:98) IS . (cid:99) CL . (cid:99) CL . (cid:99) CL . (a) D-vine One-step 55.541 0.655 0.157 0.507 σ = 0 . σ = 1 D-vine Two-step 148.527 1.569 0.448 1.563** C-vine One-step 151.599 1.606 0.453 1.596C-vine Two-step 148.410 1.563 0.448 1.564(b) D-vine One-step 118.748 1.289 0.417 1.300* D-vine Two-step 119.103 1.296 0.417 1.301C-vine One-step 119.084 1.297 0.414 1.301C-vine Two-step 118.896 1.295 0.417 1.300(c) D-vine One-step 2908.904 30.539 8.551 30.422** D-vine Two-step 2853.517 30.208 8.699 29.950C-vine One-step 2859.232 30.238 8.586 29.946C-vine Two-step 2850.103 30.191 8.642 29.837(d) D-vine One-step 86.396 0.916 0.238 0.914** D-vine Two-step 83.538 0.897 0.236 0.878C-vine One-step 84.985 0.908 0.240 0.900C-vine Two-step 83.326 0.900 0.237 0.871(e) D-vine One-step 10.587 0.108 0.028 0.109* D-vine Two-step 10.323 0.096 0.026 0.113C-vine One-step 10.233 0.107 0.026 0.102C-vine Two-step 10.345 0.100 0.025 0.109(f) D-vine One-step 13.787 0.158 0.043 0.137** D-vine Two-step 8.437 0.091 0.023 0.079C-vine One-step 12.624 0.139 0.038 0.129C-vine Two-step 9.093 0.098 0.024 0.085 Table 4:
Out-of-sample predictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . for settings (a) – (f) with N train = 300. Lower values, indicating better performance, are highlighted in gray. With ** wedenote the scenarios in which there is an improvement through the second step and with * wedenote scenarios in which the models perform similar. (cid:98) IS . (cid:99) CL . (cid:99) CL . (cid:99) CL . (a) D-vine One-step 55.891 0.673 0.150 0.498 σ = 0 . σ = 1 D-vine Two-step 156.772 1.632 0.419 1.619** C-vine One-step 160.775 1.684 0.432 1.653C-vine Two-step 156.794 1.634 0.419 1.617(b) D-vine One-step 125.327 1.365 0.395 1.361* D-vine Two-step 125.242 1.363 0.399 1.360C-vine One-step 125.122 1.364 0.395 1.359C-vine Two-step 125.298 1.364 0.398 1.361(c) D-vine One-step 3064.781 31.692 8.147 31.474** D-vine Two-step 3041.950 31.607 8.195 31.259C-vine One-step 3046.516 31.639 8.182 31.247C-vine Two-step 3042.458 31.617 8.195 31.225(d) D-vine One-step 91.105 0.963 0.223 0.945** D-vine Two-step 89.558 0.957 0.222 0.920C-vine One-step 90.401 0.961 0.223 0.935C-vine Two-step 89.465 0.956 0.222 0.920(e) D-vine One-step 10.493 0.105 0.025 0.106* D-vine Two-step 10.264 0.091 0.023 0.114C-vine One-step 10.023 0.104 0.024 0.097C-vine Two-step 10.334 0.097 0.023 0.110(f) D-vine One-step 13.700 0.156 0.040 0.137** D-vine Two-step 8.279 0.086 0.020 0.077C-vine One-step 12.233 0.134 0.036 0.126C-vine Two-step 8.928 0.093 0.021 0.083 Table 5:
Out-of-sample predictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . for settings (a) – (f) with N train = 1000. Lower values, indicating better performance, are highlighted in gray. With **we denote the scenarios in which there is an improvement through the second step and with *we denote scenarios in which the models perform similar. (cid:98) IS . (cid:99) CL . (cid:99) CL . (cid:99) CL . (g) D-vine One-step 19.625 0.260 0.245 0.231 σ = 0 . σ = 1 D-vine Two-step 52.166 0.675 0.652 0.629** C-vine One-step 53.616 0.694 0.670 0.646C-vine Two-step 52.353 0.666 0.654 0.643(h) D-vine One-step 558.359 6.922 6.979 7.037 σ = 0 . σ = 1 D-vine Two-step 531.296 6.643 6.641 6.639** C-vine One-step 512.964 6.386 6.412 6.438C-vine Two-step 483.921 6.050 6.049 6.048 Table 6:
Out-of-sample predictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . for settings (g) – (h) with N train = 100. Lower values, indicating better performance, are highlighted in gray. With ** wedenote the scenarios in which there is an improvement through the second step and with * wedenote scenarios in which the models perform similar. one would expect of the forward sequential algorithm. However, there is an improvement inthe performance of the two-step ahead approach compared to the one-step ahead approachfor both C- and D-vine based models. Also, the C-vine model seems more appropriatefor modelling the dependency structure in the data set. Finally, out of all models, the C-vine two-step ahead algorithm is the best performing algorithm in terms of out-of-samplepredictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . on the concrete data set, as seen in Table 7 .In order to explain the different approaches of the one-step ahead and the two-step aheadalgorithms, we consider the order of the predictors which the algorithms provide. The orderof the predictors enter the model is based on maximising the conditional log-likelihood, thusit provides a descending order of influence of the predictors on the conditional quantilefunction of the response. Figure 4 shows the individual distribution of positions for eachpredictor in the four different models, i.e. each individual bar plot has all the possiblepositions in the order on the x-axis and the counts of how many times the given predictorappeared on a specific position of the order on the y-axis out of the 100 repeated model22odel (cid:98) IS . (cid:99) CL . (cid:99) CL . (cid:99) CL . D-vine One-step 1032.319 10.751 2.759 10.520D-vine Two-step 987.101 10.543 2.778 9.815C-vine One-step 976.750 10.649 2.696 9.454C-vine Two-step 967.002 10.519 2.639 9.453
Table 7:
Concrete data set: Out-of-sample predictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . . The bestperforming model is highlighted in gray. fits.From Figure 4 we can indeed see the greedy approach of the one-step algorithms. Both one-step ahead C- and D- vine models always choose the same predictor as the first predictorto enter the model. That is because the pair copula between the response and the predictor AgeDay has the biggest likelihood, out of the possible pair copulas between the responseand each of the predictors. Next, as second predictor both one-step ahead algorithmsalways choose the predictor
CementComp , because, similarly as before, the bivariate copulabetween the response and
CementComp conditioned on the already chosen
AgeDay hasthe biggest likelihood out of the possible pair copulas between the response and the otherpossible predictors conditioned on
AgeDay . Note that in only 3 dimensions the modelsfor both C- and D-vines are equivalent (in 3 dimensions a path is also a star, and viceversa). On other side, the two-step ahead approaches do not make such a uniform decisionabout the first predictor to be included. Instead of choosing
AgeDay as first predictor,the algorithms consider the future possible improvement and based on that, they choosethe predictor
CementComp as the first to enter the model. The most influential predictorfrom the one-step ahead models,
AgeDay , is chosen to be the second or third predictor ofthe two-step ahead models, which turns out to be a better ordering. Next, the influence ofthe predictors is ranked sequentially by choosing the predictor with the highest frequencyof each position in the order excluding the predictor chosen in the previous steps. Asalready stated, the predictors that take the first positions in the order have the biggestcontribution to the conditional log-likelihood, thus they influence the conditional quantilefunction the most and we are interested in finding them. For instance, the most influentialpredictor is chosen as the one most frequently ranked first; the second most important ischosen as the most frequently ranked second, excluding the most important one selectedin the previous step. Following this approach, the three most influential predictors, basedon the most accurate C-vine two-step model, on the concrete compressive strength are
CementComp , WaterComp and
AgeDay . In Figure 5 the marginal effect plots basedon the fitted quantiles, from the C-vine two-step model, for the three most influentialpredictors are given. The marginal effect of a predictor variable is its expected impact onthe quantile estimator, where the expectation is taken over all other predictor variables.23 igure 4:
Individual distribution plots. Column 1: D-vine one-step, Column 2: D-vine two-step, Column 3: C-vine one-step, Column 4: C-vine two-step.
This is estimated using all fitted conditional quantiles and smoothed over the predictorvariables considered.
The riboflavin data set, available in the R package hdi , aims at quantile predictions oflog-transformed production rate of Bacillus subtilis using log-transformed expression levelsof 4088 genes. To reduce the computational burden, we perform a pre-selection of thetop 100 genes with the highest variance (B¨uhlmann and van de Geer 2011), resulting in24 igure 5:
Marginal effect plots for the 3 most influential predictors on the concrete compressivestrength for α values of 0 .
05 (red colour), 0 . .
95 (blue color). a subset with p = 100 log-transformed gene expressions and N = 71 samples. Randomsplitting of the subset into training set with 61 samples and evaluation set with 10 samples,is repeated for 100 times. For the C- and D-vine two-step ahead quantile regression thenumber of candidates is set to k = 10. Additionally, to further reduce the computationalburden the additional variable selection from Section 4.1.1 is applied with the chosen subsetcontaining 25% of all possible choices, where 15% are predictors having the highest partialcorrelation with the log-transformed Bacillus subtilis production rate and the remaining10% are chosen randomly from the remaining predictors. Performance of competitivequantile regression models is reported in Table 8, where we see that the proposed C-vine two-step ahead quantile regression is the best performing model and outperformsboth the D-vine one-step ahead quantile regression from Kraus and Czado (2017) andthe C-vine one-step ahead quantile regression to a large extent. Further, the second bestperforming method is the D-vine two-step ahead model which, while performing slightlyworse than the C-vine two-step ahead model, also significantly outperforms both the C-vine and D-vine one-step ahead models. Further, since the predictors entering the C- andModel (cid:98) IS . (cid:99) CL . (cid:99) CL . (cid:99) CL . D-vine One-step 33.826 0.438 0.423 0.408D-vine Two-step 30.574 0.436 0.382 0.328C-vine One-step 34.516 0.485 0.431 0.378C-vine Two-step 28.590 0.413 0.357 0.302
Table 8:
Out-of-sample predictions (cid:98) IS . , (cid:99) CL . , (cid:99) CL . , (cid:99) CL . . The best performing model ishighlighted in gray. D-vine models yield a descending order of the predictors contributing to maximizing the25onditional log-likelihood, the order indicates the influence of the predictors to the responsevariable. It is often of practical interest to know which gene expressions are of the highestimportance for prediction. Similarly as before, since we repeat the random splitting of thesubset for R = 100 times, the importance of the gene expressions is ranked sequentiallyby choosing the one with the highest frequency of each element in the order excludingthe gene expressions chosen in the previous steps. For instance, the most important geneexpression is chosen as the one most frequently ranked first; the second most important geneis chosen as the one most frequently chosen as the second element in the order, excludingthe most important gene selected in the previous step. The top ten most influential geneexpressions using the C- and D-vine one- or two-step ahead models are recorded in Table 9.Figure 6 shows the marginal effects plots based on the fitted quantiles, from the C-vine two- Model/Position 1 2 3 4 5 6 7 8 9 10D-vine One-step GGT YCIC MTA RPSE YVAK THIK ANSB SPOVB YVZB YQJBD-vine Two-step MTA RPSE THIK YMFE YCIC sigM PGM YACC YVQF YKPBC-vine One-step GGT YCIC MTA RPSE HIT BFMBAB PHRC YBAE PGM YHEFC-vine Two-step MTA RPSE THIK YCIC YURU PGM sigM YACC YKRM ASNB
Table 9:
The 10 most influential gene expressions on the conditional quantile function, rankedbased on their position in the order. step model, for the 10 most influential predictors on the log-transformed Bacillus subtilisproduction rate.
Figure 6:
Marginal effect plots for the 10 most influential predictors on the log-transformedBacillus subtilis production rate for α = 0 . Summary and discussion
In this paper, we introduce a novel two-step ahead forward selection algorithm for nonpara-metric C- and D-vine copula based quantile regression. Inclusion of future information,obtained through considering the next tree in the two-step ahead algorithm, yields a sig-nificantly less greedy sequential selection procedure in comparison to the already existingone-step ahead algorithm for D-vine based quantile regression in Kraus and Czado (2017).We extend the vine-based quantile regression framework to include C-vine copulas, givingan additional choice for the dependence structure. Further, for the first time nonparamet-ric bivariate copulas are used in the construction of vine copula based quantile regressionmodels. This overcomes the problem of possible family misspecification in the paramet-ric estimation of bivariate copulas, and allows for even more flexibility in dependenceestimation. As an additional result, under mild regularity conditions the nonparametricconditional quantile estimator is shown to be consistent.The extensive simulation study, including several different settings and data sets with dif-ferent dimensions, strengths of dependence and tail dependencies, shows that the two-stepahead algorithm outperforms the one-step ahead algorithm in the majority of scenarios.Especially interesting are the results for the Concrete and Riboflavin data sets, as the C-vine two-step ahead algorithm has a significant improvement in comparison to the otheralgorithms. These findings provide strong evidence for the need of modeling the depen-dence structure following a C-vine copula. In addition, the two-step ahead algorithm allowsto control the computational intensity independently of the data dimensions, through thenumber of candidate predictors and the additional variable selection discussed in Section5. Thus, fitting a vine based quantile regression models in high dimensions becomes feasi-ble. As seen in a number of simulation settings, there is a significant gain by introducingadditional dependence structures in the vine based quantile regression. An area of furtherresearch is developing similar forward selection algorithms for R-vine tree structures.At each step of the vine building stage we compare equal-sized models with the samenumber of variables. The conditional log-likelihood is suited for such a comparison. Forother questions such as choosing between a C-vine, D-vine or R-vine information criteriamight come in handy. When maximum likelihood estimation is employed at all stages, theselection criteria by Akaike (AIC) (Akaike 1973), the Bayesian information criterion (BIC)(Schwarz 1978) and the focussed information criterion (FIC) (Claeskens and Hjort 2003)might be used immediately. Ko et al. (2019) studied FIC and AIC specifically for the se-lection of parametric copulas. The copula information criterion in the spirit of the Akaikeinformation criterion by Grønneberg and Hjort (2014) can be used for selection among cop-ula models with empirically estimated margins, while Ko and Hjort (2019) studied sucha criterion for parametric copula models. We plan a deeper investigation of the use ofinformation criteria for nonparametrically estimated copulas and for vines in particular.Such a study is beyond the scope of this paper, but could be interesting to study stoppingcriteria too for building vines. 27onparametrically estimated vines are offering a large flexibility. Their parametric counter-parts, on the other hand, are enjoying simplicity. An interesting route for further researchis to combine parametric and nonparametric components in the construction of the vinesin an efficient way to bring most benefit, which should be made tangible by means ofsome criterion such that guidance can be provided about which components should bemodelled nonparametrically and which others are best modeled parametrically. For sometypes of models, such choice between a parametric and a nonparametric model has beeninvestigated by Jullum and Hjort (2017) via the focussed information criterion. This andalternative methods taking the effective degrees of freedom into account are worthwhile tofurther investigate for vine copula models.
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft[DFG CZ 86/6-1], theResearch Foundation Flanders and KU Leuven internal fund C16/20/002. The resourcesand services used in this work were provided by the VSC (Flemish Supercomputer Center),funded by the Research Foundation-Flanders (FWO) and the Flemish Government.28 ppendixA Construction of the transformation local likelihood estimatorof the copula density
Let the N × D = ( S, T ) , (7)where the transformed samples D n = (cid:0) S n = Φ − ( U ( n ) i ) , T n = Φ − ( U ( n ) j ) (cid:1) , n = 1 , . . . , N ,and Φ denotes the cumulative distribution function of a standard Gaussian distribution.The logarithm of the density f S,T of the transformed samples ( S n , T n ) , n = 1 , . . . , N isapproximated locally by a bivariate polynomial expansion P a m of order m with intercept˜ a m, such that the approximation is denoted by˜ f S,T (Φ − ( u ( n ) i ) , Φ − ( u ( n ) j )) = exp (cid:8) ˜ a m, (Φ − ( u ( n ) i ) , Φ − ( u ( n ) j )) (cid:9) . The transformation local likelihood estimator is then defined as˜ c ( u ( n ) i , u ( n ) j ) = ˜ f S,T (Φ − ( u ( n ) i ) , Φ − ( u ( n ) j )) φ (Φ − ( u ( n ) i )) φ (Φ − ( u ( n ) j )) . (8)To get the local polynomial approximation, we need a kernel function K with 2 × B N . For some pair (ˇ s, ˇ t ) close to ( s, t ), log f ST (ˇ s, ˇ t ) is assumed to be wellapproximated, locally, by for instance a polynomial with m = 1 (log-linear) P a (ˇ s − s, ˇ t − t ) = a , ( s, t ) + a , ( s, t )(ˇ s − s ) + a , ( s, t )(ˇ t − t ) , or m = 2 (log-quadratic) P a (ˇ s − s, ˇ t − t ) = a , ( s, t ) + a , ( s, t )(ˇ s − s ) + a , ( s, t )(ˇ t − t )+ a , ( s, t )(ˇ s − s ) + a , ( s, t )(ˇ t − t ) + a , ( s, t )(ˇ s − s )(ˇ t − t ) . The coefficient vector of the polynomial expansion P a m is denoted by a m ( s, t ), where a ( s, t ) = ( a , ( s, t ) , a , ( s, t ) , a , ( s, t )) for the log-linear approximation and a ( s, t ) =( a , ( s, t ) , . . . , a , ( s, t )) for the log-quadratic. The estimated coefficient vector ˜ a m ( s, t ) isobtained by a maximization problem in Equation (9)˜ a m ( s, t ) = arg max a m (cid:26) N (cid:88) n =1 K (cid:18) B − / N (cid:18) s − S n t − T n (cid:19) (cid:19) P a m ( S n − s, T n − t ) − N (cid:110) (cid:90) (cid:90) R K (cid:18) B − / N (cid:18) s − ˇ st − ˇ t (cid:19) (cid:19) exp (cid:16) P a m (ˇ s − s, t − t ) (cid:17) dˇ s dˇ t (cid:111)(cid:27) . (9)29hile it is well-known that kernel estimators suffer from the curse of dimensionality, in thevine construction only two-dimensional functions need to be estimated, this thus avoidsproblems with high-dimensionality.We next explain as in Geenens et al. (2017) how a bandwidth selection is obtained. Considerthe principal component decomposition for the N × D = ( S, T ) in Equation(7), such that the N × Q, R ) follows(
Q, R ) T = W D T , (10)where each row of W is an eigenvector of D T D . We obtain an estimator of f ST through thedensity estimator of f QR , which can be estimated based on a diagonal bandwidth matrixdiag( h Q , h R ). Selecting the bandwidths h Q uses samples Q n , n = 1 , . . . , N as h Q = arg min h> (cid:26) (cid:90) −∞ ∞ (cid:110) ˜ f ( p ) Q (cid:111) dq − N N (cid:88) n =1 ˜ f ( p ) Q ( − n ) ( ˆ Q n ) (cid:27) , (11)where ˜ f ( p ) Q ( p = 1 ,
2) are the local polynomial estimators for f Q , and ˜ f ( p ) Q ( − n ) is the “leave-one-out” version of ˜ f ( p ) Q computed by leaving out Q n . The procedure of selecting h R issimilar. The bandwidth matrix for the bivariate copula density is then given by B N = K ( p ) N W − diag( h Q , h R ) W − where K ( p ) N takes N / to ensure an asymptotic optimal band-width order for the local log-quadratic case ( p = 2), see Geenens et al. (2017, Section 4) fordetails. Selection for the k-nearest-neighbour type bandwidth is similar. The k-nearest-neighbour bandwidths denoted as h (cid:48) Q and h (cid:48) R are obtained by restricting the minimizationin Equation (11) in the interval (0 , h (cid:48) Q = arg min h (cid:48) Q ∈ (0 , (cid:26) (cid:90) ∞−∞ (cid:110) ˜ f ( p ) Q (cid:111) dq − N N (cid:88) n =1 ˜ f ( p ) Q ( − n ) ( ˆ Q n ) (cid:27) . Estimating f QR at any ( q, r ) is obtained by using its k = K ( p ) N · h (cid:48) Q · N nearest neighbourswhere K ( p ) N takes N − / for p = 2. The R package rvinecopulib only implemented thebandwidth in Equation (11) for the quadratic case with p = 2. B Proof of Proposition 4.1
Proof.
We first show statement 1. By Equation (1), the estimatorˆ F − Y | X ,...,X p ( α | x , . . . , x p ) = ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) , where ˆ u j = ˆ F j ( x j ) , j = 1 , . . . , p denote variables on the u-scale. To avoid heavy notation, N referring to the sample size will be omitted here. Following Wied and Weißbach (2012); Sil-verman (1978), to show the uniformly strong consistency of ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) ,30e showsup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) → a.s. For all (cid:15) ≥ ≥ P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) = P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) + ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) ≥ P (cid:18) sup α (cid:110)(cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:111) ≤ (cid:15) (cid:19) = P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) + sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) ≥ P (cid:18)(cid:16) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:17)(cid:92) (cid:16) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:17)(cid:19) = P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) · P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) (12)Denote the event A = sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) , then P ( A ) = 1 holds by the uniform strong consistency of the estimator of F − Y . Wenow show that the conditional probability in (12) is equal to 1. We start by rewriting the31onditional probability as P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) A (cid:19) = P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) + F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) + F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) A (cid:19) ≥ P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17)(cid:12)(cid:12)(cid:12) + sup α (cid:12)(cid:12)(cid:12) F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) + sup α (cid:12)(cid:12)(cid:12) F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) A (cid:19) . This conditional probability is equal to 1, since the first and second supremum are less thanor equal to (cid:15) by conditioning on A and due to the uniform consistency of ˆ F − Y . The lastsupremum is less than or equal to (cid:15) by Bartle and Joichi (1961, Thm.2) on almost uniformconvergence, applied to the continuous inverse distribution function F − Y , and taking themeasurable space to be the probability space. First, P (cid:18) sup α (cid:12)(cid:12)(cid:12)(cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) = 1, which can be argued similar to (12) using theuniform consistency and continuity of the inverse of the h-functions. Next, (12) states P (cid:18) sup α (cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − ˆ F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) = 1 . We conclude that ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) is uniformly strong consistent.To prove the weak consistency in 2, by Wied and Weißbach (2012); Silverman (1978), weonly need to show P (cid:18)(cid:12)(cid:12)(cid:12) ˆ F − Y (cid:16) ˆ C − V | U ,...,U p ( α | ˆ u , . . . , ˆ u p ) (cid:17) − F − Y (cid:16) C − V | U ,...,U p ( α | u , . . . , u p ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:19) → . Using the same technique in Equation (12) and a similar argument for proving statement 2of Proposition 4.1 with Theorem 2 on convergence in measure in Bartle and Joichi (1961),the weak consistency can be obtained. 32 eferences
Akaike, H. (1973). Information theory and an extension of the maximum likelihood princi-ple. In Petrov, B. and Cs´aki, F., editors,
Second International Symposium on InformationTheory , pages 267–281. Akad´emiai Kiad´o, Budapest.Athey, S., Tibshirani, J., Wager, S., et al. (2019). Generalized random forests.
The Annalsof Statistics , 47(2):1148–1178.Bartle, R. G. and Joichi, J. T. (1961). The preservation of convergence of measurable func-tions under composition.
Proceedings of the American Mathematical Society , 12(1):122–126.Bartle, R. G. and Sherbert, D. R. (2000).
Introduction to real analysis . Wiley New York.Bauer, A. and Czado, C. (2016). Pair-copula bayesian networks.
Journal of Computationaland Graphical Statistics , 25(4):1248–1271.Bedford, T. and Cooke, R. M. (2002). Vines–a new graphical model for dependent randomvariables.
The Annals of Statistics , 30(4):1031–1068.Belloni, A. and Chernozhukov, V. (2011). (cid:96)
The Annals of Statistics , 39(1):82–130.Bernard, C. and Czado, C. (2015). Conditional quantiles and tail dependence.
Journal ofMultivariate Analysis , 138:104–126.Breiman, L. (2001). Random forests, machine learning 45.
J. Clin. Microbiol , 2(30):199–228.B¨uhlmann, P. and van de Geer, S. (2011).
Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media.Chang, B. and Joe, H. (2019). Prediction based on conditional distributions of vine copulas.
Computational Statistics & Data Analysis , 139:45–63.Charpentier, A., Fermanian, J.-D., and Scaillet, O. (2007). The estimation of copulas:Theory and practice. In Rank, J., editor,
Copulas: from theory to application in finance ,pages 35–64. London : Risk Books.Chen, X., Koenker, R., and Xiao, Z. (2009). Copula-based nonlinear quantile autoregres-sion.
The Econometrics Journal , 12:S50–S67.Cheng, C. (1995). Uniform consistency of generalized kernel estimators of quantile density.
The Annals of Statistics , 23(6):2285–2291.33heng, K.-F. (1984). On almost sure representation for quantiles of the product limitestimator with applications.
Sankhy¯a: The Indian Journal of Statistics, Series A .Claeskens, G. and Hjort, N. (2003). The focused information criterion.
Journal of theAmerican Statistical Association , 98:900–916. With discussion and a rejoinder by theauthors.Czado, C. (2019).
Analyzing Dependent Data with Vine Copulas: A Practical Guide WithR . Lecture Notes in Statistics. Springer International Publishing.Dua, D. and Graff, C. (2017). UCI machine learning repository.Durrett, R. (2010).
Probability: theory and examples . Cambridge university press.Fenske, N., Kneib, T., and Hothorn, T. (2011). Identifying risk factors for severe childhoodmalnutrition by boosting additive quantile regression.
Journal of the American StatisticalAssociation , 106(494):494–510.Geenens, G. (2014). Probit transformation for kernel density estimation on the unit inter-val.
Journal of the American Statistical Association , 109(505):346–358.Geenens, G., Charpentier, A., and Paindaveine, D. (2017). Probit transformation fornonparametric kernel estimation of the copula density.
Bernoulli , 23(3):1848–1873.Gijbels, I. and Mielniczuk, J. (1990). Estimating the density of a copula function.
Com-munications in Statistics-Theory and Methods , 19(2):445–464.Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, andestimation.
Journal of the American Statistical Association , 102(477):359–378.Grønneberg, S. and Hjort, N. L. (2014). The copula information criteria.
ScandinavianJournal of Statistics , 41(2):436–459.Haff, I. H., Aas, K., and Frigessi, A. (2010). On the simplified pair-copula construc-tion—simply useful or too simplistic?
Journal of Multivariate Analysis , 101(5):1296–1310.Joe, H. (1996). Families of m-variate distributions with given margins and m (m-1)/2bivariate dependence parameters.
Lecture Notes-Monograph Series , pages 120–141.Joe, H. (1997).
Multivariate models and multivariate dependence concepts . CRC Press.Jullum, M. and Hjort, N. L. (2017). Parametric or nonparametric: The FIC approach.
Statistica Sinica , 27(3):951–981.Ko, V. and Hjort, N. L. (2019). Copula information criterion for model selection withtwo-stage maximum likelihood estimation.
Econometrics and Statistics , 12:167 – 180.34o, V., Hjort, N. L., and Hobæk Haff, I. (2019). Focused information criteria for copulas.
Scandinavian Journal of Statistics , 46(4):1117–1140.Koenker, R. (2004). Quantile regression for longitudinal data.
Journal of MultivariateAnalysis , 91(1):74–89.Koenker, R. (2005a).
Quantile Regression . Cambridge University Press.Koenker, R. (2005b).
Quantile Regression . Econometric Society Monographs. CambridgeUniversity Press.Koenker, R. (2011). Additive models for quantile regression: Model selection and confi-dence bandaids.
Brazilian Journal of Probability and Statistics , 25(3):239–262.Koenker, R. and Bassett, G. (1978). Regression quantiles.
Econometrica: journal of theEconometric Society .Kolmogorov, A. N. and Fomin, S. (1970).
Introductory real analysis . Prentice-Hall.Komunjer, I. (2013).
Quantile Prediction, Chapter 17 in Handbook of Financial Econo-metrics, edited by Yacine Ait-Sahalia and Lars Peter Hansen . Elsevier.Kraus, D. and Czado, C. (2017). D-vine copula based quantile regression.
ComputationalStatistics & Data Analysis , 110:1–18.Li, A. H. and Martin, A. (2017). Forest-type regression with general losses and robustforest. In
International Conference on Machine Learning , pages 2091–2100.Li, Q., Lin, J., and Racine, J. S. (2013). Optimal bandwidth selection for nonparamet-ric conditional distribution and quantile functions.
Journal of Business & EconomicStatistics , 31(1):57–65.Meinshausen, N. (2006). Quantile regression forests.
Journal of Machine Learning Re-search , 7(Jun):983–999.Nagler, T. (2018). kdecopula: An R package for the kernel estimation of bivariate copuladensities.
Journal of Statistical Software , 84(7):1–22.Nagler, T. (2019). vinereg: D-Vine Quantile Regression . R package version 0.7.0.Nagler, T., Schellhase, C., and Czado, C. (2017). Nonparametric estimation of simplifiedvine copula models: comparison of methods.
Dependence Modeling , 5(1):99–120.Nagler, T. and Vatter, T. (2019a). kde1d: Univariate Kernel Density Estimation . Rpackage version 1.0.2. 35agler, T. and Vatter, T. (2019b). rvinecopulib: High Performance Algorithms for VineCopula Modeling . R package version 0.5.1.1.0.Noh, H., Ghouch, A. E., and Bouezmarni, T. (2013). Copula-based regression estimationand inference.
Journal of the American Statistical Association , 108(502):676–688.Noh, H., Ghouch, A. E., and Van Keilegom, I. (2015). Semiparametric conditional quantileestimation through copula-based multivariate models.
Journal of Business & EconomicStatistics , 33(2):167–178.Parzen, E. (1962). On estimation of a probability density function and mode.
The Annalsof Mathematical Statistics , 33(3):1065–1076.R Core Team (2020).
R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.Ryzin, J. V. (1969). On strong consistency of density estimates.
The Annals of Mathemat-ical Statistics , 40(5):1765–1772.Schwarz, G. (1978). Estimating the dimension of a model.
The Annals of Statistics ,6(2):461–464.Silverman, B. W. (1978). Weak and strong uniform consistency of the kernel estimate of adensity and its derivatives.
The Annals of Statistics , pages 177–184.Sklar, M. (1959). Fonctions de repartition an dimensions et leurs marges.
Publ. inst. statist.univ. Paris , 8:229–231.Stoeber, J., Joe, H., and Czado, C. (2013). Simplified pair copula construc-tions—limitations and extensions.
Journal of Multivariate Analysis , 119:101–118.Tepegjozova, M. (2019). D- and c-vine quantile regression for large data sets. Masterarbeit,Technische Universit¨at M¨unchen, Garching b. M¨unchen.Van Keilegom, I. and Veraverbeke, N. (1998). Bootstrapping quantiles in a fixed designregression model with censored data.
Journal of Statistical Planning and Inference ,69(1):115–131.Wen, K. and Wu, X. (2015). An improved transformation-based kernel estimator of densi-ties on the unit interval.
Journal of the American Statistical Association , 110(510):773–783.Wied, D. and Weißbach, R. (2012). Consistency of the kernel density estimator: a survey.
Statistical Papers , 53(1):1–21. 36iao, Z. and Koenker, R. (2009). Conditional quantile estimation for generalized au-toregressive conditional heteroscedasticity models.
Journal of the American StatisticalAssociation , 104(488):1696–1712.Yeh, I.-C. (1998). Modeling of strength of high-performance concrete using artificial neuralnetworks.
Cement and Concrete research , 28(12):1797–1808.Yu, K. and Jones, M. (1998). Local linear quantile regression.
Journal of the Americanstatistical Association , 93(441):228–237.Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression.