Random Forest based Qantile Oriented Sensitivity Analysis indices estimation
RRandom Forest based Qantile Oriented Sensitivity Analysis indicesestimation
Kévin Elie-Dit-Cosaque ∗1, 2 and Véronique Maume-Deschamps †11
Institut Camille Jordan, Université Claude Bernard Lyon 1, Lyon, FRANCE Actuarial Modelling, SCOR, Paris, FRANCE
February 26, 2021
Abstract
We propose a random forest based estimation procedure for Quantile Oriented SensitivityAnalysis - QOSA. In order to be efficient, a cross validation step on the leaf size of trees isrequired. Our full estimation procedure is tested on both simulated data and a real dataset.
Keywords: Quantile Oriented Sensitivity Analysis, Random forest, Cross validation, Out of Bagsamples.
Numerical models are ubiquitous in various fields, such as aerospace, economy, environment orinsurance, they allow to approximate the behavior of physical phenomenon. Their main advantage isthat they replace expensive, or even unachievable, real-life experiments and thus provide knowledgeabout the natural system. The extremely faithful representation of reality, made possible thanksto the increase in computing power, also explains this widespread use. However, this accuracy isoften synonymous of complexity, ultimately leading to a difficult interpretation of models. Besides,model inputs are usually uncertain due to a lack of information or the random nature of factors,which means that the resulting output can be regarded as random. It is then important to assessthe impact of this uncertainty on the model output. Global Sensitivity Analysis (GSA) methodssolve these issues by studying how the uncertainty in the output of a model can be apportioned todifferent sources of uncertainty in the model inputs (Saltelli et al., 2004). Hence, GSA allows toinvestigate input-ouput relationships by identifying the inputs that strongly influence the modelresponse. Conversely, it may be of interest to see that although some inputs may not be very wellestablished, they do not significantly contribute to output uncertainty.Variance-based approaches are well-established and widely used for GSA. Among them, thesensitivity indices developed by Sobol (1993) are very popular. This last method stands on theassumption that the inputs are independent. Under this hypothesis, the overall variance of a ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b calar output can be split down into different partial variances using the so-called Hoeffding (1948)decomposition. Then, the first-order Sobol’ index quantifies the individual contribution of an inputto the output variance while the total Sobol’ index (Jansen et al., 1994; Homma and Saltelli, 1996)measures the marginal and interaction effects. However, even if they are extremely popular andinformative measures, variance-based approaches suffer from some limitation. Indeed, by definition,they study only the impact of the inputs on the expectation of the output since they consider thevariance as distance measure.A new class of sensitivity indices, generalizing the first-order Sobol’ index to other quantities ofinterest than the expectation, has been introduced in Fort et al. (2016). These indices called GoalOriented Sensitivity Analysis (GOSA) compare the minimum of a specific contrast function to itsconditional counterpart when one of the inputs is fixed. The unconditional minimum being reachedby the quantity of interest (for example a quantile).In this paper, we focus on Quantile Oriented Sensitivity Analysis (QOSA) measuring the impactof the inputs on the α -quantile of the output distribution. Browne et al. (2017); Maume-Deschampsand Niang (2018) introduced a statistical estimator of the first-order QOSA index based on a kernelapproach. Kala (2019) defined the second and higher order QOSA indices as well as a variance-like decomposition for quantiles in the case of independent inputs. Elie-Dit-Cosaque and Maume-Deschamps (2021) studied QOSA indices on various toy models in independent and dependentcontexts.Despite these recents works, the question of the effective estimation of the first-order QOSA indexremains open. Indeed, it turns out to be difficult to compute them in practice because it requires anaccurate estimate of either the conditional quantile of the output given an input, or the minimum ofa conditional expectation of the output given an input. Kala (2019) handles this feature with a bruteforce Monte-Carlo approach. As a matter of fact, for each value of an input, realizations of the otherinputs are generated conditionally to the fixed value. Therefore, in this approach, the dependencystructure of inputs has to be known, which is not always the case. Besides, the computational costis too high to consider its use in an industrial context when dealing with costly models. Browneet al. (2017); Maume-Deschamps and Niang (2018) developed kernel-based estimators to avoidthis double-loop issue. But, when using a small dataset, their performance is highly dependent ofthe bandwidth parameter. Browne et al. (2017) proposed a cumbersome algorithm for setting anefficient bandwidth that is not straighforward to implement in practice. As for the estimator ofMaume-Deschamps and Niang (2018), a large dataset is needed in order to have a low estimationerror, as no algorithm of bandwidth parameter selection is established.To overcome these issues, we explore the random forest algorithm introduced by Breiman (2001)in order to estimate the conditional distribution of the output given an input. The main contributionof this paper is to provide different estimation strategies of the first-order QOSA index based onthis method.The paper is organized as follows. We recall in Section 2 the definition of the first-order QOSAindex and initiate the estimation process. Section 3 presents the random forest algorithm and severalestimators of the first-order QOSA index based on this method are described in Section 4. Theentire process is summarized in Section 5. Then, the performance of the estimators is investigatedin Section 6 on simulated data and the relevance of this index is highlighted on a real dataset inSection 7. Finally, a conclusion is given in Section 8.2 Estimation of the QOSA index
Let us consider the input-output system where X = ( X , . . . , X d ) ∈ R d is a random vector of d independent inputs and Y = f ( X ) is the output random variable of a measurable deterministicfunction f : R d → R which can be a mathematical function or a computational code. Then, given alevel α ∈ ]0 , X i , as S αi = min θ ∈ R E [ ψ α ( Y, θ )] − E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X i ] (cid:21) min θ ∈ R E [ ψ α ( Y, θ )] , with the contrast function ψ α : ( y, θ ) ( y − θ ) (cid:16) α − { y (cid:54) θ } (cid:17) . This function, also called pinballloss or check function in the literature is the cornerstone of the quantile regression (Koenker andHallock, 2001). Quantile and conditional quantile are related to this loss function as follows q α ( Y ) = arg min θ ∈ R E [ ψ α ( Y, θ )] and q α ( Y | X i ) = arg min θ ∈ R E [ ψ α ( Y, θ ) | X i ] , where q α ( Y ) is the α -quantile of Y and q α ( Y | X i ), the α -quantile of Y given X i . Thus, the index S αi can be rewritten in the following way, S αi = 1 − E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X i ] (cid:21) min θ ∈ R E [ ψ α ( Y, θ )] = 1 − E [ ψ α ( Y, q α ( Y | X i ))] E [ ψ α ( Y, q α ( Y ))] = 1 − OP , where O refers to E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X i ] (cid:21) = E [ ψ α ( Y, q α ( Y | X i ))] and P , to min θ ∈ R E [ ψ α ( Y, θ )] = E [ ψ α ( Y, q α ( Y ))].Hence, as stated in Browne et al. (2017), the index S αi compares the mean distance between Y and its conditional quantile to the mean distance between Y and its quantile, where the pinballloss function ψ α is the considered distance. This index has some basic properties requested for areasonable sensitivity index such as 0 (cid:54) S αi (cid:54) S αi = 0 if Y is independent of X i and S αi = 1 if Y is X i measurable.It should be mentioned that Kucherenko et al. (2019) proposed new indices K α to assess theimpact of inputs on the α -quantile of the output distribution. They directly quantify the meandistance between quantiles q α ( Y ) and q α ( Y | X i ) rather than the mean distance between averagecontrast functions like in the first-order QOSA index. Different estimation strategies are investi-gated in their paper (brute force Monte Carlo and double-loop reordering approach). But a majorlimitation is that a large sample size is required to get an accurate computation of the index (samplesof size 2 are used in their paper). Also, as mentioned in Elie-Dit-Cosaque and Maume-Deschamps(2021), the practical interpretation of the K α indices is questionable.Let us now initiate the estimation procedure for the first-order QOSA index S αi , associated to aspecific input X i and a level α .We consider an i.i.d n -sample D (cid:5) n = (cid:0) X (cid:5) j , Y (cid:5) j (cid:1) j =1 ,...,n such that Y (cid:5) j = f (cid:0) X (cid:5) j (cid:1) , j = 1 , . . . , n . Then,3 first natural estimator of the P term of the QOSA index based on the quantity E [ ψ α ( Y, q α ( Y ))]is proposed b P = 1 n n X j =1 ψ α (cid:16) Y (cid:5) j , b q α ( Y ) (cid:17) , (2.1)with b q α ( Y ), the classical empirical estimator for q α ( Y ) obtained from D (cid:5) n .The P term can be alternatively estimated as follows by using the quantity min θ ∈ R E [ ψ α ( Y, θ )]. b P = min θ ∈ R n n X j =1 ψ α (cid:16) Y (cid:5) j , θ (cid:17) , where the minimum is reached for one of the elements of (cid:0) Y (cid:5) j (cid:1) j =1 ,...,n . As the function to minimizeis decreasing then increasing, this estimator therefore requires to compute n n P j =1 ψ α (cid:16) Y (cid:5) j , Y (cid:5) ( k ) (cid:17) , k = 1 , . . . , n , until it increases, with Y (cid:5) ( k ) the order statistics of (cid:0) Y (cid:5) , . . . , Y (cid:5) n (cid:1) . This process ismuch more time-consuming than the first estimator where we just need to compute the quantileand then plug it. Thus, in the sequel, we are going to use the b P estimator.The O term of the QOSA index is trickier to estimate because a good approximation of theconditional distribution of Y given X i is necessary. Both existing estimators of the QOSA indexcurrently provided in Browne et al. (2017); Maume-Deschamps and Niang (2018) handle this featurethanks to kernel-based methods. But in practice, with these methods, we are faced with determin-ing the optimal bandwidth parameter or using large sample sizes in order to have a sufficientlylow estimation error when employing a non optimal bandwidth. Thus, when dealing with costlycomputational models, a precise enough estimation of these indices can be difficult to achieve oreven unfeasible.We propose in this paper to address these issues by using the random forest method for estimatingthe conditional distribution. Therefore, several statistical estimators for the O term of the first-orderQOSA index will be defined in Section 4. Let us first recall the random forest algorithm. Random forests are ensemble learning methods, first introduced by Breiman (2001), which can beused in classification or regression problems. We only focus on their use for regression task andassume to be given a training sample D n = (cid:0) X j , Y j (cid:1) j =1 ,...,n of i.i.d random variables distributed asthe prototype pair ( X , Y ).Breiman’s forest growns a collection of k regression trees based on the CART procedure describedin Breiman et al. (1984). Building several different trees from a single dataset requires to randomizethe tree building process. Randomness injected in each tree is denoted by Θ ‘ where (Θ ‘ ) ‘ =1 ,...,k areindependent random variables distributed as Θ (independent of D n ). Θ = (Θ , Θ ) contains indicesof observations selected to build the tree and indices of splitting candidate directions in each cell.In more detail, the ‘ -th tree is built using a bootstrap sample D ?n (Θ ‘ ) from the original dataset.Only these observations are used to construct the tree and to make the tree prediction. Once theobservations have been selected, the algorithm forms a recursive partitioning of the input space. In4ach cell, a number max f eatures of variables is selected uniformly at random among all inputs.Then, the best split is chosen as the one optimizing the CART splitting criterion only along the max f eatures preselected directions. This process is repeated in each cell. A stopping criterion,often implemented, is that a split point at any depth will only be considered if it leaves at least min samples leaf samples in each of the left and right child nodes. After tree partition has beencompleted, the prediction of the ‘ -th tree denoted by m bn ( x ; Θ ‘ , D n ) at a new point x is computedby averaging the N bn ( x ; Θ ‘ , D n ) observations falling into the cell A n ( x ; Θ ‘ , D n ) of the new point.Hence, the random forest prediction is the average of the k predicted values: m bk,n ( x ; Θ , . . . , Θ k , D n ) = 1 k k X ‘ =1 m bn ( x ; Θ ‘ , D n ) = 1 k k X ‘ =1 X j ∈D ?n (Θ ‘ ) { X j ∈ A n ( x ;Θ ‘ , D n ) } N bn ( x ; Θ ‘ , D n ) Y j . (3.1)By defining the random variable B j (cid:0) Θ ‘ , D n (cid:1) as the number of times that the observation (cid:0) X j , Y j (cid:1) has been used from the original dataset for the ‘ -th tree construction, the conditionalmean estimator in Equation (3.1) is rewritten as follows m bk,n ( x ; Θ , . . . , Θ k , D n ) = n X j =1 w bn,j ( x ; Θ , . . . , Θ k , D n ) Y j , (3.2)where the weights w bn,j ( x ; Θ , . . . , Θ k , D n ) are defined by w bn,j ( x ; Θ , . . . , Θ k , D n ) = 1 k k X ‘ =1 B j (cid:0) Θ ‘ , D n (cid:1) { X j ∈ A n ( x ;Θ ‘ , D n ) } N bn ( x ; Θ ‘ , D n ) . (3.3)A variant of the Equation (3.2) provides another estimator of the conditional mean. Trees arestill grown as in the standard random forest algorithm being based on the bootstrap samples but,for the tree prediction, the original dataset D n is used instead of the bootstrap sample D ?n (Θ ‘ )associated to the ‘ -th tree and we get m ok,n ( x ; Θ , . . . , Θ k , D n ) = n X j =1 w on,j ( x ; Θ , . . . , Θ k , D n ) Y j , (3.4)where the weights w on,j ( x ; Θ , . . . , Θ k , D n ) are defined by w on,j ( x ; Θ , . . . , Θ k , D n ) = 1 k k X ‘ =1 { X j ∈ A n ( x ;Θ ‘ , D n ) } N on ( x ; Θ ‘ , D n ) . (3.5)It has to be noted that contrary to Equation (3.3) where N bn ( x ; Θ ‘ , D n ) refers to the number ofelements of D ?n (Θ ‘ ) falling into A n ( x ; Θ ‘ , D n ), in Equation (3.5), N on ( x ; Θ ‘ , D n ) is the number ofelements of D n that fall into A n ( x ; Θ ‘ , D n ).Thus, both weighted approaches using, either the bootstrap samples (Equation (3.2)) or theoriginal dataset (Equation (3.4)), allow to see the random forest method as a local averaging estimate(Lin and Jeon, 2006; Scornet, 2016) and will be at the heart of the strategies proposed for estimatingthe O term of the QOSA index. In the following, to lighten notation we will omit the dependenceto Θ and D n in the weights. 5 Estimation of the O term of the QOSA index By using the random forest method aforementioned, ten estimators of the O term may be defined.The first four rely on the expression E [ ψ α ( Y, q α ( Y | X i ))] and the others on E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X i ] (cid:21) .Since our aim is to estimate conditional expressions with respect to one input variable, say X i , weshall consider forests driven by X i , i.e. the random forest is built with the observations D in = (cid:16) X ji , Y j (cid:17) j =1 ,...,n from D n , which means that Y is explained with X i only. When needed, we shalldenote by D ?in a bootstrapped sample from D in and D (cid:5) in = ( X (cid:5) ji , Y (cid:5) j ) j =1 ,...,n an independant copyof D in . O term estimators In this section, the estimations of the O term of the QOSA index are based on the quantity E [ ψ α ( Y, q α ( Y | X i ))]. Using two training samples D in and D (cid:5) in , we define b R i = 1 n n X j =1 ψ α (cid:16) Y (cid:5) j , b q α (cid:16) Y | X i = X (cid:5) ji (cid:17)(cid:17) , where the sample D in is used to get b q α ( Y | X i = x i ), an estimator of the conditional quantile q α ( Y | X i = x i ). It is obtained thanks to two approaches based on the random forests, described inthe sequel. We consider the estimator of the Conditional Cumulative Distribution Function (C_CDF) intro-duced in Elie-Dit-Cosaque and Maume-Deschamps (2020) using D in to construct the forest. TheC_CDF estimator used to estimate the conditional quantile is F bk,n ( y | X i = x i ) = n X j =1 w bn,j ( x i ) { Y j (cid:54) y } , where the w bn,j ( x i )’s are defined in Equation (3.3).Hence, given a level α ∈ [0 , b q α ( Y | X i = x i ) is defined asfollows b q α ( Y | X i = x i ) = inf p =1 ,...,n n Y p : F bk,n ( Y p | X i = x i ) (cid:62) α o . As a result, the estimator of E [ ψ α ( Y, q α ( Y | X i ))] based on this method is denoted b R ,bi .Another estimator of the C_CDF can be achieved by replacing the weights w bn,j ( x i ) based on thebootstrap samples of the forest by those using the original dataset w on,j ( x i ) provided in Equation(3.5). That gives the following estimator which has been proposed in Meinshausen (2006), F ok,n ( y | X i = x i ) = n X j =1 w on,j ( x i ) { Y j (cid:54) y } . The conditional quantiles are then estimated by plugging F ok,n ( y | X i = x i ) instead of F ( Y | X i = x i ).Accordingly, the associated estimator of E [ ψ α ( Y, q α ( Y | X i ))] based on these weights is denoted b R ,oi .6 .1.2 Quantile estimation within a leaf Let us consider a set of k trees indexed by ‘ = 1 , . . . , k constructed with the sample D in .For the ‘ -th tree, the estimator b q b,α‘ ( Y | X i = x i ) of q α ( Y | X i = x i ) is obtained with the boot-strapped observations falling into A n ( x i ; Θ ‘ , D in ) as follows b q b,α‘ ( Y | X i = x i ) = inf p =1 ,...,n { Y p , ( X pi , Y p ) ∈ D i?n (Θ ‘ ) and X pi ∈ A n ( x i ; Θ ‘ , D in ) : n X j =1 B j (cid:0) Θ ‘ , D in (cid:1) · { X ji ∈ A n ( x i ;Θ ‘ , D in ) } · { Y j (cid:54) Y p } N bn ( x i ; Θ ‘ , D in ) (cid:62) α . The values from the k randomized trees are then agregated to obtain the following random forestestimate b q b,α ( Y | X i = x i ) = 1 k k X ‘ =1 b q b,α‘ ( Y | X i = x i ) . As for the conditional mean estimate defined in Section 3 or for the C_CDF approximationintroduced in Subsection 4.1.1, we can provide a variant using the original sample. Thus, once theforest is constructed with the bootstrap samples, we may estimate the conditional quantiles in theleaves of the ‘ -th tree using the original sample as follows b q o,α‘ ( Y | X i = x i ) = inf p =1 ,...,n { Y p , ( X pi , Y p ) ∈ D in and X pi ∈ A n ( x i ; Θ ‘ , D in ) : n X j =1 { X ji ∈ A n ( x i ;Θ ‘ , D in ) } · { Y j (cid:54) Y p } N on ( x i ; Θ ‘ , D in ) (cid:62) α . That gives us the following random forest estimate of the conditional quantile b q o,α ( Y | X i = x i ) = 1 k k X ‘ =1 b q o,α‘ ( Y | X i = x i ) . Thus, these two methods allow us to propose the following estimator b R ,bi (resp. b R ,oi ) of E [ ψ α ( Y, q α ( Y | X i ))] using the bootstrap samples (resp. the original sample). O term estimators The estimators developped in Subsection 4.1, based on E [ ψ α ( Y, q α ( Y | X i ))], require to approximatethe conditional quantile and then plug it to estimate the O term. As mentioned before, the model f could be time-consuming. Therefore, they may be inappropriate as two training samples arenecessary. Hence, we propose in this part to develop estimators of the O term taking advantagefrom the expression E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X i ] (cid:21) for which we only need to find the minimum insteadof plugging the quantile. 7 .2.1 Minimum estimation with a weighted approach First of all, a random forest is built with the observations D in . Then, by considering an additionalsample (cid:0) X (cid:5) j (cid:1) j =1 ,...,n independent of D n , the O term may be estimated as follows b Q ,bi = 1 n n X m =1 min p =1 ,...,n n X j =1 w bn,j ( X (cid:5) mi ) ψ α (cid:16) Y j , Y p (cid:17) . Let us notice that the conditional expectation E [ ψ α ( Y, θ ) | X i = x i ] is estimated with n P j =1 w bn,j ( x i ) ψ α (cid:0) Y j , θ (cid:1) whose minimum is reached for θ equals one of the elements of (cid:0) Y j (cid:1) j =1 ,...,n .Another estimator is obtained by replacing weigths w bn,j ( x i ) with the w on,j ( x i ) version presentedin Equation (3.5) using the original dataset. The obtained estimator of the O term is denoted by b Q ,oi . In this subsection, we are going to take advantage of the tree structure in order to propose a newestimator. To begin with, let us consider that a random forest is built with the observations D in .Then, the key point is that an additional sample is no longer required in order to processthe outer expectation of the O term. Indeed, for the ‘ -th tree, the observations falling into its m -th leaf node denoted by A n (cid:0) m ; Θ ‘ , D in (cid:1) approximate the conditional distribution of Y givena certain point X i = x i , which allows to estimate the minimum of the conditional expectationmin θ ∈ R E [ ψ α ( Y, θ ) | X i = x i ]. Then, we make the average over all the leaves of the ‘ -th tree to dealwith the outer expectation. Hence, let N bn ( m ; Θ ‘ , D in ) be the number of observations of the bootstrapsample D i?n (Θ ‘ ) falling into the m -th leaf node and N ‘leaves be the number of leaves in the ‘ -th tree.We define the following tree estimator for the O term1 N ‘leaves N ‘leaves X m =1 min n p = 1 , . . . , n, ( X pi , Y p ) ∈ D i?n (Θ ‘ ) and X pi ∈ A n (cid:16) m ; Θ ‘ , D in (cid:17)o n X j =1 B j (cid:0) Θ ‘ , D in (cid:1) · ψ α (cid:0) Y j , Y p (cid:1) · { ( X ji ,Y j ) ∈D i?n (Θ ‘ ) , X ji ∈ A n ( m ;Θ ‘ , D in ) } N bn ( m ; Θ ‘ , D in ) . The approximations of the k randomized trees are then averaged to obtain the following randomforest estimate b Q ,bi = 1 k k X ‘ =1 N ‘leaves N ‘leaves X m =1 min n p = 1 , . . . , n, ( X pi , Y p ) ∈ D i?n (Θ ‘ ) and X pi ∈ A n (cid:16) m ; Θ ‘ , D in (cid:17)o n X j =1 B j (cid:0) Θ ‘ , D in (cid:1) · ψ α (cid:0) Y j , Y p (cid:1) · { ( X ji ,Y j ) ∈D i?n (Θ ‘ ) , X ji ∈ A n ( m ;Θ ‘ , D in ) } N bn ( m ; Θ ‘ , D in ) .
8s before, the entire procedure described above is still valid using the observations of the originalsample falling into the leaves of the ‘ -th tree instead of the bootstrap ones. Thanks to this change,we get the following estimator of the O term b Q ,oi = 1 k k X ‘ =1 N ‘leaves N ‘leaves X m =1 min n p = 1 , . . . , n, ( X pi , Y p ) ∈ D in and X pi ∈ A n (cid:16) m ; Θ ‘ , D in (cid:17)o n X j =1 ψ α (cid:0) Y j , Y p (cid:1) · { ( X ji ,Y j ) ∈D in , X ji ∈ A n ( m ;Θ ‘ , D in ) } N on ( m ; Θ ‘ , D in ) . It should be noted that looking for the minimum in the leaves directly implies that they aresufficiently sampled for the method to be valid.
In Subsections 4.2.1 and 4.2.2, the conditional distribution of Y given X i is obtained from treesgrown with D in . Instead of using this approach, we propose in this part to build a forest withcomplete trees, i.e. grown with all the model’s inputs and then adjust the weights to recover theconditional expectation E [ ψ α ( Y, θ ) | X i ].Thus, as noticed, a full random forest is constructed with the whole dataset D n . Then, by usingan additional sample (cid:0) X (cid:5) j (cid:1) j =1 ,...,n independent of D n , the conditional expectation E [ ψ α ( Y, θ ) | X i = x i ]is estimated as follows E [ ψ α ( Y, θ ) | X i = x i ] = E [ E [ ψ α ( Y, θ ) | X , . . . , x i , . . . , X d ] | X i = x i ] ≈ n n X l =1 n X j =1 w bn,j (cid:16)(cid:16) X (cid:5) ‘ , . . . , X (cid:5) ‘i − , x i , X (cid:5) ‘i +1 , . . . , X (cid:5) ‘d (cid:17)(cid:17) ≈ n X j =1 w b,in,j ( x i ) ψ α (cid:16) Y j , θ (cid:17) , where the suitable weights w b,in,j ( x i ) are defined by w b,in,j ( x i ) = 1 n n X ‘ =1 w bn,j (cid:16)(cid:16) X (cid:5) ‘ − i , x i (cid:17)(cid:17) . (4.1)The notation X − i indicates the set of all variables except X i and we note that the conditionalexpectation given X i = x i is recovered by averaging over the components X − i . Thus, havingindependent inputs is very convenient. Otherwise, it would be necessary to know the dependencystructure in order to generate the observations (cid:16) X (cid:5) l − i (cid:17) l =1 ,...,n for each new point X i = x i , whichwould make this estimator very cumbersome.In addition to being used to recover the conditional expectation given X i = x i , the sample (cid:0) X (cid:5) j (cid:1) j =1 ,...,n is also used to estimate the outer expectation and we finally obtain the followingestimator for the O term b Q ,bi = 1 n n X m =1 min p =1 ,...,n n X j =1 w b,in,j ( X (cid:5) mi ) ψ α (cid:16) Y j , Y p (cid:17) .
9y using the weights w on,j ( x ) instead of w bn,j ( x ), we may define the estimator b Q ,oi . After defining the respective estimators for each term of the first-order QOSA index in Sections 2and 4, the overall estimators are set in the following. In order to improve their accuracy, differentstrategies are also presented to tune hyperparameters of the random forest.
When using a random forest method for a regression task, a prediction is generally obtained byusing the default values proposed in the packages for the max f eatures and min samples leaf hyperparameters. There are some empirical studies on the impact of these hyperparameters such asDíaz-Uriarte and De Andres (2006); Scornet (2017); Duroux and Scornet (2018) but no theoreticalguarantee to support the default values.Concerning the estimation methods of the O term of the QOSA index proposed in Section 4,except for b Q ,bi and b Q ,oi , it turns out that the values of the hyperparameters must be chosen care-fully.First of all, as a forest explaining Y by X i is built for each model’s input, the max f eatures hyper-parameter has no impact in our procedures because it equals 1. Regarding the min samples leaf hyperparameter, its impact on the quality of the estimators is investigated through the followingtoy example Y = X − X , (5.1)with X , X ∼ E (1). This standard example is commonly used in Sensitivity Analysis literature toassess the quality of QOSA index estimators such as in Fort et al. (2016); Browne et al. (2017);Maume-Deschamps and Niang (2018).To illustrate the influence of this hyperparameter, we present in Figure 1 the boxplot of b R ,o made with 100 values for different leaf sizes. For each value of min samples leaf , an estimation b R ,o is computed using two samples of size n = 10 and a forest grown with n trees = 500. Then,the boxplots are compared with the analytical value given below and represented with the dottedorange line on each graph in Figure 1: E [ ψ α ( Y, q α ( Y | X ))] = e − q − α ( X ) (cid:16) q − α ( X ) (cid:17) − α . Based on the results obtained in Figure 1, we see that for each level α , the performance of b R ,o depends highly on the choice of the min samples leaf hyperparameter. Indeed, with the gridproposed for the values of min samples leaf , the optimum value seems to be 258 for α = 0 .
1, 83for α = 0 .
3, 47 for α = 0 . α = 0 . b R ,oi but is also encountered for both methods,stated in Subsection 4.1, computing the conditional quantile with either the bootstrap samples orthe original sample.By using the same setting as in Figure 1, the distribution of b Q ,o is presented in Figure 2 in orderto assess the impact of the min samples leaf hyperparameter for a method where the minimumis estimated instead of plugging the quantile. The quality of b Q ,o also seems to depend on the leaf10 .230.240.250.260.27 0.360.370.380.395 8 15 27 47 83 147 258 455 800Value of the min _ samples _ leaf parameter0.250.260.270.280.29 5 8 15 27 47 83 147 258 455 800Value of the min _ samples _ leaf parameter0.100.110.120.13= 0.10 = 0.30= 0.70 = 0.90 Figure 1: For several levels α : distribution of b R ,o , the estimation of the O term associated to thevariable X for different leaf sizes. The dotted orange line represents the true value on each plot.size and the optimum value, allowing to well estimate E (cid:20) min θ ∈ R E [ ψ α ( Y, θ ) | X ] (cid:21) for each level α , isthe same as in Figure 1.As before, this concern about the leaf size was only emphasized for b Q ,oi but is also encountered forboth methods, detailed in Subsections 4.2.1 and 4.2.2, approximating the minimum with either thebootstrap samples or the original sample.For the methods b Q ,bi and b Q ,oi , based on complete trees, it seems that the tuning of the leafsize is less important as observed in Figure 3. Indeed, whatever the α level, the best results areobserved for almost fully developed trees.Thus, for all other estimators of the O term proposed in Section 4, a method giving us theoptimal value of the leaf size for each level α is required to properly estimate the first-order QOSAindex. In order to tune the leaf size of our estimators, two methods are presented in this part. Theylead to significatively improve the efficiency of the estimation. The first one rests on a classicalcross-validation procedure and the second one uses the Out-Of-Bag samples.
The estimators of the O term developed in Subsection 4.1 are part of the conditional quantileestimation problem. Indeed, in a regression scheme, the conditional mean minimizes the expectedsquared error loss, while the conditional quantile q α ( Y | X i = x i ) minimizes the following expected11 .160.180.200.220.24 0.300.320.340.360.385 8 15 27 47 83 147 258 455 800Value of the min _ samples _ leaf parameter0.220.240.260.28 5 8 15 27 47 83 147 258 455 800Value of the min _ samples _ leaf parameter0.090.100.110.120.13= 0.10 = 0.30= 0.70 = 0.90 Figure 2: For several levels α : distribution of b Q ,o , the estimation of the O term associated to thevariable X for different leaf sizes. The dotted orange line represents the true value on each plot. min _ samples _ leaf parameter0.250.260.270.280.29 2 3 6 10 18 32 56 98 171 300Value of the min _ samples _ leaf parameter0.0950.1000.1050.1100.1150.1200.125= 0.10 = 0.30= 0.70 = 0.90 Figure 3: For several levels α : distribution of b Q ,o , the estimation of the O term associated to thevariable X for different leaf sizes. The dotted orange line represents the true value on each plot.loss q α ( Y | X i ) = arg min h : R → R E [ ψ α ( Y, h ( X i ))] . E [ ψ α ( Y, q α ( Y | X i ))] established in Subsection 4.1 allow to assess the qualityof the approximation of the true conditional quantile function. The smaller they are, the betterthe estimate of the conditional quantile function is. That is verified in Figure 1 and explains whywe have this convex shape depending on the leaf size. As a matter of fact, when the value of the min samples leaf hyperparameter is incorrectly chosen, the approximation of the true conditionalquantile function is wrong and so, this of E [ ψ α ( Y, q α ( Y | X i ))] too.Hence, in order to estimate well the conditional quantile function q α ( Y | X i ) and therefore, E [ ψ α ( Y, q α ( Y | X i ))] (which is our goal), the optimum value of the leaf size will be chosen within apredefined grid containing potential values as being the one minimizing the empirical generalizationerror computed with a K -fold cross-validation procedure. A detailed description of this process isgiven in Algorithm 1 with b R ,oi for instance. The principle is the same for all estimators defined inSubsection 4.1. Algorithm 1:
K-fold cross-validation procedure explained with b R ,oi Input: • Datasets: D (cid:5) in = (cid:16) X (cid:5) ji , Y (cid:5) j (cid:17) j =1 ,...,n from D (cid:5) n and D in = (cid:16) X ji , Y j (cid:17) j =1 ,...,n from D n • Number of trees: k ∈ N ? • The order where estimating E [ ψ α ( Y, q α ( Y | X i ))] : α ∈ ]0 , grid min samples leaf • Number of folds: K ∈ { , . . . , n } Output:
Estimated value of E [ ψ α ( Y, q α ( Y | X i ))] at the α -level with b R ,oi begin Cross-validation procedure Randomly split the dataset D in into K folds. foreach ‘ ∈ grid min samples leaf do foreach fold do Take the current fold as a test set. Take the remaining groups as a training set. Fit a random forest model on the training set with the current ‘ as min samples leaf hyperparameter. Evaluate the conditional quantiles at the observations X i in the test dataset andthen compute b R ,oi on the test set. Retain the estimation obtained. end Summarize the quality related to the current ‘ by averaging the K estimated valuesand save the mean. end end Select as optimal value ‘ opt for the min samples leaf hyperparameter, this one with thesmallest mean. Fit a random forest model on the complete dataset D in by fixing the min samples leaf hyperparameter to ‘ opt . Compute b R ,oi with D (cid:5) in . 13t has to be noted that the number of folds K should be chosen carefully. Indeed, a lower valueof K results in a more biased estimation of the generalization error, and hence undesirable. Incontrast, a larger value of K is less biased, but can suffer from large variability. The choice of K isusually 5 or 10, but there is no formal rule. The estimators detailed in Subsection 4.1 deserve special attention. Indeed, another less cumber-some approach than cross-validation can be used to tune the leaf size. It is based on an adaptationto our context of the widespread “Out-Of-Bag” (OOB) error (Breiman, 1996) in regression andclassification to estimate the generalization error. • We first adapt the calculation of the OOB error for the conditional quantiles estimated withlocal averaging estimate of the C_CDF proposed in Subsection 4.1.1. For this purpose, we start bydefining the OOB quantile error for b R ,bi .Let us fix an observation ( X mi , Y m ) from D in and consider I m as the set of trees built with the boot-strap samples not containing this observation, i.e. for which this one is “Out-Of-Bag”. The condi-tional quantile given that X i = X mi is estimated through F bk,n ( Y | X i = x i ) = n P j =1 w bn,j ( x i ) { Y j (cid:54) Y } where the weights are tailored to our context as follows w bn,j (cid:16) x i ; Θ , . . . , Θ |I m | , D in (cid:17) = 1 |I m | X ‘ ∈I m B j (cid:0) Θ ‘ , D in (cid:1) { X ji ∈ A n ( x i ;Θ ‘ , D in ) } N bn ( x i ; Θ ‘ , D in ) , j = 1 , . . . , n . Then, q α ( Y | X i = X mi ) is estimated by plugging F bk,n ( Y | X i = X mi ) instead of F ( Y | X i = X mi ) b q b,αoob ( Y | X i = X mi ) = inf p =1 ,...,n n Y p : F bk,n ( Y p | X i = X mi ) (cid:62) α o . After this operation is carried out for all data in D in , we calculate the error related to the approxi-mation of the true conditional quantile function, i.e. the empirical generalization error (cid:92) OOB bi = 1 n n X m =1 ψ α (cid:16) Y m , b q b,αoob ( Y | X i = X mi ) (cid:17) . We may use the original sample (rather than the bootstrap one) in the definition of the weights: w on,j (cid:16) x i ; Θ , . . . , Θ |I m | , D in (cid:17) = 1 |I m | X ‘ ∈I m { X ji ∈ A n ( x i ;Θ ‘ , D in ) } N on ( x i ; Θ ‘ , D in ) , j = 1 , . . . , n, j = m . This leads to define F ok,n as follows F ok,n ( y | X i = x i ) = n X j =1 j = m w on,j ( x i ) { Y j (cid:54) y } . The estimations of the conditional quantile b q o,αoob and of the OOB quantile error (cid:92) OOB oi follow.14 Secondly, we adapt the calculation of the OOB error for conditional quantiles estimated directlyin tree leaves as introduced in Subsection 4.1.2. In that sense, define OOB quantile error for b R ,bi .Let us fix an observation ( X mi , Y m ) from D in and consider the set of trees built with the bootstrapsamples not containing this observation. We then aggregate only the predictions of these trees tomake our prediction b q b,αoob ( Y | X i = X mi ) of q α ( Y | X i = X mi ). After this operation carried out for allthe data in D in , we calculate the error related to the approximation of the true conditional quantilefunction, i.e. the empirical generalization error (cid:92) OOB bi = 1 n n X m =1 ψ α (cid:16) Y m , b q b,αoob ( Y | X i = X mi ) (cid:17) . Again, using the original sample instead of the bootstrap one lead to define (cid:92)
OOB oi .The advantage of these methods, compared to cross-validation techniques, is that they do notrequire cutting out the training sample D in and take place during the forest construction process.Thus, given the dataset D in and a grid containing potential values of the min samples leaf hyperparameter, a random forest is built for each one and the OOB quantile error associated iscomputed. Then, the optimal hyperparameter is chosen as the one with the smallest OOB error. Now, we have all the components in order to set the estimators of the first-order QOSA index S αi .These are separated in two classes according to the estimation method adopted for the O term.First of all, with the methods plugging the quantile, we define b S αi = 1 − b R i b P with b R i ∈ n b R ,bi , b R ,oi , b R ,bi , b R ,oi o . The whole procedure integrating the cross-validation process for these methods is detailed in Algo-rithm 2 (see Appendix A.1).On the other hand, regarding the methods based on the minimum to compute the O term, weset b S αi = 1 − b Q i b P with b Q i ∈ n b Q ,bi , b Q ,oi , b Q ,bi , b Q ,oi , b Q ,bi , b Q ,oi o . The estimation process based on the minimum is formalized in Algorithms 3, 4 and 5. For the sakeof clarity, they are all gathered in Appendix A.1 . Algorithm 3 (resp. 5) estimating the QOSAindex with b Q ,bi or b Q ,oi (resp. b Q ,bi or b Q ,oi ), needs a full training sample D n as well as a partialone (cid:0) X (cid:5) j (cid:1) j =1 ,...,n . While estimating the QOSA index with b Q ,bi or b Q ,oi only requires one trainingsample D n . This is a major advantage over methods plugging the quantile that need two full trainingsamples.So far, no consistency result has been proved for b S αi . These various estimators are reviewed inthe next section in order to establish their efficiency in practice. Moreover, all these algorithms areimplemented within a python package named qosa-indices available at Elie-Dit-Cosaque (2020),it can be also freely downloaded on the PyPI website.15 Numerical illustrations
Let us now carry out some simulations in order to investigate the influence of the hyperparameteroptimization algorithm, the impact of the number of trees on our estimators and compare thedecrease of the estimation error of each one in function of the train sample-size. From these results,the performance of the two best estimators as well as those based on kernel methods defined inBrowne et al. (2017); Maume-Deschamps and Niang (2018) is assessed. Then, their scalability istested on a toy example.
We start by studying the influence of the hyperparameter optimization algorithm on the performanceof our estimators plugging the conditional quantile (i.e. using b R ,bi , b R ,oi , b R ,bi or b R ,oi ). This surveyis carried out with the model introduced in Equation (5.1) and the following setting.The estimators of the QOSA index are computed with samples of size n = 10 . The leaf size istuned for each estimator over a grid with 20 numbers evenly spaced ranging from 5 to 300 by usingeither the strategy based on the OOB quantile error developed in Subsection 5.2.2 or a 3-fold cross-validation procedure. Then, to assess the efficiency of each method (CV vs OOB), the experimentis repeated s = 100 times and the following metrics are computed RM SE αi = vuut s s X j =1 (cid:16) b S α,ji − S αi (cid:17) ,Bias αi = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s s X j =1 b S α,ji − S αi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (6.1) V ariance αi = 1 s s X j =1 b S α,ji − s s X j =1 b S α,ji , with S αi , the analytical values that were provided in Fort et al. (2016).In Figure 4, for three levels α , we present the evolution of the different metrics related to thevariable X of our toy example in function of the number of trees ranging from 1 to 200 (in logscale). More precisely, sub-figures at the top of Figure 4 show the Root Mean Square Error (RMSE),in the middle, the bias and the variance at the bottom.We observe that regardless of the level α and the number of trees, our estimators plugging thequantile have globally the same performance when calculated with either the OOB strategy or thecross-validation procedure. But, the run time is faster when using the OOB strategy rather thanthe cross-validation procedure. We analyze in this part the impact of the number of trees on the performance of all our estimatorsexcept for those using b Q ,bi and b Q ,oi because of the computational cost. This survey is also carriedout with the model introduced in Equation (5.1) and the following setting.16 .0100.0120.0140.0160.0180.0200.022 R M S E = 0.10 = 0.50 = 0.99 B i a s Number of trees0.000050.000000.000050.000100.00015 V a r i a n c e Number of trees 10 Number of trees
OOB S with R b S with R o S with R b S with R o CV S with R b S with R o S with R b S with R o Figure 4: Evolution of RMSE, bias and variance of the estimators associated with X , calculatedwith either the OOB strategy or the Cross-Validation procedure, in function of the number of treesfor three levels α .The estimators of the QOSA index are computed with samples of size n = 10 . The leaf sizeis tuned over a grid with 20 numbers evenly spaced ranging from 5 to 300 by using a 3-fold cross-validation procedure for b R ,bi and b R ,oi while the strategy based on the OOB samples, developed inSubsection 5.2.2, is used for b R ,bi and b R ,oi . Regarding the minimum based estimators, the optimalleaf size is obtained via b R ,oi during the 3-fold cross-validation process. Then, the efficiency of ourestimators is assessed with the metrics introduced in Equation (6.1) by repeating the experiment s = 200.In Figure 5, for three levels α , we present the evolution of the different metrics related to thevariable X of our toy example in function of the number of trees ranging from 1 to 200 (in logscale). More precisely, sub-figures at the top of Figure 5 show the Root Mean Square Error (RMSE),in the middle, the bias and the variance at the bottom.We observe that regardless of the level α , RMSE of our estimators is small. The number of treesseems to have no impact for those using b Q ,oi and b Q ,oi as the RMSE value is almost always thesame. RMSE of the others decreases in function of the number of trees until it reaches a thresholdstarting at about 50 trees. Indeed, it is well known that from a certain number, increasing thenumber of trees becomes useless but results in higher calculation costs. However, we did not expectto have a stable estimation error with so few trees.Besides, still from the RMSE curves, it first appears that the estimators using the original sample(plain lines) have a lower error compared to those using the bootstrap samples (dotted lines). On17 .00750.01000.01250.01500.01750.02000.0225 R M S E = 0.10 = 0.50 = 0.99 B i a s Number of trees0.00000.00010.00020.00030.0004 V a r i a n c e Number of trees 10 Number of trees S with R b S with R o S with R b S with R o S with Q b S with Q o S with Q b S with Q o Figure 5: Evolution of RMSE, bias and variance of the estimators associated with X in functionof the number of trees for three levels α .the other hand, the performance of the minimum based estimators (green and red lines) seemsbetter than those based on the quantile (blue and orange lines). That might be explained by theadditional error due to the estimation of the conditional quantile.Variance of all estimators is close to 0 and the bias curves have the same behavior as RMSE curves.This means that bias is the main/only source of error in the RMSE. This bias could be reducedby taking a larger grid where looking for the optimal leaf size during the cross-validation or usinganother more efficient method to find the optimum.Let us now compare the decrease of the estimation error in function of the train sample-size. Asobserved in Figure 5, take a very large number of trees is not required in order to have a stableestimation error. Thus, we take n trees = 100 and the same setting as before for other parametersin the next study and observe the evolution of the metrics introduced in Equation (6.1) in functionof the sample size.Figure 6 presents RMSE, bias and variance of our estimators for different sample sizes. Weobserve that all the metrics associated with the various estimators converge to 0 at different rates.Indeed, the convergence rates of the metrics of the quantile-based estimators are slower than thosebased on the minimum.Hence, from our experiments, it turns out that the minimum-based estimators give the bestresults. This is an interesting feature because they need less data than those plugging the quantile.Furthermore, few trees are necessary in order to reduce the estimation error. It therefore allows to18 .10.20.30.4 R M S E = 0.10 = 0.50 = 0.99 B i a s Train sample-size0.000.010.020.030.040.050.06 V a r i a n c e Train sample-size 10 Train sample-size S with R b S with R o S with R b S with R o S with Q b S with Q o S with Q b S with Q o Figure 6: Evolution of RMSE, bias and variance of the estimators associated with X in functionof the train sample-size for three levels α .get a good estimation of the indices with a reasonable computational cost. In this subsection, we compare on the toy example introduced in Equation (5.1):• the kernel-based estimators proposed in Browne et al. (2017); Maume-Deschamps and Niang(2018) denoted by ˇ S αi and e S αi ,• the minimum-based QOSA index estimators building one forest for each input and using theoriginal sample,• and the minimum-based QOSA index estimators using a forest grown with trees fully devel-oped.The estimators of the QOSA indices are computed with samples of size n = 10 .Forest methods are grown with n trees = 100. The optimal leaf size for the minimum-based estimatorsbuilding one forest for each input is obtained with b R ,oi during the 3-fold cross-validation process overa grid containing 20 numbers evenly spaced ranging from 5 to 300. Regarding the minimum-basedestimators using a forest grown with trees fully developed, the min samples leaf hyperparemeterequals 2.In order to have comparable methods, a cross-validation procedure is also implemented for thekernel-based estimators to choose the optimal bandwidth parameter. It is selected within over a19rid containing 20 potential values ranging from 0.001 to 1. Then, we assess the performance of thedifferent estimators by computing their empirical root mean squared error with 100 experiments. b S αi with b Q ,oi b S αi with b Q ,oi b S αi with b Q ,bi b S αi with b Q ,oi ˇ S αi e S αi X X X X X X X X X X X X α = 0 . α = 0 .
25 0.008 0.006 0.009 0.006 0.013 0.007 0.013 0.007 0.013 0.036 0.042 0.012 α = 0 . α = 0 .
75 0.008 0.007 0.008 0.008 0.008 0.014 0.008 0.014 0.035 0.012 0.014 0.042 α = 0 .
99 0.006 0.016 0.006 0.018 0.006 0.032 0.006 0.032 0.084 0.071 0.013 0.11run time 1 hr 18 min 24 sec 10 hr 41 min 8 hr 18 min 1 hr 55 min 1 min 51 secTable 1: RMSE and run time for the toy example of the random forest based estimators: b S αi computed with b Q ,oi , b Q ,oi , b Q ,bi and b Q ,oi as well as those based on kernel: e S αi and ˇ S αi .Table 1 contains the empirical root mean square error of the different estimators associated toeach input as well as the overall run time requested to obtain them. About their performance, itseems that the random forest-based estimators are better than the kernel methods. Nevertheless, asregards the methods using b Q ,bi and b Q ,oi , while they have a low error and do not need to tune the leafsize, their run time with the current implementation is too long to be used in practice. Accordingly,we recommend to compute the indices with b Q ,oi and b Q ,oi in order to get good estimations of thefirst-order QOSA indices in a reasonable time. The influence of the model’s dimension d over the performance of the estimators using b Q ,oi and b Q ,oi is investigated in this subsection with the following additive exponential framework Y = d X i =1 X i . (6.2)Independent inputs X i , i = 1 , . . . , d , follow an Exponential distribution E ( λ i ), with distinct λ i . Theresulting output Y is a generalized Erlang distribution also called Hypoexponential distribution. Bytaking advantage of the other expression of the first-order QOSA index given in Maume-Deschampsand Niang (2018), we obtain the following semi closed-form analytical formula S αi = 1 − α E h Xs ( − i ) i − E h Xs ( − i ) { Xs ( − i ) (cid:54) q α ( Xs ( − i ) ) } i α E [ Y ] − E h Y { Y (cid:54) q α ( Y ) } i , (6.3)with Xs ( − i ) = P j = i X j that also follows a Hypoexponential distribution. Knowing the cumulativedistribution function of the Hypoexponential distribution, quantiles q α ( Y ) and q α (cid:16) Xs ( − i ) (cid:17) arecomputed by numeric inversion and the analytical expression of the truncated expectations is derivedfrom Marceau (2013). 20or a specific dimension d , d values evenly spaced are selected from the interval [0 . , .
25] andthen each one represents the λ i parameter of an input X i , i = 1 , . . . , d . QOSA index estimationsare then computed with samples of size n = 10 , a forest grown with n trees = 100 and the settingdefined hereafter. The leaf size is tuned with b R ,oi over a grid with 20 numbers evenly spaced rangingfrom 5 to 300 by using a 3-fold cross-validation. Each experiment is done 100 times in order tocompute the RMSE defined in Equation (6.1) for each input, and then we take the weighted meanby the analytical values of the QOSA indices over all dimensions in order to get a global measure. a v e r a g e d R M S E = 0.10 = 0.30 a v e r a g e d R M S E = 0.70 = 0.90 S i with Q oi S i with Q oi S i with Q oi S i with Q oi Figure 7: Evolution of the averaged RMSE over all dimensions of the estimators calculated with b Q ,oi and b Q ,oi in function of the model dimension for four levels α .Figure 7 presents the weighted RMSE as a function of the increasing dimension of our modelfor several levels α . For each one, we observe that the error increases slowly at the beginninguntil the dimension 6 for both methods then decreases. This phenomenon is due to the chosenparametrization. Indeed, when increasing the dimension of the model, the respective impact ofeach input is reduced. Thus, from a certain dimension, all the analytical values of the first-orderQOSA indices become small and even close to 0 for some inputs. Our estimators properly capturethis trend as they decrease by increasing the dimension. However, the estimator using b Q ,oi seemsbetter than this one employing b Q ,o as its error is lower.21 Practical case study
We propose to apply our methodology to a practical case study. It concerns bias between thepredictions from MOCAGE (Modèle de Chimie Atmosphérique à Grande Echelle) and the observedozone concentration.This dataset has been proposed and studied in Besse et al. (2007), the full data descriptionmay be found there. It contains 10 variables with 1041 observations. O3obs : observed ozoneconcentration will be explained by the 9 other variables described below.JOUR: type of day (0 for holiday vs 1 for nonholiday) STATION: site of observations (5 differentsites)RMH2O: humidity ratio NO2: nitrogen dioxide concentrationVentMOD: wind force VentANG: wind directionNO: nitric oxide concentration TEMPE: officially predicted temperaturesMOCAGE: ozone concentration predicted bya fluid mechanics modelFigure 8 below gives the QOSA estimations, using the b Q ,oi estimator, since from our numericalstudy, it is the quicker, requires only one sample and is efficient. The inputs are ranked, for differentvalues of alpha. On the right picture, the QOSA indices are in percentage (normalised by the sumof QOSA indices for all variables). S i S i JOURMOCAGENONO2RMH2OSTATIONTEMPEVentANGVentMODQOSA indices
Figure 8: QOSA (resp. normalised QOSA) indices at different levels α on the left-hand (resp. onthe right-hand) plot.In Besse et al. (2007); Broto et al. (2020) the central effects have been studied and lead toconsider MOCAGE and TEMPE as the most influencial variables, then STATION and NO2. Ourstudy considers various quantile level effects and shows that for quantile levels greater than 0.6,wind and RMH2O are also important, more than STATION. Large Scale Atmospherical Chemestrial Model Conclusion
In this paper, we introduced several estimators for the first-order QOSA index by using the randomforest method. Some of them use the original sample while the others use the bootstrap samplesgenerated during the forest construction. Both classes of estimator seem to be efficient even if weobserve in our experiments that the methods using the original sample have a lower estimationerror than those based on the bootstrap ones. Thus, supplementary studies should be conducted toinquire into this difference. Furthermore, the performance of these methods is highly dependent onthe leaf size. This parameter could be compared to the bandwidth parameter of kernel estimatorsas it controls the bias of the method. But, it turns out to be easier to calibrate and we propose twomethods to do this: K -fold cross-validation and Out of Bag samples based selection method.It is also well known for random forest methods that the number of trees k should be chosenlarge enough to reach the desired statistical precision and small enough to make the calculationsfeasible as the computational cost increases linearly with k as mentioned in Scornet (2017). But, wehave seen on our “toy example” that estimators proposed herein require few trees in order to havea low estimation error. This makes possible to estimate the indices correctly while maintaining areasonable computation time.Besides, we obtain in our application better results for our estimators when comparing withthe kernel methods. A major advantage is that we have developed an estimator that requires onlyone training sample, whereas kernel methods require two training samples or a full one plus apartial. This feature is interesting when dealing with costly models. Another significant asset ofour estimators is that their efficiency seems maintained when increasing the model dimension.Despite these benefits, the proof for the estimators’ consistency as well as the asymptotic analy-sis to establish the convergence rates and confidence intervals remains a major wish for the future.It is also important to remember that these indices do not have an analogue to the variance decom-position offered by Sobol indices through the theorem of Hoeffding (1948). Thus, using the valuesof Shapley (1953) could be interesting to get condensed and easy-to-interpret indices with a goodapportionment of the interaction and dependences contributions between the inputs involved.23 eferences Besse, P., Milhem, H., Mestre, O., Dufour, A., and Peuch, V.-H. (2007). Comparaison de techniquesde «data mining» pour l’adaptation statistique des prévisions d’ozone du modèle de chimie-transport mocage. .Breiman, L. (1996). Out-of-bag estimation.Breiman, L. (2001). Random forests.
Machine learning , 45(1):5–32.Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regressiontrees.Broto, B., Bachoc, F., and Depecker, M. (2020). Variance reduction for estimation of shapley effectsand adaptation to unknown input distribution.
SIAM/ASA Journal on Uncertainty Quantifica-tion , 8(2):693–716.Browne, T., Fort, J.-C., Iooss, B., and Le Gratiet, L. (2017). Estimate of quantile-oriented sensitivityindices. Technical Report, hal-01450891.Díaz-Uriarte, R. and De Andres, S. A. (2006). Gene selection and classification of microarray datausing random forest.
BMC bioinformatics , 7(1):3.Duroux, R. and Scornet, E. (2018). Impact of subsampling and tree depth on random forests.
ESAIM: Probability and Statistics , 22:96–128.Elie-Dit-Cosaque, K. (2020). qosa-indices, a python package available at: https://gitlab.com/qosa_index/qosa .Elie-Dit-Cosaque, K. and Maume-Deschamps, V. (2020). Random forest estimation of conditionaldistribution functions and conditional quantiles.
Preprint on HAL .Elie-Dit-Cosaque, K. and Maume-Deschamps, V. (2021). On quantile oriented sensitivity analysis.Fort, J.-C., Klein, T., and Rachdi, N. (2016). New sensitivity analysis subordinated to a contrast.
Communications in Statistics-Theory and Methods , 45(15):4349–4364.Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.
Annals ofMathematical Statistics , 19(3):293–325.Homma, T. and Saltelli, A. (1996). Importance measures in global sensitivity analysis of nonlinearmodels.
Reliability Engineering & System Safety , 52(1):1–17.Jansen, M. J., Rossing, W. A., and Daamen, R. A. (1994). Monte carlo estimation of uncertaintycontributions from several independent multivariate sources. In
Predictability and NonlinearModelling in Natural Sciences and Economics , pages 334–343. Springer.Kala, Z. (2019). Quantile-oriented global sensitivity analysis of design resistance.
Journal of CivilEngineering and Management , 25(4):297–305.Koenker, R. and Hallock, K. F. (2001). Quantile regression.
Journal of economic perspectives ,15(4):143–156. 24ucherenko, S., Song, S., and Wang, L. (2019). Quantile based global sensitivity measures.
Relia-bility Engineering & System Safety , 185:35–48.Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors.
Journal of theAmerican Statistical Association , 101(474):578–590.Marceau, E. (2013).
Modélisation et évaltuation quantitative des risques en actuariat . SpringerBerlin.Maume-Deschamps, V. and Niang, I. (2018). Estimation of quantile oriented sensitivity indices.
Statistics & Probability Letters , 134:122–127.Meinshausen, N. (2006). Quantile regression forests.
Journal of Machine Learning Research ,7(Jun):983–999.Saltelli, A., Tarantola, S., Campolongo, F., and Ratto, M. (2004).
Sensitivity analysis in practice:a guide to assessing scientific models . John Wiley & Sons.Scornet, E. (2016). Random forests and kernel methods.
IEEE Transactions on Information Theory ,62(3):1485–1500.Scornet, E. (2017). Tuning parameters in random forests.
ESAIM: Proceedings and Surveys , 60:144–162.Shapley, L. S. (1953). A value for n-person games.
Contributions to the Theory of Games , 2(28):307–317.Sobol, I. M. (1993). Sensitivity estimates for nonlinear mathematical models.
Mathematical Mod-elling and Computational Experiments , 1(4):407–414.25
Appendix
A.1 Algorithms for estimating the first-order QOSA index
Algorithm 2:
QOSA index estimators plugging the quantile
Input: • Datasets: D (cid:5) n = (cid:0) X (cid:5) j , Y (cid:5) j (cid:1) j =1 ,...,n and D n = (cid:0) X j , Y j (cid:1) j =1 ,...,n • Number of trees: k ∈ N ? • Order where estimating the QOSA index : α ∈ ]0 , grid min samples leaf • Number of folds: K ∈ { , . . . , n } Output:
Estimated value of the QOSA index at the α -order b S αi for all inputs. Compute b P thanks to Equation (2.1). foreach i = 1 , . . . , d do D (cid:5) in = (cid:16) X (cid:5) ji , Y (cid:5) j (cid:17) j =1 ,...,n from D (cid:5) n and D in = (cid:16) X ji , Y j (cid:17) j =1 ,...,n from D n Cross-validation as in Algorithm 1 with D in to get the optimal leaf size ‘ opt . Fit a random forest model with D in by fixing the min samples leaf hyperparameter to ‘ opt . Compute the estimator b R i with D (cid:5) in . Compute b S αi = 1 − b R i / b P . endAlgorithm 3: QOSA index estimators with the weighted minimum approach
Input: • Datasets: D n = (cid:0) X j , Y j (cid:1) j =1 ,...,n and (cid:0) X (cid:5) j (cid:1) j =1 ,...,n • Number of trees: k ∈ N ? • Order where estimating the QOSA index : α ∈ ]0 , grid min samples leaf • Number of folds: K ∈ { , . . . , n } Output:
Estimated value of the QOSA index at the α -order b S αi for all inputs. Compute b P thanks to Equation (2.1). foreach i = 1 , . . . , d do D in = (cid:16) X ji , Y j (cid:17) j =1 ,...,n from D n and (cid:16) X (cid:5) ji (cid:17) j =1 ,...,n Cross-validation as in Algorithm 1 with D in to get the optimal leaf size ‘ opt . Fit a random forest model with D in by fixing the min samples leaf hyperparameter to ‘ opt . Compute the estimator b Q i ∈ n b Q ,bi , b Q ,oi o with D (cid:5) in and (cid:16) X (cid:5) ji (cid:17) j =1 ,...,n . Compute b S αi = 1 − b Q i / b P end lgorithm 4: QOSA index estimators computing the minimum in leaves
Input: • Datasets: D n = (cid:0) X j , Y j (cid:1) j =1 ,...,n • Number of trees: k ∈ N ? • Order where estimating the QOSA index : α ∈ ]0 , grid min samples leaf • Number of folds: K ∈ { , . . . , n } Output:
Estimated value of the QOSA index at the α -order b S αi for all inputs. Compute b P thanks to Equation (2.1). foreach i = 1 , . . . , d do D in = (cid:16) X ji , Y j (cid:17) j =1 ,...,n from D n Cross-validation as in Algorithm 1 with D in to get the optimal leaf size ‘ opt . Fit a random forest model with D in by fixing the min samples leaf hyperparameter to ‘ opt . Compute the estimator b Q i ∈ n b Q ,bi , b Q ,oi o . Compute b S αi = 1 − b Q i / b P endAlgorithm 5: QOSA index estimators with the weighted minimum and fully grown trees
Input: • Datasets: D n = (cid:0) X j , Y j (cid:1) j =1 ,...,n and (cid:0) X (cid:5) j (cid:1) j =1 ,...,n • Number of trees: k ∈ N ? • Order where estimating the QOSA index : α ∈ ]0 , min samples leaf ∈ { , ..., n } Output:
Estimated value of the QOSA index at the α -order b S αi for all inputs. Compute b P thanks to Equation (2.1). Fit a random forest model with D n and the min samples leaf hyperparameter. foreach i = 1 , . . . , d do Compute the estimator b Q i ∈ n b Q ,bi , b Q ,oi o with (cid:0) X (cid:5) j (cid:1) j =1 ,...,n . Compute b S αi = 1 − b Q i / b P endend