Knockoff Boosted Tree for Model-Free Variable Selection
KK NOCKOFF B OOSTED T REE FOR M ODEL -F REE V ARIABLE S ELECTION
A P
REPRINT
Tao Jiang
North Carolina State University [email protected]
Yuanyuan Li
National Institute of Environmental Health Sciences
Alison A. Motsinger-Reif ∗ National Institute of Environmental Health Sciences [email protected]
February 24, 2020 A BSTRACT
In this article, we propose a novel strategy for conducting variable selection without prior modeltopology knowledge using the knockoff method with boosted tree models. Our method is inspired bythe original knockoff method, where the differences between original and knockoff variables are usedfor variable selection with false discovery rate control. The original method uses Lasso for regressionmodels and assumes there are more samples than variables. We extend this method to both model-freeand high-dimensional variable selection. We propose two new sampling methods for generatingknockoffs, namely the sparse covariance and principal component knockoff methods. We test thesemethods and compare them with the original knockoff method in terms of their ability to controltype I errors and power. The boosted tree model is a complex system and has more hyperparametersthan models with simpler assumptions. In our framework, these hyperparameters are either tunedthrough Bayesian optimization or fixed at multiple levels for trend detection. In simulation tests,we also compare the properties and performance of importance test statistics of tree models. Theresults include combinations of different knockoffs and importance test statistics. We also considerscenarios that include main-effect, interaction, exponential, and second-order models while assumingthe true model structures are unknown. We apply our algorithm for tumor purity estimation and tumorclassification using the Cancer Genome Atlas (TCGA) gene expression data. The proposed algorithmis included in the KOBT package, available at https://cran.r-project.org/web/packages/KOBT/index.html . K eywords Boosted tree model · Knockoff · Shapley additive explanation · Model-free variable selection
In nearly all existing variable selection methods, a set of predictor candidates must be specified before selection (Liet al., 2005). Predictors in the final selected model are a subset of candidate set. However, it can be challenging todetermine an appropriate set of predictor candidates, especially considering the order (e.g., quadratic, cubic) of maineffects and interactions with a large number of variables. Li et al. (2005) stated the difference between model selectionand variable selection is whether to assume a model for comparison. One of the advantages of model-free variableselection is that its performance is not limited by the choice of model formulation. Model-free methods such as RandomForest (Breiman, 2001), AdaBoost (Rätsch et al., 2001), and convolutional neural networks (LeCun et al., 1995) arewidely used in research and industry for regression and classification purposes. These methods recognize data patterns ∗ To whom correspondence should be addressed. a r X i v : . [ s t a t . M E ] F e b PREPRINT - F
EBRUARY
24, 2020without the need to impose specific model structures on regression functions. This independence allows model-freemethods more flexibility, thus making them highly appropriate in many cases. For example, Gao et al. (2018) reportedthat compared with other methods, model-free methods provide more reliable clinical outcomes when forecasting fallsof Parkinson’s patients.Among the widely used model-free methods, tree-based models have demonstrated greater effectiveness and inter-pretability than most other learning algorithms. In a standard regression model, an interaction term is not estimated if itis not specified in the model. In contrast, tree-based methods automatically include interactions during tree growing,without the requirement of an a prioiri formula preconception (Su et al., 2011). Further, tree-based models are morenaturally associated with nonlinear interactions (Schiltz et al., 2018), and the branches in a tree model are originalinteraction variables. Friedman (2001) stated that the number of terminal nodes of a boosted tree model should dependon the highest order of the dominant interactions among the variables. This implies that (1) terminal (leaf) nodes in treemodels cannot be interpreted simply as main effects of original variables and (2) growth of a tree model is selective.Variable selection in tree models is affected by the number of categories of features (Kim and Loh, 2001), with variableswith more categories preferred. To eliminate bias in variable selection, Loh (2002) proposed the GUIDE algorithm,which is used to conduct chi-square analysis of residuals and bootstrap calibration of significance probabilities. Giniimportance in random forest methods (Breiman, 2001) is used to measure improvement in the splitting criterionproduced by each corresponding variable. To generate an unbiased importance index, Breiman and Cutler (2008)proposed permutation importance by statistical permutation. In addition to regression tree and random forest models,attention has also been paid to variable selection in boosted tree models, given their superior performance in datapattern recognition. Miller et al. (2016) extended gradient boosted tree to multivariate tree boosting and performednonparametric regression to identify important variables. Gain, weight, and cover are three common importanceranking measures in the highly successful XGBoost (Chen and Guestrin, 2016). SHapley Additive exPlanation (SHAP)(Lundberg and Lee, 2017b) has also been proposed to interpret model predictions, and its performance has been highlyconsistent when applied in tree models (Lundberg et al., 2018).While most data scientists are interested in model prediction accuracy, inference from analysis is also important.Originally described for Random Forest (Breiman, 2001), the permutation method has been widely used to generate anull distribution of test statistics to detect significance in tree models. Unfortunately, some statistical inference, such asaccurate false discovery rate (FDR) control, is not possible with permutation-based methods. For example, in a linearregression problem, a permutation-based construction may underestimate the false discovery proportion (FDP) in caseswhere the design matrix displays nonvanishing correlations (Barber et al., 2015). Under a global null, p -values frompermutation tests are exact. However, under an alternative hypothesis, rejecting the null based on permutation testsdoes not necessarily imply a valid rejection (Chung et al., 2013). To control the FDR in variable selection, Barber et al.(2015) proposed a novel statistical framework, the knockoff framework, for low-dimensional ( n ≥ p ) fixed design.The knockoff framework was later extended to high-dimensional ( n < p ) and random design scenarios (Candes et al.,2018). The method generates knockoff variables without the need to collect new data or consider response variableswhile retaining the correlation structure of the original variables. These knockoff variables are used as a negativecontrol group, and a variable selection procedure is applied for both the original and knockoff variables. Test statisticsmeasuring the importance of both kinds of variables can be compared to compute symmetrized knockoff statistics.Finally, a threshold for these statistics is determined according to the (desired) target FDR level. The goal of using the knockoff method is discovering relevant variables while controlling the FDR.In this article, we propose the knockoff boosted tree (KOBT) algorithm, a model-free variable selection algorithmthat includes boosted tree (Friedman, 2001; Chen and Guestrin, 2016) and model-X knockoffs (Candes et al., 2018).Model-X knockoff variables are generated with the same correlation structure as the original variables. Test statistics,such as tree SHAP (Lundberg et al., 2018), Cover, Frequency, Gain, and Saabas are generated for each pair of originaland knockoff variables, and the consistency and accuracy of the produced test statistics are compared. The FDR iscontrolled using the difference between original and knockoff test statistics. Further, regularization parameters inboosted tree models are tuned through Bayesian optimization (Snoek et al., 2012).In the following sections, we describe our analytical framework, including (1) the generation of knockoff variables;(2) fitting and Bayesian optimization in boosted tree regression; and (3) consistency, accuracy, and FDR control inmodel-free variable selection. We validate our approach with simulation experiments. Finally, we demonstrate ourapproach on tumor purity estimation and tumor classification using the Cancer Genome Atlas (TCGA) gene expressiondata. 2 PREPRINT - F
EBRUARY
24, 2020
Denote y n × as n responses and X n × p as p independent variables each with n values. In regression trees, y n × canbe continuous or ordered discrete. The goal is to minimize an objective function, which may contain a loss functioncomponent and a regularization component, for a combination of all leaves (terminal nodes) on a tree. Given a tree with m terminal nodes (i.e., the whole sample contains m partitions), if we model the response as a constant in each region,the estimated regression tree function is defined as ˆ y i = ˆ f ( x i ) = m (cid:88) k =1 I { x i ∈ R pk } β k , (1)where I ( · ) is a function of x , and I { x i ∈ R pk } = 1 , if the i -th row vector of X , x i ∈ R pk , the k -th disjoint partitionregion and otherwise I { x i ∈ R pk } = 0 . R pk is a split region or a so-called leaf, k = 1 , ..., m . With the squared L norm loss function, we have estimator ˆ β = argmin β n (cid:88) i =1 (cid:40) y i − m (cid:88) k =1 I { x i ∈ R pk } β k (cid:41) , (2)and ∂ n (cid:88) i =1 (cid:40) y i − m (cid:88) k =1 I { x i ∈ R pk } β k (cid:41) /∂β k = 0 . (3)Solving Equation 3, the k -th estimated parameter of β is ˆ β k = (cid:80) ni =1 y i I k { x i ∈ R pk } (cid:80) ni =1 I k { x i ∈ R pk } , (4)where I k { x i ∈ R pk } = 0 for a fixed R pk . This is the sample mean of y on that leaf of the regression tree. Meanwhile,to grow a tree model, the data need to be partitioned or split. For the j -th variable x j , we define a split on x j with somevalue c as R pk,< = { i : x ij < c | x i ∈ R pk } , R pk,> = { i : x ij ≥ c | x i ∈ R pk } . (5)Based on Equation 4, we have two estimates ˆ β k,< = (cid:80) ni =1 y i I k (cid:110) x i ∈ R pk,< (cid:111)(cid:80) ni =1 I k (cid:110) x i ∈ R pk,< (cid:111) , ˆ β k,> = (cid:80) ni =1 y i I k (cid:110) x i ∈ R pk,> (cid:111)(cid:80) ni =1 I k (cid:110) x i ∈ R pk,> (cid:111) . (6)Thus, an optimal split is defined as ˆ c = argmin c (cid:40) n (cid:88) i =1 (cid:16) y i ∈R pk,< − ˆ β k,< (cid:17) + n (cid:88) i =1 (cid:16) y i ∈R pk,> − ˆ β k,> (cid:17) (cid:41) . (7)To avoid overfitting, similar to penalized regression, a penalty term is added to the cost-complexity criterion of aregression tree model m (cid:88) k =1 || y i ∈ I k − ˆ β k || + γ | T | , (8)where T ( X n × p ; Θ ) = f ( X n × p ) is a regression tree model, Θ = {R , β } contains tree structure parameters, | T | is thecardinality of terminal nodes in T , and γ ≥ is a tuning parameter.3 PREPRINT - F
EBRUARY
24, 2020
To overcome the bias and high-variance problem in a single regression tree, ensemble tree models, such as bagging orboosting structures, have been proposed. Defining a boosted tree model as a sum of B trees in Section 2.1, f B ( X ) = B (cid:88) b =1 T ( X n × p ; Θ b ) = B (cid:88) b =1 f b ( X n × p ) . (9)In a forward stagewise algorithm (Hastie et al., 2005), the B -th tree structure is found by solving ˆΘ B = argmin Θ B L ( y , f B − ( X ) + T ( X , Θ B )) , (10)where L (Θ B ; y , f B − ) is a loss function based on the B -th tree structure, given response y and previously fitted B − tree models f B − .In boosted tree model fitting, the objective function usually contains a loss function and penalty function for treecomplexity (i.e., regularization for tree models). The complexity depends on the number of trees in a sequence, thedepth of each tree, and the number of terminal nodes in each tree. One way to control the complexity of tree modelfitting is to set a minimum number of instances in a single terminal node. For example, the min_child_weight parameterin XGBoost (Chen and Guestrin, 2016) controls the minimum sum of instance weight (Hessian) in a terminal node,which is equivalent to a minimum number of instances if all instances have a weight of . For simplicity, here, weassume all weights are . Combining Equation 8 and Equation 10, the B -th objective function is defined as Obj ( f B ; y , f B − ) = L ( y , f B − ( X ) + T ( X , Θ B )) + B (cid:88) b =1 Ω( f b ) ∝ L ( y , f B − ( X ) + T ( X , Θ B )) + Ω( f B )= L ( y , f B − ( X ) + T ( X , Θ B )) + γ | T ( X , Θ B )) | , (11)where | T ( X , Θ B )) | is the cardinality of terminal nodes in the B -th tree, and γ ≥ is a tuning parameter. Similarto elastic net (Zou and Hastie, 2005), XGBoost introduces L and L -norm penalty terms into objective functions.Therefore, the objective function for the B -th tree can be updated as Obj ( f B ; y , f B − ) ∝ L ( y , f B − ( X ) + T ( X , Θ B )) + γ | T | + λ || β B || + α || β B || , (12)where β B is a vector of leaf weights in the B -th tree, and γ ≥ , λ ≥ , and α ≥ are tuning parameters.Besides setting the minimum number of instances in terminal nodes and applying penalization on node weights,regularization can be achieved by limiting the maximum depth of a tree, applying the maximum number of trees ina boosting sequence, or defining an early stopping criteria for boosting (Zhang et al., 2005). Given the numerousparameters and hyperparameters of regularization and tree structure, it is tedious to find the optimal group of parametersby grid search. The parameters and their optimization are summarized in the following section. As reviewed by Nielsen (2016), there are three kinds of regularization parameters: boosting parameters, randomizationparameters, and tree parameters. Boosting parameters include the number of trees B (i.e., the number of boostingiterations) and the learning rate η . A small step-size of the learning rate has been found to play an important role in theconvergence of boosting procedures (Zhang et al., 2005). In simulation studies, decreasing the learning rate increasedthe performance of boosted models (Friedman et al., 2000). Using a smaller learning rate and thus a relatively largernumber of trees has been suggested, given the relationship between the learning rate and the number of boosted models.A drawback of this procedure is increased running time for model fitting. The randomization parameters refer to the rowsubsampling parameter and the column subsampling parameter. In the random forest algorithm (Breiman, 2001), only arandom subsample of the training data set is used for building model. Friedman (2002) built on this idea and introducedstochastic gradient boosting for the boosting tree algorithm. Basically, the row subsampling parameter controls the ratiofor a subset of the data, and the column subsampling parameter determines the ratio for a subgroup of features in eachtree fitting. Both boosting parameters and randomization parameters are used for general regularization since they areused in the optimization of all kinds of boosting models. However, tree parameters are for tree models only. Therefore,in this article, the discussion is focused on tree parameters and their optimization.Note that all the tree parameters implicate a tradeoff between bias and variance. In each individual tree, besides thepenalization parameters γ , λ , and α in Equation 12, the tree parameters also include structure parameters, the maximum4 PREPRINT - F
EBRUARY
24, 2020depth of the tree, and the minimum sum of observation weights required for each leaf. Given the maximum depthof a tree, D , using the decomposition of target function introduced in Friedman et al. (1991), the b -th individual treefunction in Equation 9 can be defined as f b ( X n × p ) = (cid:88) j g (1) ( x j ) + (cid:88) j,k g (2) ( x j , x k ) + (cid:88) j,k,l g (3) ( x j , x k , x l ) + ... + (cid:88) j,k,l,... g ( D ) ( x j , x k , x l , ... ) , (13)where g (1) ( x j ) is the first-order interaction (main effect) of the j -th variable, x j , g (2) ( x j , x k ) is the second-orderinteraction between x j and x k , and so on. From Equation 13, it is clear that the highest order of interaction is themaximum depth of a tree, D . The minimum sum of observation weights in a leaf determines the variance of ˆ β ’s,estimated weights of leaves in a tree. If the minimum sum is large, the growth of a tree will be conservative, whichmeans fewer leaves will be grown and thus a smaller variance of ˆ β ’s.Finally, we discuss how Bayesian optimization (Snoek et al., 2012) is applied for tuning γ , λ , and α in Equation 12. Wechoose Bayesian optimization because brute-force search such as grid search or random search is time-consuming in ascenario where there are three hyperparameters. Here we treat hyperparameter optimization as a black-box process. Wedefine a configuration of tuning hyperparameters, θ = ( γ, λ, α ) , for the function we want to minimize,CVTE ( γ, λ, α | y , X ) = 1 n n (cid:88) i =1 ( y i − f B ( x i )) , (14)where CVTE ( · ) is a K -fold cross-validation error score, x i is a row vector containing p features of an observation, and f B ( · ) is the fitted boosted tree model with B individual trees. An early stopping criteria states that if adding new treesdoes not decrease cross-validation error within five trees, the boosting iteration should be stopped before the maximumnumber of trees is reached. The value of B is equal to the number of trees in the sequence, where the combination oftrees provides the lowest cross-validation error score. This error score is the output value from CVTE ( · ) in Equation 14.The support sets of γ , λ , and α are located within a tuning region, [0 , . The optimal combination is used in the finalmodel for comparison. Details are provided in Section 3.1. Barber et al. (2015) proposed the knockoff filter, a new variable selection method that controls the FDR. The knockofffilter generates knockoff variables that mimic the correlation structure of original variables but are not associated withthe response. These knockoff variables are used as controls for the original variables, so only the original variablesthat are highly associated with the response are selected. The knockoff filter has been shown to provide accurate FDRcontrol, which cannot be realistically achieved using permutation methods. Barber et al. (2015)’s filter works underthe following conditions, where for a fixed design, X n × p , n > p , and y follows a linear Gaussian model. Candeset al. (2018) introduced model-X knockoffs, which extended the model assumption to high-dimensional and covariateselection from random known distributions. Following the definition of model-X knockoffs in Candes et al. (2018), werestate the definition for boosted tree models here. Definition 2.1.
For a row vector of p random variables x × p = ( x , x , ..., x p ) , where each x j is a random variablethat represents a feature, the corresponding model-X knockoff variables z × p = ( z , z , ..., z p ) are constructed suchthat: 1. for a combined random vector ( x , z ) × p , if its j -th random variable is switched with its ( j + p ) -th randomvariable ( j = 1 , ..., p ) (i.e, an original variable is switched with its knockoff counterpart), the distribution ofthe new random vector is invariant to ( x , z ) × p ; and2. z × p ⊥⊥ y | x × p , where y is the response.Based on the definition, to achieve high-dimensional FDR-controlled variable selection, qualified knockoff variablesshould satisfy two properties: (1) distribution equality with original variables and (2) independence of response.Two sampling methods, namely exact constructions and approximate construction, have been introduced to generateknockoffs (Candes et al., 2018). For exact construction, a new knockoff variable z j is sampled from the conditionaldistribution L ( x j | x − j , z j − ) , where j = 1 , ..., p , and x − j = ( x j − , x ( j +1): p ) . Approximate construction focuseson whether ( x , z ) × p retains its first two moments of a distribution after swapping. Given this summary of the twomethods, it is apparent that the approximate construction method requires less complex computations than the exactconstruction method.In this article, we compare three algorithms for knockoff generation. The first algorithm, Approximate Construction (AC), is available in the knockoff
R package (Candes et al., 2018), where the covariance matrix of original variables,5
PREPRINT - F
EBRUARY
24, 2020cov ( X n × p ) , is estimated directly. The estimated covariance matrix is shrunk to the identity matrix if it is not positivedefinite (Opgen-Rhein and Strimmer, 2007). In addition, we propose two other algorithms: one that is similar to the ACalgorithm described above and one without Gaussian assumption. The second algorithm (i.e., the one that is similar tothe AC algorithm) is a Sparse Constructions (SC) algorithm. Instead of estimating the covariance matrix directly, weconduct sparse estimation of the covariance matrix (Bien and Tibshirani, 2011), which generates a sparse matrix with asimpler structure. The rest of the algorithm is the same as the AC algorithm. The third algorithm (i.e., the one withoutGaussian assumption) is a Principal Component Constructions (PCC) algorithm, which is inspired by Algorithm A.1 inShen et al. (2019). In accordance with Definition 2.1, we describe our proposed PCC algorithm in Algorithm 2.1 below.Unlike the AC and SC algorithms, the PCC algorithm does not require data with a Gaussian assumption.
Algorithm 2.1.
Principal Component Construction (PCC)For each column vector of original variable x j , where j = 1 , ..., p ,1. Conduct principal component analysis on matrix ( X − j , Z j − ) n × ( p + j − , where X − j is matrix X withoutthe j -th column and Z j − is the first ( j − columns in matrix Z . When j = 1 , Z j − is empty.2. Denote K as the number of principal components chosen for a regression model, K = 1 , ..., n − . For afixed K , fit x j on K PCs. There is a tradeoff in that the larger the K , the more akin the knockoff will be to theoriginal variables. This results in a smaller type 1 error but weaker power of the test.3. Compute a residual vector ε n = ( x j − ˆ x j ) .4. Permute ε n randomly. Denote the permuted vector as ε ∗ n .5. Set z j = ˆ x j + ε ∗ n , and combine it with the current knockoff matrix Z j − .This algorithm was designed in accordance with our definition of knockoff. Using linear regression models, theempirical conditional distribution of x j , L ( x j | X − j , Z j − ) can be estimated. Using permutation, we eliminate theGaussian assumption. Meanwhile, the generated z j is independent of the response y , since y is ignored in our algorithm.Note that this is a sequential algorithm, so computation could be expensive. In Shen et al. (2019), the residuals areassumed to be approximately independently and identically distributed. We keep this assumption for our algorithm.To evaluate knockoffs generated by these methods, we propose the Mean Absolute Angle of Columns (MAAC) metricfor checking vector independence, and use the
Kernel Maximum Mean Discrepancy (KMMD) metric to test distributionsimilarity (Gretton et al., 2007).
Definition 2.2.
Mean Absolute Angle of Columns (MAAC)For two column vectors with the same length, x and z , we defineMAAC ( x , z ) = arccos (cid:12)(cid:12)(cid:12)(cid:12) x (cid:62) z || x || · || z || (cid:12)(cid:12)(cid:12)(cid:12) . (15)For two matrices with the same dimensions, X and Z , with p columns, we defineMAAC ( X , Z ) = 1 p p (cid:88) j =1 arccos (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x (cid:62) j z j || x j || · || z j || (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (16)MAAC indicates the correlation between corresponding columns in two matrices. For knockoff variable selection,we know that the weaker the correlation, the more powerful the test. The KMMD metric is used to perform a non-parametric distribution test. The null hypothesis test, H , is that the row vectors x i in X and z i in Z come from thesame distribution. In summary, MAAC is for type 2 error control, and KMMD is for type 1 error control. In Section 3.2,simulation tests are described, and permutation tests are used for comparison. For each original and knockoff variable pair, Barber et al. (2015) suggested the
Lasso signed max (LSM) statistic as thetest statistic for variable selection. Candes et al. (2018) later proposed the
Lasso coefficient difference (LCD) statisticfor the same purpose. While both of these test statistics were designed specifically for Lasso, there are also test statisticsdesigned for tree models. Table 1 provides a list of common variable importance test statistics for boosted tree models.Given that different test statistics will provide different results for importance ranking, it is difficult to determine whichtest statistic is the best choice in a specific scenario. 6
PREPRINT - F
EBRUARY
24, 2020Table 1: Available variable importance test statistics in boosted tree models.Test Statistic DefinitionCover The number of times a variable is used, weighted by theamount of training data in the corresponding nodesGain (Breiman et al., 1984) The total loss reduction gained when using a variableSaabas (Saabas, 2014) The change in the model’s expected outputsTree SHAP (Lundberg et al., 2018) The sum of weighted difference of conditional expectationswith and without a variableWeight The number of times a variable is usedAmong these statistics, SHAP, which is based on game theory and local explanations, has the demonstrated advantageof retaining both consistency and local accuracy (Lundberg and Lee, 2017b). Lundberg and Lee (2017a) introducedfeature attribution for tree ensembles using the additive feature attribution methods defined by Lundberg and Lee(2017b). They also proposed the Tree SHAP algorithm, which reduces the complexity of computing exact SHAP valuesfrom O ( BL p ) to O ( BLD ) , where B is the number of trees, L = max( | T | , ..., | T B | ) is the maximum number ofleaves in any tree, p is the number of variables, and D is the maximum depth of any tree. Further, Lundberg et al. (2018)extended SHAP values to SHAP interaction values. In additive feature attribution methods , f ( · ) in Equation 9 is theoriginal model, and an explanation model, h ( · ) , is defined for f ( · ) such that h ( u ) = φ + φ (cid:62) u , (17)where u ∈ { , } p is a binary p dimensional column vector, u j = 1 if the j -th variable is observed, u j = 0 if it ismissing, and φ stands for feature attribution values. The exact value of each φ j can be calculated as φ j = (cid:88) S ⊆ P \{ j } | S | !( p − | S | − p ! [ f x ( S ∪ { j } ) − f x ( S )] , (18)where P is the set of all variables, S is a subset of P with non-zero indices in u , and f x ( S ) = E [ f ( X ) | X S ] is definedas a conditional expectation. This φ j is the SHAP value for the j -th variable.Following the concept outlined by Barber et al. (2015) and Candes et al. (2018), we move on to define a test statistic forknockoff variables in tree models that can control the FDR. For simplicity, we use the SHAP value as an example, butother test statistics can be applied to the knockoff variable using a similar procedure. We denote a pair of original andknockoff variables as x j and z j , which is the same notation used in Section 2.4. Accordingly, their respective SHAPvalues are φ xj and φ zj . The hypothesis test in which we are interested is: H : The original variable, x j , and the knockoff variable, z j , are equally associated with response; H a : The original variable, x j , is more closely associated with response than the knockoff variable, z j .Here, the Shapley explanation absolute difference (SEAD) statistic is defined as T j = | φ xj ( γ, λ, α ) | − | φ zj ( γ, λ, α ) | , (19)where γ , λ , and α are hyperparameters used in Equation 14. A larger positive T j indicates that the response is morelikely to be associated with the original variable than its knockoff. Under the null hypothesis, T j is equally likely to bepositive or negative. Given that null T j s are symmetric, for any fixed t > , the false discovery proportion isFDP ( t ) = { null j : T j ≥ t } { j : T j ≥ t } . (20)This can be estimated by ˆ FDP ( t ) = min (cid:26) { j : T j ≤ − t } { j : T j ≥ t } , (cid:27) , (21)where ˆ FDP ( t ) = 1 when there are more T j s below − t than above t . Candes et al. (2018) provided equations for boththe modified FDR and the typical FDR. Here, we accept the latter as it is more conservative. A threshold τ is definedsuch that τ ( t ) = min (cid:26) { j : T j ≤ − t } + 1 { j : T j ≥ t } ≤ δ (cid:27) , (22)7 PREPRINT - F
EBRUARY
24, 2020where δ is the level under which we would like to keep the FDR. A set of selected variables is determined as ˆ S ( τ ) = { j : T j ≥ τ } . (23)Then, the FDR is controlled as E (cid:32) | ˆ S ∩ S || ˆ S | ∨ (cid:33) ≤ δ, (24)where S ⊆ { , , ..., p } is a set of noise variables. The performance of the SEAD statistic in various models isevaluated in the following sections, and test statistics constructed using other methods are tested for comparison. To perform model-free variable selection with FDR control, we propose a robust multiple-stage KnockOff Boosted Tree(KOBT) algorithm. Figure 1 is a flowchart that outlines details of the procedure. The algorithm contains four main stepsand one optional step (circled in red): (1) sample knockoff matrix Z according to original matrix X ; (2) grow boostedtrees using combined predictors ( X , Z ) ; (3) calculate the test statistic, φ , for each variable from the fitted boosted treemodel, f B ( X , Z ) ; and (4) conduct steps (1) to (3) q times and get T j = q (cid:80) qm =1 | φ xj,m | − q (cid:80) qm =1 | φ zj,m | for the j -thoriginal variable. A larger positive T j indicates a variable is more closely associated with response. The optional step istaken in the scenario in which some variables (shown as W in Figure 1) need to be retained in the final model.Figure 1: Knockoff boosted tree flowchart. An optional step is circled in red.The first part of the algorithm is optional, and a regression model is added before the boosted tree model. Given that wewant to retain some covariate variables in our final model, we conduct a regression, y = W α + y , (25)where y is the original responses, W is the covariate matrix, and y stands for residuals from this optional regression.This y is treated as new responses for the subsequent boosted tree fitting.8 PREPRINT - F
EBRUARY
24, 2020For the first required step of KOBT, we generate knockoff variables Z conditional on original variables X . Withthe assumption that the data in X follow a Gaussian distribution, the column mean and covariance matrix of X areestimated. As discussed in Section 2.4, there are two choices for covariance matrix estimation: shrinking the estimationto the identity matrix (Opgen-Rhein and Strimmer, 2007) and sparse estimation (Bien and Tibshirani, 2011). Withouta Gaussian assumption, we propose a strategy based on principal components and permutation (see Section 2.4 fordetails). The properties of knockoff variables are compared in Section 3.2.During the process of model fitting, hyperparameters γ , λ , and α are tuned through Bayesian optimization. Once aboosted model is fitted, test statistics showing importance are calculated. Multiple statistics are considered, such asgain, cover, weight, SHAP value, and Saabas value. Their values are discussed in Section 3.3. The above steps, fromgenerating knockoff variables to calculating test statistics, are repeated q times to achieve the mean absolute value foreach variable, where q should be sufficiently large. According to the strategy of applying knockoff variables (Barberet al., 2015; Candes et al., 2018), for each pair of original and knockoff variables, a test statistic is calculated, T j = 1 q q (cid:88) m =1 | φ xj,m | − q q (cid:88) m =1 | φ zj,m | , (26)which is the difference between the mean absolute test statistic values of each variable pair. It is straightforward that alarger test statistic proves that the original variable is more important than its corresponding knockoff. Finally, the FDRof variable selection is controlled through values of all generated T s, as shown in Equation 22.We use simulation tests to address the following topics: (a) control of type I and type II errors for knockoffs generatedby different methods; (b) the properties of different variable importance statistics in boosted tree models; and (c) powerand false discovery control using various combinations of knockoffs and statistics. Further, the performance of boostedtree fitting for different model structures is compared. First, we describe the models and the boosted tree framework used in the following simulation tests. For the designmatrix X n × p , we simulate n = 200 samples and p = 1000 variables. Each row vector in design matrix X n × p isgenerated as x i ∼ N p ( , Σ ) with block dependence structure matrix Σ = diag (Σ , Σ , Σ , ... ) , where each Σ i is a πp × πp matrix with matrix element σ j,k = ( ρ | j − k | ) . We set π = 0 . and ρ = 0 . . Because we want to test model-freevariable selection, we generate the response as y = X β + ε, (27)where β = ( β (cid:62) πp , (cid:62) p − πp ) (cid:62) , β πp is a vector with equal elements, ε ∼ N p ( , I p ) , and X is transformed from initial X with four different structures, namely:1. Main-effect model: X = X .2. Interaction effect model: each x i,j = x i − ,j x i,j , for i = 1 , ..., πp/ . This means we keep πp/ signalvariables the same as the main-effect models.3. Exponential effect model: each x i,j = exp ( x i,j ) , for i = 1 , ..., πp .4. Quadratic effect model: each x i,j = ( x i,j ) , for i = 1 , ..., πp .While we generate the response y using the transformed X , we work under the notion of having no information aboutthe transformed X and fit the models with the initial design matrix X . Thus, we use incorrect but related variables forfitting boosted tree models. While this is challenging, there is no guarantee that the correct variables will be used forreal data modeling, so it is possible to see boosted tree modeling performance with incorrect but related variables.After defining the simulated data values, we move on to our boosted tree models. The booster used for growing our treeare the gradient boosted tree (GBRT in Table 2) and dropouts meet multiple additive regression tree (DART in Table 2)(Rashmi and Gilad-Bachrach, 2015). As suggested in Friedman et al. (2000) and Zhang et al. (2005), a small learningrate is preferred in boosted model fitting. For our simulation, we fix the learning rate γ = 0 . . In this scenario, thetrees previously added to the boosted model are more important. This overfits the model with trees added earlier. Initerations of DART model training, some trees are dropped randomly to avoid overfitting while in traditional GBRT, alltrees are kept once they join a model. We compare these two algorithms in the following simulations.9 PREPRINT - F
EBRUARY
24, 2020Table 2: Means and standard errors of 100 10-fold cross-validation error scores, mean (SD), where sample size n = 200 ,feature size p = 1000 , signal proportion π = 0 . , signal strength β = 1 , learning rate η = 0 . , minimum childweight w min = 10 , and all samples and features are considered in each iteration step.Models Boosters Max Depth = 2 initial hyperparameter values of ( γ, λ, α ) and conduct Bayesian optimization iterations times for eachfitting. We use the average -fold cross-validation error score for model evaluation. An early stopping criteria is set: ifthe average -fold cross-validation error score has not improved after five training iterations, no new trees are added.The maximum depth of each tree is fixed at , , , , and . The minimum weight of each single leaf is . Note that1. We can increase the number of initial ( γ, λ, α ) hyperparameter values and Bayesian optimization iterations tofind new ( γ, λ, α ) that can improve the performance of the final model. However, this is not within the scopeof our simulation tests.2. For real data, we can choose a larger number for the maximum depth of a boosted tree model. Here, we useonly to because this is a simple simulation test.For each combination of the above four models, two boosters, and five maximum tree depths, 100 boosted tree modelsare fitted. Table 2 lists the means and standard errors of 100 10-fold cross-validation error scores. We use the samedesign matrices X , ..., X for generating transformed matrices. For each model structure, X , ..., X are createdaccording to our definitions. Among all four model structures, the main-effect model, which is the correct variablemodel, has the lowest cross-validation prediction error. The incorrect but related models, ranked from the lowest tohighest cross-validation prediction error, are interaction, exponential, and then quadratic. This is reasonable since thestructure of tree models is formed naturally by interaction among variables. From the standard errors of means, weconclude that predictions for the main-effect and interaction models are more stable than predictions for the exponentialand quadratic models. The results indicate that although prediction ability differs by the various true models, it isrealistic to use incorrect but related variables in boosted tree modeling. As for maximum tree depth, since the correlationbetween two variables is chosen as . d , where d is the index difference of each pair of variables, the correlationdecreases rapidly as we choose two variables that are farther away from each other. This is why, in Table 2, thecross-validation error score is the lowest when depth = 2 . This implies that prior knowledge of the strength of variablecorrelation is important, as we need to choose an appropriate value for tree depth. Finally, under the conditions of thissimulation, there is no obvious difference between GBRT and DART boosters. Figure 2 shows examples of boosted treeiteration steps for all models and boosters. As proposed in Section 2.4, there are three strategies to generate knockoff variables: Gaussian assumption withshrunk covariance matrix, Gaussian assumption with sparse covariance matrix, and permuted residuals from principalcomponent regression. We are interested in evaluating the power and type I error performance using different knockoffvariables for testing. We apply the MAAC metric, which is defined in Section 2.4, for power evaluation. A largerpositive MAAC value shows that two tested matrices have less related column space, so the difference between a realsignal and its knockoff is more significant. As for type I errors, we use KMMD to test if row vectors in the originaldesign matrix X and knockoff matrix Z are drawn from the same distribution. A larger positive test statistic implies thatit is more likely to reject the null hypothesis, which assumes that these two row vectors are from the same distribution.If knockoff variables are highly different from the original variables, this leads to large type I errors. In other words, wewant knockoff variables to be drawn from the same distribution as the original variables, but with sufficiently differentvalues. As always, there is a tradeoff between power and type I error control.10 PREPRINT - F
EBRUARY
24, 2020Figure 2: Examples of boosted tree iteration steps for all models and boosters. (a) Main-effect model with GBRT; (b)Main-effect model with DART; (c) Interaction model with GBRT; (d) Interaction model with DART; (e) Exponentialmodel with GBRT; (f) Exponential model with DART; (g) Quadratic model with GBRT; and (h) Quadratic model withDART. 11
PREPRINT - F
EBRUARY
24, 2020Figure 3: MAAC and KMMD of knockoffs with Normal and Poisson original variables. (a) MAAC with normallydistributed X ; (b) MAAC with Poisson-distributed X ; (c) KMMD with normally distributed X ; and (d) KMMD withPoisson-distributed X .We first assume the row vector in design matrix X is from a normal distribution. We simulate n = 100 samples and p = 500 variables. Each row vector in design matrix X n × p is generated as x i ∼ N p ( , Σ ) with block dependencestructure matrix Σ = diag (Σ , Σ , Σ , ... ) , where each Σ i is a πp × πp matrix with matrix element σ j,k = ( ρ | j − k | ) .We set π = 0 . and ρ = 0 . . In Figure 3, we simulate original design matrices and generate , correspondingknockoff matrices for each original design matrix. Four kinds of knockoffs are created for comparison: shrunk Gaussian,sparse Gaussian, permuted principal component , and permuted principal component . The y -axis is calculated asthe mean value of each , knockoff samples from one original matrix. The x -axis is the seed used to sample theoriginal matrix. Therefore, y values of different methods at the same x location are comparable. It is clear from Figure3 (a) and (c) that all methods are consistent under normal assumption. The MAAC values in Figure 3 (a) show thatthe sparse method provides the most similar knockoffs, which also means they are the least powerful. On the otherhand, the shrunk method has the most different knockoffs. Principal component methods provide knockoffs in between.This makes sense since the shrunk Gaussian method directly estimate the covariance matrix of X and shrinks it to theidentity matrix if it is not definitely positive, while the sparse Gaussian method has zeros in the estimated matrix givenits sparse assumption. In Figure 3 (c), the KMMD plot has the same order as MAAC. However, the sparse method hasthe lowest type I error while the shrunk method has the highest.Since both the shrunk Gaussian and sparse Gaussian methods assume X follows a normal distribution, it is interesting tosee what would happen if a design matrix does not follow a normal distribution. Figure 3 (b) and (d) show that the sparseGaussian method is not as consistent for a Poisson distribution as it is for normal distribution. The principal componentmethod is consistent, however, because it does not depend on the normal assumption. Despite some exceptions, formost design matrices, the y -axis order of these four curves is the same as for normal distribution. In summary, thesparse Gaussian method is the most conservative, the shrunk Gaussian method is the most powerful, and the principalcomponent method lies in between. When we increase the number of principal components, the performance of thegenerated knockoff is similar to the performance of the shrunk method.12 PREPRINT - F
EBRUARY
24, 2020Table 3: Mean and standard error of signal percentage and ranking ratio (RR) from , simulation tests for rankingtest statistics in main-effect models, where n = 100 , p = 500 .Strength Booster Gain Cover Frequency SHAP Saabas0.5 GBRT .
705 (0 . .
685 (0 . .
716 (0 . .
699 (0 . .
700 (0 . DART .
714 (0 . .
700 (0 . .
729 (0 . .
708 (0 . .
711 (0 . .
699 (0 . .
697 (0 . .
723 (0 . .
700 (0 . .
700 (0 . DART .
694 (0 . .
693 (0 . .
713 (0 . .
700 (0 . .
703 (0 . Table 4: Mean and standard error of signal percentage and ranking ratio (RR) from , simulation tests for rankingtest statistics in interaction models, where n = 100 , p = 500 .Strength Booster Gain Cover Frequency SHAP Saabas0.5 GBRT .
564 (0 . .
561 (0 . .
586 (0 . .
566 (0 . .
565 (0 . DART .
562 (0 . .
550 (0 . .
581 (0 . .
561 (0 . .
562 (0 . .
626 (0 . .
634 (0 . .
654 (0 . .
633 (0 . .
635 (0 . DART .
625 (0 . .
627 (0 . .
649 (0 . .
630 (0 . .
632 (0 . For each variable in a tree model, feature importance can be used to show its importance in the model. As discussed inSection 2.5, a few test statistics can represent this importance. In this section, we conduct simulation tests comparing gain , cover , frequency , SHAP , and
Saabas . Since we are only interested in the ranking ability of these test statistics forselected variables, not the performance of variable selection, which is conducted by tree growing itself, we define ascore called the ranking ratio (RR) asRR =
Noise variables ranked above the last signal in ranking
Noise variables . (28)The RR contains two parts: the number of noise variables selected by the boosted tree and ranked above the lowestranked signal among all selected signals, and the total number of mis-selected noise variables. This score, which rangeswithin (0 , , indicates how many false positive decisions must be made, if, for a given set of variables chosen duringtree growing, we want to include all signals for variable selection. The score is undefined if no signal or noise areselected in a tree model. It evaluates the ranking ability of each variable importance ranking statistic. A small RR scoreindicates better ranking ability of variable importance.We keep the model described in Section 3.2 and simulate n = 100 samples and p = 500 variables. Each row vector indesign matrix X n × p is generated as x i ∼ N p ( , Σ ) , with block dependence structure matrix Σ = diag (Σ , Σ , Σ , ... ) ,where each Σ i is a πp × πp matrix with matrix element σ j,k = ( ρ | j − k | ) . We set π = 0 . and ρ = 0 . . Following treemodel hyperparameters in Section 3.1, the maximum depth of each tree is fixed at . We use two boosters, GBRT andDART, four model structures, and two signal strengths. In Table 3, the correct variables are used for modeling. Eachmean and standard error are calculated from , simulations. In each signal strength-booster combination, means forthe test statistic are close, and the standard errors are almost the same. This indicates these variable importance statisticshave similar consistency even under different signal strengths and boosters. There is no obvious difference between thetwo boosters. Among the five variable importance statistics, frequency has the weakest ranking ability given its highestRR score, regardless of strength or booster. In Tables 4, 5, and 6, incorrect but related variables are used for modeling.For these models, cover has the best performance as it returns the lowest RR scores in most cases. Note that we areonly comparing five variable importance statistics under each combination of model structure, strength, and booster(i.e., each row in Tables 3 to 6). Different rows have different boosted tree models and are thus incomparable. In this final subsection describing the simulation, we combine all of the previous steps and perform the whole frameworkof the knockoff boosted tree (KOBT) algorithm. We are interested in the power and FDR of different combinationsof ranking statistics and knockoff types. We simulate n = 100 samples and p = 500 variables. Each row vector indesign matrix X n × p is generated as x i ∼ N p ( , Σ ) or x i ∼ Poisson p ( , Σ ) with block dependence structure matrix13 PREPRINT - F
EBRUARY
24, 2020Table 5: Mean and standard error of signal percentage and ranking ratio (RR) from , simulation tests for rankingtest statistics in exponential models, where n = 100 , p = 500 .Strength Booster Gain Cover Frequency SHAP Saabas0.5 GBRT .
802 (0 . .
783 (0 . .
800 (0 . .
799 (0 . .
799 (0 . DART .
810 (0 . .
784 (0 . .
810 (0 . .
808 (0 . .
808 (0 . .
817 (0 . .
809 (0 . .
818 (0 . .
821 (0 . .
821 (0 . DART .
817 (0 . .
808 (0 . .
818 (0 . .
820 (0 . .
821 (0 . Table 6: Mean and standard error of signal percentage and ranking ratio (RR) from , simulation tests for rankingtest statistics in second-order models, where n = 100 , p = 500 .Strength Booster Gain Cover Frequency SHAP Saabas0.5 GBRT .
784 (0 . .
759 (0 . .
762 (0 . .
786 (0 . .
786 (0 . DART .
772 (0 . .
747 (0 . .
749 (0 . .
768 (0 . .
768 (0 . .
832 (0 . .
834 (0 . .
823 (0 . .
846 (0 . .
846 (0 . DART .
834 (0 . .
828 (0 . .
821 (0 . .
839 (0 . .
840 (0 . Σ = diag (Σ , Σ , Σ , ... ) , where each Σ i is a πp × πp matrix with matrix element σ j,k = ( ρ | j − k | ) . We set π = 0 . and ρ = 0 . . The maximum depth of each tree is fixed at . For simplicity, we choose GBRT as the booster and themain-effect model as the true model. In Tables 7 to 10, power and FDR for different combinations of knockoff typesand ranking statistics are listed for comparison. We choose a condition where signals are sparse and weak so that theperformance of each method is considerably distinct.In Table 7, normally distributed design matrices are simulated, each with , shrunk Gaussian knockoffs, , sparse Gaussian knockoffs, ,
000 10 -principal component knockoffs, and ,
000 30 -principal component knockoffs.We set the targeted FDR as . . The means and corresponding standard errors of power for each combination are listed.Among these four types of knockoffs, the shrunk Gaussian knockoff has the highest power, which is the same conclusionreached in Figure 3 (a). This is followed by - and -principal component knockoffs. The sparse Gaussian knockoffhas the lowest power. Among the five importance statistics we test, frequency is always the most powerful statisticfor all knockoffs, while gain is the least powerful statistic. The other three are moderately powerful and fall between frequency and gain . Table 8 shows the means and corresponding standard errors of the FDR for each combination.Since the targeted FDR is . , it is clear that the sparse Gaussian knockoff is the most conservative method and is theonly one that can ensure the FDR stays under the targeted level. At the same order of power, the shrunk Gaussianknockoff has the highest FDR, followed by the two principal component knockoffs.Table 9 shows the power of the statistics when the normality assumption is invalid. Fifty Poisson-distributed designmatrices are simulated, each with , shrunk Gaussian knockoffs, , sparse Gaussian knockoffs, ,
000 10 -principal component knockoffs, and ,
000 30 -principal component knockoffs. Again, the targeted FDR is set as . . The power of the knockoffs is the same as for normal distribution. For the importance statistics, the order forPoisson-distributed matrices is different from that for normally distributed matrices, where cover has the highest power,followed by frequency and then the other three. Both the sparse Gaussian and -principal component knockoffs cancontrol the FDR to stay close to the targeted level. The shrunk Gaussian and -principal component knockoffs havehigher FDRs. The order of importance statistics for FDR is the same as their power ranking. In summary, the ranges intables below show the trends of knockoffs and importance statistics. More Conservative More Powerful
Sparse Gaussian 30-Principal Component 10-Principal Component Shrunk Gaussian
More Conservative More Powerful
Normal Gain Cover, Saabas, and SHAP FrequencyNon-Normal Cover Frequency Gain, Saabas, and SHAP14
PREPRINT - F
EBRUARY
24, 2020Table 7: Mean and standard error of power from normal distributed simulation tests for ranking test statisticsand knockoff types, where n = 100 , p = 500 , π = 0 . , signal strength = 1 . , design matrix variance σ = 1 , and q = 1000 . Sparse Gaussian Shrunk Gaussian -Principal Component -Principal ComponentCover .
042 (0 . .
460 (0 . .
448 (0 . .
297 (0 . Frequency .
054 (0 . .
480 (0 . .
463 (0 . .
355 (0 . Gain .
007 (0 . .
423 (0 . .
398 (0 . .
211 (0 . Saabas .
037 (0 . .
471 (0 . .
448 (0 . .
283 (0 . SHAP .
037 (0 . .
467 (0 . .
448 (0 . .
284 (0 . Table 8: Mean and standard error of
FDR from normal distributed simulation tests for ranking test statistics andknockoff types, where n = 100 , p = 500 , π = 0 . , signal strength = 1 . , design matrix variance σ = 1 , and q = 1000 . Sparse Gaussian Shrunk Gaussian -Principal Component -Principal ComponentCover .
091 (0 . .
721 (0 . .
693 (0 . .
481 (0 . Frequency .
139 (0 . .
751 (0 . .
739 (0 . .
640 (0 . Gain .
007 (0 . .
684 (0 . .
660 (0 . .
407 (0 . Saabas .
070 (0 . .
720 (0 . .
694 (0 . .
450 (0 . SHAP .
070 (0 . .
719 (0 . .
694 (0 . .
448 (0 . Table 9: Mean and standard error of power from Poisson -distributed simulation tests for ranking test statisticsand knockoff types, where n = 100 , p = 500 , π = 0 . , signal strength = 1 . , design matrix variance σ = 1 , and q = 1000 . Sparse Gaussian Shrunk Gaussian -Principal Component -Principal ComponentCover .
010 (0 . .
380 (0 . .
359 (0 . .
105 (0 . Frequency .
008 (0 . .
377 (0 . .
338 (0 . .
086 (0 . Gain .
008 (0 . .
332 (0 . .
262 (0 . .
053 (0 . Saabas .
008 (0 . .
349 (0 . .
276 (0 . .
040 (0 . SHAP .
008 (0 . .
351 (0 . .
275 (0 . .
041 (0 . Table 10: Mean and standard error of
FDR from Poisson -distributed simulation tests for ranking test statisticsand knockoff types, where n = 100 , p = 500 , π = 0 . , signal strength = 1 . , design matrix variance σ = 1 , and q = 1000 . Sparse Gaussian Shrunk Gaussian -Principal Component -Principal ComponentCover .
007 (0 . .
551 (0 . .
488 (0 . .
142 (0 . Frequency .
007 (0 . .
545 (0 . .
474 (0 . .
116 (0 . Gain .
005 (0 . .
496 (0 . .
396 (0 . .
084 (0 . Saabas .
005 (0 . .
514 (0 . .
388 (0 . .
066 (0 . SHAP .
005 (0 . .
514 (0 . .
392 (0 . .
065 (0 . PREPRINT - F
EBRUARY
24, 2020Table 11: Genes whose expression is related to BRCA tumor purity, where targeted FDR = 0 . . Genes are selectedusing the Shrunk Gaussian knockoff. Overlapping genes with ESTIMATE models are in bold .Detected GenesC1S, C2orf48, CCDC69 , CSF2RB, CXorf65, DNAJC12, EDN3, FAM163A, FAM65B,
FGR , GBP3, HGFAC, HTR2A,
IL7R , KIAA0087, NCRNA00175, NRADDP, NRG2,PM20D1, PNOC, SLAIN1, UNQ6494
Tumor sample purity refers to the percentage of cancer cells in a tumor tissue sample. It is used to discover the rolesof cancerous and non-cancerous cells in the tumour microenvironment, which mainly comprises immune cells (Aranet al., 2015; Turley et al., 2015). To estimate tumor purity, Carter et al. (2012) proposed ABSOLUTE, a method usedto perform analysis of somatic DNA alterations. In addition, it has been reported that DNA methylation data andexpression data from selected stromal genes (Houseman et al., 2012; Yoshihara et al., 2013) have been successfullyused for estimation. Recently, Aran et al. (2015) and Li et al. (2019) used RNA-seq gene expression data to determinetumor purity and found several gene signatures of individual cancer types. As it has been shown reasonable to estimatetumor sample purity using gene expression data, we apply our KOBT algorithm so that we can compare our resultswith those of previous reports. Here, our interest is not the accuracy of estimation but the selection of genes that areimportant in estimation with no topology assumptions. If a gene’s expression level is positively correlated with tumorpurity, then it is highly likely that this gene is expressed primarily by cancer cells in tumor samples.Processed TCGA RNA-seq gene expression data was downloaded from the Pan-Cancer Atlas Publication website(Hoadley et al., 2018). Among all available tumor types, breast invasive carcinoma (BRCA) and skin cutaneousmelanoma (SKCM) were chosen for tumor purity estimation. There are , BRCA and
SKCM samples with , expressed genes for each sample. The genes with missing values or zero variance were filtered out. We usedthe tumor purity estimates in Hoadley et al. (2018) as the responses, which were obtained using ABSOLUTE. Theresponses are thus bounded between and .To reduce the original dimensionality to a lower value, we apply Lasso (Tibshirani, 1996) on original data sets beforeperforming our KOBT steps. After tuning the penalty terms in Lasso objective functions, and genes of BRCAand SKCM respectively are selected for further analysis. We generated q = 300 knockoffs for each gene. Here,we compare two knockoff types: shrunk Gaussian and -principal component knockoffs. We conducted Bayesianoptimization and -fold cross-validation to choose the optimal parameters for the boosted tree algorithm (see Section6.1 for details of the parameters). We used the SHAP for importance evaluation. Therefore, our test statistic is theaverage difference of SHAP values between each gene and its own knockoff, as shown in Equation 26. Finally, FDRcontrol is applied to these test statistics of genes with a targeted FDR level of .Table 11 reports the selected genes whose expression is related to BRCA tumor purity, where the targeted FDR is . .Genes are selected using the Shrunk Gaussian knockoff. Genes selected by the -principal component knockoff arelisted in Table 16. Among all detected genes, CSF2RB (colony stimulating factor 2 receptor beta common subunit), C1S(complement C1s), CCDC69 (coiled-coil domain containing 69), and FGR are also reported among the top -rankedimportant genes for pan-cancer tumor purity prediction in Li et al. (2019). CSF2RB is an immune-related gene. Ithas been reported that in BRCA, the high expression of CSF2RB is positively correlated with patient survival (Liuet al., 2019). ESTIMATE (Yoshihara et al., 2013) uses stromal and immune gene expression to predict tumor purity.Our detected immune genes CCDC69, FGR, and IL7R are included in Yoshihara et al. (2013). The box plots of geneexpression levels for selected genes are presented in Figure 4. These genes are selected because they are detected byboth types of knockoffs. Other genes are plotted in Figure 8. All samples are grouped according to their purity: sampleswith the top / purity are labeled as high (blue); and samples with the bottom / purity are labeled as low (yellow).The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p -values areincluded at the top of the plot. It is shown that genes expression levels of almost all selected genes are highly correlatedwith tumor purity. Samples are grouped according to their gene expression levels. The top / samples are groupedinto the high-level group, and the bottom / samples are grouped into the low-level group. Survival analysis has beenconducted for these genes. Among them, CCDC69, CXORF65, and FGR have significant p -values for coefficients inCox regression. Plots of the Kaplan–Meier estimators and p -values are in Figure 10.16 PREPRINT - F
EBRUARY
24, 2020Table 12: Genes whose expression is related to SKCM tumor purity, where targeted FDR = 0 . . Genes are selectedusing the Shrunk Gaussian knockoff. Overlapping genes with ESTIMATE models are in bold .Detected GenesCSF2RB, RHOH , C1S,
CCDC69 , CCL22,
FGR , ALPK2, ANKRD40, CAPN13, CASP5,CCDC30, CTCF, DEFA1B, FAM78A, GAL3ST2, GSTM5, HERPUD1, IGF2, KCNRG,KIR2DL3, LOC100129066, LOC91450, MYBPC3, NAP1L2, OLAH, OSTalpha, PADI4,PHC1, PLXNB3, PROCA1, PRRG4, PTK7, RAP1GAP, RGL4, SOAT1, SRPRB,
TXNDC3 , VNN3Table 12 reports more genes associated with SKCM tumor purity. Li et al. (2019) reports CSF2RB, RHOH, C1S,CCDC69, and CCL22 as among the top 10-ranked important genes for pan-cancer tumor purity prediction. Most ofthese detected genes are from stromal cells and not cancer cells. For example, the expression level of CCDC69 isnegatively correlated with tumor purity, which can not be true if it mainly comes from cancer cells. Immune genesCCDC69, FGR, and RHOH, and stromal gene TXNDC3 are included in ESTIMATE models. The box plots of geneexpression levels for selected genes are presented in Figure 5. Other genes are plotted in Figure 9. Same as it is forBRCA, samples with the top / purity are labeled as high (blue); and samples with the bottom / purity are labeled aslow (yellow). Related p -values from Wilcoxon signed-rank tests are attached to the plots. We see that genes expressionlevels of almost all selected genes are highly correlated with tumor purity. All samples are grouped according to theirgene expression levels. Plots of their Kaplan–Meier estimators and p -values are in Figure 11 and 12. In summary, fortumor purity estimation, we propose that our KOBT algorithm can detect a few genes that are largely expressed instromal cells. Our results reproduce previously reported work using different algorithms.Figure 4: Relative gene expression level comparison for selected genes in low (yellow) and high (blue) purity samplesof BRCA. The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p -values are included at the top of the plot. The relative gene expression level is the log -transformed normalizedexpression read counts mapped (unit: million reads). For application of KOBT on classification, we focus on its performance of assigning tumors to known classes andidentifying those genes whose expression levels are related to the classification. In 1999, Golub et al. (1999) used geneexpression monitoring by DNA microarrays to conduct cancer classification. Their test on classifying acute myeloidleukemia (AML) and acute lymphoblastic leukemia (ALL) showed that it was realistic to distinguish different cancersubtypes only based on gene expression data. Li et al. (2017) performed a pan-cancer classification with RNA-seq datafrom TCGA using boosted tree and k -nearest neighbours methods. Since KOBT algorithm can detect signals with falsediscovery rate control, we would like to see how it works on gene selection for tumor type classification.17 PREPRINT - F
EBRUARY
24, 2020Figure 5: Relative gene expression level comparison for selected genes in low (yellow) and high (blue) purity samplesof SKCM. The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p -values are included at the top of the plot. The relative gene expression level is the log -transformed normalizedexpression read counts mapped (unit: million reads).Table 13: Genes whose expression is related to ESCA vs STAD tumor classification, where targeted FDR = 0 . .Detected GenesBARX1, HAND2, KRT14, NVLThe TCGA RNA-seq gene expression data is the same as in Subsection 4.1. Instead of a pan-cancer classification,we focus on two challenging binary classifications for (1) esophageal carcinoma (ESCA) vs stomach adenocarcinoma(STAD) and (2) rectum adenocarcinoma (READ) vs colon adenocarcinoma (COAD). We choose these two pairs ofcancers because of their similarities. Esophageal carcinoma (ESCA) and stomach adenocarcinoma (STAD) are bothmalignant tumors in the digestive tract. In Li et al. (2017), it has been reported that almost all READ samples weremis-assigned to COAD. We have samples for the ESCA vs STAD classification test and samples for the READvs COAD one. The steps are the same as in Subsection 4.1 except that the responses are binary now: for one cancerand for another cancer.In Table 13, the selected genes that can distinguish ESCA and STAD are presented. Among them, BARX1 is a stomachmesenchymal transcription factor. Kim et al. (2005) has shown that BARX1 loss in the mesenchyme prevents stomachepithelial differentiation of overlying endoderm and induces intestine-specific genes instead. Their results defineda transcriptional and signaling pathway of inductive cell interactions in vertebrate organogenesis. Kim et al. (2011)proved that BARX1 controls mouse stomach morphogenesis and is required to specify stomach-specific epithelium inadjacent endoderm. HAND2 is an RNA Gene, which is affiliated with the long non-coding RNA class. Tsubokawa et al.(2002) stated that KRT14 is often expressed by tumor cells in the trabecular nests of the primary carcinoma. The boxplots of gene expression levels for all detected genes are presented in Figure 6. All samples are grouped accordingto their cancer types: ESCA samples are in blue, and STAD samples are in yellow. The non-parametric Wilcoxonsigned-rank test is conducted for each gene. Their corresponding p -values are included at the top of the plot. It is shownthat genes expression levels of all selected genes are highly correlated with the cancer type.In Table 14, we show the detected genes that can used for READ and COAD classification. HOXC4 and HOXC8 aremembers of Homeobox genes, which is a large family of transcription factors that direct the formation of many bodystructures during early embryonic development. It has been observed that HOXC family gene expression is upregulatedin most solid tumor types (Bhatlekar et al., 2014). The box plots of gene expression levels for all detected genes arepresented in Figure 7. All samples are grouped according to their cancer types: COAD samples are in blue, and READsamples are in yellow. The non-parametric Wilcoxon signed-rank test is conducted for each gene. Their corresponding p -values are included at the top of the plot. It is shown that genes expression levels of all selected genes are highlycorrelated with the cancer type. 18 PREPRINT - F
EBRUARY
24, 2020Figure 6: Relative gene expression level comparison for selected genes in ESCA (blue) and STAD (yellow) samples.The non-parametric Wilcoxon signed-rank test is conducted. The corresponding p -values are included at the top ofthe plot. The relative gene expression level is the log -transformed normalized expression read counts mapped (unit:million reads).Table 14: Genes whose expression is related to READ vs COAD tumor classification, where targeted FDR = 0 . .Detected GenesEMB, HOXC4, HOXC8, ZNF595Figure 7: Relative gene expression level comparison for selected genes in COAD (blue) and READ (yellow) samples.The non-parametric Wilcoxon signed-rank test is conducted. The corresponding p -values are included at the top ofthe plot. The relative gene expression level is the log -transformed normalized expression read counts mapped (unit:million reads). 19 PREPRINT - F
EBRUARY
24, 2020Unlike in tumor purity estimation, cancer genes play a more important role in tumor type classification. We detect someprotein coding genes and find previous published literature to support our findings in both classification tests. Given thefact that the mechanism of cancers is complicated, and the cancer types we choose have been reported to be hard todistinguish, we show that our proposed KOBT algorithm to be robust in real data with unknown models.
In this article, we introduce a new model-free variable selection method, called knockoff boosted tree, KOBT. Giventhe nature of boosted tree models, no prior model topology knowledge is required. We extend the current applicationof knockoff methods to tree models. Meanwhile, we proposed two new sampling methods to generate knockoffs, theprincipal component construction knockoff and the sparse Gaussian knockoff. Unlike currently available methods, theprincipal component construction knockoff does not depend on Gaussian assumption of design matrix. To evaluate thepower of our generated knockoffs in simulation tests, we define a new test statistic, mean absolute angle of columns ,which stands for the average distance between column vectors in two matrices. We apply kernel maximum meandiscrepancy to test whether our proposed knockoffs are drawn from the same distributions as the original vectors. In ourboosted tree framework, we fix most hyperparameters at multiple reasonable levels and leave regularization parametersto be tuned by Bayesian optimization. We consider different importance scores in tree models. Combinations ofdifferent knockoffs and importance test statistics are listed in our results. We test our strategy on different unknown truemodels, including main effect, interaction, exponential, and second order models. Finally, we apply our algorithm ontumor purity estimation and tumor type classification using TCGA gene expression data. We see that KOBT performswell and our results match previously published literature. Finally, we provide an R package to implement our proposedapproaches, which is available at https://cran.r-project.org/web/packages/KOBT/index.html . Acknowledgements
This research was supported by the Intramural Research Program of the NIH, National Institute of EnvironmentalHealth Sciences. 20
PREPRINT - F
EBRUARY
24, 2020
Table 15: XGBoost parameters in real data applications.Parameter ValueBooster gbtreeMaximum number of trees
Learning rate . Maximum tree depth Minimum leaf weight Sub-sample
Sub-feature γ [0 , (a tuning region for Bayesian optimization) λ [0 , (a tuning region for Bayesian optimization) α [0 , (a tuning region for Bayesian optimization)Objective function reg:squarederror & binary:logisticEvaluation metric root mean square error & classification errorThe early stopping rule is stopping adding new trees when average test loss in cross validation does not improve in fiveadditional trees. Table 16: Genes whose expression is related to tumor purity using -principal component knockoff, where targetedFDR = 0 . . Cancer Detected GenesBRCA CSF2RB, CCDC69, FGR, CXorf65, INSL3, UNQ6494SKCM RHOH, CCDC69, CCL22, ALPK2, ANKRD40, BBS1, BEST4,C11orf70, C2orf27A, CAPN13, CASP5, CCDC30, CHDH, CHRM4,CTAGE9, CTCF, CXCL6, DEFA1B, EIF1B, FAHD1, FAM78A, FBXW8,FERMT1, GAL3ST2, GRM4, GSTM5, IGF2, KCNRG, KDM4D,KHDC1, KIR2DL3, KLF11, LAMA1, LOC100129034, LOC100129066,LOC91450, MIA, MYBPC3, NAP1L2, OLAH, OSTalpha, PADI4,PCCB, PER1, PER2, PHC1, PLXNB3, PROCA1, PRRG4, PTK7,PTPN5, PTRH2, PYROXD2, RAP1GAP2, RGL4, RIC3, RNF126P1,RYBP, SEC61A1, SGSM2, SMAD2, SOAT1, SRPRB, TXNDC3,USP30, VNN3, ZIC1 21 PREPRINT - F
EBRUARY
24, 2020
Figure 8: Relative gene expression level comparison for other genes in low (yellow) and high (blue) purity samplesof BRCA. The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p -values are included at the top of the plot. The relative gene expression level is the log -transformed normalizedexpression read counts mapped (unit: million reads).Figure 9: Relative gene expression level comparison for other genes in low (yellow) and high (blue) purity samplesof SKCM. The non-parametric Wilcoxon signed-rank test is conducted for each low-high pair. Their corresponding p -values are included at the top of the plot. The relative gene expression level is the log -transformed normalizedexpression read counts mapped (unit: million reads). 22 PREPRINT - F
EBRUARY
24, 2020
Figure 10: Plots of the Kaplan–Meier estimators and p -values for coefficients in Cox regression of BRCA. Sampleswith high gene expression levels are in red, and samples with low gene expression levels are in blue.23 PREPRINT - F
EBRUARY
24, 2020Figure 11: Plots of the Kaplan–Meier estimators and p -values for coefficients in Cox regression of SKCM.24 PREPRINT - F
EBRUARY
24, 2020Figure 12: (Continued) Plots of the Kaplan–Meier estimators and p -values for coefficients in Cox regression of SKCM.25 PREPRINT - F
EBRUARY
24, 2020
References
Aran, D., Sirota, M., and Butte, A. J. (2015). Systematic pan-cancer analysis of tumour purity.
Nature communications ,6:8971.Barber, R. F., Candès, E. J., et al. (2015). Controlling the false discovery rate via knockoffs.
The Annals of Statistics ,43(5):2055–2085.Bhatlekar, S., Fields, J. Z., and Boman, B. M. (2014). Hox genes and their role in the development of human cancers.
Journal of molecular medicine , 92(8):811–823.Bien, J. and Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix.
Biometrika , 98(4):807–820.Breiman, L. (2001). Random forests.
Machine Learning , 45(1):5–32.Breiman, L. and Cutler, A. (2008). Random forests—classification manual. .Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. wadsworth int.
Group ,37(15):237–251.Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlledvariable selection.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 80(3):551–577.Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler, W.,Weir, B. A., et al. (2012). Absolute quantification of somatic dna alterations in human cancer.
Nature biotechnology ,30(5):413.Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd acm sigkddinternational conference on knowledge discovery and data mining , pages 785–794. ACM.Chung, E., Romano, J. P., et al. (2013). Exact and asymptotically robust permutation tests.
The Annals of Statistics ,41(2):484–507.Friedman, J., Hastie, T., Tibshirani, R., et al. (2000). Additive logistic regression: a statistical view of boosting (withdiscussion and a rejoinder by the authors).
The Annals of Statistics , 28(2):337–407.Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
The Annals of Statistics , pages1189–1232.Friedman, J. H. (2002). Stochastic gradient boosting.
Computational Statistics & Data Analysis , 38(4):367–378.Friedman, J. H. et al. (1991). Multivariate adaptive regression splines.
The Annals of Statistics , 19(1):1–67.Gao, C., Sun, H., Wang, T., Tang, M., Bohnen, N. I., Müller, M. L., Herman, T., Giladi, N., Kalinin, A., Spino, C., et al.(2018). Model-based and model-free machine learning techniques for diagnostic prediction and classification ofclinical outcomes in parkinson’s disease.
Scientific Reports , 8(1):7129.Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing,J. R., Caligiuri, M. A., et al. (1999). Molecular classification of cancer: class discovery and class prediction by geneexpression monitoring. science , 286(5439):531–537.Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. (2007). A kernel method for the two-sample-problem. In
Advances in Neural Information Processing Systems , pages 513–520.Hastie, T., Tibshirani, R., Friedman, J., and Franklin, J. (2005). The elements of statistical learning: data mining,inference and prediction.
The Mathematical Intelligencer , 27(2):83–85.Hoadley, K. A., Yau, C., Hinoue, T., Wolf, D. M., Lazar, A. J., Drill, E., Shen, R., Taylor, A. M., Cherniack, A. D.,Thorsson, V., et al. (2018). Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33types of cancer.
Cell , 173(2):291–304.Houseman, E. A., Accomando, W. P., Koestler, D. C., Christensen, B. C., Marsit, C. J., Nelson, H. H., Wiencke,J. K., and Kelsey, K. T. (2012). Dna methylation arrays as surrogate measures of cell mixture distribution.
BMCbioinformatics , 13(1):86.Kim, B.-M., Buchner, G., Miletich, I., Sharpe, P. T., and Shivdasani, R. A. (2005). The stomach mesenchymal transcrip-tion factor barx1 specifies gastric epithelial identity through inhibition of transient wnt signaling.
Developmental cell ,8(4):611–622.Kim, B.-M., Woo, J., Kanellopoulou, C., and Shivdasani, R. A. (2011). Regulation of mouse stomach development andbarx1 expression by specific micrornas.
Development , 138(6):1081–1086.Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits.
Journal of the American StatisticalAssociation , 96(454):589–604. 26
PREPRINT - F
EBRUARY
24, 2020LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series.
The Handbook ofBrain Theory and Neural Networks , 3361(10):1995.Li, L., Dennis Cook, R., and Nachtsheim, C. J. (2005). Model-free variable selection.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 67(2):285–299.Li, Y., Kang, K., Krahn, J. M., Croutwater, N., Lee, K., Umbach, D. M., and Li, L. (2017). A comprehensive genomicpan-cancer classification using the cancer genome atlas gene expression data.
BMC genomics , 18(1):508.Li, Y., Umbach, D. M., Bingham, A., Li, Q.-J., Zhuang, Y., and Li, L. (2019). Putative biomarkers for predicting tumorsample purity based on gene expression data.
BMC genomics , 20(1):1–12.Liu, C., Min, L., Kuang, J., Zhu, C.-s., Qiu, X.-y., and Zhu, L. (2019). Bioinformatic identification of mir-622 keytarget genes and experimental validation of the mir-622-rnf8 axis in breast cancer.
Frontiers in oncology , 9:1114.Loh, W.-Y. (2002). Regression tress with unbiased variable selection and interaction detection.
Statistica Sinica , pages361–386.Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 .Lundberg, S. M. and Lee, S.-I. (2017a). Consistent feature attribution for tree ensembles. arXiv preprintarXiv:1706.06060 .Lundberg, S. M. and Lee, S.-I. (2017b). A unified approach to interpreting model predictions. In
Advances in NeuralInformation Processing Systems , pages 4765–4774.Miller, P. J., Lubke, G. H., McArtor, D. B., and Bergeman, C. (2016). Finding structure in data using multivariate treeboosting.
Psychological Methods , 21(4):583.Nielsen, D. (2016). Tree boosting with xgboost.
NTNU Norwegian University of Science and Technology .Opgen-Rhein, R. and Strimmer, K. (2007). Accurate ranking of differentially expressed genes by a distribution-freeshrinkage approach.
Statistical Applications in Genetics and Molecular Biology , 6(1).Rashmi, K. V. and Gilad-Bachrach, R. (2015). Dart: Dropouts meet multiple additive regression trees. In
AISTATS ,pages 489–497.Rätsch, G., Onoda, T., and Müller, K.-R. (2001). Soft margins for adaboost.
Machine Learning , 42(3):287–320.Saabas, A. (2014). Interpreting random forests.
Diving into Data , 24.Schiltz, F., Masci, C., Agasisti, T., and Horn, D. (2018). Using regression tree ensembles to model interaction effects: agraphical approach.
Applied Economics , 50(58):6341–6354.Shen, A., Fu, H., He, K., and Jiang, H. (2019). False discovery rate control in cancer biomarker selection usingknockoffs.
Cancers , 11(6):744.Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In
Advances in neural information processing systems , pages 2951–2959.Su, X., Meneses, K., McNees, P., and Johnson, W. O. (2011). Interaction trees: exploring the differential effects ofan intervention programme for breast cancer survivors.
Journal of the Royal Statistical Society: Series C (AppliedStatistics) , 60(3):457–474.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: SeriesB (Methodological) , 58(1):267–288.Tsubokawa, F., Nishisaka, T., Takeshima, Y., and Inai, K. (2002). Heterogeneity of expression of cytokeratin subtypesin squamous cell carcinoma of the lung: with special reference to ck14 overexpression in cancer of high-proliferativeand lymphogenous metastatic potential.
Pathology international , 52(4):286–293.Turley, S. J., Cremasco, V., and Astarita, J. L. (2015). Immunological hallmarks of stromal cells in the tumourmicroenvironment.
Nature reviews immunology , 15(11):669.Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres-Garcia, W., Treviño, V., Shen, H., Laird,P. W., Levine, D. A., et al. (2013). Inferring tumour purity and stromal and immune cell admixture from expressiondata.
Nature communications , 4:2612.Zhang, T., Yu, B., et al. (2005). Boosting with early stopping: Convergence and consistency.
The Annals of Statistics ,33(4):1538–1579.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.