Massively-Parallel Feature Selection for Big Data
Ioannis Tsamardinos, Giorgos Borboudakis, Pavlos Katsogridakis, Polyvios Pratikakis, Vassilis Christophides
aa r X i v : . [ c s . L G ] A ug Noname manuscript No. (will be inserted by the editor)
Massively-Parallel Feature Selection for Big Data
Ioannis Tsamardinos ⋆ · GiorgosBorboudakis ⋆ · Pavlos Katsogridakis · Polyvios Pratikakis · Vassilis Christophides
Received: date / Accepted: date
Keywords feature selection | variable selection | forward selection | Big Data | data analytics Abstract
We present the
Parallel, Forward-Backward with Pruning (PFBP) al-gorithm for feature selection (FS) in Big Data settings (high dimensionality and/orsample size). To tackle the challenges of Big Data FS PFBP partitions the datamatrix both in terms of rows (samples, training examples) as well as columns (fea-tures). By employing the concepts of p -values of conditional independence testsand meta-analysis techniques PFBP manages to rely only on computations local toa partition while minimizing communication costs. Then, it employs powerful andsafe (asymptotically sound) heuristics to make early, approximate decisions, suchas Early Dropping of features from consideration in subsequent iterations,
EarlyStopping of consideration of features within the same iteration, or
Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimal-ity for data distributions faithfully representable by a causal network (Bayesiannetwork or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability withrespect to the number of features and processing cores, while dominating othercompetitive algorithms in its class. ⋆ Equal contribution.Ioannis TsamardinosE-mail: [email protected] BorboudakisE-mail: [email protected] KatsogridakisE-mail: [email protected] PratikakisE-mail: [email protected] ChristophidesE-mail: [email protected] Tsamardinos, I., Borboudakis, G. et al.
Creating predictive models from data requires sophisticated machine learning, pat-tern recognition, and statistical modeling techniques. When applied to Big Datasettings these algorithms need to scale not only to millions of training instances(samples) but also millions of predictive quantities (interchangeably called fea-tures, variables, or attributes) [8, 75, 77]. A common way to reduce data dimen-sionality consists of selecting only a subset of the original features that retainsall of the predictive information regarding an outcome of interest T . Specifically,the objective of the Feature Selection (FS) problem can be defined as identifyinga feature subset that is of minimal-size and collectively (multivariately) optimallypredictive w.r.t. T . By removing irrelevant as well as redundant (related to theconcept of weakly relevant ) features, FS essentially facilitates the learning task. Itresults in predictive models with fewer features that are easier to inspect, visual-ize, understand, and faster to apply. Thus, FS provides valuable intuition on thedata generating mechanism and is a primary tool for knowledge discovery; deepconnections of the solutions to the FS with the causal mechanisms that generatethe data have been found [62]. Indeed, FS is often the primary task of an analysis,while predictive modeling is only a by-product.Designing a FS algorithm is challenging because by definition it is a com-binatorial problem; the FS is NP-hard even for linear regression problems [70].An exhaustive search of all feature subsets is impractical except for quite smallsized feature spaces. Heuristic search strategies and approximating assumptionsare required to scale up FS, ranging from convex relaxations and parametric as-sumptions such as linearity (e.g., the Lasso algorithm [61]) to causally-inspired,non-parametric assumptions, such as faithfulness of the data distribution to acausal model [49, 59].Specifically, in the context of Big Data featuring both high dimensionalityand/or high sample volume, computations become CPU- and data-intensive thatcannot be handled by a single machine . The main challenges arising in this con-text are (a) how can data be partitioned both horizontally (over samples) andvertically (over features), called hybrid-partitioning , so that computations can beperformed locally in each block and combined globally with a minimal commu-nication overhead ; (b) what heuristics can quickly (e.g., without the need to gothrough all samples) and safely (providing theoretical guarantees of correctness)eliminate irrelevant and redundant features. Hybrid partitioning over both datasamples and learned models [36, 73] is an open research issue in Big ML algorithmswhile safe FS heuristics has been proposed only for sparse Big Data [53, 58], i.e.,for data where a large percentage of values are the same (typically zeros).To address these challenges we introduce the Parallel, Forward-Backwardwith Pruning (PFBP) algorithm for Big Volume Data. PFBP does not relieson data sparsity and is generally applicable to both dense and sparse datasets;in the future, it could be extended to include optimizations specifically designedfor sparse datasets. PFBP is based on statistical tests of conditional independence Optimally predictive with respect to an ideal predictor; see [62] for a discussion. This definition covers what we call single FS; the problem of multiple FS can be definedas the problem of identifying all minimal and optimally-predictive subsets but it has receivedmuch less study in the literature [33, 60]. See [8, 75, 77] for the evolution of Big Data dimensionality in various ML datasets.assively-Parallel Feature Selection for Big Data 3 and it is inspired by statistical causal modeling that represents join probabilitydistribution as a causal model and specifically the theory of Bayesian networksand maximal ancestral graphs [48, 54, 59].To tackle parallelization with hybrid partitioning (challenge (a) above), PFBPdecisions rely on p -values and log-likelihoods returned by the independence testscomputed locally on each data partition; these values are then combined togetherusing statistical meta-analysis techniques to produce global approximate p -values and log-likelihoods. This technique essentially minimizes PFBP’s communicationcost, as only local p -values and log-likelihoods need to be communicated fromworkers to the master node in a cluster of machines at each iteration of the algo-rithm.To reduce the number and workload of iterations required to compute a FSsolution (challenge (b) above), PFBP relies on several heuristics. First, it adaptsfor Big Data a heuristic called Early Dropping recently introduced in [9]. EarlyDropping removes features from subsequent iterations thus significantly speedingup the algorithm. Then, PFBP is equipped with two new heuristics for
EarlyStopping of consideration of features within the same iteration, and
Early Re-turning the current best feature for addition or removal. The three heuristics areimplemented using
Bootstrap-based statistical tests . They are applied on the setof currently available local p -values and log-likelihoods to determine whether thealgorithm has seen enough samples to make safely (i.e., with high probability)early decisions.PFBP is proven to compute the optimal feature set for distributions faithful [59](also called stable distributions [49]) to a causal network represented as a BayesianNetwork or a maximal ancestral graph [54, 59]. These are data distributions whoseset of conditional independencies coincides with the set of independencies entailedby a causal graph and the Markov Condition [49, 59]. Assuming faithfulness ofthe data distribution has led to algorithms that have been proven competitivein practice [3, 4, 9, 33, 34, 35, 41, 51, 63, 64]. We should also notice that allPFBP computations are not bound to specific data-types; by supplying differentconditional independence tests PFBP becomes applicable to a wide variety of datatypes and target variables [33] (continuous, ordinal, nominal, time-to-event).The paper is organized as follows. In Section 2 we provide a brief introductionto the basic concepts required to introduce our FS algorithm. The PFBP algorithmis introduced in Section 3. In Section 4 we explain the heuristics used by PFBP indetail, and show to how to implement them using bootstrap-based tests. Guidelinesfor setting the hyper-parameter values for the data partitioning used by PFBPare presented in Section 5. In Section 6 we list some implementation details ofPFBP, which are required for a fast and robust implementation. The theoreticalproperties of PFBP are presented in Section 7. A high-level theoretical comparisonof PFBP to alternative feature selection algorithms, as well as an overview offeature selection methods for Big Data is given in Section 8. Finally, in Section 9 weevaluate PFBP on synthetic data, and compare it to alternative forward-selectionalgorithms on 11 binary classification datasets. Alternatively, one can combine the test statistics that produce the p -values. This is con-ceptually equivalent, although there may be differences in practice. Tsamardinos, I., Borboudakis, G. et al. Table 1: Table containing common acronyms, terms and mathematical notation(left) used throughout the paper with a short description (right).
FBS Forward-Backward SelectionPFBP Parallel Forward-Backward with PruningUFS Univariate Feature SelectionSFO Single Feature OptimizationED Early DroppingES Early StoppingER Early ReturnIteration forward (backward) iteration of PFBPPhase forward (backward) loop of PFBPRun execution of a forward and a backward Phase by PFBPFeature Subset subset of featuresSample Subset subset of samplesData Block contains samples of one Sample Subset and one Feature SubsetGroup Sample set of Sample SubsetsGroup set of Data Blocks corresponding to Sample Subsets in a Group Sample X Random variable X set of random variables | X | number of elements in X T outcome (or target) variable T est ( X k , T | S ) conditional independence test of X k with T given S p k p -value of T est ( X k , T | S ) (for some S )df Degrees of Freedom α significance level threshold D Dataset - 2-D matrix F Features in D F j j -th Feature Subset nf number of Feature Subsets f number of features in each Feature Subset S i i -th Sample Subset ns number of Sample Subsets s number of samples in each Sample Subset G q q -th Group Sample Q number of Group Samples C number of Sample Subsets per Group Sample D i,j Data Block with rows S i and columns F j Π p -values π i,j local p -value of j -th alive variable in Π computed on rows in S i π Vector with combined log p -values π i combined log p -value for the i -th alive variable S set of Selected features R set of Remaining features A set of Alive features B number of bootstrap iterations used by bootstrap tests b value corresponding to b -th bootstrap sample P drop threshold used by bootstrap test for Early Dropping P stop threshold used by bootstrap test for Early Stopping P return threshold used by bootstrap test for Early Return lt tolerance level used by bootstrap test for Early Return In this section, we provide the basic notation used throughout the paper, andpresent the core algorithmic and statistical reasoning techniques exploited by the assively-Parallel Feature Selection for Big Data 5
Algorithm 1
Forward-Backward Selection
Input:
Dataset D , Target T , Significance Level α Output:
Selected Features SS ← ∅ // Selected features, initially empty // Forward Phase: Iterate until no more features can be selected while S changes do // Identify V ∗ with minimum p -value conditional on S V ∗ ← argmin V ∈ V D \ S Pvalue ( T, V | S )// Select V ∗ if conditionally dependent with T given Sif
Pvalue ( T, V ∗ | S ) ≤ α thenS ← S ∪ V ∗ end ifend while // Backward Phase: Iterate until no more features can be removed from
Swhile S changes do // Identify V ∗ with maximum p -value conditional on S \ V ∗ V ∗ ← argmax V ∈ S Pvalue ( T, V | S \ V )// Remove V ∗ if conditionally independent with T given S \ V ∗ if Pvalue ( T, V ∗ | S \ V ∗ ) > α thenS ← S \ V ∗ end ifend whilereturn S proposed FS algorithm. Random variables are denoted using upper-case letters(e.g. X), while sets of random variables are denoted using bold upper-case letters(e.g. Z ). We use | Z | to refer to the number of variables in Z . The terms variableand feature will be used interchangeably, and the outcome (or target) variable willbe denoted as T . A summary of acronyms, terms and notation is given in Table 1.2.1 Forward-Backward Feature SelectionThe Forward-Backward Selection algorithm (FBS) is an instance of the stepwisefeature selection algorithm family [32, 69]. It is also one of the first and mostpopular algorithms for causal feature selection [41, 64]. In each forward Iteration ,FBS selects the feature that provides the largest increase in terms of predictiveperformance for T , and adds it to the set of selected variables, denoted with S hereon, starting from the empty set. The forward Phase ends when no featurefurther improves performance or a maximum number of selected features has beenreached. In each Iteration of the backward Phase, the feature that most confidentlydoes not reduce performance is removed from S . The backward Phase stops whenno feature can be removed without reducing performance. We use the terms Phase to refer to the forward and backward loops of the algorithm and
Iteration to thepart that decides which feature to add or remove next.To determine whether predictive performance is increased or decreased whena single feature is added or removed in a greedy fashion, FBS uses conditionalindependence tests . An important advantage of methods relying on conditional Alternatively, one can use information criteria such as AIC [2] and BIC [56], or out-of-sample methods such as cross-validation to evaluate the performance of the current set ofselected features; see [32, 69] for more details. Tsamardinos, I., Borboudakis, G. et al. independence tests is that it allows one to adapt and apply the algorithm to anytype of outcome (e.g. nominal, ordinal, continuous, time-to-event, time-course,time series) for which an appropriate statistical test of conditional independenceexists . This way, the same feature selection algorithm can deal with different datatypes .Conditional independence of X with T given S implies that P ( T | S , X ) = P ( T | S ), whenever P ( S ) > S is allowed to be the empty set). Thus, whenconditional independence holds, X is not predictive of T when S (and only S ) isknown. A conditional independence test assumes the null hypothesis that feature X is probabilistically independent of T (i.e., redundant) given a set of variables S and is denoted by Test ( X, T | S ). The test returns a p -value, which corresponds tothe probability that one obtains deviations from what is expected under the nullhypothesis as extreme or more extreme than the deviation actually observed withthe given data. When p k ≡ Test ( X k , T | S ) is low, the null hypothesis can be safelyrejected: the value of X k does provide predictive information for T when the valuesof S are known. In practice, decisions are made using a threshold α (significancelevel) on the p -values; the null hypothesis is rejected if the p -value is below α .In the context of feature selection, the p -values p k returned by statistical hy-potheses tests of conditional independence are employed not only to reject oraccept hypotheses, but also to rank the features according to the predictive infor-mation they provide for T given S . Intuitively, this can be justified by the fact thateverything else being equal (i.e., sample size, type of test) the p -values of such testsin case of dependence have (on average) the reverse ordering with the conditionalassociation of the variables with T given S . So, the basic variant of the algo-rithm selects to add (remove) the feature with the lower (higher) p -value in eachForward (Backward) Iteration. The Forward-Backward Selection algorithm usingconditional independence tests is summarized in Algorithm 1. We use V D to denotethe set of variables contained in dataset D (excluding T ). The Pvalue ( T, X | S )function performs a conditional independence test of T and X given S and returnsa p -value.2.2 Implementing Independence Tests using the Likelihood Ratio TechniqueThere are several methods for assessing conditional independence, such as likelihood-ratio based tests (or asymptotically equivalent approximations thereof like scoretests and Wald tests [18]) or kernel-based tests [76]. We focus on likelihood-ratiobased tests hereafter, mostly because they are general and can be applied for dif-ferent data types (e.g. continuous, ordinal, nominal, time-to-event, to name a few),although the main algorithm is not limited to such tests but can be applied withany type of test.To construct a likelihood-ratio test for conditional independence of T with X given S one needs a statistical model that maximizes the log-likelihood of the data LL ( D ; θ ) ≡ log P ( D | θ ) over a set of parameters θ . Without loss of generality, weassume hereafter T is binary and consider the binary logistic regression model. Forthe logistic regression, the parameters θ are weight coefficients for each feature For example, the R-package MXM [33] includes asymptotic, permutation-based, and robusttests for nominal, ordinal, continuous, time-course, and censored time-to-event targetsassively-Parallel Feature Selection for Big Data 7 in the model and an intercept term. Subsequently, two statistical models haveto be created for T : (i) model M using only variables S , and (ii) model M using S and X resulting in corresponding log-likelihoods LL and LL . The nullhypothesis of independence now becomes equivalent to the hypothesis that bothlog-likelihoods are equal asymptotically. The test statistic function of the test iscalled the deviance and is defined as D ≡ × ( LL − LL )Notice that, the difference in the logs of the likelihoods corresponds to theratio of the likelihoods, hence the name likelihood-ratio test. The test statistic isknown to follow asymptotically a χ distribution with P − P degrees of freedom[72], where P and P are degrees of freedom of models M and M respectively .When X is a continuous feature, only one more parameter is added to θ so thedifference in degrees of freedom is 1 for this case. Categorical predictors can beused by simply encoding them as K − K is thenumber of possible values of the original feature. In this case, the difference indegrees of freedom is K −
1. Knowing the theoretical distribution of the statisticallows one to compute the p -value of the test: p = 1 − cdf ( D, df ), where cdf is thecumulative probability distribution function of the χ distribution with degrees offreedom df and D the observed deviance. Likelihood-ratio tests can be constructedfor any type of data for which an algorithm for maximizing the data likelihoodexists, such as binary, multinomial or ordinal logistic regression, linear regressionand Cox regression to name a few.Likelihood-ratio tests are approximate in the sense that the test statistic hasa χ distribution only asymptotically . When sample size is low, the asymptoticapproximation may return inaccurate p -values. Thus, to apply approximate testsit is important to ensure a sufficient number of samples is available . This issueis treated in detail in the context of PBFP and the logistic test in Section A.Note that, the aforementioned models and the corresponding independence testsare only suited for identifying linear dependencies; certain types of non-lineardependencies may also be identifiable if one also includes interaction terms andfeature transformations in the models.2.3 Combining p-values Using Meta-Analysis TechniquesA set of p -values stemming from testing the same null hypothesis (e.g. testing theconditional independence of X and Y given Z ) can be combined using statisticalmeta-analysis techniques into a single p -value. Multiple such methods exist in theliterature [39]. Fisher’s combined probability test [20] is one such method that hasbeen shown to work well across many cases [39]. It assumes that the p -values areindependent and combines them into a single statistic using the formulaStatistic ≡ − K X i =1 log( p i ) An implicit assumption made here is that the models are correctly specified. If this doesnot hold, the statistic follows a different distribution [21]. There exist methods that handle themore general case [68, 71], but this is clearly out of this paper’s scope. Tsamardinos, I., Borboudakis, G. et al. where K is the number of p -values, p i is the i -th p -value, and log is the naturallogarithm. The statistic is then distributed as a χ random variable with 2 · K degrees of freedom, from which a combined p -value is computed.2.4 Bootstrap-based Hypothesis TestingThe bootstrap procedure [17] can be used to compute the distribution of a statis-tic of interest. Bootstrapping is employed in the PFBP algorithm for makingearly, probabilistic decisions. Bootstrapping is a general-purpose non-parametricresampling-based procedure which works as follows: (a) resample with replacementfrom the input values a sample of equal size, (b) compute the statistic of interest onthe bootstrap sample, (c) repeat steps (a) and (b) many times to get an estimateof the bootstrap distribution of the statistic. The bootstrap distribution can thenbe used to compute properties of the distribution such as confidence intervals, orto compute some condition of the statistic; a simple example application on thelatter follows.Let µ X denote the mean of random variable X and let ˆ µ X denote the estimateof the mean of X given a sample of X . Assume we are given a sample of size n ofrandom variable X and we want to compute the probability that the mean of X is larger than 10, P ( µ X > P ( µ X > P ( µ X >
10) can beestimated as follows: (a) sample with replacement n values of X and create the b -thbootstrap sample X b , (b) estimate the mean of X b , denoted as ˆ µ bX , and compute I (ˆ µ bX > I is the indicator function returning 1 if the inequality holdsand 0 otherwise, and (c) repeat (a) and (b) B times (e.g. B = 1000). P ( µ X > P ( µ X >
10) = I (ˆ µ X >
10) + P Bi =1 I (ˆ µ bX > B + 1Note that, we also compute the statistic on the original sample, and thus divideby B + 1.2.5 Probabilistic Graphical Models and Markov BlanketsIn this section, we give a brief overview of Bayesian networks and maximal ances-tral graphs, which will be used later on to present the theoretical properties of theproposed algorithm. A more extensive exposition and rigorous treatment can befound in [3, 54, 59]. A Bayesian network B = h G, P i consists of a directed acyclic graph G over a setof vertices V and a joint distribution P , over random variables that correspondone-to-one to vertices in V (thus, no distinction is made between variables and ver-tices). The Markov condition has to hold between G and P : every variable X isconditionally independent of its non-descendants in G , given its parents, denoted assively-Parallel Feature Selection for Big Data 9 by P a ( X ). The Markov condition leads to a factorization of the joint probabil-ity P ( V ) = Q i P ( X i | P a ( X i )). Thus, the graph G determines a factorization ofthe probability distribution, directly implying that some independencies have tohold, and further entailing (along with the other probability axioms) some addi-tional conditional independencies. A Bayesian network is called faithful if all andonly the conditional independencies in P are entailed by the Markov condition.Conceptually, this faithfulness condition means that all independencies in the dis-tribution of the data are determined by the structure of the graph G and not theactual parameterization of the distribution. A distribution P is called faithful (to a Bayesian Network) if there is a graph G such that B = h G, P i is faith-ful. Under the Markov and faithfulness assumptions, a graphical criterion called d-separation [47, 66] can be used to read off dependencies and independenciesencoded in a Bayesian network. To define d -separation the notion of colliders isused, which are triplets of variables h X, Y, Z i with X and Z having directed edgesinto Y . Two variables X and Y are d -connected by a set of variables Z if andonly if there exists a (not necessarily directed) path p between X and Y suchthat (i) for each collider V on p , V is either in Z or some descendant of V is in Z , and (ii) no non-collider on p is in Z . In case no such path exists, X and Y are d -separated given Z . Thus, the Markov and faithfulness conditions imply thatif two variables X and Y are d -separated ( d -connected) given Z , then they areconditional independent (dependent) given Z . A distribution class strictly larger than the set of faithful distributions to BayesianNetworks, is the set of distributions that are marginals of faithful distributi0ons.Unfortunately, marginals of faithful distributions are not always faithful to someBayesian network! Thus, marginalization over some variables loses the faithfulnessproperty: the marginal distribution cannot always be faithfully represented bya Bayesian network. However, faithful marginal distributions can be representedby another type of graph called directed maximal ancestral graph [54] orDMAG. DMAGs include not only directional edges, but also bi-directional edges.DMAGs are extensions of Bayesian networks for marginal distributions and areclosed under marginalization. The representation of a marginal of a faithful (toa Bayesian network) distribution by a DMAG is again faithful, in the sense thatall and only the conditional independencies in the distribution are implied by theMarkov condition. The set of conditional independencies entailed by a DMAG isprovided by a criterion similar to d -separation, now called m -separation. A Markov blanket of T with respect to a set of variables V is defined as aminimal set S such that ( V \ S ⊥ T | S ), where ( X ⊥ T | S ) denotes the conditionalindependence of X with T given S . Thus, a Markov blanket of T is any minimal setthat renders all other variables conditionally independent. An important theoremconnects the Markov blanket of T with the feature selection problem for T : underbroad conditions [41, 62] a Markov blanket of T is a solution to the feature selectionproblem for T . When the distribution is faithful to a Bayesian network or DMAG, the Markov blanket of T is unique . In other words, for faithful distributions, theMarkov Blanket of T has a direct graphical interpretation. The Markov blanketconsists of all vertices adjacent to T , and all vertices that are reachable from T through a collider path, which is a path where all vertices except the start andend vertices are colliders [9]. For Bayesian networks, this corresponds to the set ofparents (vertices with an edge to T ), children (vertices with an edge from T ), andspouses (parents of children) of T in G . We provide an overview of our algorithm, called Parallel, Forward-Backward withPruning (
PFBP ), an extension of the basic Forward-Backward Selection (FBS)algorithm (see Section 2.1 for a description). We will use the terminology intro-duced for FBS: a forward (backward)
Phase refers to the forward (backward)loops of FBS, and an
Iteration refers to each loop iteration that decides whichvariable to select (remove) next. PFBP is presented in “evolutionary” steps wheresuccessive enhancements are introduced in order to make computations local or re-duce computations and communication costs; the complete algorithm is presentedin Section 3.4. To evaluate predictive performance of candidate features we use p -values of conditional independence tests, as described in Section 2.1. We assumethe data are provided in the form of a 2-dimensional matrix D where rows cor-respond to training instances (samples) and columns to features (variables), andone of the variables is the target variable T . Physically, the data matrix is par-titioned in sub-matrices D i,j and stored in a distributed fashion in workers in acluster running Spark [74] or similar platform. Workers perform in parallel localcomputations on each D i,j and a master node performs the centralized, globalcomputations.3.1 Data Partitions in Blocks and Groups and Parallelization StrategyWe now describe the way D is partitioned in sub-matrices to enable parallel com-putations. First, the set of available features (columns) F is partitioned to aboutequal-sized Feature Subsets { F , . . . , F nf } . Similarly, the samples (rows) are ran-domly partitioned to about equal-sized Sample Subsets { S , . . . , S ns } . The rowand column partitioning defines sub-matrices called Data Blocks D i,j with rows S i and features F j . Sample Subsets are assigned to Q Group Samples { G q } C of size C each, where each group sample G q is a set { S q , . . . , S q n } (i.e., the setof Sample Subsets is partitioned). The Data Blocks D i,j with samples within agroup sample S i ∈ G q belong in the same Group . This second, higher level ofgrouping is required by the bootstrap tests explained in Section 4. Data Blocksin the same Group are processed in parallel in different workers (provided enoughare available). However, Groups are processed sequentially, i.e., computation in allBlocks within a Group has to complete to begin computations in the Blocks ofthe next Group. Obviously, if workers are more than the Data Blocks, there is noneed for defining Groups. The data partitioning scheme is shown in Figure 1:Left. Some recent algorithms [33, 60] deal with the problem of solution multiplicity in featureselection.assively-Parallel Feature Selection for Big Data 11
Fig. 1:
Left : Data partitioning of the algorithm. In the top the initial data matrix D is shown with 6 features and instances I , . . . , I m . In the bottom, the 6 featuresare partitioned to Feature Subsets F = { , , } and F = { , , } . The rowsare randomly partitioned to Sample Subsets S , . . . , S ns , and the Sample Subsetsare assigned to Group Samples. Each Block D i,j is physically stored as a unit. Right : Example of trace of a Forward Iteration of PFBP. (a) The R emainingfeatures, A live features, and S elected features are initialized. (b) All Data Blocks D , , D , , D , , D , in the first Group are processed in parallel (by workers).(c) The resulting local p -values are collected (reduced) in a master node for eachAlive feature and Sample Set in the first Group (as well as the likelihoods, notshown in the Figure). (d) Bootstrap-based tests determine which features to EarlyDrop or Stop based on Π , or whether to Early Return (based on Λ , not shownin the Figure). The sets R and A are updated accordingly. In this example, X , X and X are Dropped, and only X and X remain Alive. Notice that always A ⊆ R . (e) The second Group is processed in parallel (by workers) containingBlocks D , , D , , D , , D , . (f) New local p -values for all features still Alive areappended to Π . If G was the last Group, global p -values for the Alive featureswould be computed and the one with the minimum value (in this example X )would be selected for inclusion in S . (g) In case, X and X are deemed almostequally predictive (based on their log-likelihoods) the current best is Early Re-turned.Details of how the number of Sample Sets ns , the number of Feature Subsets nf ,and the number C of Group Samples are determined are provided in Section 5. p -values by Combining Local p -values UsingMeta-AnalysisRecall that Forward-Backward Selection uses p -values stemming from conditionalindependence tests to rank the variables and to select the best on for inclusion(forward Phase) or exclusion (backward Phase). Extending the conditional inde-pendence tests to be computed over multiple Data Blocks is not straightforward,and may be computationally inefficient. For conditional independence tests basedon regression models (e.g. logistic or Cox regression), a maximum-likelihood es-timation over all samples has to be performed, which typically does not have aclosed-form solution and thus requires the use of an iterative procedure (e.g. New-ton descent). Due to its iterative nature, it results in a high communication costrendering it computationally inefficient, especially for feature selection purposeson Big Data, as many models have to be fit at each Iteration.Instead of fitting full (global) regression models, we propose to perform theconditional independence tests locally on each data block, and to combine theresulting p -values using statistical meta-analysis techniques. Specifically, the algo-rithm computes local p -values denoted by π i,k for candidate feature X k from onlythe rows in S i of a data block D i,j , where F j contains the feature X k . This enablesmassive parallelization of the algorithm, as each data block can be processed in-dependently and in parallel by a different worker. The local p -values π i,k are then communicated to the master node of the cluster, and are stored in a matrix Π ;we will use π i,k to refer to the elements of matrix Π , corresponding to the local p -value of X k computed on a data block containing samples in sample set S i . Usingthe p -values in matrix Π , the master node combines the p -values to global p -values for each feature X k using Fisher’s combined probability test [20] (Fig. 1:Right(b)) . Finally, we note that this approach is not limited to regression-based tests, butcan be used with any type of conditional independence test, and is most appro-priate for tests which are hard to parallelize, or computationally expensive (e.g.kernel-based tests [76]).Using Fisher’s combined probability test to combine local p -values does notnecessarily lead to the same p -value as the one computed over all samples. Thereare no guarantees how close those p -values will be in case the null hypothesis ofconditional independence holds, except that they are uniformly distributed be-tween 0 and 1. In case the null hypothesis does not hold however, one expects toreject the null hypothesis using either method in the sample limit. What mattersin practice is the fact that PFBP makes often the same decision at each Iteration,that is, the top ranked variable is often the same. Note that, even if the top rankedvariable is not the same one, PFBP may still perform well, as long as some otherinformative variable is ranked first. The accuracy of Fisher’s combined probabilitytest is further investigated in experiments on synthetic data, presented in Sec-tion A, where we show that, if the sample size per data block is sufficiently high,choosing a value by combining p -values leads to the same decision.For the computation of the local p -values on D i,j , samples S i of the selectedfeatures S are required, and thus the data need to be broadcast to every workerprocessing D i,j whenever S is augmented, i.e., in the end of each Forward Iteration. Naturally, any method for combining p -values can be used instead of Fisher’s method, butwe did not further investigate this in this work.assively-Parallel Feature Selection for Big Data 13 In total, the communication cost of the algorithm is due to the assembly of alllocal p -values π i,k to determine the next feature to include (exclude), as well asthe broadcast of the data for the newly added feature in S at the end of eachforward Iteration. We would like to emphasize that the bulk of computation ofthe algorithm is the calculation of local p -values that require expensive statisticaltests and it takes place in the workers in parallel. The central computations in themaster are minimal .3.3 Speeding-up PFBP using Pruning HeuristicsIn this section, we present 3 pruning heuristics used by PFBP to speed-up com-putation. Implementation details of the heuristics using locally computed p -valuesare presented in Section 4. The first addition to PFBP is the
Early Dropping (ED) heuristic, first intro-duced in [9] for a non-parallel version of Forward-Backward Selection. Let R denotethe set of remaining features, that is, the set of features still under considerationfor selection. Initially, R = F \ S , where F is the set of all available features and S is the set of selected features, which is initially empty. At each forward Iteration,ED removes from R all features that are conditionally independent of the target T given the set of currently selected features S . Typically, just after the first fewIterations of PFBP, only a very small proportion of the features will still remainin R , leading to orders of magnitude of efficiency improvements even in the non-parallel version of the algorithm [9]. When the set of variables R becomes empty,we say that PFBP finished one Run . Unfortunately, the Early Dropping heuristicwithout further adjustments may miss important features which seem uninforma-tive at first, but provide information for T when considered with features selectedin subsequent Iterations. Variables should be given additional opportunities to beselected by performing more Runs. Each additional Run calls the forward phaseagain but starts with the previously selected variables S and re-initializes the re-maining variables to R = F \ S . By default, PFBP uses 2 Runs, although a differentnumber of Runs may be used. Typically a value of 1 or 2 is sufficient in practice,with larger values requiring more computational time while also giving strongertheoretical guarantees; the theoretical properties of PFBP with ED are describedin Section 7 in detail; in short, PFBP with 2 Runs and assume no statistical errorsin the conditional independence tests returns the Markov Blanket of T is distri-butions faithful to a Bayesian Network. Overall, by discarding variables at eachIteration, the Early Dropping heuristic allows the algorithm to scale with featuresize.3.3.2 Early Stopping of Features within the Same Iteration
The next addition to the algorithm regards
Early Stopping (ES) of considerationof features within the same
Iteration, i.e., in order to select the next best feature toselect in a forward Iteration or to remove in a backward Iteration. To implementES we introduce the set A of features still Alive (i.e., under consideration) in the current Iteration, initialized to A = R at the beginning of each Iteration (seeFigure 1:Right(b)). As the master node gathers local p -values for a feature X k from several Data Blocks, it may be able to determine that no more local p -valuesneed to be computed for X k . This is the case if these p -values are enough to safelydecide that with high probability X k is not going to be selected for inclusion(Forward Phase) or exclusion (Backward Phase) in this Iteration. In this case, X k is removed from the set of alive features A , and is not further considered in thecurrent Iteration. This allows PFBP to quickly filter out variables which will notbe selected at the current Iteration. Thus, ES leads to a super-linear speed-up ofthe feature selection algorithm with respect to the sample size: even if samples aredoubled, the same features will be Early Stopped; p -values will not be computed forthese features on the extra samples . The final heuristic of the algorithm is called
Early Return (ER). Recall thatEarly Dropping will remove features conditionally independent of T given S fromthis and subsequent Iterations while Early Stopping will remove non-winners fromthe current Iteration. However, even using both heuristics, the algorithm will keepproducing local p -values for features X j and X k that are candidates for selectionand at the same time are informationally indistinguishable (equally predictivegiven S ) with regards to T (this is the case when the residuals of X j and X k given S are almost collinear). When two or more features are both candidatesfor selection and almost indistinguishable, it does not make sense to go throughthe remaining data: all choices are almost equally good. Hence, Early Returnterminates the computation in the current Iteration and returns the current bestfeature X j , if with high probability it is not going to be much worse than thebest feature in the end of the Iteration (see Fig. 1: Right(c)). Again, the resultis that computation in the current Iteration may not process all Groups. Themotivation behind Early Return is similar to Early Stopping, in that it tries toquickly determine the next variable to select. The difference is that, Early Returntries to quickly determine whether a variable is “good enough” to be selected, incontrast to Early Stopping which discards unpromising variables.A technical detail is that judging whether two features X i and X j are “equallypredictive” is implemented using the log-likelihoods λ i and λ j of the models withpredictors S ∪ { X i } and S ∪ { X j } instead of the corresponding p -values. Thelikelihoods are part of the computation of the p -values, thus incur no additionalcomputational overhead.3.4 The Parallel Forward-Backward with Pruning AlgorithmWe present the proposed Parallel Forward-Backward with Pruning (PFBP) al-gorithm, shown in Algorithm 2. To improve readability, several arguments areomitted from function calls. PFBP takes as input a dataset D and the targetvariable of interest T . Initially the number of Sample Sets ns and number ofFeature Sets nf are determined as described in Section 5. Then, (a) the samplesare randomly assigned to Sample Sets S , . . . , S ns , to avoid any systematic biases assively-Parallel Feature Selection for Big Data 15 Algorithm 2 Parallel Forward-Backward With Pruning ( PFBP ) Input:
Dataset D , Target T , Maximum Number of Runs maxRuns Output:
Selected Variables S
1: //
Data Partitioning Randomly assign samples to sample sets S , . . . , S ns Assign sample sets S , . . . , S ns to equally-sized Groups G , . . . , G K Assign features to feature sets F , . . . , F nf Partition D to data blocks D i,j containing samples from S i and F j , ∀ i, j S ← ∅ // No selected variables run ← First run
Iterate until (a) maximum number of runs reached, or (b) selected features S did notchange while run ≤ maxRuns ∧ S changes do S ← OneRun ( D , T , S )13: run ← run + 114: end while return S function OneRun (Data Blocks D , Target T , Selected Variables S , Maximum Number ofVariables To Select maxVars)17: R ← F \ S // All variables remaining
18: //
Forward phase: iterate until (a) maximum number of variables selected or (b) no newvariable has been selected while | S | < maxVars ∧ S changes do h S , R i ← ForwardIteration ( D , T , S , R )21: end while Backward phase: iterate until no variable can be removed while S changes do S ← BackwardIteration ( D , T , S )26: end while return S (see also Section 6.1), (b) the Sample Sets S , . . . , S ns are assigned to Q approx-imately equal-sized Groups, G , . . . , G Q , (c) the features are assigned to featuresets F , . . . , F nf , in order of occurrence in the dataset, and (d) the dataset D ispartitioned into data blocks D i,j , with each such block containing samples andfeatures corresponding to sample set S i and feature set F j respectively. The se-lected variables S are initialized to the empty set. The main loop of the algorithmperforms up to maxRuns Runs, as long as the selected variables S change. Eachsuch Run executes a forward and a backward Phase.The OneRun function takes as input a set of data blocks D , the target variable T , a set of selected variables S , and a limit on the number of variables to select maxVars . It initializes the set of remaining variables R to all non-selected variables F \ S . Then, it executes the forward and backward Phases. The forward (backward)Phase executes forward (backward) Iterations until some stopping criteria are met.Specifically, the forward Phase terminates if the maximum number of variables maxVars has been selected, or until no more variable can be selected, while thebackward Phase terminates if no more variables can be removed from S .The forward and backward Iteration procedures are shown in Algorithms 3 and4. ForwardIteration takes as input the data blocks D , the target variable T aswell as the current sets of remaining and selected variables, performs a forward Algorithm 3
ForwardIteration
Input:
Data Blocks D , Target T , Selected Variables S , Remaining Variables ROutput:
Selected Variables S , Remaining Variables R A ← R // Initialize Alive Variables Π // Array of log-pvalues, initially empty Λ // Array of log-likelihoods, initially empty q ← Initialize current Group counter Q ← Groups // Set Q to the total number of Groups while q ≤ Q do
8: //
Process the alive features A for all data blocks containing sample sets in G q (denotedas D q ) in parallel in workers for the given T , S and A , compute sub-matrices Π q and Λ q from each block, and append results to Π and Λ h Π q , Λ q i ← TestParallel ( D q , T, S , A )10: h R , A i ← EarlyDropping ( Π, R , A )11: A ← EarlyStopping ( Π, A )12: A ← EarlyReturn ( Λ, A )13: Update Π and Λ (Retain only columns of alive variables)14: // Stop if single variable alive if | A | ≤ then break end if q ← q + 119: end while if | A | > then π ← Combine ( Π ) // Compute final combined p -value for all alive variables X best ← argmin X i ∈ A π i // Identify the best variable X best
24: //
Select X best if dependent with T given S if π best ≤ α then S ← S ∪ { X best } R ← R \ { X best } end if end if return h S , R i Iteration and outputs the updated sets of selected and remaining variables. Ituses the variable set A to keep track of all alive variables, i.e. variables that arecandidates for selection. The arrays Π and Λ contain the local log p -values andlog-likelihoods, containing ns rows (one for each sample set) and | A | columns (onefor each alive variable). The values of Π and Λ are initially empty, and are filledgradually after processing Groups. We use D q to denote all data blocks whichcorresponds to Sample Sets contained in Group G q . Similarly, accessing the valuesof Π and Λ corresponding to Group q and variables X is denoted as Π q and Λ q .In the main loop, the algorithm iteratively processes Groups in a synchronousfashion, until all Groups have been processed or no alive variable remains. The TestParallel function takes as input the data blocks D q corresponding to thecurrent Group G q , and performs all necessary independence tests in parallel inworkers. The results, denoted as Π q and Λ q are then appended to the Π and Λ matrices respectively. After processing a Group, the tests for Early Dropping, EarlyStopping and Early Return are performed, using all local p -values computed up toGroup q ; details about the implementation of the EarlyDropping , EarlyStop-ping and
EarlyReturn algorithms when data have only been partially processed assively-Parallel Feature Selection for Big Data 17
Algorithm 4
BackwardIteration
Input:
Data Blocks D , Target T , Selected Variables SOutput:
Selected Variables S , Remaining Variables R A ← S // Initialize Alive Variables Π // Array of log-pvalues, initially empty q ← Initialize current Group counter Q ← Groups // Set Q to the total number of Groups while q ≤ Q do
7: //
Process the alive features A for all data blocks containing sample sets in G q (denotedas D q ) in parallel in workers for the given T , S and A , compute sub-matrix Π q from eachblock, and append it to Π Π q ← TestParallel ( D q , T, S , A )9: A ← EarlyStoppingBackward ( Π, A )10: Update Π (Retain only columns of alive variables)11: // Stop if single variable alive if | A | ≤ then break end if q ← q + 116: end while if | A | > then π ← Combine ( Π q ) // Compute final combined p -value for all alive variables X worst ← argmax X i ∈ A π i // Identify the worst variable X worst
21: //
Remove X worst if independent with T given S \ X worst if π worst > α then S ← S \ { X worst } end if end if return S are given in Section 4. The values of non-alive features are then removed from Π and Λ (see also Figure 1(f) for an example). If only a single alive variable remains,processing stops. Note that, this is not checked in the while loop condition, inorder to ensure that at least one Group has been processed if the input set ofremaining variables contains a single variable. Finally, the best alive variable (ifsuch a variable exists) is selected if it is conditionally dependent with T given theselected variables S . Conditional dependence is determined by using the p -valueresulting from combining all local p -values available in Π . BackwardIteration is similar to
ForwardIteration with the exception that (a) the remaining vari-ables are not needed, and thus no dropping is performed, (b) no early return isperformed, and (c) the tests are reversed, i.e. the worst variable is removed.3.5 Massively-Parallel Predictive ModelingThe technique of combining locally computed p -values to global ones to massivelyparallelize computations, can be applied not only for feature selection, but also forpredictive modeling. This way, at the end of the feature selection process one couldobtain an approximate predictive model with no additional overhead! We exploitthis opportunity in the context of independence tests implemented by logisticregression. During the computation of local p -values π i,k a (logistic) model for T using all selected features S is produced from the samples in S i . Such a modelcomputes a set of coefficients β i that weighs each feature in the model to producethe probability that T = 1. Methods for combining multiple models, such as theones considered here, are described in [6]. We used the weighted univariate leastsquares (WLS) approach [26], with equal weights for each model; multivariateapproaches may be more accurate and can also be applied in our case withoutany significant computational overhead, but were not further considered in thiswork. The WLS method with equal weights combines the local models to a globalone ˆ β by just taking the average of the coefficient vectors of the model , i.e.,ˆ β = N P Ni =1 β i . Thus, the only change to the algorithm is to cache each β i andaverage them in the master node. By default, PFBP uses the WLS method toconstruct a predictive model at each forward Iteration.Using the previous technique, one could obtain a model at the end of eachIteration and assess its predictive performance (e.g., accuracy) on a hold-out vali-dation set. Constructing for instance the graph of the number of selected featuresversus achieved predictive performance on the hold-out set could visually assistdata analysts [31] in determining how many features to include in the final selec-tions; an example application on SNP data is given in the experimental section.An automated criterion for selecting the best trade-off between the number ofselected features and the achieved predictive performance could also be devised,although this is out of the scope of this paper, as multiple testing has to be takeninto consideration. Recall that the algorithm processes Group Samples sequentially. After processingeach Group and collecting the results, PFBP applies the Early Dropping, EarlyStopping and Early Return heuristics, computed on the master node, to filter outvariables and reduce subsequent computation. Thus, all three heuristics involvemaking early probabilistic decisions based on a subset of the samples examined sofar. Naturally, if all samples have been processed, Early Dropping can be appliedon the combined p -values without making probabilistic decisions.Before proceeding with the details, we provide the notation used hereafter.Let Π and Λ be 2-dimensional arrays containing K local log p -values and log-likelihoods for all alive variables in A and for all Groups already processed. Thematrices reside on the master node, and are updated each time a Group is pro-cessed. Let π i,j and λ i,j denote the i -th value of the j -th alive variable, denotedas X j . Recall that those values have been computed locally on the Data Blockcontaining samples from Sample Set S i . For the sake of simplicity, we will use π j and λ j ( l j ) to denote the combined p -value and sum of log-likelihoods (likelihood)respectively of variable X j . The vectors π and λ will be used to refer to the com-bined p -values and sum of log-likelihoods for all alive variables respectively. Also,let X best be the variable that would have been selected if no more data blocks wereevaluated, that is, the one with the currently lowest combined p -value, denoted as π best . assively-Parallel Feature Selection for Big Data 19 P ( π j ≥ α ) > P drop forEarly Dropping of X j ( α is the significance level), (b) P ( π j > π best ) > P stop forEarly Stopping of X j , and (c) ∀ X j , ( P ( l best /l j ≥ t ) > P return ) for Early Returnof X best (i.e., the probability is larger than the threshold for all variables), where t is a tolerance parameter that determines how close the model with X best is tothe rest in terms of how well it fits the data, and takes values between 0 and 1;the closer it is to 1, the closer it is in terms of performance to all other models.By taking the logarithm, (c) can be rewritten as ∀ X j , P ( λ best − λ j ≥ lt ), where lt = log( t ).We employed bootstrapping to test the above. A bootstrap-sample b of Π ( Λ ),denoted as Π b ( Λ b ), is created by sampling with replacement K rows from Π ( Λ ).Then, for each such sample, the Fisher’s combined p -values (sum of log-likelihoods)are computed, by summing over all respective values for each alive variable; werefer to the vector of combined p -values (log-likelihoods) on bootstrap sample bas π b ( λ b ), and the i -th element is referred to as π bi ( λ bi ). By performing the above B times, probabilities (a), (b) and (c) can be estimated as: P ( π j ≥ α ) = I ( π j ≥ α ) + P Bb =1 I ( π bj ≥ α ) B + 1 (Early Dropping) P ( π j > π best ) = I ( π j > π best ) + P Bb =1 I ( π bj > π bbest ) B + 1 (Early Stopping) P ( λ best − λ j ≥ lt ) = I ( λ best − λ j ≥ lt ) + P Bb =1 I ( λ bbest − λ bj ≥ lt ) B + 1 (Early Return)where I is the indicator function, which evaluates to 1 if the inequality holdsand to 0 otherwise. For all of the above, the condition is also computed on theoriginal sample, and the result is divided by the number of bootstrap iterations B plus 1. Note that, for Early Return the above value is computed for all variables X j .Algorithms 5,6 and 7 show the procedures in more detail. For all heuristics, avector named cnts is used to keep track of how often the inequality is satisfied foreach variable. To avoid cluttering, the indicator function I performs the check formultiple variables and returns a vector of values in each case, containing one valuefor each variable. The function BootstrapSample creates a bootstrap sampleas described above, function
Combine uses Fisher’s combined probability testto compute a combined p -value, and SumRows sums over all rows of the log-likelihoods contained in Λ , returning a single value for each alive variable.4.2 Implementation Details of Bootstrap TestingWe recommend using the same sequence of bootstrap indices for each variable,and for each bootstrap test. The main reasons are to (a) simplify implementation,(b) avoid mistakes and (c) ensure results do not change across different executions Algorithm 5
EarlyDropping
Input:
Log p -values Π , Remaining Variables R , Alive Variables A , Number of BootstrapSamples B , Significance Level Threshold α , ED Threshold P drop Output:
Remaining variables R , Alive Variables A π ← Combine ( Π ) // Combine log p -values Π using Fisher’s c.p.t.
2: cnts ← | A | // Count vector of size equal to the number of alive variables
3: cnts ← cnts + I ( π ≥ α )4: for b = 1 to B do Π b ← BootstrapSample ( Π )6: π b ← Combine ( Π b ) // Combine log p -values Π b using Fisher’s c.p.t.
7: cnts ← cnts + I ( π b ≥ α )8: end for
9: //
Drop variables if p -value larger than α with probability at least P drop R ← R \ { X i ∈ A : cnts i / ( B + 1) ≥ P drop } A ← A \ { X i ∈ A : cnts i / ( B + 1) ≥ P drop } return h R , A i Algorithm 6
EarlyStopping
Input:
Log p -values Π , Alive Variables A , Number of Bootstrap Samples B , ES Threshold P stop Output:
Alive Variables A π ← Combine ( Π ) // Combine log p -values Π using Fisher’s c.p.t. X best ← argmin X i ∈ A π i // Identify variable with minimum Fisher’s combined p -value
3: cnts ← | A | // Count vector of size equal to the number of alive variables
4: cnts ← cnts + I ( π best < π )5: for b = 1 to B do Π b ← BootstrapSample ( Π )7: π b ← Combine ( Π b ) // Combine log p -values Π b using Fisher’s c.p.t.
8: cnts ← cnts + I ( π bbest < π b )9: end for
10: //
Exclude variables from A that are worse than V best with probability at least P stop A ← A \ { X i ∈ A : cnts i / ( B + 1) ≥ P stop } return A Algorithm 7
EarlyReturn
Input:
Log-likelihoods Λ , Alive Variables A , Number of Bootstrap Samples B , ER Threshold P return , ER Tolerance lt Output:
Alive Variables A π ← Combine ( Π ) // Combine log p -values Π using Fisher’s c.p.t. λ ← SumRows ( Λ ) // Sum rows of log-likelihoods Λ X best ← argmin X i ∈ A π i // Identify variable with minimum Fisher’s combined p -value
4: cnts ← | A | // Count vector of size equal to the number of alive variables
5: cnt ← cnts + I ( λ best − λ > lt )6: for b = 1 to B do Λ b ← BootstrapSample ( Λ )8: λ b ← SumRows ( Λ ) // Sum rows of log-likelihoods Λ b
9: cnts ← cnts + I ( λ bbest − λ b > lt )10: end for
11: //
Select X best early if better than all other variables with probability at least P return if ∀ i, cnts i / ( B + 1) ≥ P return then A ← { X best } end if return A assively-Parallel Feature Selection for Big Data 21 of the algorithms. This can be done by initializing the random number generatorwith the same seed. Next, note that ED, ES and ER do not necessarily have to beperformed separately, but can be performed simultaneously (i.e,. using the samebootstrap samplings). This allows the re-usage of the sampled indices for all testsand variables, saving some computational time. Another important observationfor ED and ES is that the actual combined p -values are not required. It suffices tocompare statistics instead, which are inversely related to p -values: larger statisticscorrespond to lower p -values. For the ED test, the statistic has to be compared tothe statistic corresponding to the significance level α , which can be computed usingthe inverse χ cumulative distribution. This is crucial to speed-up the procedure,as computing log p -values is computationally expensive. Finally, note that theexact probabilities for the tests are not required, and one can often decide earlierif a probability is smaller than the threshold. For example, let P drop = 0 .
99 and B = 999. Then, in order to drop a variable X i , the number of times cnts i where the p -value of X i exceeds α has to be at least 990. If after K iterations ( B − K ) + cnts i is less than 990, one can determine that X i will not be dropped; even if in allremaining bootstrap iterations its p -value is larger than α , cnts i + B − K willalways be less than 990, and thus the probability P ( π i ≥ α ) will be less than thethreshold P drop = 0 . .
99 for P drop and P stop , and values of P return = 0 .
95 and tol = 0 . B should be as large aspossible, with a minimum recommended value of 500. By default, PFBP uses theabove values. The main parameters for the data partitioning to determine are (a) the sample size s of each Data Block, (b) the number of features f in each Data Block, and (c) thenumber of Sample Subsets C in each Group; the latter determines how many new p -values per feature are computed in each Group. Notice that s determines thehorizontal partitioning of the data matrix and f the vertical partitioning of datamatrix . In general, the parameters are set so that Blocks are as small as possible toachieve high parallelization, without sacrificing feature selection accuracy: if thenumber of samples is too low, the local tests will have low power. In this section,we provide detailed guidelines to determine those parameters, and show how thosevalues were set for the special case of PFBP using conditional independence testsbased on binary logistic regression.5.1 Determining the Required Sample Size for Conditional Independence TestsFor optimal computational performance, the number of Sample Sets should be aslarge as possible to increase parallelism, and each Sample Set should contain asfew samples as possible to reduce the computational cost for performing the localconditional independence tests. Of course, this should be done without sacrificing the accuracy of feature selection: if the number of samples is too low, the localtests will have low power.Various rules of thumb have appeared in the literature to choose a sufficientnumber of samples for linear, logistic and Cox regression [25, 50, 67]. We focus onthe case of binary logistic regression hereafter. For binary logistic regression, it isrecommended to use at least s = c/ min( p , p ) · df samples, where p and p are theproportion of negative and positive classes in T respectively, df is the number ofdegrees of freedom in the model (that is, the total number of parameters, includingthe intercept term) and c is usually recommended to be between 5 and 20, withlarger values leading to more accurate results. This rule is based on the events pervariable (EPV) [50], and will referred to as the EPV rule hereafter.Rules like the above can be used to determine the number of samples s in eachSample Set, by setting the minimum number of samples in each Data Block in away that the locally computed p -values are valid for the type of test employed inthe worst case . The worst case scenario occurs if the maximum number of features maxVars have been selected. If all features are continuous df equals maxVars + 1.This can easily be adapted for the case of categorical features, by consideringthe maxVars variables with the most categories, and setting df appropriately.By considering the worst case scenario, the required number of samples can becomputed by plugging the values of df , c , p and p into the EPV rule. We foundout that, although the EPV rule works reasonably well, it tends to overestimatethe number of samples required for skewed class distributions. As a result, it mayunnecessarily slow down PFBP in such cases. Ideally for a given value of c theresults should be equally accurate irrespective of the class distribution and thenumber of model parameters.To overcome the drawbacks of the EPV rule, we propose another rule, called theSTD rule, which is computed as s = df · c/ √ p · p . For balanced class distributionsthe result is identical to the EPV rule, while for skewed distributions the valueis always smaller. We found that a value of c = 10 works sufficiently well, andrecommend to always set c to a minimum of 10; higher values could lead to moreaccurate results, but will also increase computation time. Again, the number ofsamples per Sample Set is determined as described above. A comparison of bothrules is given in Appendix A. We show that the STD rule behaves better acrossdifferent values of df and class distributions of the outcome than the EPV rule.5.2 Determining the Number of Features per Data BlockGiven the sample size s of each data block D i,j , the next hyper-parameter to decideis the number of features f in each block. The physical partitioning to Feature Setsis performed so that each Block fits within the memory of a cluster node. Some physical partitioning is required only for ultra-high dimensional datasets and it wasnever launched in our implementation for the Sample Sets sizes of the datasetsemployed in our experiments . In practice, features need to be partitioned onlyvirtually, which is computationally cheaper. Specifically, enough (virtual) featuresets nf are created so that the number of Data Blocks in a Group (i.e., C × nf ) isas close as possible to a desired oversubscription-of-machines parameter o . The o parameter dictates the average number of Blocks (tasks) assigned to a machine perGroup. By default, the value of o is set to 1. In other words, we set nf = ⌊ o · M/C ⌋ , assively-Parallel Feature Selection for Big Data 23 where M is the number of available machines, so that each machine processes atleast o blocks per Group.5.3 Setting the Number of Groups C We now discuss the determination of the C value, the number of Sample Sets ineach Group. Recall that, the value of C determines how many Sample Sets areprocessed in parallel, (and thus, how many additional local p -values are addedto the p -value matrix Π ), before invoking bootstrap tests that decide on EarlyDropping, Stopping or Return. We set C = 15, as it allows enough local p -valuesfor a bootstrap test to be performed in the first Group. Smaller values wouldinvoke the bootstrap test more often and present more opportunities for EarlyDrop, Stop and Return, but would also reduce the parallelization of the algorithm(since computations await for the results of the bootstrap). The value 15 waschosen (without extensive experimentation) as a good trade-off between the tworesources. In this section, we discuss several important details for an efficient and accurateimplementation of PFBP. The main focus is on PFBP using conditional inde-pendence tests based on binary logistic regression, which is the test used in theexperiments, although most details regard the general case or can be adapted toother conditional independence tests.6.1 Accurate Combination of Local p -values Using Fisher’s methodIn order to apply Fisher’s combined probability test, the data distributions ofeach data block should be the same for the test to be valid . There should be nosystematic bias on the data or the combining process may exacerbate this bias(see [65]). Such bias may occur if blocks contain data from the same departments,stores, or branches, or in consecutive time moments and there is time-drift onthe data distribution. This problem is easily avoided if before the analysis thepartitioning of samples to blocks is done randomly, as done by PFBP.Another important detail to observe in practice, is to directly compute thelogarithm of the p -values for each conditional independence test instead of firstcomputing the p -value and then taking the logarithm. As p -values tend to getsmaller with larger sample sizes (in case the null hypothesis does not hold), theyquickly reach the machine epsilon, and will be rounded to zero. If this happens,then sorting and selecting features according to p -values breaks down and PFBPwill select an arbitrary feature. This behavior is further magnified in case of com-bined p -values, as a single zero local p -value leads to a zero combined p -value nomatter the values of the remaining p -values. ForwardIteration algorithm may be slow in cases where no variables are droppedor stopped for multiple consecutive iterations. In such cases, the algorithm willspend a large amount of time performing the bootstrap tests, even though inmost cases no variables are removed from R or A . This can become especiallyproblematic if the number of sample sets ns is very large. For this reason, we al-low the algorithm to increase the Group size, if after processing two consecutiveGroups the alive and remaining variables remain the same. Specifically, we foundthat doubling the Group size works well in practice. This is identical to doublingthe number of Groups processed, and thus minimal changes are required in thealgorithm. One needs to keep track of the number of Groups to process in eachiteration, and double that value if the alive and remaining variables do not changeafter the bootstrap tests. Finally, we note that the value of C is reset to the defaultvalue (15 in our case, see Section 5) after each forward Iteration.6.3 Implementation of the Conditional Independence Test using LogisticRegression for Binary TargetsThe conditional independence test is the basic building block of PFBP, and thususing a fast and robust implementation is essential. Next, we briefly review opti-mization algorithms used for maximum likelihood estimation, mainly focusing onbinary logistic regression, and in the context of feature selection using likelihood-ratio tests.A comprehensive introduction and comparison of algorithms for fitting (i.e.,finding the β that maximizes the likelihood) binary logistic regression models isprovided in [46]. Three important classes of optimization algorithms are Newton’smethod, conjugate gradient descent and quasi-Newton methods. Out of those,Newton’s method is the most accurate and typically converges to the optimal so-lution in a few tens of iterations. The main drawback is that each such iterationis slow, requiring O ( n · d ) computations, where n is the sample size and d thenumber of features. Conjugate gradient descent and quasi-Newton methods on theother hand require O ( n · d ) and O ( n · d + d ) time per iteration, but may takemuch longer to converge. Unfortunately, there are cases were those methods failto converge to an optimal solution even after hundreds of iterations. This not onlyaffects the accuracy of feature selection, but also leads to unpredictable runningtimes. Most statistical packages include one or multiple implementations of lo-gistic regression. Such implementations typically use algorithms that can handlethousands of predictors, with quasi-Newton methods being a popular choice. Forfeature selection however, one is typically interested to select a few tens or hun-dreds of variables. In anecdotal experiments, we found that for this case Newton’smethod is usually faster and more accurate, especially with fewer than 100-200variables. Because of that, and because of the issues mentioned above, we used afine-tuned, custom implementation of Newton’s method. assively-Parallel Feature Selection for Big Data 25 There are some additional, important details. First of all, there are cases wherethe Hessian is not invertible . If this the case, we switch to conjugate gradientdescent using the fixed Hessian as a search direction for that iteration, as describedin [46]. Finally, as a last resort, in case the fixed Hessian is not invertible we switchto simple gradient descent. Next, for all optimization methods there are cases inwhich the computed step-size has to be adjusted to avoid divergence, whether it isdue to violations of assumptions or numerical issues. One way to do this is to useinexact line-search methods, such as backtracking-Armijo line search [5], whichwas used in our implementation.6.4 Score Tests for the Univariate CaseIn the first step of forward selection where no variable has been selected, one canuse a score test (also known as Lagrange multiplier test) instead of a likelihood-ratio test to quickly compute the p -value without having to actually fit logisticregression models. The statistic of the Score test equals [27]Statistic ≡ P nj =1 X j ( T j − ¯ T ) q ¯ T (1 − ¯ T ) P nj =1 ( X j − ¯ X ) where n is the number of samples, T is the binary outcome variable (using a 0/1encoding), and X is the variable tested for independence. Note that, such tests canalso be derived for models other than binary logistic regression, but it is out of thescope of the paper. The score test is asymptotically equivalent to the likelihoodratio test, and in anecdotal experiments we found that a few hundred samples aresufficient to get basically identical results, justifying its use in Big Data settings.Using this in place of the likelihood ratio test reduces the time of the univariatestep significantly and is important for an efficient implementation, as the first stepis usually the most computationally demanding one in the PFBP algorithm, as alarge portion of the variables will be dropped by the Early Dropping heuristic. Assuming an oracle of conditional independence, it can be shown that the standardForward-Backward Selection algorithm is able to identify the optimal set of fea-tures for distributions faithful to Bayesian networks or maximal ancestral graphs[9, 41, 64]. Unfortunately, the Early Dropping (ED) heuristic may compromise theoptimality of the method. ED may remove features that are necessary for optimalprediction of T . Intuitively, these features provide no predictive information for T given S (are conditionally independent) but become conditionally dependent given a superset of S , i.e., after more features are selected. This problem can be One case where this happens is if the covariance matrix of the input data is singular, orclose to singular. Note that, due to the nature of the feature selection method which considersone variable at a time, this can happen only if the newly added variable is (almost) collinearwith some of the previously selected variables. If this is the case, the variable would not beselected anyway.6 Tsamardinos, I., Borboudakis, G. et al. overcome by using multiple Runs of the Forward-Backward Phases. Recall that,each Run reinitializes the remaining variables with R = F \ S . Thus, each subse-quent Run provides each feature with another opportunity to be selected, even ifit was Dropped in a previous one. The heuristic has a graphical interpretation inthe context of probabilistic graphical models such as Bayesian networks and maxi-mal ancestral graphs [48, 54, 59] inspired by modeling causal relations. A rigoroustreatment of the Early Dropping heuristic and theorems regarding its optimalityfor distributions faithful to Bayesian networks and maximal ancestral graphs isprovided in [9]; for the paper to be self-sustained, we provide the main theoremsalong with proofs next.We assume that PFBP has access to an independence oracle that determineswhether a given conditional dependence or independence holds. Furthermore, weassume that the Markov and faithfulness conditions hold, which allow us to usethe terms d-separated/m-separated and independent (dependent) interchangeably.We will use the the weak union axiom, one of the semi-graphoid axioms [48]about conditional independence statements, which are general axioms holding inall probability distributions. The weak union axiom states that ( X ⊥ Y ∪ W | Z ) ⇒ ( X ⊥ Y | Z ∪ W ) holds for any such sets of variables. Theorem 1
If the distribution can be faithfully represented by a Bayesian net-work, then PFBP with two runs identifies the Markov blanket of the target T .Proof In the first run of PFBP, all variables that are adjacent to T (that is, itsparents and children) will be selected, as none of them can be d-separated from T by any set of variables. In the next run, all variables connected through a colliderpath of length 2 (that is, the spouses of T ) will become d-connected with T , sincethe algorithm conditions on all selected variables (including its children), and thusspouses will be selected as at least a d -connecting path is open: the path thatgoes through the selected child. The resulting set of variables includes the Markovblanket of T , but may also include additional variables. Next we show that alladditional variables will be removed by the backward selection phase. Let MB( T )be the Markov blanket of T and S ind = S \ MB( T ) be all selected variables notin the Markov blanket of T . By definition, ( T ⊥ X | M B ( T )) holds for any setof variables X not in MB( T ), and thus also for variables S ind . By applying theweak union graphoid axiom, one can infer that ∀ S i ∈ S ind , ( T ⊥ S i | M B ( T ) ∪ S ind \ S i ) holds, and thus some variable S j will be removed in the first iteration.Using the same reasoning and the definition of a Markov blanket, it can be shownthat all variables in S ind will be removed from MB( T ) at some iteration. Toconclude, it suffices to use the fact that variables in MB( T ) will not be removedby the backward selection, as they are not conditionally independent of T giventhe remaining variables in MB( T ). ⊓⊔ Theorem 2
If the distribution can be faithfully represented by a directed maximalancestral graph, then PFBP with no limit on the number of runs identifies theMarkov blanket of the target T .Proof In the first run of PFBP, all variables that are adjacent to T (that is, itsparents, children and variables connected with T by a bi-directed edge) will beselected, as none of them can be m-separated from T by any set of variables. Aftereach run additional variables may become admissible for selection. Specifically, assively-Parallel Feature Selection for Big Data 27 after k runs all variables that are connected with T by a collider path of length k will become m-connected with T , and thus will be selected; we prove this next.Assume that after k runs all variables connected with T by a collider path oflength at most k − T by a colliderpath will become m-connected with T . This is true because conditioning on avariable Y in a collider h X, Y, Z i m-connects X and Z . By applying this on eachvariable on some collider path, it is easy to see that its end-points become m-connected. Finally, after applying the backward selection phase, all variables thatare not in the Markov blanket of T will be removed; the proof is identical to theone used in the proof of Theorem 1. ⊓⊔ In this section we provide an overview of alternative parallel feature selectionmethods, focusing on methods for MapReduce-based systems (such as Spark), aswell as causal-based methods, and compare them to PFBP. An overview of featureselection methods can be found in [24] and [37].8.1 Parallel Univariate Feature Selection and Parallel Forward-BackwardSelectionUnivariate feature selection (UFS) applies only the first step of forward selec-tion, ranks all features based on some ranking criterion, and selects either thetop maxVars variables or all features that satisfy some selection criterion. An im-plementation for discrete data based on the chi-squared test of independence isprovided in the Spark machine learning library MLlib [44]. In this case, all featuresare ranked based on the p -value of the test of unconditional independence with theoutcome T , and features are selected by either choosing the top maxVars ones, orall features with a p -value below a fixed significance level α . Although not explic-itly mentioned as feature selection methods, MLlib also contains implementationsof the Pearson and Spearman correlation coefficients, which can be used similarlyto perform univariate feature selection with continuous features and outcome vari-ables. Furthermore, MLlib also contains implementations of binomial, multinomialand linear regression, which can be used both for univariate feature selection aswell as for forward-backward selection (FBS), by performing likelihood-ratio tests.The main advantages of PFBP over UFS and FBS are that (a) PFBP does notrequire specialized distributed implementations of independence tests, as it onlyrelies on local computations and thus can use existing implementations, which isalso much faster than fitting full models over all samples, and (b) it employs theEarly Dropping, Early Stopping and Early Return heuristics, allowing it to scaleboth with number of features and samples. Perhaps, most importantly (c) UFSwill not necessarily identify the Markov Blanket of T even in faithful distributions;the solution by UFS will have false positives (e.g., redundant features) as well asfalse negatives (missed Markov Blanket features). S be the selected features up to some point and R = F \ S be all candidatevariables for selection, and assume that a full logistic regression model M for T using S is available. SFO creates an approximate model for each variable R i ∈ R by fixing the coefficients of S using their coefficients in M , and only optimizing thecoefficient of R i . This problem is much simpler than fitting full models for eachremaining variable, significantly reducing running time. Then, the best variable R ∗ is chosen based on those approximate models (using some performance measuresuch as the log-likelihood), and a full logistic regression model M ∗ with S ∪ R ∗ iscreated. Thus, at each iteration only a single, full logistic regression model needsto be created. By default, SFO uses a maximum number of variables to selectas a stopping criterion. Alternatively, to decide whether R ∗ should be selected alikelihood-ratio test could be used, in which case the test is performed on M and M ∗ , and R ∗ is selected if the p -value is below a threshold α ; we used this in ourimplementation of SFO in the experiments. The parallelization over samples isperformed in the map phase, in which one value p j is computed for each sample j , which equals p j = e β · S j e β · S j where β are the coefficients of S in M and S j are the values of S in the j -th sample.The values of p j , the values of the outcome T and all of candidate variables R arethen sent to workers to be processed in the reduce phase. Note that, this incurs ahigh communication cost, as essentially the whole dataset has to be sent over thenetwork. Finally, in the reduce phase, all workers fit in parallel over all variables R the approximate logistic regression models.Although SFO significantly improves over the standard forward selection al-gorithm in running time, it has three main drawbacks compared to PFBP: (a) ithas a high communication cost, in contrast to PFBP which only requires minimalinformation to be communicated, (b) to select a variable all non-selected variableshave to be considered, while PFBP employs the Early Dropping heuristic thatsignificantly reduces the number of remaining variables, and (c) SFO always usesall samples, while PFBP uses Early Stopping and Early Return allowing it toscale sub-linearly with number of samples. Finally, (d) SFO does not provide anytheoretical guarantees. assively-Parallel Feature Selection for Big Data 29 . They are applicable onlyfor discrete variables, although discretization methods can be used to also handlecontinuous variables. ITFS relies on computations of the mutual information andthe conditional mutual information, and many variations have appeared in theliterature [11]; we provide a brief description next. The criterion J of many ITFSmethods for evaluating feature X k can be expressed as J ( X k ) = I ( T ; X k ) − β X X j ∈ S I ( X j ; X k ) + γ X X j ∈ S I ( X j ; X k | T )where β and γ are parameters taking values in [0 , J with respect to the current set of selected variables S . All of those criteria approximate the conditional mutual information (CMI) I ( T ; X k | S ) using different assumptions on the data distributions. Observe that,both ITFS and forward selection are essentially identical, but use different criteriafor selecting the next best feature. Forward selection using regression models (e.g.logistic regression) assumes a specific probabilistic model to approximate the CMI.Thus, both approaches use different types of approximations to the CMI. We notethat forward selection can also be used for discrete data to with the CMI criterionusing the G-test of conditional independence [1].The main advantage of ITFS over PFBP is that computing (conditional) mu-tual informations does not require fitting any model, and thus can be performedefficiently even on a Big Data setting. They can additionally trivially take advan-tage of sparse data, further speeding up computation. However, ITFS methods donot have the theoretical properties of PFBP, which can be shown to be optimal fordistributions that can be faithfully represented by Bayesian networks and maximalancestral graphs. This stems from the fact that PFBP solves an inherently harderproblem, as it considers all selected variables at each iteration in order to select thenext feature, while ITFS only conditions on one variable at a time. Furthermore,ITFS methods are not as general as PFBP, which can be applied to various datatypes as long as an appropriate conditional independence test is available. Forexample, it is not clear if and how ITFS can be applied to time-to-event outcomevariables, whereas PFBP can be directly applied if a likelihood-ratio test based onCox regression is used. Last but not least they are only applicable to discrete data.Thus, in case of continuous variables, a discretization method has to be appliedbefore feature selection, possibly losing information [16, 30]. This also increasescomputational time and may require extra tuning to find a good discretization offeatures.8.4 Feature Selection with LassoFeature selection based on Lasso [61] implicitly performs feature selection whilefitting a model. The feature selection problem is expressed as a global optimization https://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection There are methods that do not fall into this framework, but we will not go into moredetail; see [11] for more details.0 Tsamardinos, I., Borboudakis, G. et al. problem using an L penalty on the feature coefficients. We describe it in moredetail next, focusing on likelihood-based models such as logistic regression. Let D ( θ ) be the deviance of a model using n parameters θ . The optimization problemLasso solves can be expressed asmin θ ∈ R n D ( θ ) + λ k θ k where k θ k is the L norm and λ ≥ λ controlsthe number of non-zero coefficients in the solution, with larger values leading tosparser solutions. This problem formulation is a convex approximation of the moregeneral best subset selection (BSS) problem [45], defined as follows to match theLasso optimization formulation min θ ∈ R n D ( θ ) + λ k θ k where k θ k equals the total number of variables with non-zero coefficients. TheBSS problem has been shown to be NP-hard [70], and thus most approaches, suchas Lasso and forward selection, rely on some type of approximation to solve it . Incontrast to Lasso, which uses a different constraint on the values of the coefficients( L instead of L penalty), forward selection type algorithms perform a greedyoptimization of the BSS problem, by including the next best variable at each step;see [22, 45] for a more thorough treatment of the topic. A sufficient condition foroptimality of PFBP and FBS is that distributions can be faithfully representedby Bayesian networks or maximal ancestral graphs (see Section 7. Conditions foroptimal feature selection with Lasso are given in [43].In extensive simulations it has been shown that causal-based feature selectionmethods are competitive with Lasso on classification and survival analysis taskson many real datasets [3, 33, 34, 35]. Furthermore, the non-parallel version ofPFBP (called Forward-Backward Selection with Early Dropping) as well as thestandard Forward-Backward Selection algorithm have been shown to perform aswell as Lasso if restricted to select the same number of variables [9].Lasso has been parallelized for single machines and shared-memory clusters[10, 38, 78]. These approaches only parallelize over features and not samples (i.e.consider vertical partitioning). Naturally, ideas and techniques presented in thoseworks could be adapted or extended for Spark or related systems. An implemen-tation of Lasso linear regression is provided in the Spark MLlib library [44]. Adisadvantage of that implementation is that it requires extensive tuning of itshyper-parameters (like the regularization parameter λ and several parameters ofthe optimization procedure), rendering it impractical as typically many differenthyper-parameter combinations have to be used. Unfortunately, we were not able tofind any Spark implementation of Lasso for logistic regression, or any work dealingwith the problem of efficient parallelization of Lasso on Spark. Finally, we notethat in contrast to forward selection using conditional independence tests, Lassois not easily extensible for different problems, and requires specialized algorithmsfor different data types [23, 28, 42], whose objective function may be non-convex[23] or computationally demanding [19]. Recently, there have been efforts for exact algorithms solving the BSS problem usingmixed-integer optimization formulations for linear regression [7] and logistic regression [55].assively-Parallel Feature Selection for Big Data 31 S of selected featuresand can guarantee to identify the Markov blanket for faithful distributions asymp-totically. However, the larger the conditioning set, the more samples are requiredto obtain valid results. Thus, these algorithms are not well-suited for problemswith large Markov blankets relative to the available sample size.Another class of such algorithms includes HITON-PC [4], MMPC [63], andmore recently SES [33] for multiple solutions. The main difference in this classof algorithms is that they condition on subsets of the selected features S , not thefull set. They do not guarantee to identify the full Markov blanket, but onlya superset of the parents and children of T . Recent extensive experiments haveconcluded that they perform well in practice [3]. These algorithms remove fromconsideration any features that become independent of T conditioned on some subset of the selected features S . This is similar to the Early Dropping heuristicand renders the algorithms quite computationally efficient and scalable to high-dimensional settings.PFBP combines the advantages of these two classes of algorithms: those thatcondition on subsets, drop features from consideration and achieve scalability, andthose that condition on the full set of selected features and guarantee identificationof the full Markov blanket. We performed three sets of experiments to evaluate PFBP. First, we investigatethe scalability of PFBP in terms of variable size, sample size and number of work-ers on synthetic datasets, simulated from Bayesian networks. Then, we comparePFBP to four competing forward-selection based feature selection algorithms. Wemade every reasonable effort to include all candidate competitors. These alter-natives constitute algorithms specifically designed for MapReduce architectures(i.e., SFO), standard FS algorithms using parallel implementations of the condi-tional independence tests (i.e., UFS and FBS) and ITFS. The latter comparisonis indirect using the published results in [53]. The only Lasso implementation forSpark available in the Spark MLlib library [44] (a) is for continuous targets, andthus is not suitable for binary classification tasks, and (b) required tuning of 5hyper-parameters; as no procedure has been suggested by the authors for theirtuning, it was excluded from the comparative evaluation. Finally, we performed aproof-of-concept experiment of PFBP on dense synthetic SNP data. α was set to 0.01 for all algorithms, and PFBP was ex-ecuted with 2 Runs. For the bootstrap tests used by PFBP, we used the defaultparameter values as described in Section 4.2. To produce a predictive model forPFBP we followed the approach described in Section 3.5. Parameter values relatedto the number of Group Samples, Sample Sets and Feature Sets were determinedusing the STD rule, and by setting the maximum number of variables to selectto maxVars (the exact value is given for each specific experiment later); see Sec-tion 5 for more details. We note that, none of the experiments required a physicalpartitioning to Feature Sets, and thus Feature Sets were only partitioned virtually.9.2 Scalability of PFBP with Sample Size, Feature Size and Number of WorkersWe investigated the scalability of PFBP on dense synthetic datasets in terms ofsample size, variable size and number of workers used. The data were sampled fromrandomly generated Bayesian networks, which are probabilistic models that canencode complicated dependency structures among features. Such simulated datacontain not only features necessary for optimal prediction (strongly relevant in theterminology of [29]) and irrelevant, but also redundant features (weakly relevant[29]) . This is a novelty in the Big Data FS literature as far as we know, makingthe synthetic tasks more realistic. A detailed description of the data and networkgenerating procedures is given in Appendix B.For each experimental setting, we generated 5 Bayesian networks, and sam-pled one dataset from each. The connectivity parameter C was set to 10 (i.e., theaverage degree of each node), the class distribution of T was 50/50, and the vari-ance of the error term was set to 1. To investigate scalability in terms of samplesize, we fixed the number of features to 1000 and varied the sample size from 2Mto 10M. Scalability in terms of feature size was evaluated on datasets with 100Ksamples and varying the feature size from 20K to 100K, all of which also includedthe optimal feature set (i.e. the Markov blanket of T ). Finally, scalability in termsof number of workers was investigated on datasets with 10K variables and 1Msamples. The maximum number of variables maxVars to select was set to 50.The results are summarized in Figure 2. On the top row we show the relativeruntime of PFBP with varying sample size (left) and number of variables (right),respectively. The bottom figure shows the speed-up achieved with varying thenumber of workers. Relative time and speed up are computed with respect to thelowest point on the x-axis. We can clearly see that: (Top Left) PFBP improvessuper-linearly with sample size; in other words, feeding twice the number of rowsto the algorithm requires less than double the time. This characteristic can beattributed to the Early Stopping and Early Return heuristics. (Top Right) PFBPscales linearly with number of features due to the Early Dropping heuristic and assively-Parallel Feature Selection for Big Data 33 Fig. 2: Scalability of PFBP with increasing sample size (top left), feature size(top right) and number of machines (bottom). Time and speed-up were computedrelatively to the first point on the x-axis, for the same number of Runs. PFBPimproves super-linearly with sample size, linearly with feature size and runningtime is reduced linearly with increasing number of machines. The results are similarfor PFBP with 1 run and 2 runs.(Bottom) PFBP is able to utilize the allocated machines, although the speed-upfactor does not reach the theoretical optimum. The reason for this is that theEarly Stopping heuristic quickly prunes many features from consideration afterprocessing the first Group sample, reducing parallelization in subsequent Groupsas only few features remain Alive.9.3 Comparative Evaluation of PFBP on Real DatasetsWe evaluated the PFBP algorithm on 11 binary classification datasets, collectedfrom the LIBSVM dataset repository , with the constraint that each datasetcontains at least 500K samples or variables. A summary of the datasets, shown inorder of increasing variable size, can be found in Table 2. The first two columnsshow the total number of samples and variables of each dataset, while the lastcolumn shows the average number of non-zero elements of each sample. The max-imum number of non-zero elements equals the total number of variables. Exceptfor the first four datasets, all other datasets are extremely sparse. Table 2: Binary classification datasets used in the comparative evaluation
Name
We compared PFBP to 3 forward selection based algorithms: (i) Single Feature Op-timization (SFO) [58], (ii) Forward-Backward Selection (FBS), and (iii) UnivariateFeature Selection (UFS). UFS and FBS were implemented using a parallelized im-plementation of standard binary logistic regression for Big Data provided in theSpark machine learning library [44]; this was also the implementation used to fitfull logistic regression models for SFO at each iteration.The algorithms were compared in terms of classification accuracy and runningtime. To estimate the classification accuracy, 5% of the training instances wererandomly selected and kept out. The remaining 95% were used by each algorithmto select a set of features and to train a logistic regression model using thosefeatures. For SFO, FBS and UFS, the default recommended parameters for datapartition and fitting the logistic regression models of Spark were used. The max-imum number of features to select was set to 50. A timeout limit of 2 hours wasused for each algorithm. In case an algorithm did not terminate within the timelimit, the number of features selected up to that point are reported. If no featurewas selected, the accuracy was set to N/A (not available).
Table 3 shows the results of the evaluation. It shows the running time in minutes,the classification accuracy and the number selected variables of each algorithm andon each dataset. We included the results of the trivial classification method, whichclassifies each sample to the most frequent class, and thus attains an accuracyequal to the frequency of the most common class. It can clearly be seen thatPFBP outperforms all competing methods in terms of running time. All competingmethods reach the timeout limit or crash and do not select a single feature evenfor the moderately sized rcv1 dataset, which contains only 47.2K variables and697.6K samples. PFBP reaches the timeout limit for 4 datasets, but is still able toselect features (ranging from 25 to 33) and to produce a predictive model. For thefew datasets in which the competing methods terminated, all methods performsimilarly in terms of accuracy. For the avazu, kdd2010a and kdd2010b datasets,PFBP reaches an accuracy only marginally better than the trivial classifier, which assively-Parallel Feature Selection for Big Data 35
Table 3: The table shows the total running time in minutes, the classificationaccuracy % and the number of variables selected. PFBP significantly outperformsall competitors in terms of running time, and is the only algorithm that is ableto produce results on all datasets within the given time limit of 2 hours. SFOproduces results only for datasets up to 2000 variables (epsilon), whereas UFS andFBS produce results only for datasets with a few tens of variables. Furthermore,for SFO, UFS and FBS Spark either crashed or ran out of memory for the largestdatasets. In terms of classification accuracy, all algorithms perform similarly, andperform better than trivial classification to the most frequent class.
Running Time (Minutes) Classification Accuracy (%) Selected VariablesDataset PFBP SFO UFS FBS PFBP SFO UFS FBS Trivial PFBP SFO UFS FBSSUSY 0.1 4.5 1.0 26.9 78.87 78.55 78.59 78.59 54.24 10 14 18 15HIGGS 0.2 9.9 3.0 89.7 64.04 64.07 64.06 64.06 52.99 11 18 28 18covtype 3.0 55.1 1.6 N/A N/A N/A N/A 97.60 N/A N/A N/A 60.62 33 0 0 0kdd2010a N/A N/A N/A 86.13 N/A N/A N/A 85.30 27 0 0 0kdd2010b N/A N/A N/A 86.13 N/A N/A N/A 86.06 25 0 0 0 Timeout (returned model) | Timeout (no model returned) | Out of memory | Spark crashed may be because of the hardness of the problem, or because logistic regression isnot appropriate.
We compare PFBP to the Minimum-Redundancy Maximum-Relevance (mRMR)algorithm [52], an instance of the ITFS methods [11]. The mRMR algorithm hasbeen extended for Spark and has been evaluated on binary classification datasets[53], including the epsilon, url and kdd2010b datasets. This allows us to performan indirect comparison of mRMR and PFBP. They used a cluster with 432 cores incontrast to 64 used by us (6.75 times more cores), and used 80% of the samples astraining data in contrast to the 95% we used. mRMR was able to select 50 featuresin 4.9 and 5.6 minutes for epsilon and url respectively, and 12.9 minutes to select25 features for kdd2010b. For epsilon, which is dense, PFBP took 1.3 minutes toselect 50 features, while for url and kdd2010b which are sparse, PFBP needed91.9 minutes and 120 minutes to select 50 and 25 features respectively. Thus,for dense data PFBP may be faster, while possibly being a bit slower for sparsedata, as mRMR had 6.75 times more cores available. Recall that no specializedimplementation for sparse data was used for PFBP, which could potentially furtherreduce its running time. In any case, there does not seem to be a significant gap inrunning time between both approaches. Unfortunately, no comparison in terms ofclassification accuracy is possible, as a different different performance measure anda different classifier was used for mRMR. Finally, we point out that no predictivemodel was presented for the kdd2010b dataset, as the implementations of the naiveBayes and SVM classifiers used could not handle it, while PFBP is always able to produce a predictive model without any computational overhead, as long as it isable to complete a single iteration.9.4 Proof-of-Concept Application on SNP DataSingle Nucleotide Polymorphisms (SNPs) , the most common type of geneticvariation, are variations of a single nucleotide at specific loci in the genome ofa species. The Single Nucleotide Polymorphism Database (dbSNP) (build 150)[57] now lists 324 million different variants found in sequenced human genomes .In several human studies, SNPs have been associated with genetic diseases orpredisposition to disease or other phenotypic traits. As of 2017-08-13, the GWASCatalog contains 3057 publications and 39366 unique SNP-trait associations.Large scale studies under way (e.g., Precision Medicine Initiative [14]) intend tocollect SNP data in large population cohorts, as well as other medical, clinical andlifestyle data. The resulting matrices may end up with millions of rows, one foreach individual, and variables (SNP or some other measured quantity). A proof-of-concept application of PFBP is presented next. We simulated genotype data containing 500000 individuals (samples) and 592555SNP genotypes (variables), following the procedure described in [12]. As SNPdata are dense, they require approximately 2.16 TB of memory, and thus are morechallenging to analyze than sparse data, such as the ones used in the previousexperiment. The data were simulated with the HAPGEN 2 software [13] from theHapmap 2 (release 22) CEU population [15]. A more detailed description of thedata generation procedure is given in Appendix C.We used M = 100 randomly selected SNPs to generate a binary phenotype(outcome), as described in [12] (see also Appendix C). The optimal accuracy usingall 100 SNPs is 81.42%. Ideally and given enough samples, any feature selectionmethod should select those 100 SNPs and achieve an accuracy around 81.42%. Dueto linkage disequilibrium however, many neighboring SNPs are highly correlated(collinear) and as a consequence offer similar predictive information about theoutcome and are informationally equivalent. Therefore, a high accuracy can beachieved even with SNPs other than the 100 used to simulated the outcome.We used 95% of the samples as a training set, and 5% as a test set for perfor-mance estimation. We set a timeout limit of 15 hours, and used the same setupas used in previous experiments, with the exception that the maximum number ofvariables to select was set to 100. For big, dense data such as the SNP data considered in this experiment whichrequire over 2 TB of memory, a direct application of PFBP as used in other exper-iments is possible, but may be unnecessarily slow. We found that for such problems, https://ghr.nlm.nih.gov/primer/genomicresearch/snp Fig. 3: The effects of the early pruning heuristics is shown for the first 10 forwarditerations on the SNP data. The y-axis shows the number of variables on a loga-rithmic scale. The width of each iteration is proportional to the number of groupsprocessed. The early dropping heuristic is able to quickly discard many features,reducing them by about an order of magnitude. Early stopping filters out mostvariables after processing the first group, and early return is applied two times.it makes sense to repartition the data at some point, if enough variables have beenremoved by the Early Dropping heuristic. Repartitioning and discarding droppedvariables reduces storage requirements, and may offer a significant speed boost. Itis an expensive operation however, and should only be used in special situations.For the SNP data, after the first iteration only about a third of the variables re-mained, reducing the memory requirements to less than 1 TB, and thus most (ifnot all) of the data blocks were able to fit in memory. In this case repartitioningmakes sense, as its benefits far outweigh the computational overhead.
PFBP was able to select 84 features in 15 hours, using a total of 960 core hours.It achieved an accuracy of 77.62%, which is over 95% of the theoretical optimalaccuracy. The results are very encouraging; in comparison the DISSECT software[12] took 4 hours on 8400 cores (that is, 33600 core hours) and using 16 TB ofmemory to fit a mixed linear model on similar data, and to achieve an accuracyaround 86% of the theoretical maximum. The two experiments are not directlycomparable because (a) the outcome in our case is binary instead of continuousrequiring logistic regression models favoring DISSECT, (b) the scenarios simu-lated in [12] had larger Markov blankets (1000 and 10000 instead of 100) favoringPFBP (although, their results are invariant to the size of the Markov blanket).Nevertheless, the reported results are still indicative of the efficiency of PFBP onSNP Big Data.Figure 3 shows the effects of the heuristics used by PFBP for the first 10iterations. The y-axis shows the number of Remaining and Alive features on alogarithmic scale. The x-axis shows the current iteration, and the width is pro-portional to the total number of Groups processed in that iteration. We observethat (a) Early Dropping discards many features in the first iteration, reducingthe number of Remaining features by about an order of magnitude, (b) in most
Number of Selected Features C l a ss i f i c a t i on A cc u r a cy Accuracy vs Selected Features on the SNP Data
Fig. 4: The figure shows how the accuracy of PFBP on the SNP data increasesas it selects more features. The models are produced by PFBP at each iterationwith minimal computational overhead. In the first few iteration, accuracy increasessharply, while in the later iterations a plateau is reached, reaching a value of 77.59%with 70 features, with the maximum being 77.62% with 84 features. This could beused as a criterion to stop feature selection early.iterations, Early Stopping is able to reduce the number of Alive features to around10 after processing the first Group, (c) Early Return is applied 2 times, endingthe Iteration and selecting the top feature after processing a single Group.Finally, we also computed the accuracy at each iteration of PFBP, to inves-tigate its behavior with increasing number of selected features. As before, theaccuracy is computed on the 5% of the data that were kept out as a test set.Such information could be used to decide early whether a sufficient number offeatures has been selected, and to stop computation if the accuracy reaches aplateau. This is often the case, as most important features are typically selectedduring the first few iterations. This task can be performed using PFBP with min-imal computational overhead, as the local models required to approximate a fullglobal model (see Section 3.5) are already available. The results are shown in Fig-ure 4. As expected, the largest increase in accuracy is obtained after selected thefirst few features, reaching an accuracy of 75% even after selecting only 30 fea-tures. In addition, after selecting about 70 features, the accuracy increases onlymarginally afterwards, increasing from 77.59% with 70 features to 77.62% with84. Thus, computation could be stopped after 70 features have been selected, andstill attain almost the same accuracy.
10 Discussion and Conclusions
We have presented a novel algorithm for feature selection (FS) in Big Data set-tings that can scale to millions of predictive quantities (i.e., features, variables)and millions of training instances (samples). The Parallel, Forward-Backward withPruning (PFBP) enables computations that can be performed in a massively paral-lel way by partitioning data both horizontally (over samples) and vertically (overfeatures) and using meta-analysis techniques to combine results of local compu-tations. It is equipped with heuristics that can quickly and safely drop from con- assively-Parallel Feature Selection for Big Data 39 sideration some of the redundant and irrelevant features to significantly speed upcomputations. The heuristics are inspired by causal models and provide theoreti-cal guarantees of correctness in distributions faithful to causal models (Bayesiannetworks or maximal ancestral graphs). Bootstrapping testing allows PFBP todetermine whether enough samples have been seen to safely apply the heuristicsand forgo computations on the remaining samples. Our empirical analysis con-firms that, PFBP exhibits a super-linear speedup with increasing sample size anda linear scalability with respect to the number of features and processing cores.A limitation to address in the future is to equip the algorithm with a principledcriterion for the determining the number of selected features. Other directions toimprove include exploiting the sparsity of the data, and implementing run-timere-partitioning when deemed beneficial, implementing tests in GPUs to name afew.
Acknowledgements
IT and GB have received funding from the European Research Councilunder the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC GrantAgreement n. 617393. We’d like to thank Kleanthi Lakiotaki for help with the motivatingexample text and data simulation, and Vincenzo Lagani for proofreading and constructivefeedback.0 Tsamardinos, I., Borboudakis, G. et al.
AppendicesA Accuracy of p -value Combination using Meta-Analysis andEvaluation of the STD Rule We evaluated the ability of the proposed p -value computation method (combination of local p -values using Fisher’s combined probability test) in identifying the same variable to select aswhen global p -values are used. We performed a computational experiment on simulated datato investigate the effect of the total sample size and number of data Blocks on the accuracyof the proposed approach. Furthermore, we compare the STD and EPV rules for setting theminimum number of samples in each Data Block. The EPV rule computes the sample sizeper Sample Group as s = df · c/ min( p , p ), while STD uses s = df · c/ √ p · p , where df isset to the maximum number of degrees of freedom (see 5.1 for more details), c is a positiveconstant (which may take different values for each rule), and p and p are the frequencies ofthe negative and positive class respectively. A.1 Data Generation
To generate data with complex correlation structures, we chose to generate data from simulatedBayesian networks. All variables are continuous Gaussian and are linear functions of theirparents. The target variable is binary, and the log-odds ratio is a linear function of its parents.The procedure is described in detail in Appendix B.We used the following parameters to generate Bayesian networks and data from thosenetworks: (a) the number of variables was set to 101 (100 variables plus the outcome T ),(b) the connectivity parameter was set to 10 (i.e., the average degree of each node), (c) thefrequency of the most frequent class of T was set to { , , , , } and (d) thestandard deviation of the error terms was set to { . , . , } . In total this results in 15possible Bayesian network configurations. Note that, the connectivity is relatively high and thestandard deviation of the error terms is relatively low so that all variables are highly correlated,increasing the difficulty of the problem of selecting the best variable. For each such parametercombination we generated 250 Bayesian networks, resulting in a total of 250 ×
15 = 3750networks. Next, we generated datasets with different sample sizes, by varying the sample sizefrom 10 . to 10 in increments of 0 . A.2 Simulation Results: Combined p -values vs Global p -values We performed conditional independence tests on all generated datasets to simulate a forwardIteration using p -values from global tests (i.e., using all data) and from combined p -valuesusing Fisher’s method. We varied the following parameters: (a) the number of conditioningvariables, which was set to 0, 1, 2 or 3, and (b) the number of Sample Sets each datasetwas split to, which ranged from 1 (no split) to 25 with increments of 1 (a total of 25 cases).This allows us to investigate the effect of the total number of combined local p -values. Thesimulation of a forward Iteration was performed for each dataset and conditioning size k asfollows: (a) k variables were randomly selected from the Markov blanket of T (simulatingthat k variables have already been selected), (b) the global conditional independence test wasperformed between T and the remaining variables over all samples (i.e. number of sample setsequals 1), (c) the same test was performed on all Sample Sets resulting by splitting the datarandomly to m equally-sized sample sets ( m ranging from 2 to 25) and combining the p -valuesusing Fisher’s combined probability test.We compute the percentage of agreement between both methods, that is, how often bothmethods select the same variable. This is computed as the proportion of times both methodsagreed on the 250 repetitions, leading to one value for each of the 15 Bayesian network config-urations, each sample size, conditioning set size and number of Sample Sets. Thus, in total wehave 15 × × ×
24 = 23040 such values. The results are summarized in Figure 5. There are 4figures, one for each different conditioning size, and each figure contains 5 curves, one for eachassively-Parallel Feature Selection for Big Data 41
Sample size per sample set A g r ee m en t pe r c en t age Agreement between p-value computation methods (2 parameters)
Class distribution 90/10Class distribution 80/20Class distribution 70/30Class distribution 60/40Class distribution 50/50
Sample size per sample set A g r ee m en t pe r c en t age Agreement between p-value computation methods (3 parameters)
Class distribution 90/10Class distribution 80/20Class distribution 70/30Class distribution 60/40Class distribution 50/50
Sample size per sample set A g r ee m en t pe r c en t age Agreement between p-value computation methods (4 parameters)
Class distribution 90/10Class distribution 80/20Class distribution 70/30Class distribution 60/40Class distribution 50/50
Sample size per sample set A g r ee m en t pe r c en t age Agreement between p-value computation methods (5 parameters)
Class distribution 90/10Class distribution 80/20Class distribution 70/30Class distribution 60/40Class distribution 50/50
Fig. 5: The percentage of agreement is shown, which corresponds to how oftencombining local p -values and computing the p -value on all samples leads to thesame decision. The y -axis shows how the sample size per sample set affects theagreement percentage. Both methods tend to agree asymptotically for various classdistributions and conditioning set sizes. class distribution of T . Each such curve summarizes the results over all error variances, samplesizes and number of Sample Sets (that is, 3 × ×
24 = 1152 points). Note that the numberof parameters of the largest model is always the conditioning size plus 2, as the model alsoincludes the variable that is tested for (conditional) independence with T and the intercept.The x -axis shows the sample size per Sample Set, which is computed as the sample size dividedby the number of Sample Sets. We only show the results up to 500 samples per Sample Set;the agreement percentage approaches 100% in all cases with increasing sample size, reachingat least 99% with 5000 samples per Sample Set. The y -axis shows how often both methodslead to the same decision. To avoid cluttering, we computed the curves by fitting a powerregression model y = α · x β + c . We found that this model is appropriate, as it has R valuesbetween 0 .
75 and 0 .
95. We conclude the following: (a) both approaches tend to make the samedecision with increasing sample size, (b) the sample size per Sample Set required depends onthe number of parameters and the class distribution, and increases with increasing number ofparameters and class imbalance . A.3 Simulation Results: STD vs EPV for Determining the Required Sample Size
We propose an alternative rule to EPV, which is computed as df · c/ √ p · p . The denominatoris the standard deviation of the class distribution, which follows a Bernoulli distribution. Wecall this the STD rule hereafter. For balanced class distributions the result is identical to theEPV rule, while for skewed distributions the value is always smaller.To validate the STD rule, we used the results of the previous simulation experiment andcomputed the value of c by solving the equation for c and substituting in the values of theclass distributions, degrees of freedom and sample size per Sample Set. We kept the values of c Table 4: Median value of c to obtain an agreement percentage between 85% and95%. p max corresponds to the proportion of the most frequent class, while df is thedegrees of freedom in the largest model. The relative differences for the STD ruleare smaller (less than 2 against over 2.5 for the EPV rule), suggesting it is moreappropriate. A minimum value of c = 10 with the proposed rule is recommendedand used in the experiments. EPV Rule STD Rule p max | df 2 3 4 5 2 3 4 50.5 11.2 7.8 9.0 9.7 11.2 7.8 9.0 9.70.6 7.7 7.6 7.4 11.2 9.4 9.3 9.1 13.70.7 6.6 5.7 5.9 8.6 10.1 8.8 9.1 13.20.8 5.7 5.2 5.7 6.7 11.5 10.5 11.4 13.30.9 5.0 4.4 4.2 4.4 14.9 13.2 12.5 13.3that correspond to an agreement percentage between 85% and 95% (focusing on an interestingrange of high agreement between p -value computation methods), and computed their medianvalue for each class distribution, conditioning size k and for both rules. Ideally, one wouldexpect c to be constant across rows (class distribution) and columns (conditioning size). Aconstant value of c for a rule means that the rule can exactly compute the required samplesize to get an agreement percentage around 90%. Furthermore, we note that the values of c arenot comparable between rules, and thus their exact values are not important; what matters isthe relative difference between values of c for the same rule.The results are shown in Table 4. Although the value of c varies across class distributionsand degrees of freedom, we can see that the relative differences are smaller the STD rule.Specifically, for EPV c ranges from 4.2 to 11.2, the latter being over 2.5 times larger, while forSTD it ranges from 7.8 to 14.9, which is less than 2 times larger. This suggests that the STDrule performs better than EPV across various conditioning set sizes and class distributions,at least on the experiments considered here. Furthermore, the results suggest that a value ofat least c = 10 should be used for STD to get reasonably accurate results. We note that, inpractice this value is much higher in most cases for PFBP, as it partitions the samples initiallyby considering the worst case scenario (i.e., selecting maxVars variables). Thus especially inearly Iterations, which are the most crucial ones, PFBP will typically have a sufficient numberof samples even with c = 10 to select the best variables. B Simulating Data from Bayesian Networks
To generate data with complex correlation structures, we chose to generate data from Bayesiannetworks. This is done in three steps: (a) generate a Bayesian network structure G with N nodes(variables) and M edges, (b) sample the parameters of G , and (c) sample instances from G .We will next describe the procedures used for each step. B.1 Generating the Bayesian network structure
First, we need to specify the number of nodes N and the connectivity C between those nodes,which implicitly corresponds to some number of edges M . The connectivity parameter C corresponds to the (average) degree of each variable. Using the connectivity instead of settingthe number of edges allows one to easily control the complexity of the network, as C directlycorresponds to the average number of parents and children of each node. We proceed byshowing how the edges were sampled. Let V , . . . , V N be all nodes of G , listed in topologicalorder. To sample the edges of the network we iterate over all pairs of variables V i and V j (i ¡ j),and add an edge from V i to V j with probability C/ ( N − C .assively-Parallel Feature Selection for Big Data 43 B.2 Generating the Bayesian network parameters
The first step is to pick the variable that corresponds to the outcome variable T . We choseto use the node at position ⌈ N/ ⌉ , as this node has the same number of parents and childrenon average. For our experiments, we chose T to be of binary type and all remaining variablesto be of continuous type, but in principle everything stated can be easily adapted to othervariable types. In Bayesian networks, the each variable V is a function of its parents Pa ( V ).The functional form for continuous variables V i is V i = β + X V j ∈ Pa ( V i ) β j V j + ǫ i where β is the intercept, β j is the coefficient of the j -th parent of V i and ǫ i is its error term. Inour case, we set the intercept to 0, as it does not affect the correlation structure. The coefficients β j are sampled uniformly at random from [ − , − . ∪ [0 . ,
1] to avoid coefficients which areclose to 0. The error term ǫ i follows a normal distribution with 0 mean and σ i variance, whichwas set to the default value of 1 in our experiments, unless stated otherwise. Note that allvariables are normally distributed as they are sums of normally distributed variables. For eachvariable V i the mean equals zero and the variance equals σ i + P V j ∈ Pa ( V i ) β j . The fact that thevariance increases may lead to numerical instabilities in practice, especially when generatinglarge networks. Because of that, we standardize each variable to have unit variance by dividingit with its standard deviation, which is the square root of the variance as described above. Forthe target T , its log-odds ratio is again a linear function of its parents, defined aslog( P ( T = 1)1 − P ( T = 1) ) = β + X V j ∈ Pa ( T ) β j V j + ǫ T As before, the log-odds ratio is standardized to have unit variance.The value of T is set to 1 whenever the log-odds ratio is larger than some threshold t , and to 0 otherwise. Setting t to 0 results in a 50/50 class distribution of T . Other classdistributions p /p can be obtained by simply setting t to N − , (1 − p ), where N − , is thestandard normal inverse cumulative distribution function. As a final note, the standardizationmethod used above only guarantees that variables that come before T in the topologicalordering are standard normal variables. As T is not normally distributed (nor does it haveunit variance), all variables that are direct or indirect functions of T are not exactly normallydistributed. However, as this neither alters the correctness of the data generation method, norleads to any other issues, we leave it as is. B.3 Sampling data from the generated Bayesian network
To generate a sample, one has to traverse the network in topological order and to compute thevalue of each variable separately, using the formulas described previously. By construction thenetwork is already in topological order, which is simply given by the index of each variable.To compute the value of a variable one has to compute the sum of its parents (if it has anyparents), and to add the error term, which is drawn from a normal distribution.
C SNP Data Generation
To generate the SNP dataset we followed the procedure described in [12]. We used the HAP-GEN 2 software [13] with the Hapmap 2 (release 22) CEU population [15] to simulate 500000individuals (samples). This population contains 2543887 SNPs, but only 592555 were kept, byfiltering out the ones not available in the Illumina Human OmniExpress-12 v1.1 BeadChip . The final dataset contains 500000 samples and 592555 SNPs. Each variable takes valuesin { , , } , which correspond to the number of reference alleles. Thus, the dataset is dense, https://support.illumina.com/array/array_kits/humanomniexpress-12-beadchip-kit.html C.1 Phenotype Simulation
Let s ij be the i-th value of the j-th SNP s j , and p j be the reference allele frequency of SNP j ,that is, p j is the average value of s ij divided by 2. The standardized value of s ij , z ij is definedas z ij = ( s ij − µ j ) /σ j where µ j = 2 p j and σ j = p p j (1 − p j ).The phenotype (outcome) y follows an additive genetic model y i = g i + e i = M X j =1 z ij u j + e i where y i is the i-th value of y , g i is the genetic effect, e i is the noise term, M is the number ofvariables influencing y , and u j is the effect (coefficient) of z j . The coefficients z j were sampledfrom a normal distribution with zero mean and unit variance. The error terms e i follow anormal distribution with zero mean and variance σ i (1 − h ) /h , where σ i is the variance of g i and h corresponds to the trait heritability. Naturally, the larger h , the more y dependson the SNPs. In our case, we chose M = 100 and set h = 0 .
7, one of the values used in[12]. Finally, to obtain a binary outcome, we set the value of y i to 1 if it is positive, and to 0otherwise, resulting in an approximately balanced outcome. References
1. A. Agresti.
Categorical Data Analysis . Wiley Series in Probability and Statistics. Wiley-Interscience, 2nd edition, 2002.2. H. Akaike. Information theory and an extension of the maximum likelihood principle. In
Second International Symposium on Information Theory , pages 267–281, Budapest, 1973.Akad´emiai Kiado.3. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos. Local causaland Markov blanket induction for causal discovery and feature selection for classificationpart i: Algorithms and empirical evaluation.
Journal of Machine Learning Research ,11(Jan):171–234, 2010.4. C. F. Aliferis, I. Tsamardinos, and A. Statnikov. Hiton: a novel Markov blanket algorithmfor optimal variable selection. In
AMIA Annual Symposium Proceedings , volume 2003,page 21. American Medical Informatics Association, 2003.5. L. Armijo. Minimization of functions having lipschitz continuous first partial derivatives.
Pacific Journal of Mathematics , 16(1):1–3, 1966.6. B. J. Becker and M.-J. Wu. The synthesis of regression slopes in meta-analysis.
StatisticalScience , pages 414–429, 2007.7. D. Bertsimas, A. King, R. Mazumder, et al. Best subset selection via a modern optimiza-tion lens.
The Annals of Statistics , 44(2):813–852, 2016.8. V. Boln-Canedo, N. Snchez-Maroo, and A. Alonso-Betanzos.
Feature Selection for High-Dimensional Data . Springer Publishing Company, Incorporated, 1st edition, 2015.9. G. Borboudakis and I. Tsamardinos. Forward-backward selection with early dropping. arXiv:1705.10770 [cs.LG] , 2017.10. J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent forl1-regularized loss minimization. In
Proceedings of the 28th International Conference onMachine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 ,pages 321–328, 2011.11. G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´an. Conditional likelihood maximisation:A unifying framework for information theoretic feature selection.
J. Mach. Learn. Res. ,13:27–66, Jan. 2012.assively-Parallel Feature Selection for Big Data 4512. O. Canela-Xandri, A. Law, A. Gray, J. A. Woolliams, and A. Tenesa. A new tool calleddissect for analysing large genomic data sets using a big data approach.
Nature commu-nications , 6, 2015.13. C. C. Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell, and J. J. Lee. Second-generation plink: rising to the challenge of larger and richer datasets.
Gigascience , 4(1):7,2015.14. F. S. Collins and H. Varmus. A new initiative on precision medicine.
New England Journalof Medicine , 372(9):793–795, 2015.15. I. H. Consortium et al. A haplotype map of the human genome.
Nature , 437(7063):1299–1320, 2005.16. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization ofcontinuous features. In
MACHINE LEARNING: PROCEEDINGS OF THE TWELFTHINTERNATIONAL CONFERENCE , pages 194–202. Morgan Kaufmann, 1995.17. B. Efron and R. J. Tibshirani.
An introduction to the bootstrap . CRC press, 1994.18. R. F. Engle. Wald, likelihood ratio, and lagrange multiplier tests in econometrics.
Hand-book of econometrics , 2:775–826, 1984.19. J. Fan, Y. Feng, Y. Wu, et al. High-dimensional variable selection for cox’s proportionalhazards model. In
Borrowing Strength: Theory Powering Applications–A Festschrift forLawrence D. Brown , pages 70–86. Institute of Mathematical Statistics, 2010.20. R. Fisher.
Statistical methods for research workers . Edinburgh Oliver & Boyd, 1932.21. R. V. Foutz and R. C. Srivastava. The performance of the likelihood ratio test when themodel is incorrect.
The Annals of Statistics , 5(6):1183–1194, 1977.22. J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning , volume 1.Springer series in statistics New York, 2001.23. S. v. d. Geer, P. B¨uhlmann, and J. Schelldorfer. Estimation for high-dimensional linearmixed-effects models using l1-penalization.
Scandinavian Journal of Statistics , 38(2):197–214, 2011.24. I. Guyon and A. Elisseeff. An introduction to variable and feature selection.
Journal ofmachine learning research , 3(Mar):1157–1182, 2003.25. F. Harrell.
Regression Modeling Strategies . Springer, corrected edition, Jan. 2001.26. L. V. Hedges and J. L. Vevea. Fixed-and random-effects models in meta-analysis.
Psy-chological methods , 3(4):486, 1998.27. D. W. Hosmer, Jr., S. Lemeshow, and R. X. Sturdivant.
Introduction to the LogisticRegression Model . John Wiley & Sons, Inc., 2013.28. S. Ivanoff, F. Picard, and V. Rivoirard. Adaptive lasso and group-lasso for functionalpoisson regression.
J. Mach. Learn. Res. , 17(1):1903–1948, Jan. 2016.29. G. H. John, R. Kohavi, K. Pfleger, et al. Irrelevant features and the subset selectionproblem. In
Machine learning: proceedings of the eleventh international conference , pages121–129, 1994.30. R. Kerber. Chimerge: Discretization of numeric attributes. In
Proceedings of the tenthnational conference on Artificial intelligence , pages 123–128. Aaai Press, 1992.31. P. Konda, A. Kumar, C. R´e, and V. Sashikanth. Feature selection in enterprise analytics: Ademonstration using an R-based data analytics system.
Proc. VLDB Endow. , 6(12):1306–1309, Aug. 2013.32. M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li.
Applied Linear Statistical Models .McGraw-Hill/Irwin, 5th edition, Aug. 2004.33. V. Lagani, G. Athineou, A. Farcomeni, M. Tsagris, and I. Tsamardinos. Feature se-lection with the r package mxm: Discovering statistically-equivalent feature subsets. arXiv:1611.03227 [stat.ML] , 2016.34. V. Lagani, G. Kortas, and I. Tsamardinos. Biomarker signature identification in omics datawith multi-class outcome.
Computational and structural biotechnology journal , 6(7):1–7,2013.35. V. Lagani and I. Tsamardinos. Structure-based variable selection for survival data.
Bioin-formatics , 26(15):1887–1894, 2010.36. S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model paralleliza-tion and scheduling strategies for distributed machine learning. In
Advances in NeuralInformation Processing Systems 27: Annual Conference on Neural Information Process-ing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages 2834–2842,2014.6 Tsamardinos, I., Borboudakis, G. et al.37. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu. Featureselection: A data perspective. arXiv preprint arXiv:1601.07996 , 2016.38. Q. Li, S. Qiu, S. Ji, P. M. Thompson, J. Ye, and J. Wang. Parallel lasso screening for bigdata optimization. In
Proceedings of the 22Nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , KDD ’16, pages 1705–1714, New York, NY,USA, 2016. ACM.39. T. M. Loughin. A systematic comparison of methods for combining p-values from inde-pendent tests.
Computational statistics & data analysis , 47(3):467–485, 2004.40. D. Margaritis. Toward provably correct feature selection in arbitrary domains. In
Advancesin Neural Information Processing Systems , pages 1240–1248, 2009.41. D. Margaritis and S. Thrun. Bayesian network induction via local neighborhoods. In S. A.Solla, T. K. Leen, and K. M¨uller, editors,
Advances in Neural Information ProcessingSystems 12 , pages 505–511. MIT Press, 2000.42. L. Meier, S. V. D. Geer, and P. B¨uhlmann. The group lasso for logistic regression.
Journalof the Royal Statistical Society, Series B , 2008.43. N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection withthe lasso.
The annals of statistics , pages 1436–1462, 2006.44. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai,M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Tal-walkar. Mllib: Machine learning in apache spark.
J. Mach. Learn. Res. , 17(1):1235–1241,Jan. 2016.45. A. Miller.
Subset selection in regression . CRC Press, 2002.46. T. P. Minka. A comparison of numerical optimizers for logistic regression.
Unpublisheddraft , 2003.47. J. Pearl.
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference .Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.48. J. Pearl.
Causality, Models, Reasoning, and Inference . Cambridge University Press,Cambridge, U.K., 2000.49. J. Pearl and T. S. Verma.
A Theory of Inferred Causation . Morgan Kaufmann, 1991.50. P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. A simulationstudy of the number of events per variable in logistic regression analysis.
Journal ofclinical epidemiology , 49(12):1373–1379, 1996.51. J. M. Pe˜na, R. Nilsson, J. Bj¨orkegren, and J. Tegn´er. Towards scalable and data effi-cient learning of markov boundaries.
International Journal of Approximate Reasoning ,45(2):211–232, 2007.52. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteriaof max-dependency, max-relevance, and min-redundancy.
IEEE Transactions on patternanalysis and machine intelligence , 27(8):1226–1238, 2005.53. S. Ramrez-Gallego, H. Mourio-Taln, D. Martnez-Rego, V. Boln-Canedo, J. M. Bentez,A. Alonso-Betanzos, and F. Herrera. An information theory-based feature selection frame-work for big data under apache spark.
IEEE Transactions on Systems, Man, and Cyber-netics: Systems , PP(99):1–13, 2017.54. T. Richardson and P. Spirtes. Ancestral graph Markov models.
Annals of Statistics , pages962–1030, 2002.55. T. Sato, Y. Takano, R. Miyashiro, and A. Yoshise. Feature subset selection for logisticregression via mixed integer optimization.
Computational Optimization and Applications ,64(3):865–880, 2016.56. G. Schwarz et al. Estimating the dimension of a model.
The annals of statistics , 6(2):461–464, 1978.57. S. T. Sherry, M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, andK. Sirotkin. dbsnp: the ncbi database of genetic variation.
Nucleic acids research ,29(1):308–311, 2001.58. S. Singh, J. Kubica, S. Larsen, and D. Sorokina. Parallel large scale feature selection forlogistic regression. In
Proceedings of the 2009 SIAM International Conference on DataMining , pages 1172–1183. SIAM, 2009.59. P. Spirtes, C. N. Glymour, and R. Scheines.
Causation, prediction, and search . MIT press,2nd edition, 2000.60. A. Statnikov, N. I. Lytkin, J. Lemeire, and C. F. Aliferis. Algorithms for discovery ofmultiple Markov boundaries.
Journal of Machine Learning Research , 14(Feb):499–566,2013.assively-Parallel Feature Selection for Big Data 4761. R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.62. I. Tsamardinos and C. F. Aliferis. Towards principled feature selection: relevancy, filtersand wrappers. In
Proceedings of the Ninth International Workshop on Artificial Intelli-gence and Statistics , 2003.63. I. Tsamardinos, C. F. Aliferis, and A. Statnikov. Time and sample efficient discovery ofMarkov blankets and direct causal relations. In
Proceedings of the Ninth ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 673–678. ACM,2003.64. I. Tsamardinos, C. F. Aliferis, and A. R. Statnikov. Algorithms for large scale Markovblanket discovery. In
FLAIRS conference , volume 2, 2003.65. I. Tsamardinos and A. P. Mariglis. Multi-source causal analysis: Learning Bayesian net-works from multiple datasets. In
IFIP International Conference on Artificial IntelligenceApplications and Innovations , pages 479–490. Springer, 2009.66. T. Verma and Pearl. Causal Networks: Semantics and Expressiveness. In in Proceedings,4th Workshop on Uncertainty in Artificial Intelligence , pages 352–359, Aug. 1988.67. E. Vittinghoff and C. E. McCulloch. Relaxing the rule of ten events per variable in logisticand Cox regression.
American journal of epidemiology , 165(6):710–718, 2007.68. Q. H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses.
Econo-metrica: Journal of the Econometric Society , pages 307–333, 1989.69. S. Weisberg.
Applied linear regression , volume 528. John Wiley & Sons, 2005.70. W. J. Welch. Algorithmic complexity: three np-hard problems in computational statistics.
Journal of Statistical Computation and Simulation , 15(1):17–25, 1982.71. H. White. Maximum likelihood estimation of misspecified models.
Econometrica , 50(1):1–25, 1982.72. S. S. Wilks. The large-sample distribution of the likelihood ratio for testing compositehypotheses.
The Annals of Mathematical Statistics , 9(1):60–62, 03 1938.73. E. P. Xing, Q. Ho, P. Xie, and D. Wei. Strategies and principles of distributed machinelearning on big data.
Engineering , 2(2):179 – 195, 2016.74. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Clustercomputing with working sets. In
HotCloud , 2010.75. Y. Zhai, Y. Ong, and I. W. Tsang. The emerging big dimensionality.
IEEE Comp. Int.Mag. , 9(3):14–26, 2014.76. K. Zhang, J. Peters, D. Janzing, and B. Sch¨olkopf. Kernel-based conditional independencetest and application in causal discovery. In
Proceedings of the Twenty-Seventh Conferenceon Uncertainty in Artificial Intelligence , pages 804–813, 2011.77. Z. Zhao, R. Zhang, J. Cox, D. Duling, and W. Sarle. Massively parallel feature selection:an approach based on variance preservation.
Machine Learning , 92(1):195–220, 2013.78. P. Zhimin, Y. Ming, and Y. Wotao. Parallel and distributed sparse optimization. In