[PDF] A Pipeline for Integrated Theory and Data-Driven Modeling of Genomic and Clinical Data - Researchain

Abstract

High throughput genome sequencing technologies such as RNA-Seq and Microarray have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand causes of disease and the effects of medical interventions, this data must be integrated with phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods that can infer relationships between these data types are required. In this work, we propose a pipeline for knowledge discovery from integrated genomic and clinical data. The pipeline begins with a novel variable selection method, and uses a probabilistic graphical model to understand the relationships between features in the data. We demonstrate how this pipeline can improve breast cancer outcome prediction models, and can provide a biologically interpretable view of sequencing data.

Full PDF

AA PIPELINE FOR INTEGRATED THEORY AND DATA - DRIVENMODELING OF GENOMIC AND CLINICAL DATA

A P

REPRINT

Vineet K. Raghu

Department of Computer ScienceUniversity of Pittsburgh [email protected]

Xiaoyu Ge

Department of Computer ScienceUniversity of Pittsburgh [email protected]

Arun Balajee

Department of Computer ScienceUniversity of Pittsburgh [email protected]

Daniel J Shirer

Department of Computer ScienceUniversity of Pittsburgh [email protected]

Isha Das

Department of Computational and Systems BiologyUniversity of Pittsburgh

Panayiotis V. Benos

Department of Computational and Systems BiologyUniversity of Pittsburgh [email protected]

Panos K. Chrysanthis

Department of Computer ScienceUniversity of Pittsburgh [email protected]

May 7, 2020

Abstract

High throughput genome sequencing technologies such as RNA-Seq and Microarray have the potential totransform clinical decision making and biomedical research by enabling high-throughput measurements ofthe genome at a granular level. However, to truly understand causes of disease and the effects of medicalinterventions, this data must be integrated with phenotypic, environmental, and behavioral data from individuals.Further, effective knowledge discovery methods that can infer relationships between these data types are required.In this work, we propose a pipeline for knowledge discovery from integrated genomic and clinical data. Thepipeline begins with a novel variable selection method, and uses a probabilistic graphical model to understandthe relationships between features in the data. We demonstrate how this pipeline can improve breast canceroutcome prediction models, and can provide a biologically interpretable view of sequencing data. K eywords Genomics, Graphical Models, Feature Selection, Phenotype Prediction, Knowledge Discovery

Since the advent of high-throughput sequencing assays, a number of modeling approaches have been developed topredict patient outcome from genomic data . To understand the complex relationships between genomics andoutcomes, the genomic data are often merged with clinical and demographic information in an "integrated" dataset. Thereason standard machine learning models have been insufﬁcient to model this data are due to its unique complexities.Speciﬁcally, these data have highly correlated sets of variables (genes), and often consist of several orders of magnitudemore variables than samples (high-dimensional). Lastly for biomedical research purposes, predictive power of modelsare important, but interpretability of models are equally crucial. Often, biomedical researchers aim to learn from theirmodels to identify promising new hypotheses or to prioritize future experiments, instead of solely aiming to predictoutcomes accurately.Probabilistic Graphical Models (PGM’s) are an effective tool to build interpretable models . These models represent adataset as a graph where nodes correspond to features and edges correspond to dependence relationships. Learning the a r X i v : . [ q - b i o . GN ] M a y pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT

Figure 1: Pipeline proposed in this work to learn graphical model structure from mixed clinical and omics datasets.structure of these models from data is a well-studied problem for continuous data and categorical data but not mixturesof both. Recently, several approaches have been proposed to model mixed data . However, to utilize theseapproaches requires addressing some of the aforementioned difﬁculties inherent to the integrated datasets. Graphicalmodel learning would be computationally intractable on genome scale data. Further, interpretation of a large graphicalmodel is difﬁcult unless single variables of interest are queried. Lastly, highly correlated data can result in the formationof disconnected cliques, impeding model accuracy .One way to address these issues is by selecting a subset of variables to model. The machine learning communityrefers to this problem as feature selection. There, the aim is to ﬁnd the subset of features that best predicts a targetvariable of interest. Though some of these approaches are applied to integrated biomedical data, they still fail to addressthe aforementioned challenges. High correlation among features results in unstable prediction models, and harmsinterpretability of learned models .Using prior knowledge has been proposed as a way to address these difﬁculties . These sources allow a researcherto choose the most biologically plausible model among statistically equivalent modelsa . However, many proposedmethods have shown no signiﬁcant beneﬁt from using prior knowledge . Our hypothesis for this is twofold. First,external data sources need to be evaluated and weighted accordingly due to data provenance and context-speciﬁcinformation. For example, a particular biological pathway may not be active in the context from which a genomicdataset was measured. Thus, this source of information should be downweighted in the ﬁnal model. Second, multiplesources of prior information should be integrated to achieve consistent results.This motivates our pipeline for modeling an integrated genomic and clinical dataset (Figure 1). The ﬁrst step is tomeasure the concordance between the data and each of the prior information sources, and weight the sources accordingly.This is based upon a prior knowledge evaluation method we recently developed for graphical structure learning .Then, the information in the data and the prior knowledge are fused to select parameters for a feature selection method(Pref-Div). Finally, the clusters selected by Pref-Div are modeled using a graphical model structure learning algorithmto represent the dependencies between the clusters and outcome variables of interest.Our speciﬁc contributions are as follows: • A novel method for variable and cluster selection that combines a feature selection approach with anapproach to evaluate and integrate prior information . • An extensive evaluation of the variable selection approach on synthetic datasets2 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT • A pipeline that combines variable selection with graphical modeling to represent inter-relationships betweenvariables from mixed discrete and continuous datasets • An evaluation of the pipeline against state of the art variable selection approaches in predictiing breast canceroutcome and subtyping from transcriptomic data.

In this section, we give background information on feature selection methods for genomic data. Then, we discuss howprior knowledge has been incorporated. Finally, we discuss graphical model structure learning approaches for mixeddatasets.

Feature selection aims to identify a subset of features in a dataset that together best predict a target variable . The mainpurpose of feature selection is to improve model training efﬁciency and to prevent overﬁtting. In machine learning,feature selection approaches fall into three broad classes: ﬁlter methods, wrapper methods, and embedded methods .Filter methods select features using univariate ranking scores such as a Wilcoxon test or a t-test between a covariate anda target variable. Wrapper methods use a predictive model like the Support Vector Machine to select a set of features thatresult in an accurate prediction model . Two popular wrapper methods are the recursive feature elimination and greedyforward search, which select the best feature to eliminate (or include, respectively) in a step-wise fashion. Finally,embedded methods are predictive models which select features automatically as part of the learning procedure. Themost popular example of this is the LASSO regression method , which uses an L norm penalty to shrink coefﬁcientsto zero in a linear regression.Recently, a study was performed investigating the performance of these techniques to predict breast cancer relapse fromgenomic data . Overall, no method had signiﬁcantly better accuracy than randomly choosing features and learning aclassiﬁcation model, and only ﬁlter based methods were more stable (insensitive to variations in the data). Since thiswas true even within a single dataset, it could only be attributed to the methods themselves. This suggests that tailoredapproaches are necessary to improve feature selection from omics data. One way to improve these approaches is to use domain knowledge about the relationships between genes and betweenprotein products. Three main sources of prior knowledge have been explored: gene ontology (GO) terms, protein-proteininteraction networks (PPI’s), and biological pathways .GO groups genes based on known biological functions (e.g. cell cycle or angiogenesis). Several approaches haveleveraged GO terms as prior information to construct gene clusters . The main drawback of these methods is thenature of GO terms. Not all genes belong to a functional group in the GO, and these methods chose to discard thosegenes. In addition, GO terms tend to deﬁne broad functional classes which are difﬁcult to interpret.PPI’s are networks that encode protein interactions known to occur in normal cellular activity. Methods for geneselection have been built off of these networks, and they were reviewed and evaluated in . Many of these approachesaim to either 1) group genes based on the edges in the network and penalize them together or 2) use the networkinformation to determine gene importance . Pathway based approaches are similar in principle, but use biologicalpathways which represent a module of the network that carries out a speciﬁc function. These are normally taken from apathway database such as KEGG or I2D . Biological pathway-based feature selection is a step-wise method that usesmutual information to the target variable as a scoring criterion, but does not choose multiple genes from the same areain the graph consisting of the union of all the pathways . In , the authors attempt to construct a single feature for eachpathway by aggregating information across multiple genes. A similar method is taken in except that the pathwaysare constructed using the data. Multiple studies have found no signiﬁcant beneﬁt in prediction accuracy over baselinemethods; however, these methods do appear to give more biologically interpretable signatures . For data exploration applications, graphical models enable a user to identify all direct associations for any variable ofinterest. Since genomic data is often integrated with clinical, demographic, and epidemiological data, in this work wefocus upon approaches to learn undirected graphical models from mixed datasets: mixed graphical models (MGM). AMGMl parametrizes the joint distribution of a mixed dataset as a graph G = ( V, E ) , where V is the set of variables and3 pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT E is the set of edges. In this type of model, an edge exists between two variables X and Y if X and Y are conditionallydependent given the rest of the variables in the data.Recently several methods have been proposed to learn MGM’s. In , the authors propose qp-graphs which can beestimated from high dimensional data.However, this type of model assumes that there are no edges between categoricalvariables which is a limiting assumption for clinical data. Several authors have proposed using regression based methodsto estimate the conditional dependencies among pairs of variables to infer the edges in the graph. In these approaches,each variable in turn is considered as the target variable, a regression is performed using all other variables as predictors,and edges are added to the model for all signiﬁcant regressors. In, , the authors use a random forest regression approachto rank edges for inclusion into a graphical model among mixed variables. In, , they assume that the conditionaldistributions of each type of variable come from the exponential family and use node-wise regression approachesto estimate the parameters of the model. Other authors have proposed similar techniques to the aforementionedmethods .Another way to estimate a MGM is the pseudolikelihood approach . This approach uses the product of conditionaldistributions of the variables as a consistent estimator of the true likelihood without computing the partition function.Then a gradient based optimization method can be used to ﬁnd maximum pseudolikelihood estimates of the parameters.Lee and Hastie propose a MGMl that generalizes a popular continuous graphical model (Gaussian Graphical Model)and a popular discrete model (Markov Random Field) . They demonstrate that using the pseudolikelihood approachshows better empirical performance than using separate regressions, and so we focus on this approach.The authors parameterize the joint distribution of the variables according to Equation 1. p ( x, y ; θ ) ∝ exp (cid:18) p (cid:88) s =1 p (cid:88) t =1 − β st x s x t + p (cid:88) s =1 α s x s + p (cid:88) s =1 q (cid:88) j =1 ρ sj ( y j ) x s + q (cid:88) j =1 q (cid:88) r =1 φ rj ( y r , y j ) (cid:19) (1)where θ represents the full set of parameters, x s represents the s th of p continuous variables and y j represents the j th of q discrete variables. β st represents the edge potential between continuous variables s and t , α s represents the continuousnode potential for s , ρ sj represents the edge potential between continuous variable s and discrete variable j , and ﬁnally φ rj represents the edge potential between discrete variables r and j . This model has the favorable property that itsconditional distributions are given by Gaussian linear regression and Multiclass Logistic Regression for continuous anddiscrete variables respectively.Learning this model over high dimensional datasets directly is computationally infeasible due to the computation of thepartition function, so to avoid this, a proximal gradient method is used to learn a penalized negative log pseudolikelihoodform of the model (Equation 2, product of conditional distributions). To prevent overﬁtting, nonzero parameters arepenalized using the method described in (Equation 3). Here, λ CC is a penalty parameter only for edges betweencontinuous variables (CC = Continuous-Continuous), λ CD and λ DD are for mixed edges and edges only using discretevariables, respectively. || . || F refers to the Frobenius norm of a matrix. To optimize this objective function the proximalgradient optimization method is used as speciﬁed in . (cid:101) l (Θ | x, y ) = − p (cid:88) s =1 log p ( x s | x /s , y ; Θ) − q (cid:88) r =1 log p ( y r | x, y /r ; Θ) (2)minimize Θ l λ (Θ) = (cid:101) l (Θ) + λ CC p (cid:88) s =1 s − (cid:88) t =1 | β st | + λ CD p (cid:88) s =1 q (cid:88) j =1 || ρ sj || + λ DD q (cid:88) j =1 j − (cid:88) r =1 || φ rj || F (3) In this section, we describe the computational methods used to realize our pipeline. First, we discuss our variableselection method, and then we discuss how we incorporate prior knowledge to select parameters for this methodautomatically. Next, we describe the synthetic and real data used to evaluate our approach, and the metrics we apply inour evaluation. Lastly, we describe the prior knowledge sources used.4 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT

The ﬁrst step in our procedure is to choose variables to model. To this end, we propose to address the following featureselection problem (Deﬁnition 1). We ﬁrst proposed this problem to identify query results relevant to the user but alsodiverse to give a broad snapshot of the underlying data . The problem was referred to as the Top-K relevant and diverseset problem, and is as follows. Deﬁnition 1.

Top-K Relevant and Diverse Set . Given ≤ r ≤ a radius of similarity, a set of variables V , an outputsize k , a similarity function Sim ( V i , V j ) , and a relevance function Rel ( V i ) . maximize (cid:88) X i ∈ S Rel ( X i ) subject to S ⊂ V | S | = k ∀ i, j V i ∈ S and V j ∈ S → Sim ( V i , V j ) < r (4)Intuitively, we aim to ﬁnd a set of variables that are relevant to the user with the constraint that no pair of chosenvariables are similar to one another. This is an appropriate choice prior to graphical modeling because graphical modelscan lose accuracy if redundant variables are included in the model . The method we propose to solve this problemis similar in principle to two ﬁlter methods: Correlation-based feature selection and maximum relevance minimumredundancy (mRMR) feature selection . Both of these are greedy approaches which select the feature that optimizesan objective function that balances relevance and diversity. The main difference in our approach is that we require zeroredundancy, and that we quantify redundancy using prior knowledge. Lastly, to ensure stability of the downstreammodel, we report the selected features as clusters with redundant variables included in each cluster (instead of discardingthem). This allows the user to understand the redundancy in the data.Another popular approach that follows this principle is the Weighted Gene Correlation Network Analysis (WGCNA) .Brieﬂy, this method aims to learn a weighted undirected correlation network by converting correlation to edge weight.With this network, they infer the dissimilarity between nodes in the network, and use network characteristics (e.g. hubnodes) to select important genes. This method differs in that it uses network characteristics instead of summary statisticsto infer importance, and the network that they infer is a correlation network whereas graphical models aim to identifyconditional dependence relationships.Here, we solve this problem using the Preferential Diversity (Pref-Div) algorithm. Pref-Div is an iterative procedurethat ﬁrst selects the top-K most relevant variables and adds them to the result set R . Then, it determines whether anypair of variables in R are redundant (as deﬁned by the radius of similarity, r and the similarity function Sim ( V i , V j ) ),and removes the lower relevance variable from the result set. Then, the most relevant K − | R | that have not beenexplored are added to the result set. This procedure repeats until a set of K relevant and diverse features are selected.For the full procedure, we refer the reader to . In this work, we make one substantial modiﬁcation from the originalPref-Div algorithm. Here, we compute all variables considered redundant to the selected set of variables and returneach variable as a cluster.We instantiate the Pref-Div algorithm with the following parameter choices. The output size k is user-determined sincethe appropriate choice for this is based on computational resources available for graphical modeling. Similarity scoresbetween pairs of features are given by pearson correlation, and relevance of each feature is given by pearson correlationto a pre-deﬁned target variable. We note that having a target variable of interest is not necessary, and unsupervisedstatistics such as variance or domain knowledge can be used to determine relevance scores. In the next section, wediscuss how we select the radius of similarity, λ , using prior knowledge. To choose λ ∗ , we utilize a method we originally developed to select hyperparameters to learn graphical model structure .Here, we brieﬂy summarize the approach for variable selection: called piPref-Div (Prior Information Pref-Div). Themain idea is to compute a correlation graph across many different correlation thresholds, λ ∗ . A correlation graphcontains an edge between V and V if the correlation between V and V is greater than the correlation threshold. Thismethod proceeds in four major steps.First, an appropriate parameter range is determined. This is done by identifying a range where the fewest edges areselected (correlation graph is sparse) yet changing λ slightly results in a large change in the number of edges in thegraph. Figure 2a shows a plot of the number of edges in the correlation graph vs. λ . Initially, a knee point is identiﬁed5 pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT (a) Illustration of procedure to limit tested parameter range. (b) Subsampling procedure to determine empirical proba-bilities for every edge in the correlation graph. B ( λ, S ) returns a correlation graph computed upon dataset S withthreshold λ . that best splits the curve into two straight lines (Panel a). Then, this procedure is repeated on each partition of the curveto compute two additional knee-points (Panels b and c). The ﬁnal parameter range is the space between these twoknee-points (Panel d).Then, a subsampling approach is used to compute empirical probabilities of appearance for each edge by computingcorrelation graphs across the chosen range of thresholds and random subsamples without replacement. The empiricalprobability of each edge is its frequency of appearance.Next, the information contained in the prior knowledge sources are evaluated against these empirical probabilitiesacross all edges (Equation 5). Each prior information source ( t r ) gives knowledge in the form of a probability ofappearance for some ﬁxed set of edges ( wp t r ). τ t r quantiﬁes the "unreliability" of source t r . φ t r k is the expected numberof times edge k should appear during the subsampling procedure according to source t r , and µ k is the actual number ofappearances for edge k . τ t r = (cid:80) | wp tr | k =1 | φ t r k − µ k || wp t r | (5)Finally, posterior distributions are computed for each edge (Figure 2b). For each edge k , a normal distribution is usedto approximate the probability of appearance for each prior source (red, green, and blue curves in Panel a). Usinga normalized reciprocal of the scores computed in the previous step, these normal distributions are combined into aweighted mixture (black curve, Panel a). This mixture distribution is approximated by the normal distribution whichhas minimal KL-divergence to the mixture (Panel b). Finally, this normal distribution is combined with a normaldistribution from the empirical probabilities to get a posterior distribution (Panel c, blue curve).Since some edges may not have prior information from any of the sources, λ ∗ is split into two parameters: one for edgeswhere prior information is available λ ∗ wp and one for edges where it is not, λ ∗ np . λ ∗ wp is chosen based upon stability ofthe correlation graph across subsamples along with concordance to the posterior distribution for each feature. λ ∗ np ischosen the same way, except that the posterior distribution is the one computed from the data alone (pink curve, Panelc). Simulated datasets were used to ensure algorithmic correctness and to understand the impact of prior informationsources. Data was generated from a linear Gaussian graphical model. Edge coefﬁcients were drawn uniformly atrandom from the set [ − . , − . ∪ [0 . , . . Error terms for each variable were zero mean with variance randomlydrawn from the set [0 . , . Graphical structure was simulated using a "clustered simulation" (Figure 3). Here, eachvariable belonged to one of C clusters. In these clusters, each pair of variables in the cluster were connected by anedge. c < C clusters had one randomly chosen variable connected to the target variable (relevant clusters), while theremaining C − c clusters were disconnected from the rest of the network. Each cluster consisted of an equal number ofvariables. To represent a master regulator and force correlated structure, each cluster had a single latent (unmeasured)variable that inﬂuenced the value of all variables in the cluster.Prior knowledge was simulated for two types of prior information sources: reliable priors and unreliable priors. Allprior sources give information based on a beta distribution; however, the parameters of this distribution differ based on6 pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT

Figure 3: Cluster Simulation to generate simulated datasets. Purple nodes are master regulators of a cluster, blue nodesare causal parents of the target variable, and the beige node is the target variable.the type of prior and whether the variables in question belong to the same cluster. An unreliable prior gives informationdrawn from

Beta (4 , for both true and false edges (cluster memberships), whereas a reliable prior draws from Beta (10 , for true edges, and Beta (2 , for false edges. The amount of prior information varies based on theexperiment. To determine whether prior information is available for each edge, each edge gets a value b (cid:118) U (0 , , andeach prior information source has a value c ∈ [0 , . The prior gives information about the relationship if b < c . In thisway, the simulated data reﬂects the fact that some relationships are more well-studied than others.We evaluate piPref-Div on its ability to incorporate unreliable prior information in order to select relevant clusters moreaccurately. The metric we use for evaluation of selected clusters on simulated data is called cluster accuracy. The goalof this metric is to compare the relevant clusters output by piPref-Div to the truly relevant clusters in the data generatinggraph, where a relevant cluster is a cluster with at least one variable that is a parent of the target variable. First, anoptimal matching is computed between the predicted and actual clusters using the Hungarian Algorithm. The cost ofassigning a predicted cluster to an actual cluster is given by 1 - the Jaccard similarity between the clusters. If multiplepredicted clusters are best assigned to the same actual cluster, these clusters are combined. Finally, the average Jaccardsimilarity between the combined predicted clusters and their matched actual clusters are computed as the score. To evaluate the performance of piPref-Div, we apply it to six publicly available breast cancer Affymetrix HGU133Amicroarray datasets. These datasets have been used in several previous analyses and represent a baseline to evaluateprediction methods . Each dataset consists of microarray expression data for between 159 and 286 patients, andthe target variable of interest in this data was whether or not the patient had relapse free survival (RFS)l for 5 yearsfor four of the datasets. For two sets, this information was unavailable and metastasis free survival (MFS) was used(Schmidt et al. and Ivshina et al.). Our evaluation consists of a ﬁve-fold cross validation within each dataset to determinehow well the selected features give predictive models that are generalizable (accurate) and stable. To measure accuracy,we use area under the ROC curve (AUC) comparing the probability predictions from each method and true binaryoutcome of RFS and MFS for ﬁve years. To measure stability, we use the average tanimoto set similarity (intersectiondivided by union) for the set of features selected in each fold.To evaluate the potential of our full pipeline to discover knowledge from data , a graphical model was learned from theTCGA-BRCA RNA-Seq expression dataset. This data included gene expression measurements from 784 breast tumorsamples and 13,994 genes. Breast cancer sub-type information for each tumor sample was obtained from . Breastcancer diagnosis and prognosis are commonly divided into four main subtypes: Luminal A, Luminal B, HER2+, and7 pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT

Triple-Negative breast cancer. The main driving distinction for these subtypes is the presence or absence of hormonereceptors on the tumor cell surface, which can lead to varying prognoses. In these experiments, we aim to identifyclusters distinguishing the four sub-types from expression data. To determine stability of each of these clusters, a10-fold cross validation was done, and the stability of each cluster was the number of times a similar cluster (similarity> 0.85) was selected in each fold.

Prior knowledge consisted of ﬁve distinct sources of information. Physical gene distance explored the base pair distancebetween two genes on the chromosome. If two genes were on separate chromosomes, then this value was set to zero.Otherwise given gene G i from base pairs B i to B i and gene G j from base pairs B j to B j , and full chromosome length C , the physical distance prior is given by Equation 6. This represents the proportion of chromosome distance coveredby the space between these two genes. P hys ( G i , G j ) = 1 − max (cid:0) B i , B j (cid:1) − min (cid:0) B i , B j (cid:1) C (6)Gene family information was curated from the Human Genome Organization (HUGO). Gene families are groups ofgenes related by sequence and/or function. A single gene can belong to multiple gene families. Thus, we representeach gene as a vector of families with one-hot encoding. To compute the similarity between these vectors, we use theJaccard similarity metric which is the number of families in common divided by the total number of unique familieseither gene belongs to. A similar approach is used for gene-disease mapping from the DisGeNet . This database givesscores quantifying the level of knowledge that a change in a gene is related to a disease. We use the guilt by associationprinciple to compute whether two genes are related. We represent a gene by a vector of scores to the diseases in thedatabase, and we compute the cosine similarity between two gene vectors. Since all scores are positive, this metric ispositive, and is used directly as a probability.Finally, we use gene-gene similarity data from two sources: Harmonizome and STRING . Harmonizome similaritydata was curated from the Molecular Signatures Database and consisted of correlation between gene expressionacross several microarray experiments. STRINGdb curates gene-gene relationship scores based on several factors suchas: co-expression, literature co-occurrence, experimental evidence, other databases, etc. STRINGdb scores were scaledfrom their (0 , range to (0 , . Next, we demonstrate the performance of piPref-Div on simulated and real datasets. First, we evaluate its ability todetermine reliable prior information sources and incorporate those sources to select better clusters. Next, we evaluatethe method in terms of its ability to accurately predict outcome for breast cancer patients, and lastly, we use the fullpipeline to learn a graphical model of breast cancer subtype discrimination.

First, we tested the ability of piPref-Div to accurately evaluate prior knowledge sources on simulated datasets of 500variables with 50 clusters (25 relevant), 200 and 50 samples, and 5 prior knowledge sources (3 reliable) with a randomamount of prior information. The results are presented in Figure 4. Here the "Actual Reliability" on the y-axis refers tothe sum of the probabilities given to true edges minus the sum of the probabilities given to false edges for each prior.The predicted weight for each prior knowledge source given by piPref-Div shows a clear association to the reliabilityscore. A beneﬁt of this approach is that this weight does not appear to be dependent on the amount of prior information.Even with little prior information (blue circles), piPref-Div assigns an accurate weight to the knowledge sources.The next experiment investigated the impact of the amount and quality of prior knowledge on the ability for piPref-Div toidentify relevant clusters of variables in the data. In these experiments, we test the method using the same experimentalparameters as the prior knowledge evaluation section, and using a larger dataset with 3000 variables, 300 clusters (75relevant). For each experimental setting, 15 graphs were generated and the results are presented cumulatively over thesegraphs.The results for the small datasets are given by Figure 5. Sample size is clearly the most signiﬁcant factor in determiningaccuracy of the selected clusters. Prior information gives a modest improvement in accuracy, but this beneﬁt only occurswith at least 50% of prior information and at least 3 reliable sources out of 5. However, when all sources are unreliable,there is no decrease in accuracy unless there is a large amount of information present. Lastly, we note that the beneﬁt of8 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT

Figure 4: Predicted Weight vs. True Reliability for each prior knowledge source in simulated experiments for piPref-Divfor (left) 50 samples and (right) 200 samples.Figure 5: Accuracy of predicted clusters for varying amount and reliability of prior knowledge. Sample size was set to50 (left) and 200 (right).prior information is drastically reduced in cases with sufﬁcient sample size (200 sample case). This is intuitive, as withmore data, correlation becomes a very stable measure, and prior information can be ignored.9 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT

Figure 6: Accuracy of predicted clusters for varying amount and reliability of prior knowledge on large datasets. Samplesize was set to 50 (left) and 200 (right).Lastly, we examined the ability of piPref-Div to detect clusters from a larger graph (Figure 6). Here, the pattern islargely similar, except the impact of prior knowledge is more signiﬁcant. In particular, even 25% prior informationgives a substantial increase in accuracy over having no prior information at all. Again, this impact is larger when thesample size is small, though it is present in both cases. Further, the impact of more samples is more pronounced inthe larger dataset. An increase from 50 to 200 samples results in an increase in accuracy from 0.65 to over 0.8 for allamounts of prior.

To determine the performance of piPref-Div on real datasets, we applied the algorithm to the aforementioned breastcancer microarray datasets. Three variations of piPref-Div were tested. piPref-Div alone (PD), piPref-Div withand without prior information (No Prior = NP) with clusters aggregated into summarized features using principalcomponent analysis (PD-PCA,PDNP-PCA). For the Pref-Div approaches, an inner 3-fold cross-validation loop wasused to determine the number of selected features (1,3,5, and 10 features were tested). Genes with less than 0.5 standarddeviation across samples in the training set were removed from the dataset prior to feature selection. For comparisonpurposes, two methods that performed well in a previous study were included in the analysis: Hybrid-Huberized SVM(HH-SVM) and Recursive-Reweighted Feature Elimination (RRFE) .Figure 7 presents the accuracy results. Across the datasets, the consistent best performing methods are PD-NP-PCAand PD-PCA (sea-green and light blue, respectively). This suggests that cluster selection and representing individualfeatures as clusters offers a beneﬁt to selecting single genes alone. However, this is dataset dependent, as Pref-Div alone(PD, yellow box) matches these methods on 2 of the datasets (Sotiriou and Desmedt) and performs better on 1 dataset(Wang). Overall, these results show no signiﬁcant difference between using prior information (PD, PD-PCA) and notusing prior information (PD-NPPCA). Using prior information with PCA clustering shows a slight improvement on theIvshina dataset, but none of the others. We ﬁnd that our approach performs about the same in terms of accuracy whencompared to SVM-RRFE, but our method tends to select signiﬁcantly fewer features.Figure 9 presents the stability of the learned models. The results conﬁrm previous work that identifying a stable modelfor breast cancer outcome prediction is a difﬁcult problem . In general, only the RRFE algorithm shows somewhatconsistent stability; however, we note that a major contributing factor is that this algorithm uses on average 119 selectedfeatures, whereas HH-SVM averages around 6 and the PD approaches average around 1 feature (or cluster). Among,the Pref-Div based approaches, PD-PCA with and without prior information show the most consistent stability. Onnearly all datasets they are near to, or on par with RRFE despite choosing signiﬁcantly fewer features.10 pipeline for integrated theory and data-driven modeling of genomic and clinical data A P

REPRINT

Figure 7: AUC of predicting RFS using several feature selection methods on six independent breast cancer microarraydatasets. 11 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT

Figure 8: Graphical model of breast cancer subtype. Size of each edge represents the number of times a similar clusterwas selected to be related to Subtype in each of the cross-validation folds.

Finally, we evaluate our full pipeline of variable selection and then graphical modeling based on its ability to mineinteresting clusters related to breast cancer subtype. Based on the previous section, we chose to use PD-PCA forvariable selection due to its consistently high accuracy and relatively high stability. An MGM model was learned ona dataset consisting of only the selected clusters and the Subtype variable. To summarize clusters into single names,the Ingenuity Pathway Analysis regulator analysis was used, and the KEGG Pathway database was queried (correctedp-values < 0.05 were chosen as candidates). Following this step, only speciﬁc pathways and regulators were included asnames of the clusters.The learned graphical model is presented in Figure 8. We found that two clusters were unable to be mapped coherentlyto any biological function (single gene representatives were TMEM41A, and TSPAN15); however, these clusterswere relatively unstable. The two most stable clusters were: Fanconi Anemia/ Hereditary Breast Cancer pathway,and a set of genes regulated by MYCN. Fanconi Anemia and the Hereditary Breast Cancer pathways are known toshare common genes , and developing breast cancer through a genetic basis tends to be associated with ER+ breastcancer . MYC family pathways and the transcription factors themselves are known to be differentially expressedacross subtypes, and the MYCN factor in particular has shown differences between triple-negative and other subtypes .FOXA1 along with GATA3 and ESR1 are necessary for maintaing a luminal phenotype of breast cancer , and AGR2 isupregulated by FOXA1 but only in an estrogen receptor dependent manner . This implies that the FOXA1-AGR2loop will only be upregulated in ER+ breast cancer. Though it is unclear how KRT14 regulated genes distinguishsubtypes of breast cancer, it is known that upregulation of KRT14 reduces the ability of breast tumor to metastasize andinvade the extracellular matrix . Overall, we ﬁnd that the pipeline constructs and selects reasonable clusters that arediscriminative of breast cancer subtypes. In addition, the pipeline generates novel candidate clusters for scientists toexperimentally probe. In this work, we have presented a pipeline to learn graphical model structure from large omics datasets. The pipelinebuilds upon previous work in integrating and evaluating prior information to select hyperparameters, and a variableselection method to identify relevant yet non-redundant sets of features. We have extended this approach to returnclusters of variables instead of individual features, and to use these features as input to a graphical model structurelearning algorithm to ﬁnd interesting relationships within the clusters and between clusters and phenotypes.We evaluated our work on both synthetic data and real breast cancer data. On the synthetic data, we ﬁnd that our methodis able to accurately evaluate prior information, and utilize this information to improve the selection of relevant clusters.In addition, the method avoids performing more poorly than having no prior information at all, when most of the priorinformation available is unreliable. We ﬁnd that the largest improvement over having no prior information occurs when12 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT there are few samples and a large number of features. Overall, we ﬁnd the improvement when using prior informationto be modest, but this appears to be necessary to avoid poor performance with unreliable priors.On classiﬁcation experiments with microarray data, we ﬁnd that our method performs at par or better with other state ofthe art approaches. We ﬁnd that using PCA to summarize clusters into single variables gives a slight improvement overselecting single variables alone. We ﬁnd that when compared to the state of the art approaches, our method selects farfewer features to achieve similar or better levels of accuracy. When using the full pipeline with graphical modeling toanalyze RNA-Seq data to discriminate breast cancer subtypes, we ﬁnd that the method identiﬁes reasonable clusters.Two of the seven clusters did not map to any known biological regulator or pathway, and are candidates for furtherinvestigation.For future work, we aim to improve upon the accuracy results with reliable prior information. It could be that using theprior information solely to select hyperparameters is too conservative an approach, and using the posterior distributionsdirectly can give better results. Since the priors are already being appropriately weighted, the posterior distributionsshould be accurate. In addition, we aim to utilize our pipeline with experimental approaches to validate selected clustersin terms of activity and biological signiﬁcnce. Finally, the prior information sources used for the biological experimentswere relatively sparse. It is future work to utilize the vast array of gene expression experiments available to constructpriors that cover a wider representation of the genome.

This work was supported by NIH Grants U01HL137159, R01LM012087 (to PVB), and T32CA082084 (to VKR).

References [1] D Alan and MD D’Andrea. The fanconi anemia and breast cancer susceptibility pathways.

The New Englandjournal of medicine , 362(20):1909, 2010.[2] Nicolas Alcaraz, Markus List, Richa Batra, Fabio Vandin, Henrik J Ditzel, and Jan Baumbach. De novopathway-based biomarker identiﬁcation.

Nucleic acids research , 45(16):e151–e151, 2017.[3] Amin Allahyar and Jeroen De Ridder. Feral: network-based classiﬁer with application to breast cancer outcomeprediction.

Bioinformatics , 31(12):i311–i319, 2015.[4] Nirmalya Bandyopadhyay, Tamer Kahveci, Steve Goodison, Yijun Sun, and Sanjay Ranka. Pathway-based featureselection algorithm for cancer microarray data.

Advances in bioinformatics , 2009, 2010.[5] Julian Besag. Statistical analysis of non-lattice data.

Journal of the Royal Statistical Society, Series D , pages179–195, 1975.[6] Kevin R Brown and Igor Jurisica. Online predicted human interaction database.

Bioinformatics , 21(9):2076–2082,2005.[7] Sanjib Chaudhary, B Madhu Krishna, and Sandip K Mishra. A novel foxa1/esr1 interacting pathway: A study ofoncomine TM breast cancer microarrays. Oncology letters , 14(2):1247–1264, 2017.[8] Shizhe Chen, Daniela M Witten, and Ali Shojaie. Selection and estimation for mixed graphical models.

Biometrika ,102(1):47–64, 2014.[9] Xi Chen and Lily Wang. Integrating biological knowledge with gene expression proﬁles for survival prediction ofcancer.

Journal of Computational Biology , 16(2):265–278, 2009.[10] Jie Cheng, Tianxi Li, Elizaveta Levina, and Ji Zhu. High-dimensional mixed graphical models.

Journal ofComputational and Graphical Statistics , 2016.[11] Jill Cheng, Melissa Cline, John Martin, David Finkelstein, Tarif Awad, David Kulp, and Michael A Siani-Rose.A knowledge-based clustering algorithm driven by gene ontology.

Journal of biopharmaceutical statistics ,14(3):687–700, 2004.[12] Yupeng Cun and Holger Fröhlich. Prognostic gene signatures for patient stratiﬁcation in breast cancer-accuracy,stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions.

BMC bioinformatics , 13(1):69, 2012.[13] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data.

Journal of bioinformatics and computational biology , 3(02):185–205, 2005.[14] Bernd Fellinghauer et al. Stable graphical model estimation with random forests for discrete, continuous, andmixed variables.

Computational Statistics & Data Analysis , 64:132–152, 2013.13 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT [15] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphicallasso.

Biostatistics , 9(3):432–441, 2008.[16] Xiaoyu Ge, Panos K Chrysanthis, and Alexandros Labrinidis. Preferential diversity. In

Proceedings of the SecondInternational Workshop on Exploratory Search in Databases and the Web , pages 9–14. ACM, 2015.[17] Zheng Guo, Tianwen Zhang, Xia Li, Qi Wang, Jianzhen Xu, Hui Yu, Jing Zhu, Haiyun Wang, Chenguang Wang,Eric J Topol, et al. Towards precise classiﬁcation of cancers based on robust gene functional expression proﬁles.

BMC bioinformatics , 6(1):58, 2005.[18] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection.

Journal of machine learningresearch , 3(Mar):1157–1182, 2003.[19] Mark A Hall. Correlation-based feature selection of discrete and numeric class machine learning. 2000.[20] Anne-Claire Haury, Pierre Gestraud, and Jean-Philippe Vert. The inﬂuence of feature selection methods onaccuracy, stability and interpretability of molecular signatures.

PloS one , 6(12):e28210, 2011.[21] Zena M Hira and Duncan F Gillies. A review of feature selection and feature extraction methods applied onmicroarray data.

Advances in bioinformatics , 2015, 2015.[22] Dai others Horiuchi. Myc pathway activation in triple-negative breast cancer is synthetic lethal with cdk inhibition.

Journal of Experimental Medicine , 209(4):679–696, 2012.[23] Grace T Huang, Ioannis Tsamardinos, Vineet Raghu, Naftali Kaminski, and Panayiotis V Benos. T-recs: stableselection of dynamically formed groups of features with application to prediction of clinical outcomes.

PaciﬁcSymposium on Biocomputing. Paciﬁc Symposium on Biocomputing , 20:431–42, 1 2015.[24] Guanglong Jiang, Shijun Zhang, Aida Yazdanparast, Meng Li, Aniruddha Vikram Pawar, Yunlong Liu,Sai Mounika Inavolu, and Lijun Cheng. Comprehensive comparison of molecular portraits between cell lines andtumors in breast cancer.

BMC genomics , 17(7):525, 2016.[25] Marc Johannes, Jan C Brase, Holger Fröhlich, Stephan Gade, Mathias Gehrmann, Maria Fälth, Holger Sültmann,and Tim Beißbarth. Integration of pathway knowledge into a reweighted recursive feature elimination approachfor risk stratiﬁcation of cancer patients.

Bioinformatics , 26(17):2136–2144, 2010.[26] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.

Nucleic acids research ,28(1):27–30, 2000.[27] Daphne Koller and Nir Friedman.

Probabilistic graphical models: principles and techniques . MIT press, 2009.[28] Jason D Lee and Trevor J Hastie. Learning the Structure of Mixed Graphical Models.

Journal of Computationaland Graphical Statistics , 24(1):230–253, 2015.[29] Jan Lemeire, Stijn Meganck, Francesco Cartella, and Tingting Liu. Conservative independence-based causal struc-ture learning in absence of adjacency faithfulness.

International Journal of Approximate Reasoning , 53(9):1305–1325, 2012.[30] Dimitris V Manatakis, Vineet K Raghu, and Panayiotis V Benos. piMGM: Incorporating multi-source priors inmixed graphical models for learning disease networks.

Bioinformatics , 34(17):i848–i856, 2018.[31] Anna Marie Mulligan et al. Common breast cancer susceptibility alleles are associated with tumour subtypes inbrca1 and brca2 mutation carriers: results from the consortium of investigators of modiﬁers of brca1/2.

BreastCancer Research , 13(6):R110, 2011.[32] Wei Pan. Incorporating gene functions as priors in model-based clustering of microarray gene expression data.

Bioinformatics , 22(7):795–801, 2006.[33] Janet Piñero, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno,Javier García-García, Ferran Sanz, and Laura I Furlong. Disgenet: a comprehensive platform integratinginformation on human disease-associated genes and variants.

Nucleic acids research , 2016.[34] Vineet K Raghu et al. Comparison of strategies for scalable causal discovery of latent variable models from mixeddata.

International Journal of Data Science and Analytics , pages 1–13, 2018.[35] Andrew D Rouillard, Gregory W Gundersen, Nicolas F Fernandez, Zichen Wang, Caroline D Monteiro, Michael GMcDermott, and Avi Ma’ayan. The harmonizome: a collection of processed datasets gathered to serve and mineknowledge about genes and proteins.

Database , 2016, 2016.[36] Andrew J. Sedgewick, Ivy Shi, Rory M. Donovan, and Panayiotis V. Benos. Learning mixed graphical modelswith separate sparsity parameters and stability-based model selection.

BMC Bioinformatics , 17(S5):175, 2016.[37] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso.

Journal ofComputational and Graphical Statistics , 22(2):231–245, 2013.14 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT [38] Christine Staiger, Sidney Cadot, Balázs Györffy, Lodewyk FA Wessels, and Gunnar W Klau. Current composite-feature classiﬁcation methods do not outperform simple single-genes classiﬁers in breast cancer prognosis.

Frontiers in genetics , 4:289, 2013.[39] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette,Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al. Gene set enrichment analysis: aknowledge-based approach for interpreting genome-wide expression proﬁles.

Proceedings of the NationalAcademy of Sciences , 102(43):15545–15550, 2005.[40] Damian Szklarczyk, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller, Jaime Huerta-Cepas,Milan Simonovic, Alexander Roth, Alberto Santos, Kalliopi P Tsafou, et al. String v10: protein–protein interactionnetworks, integrated over the tree of life.

Nucleic acids research , 43(D1):D447–D452, 2014.[41] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, TonyPawson, Quaid Morris, and Jeffrey L Wrana. Dynamic modularity in protein interaction networks predicts breastcancer outcome.

Nature biotechnology , 27(2):199, 2009.[42] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society:Series B (Methodological) , 58(1):267–288, 1996.[43] Inma Tur and Robert Castelo. Learning mixed graphical models from data with p larger than n. In

Proceedings ofthe Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence , pages 689–697. AUAI Press, 2011.[44] David Venet, Jacques E Dumont, and Vincent Detours. Most random gene expression signatures are signiﬁcantlyassociated with breast cancer outcome.

PLoS computational biology , 7(10):e1002240, 2011.[45] Li Wang, Ji Zhu, and Hui Zou. Hybrid huberized support vector machines for microarray classiﬁcation and geneselection.

Bioinformatics , 24(3):412–419, 2008.[46] Jill M Westcott et al. An epigenetically distinct breast cancer cell subpopulation promotes collective invasion.

TheJournal of clinical investigation , 125(5):1927–1943, 2015.[47] Tricia M Wright, Suzanne E Wardell, Jeff S Jasper, James P Stice, Rachid Saﬁ, Erik R Nelson, and Donald PMcDonnell. Delineation of a foxa1/er α /agr2 regulatory loop that is dysregulated in endocrine therapy–resistantbreast cancer. Molecular cancer research , 12(12):1829–1839, 2014.[48] Eunho Yang, Yulia Baker, Pradeep Ravikumar, Genevera Allen, and Zhandong Liu. Mixed graphical models viaexponential families. In

Artiﬁcial Intelligence and Statistics , pages 1042–1050, 2014.[49] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis.

Statisticalapplications in genetics and molecular biology , 4(1).[50] Yanni Zhu, Xiaotong Shen, and Wei Pan. Network-based support vector machine for classiﬁcation of microarraysamples.

BMC bioinformatics , 10(1):S21, 2009. 15 pipeline for integrated theory and data-driven modeling of genomic and clinical data

A P

REPRINT