[PDF] ACCORDION: Clustering and Selecting Relevant Data for Guided Network Extension and Query Answering

Abstract

Querying new information from knowledge sources, in general, and published literature, in particular, aims to provide precise and quick answers to questions raised about a system under study. In this paper, we present ACCORDION (Automated Clustering Conditional On Relating Data of Interactions tO a Network), a novel tool and a methodology to enable efficient answering of biological questions by automatically assembling new, or expanding existing models using published literature. Our approach integrates information extraction and clustering with simulation and formal analysis to allow for an automated iterative process that includes assembling, testing and selecting the most relevant models, given a set of desired system properties. We applied our methodology to a model of the circuitry that con-trols T cell differentiation. To evaluate our approach, we compare the model that we obtained, using our automated model extension approach, with the previously published manually extended T cell differentiation model. Besides demonstrating automated and rapid reconstruction of a model that was previously built manually, ACCORDION can assemble multiple models that satisfy desired properties. As such, it replaces large number of tedious or even imprac-tical manual experiments and guides alternative hypotheses and interventions in biological systems.

Full PDF

BBioinformatics , YYYY, 0–0 doi: 10.1093/bioinformatics/xxxxx Advance Access Publication Date: DD Month YYYY Manuscript Category

Subject Section

ACCORDION: Clustering and Selecting Relevant Data for Guided Network Extension and Query Answering

Yasmine Ahmed * , Cheryl Telmer and Natasa Miskov-Zivanov * Electrical and Computer Engineering Department, Bioengineering, Computational and Systems Biology, University of Pittsburgh, Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA *To whom correspondence should be addressed. Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Querying new information from knowledge sources, in general, and published literature, in particular, aims to provide precise and quick answers to questions raised about a system under study. In this paper, we present ACCORDION ( A utomated C lustering C onditional O n R elating D ata of I nteractions t O a N etwork), a novel tool and a methodology to enable efficient answering of biological questions by automatically assembling new, or expanding existing models using published literature. Our approach integrates information extraction and clustering with simulation and formal analysis to allow for an automated iterative process that includes assembling, testing and selecting the most relevant models, given a set of desired system properties. We applied our methodology to a model of the circuitry that controls T cell differentiation. To evaluate our approach, we compare the model that we obtained, using our automated model extension approach, with the previously published manually extended T cell differentiation model. Besides demonstrat-ing automated and rapid reconstruction of a model that was previously built manually, ACCORDION can assemble multiple models that satisfy desired properties. As such, it replaces large number of tedious or even impractical manual experiments and guides alternative hypotheses and interventions in biological systems. Contact: {yaa38, nmzivanov}@pitt.edu

Supplementary information:

Supplementary data are available at

Bioinformatics online. Introduction

While modeling helps explain complex systems, guides data collection and generates new challenges and questions [1], it is still largely depend-ent on human contributions. For example, in biology, model creation re-quires reading hundreds of papers, extracting useful information manu-ally, incorporating background and common-sense knowledge of domain experts, and conducting wet-lab experiments. These time-consuming steps make the creation and the development of models a slow, laborious and error-prone process. In addition, as the amount of biological data in the public domain grows rapidly, problems of data inconsistency and frag-mentation are arising [2]. Therefore, the automation of model building, and even more, of model extension, when new information becomes avail-able, or when the domain knowledge advances, is a critical next step for computational modeling. Such automation will not only lead to more effi-cient modeling due to reducing the amount of slow human interventions, but will also allow for more consistent, comprehensive and robust model-ing process. In the last few decades, computer models have been used to explain how biomolecular signaling pathways regulate cell functions. Usually, modelers start with a few seed components and their interactions, or with a baseline model that can be found in curated public model databases such as Reactome [3], STRING [4], KEGG [5], or in published literature. De-pending on the questions to be answered with modeling, the baseline model is usually further extended with the information extracted from published literature or obtained from domain experts [6]. In order to auto-mate the collection of articles and information extraction, one begins with a formal search query, which is defined according to a question posed about the modeled system. The search query guides automated selection of articles that contain relevant information from published literature da-tabases. As the biomedical literature mining tools are becoming essential for the high throughput extraction of knowledge from scientific papers, we use in our work existing machine reading engines. We then use the extracted information to extend or assemble models in order to answer questions about the system under investigation [7]. In [8], the authors proposed a method that starts with a baseline model and selects interactions automatically extracted from published work. The goal of [8] was to build a model that satisfies pre-defined requirements or to identify new therapeutic targets, formally expressed as existing or de-sired system properties. As results in [8] demonstrate, automatic model extension is a promising approach for accelerating modeling, and conse-quently, disease treatment design. The authors in [8] organize the infor-mation extracted from literature into layers, based on their proximity to the baseline model. Recently, another extension method that uses a Ge-netic Algorithm (GA) was proposed in [9]. The GA-based approach was able to extract a set of extensions that led to the desired behavior of the final extended model. The disadvantages of the GA-based approach in-clude non-determinism, as the solution may vary across multiple algo-rithm executions on the same inputs, as well as issues with scalability. In this work, we propose ACCORDION ( A utomated C lustering C on-ditional O n R elating D ata of I nteractions t O a N etwork), a tool that auto-matically and efficiently assembles the information extracted from avail-able literature into models, tests the newly assembled models, and selects the most suitable model to address user questions. In contrast to [8], our approach focuses on identifying clusters of strongly connected elements in the newly extracted information, that have a measurable impact when added to the model. Once the interactions extracted from the literature are clustered, we score their performance on a selected set of system proper-ties, using stochastic simulation methods [11] and statistical model check-ing [10]. The scoring helps determine which clusters to add to the baseline model. Therefore, ACCORDION takes at most a few hours to execute thousands of experiments in silico , which would take days, or months, or would be impractical to conduct in vivo or in vitro . ACCORDION is versatile and can be used to extend many different models. We have selected a computational model of T cell differentiation [12] to demonstrate the accuracy, efficiency and utility of the tool. Our main goal with this case study is to show that ACCORDION is able to expand automatically, without human intervention, an existing published model into another published and manually built model, using new ele-ments and new interactions automatically extracted from published litera-ture. As the final golden model, we used the T cell model published in [13] and the set of desired system properties discussed in [12][13]. The golden model and the properties are used to evaluate the ACCORDION output. To this end, the contributions of this work include: (i) a new method to extend models by combining clustering and path finding with model test-ing on a set of formally written desired system properties; (ii) an evalua-tion of the effect of published literature and machine reading when auto-matically reproducing a manually built model; (iii) several new candidate models of the circuitry controlling naïve T cell differentiation, assembled automatically, satisfying the same set of desired properties as existing manually built models, and thus, enabling exploration of redundancies or discovering alternative pathways of regulation. Background

We provide in this section an overview of several tools and background concepts that are used by ACCORDION. We first describe the tools that we have used to automatically find and read published papers relevant for user queries (Section 2.1). In order to use the extracted information in models, while retaining all the useful information, a suitable representa-tion format for model components is critical (Section 2.2). We detail the components of the representation format and provide a brief overview of the model analysis techniques. ACCORDION uses these techniques and tools to evaluate the newly expanded models, and to select the best model to address user questions (Section 2.3).

Information extraction from literature

Extraction from literature usually starts with a question, for example, “How is PTEN regulation involved in T-cell fate?” We can write these questions as logical expressions (

Figure 1 (a)(left)). These formally writ-ten queries are used to search public literature databases (e.g., PubMed [15]) as illustrated in

Figure 1 (a)(middle). Once the relevant papers are selected, they are sent to machine reading engines for automated extrac-tion of information (

Figure 1 (a)(middle)). The state-of-the-art automated reading engines [18][19] are capable of finding hundreds of thousands of events in cellular signaling pathways from thousands of papers, in a few hours. Events in the machine reading output represent interactions between biochemical entities, such as post-translational modifications (e.g., binding, phosphorylation, ubiquitination, etc.), transcription, translation, translocation, and increase or decrease of amount or activity. In the context of biomedical literature, entities are usu-ally proteins, chemicals, genes, and RNAs, although sometimes they also represent biological processes. For each extracted entity, reading engines provide its name, the database where it is characterized, and the database identifier (ID) for the entity. Machine reading also collects the evidence, usually a sentence from which the event was extracted. For our case study, we used an open-source reading engine, REACH [19], to quickly obtain information from biomedical literature. In

Figure 1 (a)(right), we show two example sentences. The REACH reading engine extracts events into an interaction-based format shown in

Figure 1 (b). We will refer to the list of interactions retrieved from literature in this format as reading output . Model representation and executable models

The three rows in the table in

Figure 1 (b) can be automatically translated into the element-based BioRECIPES format [20], which is then used as input to the executable model generation (see Section 2.3). The BioREC-IPES tabular model representation format is illustrated in

Figure 1 (c) with several examples of molecules and interactions in T cells [12]. In the ex-amples, PTEN is positively regulated by Foxp3, and negatively regulated by TCR. Ras has one positive regulator, TCR, and no negative regulators. IL-2 has positive and negative regulators, Ras and Foxp3, respectively. The BioRECIPES representation format includes, for each model ele-ment: (i) name, (ii) type (protein, gene, RNA, or a chemical), (iii) identi-fier from a database (e.g., UniProt [21]), (iv) variable that represents state, and (v) set of regulators. While the BioRECIPES format is a sufficient representation for all the relevant element and interaction information, all interactions in a model can also be represented as a directed graph G ( V , E ), with a set of nodes V and a set of directed edges E . Each node v ∈ V corresponds to one model element, and each edge e ( v i , v j ) ∈ E represents a directed interaction in which element v i regulates element v j . The graph-ical representation of all model interactions is often referred to as an in-fluence map, and it is especially useful for the methods that are used in ACCORDION, as will be discussed in Section 3.2. Next to the table in Figure 1 (c), we show a graph of element interactions that are listed in the table. As can be seen in the graph, we include the information about the sign of the interaction in the form of arrow type, a pointed arrow represents positive regulation, while a blunt arrow represents negative regulation. We will refer to the set of regulators of an element as its influence set, distinguishing between positive and negative regulators. Additionally, we can define a vector of all variables representing states of model elements as x = ( x ,., x N ), where N=|V| is the total number of model elements. If we use Boolean variables, then x i ∈ {0, 1}, where i =1.. N . Next, we can assign a state transition function to any model element, which defines a state change of the element, given the states of its regulators. We will refer to these functions as element update rules and to the model with update rules rticle short title as an executable model . In the case of Boolean variables representing ele-ment states, the basic operations are AND (*), OR (+) and NOT (!). For example, one version of update rules for the small graph in Figure 1 (c) can be: PTEN = Foxp3 * !TCR, Ras = TCR, and IL-2 = Ras * !Foxp3. The choice between AND and OR operation depends on the available infor-mation about interactions and element regulations. For example, for an element to be “activated”, all necessary regulators are combined with an AND operation, and all sufficient regulators with an OR operation.

Model analysis

We describe here two methods that we use to analyze the models extended with the newly obtained information and data.

We use the DiSH simulator [11] to observe dynamic behavior of the baseline model and the extended models. DiSH can simulate networks with multi-valued elements in both deterministic and stochastic manner, and we utilize both of these features in our analy-sis, as shown later in Section 5. Each simulation run starts with a specified initial model state, where initial values are assigned to all model elements to represent a particular system state (e.g., naïve T cell, regulatory T cell, etc.). Next, we use element update rules to determine element state transi-tions. We track element changes for a pre-defined number of simulation steps, or until a steady state is reached [11]. Here, we provide a brief description of the simulation approach, more details can be found in [11]. Furthermore, this approach has been used for simulation and analysis of discrete models, such as those described in [22][12], and was previously incorporated into several other tools [23][24][25]. If we assume that a simulation run has M steps, we define a trajectory of element x i in the k th run as a time course of its state values T k ( x i )= ( x i , x i ,…, x iM ) k in time steps t =0,.., M . When the simulator is in the stochastic mode, in each simulation step, only one element is randomly chosen, and its new value is computed according to its update rule. De-pending on the information available, the rates at which elements are up-dated can be different across model elements; when there is limited infor- mation about elements, we choose to use the same update rate for all ele-ments. In either case, due to the randomness in element update order, mul-tiple runs that start with the same initial state may result in different (non-deterministic) state transitions, and thus, in different trajectories of state changes in time. DiSH simulations output a file that includes all the sim-ulated trajectories for all model elements, in other words, for K runs, for each model element x i , we obtain its simulated trajectories T ( x i )={ T ( x i ), T ( x i )…, T K ( x i )}. These trajectories can be used to plot and visualize be-havior over time for any given element. Typically, averaged trajectories are plotted [12][22], where an average element state value is computed across all trajectories, in each simulation step. In this work, we use statistical model checking [10][14] to test all generated models against formally defined properties. Model checking is often used to verify whether a model of a system, or a system design, satisfies a set of properties describing expected behavior of the system. Each property is encoded into Bounded Linear Temporal Logic (BLTL) [26][14]. Here, we use statistical model checking since the state transitions are not necessarily deterministic, and we follow the simulation approach described in Section 2.3.1. To avoid a full state space search, statistical model checking conducts randomized sampling to generate simulation trajectories of the model and performs statistical anal-ysis on those trajectories. The input to the statistical model checker is a system property expressed as a BLTL formula, and the output is a proba-bility estimate ( P ) that the model satisfies a given property, under particu-lar error interval for the estimate. For instance, let us assume that we would like to test a property that, at any point within the first s time steps, ele-ment v i becomes 1 and element v j becomes 0, and that they both keep those values for at least s time steps. We would then write the formula: 𝐹 ! ! 𝐺 ! " %𝑣 " = 1 ∧ 𝑣 = 0+ , where 𝐹 ! ! stands for “any time in the future s steps”, and 𝐺 ! " stands for “globally for s steps”. Proposed methodology

The steps and components within ACCORDION are outlined in

Figure 1 (d). The first step of our proposed methodology is creating an input for

Figure 1.

ACCORDION inputs and methodology overview: (a)

Left : example query used to select relevant papers.

Middle : main components of infor-mation extraction from relevant papers.

Right : two example sentences with highlighted entities and events that are extracted by machine readers. (b)

Top : tabular outputs from REACH engine when reading example sentences from (a).

Bottom : graphical representation of REACH outputs. (c)

Left : tabular representation of several elements and their influence sets (positive and negative regulators) in BioRECIPES format.

Right : graphical representation of elements and influence sets. (d) The flow diagram of the ACCORDION processing steps, inputs, and outputs. (e) Toy example: a baseline model and connected clusters: blue edges highlight a return path within one cluster, and red edges show a return path connecting two clusters.

ACCORDION, which includes extracting new event information from lit-erature by machine reading engines, followed by filtering, scoring and classifying these events. Once the new input is created, the three main steps within ACCORDION are performed, and they include (1) clustering of new events, (2) assembly of the clustered event data into models, and (3) selection of the most suitable and useful events. In the following sub-sections, we discuss each of these steps in detail.

Extraction and classification of new event information

To query for new information from literature databases, we write ques-tions as search terms in the form of logical expressions, which can be used by literature search engines, either Non-Medline (e.g., Google), or Med-line search tools (e.g., PubMed [15], Ovid [29]). The search engines return a list of papers most relevant to these search terms. The selected papers are then used as an input for machine reading engines, which extract enti-ties and events from the papers. Once the event information is extracted, it is forwarded to the event classification tool, to identify potential exten-sions for the existing model. However, the output of machine reading en-gines can contain inconsistencies and errors. Therefore, extracted event information needs to be filtered before it can be used in models. First, we select from the reading output the protein-protein interactions, not including the more general biological processes information. The ra-tionale behind this is that there is often not enough context for a mentioned biological process, and the lack of context affects interpretation of the ex-tracted interaction. The machine reading output is further filtered using public protein interaction databases [21][4][30], which increases the con-fidence in the interactions that will be used as potential extensions for models. To classify the remaining interactions, we use an interaction clas-sification tool [31]. As described in Section 1, we assume that, in order to answer a query, we would most often start from an existing baseline model, and thus, the extracted interactions are classified according to their relationship with the baseline model. We classify interactions from the reading output into three groups: (a) corroborations , when the interaction matches an interaction that already exists in the model; (b) contradictions , when the interaction represents a contradicting regulatory mechanism from the one that exists between the same elements in the model; (c) ex-tensions , when the interaction is not found in the model. As corroborations confirm what is already in the baseline model, we do not use them in extending the baseline model. In our future work, we plan to include a confidence measure for the interactions in the model, and the corroborations found in literature would contribute to computing the con-fidence. Additionally, ACCORDION currently does not examine and uti-lize the information within the extracted contradictions, although they may hold useful information about the modeled system. In some cases, contra-dictions could be even considered as model extensions. For example, in the reading output, we often come across interactions stated as “A posi-tively regulates B” or just “A regulates B”, while the model includes in-teraction “A inhibits B” or “B inhibits A” or “B regulates A”. Given that extracted contradictions can be further explored and the information within contradictions can sometimes lead to model improvements, we will explore them more carefully in our future work. To extend the baseline model, only the interactions that are classified as extensions form an input for ACCORDION, and in the rest of the paper, we will refer to these in-teractions as

Candidate Extension Interactions (CEIs) . Clustering of candidate extension interactions

The method used to identify clusters of extracted, filtered and classified CEIs is formally outlined in Algorithm 1 (see Figure S1 in the supplemen-tary material) and described in detail here. The set of CEIs can be represented as a set of candidate extension edges E ext , and the source and target nodes of these edges that are not already in the baseline model graph, G BM ( V BM , E BM ), will be members of the set of candidate extension nodes V ext . We then create a new graph G new ( V new , E new ), where V new = V BM ∪ V ext , and E new = E BM ∪ E ext . Figure 1 (e) shows a toy example graph G new , where grey nodes belong to the baseline model, while yellow and green nodes belong to the CEIs obtained from machine reading. We further classify the edges e ( v s ,v t ) from the set E ext , where v s is the source node and v t is the target node, into the following categories: (i) edges in which both the source node v s and the target node v t belong to the baseline model: { v s , v t } ∈ V BM ; (ii) edges in which either a source node or a target node belongs to the baseline model: ( v s ∈ V BM and v t ∉ V BM ) or ( v s ∉ V BM and v t ∈ V BM ); (iii) edges in which neither the source node nor the target node belongs to the baseline model: { v s ,v t } ∉ V BM . Adding the CEIs to the baseline model all at once usually does not result in a useful and accurate model. Alternatively, we can add one interaction at a time and test each model version, which is time consuming, or even impractical, given that the number of models increases exponentially with the number of CEIs. Moreover, adding individual interactions does not have an effect on the model when an interaction belongs to category (iii), and most often when it belongs to category (ii). It proves much more use-ful to add paths of connected interactions, which are at the same time con-nected to the baseline model in at least two elements. Therefore, our ap-proach for finding the most useful subset of the CEIs includes finding con-nected interactions, that is, a set of edges in the graph G new that form a return path. If we define a path of p connected edges as e path ( v s ,v tp ) = ( e i ( v s , v t ), e i ( v s = v t , v t ), e i3 ( v s = v t , v t ), …, e ip ( v sp = v tp ,v tp )), we will call e path ( v s ,v tp ) a return path, when { v s ,v tp } ∈ V BM and v s ≠ v tp . In Figure 1 (e), we highlight one such return path in blue. To find these return paths formed by CEIs, we conduct clustering of graph G new that includes both the baseline model and the CEIs. To cluster CEIs, we use Markov Clustering algorithm (MCL) [32], an unsupervised graph clustering algorithm, commonly used in bioinformat-ics (e.g., clustering of protein-protein interaction networks [33][34]). In [35], the authors showed that the MCL algorithm is tolerant to noise, while identifying meaningful clusters. MCL is compared with, Affinity Propa-gation (AP) clustering algorithm, proposed in [36], and it is demonstrated that the MCL algorithm outperforms the AP algorithm [35]. Moreover, the analysis in [33] supported the superiority of MCL over other clustering techniques [37][38][39] in identifying protein complexes from interaction networks. MCL simulates random walks on an underlying interaction net-work, by alternating two operations, expansion and inflation. First, self-loops are added to the input graph representing biological interactions which is in our case graph G new , and this graph is then translated into a stochastic Markov matrix [40]. This matrix consists of transition probabil-ities between all pairs of the graph nodes, and the probability of a random walk of length p between any two nodes can be calculated by raising this matrix to the exponent p , a process called expansion. As longer paths are more common between nodes within the same cluster than between nodes across different clusters, the transition probabilities between nodes in the same cluster will typically be higher in these newly obtained expanded matrices. MCL further amplifies this effect by computing entry-wise ex-ponents of the expanded matrix, a process called inflation [32], which raises each element of the matrix to the power r . Clusters are determined by alternating expansion and inflation, until the graph is partitioned into subsets such that there are no paths between these subsets. Assembly of new interaction data into models

After generating clusters, the next step is to add them to the model. Similar to the discussion about individual extensions in Section 3.2, we can add rticle short title clusters one at a time, or in groups. The more cluster or cluster groups we generate, the more models we need to assemble and test. Moreover, the number of possible cluster combinations grows with the total number of generated clusters, and the number of clusters depends on the inflation parameter r , as it directly influences cluster granularity [32]. To alleviate the problem of the large number of cluster combinations, we propose a method for combining the clusters found by the MCL algorithm. Formally, if the clusters we generated in the previous step are C ,…,C n , and we find a subset of clusters C i ,.., C ij , where 1 < j £ n , for which at least one return path exists that goes through all the clusters, then we merge these clusters into a single cluster that will be added to the model. An example of a multi-cluster path is highlighted in red in Figure 1 (e), starting at BM (baseline model), connecting to C (cluster 1), then connecting to C (cluster 2), and from C connecting back to BM . Therefore, we extend the baseline model with multiple clusters simultaneously, based on how clusters are con-nected to the model. The cluster merging procedure is outlined in Algo-rithm 2 (see Figure S2 in the supplementary material). Finally, we rank and score the final list of candidate clusters, based on the existence of re-turn path, to choose the ones that will be used in model extension. Next, we can select one or more clusters from the set of ranked and scored clusters to generate multiple Candidate Executable Models (CEMs) . Each CEM contains elements from both the baseline model and the selected cluster(s). Both procedures, the assembly of the CEI set, and the generation of CEMs are fully automated. However, element update rules are not necessarily unique, as previously discussed in [8]. For exam-ple, in the case of a Boolean model, if the original rule is “A = B + C”, and the candidate extension states that “D positively regulates A”, then the new update rule for A can be either “A= (B + C) * D”, or ”A= B + C + D”. We will investigate the effect of adding a new regulatory element in both cases, using either AND (*) or OR (+) operation.

Selection of final extended model

We use both simulation and formal analysis to evaluate the CEMs. In or-der to simulate a model, all model elements need to be assigned a starting state (i.e., initial value). The initial values for the baseline model elements (nodes in the set V BM ) are typically already known, however, the newly added elements (nodes in the set V ext ) need to be assigned initial values as well. Unfortunately, machine reading does not usually provide this infor-mation. In this work, we assume that all elements within the same cluster have the same initial value. In Section 6, we will compare models with different initializations of the newly added elements to evaluate the effect of initialization on the behavior of the CEMs. To obtain dynamic traces of the baseline model and the CEMs, we use the DiSH simulator (Section 2.3.1). We test each CEM using the statistical model checking approach (Section 2.3.2), by computing a probability es-timate P for each property in a given set of system properties. As discussed in Section 2.3.2, the statistical model checker calls the simulator in order to obtain element trajectories for a defined number of steps. Assuming independence across system properties, for a given CEM, we compute the global CEM probability by multiplying the P values for all properties for the given CEM. Finally, we select the CEM that has the highest probability of satisfying the selected properties as our final extended executable model. The procedure for selecting this final CEM is summarized in Al-gorithm 3 (see Figure S3 in the supplementary material). Case study: T cell differentiation

Naïve peripheral T cells are stimulated via antigen presentation to T cell receptor (TCR) and with co-stimulation at CD28 receptor. This stimula-tion results in the activation of several downstream pathways, feedback and feedforward loops between pathway elements, which then lead to the differentiation of naïve T cells into helper (Th) or regulatory (Treg) phe-notypes. The distribution between Th and Treg cells within the T cell pop-ulation depends on antigen dose; for instance, high antigen dose results in prevalence of Th cells, while low antigen dose leads to a mixed population of Th and Treg cells. The key markers that are commonly used to measure the outcomes of the naïve T cell differentiation into Th and Treg cells are IL2 and Foxp3, respectively. In other words, Th cells are characterized by high expression of IL-2 and low expression of Foxp3, and Treg cells are characterized by high expression of Foxp3 and low expression of IL-2. To demonstrate our model extension procedure, we use two existing, manu-ally built models of T cell differentiation, from [12] and [13].

Baseline model and golden model

In [12], the authors proposed a model where most of the elements are as-sumed to have two main levels of activity, and are therefore represented with Boolean variables, and their update rules are logic functions. Addi-tionally, the stimulation through TCR is assumed to have three different levels, no stimulation (TCR=0), low dose (TCR=1), and high dose (TCR=2), and therefore, it is implemented using two Boolean variables. We used the model from [12] to create the baseline model for our case study. The interaction map of this model is provided in [12] (also included in the supplementary material). In [13], the authors have proposed an extension of the original T cell model from [12], a new model that improved the behavior of the original model. Specifically, in the new model in [13], the Foxp3 response to low dose is closer to experimental observations, that is, it is present in almost 70% of the differentiated population, while in [12] Foxp3 was present in 100% of the differentiated population. In both models, there is a brief tran-sient induction of Foxp3 after the stimulation with high antigen dose. We will refer to the model from [13] as the golden model. For the baseline, we used the original model from [12], without several interactions overlap-ping with the golden model from [13] (TCR activates PIP3, PIP3 activates Akt, Akt activates mTORC2 and mTORC2 inhibits Akt). While the model from [12] satisfied a large number of system properties, except for a few that are satisfied by the model in [13] only, the baseline model in its re-duced shape does not satisfy a larger set of system properties. Our aim is to use ACCORDION to automatically expand this baseline model in order to recapitulate the behavior of the golden model.

Set of properties

From the golden model in [13] and the results of its studies, we define a set of properties that our final automatically extended model needs to sat-isfy. Specifically, the properties capture observed responses of key path-way components in T cells, Foxp3, IL-2, PTEN, CD25, STAT5, AKT, mTOR, mTORC2 and FoxO1, to three scenarios: (1) no stimulation (TCR=0), (2) stimulation with low antigen dose (TCR=1), and (3) stimu-lations with high antigen dose (TCR=2). The complete list of 27 properties is shown in Figure S4 in the supplementary material. Results

Here, we discuss several experiments that we conducted, and the results and our observations from these experiments.

Experimental setup

All models that we use or create are written in the BioRECIPES represen-tation format, which was presented in [20] and briefly described in Section 2.2. From this format, executable models are generated automatically as part of the DiSH simulator [11], which is publicly available at [41].

In the experiments described below, we used the PubMed database [15]. The PubMed search was conducted using Entrez [42], an integrated data-base retrieval system that allows access to a diverse set of databases at the National Center for Biotechnology Information (NCBI) [43] website. The published articles that were obtained through search of PubMed are read using the REACH engine [19], which extracted a list of events and the corresponding information (see Section 2.1). The REACH reading engine is available online and can be run through the Integrated Network and Dy-namical Reasoning Assembler (INDRA) [44]. We conducted our analysis on three different CEI sets, which were obtained using varying levels of automation and manual intervention. For each set, the list of events, with associated entities, is automatically translated from the reading output into BioRECIPES tabular format. In the fully automated ( FA ) approach, both the PubMed database search for relevant articles and the extraction of event data from the selected ar-ticles were done by machines. Specifically, in the FA experiment, we used search query “T-cell and (PTEN or AKT or FOXO)” and selected top 11 from the best matched papers, by the PubMed search engine. In the semi-automated ( SA ) approach, we selected papers that are cited by [13] and used the event information that REACH extracted from those papers. Fi-nally, in the semi-manual approach ( SM ), we rely the most on human in-tervention, we manually excluded from the SA reading output those inter-actions that violate any assumptions made by the authors originally in [12]. For instance, the authors in [12] consider element TCR to be an input to the network, and therefore, TCR should not have any regulators in the T cell model. Therefore, if REACH retrieves an interaction in which TCR is a regulated element, we manually remove these interactions and keep only the interactions having TCR as a regulator. Model extension algorithms are written in Python. The statistical model checker is written in C++ and it was used to test all CEMs on a set of properties listed in the supplementary material (Figure S4). The properties are written as BLTL formulas. The overall iterative model extension tool is written in Python, and it was run on a 3.3 GHz Intel Core i5 processor. For the clustering algorithm, we used the MCL package from [32], and for the visualization of the graphs we used Cytoscape [35]. Role of static network characteristics In Figure 2 (a), we show three networks (undirected interaction maps, for easier visualization) that were formed by combining each of the CEI sets (FA, SM, and SA) with the baseline model. In other words, as described in Section 3.2, we created graphs G new ,FA ( V new ,FA , E new ,FA ), G new ,SA ( V new ,SA , E new ,SA ), and G new ,SM ( V new ,SM , E new ,SM ). We explored static characteristics of these three graphs and their correlation with the selection of the best ex-tended model. Given a directed graph G new,* ( V new,* , E new,* ), where * Î {FA,SA,SM} and the definition of a path in Section 3.2, we computed average path length ( APL ), clustering coefficient (

Coeff ), and graph den-sity ( D ) [45]. Assuming that a distance 𝑑(𝑣 " , 𝑣 ) is the number of edges on a shortest path between nodes v i and v j , APL is computed as an average distance across all possible pairs of nodes in the graph:

𝐴𝑃𝐿 = 1|𝑉 $%&,∗ | ∙ (|𝑉 $%&,∗ | − 1) ∙ : 𝑑(𝑣 " , 𝑣 ) ) ,) $ ∈+ %&’,∗ ,) ,) $ where | 𝑉 $%&,∗ | is the number of nodes in the graph. If there is no path be-tween v i and v j , then 𝑑%𝑣 " , 𝑣 + = 0 . The clustering coefficient ( Coeff ) [45] is computed for each node v i in a directed graph as: 𝐶𝑜𝑒𝑓𝑓(𝑣 " ) = -() )0%12%% *+* () )∙(0%12%% *+* () )45)46∙0%12%% ↔ () ) where T ( v i ) is the number of triangles (three connected nodes) in the graph that contain node v i , degree tot ( v i ) is the sum of the in-degree (the number of incoming edges) and the out-degree (the number of outgoing edges) of v i , and degree ↔ ( v i ) is the reciprocal of degree tot ( v i ). Coeff is a number be-tween 0 and 1, and therefore, if an average

Coeff value, computed across all graph nodes, approaches 0, the graph is more likely to contain stars,

Figure 2.

T cell model extension results. (a) Networks obtained when combining baseline model with the CEI set for each of the three cases, FA, SA, and SM ( G new ,FA , G new ,SA , and G new ,SM ). Gray nodes are the baseline model nodes and white nodes are the new nodes that belong to the CEIs. Table : Common graph features of G new ,FA , G new ,SA , and G new ,SM . (b) G new ,FA , G new ,SA , and G new ,SM drawn together, highlighting the common nodes. (c) Top : histogram of degree tot values and corresponding number of nodes;

Bottom : histogram of node distance ( d ) values and the corresponding number of paths. (d) Two clusters that form a return path with the baseline T cell model, shown as directed graphs (yellow node is a common node for both clusters). (e) Global (overall) probability of satisfying desired system properties for each candidate extension model (CEM) that ACCORDION assembled from each of the three CEI sets (FA-27, SA-22, and SM-16 CEMs). rticle short title while when this value approaches 1, the graph is a clique. The graph den-sity ( D ) [45] is defined for a directed graph as: 𝐷 = |8 %&’,∗ ||+ %&’,∗ |(|+ %&’,∗ |45) where | E new,* | is the number of edges and | V new,* | is the number of nodes in the graph. A graph is considered to be dense if the number of edges is close to the maximum number of possible edges, therefore, the graph den-sity is close to 1 for a dense graph and close to 0 for a sparse graph. We list in the table in Figure 2 (a) the main parameters for graphs G new ,FA , G new ,SA , and G new ,SM . As can be seen from the table, the G new ,FA has the largest number of nodes and edges, and it results in the largest number of clusters. On the other hand, G new ,SA and G new ,SM have smaller number of edges and nodes. In Figure 2 (b), we highlight the difference between the three graphs: G new ,FA is shown in green, G new ,SA in blue, and G new ,SM , which is a subgraph of G new ,SA , in orange. In addition, we show the overlapping nodes between the three networks in cyan. Interestingly, it was observed that despite network diversity, G new ,FA , G new ,SA , and G new ,SM share prominent structural features: they have small APL , small average

Coeff , and small D , and thus, large average degree tot values are unlikely. This similarity is even better illustrated in Figure 2 (c), showing the degree tot histogram for the nodes in each network that follows a power law, and the distribution of node distance ( d ) centered approxi-mately around value 4. As can be noticed, both graph parameters, degree tot and d , have similar patterns but with different count numbers for each G new,* in proportion to the size of its network. Moreover, the values of D in the table in Figure 2(a) suggest that the graphs assembled from the in-formation extracted by machine readers are less dense, even with varying network size. These results also suggest that the difference in literature sources and the size of the CEI sets did not affect the characteristics of G new,* graphs in our case study. Following our Algorithm 1, shown in Figure S1 in the supplementary material, we clustered the three graphs G new,* . We obtained 22 clusters from G new ,FA ( C , C ,…, C ), 11 clusters from G new ,SA ( C , C ,…, C ), and 9 clusters from G new ,SM ( C , C ,…, C ). The inspection of obtained clusters shows that they are less dense and star-like networks (two examples shown in Figure 2 (d)), which agrees with the conclusions of the above studies of graph characteristics. Thus, computing the graph parameters can guide our proposed extension method by providing an early estimate of whether the CEI sets can lead to desired models. For instance, if the

APL is large, we will expect to extract a fewer number of return paths (defined in Section 3.2) from the G new,* graphs, and therefore, in our analysis we will lack the connectivity of the CEIs to the baseline model. Additionally, the graphs with smaller D will reduce the computa-tion time, and computing this parameter helps determine in advance the expected execution time of our algorithm. Assembly and evaluation of T cell CEMs

To extend the baseline model, we first test the connectivity of each clus-ter to the model by searching for a return path (starts and ends in the base-line model) between an individual cluster and the model. In

Figure 2 (d), we highlight in blue a return path that exists between cluster C and the baseline model (TCR → AKT → MTORC2), and a return path that exists between cluster C and the baseline model (TCR → NEDD4_ext → PTEN), where clusters C and C are two of the nine clusters gener-ated from G new ,SM . Furthermore, we explored multiple cluster connectivity with respect to return paths and created 5, 11, and 7 additional CEMs by merging two clusters together from the clusters obtained from G new ,FA , G new ,SA , and G new ,SM graphs respectively. Therefore, the total number of CEMs resulting from the FM, SA, and SM CEI sets are 27, 22, and 16, respectively. We also highlight in red in Figure 2 (d) a return path that exists between the baseline model, and clusters C and C (PI3K → PIP3 → AKT → Foxo1_ext → PTEN). The final list of CEMs includes the baseline model extended by one or two of the clusters generated by MCL for each G new,* graph. For several candidate models, and for all the 27 properties, we show in Figure 2 (e) the global CEM probability (defined in Section 3.4) for all CEMs, in the FA, SA, and SM cases. The highest peak per graph represents the best candidate model when compared to the golden model. From the morphol-ogy of the generated clusters, we expect a cluster to affect the behavior of the model if it contains the key elements included in system properties. As a consequence, the estimated probability ( P ) values will vary for those clusters. However, clusters lacking those key elements will most probably not affect the behavior of the model, and thus, larger number of properties will not be satisfied. In our case study, we found that the CEMs that in-clude two clusters with key elements satisfy a larger number of properties. Merging clusters helped increase the probabilities, however, the 70% steady-state level of Foxp3 in the low-dose scenario observed in [13] is not achieved. The best performance is obtained for the model that com-bines two clusters, C and C (Figure 2 ( d)), which satisfies almost all properties (24 out of 27), (Figure 2(e)). Additionally, these two clusters together restored all the missing interactions removed from the golden model (Section 4.1). The network of the best CEMs for the SM CEI set is shown in Figure S5 in the supplementary material. Guided extension of executable models

A model created automatically with ACCORDION, using the information from the papers in public databases, which satisfies most of the desired properties, may not be the same as the golden model. The differences can be found in both network structure and element update rules. With our extension methodology, we sometimes obtain multiple models that satisfy the same set of properties. This diversity helps us examine redundancies or discover alternative pathways regulating the same target element. Adding new elements to the baseline model requires the consideration of three factors, cluster granularity, regulatory functions, and initial val-ues. First, when using MCL to cluster our directed networks ( G new,* ), the principal handle for changing cluster granularity is the inflation parameter, r , described in Section 3.2. An increase in r causes an increase in the clus-ter granularity. Therefore, we explored the effect of r on finding the best set of clusters for each CEI set. In [32], the authors determined a good interval to choose from (e.g., from 1.1 to 10.0), however, the range of suit-able values will certainly depend on the input graph. For our case study and the different reading output sets, we found that r = 1.1 is too low, and r ≥ r = 4 for our studies and conducted experiments based on this value ( Figure 3 (a)). Additionally, MCL is usually applied to undirected graphs, and thus, if there is an edge between nodes v i and v j , the corresponding entries in the adjacency matrix will be s ( i , j ) = s ( j , i ) ¹ Since we use directed graphs, an edge from node v i to node v j would correspond to the s ( i , j ) entry, but not to the s ( j , i ) entry. To solve this issue, we make s ( j , i ) equal to s ( i , j ), which seems to ignore an important information (directionality). How-ever, the directionality does not affect the static structure of clusters [32]. Finally, the runtime of the extension algorithm is proportional to the num-ber of properties that we need to test against. In other words, if we have N C clusters and N P properties, the time required for the extension algorithm is at the order of O( N C * N P ). However, the time complexity can be reduced to O( N C ) if testing for all the clusters is carried out in parallel. The second challenge is in deciding which operations to use (e.g., AND or OR for Boolean functions), when adding new element regulators found in CEIs. We investigated element update rules, and the difference between our final model and the golden model. For instance, the PTEN update rule from the model in [13] is PTEN = (FOXO1* MEK1) * (!CK2 + !NEDD4). As our baseline model starts with PTEN = (FOXO1 * MEK1), we need to “recover” the rest of the update rule from the CEI set, that is, negative regulation of PTEN by CK2 and negative regulation of PTEN by NEDD4. In the process of adding new element regulators, if the information about the regulation type or the importance of the regulator (e.g., necessary vs. sufficient) is available, it will guide the operation choice (AND vs. OR). The third challenge is in deciding elements’ initial values (e.g., 0 or 1 in Boolean models) when simulating CEMs and testing their behavior against the desired system properties. We found that assigning different initial values to those source and target elements of CEIs that were not in the original baseline model have quite similar results. This emphasizes the robustness of the baseline model and that the final extended model is in-fluenced by the values of elements in the baseline model. These findings are mostly in agreement with what has been shown in [8]. It is likely that these results are influenced by our choice of the case study, and the fact that we defined system properties (Figure S4 in the supplementary mate-rial) in terms of steady states of the key model elements. As our next steps, we will further explore effects of initialization, as well as add system prop-erties with more complicated temporal relationships between elements. Comparison with previously proposed methods

We tested the effectiveness of the previously proposed model extension method from [8] when applied to our case study. This is achieved by re-placing our model extension method with the method introduced in [8], using the same baseline T cell model (described in Section 4.1) and the three CEI sets (FA, SA, and SM). In [8], the authors described an auto-mated extension method that considers only the extensions that are con-nected to the baseline model. They first identify a set of baseline model elements of interest, and then add the extensions based on the proximity to these elements. The proximity is measured as a number of edges on a path connecting baseline model elements and new elements in extensions, and the extension is conducted in layers, starting from the baseline model. Several extension configurations are proposed in [8], depending on the extension approach that the user could be interested in. For example, the focus of model extension can be including the regulation of a certain ele-ment or a set of elements, regardless of the number of extension layers this would require. Another approach discussed in [8] focuses on reducing the number of layers while tracking the effect of adding new extensions to the baseline model. In this work, we focus on studying the effect of adding new extensions to the baseline model, therefore, we used the latter ap-proach from [8] in the comparisons. Figure 3(b) highlights the differences between the results of our method and the method from [8], when tested using statistical model checking. We compared the P values for each property and each CEM for the two meth-ods. As can be observed, ACCORDION outperforms the method from [8] in the case of the FA and SA CEI sets. However, in the SM case, the method from [8] shows slightly better results. These results indicate that the layer-based approach is less effective when used on a large set of CEIs and without any human intervention. The visualization of the topology of the sets of extensions extracted by each method is shown in Figure 3(c). Our method provides concise groups of connected CEIs, that are at the same time connected to the baseline model through return paths. On the other hand, the networks generated by the method from [8], show several nodes that are extensions and are downstream from the baseline model, which means they are not affecting the baseline model ( Figure 3 (c)). Thus, the comparisons we conducted suggest that the Liang et al. method [8] has two major limitations that our method overcomes: it is subjective and prone to human judgment variation in selecting the number of ele-ments of interest and the number of layers, and it becomes impractical with the large number of layers. Conclusion and future work

In this paper, we have described a novel methodology and a tool, ACCORDION, that can be used to automatically assemble the information extracted from literature into models. Our proposed approach combines machine reading with clustering, simulation, and model checking, into an automated framework for rapid model assembly and testing to address bi-ological questions. Furthermore, by automatically extending models with the information published in literature, our methodology allows for effi-cient collection of the existing information in a consistent and comprehen-sive way, while also facilitating information reuse and data reproducibil-ity, and replacing hundreds or thousands of manual experiments, thereby reducing the time needed for the advancement of knowledge. As our future work, we will explore opportunities to improve ACCORDION. For exam-ple, we plan to utilize the information about corroborations and contradic-tions to inform clustering and CEM selection, and we will work on paral-lelization of our algorithms to further increase their execution efficiency.

Figure 3. (a) Several cluster characteristics measured as functions of inflation parameter (IP), for the FA, SA, and SM cases (IP1=0.5, IP2=2, IP3=4, IP4=6). (b) Comparison of the probability estimate P for all 27 properties, for the golden model, and for the best model obtained from each of the three CEI sets (FA, SA, SM) using ACCORDION and the method from [8]. (c) Clusters that were included in the final CEMs for the FA, SA, and SM cases, for ACCORDION and [8] (white nodes are new nodes from the CEIs and gray nodes are the nodes in the CEIs that also exist in the baseline model). rticle short title Funding

This work is supported by DARPA grants W911NF-17-1-0135 and W911NF-18-1-0017 awarded to N. Miskov-Zivanov.

References [1] J. Epstein, “Why Model?,”

Cybern. Syst. , vol. 35, no. 2–3, pp. 117–128, 2008. [2] M. A. Valenzuela-Escárcega, G. Hahn-Powell, T. Hicks, and M. Surdeanu, “A Domain-independent Rule-based Framework for Event Extraction,”

Proc. ACL-IJCNLP 2015 Syst. Demonstr. , pp. 127–132, 2015. [3] A. Fabregat et al. , “The Reactome Pathway Knowledgebase,”

Nucleic Acids Res. , vol. 46, no. D1, pp. D649–D655, 2018. [4] C. Von Mering et al. , “STRING : known and predicted protein – protein associations , integrated and transferred across organisms,” vol. 33, pp. 433–437, 2005. [5] K. Encyclopedia, “Using the KEGG Database Resource,” pp. 1–54, 2005. [6] N. Miskov-Zivanov, “Automation of Biological Model Learning, Design and Analysis,”

Proc. 25th Ed. Gt. Lakes Symp. VLSI - GLSVLSI ’15 , pp. 327–329, 2015. [7] O. Etzioni, M. Banko, and M. J. Cafarella, “Machine Reading,”

Ameri , pp. 1517–1519, 2006. [8] K.-W. Liang, Q. Wang, C. Telmer, D. Ravichandran, P. Spirtes, and N. Miskov-Zivanov, “Methods to Expand Cell signaling Models using Automated Reading and Model Checking.” . [9] K. Sayed, K. N. Bocan, and N. Miskov-zivanov, “Automated Extension of Cell Signaling Models with Genetic Algorithm,” no. 1, pp. 5030–5033, 2018. [10] Q. Wang, N. Miskov-Zivanov, B. Liu, J. R. Faeder, M. Lotze, and E. M. Clarke, “Formal Modeling and Analysis of Pancreatic Cancer Microenvironment,” pp. 289–305, 2016. [11] K. Sayed, Y. H. Kuo, A. Kulkarni, and N. Miskov-Zivanov, “DiSH simulator: Capturing dynamics of cellular signaling with heterogeneous knowledge,”

Proc. - Winter Simul. Conf. , pp. 896–907, 2018. [12] N. Miskov-Zivanov, M. S. Turner, L. P. Kane, P. A. Morel, and J. R. Faeder, “The duration of T cell stimulation is a critical determinant of cell fate and plasticity,”

Sci. Signal. , vol. 6, no. 300, pp. 1–16, 2013. [13] W. F. Hawse et al. , “ Cutting Edge: Differential Regulation of PTEN by TCR, Akt, and FoxO1 Controls CD4 + T Cell Fate Decisions ,”

J. Immunol. , vol. 194, no. 10, pp. 4615–4619, 2015. [14] S. Kumar-Jha, E. M. Clarke, C. J. Langmead, A. Legay, A. Platzer, and P. Zuliani, “A Bayesian Approach to Model Checking Biological Systems,”

Cmsb , no. 2005, pp. 218–234, 2009. [15] R. J. Roberts, “PubMed Central : The GenBank of the published literature,” vol. 98, no. 2, pp. 381–382, 2001. [16] S. Ananiadou, S. Pyysalo, J. Tsujii, and D. B. Kell, “Event extraction for systems biology by text mining the literature,”

Trends Biotechnol. , vol. 28, no. 7, pp. 381–390, 2010. [17] R. B. Altman et al. , “Text mining for biology - The way forward: Opinions from leading scientists,”

Genome Biol. , vol. 9, no. SUPPL. 2, 2008. [18] G. A. P. C. Burns, P. Dasigi, A. de Waard, and E. H. Hovy, “Automated detection of discourse segment and experimental types from the text of cancer pathway results sections,”

Database (Oxford). , vol. 2016, pp. 1–12, 2016. [19] M. A. Valenzuela-escárcega et al. , “Large-scale Automated Reading of Scientific Cancer Literature Discovers New Cancer Driving Mechanisms,” pp. 3–5, 2017. [20] K. Sayed, C. A. Telmer, A. A. Butchy, and N. Miskov-Zivanov, “Recipes for translating big data machine reading to executable cellular signaling models,”

Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) , vol. 10710 LNCS, pp. 1–15, 2018. [21] A. Bateman et al. , “UniProt: The universal protein knowledgebase,”

Nucleic Acids Res. , vol. 45, no. D1, pp. D158–D169, 2017. [22] R. Zhang et al. , “Network model of survival signaling in T-cell large granular lymphocyte leukemia,”

Proc. Natl. Acad. Sci. , vol. 105, no. 42, pp. 16308–16313, 2008. [23] I. Albert, J. Thakar, S. Li, R. Zhang, and R. Albert, “Boolean network simulations for life scientists,”

Source Code Biol. Med. , vol. 3, pp. 1–8, 2008. [24] T. Helikar et al. , “The Cell Collective: Toward an open and collaborative approach to systems biology,”

BMC Syst. Biol. , vol. 6, 2012. [25] A. Naldi et al. , “The CoLoMoTo interactive notebook: Accessible and reproducible computational analyses for qualitative biological networks,”

Front. Physiol. , vol. 9, no. JUN, pp. 1–13, 2018. [26] I. Tkachev and A. Abate, “Formula-free Finite Abstractions for Linear Temporal,”

Proc. 16th Int. Conf. Hybrid Syst. Comput. Control , pp. 283–292, 2013. [27] M. Y. Vardi, “Automatic Verification of Probabilistic Concurrent Finite-State Systems,”

Gene Expr. , vol. 25, no. may, pp. 25–29, 2000. [31] C. A. Telmer, K. N. Bocan, E. Holtzapple, Y. Ahmed, A. Butchy, and C. E. Hansen, “Dynamic system explanation: DySE, a framework that evolves to reason about complex systems-lessons learned,”

Proc. Conf. Artif. Intell. Data Discov. Reuse , no. Aidr, pp. 1–10, 2019. [32] S. E. Schaeffer, “Graph clustering by flow simulation,”

Comput. Sci. Rev. , vol. 1, no. september 1969, pp. 27–64, 2007. [33] S. Brohée and J. van Helden, “Evaluation of clustering algorithms for protein-protein interaction networks,”

BMC Bioinformatics , vol. 7, 2006. [34] X. Lei, F. Wang, F. Wu, A. Zhang, and W. Pedrycz, “Protein complex identification through Markov clustering with firefly algorithm on dynamic protein – protein interaction networks,”

Inf. Sci. (Ny). , vol. 329, pp. 303–316, 2016. [35] J. Vlasblom and S. J. Wodak, “Markov clustering versus affinity propagation for the partitioning of protein interaction graphs,”

BMC Bioinformatics , vol. 10, pp. 1–14, 2009. [36] F. B. J and D. D, “Clustering by passing messages between data points,”

Science (80-. ). , vol. 315, no. February, pp. 972–976, 2007. [37] A. D. King, N. Pržulj, and I. Jurisica, “Protein complex prediction via cost-based clustering,”

Bioinformatics , vol. 20, no. 17, pp. 3013–3020, 2004. [38] M. Blatt, S. Wiseman, and E. Domany, “Superparamagnetic clustering of data,”

Phys. Rev. Lett. , vol. 76, no. 18, pp. 3251–3254, 1996. [39] G. D. Bader and C. W. V Hogue, “An automated method for finding molecular complexes in large protein interaction networks,”

BMC Bioinformatics , vol. 27, pp. 1–27, 2003. [40] P. A. Gagniuc,

Markov Chains: From Theory to Implementation and Experimentation . . [41] “https://bitbucket.org/biodesignlab/dyse-framework/src/master/.” [42] K. J. Schuler GD, Epstein JA, Ohkawa H, “Entrez,”

Methods Enzym. , vol. 266, pp. 141–162, 1996. [43] L. Y. Geer et al. , “The NCBI BioSystems database,” vol. 38, no. October 2009, pp. 492–496, 2010. [44] B. M. Gyori, J. A. Bachman, K. Subramanian, J. L. Muhlich, L. Galescu, and P. K. Sorger, “From word models to executable models of signaling networks using automated assembly,”

Mol. Syst. Biol. , vol. 13, no. 11, p. 954, 2017. [45] M. E. J. Newman, “The Structure and Function of Complex Networks ∗ ,” vol. 45, no. 2, pp. 167–256, 2003.,” vol. 45, no. 2, pp. 167–256, 2003.