A meta-analysis of Boolean network models reveals design principles of gene regulatory networks
Claus Kadelka, Taras-Michael Butrie, Evan Hilton, Jack Kinseth, Haris Serdarevic
AA meta-analysis of Boolean network models reveals design principles of generegulatory networks
Claus Kadelka a, ∗ , Taras-Michael Butrie b,1 , Evan Hilton c,1 , Jack Kinseth a,1 , Haris Serdarevic a,1 a Department of Mathematics, Iowa State University, Ames, Iowa 50011 b Department of Aerospace Engineering, Iowa State University, Ames, Iowa 50011 c Department of Computer Science, Iowa State University, Ames, Iowa 50011
Abstract
Gene regulatory networks (GRNs) describe how a collection of genes governs the processes within a cell.Understanding how GRNs manage to consistently perform a particular function constitutes a key questionin cell biology. GRNs are frequently modeled as Boolean networks, which are intuitive, simple to describe,and can yield qualitative results even when data is sparse.We generate an expandable database of published, expert-curated Boolean GRN models, and extractedthe rules governing these networks. A meta-analysis of this diverse set of models enables us to identifyfundamental design principles of GRNs.The biological term canalization reflects a cell’s ability to maintain a stable phenotype despite ongoingenvironmental perturbations. Accordingly, Boolean canalizing functions are functions where the outputis already determined if a specific variable takes on its canalizing input, regardless of all other inputs.We provide a detailed analysis of the prevalence of canalization and show that most rules describing theregulatory logic are highly canalizing. Independent from this, we also find that most rules exhibit a highlevel of redundancy. An analysis of the prevalence of small network motifs, e.g. feed-forward loops or feedbackloops, in the wiring diagram of the identified models reveals several highly abundant types of motifs, as wellas a surprisingly high overabundance of negative regulations in complex feedback loops. Lastly, we providethe strongest evidence thus far in favor of the hypothesis that GRNs operate at the critical edge betweenorder and chaos.
Keywords: robustness, stability, canalization, gene regulatory networks, Boolean networks
Introduction
Gene regulatory networks (GRNs) describe how a collection of genes governs the processes within a cell. Un-derstanding how GRNs perform particular functions and do so consistently despite ubiquitous perturbationsconstitutes a fundamental biological question [1]. Over the last two decades, a variety of design principles ofGRNs have been proposed and studied, with a focus on discovering causal links between network form andfunction.GRNs have been shown to be enriched for certain sub-graphs with a specific structure, so-called networkmotifs, like feed-forward loops, feedback loops but also larger subcircuits [2, 3, 4, 5]. Theoretical studiesof the dynamic properties of these motifs revealed specific functionalities [6, 7]. For example, coherentfeed-forward loops can delay the activation or inhibition of a target gene, while incoherent ones can act asaccelerators [8]. Other hypothesized design principles include e.g. redundancy in the regulatory logic [9, 10],and a high prevalence of canalization [11]. Canalization, a concept originating from the study of embryonaldevelopment [12], refers to the ability of a GRN to maintain a stable phenotype despite ample genotypic aswell as environmental variation.Over the past decades, Boolean networks (reviewed in [13]) have become an increasingly popular modelingframework for the study of biological systems, as they are intuitive and simple to describe. When data is ∗ Corresponding author, email: [email protected] These authors contributed equally to this project as part of a one-semester research experience for first-year Honor students. a r X i v : . [ q - b i o . M N ] S e p parse, as is still often the case for less-studied organisms and processes, complicated models (e.g., continuousdifferential equation models), which harbor the potential for quantitative predictions, cannot be appropriatelyfitted to the data due to their high number of parameters [14]. In this case, Boolean network models canoften still yield qualitative results.Static network models are comprised of (i) a set of considered nodes (genes, external parameters, etc.),and (ii) a wiring diagram, which describes which node regulates which and often also contains informationabout the respective type of regulation (positive due to e.g. transcriptional activation vs negative due tof.e. inhibition). A dynamic Boolean network model possesses these same features but obtains its dynamicsfrom an additional set of update rules (i.e., Boolean functions) that describe the regulatory logic governingthe expression of each gene. Each gene is either on (i.e., high concentration, expressed) or off (i.e., lowconcentration, unexpressed) and time is discretized as well. While large, genome-wide static transcriptionalnetwork models can be readily assembled from existing databases, the formulation of dynamic Booleannetwork models requires a careful calibration of the update rules by a subject expert. Therefore, all dynamicBoolean GRN models published thus far focus on specific biological processes of interest and contain onlythose genes involved in these processes.Here, we describe a comprehensive meta-analysis of all published, expert-curated Boolean GRN models,aiming to obtain a better understanding of the design principles of GRNs, which can explain how theyoperate smoothly and perform particular functions.
Results
Using the biomedical literature search engine Pubmed, we created a database of 151 Boolean GRN mod-els (see Methods for details). The identified networks describe the regulatory logic underlying a variety ofprocesses in numerous species, ranged in size from 3 to 302 genes (mean = 43.17, median = 23.5), and encom-passed a total of 5698 genes as well as 835 external parameters, which remain constant over time(Fig. 1A).We only included expert-curated models where the nodes and the update rules were selected by hand andnot by a prediction algorithm or where default choices like threshold rules were used throughout. We furtherincluded only one version of highly similar models, which led to the exclusion of 19 models, resulting in atotal of 132 models used in the meta-analysis.A majority of the models (104, 78 . levels, andas expected, network models with more genes contained on average more external parameters ( ρ Spearman =0 .
5, Fig. 1B). On the other hand, the size of a network was not significantly correlated with the averageconnectivity, which describes the average number of regulators per gene ( ρ Spearman = − .
04, Fig. 1C). Theaverage connectivity differed widely across the 132 models; we observed a range of [1 . , .
49] and a meanaverage connectivity of 2 .
37. The degree distribution of a random graph, in which the edges are distributedrandomly, is a Poisson distribution [15]. When considering all update rules separately, we identified thatthe in-degree distribution resembled a Poisson distribution, while the out-degree distribution possessed apower-law tail (Fig. 1D), as has been observed for many diverse types of networks [16, 15] including theyeast transcriptional regulatory network [17]. The tails of the two degree distributions differed significantly;we found many more instances of high out-degree versus high in-degree, highlighting the presence of keytranscription factors that act as network hubs [18].Next, we investigated the prevalence of different types of regulations. If gene A regulates gene B, there arethree possibilities: (i) gene A may activate gene B, meaning that an increased expression of gene A (i.e., achange from 0 to 1 in the Boolean world) leads to an increased expression of gene B for some states of theother regulators, and possibly to no change in B for other states of the other regulators (ii) gene A mayinhibit gene B, meaning that an increase in A leads to a decrease in B for some states of the other regulators,and possibly to no change in B for other states of the other regulators, and (iii) gene A’s effect on gene Bmay be conditional (i.e., not monotonic), meaning that for some states of the other regulators, A activates B,while for other states of the other regulators, A inhibits B. Except for ten rules with more than 16 inputs, weinvestigated all update rules, resulting in a total of 15379 analyzed regulators. The majority of regulationswere activations (11549, 75 . . . ene regulatory networks0100200300 N u m b e r o f n o d e s
132 networks5698 genes835 external parametersgenesexternalparameters Number of genes0110100 N u m b e r o f e x t e r n a l p a r a m e t e r s r = 0.5p = 8 e
10 10 Number of genes12345 A v e r a g e c o nn e c t i v i t y r = 0.04p = 0.630 1 2 5 10 20 50degree10 p r o p o r t i o n in-degreeout-degree 1 2 3 4 5 6 7number of essential variables0.00.20.40.60.81.0 p r o p o r t i o n activationinhibitionconditional Figure 1:
Summary statistics of the analyzed GRN models . (A) Plot of the number of genes and constants (externalparameters) for each model sorted by number of genes. (B-C) For each model, the number of genes is plotted against (B) thenumber of constants and (C) the average in-degree of the genes. The Spearman correlation coefficient and associated p-valueare shown in red. (D) In-degree (red circles) and out-degree (black stars) distribution derived from all 5698 update rules.(E) Prevalence of type of regulation (activation: red, inhibition: blue, conditional: black) stratified by number of regulators(in-degree; x-axis). N u m b e r o f e ss e n t i a l r e g u l a t o r s Figure 2:
Discrepancies in some published update rules . For 5688 update rules, the number of regulators in the identifiedpublished rule (x-axis) is plotted against the number of essential inputs after simplification of the rule (y-axis).
31 2 0 770 0 0 0 0 0 0 0
62 6 0 0 448 0 0 0 0 0 0
48 18 5 0 0 211 0 0 0 0 0
47 14 11 0 3 0 118 0 0 0 0
28 10 9 4 1 0 0 44 0 0 0
21 2 1 5 1 0 0 0 42 0 0
11 7 9 1 2 6 0 0 0 13 0 canalizing depth nu m b e r o f esse n t i a l i npu t s
190 0 810 0 0 0 0 0 0 0 0
595 109 0 296 0 0 0 0 0 0 0
957 31 3 0 9 0 0 0 0 0 0 canalizing depth nu m b e r o f esse n t i a l i npu t s C a n a li z i n g s t r e n g t h ObservedRandom
Figure 3:
Prevalence of canalization . (A) Stratification of all identified update rules based on the number of essential inputs(rows) and the canalizing depth (columns). Update rules with more than ten inputs were omitted. (B) Expected distribution ofthe canalizing depth for random Boolean functions for different numbers of essential inputs (0-10). For each row, 1000 randomfunctions were generated. (C) The distribution of the canalizing strength of all observed 0-canalizing functions with 3-6 essentialinputs is shown (blue), as well as the expected distribution for random Boolean functions (orange), derived from 1000 sampleseach. Horizontal lines depict the respective mean values. prevalence of activation. We further found that activation seems particularly prevalent in situations wherea gene’s state is determined by one or only a few regulators (Fig. 1E).Interestingly, we found that 232 of the 15379 regulators (1 . . . Canalization
The concept of canalization, already introduced in the 1940s in the context of embryonal development [12],has been proposed as a possible explanation for the remarkable stability of GRNs in the face of ubiquitousperturbations, and accordingly, Boolean canalizing functions have been proposed as ideally suited updatefunctions in Boolean GRN models [11]. Recently, the class of canalizing functions has been further stratifiedand studied [21, 22]. Some smaller studies support the general hypothesis by revealing an overabundance ofcanalizing functions in GRN models [23, 24] but a rigorous, comprehensive analysis that considers varioustypes of canalization is still missing.A canalizing function possesses at least one input variable such that, if this variable takes on a certain“canalizing” value, then the output value is already determined, regardless of the values of the remaininginput variables. If this variable takes on another value, and there is a second variable with this sameproperty, the function is 2-canalizing. If k variables follow this pattern, the function is k -canalizing [21], andthe number of variables that follow this pattern is the canalizing depth of the function [25]. If the canalizing4
968 402 0 0 0 0 0 0 0 0 0
289 449 65 0 0 0 0 0 0 0 0
84 264 153 15 0 0 0 0 0 0 0
23 132 85 32 10 0 0 0 0 0 0
16 62 46 49 14 6 0 0 0 0 0 number of symmetry groups nu m b e r o f esse n t i a l i npu t s
587 413 0 0 0 0 0 0 0 0 0
63 569 368 0 0 0 0 0 0 0 0 number of symmetry groups nu m b e r o f esse n t i a l i npu t s
502 498 0 0 0 0 0 0 0 0 0
117 592 291 0 0 0 0 0 0 0 0
31 289 421 259 0 0 0 0 0 0 0 number of symmetry groups nu m b e r o f esse n t i a l i npu t s Figure 4:
Prevalence of redundancy . (A) Stratification of all identified update rules based on the number of essentialinputs (rows) and the redundancy, measured by the number of symmetry groups (columns). Update rules with more thaneleven inputs were omitted. (B-C) Expected distribution of the number of symmetry groups for random Boolean functions fordifferent numbers of essential inputs (1-11). For each row, 1000 random functions were generated. In (C), the distribution ofthe canalizing depth of the random Boolean functions was matched to the one observed for the GRN models, shown in Fig. 3A. depth equals the number of inputs (i.e. if all variables follow the described pattern), the function is alsocalled nested canalizing.To test the level of canalization in published GRN models, we stratified all 5698 update rules based ontheir number of essential inputs and their canalizing depth. The number of Boolean functions with a certaincanalizing depth is known [21], and the fraction of random Boolean functions which are canalizing (i.e., thosewith canalizing depth ≥
1) decreases quickly as the number of inputs increases (Fig. 3A). Most identifiedupdate rules, however, possessed a high canalizing depth, even rules with many inputs (Fig. 3B). Most updaterules were even nested canalizing, meaning that all their variables become “eventually” canalizing. Thesefindings agree with earlier, smaller studies [23, 24], which focused solely on the abundance of canalizing andnested canalizing functions but lacked the finer level of detail added by the canalizing depth.Our findings raised an important question: Are GRN models enriched for canalizing functions solely becauseof the strong overabundance of nested canalizing functions, or is there broader evidence for canalizationin general. To answer this, we relied on a broader mathematical definition of the biology-inspired conceptof canalization, called collective canalization [26]. Rather than focusing on single inputs that determinethe output of a function regardless of the values of the remaining inputs, we studied the proportion of sets of inputs that have this canalizing ability. The recently introduced canalizing strength of a Booleanfunction summarizes this information in a single measure [27]. By comparing the canalizing strength of allidentified non-canalizing update rules with 3 to 6 inputs (i.e., those with canalizing depth = 0) with randomnon-canalizing Boolean functions, we found that even those update rules, non-canalizing according to thestringent definition, exhibited a higher level of collective canalization than expected (Fig. 3C).
Redundancy
Genetic redundancy constitutes another important feature of gene regulation, as the presence of duplicategenes provides robustness against null mutations [9, 10]. We tested the level of redundancy contained in theGRN models by quantifying the number of symmetry groups for each update rule. Two regulators are in thesame symmetry group if they have exactly the same effect on the targeted gene, for all possible states of allother regulators. Redundant genes perform the same function and would thus be part of the same symmetrygroup. We found a much higher level of redundancy in the GRN models (i.e., much fewer symmetry groups;Fig. 4A) than expected by chance (Fig. 4B). This comparison is somewhat skewed by the fact that canalizingfunctions on average possess fewer symmetry groups. We corrected for this confounding effect and consideredrandom functions with a distribution of the canalizing depth matched to the one observed for update rulesin published GRN models (Fig. 4C). Although much smaller, this matched comparison still revealed anincreased level of redundancy in the GRN models.
Network motifs
Network motifs are sub-graphs with a specific structure that recur throughout a network and often carryout a certain function [3, 4]. Several network motifs are commonly found in large, static GRN models suchas the transcriptional network of
E. coli [2]. One such motif is the feed-forward loop (FFL), which consistsof three genes: one gene that regulates both others, one target gene that is jointly regulated by both others,and one intermediate gene. In a coherent
FFL, the direct effect of the master regulator on the target has the5 n k n o w n t o t a l c o un t coherent FFLs incoherent FFLs1 2 3 4 5 6 7 8 Gene regulatory networks0.00.20.40.60.81.0 P r o p o r t i o n o f s p e c i f i c FF L s N u m b e r o f FF L s Figure 5:
Abundance of feed-forward loops in the GRN models . (A) Total number of the different types of FFLs inthe 132 GRN models. FFLs, which contained conditional regulations preventing the determination of their exact type, wereclassified as unknown (gray). (B) Number (black line) and proportion (color-coded stacked bar) of the different types of FFLsfor each model. Models without any FFLs are omitted. same sign as the net indirect effect through the intermediate gene. Otherwise, the FFL is incoherent (Fig. 5displays all eight FFL types). Incoherent FFLs may act as sign-sensitive accelerators of the expression of thetarget gene, while coherent FFLs act as sign-sensitive delays [6]. Here, sign-sensitive means that the motifperforms a function only in one direction, either when the target is up- or downregulated.We identified a total of 3847 FFLs in the GRN models and stratified the number of occurrences bytype (Fig. 5A) and additionally by model (Fig. 5B). 142 FFLs (3 . E. Coli and
S. cerevisiae [6], the FFL with three activating edges (type 4 in Fig. 5A) proved by far the most preva-lent. Interestingly, the type 3 FFL far outnumbered the remaining FFL types, including the two othercoherent ones. This is surprising as all three FFL types contain one activating and two inhibiting edges.Given that we identified in total about three times more activating than inhibitory regulations (Fig. 1E),we would expect that FFLs involving more inhibitory regulations should be less frequent. This did notprove to be the case. First, all coherent FFL types, most involving two inhibitions, were at least as frequentas any incoherent FFL type, most of which contain only one inhibition. Second, the incoherent FFL withthree inhibitory regulations (type 5) was much more prevalent than two of the three incoherent FFL typeswith only one inhibitory regulation. In static GRN models of
E. Coli and
S. cerevisiae , the type 8 FFLoutnumbered all other incoherent FFLs (types 5-7) [6]. On the contrary, we found FFL type 8 to be the leastabundant. This may be due to low sample sizes in the earlier publication, or due to genuine differences ingenome-wide transcriptional networks versus dynamic GRN models, which focus on a relatively small subsetof genes involved in a certain biological process of interest. To explain the observed differences, theoreticalstudies similar to [6] may be needed, which focus on the functions of the different types of FFLs in dynamicGRN models.We further investigated the occurrence of clusters of FFLs, which consist of two FFLs that share at leastone node. As with single FFLs, we can distinguish different types of FFL clusters based on the distributionof activating and inhibiting edges in the motif (Fig. 5 enumerates all 15 types of FFL clusters that weconsidered). A recent analysis of a diverse set of natural and engineered networks revealed wide differencesin the distribution of the different types of FFL clusters [28].We identified a total of 85284 FFL clusters in the 132 GRN models. As with the single FFL motifs, westratified the number of occurrences by type (Fig. 6A) and additionally by model (Fig. 6B). As expected,we found most such clusters to involve 5 genes (64472, 75 . . . E. coli and
S. cerevisiae [28], type 6 was the mostabundant. This type of FFL cluster features a master regulator involved in both FFLs and its abundance islikely due to the known presence of transcription factor hubs, which was also observed in this meta-analysis(Fig. 1D). Interestingly, type 11 was slightly more abundant than type 12, and much more abundant thanany other FFL cluster involving 4 genes. On the other hand, transcriptional networks of
E. coli and
S.cerevisiae contained almost exclusively type 12 and hardly any of the other 4-gene FFL clusters [28]. Anexplanation for these discrepancies likely requires novel theoretical or computational studies that relate motif6 t o t a l c o un t P r o p o r t i o n o f s p e c i f i c FF L c l u s t e r N u m b e r o f FF L c l u s t e r s Figure 6:
Abundance of clusters of feed-forward loops in the GRN models . (A) Total number of the different typesof clusters of FFLs in the 132 GRN models. Nodes in the motif graphs are color-coded based on their role in the two clusteredFFLs: blue nodes only occur as regulators, yellow nodes only occur as intermediate genes, red nodes only occur as target genes,and gray depicts nodes with mixed roles. (B) Number (black line) and proportion (color-coded stacked bar) of the differenttypes of clusters of FFLs for each network. Models without any clusters of FFLs are omitted. structure to motif function.Feedback loops (FBLs) constitute another classical network motif. The parity of the number of negativeregulations determines if a FBL is positive (even number) or negative (uneven number). Each gene in apositive (negative) FBL exerts a positive (negative) effect on its own downstream expression. In general,negative FBLs buffer a perturbation and ensure homeostasis, while positive FBLs amplify perturbations andare necessary for bi- or multistationarity [29, 30, 31]. We identified all FBLs involving up to six genes presentin the GRN models. For each FBL, we counted the number of positive and negative regulations involved(Fig. 7). Just like FFLs, some FBLs contained conditional regulations, which prevented the determinationof their exact type. As expected by chance, we found more complex loops than short 2-loops or even auto-regulatory loops (i.e., 1-loops). Also, loops with a balanced number of positive and negative regulations arecombinatorially more likely and were accordingly found more frequently. A general comparison of the num-ber of positive versus negative FBLs is difficult as the expected distribution depends on the particular sizeof the loop considered. Nevertheless, we observed many more self-reinforcing than self-inhibitory regulations(1-loops). On the other hand, when comparing complex loops of the same type, with the same number ofgenes and the same number of combinatorially expected occurrences (e.g., 4-loops with 3 versus 1 positiveregulations, or 6-loops with 5 versus 1 or 4 versus 2 positive regulations), we discovered a surprising over-abundance of negative regulations in FBLs. The fact that activating (i.e., positive) regulations outnumberedinhibitory (i.e., negative) regulations 3:1(Fig. 1E) makes this finding even more surprising.
Dynamical Robustness
Gene regulation is a highly stochastic process due to e.g. low copy numbers of expressed molecules, randomtransitions between chromatin states, or extrinsic environmental perturbations [32, 31]. Despite these varioussources of stochasticity, GRNs must maintain a stable phenotype in order to ensure consistent operation of7 +0 0+1 N.A.0100200300 t o t a l nu m b e r o f - l oo p s t o t a l nu m b e r o f - l oo p s t o t a l nu m b e r o f - l oo p s t o t a l nu m b e r o f - l oo p s t o t a l nu m b e r o f - l oo p s t o t a l nu m b e r o f - l oo p s Figure 7:
Total number of different types of feedback loops in the GRN models . Feedback loops are stratified basedon the number of involved genes (sub panels) and based on the number of positive (+) and negative ( − ) regulations (x-axis).Loops, which contained conditional regulations preventing the determination of their type, were classified as N.A. Number of genes0.81.01.2 D e rr i d a v a l u e f o r a s i n g l e p e r t u r b a t i o n r = 0.15p = 0.1 Figure 8:
Dynamical robustness of the GRN models . For each GRN model, the average size of a single perturbation aftera round of synchronous updates (Derrida value) is plotted against the size of the model. The Spearman correlation coefficientand associated p-value are shown in red. critical dynamical regime,on the edge of order and chaos [33]. In Boolean networks, dynamical robustness can be measured using theconcept of average sensitivity or more general Derrida plots [34, 35], which describe how a small perturbationaffects the network over time. If, on average, the perturbation reduces in size, the system operates in the ordered regime; if it amplifies on average, the system is in the chaotic regime, and if it remains, on average,of the same size, the system exhibits criticality . Many biological systems, modeled using Boolean networks,have already been found to operate in the critical regime [36, 24]. For each GRN model, we computed theDerrida value for a single perturbation (i.e., the average size of a single perturbation after each gene wasonce synchronous updated). We found that most models exhibited an average sensitivity of close to 1, whichconstitutes the critical threshold between order and chaos (Fig. 8). Moreover, model size was not associatedwith the dynamical robustness ( ρ Spearman = − . Conclusion
Gene expression constitutes the most fundamental process in which the genotype determines the phenotype.A detailed understanding of the design principles that regulate this process is therefore of great importance.We utilized combined knowledge from numerous experts in their respective fields to perform a meta-analysisof gene regulatory networks. Boolean networks constituted the perfect modeling framework for this kind ofanalysis due to their simplicity, easy comparability, and widespread use. A large literature search providedus with the most extensive database of expert-curated Boolean gene regulatory network models thus far,which may be queried to generate and test various types of hypotheses.We highlighted the usefulness of this resource by focusing on several proposed design principles of GRNs.We confirmed that the regulatory logic is not random but highly canalized. Using a broader definition ofcanalization, we showed that even regulatory interactions that were not considered canalizing in previousanalyses, exhibited a high level of canalization. Canalization and genetic redundancy are two correlatedconcepts; GRNs proved to be independently enriched for both. We further studied the presence of smallnetwork motifs and discovered various types of motifs that were vastly more or less abundant than expectedby chance. Finally, we provided strong evidence for the hypothesis that most GRNs dynamically operate atthe edge of order and chaos due to a tradeoff between stability and adaptability.The described analysis suffers from several obvious limitations. First, not all biological phenomena canbe accurately described in simple Boolean logic. There are a variety of published models that allow formore than two states. A similar analysis of more general models might provide more detailed insights intogene regulation but will itself suffer from the increased complexity of describing the studied concepts in thenon-Boolean case. Second, there exists no feasible way to test the representativeness or completeness ofour generated database of Boolean models. Even if a complete database of all published Boolean networkmodels existed, the results would still be biased as some processes and species receive more attention and aremodeled more frequently than others. Third, design principles of GRNs will likely differ among kingdomsof life or even among lower taxonomic levels. A stratification of the results based on the organism underconsideration would likely yield valuable information.
Methods
Database creation
Aiming to identify all published Boolean network models of GRNs, we developed an algorithm that parsesall of the more than 30 million abstracts indexed in the literature search engine Pubmed and used keywordsto rank the abstracts based on how likely they were to contain a Boolean network model. To identify thekeywords, we relied on the Cell Collective, a pre-existing repository of Boolean network models, which, atthe time of access, contained 78 Boolean network models published in 65 distinct papers [37]. The abstractsof these 65 papers served as a training set for the identification of keywords indicative of the presence ofa Boolean network model. We considered as possible indicators (i) any word that occurred in at least twoCell Collective abstracts and was not among the most common 3000 words found in an English dictionary,9ii) all fixed combinations of two and three words like “logical modeling” or “Boolean network model”, and(iii) all co-occurrences of two or three single words in the same abstract, e.g. the co-occurrence of the words“logical”, “regulatory” and “modelling” in an abstract, not necessarily in the same fixed order. For anypossible indicator, we calculated a quality score as the ratio of the number of Cell Collective abstracts inwhich it occurred over the total number of Pubmed abstracts containing this indicator. This procedureresulted in 1297 publications with at least one indicator with a quality score of 5% or greater. We thenmanually investigated these 1297 publications to decide if they indeed contained a GRN model. During themanual review, an additional 369 referenced publications were investigated, as they were manually deemedto be of potential interest despite lacking an indicator with quality score ≥ Model exclusion
To avoid the introduction of various of kinds of bias into the analysis, we used strict criteria for the inclusionof models. We excluded models where the update rules were solely generated using an inference methodor where default updates like threshold rules were consistently used. Our goal was to include only modelswhere the update rules were built based on biological expertise and knowledge gained from appropriateexperiments. In addition, identical models that were presented in multiple publications were only includedonce, and we aimed to include the earliest publication that initially presented the model.
Model extraction and standardization
Boolean network models are presented in various formats in the literature. Using customized Python scripts,we extracted all published Boolean network models that were not excluded (see Model exclusion) and trans-formed them into a standardized format. In this format, each line describes the regulation of one gene;the name of the regulated gene is on the left, followed by “=”, followed by the Boolean update rule withoperators AND, OR, and NOT. External parameters do not have an update rule and only occur in theupdate rules of the genes they regulate. For example,
A = B OR CB = A OR (C AND D)C = NOT A represents a model with three genes, A, B and C, and one external parameter D.
Meta-analysis
All analyses were performed in Python 3.7 using the libraries numpy, scipy.stats, and itertools. In particular,we wrote a Python script that takes as input a Boolean model, described in standardized format, and returns,among other things, an adjacency matrix of the wiring diagram of the model, as well as completely evaluatedupdate rules. That is, each update rule of k inputs is represented as a vector of length 2 k , which togetherwith the wiring diagram enables all presented analyses.An additional quality check ensured that highly similar models were only analyzed once. For any combinationof two models, we used the Jaccard index to determine the similarity of the set of model variables. We thenmanually reviewed all pairs with greater than 80% similarity. For each pair/multiple of highly similar models,we removed all but one model from the analysis. This additional quality control step led to the exclusion of19 out of the 151 identified GRN models.For computational reasons, we restricted most analyses to update rules with 16 or fewer inputs. The fourGRN models that contained a total of ten rules with more inputs were excluded from the analyses reportedin Figs. 5, 6, 7, 8, 2, as the specific types of regulation (activation, inhibition, conditional) and number ofessential inputs could not be determined for rules with so many inputs. Acknowledgements
Haris Serdarevic, Jack Kinseth, Taras-Michael Butrie and Evan Hilton all contributed equally to this projectas part of a one-semester research experience for first-year Honor students. Throughout the course of thesemester, they each considered about 200 papers and extracted any Boolean network models they encoun-tered. The authors thank Audrey McCombs for helpful comments on the manuscript.10 ata availability
The database of Boolean GRN models is available upon email request, and will be made publicly availableonce the paper is accepted for journal publication.