A Scalable Algorithm for Structure Identification of Complex Gene Regulatory Network from Temporal Expression Data
AA Scalable Algorithm for Structure Identification of Complex GeneRegulatory Network from Temporal Expression Data
Shupeng Gui , Rui Chen , Liang Wu , Ji Liu , and Hongyu Miao Department of Computer Science, University of Rochester, Rochester, 14620, USA Molecular and Human Genetics, Baylor College of Medicine, Houston, 77030, USA Department of Biostatistics, School of Public Health, UTHealth, Houston, 77030, USA Goergen Institute for Data Science, University of Rochester, Rochester, 14620, USA
July 16, 2018
AbstractMotivation:
Gene regulatory interactions are of fundamental importance to various biological functionsand processes. However, only a few previous computational studies have claimed success in revealinggenome-wide regulatory landscapes from temporal gene expression data, especially for complex eukaryoteslike human. Moreover, recent work suggests that these methods still suffer from the curse of dimension-ality if network size increases to 100 or higher.
Result:
We present a novel scalable algorithm for identifying genome-wide regulatory network struc-tures. The highlight of our method is that its superior performance does not degenerate even for anetwork size on the order of 10 , and is thus readily applicable to large-scale complex networks. Such abreakthrough is achieved by considering both prior biological knowledge and multiple topological prop-erties (i.e., sparsity and hub gene structure) of complex networks in the regularized formulation. Wealso illustrate the application of our algorithm in practice using the time-course expression data from aninfluenza infection study in respiratory epithelial cells. Availability and Implementation:
The algorithm described in this article is implemented in MATLAB (cid:114) .The source code is freely available from https://github.com/Hongyu-Miao/DMI.git . Contact: [email protected]; [email protected]
Supplementary information:
Supplementary data are available online.
Gene regulatory network (GRN), consisting of multiple regulators and their target molecules, plays criticalroles in numerous biological processes by modulating the expression levels of RNAs and proteins [26]. Whileremarkable successes in dissecting single genes that are responsible for certain biological functions, behavioror diseases have been achieved over the past few decades, it has been increasingly recognized that elucidatinggene functions and interactions in the context of networks becomes more and more important to gain novelinsight into mechanisms, effects and interventions of molecular, cellular or organ-level biological processes[4, 5, 27]. Clearly, one of the prerequisites for investigators to harvest the benefits of such systematicnetwork approaches is whether the structures of gene regulatory networks can be accurately revealed fromexperimental data.Modern high-throughput experimental technologies such as next generation sequencing [40] can gener-ate time-course data at a much more affordable cost [10], thus provide unprecedented opportunities for1 a r X i v : . [ q - b i o . M N ] F e b esearchers to systematically investigate the temporal patterns of gene expressions and infer gene regula-tory relationships. However, two well-known major obstacles have significantly hampered our ability tointerrogate such data for novel scientific findings. First, limited by resources or technical and ethic issues,the sampling frequency of time-course gene expression profiling data is low (e.g., most of the time-courseGEO datasets [14] have less than 6 time points), which renders the sample size far less than the numberof unknown parameters in the context of GRN structure identification. Targeting at such scenarios, it is ofsignificant importance to borrow information from additional sources (e.g., previous biological knowledge).Second, considering the fact that for complex eukaryotes like human, the number of protein-coding genesis approximately 19,000 [15] so the problem dimension is ultra-high (i.e., tens of thousands or even millionsunknown parameters are involved). The development of novel and more efficient algorithms that can scaleto such high-dimensional networks is still necessary.A number of modeling and computational approaches have been developed for gene network structureidentification [35], including information theory method [e.g. 45], clustering method [e.g. 30], Boolean network[49], Bayesian network [e.g. 19], state-space model [21], regression model [52], and differential equation model[e.g. 13]. The well-known methods for static expression data include the Pearson correlation coefficient (PCC)and the mutual information (MI) methods [2, 36], the ARACNE algorithm [38] based on the data processinginequality concept [12], the context likelihood of relatedness (CLR) approach [16] that extends the relevancenetworks method [8], the GENIE3 algorithm [23] that uses tree-based gene ranking and ensemble, and theTIGRESS algorithm [20] that combines the least angle regression with stability selection. However, only afew algorithms were specifically developed for time-course data. [53] proposed to calculate the time-delayeddependencies between gene expressions using mutual information, and developed and tested the TimeDelay-ARACNE algoithm on networks with less than 30 genes. Our group introduced the concept of backgroundnetwork to confine the parameter space using prior biological knowledge and built the SITPR pipeline basedon a dynamic Bayesian network (DBN) model regularized by L norm [33]. More recently, [24] developedthe Jump3 method by combining the on/off stochastic differential equation model with decision trees, wherethe marginal likelihood of each node was used as the decision function.The DREAM Challenge [48] makes it feasible to generate authoritative benchmark data for algorithmperformance evaluation. To the best knowledge of our authors, none of the algorithms above have been eval-uated on DREAM networks of a size greater than 100; actually, their performances all become unsatisfactoryat a network size of 100, as verified in several previous studies [2, 33, 36]. Therefore, a breakthrough in algo-rithm development is necessarily needed to obtain accurate and reliable results for large-scale GRN inference(e.g., a network size O(10 )). The work of [33] is one of the few studies that systematically borrow informa-tion from previous biological knowledge to deal with the low sampling frequency and the high dimensionalityissues in GRN inference; however, even after the incorporation of prior knowledge, the number of unknownparameters is still on the order of 10 so the high dimensionality issue remains. Since GRNs are typicalcomplex networks with structural sparsity [31], it is common to impose sparsity on the inference problemformulation in previous studies [20, 33, 52]; however, it turned out that sparsity alone cannot satisfyinglyaddress the high dimensionality problem [33]. This observation leads us to the hypothesis that consideringadditional structural or topological properties of large-scale complex networks may result in novel scalablealgorithms for GRN inference that are not subject to the curse of dimensionality.In this study, we adopt the background network approach developed in our previous study [33] forparameter space confinement, and we modify other selected state-of-the-art algorithms to take the advantageof the same background network for fairness of algorithm performance comparison. More importantly, abreakthrough in algorithm performance is achieved by considering both structural sparsity and the existenceof hub genes, which is a prominent topological feature of GRNs due to preferential gene attachment [5].We describe the mathematical formulation and computational steps of the novel algorithm in Section 2.We evaluate and compare the performance of our algorithm with other state-of-the-art approaches usingDREAM4 benchmark data that are generated from networks of a size up to 2 × , and we apply theproposed method to real temporal gene expression data from an influenza H1N1 virus infection study toillustrate its usage in practice. The related computational settings and results are presented in Section 3.Finally, we discuss the advantages and limitations of this work in Section 4.2 hole0 20 4001020304050 Group Sparsity0 20 4001020304050
Sparsity0 20 4001020304050
C A B = +
G44G48G47 G43G46G40 G49 G45G50 G3G1G17G39 G22G16 G28G21 G27G8 G20 G31G11G42G26G25G13G23G30 G5G24 G6 G7G19G4G29 G36 G32G18 G34G38 G35 G33G12 G37G41G14 G10 G2G9G15
G28 G44 G43 G38G41 G36G35G27 G37G46G8G47 G48G21 G20G40 G49G13G11G30 G23G25G42G26 G33G31 G12 G32G18 G34G9G16G14 G10G45G50 G17G39 G22G15 G19 G7G6G3 G4G2 G24G1 G5G29
G28G21G8 G27G20G22 G45G42 G17G30 G50G39G16G25G13G23G11G26 G18G12G29G24G19G31G14 G9 G10 G2G15 G34G36G35 G32G33G37G1 G3 G4 G5 G6 G7G44G48G47 G41G43G40 G46G49 G38 = +
Figure 1: Illustration of the hub gene structure separation and the corresponding coefficient matrices.
For ultra-high dimensional problems considered in this study, it is important to constrain the parameterspace based on prior biological information to improve computing efficiency and avoid over-fitting, especiallyfor datasets with limited sample sizes. For this purpose, the concept of background regulatory network wasintroduced in our previous work [33], and we recently also developed a curated database called RegNetworkfor public use [32]. Here we present the key features of the background network and describe how it isemployed in this study for GRN inference.First, the background regulatory network is comprehensive so it contains not only experimentally-observed regulatory interactions but also physically, chemically or biologically feasible interactions. Forthis purpose, information from over 20 selected databases or knowledgebases (e.g., FANTOM [46], TRANS-FAC [39], JASPAR [7], KEGG [25], HPRD [42], and ENCODE [18]) have been collected and integrated. Inaddition, potential interactions between transcription factors and target genes are predicted based on se-quence motifs obtained from Ensembl [17]. In this way, the background network also allows the discovery ofnovel regulatory relationships that have not been officially reported or documented. Second, the backgroundnetwork is not cell type- or condition-specific but it allows the detection of cell type- or condition-specificregulatory relationships. The reason is that under different conditions (e.g., different diseases), only certainregulatory interactions in certain types of cells will be activated in the background network, and an appro-3riately designed algorithm should be able to detect a reasonable number of activated interactions with thepresence of the inactive ones. Third, the background network is different from random networks [3] in termsof certain properties like characteristic path length and node degree distribution [1]; also, its node degreesare found to satisfy the power-law distributions so the background network is scale-free.Here we use the background regulatory network concept in both the simulation studies and the realdata example. That is, our primary task is to identify the activated interactions in the confined parameterspace specified by the background network. We recognize that some existing algorithms under comparisonstart with the full network structure; therefore, for fairness, we make necessary efforts to modify theircomputer codes to take the background network as the parameter space. Also, it should be mentioned thatthe same weight is assigned to all the edges in the background network in this study for simplicity. Thatmeans, our algorithm will automatically determine the sparse structure and the hub gene structures withoutinforming itself the potential location of such structures beforehand. The algorithm performance can befurther improved if edge weights can be assigned based on evidence strength in, e.g., literature; however,even with a simple equal-weight scheme, our algorithm can already achieve a satisfying performance (see theResults Section). Finally, for simulation studies, the background network is introduced by adding additionalrandom edges to the DREAM4 network structures; for real data example, the background network for humanfetched from the RegNetwork database is used, which contains 23,079 nodes and 372,774 edges.
While sparsity of GRNs has been extensively explored in previous studies, the hub gene structure has rarelybeen addressed in existing formulations of GRN inference. This study suggests that in addition to sparsity,the hub gene structure is also of significant importance to the development of scalable algorithms that arenot subject to the curse of dimensionality.Throughout this paper, the following notations are used: • n denotes the total number of nodes in a network; • x t ∈ R n denotes the expression level of n genes (and miRNAs) at time t ; • (cid:12) denotes the Hadamard product (i.e., the product of the entries of two matrices at the same position).For example, [1 , (cid:12) [0 ,
2] = [0 , • (cid:107) · (cid:107) F is the Frobenius norm of a matrix M : (cid:107) M (cid:107) F = (cid:113)(cid:80) i,j M ij ; • (cid:107) . (cid:107) is the Spectral norm (i.e., the maximal singular value of a matrix); • C · j : j th column of C ; • C i · : i th row of C .The dynamic Bayesian network (DBN) model [54] is considered in this study for GRN inference. It hasbeen shown by previous studies [28, 54] that under the normality and Markov assumptions, the DBN modelcan be reduced to a vector autoregression (VAR) model as follows: x t +1 = P x t + w t , t = 1 , · · · , T − , (1)where T is the total number of time points, and P ∈ R n × n is the coefficient matrix, characterizing the changefrom time t to the next time point t + 1. P ij (cid:54) = 0 indicates the existence of regulatory relationship betweenregulator j and target i . Also, for simplicity and robustness, uneven time intervals are treated as of equallength. We consider an equivalent form of Eq. (1): y t +1 ,t := x t +1 − x t = C x t + w t , (2)4here C = P − I . Our goal is to estimate the coefficient matrix C (or equivalently P ) given the time-courseobservations { x t } Tt =1 . The objective function can be formulated as followsmin C (cid:107) Y − CX (cid:107) F = T − (cid:88) t =1 (cid:107) ( x t +1 − x t ) − C x t (cid:107) , (3)where Y := [ x − x , x − x , · · · , x T − x T − ] ∈ R n × ( T − and X := [ x , x , · · · , x T − ] ∈ R n × ( T − .Since the total number of distinct time points are usually small, the solution obtained by directly solving(3) suffers from the overfitting issue. To overcome this problem, the commonly used strategy is to confinethe model space using regularizations or constraints; for example, the (cid:96) norm regularization for imposingmodel sparsity. However, in addition to sparsity, there exist other prominent topological characteristics forcomplex networks like GRNs [11]. To illustrate this, we use the Yeast gene network as an example. Thetop left graph in Fig. 1 shows a typical substructure of the Yeast gene network; the bottom left figure isthe corresponding coefficient matrix C , where the blue dots indicate nonzeros (3.6%) and the white areasindicate zeros (96.4%) in C . Sparsity is thus a clear feature to tell from Fig. 1, and we also consider thefollowing network characteristics in our model formulation. Hub gene structure separation : As suggested in Fig. 1, there exist a small number of gene nodes witha high degree, called the “hub” genes. If we define the hub gene structure as the set of outgoing edges from ahub gene node to its immediate neighbors, we can separate this structure from the original GRN such thatmatrix C can be decomposed into C = A + B, where matrix A corresponds to the hub gene structure (the upper middle graph in Fig. 1) and matrix B corresponds to the remaining edges (the upper right graph in Fig. 1).It can be told from the lower middle plot of Fig. 1 that matrix A for hub gene structure has a columngroup sparse structure (that is, it only contains a few nonzero columns), while the matrix B for the remainingedges shows a more uniform sparsity pattern. Therefore, we consider the (cid:96) , norm convex regularization on A to enforce the group sparsity structure [22], (cid:107) A (cid:107) , := (cid:88) j ( (cid:88) i | A ij | ) , and consider the commonly used (cid:96) regularization [9] on B to enforce the uniform sparsity, (cid:107) B (cid:107) := (cid:88) i,j | B ij | . Small number of parent nodes : Based on the real GRN structures previously reported (e.g., Yeast),we find that while one gene may directly modulate the expressions of many other genes, a single gene isusually co-regulated by only a few regulators. That is, in a GRN represented by a directed acyclic graph,the number of parent nodes of a node is often small. This observation suggests the non-zero elements in anyrow of matrix C should be small (see the bottom left of Fig. 1). We thus can impose the following constrainton the rows of matrix C (cid:107) C i · (cid:107) ≤ ξ i , i = 1 , · · · , n, where ξ i is a predefined upper bound of the number of incoming edges to gene i (e.g., 0.7 times the indegreeof a node in the background network). Background network : The mask of the background network can be denoted as Ω ∈ { , } n × n , whereΩ ij = 1 corresponds to the non-existence of any regulatory interaction between genes i and j and Ω ij = 0corresponds to the existence of a possible interaction between genes i and j that needs to be determined5rom temporal expression data. Therefore, the background network constraints imposed on matrix C can bedenoted as follows: P Ω ( C ) := C (cid:12) Ω = 0 . Combining all the constraints above, the overall GRN structure identification formulation is given below:min
A,B,C (cid:107) ( A + B ) X − Y (cid:107) F + α (cid:107) A (cid:107) , + β (cid:107) B (cid:107) s.t. (cid:107) C i · (cid:107) ≤ ξ i P Ω ( A ) = 0 , P Ω ( B ) = 0 A + B = C. (4)where α and β are two penalty coefficients that balance the weights on group sparsity and uniform sparsity. Existing solvers cannot deal with the proposed formulation above, so we introduce an efficient optimiza-tion algorithm, called Decomposable Multi-structure Identification (DMI) (Algorithm 1), that can handle anetwork size greater than 20 k on an average computer [6].We first define the Augmented Lagrangian function: L ( A, B, C, U ) = 12 (cid:107) ( A + B ) X − Y (cid:107) F + α (cid:107) A (cid:107) , + β (cid:107) B (cid:107) + ρ (cid:107) A + B − C (cid:107) F + (cid:104) U, A + B − C (cid:105) + I (cid:107) C i · (cid:107) ≤ ξ i ( C ) + I P Ω ( A )=0 ( A ) + I P Ω ( B )=0 ( B ) , (5)where ρ is a positive real number, I condition ( · ) denotes the function that gives 0 if the argument variablesatisfies the condition and gives positive infinity otherwise, and U is a dual matrix of the same size as C introduced to solve the optimization problem. Algorithm 1
Decomposable Multi-structure Identification (DMI)
Require: X , Y , ξ i , ρ > Ensure: C repeat [ A, B ] = arg min
A,B L ( A, B, C, U ) using Algorithm 2; C = arg min C L ( A, B, C, U ) to update C using (8);Update U by U ← U + ρ ( A + B − C ); until convergence; return C .More specifically, we can break the optimization problem into minimization w.r.t. A and B , minimizationw.r.t. C , and minization w.r.t. U as follows. Minimization w.r.t. A and B . Since some terms in Eq. (5) such as I (cid:107) C i · (cid:107) ≤ ξ i ( C ) contain no A or B ,the optimization problem can be simplified:min A,B (cid:107) ( A + B ) X − Y (cid:107) F + α (cid:107) A (cid:107) , + β (cid:107) B (cid:107) + ρ (cid:107) A + B − C (cid:107) F + (cid:104) U, A + B − C (cid:105) + I P Ω ( A )=0 ( A ) + I P Ω ( B )=0 ( B ) , (6)6hich can be solved using the coordinate block descent method [50]. By taking partial derivatives of theobjective function, we can derive g = [( A + B ) X − Y ] X (cid:62) + ρ ( A + B − C ) + U ;and by applying the proximal gradient descent method [44], the following update rules for A and B at eachiteration can be obtained A · i = prox γα (cid:107)·(cid:107) (( A − γg A ) · i )= max (cid:16) , − γα (cid:107) ( A − γg A ) · i (cid:107) (cid:17) ( A − γg A ) · i , and B = prox γβ (cid:107)·(cid:107) ( B − γg B )= sign( B − γg B ) (cid:12) max(0 , | B − γg B | − γβ ) , where sign( a ) = 1 if a >
0, sign( a ) = 0 if a = 0, and otherwise −
1. Usually we choose γ = (cid:107) X (cid:107) + ρ as thesafe step length for each update to guarantee the monotonic decreasing of the objective function (6). Moredetails about the accelerated proximal gradient descent method are given in Algorithm 2. Minimization w.r.t. C . Similar to the objective function (6), some terms in Eq. (5) like I P Ω ( A )=0 ( A ) donot contain matrix C . Hence, Eq. (5) can be simplified as follows:min C ρ (cid:107) A + B − C (cid:107) F + (cid:104) U, A + B − C (cid:105) + I (cid:107) C i · (cid:107) ≤ ξ i ( C ) , ∝ (cid:107) A + B − C + 1 ρ U (cid:107) F + I (cid:107) C i · (cid:107) ≤ ξ i ( C ) . (7)We thus update C by the following rule C = P ξ ( A + B + 1 ρ U ) , (8)where P ξ is a projection of (cid:107) C i · (cid:107) ≤ ξ i by retaining the largest ξ i elements in the i -th row of C . Minimization w.r.t. U . Similar to the derivation of the objective function (7) above, we havemin U (cid:104) U, A + B − C (cid:105) . (9)We update U using U = U + ρ ( A + B − C ) for each iteration. Let | ¯Ω | denote the total number of zeros in the background network matrix Ω ∈ R n × n used in Algorithms1 and 2. Algorithm 1 is assumed to run J iterations to converge, and for each iteration of Algorithm 1,we run Algorithm 2 for K iterations. Therefore, the complexity of Algorithm 2 is O ( | ¯Ω | × T × K ), where X, Y ∈ R n × ( T − , T (cid:28) n and | ¯Ω | (cid:28) n . Therefore, the complexity of DMI is O ( | ¯Ω | × T × K × J ).7 lgorithm 2 Accelerated Proximal Gradient Descent Method [43]
Require: n (number of genes in network), γ (step length), α , β , C , U Ensure: A , B Initialize t = 1, V = A = , W = B = , ρ = 10; for k = 1: K do g k = P Ω ([( V k + W k ) X − Y ] X (cid:62) + ρ ( V k + W k − C ) + U ); for i=1 to n do A k +1 · i = prox γα (cid:107)·(cid:107) (( V k − γg k ) · i ); B k +1 = prox γβ (cid:107)·(cid:107) ( W k − γg k ); t k +1 = + (cid:112) t k ; V k +1 = A k +1 + ( t k − t k +1 )( A k +1 − A k ); W k +1 = B k +1 + ( t k − t k +1 )( B k +1 − B k ); return A and B . To evaluate and compare the proposed method with other state-of-the-art algorithms, we employ GeneNetWeaver[37, 48], the official DREAM Challenge tool for time-course expression data generation. More specifically,GeneNetWeaver uses the GRNs from Yeast (4,441 nodes, 12,873 edges) or Ecoli (1,565 nodes, 3,785 edges)as templates to generate network structures with typical complex network properties, of a network size upto ∼ , n = 10 , , , m denote the numberof edges in a ground truth network, we then randomly add m additional edges to the ground truth networkto generate the background network with 2 m edges. Consequently, the goal of GRN inference algorithms inthe simulation studies is to identify as many ground truth edges as possible from all the background networkedges, with a controlled false positive rate. It should be stressed that networks of a size 20 ,
000 cannotbe generated from the Yeast or Ecoli templates, so we supply the large-scale GRNs from the RegNetworkdatabase [32] to GeneNetWeaver as templates (e.g., the human GRN in RegNetwork has ∼ ,
000 nodesand ∼ ,
000 edges). For each simulated dataset, one data point is generated at each of 6 distinct timepoints, respectively, for a pre-specified noise level to match the observation scheme in the real data examplein Section 3.2.
Since the state-of-the-art algorithms under comparison such as ARACNE [38] start with a full matrix C ,for fairness of comparison, we make efforts for these existing algorithms to also take the advantage of thebackground network by confining the calculation of the performance evaluation criteria to the backgroundnetwork. The computing parameters used by existing algorithms are tuned as suggested in their originalmanuscripts (see Table S1 in Supplementary Data I). For our DMI algorithm, we set α = 0 . β = 0 .
16 and ρ = 10 for all the simulated datasets,where α and β are determined using cross-validation.Five commonly-used criteria are considered to measure algorithm performance, including sensitivity( SN ), specificity ( SP ), accuracy ( ACC ), Matthews correlation coefficient (
M CC ), and the Area UnderROC Curve (
AU C ): 8 N = T PT P + F N ,SP = T NT N + F P ,ACC = T P + T NT P + F P + T N + F N ,M CC = T P × T N − F P × F N (cid:112) ( T P + F P )( T P + F N )( T N + F P )( T N + F N ) , where T P and
T N denote the true positive and true negative, and
F P and
F N denote the false positiveand false negative, respectively.Table 1: Performance evaluation of DMI and other competing algorithms on a network size 10, 100, or 1,000 at a 10% noiselevel.
Methods Size SN SP ACC MCC AUCDMI 10 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CLR 10 0.5900 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Evaluation of DMI at a noise level from 10% to 30% for a network size 10, 100, 1,000 or 20,000.
Noise Level Size SN SP ACC MCC AUC10% noise 10 0.9333 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± BIRC5XBP1RBCK1TRIB3RAB17RNASE4C5orf32C20orf196CLCN6FGL1 −1.5−1−0.5 (a) (b) (c) CDKN1A SNRNP40ANKRA2SEZ6L2RAB37SECISBP2KIF18AQPCTTM9SF2DNAJC27CYSTM1PRELID1KCNK6MAOAC17orf28PJA2GCC1HCFC1POLR2AZDHHC9WDR36ABLIM2 ATF2 ZWILCHPWP1RCC2TBPL2CHSY1LMBRD1CLTBHHEX MKLN1PPP2R2ACLCN3 RCE1 DDX6DNAJC9TMEM66CCNA2TSPAN7 TRAIPUBE2IRBBP8KDELR1CREB3L2SH3PXD2AANKRD11 COX4NBJAG2 COPAZHX2 STX5ARMCX6NFIL3UCHL5TMED5UCK2HARSRBM12WDR77EFR3AISL2FAM171A1 FKBP2CEP55SNAP91LGALS4MRPL4GOLGB1 PELI2 LONRF1FAM71E1 AHCYL2 CDKL3PTPRUELL2RAD51CCBX4 ARF4DTLCHPFINHBE CENPEGNG4NUMBTSPAN31DNAJB11 NT5ECAMKVMAP1LC3A CEBPBARF1 IGIPPLEKHH2HSP90AB1GPR108PKD1PPIGGRAMD1AMCM10 PFAS MBD6TOM1 C14orf80CCDC130ITGA1 MAPK6HLA-F ST13 CCND1 ERP29OXSR1IKBKB CDK2PPP2R5BKIF1A DNAJB2NCOA6PAQR4 RUVBL2REEP4EBF1CIRBP-AS1 TMEM39ATTC9C USP25SAFB2 TMEM59LNUP98LIN54EOGTZNF367
Figure 2: Application of the DMI algorithm to the expression data from human A549 cells in response toinfluenza H1N1 virus infection. (a) Example of differentially expressed genes,where the color bar values arethe normalized gene expression levels; (b) Overall GRN structure (the full details can be found on GitHub);(c) ATF2 hub gene structure.
The first set of experiments are conducted to compare our DMI algorithm with nine representative algorithms,including PCC, ARACNE [38], CLR [16], MINET (the maximum relevance minimum redundancy method)[41], GENIE3 [23], TimeDelay-ARACNE [53], TIGRESS [20], SITPR [33], and Jump3 [24]. Specifically,SITPR [33] is a multi-step pipeline with the (cid:96) norm penalty incorporated; to avoid manual interventionneeded by certain SITPR steps, we simply compare the constrained LASSO step in SITPR with DMI. Inthe simulated data, the noise level is fixed at 10%, and the networks size ranges from 10 to 1,000 becauseother competing algorithms cannot handle a network size of 20 , . ∼ .
50 for n = 1 , The real data example for illustrating the use of DMI in practice is from the recent study of [34], where humanA549 cells were infected with influenza H1N1 virus (A/Mexico/InDRE4487/2009). Illumina HumanHT-12v3 BeadChips and Febit miRBase 14 Geniom miRNA Biochips were employed to measure the mRNA andmiRNA samples, respectively. Six replicates of the expression levels were collected at each of six time points(0, 4, 8, 24, 48, and 72 hours post infection), but one sample at hour 4 was excluded due to degradation.The within-sample normalized dataset is available from NCBI GEO (GSE36553 and GSE36461). We furtherconducted between-sample normalization on the data across the six time points.The original analyses in [34] did not use all the time points; instead, a small number of regulatoryrelationships between functional miRNAs and gene targets were inferred based on data on hr 0 and hr 8only. The genome-wide regulatory landscape has not been revealed based on the entire dataset although [33]made an attempt to investigate the subnetwork structures of a size from 2 to 300. In this study, 1,572 genes10nd 14 non-coding RNAs were identified to be differentially expressed (Figure 2(a)) using the FPCA method[51] for a false discovery rate controlled at 0.05. From the human background network in RegNetwork, theDMI algorithm was applied to identify 1,926 regulatory interactions between the differentially expressedgenes or miRNAs, as shown in Figure 2(b).We conducted comprehensive network analyses on the topological structure and found that the A549GRN in response to influenza H1N1 infection is a typical complex network, evidenced by, e.g., its powerlaw degree distribution and a large clustering coefficient (see Supplementary Data I for details). A closeexamination of the inferred GRN by the DMI algorithm suggests consistency between the findings in thisstudy and those in literature. For instance, ’CEBPB’ (CCAAT/Enhancer-Binding Protein Beta) is a tran-scription factor (TF) that plays an important role in immune and inflammatory responses [47], and thesubnetwork structure centered at ’CEBPB’ has been previously reported in [33]. Our study also revealeda similar subnetwork structure around ’CEBPB’ (e.g., interactions between ’CEBPB’ and ’NCS1’, ’ISG20’,or ’ABCG1’. See Figure S2 in Supplementary Data I); however, some experimentally-verified interactions(e.g., between ’CEBPB’ and ’NOLC1’ in HPRD) were successfully identified in this study, but were missingin the work of [33]. More interestingly, we found a number of hub gene structures, e.g., centered at theActivating Transcription Factor genes (’ATF2’ and ’ATF4’) or the E2F Transcription Factor genes (’E2F6’and ’E2F7’). More specifically, ’ATF2’ encodes a DNA binding protein of the leucine zipper family that canperform distinct functions via different mechanisms (e.g., formation of a homo/heterodimer with ’c-Jun’).Our results showed that ’ATF2’ interacts with more than 130 other genes, including ’CEBPB’ (Figure 2(c)).Although it has been previously reported that influenza A viral RNA molecules can indirectly activate tran-scription factors like ATF2 [29], the importance of ATF2 in the context of influenza infection has not beenfully appreciated by previous experimental studies, considering the large number of target genes associatedwith ATF2 in the hub gene structure. The proposed DMI algorithm thus provides us not only an opportu-nity to investigate the genome-wide regulatory landscapes but also a way to identify interesting subnetworkslike hub gene structures. Further experiments like ChiP-PCR for measuring gene-TF interactions can beconducted in the future to validate the results produced by DMI.
We proposed a novel scalable algorithm called DMI for large-scale GRN inference from temporal gene ex-pression data in this study. Extensive simulation studies and real data application suggest the superiorityof the DMI algorithm over other state-of-the-art approaches, and the success mainly relies on the incorpo-ration of multiple topological characteristics of GRNs like sparsity and hub gene structures into our modelformulation.We also recognize several limitations of the current study: 1) DMI’s performance can be further improvedif the background network edges can be appropriately weighted; 2) Prediction of previously unseen interac-tions heavily depends on the background network preparation so further efforts need to be invested to thedevelopment of public knowledgebases like RegNetwork; 3) This work mainly focuses on solving the ultra-high dimensional inference problems, and data heterogeneity and integration issues have not been tackled.We believe that it is necessary to address the aforementioned limitations in separate articles, and this studyprovides a solid basis for such future investigations.
Acknowledgements
We thank Dr. Zhiping Liu (Shandong University, China) and Dr. Hulin Wu (UTSPH) for developing theRegNetwork database. We also thank Dr. Andrew P. Rice (BCM) for constructive feedbacks.
Funding
This work was partially supported by NIH/NEI grants R01EY022356, R01EY018571, and R01EY020540(RC), NSF grant CNS-1548078 (JL), and NSF grant DMS-1620957 (HM).11 eferences [1] R. Albert. Scale-free networks in cell biology.
Journal of Cell Science , 118(Pt 21):4947–57, 2005.[2] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo. How to infer gene networks fromexpression profiles.
Mol Syst Biol , 3:78, 2007.[3] A. L. Barabasi and R. Albert. Emergence of scaling in random networks.
Science , 286(5439):509–12,1999.[4] A.-L. Barabasi, N. Culbahce, and J. Loscalzo. Network medicine: a network-based approach to humandisease.
Nature Reviews Genetics , 12:56–68, 2011.[5] A.-L. Barabasi and Z. N. Oltvai. Network biology: Understanding the cell’s functional organization.
Nature Reviews Genetics , 5:101–113, 2004.[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statisticallearning via the alternating direction method of multipliers.
Foundations and Trends R (cid:13) in MachineLearning , 3(1):1–122, 2011.[7] J. C. Bryne, E. Valen, M. H. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard,and A. Sandelin. Jaspar, the open access database of transcription factor-binding profiles: new contentand tools in the 2008 update. Nucleic Acids Res , 36(Database issue):D102–6, 2008.[8] A. J. Butte and I. S. Kohane. Mutual information relevance networks: functional genomic clusteringusing pairwise entropy measurements.
Pac Symp Biocomput , pages 418–29, 2000.[9] E. Candes and J. Romberg. l1-magic: Recovery of sparse signals via convex programming. , 4:46, 2005.[10] R. Chen and M. Snyder. Promise of personalized omics to precision medicine.
Wiley InterdisciplinaryReviews: Systems Biology and Medicine , 5(1):73–82, 2013.[11] L. Costa, F. Rodrigues, G. Travieso, and P. V. Boas. Characterization of complex networks: A surveyof measurements.
Advances in Physics , 56(1):167–242, 2007.[12] T. Cover and J. Thomas.
Elements of Information Theory . John Wiley & Sons, New York, 1991.[13] H. De Jong. Modeling and simulation of genetic regulatory systems: A literature review.
Journal ofComputational Biology , 9(1):67–103, 2002.[14] R. Edgar, M. Domrachev, and A. E. Lash. Gene expression omnibus: Ncbi gene expression and hy-bridization array data repository.
Nucleic Acids Res , 30(1):207–10, 2002.[15] I. Ezkurdia, D. Juan, J. M. Rodriguez, A. Frankish, M. Diekhans, J. Harrow, J. Vazquez, A. Valencia,and M. L. Tress. Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes.
Human Molecular Genetics , 2014.[16] J. J. Faith, B. Hayete, J. T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. J. Collins, andT. S. Gardner. Large-scale mapping and validation of escherichia coli transcriptional regulation from acompendium of expression profiles.
PLoS biol , 5(1):e8, 2007.[17] P. Flicek, M. R. Amode, D. Barrell, et al. Ensembl 2012.
Nucleic Acids Res , 40(Database issue):D84–90,2012.[18] M. B. Gerstein, A. Kundaje, M. Hariharan, et al. Architecture of the human regulatory network derivedfrom encode data.
Nature , 489(7414):91–100, 2012.1219] A. Hartemink.
Bayesian networks and informative priors: Transcriptional regulatory network models ,pages 401–424. Cambridge University Press, Cambridge, UK, 2006.[20] A. C. Haury, F. Mordelet, P. Vera-Licona, and J. P. Vert. Tigress: Trustful inference of gene regulationusing stability selection.
BMC Syst Biol , 6:145, 2012.[21] O. Hirose, R. Yoshida, S. Imoto, R. Yamaguchi, T. Higuchi, D. S. Charnock-Jones, C. Print, andS. Miyano. Statistical inference of transcriptional module-based gene networks from time course geneexpression profiles by using state space models.
Bioinformatics , 24(7):932–942, 2008.[22] J. Huang, T. Zhang, et al. The benefit of group sparsity.
The Annals of Statistics , 38(4):1978–2004,2010.[23] V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expres-sion data using tree-based methods.
PLoS One , 5(9), 2010.[24] V. A. Huynh-Thu and G. Sanguinetti. Combining tree-based and dynamical systems for the inferenceof gene regulatory networks.
Bioinformatics , 31(10):1614–1622, 2015.[25] M. Kanehisa and S. Goto. Kegg: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res , 28(1):27–30, 2000.[26] G. Karlebach and R. Shamir. Modelling and analysis of gene regulatory networks.
Nat Rev Mol CellBiol , 9(10):770–780, 2008.[27] B. A. Kidd, L. A. Peters, E. E. Schadt, and J. T. Dudley. Unifying immunology with informatics andmultiscale biology.
Nature Immunology , 15(2):118–127, 2014.[28] S. Y. Kim, S. Imoto, and S. Miyano. Inferring gene networks from time series microarray data usingdynamic bayesian networks.
Briefings in Bioinformatics , 4(3):228–235, 2003.[29] G. Kochs, A. Garca-Sastre, and L. Martnez-Sobrido. Multiple anti-interferon actions of the influenza avirus ns1 protein.
Journal of Virology , 81(13):7011–7021, 2007.[30] P. Langfelder and S. Horvath. Wgcna: an r package for weighted correlation network analysis.
BMCBioinformatics , 9(1):559, 2008.[31] R. D. Leclerc. Survival of the sparsest: robust gene networks are parsimonious.
Molecular SystemsBiology , 4(1):213, 2008.[32] Z.-P. Liu, C. Wu, H. Miao, and H. Wu. Regnetwork: an integrated database of transcriptional andpost-transcriptional regulatory networks in human and mouse.
Database , 2015:bav095, 2015.[33] Z.-P. Liu, H. Wu, J. Zhu, and H. Miao. Systematic identification of transcriptional and post-transcriptional regulations in human respiratory epithelial cells during influenza a virus infection.
BMCbioinformatics , 15(1):1, 2014.[34] E.-K. Loveday, V. Svinti, S. Diederich, J. Pasick, and F. Jean. Temporal- and strain-specific hostmicrorna molecular signatures associated with swine-origin h1n1 and avian-origin h7n7 influenza avirus infection.
Journal of Virology , 86(11):6109–6122, 2012.[35] D. Marbach, J. C. Costello, R. Kuffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison,D. Consortium, M. Kellis, J. J. Collins, and G. Stolovitzky. Wisdom of crowds for robust gene networkinference.
Nat Methods , 9(8):796–804, 2012.[36] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky. Revealing strengthsand weaknesses of methods for gene network inference.
Proc Natl Acad Sci U S A , 107(14):6286–91,2010. 1337] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano. Generating realistic in silico gene networks forperformance assessment of reverse engineering methods.
Journal of Computational Biology , 16(2):229–239, 2009. WingX.[38] A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. D. Favera, and A. Califano.Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellularcontext.
BMC bioinformatics , 7(Suppl 1):S7, 2006.[39] V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev,M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Win-gender. Transfac and its module transcompel: transcriptional gene regulation in eukaryotes.
NucleicAcids Res , 34(Database issue):D108–10, 2006.[40] M. L. Metzker. Sequencing technologies - the next generation.
Nat Rev Genet , 11(1):31–46, 2010.[41] P. E. Meyer, F. Lafitte, and G. Bontempi. minet: A r/bioconductor package for inferring large tran-scriptional networks using mutual information.
BMC Bioinformatics , 9(1):1–10, 2008.[42] G. R. Mishra, M. Suresh, K. Kumaran, et al. Human protein reference database–2006 update.
NucleicAcids Res , 34(Database issue):D411–4, 2006.[43] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/ k ). SovietMathematics Doklady , 27(2):372–376, 1983.[44] Y. Nesterov. Smooth minimization of non-smooth functions.
Mathematical programming , 103(1):127–152, 2005.[45] R. Opgen-Rhein and K. Strimmer. From correlation to causation networks: a simple approximatelearning algorithm and its application to high-dimensional plant gene expression data.
BMC Syst Biol ,1:37, 2007.[46] T. Ravasi, H. Suzuki, C. V. Cannistraci, et al. An atlas of combinatorial transcriptional regulation inmouse and man.
Cell , 140(5):744–52, 2010.[47] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, and D. Lancet. Genecards: a novel functional genomicscompendium with automated data mining and query reformulation support.
Bioinformatics , 14(8):656–664, 1998.[48] T. Schaffter, D. Marbach, and D. Floreano. GeneNetWeaver: In silico benchmark generation andperformance profiling of network inference methods.
Bioinformatics , 27(16):2263–2270, 2011. wingx.[49] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang. Probabilistic boolean networks: a rule-baseduncertainty model for gene regulatory networks.
Bioinformatics , 18(2):261–74, 2002.[50] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.
Journalof optimization theory and applications , 109(3):475–494, 2001.[51] S. Wu and H. Wu. More powerful significant testing for time course gene expression data using functionalprincipal component analysis approaches.
BMC Bioinformatics , 14(1):1–13, 2013.[52] M. K. Yeung, J. Tegner, and J. J. Collins. Reverse engineering gene networks using singular valuedecomposition and robust regression.
Proc Natl Acad Sci U S A , 99(9):6163–8, 2002.[53] P. Zoppoli, S. Morganella, and M. Ceccarelli. Timedelay-aracne: Reverse engineering of gene networksfrom time-course data by an information theoretic approach.
BMC Bioinformatics , 11(1):1–15, 2010.[54] M. Zou and S. D. Conzen. A new dynamic bayesian network (dbn) approach for identifying generegulatory networks from time course microarray data.