Characterizing the structure of protein-protein interaction networks
aa r X i v : . [ q - b i o . M N ] S e p CHARACTERIZING THE STRUCTURE OF COMPLEXPROTEIN-PROTEIN INTERACTION NETWORKSAllan A. Zea [email protected]
Escuela de Matem´atica, Facultad de Ciencias, UCV, Caracas, Venezuela
Antonio Rueda-Toicen [email protected]
Centro de Visualizaci´on de Im´agenes, Instituto Nacional de Bioingenier´ıa, UCV, Caracas, VenezuelaPhysics and Mathematics in Biomedicine Consortium, UCV, Caracas, VenezuelaAlgorithmic Nature Group, LABORES for the Natural and Digital Sciences, Paris, France
Abstract.
Network theorists have developed methods to characterize the complex interactions innatural phenomena. The structure of the network of interactions between proteins is important inthe field of proteomics, and has been subject to intensive research in recent years, as scientists havebecome increasingly capable and interested in describing the underlying structure of interactionsin both normal and pathological biological processes. In this paper, we survey the graph-theoreticcharacterization of protein-protein interaction networks (PINs) in terms of structural features,and discuss its possible applications in biomedical research. We also perform a brief revision ofnetwork theory’s classical literature and discuss modern statistical and computational techniquesto describe the structure of PINs.
Key words:
Complex networks, biological network, protein-protein interaction, interactome,network science.
Proteins are responsible of carrying out essential cellular processes like metabolism, vesicletransport, DNA transcription, among others. However, they rarely act alone: they must normallyinteract with other proteins in order to perform their functions. These physical interactions areimportant in several disciplines since they provide useful information regarding the function anddevelopment of certain diseases and health abnormalities [1, 2].any databases and public repositories have been recently launched in an attempt to finallyassemble the whole set of interactions between proteins, for which it has become necessary to workon ad hoc methodologies both to predict and model these interesting relationships.In the following sections we survey modern techniques for characterizing the topology of protein-protein interaction networks and discuss the possible impact of such characterization in currentbiomedical research.
Protein-protein interactions (PPIs) are physical interactions between any two proteins thatregulate the vast majority of biological processes within the cell. Predicting the whole set of PPIsfor a given organism, therefore, comprises one of the major challenges in the field of proteomics,but the existing procedures for detecting and analyzing these interactions are still quite limited.
Systems biology has developed different approaches for identifying protein interactions [3].There are two main high-throughput methods to detect PPIs: namely, mass-spectrometry (MS)and methods based on yeast two-hybrid. The yeast two-hybrid (Y2H) approach allows in vivo detection of protein interactions in yeast cells, and it is known to be more efficient and consid-erably less expensive than other techniques. This approach relies on the use of a transcriptionfactor (originally Gal4, [4]) that binds an upstream specific activating sequence (UAS) to activatea downstream reporter gene, which then leads to a specific phenotype, like growth on a selectivemedium or a color reaction [5]. Nevertheless, although these biological assays have been imple-mented for various organisms in large-scale experiments [6], their interactomes remain incomplete.
Computational prediction of protein interactions.
The detection of PPIs via high-throughput experimental techniques has slowed down in recent years in spite of the sharpincrease in availability of genomic and proteomic data [7], partly because of limitations withscreening methods. For instance, Y2H experiments can produce false positives due to non-specific or “promiscuous” interactions, and can undergo difficulties when trying to detectvery transient ones [5, 7]. Reasonable amounts of effort have thus been invested at predict-ing, curating and validating these uncertain interactions using computational techniques [8],some of which incorporate advanced insights from structural biology and machine learning(see, for example, [9]). A very popular tool for predicting PPIs is the
Struct2Net web server[7], which is currently maintained by the computation and biology group at MIT’s computerscience laboratory and is freely available online. A graph is the pair g := ( V , E ) consisting of a vertex set V ( g ) and, correspondingly, an edge set E ( g ) ⊆ {{ v i , v j } | v i , v j ∈ V ( g ) , i = j } . Two vertices v, w ∈ V ( g ) are said to be adjacent if thereexists an edge { v, w } ∈ E ( g ) connecting them. In the context of PINs, two proteins are related ifthey establish a specific physical or biochemical interaction. Therefore, using this abstraction, wecan represent the proteins in the interactome as vertices of a graph and the protein interactions(detected through Y2H or MS) as edges between them. These graphs or networks can be easilyonstructed from the existing PPI annotations in most comprehensive repositories for interactiondatasets such as the BioGRID [10]. Figure 1 illustrates an example of protein interaction network.This network was built in Wolfram Mathematica using the high-quality Y2H dataset of the
CCSBInteractome Database [6].Figure 1: Network model for the PPI data of
Saccharomyces cerevisiae .Some basic features regarding the structure of these PINs are shown in Table 1: The secondand third columns respectively show the number of proteins in the network ( |V ( g ) | ) and the totalamount of detected interactions between them ( |E ( g ) | ). The remaining columns, on the otherhand, show the global and average clustering coefficients ( C and C , respectively), which relate tothe number of interactions between the neighbors of each protei, and the naive scaling exponent( γ naive ) that we will briefly describe in the next section.Table 1: Some global features of the PINs in [6].Protein interaction dataset |V ( g ) | |E ( g ) | γ naive C C
Saccharomyces cerevisiae
Caenorhabditis elegans
Arabidopsis thaliana
Homo sapiens
THE STRUCTURE OF PROTEIN INTERACTION NETWORKS3.1 Degree Distribution
An intriguing feature of a network’s structure is its degree distribution. Many classical workshave claimed that PINs are scale-free, which is to say that their degree distribution approximatelyfollows a power-law behavior [11, 14]. Thus, given some exponent γ , the probability P ( k ) that arandomly chosen vertex in the PIN will have k edges satisfies the asymptotic relation P ( k ) ∼ k − γ (1)that causes the degrees in the network to be distributed as shown in Fig. 2a. It is clear to observefrom this figure that proteins with few interactions are significantly more frequent than thoserelated with a large amount of other proteins. This characterization, to some extent, reflects theoverly important role only a small number of proteins play in the cell’s functionality, which isessential, for example, to understand the network’s tolerance to failure and its vulnerability totargetted attack [15]. (a) Corresponding degree distribution (b) Fitted distribution in logarithmic scales Figure 2: Protein interaction network of
Saccharomyces cerevisiae [6].
Estimating the scaling parameter.
The scaling exponent in Eq. (1) is often used as ameasure for quantifying complexity in many real-world networks, that is normally calculatedby fitting a line over the degree distribution plotted in logarithmic scales (Fig. 2b), where theslope is, presumably, γ . However, this procedure is known to be misleading, since a straightline in a log-log plot is not a sufficient condition for power-law behavior [16]. An efficient wayto estimate this parameter is through the maximum-likelihood estimator (MLE) describedin [16, 17]. In general, for discrete data distributions, this MLE is given byˆ γ ≃ " n X i =1 ln k i k min − − (2)where k min is the lower bound for the power-law behavior and ˆ γ → γ in the limit of large n . esting the power-law hypothesis. The work of Stumpf and Ingram [18], in contrast tothe claims of previous studies, argued that power-law distributions could not be fitted suffi-ciently well in much of the available PIN data. Furthermore, they found that the stretchedexponential and log-normal distributions are often best fits for this empirical data. For thisreason, Clauset et al. [16] also accompanied their approach with statistics to test whetherthe power-law hypothesis for the empirical distribution under study is a plausible one.
Let g := ( V , E ) be an undirected graph with a vertex set V ( g ), edge set E ( g ) and let n = |V ( g ) | .We can represent graph g by an n × n symmetric matrix A = ( a ij ) called adjacency matrix , where a ij = (cid:26) i and j are adjacent,0 otherwise. (3)Given two vertices v, w ∈ V ( g ) = { , , ..., n } , we define the following centrality measures: Degree centrality.
The degree centrality C deg of a vertex v is the total number of vertices towhich v is connected. Formally, we have a mapping C deg : V ( g ) → N such that C deg ( v ) = n X j =1 a vj , (4)where N is the set of natural numbers and 1 ≤ v ≤ n . Broadly speaking, a vertex v isranked as important by C deg if it is connected to many other vertices. This is one of themost intuitive but still useful ideas of centrality, e.g. the most central vertex in a friendshipnetwork would be the person having the greatest amount of friends. Eigenvector centrality.
Eigenvector centrality, C eig , is often referred as a more sophisticatedformulation of degree centrality. Unlike degree centrality, a vertex v is ranked as importantby C eig if it is connected to other vertices which are themselves important. Thus, we have: C eig ( v ) = 1 λ X w ∈ N ( v ) C eig ( w ) = 1 λ X v ∈V ( g ) a vw C eig ( w ) , (5)where λ is a constant and N ( v ) = { w ∈ V ( g ) | a vw = 1 } is the set of neighbors of vertex v .Letting C = ( C eig (1) , C eig (2) , ..., C eig ( n )) denote the vector of centralities, we can write theabove equation in matrix form as λ C = AC , (6)and therefore we see that C is an eigenvector of the adjacency matrix with eigenvalue λ .Eigenvector centrality is frequently used by search engines like Google to rank websites byrelevance, because, as a matter of fact, websites are more likely to be visited if they arelinked to other important websites that users on the Internet can reach. etweenness centrality. Another important measurement of centrality in complex networksis betweenness, which is based upon the graph-theoretic notion of path. A path is a sequenceof distinct vertices that pass over following edges accross a graph, from a vertex v to somevertex w . Given u ∈ V ( g ) we define its betweenness centrality, C bet ( u ), as follows: C bet ( u ) = X v = u = w ∈V ( g ) σ ( v, u, w ) σ ( v, w ) , (7)where σ ( v, w ) is the total number of shortest paths between vertices v and w , and σ ( v, u, w )is the total number of these paths passing through vertex u . Vertices with high betweennesscentrality are often called bottlenecks. Betweenness is, in broad terms, a measure of thecontrol these vertices conduct over the flow of information within the network.Identifying and evaluating core proteins for an organism’s interaction network has been oneof the major goals of systems biology for several years. As a consequence, plenty of works havefocused in the systematic study of centrality in PINs. Yu et al. [19] found that there was astrong correlation between the “bottleneckedness” of vertices in protein interaction networks andprotein essentiality, i.e. bottlenecks are more likely to be essential: a fundamental observation forunderstanding lethality [20] and disease associations in PINs [1]. Robustness is a fundamental issue in the study of complex networks, which concerns to thechanges a networked system’s structure undergoes after a portion of its vertices are removed.These changes largely depend on the way degrees are distributed in the system as well as thevertex removal method, and can provide insight to study the network’s capacity to withstandfailure after some of its most essential components have been compromised or damaged.
Resilience to centrality attack.
A recent paper by Iyer et al. [15] explored quantitativeproperties of robustness in complex network topologies and discussed their relationship tothe notion of centrality. In particular, they focused in both random and centrality-basedvertex removal strategies for degrading various empirical networks with different degreedistributions. Their results suggested that, although networks with scale-free topology aretolerant to error (random elimination of poorly connected vertices), the effect of removingvertices with high centrality (degree, betweenness and eigenvector) is detrimental for theoverall network’s dynamics. However, they also pointed out that the degree distribution aswell as the assortative mixing , r , are determining parameters in a network’s ability to resistthese targeted attacks. Random networks: a critical comparison.
Random networks have received a great dealof attention in network science. They are a special kind of network, whose vertices are moreuniformly distributed than in scale-free networks (they resemble a Poisson distribution).Figure 3 shows sequential targeted attacks performed on a random network made of 756vertices and 1685 edges. For this, we calculated the degree centralities for each vertex and,subsequently, removed a percentage of those with the highest C deg . a) No vertices removed (b) Corresponding degree distribution(c) 15% of vertices removed (d) Modified degree distribution(e) 25% of vertices removed (f) Modified degree distribution Figure 3: Sequential targeted attack to a random network. The left-hand side figures displaychanges in the network’s structure after successful elimination of vertices with the highest degreecentralities (hubs), while figures in the right show modifications in their degree distributions.The number of vertices in the largest connected component of this random network varied from744 to 611 and 472 after 15% and 25% of the most central vertices were removed. Furthermore,he degree distribution of this network varied only slightly. This same experiment was repeated forthe protein interaction network of
Saccharomyces cerevisiae and, after removing 15% of the mostcentral vertices, the size of the largest connected component reduced from 1647 to 17 vertices,hence making PINs less robust to attack than random networks.
Interactome networks have been of interest in various disciplines because they carry substan-tial information about the development of human disease [1]. Predicting protein functionality aswell as choosing the best targets for drug therapy are some common applications of the topolog-ical characterization of PINs. For instance, in an up-to-date publication (2015), Azevedo et al.[21] built and analyzed protein interaction networks related to chemoresistance to temozolomide(TMZ), a commonly used alkylating agent for brain cancer treatment. Their in silico experimentsof topological proved to be an efficient framework for targeting chemoresistance in cancer therapy.
We have presented an overview of the topological characterization of complex protein-proteininteraction networks. Studying the degree distribution and its influence in percolation processesis crucial for understanding and predicting functional modules in many real-world networks.Additionally, we provided critical comparisons between the robustness of random and scale-freenetworks based upon state-of-the-art algorithms from network theory that have spanned a widearray of emerging subfields in biology and medicine.In the last years, computational techniques for analyzing the structure of networked systemshave become relevant in many branches of biology and have impulsed the creation of new toolsthat will hopefully serve as a starting point for future developments in the biomedical sciences.
Acknowledgements
We wish to thank all people at Instituto Nacional de Bioingenier´ıa for great discussions and theorganizing committee members for their support at all stages of the research project. We wouldalso like to thank reviewers for helpful comments regarding the preparation of this manuscript.
REFERENCES [1]
Vidal, M., Cusick, M. E., & Barab´asi, A.-L.,
Interactome Networks and HumanDisease.
Cell , vol. 144, n. 6, pp. 986-998, 2011.[2]
Barab´asi, A.-L., Gulbahce, N., & Loscalzo, J.,
Network medicine: a network-basedapproach to human disease.
Nature Reviews Genetics , vol. 12, n. 1, pp. 56-68, 2011.[3]
Rao, V. S., Srinivas, K., Sujini, G. N., & Kumar, G. N.,
Protein-Protein InteractionDetection: Methods and Analysis.
International Journal of Proteomics , vol. 2014, Article ID147648, 2014.[4]
Fields, S., & Song, O.,
A novel genetic system to detect protein protein interactions.
Nature , vol. 340, n. 6230, pp. 245-246, 1989.5]
Br¨uckner, A., Polge, C., Lentze, N., Auerbach, D., & Schlattner, U.,
YeastTwo-Hybrid, a Powerful Tool for Systems Biology.
International Journal of Molecular Sci-ences , vol. 10, n. 6, pp. 2763-2788, 2009.[6] Center for Cancer and Systems Biology (CCSB), Dana Farber Cancer Institute, Boston MA.
CCSB Interactome Database . Available online at: http://interactome.dfci.harvard.edu/.[7]
Singh, R., Park, D., Xu, J., Hosur, R., & Berger, B.,
Struct2Net: a web serviceto predict protein–protein interactions using a structure-based approach.
Nucleic Acids Re-search , vol. 38, Web Server, pp. W508-W515, 2010.[8]
Zahiri, J., Bozorgmehr, J. H., & Masoudi-Nejad, A.,
Computational Prediction ofProtein-Protein Interaction Networks: Algorithms and Resources.
Current Genomics , vol. 14,n. 6, pp. 397-414, 2013.[9]
Singh, R.,
Algorithms for the Analysis of Protein Interaction Networks . Ph.D. dissertation,Massachusetts Institute of Technology, 2012.[10]
Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., and Breitkreutz, A., &Tyers, M.,
BioGRID: a general repository for interaction datasets.
Nucleic Acids Research ,vol. 34, pp. D535-D539, 2006.[11]
Albert, R., & Barab´asi, A.-L.,
Statistical Mechanics of Complex Networks.
Reviews ofModern Physics , vol. 74, n. 1, pp. 47-97, 2002.[12]
Newman, M. E. J.,
The structure and function of complex networks.
SIAM Review , vol.45, n. 2, pp. 167-256, 2003.[13]
Costa, L. da F., Rodrigues, F. A., Travieso, G., & Villas Boas, P. R.,
Charac-terization of Complex Networks: A Survey of measurements.
Advances in Physics , vol. 56, n.1, pp. 167-242, 2007.[14]
Barab´asi, A.-L., & Bonabeau, E.,
Scale-free Networks.
Scientific American , vol. 288, n.5, pp. 60-69, 2003.[15]
Iyer, S., Killingback, T., Sundaram, B., & Wang, Z.,
Attack Robustness and Cen-trality of Complex Networks.
PloS one , vol. 8, n. 4, pp. e59613, 2013.[16]
Clauset, A., Shalizi, C. R., & Newman, M. E. J.,
Power-law distributions in empiricaldata.
SIAM Review , vol. 51, n. 4, pp. 661-703, 2009.[17]
Newman, M. E. J.,
Power laws, Pareto distributions and Zipf’s law.
Contemporary Physics ,vol. 46, n. 5, pp. 323-351, 2005.[18]
Stumpf, M. P. H., & Ingram, P. J.,
Probability Models for Degree Distributions ofProtein Interaction Networks.
Europhysics Letters , vol. 71, n. 1, pp. 152-158, 2005.19]
Yu, H., Kim, P. M., Sprecher, E., Trifonov, V., & Gerstein, M.,
The Impor-tance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and ExpressionDynamics.
PLoS Computational Biology , vol. 3, n. 4, pp. e59, 2007.[20]
Jeong, H., Mason, S. P., Barab´asi, A.-L., & Oltvai, Z. N.,
Lethality and centralityin protein networks.
Nature , vol. 411, n. 6833, pp. 41-42, 2001.[21]
Azevedo, H., & Moreira-Filho, C. A.,
Topological robustness analysis of protein in-teraction networks reveals key targets for overcoming chemotherapy resistance in glioma.