[PDF] Two new methods for identifying proteins based on the domain protein complexes and topological properties

Abstract

The recognition of essential proteins not only can help to understand the mechanism of cell operation, but also help to study the mechanism of biological evolution. At present, many scholars have been discovering essential proteins according to the topological structure of protein network and complexes. While some proteins still can not be recognized. In this paper, we proposed two new methods complex degree centrality (CDC) and complex in-degree and betweenness definition (CIBD) which integrate the local character of protein complexes and topological properties to determine the essentiality of proteins. First, we give the definitions of complex average centrality (CAC) and complex hybrid centrality (CHC) which both describe the properties of protein complexes. Then we propose these new methods CDC and CIBD based on CAC and CHC definitions. In order to access these two methods, different Protein-Protein Interaction (PPI) networks of Saccharomyces cerevisiae, DIP, MIPS and YMBD are used as experimental materials. Experimental results in networks show that the methods of CDC and CIBD can help to improve the precision of predicting essential proteins.

Full PDF

TTwo new methods for identifying proteins based on the domainprotein complexes and topological properties ∗ Pengli Lu † and JingJuan Yu School of Computer and Communication, Lanzhou University of Technology, Lanzhou, 730050, Gansu, P.R. China

Abstract

CDC ) and complexin-degree and betweenness deﬁnition (

CIBD ) which integrate the local character of proteincomplexes and topological properties to determine the essentiality of proteins. First, we givethe deﬁnitions of complex average centrality (

CAC ) and complex hybrid centrality (

CHC )which both describe the properties of protein complexes. Then we propose these new methods

CDC and

CIBD based on

CAC and

CHC deﬁnitions. In order to access these two methods,diﬀerent Protein-Protein Interaction (PPI) networks of Saccharomyces cerevisiae, DIP, MIPSand YMBD are used as experimental materials. Experimental results in networks show thatthe methods of

CDC and

CIBD can help to improve the precision of predicting essentialproteins.

Keywords:

Protein interaction network; Essential protein; Topology; Protein complex

Protein is one of the main components of human life. Essential protein is deﬁned as aprotein which would result in the inability of the organism to survive when it is removed by aknockout mutation. Essential proteins are more conserved in biological evolution in comparisionto non-essential proteins [1]. Not only can essential proteins help us understand the growthcontrol system of cells, and then understand the mechanism of life, but also help the study ofbiological evolution mechanism [2]. Removing essential proteins can lead to fatal or infertility[3]. Determining the essentiality of proteins is of great signiﬁcance to the research of systembiology which provides valuable theories and methods for the diagnosis of diseases, drug design,etc. [4]. Therefore, identifying the essential protein is meaningful in biomedicine.Previous methods for identifying essential proteins mainly used some biological experiments,including conditional knockouts [5], RNA interference [6], and single gene knockouts [7], coupledwith the survival ability of infected organisms being tested. However, these biological experi-mental processes not only consume amounts of time and costs, but also require a lot of biological ∗ Supported by the National Natural Science Foundation of China (No.11361033) and the Natural ScienceFoundation of Gansu Province (No.1212RJZA029). † Corresponding author. E-mail addresses: [email protected] (

P. Lu ), [email protected] (

J. Yu ). a r X i v : . [ q - b i o . M N ] M a r esources. Nowadays, it has been a crucial research direction in the ﬁeld of bioinformatics forpredicting essential proteins from a large number of biological experiments by using computertechnology theory and research methods.Jeong H M et al. put forward that the essentiality of proteins is associated with the topolog-ical structure in protein interaction networks [8]. There are some species including S.cerevisiae,E.coli, C.elegans and D.melanogaster that have demonstrated the hubs in PPIs have more chanceto be essential proteins [9]. Thus, we are working to investigate the importance of proteins intopologies to essential proteins. On the basis of network topology characteristics of nodes, thereare many centrality measures to discover essential proteins. Some of them are global networkcharacteristics, like betweenness centrality ( BC ) [11,38], eigenvector centrality ( EC ) [19], infor-mation centrality ( IC ) [20] and closeness centrality ( CC ) [13]. Others are local network features,such as degree centrality ( DC ) [10,14,15], subgraph centrality ( SC ) [16], local average centrality( LAC ) [17] and topology potential-based method (

T P ) [34]. On the basis of network topol-ogy characteristics of edges, there are also some measures, including edge clustering coeﬃcient(

ECC ) [35], and improved node and edge clustering coeﬃcient (

IN EC ) [36]. In recent years,many scholars have been working to identify proteins in combination with protein information,such as

P eC which combines edge clustering coeﬃcients with gene expression data correlationcoeﬃcients [24], esP OS which using gene expression information and subcellular localizationinformation [21],

SP P which based on sub-network partition and prioritization by integratingsubcellular localization [12], extended pareto optimality consensus model (

EP OC ) that fusesneighborhood closeness centrality and Orthology information [39]. Go terms information canalso be used to predict essential proteins such as

RSG method in [25].Apart from analyzing the essentiality of proteins from topological point of view and proteininformation, analyzing the characteristics from the perspective of protein complexes has becomeanother direction of our study. Hart G T et al. found that the essential proteins are oftendetermined by the protein complexes in which the protein is involved, rather than by a singleprotein [22]. Li et al. also prove that the frequency of the essential proteins appear in thecomplex would be more than that in the whole network [21,41]. To give examples, Luo J W et al.raised the local interaction density of binding protein complexes (

LIDC ) for predicting essentialproteins [37]. Qin C et al. put forward the

LBCC , a measure on the basis of both networktopology features and protein complexes [18]. Li et al. proposed united complex centrality(

U C ) which combine the edge clustering coeﬃcient and the freqencies of proteins appeared incomplexes [23]. From the results of their experiences, we can see that the performances of thesemethods are better than using the pure topological methods.Therefore, on the basis of the association with protein complexes information and topologicalproperties, our two new novel methods complex degree centrality (

CDC ) and complex in-degreeand betweenness deﬁnition (

CIBD ) are proposed. In order to describe the structural propertiesof protein complexes, we deﬁne

CAC and

CHC of a node v . Between the two indicators we putforward, one is called CDC which combine the node and its neighbors properties to describethe features for protein complexes, the other is called

CIBD based on the features of proteincomplexes, local features and global properties in the network.To assess the quality of

CDC and

CIBD methods, we apply them to diﬀerent datasets ofSaccharomyes cerevisiae, DIP, MIPS and YMBD. In order to obtain the performance of ourproposed methods, we make comparisions by using some existing measures, including DC , BC ,2 AC , SC , LBCC , EC , SoECC and

U C which can gain the original paper from [10], [11],[17], [16], [18] ,[19], [28] and [23] respectively. In terms of the sensitivity, speciﬁcity, positivepredictive value, negative predict value, F-measure, accuracy rate and the evaluation methodsof “sorting-screening”, the precision-recall curves and jackknife, the results show that our twomethods are more eﬀective in determining the essentiality of proteins than existing measures.

An undirected simple graph G ( V, E ) can be used to express a network of protein interaction.Proteins can be regarded as nodes set V of a network and the connections between two proteinscan be regarded as edges set E . The number of nodes and edges in a graph G can be deﬁnedas | V ( G ) | and | E ( G ) | separately. The neighbor set of node v is denoted by N v , and its numbercan be represented as | N v | . The induced subgraph of G [ S ] is a subgraph of G induced by thenodes set S . There are some centralities we need to understand. • Betweenness centrality ( BC ) [11] BC ( v ) = (cid:88) s (cid:54) = v (cid:54) = t ∈ V σ st ( v ) σ st (2.1)where σ st denotes the number of shortest paths between s and t . σ st ( v ) denotes the numberof shortest paths from s to t that pass through the node v . • In-degree centrality of complex (

IDC ) [21]

IDC ( v ) = (cid:88) i ∈ ComplexSet ( v ) IN − Degree ( v ) i (2.2)A subset of protein complexes that containing protein v can be represented as ComplexSet ( v ),the degree of node v for the i th protein complex which belongs to ComplexSet ( v ) can berepresented as IN − Degree ( v ) i . • LBCC method [18]

LBCC ( v ) = a ∗ log Den ( v ) + b ∗ log Den ( v )+ c ∗ log IDC ( v ) + d ∗ log BC ( v ) (2.3)Speciﬁcally, Den ( v ) = 2 | E ( H ) || V ( H ) | ( | V ( H ) | −

1) (2.4)where the induced subgraph G [ N v (cid:83) { v } ] can be represented as H . Den ( v ) = 2 | E ( H ) || V ( H ) | ( | V ( H ) −

1) (2.5)where M u = (cid:83) u ∈ N v N u , H represents the induced subgraph G [ M u (cid:83) N v (cid:83) { v } ].3 .3 New Centrality: CDC and

CIBD

The basic considerations of

CDC and

CIBD are as follows: (1)The essential proteins appearin complexes can be more frequency. (2)Both the node itself and its neighbors are critical toaﬀect the essentiality. (3)The global topological is considered to be a factor in locating essentialproteins. Consequently, we present two new deﬁnitions to judge the essentiality of proteins bycombining the domain features of protein complex and the topological properties.First, we present a new complex average central deﬁnition (

CAC ) for the neighbors of anode v , CAC ( v ) = (cid:80) u ∈ N v IDC ( u ) | N v | (2.6)where (cid:80) u ∈ N v IDC ( u ) represents the total values of IDC for all the neighbors of a node v . IDC centrality has been mentioned in Eq. (2)Then, we propose complex hybrid central deﬁnition (

CHC ) by combining the number ofcomplexes for a node v with complex average central deﬁnition CAC , CHC ( v ) = N complex ( v ) · CAC ( v ) · IDC ( v ) (2.7)where N complex ( v ) denotes the total number of complexes for a node v .Now, based on the two deﬁnitions that we described above, we propose these two newmethods for estimating the essentiality of a node v . One is complex degree centrality ( CDC )which combine the node with its neighbors to describe the properties for protein complexes,

CDC ( v ) = a ∗ CAC ( v ) + b ∗ IDC ( v ) (2.8)where a , b are random parameters ranging from 1 to 10. After conducting plenty of experiments,we can get the best results of the method CDC when a and b are 1 and 4, respectively.The other is complex in-degree and betweenness deﬁnition ( CIBD ) which combining

CHC , Den and BC , where the structural property of the protein complexes is described by CHC , thelocal feature is described by

Den and the global property is described by BC . Since the valuesof these measures are quite diﬀerent, the data is normalized by logarithmic transformation, CIBD ( v ) = a ∗ log( CHC ( v )) + b ∗ log( Den )+ c ∗ log( BC ( v )) (2.9)where a , b and c are random parameters ranging from 1 to 10. Under the amounts of experiments,we can get the best results of the method CIBD when a , b and c are 1, 3 and 1, respectively.The descirption of CDC and

CIBD algorithms are in Table 1.

In order to analyze the performance of these two algorithms of

CDC and

CIBD , experimentsare conducted by using the protein interaction data of Saccharomyes cerevisiae because itsproteins are more complete.Three sets of PPI network data YDIP, YMIPS and YMBD are used. The DIP dataset ismarked as YDIP network [26]; The MIPS dataset is marked as YMIPS network [25]; The YMBD4able 1: Description of CDC and CIBD algorithms

CDC and

CIBD algorithms

Input : Undirected graph G = ( V ( G ) , E ( G )) stands fora PPI network, C = { C i = ( V ( C i ) , E ( C i )) | C i ⊂ G } represents complexes Output : The proteins list sorted by

CDC , CIBD in adescending order : For each vertex v ∈ V ( G ) do IDC ( v ) = 0 : For each ∀ C i ∈ C do03 : calculate IDC ( v ) = IDC ( v ) + IN − Degree ( v ) i //where IN − Degree ( v ) i is the value of DC ( v ) in i th complex : For each vertex v ∈ V ( G ) do05 : Find the neighbor nodes N v of node v //where N v stands for the neighbor nodes set for node v : calculate CAC ( v ) by Equation(6) : For each vertex v ∈ N v do08 : Find the neighbor nodes of N v //where N v stands for the neighbor nodes set for node v which v ∈ N v : calculate Den by Equation(5) : For each vertex v ∈ V ( G ) do11 : calculate CHC ( v ) by Equation(7) : calculate and sort CDC ( v ) by Equation(8) : calculate and sort CIBD ( v ) by Equation(9) network comes from the Mark Gerstein Lab website. In the protein network, all self-interactionand repetitive interaction are deleted as a data preprocessing of these PPIs. Speciﬁc proper-ties for these three networks are presented in the Table 2. In the YDIP network, there are5093 proteins and 24743 interactions, whose clustering coeﬃcient is about 0.0973. YMIPS net-work includes 4546 proteins and 12319 interactions, whose clustering coeﬃcient is about 0.0879.YMBD network includes 2559 proteins and 11835 interactions, whose clustering coeﬃcient isabout 0.4445.The known essential protein is derived from four databases: MIPS [40], SGD (Saccha-romyces Genome Database) [33], SGDP (Saccharomyces Genome Deletion Project) [4], andDEG (Database of Essential Genes) [27]. The protein complex set is from CM270 [40], CM425[29], CYC408 and CYC428 datasets [30,31] which can gained from [21], containing 745 proteincomplexes (including 2167 proteins).Table 2: Data details of the three protein networks: YDIP, YMIPS, YMBDDataset Proteins Interactions Average degree Essential proteins Clustering coeﬃcientYDIP 5093 24743 9.72 1167 0.0973YMIPS 4546 12319 5.42 1016 0.0879YMBD 2559 11835 9.25 763 0.4445 According to their values of

CDC , CIBD and other eight prediction measures including DC , BC , EC , SC , LAC , LBCC , SoECC and

U C , proteins are sorted from high to low orders.First, we choose some number of top proteins in sequence as predictive essential proteins andthen compare them with the real essential proteins. This allows us to know the quantity of trueessential proteins. Therefore, the sensitivity ( SN ), speciﬁcity ( SP ), F-measure ( F ), accuracy5 ACC ), positive predictive value (

P P V ) and negative predictive value (

N P V ) can be calculated[28,29].The following are the formulas for calculating these six statistical indicators.Sensitivity: SN = T PT P + F N

Speciﬁcity: SP = T NT N + F P

Positive predictive value:

P P V = T PT P + F P

Negative predictive value:

N P V = T NT N + F N

F-measure: F = 2 ∗ SN ∗ P P VSN + P P V

Accuracy:

ACC = T P + T NP + N where T P stands for the number of true essential proteins which are correctly selected as essentialproteins.

F P is the number of nonessential proteins which are incorrectly selected as essential.

T N is the number of nonessential proteins which are correctly selected as nonessential.

F N isthe number of essential proteins which are incorrectly selected as nonessential. P and N standfor the sum number of essential and nonessential proteins, respectively. In this paper, to evaluate the eﬃciency and accuracy of diﬀerent indicators in identifyingessential proteins, we follow the principle of “sorting-screening” which has described as a ﬂowchart in Fig. 1. Then we compare

CDC and

CIBD methods with other eight previous measuresincluding DC , BC , EC , SC , LAC , LBCC , SoECC and

U C in the three datasets. Thealgorithm for

LBCC was implemented according to [18] which used the same datasets as ours.Other algorithms of DC , BC , EC , SC , LAC , SoECC and

U C were implemented accordingto references [10], [11], [19], [16], [17], [28] and [23] respectively. Besides, we can also getthese algorithms by using CytoNCA [42], which is a Cytoscape app for network centrality. Wehave mentioned the method of BC and LBCC in the Section Previously Proposed CentralityMeasures. Now we give a brief description of other six indicators. • Degree centrality ( DC ) [10] DC ( v ) = deg ( v ) (4.1)where deg ( v ) denotes the degree of a node v .6 ig. 1 “sorting-screening” methodFig. 2 The quantity of true essential proteins determined by CDC and other eight previously methods from the YDIPnetwork. ig. 3 The quantity of true essential proteins determined by CDC , CIBD and other eight previously methods from theYMIPS network.Fig. 4 The quantity of true essential proteins determined by

CIBD and other eight previously methods from the YMBDnetwork. Local average connectivity centrality (

LAC ) [17]

LAC ( v ) = (cid:80) u ∈ N v deg C v ( u ) | N v | (4.2)where C v is the subgraph induced by the node set N v of G and deg C v ( u ) is the number ofits neighbors in C v for a node u ∈ N v . • Subgraph centrality ( SC ) [16] SC ( v ) = ∞ (cid:88) k =0 µ k ( v ) k ! (4.3)where µ k ( v ) denotes the number of closed walks of length k which starts and ends at node v . • Eigenvector centrality ( EC ) [19] EC ( v ) = α max ( v ) (4.4)where α max refers to the main eigenvector corresponding to the largest eigenvalue of thenetwork adjacency matrix A , and α max ( v ) represents the v th component of α max . • The sum of edge clustering coeﬃcients (

SoECC ) [28]

ECC v,u = z v,u min ( k v − , k u −

1) (4.5)where z v,u is the number of triangles that includes the edge e ( v, u ) in network. k v and k u are the degrees of node u and node v , respectively. SoECC ( v ) = (cid:88) u ∈ N v ECC ( v, u ) (4.6)where N v denotes the set of all neighbors of node v . • United complex centrality (

U C ) [23]

U C ( v ) = (cid:88) u ∈ N v ( f u + 1 f M + 1 × ECC v,u )where f u denotes the frequency of protein u appeared in the known protein complexes, f M is the maximum frequency that a protein appeared in the known protein complexes.Speciﬁcally, we compare CDC with other eight previous measures in YDIP and YMIPSnetworks, and compare

CIBD with other eight previous measures using YMIPS and YMBDnetworks. Step one, we sort proteins from high to low order on the basis of their values of

CDC , CIBD and other eight previous measures. Step two, we choose the top 100, 200, 300, 400, 500,and 600 proteins as predictive essential proteins, then compare them with the known essentialproteins. Finally, we can get the quantity of true essential proteins among these predictiveessential proteins. The experimental results of these measures are shown in Figs. 2-4.9rom Fig. 2, the quantity of true essential proteins judged by

CDC are 79, 152, 221, 272, 316and 364 from the top 100 to the top 600, respectively, being the best among the seven methodsin YDIP network. Besides

CDC method, the method of

LBCC also has well performancewith 74, 135, 204, 261, 307 and 360 essential proteins correctly identiﬁed at the same level. Bycomparison, the true essential proteins determined by

CDC method are increased by 5, 17, 17,11, 9 and 4, respectively. Compared with other recent methods

SoECC and

U C , CDC alsoperforms an excellent improvement. Moreover, the quantity of essential proteins are much morethan previous method including BC , SC and EC . Although LAC has a good performance, ourproposed

CDC also has better results than it.From Fig. 3, we can see that

CIBD and

CDC both perform better than DC , BC , SC , LAC , EC , SoECC and

U C in YMIPS network, except for

LBCC . The method of

LBCC produces the best results at the top of 200, 500 and 600.

CIBD performs the same as

LBCC at the top of 100 and 300. At the top of 400, the performance of

CDC and

CIBD are bothbetter than

LBCC .From Fig. 4,

CIBD performs closely to the

LBCC which gains the best performance attop 100, 200, 400 and 600.

CIBD attains the best performance at the top of 300 and 500. Wecan also see these classical methods ( DC , BC , SC , EC ) perform not well in YMBD network.Hence, our new methods CDC and

CIBD can determine much more true essential proteins inmost cases.

To further judge these two indicators of

CDC , CIBD as well as other eight identiﬁcationmeasures, the six statistical methods mentioned in the Section Assessment methods are used.From the formulas, we can obtain some more profound meaning. The sensitivity ( SN ) measuresthe recognition ability of classiﬁers to identify correct essential proteins, the larger the value is,the better the classiﬁer is. The speciﬁcity ( SP ) measures the recognition ability of classiﬁersto identify correct non-essential proteins. F-measures ( F ) stands for the harmonic mean ofprecision and sensitivity. The higher the accuracy ( ACC ) is, the better the classiﬁer is. Inconclusion, the values for these six statistical method can reﬂect the quality of indicators.Hence, we sort proteins from high to low order on the basis of their values of these methods;Then we take the top 20 percent proteins into account as predictive essential proteins, theremaining 80 percent can be considered as candidates for nonessential proteins. Comparedwith the known essential protein dataset, we can obtain the values of

T P , T N , F P and

F N .According to the formulas, the values of these six statistical method would be calculated. Onthe three diﬀerent networks, the comparisons among the values of

CDC , CIBD and other eightmeasures are executed, showing in Table 3.For YDIP network, these six statistic values for

CDC are higher than other previous mea-sures, which show that

CDC has a better prediction accuracy. And the values of BC is thelowest, indicating it has poor performance. For YMIPS and YMBD networks, these six statisticvalues determined by CIBD are similar to

LBCC which also has the ability to predict essentialproteins accurately.In addition, the Precision-Recall curve, a statistical method for evaluating stability, can be10able 3: Comparison the results of sensitivity( SN ), speciﬁcity( SP ), positive predictivevalue( P P V ), negative predictive value(

N P V ), F-measure( F ) and accuracy( ACC ) of

CDC , CIBD and other eight previous algorithms.

Dataset Methods SN SP PPV NPV F ACCYDIP DC 0.363 0.825 0.416 0.789 0.388 0.706BC 0.281 0.798 0.354 0.738 0.313 0.652LAC 0.408 0.839 0.467 0.804 0.435 0.729SC 0.335 0.811 0.36 0.794 0.347 0.697LBCC 0.436 0.853 0.512 0.817 0.477 0.749EC 0.344 0.814 0.370 0.796 0.356 0.701SoECC 0.40 0.850 0.463 0.813 0.428 0.739UC 0.391 0.850 0.458 0.811 0.422 0.737CDC .

448 0 .

868 0 .

515 0 .

835 0 .

487 0 . YMIPS DC 0.274 0.821 0.305 0.797 0.289 0.699BC 0.197 0.796 0.278 0.716 0.231 0.629LAC 0.287 0.825 0.321 0.801 0.303 0.705SC 0.139 0.782 0.155 0.759 0.146 0.638LBCC 0.430 0.866 0.480 .

841 0 .

454 0 . EC 0.123 0.774 0.155 0.723 0.137 0.610SoECC 0.281 0.814 0.325 0 .

781 0 .

302 0.686UC 0.271 0.812 0.314 0.778 0.291 0.682CDC 0.376 .

868 0 . . . .

373 0 .

910 0 .

617 0 .

789 0 .

465 0 . EC 0.219 0.851 0.366 0.734 0.274 0.672SoECC 0.266 0.835 0.422 0.715 0.326 0.657UC 0.274 0.838 0.434 0.718 0.336 0.662CIBD .

347 0 .

910 0 .

581 0 .

777 0 .

434 0 . Fig. 5 Precision and recall curves of

CDC and other eight methods for YDIP network. ig. 6 Precision and recall curves of CDC , CIBD and other eight methods for YMIPS network.Fig. 7 Precision and recall curves of

CIBD and other eight methods for YMBD network.

CDC and

CIBD methods and other previous eight measures which deﬁned as follows:

P recision ( n ) = T P ( n ) T P ( n ) + F P ( n ) Recall ( n ) = T P ( n ) T P ( n ) + F N ( n )where the deﬁnitions of T P , F P , F N are depicted in the Assessment method Section. Theresults are revealed in Figs. 5-7. In YDIP network, our method of

CDC has better performancethan the other methods. In YMIPS and YDIP networks, the performance of

CDC and

CIBD are similar to the performance of

LBCC . Holman et al. developed the jackknife methodology which is an eﬀective universal predictionmethod [32]. The X-axis represents the quantity of selected predictive essential proteins aftersequencing, and the Y-axis represents the quantity of true essential proteins in the selectedproteins. The area under the curve reﬂects the performance of each method. The larger thearea under the curve is, the better the centrality is.First, according to the predicted value, proteins are sorted in descending order. And thenwe choose predictive essential proteins of top 600 for each dataset. Last, the jackknife curve isdrawn based on the accumulation quantity of real essential proteins.

Fig. 8 The performances of

CDC and other eight centrality measures on the YDIP network are evaluated by a jackknifemethodology.

From Fig. 8, it can be seen that the prediction eﬃciency of

CDC is higher than that ofother centrality measures on the YDIP network. From Fig. 9, it is shown that

CDC and

CIBD exhibit performances resemble to that of

LBCC and better than those of all the other methodsincluding DC , BC , LAC , SC and EC , SoECC and

U C on the YMIPS network. From the13 ig. 9 The performances of

CDC , CIBD and other eight centrality measures on the YMIPS network are evaluated by ajackknife methodology.Fig. 10 The performances of

CIBD and other eight centrality measures on the YMBD network are evaluated by ajackknife methodology.

CDC and

CIBD both are eﬀective approaches for predictingessential proteins.

Identifying essential proteins in protein networks is an indispensable point in the post-genomic era. Improving the recognition rate of essential proteins is a challenging task. Atpresent, plenty of centrality algorithms have been proposed to determine the essentiality of pro-teins, most of them focus on the analysis and mining of node topology characteristics. In thispaper, on the basis of the combination of the local features of protein complexes and topolog-ical properties, two new methods are proposed which named as

CDC and

CIBD . We applythem to diﬀerent datasets YDIP, YMIPS and YMBD. Then we compare the quantity of trueessential proteins predicted by

CDC , CIBD and other eight proposed methods, containing DC , BC , LAC , SC , LBCC , EC , SoECC and

U C . The results show that

CDC and

CIBD per-form well in most cases. By using the methods of the six statistical, the precision-recall curveand jackknife, we can ﬁnd that our proposed methods of

CDC and

CIBD have the ability toimprove the accuracy in predicting essential proteins. In future work, deepening the miningof protein biological function and biological signiﬁcance can be another direction to ﬁnd theessential proteins.

References [1] Fraser H B, Hirsh A E, et al., Evolutionary Rate in the Protein Interaction Network, Science, 296(5568):750-752, 2002.[2] Xu B, Guan J, Wang Y, et al., Essential protein detection by random walk on weighted protein-proteininteraction networks, IEEE/ACM Trans Comput Biol Bioinform, PP(99):1-1, 2017.[3] Winzeler E A, Shoemaker D D, Astromoﬀ A, Liang H, Anderson K, Andre B, et al., Functional charac-terization of the s. cerevisiae genome by gene deletion and parallel analysis, Science, 285 (5429):901-906,1999.[4] Wang Y, Sun H, Du W, Blanzieri E, Viero G, Xu Y, et al., Identiﬁcation of essential proteins based onranking edge-weights in protein-protein interaction networks, PloS One, 9(9):e108716, 2014.[5] Roemer T, Jiang B, Davison J, Ketela T, Veillette K, et al., Large-scale essential gene identiﬁcation inCandida albicans and applications to antifungal drug discovery, Mol Microbiol, 50:167-181, 2003.[6] Cullen L M, Arndt G M, Genome-wide screening for gene function using RNAi in mammalian cells, ImmunolCell Biol, 83:217-223, 2005.[7] Giaever G, Chu A M, Ni L, et al., SGD: Functional proﬁling of the saccharomyces cerevisiae genome, Nature,418(6896):387-391, 2002.[8] Jeong H M, Mason S P, Albert B, et al., Lethality and centrality in protein networks, Nature, 411:41-42,2001.[9] Zhao B H, Wang J X, Li M, et al., Prediction of Essential Proteins Based on Overlapping Essential Modules,IEEE Transactions on Nanobioscience, 13(4):415-424, 2014.[10] Hahn M W, Kern A D, Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks, Molecular Biology and Evolution, 22(4):803-806, 2005.

11] Freeman L C, A set of measures of centrality based on betweenness, Sociometry, 40(1):35-41, 1977.[12] Li M , Li W , Wu F X , et al., Identifying essential proteins based on sub-network partition and prioritizationby integrating subcellular localization information, Journal of Theoretical Biology, 2018.[13] Wuchty S, Stadler P F, Centers of complex networks, Journal of Theoretical Biology, 223(1):45-53, 2003.[14] Lin C C, Juan H F, Hsiang J T, Hwang Y C, Mori H, Huang H C, Essential core of protein-protein interactionnetwork in escherichia coli, Journal of Proteome Research, 8(4):1925-1931, 2009.[15] Liang H, Li W H, Gene essentiality, gene duplicability and protein connectivity in human and mouse, Trendsin Genetics, 23(8):375-378, 2007.[16] Estrada E, Juan A, Subgraph centrality in complex networks, Physical Review E, 71(5):1-9, 2005.[17] Li M, Wang J, Chen X, et al., A local average connectivity-based method for identifying essential proteinsfrom the network level, Computational Biology and Chemistry, 35(3):143-150, 2011.[18] Qin C, Sun Y, Dong Y, A new method for identifying essential proteins based on network topology propertiesand protein complexes, PLOS ONE, 11(8):e0161042, 2016.[19] Bonacich P, Power and centrality: a family of measures, American Journal of Sociology, 92(5):1170-1182,1987.[20] Stephenson K, Zelen M, Rethinking centrality: methods and examples, Soc Networks, 11:1-37, 1989.[21] Zhang Z P, Ruan J S, Gao J Z, et al., Predicting essential proteins from protein-protein interactions usingorder statistics, Journal of Theoretical Bioligy, 480:274-283, 2019.[22] Hart G T, Lee I, Marcotte E M. A high-accuracy consensus map of yeast protein complexes reveals modularnature of gene essentiality. Bmc Bioinformatics, 8(1):236-0, 2007.[23] Li M, Lu Y, Niu Z, et al., United complex centrality for identiﬁcation of essential proteins from PPI networks,IEEE/ACM Transactions on Computational Biology and Bioinformatics, 14(2):370-380, 2017.[24] Li M, Zhang H H, Fei Y P, Essential protein discovery method based on integration of PPI and geneexpression data, Journal of Central South University, 44(3):1024-1029, 2013.[25] Lei X , Zhao J , et al., Predicting essential proteins based on RNA-Seq, subcellular localization and GOannotation datasets, Knowledge-Based Systems, 2018.[26] Xenarios I, Lukasz S, et al., DIP, the database of interacting proteins: a research tool for studying cellularnetworks of protein interactions, Nucleic Acids Research, 30(1):303-305, 2002.[27] Zhang R, Lin Y, DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes, Nucleic AcidsRes, 37(suppl 1):D455-D458, 2009.[28] Wang J, Li M, Wang H, Pan Y, Identiﬁcation of essential proteins based on edge clustering coeﬃcient,Transactions on Computational Biology and Bioinformatics, 9(4):1070-1080, 2012.[29] Friedel C C, Krumsiek J, Zimmer R, International Conference on Research in Computational MolecularBiology, Springer-Verlag, 2008.[30] Pu S, Wong J, Turner B, Cho E, Wodak S J, Up-to-date catalogues of yeast protein complexes, NucleicAcids Research, 37(3):825-831, 2009.[31] Pu S, Vlasblom J, Emili A, et al., Identifying functional modules in the physical interactome of saccharomycescerevisiae, Proteomics, 7(6):944-960, 2010.[32] Holman A G, Davis P J, Foster J M, et al., Computational prediction of essential genes in an unculturableendosymbiotic bacterium, wolbachia of brugia malayi, Bmc Microbiology, 9(1):1-14, 2009.[33] Cherry J M, Adler C, Ball C A, et al., SGD: saccharomyces genome database, Nucleic Acids Research,26(1):73-79, 1998.

34] Li M , Lu Y , Wang J , et al., A Topology Potential-Based Method for Identifying Essential Proteins fromPPI Networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2):372-383,2015.[35] Radicchi F , Castellano C , Cecconi F, et al., Deﬁning and identifying communities in networks, Proceedingsof the National Academy of Sciences of the United States of America, 101(9):2658-2663, 2003.[36] Zhu Y, Wu C, Identiﬁcation of essential proteins using improved node and edge clustering coeﬃcient,Proceedings of the 37th Chinese Control Conference, 2018.[37] Luo J W, Qi Y, Identiﬁcation of essential proteins based on a new combination of local interaction densityand protein complexes, PLOS ONE, 10(6):e0131418, 2015.[38] Joy M P, Brock A, Ingber D E, et al., High-betweenness proteins in the yeast protein interaction network,Journal of Biomedicine and Biotechnology, 2005(2):96, 2014.[39] Li G , Li M , Wang J , et al., United neighborhood closeness centrality and orthology for predicting essentialproteins, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018.[40] Mewes H W, Amid C, Arnold R, et al., MIPS: analysis and annotation of proteins from whole genomes,Nucleic Acids Research, 34(Database issue):169-72, 2004.[41] Pereira-Leal J B, Benjamin A , Peregrin-Alvarez J M, et al., An Exponential Core in the Heart of the YeastProtein Interaction Network, Molecular Biology and Evolution, 2015.[42] Tang Y , Li M , Wang J , et al., CytoNCA: A cytoscape plugin for centrality analysis and evaluation ofprotein interaction networks, Biosystems, 127:67-72, 2015.34] Li M , Lu Y , Wang J , et al., A Topology Potential-Based Method for Identifying Essential Proteins fromPPI Networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2):372-383,2015.[35] Radicchi F , Castellano C , Cecconi F, et al., Deﬁning and identifying communities in networks, Proceedingsof the National Academy of Sciences of the United States of America, 101(9):2658-2663, 2003.[36] Zhu Y, Wu C, Identiﬁcation of essential proteins using improved node and edge clustering coeﬃcient,Proceedings of the 37th Chinese Control Conference, 2018.[37] Luo J W, Qi Y, Identiﬁcation of essential proteins based on a new combination of local interaction densityand protein complexes, PLOS ONE, 10(6):e0131418, 2015.[38] Joy M P, Brock A, Ingber D E, et al., High-betweenness proteins in the yeast protein interaction network,Journal of Biomedicine and Biotechnology, 2005(2):96, 2014.[39] Li G , Li M , Wang J , et al., United neighborhood closeness centrality and orthology for predicting essentialproteins, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018.[40] Mewes H W, Amid C, Arnold R, et al., MIPS: analysis and annotation of proteins from whole genomes,Nucleic Acids Research, 34(Database issue):169-72, 2004.[41] Pereira-Leal J B, Benjamin A , Peregrin-Alvarez J M, et al., An Exponential Core in the Heart of the YeastProtein Interaction Network, Molecular Biology and Evolution, 2015.[42] Tang Y , Li M , Wang J , et al., CytoNCA: A cytoscape plugin for centrality analysis and evaluation ofprotein interaction networks, Biosystems, 127:67-72, 2015.