Positive and Unlabeled Learning through Negative Selection and Imbalance-aware Classification
NNoname manuscript No. (will be inserted by the editor)
Positive and Unlabeled Learning through NegativeSelection and Imbalance-aware Classification
Marco Frasca · Nicol`o Cesa-Bianchi
Received: 24/01/2019 / Accepted: date
Abstract
Motivated by applications in protein function prediction, we consider achallenging supervised classification setting in which positive labels are scarce andthere are no explicit negative labels. The learning algorithm must thus select whichunlabeled examples to use as negative training points, possibly ending up with anunbalanced learning problem. We address these issues by proposing an algorithmthat combines active learning (for selecting negative examples) with imbalance-aware learning (for mitigating the label imbalance). In our experiments we observethat these two techniques operate synergistically, outperforming state-of-the-artmethods on standard protein function prediction benchmarks.
Learning from positive and unlabeled data is a classification setting in which classeshave no explicit negative labels —see, e.g., (Elkan and Noto, 2008). An importantreal-world instance of this problem is the automated functional prediction (AFP)of proteins (Zhao et al., 2008; Radivojac et al., 2013). Indeed, public repositoriesfor protein functions —e.g., the Gene Ontology The Gene Ontology Consortium(2000)— rarely store “negative” annotations of proteins to functions. Anotherexample is the disease-gene prioritization (Moreau and Tranchevent, 2012), asonly genes involved in the disease etiology are usually recorded —see for instancethe Online Mendelian Inheritance in Man (OMIM) database (Amberger et al.,2011). In these domains the lack of a positive annotation for a given class doesnot necessarily imply that the data point is a true negative for that class. On the
Marco FrascaDipartimento di Informatica, Universit`a degli Studi di Milano, Milan, 20135, ItalyTel.: +39-02-50316295E-mail: [email protected]`o Cesa-BianchiDipartimento di Informatica, Universit`a degli Studi di Milano, Milan, 20135, ItalyTel.: +39-02-50316280E-mail: [email protected] a r X i v : . [ c s . L G ] J a n M. Frasca and N. Cesa-Bianchi contrary, further investigations may result in new annotations being subsequentlyadded to previously unannotated data points.The absence of explicit negative labels is typically handled by means of astrategy for selecting unlabeled examples that are then used as negative trainingexamples for the class at hand. Since —as noted earlier— not all unlabeled in-stances are true negatives, the task of learning from positive and unlabeled datais generally harder than standard supervised learning. The scarcity of positive ex-amples, which is widespread in most datasets for this setting, just makes thingsworse, as the only way to increase the size of the training set is by adding negativeexamples, which eventually leads to an imbalance between positives and negatives.In this work we propose a novel approach to learning from positive and un-labeled data based on combining active learning (for the negative selection) withimbalance-aware classification (for mitigating the label imbalance). Unlike tradi-tional active learning, which is typically used to select training examples from a setof unlabeled data, our active learning approach focuses on the selection of negativeexamples from a set of unlabeled data that are mostly negative for the class underconsideration. The intuition is that the ability of active learning to focus on themost informative examples can be exploited to filter out unlabeled data which arenot good negative training points, or possibly not even true negative points. Inpractice, however, the benefit of selecting negative examples through active learn-ing can be neutralized by the degradation of performance caused by the trainingimbalance. This can be fixed through the use of an appropriate imbalance-awarelearning technique. In our experiments we observe that —when combined— activelearning and imbalance-aware training operate synergistically, delivering a signifi-cant increase of performance on standard AFP benchmarks. Overall, the proposedframework is composed of two phases:1. negative training examples are selected through active learning;2. a supervised classifier, handling the data imbalance, is then learned on thetraining set containing the available positives and the negatives selected in theprevious phase.We deal with the rarity of positive examples using imbalance-aware learning al-gorithms, focusing in particular on learners that can be trained both in activeand passive learning settings. In the first phase, we incrementally refine an initialmodel trained on a small seed by running the learner in active mode. This processgoes on until a fixed budget of negative instances is reached. In the second phase,a new model is trained by running the learner in passive mode on the training setbuilt in the first phase. A relevant feature of our approach is that, unlike existingapproaches (Mostafavi and Morris, 2009; Youngs et al., 2014, 2013), our methodcan be applied even in settings where no hierarchy of labels is available.With regard to the AFP application, our method outperforms state-of-the-artbaselines in predicting thousands of GO functions for the proteins of
S. cerevisiae and
Homo sapiens organisms in a genome-wide fashion.
Computational methods play a central role in annotating the functions of largeamounts of proteins delivered by high-throughput technologies. Despite the en-couraging results achieved by these methods, many functions still have a very itle Suppressed Due to Excessive Length 3 low number of verified protein annotations, leading to a pronounced imbalancebetween annotated and unannotated proteins. Several solutions have been pro-posed in the last decades for the AFP problem, including those presented in therecent international challenges CAFA1 (Radivojac et al., 2013) and CAFA2 (Jianget al., 2016), which collected dozens of algorithms proposed by numerous researchgroups evaluating them on a common set of target proteins. Such methods canbe roughly categorized into three groups: (1) graph-based approaches (Marcotteet al., 1999; Oliver, 2000; Schwikowski et al., 2000; Vazquez et al., 2003; Sharanet al., 2007; Mostafavi et al., 2008; Frasca et al., 2016), attempting to transferfunctional evidence in a semi-supervised manner from annotated nodes in the pro-tein graph; (2) multitask and structured output algorithms (Mostafavi and Morris,2009; Sokolov and Ben-Hur, 2010; Sokolov et al., 2013; Frasca and Cesa-Bianchi,2017; Feng et al., 2017), predicting multiple GO functions at the same time bytaking into account their hierarchical relationships; (3) hierarchical ensemble meth-ods (Obozinski et al., 2008; Guan et al., 2008; Lan et al., 2013; Valentini, 2014;Robinson et al., 2015), recombining post-prediction inferences in order to respectthe hierarchical constraints.Despite the diversity of these approaches, to the best of our knowledge noapproach has been proposed yet to simultaneously handle the problems of data-imbalance and negative selection. Imbalance-aware techniques for AFP, includingthe cost-sensitive learning, were studied in a few papers, with generally promisingresults (Bertoni et al., 2011; Cesa-Bianchi and Valentini, 2009; Garc´ıa-L´opez et al.,2013). A few more works in the same context investigated the problem of negativeselection (Mostafavi and Morris, 2009; Youngs et al., 2013, 2014; Frasca et al.,2017), mostly exploiting the GO structure to select negatives.
Preliminaries.
Following the standard encoding of biological data, we use a matrix-based data representation in which instances are represented by a connection ma-trix W , where the row W i, · is viewed as a feature vector for instance i . Morespecifically, our data are represented as follows:1. a network of instances, represented as an undirected weighted graph G = (cid:104) V, W (cid:105) , where V ≡ { , . . . , n } is the set of nodes and W is a n × n matrixwhose entries W ij ∈ [0 ,
1] encode some notion of similarity between nodes (with W ij = 0 when nodes i, j are not connected);2. a labeling for a given class, described by the binary vector y = ( y , y , . . . , y n ) ∈ { , } n where y i = 1 if and only if node i is positive for that class.Let V + ≡ { i ∈ V | y i = 1 } and V − ≡ V \ V + { i ∈ V | y i = 0 } be the subsets of positive and negative nodes, respectively. Note that we allow the labeling to behighly unbalanced, that is | V + | (cid:28) | V − | . In addition, the labeling is known only fora subset S ⊂ V of nodes and unknown for the remaining U ≡ V \ S . The problemis to infer the labeling of nodes in U using the training labels S and the connectionmatrix W . M. Frasca and N. Cesa-Bianchi
Negative Selection.
As mentioned in the introduction, the problem of learningfrom positive and unlabeled data is hard because negative examples V − are notwell defined, and a nonpositive example might be either a negative or a positivesubsequently annotated in light of future studies. This makes the selection ofinformative negatives among the nonpositive examples a central issue for learningaccurate models.In our setting, considering nonpositive instances as negative implies that somenegative labels are noisy , in the sense that they might turn out to be positivein the future. In order to take this issue into account, and in line with previousworks (Youngs et al., 2014), the noisy labels are characterized by considering twodifferent temporal releases of the labels for a given class. We denote these labelingswith the n -dimensional vectors y and y , assuming y is the older one. The tworeleases form a temporal holdout : the labeling y can be used to verify the qualityof predictions made by models learned using labeling y . This is addressed by thefollowing definitions: for a given class, let V −− = (cid:8) i ∈ V | y i = 0 ∧ y i = 0 (cid:9) be theset of negative instances whose label did not change, and V − + = (cid:8) i ∈ V | y i =0 ∧ y i = 1 (cid:9) be the set of instances with noisy labels. We go back to this issue inSection 4, where we investigate the behaviour of our models on noisy instances.3.1 Active learning for negative selectionLet S + = S ∩ V + and S − = S ∩ V − be, respectively, the training sets of positive andnegative (nonpositive) proteins for a given class. Given a budget 0 < B < | S − | ,our goal is to select a subset of B negative examples (cid:98) S − ⊂ S − in order to maximizethe performance of the classifier trained using the examples S + ∪ (cid:98) S − .We address this problem using an active learning (AL) strategy. AL is typi-cally used to save labeling cost by allowing the learner to choose its own trainingdata (Settles, 2012). In pool-based AL, the learner receives a label budget B and then selects B instances to be labeled from a large set of unlabeled data. Astandard pool-based AL strategy is to obtain the label of those points which thecurrent model classifies with least confidence —see, e.g., (Tong and Koller, 2001).Our setting is slightly different because we want the AL algorithm to select pointsto which we assign negative labels. Nevertheless, we may exploit AL to pick outthe “most informative” negative points for training our model.The pseudocode of our negative selection procedure in supplied in Algorithm 1.An initial seed training set I (0) = S + ∪ S − (0) is defined, where S − (0) ⊂ S − isselected at random and balanced (i.e., | S − (0) | = | S + | ). I (0) contains all availablepositives (since we do not have many of them anyway). The procedure Learn-Classifier learns an initial classifier C using the points in I (0). In each iteration t = 0 , , . . . , a new training set I ( t + 1) is built by adding to I ( t ) the s instancesin S − \ I ( t ) that are predicted with least confidence by the current classifier (pro- cedure LCP ). Then, the classifier C is updated (or retrained) using I ( t + 1) bythe procedure UpdateClassifier . These steps are iterated until | I ( t ) | = | S + | + B (budget is exhausted). A balanced seed training set is empirically the best performing choice.itle Suppressed Due to Excessive Length 5
Algorithm 1
Negative examples selection - procedure template
Input: G = (cid:104) V, W (cid:105) , input graph S + , set of positive nodes S − , set of non positive nodes˜ S − ⊂ S − , initial negative seed s , size of active learning selection B , budget of negatives to be selected Output: (cid:98) S − ⊂ S − , with | (cid:98) S − | = B . procedure t ← S − ( t ) ← ˜ S − I ( t ) ← S + ∪ S − ( t ) C ← LearnClassifier ( G, I ( t )) while | I ( t ) | < B do s (cid:48) ← min { s, B − | I ( t ) |} T ← LCP ( C, G, S − \ I ( t ) , s (cid:48) ) S − ( t + 1) ← S − ( t ) ∪ TI ( t + 1) ← I ( t ) ∪ TC ← UpdateClassifier ( C, G, I ( t + 1)) t ← t + 1 end while (cid:98) S − ← S − ( t ) return (cid:98) S − end procedure We validate this approach using two popular learning algorithms which havenatural AL variants. In the following L denotes the training set and we define L − = { i ∈ L | y i = 0 } and L + = { i ∈ L | y i = 1 } . Given the training instances ( x i , y i ) ∈ R n × {− , } , the Support Vector Machine(SVM) (Vapnik, 1995) learns the hyperplane ω ∗ ∈ R n unique solution of thefollowing optimization problem:min ω ∈ R n (cid:107) ω (cid:107) + C (cid:88) i ξ i s.t. y i ω (cid:62) x i ≥ − ξ i i ∈ Lξ i ≥ i ∈ L . (1)In our setting x i = W i, · (the i -th row of W ). The margin of the instance i is (cid:12)(cid:12) ω ∗(cid:62) x i (cid:12)(cid:12) . The most uncertain instance is the one with lowest margin, located closest to the decision hyperplane (Tong and Koller, 2001). Thus, when we implement theAL procedure template with SVM, the instances selected by the procedure LCP are those with the smallest margin.In order to deal with the scarcity of positives, we use the cost-sensitive SVMof Morik et al. (1999), in which the misclassification cost has been differentiated
M. Frasca and N. Cesa-Bianchi between the positive and the negative classes. The corresponding objective func-tion is min ω ∈ R n (cid:107) ω (cid:107) + C + (cid:88) i : y i =1 ξ i + C − (cid:88) i : y i =0 ξ i s.t. y i ω (cid:62) x i ≥ − ξ i i ∈ Lξ i ≥ i ∈ L . (2)The sum over slack variables in (1) is split into separate sums over positive andnegative training instances, with two different misclassification costs C + and C − .As suggested by the authors, we set C − = 1 and C + = | L − | / | L + | . In our experi-ments, we denote with SVM AL the cost-sensitive SVM using active learning fornegative selection. The Random Forests algorithm (RF) (Breiman, 2001) builds an ensemble of classi-fication trees, where each tree is trained on a different bootstrap sample of
N < | L | random instances, with splitting functions at the tree nodes chosen from a randomsubset of M < n attributes. RF then aggregates tree-level classifications uniformlyacross trees, computing for each instance i the fraction p i of trees that output apositive classification.When we implement the AL procedure template with RF, the instances se-lected by the procedure LCP are those with highest entropy H i , where H i = − p i log p i − (1 − p i ) log(1 − p i ) (3)and p i is computed according to the RF model C .Similarly to (Van Hulse et al., 2007; Khalilia et al., 2011), in this study we usea variant of RF designed to cope with the data imbalance. When RF selects theexamples for training a given tree, an instance i ∈ L is usually selected with uni-form probability p i = | L | . Here, instead, we draw positive and negative exampleswith different probabilities, p i = (cid:40) | L + | if y i = 1 | L − | if y i = 0.In this way, the probabilities of extracting a positive or a negative example areboth , and the trees are trained on balanced datasets. In our experiments, wedenote with RF AL the balanced RF using active learning for negative selection. In this section we analyze the empirical performance of our algorithm on predict-ing the protein bio-molecular functions. We start by describing the datasets andthe evaluation metrics, and then we move on to assess the effectiveness of the proposed approach through three different experiments: the study of the impactthe parameter s of the Algorithm 1 on the final performance; the evaluation ofthe performance in the temporal holdout setting; the comparison of our algorithmagainst an extensive collection of state-of-the-art baselines for predicting proteinfunctions. itle Suppressed Due to Excessive Length 7 Homo sapiens (human) and
Sac-caromyces cerevisiae (yeast). Each dataset consists of a protein network and thecorresponding GO annotations. Both networks were retrieved from the STRINGdatabase, version 10.5 (Szklarczyk et al., 2015), which already merges many sourcesof information about proteins. These sources include several databases collectingexperimental data, such as BIND, DIP, GRID, HPRD, IntAct, MINT; or databasescollecting curated data, such as Biocarta, BioCyc, KEGG, and Reactome. Theconnection matrix W is obtained from the STRING connections (cid:99) W after thesymmetry-preserving normalization W = D − / (cid:99) W D − / , where D is a diagonalmatrix with non-null elements d ii = (cid:80) j (cid:99) W ij . As suggested by STRING curators,we set the threshold for connection weights to 700. The two networks contain 6391yeast and 19576 human proteins.We considered all the three GO branches. Namely Biological Process (BP),Molecular Function (MF), and Cellular Component (CC). The temporal holdoutwas formed by considering two different annotation releases: the UniProt GOAreleases 69 (9 May 2017) and 52 (December 2015) for yeast, and GOA releases 168(May 2017) and 151 (December 2015) for human. In both releases, we retainedonly experimentally validated annotations. Evaluation framework.
To evaluate the generalization capabilities of our methods,we used a 3-fold cross validation (CV) and temporal holdout evaluation (i.e., oldrelease annotations are used for training while new release annotations are usedfor testing).
Performance measures.
The performance of our classifier is measured in terms ofprecision (P), recall (R) and F -measure (F). Following the recent CAFA2 inter-national challenge (Jiang et al., 2016), we adopted two measures suitable to eval-uating instance rankings: the Area Under the Precision-Recall curve (AUPR) andthe multiple-label F -measure ( F max ). AUPR is a “per task” measure, more infor-mative on unbalanced settings than the classical area under the ROC curve (Saitoand Rehmsmeier, 2015). F max provides an “instance-centric” evaluation, assess-ing performance accuracy across all classes/functions associated with a given in-stance/protein. More precisely, if we indicate as TP j ( t ), TN j ( t ) and FP j ( t ), re-spectively, the number of true positives, true negatives, and false positives for theinstance j at threshold t , we can define the “per-instance” multiple-label precision Prec( t ) and recall Rec( t ) at a given threshold t as: Prec( t ) = 1 n n (cid:88) j =1 TP j ( t )TP j ( t ) + FP j ( t )Rec( t ) = 1 n n (cid:88) j =1 TP j ( t )TP j ( t ) + FN j ( t ) M. Frasca and N. Cesa-Bianchi (a) (b)
Fig. 1
F scores averaged across GO terms (a), and average time in seconds to perform a cycleof 3-fold CV (b) while varying the active learning parameter s . where n is the number of instances. Prec( t ) (resp., Rec( t )) is therefore the aver-age multilabel precision (resp., recall) across instances. According to the CAFA2experimental setting, F max is defined as F max = max t t )Rec( t )Prec( t ) + Rec( t )4.2 Evaluating the impact of parameter s We study the trade-off between running time and F values by varying the param-eter s (Algorithm 1) while keeping fixed the budget B of negatives to be selected.Due to the large number of settings under consideration, this experiment only in-volved yeast data and a subset of GO terms. Specifically, we chose the CC termswith exactly 10 annotated proteins in the last release (GOA release 69), for a totalof 81 terms. This choice ensures a minimum of information for learning the model,significantly reduces the number of negatives randomly selected to form the initialseed (which is less than 10 in the 3-fold CV), and requires the same number ofnegatives to be selected through AL (around B −
10) for each term. We chose theCC branch because, compared to MF and BP branches, it produces the lowestnumber of terms in the same setting. We set the budget B to 500, as a reasonableproportion of the total number of proteins (different values of B showed a similartrend), while s varies in the set { , , , , , } . Values of s lower than 25increased the computational burden with a negligible impact of the classificationperformance, whereas s >
150 made less significant the contribution of AL (since B = 500). In RF, after tuning on a small subset of labeled data, we decided to use
200 trees (see Fig. 1).Although in typical AL applications s is set to 1, these results show thatlarger values of s do not deteriorate the performance in this setting. Instead, theaverage F values show almost negligible differences when varying s . This behaviouris in agreement with the results in (Mohamed et al., 2010). On the other hand, itle Suppressed Due to Excessive Length 9 GO B ρ P R F ρ P R F ρ P R F ρ P R F
Yeast HumanSVM PnoSub RF PnoSub SVM PnoSub RF PnoSub
CC all 0.457 0.539 0.480 0.52 0.3 0.332 0.174 0.168 0.158 0.199 0.121 0.115MF all 0.353 0.299 0.312 0.435 0.113 0.170 0.33 0.225 0.254 0.298 0.089 0.125BP all 0.48 0.447 0.445 0.553 0.235 0.311 0.281 0.168 0.195 0.218 0.046 0.069
SVM PrndSub RF PrndSub SVM PrndSub RF PrndSub
CC 450 0.197 0.790 0.310 0.219 0.773 0.329 0.087 0.387 0.133 0.102 0.372 0.141CC 600 0.21 0.762 0.326 0.254 0.726 0.363 0.093 0.355 0.139 0.095 0.297 0.133CC 750 0.223 0.754 0.334 0.289 0.685 0.386 0.101 0.361 0.148 0.133 0.31 0.162MF 450 0.176 0.586 0.263 0.133 0.531 0.216 0.109 0.469 0.169 0.094 0.444 0.148MF 600 0.195 0.547 0.282 0.172 0.475 0.248 0.143 0.461 0.203 0.115 0.402 0.172MF 750 0.207 0.521 0.297 0.197 0.421 0.254 0.15 0.442 0.212 0.151 0.378 0.204BP 450 0.233 0.661 0.335 0.201 0.636 0.295 0.079 0.385 0.121 0.082 0.346 0.114BP 600 0.261 0.629 0.353 0.259 0.602 0.338 0.088 0.357 0.13 0.111 0.303 0.136BP 750 0.276 0.603 0.372 0.261 0.598 0.339 0.099 0.356 0.142 0.123 0.27 0.145
SVM AL RF AL SVM AL RF AL
CC 450 0.601 0.449 0.619 0.495 0.606 0.505 0.453 0.413 0.283 0.218 0.21
Table 1
Average results of SVM and RF methods for predicting GO terms using the oldrelease of annotations. In bold the best results (underlined when statistically significant). lower values of s strongly increase the average running time. Due to the aboveconsiderations, in the rest of the paper we set s = 150, as a good compromisebetween predictive performance and running time.SVMs largely outperform RFs in terms of F in this setting, and are also muchfaster. This is due to the sparse implementation exploited by the SVMLight li-brary Joachims (1999), as opposed to the RF implementation provided by theRanger library (Wright and Ziegler, 2017). Both libraries are written in C lan-guage and used within a R wrapper. As a remark, we point out that SVM canbe made even faster using other approaches proposed in the literature —see e.g.(Bordes et al., 2005; Tsai et al., 2014); nevertheless, this is not the main focus inour setting, where data size is relatively small and the quality of solutions doesnot depend on the specific implementation of the model.4.3 Assessing the effectiveness of negative selectionThe temporal holdout setting is used here both to validate our negative selectionprocedure and to verify how our algorithm it behaves on proteins in V − + . Modelsare trained using the older release of annotations y , and their predictions comparedwith the corresponding labeling y in the later release. To this end, the set V − + should contain enough proteins for the 3-fold CV validation, and —accordingly— we only considered GO terms with | V − + | ≥
5. The final sets of GO terms contain9, 24 and 99 (CC, MF, and BP, respectively) for yeast, and 139, 256 and 707for human. To equally partition instances V − + among folds, and maintain thestratified setting, each fold is built by uniformly extracting a proportion ofinstances from V + , V −− and V − + . To better assess the benefit of our active learning selection, we also testedtwo baselines: “passive, no subsampling” (PnoSub), in which we train the passivemodels (passive variants of the SVM and RF models described in Section 3) onthe entire set S , and “passive, random subsampling” (PrndSub), using S + ∪ ¯ S − as training set, where ¯ S − is uniformly extracted from S − . Furthermore, we alsocompute the fraction ρ of instances in V − + ∩ S selected during active learning.Table 1 reports the results according to the older release. B = { , , } wasused for both organisms, since we experimentally verified that budgets B > p - value < .
05) Wilcoxon (1945).Interestingly, proteins V − + tend to be chosen by our AL selection: with abudget representing around the 15% (resp., 5%) of negatives in the pool, a fraction ρ > . ρ > .
3) of noisy label proteins was selected on yeast (resp.,human) data. This reveals that the classifier is typically less certain about theclassification of noisy label proteins, which is the desirable behaviour for a negativeselection procedure in this setting, since our goal is to select the most informativenegatives. Conversely, if we have to build an oracle answering queries for “reliable”negatives, that is negative proteins which likely remain unannotated in the futurefor that function, our strategy would supply proteins as far as possible from thedecision boundary (highest margin) on the negative side. These are precisely theinstances whose negative labels the model is most confident about. Consequently,the larger ρ , the more effective and coherent our negative selection procedure.As a further analysis, we also tested SVMs and RFs using the same negativeselection procedure but without using any imbalance-aware technique: in this casenegative selection alone did not significantly help any of the two methods (resultsnot shown), confirming the benefit of the combined action of negative selectionand imbalance-aware learning in this context.Finally, the evaluation across the temporal holdout of proteins V − + is shownin Table 2, where we report the performance according to annotations y afterpredicting using y as functional labeling. Our approach achieves slightly betterresults with regard to the corresponding passive model (PnoSub) in all branches(except for SVMs, MF yeast data), and largely the top results when consideringall the proteins. Indeed, the results on the the remaining proteins are the same ofTable 1, since y and y differs solely on the instances V − + . Nevertheless, a clearpattern does not emerge. The F results on noisy label proteins are clearly limitedby the number of positives predicted by the model, since all proteins V − + arepositive according to the newer release y . F values tend thereby to decrease whenthe budget B increases. In this context, the method with the highest recall inTable 1 —i.e., PrndSub— wins, although it is not the method which classifies itle Suppressed Due to Excessive Length 11 Branch B P R F P R F P R F P R F
Yeast HumanSVM PnoSub RF PnoSub SVM PnoSub RF PnoSub
CC all 0.154 0.048 0.072 0.083 0.042 0.056 0.139 0.044 0.06 0.056 0.035 0.039MF all 0.746 0.283 0.387 0.077 0.017 0.027 0.271 0.083 0.123 0.104 0.025 0.04BP all 0.278 0.126 0.162 0.117 0.04 0.058 0.038 0.19 0.038 0.052 0.008 0.014
SVM PrndSub RF PrndSub SVM PrndSub RF PrndSub
CC 450 0.917 0.625 0.686 0.803 0.387 0.457 0.556 0.236 0.306 0.583 0.221 0.296CC 600 0.917 0.617 0.673 0.75 0.361 0.445 0.556 0.219 0.294 0.583 0.222 0.297CC 750 0.917 0.577 0.648 0.75 0.284 0.371 0.583 0.253 0.324 0.472 0.19 0.252MF 450 0.464 0.302 0.367 0.462 0.26 0.316 0.792 0.351 0.452 0.729 0.357 0.454MF 600 0.41 0.222 0.28 0.487 0.234 0.302 0.667 0.294 0.381 0.688 0.317 0.414MF 750 0.372 0.162 0.207 0.333 0.183 0.224 0.708 0.298 0.394 0.667 0.298 0.392BP 450 0.527 0.299 0.364 0.574 0.331 0.401 0.556 0.203 0.282 0.569 0.223 0.305BP 600 0.481 0.266 0.323 0.494 0.281 0.342 0.497 0.185 0.254 0.523 0.193 0.268BP 750 0.443 0.231 0.297 0.451 0.24 0.297 0.458 0.158 0.223 0.451 0.148 0.211
SVM AL RF AL SVM AL RF AL
CC 450 0.463 0.173 0.235 0.25 0.088 0.12 0.222 0.066 0.095 0.222 0.056 0.087CC 600 0.5 0.255 0.315 0.167 0.083 0.111 0.111 0.035 0.047 0.056 0.021 0.03CC 750 0.417 0.158 0.218 0.25 0.167 0.194 0.111 0.027 0.041 0.083 0.025 0.037MF 450 0.201 0.077 0.103 0.256 0.095 0.13 0.333 0.109 0.161 0.229 0.065 0.1MF 600 0.154 0.047 0.07 0.154 0.042 0.066 0.312 0.094 0.142 0.146 0.042 0.065MF 750 0.129 0.035 0.063 0.128 0.034 0.053 0.229 0.072 0.107 0.146 0.038 0.059BP 450 0.257 0.129 0.163 0.278 0.132 0.17 0.261 0.062 0.096 0.144 0.033 0.052BP 600 0.247 0.111 0.144 0.228 0.098 0.129 0.157 0.033 0.053 0.092 0.018 0.029BP 750 0.227 0.099 0.121 0.21 0.084 0.113 0.163 0.032 0.052 0.085 0.015 0.025
Table 2
Average results of SVM and RF trained using labels y for predicting proteins in V − + . The predictions are evaluated using the newer release y . Method B SVM RF
PnoSub All 23.40 65.11PrndSub 450 9.76 52.49PrndSub 600 10.22 53.90PrndSub 750 11.57 54.74AL 450 51.54 239.23AL 600 81.40 329.22AL 750 109.54 428.16
Table 3
Average running time in seconds to perform a cross validation cycle on yeast data. best. It is worth noting that the models were not specifically trained to predictthe set V − + ; accordingly, the performance of classifiers just over the noisy labelproteins relies on a side-effect of our learning algorithms. Instead, when evaluatingover all proteins, the improvements of our strategies over PnoSub and PrndSub areconfirmed, or even improved, in the holdout setting with respect to the CV setting. Note that having good performance in the holdout setting is very important, sinceit is a more realistic scenario for AFP.Finally, to have an idea of the scalability of our methods, the average runningtimes in seconds to perform a CV cycle on yeast data using a Linux machine withIntel Xeon(R) CPU 3.60GHz and 32 Gb RAM are reported in Table 3. As expected,the running time of active SVMs increases almost linearly with the number ofselected negatives with respect to the SVM PnoSub. On the other hand, RFs scalepoorly, and are much slower than SVMs while exhibiting a worse performance.This motivates our choice of employing RFs just as a mean to validate the ALnegative selection.4.4 Predicting GO protein functionsWe compared our approach to eight state-of-the-art algorithms for AFP operat-ing in the same flat setting (that is, not including further information from otherfunctions in the hierarchy when a given function is predicted):
Random Walk (RW) and the
Random Walk with Restart (RWR) (Tong et al., 2008); the guilt-by-association (GBA) method (Schwikowski et al., 2000); the
Label Propagation algorithm ( LP ) (Zhu et al., 2003); a method based on Hopfield networks, the Cost-Sensitive Neural Network (COSNet) (Bertoni et al., 2011), designed for unbalanceddata and showing competitive results on the MOUSEFUNC benchmark (Frascaet al., 2015); the
Multi-Source k-Nearest Neighbors (MS-kNN) (Lan et al., 2013),one of the top-ranked methods in the recent CAFA2 international challenge forAFP; the
RAnking of Nodes with Kernelized Score Functions (RANKS) (Re et al.,2012), recently proposed as a fast and effective ranking algorithm for AFP. Fi-nally, to assess the benefit of the negative selection, we tested also the baselinepassive SVM (PnoSub). For COSNet and RANKS we used publicly available Rlibraries (Frasca and Valentini, 2017; Valentini et al., 2016), whereas for the othermethods the code provided by the authors or our own software implementationswas utilized. Since COSNet is a binary classifier, to provide a ranking of proteinswe followed the approach presented in (Frasca and Pavesi, 2013), which uses theneuron energy at equilibrium as ranking score. For SVM we used the margin torank instances. Finally, the parameters required by the benchmarked methodswere learned through internal tuning on a subset of training data.We selected the GO terms with 10 to 100 annotated proteins in the most recentrelease. This in order to discard terms that are too generic (close to root nodes),and to ensure the availability of a few positives to train the models in a 3-fold CVsetting. The resulting datasets contain 162, 227, and 660 terms for yeast, and 272,476, and 1689 terms for human, respectively, in the GO branches CC, MF and BP.The comparison with the passive SVM (PnoSub) is shown in Fig. 2, where ouralgorithm shows a considerable increase in both precision and F -measure on yeastdata (Fig. 2(a)) when the budget of negatives is large enough. Similar but lessmarked improvements with respect to passive SVMs are obtained on human data, except for the CC branch where the improvement in terms of F measure is not sta-tistically significant, Fig. 2(b). For the yeast setting, already with B ∈ { , } SVM AL achieves similar results compared to SVM, but using just less than 15%of the available negatives. However, when B = 750 its precision is significantlyincreased while nearly preserving the same recall. This results in a statistically itle Suppressed Due to Excessive Length 13 (a) (b) Fig. 2
Precision (P), Recall(R) and F values of binary classifiers averaged across GO termson (a) yeast and (b) human data. (a) (b)
Fig. 3
Performance comparison in terms of AUPR averaged across GO terms on (a) yeastand (b) human data. significant improvement in the F -measure ( p - value < . B = 600) to achieve its top performance, whichcorresponds to around 5% of available negatives, showing that a large proportionof negatives likely carries redundant information. When B = 750 the results areslightly worse, likely due to overfitting phenomena. On both yeast and humandata we tested budgets even larger than 750, which increased the execution timewithout providing significant performance improvements (results not shown).The comparison with the state-of-the-art methods for AFP is depicted in Fig. 3(AUPR) and Fig. 4 ( F max ), where B is set to 750. In terms of AUPR, SVM AL sig-nificantly outperforms all the other methods in all the branches ( p - value < . SVM instead is placed between second and fifth place, disclosing the relevant con-tribution supplied by the negative selection procedure based on active learning. Onthe other side, the GBA algorithm behaves nicely, being the second top methodin all experiments (except for human–MF), whereas RWR often is the third bestperforming method. The good performance of GBA and RWR confirms the results (a) (b)
Fig. 4
Fmax results on (a) yeast and (b) human data. shown in (Vascon et al., 2018). MS-kNN performs worst in all the settings, followedby LP, while COSNet performs halfway between the best and worst performance.In order to gain further insights on the behaviour of our algorithm, we followedthe MOUSEFUNC challenge experimental set-up (Pe˜na-Castillo et al., 2008) andcomputed the performance averaged across categories containing terms with dif-ferent degrees of imbalance. Namely, we distinguished two groups: functions with10–20 positives and functions with 21–100 positives. Functions in the first groupare more specific (i.e., closer to leaves in the GO hierarchy), and more complex(since less information is available). Hence, on the functions in this group AFPmethods typically perform worse (Pe˜na-Castillo et al., 2008; Mostafavi and Morris,2010; Frasca et al., 2015). Results shown in Fig. 5 in the Appendix A confirm thistrend, except for CC experiments where all methods show lower mean AUPR onthe 21–100 group —see Fig. (5 (b)-(d)), Appendix A. SVM AL instead achievessimilar results on both groups. Indeed, the gap between the two top methods ismore pronounced in the 10–20 group. This is likely due to the cost-sensitive strat-egy adopted by our method, which allows to better handle the higher imbalanceof the 10–20 group. Such behaviour is preferable and more useful in ontologies likeGO, where more specific nodes provide less generic definitions ensuring a bettercharacterization of protein functions. Finally, on yeast data our algorithm tends tosuffer from less outliers compared to other methods (Fig. 6, Appendix A), havingalmost the same median and mean AUPR (red horizontal segment). On humandata, possibly due to the higher complexity of this dataset, such a behaviour be-comes less evident.
SVM AL has the best performance even in terms of F max in all the consideredsettings, and especially on the MF terms for human data. SVM is the second topmethod. LP performs much better with reference to AUPR results, being at thethird place in all experiments and confirming the analysis proposed in Vascon et al.(2018). COSNet and RANKS occupy mid-range positions, with a slightly betterranking on yeast data. itle Suppressed Due to Excessive Length 15 We addressed the problem of positive and unlabeled learning using a novel ap-proach, exploiting the synergy of imbalance-aware and negative selection strate-gies. Our algorithm, based on Support Vector Machines, is able to counterbalancethe predominance of unannotated instances via an imbalance-aware techniquecombined with active learning for choosing negative examples. We showed thatactive learning helps in both choosing reliable negatives and detecting the “noisy”examples (i.e., initially unannotated instances that get annotated in later releasesof the dataset). The effectiveness of our approach was experimentally tested onthe problem of automatically annotating the proteomes of sequenced organisms,whose proteins have annotations from the GO functional taxonomy that are sparse,and lack a proper definition for negative examples. This experimental validationshowed that our tool compares favourably against existing approaches to AFPwhen predicting the GO functions of yeast and human proteins.
Funding
This work was supported by the grant title
Machine learning algorithms to handlelabel imbalance in biomedical taxonomies , code PSR2017 DIP 010 MFRAS, Uni-versit`a degli Studi di Milano.
References
J. Amberger, C. Bocchini, and A. Amosh. A new face and new challenges forOnline Mendelian inheritance in Man (OMIM).
Hum. Mutat. , 32:564–7, 2011.A. Bertoni, M. Frasca, and G. Valentini. Cosnet: a cost sensitive neural network forsemi-supervised learning in graphs. In
European Conference on Machine Learn-ing, ECML PKDD 2011 , volume 6911 of
Lecture Notes on Artificial Intelligence ,pages 219–234. Springer, 2011. doi:10.1007/978-3-642-23780-5 24.Antoine Bordes, Seyda Ertekin, Jason Weston, and L´eon Bottou. Fast kernelclassifiers with online and active learning.
J. Mach. Learn. Res. , 6:1579–1619,December 2005. ISSN 1532-4435.L. Breiman. Random forests.
Machine Learning , 45(1):5–32, 2001.Nicol Cesa-Bianchi and Giorgio Valentini. Hierarchical cost-sensitive algorithmsfor genome-wide gene function prediction. In Sao Deroski, Pierre Guerts, andJuho Rousu, editors,
Proceedings of the third International Workshop on Ma-chine Learning in Systems Biology , volume 8 of
Proceedings of Machine LearningResearch , pages 14–29, Ljubljana, Slovenia, 05–06 Sep 2009. PMLR.Charles Elkan and Keith Noto. Learning classifiers from only positive and unla-beled data. In
Proceedings of the 14th ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 213–220. ACM, 2008.
Shou Feng, Ping Fu, and Wenbin Zheng. A hierarchical multi-label classificationalgorithm for gene function prediction.
Algorithms , 10:138, 2017.M. Frasca and N. Cesa-Bianchi. Multitask protein function prediction through taskdissimilarity.
IEEE/ACM Transactions on Computational Biology and Bioin-formatics , 2017. ISSN 1545-5963. In press. doi:10.1109/TCBB.2017.2684127.
M. Frasca, F. Lipreri, and D. Malchiodi. Analysis of informative features fornegative selection in protein function prediction.
Lecture Notes in ComputerScience (including subseries Lecture Notes in Artificial Intelligence and LectureNotes in Bioinformatics) , 10209 LNCS:267–276, 2017. doi:10.1007/978-3-319-56154-7 25.Marco Frasca and Giulio Pavesi. A neural network based algorithm for geneexpression prediction from chromatin structure. In
IJCNN , pages 1–8. IEEE,2013. doi: http://dx.doi.org/10.1109/IJCNN.2013.6706954.Marco Frasca and Giorgio Valentini. COSNet: An R package for label predictionin unbalanced biological networks.
Neurocomput. , 237(C):397–400, May 2017.ISSN 0925-2312. doi:10.1016/j.neucom.2015.11.096.Marco Frasca, Alberto Bertoni, et al. UNIPred: unbalance-aware Network Inte-gration and Prediction of protein functions.
Journal of Computational Biology ,22(12):1057–1074, 2015. doi:10.1089/cmb.2014.0110.Marco Frasca, Simone Bassis, and Giorgio Valentini. Learning node labels withmulti-category hopfield networks.
Neural Computing and Applications , 27(6):1677–1692, 2016. ISSN 1433-3058. doi:10.1007/s00521-015-1965-1.S. Garc´ıa-L´opez, J. A. Jaramillo-Garz´on, L. Duque-Mu˜noz, and C. G. Castellanos-Dom´ınguez. A methodology for optimizing the cost matrix in cost sensitivelearning models applied to prediction of molecular functions in embryophytaplants. In
Proceedings of the International Conference on Bioinformatics Mod-els, Methods and Algorithms (BIOSTEC 2013) , pages 71–80, 2013. ISBN 978-989-8565-35-8. doi: 10.5220/0004250900710080.Y Guan, C.L. Myers, D.C. Hess, Z. Barutcuoglu, A. Caudy, and O.G. Troyanskaya.Predicting gene function in a hierarchical context with an ensemble of classifiers.
Genome Biology , 9(S2), 2008.Yuxiang Jiang, Tal Ronnen Oron, et al. An expanded evaluation of protein functionprediction methods shows an improvement in accuracy.
Genome Biology , 17(1):184, 2016. doi: 10.1186/s13059-016-1037-6.T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf,C. Burges, and A. Smola, editors,
Advances in Kernel Methods - Support VectorLearning , chapter 11, pages 169–184. MIT Press, Cambridge, MA, 1999.Mohammed Khalilia, Sounak Chakraborty, and Mihail Popescu. Predicting dis-ease risks from highly imbalanced data using random forest.
BMC MedicalInformatics and Decision Making , 11(1):51, Jul 2011. ISSN 1472-6947. doi:10.1186/1472-6947-11-51.L. Lan, N. Djuric, Y. Guo, and Vucetic S. MS-kNN: protein function prediction byintegrating multiple data sources.
BMC Bioinformatics , 14(Suppl 3:S8), 2013.E.M. Marcotte, M. Pellegrini, M.J. Thompson, T.O. Yeates, and D. Eisenberg.A combined algorithm for genome-wide prediction of protein function.
Nature ,402:83–86, 1999.Thahir P. Mohamed, Jaime G. Carbonell, and Madhavi K. Ganapathiraju. Activelearning for human protein-protein interaction prediction.
BMC Bioinformatics ,11(1):S57, Jan 2010. ISSN 1471-2105. doi: 10.1186/1471-2105-11-S1-S57.
Y. Moreau and L.C. Tranchevent. Computational tools for prioritizing candidategenes: boosting disease gene discovery.
Nature Rev. Genet. , 13(8):523–536, 2012.K. Morik, P. Brockhausen, and T. Joachims. Combining statistical learning witha knowledge-based approach – a case study in intensive care monitoring. In
International Conference on Machine Learning (ICML) , pages 268–277, Bled, itle Suppressed Due to Excessive Length 17
Slowenien, 1999.S. Mostafavi and Q. Morris. Using the gene ontology hierarchy when predictinggene function. In
Proceedings of the Twenty-Fifth Annual Conference on Un-certainty in Artificial Intelligence (UAI-09) , pages 419–427, Corvallis, Oregon,2009. AUAI Press.S. Mostafavi and Q. Morris. Fast integration of heterogeneous data sources forpredicting gene function with limited annotation.
Bioinformatics , 26(14):1759–1765, 2010.S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, and Q. Morris. GeneMANIA: areal-time multiple association network integration algorithm for predicting genefunction.
Genome Biology , 9(S4), 2008.G. Obozinski, G. Lanckriet, C. Grant, Jordan. M., and W.S. Noble. Consistentprobabilistic output for protein function prediction.
Genome Biology , 9(S6),2008.S. Oliver. Guilt-by-association goes global.
Nature , 403:601–603, 2000.Lourdes Pe˜na-Castillo, Murat Tasan, Chad L. Myers, Hyunju Lee, Trupti Joshi,Chao Zhang, Yuanfang Guan, Michele Leone, Andrea Pagnani, Wan KyuKim, Chase Krumpelman, Weidong Tian, Guillaume Obozinski, Yanjun Qi,Sara Mostafavi, Guan Ning Lin, Gabriel F. Berriz, Francis D. Gibbons, GertLanckriet, Jian Qiu, Charles Grant, Zafer Barutcuoglu, David P. Hill, DavidWarde-Farley, Chris Grouios, Debajyoti Ray, Judith A. Blake, Minghua Deng,Michael I. Jordan, William S. Noble, Quaid Morris, Judith Klein-Seetharaman,Ziv Bar-Joseph, Ting Chen, Fengzhu Sun, Olga G. Troyanskaya, Edward M.Marcotte, Dong Xu, Timothy R. Hughes, and Frederick P. Roth. A criticalassessment of mus musculusgene function prediction using integrated genomicevidence.
Genome Biology , 9(1):S2, Jun 2008. doi: 10.1186/gb-2008-9-s1-s2.P. Radivojac et al. A large-scale evaluation of computational protein functionprediction.
Nature Methods , 10(3):221–227, 2013.M. Re, M. Mesiti, and G. Valentini. A Fast Ranking Algorithm for PredictingGene Functions in Biomolecular Networks.
IEEE ACM Transactions on Com-putational Biology and Bioinformatics , 9(6):1812–1818, 2012.Peter N. Robinson, Marco Frasca, Sebastian K¨ohler, Marco Notaro, Matteo R´e,and Giorgio Valentini. A hierarchical ensemble method for dag-structured tax-onomies. In
International Workshop on Multiple Classifier Systems, MCS , pages15–26, 2015. doi:10.1007/978-3-319-20248-8 2.Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informativethan the roc plot when evaluating binary classifiers on imbalanced datasets.
PLoS ONE , 10(3):1–21, 03 2015. doi: 10.1371/journal.pone.0118432.B. Schwikowski, P. Uetz, and S. Fields. A network of protein-protein interactionsin yeast.
Nature biotechnology , 18(12):1257–1261, December 2000.Burr Settles. Active learning.
Synthesis Lectures on Artificial Intelligence andMachine Learning , 6(1):1–114, 2012.R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function.
Mol. Sys. Biol. , 8(88), 2007.
A. Sokolov and A. Ben-Hur. Hierarchical classification of Gene Ontology termsusing the GOstruct method.
Journal of Bioinformatics and Computational Bi-ology , 8(2):357–376, 2010.A. Sokolov, C. Funk, K. Graim, K. Verspoor, and A. Ben-Hur. Combining het-erogeneous data sources for accurate functional annotation of proteins.
BMC
Bioinformatics , 14(Suppl 3:S10), 2013.Damian Szklarczyk et al. String v10: proteinprotein interaction networks, inte-grated over the tree of life.
Nucleic Acids Research , 43(D1):D447–D452, 2015.doi: 10.1093/nar/gku1003. URL http://nar.oxfordjournals.org/content/43/D1/D447.abstract .The Gene Ontology Consortium. Gene ontology: tool for the unification of biology.
Nature Genet. , 25:25–29, 2000.Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Random walk with restart:Fast solutions and applications.
Knowl. Inf. Syst. , 14(3):327–346, March 2008.ISSN 0219-1377. doi: 10.1007/s10115-007-0094-2. URL http://dx.doi.org/10.1007/s10115-007-0094-2 .Simon Tong and Daphne Koller. Support vector machine active learning withapplications to text classification.
Journal of machine learning research , 2(Nov):45–66, 2001.Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin. Incremental and decrementaltraining for linear classification. In
Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , KDD ’14,pages 343–352, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9. doi:10.1145/2623330.2623661.G. Valentini. Hierarchical Ensemble Methods for Protein Function Prediction.
ISRN Bioinformatics , 2014(Article ID 901419):34 pages, 2014.Giorgio Valentini, Giuliano Armano, Marco Frasca, Jianyi Lin, Marco Mesiti,and Matteo Re. Ranks: a flexible tool for node label ranking and clas-sification in biological networks.
Bioinformatics , 32(18):2872–2874, 2016.doi:10.1093/bioinformatics/btw235.Jason Van Hulse, Taghi M. Khoshgoftaar, and Amri Napolitano. Experimentalperspectives on learning from imbalanced data. In
Proceedings of the 24th Inter-national Conference on Machine Learning , ICML ’07, pages 935–942, New York,NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273614.Vladimir N. Vapnik.
The Nature of Statistical Learning Theory . Springer-VerlagNew York, Inc., New York, NY, USA, 1995. ISBN 0-387-94559-8.Sebastiano Vascon, Marco Frasca, Rocco Tripodi, Giorgio Valentini, and Mar-cello Pelillo. Protein function prediction as a graph-transduction game.
Pat-tern Recognition Letters , 2018. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2018.04.002. In press.A. Vazquez, A. Flammini, A Maritan, and A Vespignani. Global protein functionprediction from protein-protein interaction networks.
Nature Biotechnology , 21:697–700, 2003.F. Wilcoxon. Individual comparisons by ranking methods.
Biometrics , 1:80–83,1945.Marvin Wright and Andreas Ziegler. ranger: A fast implementation of ran-dom forests for high dimensional data in c++ and r.
Journal of Statis-tical Software, Articles , 77(1):1–17, 2017. doi: 10.18637/jss.v077.i01. URL . N. Youngs, D. Penfold-Brown, K. Drew, D. Shasha, and R. Bonneau. Parametricbayesian priors and better choice of negative examples improve protein functionprediction.
Bioinformatics , 29(9):1190–1198, 2013.Noah Youngs, Duncan Penfold-Brown, Richard Bonneau, and Dennis Shasha. Neg-ative example selection for protein function prediction: The NoGO database. itle Suppressed Due to Excessive Length 19
PLOS Computational Biology , 10(6):1–12, 06 2014. doi: 10.1371/journal.pcbi.1003644. URL https://doi.org/10.1371/journal.pcbi.1003644 .Xing-Ming Zhao, Yong Wang, Luonan Chen, and Kazuyuki Aihara. Gene functionprediction using labeled and unlabeled data.
BMC bioinformatics , 9(1):57, 2008.X. Zhu et al. Semi-supervised learning with gaussian fields and harmonic functions.In
Proc. of the 20th Int. Conf. on Machine Learning , Washintgton DC, USA,2003.
Appendix A (a) (b)(c) (d)
Fig. 5
State-of-the-art comparison in terms of AUPR values averaged across GO terms with10–20 and 21–100 positives on (a–b) yeast and (c–d) human data. First column correspondsto 10–20 terms. The total number of 10–20 (resp. 21–100) terms is 76 (resp. 86), 122 (115)and 334 (326) for CC, MF and BP branches respectively.itle Suppressed Due to Excessive Length 21 (a) (b)(c) (d)(e) (f)