Finding New Order in Biological Functions from the Network Structure of Gene Annotations
FFinding New Order in Biological Functions from the Network Structure of GeneAnnotations
Kimberly Glass , , ∗ , and Michelle Girvan , , Biostatistics and Computational Biology,Dana-Farber Cancer Institute, Boston, MA, USA Department of Biostatistics,Harvard School of Public Health, Boston, MA, USA Physics Department, University of Maryland,College Park, MD, USA Institute for Physical Science and Technology,University of Maryland, College Park, MD, USA Santa Fe Institute, Santa Fe, NM, USA
1. INTRODUCTION
The Gene Ontology (GO) [4][34] has been around forover a decade, during which time it has been widely uti-lized both to validate and to predict the results of biolog-ical experiments (see, for example [11, 17, 19, 20, 25, 38,39]). The structure of the ontology, where different “cate-gories” or terms are related to each other in a hierarchicalfashion, provides a well-established format with which toclassify and subclassify all biological functions and pro-cesses. This classification approach is well-structured andwell-characterized, however, we seek to determine if it isthe only natural way in which to classify this type ofbiological information. We address two main questions.First, does there exist another natural way to organizethe functional terms that is distinct from the ontologicalorganization? Secondly, if such an alternate classificationexists, can it be used to interpret biological data?In recent years, complex networks tools have beenused alongside traditional bioinformatics techniques tostudy many different kinds of biological networks [27],including, but not limited to, gene regulatory networks[24, 33], protein-protein interaction networks [18, 36], andmetabolic networks [16, 40]. Developments in networktheory provide the computational tools needed to calcu-late global properties of such networks, lending insightsinto the behavior of the systems represented by these ∗ contact: [email protected] networks. For example, many networks exhibit commu-nity structure, meaning that there are clusters of nodesin the network within which edges are relatively dense[13]. Within the field of complex networks, many recentresearch papers [21, 26, 28, 31] have focused on the de-velopment of methods to detect such module in varioustypes of networks [26][21] in a computationally efficientand accurate manner [6]. In this study, we leverage thecommunity structure in gene annotation networks to de-velop an endogenous organization of biological functions.Ontologies are utilized across many disciplines in-cluding economics, artificial intelligence, engineering, li-brary science, and biomedical informatics (for example,[5, 22, 32]). The Gene Ontology, specifically, describesthe relationships between different biological concepts orfunctions [4]. It breaks these concepts into three main do-mains, or distinct ontologies: “Biological Process” (BP),describing sets of molecular events, “Molecular Func-tion” (MF), describing the activities of gene products,and “Cellular Component” (CC), describing parts of acell or its external environment. Each of the three pri-mary domains in GO takes the form of a directed acyclicgraph (DAG), in which “child” functional categories,or “terms”, are subclassified under one or more “par-ent” terms. Each parent and all its subsequent progenytherefore define multiple, overlapping, sets of terms, or“branches” in GO. Using GO, genes are annotated to in-dividual terms representing their particular role in a cell,and these annotations are transitive up the relationshipsin the DAG such that each “parent” term takes on all thegene annotations associated with any of its progeny [35]. a r X i v : . [ q - b i o . Q M ] M a y
2. METHODS2.1. Characterizing Gene Ontology Annotations ina Bipartite Graph
In the following analysis we explore if there exists analternate, natural way to classify terms that is distinctfrom the ontology structure. To begin, we use term-termontology relationships and gene-term annotation infor-mation for human genes downloaded from the GO web-site (geneontology.org) to construct a gene-term bipartitenetwork. We choose to represent this network in the formof an n G × n T adjacency matrix, where n G is the totalnumber of genes and n T is the total number of termslisted in the annotation file. In this matrix a value of oneindicates a known connection between the correspondinggene and term, and a value of zero indicates that the gene Gene Ontology Term Hierarchy (DAG) Gene-Term Annotations (Bipartite Graph) Annotation-Driven Term Network
Annotation-Driven
Term Communities evaluate enrichment in experimental gene sets incorporate annotations project network partition terms compare structure
Branch1 Branch2 Branch3 Comm1
Comm2 Comm3
FIG. 1: Visual representation of our approach. First, we sum-marize gene annotations made to functional terms in the GeneOntology hierarchy as a gene-term bipartite graph. Fromthese gene-term relationships, we project a term-term net-work. We partition this network into communities and com-pare those term communities to branches of terms in theDAG. Finally, we perform functional enrichment analysis onexperimentally-defined gene sets using both the term commu-nities and GO branches. is not associated with that term. Thus, B pi = (cid:40) p is annotated to term i p is not annotated to term i . (1)This bipartite network represents a summary of the re-lationships between 18930 genes and 15033 functionalterms, derived from many different types of biologicalevidence and contributed to by multiple laboratories[10].We note that although GO is broken into three primarydomains and gene-annotations are made to the ontologyfor many species, for simplicity in the following analysiswe will combine information from all three domains anduse annotation information only that pertains to humangenes. Domain-specific and comparative species analysisis provided in the Supplemental Material. Next, we used gene-term annotations to construct anetwork representing term-term relationships. Using thebipartite network (Equation 1) one could create a termnetwork by simply joining together any pair of terms thatshare common genes; however, the number of genes an-notated to each term has a heavy-tailed distribution (seeSupplemental Figure S1(a)), thus this approach wouldlose a large amount of information as connections be-tween pairs of highly-annotated terms would be given thesame weight as connections between pairs of terms withonly a few annotations. We correct for the skewed termdegree distribution by constructing a diagonal weightingmatrix, w , and then projecting a term network T , whoseedges are modified by this weighting matrix: w ij = δ ijn G (cid:88) q =1 B qi , T = w (cid:48) B (cid:48) Bw. (2)The values of T ij take a maximum value of one whenterms i and j each only have the same single gene an-notation and a minimum value of zero when none of thegenes annotated to term i are annotated to term j . Theuse of the weighting matrix emphasizes the weights ofnetwork edges between low degree terms. Since theseterms represent biological functions performed by only ahandful of genes, we believe this weighting is more likelyto capture highly-specific shared biological information. Finally, we seek to identify the community structurein annotation-driven term-term relationships, or clustersof terms within which there are many or high-weight re-lationships in our projected network (Equation 2), butbetween which there are only few or low-weight relation-ships. In order to quantify the strength of communitystructure we use a quantity known as modularity [28].Modularity ( Q ) can be defined as: Q = 12 m (cid:88) ij (cid:20) A ij − (cid:18) r (cid:104) k (cid:105) (cid:19) k i k j m (cid:21) δ ( x i , x j ) (3)where δ is the Kronecker delta function, x i is the com-munity of node i , k i is the degree of node i , A is theadjacency matrix, a matrix with values representing theweight between nodes i and j , and m is the total weightof the edges in the network[3, 26]. Traditionally, in orderto partition a network into communities, the resolutionparameter, r in Equation 3, is set equal to zero. Vary-ing this value allows one to look for alternate divisionsof a network into communities at different scales, or res-olutions, with r >
3. RESULTS: AN ALTERNATE “NATURAL”GROUPING OF GO TERMS3.1. Term Communities and GO BranchesRepresent Distinct Collections of BiologicalFunctions
To better understand the relationships between thecommunities found at different resolutions, we visualizedthe term communities with ten or more members for thesix lowest values of resolution used (Figure 2). In thisvisualization each community is represented by a singlecircle, whose radius scales as the log of the number ofterms belonging to that community and whose color cor-responds to the percentage of members from each pri-mary domain that belong to that community. Betweenthe communities found at adjacent resolutions, we draw aline from a community at a higher resolution to a commu-nity at a lower resolution if at least 10% of the membersof the community from the higher resolution also belongto the community at the lower resolution. The thick-ness of the line is indicative of the overlap between thetwo communities. For more details on the visualizationprocess see the Supplemental Material.The structure of annotation-driven term relationshipsis distinct from the structure of those relationships asdefined by GO branches. This is evidenced clearly bythe fact that, although each GO branch can only belongto one primary ontology, and thus would be pure yel-low, cyan or magenta in this type of visualization, com-munities, even smaller ones and those found at higherresolutions, generally contain members from multiple on-tologies, resulting in a rainbow of colors. We also observethat communities at higher resolutions do not merely rep-resent the “splitting apart” of communities at lower reso-lutions (represented by a child community only connect-ing to a single parent), but instead each resolution oftenbrings about a new way of partitioning the network. Ananalogous visualization of GO branches reveals a similara complex partitioning of terms in the GO DAG, albeitsegregated by primary domain (see Supplemental FigureS2).Next we directly compared the membership of the termcommunities with that of branches in the GO DAG. Inorder to quantify the similarity between each commu-nity and branch, we calculated the Jaccard similarity,which takes the value J ( x, y ) = | x ∩ y | / | x ∪ y | . Then, foreach community ( x ), we determined the correspondingbranch ( y ) that has the highest overlap in membership BPCC MF
FIG. 2: Visualization of communities (circles) of GO terms found at the six lowest levels of resolution (rows), in increasingorder (top to bottom). The width of the line connecting two communities is proportional to the percentage of terms in the childcommunity that are also in the parent community. The size of communities is proportional to the log of the number of termsin the community. Color represents the normalized percentage of terms in the community which belong to the BP (yellow),MF (cyan) and CC (magenta) primary domains. by this measure: J m ( x ) = max { J ( x, y ) : y ∈ Y } , andvice versa. Because the exact value of the Jaccard sim-ilarity is highly sensitive to incremental changes in setmembership when comparing sets with only a few mem-bers, we will limit all the following analysis to communi-ties and branches that contain ten or more terms in orderto focus on the most robust results. Figure 3(a) showsthe distribution of J m comparing these 2370 communitiesand 2151 branches. Although a handful of communitiesand branches are quite similar to each other, the ma-jority of communities are dissimilar to the GO Branchesand vice versus. We have repeated this analysis con-structing the term network and corresponding partitionsthree more times, using annotations specific to each ofthe three primary ontologies, and observe similar results(see Supplemental Figure S1(b)-1(d) and SupplementalTable S2).To better interpret these values, we selected severalcommunities to inspect more closely. First we selecteda community with a very high J m value to inspect (Fig-ure 3(b)). TC:0007391 is most similar to GO:0070570(“regulation of neuron projection regeneration”) with J m = 0 . J m = 0 . We have illustrated that our term communities rep-resent a natural partitioning of functional terms thatis distinct from the GO DAG, however, the biologicalmeaning of these communities is, at this point, unclear.On a mathematical level they represent sets of biologi-cal functions that are generally performed by the samecollection of genes. Labeling and understanding the bio-logical meaning behind these communities is vital if theyare to have the same wide-range applications as the GObranches. Therefore, in order to easily interpret the con-tents of an individual community we choose to summa-rize the descriptions of its member terms in the form ofa word cloud, coloring each word in the cloud based onthe normalized percentage of times the members is it de-rived from belong to each primary domain, and scalingthe size of each word by the statistical enrichment of itsfrequency in that community (for details see Supplemen-tal Material). We illustrate the biological content of twocommunities in Figure 4(a)-4(b).The word clouds illustrate a richness of biologicalinformation in term communities. Although Com-munity TC:0000400 (Figure 4(a)) contains 335 mem-bers harking from all three primary domains, the wordcloud presentation easily summarizes this information.The individual words are often contained in terms as-sociated with multiple domains, resulting in a com-plex coloration, but reveal that this community in-
Maximum Jaccard (J m ) o f B r an c he s / C o mm un i t i e s BranchesCommunities (a)Distribution of J m for Branches and Communities(b)TC:0007391, J m = 0 . J m = 0 . FIG. 3: A comparison of branches in the GO DAG and termcommunities found by partitioning the term network. (A)Distribution of J m , the maximum similarity a communityor branch with ten or more members has compared to allother branches or communities with ten or more members,respectively. Although a small number of communities andbranches have similar memberships, most are highly dissimi-lar. (B)-(C) Two example comparisons between communitiesand branches: (B) TC:0007391 compared to GO:0070570, and(C) TC:0000936 compared to GO:0060538. In each panel onthe left hand side a community and its inter-community con-nections in the annotation-driven term network is shown andon the right hand side the branch with which that commu-nity has the the highest Jaccard similarity is illustrated. Inthe right panel edges represent the ontological associationsdefined by the Gene Ontology term hierarchy. Each termmember of the community or branch is colored both by its as-sociated primary domain (inner color - BP:yellow, MF:cyan,CC:magenta) and its community membership (outer color),determined at the same resolution value as the illustratedcommunity. Terms common between each community andbranch pair are circled. cludes biological concepts related to various types ofRNA, including “rRNA”, “tRNA”, “mRNA”, “LSU-rRNA”, “SSU-rRNA”, “ncRNA”, “RNA-polymerase”and more. In contrast, TC:0000061 contains manywords related to the heart such as “cardiac”, “muscle”,“ventricle”, “ventricular” and “heart” (Figure 4(b)).Neither community is very similar to any particularbranch in GO, although they represent similar biolog-ical information. TC:0000400 is most similar ( J m =0 . J m = 0 . Finally, we wanted to test how our communities mightbe used in one common application of the Gene Ontology:functional enrichment analysis. To begin, we downloadeda collection of experimentally derived genes sets fromthe Gene Signatures Database (GeneSigDB) [8]. Thisdatabase is a manual curation of previously publishedgene expression signatures, focusing primarily on can-cer and stem cell signatures [7], and includes 509 humansignatures that contain at least 100 and less than 1000genes annotated in the Gene Ontology. To perform thefunctional enrichemnt analysis we use Annotation En-richment Analysis [14] since it has been shown to betterestimate the biological functions of experimentally de-rived sets of genes, and has a conceptual framework con-ducive to estimating functional enrichment between setsof terms and genes, rather than simply between two setsof genes (for more details see Supplemental Material).In general, we observe that term communities con-tain slightly more statistically enriched associations withthese experimental signatures than GO branches (whichare widely used in functional enrichment analysis) andwe verified that this level of enrichment is absent for ran-domly constructed communities (see Supplemental Fig-ure S3). Knowing that our communities are statisticallyassociated with experimental gene signatures, we nextsought to know if there was a context in which our termcommunities captured biological information from thesesignatures that is missed by the branches, or vice versus.Thus we selected gene signatures that were significantlyenriched ( p < − ) in at least one community/branchbut not significantly enriched ( p > × − ) in anyGO branch/community, respectively. Figure 4(e) showsa heat map of the enrichment values for the nineteensignatures that met this criteria across any communityor branch statistically enriched in at least one of thosesignatures.It is immediately striking that of these signatures,the majority are enriched in communities and not GObranches. Two signatures, in particular, are enrichedin a collection of communities. The first, an embryonicstem cell signature [37], represents genes that are up-regulated in cardiomyocytes compared to non-selectedembryoid bodies and hESC. The communities repre-sented in this signature contain several different themes,all consistent with the expected properties of genes se-lected from stem cells and related to the heart. Thecorresponding clouds emphasize words such as “cardiac”and “muscle” (TC:0000012), “actin”, “myosin”, and “fil-ament” (TC:0000249), “morphogenesis” and “develop-ment” (TC:0000365), “blood”, “pressure” and “contrac-tion” (TC:0000582), with the other clouds generally con-taining these words in different combinations (see for ex-ample TC:0000061, illustrated in Figure 4(b)). The sec-ond signature is a list of bladder cancer specific genes[29]. Most of the words emphasized by the commu-nity clouds are related to cell proliferation. For exam-ple, it is enriched in TC:0000400 (illustrated in Fig-ure 4(a)) and the other clouds emphasize words suchas “cell-cycle”, “mitotic”, “meiotic”, “checkpoint”, “re-pair”, “replication”, “recombination”, “telomere”, “spin-dle”, “complex”, “DNA”, “chromosome”, “histone”, and“methylation”. Although the connection to the bladderis not obvious, the connection to cancer and the high rateof cell proliferation in tumor cells [2] is apparent.
4. CONCLUSION
The network structure of gene annotations made tothe Gene Ontology has not previously been exploitedin a manner that reveals an organization of biologicalfunction unique from the published hierarchical classifi-cation of the GO DAG. By analyzing functional annota-tion data we were able to construct an alternate, natural,and biologically-relevant way in which to categorize cel-lular functions. This categorization is structurally andconceptually distinct from the GO DAG and allows us touncover multiple, strong connections between terms thatdo not share a parent/child relationship. It takes advan-tage of a large amount of data from a variety of sourcesand creates a classification scheme that is motivated pri-marily by the data reported rather than the organizationof human conceptions.The term communities defined in this work representan integration of information across all three primarydomains in GO that, to the authors’ knowledge, has (a)TC:0000400 WordCloud (b)TC:0000061 WordCloud(c)GO:0000003 WordCloud (d)GO:0002376 WordCloud GO : GO : GO : GO : GO : GO : GO : GO : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : T C : − l og p − v a l ue Ovarian (Ouellet ’05)Bone (Heller ’08)Sarcoma (Lee ’04)Leukemia (Walker ’06)StemCell (Osada ’05)HeadandNeck (Bredel ’06)Sarcoma (Missiaglia ’10)Viral (Urosevic ’07)EmbryonicStemCell (Xu ’09)Bladder (Osman ’06)Sarcoma (Filion ’09)Breast (Creighton ’08)Intestine (Vecchi ’07)Breast (Mazurek ’05)HeadandNeck (Roepman ’06)Lung (Hou ’10)Breast (Mira ’09)Colon (Grade ’06)Leukemia (Calln ’08) (e)Annotation Enrichment Analysis
FIG. 4: (A-D) Term Communities (TC:0000400, TC:0000061)and branches (GO:0009607, GO:0050896) summarized asword clouds. In each case the color of a word representshow often the term description containing that word be-longs to each of the primary domains (BP:yellow, MF:cyan,CC:magenta, also see Figure 2 for mixed-domain coloration)and size represents that word’s statistical enrichment in thatcommunity/branch. (E) A heat map showing the statisti-cal enrichment of selected cancer signatures (see text) in GObranches and term communities. not previously been investigated systematically. Usingthe simple principle of co-annotation we suggest thatin the future biological concepts from other classifica-tion databases could also be analyzed or even combinedwith these results. We concede that the communities de-fined here likely do not represent the only way to groupfunctional terms outside of the ontology structure. Evenso, we believe that our functional enrichment analysisdemonstrates that these term communities, in particular,are more than a mathematical phenomenon and have ahigh potential to be used to better interpret biologicaldata.
Acknowledgements
We wish to thank Geet Duggal for supplying an im-plementation of the Fast Greedy Community Structure algorithm that included the resolution parameter. .[2] M Andreeff, Goodrich DW, and Pardee AB.
Holland-Frei Cancer Medicine . BC Decker, Hamilton, ON, 5thedition, 2000.[3] Alex Arenas, Alberto Fernandez, and SergioGomez. Analysis of the structure of complexnetworks at different resolution levels. e-pub:http://arxiv.org/abs/physics/0703218 , January2008. doi: 10.1088/1367-2630/10/5/053039. URL http://dx.doi.org/10.1088/1367-2630/10/5/053039 .[4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein,H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S.Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E.Richardson, M. Ringwald, G. M. Rubin, and G. Sher-lock. Gene ontology: tool for the unification of biology.the gene ontology consortium.
Nature genetics , 25(1):25–29, May 2000. ISSN 1061-4036. doi: 10.1038/75556.URL http://dx.doi.org/10.1038/75556 .[5] B. Chandrasekaran, John R. Josephson, and V. RichardBenjamins. What are ontologies, and why do we needthem?
IEEE Intelligent Systems , 14(1):20–26, January1999. ISSN 1541-1672. doi: 10.1109/5254.747902. URL http://dx.doi.org/10.1109/5254.747902 .[6] Aaron Clauset, M. E. J. Newman, and CristopherMoore. Finding community structure in very large net-works.
Physical Review E , 70(6):066111+, December2004. doi: 10.1103/PhysRevE.70.066111. URL http://dx.doi.org/10.1103/PhysRevE.70.066111 .[7] Aed´ın C. Culhane, Thomas Schwarzl, Razvan Sul-tana, Kermshlise C. Picard, Shaita C. Picard, Tim H.Lu, Katherine R. Franklin, Simon J. French, GeraldPapenhausen, Mick Correll, and John Quackenbush.GeneSigDB–a curated database of gene expression signa-tures.
Nucleic acids research , 38(Database issue):D716–D725, January 2010. ISSN 1362-4962. doi: 10.1093/nar/gkp1015. URL http://dx.doi.org/10.1093/nar/gkp1015 .[8] Aed´ın C. Culhane, Markus S. Schr¨oder, Razvan Sul-tana, Shaita C. Picard, Enzo N. Martinelli, CarolineKelly, Benjamin Haibe-Kains, Misha Kapushesky, Anne-Alyssa St Pierre, William Flahive, Kermshlise C. Pi-card, Daniel Gusenleitner, Gerald Papenhausen, NiallO’Connor, Mick Correll, and John Quackenbush. Gen-eSigDB: a manually curated database and resource foranalysis of gene expression signatures.
Nucleic Acids Re-search , 40(D1):D1060–D1066, January 2012. ISSN 1362-4962. doi: 10.1093/nar/gkr901. URL http://dx.doi.org/10.1093/nar/gkr901 .[9] Leon Danon, Albert D´ıaz-Guilera, Jordi Duch, andAlex Arenas. Comparing community structure iden-tification.
Journal of Statistical Mechanics: The- ory and Experiment , 2005(09):P09008, September 2005.ISSN 1742-5468. doi: 10.1088/1742-5468/2005/09/P09008. URL http://dx.doi.org/10.1088/1742-5468/2005/09/P09008 .[10] Emily C. Dimmer, Rachael P. Huntley, Yasmin Alam-Faruque, Tony Sawford, Claire O’Donovan, Maria J.Martin, Benoit Bely, Paul Browne, Wei Mun Chan,Ruth Eberhardt, Michael Gardner, Kati Laiho, DuncanLegge, Michele Magrane, Klemens Pichler, Diego Pog-gioli, Harminder Sehra, Andrea Auchincloss, KristianAxelsen, Marie-Claude C. Blatter, Emmanuel Boutet,Silvia Braconi-Quintaje, Lionel Breuza, Alan Bridge,Elizabeth Coudert, Anne Estreicher, Livia Famiglietti,Serenella Ferro-Rojas, Marc Feuermann, Arnaud Gos,Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo,Janet James, Silvia Jimenez, Florence Jungo, GuillaumeKeller, Phillippe Lemercier, Damien Lieberherr, PatrickMasson, Madelaine Moinat, Ivo Pedruzzi, Sylvain Poux,Catherine Rivoire, Bernd Roechert, Michael Schneider,Andre Stutz, Shyamala Sundaram, Michael Tognolli, Ly-die Bougueleret, Ghislaine Argoud-Puy, Isabelle Cusin,Paula Duek-Roggli, Ioannis Xenarios, and Rolf Apweiler.The UniProt-GO annotation database in 2011.
Nucleicacids research , 40(Database issue):D565–D570, January2012. ISSN 1362-4962. doi: 10.1093/nar/gkr1048. URL http://dx.doi.org/10.1093/nar/gkr1048 .[11] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D.de Jong, Michael Egmont-Petersen, and Cisca Wijmenga.Reconstruction of a functional human gene network, withan application for prioritizing positional candidate genes.
American journal of human genetics , 78(6):1011–1025,June 2006. ISSN 0002-9297. doi: 10.1086/504300. URL http://dx.doi.org/10.1086/504300 .[12] H. Funakoshi, J. Fris´en, G. Barbany, T. Timmusk,O. Zachrisson, V. M. Verge, and H. Persson. Differentialexpression of mRNAs for neurotrophins and their recep-tors after axotomy of the sciatic nerve.
The Journal of cellbiology , 123(2):455–465, October 1993. ISSN 0021-9525.URL http://view.ncbi.nlm.nih.gov/pubmed/8408225 .[13] M. Girvan and M. E. J. Newman. Community struc-ture in social and biological networks.
Proceedings of theNational Academy of Sciences , 99(12):7821–7826, June2002. ISSN 0027-8424. doi: 10.1073/pnas.122653799.URL http://dx.doi.org/10.1073/pnas.122653799 .[14] Kimberly Glass and Michelle Girvan. Annotation en-richment analysis: An alternative method for evalu-ating the functional properties of gene sets. e-pub:http://arxiv.org/abs/1208.4127 , August 2012. URL http://arxiv.org/abs/1208.4127 .[15] Kimberly Glass, Edward Ott, Wolfgang Losert, andMichelle Girvan. Implications of functional similarityfor gene regulatory interactions.
Journal of the RoyalSociety, Interface / the Royal Society , February 2012.ISSN 1742-5662. doi: 10.1098/rsif.2011.0585. URL http://dx.doi.org/10.1098/rsif.2011.0585 . [16] Roger Guimera and Luis A. Nunes Amaral. Functionalcartography of complex metabolic networks. Nature , 433(7028):895–900, February 2005. ISSN 0028-0836. doi: 10.1038/nature03288. URL http://dx.doi.org/10.1038/nature03288 .[17] Da W. Huang, Brad T. Sherman, Qina Tan, JosephKir, David Liu, David Bryant, Yongjian Guo, RobertStephens, Michael W. Baseler, H. Clifford Lane, andRichard A. Lempicki. David bioinformatics resources:expanded annotation database and novel algorithms tobetter extract biology from large gene lists.
Nucl. AcidsRes. , 35(Web Server issue):gkm415+, June 2007. ISSN1362-4962. doi: 10.1093/nar/gkm415. URL http://dx.doi.org/10.1093/nar/gkm415 .[18] H. Jeong, S. P. Mason, A. L. Barab´asi, and Z. N. Oltvai.Lethality and centrality in protein networks.
Nature , 411(6833):41–42, May 2001. ISSN 0028-0836. doi: 10.1038/35075138. URL http://dx.doi.org/10.1038/35075138 .[19] O. D. King, R. E. Foulger, S. S. Dwight, J. V. White, andF. P. Roth. Predicting gene function from patterns ofannotation.
Genome research , 13(5):896–904, May 2003.ISSN 1088-9051. doi: 10.1101/gr.440803. URL http://dx.doi.org/10.1101/gr.440803 .[20] Insuk Lee, Shailesh V. Date, Alex T. Adai, and Ed-ward M. Marcotte. A probabilistic functional networkof yeast genes.
Science , 306(5701):1555–1558, November2004. ISSN 1095-9203. doi: 10.1126/science.1099511.URL http://dx.doi.org/10.1126/science.1099511 .[21] E. A. Leicht and M. E. J. Newman. Commu-nity structure in directed networks.
Physical ReviewLetters , 100:118703+, March 2008. doi: 10.1103/PhysRevLett.100.118703. URL http://dx.doi.org/10.1103/PhysRevLett.100.118703 .[22] Usakali M¨aki, editor.
The Economic World View: Stud-ies in the Ontology of Economics . The Press Syndicateof the University of Cambridge, Cambridge, UK, 2001.[23] Marina Meila. Comparing clusterings by the variationof information.
Learning Theory and Kernel Machines ,pages 173–187, 2003. URL .[24] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,D. Chklovskii, and U. Alon. Network motifs: Sim-ple building blocks of complex networks.
Science , 298(5594):824–827, October 2002. ISSN 1095-9203. doi: 10.1126/science.298.5594.824. URL http://dx.doi.org/10.1126/science.298.5594.824 .[25] Sara Mostafavi and Quaid Morris. Fast integration ofheterogeneous data sources for predicting gene func-tion with limited annotation.
Bioinformatics , 26(14):1759–1765, July 2010. ISSN 1367-4811. doi: 10.1093/bioinformatics/btq262. URL http://dx.doi.org/10.1093/bioinformatics/btq262 .[26] M. E. Newman. Analysis of weighted networks.
Phys RevE Stat Nonlin Soft Matter Phys , 70(5 Pt 2), November2004. ISSN 1539-3755. URL http://view.ncbi.nlm.nih.gov/pubmed/15600716 .[27] M. E. J. Newman. The structure and func-tion of complex networks.
SIAM Review , 45(2):167–256, 2003. URL http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=SIREAD000045000002000167000001&idtype=cvips&gifs=yes .[28] M. E. J. Newman and M. Girvan. Finding and evaluatingcommunity structure in networks.
Physical Review E , 69 (2):026113+, February 2004. doi: 10.1103/physreve.69.026113. URL http://dx.doi.org/10.1103/physreve.69.026113 .[29] Iman Osman, Dean F. Bajorin, Tung-Tien T. Sun, HongZhong, Diah Douglas, Joseph Scattergood, Run Zheng,Mark Han, K. Wayne Marshall, and Choong-Chin C.Liew. Novel blood biomarkers of human urinary blad-der cancer.
Clinical cancer research : an official jour-nal of the American Association for Cancer Research , 12(11 Pt 1):3374–3380, June 2006. ISSN 1078-0432. doi:10.1158/1078-0432.CCR-05-2081. URL http://dx.doi.org/10.1158/1078-0432.CCR-05-2081 .[30] Erika Pastrana, Maria Teresa T. Moreno-Flores, Este-ban N. Gurzov, Jesus Avila, Francisco Wandosell, andJavier Diaz-Nido. Genes associated with adult axon re-generation promoted by olfactory ensheathing cells: anew role for matrix metalloproteinase 2.
The Journal ofneuroscience : the official journal of the Society for Neu-roscience , 26(20):5347–5359, May 2006. ISSN 1529-2401.doi: 10.1523/JNEUROSCI.1111-06.2006. URL http://dx.doi.org/10.1523/JNEUROSCI.1111-06.2006 .[31] Mason A. Porter, Jukka-Pekka Onnela, and Pe-ter J. Mucha. Communities in networks. e-pub:http://arxiv.org/abs/0902.3788 , September 2009. URL http://arxiv.org/abs/0902.3788 .[32] Simon B. Shum, Enrico Motta, and John Domingue.ScholOnto: an Ontology-Based digital library serverfor research documents and discourse.
InternationalJournal on Digital Libraries , 3:237–248, 2000. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.3835 .[33] Ricard V. Sol´e, Ramon F. Cancho, Jose M. Montoya,and Sergi Valverde. Selection, tinkering, and emergencein complex networks.
Complex. , 8(1):20–33, September2002. ISSN 1076-2787. URL http://portal.acm.org/citation.cfm?id=770715.770720 .[34] Robert Stevens, Carole A. Goble, and Sean Bechhofer.Ontology-based knowledge representation for bioinfor-matics.
Brief Bioinform , 1(4):398–414, January 2000.doi: 10.1093/bib/1.4.398. URL http://dx.doi.org/10.1093/bib/1.4.398 .[35] The gene ontology consortium. Creating the gene ontol-ogy resource: design and implementation.
Genome Res. ,11(8):1425–1433, August 2001. ISSN 1088-9051. doi:10.1101/gr.180801. URL http://dx.doi.org/10.1101/gr.180801 .[36] Andreas Wagner. The yeast protein interaction networkevolves rapidly and contains few redundant duplicategenes.
Mol Biol Evol , 18(7):1283–1292, July 2001. ISSN0737-4038. URL http://mbe.oxfordjournals.org/cgi/content/abstract/18/7/1283 .[37] Xiu Qin Q. Xu, Set Yen Y. Soo, William Sun, andRobert Zweigerdt. Global expression profile of highlyenriched cardiomyocytes derived from human embryonicstem cells.
Stem cells (Dayton, Ohio) , 27(9):2163–2174,September 2009. ISSN 1549-4918. doi: 10.1002/stem.166.URL http://dx.doi.org/10.1002/stem.166 .[38] Xuerui Yang, Yang Zhou, Rong Jin, and Christina Chan.Reconstruct modular phenotype-specific gene networksby knowledge-driven matrix factorization.
Bioinformat-ics , 25(17):2236–2243, September 2009. ISSN 1367-4811.doi: 10.1093/bioinformatics/btp376. URL http://dx.doi.org/10.1093/bioinformatics/btp376 .[39] Ahrim Youn, David J. Reiss, and Werner Stuetzle.
Learning transcriptional networks from the integrationof ChIP-chip and expression data in a non-parametricmodel.
Bioinformatics , 26(15):1879–1886, August 2010.doi: 10.1093/bioinformatics/btq289. URL http://dx.doi.org/10.1093/bioinformatics/btq289 .[40] Jing Zhao, Hong Yu, Jianhua Luo, Z. Cao, and YixueLi. Complex networks theory for analyzing metabolicnetworks.
Chinese Science Bulletin , 51(13):1529–1537–1537, July 2006. ISSN 1001-6538. doi: 10.1007/s11434-006-2015-2. URL http://dx.doi.org/10.1007/s11434-006-2015-2 . Supplemental Material
In this section we provide additional materials meantto compliment and expand upon the analysis in “Find-ing New Order in Biological Functions from the NetworkStructure of Gene Annotations.” In the first section,“Communities Found Across Multiple Resolutions”, weprovide more detail about how we used the resolutionparameter to generate multiple, overlapping, yet uniquecommunities of terms. In the second section, “Domain-specific Analysis” we provide analysis of communities ofterms compared to branches within each primary domainof GO. Next, in “Visualizing GO Branches” we illustraterelationships between GO branches at different “levels”of the GO DAG. In the section entitled “Functional En-richment Analysis”, we briefly explain the methodologyused to evaluate the statistical enrichment of gene setsin the term communities and GO branches, and pro-vide analysis showing that this enrichment is absent forrandomly generated term communities or randomly gen-erated sets of genes. Finally, in “Comparative SpeciesAnalysis” we construct and analyze annotation-driventerm networks using gene annotations for sixteen addi-tional organisms and compare those networks with eachother as well as the human results presented in the maintext.For information regarding the methodology used toillustrate the term communities or generate the wordclouds in the main text, see “Supplemental Methods”below.
In order to find additional viable partitions of theannotation-driven term network, representing differentresolutions, we varied the weighting parameter (see r inEquation 3, and “Methods” in the main text). In addi-tion to the fundamental partition ( r = 0) we chose val-ues of r in geometrically increasing steps ranging from r = 2 − = 0 .
25 to r = 2 = 1024 such that the final dis-tribution of community sizes would resemble that of thebranches. This procedure resulted in fourteen differentpartitions of the terms in the annotation-driven networksuch that, initially, each term can be assigned to exactlyfourteen communities, one at each resolution. We em-phasize that while communities at any given resolutiondo not overlap, communities at different resolutions canbe highly overlapping.We point out that it is possible for the membership of acommunity found at one resolution to be identical to themembership of a community found at another resolution.In order to eliminate completely redundant communityinformation from our found communities, at each resolu-tion, we determined if any of the communities found atthat resolution were identical in membership as a commu-nity found at a lower resolution. If so, we “collapsed” the Value of r Number of Number of New(Resolution) Communities Communities0 56 560.25 80 510.5 89 441 116 672 146 954 193 1488 320 25716 576 42232 1007 70364 1557 965128 2409 1439256 3585 1965512 4983 23751024 6730 2904TABLE S1: The number of communities found at each valueof resolution used. As the resolution is increased, the numberof communities found at that resolution increases as well. two community assignments into that of the communityfrom the lower resolution. For example, of the 80 com-munities found at a resolution value of r = 0 .
25, 29 hadan identical membership to one of the 56 communitiesfound at r = 0. We removed those 29 “redundant” com-munities from the r = 0 .
25 partitioning, retaining onlytheir r = 0 assignment, and record that at r = 0 .
25 only51 additional communities are found. Similarly, of the 89communities found at a resolution of r = 0 .
5, 45 of themare identical in membership to one of the 107 unique com-munities found at r = 0 or r = 0 .
25. These “redundant”communities were removed and we record only 44 addi-tional communities found at r = 0 .
5. This procedure wasrepeated for the remaining resolutions, resulting in 11491“unique” communities (see Table S1). The cumulativedistribution of the number of members in these commu-nities is shown in Figure S1(a). For comparison, on thesame graph we also plot the cumulative distribution ofthe number of annotations made to GO branches. Thesedistributions are heavy-tailed and demonstrate that thenumber and sizes of branches and communities are simi-lar. The heavy-tailed distribution for branches is a resultof the hierarchical DAG structure as the members of eachbranch are also members of their parent branch(es) (see[14, 15]).
The Gene Ontology is broken into three, fully indepen-dent, primary domains, each of which takes the form ofa directed acyclic graph [35]. In the main text we com-bined information from all three primary domains to con-struct the annotation-driven term network. Because ofthis, the dissimilarity between the GO branches and the1 Number of Members (N) o f B r an c he s / C o mm un i t i e s ( M e m be r s ≥ N ) BranchesCommunities (a)Distribution of Branch andCommunity Sizes N u m be r o f B r an c he s / C o mm un i t i e s BranchesCommunities (b)Biological Process Terms N u m be r o f B r an c he s / C o mm un i t i e s BranchesCommunities (c)Molecular Function Terms N u m be r o f B r an c he s / C o mm un i t i e s BranchesCommunities (d)Cellular Component Terms
FIG. S1: (A) The cumulative Distribution for the sizes of all branches in the Gene Ontology and all unique term communitiesfound at the various resolutions. (B-D) Distribution of J m , the maximum similarity a domain-specific community or branchwith ten or more members has compared to all other domain-specific branches or communities with ten or more members,respectively. Although a small number of communities and branches have similar memberships, most are very dissimilar. term communities we find by partitioning this networkmay be at least partially attributable to the fact thatmembers of a particular community can belong to anyof the primary domains whereas members of a particularGO branch must all belong to the same primary domainbased on the construction of the hierarchy. To addressthe extent of this issue, we constructed three “domain-specific” term networks by using only terms specific toa particular primary domain and the gene annotationsmade to those specific terms. We partitioned each ofthese networks using the same resolution values as wereused previously and, as with the partitions using all an-notations, we retained only the “unique” set of communi-ties for the following analysis (see Section 4.1). Table S2shows the number of communities found and the num-ber of branches defined in GO for each of the primarydomains.Next we compared the membership of these communi-ties and branches using the Jaccard similarity (see “TermCommunities and GO Branches Represent Distinct Col-lections of Biological Functions” in the main text). Wedid the comparison three times, each time confining thecomparison to those branches defined by a particular pri-mary domain and the term communities derived from thenetwork constructed using those terms and their corre-sponding gene annotations. The distribution of J m (Fig- Number of Number ofBranches CommunitiesAll Terms 15033 11491BP Terms 10192 9043MF terms 3634 5423CC Terms 1207 2107TABLE S2: The number of branches defined within each pri-mary domain as well as the number of communities foundby varying the resolution parameter and partitioning a term-network derived by gene annotations made to terms in thisprimary domain. The large size of the “Biological Process”primary domain compared to the others is evident. ure S1), the maximum similarity a community has toany branch, or vice versus, for each of the domains, is notstrikingly different from that presented when using infor-mation collected from all three domains (see Figure 3(a)in the main text). There might be slightly more similar-ity when directly comparing communities and branchesderived from either the “Molecular Function”, or “Cel-lular Component” primary domains, but we remind thereader that the majority (approximately two-thirds) ofthe terms in GO belong to the “Biological Process” pri-mary domain, and this distribution (Figure S1(b)) ispractically indistinguishable from the one presented inFigure 3(a) of the main text.
To better understand the relationships between thebranches in the Gene Ontology, we visualized branches ina manner similar to the way we visualized the relation-ships between term communities found at different reso-lutions (see Figure 2 in the main text). First, we deter-mined a “level” for each GO branch in order to segregatethe branches in a manner similar to the resolution param-eter. To begin, the head nodes of the three primary do-mains (“Biological Process”, “Molecular Function”, and“Cellular Component”) were assigned to level one. Next,we determined branches for which the head-node is aterm that has a parent-child relationship with only oneof these three level-one terms, and assigned those termsa level of two. Continuing, head-nodes that have parent-child relationships with only level-one or level-two termswere given a level assignment of three, and so on. Thisassignment was repeated until all head-nodes were as-signed a “level”. Branch levels were then defined basedon the level of its head-node.We visualized branches with one-hundred or moreterm-members at the six lowest levels (Figure S2), lin-ing up branches with the same level assignment horizon-tally. Each branch is represented by a single circle, whoseradius scales as the log of the number of terms belong-2
FIG. S2: Visualization of branches of GO terms. Each branch is represented as a circle whose radius scales as the log of thenumber of terms in the branch. The width of the line connecting two branches is proportional to the percentage of terms in thechild branch that are also in the parent branch. Color represents whether the terms in the branch belong to the BP (yellow),MF (cyan) or CC (magenta) primary domain. ing to that branch and whose color represents whetherthe terms in the branch belong to the BP (yellow), MF(cyan) or CC (magenta) primary domain. Between thebranches found at adjacent levels, we draw a line froma branch at a higher level to a branch at a lower levelif at least 10% of the members of the branch from thehigher level also belong to the branch at the lower level.The thickness of the line is indicative of the overlap inmembership between the two branches.Unsurprisingly, we observe three distinct sets of inter-connected branches corresponding to branches belongingto each of the three primary domains. Like in the visualrepresentation of the term communities across differentresolutions (see Figure 2 in the main text), we observea lot of “cross-talk” between branches at adjacent levels,whereby a branch at a given level is very likely to con-tain members from multiple branches at a lower level. Wenote that in this representation, an individual term canbe a member of multiple branches at the same “level”.This is in contrast to segregating communities by reso-lution, in which case each term only appears once on agiven resolution-row. As a consequence, the inter-levelconnections between branches are somewhat structurallydifferent from connections between communities found atadjacent resolutions. Namely, branches that share a termwill necessarily also share a set of parent branches. Re-dundancy of the same term member(s) across multiplebranches at a given level is visually evident among “Bi-ological Process” (yellow) branches, where there existsgroups of branches at each level that connect primarilyto the same set of branches at a lower level.
We also wanted to test how our communities might beused in functional enrichment analysis, a very commonapplication of the Gene Ontology. Traditionally, eachbranch of GO is collapsed to its head node and all thegenes annotated to that head node are grouped into one“set.” (Note that this set is the same as the genes an-notated to the entire branch because of the propagationof gene annotation assignment, see the “Introduction” inthe main text). Recently there has been evidence thatthis approach over-simplifies the complex structure of theGene Ontology and has the potential to mis-represent the enrichment of gene sets in branches [14]. Therefore, wechoose to use Annotation Enrichment Analysis (AEA)to evaluate the functional enrichment of experimentally-derived gene sets in both the GO branches and our termcommunities. AEA allows the user to specify a collectionof terms and a collection of genes and uses a randomiza-tion protocol to evaluate the probability that these twosets are more connected than by chance.We ran AEA using one million randomizations defin-ing collections of genes using a public database of CancerSignatures [8] and using collections of terms defined by(1) membership in the term communities; or (2) member-ship in GO branches. We plot the number of pairs of termand gene “collections” that are enriched beneath varioussignificance thresholds (Figure S3(a)) and observe clearstatistical enrichment of experimentally-derived gene sig-natures in both term communities and GO brancheswith several thousand community-signature and branch-signature pairs enriched at a p-value significance less than10 − .Next, we wished to verify that this enrichment was notan artifact of the community construction – namely, wewanted to verify that experimental gene signatures arenot enriched in random collections of terms and that termcommunities are not enriched in random collections ofgenes. Therefore we constructed “random” term commu-nities by taking the community assignments of terms andswapping term labels. Similarly, we constructed “ran-dom” gene sets by randomly swapping gene labels. Thisgives us random term communities and random gene setswith both the same size and relative overlap as the realterm communities and the experimentally-defined signa-tures. We then ran AEA an additional four times, using:(1) random communities and the experimentally-definedsignatures; (2) term communities and random gene sets;(3) branches and random gene sets; and (4) random com-munities and random gene sets. The result of AEA indi-cated no enrichment using either random term commu-nities or random sets of genes (Figure S3(a)), showingthat the term communities generated by partitioning theannotation-driven term network contain useful biologicalinformation.AEA evaluates the overlap in annotations made by aset of genes or to a set of terms. Other, traditional func-tional enrichment analysis procedures, however, often useFisher’s Exact Test (FET) to evaluate the overlap in two3 p ≤ −6 p ≤ −5 p ≤ −4 p ≤ −3 p ≤ −2 p ≤ −1 ≥ N u m be r o f C o m pa r i s on s M ee t i ng S i gn i f i c an c e Le v e l GO Branches (Cancer Signatures)GO Branches (Random Signatures)Term Communities (Cancer Signatures)Term Communities (Random Signatures)Random Communities (Cancer Signatures)Random Communities (Random Signatures) (a)Annotation Enrichment Analysis p ≤ −6 p ≤ −5 p ≤ −4 p ≤ −3 p ≤ −2 p ≤ −1 ≥ N u m be r o f C o m pa r i s on s M ee t i ng S i gn i f i c an c e Le v e l GO Branches (Cancer Signatures)GO Branches (Random Signatures)Term Communities (Cancer Signatures)Term Communities (Random Signatures)Random Communities (Cancer Signatures)Random Communities (Random Signatures) (b)Fisher’s Exact Test
FIG. S3: Level of statistical enrichment found when comparing branches, communities, and randomly generated communitieswith either cancer signatures or randomly generated gene sets. Only branches and our identified term communities showstatistical enrichment in experimentally-derived gene signatures, with slightly more enrichment for the communities comparedto the branches. sets of genes – one defined by a gene set or signatureof interest, and the other defined by taking the set ofgenes annotated to all the terms in a GO branch (sameas the genes annotated to the parent node). Althoughthere is evidence that such analysis is highly sensitiveto the annotation degree of genes and terms [14], wewanted to see if our communities were enriched in cancersignatures using this more traditional approach. There-fore, for each GO branch, term community and randomcommunity, we took the collection of genes annotatedto all terms in that branch/community/random commu-nity, and assigned this set of genes to represent thatbranch/community/random-community. We then eval-uated the significance of overlap between theses sets ofgenes and sets of genes as defined by cancer signatures,or random sets of genes.As with AEA, both term communities and GObranches show enrichment in experimentally-defined genesignatures using FET, with term communities perhapshaving a slightly greater level of enrichment (FigureS3(b)). At first it is surprising to observe that randomcommunities also show a large amount of enrichment inthe experimental gene signatures – much more than ei-ther the branches or real communities!! In retrospect,however, this serves to highlight a known weakness ofFET to incorrectly over-estimate the significance of over-lap when genes in a set contain a higher than expectednumber of annotations. Whereas our random gene setsrepresent a random sampling from all genes annotated to GO, the genes collected in the signatures publishedby GeneSigDB are biased in that they are generally an-notated much more frequently to GO than one wouldexpect by chance (see [14]). Furthermore, by taking allgenes annotated to a collection of terms, highly anno-tated genes are also more likely to be represented in thegene sets representing the branches, term communities,and the random communities. The enrichment of the ex-perimental gene signatures in the random communities,therefore, is attributable to the fact that FET finds sig-nificant overlap between sets containing an abundanceof highly annotated genes, independent of the biologicalcontent of those sets. For more discussion, see [14].
Even though the Gene Ontology hierarchy establishes aspecies-independent terminology, one could imagine con-structing networks of terms using species-specific geneannotations, and thereby constructing species-specificterm communities. These communities would reflect thebiological terms that are performed by the same sets ofgenes in a particular species and wouldn’t necessarily bethe same across different species. In this section we eval-uate the similarity between partitions of GO terms de-rived from the annotations of seventeen model organisms,including thale cress (
Arabidopsis thaliana ), Escherichiacoli , slime mold (
Dictyostelium discoideum ), Aspergillus A r ab i dop s i s t ha li ana E sc he r i c h i a c o li D i c t y o s t e li u m d i sc o i deu m A s pe r g ill u s n i du l an s C and i da a l b i c an s S a cc ha r o m yc e s c e r e v i s i ae S c h i z o s a cc ha r o m yc e s po m be C aeno r habd i t i s e l egan s D r o s oph il a m e l anoga s t e r D an i o r e r i o G a ll u s ga ll u s S u s sc r o f a B o s t au r u s C an i ne l upu s f a m ili a r i s M u s m u sc u l u s R a tt u s no r v eg i c u s H o m o s ap i en s V a r i a t i on o f I n f o r m a t i on Thale cress E. coli Slime mold A. nidulans YeastBudding yeast Fission yeast WormFruit fly ZebrafishChickenPigCowDogMouseRatHuman 00.20.40.60.811.21.41.61.82
FIG. S4: The variation of information between term par-titions generated from species-specific projected term net-works. Labels along the x-axis are the scientific name for thespecies and labels along the y-axis are the common name forthe species. Although the partitions are more similar thanrandom (
V I ≈ . V I = 0), indicating that these species-specific communitiesof functional terms carry distinct information. nidulans , three types of yeast including
Candida albicans ,budding yeast (
Saccharomyces cerevisiae ), and fissionyeast (
Schizosaccharomyces pombe ), worm (
Caenorhab-ditis elegans ), fruit fly (
Drosophila melanogaster ), ze-brafish (
Danio rerio ), chicken (
Gallus gallus ), pig (
Susscrofa ), cow (
Bos taurus ), dog (
Canine lupus familiaris ),mouse (
Mus muculus ), rat (
Rattus norvagicus ) and hu-man (
Homo sapiens ). We downloaded gene annotationfiles for each of these species and projected term-termnetworks (see Section “Constructing a Term Networkfrom Gene Ontology Annotations” in the main text). Wethen partitioned each of these networks into communities(see Section “Identifying Communities of GO terms” inthe main text). For simplicity we choose to focus only onthe fundamental partition (resolution parameter r = 0,see Equation 3 in the main text). This results in exactlyone discreet partitioning of GO terms associated witheach species.Comparing community structure identification is anongoing area of research in the complex systems field andthere are multiple proposed methods for comparing twodiscreet partitions of a set of nodes (e.g. [9, 23]). Onewe will employ here is the variation of information ( V I )[23]:
V I ( X, Y ) = H ( X ) + H ( Y ) − M I ( X, Y ) (4)where H ( X ) is the Shannon’s entropy associated with apartition, X , and M I ( X, Y ) is the mutual information between two partitions, X and Y . V I ( X, Y ) representsa distance between the information shared between twopartitions, X and Y , with V I ( X, Y ) = 0 indicating iden-tical partitions.We calculated the VI between the partitions for everypair of species, using only terms common to both parti-tions when different sets of terms are annotated in thedifferent species. Figure S4 shows these values. We alsocalculated a VI value for a random shuffling of commu-nity assignments in each pair of species and observe that“random” VI takes a value of approximately 2 . ± . V I = 0), indicating thatthese term partitions are far from identical.Interestingly, there is a relatively higher level of simi-larity between the species belonging to the Fungi King-dom (
A.nidulans , yeast, budding yeast and fission yeast).On the other hand there is higher dissimilarity betweenanimals belonging to the Chordata phylum (zebrafish,chicken, pig, cow, dog, mouse, rat, human), both betweeneach other and compared to the other species. This couldbe a consequence of evolutionary diversity playing a moredominant role in the organization of biological functionin these organisms, perhaps through more complex regu-latory mechanisms such as epigenetics. The exception iswhen comparing mouse, rat and human, which is not en-tirely surprising given the extent to which mouse is usedto mimic the human system in laboratory experiments.Although some of the differences between the species-specific term communities may be due to variations in theannotation practices among groups that supply annota-tions to GO, it is also likely that they reflect real, biolog-ical differences in the cellular organization of these sys-tems. We suggest that using these partitions of terms in aspecies-specific context may enhance the results of func-tional analysis for these model organisms. Furthermore,identifying the exact differences between these communi-ties may uncover important cellular properties of variousspecies, an investigation we leave to future work.
Supplemental Methods
In this section we provide additional information onthe methodology used to illustrate the term communitiesacross different resolutions (Figure 2 in the main text)and generate the word clouds representing the biologicalcontent of those communities (Figure 4 in the main text).
To better understand the relationships between thecommunities found at different resolutions, we visual-ized the uniquely-found term communities with ten or5more members for the six lowest values of resolution used( r = { , . , . , , , } ) (Figure 2 in the main text).We line up the communities found at each resolution andvisualize each as a circle whose radius scales as the logof the number of term members found in that commu-nity and whose color corresponds to the percentage of“Biological Process”, “Molecular Function” and “Cellu-lar Component” terms that belong to that community.In other words, for each community we count the num-ber of members in that community from the “BiologicalProcess” domain and divide by the number of membersin the entire “Biological Process” domain. We then dothe same things for the other two domains. After thesepercentages are calculated, within each community theyare “normalized” by dividing by the maximum found per-centage such that the vector representing that commu-nity’s domain content has at least one member with avalue of one. We can think of these values in terms ofa three-part cmy color vector. The normalization pro-cess causes communities with an equal percentage fromall three primary domains to be colored black (cmy colorvector equal to [1,1,1]), and those with members onlyfrom one primary domain to be exactly yellow ([0,0,1]) for“Biological Process”, cyan ([1,0,0]) for “Molecular Func-tion”, or magenta ([0,1,0]) for “Cellular Component”.Between the communities found at adjacent resolu-tions, we draw a line from a community at a higher reso-lution to a community at a lower resolution if at least 10%of the members of the community from the higher resolu-tion also belong to the community at the lower resolution.Line thickness scales linearly based on the percentage ofmembers of the community from the higher resolutionthat belong to the community at the lower resolution.Note that although connections are only made betweencommunities in adjacent resolutions, sometimes the par-ent community is identical to another community foundat an even lower resolution, in which case the connectionis made from the child community to the copy of theparent at its lowest found resolution. In order to easily interpret the contents of our commu-nities we summarize the information contained in each inthe form of word clouds using a free word-cloud makingprogram [1]. This program automatically configures theorientation of the words in the clouds, but we manuallyassign each word a relative size and color to representthat word’s statistical enrichment in the community andthe primary domain that word is representing in the com-munity, respectively.To begin, for each community, we determine all thedescriptions corresponding to the member terms of thatcommunity and count the number of times an individualword appears across all these descriptions. Then, for eachof these words, we calculate the statistical enrichment (p- value) of the frequency of that word in the communitycompared to its frequency across the descriptions of allGO terms using the hypergeometric probability: p = P ( N ≥ N wc | N w , N c , N tot ) = min [ N w ,N c ] (cid:88) i = N wc (cid:0) N c i (cid:1)(cid:0) N tot − N c N w − i (cid:1)(cid:0) N tot N w (cid:1) , (5)where N wc is the number of times that word appearsin the term descriptions specific to a community, N w isthe number of times that word appears across all termdescriptions, N c is the number of individual words in theterm descriptions specific to a community, and N tot is thenumber of individual words across all term descriptions.We scale the sizes of the words in the word cloud as − log ( pp