AApplied Category Theory for Genomics– An Initiative
Yanying Wu
Centre for Neural Circuits and Behaviour, University of Oxford, UK Department of Physiology, Anatomy and Genetics, University of Oxford, UK
06 Sept, 2020
Abstract
The ultimate secret of all lives on earth is hidden in their genomes – atotality of DNA sequences. We currently know the whole genome sequenceof many organisms, while our understanding of the genome architecture ona systematic level remains rudimentary. Applied category theory opensa promising way to integrate the humongous amount of heterogeneousinformations in genomics, to advance our knowledge regarding genomeorganization, and to provide us with a deep and holistic view of our owngenomes. In this work we explain why applied category theory carries sucha hope, and we move on to show how it could actually do so, albeit in babysteps. The manuscript intends to be readable to both mathematicians andbiologists, therefore no prior knowledge is required from either side. a r X i v : . [ q - b i o . GN ] S e p Introduction
DNA, the genetic material of all living beings on this planet, holds the secret oflife. The complete set of DNA sequences in an organism constitutes its genome– the blueprint and instruction manual of that organism, be it a human or fly [1].Therefore, genomics, which studies the contents and meaning of genomes, hasbeen standing in the central stage of scientific research since its birth.The twentieth century witnessed three milestones of genomics research [1].It began with the discovery of Mendel’s laws of inheritance [2], sparked a climaxin the middle with the reveal of DNA double helix structure [3], and ended withthe accomplishment of a first draft of complete human genome sequences [4].In the new era, with the advances in high-throughput sequencing technologyand a flourish in bioinformatics, numerous details of various genomes have beenaccumulated. Consequently, a major challenge for the next generation genomicsis to integrate those large amount of information, and to obtain a unified, globalview of the genomes [5]. To address this challenge, various computational meth-ods have been employed, including deep learning as the latest and most popularforce [6, 7]. Albeit missing from the list is applied category theory.Category theory is a rising star in the pure mathematics field. It was in-vented for communication of ideas between different fields within mathemat-ics [8]. Later people found that it is a language and mathematical tool thatcaptures the essential features of certain subjects, and that it can be appliedquite generally [9]. Category theory has already been successfully applied incomputer science, linguistics and physics [10]. Actually, applied category the-ory has now grown into a brand new discipline itself, expanding territories intosocial science, cognition, neuroscience, cybernetics and many more [11].In this manuscript, we propose that applied category theory is a powerfuland perfect tool for the study of genomics. The main goal of this article is toexplain why that is true. In order to do so, we will first introduce the history,current status and next big question of genomics. Then, we will describe brieflywhat is category theory and the current development of applied category theory,especially in biology. Finally, we try to bridge the two fields. Using the flygenome as a sample system, we demonstrate how applied category theory couldhelp us better understand a genome as a whole.Apparently our final aim: to uncover the organizational principle of genomesin general, is far from being reached in this initial work. However, with theright arms at hand and a clear direction ahead, all we need to do is to proceed.How about naming this course “Categorical Genomics” – a study of A pplied C ategory T heory for G enomics? Genomics by definition is the study of all the genes of an organism, as well astheir interactions with each other and with the environment. Genetics, on theother side, focuses on the function and heredity of single genes [12,13]. However,it is hard to draw a clear boundary between genomics and genetics; also genomicsis closely intermingled with molecular biology and evolution. So, for the historyof genomics, we take a subjective choice on what to cover, considering thoserelatively influential and relevant aspects but not restricted to genomics literally.1 .1 A brief history of genomics
The origin of genomics could date back to the mid 19th century, when Mendel’slaws of inheritance was first formulated in 1865. Mendel’s laws remained unno-ticed and were later rediscovered independently by three scientists in 1900 [14].Before Mendel it was commonly believed that an organism’s traits were passedon to its offspring in a blend of characteristics contributed by each parent.Mendel’s law stated that genetic material (the concept of gene was not formedyet) was inherited as distinct units, one from each parent. Due to this revolu-tionary discovery, Mendel was named as the “Father of Genetics”. A decadelater, Morgan spotted a white-eyed male fruitfly which led him to the finding ofsex linkage. Morgan became the first person to link the inheritance of a specifictrain (white eye) to a particular chromosome (X) [15, 16]. These early worksmarked the dawn of our understanding of the fundamental nature of inheritance.But the road ahead was never straightforward.People nowadays take it for granted that DNA is the genetic material passedon from one’s parents and determining ones characteristics. Actually for manydecades scientists believed that proteins, instead of DNA, were the moleculesthat carry genetic information. Despite a lack of clear experimental proof, themain reason that hindered the scientific progress was in fact people’s intuitivebias. As DNA contains only 4 different nucleotides (A, C, T and G), whileprotein has 20 various amino acids to harness, no wonder people were inclinedto believe that protein is more qualified to play such a vital role as storingthe vast amount of genetic information. It was not until the 1940s that Averyand his colleagues found convincing experimental proof showing DNA being thechosen one [17]. In retrospect, we learn that the number of basic componentsmatters less than the ways of their combinations.In the early 1950s, scientists working on DNA started to use “gene”, whichis essentially a fragment of DNA, to denote the smallest unit of genetic in-formation. However, no one knew what genes look like physically, or how isDNA duplicated during the organism reproduction [18]. Then came Watsonand Crick. Based on the X-ray crystallography obtained by Rosalind Franklin,they depicted the double-helix structure model for DNA, which turned out tobe astonishingly accurate [3]. The single-page paper titled “A Structure for De-oxyribose Nucleic Acid” not only illustrated beautifully what DNA looks like,but also answered the questions on how DNA replication could be carried out.The discovery of the double helix structure of DNA represents one of the mostremarkable scientific achievements of human being, and it gave rise to modernmolecular biology [18]. Figure 1 shows a schematic drawing of the DNA doublehelix structure.Several years later, Nirenberg, Crick and collaborators deciphered the geneticcode. They first figured out that each amino acid is encoded as a triplet (namedcodon) in terms of DNA bases. Then and by 1966, the codons for all twentyamino acids had been identified. It thus became clear how genetic informationflows from DNA to messenger RNA, and to protein [19, 20].The invention of DNA sequencing method by Sanger in 1977 opened akey chapter of genomics [21]. Since then, the sequencing technology advancedrapidly [22]. In addition, with the thriving of bio-informatics, large databasesand genome browsers were developed [23–25]. Together, they led to the comple-tion of full genome sequencing of the model organism
Drosophila melanogaster
Rather than putting an end to genomics, the sequencing of more and more wholegenomes not only revolutionized existent disciplines, but also opened up excitingnew areas. Personalized medicine is one salient example [27]. Without individ-ual sequence profiles of their patients, doctors previously could only diagnose adisease after the patients had already developed it. Nowadays, a doctor is ableto predict whether a person is at risk of developing certain diseases based ontheir unique genomic constitution. As a result, the chances of preventing suchdisease from actually happening are greatly increased [28–30].Moreover, genome editing (also called gene editing) is a technique that hasdrawn a great deal of attention in the recent years [31]. The tools in thisfield have expanded rapidly from zinc finger nucleases (ZFNs), transcriptionactivation-like effector nucleases (TALENs), to clustered regularly interspacedshort palindromic repeats (CRISPR) and CRISPR-associated (Cas). These newmethods enabled highly efficient, precise and cost-effective study on human andanimal models of diseases [32].Another field worth mentioning is synthetic biology. It aims to redesignand synthesize artificial organisms [33]. Driven also by technical advances, thecurrent ambitious project – to design and synthesis of a complete yeast genome,is making incredible progress [34].Besides DNA sequencing, RNA sequencing (RNA-seq) which directly reflectshow genes are expressed in the cells, is also an important branch of genomics.The next generation sequencing technology has given a strong impetus for highthroughput RNA-seq as well [35]. Especially, single cell RNA-seq has revo-lutionized RNA-seq by providing unprecedented resolution, and it is growingprosperously in the recent years [36, 37].The tremendous amount of data generated by those sequencing works urged3trong support from bioinformatics and computational biology. Large databasessuch as GenBank (the NIH genetic sequence database), Refseq (NCBI refer-ence sequence database) and genome browsers such as Ensembl, UCSC GenomeBrowser, are established and continuously being improved [38]. An expand-ing list of major bioinformatics institutions are setting up worldwide, NationalCenter for Biotechnology Information (NCBI), European Bioinformatics Insti-tute (EMBL-EBL), Wellcome Trust Sanger Institue (WTSI), Broad Institute,to name a few of them. Currently about 25 journals, 13 conferences, 8 work-shops are dedicated exclusively to bioinformatics and more than 22k articlesbeing published on the trends [39]. Furthermore, all sorts of computationaltools including deep learning are utilized for genomics [7, 40–48].Above all, the most exciting ongoing research in genomics is perhaps the“3D genome” [49]. People have long been realizing that although the DNAstrands appear linear, the real chromosomes in a nucleus actually form com-plex and dynamic three-dimensional configurations, and that plays critical rolesin gene regulations and functioning [50]. In particular, high-resolution studiesof chromosome conformation has revealed that the 3D genome is hierarchicallyorganized into large compartments composed of smaller domains called topolog-ically associating domains (TADs). The molecular details on how these domainsare formed and how are they affecting genome functions are the hot topics underintensive investigations [50–54].
Apparently the study of 3D genome is far from being completed at this moment.On the other side, based on the current speed of science, it won’t be too long be-fore we see a relatively clear picture of the mechanisms of 3D genome dynamics.By then we will have the chances to tackle the next big question in genomics– the organization principles of the genome, both structurally and functionally.That means we will have a deep and thorough understanding of the genomes,to the point that it will guide our genome editing, genome synthesis and diseaseprediction with precision. If we dare look into farther future, we might be ableto create artificial organisms, and we will be able to cure any human diseasesthrough genome editing.But, in order to reach that level of understanding, we need to combine ourinsights of the 3D genome with all the DNA/RNA sequencing information as wellas the functional and relational genomics knowledge that we have accumulatedand will do. Here we propose that applied category theory is the right languageand tool to implement the combination, and we will explain the reason in thefollowing section.
Category theory originates from pure mathematics and is known to be highlyabstract. As if residing in a lofty attic, category theory appears strange evento many professional mathematicians. While unexpectedly, this once esotericcourse has found itself being pretty useful in several areas outside mathematics4long with its own development in the recent years. Actually, the field of ap-plied category theory is expanding its territory at an accelerated pace, coveringquantum physics, programming language, databases and informatics, naturallanguage processing and others [11]. Furthermore, given the fact that categorytheory has already been applied in many sub-fields of biology, it is surprisingthat an explicit connection between applied category theory and genomics isstill missing today. We intend to set up such a connection, and will lay out ourrationality in more details subsequently.
Category theory is a relatively new branch of mathematics invented by Eilen-berg and Mac Lane in 1945 [8]. They were studying the unification of differentmathematical fields such as geometry and algebra, and they managed to devisea set of abstract concepts to capture the similarities between those two fields ata fundamental level [10]. Soon their work was found to be applicable to manyother fields of mathematics. The basic idea of category theory is to formalizea given study within mathematics as a category, and it can be connected withother categories from different fields, as long as their structures could be alignedin a “functorial” way. In a nutshell, a category has objects (representing things)and morphisms (representing the ways to go between things) as its basic com-ponents. And two consecutive morphisms could be composed together. Besides,there is functor which maps a category to another category, and there is naturalmorphism going between two such functors. These ideas and the whole assemblyof theories built upon them are so powerful that category theory was nominatedas an alternative foundation to mathematics (replacing set theory) [10, 55].Not long after, category theorist realized that the practical value of thosetheories could go well beyond mathematics. In fact, category theory has beensuccessfully applied in physics, computer science and linguistics [10]. Takequantum physics for example, a basic physical system and the measurementperformed on it could be naturally modelled as a category. Concretely, differenttypes of physical systems, be them qubits, electrons or classical measurementdata, are the objects of the category. The operations are the morphisms, andconsecutive applications of two operations correspond to the composition of twomorphisms, and so on. Once transformed into categorical models, the physicalsystems could be studied formally, taking advantage of the rich repertoire ofconstructions and theorems in category theory. Besides, string diagram in cate-gory theory provides a unique way to visualize complex quantum computationsin a much more intuitive way than traditional equations do [56, 57]. But mostamazingly, category theory allows us to zoom out and see an even bigger picture.In this picture, there appear extensive analogies between physics, topology, logicand computer science, and these analogies could be precisely described using theconcept of a “closed symmetric monoidal category” [58].Further along the way, an application of category theory to science in generalwas introduced by David Spivak in his book “Category Theory for the Sciences”[59]. In this nice book, a tool called ologs (ontology logs), based on categorytheory, is invented to give structures to ideas and concepts, so that they couldbe expressed strictly and explicitly, and be communicated efficiently. Also,ologs encompass a database schema. In short, ologs represent a frameworkwhere scientists are able to formalize their ideas and record data about their5xperiments, all in a mathematically sound way [59].It appears that the growing breadth and depth of applications of categorytheory demand a field of its own. Actually, the first international conference ofapplied category theory in 2018 [60], along with a dedicated open access journal“Compositionality” [61], has already given birth to this new field – appliedcategory theory (ACT). Since then, it has been attracting increasing attentions,its community has been expanding rapidly, and it has been reaching out widerterritories. A more inclusive but not exhaustive list of ongoing applicationsoutside mathematics would have in it quantum computing, natural languageprocessing, programming languages, network theory, databases and informationtheory, logic and proof theory, resource theory, process theory, game theory,statistics and probability theory, biology (detailed right after) and cybernetics[11, 60, 62–64].
The theoretical biologist Robert Rosen was among the first to use categorytheory in system biology. He asked the most fundamental questions such as“What is life?” And he argued that a reductionistic approach was inadequate forthe studying of the functional organization of living organisms [65, 66]. Instead,Rosen proposed an alternate paradigm in which he modelled an organism asa “complex system” that cannot be fully understood by reducing to its parts.Under this frame, complexity refers to the causal impact of organization onthe system as a whole, and relationships among organized matter rather thanparticular matter alone were put into focus. The model known as (M,R) systems,and the discipline named “Relational biology”, adopted a category theoreticmethod [67–70].On the other hand, different from modelling the whole organism as whatRosen did, there exist several important studies working on the molecular level.An early example is Carbone and Gromov’s “Mathematical slices of MolecularBiology”, in which they modelled the spatial structure of DNA, RNA and proteinby using topological surfaces and spaces [71]. Later on, Sawamura et al. madean effort to systematize molecular and genetic biology using category theory. Intheir work, the authors constructed a wallpaper pattern to describe the algebraicfeatures of DNA base sequences [72]. More recently, Remy Tuyeras built a muchstronger connection between genetics and category theory through a series ofpapers. There he defined a class of theories and related models, and recoveredvarious aspects of genetics categorically. Under his framework, DNA sequencingand alignment, homologous recombination, haplotypes, CRISPR editing andgenetic linkage could all be formalized using mathematical language, and themechanisms of genetics could be illustrated clearly [73–75].In addition, a category theoretic way to study neuroscience has been prac-tised from time to time. Most prominently, the book “Memory Evolutive Sys-tem; Hierarchy, Emergence, Cognition”, authored by Ehresmann and Vanbre-meersch, provides a comprehensive mathematical model for autonomous evolu-tionary systems in neuroscience. The main idea of this book is to use the notionof colimit from category theory to describe how complex hierarchical systemsare organised [76, 77]. Besides, it also lays out the potentials for higher dimen-sional algebra and higher categories to model the structure of the brain, as wasproposed in an earlier work [78]. 6eyond neuroscience and perhaps even beyond biology, category theory hasbeen adopted to demystify conciousness as well. We exemplify two pieces ofthe latest work here, both of which take advantage of the graphical languageof process theory. Process theory is an abstract framework describing howprocesses can be composed, and it is essentially symmetric monoidal categories[79, 80]. One study of the two gets its inspiration from the Yogacara school [81]and characterises the key feature of consciousness as “other-dependent nature”,i.e. the nature of existence arises from causes and conditions [82]. The otherwork is based on Integrated Information Theory (IIT) developed by Tononiand collaborators, which proposes that consciousness originates from integratedinternal dynamics in the brain [83, 84].
So far, category theory has not been explicitly applied to genomics, while ap-parently it should be. The reason is that applied category theory has all theadvantages to address the needs from genomics study, whose data have the mainfeatures explained next.First of all, genomics data is highly diversified and heterogeneous. Thereare about 8.7 million different species on the planet, each with a unique genomeof its own [85]. As genomics concerns the genomes of all sorts of organisms,the diversity of genomics data comes naturally from the multiplicity of species.On the other side, the heterogeneity of high-throughput biotechnologies havecreated miscellaneous data types from various perspectives, e.g., Chromatinaccessibility, DNA methylation, and mRNA/microRNA expression [86, 87].Second, the genomics data has the hierarchical and multi-scaled features. Inpractice, the genome itself is organized in hierarchical manner. Concretely, thespatial organization of chromatin in the nucleus has important impact on theregulation of gene expressions. Experiments that map high-frequency contactsbetween chromatin segments have revealed the existence of topological associ-ating domains (TAD), which incorporate most of the regulatory interactions.Especially, TADs are found to be not homogeneous structural units, but or-ganized into hierarchies, i.e., large TADs often contain subTADs nested insidethem [88, 89].Third, genomics information contains intricately interconnected relation-ships. As mentioned above, inside the nucleus, chromatins form complex andmotional 3D shapes, and therefore the local chromatin segments contact eachother in a coordinated yet dynamic way [90]. Also, the transcription factorscontrols the expression of their target genes which together constitute sophis-ticated gene regulatory networks. In addition, outside the nucleus while insidea cell, the proteins interact with each other forming signalling pathways andcross-talks among those pathways [91].Together, these characteristics of genomics data pose a formidable challengefor any attempts to integrate them. Luckily and reassuringly, applied categorytheory offers promising solutions due to the following reasons. (1) The abstract-ness of category theory allows for an opportunity to extract the general featuresor common underlying structures (which are not obvious on the surface) fromthe diverse data, and to present them in a unified framework. (2) Composi-tionality, the essence of category theory, together with scalability (to higher7ategories) make applied category theory an ideal tool to model large-scale,multi-layer systems. (3) An extremely rich and ingenious set of gadgets suchas functor, natural transformation, adjoint, limit and so on, provides powerfullanguage to describe all sorts of relationships between objects, as well as therelationships between relationships.
Due to what has been explained in the previous sections, a practice to reallyconnecting genomics with category theory is worth of time and efforts. In otherwords, we’ve answered the question of why, and we need to solve the problemof how. To do so, there are two main routes to take: one is to make use of thecurrently available models or theories, and the other is to build a categoricalframework for genomics from scratch. In this work we will first review someinitial attempts from the former, and then give a preliminary draft for thelatter.
Figure 2 illustrates a summary of three topics serving as bridges or tunnels thatconnect the two fields. We will give a brief description for each of them in thissection. The readers are suggested to look into the individual manuscripts forfurther details.Figure 2: Three bridges connecting applied category theory with genomicsOne bridge is the language perspective. As in its physical form, the genomeis long strings of DNA sequences consisting of A, C, T and G, it is naturalto think of the genome as a text written in a language with the alphabet ofthose four letters (nucleotides). Consequently, the tools and theories that havebeen developed for natural language processing could be borrowed for the study8f genomes. In practice, this enterprise has been taken on for several decades[92–99].On the other side, category theory has been used for natural language pro-cessing for almost a decade as well. Especially, a mathematical model called Dis-CoCat (Categorical Compositional Distributional Model) was created to com-pute the meaning of a sentence from the meanings of its constitutive words.Concretely, the DisCoCat model unifies the distributional representations ofword meanings in vector spaces with the compositional grammar types of wordsin a pregroup, and it takes advantage of the pregroup algebra to transform themeanings of individual words into a meaning of the whole sentence. The key ideabehind DisCoCat model is that both vector spaces and pregroup share the samehigh level mathematical structure – a compact closed monoidal category [100].As a trial to harness the DisCoCat model for the genome language, a proteinlinguistics was considered [101]. Proteins are the products of genes that actuallycarry out the biological functions. Each protein is in turn composed of oneor more distinguish domains, which are the functional units of the protein.Interestingly, the domains seem to assembly into proteins in a modular andgrammatical fashion [102, 103]. This feature of protein enables us to take alanguage analogy, where a protein could be viewed as a sentence and its domainsas words, and biological functions correspond to meanings. Similar to naturallanguage where words are stable while their combinations into sentences arediversified, the domains are evolutionarily conserved while their combinationsinto proteins are quite flexible. Therefore, although we now know the functionof most domains, our ability to predict functions of novel proteins are limited[104, 105]. Since DisCoCat model could calculate the meaning of a sentencefrom words, it provides a novel way to predict the function of a protein from itsdomains [101]. Figure 3 shows a transferring of the basic schema of DisCoCatmodel to the analogous protein linguistics.Figure 3: Adopt DisCoCat model for protein linguisticsThe other bridge concerns the network connections. In genomics, gene regula-tory network (GRN) is the most common description of the intrinsic interactionsbetween genes that govern their expression levels. Several computational ap-proaches have been used to model GRN including boolean network, Bayesian9etwork, differential equations model, Petri nets, etc. [106–109]. In particular,Petri nets, with a strong theoretical support and a broad application commu-nity, excel in modelling concurrent dynamic systems in general. A Petri net, bydefinition, is a bipartite directed graph containing places and transitions con-nected by directed arcs. A place can hold resources called tokens, and the flow oftokens from one place to another through the transition between them capturesthe process of state updates of a dynamic system. Furnished with rules for thoseupdating, Petri nets are able to specify clearly the structure and behavior ofcertain processes. Therefore, although they haven’t been extensively exploredfor GRN, Petri nets provide a very promising tool to advance GRN study in thefuture [107, 108, 110–112].However, standard Petri nets are not composable, which limits their appli-cation in modelling large scale networks. This issue has been solved when Petrinets were put into the framework of category theory, and when the concept of“open Petri nets” were established. The basic idea of an open Petri net in-volves designating certain places in a Petri net as input or output, and throughthem tokens are allowed to flow in or out of the Petri net, and thus the netis made open. The input and output sets are objects and the open Petri netis the morphism between them, and they form a symmetric monoidal categorywith both sequential and parallel composition available [113, 114]. Apparently,if open Petri nets could be used for GRN, it would help to model large or multi-scale GRNs. A rudimentary step is taken towards that direction, while morein-depth work is expected [115].Another bridge takes the ontology viewpoint. In order to enable a knowledgetransfer among different species, the concept of Gene Ontology (Go) was createdto produce a controlled vocabulary for the annotation of gene functions [116].Since its birth, GO has grown into the most important tool to unify biology.As of Aug 2020, the Gene Ontology database (http://geneontology.org/) hascollected 44,262 GO terms and 8,047,076 annotations, covering 1,556,208 geneproducts from 4,643 species [117].Currently GO is represented in the Web Ontology Language (OWL), whichis a semantic web standard established by the World Wide Web Consortium(W3C) [118, 119]. Although OWL is expressive, flexible and efficient, it is rela-tively insufficient in representing knowledge that is not binary, and its scalabilityis limited [119]. As the GO database hosts a huge amount of information, alsoas the GO terms have intricate connections, a more powerful ontology languageis needed.Fortunately, as mentioned above, ontology log (olog), an ontology languagebased upon category theory, has been devised [120]. Basically, each olog is acategory in which objects and morphisms are called types and aspects. A typein an olog represents an abstract concept such as “a gene”, and an aspect fromtype X to type Y denotes a way of viewing X as Y. The aspects are functionalrelationships so that they can be composed. Built on these simple blocks, a richset of structures and relationships could be expressed rigidly, making olog anideal tool for knowledge representation [59, 120]. The merits of ologs justify aneffort to try them for gene ontology, as was practised already [121]. Figure 4shows a sample gene olog, which gives a very brief description of the gene FoxP,a gene for language in human.At this preliminary stage, all three topics are only touched superficially butthere are outlets to go deeper. For example, in the DisCoCat for protein project,10igure 4: A simple gene olog for FoxPa pregroup grammar description for protein domains is still lacking, and it canbe filled by systematically analyzing the domain combinations of proteins. Also,string diagram could be introduced to formalize and visualize the calculation.In addition, once we know how to calculate the function of a protein (equivalentto the meaning of a sentence), we could then lift from DisCoCat to DisCoCirc,that is, from sentence to text [122]. In the language of life, that correspondsto go from gene to genome, which is exactly what we want to achieve from thebeginning.However, the point really is that the genome could be modelled or viewedin three such different lens, while applied category theory is able to provide allthose various frameworks. Besides, this is just part of the story, there are muchmore intersections between genomics and applied category theory that are yetto be discovered. The opportunities for categorical genomics is abundant.
In this section, we will build a category for genomics from scratch. First, wereview the definition of a category [123,124]. A category consists of the followingdata: • a collection of objects : X, Y, Z, ... • a collection of morphisms : f, g, h, ... • for each morphism f , there are specified objects called domain and codomain of f ; the notation f : X → Y indicates that X is the domain of f and Y is thecodomain. • given morphisms f : X → Y and g : Y → Z , that is, the domain of g isthe same as the codomain of f , there is a morphism g ◦ f : X → Z called the composite of f and g . • for each object X , there is a given morphism 1 X : X → X called the identity morphism of X .These data are required to satisfy the following laws: • Associativity: h ◦ ( g ◦ f ) = ( h ◦ g ) ◦ f for all f : X → Y, g : Y → Z, h : Z → W . • Unit: f ◦ X = f = 1 Y ◦ f for all f : X → Y .11ccording to this definition, the most straightforward categorical construc-tion in genomics seems to be a category of genes, where the objects are genesand the morphisms are relationships between them. However, the actual defini-tion of such morphisms could be tricky because the relationships between genesare intrinsically complex. As a start point, we define a pre-order relationshipbetween genes based on their positions on the chromosome. Concretely, for acertain genome, we first order all its constitute chromosomes according to theirnormal nomenclature, and then for all the genes on a certain chromosome, weorder them according to their start positions if the genes are on the sense strand,while to their end positions if they are on the anti-sense strand. If two geneshave the same start positions, the one that finishes earlier (the shorter one) willbe ordered ahead of the other one. If gene A is ordered before gene B, we willwrite A ≤ B to denote it. The order of chromosomes is marked similarly.To illustrate the ordering scheme, we take the fruitfly genome as an example.Figure 5 depicts the components of a fly genome.Figure 5: Drosophila melanogaster chromosomes (diagram created by Steven J.Baskauf)We can see that fruitflies have 5 distinctive chromosomes, and they could beordered as
Chr.X ≤ Chr.Y ≤ Chr. ≤ Chr. ≤ Chr.
4. According to our rules,all the genes on
Chr.X are ordered before those on
Chr.Y , which in turn areplaced before those on
Chr.
2, and so on. Figure 6 shows the ordering of somehypothetical genes on
Chr.X and
Chr.
2. In this example, gene A ranks the“lowest” as it is on chromosome X. Gene B and gene C start at the same pointwhile gene C finishes earlier, so gene C sits before gene B in the order list. Asfor gene E, the start point of gene E locates after that of gene D, however, geneE is on the anti-sense strand, and therefore it ranks before gene D.Finally, we give a sketch of the construction of a category of genes basedon the pre-order relation explained above. It is expected that more variants ofcategories of genes will be devised in the future.The category of genes G for a specific species consists of the following com-12igure 6: Ordering of hypothetical sample genesponents: • All the genes of that species as objects. • The pre-order relation between genes as morphism, denoted as “ ≤ ”. Andthe pre-order is defined with the rules:(1) the chromosomes are ordered according to normal convention(2) if chromosome I is ordered before chromosome II, then all genes on Iare ordered before II(3) on the same chromosome, genes are ordered according to their startpositions(4) if a gene is on the anti-sense strand, the stop position is taken forordering instead of the start position(5) if two genes start on the same position, the shorter gene is orderedbefore the longer one. • Each gene A is related to itself ( A ≤ A ), denoted as id A , this is the identitymorphism on objects. • If three genes A, B, and C have the relations A ≤ B , and B ≤ C , thenthere is relation A ≤ C , and this is composition on morphisms.The associativity and unit laws fit naturally with the pre-order relation on G . Interestingly, there is an important “dual” notion in category theory, wherewe get a dual/opposite category with the same objects as the original one butall morphisms reversed [123,124]. The dual category for G would be named G op ,and it takes care of the anti-sense strand in reality. Applying category theory to genomics is an exciting new field to explore. Wepreviously made some efforts in three different perspectives. However, it seemsnecessary to explicitly propose categorical genomics as a field of its own, so thatthose individual puzzles could be incorporated into a big picture, and holes berevealed for future adventures. More importantly, with a dedicated name and13ts significance explained, we anticipate this new field could draw attentionsfrom both category theorists and genomicists, and allow them to work togetherto uncover the unknown organizational principles of the genome – the marvelof life (itself being the marvel of earth).
The author would like to thank Quanlong Wang for introducing category theoryto her. Loads of sincere appreciations go to Richard Southwell, Eugenia Cheng,John Baez, Bob Coecke, David Spivak, Brendan Fong, Bartosz Milewski, EmilyReilly, Steve Awodey, Tai-Danae Bradley and many others for their great effortsin making category theory accessible to a more general audience than mathe-maticians alone.
References [1] Simon Mawer. DNA and the meaning of life, apr 2003.[2] Gregor Mendel. EXPERIMENTS IN PLANT HYBRIDIZATION (1865).Technical report, 1996.[3] James D Watson, Francis H C Crick, and Others. Molecular structure ofnucleic acids.
Nature , 171(4356):737–738, 1953.[4] J. Craig Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural,G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D.Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman,Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Sub-ramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson,S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J.Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos,A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern,S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Rem-ington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Bran-don, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi,Z. Deng, V. di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E.Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman,M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li,Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K.Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg,W. Shao, B. Shue, J. Sun, Z. Yuan Wang, A. Wang, X. Wang, J. Wang,M. H. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang,H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. C. Zhu, S. Zhao,D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage,F. Ali, H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow,K. Beeson, D. Busam, A. Carver, A. Center, M. Lai Cheng, L. Curry,S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson, L. Doup,S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes,C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam,J. Johnson, F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May,14. McCawley, T. McIntosh, I. McMullen, M. Moy, L. Moy, B. Murphy,K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon,R. Rodriguez, Yu H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sit-ter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. NiTint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams,S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril,R. Guigo, M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal,H. Mi, B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muru-ganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz,B. Walenz, S. Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M. Cam-inha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne, C. Dahlke,A. Deslattes Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham,C. Fosler, H. Gire, S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov,K. Graham, B. Gropman, M. Harris, J. Heil, S. Henderson, J. Hoover,D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft, A. Levit-sky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel, S. Mur-phy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck,M. Peterson, W. Rowe, R. Sanders, J. Scott, M. Simpson, T. Smith,A. Sprague, T. Stockwell, R. Turner, E. Venter, M. Wang, M. Wen, D. Wu,M. Wu, A. Xia, A. Zandieh, and X. Zhu. The sequence of the humangenome.
Science , 291(5507):1304–1351, feb 2001.[5] R. David Hawkins, Gary C. Hon, and Bing Ren. Next-generation ge-nomics: An integrative approach, jun 2010.[6] Computational Tools, 2018.[7] Deep learning for genomics.
Nature Genetics , 51(1):1–1, jan 2019.[8] Samuel Eilenberg and Saunders MacLane. General theory of natu-ral equivalences.
Transactions of the American Mathematical Society ,58(2):231–294, 1945.[9] David I. Spivak, Tristan Giesa, Elizabeth Wood, and Markus J. Buehler.Category theoretic analysis of hierarchical protein materials and socialnetworks.
PLoS ONE , 6(9), 2011.[10] Giandomenico Sica.
What is category theory?
WHO , 2016.[14] Ernst Posner and J Skutil. The great neglect: the fate of Mendel’s classicpaper between 1865 and 1900.
Medical History , 12(2):122–136, 1968.[15] Thomas Hunt Morgan.
Sex-linked inheritance in Drosophila . Number 237.Carnegie institution of Washington, 1916.1516] Ilona Miko. Thomas Hunt Morgan and sex linkage.
Nature Education ,1(1):143, 2008.[17] Oswald T Avery, Colin M MacLeod, and Maclyn McCarty. Studies onthe chemical nature of the substance inducing transformation of pneu-mococcal types: induction of transformation by a desoxyribonucleic acidfraction isolated from pneumococcus type III.
The Journal of experimentalmedicine , 79(2):137–158, 1944.[18] The Discovery of the Double Helix, 1951-1953 — Francis Crick - Profilesin Science, https://profiles.nlm.nih.gov/spotlight/sc/feature/doublehelix.[19] Marshall W Nirenberg. The genetic code.
Scientific American , 208(3):80–95, 1963.[20] MARSHALL Nirenberg. Deciphering the genetic code.
Office of NIHHistory , 2010.[21] F Sanger, S Nicklen, and A R Coulson. DNA sequencing with chain-terminating inhibitors.
Proceedings of the National Academy of Sciencesof the United States of America , 74(12):5463–7, dec 1977.[22] Sam Behjati and Patrick S. Tarpey. What is next generation sequenc-ing?
Archives of Disease in Childhood: Education and Practice Edition ,98(6):236–238, dec 2013.[23] T Hubbard, D Barker, E Birney, G Cameron, Y Chen, L Clark, T Cox,J Cuff, V Curwen, T Down, R Durbin, E Eyras, J Gilbert, M Ham-mond, L Huminiecki, A Kasprzyk, H Lehvaslaiho, P Lijnzaad, C Melsopp,E Mongin, R Pettett, M Pocock, S Potter, A Rust, E Schmidt, S Searle,G Slater, J Smith, W Spooner, A Stabenau, J Stalker, E Stupka, A Ureta-Vidal, I Vastrik, and M Clamp. The Ensembl genome database project.Technical Report 1, 2002.[24] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle,A. M. Zahler, and a. D. Haussler. The Human Genome Browser at UCSC.
Genome Research , 12(6):996–1006, may 2002.[25] Andrew D Yates, Premanand Achuthan, Wasiu Akanni, James Allen,Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean,Andrey G Azov, Ruth Bennett, Jyothish Bhai, Konstantinos Billis, San-jay Boddu, Jos´e Jos, Jos´e Carlos, Marug´an Marug Marug´an, Carla Cum-mins, Claire Davidson, Kamalkumar Dodiya, Reham Fatima, Astrid Gall,Carlos Garcia Giron, Laurent Gil, Tiago Grego, Leanne Haggerty, ErinHaskell, Thibaut Hourlier, Osagie G Izuogu, Sophie H Janacek, ThomasJuettemann, Mike Kay, Ilias Lavidas, Tuan Le, Diana Lemos, Jose Gonza-lez Martinez, Thomas Maurel, Mark Mcdowall, Aoife Mcmahon, ShamikaMohanan, Benjamin Moore, Michael Nuhn, Denye N Oheh, Anne Parker,Andrew Parton, Mateus Patricio, Pandian Sakthivel, Ahamed Imran, Ab-dul Salam, Bianca M Schmitt, Helen Schuilenburg, Dan Sheppard, MiraSycheva, Marek Szuba, Kieron Taylor, Anja Thormann, Glen Thread-gold, Alessandro Vullo, Brandon Walts, Andrea Winterbottom, AmonidaZadissa, Marc Chakiachvili, Bethany Flint, Adam Frankish, Sarah E16unt, Garth Iisley, Myrto Kostadima, Nick Langridge, Jane E Love-land, Fergal J Martin, Joannella Morales, Jonathan M Mudge, MatthieuMuffato, Emily Perry, Magali Ruffier, Stephen J Trevanion, Fiona Cun-ningham, Kevin L Howe, Daniel R Zerbino, and Paul Flicek. Ensembl2020.
Nucleic Acids Research , 48, 2020.[26] M D Adams, S E Celniker, R A Holt, C A Evans, J D Gocayne, P G Ama-natides, S E Scherer, P W Li, R A Hoskins, R F Galle, R A George, S ELewis, S Richards, M Ashburner, S N Henderson, G G Sutton, J R Wort-man, M D Yandell, Q Zhang, L X Chen, R C Brandon, Y H Rogers, R GBlazej, M Champe, B D Pfeiffer, K H Wan, C Doyle, E G Baxter, G Helt,C R Nelson, G L Gabor, J F Abril, A Agbayani, H J An, C Andrews-Pfannkoch, D Baldwin, R M Ballew, A Basu, J Baxendale, L Bayrak-taroglu, E M Beasley, K Y Beeson, P V Benos, B P Berman, D Bhandari,S Bolshakov, D Borkova, M R Botchan, J Bouck, P Brokstein, P Brot-tier, K C Burtis, D A Busam, H Butler, E Cadieu, A Center, I Chandra,J M Cherry, S Cawley, C Dahlke, L B Davenport, P Davies, B de Pablos,A Delcher, Z Deng, A D Mays, I Dew, S M Dietz, K Dodson, L E Doup,M Downes, S Dugan-Rocha, B C Dunkov, P Dunn, K J Durbin, C C Evan-gelista, C Ferraz, S Ferriera, W Fleischmann, C Fosler, A E Gabrielian,N S Garg, W M Gelbart, K Glasser, A Glodek, F Gong, J H Gorrell, Z Gu,P Guan, M Harris, N L Harris, D Harvey, T J Heiman, J R Hernandez,J Houck, D Hostin, K A Houston, T J Howland, M H Wei, C Ibegwam,M Jalali, F Kalush, G H Karpen, Z Ke, J A Kennison, K A Ketchum, B EKimmel, C D Kodira, C Kraft, S Kravitz, D Kulp, Z Lai, P Lasko, Y Lei,A A Levitsky, J Li, Z Li, Y Liang, X Lin, X Liu, B Mattei, T C McIntosh,M P McLeod, D McPherson, G Merkulov, N V Milshina, C Mobarry,J Morris, A Moshrefi, S M Mount, M Moy, B Murphy, L Murphy, D MMuzny, D L Nelson, D R Nelson, K A Nelson, K Nixon, D R Nusskern,J M Pacleb, M Palazzolo, G S Pittman, S Pan, J Pollard, V Puri, M GReese, K Reinert, K Remington, R D Saunders, F Scheeler, H Shen, B CShue, I Sid´en-Kiamos, M Simpson, M P Skupski, T Smith, E Spier, A CSpradling, M Stapleton, R Strong, E Sun, R Svirskas, C Tector, R Turner,E Venter, A H Wang, X Wang, Z Y Wang, D A Wassarman, G M Wein-stock, J Weissenbach, S M Williams, Sherita M. WoodageT, K C Worley,D Wu, S Yang, Q A Yao, J Ye, R F Yeh, J S Zaveri, M Zhan, G Zhang,Q Zhao, L Zheng, X H Zheng, F N Zhong, W Zhong, X Zhou, S Zhu,X Zhu, H O Smith, R A Gibbs, E W Myers, G M Rubin, J C Venter,and J. Craig Venter. The genome sequence of Drosophila melanogaster.
Science (New York, N.Y.)
Cell , 161(5):1202–1214, 2015.[37] Grace X.Y. Zheng, Jessica M. Terry, Phillip Belgrader, Paul Ryvkin,Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, Tobias D. Wheeler,Geoff P. McDermott, Junjie Zhu, Mark T. Gregory, Joe Shuga, LuzMontesclaros, Jason G. Underwood, Donald A. Masquelier, Stefanie Y.Nishimura, Michael Schnall-Levin, Paul W. Wyatt, Christopher M. Hind-son, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness, Lan W. Beppu,H. Joachim Deeg, Christopher McFarland, Keith R. Loeb, William J. Va-lente, Nolan G. Ericson, Emily A. Stevens, Jerald P. Radich, Tarjei S.Mikkelsen, Benjamin J. Hindson, and Jason H. Bielas. Massively paralleldigital transcriptional profiling of single cells.
Nature Communications
NatureGenetics
Nature Reviews Genetics , 19(12):789–800, dec 2018.[54] Rieke Kempfer and Ana Pombo. Methods for mapping 3Dchromosomearchitecture, apr 2020.[55] F. William Lawvere. The Category of Categories as a Foundation forMathematics. In
Proceedings of the Conference on Categorical Algebra ,pages 1–20. Springer Berlin Heidelberg, 1966.[56] Bob Coecke. Introducing categories to the practicing physicist. aug 2008.[57] Bob Coecke and Aleks Kissinger.
Picturing quantum processes: A firstcourse in quantum theory and diagrammatic reasoning . Cambridge Uni-versity Press, mar 2017.[58] John Baez, Mike Stay, and Google. Physics Topology Logic andComputation-A Rosetta Stone. pages 1–73, 2009.[59] David I Spivak.
Category theory for the sciences arXiv preprint arXiv:1803.05316
Bulletin of Mathematical Biophysics , 20:317–341, 1958.[66] R. Rosen. A Relational Theory of Biological Systems’.
Bulletin of Math-ematical Biophysics , 20:245–260, 1958.[67] R. Rosen. Some Realizations of (M,R)-Systems and Their Interpretation’.
Bulletin of Mathematical Biophysics , 33:303–319, 1971.[68] Robert Rosen.
Life itself : a comprehensive inquiry into the nature, origin,and fabrication of life . Columbia University Press, 1991.[69] R. Rosen. Essays on Life Itself, 1999.[70] I. C. Baianu. Robert Rosen’s work and complex systems biology.
Ax-iomathes , 16(1-2):25–34, mar 2006.[71] A Carbone and M Gromov. A mathematical slices of molecular biology,Supplement to volume 88 of Gazette des Math´ematiciens, French Math.
Soc.(SMF), Paris , 2001.[72] Jitsuki Sawamura, Shigeru Morishita, and Jun Ishigooka. A symmetrymodel for genetic coding via a wallpaper group composed of the traditionalfour bases and an imaginary base E: Towards category theory-like system-atization of molecular/genetic biology.
Theoretical Biology and MedicalModelling , 11(1):18, may 2014.[73] R´emy Tuy´eras. Category Theory for Genetics. pages 1–41, 2017.[74] R´emy Tuy´eras. Category theory for genetics I: mutations and sequencealignments. may 2018.[75] R´emy Tuy´eras. Category theory for genetics II: genotype, phenotype andhaplotype. may 2018.[76] Andr´ee Charles Ehresmann and Jean-Paul Vanbremeersch.
Memory evo-lutive systems; hierarchy, emergence, cognition . Elsevier, 2007.[77] Ronald Brown. Review: Memory Evolutive Systems, sep 2009.[78] Ronald Brown and Timothy Porter. Category theory and higher dimen-sional algebra: potential descriptive tools in neuroscience. arXiv preprintmath/0306223 , 2003. 2079] Chris Heunen, Mehrnoosh Sadrzadeh, and Edward Grefenstette.
Order,Composition, Processes. In Quantum physics and linguistics: a composi-tional, diagrammatic discourse . Oxford University Press, 2013.[80] Bob Coecke, Ross Duncan, Aleks Kissinger, and Quanlong Wang. Gen-eralised compositional theories and diagrammatic reasoning. In
QuantumTheory: Informational Foundations and Foils , pages 309–366. Springer,2016.[81] John Makeham.
Transforming consciousness: Yog {\ =a } c {\ =a } ra thoughtin modern China . Oxford University Press, USA, 2014.[82] Camilo Miguel Signorelli, Quanlong Wang, and Ilyas Khan. A Composi-tional Model of Consciousness based on Consciousness-Only. jul 2020.[83] Giulio Tononi. An information integration theory of consciousness. BMCneuroscience , 5(1):1–22, 2004.[84] Sean Tull and Johannes Kleiner. Integrated Information in Process The-ories. feb 2020.[85] Daniel Strain. 8.7 million: A new estimate for all the complex species onEarth, 2011.[86] Stefano Ceri, Abdulrahman Kaitoua, Marco Masseroli, Pietro Pinoli,and Francesco Venco. IEEE/ACM TRANSACTIONS ON COMPUTA-TIONAL BIOLOGY AND BIOINFORMATICS 1 Data Management forHeterogeneous Genomic Datasets. Technical report.[87] Marco Masseroli, Arif Canakoglu, Pietro Pinoli, Abdulrahman Kaitoua,Andrea Gulino, Olha Horlova, Luca Nanni, Anna Bernasconi, StefanoPerna, Eirini Stamoulakatou, and Stefano Ceri. Processing of big het-erogeneous genomic datasets for tertiary analysis of Next Generation Se-quencing data.[88] Johan H. Gibcus and Job Dekker. The Hierarchy of the 3D Genome, mar2013.[89] Lin An, Tao Yang, Jiahao Yang, Johannes Nuebler, Guanjue Xiang,Ross C. Hardison, Qunhua Li, and Yu Zhang. OnTAD: Hierarchical do-main structure reveals the divergence of activity among TADs and bound-aries.
Genome Biology , 20(1), dec 2019.[90] Jesse R. Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, YinShen, Ming Hu, Jun S. Liu, and Bing Ren. Topological domains in mam-malian genomes identified by analysis of chromatin interactions.
Nature ,485(7398):376–380, may 2012.[91] Rok Grah and Tamar Friedlander. The relation between crosstalk and generegulation form revisited.
PLoS Computational Biology , 16(2):e1007642,2020.[92] W Doerfler. IN SEARCH OF MORE COMPLEX GENETIC CODES-CAN LINGUISTICS BE A GUIDE? Technical report, 1982.2193] V. Brendel and H. G. Busse. Genome structure described by formal lan-guages.
Nucleic Acids Research , 12(5):2561–2568, 1984.[94] B Searls. The Linguistics of DNA.
American Scientist , 80(6):579–591,1992.[95] David B Searls. The Computational Linguistics of Biological Sequences.In
Artificial Intelligence and Molecular Biology , chapter 2, pages 47–121.The MIT Press Classics Series and AAAI press, Cambridge, USA, 1993).,1993.[96] D B Searls. Reading the book of life.
Bioinformatics (Oxford, England) ,17(7):579–80, jul 2001.[97] David B. Searls. The language of genes, nov 2002.[98] David B. Searls. Linguistics: Trees of life and of language.
Nature ,426(6965):391–392, nov 2003.[99] David B. Searls. Review: A primer in macromolecular linguistics.
Biopoly-mers , 99(3):203–217, 2013.[100] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. MathematicalFoundations for a Compositional Distributional Model of Meaning. 2010.[101] Yanying Wu and Quanlong Wang. A Categorical Compositional Distribu-tional Modelling for the Language of Life, arXiv:1902.09303 [q-bio.QM].2019.[102] Mario Gimona. Protein linguistics - A grammar for modular protein as-sembly?
Nature Reviews Molecular Cell Biology , 7(1):68–73, 2006.[103] Yaakov Levy. Protein Assembly and Building Blocks: Beyond the Limitsof the LEGO Brick Metaphor. 2017.[104] Predrag Radivojac, Wyatt T. Clark, Tal Ronnen Oron, Alexandra M.Schnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, ChristopherFunk, Karin Verspoor, Asa Ben-Hur, Gaurav Pandey, Jeffrey M. Yunes,Ameet S. Talwalkar, Susanna Repo, Michael L. Souza, Damiano Pi-ovesan, Rita Casadio, Zheng Wang, Jianlin Cheng, Hai Fang, JulianGough, Patrik Koskinen, Petri T¨or¨onen, Jussi Nokso-Koivisto, LiisaHolm, Domenico Cozzetto, Daniel W.A. Buchan, Kevin Bryson, David T.Jones, Bhakti Limaye, Harshal Inamdar, Avik Datta, Sunitha K. Manjari,Rajendra Joshi, Meghana Chitale, Daisuke Kihara, Andreas M. Lisewski,Serkan Erdin, Eric Venner, Olivier Lichtarge, Robert Rentzsch, Haix-uan Yang, Alfonso E. Romero, Prajwal Bhat, Alberto Paccanaro, TobiasHamp, Rebecca Kaßner, Stefan Seemayer, Esmeralda Vicedo, ChristianSchaefer, Dominik Achten, Florian Auer, Ariane Boehm, Tatjana Braun,Maximilian Hecht, Mark Heron, Peter H¨onigschmid, Thomas A. Hopf,Stefanie Kaufmann, Michael Kiening, Denis Krompass, Cedric Landerer,Yannick Mahlich, Manfred Roos, Jari Bj¨orne, Tapio Salakoski, AndrewWong, Hagit Shatkay, Fanny Gatzmann, Ingolf Sommer, Mark N. Wass,Michael J.E. Sternberg, Nives ˇSkunca, Fran Supek, Matko Boˇsnjak, PanˇcePanov, Saˇso Dˇzeroski, Tomislav ˇSmuc, Yiannis A.I. Kourmpetis, Aalt D.J.22an Dijk, Cajo J.F. Ter Braak, Yuanpeng Zhou, Qingtian Gong, Xin-ran Dong, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo,Barbara Di Camillo, Stefano Toppo, Liang Lan, Nemanja Djuric, YuhongGuo, Slobodan Vucetic, Amos Bairoch, Michal Linial, Patricia C. Babbitt,Steven E. Brenner, Christine Orengo, Burkhard Rost, Sean D. Mooney,and Iddo Friedberg. A large-scale evaluation of computational proteinfunction prediction.
Nature Methods , 10(3):221–227, mar 2013.[105] Naihui Zhou, Yuxiang Jiang, Timothy R. Bergquist, Alexandra J.Lee, Balint Z. Kacsoh, Alex W. Crocker, Kimberley A. Lewis, GeorgeGeorghiou, Huy N. Nguyen, Md Nafiz Hamid, Larry Davis, Tunca Dogan,Volkan Atalay, Ahmet S. Rifaioglu, Alperen Dalklran, Rengul Cetin Ata-lay, Chengxin Zhang, Rebecca L. Hurto, Peter L. Freddolino, Yang Zhang,Prajwal Bhat, Fran Supek, Jos´e M. Fern´andez, Branislava Gemovic,Vladimir R. Perovic, Radoslav S. Davidovi´c, Neven Sumonja, NevenaVeljkovic, Ehsaneddin Asgari, Mohammad R.K. Mofrad, Giuseppe Profiti,Castrense Savojardo, Pier Luigi Martelli, Rita Casadio, Florian Boecker,Heiko Schoof, Indika Kahanda, Natalie Thurlby, Alice C. McHardy,Alexandre Renaux, Rabie Saidi, Julian Gough, Alex A. Freitas, Mag-dalena Antczak, Fabio Fabris, Mark N. Wass, Jie Hou, Jianlin Cheng,Zheng Wang, Alfonso E. Romero, Alberto Paccanaro, Haixuan Yang,Tatyana Goldberg, Chenguang Zhao, Liisa Holm, Petri T¨or¨onen, Alan J.Medlar, Elaine Zosa, Itamar Borukhov, Ilya Novikov, Angela Wilkins,Olivier Lichtarge, Po Han Chi, Wei Cheng Tseng, Michal Linial, Pe-ter W. Rose, Christophe Dessimoz, Vedrana Vidulin, Saso Dzeroski, IanSillitoe, Sayoni Das, Jonathan Gill Lees, David T. Jones, Cen Wan,Domenico Cozzetto, Rui Fa, Mateo Torres, Alex Warwick Vesztrocy,Jose Manuel Rodriguez, Michael L. Tress, Marco Frasca, Marco Notaro,Giuliano Grossi, Alessandro Petrini, Matteo Re, Giorgio Valentini, MarcoMesiti, Daniel B. Roche, Jonas Reeb, David W. Ritchie, Sabeur Aridhi,Seyed Ziaeddin Alborzi, Marie Dominique Devignes, Da Chen Emily Koo,Richard Bonneau, Vladimir Gligorijevi´c, Meet Barot, Hai Fang, Ste-fano Toppo, Enrico Lavezzo, Marco Falda, Michele Berselli, Silvio C.E.Tosatto, Marco Carraro, Damiano Piovesan, Hafeez Ur Rehman, QizhongMao, Shanshan Zhang, Slobodan Vucetic, Gage S. Black, Dane Jo, EricaSuh, Jonathan B. Dayton, Dallas J. Larsen, Ashton R. Omdahl, Liam J.McGuffin, Danielle A. Brackenridge, Patricia C. Babbitt, Jeffrey M.Yunes, Paolo Fontana, Feng Zhang, Shanfeng Zhu, Ronghui You, ZihanZhang, Suyang Dai, Shuwei Yao, Weidong Tian, Renzhi Cao, Caleb Chan-dler, Miguel Amezola, Devon Johnson, Jia Ming Chang, Wen Hung Liao,Yi Wei Liu, Stefano Pascarelli, Yotam Frank, Robert Hoehndorf, MaxatKulmanov, Imane Boudellioua, Gianfranco Politano, Stefano Di Carlo, Al-fredo Benso, Kai Hakala, Filip Ginter, Farrokh Mehryary, Suwisa Kaew-phan, Jari Bj¨orne, Hans Moen, Martti E.E. Tolvanen, Tapio Salakoski,Daisuke Kihara, Aashish Jain, Tomislav ˇSmuc, Adrian Altenhoff, AsaBen-Hur, Burkhard Rost, Steven E. Brenner, Christine A. Orengo, Con-stance J. Jeffery, Giovanni Bosco, Deborah A. Hogan, Maria J. Martin,Claire O’Donovan, Sean D. Mooney, Casey S. Greene, Predrag Radivo-jac, and Iddo Friedberg. The CAFA challenge reports improved proteinfunction prediction and new functional annotations for hundreds of genes23hrough experimental screens.
Genome Biology , 20(1):244, nov 2019.[106] Hidde de Jong. Modeling and Simulation of Genetic Regulatory Systems:A Literature Review.
Journal of Computational Biology , 9(1):67–103, jan2002.[107] L. J. Steggles, R. Banks, O. Shaw, and A. Wipat. Qualitatively mod-elling and analysing genetic regulatory networks: a Petri net approach.
Bioinformatics , 23(3):336–343, feb 2007.[108] Jure Bordon and Miha Mraz. Modeling gene regulatory networks usingPetri Nets, 2012.[109] Fernando M. Delgado and Francisco G´omez-Vela. Computational methodsfor Gene Regulatory Networks reconstruction and analysis: A review.
Artificial Intelligence in Medicine , 95:133–145, apr 2018.[110] Wolfgang. Reisig.
Petri Nets : an Introduction . Springer Berlin Heidel-berg, 1985.[111] T. Murata. Petri nets: Properties, analysis and applications.
Proceedingsof the IEEE , 77(4):541–580, apr 1989.[112] Claudine Chaouiya. Petri net modelling of biological networks.
Briefingsin bioinformatics , 8(4):210–219, 2007.[113] John C. Baez and Blake S. Pollard. A Compositional Framework forReaction Networks. apr 2017.[114] John C. Baez and Jade Master. Open Petri Nets. aug 2018.[115] Yanying Wu. An Open Petri Net Implementation of Gene RegulatoryNetworks, arXiv:1907.11316 [q-bio.MN]. 2019.[116] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein,Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina SDwight, Janan T Eppig, and Others. Gene ontology: tool for the unifica-tion of biology.
Nature genetics , 25(1):25, 2000.[117] TheGeneOntologyConsortium. The Gene Ontology Resource: 20 yearsand still GOing strong.
Nucleic Acids Research , 47(D1):D330–D338, jan2019.[118] Pascal. Hitzler, Markus. Krotzsch, and Sebastian (Computer scientist)Rudolph.
Foundations of Semantic Web technologies . CRC Press, 2010.[119] Janna Hastings. Primer on Ontologies. pages 3–13. 2017.[120] David I. Spivak and Robert E. Kent. Ologs: A categorical framework forknowledge representation.
PLoS ONE , 7(1):1–52, 2012.[121] Yanying Wu. Gene ologs: a categorical framework for Gene Ontology,arXiv:1909.11210 [q-bio.GN]. 2019.[122] Bob Coecke. The Mathematics of Text Structure.
The Theory and Practiceof Discourse Parsing and Summarization , apr 2019.24123] Steve Awodey.
Category theory . Oxford University Press, 2010.[124] Emily Riehl.