A Comparison of Microbial Genome Web Portals
Peter D. Karp, Natalia Ivanova, Markus Krummenacker, Nikos Kyrpides, Mario Latendresse, Peter Midford, Wai Kit Ong, Suzanne Paley, Rekha Seshadri
aa r X i v : . [ q - b i o . GN ] O c t A Comparison of Microbial Genome Web Portals
Peter D. Karp, Natalia Ivanova, Markus Krummenacker, Nikos Kyrpides, Mario Latendresse, Peter Midford, Wai Kit Ong, Suzanne Paley, and Rekha Seshadri Bioinformatics Research Group, SRI International, Menlo Park, USA DOE Joint Genome Institute, Walnut Creek, [email protected]
Abstract
Microbial genome web portals have a broad range of capabilities that address a number ofinformation-finding and analysis needs for scientists. This article compares the capabilitiesof the major microbial genome web portals to aid researchers in determining which portal(s)are best suited to solving their information-finding and analytical needs. We assessed both thebioinformatics tools and the data content of BioCyc, KEGG, Ensembl Bacteria, KBase, IMG, andPATRIC. For each portal, our assessment compared and tallied the available capabilities. Thestrengths of BioCyc include its genomic and metabolic tools, multi-search capabilities, table-based analysis tools, regulatory network tools and data, omics data analysis tools, breadth ofdata content, and large amount of curated data. The strengths of KEGG include its genomic andmetabolic tools. The strengths of Ensembl Bacteria include its genomic tools and large numberof genomes. The strengths of KBase include its genomic tools and metabolic models. Thestrengths of IMG include its genomic tools, multi-search capabilities, large number of genomes,table-based analysis tools, and breadth of data content. The strengths of PATRIC include itslarge number of genomes, table-based analysis tools, metabolic models, and breadth of datacontent. .1 Summary of the Portals
Here we introduce each portal. Note that some portals have some capabilities that are notcovered in this comparison. For each portal we provide a hyperlink to a sample gene page.
BioCyc
BioCyc [2,8] is a microbial genome web portal that integrates sequenced genomes with curatedinformation from the biological literature, with information imported from other biologicalDBs, and with computational inferences. BioCyc data include metabolic pathways, regulatorynetworks, and gene essentiality data. BioCyc provides extensive query and visualization tools,as well as tools for omics data analysis, metabolic path searching, and for running metabolicmodels. We omit discussion of many BioCyc comparative genomics and metabolic operationsunder its Analysis → Comparative Analysis menu. Scientists can use the Pathway Tools soft-ware associated with BioCyc to perform metabolic reconstructions and create BioCyc-like DBsfor in-house genome data.BioCyc contains information curated from 89,500 publications. The curated information in-cludes experimentally determined gene functions and Gene Ontology terms, experimentallystudied metabolic pathways, and experimentally determined parameters such as enzyme ki-netics data and enzyme activators and inhibitors. Curated information also includes textualmini-reviews that summarize information about genes, pathways, and regulation, with ci-tations to the primary literature. The large amount of curated information within BioCyc isunique with respect to other genome portals.Home page: https://biocyc.org/Sample gene page: https://biocyc.org/gene?orgid=ECOLI&id=EG10823.
KEGG
Ensembl Bacteria
Ensembl Bacteria is a portal for bacterial and archaeal genomes. It does not have any data ortools for metabolism, pathways or compounds, focusing on genes and proteins. Its strengthsseem to be in its large collection of gene and protein family data. Its capabilities are somewhat3ifferent from other Ensembl sites. In addition to BLAST, it includes a hidden Markov model(HM) search tool for protein motifs. Pan-taxonomic comparative tools are available for keyspecies. It also includes Ensembl’s variant effect predictor, which can predict functional conse-quences of sequence variants.Home page: https://bacteria.ensembl.org/Sample gene page: https://bacteria.ensembl.org/Escherichia coli str k 12 substr mg1655/Gene/Summary?g=b2699;r=Chromosome:2822708-2823769;t=AAC75741;db=core.
KBase
KBase is an environment for systems biology research that provides more than 160 applicationsto support user-driven analysis of a variety of data ranging from raw reads to fully assembledand annotated genomes, and metabolic models. In addition to its genome-portal capabilities,KBase [12] enables users to assemble and annotate genomes, to analyze transcriptomics data,and to create metabolic models for organisms with sequenced genomes. Once a model is cre-ated, it can be analyzed using phylogenetic, expression analysis, and comparative tools. KBasealso allows users to integrate custom code into their analysis pipeline and enables additionof external applications by their developers using a software development kit (SDK). Its othermajor aim is to support reproducible computational experiments, on models, that can be pub-lished and shared with other users.Home page: https://kbase.us/Sample gene page: https://narrative.kbase.us/
IMG
The Integrated Microbial Genomes (IMG) system is a resource for annotation and analysis ofsequence data, integrated with environmental and other metadata to support genome and mi-crobiome comparisons. In addition to being the vehicle for release of the data generated by theDOE Joint Genome Institute, it provides a suite of analytical and visualization tools availableto explore and mine the data for biological inference. Custom data marts dedicated to spe-cific research topics like synthesis of secondary metabolite (IMG-ABC) or viral eco-genomics(IMG/VR), are also included. Users can submit their own data and metadata for integration inthe system.Home page: https://img.jgi.doe.gov/Sample gene page: https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=GeneDetail&page=geneDetail&gene oid=646314661.
PATRIC
We assessed the software and data content capabilities of each portal according to a number oftopic areas, such as genomics-related tools and metabolism-related tools. We chose topic areasthat we considered to be core elements of a microbial genome information portal — that is,a web site that counts among its primary missions providing users with data and knowledgeregarding sequenced microbial genomes. A number of the portals contain functionality out-side of that mission, for example, some portals contain software tools for annotating microbialgenomes (e.g., performing assembly and gene-function prediction). We did not include suchfunctionality because we considered it outside the scope of a microbial genome informationportal. In many cases, we added new criteria within a topic area (meaning rows within ourcomparison tables) as we learned about each portal, such as adding the ability of Ensembl Bac-teria to predict the effects of sequence variants. Our choice of criteria is validated by the factthat many of the criteria are shared among some or many of the portals.For several of the topic areas, we provide multiple tables to assess software capabilities,with one or two tables focusing on DB search capabilities and another table focusing on othercapabilities in that area. For example, Tables 2 and 3 describe genomics multi-search tools, andTable 1 describe other genomics software tools.We attempted to be as diligent as possible when evaluating each portal’s capabilities, how-ever, being non-expert navigators of KEGG, Ensembl Bacteria, KBase, and PATRIC, we mayhave overlooked or misjudged some element of those portals.5 ool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Genome Browser YES YES YES YES YES YES– Operons, Promoters, TF binding sites YES no no no partial YES– Depicts Nucleotide Sequence YES YES YES YES YES YES– Customizable Tracks YES no YES no partial YES– Comparative, by Orthologs YES no no no YES YES– Genome Poster YES no no no no noRetrieve Gene Sequence YES YES YES YES YES YESRetrieve Replicon Sequence YES YES YES no YES YESRetrieve Protein Sequence YES YES YES YES YES YESNucleotide Sequence Alignment Viewer YES YES no no YES YESProtein Sequence Alignment Viewer YES YES no no YES YESProtein Phylogenetic Tree Analysis no YES no YES YES YESSequence Searching by BLAST YES YES YES YES YES YESSequence Pattern Search YES YES no YES YES noSequence Cassette Search no YES YES YES YES noOrthologs YES YES no YES YES YESGene/Protein Page YES YES YES YES YES YESEnrichment Analysis (GO Terms) YES no no YES no noEnrichment Analysis (Regulation) YES no no no no noOmics Dashboard YES no no no no noMulti-Organism Comparative Analysis YES YES YES YES YES YESHorizontal Gene Transfer Prediction no no no no YES noFused Protein Prediction no no no no YES noAlternative ORF View no no no no YES YESGenome Multi-Search YES no no no YES YESgANI Computations no no no YES YES YESKmer Frequency Analysis no no no no YES noSynteny Comparison no no no YES YES noProteome Comparisons YES no no YES YES YESStatistical Analysis, Genome YES no no no YES noStatistical Analysis, Expression no no no YES YES YESGenome Function Comparison no no no YES YES YESInsert Genomes into Reference Trees no no no YES no YES Predict Effects of Sequence Variants no no YES no no YES
Table 1:
Genomics Tools Comparison. “Partial” means that the tool provides some but not all of the indicated functionality. KEGG does have a rudimentary tool for this purpose, but it is not based on a zoomable genome browser. PATRIC supportsconstruction of trees from an arbitrary set of in-group and out-group genomes. ool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC Gene Name YES YES YES YES YES YESProduct Name YES YES YES YES YES YESDatabase Identifier YES YES YES YES YES YESEC Number YES YES YES no YES YESSequence Length YES no no YES YES YESReplicon YES no no YES YES YESMap Position YES YES no YES YES noProduct Mol Wt YES no no no YES noProduct Subunits YES no no no YES noProduct pI YES no no no YES noProduct Ligands YES no no no YES noEvidence Code YES no no no no noCell Component YES no no no no noGO Terms YES no YES YES YES YESProtein Features YES no YES no YES noPublication YES no no YES no noScaffold Length no YES no YES YES noScaffold GC Content no no no no YES YESProtein Family Assignment no YES YES no YES YESIs Partial no no no no YES noIs Pseudogene YES no no no YES YESTable 2:
Gene/protein multi-search capabilities.
Does the portal support multi-searches for genes and gene products basedon the data fields or criteria listed? “Publication” means the ability to search for a gene based on a publication cited in thepathway entry. “Scaffold Length” means the ability to search for a gene based on the length of the scaffold it resides on.“Protein Family Assignment” means the ability to search for a gene based on what protein families it is assigned to (e.g.,Pfam or TIGRFAM family). “Is Partial” means search for partial (truncated) proteins. ool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC Site Type YES no no no no no– Attenuators YES no no no no no– Origin of Replication YES no no no no no– Phage Attachment Sites YES no no no no no– REP Elements YES no no no no no– Promoters YES no no no no no– Terminators YES no no no no no– mRNA Binding Sites YES no no no YES no– Riboswitches YES no no no YES no– TF Binding Sites YES no no no no no– Transcription Units YES no no no no no– Transposons YES no no no no noReplicon YES no no no YES noMap Position YES no no no YES noSite Regulator YES no no no no noSite Ligands YES no no no no noEvidence Code YES no no no no noCRISPR Arrays no no no no YES noTable 3:
DNA/RNA Site Multi-Search Capabilities.
Does the portal support multi-searches for DNA and RNA sites basedon the data fields or criteria listed? For example, does the portal support searches for sites by the type of site (e.g., forattenuators versus transcription-factor binding sites), and by numeric constraints on the genome position of the site? ool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC Metabolite Page YES YES no no no noChemical Similarity Search no YES no no no noGlycan Similarity Search no YES no no no noReaction Page YES YES no no YES no– Reaction Atom Mappings YES YES no no no noIndividual Pathway Diagram YES YES no YES YES YES– Automatic Pathway Layout YES no no no no no– Paint Omics Data onto Pathway YES YES no no YES no– Depict Enzyme Regulation YES no no no no no– Depict Genetic Regulation YES no no no no no– Depict Metabolite Structures YES YES (Tooltip) no no no noMulti-Pathway Diagram YES no no no no noFull Metabolic Network Diagram YES YES no no no no– Zoomable Metabolic Network YES YES no no no no– Paint Omics Data onto Diagram YES no no no no no– Animated Omics Data Painting YES no no no no no– Metabolic Poster YES no no no no no– Organism Comparison YES no no no no noAutomated Metabolic Reconstruction YES (Desktop) YES no YES YES YESEnrichment Analysis (Pathways) YES no no no YES noExecute Metabolic Model YES no no YES no YES– Gene Knock-out Analysis YES no no YES no YESChokepoint Analysis YES no no no no noDead-End Metabolite Analysis YES no no no no noBlocked-Reaction Analysis YES no no YES no noRoute Search Tool YES YES no no no noPath Prediction Tool no YES no no no noAssign EC Number no YES no no no noTable 4:
Metabolic Tools Comparison. The desktop version of the Pathway Tools software performs automated metabolicreconstruction. ool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC Name YES YES no no YES YES Database Identifier YES YES no no YES YES Ontology YES no no no YES YESMonoisotopic Mass YES no no no partial noMolecular Weight YES no no no partial noChemical Formula YES no no no partial noChemical Substructure YES YES no no partial noInChi String YES no no no partial noInChi Key YES no no no partial noTable 5:
Compound multi-search capabilities.
Does the portal support multi-searches forchemical compounds based on the data fields or criteria listed? “Ontology” means the abilityto search for compounds based on a chemical ontology (classification). This search will findpages of antimicrobial compounds.
Tool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Name YES YES no no YES YESOntology YES YES no no YES YESSize in Reactions YES no no no no noSubstrates YES YES no no YES noEvidence Code YES no no no no noPublication YES no no no no noTable 6:
Pathway multi-search capabilities.
Does the portal support multi-searches for path-ways based on the data fields or criteria listed? “Ontology” means the ability to search forpathways based on a pathway ontology (classification).
Genomics tools enable researchers to query, analyze, and compare genome-related informationwithin an organism DB. Table 1 assesses most genomics tools; Tables 2 and 3 describe genomicsmulti-search tools.An explanation of the rows within Table 1 is as follows. • Genome Browser : Can a user browse a chromosome at different zoom levels to see thegenomic features present? – Are operons, promoters, and transcription-factor binding sites depicted in thegenome browser? 10 able Capability BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Table Datatypes:Genomes no no no no no YESGenes YES no no no YES YES Proteins YES no no no YES YESRNAs YES no no no YES YESMetabolites YES no no no partial noPathways YES no no no partial YESReactions YES no no no partial noPromoters YES no no no no noTerminators YES no no no no noTranscription Factor Binding Sites YES no no no no noTranscription Units YES no no no partial noPublications YES no no no no noTransciptomics Experiments no no no no partial YESBiosynthetic Clusters no no no no YES noProtein Families no no no no no YESCreate Table from Uploaded File YES no no no YES YESCreate Table from database query result YES no no no YES YESInclude Database Properties as Table Columns YES no no no YES YESCreate Columns as Computational Transformations YES no no no no noSet Operations Among Tables YES no no no YES YESFilter Table Rows YES no no no YES YESExport Table to File YES no no no YES YESShare Table with Selected Users YES no no no YES YESShare Table to the Public YES no no no no YESTable 7:
Table-Based Analysis Capabilities. PATRIC provides tables of genomes and tablesof features (defined sections of a genome, e.g., genes, CDS, mRNAs). – Is the nucleotide sequence depicted in the genome browser? – Customizable Tracks : Can a user add additional tracks to the genome browser,which show user-supplied data? – Comparative, by Orthologs : Can a user compare chromosome regions from severalgenomes side-by-side, with orthologous genes indicated? – Genome Poster : Can the portal generate a printable, detailed, wall-sized poster ofthe entire genome, e.g., one that depicts every gene in the genome? • Retrieve Gene Sequence : Can a user retrieve the nucleotide sequence of a gene?11 eature BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Gene Page Load Time (sec) YES YES YESWebinars YES no YES YES YES YESWorkshops YES ? YES YES YES YESTable 8:
User Experience Features The extent of gene details and visualization displayed is vastly different among sites and canlead to longer page load times. Userguide and webinars cover multiple Ensembl portals, notspecifically bacteria. • Retrieve Replicon Sequence : Can a user retrieve the nucleotide sequence of a specifiedregion of a replicon? • Retrieve Protein Sequence : Can a user retrieve the amino-acid sequence of a protein? • Nucleotide Sequence Alignment Viewer : Can a user compare the nucleotide sequenceof a gene with orthologs from other organisms? • Protein Sequence Alignment Viewer : Can a user compare the amino-acid sequence of aprotein with orthologs from other organisms? • Protein Phylogenetic Tree Analysis : Can a user construct a phylogenetic tree from a setof protein sequences? • Sequence Searching by BLAST : Is searching for a sequence in a genome by BLAST sup-ported? • Sequence Pattern Search : Is sequence searching by short sequence patterns supported? • Sequence Cassette Search : Is sequence searching by protein family recognition patternssupported? • Orthologs : Can a user query for the orthologs of a given gene in other organisms? • Gene/Protein Page : Does the portal provide gene pages, showing relevant informationsuch as the gene products and links to other DBs? • Enrichment Analysis (GO Terms) : Can a user find which GO terms are statistically en-riched, given a set of genes? • Enrichment Analysis (Regulation) : Given a set of genes, can a user compute which reg-ulators of those genes are statistically over-represented in the gene set?12
Omics Dashboard : Can a user submit a transcriptomics dataset for analysis using a vi-sual dashboard tool that enables interactive summarization and exploration of the datasetin a manner similar to the BioCyc Omics Dashboard [10]? • Multi-Organism Comparative Analysis : Can a user globally compare a variety of differ-ent data types between several organisms? • Horizontal Gene Transfer Prediction : Can the site show which genes may have beenacquired by horizontal gene transfer? • Fused Protein Prediction : Can the portal show which genes result from fusions of genesthat can be found separately in other organisms? • Alternative ORF Search (6-frame translation) : Can a user assess alternative ORFs to theones predicted on a given genomic region? Change the name to Alternate ORF View? • Genome Multi-Search : Does the portal support search and retrieval across all genomesusing sequencing, environmental, or other metadata attributes? • gANI (Whole-genome Average Nucleotide Identity) Computations : Whole-genome basedaverage nucleotide identity (gANI) has been proposed as a measure of genetic relatednessof a pair of genomes. gANI for a pair of genomes is calculated by averaging the nucleotideidentities of orthologous genes. The fraction of orthologous genes (alignment fraction orAF) is also reported as a complementary measure of similarity of the two genomes. • Kmer Frequency Analysis : Can the portal display principal component analysis plots ofoligonucleotide frequencies along genome length; allow comparison of genomes by thesimilarity of oligonucleotide composition, and identify sequences with abnormal oligonu-cleotide composition, such as horizontally transferred sequences and contaminating con-tigs/scaffolds? • Synteny Comparisons : Does the portal provide a tool for evaluating conservation ofgene order by plotting pairwise genome alignment? Potential translocations, inversions,or gaps relative to reference can be visualized. Such a tool gives a quick snapshot of howclosely related two strains might be. • Proteome Comparisons : Find proteins that are shared between two or more genomes orunique to a given genome. • Statistical Analysis, Genome : Example statistical analyses include counts of genes as-signed to a “feature” (such as presence of a COG/Pfam/TIGRFAM/KEGG domains),and counts of genes in different Gene Ontology categories. • Statistical Analysis, Expression : Does the portal provide tools for calculating statisticalsignificance of gene expression data? 13
Genome Function Comparison : Genomes can be clustered based on a function profile(e.g., COG/Pfam/TIGRFAM/KEGG features) and viewed as a hierarchical cluster tree,principal component analysis, principal coordinate analysis plot, or other options, to as-sess relatedness of selected genomes. • Insert Genomes into Reference Trees : Enables a user to determine evolutionary relation-ships between a genome of interest and nearby reference genomes by building a tree of49 concatenated universal sequences. • Predict Effects of Sequence Variants : Enables users to predict effects of variation, in-cluding SNPs and indels on transcripts in the region of the variant.
Metabolic tools enable researchers to query, analyze, and compare information about metabolicpathways and reactions within an organism DB, to run metabolic models, and to analyze high-throughput data in the context of metabolic networks. Table 4 assesses most metabolic tools; Ta-ble 5 describes metabolite multi-search capabilities and Table 6 describe pathway multi-searchcapabilities.An explanation of the rows within Table 4 is as follows. • Metabolite Page : Does the site provide a metabolite page, showing relevant informationsuch as synonyms, chemical structure, and reactions in which the metabolite occurs? • Chemical Similarity Search : Can the user search for chemicals that have similar struc-tures to a provided chemical? • Glycan Similarity Search : Can the user search for glycans that have similar structures toa provided glycan? • Reaction Page : Does the site provide a reaction page, showing relevant information suchas EC numbers, reaction equation, and enzymes catalyzing the reaction? • Reaction Atom Mappings : Can the reaction equation be shown with metabolite struc-tures that depict the trajectories of atoms from reactants to products? • Pathway Diagrams : Can pathway diagrams be depicted? • Automatic Pathway Layout : Are pathway diagrams generated automatically by the soft-ware, thereby avoiding manual drawing? • Paint Omics Data onto Pathway : Can a user visualize omics data on pathway diagrams? • Depict Enzyme Regulation : Can pathway diagrams show regulation of enzymes bymetabolites, to depict information such as feedback inhibition?14
Depict Genetic Regulation : Can pathway diagrams show genetic regulation of enzymes,such as by transcription factors and attenuation? • Depict Metabolite Structures : Can pathway diagrams show the chemical structures ofmetabolites? • Multi-Pathway Diagram : Can users interactively create diagrams consisting of multipleinteracting metabolic pathways? • Full Metabolic Network Diagram : Can the entire metabolic reaction network of a genomebe depicted and explored by an interactive graphical interface? • Zoomable Metabolic Network : Does the metabolic network browser enable zooming inand out? • Paint Omics Data onto Network : Can a user visualize an omics dataset (e.g., gene ex-pression, metabolomics) on the metabolic network diagram? • Animated Omics Data Painting : Can several omics data points be visualized as an ani-mation on the metabolic network diagram? • Metabolic Poster : Can the portal generate a printable wall-sized poster of the organism’smetabolic network? • Organism Comparison : Can a user compare the metabolic networks of two organismsvia the full metabolic network diagram? • Automated Metabolic Reconstruction : Starting from a functionally annotated genome,can the metabolic reaction network (and pathways) be inferred in an automated fashion? • Enrichment Analysis (Pathways) : Can the site compute statistical enrichment of path-ways within a large-scale dataset? • Execute Metabolic Model : Can a user execute a steady-state metabolic flux model viathe portal? • Gene Knock-out Analysis : Can a user run flux-balance analysis (FBA) on the metabolicnetwork by systematically disabling (knocking-out) various genes, to investigate howknock-outs perturb the network, and to predict gene essentiality? • Chokepoint Analysis : Can the site compute chokepoint reactions (possible drug targets)in the full metabolic reaction network? A chokepoint reaction is a reaction that eitheruniquely consumes a specific reactant or uniquely produces a specific product in themetabolic network. 15
Dead-End Metabolite Analysis : Can the portal compute dead-end metabolites in thefull metabolic reaction network? Dead-end metabolites are those that are either onlyconsumed, or only produced, by the reactions within a given cellular compartment, in-cluding transport reactions. • Blocked-Reaction Analysis : Can the portal compute blocked reactions in the full metabolicreaction network? Blocked reactions cannot carry flux because of dead-end metabolitesupstream or downstream of the reactions. • Route Search Tool : Given a starting and an ending metabolite, can the site compute anoptimal series of known reactions (routes) that converts the starting metabolite to theending metabolite? • Path Prediction Tool : Given a starting chemical compound, can the site predict a series ofpreviously unknown enzyme-catalyzed reactions that will act upon the input compoundand the products of previous reactions? • Assign EC Number : Can the portal compute an appropriate Enzyme Commission num-ber for a user-provided reaction?
BioCyc has a number of regulatory informatics tools that are not provided by any of the portals.We list those tools here rather than providing a table. • BioCyc includes a network browser that depicts the full transcriptional regulatory net-work of the organism. The network diagram can be queried interactively and paintedwith transcriptomics data. • The BioCyc transcription-unit page depicts operon structure including promoters, tran-scription factor binding sites, and terminators, the evidence for each, and describes reg-ulatory interactions between these sites and associated transcription factors and smallRNA regulators. • BioCyc generates diagrams that summarize all regulatory influences on a gene, includingregulation of transcription, translation, and of the gene product. • BioCyc depicts transcription-factor regulons as diagrams of all operons regulated by atranscription factor. • BioCyc can depict regulatory influences on metabolism by highlighting the regulon of atranscription factor on the BioCyc metabolic map diagram. • BioCyc SmartTables can list the regulators or regulatees of each gene within a SmartTable.16
BioCyc can generate a report comparing the regulatory networks of two or more organ-isms.
These tools (see Table 9) enable researchers to perform complex searches and analyses, to re-trieve data via web services and bulk downloads, and to create and manipulate user accounts.
Tool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Advanced Search YES no no no YES noCross-Organism Search YES YES YES partial YES YESWeb Services YES YES YES YES no noOther Query Options * * * * * *User Account opt/req no optional required opt/req opt/reqCustom Notifications YES no no no no noDownload Formats biopax,gff json,sbml fasta,gff,gff3 genbank,gff,tsv fasta,txt csv,fasta,gffgenbank json,mysql,rdf fasta,json,sbml embl,jsonsbml genbankTable 9:
Comparison of Advanced Search and Analysis, Web Services, and User Accounts. “Opt/Req” means that user accounts are optional for some operations and required for otheroperations. IMG also provides for downloading of reads, assemblies, QC reports, annotations,and more.An explanation of the rows within Table 9 is as follows. • Advanced Search : Does the site enable the user to construct multi-criteria queries thatsearch arbitrary DB fields using combinations of AND, OR, and NOT? • Cross-Organism Search : Can a user search all organisms, specified organism sets, ortaxonomic groups of organisms, for genes, metabolites, or pathways? • Web Services : Can DBs within the portal be queried programmatically by means of webservices, using for example XML protocols? • Other Query Options:
What other query options are provided by the portal? – BioCyc supports queries via its BioVelo query language [13]. Users can downloadBioCyc data files for text searches, and can load those data files into SRI’s BioWare-house system for SQL query access. Users can download bundled versions of sub-sets of BioCyc plus Pathway Tools, and query the DBs via APIs for Python, Lisp,Java, Perl, and R. – Users can download KEGG data files for text searches.17
Ensembl Bacteria provides a Perl API and public MySQL servers. – KBase includes code cells for adding python code blocks to enable custom analyses,for which applications do not exist, or for programmatically calling Kbase nativeapps to automate large scale analyses. – PATRIC provides a downloadable command line interpreter application that allowsinteractive submission of DB queries using a query language. • User Account : Are user accounts available for logging in, and for storing data and pref-erences? “Opt/Req” means accounts are optional for some operations and required forother operations. • Custom Notifications : Does the portal enable the user to register to be notified of curationupdates in biological areas of interest to the user? • Bulk Download Formats : What formats are supported by the portal for large scale datadownloads? 18 .5 Table-Based Analysis Tools
Table-based analysis tools enable users to define lists of genes, proteins, metabolites, or path-ways that are stored within the portal, and can be displayed, analyzed, manipulated, andshared with other users. These tools are called SmartTables by BioCyc and are called Cartsby IMG. A typical series of SmartTable operations are to define a SmartTable containing a listof genes (such as from a transcriptomics experiment); to configure which DB properties aredisplayed for each gene within the SmartTable (such as displaying the gene name, accessionnumber, product name, and genome map position); performing a set operation on the Smart-Table such as taking the intersection with another gene SmartTable; and transforming the geneSmartTable to say a SmartTable of the metabolic pathways containing those genes, or the set oftranscriptional regulators for those genes.KBase does not have a tables mechanism, but it does have a data sharing mechanism callednarratives, which is not table-based.An explanation of the rows within Table 7 is as follows. • Datatypes Tables can Contain : What types of entities may be stored in tables within eachportal? The more types of entities can be manipulated within tables, the more versatilethe table mechanism is. • Create Table from Uploaded File : Can tables be defined by uploading a data file thatlists the entities within the table? • Create Table from DB Query Result : Can tables be defined from the result of a querywithin the portal? • Include DB Properties as Table Columns : Can a user add columns to the table con-taining information from the DB about a given entity, such as the accession number of agene or the nucleotide coordinate of a gene, or a diagram of the chemical structure of ametabolite? • Create Table Columns as Computational Transformations : Can table columns containedinformation computed from another column, such as adding a column that computes thepathways in which a gene participates? • Set Operations Among Tables : Can the portal create a new table by computing set oper-ations between two other tables, such as taking the union of the list of genes in two othertables? • Filter Table Rows : Can the portal remove rows from a table according to a search, suchas removing all entries from a table of metabolites where the metabolite name contains“arginine”? • Export Table to File : Can the portal export the contents of a table to a data file?19
Share Table with Selected Users : Can a user share a table with a specific set of users? • Share Table with the Public : Can a user share a table with the general public?20 .6 Data Content among the Portals
Data Type BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Genome Count 14,560 5,130 44,046 122,688 97,179 184,000Bacterial Genomes 14,134 4,854 43,552 121,994 66,362 181,260Archaeal Genomes 394 276 494 694 1,724 2,881Uncultivated Organisms 0 11,466 0Genome Metadata YES YES no no YES YESRegulatory Networks 11 no no no no noProtein Localization YES no no no no noProtein Features YES no YES no partial YESProtein 3-D Structures no YES no no no noGO Terms YES no YES YES YES YESEvidence Codes YES no no no YES partial Operons YES no no no no YESProphages YES no no no YES YESGrowth Media YES no no YES no noGene Essentiality YES no no no no YESGene Clusters for Secondary Metabolites no no no no YES noGene Pairs with Correlated Expression no no no no no YESProtein-Protein Interactions no no no no no YESAMR Phenotypes no no no no no YESTable 10:
Data Types Comparison. PATRIC includes evidence codes in only two DB tables.Table 10 describes the types and quantities of data present in each web portal. An explana-tion of the rows within the Table 10 is as follows. • Genome Count (Bact./Arch.) : How many bacterial genomes (organisms) does the portalprovide access to? Only bacteria and archaea are counted here, although some resourcesprovide eukaryotic and viral genomes. • Genome Metadata : Does the portal contain genome metadata, such as the lifestyle of theorganism, and the location of where the organism sample was obtained? • Regulatory Networks : Is (gene) regulatory information provided by the site? ElevenBioCyc DBs provide regulatory networks larger than 100 transcriptional regulatory inter-actions. • Protein Localization : Does the portal contain protein cellular locations?21
Protein Features : Are annotations of features of protein sequences provided by the por-tal? Such features include which residues bind to cofactors or to metal ions, and wheresignaling peptide sequences reside. IMG provides transmembrane and signal peptidefeatures. • GO Terms : Are GO term annotations provided by the site? IMG provides evidence codesfor GO terms. BioCyc provides evidence terms for gene functions, pathway presence,operon presence. • Evidence Codes : Are evidence codes for the annotations provided by the resource, so thelevel of validity of the data can be assessed? • Operons : Are genes grouped into operons, where applicable? • Prophages : Are potential prophages indicated on the genomes? • Growth Media : Are growth media for known growth conditions of the organisms pro-vided by the site? (BioCyc provides growth-media data for two organisms.) • Gene Essentiality : Are gene essentiality data under various growth conditions providedby the site? (BioCyc provides gene-essentiality data for 36 organisms.) • Gene Clusters for Secondary Metabolites : Does the site identify putative operons ofgenes encoding enzymes for the production of secondary metabolites? • Gene pairs with correlated expression : Pairs of genes with correlated expression basedon experimental evidence. • Protein-Protein interactions : Pairs of protein with either experimental or computationalevidence of interacting. • AMR phenotypes : Can the site display phenotypes for antimicrobial resistance (e.g., is astrain resistant or susceptible to a particular antimicrobial compound)?
Table 8 contains several features that reflect the usability of the various portals. These includeaverage loading times for typical gene pages for each portal; and other features and resourcesthat assist the user in learning to use each portal. • Mean Load Time for Gene Pages : Since gene pages are among the most commonly vis-ited information pages within a genome web portal, the time required for the page to loadin a web browser is central to the user experience. The values in this row are the averagenumber of seconds required for each portal to load a gene page. The values are averagedacross six sessions, conducted from Menlo Park, California and Richmond, Virginia to22verage out geographic distances to each portal. Each session tested five genes on each ofthe six portals. Testing was conducted using the Chrome browser version 68.0, runningon MacOS 10.13.6. Testing consisted of clearing the browser cache, and pasting the URLof the gene page into the browser. The load was monitored using the ’Network’ panelof Chrome’s Developer Tools (More Tools → Developer Tools). The page was allowedto completely load (including loading large files and waiting for Ajax calls to complete).The number used is the “Finish” time in the bottom line of the panel. While some portalswere disadvantaged by starting from an empty cache, forcing large files to be loaded,others were slowed by long Ajax calls. We have removed the single worst time recordedof the 30 times (5 genes x 6 sessions) for each portal. • Portal Information : Lists the availability of a userguide, extensive explanatory tooltipsthroughout the site, recorded webinars (either downloadable files or on YouTube or sim-ilar site), and user workshops.
Table 11 summarizes the number of capabilities present in each portal. In each row of Table 11we have summed the counts in the column for each portal from the specified tables, with each“YES” counted as 1, each “partial” counted as / , and each “no” counted as 0. Tool BioCyc KEGG Ensembl Bacteria KBase IMG PATRIC
Major 51 30 14 27.5 35 29SmartTables 20 0 0 0 13.5 15Multi-Search 49 12 7 10 32 15Data Types 10 2 2 2 5.5 9.5Table 11:
Tallies of Portal Capabilities from Previous Tables.
Row “Major” summarizes themajor capabilities for genomics tools, metabolic tools, and advanced tools present in Tables 1, 4,and 9. Row “SmartTables” summarizes the number of SmartTables capabilities for each portalpresent in Table 7. Row “Multi-Search” summarizes the number of multi-search capabilitiesfor each portal present in Tables 2, 3, 5, and 6. Row “Data Types” summarizes the number ofdatatypes provided by each portal present in Table 10, from row “Genome Metadata” down-ward.BioCyc received the highest count (51) of major capabilities (which does not count its uniqueregulatory capabilities listed in Section 2.3). IMG ranked second with a count of 35. KEGG,PATRIC, and KBase ranked third, fourth, and fifth with counts of 30, 29, and 27.5, respectively.Ensembl Bacteria ranked sixth with a count of 14.BioCyc has the most extensive multi-search capabilities, with IMG in second place; theseportals provide users with the most extensive capabilities for finding desired information.23MG has the most genomics capabilities, with PATRIC and BioCyc second and third. En-sembl Bacteria has the fewest genomics capabilities. BioCyc and IMG have the most pow-erful gene/protein multi-search capabilities. BioCyc has the most extensive capabilities forDNA/RNA site multi-searches.BioCyc has the most extensive metabolic capabilities. KEGG ranks second; it lacks metabolicmodeling capabilities, and it lacks network analysis tools such as dead-end metabolite analysisand chokepoint analysis. BioCyc has the most extensive metabolic multi-search capabilities,with IMG second.SmartTables make extensive data analysis capabilities available to users that in many caseswould otherwise require assistance from a programmer. BioCyc has the most extensive Smart-Table capabilities, with PATRIC ranking second and IMG ranking third. KEGG, Ensembl Bac-teria, and KBase completely lack SmartTables capabilities.PATRIC has the largest number of genomes, with KBase and IMB ranked second and third,respectively; KEGG has the smallest number of genomes. Most of the PATRIC genomes wereassembled from whole-genome shotgun data and thus are expected to be of lower quality —only 11,803 PATRIC bacterial genomes are complete genomes.KEGG provides the fastest loading gene pages; BioCyc pages are the second fastest. Pagesfor KBase, Ensembl Bacteria, and IMG are significantly slower. PATRIC gene pages are theslowest, loading 13.96 times slower than KEGG gene pages.BioCyc contains the most extensive analysis capabilities for metabolomics and transcrip-tomics data, including painting omics data onto individual pathways, multi-pathway dia-grams, and zoomable metabolic maps; enrichment analysis for GO terms, regulation, and path-ways; and an Omics Dashboard.BioCyc contains extensive unique content not included in any of the other portals includ-ing regulatory network data, data on growth under different nutrient conditions, experimen-tal gene essentiality data, reaction atom mappings (also present in KEGG), and thousands oftextbook page equivalents of mini-review summaries. KEGG is particularly lacking a diverserange of datatypes, for example, KEGG lacks protein features, localization information, GOterms, and evidence codes.
Microbial genome web portals have a broad range of capabilities, and are quite variable interms of what capabilities they provide. We assessed the capabilities of BioCyc, KEGG, En-sembl Bacteria, KBase, IMG, and PATRIC. BioCyc provided the most capabilities overall interms of bioinformatics tools and breadth of data content; it also provides a level of curateddata content (curated from 89,000 publications) that far exceeds that within the other sites.IMG ranked second overall, second in bioinformatics tools, and second in number of genomes.KEGG ranked third overall, PATRIC ranked fourth, KBase ranked fifth, and Ensembl Bacte-24ia ranked sixth. IMG provided the most extensive genome-related tools, with BioCyc a closesecond. BioCyc provided the most extensive metabolic tools, with KEGG ranked second. En-sembl Bacteria provided no metabolic tools. PATRIC provided the largest number of genomes.BioCyc provided extensive regulatory network tools (and data) that are not present in any ofthe other portals. BioCyc provided the most extensive SmartTable tools and the most extensiveomics data analysis tools.
Acknowledgments
We thank Dr. Nishadi De Silva of the European Bioinformatics Institute for comments and cor-rections regarding Ensembl Bacteria. We thank the KBase team for comments and correctionsregarding KBase. We thank Maulik Shukla of Argonne National Laboratory and the Univer-sity of Chicago and Rebecca Wattam of Virginia Tech for comments and corrections regard-ing PATRIC. Research reported in this publication was supported by SRI International and bythe National Institute Of General Medical Sciences of the National Institutes of Health underAward Number 5R01GM080745. The content is solely the responsibility of the authors anddoes not necessarily represent the official views of the National Institutes of Health. For JGIcontributors, the work presented in this paper was supported by the Director, Office of Science,Office of Biological and Environmental Research, Life Sciences Division, U.S. Department ofEnergy under Contract No. DE-AC02-05CH11231.
References [1] A. P. Arkin, R. W. Cottingham, C. S. Henry, N. L. Harris, R. L. Stevens, S. Maslov, P. Dehal,D. Ware, F. Perez, S. Canon, M. W. Sneddon, M. L. Henderson, W. J. Riehl, D. Murphy-Olson, S. Y. Chan, R. T. Kamimura, S. Kumari, M. M. Drake, T. S. Brettin, E. M. Glass,D. Chivian, D. Gunter, D. J. Weston, B. H. Allen, J. Baumohl, A. A. Best, B. Bowen, S. E.Brenner, C. C. Bun, J. M. Chandonia, J. M. Chia, R. Colasanti, N. Conrad, J. J. Davis, B. H.Davison, M. DeJongh, S. Devoid, E. Dietrich, I. Dubchak, J. N. Edirisinghe, G. Fang, J. P.Faria, P. M. Frybarger, W. Gerlach, M. Gerstein, A. Greiner, J. Gurtowski, H. L. Haun, F. He,R. Jain, M. P. Joachimiak, K. P. Keegan, S. Kondo, V. Kumar, M. L. Land, F. Meyer, M. Mills,P. S. Novichkov, T. Oh, , G. J. Olsen, R. Olson, B. Parrello, S. Pasternak, E. Pearson, S. S.Poon, G. A. Price, S. Ramakrishnan, P. Ranjan, P. C. Ronald, M. C. Schatz, S. M. D. Seaver,M. Shukla, R. A. Sutormin, M. H. Syed, J. Thomason, N. L. Tintle, D. Wang, F. Xia, H. Yoo,S. Yoo, and D. Yu. KBase: The United States Department of Energy Systems BiologyKnowledgebase.
Nat Biotechnol , 36(7):566–569, 2018.[2] R. Caspi, R. Billington, L. Ferrer, H. Foerster, C. A. Fulcher, I. M. Keseler, A. Kothari,M. Krummenacker, M. Latendresse, L. A. Mueller, Q. Ong, S. Paley, P. Subhraveti, D. S.Weaver, and P. D. Karp. The MetaCyc database of metabolic pathways and enzymes and25he BioCyc collection of Pathway/Genome Databases.
Nuc Acids Res , 44(D1):D471–80,2016.[3] R. Caspi, R. Billington, C. A. Fulcher, I. M. Keseler, A. Kothari, M. Krummenacker, M. La-tendresse, P. E. Midford, Q. Ong, W. K. Ong, S. Paley, P. Subhraveti, and P. D. Karp. TheMetaCyc database of metabolic pathways and enzymes.
Nuc Acids Res , 46(D1):D633–9,2018.[4] I. A. Chen, V. M. Markowitz, K. Chu, K. Palaniappan, E. Szeto, M. Pillay, A. Rat-ner, J. Huang, E. Andersen, M. Huntemann, N. Varghese, M. Hadjithomas, K. Ten-nessen, T. Nielsen, N. N. Ivanova, and N. C. Kyrpides. IMG/M: integrated genome andmetagenome comparative data analysis system.
Nuc Acids Res , 45(D1):D507–D516, 2017.[5] P. S. Dehal, M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, D. Chivian, G. D.Friedland, K. H. Huang, K. Keller, P. S. Novichkov, I. L. Dubchak, E. J. Alm, and A. P.Arkin. MicrobesOnline: An integrated portal for comparative and functional genomics.
Nuc Acids Res , 38(Database issue):D396–D400, 2010.[6] C. S. Henry, M. DeJongh, A. A. Best, P. M. Frybarger, B. Linsay, and R. L. Stevens. High-throughput generation, optimization and analysis of genome-scale metabolic models.
NatBiotechnol , 28(9):977–82, 2010.[7] M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, and K. Morishima. KEGG: New perspec-tives on genomes, pathways, diseases and drugs.
Nuc Acids Res , 45(D1):D353–D361, 2017.[8] P. D. Karp, R. Billington, R. Caspi, C. A. Fulcher, M. Latendresse, A. Kothari, I. M. Keseler,M. Krummenacker, P. E. Midford, Q. Ong, W. K. Ong, S. M. Paley, and P. Subhraveti. TheBioCyc collection of microbial genomes and metabolic pathways.
Briefings in Bioinformat-ics , 2017.[9] R. Overbeek, R. Olson, G. D. Pusch, G. J. Olsen, J. J. Davis, T. Disz, R. A. Edwards,S. Gerdes, B. Parrello, M. Shukla, V. Vonstein, A. R. Wattam, F. Xia, and R. Stevens.The SEED and the Rapid Annotation of microbial genomes using Subsystems Technol-ogy (RAST).
Nuc Acids Res , 42(Database issue):D206–D214, 2014.[10] S. M. Paley, K. Parker, A. Spaulding, J.F. Tomb, P. O’Maille, and P. D. Karp. The OmicsDashboard for interactive exploration of gene-expression data.
Nuc Acids Res , 2017.[11] P.J. Kersey PJ, J.E. Allen, A. Allot, M. Barba, S. Boddu, B.J. Bolt, D. Carvalho-Silva,M. Christensen M, P. Davis, N. Kumar C. Grabmueller, Z. Liu, T. Maurel, B. Moore,M.D. McDowall, U. Maheswari, G. Naamati, V. Newman, C.K. Ong, M. Paulini, H. Pedro,E. Perry, M. Russell, H. Sparrow, E. Tapanari, K. Taylor, A. Vullo, G. Williams, A. Zadissia,A. Olson, J. Stein, S. Wei, M. Tello-Ruiz, D. Ware, A. Luciani, S. Potter, R.D. Finn, M. Urban,K.E. Hammond-Kosack, D.M. Bolser, N. De Silva, K.L. Howe, N. Langridge, G. Maslen,26.M. Staines, and A. Yates. Ensembl genomes 2018: an integrated omics infrastructure fornon-vertebrate species.
Nuc Acids Res , 46(1):D802–D808, 2018.[12] A. P. Arkin R. L. Stevens, R. W. Cottingham, S. Maslov, C. S. Henry, P. Dehal, D. Ware,F. Perez, N. L. Harris, S. Canon, M. W. Sneddon, M. L. Henderson, W. J. Riehl, D. Gunter,D. Murphy-Olson, S. Chan, R. T. Kamimura, T. S.Brettin, F. Meyer, D. Chivian, D. J.Weston, E. M. Glass, B. H. Davison, S. Kumari, B. H. Allen, J. Baumohl, A. A. Best,B. Bowen, S. E. Brenner, C. C. Bun, J.M. Chandonia, J.-M. Chia, R. Colasanti, N. Con-rad, J. J. Davis, M. DeJongh, S. Devoid, E. Dietrich, M. M. Drake, I. Dubchak, J. N.Edirisinghe, G. Fang, J. P.Faria, P. M. Frybarger, W. Gerlach, M. Gerstein, J. Gurtowski,H. L. Haun, F. He, R. Jain, M. P. Joachimiak, K. P. Keegan, S. Kondo, V. Kumar, M. L.Land, M. Mills, P. Novichkov, T. Oh, G. J. Olsen, B. Olson, B. Parrello, S. Pasternak,E. Pearson, S. S. Poon, G. Price, S. Ramakrishnan, P. Ranjan, P. C. Ronald, M. C. Schatz,S. M. D., Seaver, M. Shukla, R. A. Sutormin, M. H. Syed, J. Thomason, N. L. Tintle,D. Wang, F. Xia, H. Yoo, and S. Yoo. The DOE Systems Biology Knowledgebase (KBase). , 2016.[13] The BioVelo Query Language. https://biocyc.org/bioveloLanguage.html .[14] D. Vallenet, A. Calteau, S. Cruveiller, M. Gachet, A. Lajus, A. Josso, J. Mercier, A. Renaux,J. Rollin, Z. Rouy, D Roche, C. Scarpelli, and C. Mdigue. Microscope in 2017: an expandingand evolving integrated resource for community expertise of microbial genomes.
NucAcids Res , 45(1):D517–D528, 2017.[15] A. R. Wattam, D. Abraham, O. Dalay, T. L. Disz, T. Driscoll, J. L. Gabbard, J. J. Gillespie,R. Gough, D. Hix, R. Kenyon, D. Machi, C. Mao, E. K. Nordberg, R. Olson, R. Overbeek,G. D. Pusch, M. Shukla, J. Schulman, R. L. Stevens, D. E. Sullivan, V. Vonstein, A. Warren,R. Will, M. J. Wilson, H. S. Yoo, C. Zhang, Y. Zhang, and B. W. Sobral. PATRIC, the bacterialbioinformatics database and analysis resource.