BpForms and BcForms: Tools for concretely describing non-canonical polymers and complexes to facilitate comprehensive biochemical networks
Paul F. Lang, Yassmine Chebaro, Xiaoyue Zheng, John A. P. Sekar, Bilal Shaikh, Darren A. Natale, Jonathan R. Karr
BBpForms and
BcForms : Tools for concretely describingnon-canonical polymers and complexes to facilitate comprehensivebiochemical networks
Paul F. Lang , Yassmine Chebaro , Xiaoyue Zheng , John A. P. Sekar , BilalShaikh , Darren A. Natale , and Jonathan R. Karr Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine atMount Sinai, New York, NY 10029, USA Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai,New York, NY 10029, USA Department of Biochemistry, Oxford University, South Parks Road, Oxford OX1 3QU, UK Institut de Génétique et de Biologie Moléculaire et Cellulaire, Institut National de laSanté et de la Recherche Médicale, Centre National de la Recherche Scientifique, Universitéde Strasbourg, 67404, Illkirch, France Protein Information Resource, Georgetown University Medical Center, Washington, DC20007, USA * These authors contributed equally to this work ** Correspondence: [email protected] 26, 2019
Abstract
Although non-canonical residues, caps, crosslinks, and nicks play an important role in the functionof many DNA, RNA, proteins, and complexes, we do not fully understand how networks of non-canonical macromolecules generate behavior. One barrier is our limited formats, such as IUPAC,for abstractly describing macromolecules. To overcome this barrier, we developed
BpForms and
BcForms , a toolkit of ontologies, grammars, and software for abstracting the primary structure ofpolymers and complexes as combinations of residues, caps, crosslinks, and nicks. The toolkit canhelp quality control, exchange, and integrate information about the primary structure of macro-molecules into fine-grained global networks of intracellular biochemistry.
Keywords format; software; polymer; proteoform; complex; residue; modification; crosslink; fine-grained net-work; genome-scale network
1. Background
A central goal in biology is to understand how networks of metabolites, DNA, RNA, proteins,and complexes generate behavior. Non-canonical residues, caps, crosslinks, and nicks are essential1 a r X i v : . [ q - b i o . B M ] S e p o these networks. For example, prokaryotic restriction/modification systems use methylation toselectively degrade foreign DNA, tRNA use pseudouridine to translate multiple codons, and signalingnetworks use phosphorylation to encode information into the states of proteins.Recent technical advances have enabled detailed information about individual DNA, RNA, andprotein modifications. For example, SMRT-seq can identify the locations of DNA methylations withsingle-nucleotide resolution and mass-spectrometry can identify hundreds of protein modifications. Furthermore, several repositories have compiled extensive data about non-canonical residues andcrosslinks in DNA,
RNA, and proteins, as well data about the subunit composition andcrosslinks of complexes.
Despite this progress, it remains difficult to integrate this informationinto fine-grained global networks of intracellular biochemistry, in part, because these resources usechemically-ambiguous and incompatible formats. Consequently, we still do not have a holisticunderstanding of how non-canonical macromolecules help generate behavior.Whole-cell (WC) models, which aim to predict phenotype from genotype by representing allof the biochemical activity in cells, are a promising tool for integrating diverse information aboutmacromolecules into a holistic understanding of cellular behavior. However, it remains challeng-ing to build fine-grained, global biochemical networks, such as WC models, because we have fewtools for capturing the structures of non-canonical macromolecules and linking them together intonetworks. For example, formats such as BioNetGen and the Systems Biology Markup Language(SBML) are cumbersome for modeling post-transcriptional modification because they have limitedcapabilities to represent the primary structure of RNA. Abstractions of the primary structuresof macromolecules that can be combined with modeling frameworks such as SBML would providea significant step toward fine-grained global biochemical networks. Combined with software tools,such abstractions could also facilitate the curation, exchange, and quality control of structuralinformation about macromolecules for a wide range of omics and systems and synthetic biologyresearch.Currently, several formats have limited abilities to abstract the primary structures of non-canonicalpolymers and complexes. Molecular formats which represent each atom and bond, such as theInternational Chemical Identifier (InChI), the PDB format, and the Simplified Molecular-InputLine-Entry System (SMILES), can represent non-canonical residues, caps, crosslinks, and nicks.However, their fine granularity is cumbersome for network-scale research. Omics and systems biologyformats, such as BioPAX, the Biological Expression Language (BEL), the MODOMICS nomen-clature, the PRO notation, ProForma, and the Synthetic Biology Open Language (SBOL), use abstractions that are conducive to network-scale research. However, these formats have limitedabilities to represent non-canonical residues, caps, crosslinks and nicks, and they do not concretelyrepresent the primary structures of macromolecules.Toward fine-grained global networks of intracellular biochemistry, we developed BpForms - BcForms ,an open-source toolkit for abstractly representing the primary structure of polymers and complexes.
BpForms includes extensible alphabets of hundreds of DNA, RNA and protein residues; an ontol-ogy of common crosslinks; and a human and machine-readable grammar for combining residues,residue modifications, intra-chain crosslinks, and nicks into polymers.
BcForms includes a humanand machine-readable grammar for combining polymers, small molecules, and inter-chain crosslinksinto complexes. Both tools include software for validating descriptions of macromolecules, calculat-ing properties of macromolecules such as their formula, visualizing macromolecules, and exportingmacromolecules to molecular formats such as SMILES. Both tools are available as a web application,REST API, command-line program, and Python library.2ere, we describe the toolkit and demonstrate how it can facilitate omics, systems modeling, andsynthetic biology. First, we describe the toolkit, including the alphabets of residues, the ontologyof crosslinks, the grammars, the software tools, and the user interfaces. Second, we describe how
BpForms and
BcForms can be integrated with knowledge about pathways, kinetic models, andgenetic designs through formats such as BioPAX, CellML, SBML, and SBOL. Next, we describethe advantages of the toolkit over existing formats for representing polymers and complexes andexisting alphabets of residues. Lastly, we present multiple case studies that illustrate how thetoolkit can help researchers describe, quality control, exchange, and integrate diverse informationabout macromolecules into networks. We anticipate that
BpForms and
BcForms will help facilitatefine-grained, global networks of cellular biochemistry.
2. Results
The
BpForms - BcForms toolkit includes several interrelated tools for describing, validating, visual-izing, and calculating properties of the primary structure of DNA, RNA, proteins, and complexes(Figure 1). Here, we describe the components of the toolkit including the abstractions and gram-mars for polymers and complexes; the alphabets of residues; the ontology of crosslinks; the softwaretools for quality controlling, analyzing, and visualizing macromolecules; the protocols for integrating
BpForms and
BcForms with formats for network research; and the user interfaces.
Abstract representation of the primary structure of polymers and complexes.
BpForms represents polymers as a sequence of residues, a set of crosslinks, a set of nicks, and a Booleanindicator of circularity (Figure 2B, D).
BcForms represents complexes as a set of subunits anda set of crosslinks (Figure 2A, C). Each subunit is represented by its molecular structure andstoichiometry. The structure of each subunit can be described using
BpForms or SMILES.
Residues.
Each residue is represented by its molecular structure, a list of the atoms which can formbonds with preceding and following residues, and a list of the atoms which are displaced by theformation of these bonds (Figure 2E). These lists of atoms are optional to enable the toolkit torepresent internal nucleic and amino acids, as well as 3’ and 5’ caps. The toolkit can also capturemetadata and missing information about residues.
Crosslinks.
Each crosslink is represented as lists of the atoms which can form a bond betweenresidues and the atoms which are displaced by the formation of these bonds (Figure 2F). Thetoolkit represents each nick as a tuple of adjacent residues which are not bonded.
Alphabets of residues and ontology of crosslinks.
The toolkit uses a hybrid approach to abstract themolecular details of residues and crosslinks from the descriptions of macromolecules. The chemicaldetails of common residues and crosslinks are abstracted into alphabets of residues and an ontology ofcrosslinks. Users can define additional residues and crosslinks within descriptions of macromoleculesor create custom alphabets and ontologies. This hybrid approach standardizes the representationof common residues and crosslinks while enabling the toolkit to represent any residue or crosslink.
Coordinate system.
The toolkit uses a structured coordinate system to describe the atoms involvedin each inter-residue bond and crosslink. The coordinate of each repeated subunit ranges from oneto the stoichiometry of the subunit. The coordinate of each residue is its position within the residuesequence of its parent polymer. The coordinate of each atom is its position within the canonicalSMILES ordering of the atoms in its parent residue. Additional File 1.4 contains more information3bout the coordinate system.
Examples.
Boxes 1 and 2 illustrate the toolkit’s grammars for describing polymers and complexes,and Figure 2 illustrates the chemical semantics of a homodimer encoded in the grammars. AdditionalFile 1.2 and the
BpForms and
BcForms websites provide detailed descriptions of the grammars andadditional examples. Additional File 1.3 contains formal descriptions of the grammars.
Alphabets of DNA, RNA, and protein residues.
To support a broad range of research,
Bp-Forms includes the most extensive alphabets of DNA, RNA, and protein residues to date. The DNAalphabet includes 422 deoxyribose nucleotide monophosphates and 3’ and 5’ caps derived from dataabout DNA damage and repair from REPAIREtoire, structural data from the Protein Data BankChemical Component Dictionary (PDB CCD), and chemoinformatics data from DNAmod. TheRNA alphabet includes 378 ribose nucleotide monophosphates and 3’ and 5’ caps derived from bio-chemical data from MODOMICS and the RNA Modification Database and structural data fromthe PDB CCD. The protein alphabet has 1,435 amino acids and carboxy and amino termini derivedfrom biochemical data from RESID and structural data from the PDB CCD. The BpForms web-site contains pages which display the residues in each alphabet. Additional File 1.5 describes howwe constructed the alphabets.
Ontology of crosslinks.
To abstract the molecular structures of polymers and complexes, thetoolkit includes the first ontology of crosslinks. Currently, the ontology contains 36 commoncrosslinks. We plan to continue to curate additional crosslinks as needed to represent WC models.The
BpForms website contains a page which displays the crosslinks in the ontology. AdditionalFile 1.6 describes how we constructed the ontology.
Syntactic and semantic validation of descriptions of macromolecules.
To help qualitycontrol information about macromolecules, the toolkit can verify the syntactic and semantic cor-rectness of macromolecules encoded in
BpForms and
BcForms . First, the toolkit can verify thattextual descriptions of macromolecules are syntactically consistent with the
BpForms and
BcForms grammars and identify any errors. Second, the toolkit can verify that macromolecules representedby
BpForms and
BcForms are semantically consistent and identify any errors. For example, thetoolkit can identify pairs of adjacent amino acids that cannot form peptide bonds because the firstamino acid does not have a carboxy terminus or because the second amino acid does not have anamino terminus. Additional File 1.7 details the semantic validations implemented by the toolkit.We anticipate that these quality controls will help researchers exchange reliable information andassemble this information into high-quality networks.
Analyses of polymers and complexes.
The toolkit can calculate several properties of macro-molecules such as their primary structure, major protonation and tautomerization states, chemicalformula, molecular weight, and charge. We have begun to use these properties to quality controlWC models. For example, we are using the chemical formulae to verify that each reaction is elementand charge balanced, including reactions that represent transformations of macromolecules such asthe post-transcriptional modification of tRNA.The toolkit can also compare macromolecules to determine their equality or identify differences. Weplan to use this feature to implement automated procedures for merging models that share speciesand reactions.
Molecular and sequence visualizations.
To help analyze macromolecules, the toolkit can gen-erate molecular and sequence visualizations of residues, caps, crosslinks, polymers, and complexes.4he molecular visualizations display each atom and bond and use colors to highlight features suchas individual residues, inter-residue and crosslink bonds, and the atoms that are displaced by theformation of the inter-residue bonds (Figure S1A–C). The molecular visualizations can also displaythe coordinate of each residue and atom. The sequence visualizations include interactive tooltipsthat describe each non-canonical residue, crosslink, and nick (Figure S1D).
Export to other molecular and sequence formats.
For compatibility with structural andbiochemical research, the toolkit can export
BpForms and
BcForms -encoded macromolecules tomolecular formats such as InChI, the PDB format, and SMILES. For compatibility with genomicsresearch, the toolkit can also export the canonical sequences of
BcForms -encoded polymers to theIUPAC/IUBMB format and FASTA documents. Integration with frameworks for network-scale research.
BpForms and
BcForms can fa-cilitate network-scale research through integration with omics and systems and synthetic biologyframeworks such as BioPAX, CellML, SBML, and SBOL. Additional File 1.9 illustrates how
Bp-Forms and
BcForms can be incorporated into these frameworks.
User interfaces.
BpForms and
BcForms each include four user-friendly interfaces: a web appli-cation, a REST API, a command-line program, and a Python library.
BpForms and
BcForms are the first abstractions that can represent the primary structure of anyDNA, RNA, protein, and complex, including non-canonical residues, caps, crosslinks, nicks, and cir-cularity. The toolkit also contains the most extensive alphabets of DNA, RNA, and protein residuesand the first ontology of concrete crosslinks. Furthermore, the toolkit has several innovative featuresto facilitate research about non-canonical macromolecules: the toolkit includes a novel coordinatesystem that makes it easy to address specific atoms in macromolecules, the toolkit uses a novelcombination of ontologies and inline definitions of residues and crosslinks to standardize the repre-sentation of common residues and crosslinks while accommodating any residue or crosslink, and thetoolkit includes novel quality controls for abstractions of the primary structures of macromolecules.Taken together,
BpForms and
BcForms are well-suited for network research. Here, we summarizehow
BpForms and
BcForms improve upon several existing resources for abstracting polymers andcomplexes.
Comparison of
BpForms with existing formats for polymers.
BpForms is the first for-mat that can abstract the primary structure of DNA, RNA, and proteins, including non-canonicalresidues, caps, crosslinks, nicks, and circularity. In contrast, molecular formats such as SMILES donot abstract the structures of polymers, and abstract formats such as ProForma and network formatssuch as BioPAX do not represent concrete molecular structures.
BpForms also provides a uniqueblend of the features of previous molecular and abstract formats:
BpForms can capture missing in-formation similar to ProForma,
BpForms is human-readable like other abstract formats,
BpForms is machine-readable like molecular formats,
BpForms is composable with network formats such asSBML like molecular formats, and
BpForms is backward compatible with the IUPAC/IUBMB for-mat like other abstract formats. Additional File 1.11.1 and Table S1 provide a detailed comparisonof
BpForms with several other formats.
Comparison of
BpForms alphabets with existing databases.
The
BpForms alphabets arethe most extensive alphabets of DNA, RNA, and protein residues because they are based on struc-tural, biochemical, and physiological data from several sources. In addition, the
BpForms alphabetsand the PDB CCD are the only alphabets which consistently represent DNA, RNA, and protein5esidues and which represent the inter-residue bonding sites of each residue, enabling residues to becombined into concrete molecular structures. In contrast, DNAmod, REPAIRtoire, MODOMICS,RESID, and the RNA Modification Database each only represent DNA, RNA, or protein residues;the residues in DNAmod, REPAIRtoire, MODOMICS, and the RNA Modification Database arehard to compose into polymers because they represent nucleobases and nucleosides rather than nu-cleotides; and DNAmod, REPAIRtoire, MODOMICS, RESID, and the RNA Modification Databasedo not capture bonding sites. Additional File 1.11.2 and Table S2 provide a detailed comparison ofthe
BpForms alphabets with several other resources.
Comparison of the
BpForms crosslinks ontology with existing resources.
Several resourcescontain information about crosslinks. In particular, the UniProt controlled vocabulary of post-translational modifications includes textual descriptions of over 100 types of crosslinks. In addition,MOD, REPAIRtoire, and RESID indirectly represent crosslinks by representing crosslinked dimersand trimers.The
BpForms ontology is the first resource which directly represents the chemical structures ofcrosslinks, enabling crosslinks to be composed into concrete structures. In contrast, MOD, RE-PAIRtoire, and RESID represent crosslinks indirectly and the crosslinks in UniProt do not haveconcrete chemical semantics. Consequently, the crosslinks in MOD, REPAIRtoire, RESID, andUniProt cannot be composed into concrete structures. Additional File 1.11.3 and Table S3 providea detailed comparison of the
BpForms crosslinks ontology with these resources.
Comparison of
BcForms with existing formats for complexes.
Despite the importanceof complexes, only a few formats can represent complexes. The PDB format is well-suited tocapturing the 3-dimensional structures of complexes. BioPAX and SBOL can also capture thesubunit composition of complexes.
BcForms is the first format which abstracts the primary structures of complexes including crosslinks.In contrast, the PDB format has limited capabilities to abstract crosslinks, and BioPAX and SBOLhave limited abilities to represent stochiometric information and crosslinks.
BcForms is also thefirst format which can be composed with formats for networks such as SBML. Additional File 1.11.4and Table S4 provide a detailed comparison of
BcForms with several other formats.
We believe that the
BpForms - BcForms toolkit can support a wide range of omics and systems andsynthetic biology research. Here, we illustrate how we have used the toolkit to improve the quality ofthe PRO database of proteoforms; analyze the metabolic cost of tRNA modification in
Escherichiacoli ; refine, expand, a compose a model of MAPK signaling with models of other pathways; andidentify constraints on designing new strains of
E. coli . Proteomics: Quality control of the Protein Ontology.
One of the goals of proteomics is tocharacterize the proteoforms in cells. Toward a comprehensive catalog of proteoforms, the PROconsortium has manually integrated several different types of data into PRO, a database of 8,095proteoforms. Because the consortium constructs PRO, in part, by hand, automated quality controlscould help the consortium identify and correct errors in PRO.We have used
BpForms quality control PRO. First, we encoded each entry in PRO into the
BpForms grammar and used the
BpForms software to validate each entry. This identified several types ofsyntactical and semantic errors. For example, we identified annotated processing sites that haveinvalid coordinates that are greater than the length of the translated sequence of their parent protein.6e also identified modified residues whose structures are inconsistent with the translated sequencesof their parent proteins, such as a phosphorylated serine which is annotated at the position of atyrosine in the translated sequence of its parent. Second, the consortium corrected these errors.These improvements will be published with the next release later this year.To enable the consortium to continue to use
BpForms to quality control PRO, we developed ascript which automates this analysis. Going forward, the consortium also plans to use
BpForms and
BcForms to visualize and export proteoforms to molecular formats such as SMILES.
Systems biology: Analysis of the metabolic cost of prokaryotic tRNA modification.
To achieve WC models, we must integrate information about all of the processes in cells and theirinteractions. Here, we illustrate how
BpForms can help integrate information about the interactionbetween the RNA modification and metabolism of
E. coli and identify gaps in models.First, we estimated the abundance of each tRNA from the total observed abundance of tRNA and the observed relative abundance of each tRNA. Second, we estimated the synthesis rate ofeach tRNA from the estimated abundance of each tRNA, the observed half-life of tRNA
Asn , andthe observed doubling time of E. coli in glucose media. Third, we used
BpForms to analyzethe curated modifications of each tRNA. Fourth, we estimated the total synthesis rate of eachmodification from the synthesis rate and modification of each tRNA (Figure 3).This analysis revealed that
E. coli tRNA contain 26 modified residues, and that the five most abun-dant residues account for 73.8% of all modifications. Next, we tried to use the iML1515 metabolicmodel, one of the most comprehensive models of cellular metabolism, to analyze the impact ofthese modifications on metabolism and understand how E. coli allocates its limited metabolic re-sources among these modifications. This analysis revealed that the model only represents one ofthe modified residues (9U, pseudouridine). Therefore, the model must be expanded to capture themetabolic cost of tRNA modification.
Systems biology: Systematic identification of gaps in the Kholodenko model of MAPKsignaling.
The Kholodenko model of the eukaryotic MAPK signaling cascade describes how thecascade transduces extracellular signals for growth, differentiation, and survival into the phosphory-lation state of MAPK. However, the model does not account for factors such as the cell’s nutritionalstatus.Toward a more holistic model of the cascade, we used BpForms to systematically identify gaps inthe Kholodenko model and opportunities to merge the model with models of other pathways. First,we obtained an SBML-encoded version of the model. Second, we determined the specific proteinsrepresented by the model. We had to do this manually because Kholodenko did not report thisinformation. Third, we curated the sequences and post-translational modifications of the speciesrepresented by the model from UniProt and encoded them into
BpForms (Figure 4A). Fourth, weembedded these
BpForms representations into the SBML representation of the model. We believethat the
BpForms annotations make the model more understandable.Fifth, we used the
BpForms annotations to systematically identify missing proteoforms that couldhelp the model better explain how the MAPK pathway transduces signals. Specifically, we used
BpForms to identify two missing combinations of the individual protein modifications representedby the model and four missing reactions that involve these species (Figure 4B). These additionalspecies and reactions could help the model better capture the kinetics of MAPKK and MAPKKKactivation and deactivation and, in turn, better capture how the pathway transduces signals.7ext, we used the
BpForms annotations to identify opportunities to merge the Kholodenko modelwith models of other signaling cascades. Specifically, we searched BioModels for other modelsthat represent similar proteoforms. This analysis identified several models that represent EGFR,PI3K, S6K, and the transcriptional outputs of the MAPK pathway that could be composed withthe Kholodenko model. Furthermore, this combination of models enabled us to identify emergentcombinations of proteoforms that are missing from the individual models (Figure 4C).Lastly, to identify opportunities to merge the Kholodenko model with a model of metabolism,we used the
BpForms annotations to systematically identify unbalanced reactions with missingmetabolites. This analysis identified four missing species that, if added to the Kholodenko model,would make the model composable with models of metabolism (Figure 4D).
Synthetic biology: Systematic identification of design constraints.
A promising way toengineer cells is to combine naturally-occurring parts, such as genes that encode metabolic enzymes,in an accommodating host, such as
E. coli . However, there are numerous potential barriers totransforming parts into other cells. For example, parts that require post-translational modificationscannot be transformed into cells which cannot synthesize the modifications. Currently, it is difficultto identify such design constraints because we have limited tools to describe the dependencies ofparts. Here, we illustrate how
BpForms can systematically identify potential flaws in the design ofa novel strain of
E. coli due to missing post-translational modification machinery.First, we used the PDB and
BpForms to identify all of the modifications that have been observedin
E. coli . Second, we used the PDB and
BpForms to identify modifications which have never beenobserved in
E. coli and the proteins which contain these modifications. For example, we foundthat proteins that contain 4-hydroxproline (PDB CCD: HYP), such as collagen (UniProt: P02452),potentially cannot be transformed into
E. coli . Third, we used the literature to confirm the absenceof these modifications from
E. coli . Table S5 lists the most common modifications which couldconstrain the transformation of proteins into
E. coli .Bioengineers could use this information to more reliably modify strains by limiting designs topost-translationally compatible proteins or by co-transforming parts with their requisite post-translational modification machinery. Furthermore, the synthetic biology community could makesuch information more accessible for learning design rules by incorporating this information intoparts repositories such as SynBioHub. This information would enable these repositories to functionas dependency management systems for synthetic organisms, analogous to the Advanced PackageTool (APT) for Ubuntu packages.
3. Discussion
Realizing the full potential of
BpForms and
BcForms as formats for the primary structures ofmacromolecules will require acceptance by the omics, systems biology, and synthetic biology com-munities. We have begun to solicit users by submitting the
BpForms and
BcForms grammars to theFAIRsharing registry of standards and the EDAM ontology of formats, contributing the alphabetsof residues and the ontology of crosslinks to BioPortal, proposing a protocol for using
BpForms with SBOL, and helping the PRO consortium use
BpForms to represent proteoforms. To furtherencourage community adoption, we plan to encourage the developers of central repositories of DNA,RNA, and protein modifications such as MethSMRT, the PDB, and RMBase to export their datain BpForms format. We also plan to stimulate discussion among the BioPAX, CellML, and SBML8ommunities about formalizing our integrations of
BpForms and
BcForms with their formats. Ad-ditionally, we also plan to use the grammars to generate parsers for other languages, such as C ++ ,to help developers incorporate BpForms and
BcForms into software tools.
Because
BpForms and
BcForms aim to help researchers exchange information, we believe that thealphabets of residues, the ontology of crosslinks, and the grammars should ultimately become com-munity standards. To start, we encourage the community to contribute to
BpForms and
BcForms via Git pull requests. Going forward, we would like these resources to be governed by the communitythrough an organization such as the Computational Modeling in Biology Network (COMBINE). BpForms and
BcForms achieve abstract descriptions of macromolecules by combining a closed,defined grammar with open, extensible ontologies of residues and crosslinks. This hybrid approachenables
BpForms and
BcForms to integrate diverse data into chemically-concrete descriptions of awide range of macromolecules. Achieving WC models swimilarly requires integrating heterogeneousdata about a wide range of processes from a wide range of methods and sources into physically-concrete kinetic simulations. Consequently, we believe that hybrid open-closed approaches suchas
BpForms and
BcForms will be essential for WC modeling. For example, we are developinga hybrid methodology that enables chemically-concrete coarse-grained simulations by using fine-grained reactions to describe the chemical semantics of coarse-grained reactions.
We have begun to use
BpForms and
BcForms to describe the chemical semantics of the speciesrepresented by network models. Going forward, we also plan to use
BpForms and
BcForms to helpnetwork models capture finer-grained mechanisms that involve combinatorial interactions, suchas how methylation impacts transcription factor-DNA binding. To do this, we are developing ageneralized rule-based modeling framework which encapsulates properties such as primary structuresinto species and links these properties to reactions and rate laws. We anticipate that this framework,together with
BpForms and
BcForms , will make it easier to build fine-grained kinetic models ofcomplex processes such as transcriptional backtracking, ribosomal queuing, and tmRNA ribosomalrescuing and combine them into WC models.
4. Conclusions
The
BpForms - BcForms toolkit abstracts the primary structure of polymers and complexes, in-cluding non-canonical residues, caps, crosslinks, nicks, and several types of missing information.Furthermore, the toolkit standardizes the representation of common residues and crosslinks whileextensibly accommodating any residue and crosslink by supporting both centrally and user-definedabstractions of residues and crosslinks. The toolkit includes the most extensive alphabets of hun-dreds of DNA, RNA, and protein residues; the first ontology of common crosslinks; an intuitivecoordinate system for the subunits, residues, and atoms in macromolecules; the first human andmachine-readable grammar for composing residues, caps, crosslinks, and nicks into polymers andcomplexes; and user-friendly web, REST, command-line and Python interfaces. The toolkit is back-ward compatible with the IUPAC/IUBMB format to maximize compatibility with existing bioin-formatics tools and knowledge. The toolkit can also be integrated with frameworks for network9esearch such as BioPAX, CellML, SBML, and SBOL.We anticipate that
BpForms and
BcForms will be valuable tools for omics, systems biology, andsynthetic biology. First, the tools can help researchers precisely communicate information aboutmacromolecules. For example, the tools can help experimentalists communicate observations ofproteoforms and help bioinformaticians exchange information among databases of polymers andcomplexes. Similarly, the tools can make models and genetic designs more understandable bycapturing the semantic meaning of the species represented by models and capturing the structuresof the parts of synthetic organisms. For example,
BpForms could describe proteins produced byexpanded genetic codes.The tools can also help quality control information about macromolecules. For example, the toolscould help researchers find errors in reconstructed proteoforms such as inconsistencies between themodified and translated sequences, merge duplicate entries in databases of proteoforms, and identifygaps and element imbalances in models.In addition,
BpForms and
BcForms can help researchers integrate structural, epigenomic, tran-scriptomic, and proteomic information about macromolecules. For example, the tools can helpresearchers integrate observations of individual protein modifications into descriptions of entire pro-teoforms. The tools can also help researchers integrate databases of modified proteins into a model ofpost-translational processing, combine the model with models of other processes to create WC mod-els, and refine the model by identifying missing combinations of protein states. Similarly, the toolscan help bioengineers design biochemical networks by identifying parts that must be co-transformedwith post-transcriptional and post-translational modification machinery.
5. Methods
We designed
BpForms and
BcForms as separate, but interrelated tools, to provide users light-weight tools for the distinct use cases of describing polymers and complexes. We implementedthe toolkit using Python, ChemAxon Marvin, Flask-RESTPlus, Lark, Open Babel, YAML Ain’tMarkup Language, and Zurb Foundation. Additional File 1.10 provides more information aboutthe implementation.
Declarations
Availability of data and materials
The web applications are located at https://bpforms.org and https://bcforms.org, the REST APIsare located at https://bpforms.org/api and https://bcforms.org/api, the command-line programsand Python libraries are available from PyPI, and the code and ontologies are available at https://github.com/KarrLab.
BpForms and
BcForms are available open-source under the MIT license. Optionally, a licensefor ChemAxon Marvin is needed to calculate protonation and tautomerization states and generatemolecular visualizations. Free licenses are available for academic researchers.
BpForms and
BcForms are platform independent. The installation of
BpForms and
BcForms requires Python 3.6 or higher, Open Babel, and, optionally, ChemAxon Marvin. A Docker imagewith these dependencies is available at http://dockerhub.com/u/karrlab.10ocumentation, including installation instructions, is available at https://docs.karrlab.org. Inter-active Jupyter notebook tutorials are available at https://sandbox.karrlab.org.This article refers to versions 0.0.9 of
BpForms and 0.0.2 of
BcForms . Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by the National Institutes of Health [grant numbers R35 GM119771, P41EB023912]; the National Science Foundation [grant number 1649014]; and the Engineering andPhysical Sciences Research Council [grant number EP/L016494/1].
Authors’ contributions
PFL, YC, XZ, DAN, and JRK built the alphabets of residues and the ontology of crosslinks. XZ,BS, and JRK developed the software. XZ, DAN, and JRK developed the case studies. PFL, YC,JAPC, and JRK wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank Chris Myers and Jacob Beal for helpful discussion about integrating
BpForms with SBOLand Nicola Hawes for help designing Figure 1.
References
1. Plongthongkum, N., Diep, D. H. & Zhang, K. Advances in the profiling of DNA modifications:cytosine methylation and beyond.
Nat. Rev. Genet.
Annu. Rev. Anal. Chem. J.Cheminform.
30 (2019).4. Milanowska, K., Krwawicz, J., Papaj, G., Kosiński, J., Poleszak, K., Lesiak, J., Osińska, E.,Rother, K. & Bujnicki, J. M. REPAIRtoire–a database of DNA repair pathways.
Nucleic AcidsRes.
D788–D792 (2010).5. Ye, P., Luan, Y., Chen, K., Liu, Y., Xiao, C. & Xie, Z. MethSMRT: an integrative databasefor DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-timesequencing.
Nucleic Acids Res.
D85–D89 (2017).6. Xuan, J.-J., Sun, W.-J., Lin, P.-H., Zhou, K.-R., Liu, S., Zheng, L.-L., Qu, L.-H. & Yang, J.-H.RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencingdata.
Nucleic Acids Res.
D327–D334 (2017).7. Boccaletto, P., Machnicka, M. A., Purta, E., Piątkowski, P., Bagiński, B., Wirecki, T. K.,de Crécy-Lagard, V., Ross, R., Limbach, P. A., Kotter, A., et al.
MODOMICS: a database ofRNA modification pathways. 2017 update.
Nucleic Acids Res.
D303–D307 (2017).8. Cantara, W. A., Crain, P. F., Rozenski, J., McCloskey, J. A., Harris, K. A., Zhang, X., Vendeix,F. A., Fabris, D. & Agris, P. F. The RNA Modification Database, RNAMDB: 2011 update.
Nucleic Acids Res.
D195–D201 (2010).11. Montecchi-Palazzi, L., Beavis, R., Binz, P.-A., Chalkley, R. J., Cottrell, J., Creasy, D., Shofs-tahl, J., Seymour, S. L. & Garavelli, J. S. The PSI-MOD community standard for representationof protein modification data.
Nat. Biotechnol.
Proteomics R (cid:13) : integratingpost-translationally modified sites, disease variants and isoforms. Nucleic Acids Res.
D433–D441 (2018).12. Rose, P. W., Bi, C., Bluhm, W. F., Christie, C. H., Dimitropoulos, D., Dutta, S., Green, R. K.,Goodsell, D. S., Prlić, A., Quesada, M., et al.
The RCSB Protein Data Bank: new resourcesfor research and education.
Nucleic Acids Res.
D475–D482 (2012).13. Natale, D. A., Arighi, C. N., Blake, J. A., Bona, J., Chen, C., Chen, S.-C., Christie, K. R.,Cowart, J., D’Eustachio, P., Diehl, A. D., et al.
Protein Ontology (PRO): enhancing and scalingup the representation of protein entities.
Nucleic Acids Res.
D339–D346 (2016).14. UniProt Consortium et al.
UniProt: the universal protein knowledgebase.
Nucleic Acids Res.
D158–D169 (2017).15. Meldal, B. H. M., Bye-A-Jee, H., Gajdoš, L., Hammerová, Z., Horáčková, A., Melicher, F.,Perfetto, L., Pokorn`y, D., Lopez, M. R., Türková, A., et al.
Complex Portal 2018: extendedcontent and enhanced visualization tools for macromolecular complexes.
Nucleic Acids Res.
D550–D558 (2018).16. Giurgiu, M., Reinhard, J., Brauner, B., Dunger-Kaltenbach, I., Fobo, G., Frishman, G., Mon-trone, C. & Ruepp, A. CORUM: the comprehensive resource of mammalian protein com-plexes—2019.
Nucleic Acids Res.
D559–D563 (2018).17. Karp, P. D., Billington, R., Caspi, R., Fulcher, C. A., Latendresse, M., Kothari, A., Keseler,I. M., Krummenacker, M., Midford, P. E., Ong, Q., et al.
The BioCyc collection of microbialgenomes and metabolic pathways.
Brief. Bioinform. (2017).18. Karr, J. R., Sanghvi, J. C., Macklin, D. N., Gutschow, M. V., Jacobs, J. M., Bolival Jr, B.,Assad-Garcia, N., Glass, J. I. & Covert, M. W. A whole-cell computational model predictsphenotype from genotype.
Cell
Curr. Opin. Biotechnol.
Bioinformatics et al.
The Systems Biology Markup Language(SBML): language specification for level 3 version 2 core.
J. Integr. Bioinform. (2018).22. Misirli, G., Cavaliere, M., Waites, W., Pocock, M., Madsen, C., Gilfellon, O., Honorato-Zimmer,R., Zuliani, P., Danos, V. & Wipat, A. Annotation of rule-based models with formal semanticsto enable creation, analysis, reuse and visualization. Bioinformatics et al.
Controlled vocabularies and semantics insystems biology.
Mol. Syst. Biol.
543 (2011).124. Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPACinternational chemical identifier.
J. Cheminform.
23 (2015).25. Westbrook, J. D. & Fitzgerald, P. in
Structural Bioinformatics (eds Bourne, P. E. & Weissig,H.) 161–179 (Wiley Online Library, 2003).26. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to meth-odology and encoding rules.
J. Chem. Inform. Comp. Sci. et al.
The BioPAX community standard for pathway data sharing.
Nat. Biotechnol. et al.
Training and evaluation corpora for the extraction ofcausal relationships encoded in biological expression language (BEL).
Database baw113(2016).29. LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B.,Martin, M. J., Vizcaino, J. A., Alpi, E., Danis, P., et al.
ProForma: a standard proteoformnotation.
J. Proteome Res. et al.
Synthetic Biology Open Language (SBOL) version2.2.0.
J. Integr. Bioinform. (2018).31. Cuellar, A., Hedley, W., Nelson, M., Lloyd, C., Halstead, M., Bullivant, D., Nickerson, D.,Hunter, P. & Nielsen, P. The CellML 1.1 specification. J. Integr. Bioinform.
Bioinformatics et al.
MODOMICS:a database of RNA modification pathways–2013 update.
Nucleic Acids Res.
D262–D267(2012).34. Leonard, S. A. IUPAC/IUB single-letter codes within nucleic acid and amino acid sequences.
Curr. Protoc. Bioinformatics,
A–1A (2003).35. Pearson, W. R. Rapid and sensitive sequence comparison with FASTP and FASTA.
MethodsEnzymol.
J. Mol. Biol.
Nat. Rev.Microbiol.
Sci. Rep. Nucleic AcidsRes.
J. Bacteriol. et al. iML1515, a knowledgebase that computes Escherichia coli traits.
Nat. Biotechnol.
Eur. J. Biochem.
ACS Chem. Biol. Front. Chem.
40 (2014).45. Yi, Y., Sheng, H., Li, Z. & Ye, Q. Biosynthesis of trans-4-hydroxyproline by recombinantstrains of Corynebacterium glutamicum and Escherichia coli.
BMC Biotechnol.
44 (2014).46. McLaughlin, J. A., Myers, C. J., Zundel, Z., Mısırlı, G., Zhang, M., Ofiteru, I. D., Goñi Moreno,A. & Wipat, A. SynBioHub: a standards-enabled design repository for synthetic biology.
ACSSynth. Biol. et al. Promoting coordinated developmentof community-based information standards for modeling in biology: the COMBINE initiative.
Front. Bioeng. Biotechnol.
19 (2015).48. O’Boyle, N. M., Guha, R., Willighagen, E. L., Adams, S. E., Alvarsson, J., Bradley, J.-C.,Filippov, I. V., Hanson, R. M., Hanwell, M. D., Hutchison, G. R., et al.
Open data, opensource and open standards in chemistry: the Blue Obelisk five years on.
J. Cheminform. igure legends and boxes … A{cnmA}GU{25U}CU … Modified tRNA
C. Grammar for polymersA. Alphabets of residues
DNARNAProtein
User-defined
B. Ontology of crosslinks
Disulfide bondIsopeptide bondThioesterbond
User-defined
E. Calculated properties
Molecular structureMajor microspecies,Formula, weight, charge
H. User interfaces
Web app CLIREST Python
F. Exported formats
Structure: SMILESCanonical seq: IUPACImage: PNG, SVG, ...
G. Integrations withnetwork formats
Pathways: BioPAXModels: CellML, SBMLDesigns: SBOLDisulfide-linked homodimer
D. Grammar for complexes
Figure 1. The
BpForms - BcForms toolkit can abstract, validate, and analyze the pri-mary structures of non-canonical polymers and complexes and help integrate structuralinformation about macromolecules into networks.
The toolkit includes ( A ) extensible al-phabets that represent individual DNA, RNA and protein residues; ( B ) an ontology of crosslinks;( C ) a grammar for composing polymers from residues, caps, crosslinks and nicks; ( D ) a grammarfor composing complexes from polymers and crosslinks; software tools for validating descriptionsof macromolecules, ( E ) calculating molecular properties of macromolecules, ( F ) exporting macro-molecules to other formats, and visualizing macromolecules; ( G ) protocols for integrating structuralinformation about macromolecules into omics, systems biology, and synthetic biology formats fornetworks, models, and genetic designs; and ( H ) multiple user interfaces.15 : A (Alanine) 3: U (Selenocysteine)2: C (Cysteine) Left bond atom
Inter-residue bonds
Left displaced atomRight bond bondRight displaced atomBond
Crosslink
Bond atomDisplaced atomBond
Coordinate system
Atom (in residue)111 Residue (in seq)Subunit (in repeated subunitof complex)
Compound
PolymerComplexResidue N O N O N O O Se S H OH OHH + H H + H N O N O N O O Se S H
11 110
OH OHH + H H + H Subunit (
BpForms ) Pept: ACU
Residues (
In alphabet ) A: C[C@H]([NH3+])C(=O)OC: OC(=O)[C@@H]([NH3+])CSU: N[C@H](C(=O)O)C[SeH]
Crosslink (
In crosslinks ontology ) disulfide: C-S11|C-S11 Complex (
BcForms ) Dimer: 2 * Pept | x-link: [ id: “disulfide” | l: PeptA(1)-2 | r: PeptA(2)-2]
1: A (Alanine) 3: U (Selenocysteine)2: C (Cysteine)1: Pept2: Pept
CD EFB
Dimer A Figure 2.
BpForms and
BcForms abstract the primary structures of polymers andcomplexes as combinations of residues, crosslinks, and nicks.
For example,
BcForms ab-stracts a disulfide-linked homodimer ( A , green box) of a selenocysteine-modified tripeptide ( B , blueboxes) as two copies of the tripeptide and a single crosslink ( C , green text) and BpForms abstractsthe peptide as a sequence of three residues, including selenocysteine (U) ( D , blue text). Theseabstractions are enabled by alphabets of residues ( E , black text) and an ontology of crosslinks ( F ,black text). 16 B Canonical residue F r eq ( n t c e ll cyc l e - ) F r eq ( n t c e ll cyc l e - ) Modified residue
Figure 3.
BpForms and
BcForms can facilitate integrative analyses of fine-grainedglobal intracellular networks.
For example, we used
BpForms to estimate the metabolic cost oftRNA modification in
E. coli by canonical residue ( A ) and modified residue ( B ) from informationabout the modification, abundance, and turnover of each tRNA.17 APKKKKEGFRMAPKKK MAPKKK-PMAPKKMAPKKKMAPKK MAPKK-P MAPKK-PP
GTP +H O GDP +H + PiPi PiPi PiGTP +H O GDP +H + GTP +H O GDP +H + TranscriptionalregulationPI3K S6KMAPKK-PMAPK MAPK-P MAPK-PP
GTP +H O GDP +H + PiPi PiGTP +H O GDP +H + MAPK-P Pi MAPKKK-PMAPKK-PP
MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-PP
GTP +H O GDP +H + PiPi PiGTP +H O GDP +H + GTP +H O GDP +H + MAPKK-PMAPK MAPK-PP
GTP +H O GDP +H + Pi GTP +H O GDP +H + MAPK-P Pi MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-P MAPKK-PPMAPKK-PMAPK MAPK-P MAPK-PPMAPK-P
MAPKKKKEGFRMAPKKK MAPKKK-PMAPKKMAPKKKMAPKK MAPKK-PPTranscriptionalregulationPI3K S6KMAPKK-PMAPK MAPK-PPMAPK-PMAPKKK-PMAPKK-PP
MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-PPMAPKK-PMAPK MAPK-PPMAPK-P
E A
Proteoforms and reactions found by enumerating each combination of modificationsConnections to other pathwaysfound with BioModelsConnections to metabolismidentified by mass balanceProteoforms and reactionsfound through compositionwith models of other pathways
BpForms annotationsOriginal model
C DB igure 4. BpForms and
BcForms can facilitate the construction, expansion, composi-tion, and refinement of fine-grained global intracellular networks.
For example, we used
BpForms to systematically identify ways to improve and expand the Kholodenko model of MAPKsignaling ( A , grey) by using BpForms to capture the semantic meaning of each species ( A , red),identify missing protein states ( B , blue), identify other models that represent similar proteins whichcould be composed with the Kholodenko model ( C , yellow) which could reveal additional missingcombinations of species ( C , green), and identify mass imbalances which indicate missing metabo-lites which could facilitate composition with metabolic models ( D ). Together, this could enable asubstantially expanded model ( E ). 19 esidue sequence This example illustrates how to use
BpForms to describe a DNA which begins with deoxyinosine. {dI}ACGC
User-defined residues
Residues which are not captured by our public alphabets can be captured within descriptions of polymers. Thisexample illustrates how to describe a protein which ends with N -methyl-L-arginine. CRGN[id: "AA0305"| structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)[NH3+]"| l-bond-atom: N16-1| r-bond-atom: C2| l-displaced-atom: H16+1| l-displaced-atom: H16| r-displaced-atom: O1| r-displaced-atom: H1| name: "N5-methyl-L-arginine"| synonym: "delta-N-methylarginine"| synonym: "N5-carbamimidoyl-N5-methyl-L-ornithine"| identifier: "MOD:00310" @ "mod"| identifier: "CHEBI:21848" @ "chebi"| base-monomer: "R"| comments: "Generated by protein-arginine N5-methyltransferase (EC 2.1.1.-)."]
Crosslinks and nicks
This example illustrates how to describe a peptide that contains a disulfide bond between the cysteines at the firstand third positions and a nick between the cysteine and alanine at the first and second positions.
C:AC | x-link: [id: "disulfide"| l: 1 | r: 3]
User-defined crosslinks
Crosslinks which are not captured by our public ontology can be described inline. This example illustrates how todescribe a peptide that contains a disulfide bond between the cysteines at the first and third positions.
CAC | x-link: [l-bond-atom: 1S11 | r-bond-atom: 3S11| l-displaced-atom: 1H11 | r-displaced-atom: 3H11| comments: "disulfide bond between 1C and 3C"]
Circularity
This example illustrates how to describe a circular di-deoxyribonucleic acid.
AC | circular
Missing knowledge
User-defined residues can also capture missing information about the mass, charge, location, and biosynthesis ofresidues. This example illustrates how to describe a protein which contains a methylated cysteine or asparagine atan unknown position between the fifth and tenth residues.
CRGN[base-monomer: "C"| delta-mass: 12 | delta-charge: 0| position: 5-10 [C, N]]EGYNNYCRAKYRGH
Box 1.
Examples of the
BpForms grammar for describing polymers.20 ubunit composition
This example illustrates how to use
BcForms to describe MalEFGK (Complex Portal: CPX-1932), a heteropen-tameric maltose ABC transporter.
MalE + MalF + MalG + 2 * MalK
Crosslinks
This example illustrates how to use the crosslinks ontology to describe a disulfide-linked antiparallel homodimerof disintegrin schistatin of
Echis carinatus (UniProt: P83658).
User-defined crosslinks
Crosslinks which are not captured by our public ontology can be defined within descriptions of complexes. Thisexample illustrates how to describe the crosslinking of 10 kDa chaperonin (UniProt: P9WPE5) of
Mycobacteriumtuberculosis with prokaryotic ubiquitin-like protein Pup (UniProt: P9WHN5) via a isoglutamyl lysine isopeptidebond (RESID: AA0124). Cells use this crosslink to mark 10 kDa chaperonin for proteasomal degradation.
P9WPE5 + P9WHN5| x-link:[ l-bond-atom: P9WHN5(1)-100N1-1| r-bond-atom: P9WPE5(1)-63C2| l-displaced-atom: P9WHN5(1)-100H1+1| l-displaced-atom: P9WHN5(1)-100H1| r-displaced-atom: P9WPE5(1)-63N1| r-displaced-atom: P9WPE5(1)-63H1| r-displaced-atom: P9WPE5(1)-63H1| comments: "isoglutamyl lysine isopeptide bond"]
Box 2.
Examples of the