[PDF] BpForms and BcForms: Tools for concretely describing non-canonical polymers and complexes to facilitate comprehensive biochemical networks

Abstract

Although non-canonical residues, caps, crosslinks, and nicks play an important role in the function of many DNA, RNA, proteins, and complexes, we do not fully understand how networks of non-canonical macromolecules generate behavior. One barrier is our limited formats, such as IUPAC, for abstractly describing macromolecules. To overcome this barrier, we developed BpForms and BcForms, a toolkit of ontologies, grammars, and software for abstracting the primary structure of polymers and complexes as combinations of residues, caps, crosslinks, and nicks. The toolkit can help quality control, exchange, and integrate information about the primary structure of macromolecules into fine-grained global networks of intracellular biochemistry.

Full PDF

BBpForms and

BcForms : Tools for concretely describingnon-canonical polymers and complexes to facilitate comprehensivebiochemical networks

Paul F. Lang , Yassmine Chebaro , Xiaoyue Zheng , John A. P. Sekar , BilalShaikh , Darren A. Natale , and Jonathan R. Karr Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine atMount Sinai, New York, NY 10029, USA Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai,New York, NY 10029, USA Department of Biochemistry, Oxford University, South Parks Road, Oxford OX1 3QU, UK Institut de Génétique et de Biologie Moléculaire et Cellulaire, Institut National de laSanté et de la Recherche Médicale, Centre National de la Recherche Scientiﬁque, Universitéde Strasbourg, 67404, Illkirch, France Protein Information Resource, Georgetown University Medical Center, Washington, DC20007, USA * These authors contributed equally to this work ** Correspondence: [email protected] 26, 2019

Abstract

BpForms and

BcForms , a toolkit of ontologies, grammars, and software for abstracting the primary structure ofpolymers and complexes as combinations of residues, caps, crosslinks, and nicks. The toolkit canhelp quality control, exchange, and integrate information about the primary structure of macro-molecules into ﬁne-grained global networks of intracellular biochemistry.

Keywords format; software; polymer; proteoform; complex; residue; modiﬁcation; crosslink; ﬁne-grained net-work; genome-scale network

1. Background

A central goal in biology is to understand how networks of metabolites, DNA, RNA, proteins,and complexes generate behavior. Non-canonical residues, caps, crosslinks, and nicks are essential1 a r X i v : . [ q - b i o . B M ] S e p o these networks. For example, prokaryotic restriction/modiﬁcation systems use methylation toselectively degrade foreign DNA, tRNA use pseudouridine to translate multiple codons, and signalingnetworks use phosphorylation to encode information into the states of proteins.Recent technical advances have enabled detailed information about individual DNA, RNA, andprotein modiﬁcations. For example, SMRT-seq can identify the locations of DNA methylations withsingle-nucleotide resolution and mass-spectrometry can identify hundreds of protein modiﬁcations. Furthermore, several repositories have compiled extensive data about non-canonical residues andcrosslinks in DNA,

RNA, and proteins, as well data about the subunit composition andcrosslinks of complexes.

Despite this progress, it remains diﬃcult to integrate this informationinto ﬁne-grained global networks of intracellular biochemistry, in part, because these resources usechemically-ambiguous and incompatible formats. Consequently, we still do not have a holisticunderstanding of how non-canonical macromolecules help generate behavior.Whole-cell (WC) models, which aim to predict phenotype from genotype by representing allof the biochemical activity in cells, are a promising tool for integrating diverse information aboutmacromolecules into a holistic understanding of cellular behavior. However, it remains challeng-ing to build ﬁne-grained, global biochemical networks, such as WC models, because we have fewtools for capturing the structures of non-canonical macromolecules and linking them together intonetworks. For example, formats such as BioNetGen and the Systems Biology Markup Language(SBML) are cumbersome for modeling post-transcriptional modiﬁcation because they have limitedcapabilities to represent the primary structure of RNA. Abstractions of the primary structuresof macromolecules that can be combined with modeling frameworks such as SBML would providea signiﬁcant step toward ﬁne-grained global biochemical networks. Combined with software tools,such abstractions could also facilitate the curation, exchange, and quality control of structuralinformation about macromolecules for a wide range of omics and systems and synthetic biologyresearch.Currently, several formats have limited abilities to abstract the primary structures of non-canonicalpolymers and complexes. Molecular formats which represent each atom and bond, such as theInternational Chemical Identiﬁer (InChI), the PDB format, and the Simpliﬁed Molecular-InputLine-Entry System (SMILES), can represent non-canonical residues, caps, crosslinks, and nicks.However, their ﬁne granularity is cumbersome for network-scale research. Omics and systems biologyformats, such as BioPAX, the Biological Expression Language (BEL), the MODOMICS nomen-clature, the PRO notation, ProForma, and the Synthetic Biology Open Language (SBOL), use abstractions that are conducive to network-scale research. However, these formats have limitedabilities to represent non-canonical residues, caps, crosslinks and nicks, and they do not concretelyrepresent the primary structures of macromolecules.Toward ﬁne-grained global networks of intracellular biochemistry, we developed BpForms - BcForms ,an open-source toolkit for abstractly representing the primary structure of polymers and complexes.

BpForms includes extensible alphabets of hundreds of DNA, RNA and protein residues; an ontol-ogy of common crosslinks; and a human and machine-readable grammar for combining residues,residue modiﬁcations, intra-chain crosslinks, and nicks into polymers.

BcForms includes a humanand machine-readable grammar for combining polymers, small molecules, and inter-chain crosslinksinto complexes. Both tools include software for validating descriptions of macromolecules, calculat-ing properties of macromolecules such as their formula, visualizing macromolecules, and exportingmacromolecules to molecular formats such as SMILES. Both tools are available as a web application,REST API, command-line program, and Python library.2ere, we describe the toolkit and demonstrate how it can facilitate omics, systems modeling, andsynthetic biology. First, we describe the toolkit, including the alphabets of residues, the ontologyof crosslinks, the grammars, the software tools, and the user interfaces. Second, we describe how

BpForms and

BcForms can be integrated with knowledge about pathways, kinetic models, andgenetic designs through formats such as BioPAX, CellML, SBML, and SBOL. Next, we describethe advantages of the toolkit over existing formats for representing polymers and complexes andexisting alphabets of residues. Lastly, we present multiple case studies that illustrate how thetoolkit can help researchers describe, quality control, exchange, and integrate diverse informationabout macromolecules into networks. We anticipate that

BpForms and

BcForms will help facilitateﬁne-grained, global networks of cellular biochemistry.

2. Results

The

BpForms - BcForms toolkit includes several interrelated tools for describing, validating, visual-izing, and calculating properties of the primary structure of DNA, RNA, proteins, and complexes(Figure 1). Here, we describe the components of the toolkit including the abstractions and gram-mars for polymers and complexes; the alphabets of residues; the ontology of crosslinks; the softwaretools for quality controlling, analyzing, and visualizing macromolecules; the protocols for integrating

BpForms and

BcForms with formats for network research; and the user interfaces.

Abstract representation of the primary structure of polymers and complexes.

BpForms represents polymers as a sequence of residues, a set of crosslinks, a set of nicks, and a Booleanindicator of circularity (Figure 2B, D).

BcForms represents complexes as a set of subunits anda set of crosslinks (Figure 2A, C). Each subunit is represented by its molecular structure andstoichiometry. The structure of each subunit can be described using

BpForms or SMILES.

Residues.

Each residue is represented by its molecular structure, a list of the atoms which can formbonds with preceding and following residues, and a list of the atoms which are displaced by theformation of these bonds (Figure 2E). These lists of atoms are optional to enable the toolkit torepresent internal nucleic and amino acids, as well as 3’ and 5’ caps. The toolkit can also capturemetadata and missing information about residues.

Crosslinks.

Each crosslink is represented as lists of the atoms which can form a bond betweenresidues and the atoms which are displaced by the formation of these bonds (Figure 2F). Thetoolkit represents each nick as a tuple of adjacent residues which are not bonded.

Alphabets of residues and ontology of crosslinks.

The toolkit uses a hybrid approach to abstract themolecular details of residues and crosslinks from the descriptions of macromolecules. The chemicaldetails of common residues and crosslinks are abstracted into alphabets of residues and an ontology ofcrosslinks. Users can deﬁne additional residues and crosslinks within descriptions of macromoleculesor create custom alphabets and ontologies. This hybrid approach standardizes the representationof common residues and crosslinks while enabling the toolkit to represent any residue or crosslink.

Coordinate system.

The toolkit uses a structured coordinate system to describe the atoms involvedin each inter-residue bond and crosslink. The coordinate of each repeated subunit ranges from oneto the stoichiometry of the subunit. The coordinate of each residue is its position within the residuesequence of its parent polymer. The coordinate of each atom is its position within the canonicalSMILES ordering of the atoms in its parent residue. Additional File 1.4 contains more information3bout the coordinate system.

Examples.

Boxes 1 and 2 illustrate the toolkit’s grammars for describing polymers and complexes,and Figure 2 illustrates the chemical semantics of a homodimer encoded in the grammars. AdditionalFile 1.2 and the

BpForms and

BcForms websites provide detailed descriptions of the grammars andadditional examples. Additional File 1.3 contains formal descriptions of the grammars.

Alphabets of DNA, RNA, and protein residues.

To support a broad range of research,

Bp-Forms includes the most extensive alphabets of DNA, RNA, and protein residues to date. The DNAalphabet includes 422 deoxyribose nucleotide monophosphates and 3’ and 5’ caps derived from dataabout DNA damage and repair from REPAIREtoire, structural data from the Protein Data BankChemical Component Dictionary (PDB CCD), and chemoinformatics data from DNAmod. TheRNA alphabet includes 378 ribose nucleotide monophosphates and 3’ and 5’ caps derived from bio-chemical data from MODOMICS and the RNA Modiﬁcation Database and structural data fromthe PDB CCD. The protein alphabet has 1,435 amino acids and carboxy and amino termini derivedfrom biochemical data from RESID and structural data from the PDB CCD. The BpForms web-site contains pages which display the residues in each alphabet. Additional File 1.5 describes howwe constructed the alphabets.

Ontology of crosslinks.

To abstract the molecular structures of polymers and complexes, thetoolkit includes the ﬁrst ontology of crosslinks. Currently, the ontology contains 36 commoncrosslinks. We plan to continue to curate additional crosslinks as needed to represent WC models.The

BpForms website contains a page which displays the crosslinks in the ontology. AdditionalFile 1.6 describes how we constructed the ontology.

Syntactic and semantic validation of descriptions of macromolecules.

To help qualitycontrol information about macromolecules, the toolkit can verify the syntactic and semantic cor-rectness of macromolecules encoded in

BpForms and

BcForms . First, the toolkit can verify thattextual descriptions of macromolecules are syntactically consistent with the

BpForms and

BcForms grammars and identify any errors. Second, the toolkit can verify that macromolecules representedby

BpForms and

BcForms are semantically consistent and identify any errors. For example, thetoolkit can identify pairs of adjacent amino acids that cannot form peptide bonds because the ﬁrstamino acid does not have a carboxy terminus or because the second amino acid does not have anamino terminus. Additional File 1.7 details the semantic validations implemented by the toolkit.We anticipate that these quality controls will help researchers exchange reliable information andassemble this information into high-quality networks.

Analyses of polymers and complexes.

The toolkit can calculate several properties of macro-molecules such as their primary structure, major protonation and tautomerization states, chemicalformula, molecular weight, and charge. We have begun to use these properties to quality controlWC models. For example, we are using the chemical formulae to verify that each reaction is elementand charge balanced, including reactions that represent transformations of macromolecules such asthe post-transcriptional modiﬁcation of tRNA.The toolkit can also compare macromolecules to determine their equality or identify diﬀerences. Weplan to use this feature to implement automated procedures for merging models that share speciesand reactions.

Molecular and sequence visualizations.

To help analyze macromolecules, the toolkit can gen-erate molecular and sequence visualizations of residues, caps, crosslinks, polymers, and complexes.4he molecular visualizations display each atom and bond and use colors to highlight features suchas individual residues, inter-residue and crosslink bonds, and the atoms that are displaced by theformation of the inter-residue bonds (Figure S1A–C). The molecular visualizations can also displaythe coordinate of each residue and atom. The sequence visualizations include interactive tooltipsthat describe each non-canonical residue, crosslink, and nick (Figure S1D).

Export to other molecular and sequence formats.

For compatibility with structural andbiochemical research, the toolkit can export

BpForms and

BcForms -encoded macromolecules tomolecular formats such as InChI, the PDB format, and SMILES. For compatibility with genomicsresearch, the toolkit can also export the canonical sequences of

BcForms -encoded polymers to theIUPAC/IUBMB format and FASTA documents. Integration with frameworks for network-scale research.

BpForms and

BcForms can fa-cilitate network-scale research through integration with omics and systems and synthetic biologyframeworks such as BioPAX, CellML, SBML, and SBOL. Additional File 1.9 illustrates how

Bp-Forms and

BcForms can be incorporated into these frameworks.

User interfaces.

BpForms and

BcForms each include four user-friendly interfaces: a web appli-cation, a REST API, a command-line program, and a Python library.

BpForms and

BcForms are the ﬁrst abstractions that can represent the primary structure of anyDNA, RNA, protein, and complex, including non-canonical residues, caps, crosslinks, nicks, and cir-cularity. The toolkit also contains the most extensive alphabets of DNA, RNA, and protein residuesand the ﬁrst ontology of concrete crosslinks. Furthermore, the toolkit has several innovative featuresto facilitate research about non-canonical macromolecules: the toolkit includes a novel coordinatesystem that makes it easy to address speciﬁc atoms in macromolecules, the toolkit uses a novelcombination of ontologies and inline deﬁnitions of residues and crosslinks to standardize the repre-sentation of common residues and crosslinks while accommodating any residue or crosslink, and thetoolkit includes novel quality controls for abstractions of the primary structures of macromolecules.Taken together,

BpForms and

BcForms are well-suited for network research. Here, we summarizehow

BpForms and

BcForms improve upon several existing resources for abstracting polymers andcomplexes.

Comparison of

BpForms with existing formats for polymers.

BpForms is the ﬁrst for-mat that can abstract the primary structure of DNA, RNA, and proteins, including non-canonicalresidues, caps, crosslinks, nicks, and circularity. In contrast, molecular formats such as SMILES donot abstract the structures of polymers, and abstract formats such as ProForma and network formatssuch as BioPAX do not represent concrete molecular structures.

BpForms also provides a uniqueblend of the features of previous molecular and abstract formats:

BpForms can capture missing in-formation similar to ProForma,

BpForms is human-readable like other abstract formats,

BpForms is machine-readable like molecular formats,

BpForms is composable with network formats such asSBML like molecular formats, and

BpForms is backward compatible with the IUPAC/IUBMB for-mat like other abstract formats. Additional File 1.11.1 and Table S1 provide a detailed comparisonof

BpForms with several other formats.

Comparison of

BpForms alphabets with existing databases.

The

BpForms alphabets arethe most extensive alphabets of DNA, RNA, and protein residues because they are based on struc-tural, biochemical, and physiological data from several sources. In addition, the

BpForms alphabetsand the PDB CCD are the only alphabets which consistently represent DNA, RNA, and protein5esidues and which represent the inter-residue bonding sites of each residue, enabling residues to becombined into concrete molecular structures. In contrast, DNAmod, REPAIRtoire, MODOMICS,RESID, and the RNA Modiﬁcation Database each only represent DNA, RNA, or protein residues;the residues in DNAmod, REPAIRtoire, MODOMICS, and the RNA Modiﬁcation Database arehard to compose into polymers because they represent nucleobases and nucleosides rather than nu-cleotides; and DNAmod, REPAIRtoire, MODOMICS, RESID, and the RNA Modiﬁcation Databasedo not capture bonding sites. Additional File 1.11.2 and Table S2 provide a detailed comparison ofthe

BpForms alphabets with several other resources.

Comparison of the

BpForms crosslinks ontology with existing resources.

Several resourcescontain information about crosslinks. In particular, the UniProt controlled vocabulary of post-translational modiﬁcations includes textual descriptions of over 100 types of crosslinks. In addition,MOD, REPAIRtoire, and RESID indirectly represent crosslinks by representing crosslinked dimersand trimers.The

BpForms ontology is the ﬁrst resource which directly represents the chemical structures ofcrosslinks, enabling crosslinks to be composed into concrete structures. In contrast, MOD, RE-PAIRtoire, and RESID represent crosslinks indirectly and the crosslinks in UniProt do not haveconcrete chemical semantics. Consequently, the crosslinks in MOD, REPAIRtoire, RESID, andUniProt cannot be composed into concrete structures. Additional File 1.11.3 and Table S3 providea detailed comparison of the

BpForms crosslinks ontology with these resources.

Comparison of

BcForms with existing formats for complexes.

Despite the importanceof complexes, only a few formats can represent complexes. The PDB format is well-suited tocapturing the 3-dimensional structures of complexes. BioPAX and SBOL can also capture thesubunit composition of complexes.

BcForms is the ﬁrst format which abstracts the primary structures of complexes including crosslinks.In contrast, the PDB format has limited capabilities to abstract crosslinks, and BioPAX and SBOLhave limited abilities to represent stochiometric information and crosslinks.

BcForms is also theﬁrst format which can be composed with formats for networks such as SBML. Additional File 1.11.4and Table S4 provide a detailed comparison of

BcForms with several other formats.

We believe that the

BpForms - BcForms toolkit can support a wide range of omics and systems andsynthetic biology research. Here, we illustrate how we have used the toolkit to improve the quality ofthe PRO database of proteoforms; analyze the metabolic cost of tRNA modiﬁcation in

Escherichiacoli ; reﬁne, expand, a compose a model of MAPK signaling with models of other pathways; andidentify constraints on designing new strains of

E. coli . Proteomics: Quality control of the Protein Ontology.

One of the goals of proteomics is tocharacterize the proteoforms in cells. Toward a comprehensive catalog of proteoforms, the PROconsortium has manually integrated several diﬀerent types of data into PRO, a database of 8,095proteoforms. Because the consortium constructs PRO, in part, by hand, automated quality controlscould help the consortium identify and correct errors in PRO.We have used

BpForms quality control PRO. First, we encoded each entry in PRO into the

BpForms grammar and used the

BpForms software to validate each entry. This identiﬁed several types ofsyntactical and semantic errors. For example, we identiﬁed annotated processing sites that haveinvalid coordinates that are greater than the length of the translated sequence of their parent protein.6e also identiﬁed modiﬁed residues whose structures are inconsistent with the translated sequencesof their parent proteins, such as a phosphorylated serine which is annotated at the position of atyrosine in the translated sequence of its parent. Second, the consortium corrected these errors.These improvements will be published with the next release later this year.To enable the consortium to continue to use

BpForms to quality control PRO, we developed ascript which automates this analysis. Going forward, the consortium also plans to use

BpForms and

BcForms to visualize and export proteoforms to molecular formats such as SMILES.

Systems biology: Analysis of the metabolic cost of prokaryotic tRNA modiﬁcation.

To achieve WC models, we must integrate information about all of the processes in cells and theirinteractions. Here, we illustrate how

BpForms can help integrate information about the interactionbetween the RNA modiﬁcation and metabolism of

E. coli and identify gaps in models.First, we estimated the abundance of each tRNA from the total observed abundance of tRNA and the observed relative abundance of each tRNA. Second, we estimated the synthesis rate ofeach tRNA from the estimated abundance of each tRNA, the observed half-life of tRNA

Asn , andthe observed doubling time of E. coli in glucose media. Third, we used

BpForms to analyzethe curated modiﬁcations of each tRNA. Fourth, we estimated the total synthesis rate of eachmodiﬁcation from the synthesis rate and modiﬁcation of each tRNA (Figure 3).This analysis revealed that

E. coli tRNA contain 26 modiﬁed residues, and that the ﬁve most abun-dant residues account for 73.8% of all modiﬁcations. Next, we tried to use the iML1515 metabolicmodel, one of the most comprehensive models of cellular metabolism, to analyze the impact ofthese modiﬁcations on metabolism and understand how E. coli allocates its limited metabolic re-sources among these modiﬁcations. This analysis revealed that the model only represents one ofthe modiﬁed residues (9U, pseudouridine). Therefore, the model must be expanded to capture themetabolic cost of tRNA modiﬁcation.

Systems biology: Systematic identiﬁcation of gaps in the Kholodenko model of MAPKsignaling.

The Kholodenko model of the eukaryotic MAPK signaling cascade describes how thecascade transduces extracellular signals for growth, diﬀerentiation, and survival into the phosphory-lation state of MAPK. However, the model does not account for factors such as the cell’s nutritionalstatus.Toward a more holistic model of the cascade, we used BpForms to systematically identify gaps inthe Kholodenko model and opportunities to merge the model with models of other pathways. First,we obtained an SBML-encoded version of the model. Second, we determined the speciﬁc proteinsrepresented by the model. We had to do this manually because Kholodenko did not report thisinformation. Third, we curated the sequences and post-translational modiﬁcations of the speciesrepresented by the model from UniProt and encoded them into

BpForms (Figure 4A). Fourth, weembedded these

BpForms representations into the SBML representation of the model. We believethat the

BpForms annotations make the model more understandable.Fifth, we used the

BpForms annotations to systematically identify missing proteoforms that couldhelp the model better explain how the MAPK pathway transduces signals. Speciﬁcally, we used

BpForms to identify two missing combinations of the individual protein modiﬁcations representedby the model and four missing reactions that involve these species (Figure 4B). These additionalspecies and reactions could help the model better capture the kinetics of MAPKK and MAPKKKactivation and deactivation and, in turn, better capture how the pathway transduces signals.7ext, we used the

BpForms annotations to identify opportunities to merge the Kholodenko modelwith models of other signaling cascades. Speciﬁcally, we searched BioModels for other modelsthat represent similar proteoforms. This analysis identiﬁed several models that represent EGFR,PI3K, S6K, and the transcriptional outputs of the MAPK pathway that could be composed withthe Kholodenko model. Furthermore, this combination of models enabled us to identify emergentcombinations of proteoforms that are missing from the individual models (Figure 4C).Lastly, to identify opportunities to merge the Kholodenko model with a model of metabolism,we used the

BpForms annotations to systematically identify unbalanced reactions with missingmetabolites. This analysis identiﬁed four missing species that, if added to the Kholodenko model,would make the model composable with models of metabolism (Figure 4D).

Synthetic biology: Systematic identiﬁcation of design constraints.

A promising way toengineer cells is to combine naturally-occurring parts, such as genes that encode metabolic enzymes,in an accommodating host, such as

E. coli . However, there are numerous potential barriers totransforming parts into other cells. For example, parts that require post-translational modiﬁcationscannot be transformed into cells which cannot synthesize the modiﬁcations. Currently, it is diﬃcultto identify such design constraints because we have limited tools to describe the dependencies ofparts. Here, we illustrate how

BpForms can systematically identify potential ﬂaws in the design ofa novel strain of

E. coli due to missing post-translational modiﬁcation machinery.First, we used the PDB and

BpForms to identify all of the modiﬁcations that have been observedin

E. coli . Second, we used the PDB and

BpForms to identify modiﬁcations which have never beenobserved in

E. coli and the proteins which contain these modiﬁcations. For example, we foundthat proteins that contain 4-hydroxproline (PDB CCD: HYP), such as collagen (UniProt: P02452),potentially cannot be transformed into

E. coli . Third, we used the literature to conﬁrm the absenceof these modiﬁcations from

E. coli . Table S5 lists the most common modiﬁcations which couldconstrain the transformation of proteins into

E. coli .Bioengineers could use this information to more reliably modify strains by limiting designs topost-translationally compatible proteins or by co-transforming parts with their requisite post-translational modiﬁcation machinery. Furthermore, the synthetic biology community could makesuch information more accessible for learning design rules by incorporating this information intoparts repositories such as SynBioHub. This information would enable these repositories to functionas dependency management systems for synthetic organisms, analogous to the Advanced PackageTool (APT) for Ubuntu packages.

3. Discussion

Realizing the full potential of

BpForms and

BcForms as formats for the primary structures ofmacromolecules will require acceptance by the omics, systems biology, and synthetic biology com-munities. We have begun to solicit users by submitting the

BpForms and

BcForms grammars to theFAIRsharing registry of standards and the EDAM ontology of formats, contributing the alphabetsof residues and the ontology of crosslinks to BioPortal, proposing a protocol for using

BpForms with SBOL, and helping the PRO consortium use

BpForms to represent proteoforms. To furtherencourage community adoption, we plan to encourage the developers of central repositories of DNA,RNA, and protein modiﬁcations such as MethSMRT, the PDB, and RMBase to export their datain BpForms format. We also plan to stimulate discussion among the BioPAX, CellML, and SBML8ommunities about formalizing our integrations of

BpForms and

BcForms with their formats. Ad-ditionally, we also plan to use the grammars to generate parsers for other languages, such as C ++ ,to help developers incorporate BpForms and

BcForms into software tools.

Because

BpForms and

BcForms aim to help researchers exchange information, we believe that thealphabets of residues, the ontology of crosslinks, and the grammars should ultimately become com-munity standards. To start, we encourage the community to contribute to

BpForms and

BcForms via Git pull requests. Going forward, we would like these resources to be governed by the communitythrough an organization such as the Computational Modeling in Biology Network (COMBINE). BpForms and

BcForms achieve abstract descriptions of macromolecules by combining a closed,deﬁned grammar with open, extensible ontologies of residues and crosslinks. This hybrid approachenables

BpForms and

BcForms to integrate diverse data into chemically-concrete descriptions of awide range of macromolecules. Achieving WC models swimilarly requires integrating heterogeneousdata about a wide range of processes from a wide range of methods and sources into physically-concrete kinetic simulations. Consequently, we believe that hybrid open-closed approaches suchas

BpForms and

BcForms will be essential for WC modeling. For example, we are developinga hybrid methodology that enables chemically-concrete coarse-grained simulations by using ﬁne-grained reactions to describe the chemical semantics of coarse-grained reactions.

We have begun to use

BpForms and

BcForms to describe the chemical semantics of the speciesrepresented by network models. Going forward, we also plan to use

BpForms and

BcForms to helpnetwork models capture ﬁner-grained mechanisms that involve combinatorial interactions, suchas how methylation impacts transcription factor-DNA binding. To do this, we are developing ageneralized rule-based modeling framework which encapsulates properties such as primary structuresinto species and links these properties to reactions and rate laws. We anticipate that this framework,together with

BpForms and

BcForms , will make it easier to build ﬁne-grained kinetic models ofcomplex processes such as transcriptional backtracking, ribosomal queuing, and tmRNA ribosomalrescuing and combine them into WC models.

4. Conclusions

The

BpForms - BcForms toolkit abstracts the primary structure of polymers and complexes, in-cluding non-canonical residues, caps, crosslinks, nicks, and several types of missing information.Furthermore, the toolkit standardizes the representation of common residues and crosslinks whileextensibly accommodating any residue and crosslink by supporting both centrally and user-deﬁnedabstractions of residues and crosslinks. The toolkit includes the most extensive alphabets of hun-dreds of DNA, RNA, and protein residues; the ﬁrst ontology of common crosslinks; an intuitivecoordinate system for the subunits, residues, and atoms in macromolecules; the ﬁrst human andmachine-readable grammar for composing residues, caps, crosslinks, and nicks into polymers andcomplexes; and user-friendly web, REST, command-line and Python interfaces. The toolkit is back-ward compatible with the IUPAC/IUBMB format to maximize compatibility with existing bioin-formatics tools and knowledge. The toolkit can also be integrated with frameworks for network9esearch such as BioPAX, CellML, SBML, and SBOL.We anticipate that

BpForms and

BcForms will be valuable tools for omics, systems biology, andsynthetic biology. First, the tools can help researchers precisely communicate information aboutmacromolecules. For example, the tools can help experimentalists communicate observations ofproteoforms and help bioinformaticians exchange information among databases of polymers andcomplexes. Similarly, the tools can make models and genetic designs more understandable bycapturing the semantic meaning of the species represented by models and capturing the structuresof the parts of synthetic organisms. For example,

BpForms could describe proteins produced byexpanded genetic codes.The tools can also help quality control information about macromolecules. For example, the toolscould help researchers ﬁnd errors in reconstructed proteoforms such as inconsistencies between themodiﬁed and translated sequences, merge duplicate entries in databases of proteoforms, and identifygaps and element imbalances in models.In addition,

BpForms and

BcForms can help researchers integrate structural, epigenomic, tran-scriptomic, and proteomic information about macromolecules. For example, the tools can helpresearchers integrate observations of individual protein modiﬁcations into descriptions of entire pro-teoforms. The tools can also help researchers integrate databases of modiﬁed proteins into a model ofpost-translational processing, combine the model with models of other processes to create WC mod-els, and reﬁne the model by identifying missing combinations of protein states. Similarly, the toolscan help bioengineers design biochemical networks by identifying parts that must be co-transformedwith post-transcriptional and post-translational modiﬁcation machinery.

5. Methods

We designed

BpForms and

BcForms as separate, but interrelated tools, to provide users light-weight tools for the distinct use cases of describing polymers and complexes. We implementedthe toolkit using Python, ChemAxon Marvin, Flask-RESTPlus, Lark, Open Babel, YAML Ain’tMarkup Language, and Zurb Foundation. Additional File 1.10 provides more information aboutthe implementation.

Declarations

Availability of data and materials

The web applications are located at https://bpforms.org and https://bcforms.org, the REST APIsare located at https://bpforms.org/api and https://bcforms.org/api, the command-line programsand Python libraries are available from PyPI, and the code and ontologies are available at https://github.com/KarrLab.

BpForms and

BcForms are available open-source under the MIT license. Optionally, a licensefor ChemAxon Marvin is needed to calculate protonation and tautomerization states and generatemolecular visualizations. Free licenses are available for academic researchers.

BpForms and

BcForms are platform independent. The installation of

BpForms and

BcForms requires Python 3.6 or higher, Open Babel, and, optionally, ChemAxon Marvin. A Docker imagewith these dependencies is available at http://dockerhub.com/u/karrlab.10ocumentation, including installation instructions, is available at https://docs.karrlab.org. Inter-active Jupyter notebook tutorials are available at https://sandbox.karrlab.org.This article refers to versions 0.0.9 of

BpForms and 0.0.2 of

BcForms . Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the National Institutes of Health [grant numbers R35 GM119771, P41EB023912]; the National Science Foundation [grant number 1649014]; and the Engineering andPhysical Sciences Research Council [grant number EP/L016494/1].

Authors’ contributions

PFL, YC, XZ, DAN, and JRK built the alphabets of residues and the ontology of crosslinks. XZ,BS, and JRK developed the software. XZ, DAN, and JRK developed the case studies. PFL, YC,JAPC, and JRK wrote the manuscript. All authors read and approved the ﬁnal manuscript.

Acknowledgements

We thank Chris Myers and Jacob Beal for helpful discussion about integrating

BpForms with SBOLand Nicola Hawes for help designing Figure 1.

References

1. Plongthongkum, N., Diep, D. H. & Zhang, K. Advances in the proﬁling of DNA modiﬁcations:cytosine methylation and beyond.

Nat. Rev. Genet.

Annu. Rev. Anal. Chem. J.Cheminform.

30 (2019).4. Milanowska, K., Krwawicz, J., Papaj, G., Kosiński, J., Poleszak, K., Lesiak, J., Osińska, E.,Rother, K. & Bujnicki, J. M. REPAIRtoire–a database of DNA repair pathways.

Nucleic AcidsRes.

D788–D792 (2010).5. Ye, P., Luan, Y., Chen, K., Liu, Y., Xiao, C. & Xie, Z. MethSMRT: an integrative databasefor DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-timesequencing.

Nucleic Acids Res.

D85–D89 (2017).6. Xuan, J.-J., Sun, W.-J., Lin, P.-H., Zhou, K.-R., Liu, S., Zheng, L.-L., Qu, L.-H. & Yang, J.-H.RMBase v2.0: deciphering the map of RNA modiﬁcations from epitranscriptome sequencingdata.

Nucleic Acids Res.

D327–D334 (2017).7. Boccaletto, P., Machnicka, M. A., Purta, E., Piątkowski, P., Bagiński, B., Wirecki, T. K.,de Crécy-Lagard, V., Ross, R., Limbach, P. A., Kotter, A., et al.

MODOMICS: a database ofRNA modiﬁcation pathways. 2017 update.

Nucleic Acids Res.

D303–D307 (2017).8. Cantara, W. A., Crain, P. F., Rozenski, J., McCloskey, J. A., Harris, K. A., Zhang, X., Vendeix,F. A., Fabris, D. & Agris, P. F. The RNA Modiﬁcation Database, RNAMDB: 2011 update.

Nucleic Acids Res.

D195–D201 (2010).11. Montecchi-Palazzi, L., Beavis, R., Binz, P.-A., Chalkley, R. J., Cottrell, J., Creasy, D., Shofs-tahl, J., Seymour, S. L. & Garavelli, J. S. The PSI-MOD community standard for representationof protein modiﬁcation data.

Nat. Biotechnol.

Proteomics R (cid:13) : integratingpost-translationally modiﬁed sites, disease variants and isoforms. Nucleic Acids Res.

D433–D441 (2018).12. Rose, P. W., Bi, C., Bluhm, W. F., Christie, C. H., Dimitropoulos, D., Dutta, S., Green, R. K.,Goodsell, D. S., Prlić, A., Quesada, M., et al.

The RCSB Protein Data Bank: new resourcesfor research and education.

Nucleic Acids Res.

D475–D482 (2012).13. Natale, D. A., Arighi, C. N., Blake, J. A., Bona, J., Chen, C., Chen, S.-C., Christie, K. R.,Cowart, J., D’Eustachio, P., Diehl, A. D., et al.

Protein Ontology (PRO): enhancing and scalingup the representation of protein entities.

Nucleic Acids Res.

D339–D346 (2016).14. UniProt Consortium et al.

UniProt: the universal protein knowledgebase.

Nucleic Acids Res.

D158–D169 (2017).15. Meldal, B. H. M., Bye-A-Jee, H., Gajdoš, L., Hammerová, Z., Horáčková, A., Melicher, F.,Perfetto, L., Pokorn`y, D., Lopez, M. R., Türková, A., et al.

Complex Portal 2018: extendedcontent and enhanced visualization tools for macromolecular complexes.

Nucleic Acids Res.

D550–D558 (2018).16. Giurgiu, M., Reinhard, J., Brauner, B., Dunger-Kaltenbach, I., Fobo, G., Frishman, G., Mon-trone, C. & Ruepp, A. CORUM: the comprehensive resource of mammalian protein com-plexes—2019.

Nucleic Acids Res.

D559–D563 (2018).17. Karp, P. D., Billington, R., Caspi, R., Fulcher, C. A., Latendresse, M., Kothari, A., Keseler,I. M., Krummenacker, M., Midford, P. E., Ong, Q., et al.

The BioCyc collection of microbialgenomes and metabolic pathways.

Brief. Bioinform. (2017).18. Karr, J. R., Sanghvi, J. C., Macklin, D. N., Gutschow, M. V., Jacobs, J. M., Bolival Jr, B.,Assad-Garcia, N., Glass, J. I. & Covert, M. W. A whole-cell computational model predictsphenotype from genotype.

Cell

Curr. Opin. Biotechnol.

Bioinformatics et al.

The Systems Biology Markup Language(SBML): language speciﬁcation for level 3 version 2 core.

J. Integr. Bioinform. (2018).22. Misirli, G., Cavaliere, M., Waites, W., Pocock, M., Madsen, C., Gilfellon, O., Honorato-Zimmer,R., Zuliani, P., Danos, V. & Wipat, A. Annotation of rule-based models with formal semanticsto enable creation, analysis, reuse and visualization. Bioinformatics et al.

Controlled vocabularies and semantics insystems biology.

Mol. Syst. Biol.

543 (2011).124. Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPACinternational chemical identiﬁer.

J. Cheminform.

23 (2015).25. Westbrook, J. D. & Fitzgerald, P. in

Structural Bioinformatics (eds Bourne, P. E. & Weissig,H.) 161–179 (Wiley Online Library, 2003).26. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to meth-odology and encoding rules.

J. Chem. Inform. Comp. Sci. et al.

The BioPAX community standard for pathway data sharing.

Nat. Biotechnol. et al.

Training and evaluation corpora for the extraction ofcausal relationships encoded in biological expression language (BEL).

Database baw113(2016).29. LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B.,Martin, M. J., Vizcaino, J. A., Alpi, E., Danis, P., et al.

ProForma: a standard proteoformnotation.

J. Proteome Res. et al.

Synthetic Biology Open Language (SBOL) version2.2.0.

J. Integr. Bioinform. (2018).31. Cuellar, A., Hedley, W., Nelson, M., Lloyd, C., Halstead, M., Bullivant, D., Nickerson, D.,Hunter, P. & Nielsen, P. The CellML 1.1 speciﬁcation. J. Integr. Bioinform.

Bioinformatics et al.

MODOMICS:a database of RNA modiﬁcation pathways–2013 update.

Nucleic Acids Res.

D262–D267(2012).34. Leonard, S. A. IUPAC/IUB single-letter codes within nucleic acid and amino acid sequences.

Curr. Protoc. Bioinformatics,

A–1A (2003).35. Pearson, W. R. Rapid and sensitive sequence comparison with FASTP and FASTA.

MethodsEnzymol.

J. Mol. Biol.

Nat. Rev.Microbiol.

Sci. Rep. Nucleic AcidsRes.

J. Bacteriol. et al. iML1515, a knowledgebase that computes Escherichia coli traits.

Nat. Biotechnol.

Eur. J. Biochem.

ACS Chem. Biol. Front. Chem.

40 (2014).45. Yi, Y., Sheng, H., Li, Z. & Ye, Q. Biosynthesis of trans-4-hydroxyproline by recombinantstrains of Corynebacterium glutamicum and Escherichia coli.

BMC Biotechnol.

44 (2014).46. McLaughlin, J. A., Myers, C. J., Zundel, Z., Mısırlı, G., Zhang, M., Oﬁteru, I. D., Goñi Moreno,A. & Wipat, A. SynBioHub: a standards-enabled design repository for synthetic biology.

ACSSynth. Biol. et al. Promoting coordinated developmentof community-based information standards for modeling in biology: the COMBINE initiative.

Front. Bioeng. Biotechnol.

19 (2015).48. O’Boyle, N. M., Guha, R., Willighagen, E. L., Adams, S. E., Alvarsson, J., Bradley, J.-C.,Filippov, I. V., Hanson, R. M., Hanwell, M. D., Hutchison, G. R., et al.

Open data, opensource and open standards in chemistry: the Blue Obelisk ﬁve years on.

J. Cheminform. igure legends and boxes … A{cnmA}GU{25U}CU … Modified tRNA

C. Grammar for polymersA. Alphabets of residues

DNARNAProtein

User-defined

B. Ontology of crosslinks

Disulfide bondIsopeptide bondThioesterbond

User-defined

E. Calculated properties

Molecular structureMajor microspecies,Formula, weight, charge

H. User interfaces

Web app CLIREST Python

F. Exported formats

Structure: SMILESCanonical seq: IUPACImage: PNG, SVG, ...

G. Integrations withnetwork formats

Pathways: BioPAXModels: CellML, SBMLDesigns: SBOLDisulfide-linked homodimer

D. Grammar for complexes

Figure 1. The

BpForms - BcForms toolkit can abstract, validate, and analyze the pri-mary structures of non-canonical polymers and complexes and help integrate structuralinformation about macromolecules into networks.

The toolkit includes ( A ) extensible al-phabets that represent individual DNA, RNA and protein residues; ( B ) an ontology of crosslinks;( C ) a grammar for composing polymers from residues, caps, crosslinks and nicks; ( D ) a grammarfor composing complexes from polymers and crosslinks; software tools for validating descriptionsof macromolecules, ( E ) calculating molecular properties of macromolecules, ( F ) exporting macro-molecules to other formats, and visualizing macromolecules; ( G ) protocols for integrating structuralinformation about macromolecules into omics, systems biology, and synthetic biology formats fornetworks, models, and genetic designs; and ( H ) multiple user interfaces.15 : A (Alanine) 3: U (Selenocysteine)2: C (Cysteine) Left bond atom

Inter-residue bonds

Left displaced atomRight bond bondRight displaced atomBond

Crosslink

Bond atomDisplaced atomBond

Coordinate system

Atom (in residue)111 Residue (in seq)Subunit (in repeated subunitof complex)

Compound

PolymerComplexResidue N O N O N O O Se S H OH OHH + H H + H N O N O N O O Se S H

11 110

OH OHH + H H + H Subunit (

BpForms ) Pept: ACU

Residues (

In alphabet ) A: C[C@H]([NH3+])C(=O)OC: OC(=O)[C@@H]([NH3+])CSU: N[C@H](C(=O)O)C[SeH]

Crosslink (

In crosslinks ontology ) disulfide: C-S11|C-S11 Complex (

BcForms ) Dimer: 2 * Pept | x-link: [ id: “disulfide” | l: PeptA(1)-2 | r: PeptA(2)-2]

1: A (Alanine) 3: U (Selenocysteine)2: C (Cysteine)1: Pept2: Pept

CD EFB

Dimer A Figure 2.

BpForms and

BcForms abstract the primary structures of polymers andcomplexes as combinations of residues, crosslinks, and nicks.

For example,

BcForms ab-stracts a disulﬁde-linked homodimer ( A , green box) of a selenocysteine-modiﬁed tripeptide ( B , blueboxes) as two copies of the tripeptide and a single crosslink ( C , green text) and BpForms abstractsthe peptide as a sequence of three residues, including selenocysteine (U) ( D , blue text). Theseabstractions are enabled by alphabets of residues ( E , black text) and an ontology of crosslinks ( F ,black text). 16 B Canonical residue F r eq ( n t c e ll cyc l e - ) F r eq ( n t c e ll cyc l e - ) Modified residue

Figure 3.

BpForms and

BcForms can facilitate integrative analyses of ﬁne-grainedglobal intracellular networks.

For example, we used

BpForms to estimate the metabolic cost oftRNA modiﬁcation in

E. coli by canonical residue ( A ) and modiﬁed residue ( B ) from informationabout the modiﬁcation, abundance, and turnover of each tRNA.17 APKKKKEGFRMAPKKK MAPKKK-PMAPKKMAPKKKMAPKK MAPKK-P MAPKK-PP

GTP +H O GDP +H + PiPi PiPi PiGTP +H O GDP +H + GTP +H O GDP +H + TranscriptionalregulationPI3K S6KMAPKK-PMAPK MAPK-P MAPK-PP

GTP +H O GDP +H + PiPi PiGTP +H O GDP +H + MAPK-P Pi MAPKKK-PMAPKK-PP

MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-PP

GTP +H O GDP +H + PiPi PiGTP +H O GDP +H + GTP +H O GDP +H + MAPKK-PMAPK MAPK-PP

GTP +H O GDP +H + Pi GTP +H O GDP +H + MAPK-P Pi MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-P MAPKK-PPMAPKK-PMAPK MAPK-P MAPK-PPMAPK-P

MAPKKKKEGFRMAPKKK MAPKKK-PMAPKKMAPKKKMAPKK MAPKK-PPTranscriptionalregulationPI3K S6KMAPKK-PMAPK MAPK-PPMAPK-PMAPKKK-PMAPKK-PP

MAPKKKKMAPKKK MAPKKK-PMAPKK MAPKK-PPMAPKK-PMAPK MAPK-PPMAPK-P

E A

Proteoforms and reactions found by enumerating each combination of modificationsConnections to other pathwaysfound with BioModelsConnections to metabolismidentified by mass balanceProteoforms and reactionsfound through compositionwith models of other pathways

BpForms annotationsOriginal model

C DB igure 4. BpForms and

BcForms can facilitate the construction, expansion, composi-tion, and reﬁnement of ﬁne-grained global intracellular networks.

For example, we used

BpForms to systematically identify ways to improve and expand the Kholodenko model of MAPKsignaling ( A , grey) by using BpForms to capture the semantic meaning of each species ( A , red),identify missing protein states ( B , blue), identify other models that represent similar proteins whichcould be composed with the Kholodenko model ( C , yellow) which could reveal additional missingcombinations of species ( C , green), and identify mass imbalances which indicate missing metabo-lites which could facilitate composition with metabolic models ( D ). Together, this could enable asubstantially expanded model ( E ). 19 esidue sequence This example illustrates how to use

BpForms to describe a DNA which begins with deoxyinosine. {dI}ACGC

User-deﬁned residues

Residues which are not captured by our public alphabets can be captured within descriptions of polymers. Thisexample illustrates how to describe a protein which ends with N -methyl-L-arginine. CRGN[id: "AA0305"| structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)[NH3+]"| l-bond-atom: N16-1| r-bond-atom: C2| l-displaced-atom: H16+1| l-displaced-atom: H16| r-displaced-atom: O1| r-displaced-atom: H1| name: "N5-methyl-L-arginine"| synonym: "delta-N-methylarginine"| synonym: "N5-carbamimidoyl-N5-methyl-L-ornithine"| identifier: "MOD:00310" @ "mod"| identifier: "CHEBI:21848" @ "chebi"| base-monomer: "R"| comments: "Generated by protein-arginine N5-methyltransferase (EC 2.1.1.-)."]

Crosslinks and nicks

This example illustrates how to describe a peptide that contains a disulﬁde bond between the cysteines at the ﬁrstand third positions and a nick between the cysteine and alanine at the ﬁrst and second positions.

C:AC | x-link: [id: "disulfide"| l: 1 | r: 3]

User-deﬁned crosslinks

Crosslinks which are not captured by our public ontology can be described inline. This example illustrates how todescribe a peptide that contains a disulﬁde bond between the cysteines at the ﬁrst and third positions.

Circularity

This example illustrates how to describe a circular di-deoxyribonucleic acid.

AC | circular

Missing knowledge

User-deﬁned residues can also capture missing information about the mass, charge, location, and biosynthesis ofresidues. This example illustrates how to describe a protein which contains a methylated cysteine or asparagine atan unknown position between the ﬁfth and tenth residues.

CRGN[base-monomer: "C"| delta-mass: 12 | delta-charge: 0| position: 5-10 [C, N]]EGYNNYCRAKYRGH

Box 1.

Examples of the

BpForms grammar for describing polymers.20 ubunit composition

This example illustrates how to use

BcForms to describe MalEFGK (Complex Portal: CPX-1932), a heteropen-tameric maltose ABC transporter.

MalE + MalF + MalG + 2 * MalK

Crosslinks

This example illustrates how to use the crosslinks ontology to describe a disulﬁde-linked antiparallel homodimerof disintegrin schistatin of

Echis carinatus (UniProt: P83658).

User-deﬁned crosslinks

Crosslinks which are not captured by our public ontology can be deﬁned within descriptions of complexes. Thisexample illustrates how to describe the crosslinking of 10 kDa chaperonin (UniProt: P9WPE5) of

Mycobacteriumtuberculosis with prokaryotic ubiquitin-like protein Pup (UniProt: P9WHN5) via a isoglutamyl lysine isopeptidebond (RESID: AA0124). Cells use this crosslink to mark 10 kDa chaperonin for proteasomal degradation.

Box 2.

Examples of the