ROMEO: A Plug-and-play Software Platform of Robotics-inspired Algorithms for Modeling Biomolecular Structures and Motions
RROMEO: A Plug-and-play Software Platform ofRobotics-inspired Algorithms for Modeling BiomolecularStructures and Motions
Kevin Molloy
Dept of Computer ScienceGeorge Mason University
Erion Plaku
Department of Computer ScienceThe Catholic University of America
Amarda Shehu ∗ Dept of Computer ScienceGeorge Mason [email protected]
ABSTRACT
Motivation:
Due to the central role of protein structure in molecu-lar recognition, great computational efforts are devoted to modelingprotein structures and motions that mediate structural rearrange-ments. The size, dimensionality, and non-linearity of the proteinstructure space present outstanding challenges. Such challengesalso arise in robot motion planning, and robotics-inspired treat-ments of protein structure and motion are increasingly showinghigh exploration capability. Encouraged by such findings, we de-but here ROMEO, which stands for Robotics prOtein Motion Ex-plOration framework. ROMEO is an open-source, object-orientedplatform that allows researchers access to and reproducibility ofpublished robotics-inspired algorithms for modeling protein struc-tures and motions, as well as facilitates novel algorithmic designvia its plug-and-play architecture.
Availability and implementation:
ROMEO is written in C++and is available in GitLab (https://github.com/). This software isfreely available under the Creative Commons license (Attributionand Non-Commercial).
Contact: [email protected]
ACM Reference Format:
Kevin Molloy, Erion Plaku, and Amarda Shehu. 2019. ROMEO: A Plug-and-play Software Platform of Robotics-inspired Algorithms for ModelingBiomolecular Structures and Motions. In ,. ACM, New York, NY, USA, 6 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn
Protein molecules regulate virtually all processes that maintain andreplicate a living cell, and their tertiary structure largely determinestheir interactions with molecular partners [4]. A detailed under-standing of the structure(s) at the disposal of a protein for biologicalactivity and of the motions that mediate rearrangements betweendifferent structures for activity modulation is central to understand-ing molecular mechanisms [3]. Decades of research have shownthat such understanding cannot be obtained via experimentation or ∗ Corresponding AuthorPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s). arXiv, 2019 © 2019 Copyright held by the owner/author(s).ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.https://doi.org/10.1145/nnnnnnn.nnnnnnn computation alone; in particular, while great progress is made onmodeling biomolecular structures and motions, challenges relatedto the size, dimensionality, and non-linearity of structure spaces as-sociated with macromolecules, such as proteins, remain [11]. Overthe past decade, great advancements in the ability to explore com-plex protein structure spaces have come from robotics-inspiredalgorithms that leverage analogies between molecular and robotconfiguration spaces [17].Encouraged by such progress, we debut here ROMEO, whichstands for Robotics prOtein Motion ExplOration framework. ROMEOis an open-source, object-oriented platform that allows researchersaccess to and reproducibility of published robotics-inspired algo-rithms for modeling protein structures and motions. For instance,ROMEO provides templates for popular robotics motion planningalgorithms, such as the Rapidly-exploring Random Tree (RRT) andProbabilistic RoadMap (PRM), and offers disseminated adaptationsof these canonical algorithms that address diverse application set-tings, from template-free protein structure prediction to modelingof motions that mediate re-arrangements between stable and meta-stable structures [1, 6, 14, 15].
ROMEO follows a plug-and-play architecture to facilitate novelalgorithmic design and so allow researchers to further advancealgorithmic research in molecular biology. ROMEO is written inC++ and its object-oriented design allows easy adaptation andexpansion of its classes. These classes have been designed around acore set of components shared by many motion planning algorithms,which we summarize below (see Figure S1 for a visualization of thearchitecture).
Representation, Energy, and Forward Kinematics:
The cfg class represents a protein configuration. Included with this initialrelease is an extension of this class that interfaces with the Rosettapackage [5] and utilizes the coarse-grained representation knownas the centroid mode, which tracks heavy backbone atoms and aside-chain centroid pseudo-atom via dihedral angles. Developerscan easily extend this class to support Rosetta’s all-atom represen-tation or others. A forward kinematics class allows projecting aconfiguration (under the selected representation) into Cartesianspace space (retrieving Cartesian coordinates for each of the rep-resented atoms). This class also associates an energy/fitness scorewith a configuration. In this initial release, ROMEO utilizes Rosettascoring functions. a r X i v : . [ q - b i o . B M ] M a y rXiv, 2019 K. Mollloy, E. Plaku, and A. Shehu Planners:
Sampling-based algorithms are popular in robot mo-tion planning due to their ability to handle high-dimensional config-uration spaces. ROMEO provides base classes for tree- and roadmap-based planners. This initial release includes support for RRT, PRM,and variants (the reader is referred to the review in [17] for a de-scription of these algorithms). These planners share a commonset of operations, some of which we highlight below. (i) ROMEOprovides samplers to create new configurations and offspring gen-erators to extend the tree or roadmap in a target direction. Pro-ducing new samples or offspring using traditional robotic tech-niques results in high-energy, unrealistic configurations. ROMEOoffers offspring generators that employ molecular fragment replace-ment, which remedies this issue [5]. (ii) Acceptors are responsiblefor determining whether configurations are to be added to theroadmap/tree. ROMEO provides examples of extending this classvia an energy threshold or dynamic techniques utilizing the Me-tropolis. (iii) ROMEO provides distance classes to measure the dis-tance between two configurations, supporting both angle-basedand Cartesian coordinate-based representations. (iv) Edge cost eval-uators are responsible for weighting each of the transitions encodedinto the graph/tree. ROMEO provides many extensions to a base-line class to support encoding basic distances, energetic differences,and those based on mechanical work [6]. (v) A goal acceptor classdetermines if a given configuration is similar enough to a given goalconfiguration and has connectivity to a given start configuration.
We showcase two selected examples of how ROMEO can be utilizedfor modeling structures and motions.
Structure Prediction:
We showcase the plug-and-play natureof ROMEO by the FeLTR algorithm proposed in [16] for decoygeneration in template-free protein structure prediction. FeLTRgrow a search tree in the protein’s configuration space to searchfor low-energy configurations. To implement FeLTR, the planner class is extended to utilize a low-dimensional projection to guidethe exploration. This projection has been shown to strike a goodbalance between breadth (the unexplored frontier) and depth (low-energy). Only 300 lines of code were required to extend ROMEOto implement FeLTR (source code and examples of running thismethod are included in the software distribution). Section S1.1 inthe Supplementary Information showcases results.
Motion Computation:
Tree-based planners have been used tocompute energetically-feasible paths connecting known structuresof a protein. ROMEO includes an example of such an algorithmknown as SPRINT [14]. SPRNT grows a tree in the configurationspace from a start configuration and biases its selection based on alow-dimensional projection based on a progress coordinate to thegoal configuration. The software distribution includes an exampleof utilizing SPRINT on the cyanovirin-n protein, where the goalis to compute motions that connect two given structures morethan 16Åapart from each-other. Details and results are available inSection S1.2 of the Supplementary Information.
The examples above illustrate the power of ROMEO’s plug-and-playarchitecture. This software can be useful in advancing algorithmic research structure and motion computation. It can also be employedin a classroom setting to allow instructors to initiate students incomputational molecular biology and easily extend components ofthe framework.
This work has been partially supported by the National ScienceFoundation Grant No.1440581.
APPENDIXROMEO DESIGN/CLASSES
ROMEO is written in C++ and consists of a core set of componentsthat are shared among all sampling-based robot motion planningalgorithms. The object-oriented design of ROMEO allows easy adap-tation and expansion of its core components, making it possible fordevelopers to customize ROMEO for additional applications.Figure 1 illustrates the class inheritance hierarchy of the software.The following sections summarize the purpose of each class anddescribe in greater detail selected member functions.
Choice of Representation
The cfg class is the base class to represent a configuration. Theclass stores a vector of the backbone dihedral angles of a proteinstructure and in doing so assumes an idealized geometry (wherebond lengths and valence angles do not deviate from equilibriumvalues). This choice of representation is popular in template-freeprotein structure prediction and motion computation. We pointthat other applications, such as folding, necessitate extending thebase class by allowing bond lengths and valence angles to deviate(thus including them in the representation) or by introducing otherconfiguration parameters.Since the potential energy associated with a given protein struc-ture is used in many planning operations (for instance, when deter-mining whether to accept a new configuration or when calculatingthe cost associated with moving from a current to a new configura-tion), energy is associated with a configuration and is also storedin the cfg class.The class
MolecularStructureRosetta provides the interface be-tween ROMEO and the Rosetta software suite. This class, derivedfrom the
CfgForwardKinematics class offers the ability to extract aconfiguration from a Protein DataBank (PDB) [2] file (which is theconventional way of storing information about a tertiary structure)and to score a configuration using a selected Rosetta scoring/energyfunctions. Member functions are also provided to perform forwardkinematics on a cfg object, that is, placing the protein in a par-ticular configuration and obtaining the Cartesian coordinates foreach atom in a given representation. ROMEO provides support forRosetta’s centroid representation, which tracks heavy backboneatoms and a side-chain centroid pseudo-atom.
Planners
ROMEO utilizes sampling-based motion planning algorithms to ex-plore the configuration space of a protein. Direct support is providedfor two popular sampling-based planners: the Rapidly ExploringRandom Tree (RRT) [10] and the Probabilistic RoadMap (PRM) [9].
OMEO arXiv, 2019
As a proof of concept, to illustrate ROMEO’s plug and play archi-tecture, the FeLTR [15] and SPRINT [14] methods have also beenimplemented in this release of ROMEO. By expanding the samplerclasses, many variants of these planners can be implemented byother developers.These planners all share a common set of operations or corecomponents, such as: generating new configurations, determin-ing the validity of a new configuration, measuring the distancebetween two configurations, and projecting configurations into theworkspace. ROMEO abstracts these operations using a set of baseclasses that allow for easy “plug and play” replacements of thesecomponents.
Samplers and Offspring Generators
ROMEO offers two classes, one for sampling and another for off-spring generation. The sampling class is employed is during thegeneration of at-random samples/configurations. Examples includeobtaining q RAN D in the RRT planner and generating landmarkconfigurations to populate the roadmap/graph in PRM. The off-spring generator class is used to modify an existing configuration,potentially by perturbing it in a given direction. This is employed,for instance, when extending q near in the direction of q rand inthe RRT extend step, or when connecting landmark configurationsduring the local planning step within the PRM planner. ROMEO ex-tends these baseline classes to support utilizing Rosetta’s molecularfragment libraries for structure and motion computations. Acceptors.
The acceptor class is used when deciding whether a newconfiguration should be added to the graph or tree maintained bythe planner. A simple acceptor test would be to set a maximum en-ergy value for all configurations (thus, the acceptor verifies that newconfigurations are below this threshold). When studying moleculartransitions, the Metropolis criterion (which utilizes differences inenergy between two configurations) is commonly employed. Forthis reason, ROMEO also provides an acceptor class based on MMC;in particular, ROMEO implements the transition test utilized in theSPRINT [14] and Transition-based RRT (T-RRT) planners [7, 8].
Distances.
Planners utilize a distance function to determine nearbyconfigurations. ROMEO comes with basic distance classes (Eu-clidean distance), as well as distances commonly used in the study ofproteins (such as the least root-mean-square-distance, or lRMSD[12].
SELECTED EXAMPLES OF APPLICABILITY
This section outlines two applications that utilize the ROMEOframework. The first example utilizes ROMEO to perform template-free protein structure prediction by implementing a robotics-inspiredmethod known as FeLTR [15]. The second example employs RRT tocompute energetically-feasible paths that connect two functionally-relevant configurations of the cyanovirin-n protein. All source codeand scripts to run these examples are included with the ROMEOdistribution.
Structure Prediction
This example showcases the FeLTR method [15] for performingtemplate-free protein structure prediction. We summarize how ROMEO’s “plug and play” architecture is utilized to easily imple-ment FeLTR.As described in greater detail in [15], FeLTr employs an EST-likeplanner that expands the search tree iteratively via selection and ex-pansion operations and utilizes projection layers for configurationselection. The
SelectionVertex function of ROMEO’s
TreeSampling-BasedPlanner class is overloaded to support FeLTr’s novel nodeselection technique that utilizes a low-dimensional projection ofa configuration. The
AddVertex function is overloaded to placenew vertices into FeLTr’s projection layers. Changes to the plan-ner are limited to the addition of 100 lines of code (in addition tosupporting code for computing projection coordinates from config-urations). The ease with which this complex method is implementedin ROMEO highlights the advantages of its object-oriented design.ROMEO’s distribution provides the required scripts and con-figuration files to run FeLTR to predict the structure of the ibetasubdomain of the mu end DNA binding domain of phage mu trans-posase. Figure 2 showcases the closest (in terms of lRMSD) sampledstructure when compared to the native structure cataloged in thePDB. Different weighting schemes can be explored when select-ing nodes from FeLTR’s low-dimensional projection, as describedin [13]. Figure 3 highlights some of the behavior of each of theseweighting schemes. The quad scheme utilizes a greedy strategy,selecting nodes with the lowest energies with high probability.The linear and norm (Gaussian based) schemes approach the lowerlRMSD structure more gradually, and have been shown for longerexecutions to provide closer samples to the native structure [13].
Motion Computation
ROMEO can also be used to compute the motions that mediaterearrangements between two distinct functionally-relevant struc-tures. We highlight this capability here on the cyanovirin-n protein,where we treat two distinct structures (found under PDB IDs 2EZMand 1L5E) as start and goal structures located almost 16 Å lRMSDapart from one another. We highlight here an implementation ofthe SPRINT method [14] with ROMEO.We executed ROMEO for 12 hours and tested two different en-ergy acceptors. Rosetta’s score3 energy scheme was utilized withthe radius of gyration terms disabled (since this rewards more com-pact structures). The first acceptor limited the acceptable energy ofa configuration to no higher than 60 Rosetta Energy Units (REUs).The second scheme utilized the Metropolis criterion, setting thetemperature, such that an increase of 10 REUs had a 0.1 proba-bility of being accepted. Each scheme resulted in pathways thatended with configurations (structures) within 3 Å lRMSD of thegoal configuration. The energy profiles for each are shown in Fig-ure reffig:CVNPathEnergy. As expected, the Metropolis acceptor(MMC) shows a more gradually-increasing and overall a lower-energy path compared to the max energy acceptor. A few samplestructures along the pathway computed from the MMC executionare showcased in Figure 5.
REFERENCES [1] N. M. Amato, K. A. Dill, and G. Song. 2002. Using motion planning to map proteinfolding landscapes and analyze folding kinetics of known native structures.
J.Comp. Biol.
10, 3-4 (2002), 239–255.[2] H. M. Berman, K. Henrick, and H. Nakamura. 2003. Announcing the worldwideProtein Data Bank.
Nature Struct Biol
10, 12 (2003), 980–980. rXiv, 2019 K. Mollloy, E. Plaku, and A. Shehu
SamplingBasedPlannerAcceptors Distances
ForwardKinematics
Samplers OffspringGeneratorConfigurationRRTEST
RosettaOffspringGenerator
PRM
MMCDistance
FELTR
MaxEnergy
Rosetta POSE
LRMSD
RandomSamplerRosettaSampler LinearInterpolate
Tree-based Planner
AngularRMSD
Edge CostEvaluators
MechanicalWorkLRMSDDistance
Figure 1: In ROMEO’s plug-and-play design, all classes are derived from a common Configuration object (which can also beoverloaded. All of the planner functions can also be replaced with new, derived classes.Figure 2: The known native tertiary structure of the ibetasubdomain of the mu end DNA binding domain of phagemu transposase (found under PDB entry 2EZK) is drawnin transparent blue. The structure with the lowest lRMSD(of 3.3Å) to this native structure (among all structures com-puted by FeLTR during a 2-hour execution) is also shownhere, drawn in red. [3] D. D. Boehr, R. Nussinov, and P. E. Wright. 2009. The role of dynamic confor-mational ensembles in biomolecular recognition.
Nature Chem Biol
5, 11 (2009),789–96.[4] D. D. Boehr and P. E. Wright. 2008. How do proteins interact?
Science
Science
IEEE Transactions on NanoBioscience
14, 5 (July 2015), 545–552.https://doi.org/10.1109/TNB.2015.2424597[7] L. Jaillet, F. J. Corcho, J.-J. Perez, and J. Cortés. 2011. Randomized tree constructionalgorithm to explore energy landscapes.
J Comput Chem
32, 16 (2011), 3464–3474.[8] L. Jaillet, J. Cortés, and T. Siméon. 2008. Transition-based RRT for path planningin continuous cost spaces. In
IROS . IEEE/RSJ, Stanford, CA, 22–26.
Time (minutes) L R M S D quadlinearnorm Figure 3: The known native tertiary structure of the ibetasubdomain of the mu end DNA binding domain of phagemu transposase (found under PDB entry 2EZK) is drawnin transparent blue. The structure with the lowest lRMSD(of 3.3Å) to this native structure (among all structures com-puted by FeLTR during a 2-hour execution) is also shownhere, drawn in red. [9] L. E. Kavraki, P. Svetska, J.-C. Latombe, and M. Overmars. 1996. Probabilisticroadmaps for path planning in high-dimensional configuration spaces.
IEEETransaction on Robotics and Automation
12 (1996), 566–580.[10] S. M. LaValle and J. J. Kuffner. 2001. Randomized kinodynamic planning.
IJRR
20, 5 (2001), 378–400.[11] T. Maximova, R. Moffatt, B. Ma, R. Nussinov, and A. Shehu. 2016. Principles andOverview of Sampling Methods for Modeling Macromolecular Structure andDynamics.
PLoS Comp. Biol.
12, 4 (2016), e1004619.[12] A. D. McLachlan. 1972. A mathematical procedure for superimposing atomiccoordinates of proteins.
Acta Crystallographica
26, 6 (1972), 656–657.[13] K. Molloy, S. Saleh, and A. Shehu. 2013. Probabilistic Search and Energy Guidancefor Biased Decoy Sampling in Ab-initio Protein Structure Prediction.
IEEE/ACMTrans. Bioinf. and Comp. Biol.
10, 5 (2013), 1162–1175.[14] K. Molloy and A. Shehu. 2013. Elucidating the Ensemble of Functionally-relevantTransitions in Protein Systems with a Robotics-inspired Method.
BMC Struct.
OMEO arXiv, 2019
Path Progress E n e r g y ( r o s e tt a ) MaxEnergyMMC
Figure 4: Two energy acceptance strategies are utilized,MMC (Metropolis Monte Carlo) and a scheme that sets themaximum energy of a structure during ROMEO’s explo-ration of paths connecting two distinct, known structures ofthe cyanovirin-n protein. The MMC scheme yields a moreenergetically-feasible pathway that gradually rises in en-ergy compared to the maximum energy threshold strategy.
Biol.
13, Suppl 1 (2013), S8.[15] A. Shehu and B. Olson. 2010. Guiding the Search for Native-like Protein Confor-mations with an Ab-initio Tree-based Exploration.
Intl. J. Robot. Res.
29, 8 (2010),1106–1127.[16] Amarda Shehu and Brian S. Olson. 2010. Guiding the Search for Native-likeProtein Conformations with an Ab-initio Tree-based Exploration.
I. J. RoboticsRes.
29, 8 (2010), 1106–1127.[17] A. Shehu and E. Plaku. 2016. A Survey of omputational Treatments ofBiomolecules by Robotics-inspired Methods Modeling Equilibrium Structureand Dynamics.
J Artif Intel Res
597 (2016), 509–572. rXiv, 2019 K. Mollloy, E. Plaku, and A. Shehu
Figure 5: A few ROMEO-computed structures are shown along a path that illustrates the rearrangement of the cyanovirin-nprotein between two distinct, known structures (with PDB IDs 2EZM and 1L5E). The computed path consists of89