Topology and geometry of molecular conformational spaces and energy landscapes
Ingrid Membrillo-Solis, Mariam Pirashvili, Lee Steinberg, Jacek Brodzki, Jeremy G. Frey
TTopology and geometry of molecular conformational spacesand energy landscapes
Ingrid Membrillo-Solis , Mariam Pirashvili , Lee Steinberg , Jacek Brodzki , and JeremyG. Frey School of Mathematical Sciences, University of Southampton School of Chemistry, University of SouthamptonJuly 2019
Abstract
Understanding the geometry and topology of configuration or conformational spaces of molecules hasrelevant applications in chemistry and biology such as the proteins folding problem, drug design and thestructure activity relationship problem. Despite their relevance, configuration spaces of molecules areonly partially understood. In this paper we discuss both theoretical and computational approaches to theconfiguration spaces of molecules and their associated energy landscapes. Our mathematical approachshows that when symmetries of the molecules are taken into account, configuration spaces of moleculesgive rise to certain principal bundles and orbifolds. We also make use of a variety of geometric andtopological tools for data analysis to study the topology and geometry of these spaces.
A molecule is an example of a n -body system formed by the nuclei and the electrons of its constituentatoms. The first step of the Born-Oppenheimer approximation allows to represent the nuclei as points inEuclidean space. The space defined by the degrees of freedom of the molecular system after eliminationof the symmetry group actions is called the internal configuration or conformational space . The potentialenergy surface (PES) describes the energy of a molecule. It is a function on the internal conformationalspace of the molecule.The study of the conformational spaces of molecules and their associated PES is of particular interest inChemistry. It can help to understand the relationship between structure and properties of molecules. Forinstance, PES can give insight into the structure and dynamics of macromolecules such as proteins. It hasbeen shown that protein native-state structures can arise from considerations of symmetry and geometryassociated with the polypeptide chain [18]. Furthermore, understanding of the binding pose of a drugwith its potential target requires knowledge of both their underlying PES, and conformational spaces [34].Conformer generation is regarded as a fundamental factor in drug design [23]. Therefore, several methodsexist that produce conformer sets, sampling the conformational space. One of the challenges is obtaining analgorithm that reduces the number of duplicate conformers. This could be achieved if the symmetries of agiven molecule are taken into account. Also, it has been claimed that prediction of melting points can beimproved by taking molecular symmetry into account [37, 28].In this paper we explore the geometry and topology of the conformational space of molecules and theirquotients by symmetry groups. Despite the importance of the group of symmetries and its action on the spaceof conformers, to our knowledge there are no works on spaces of conformers that include it. Furthermore,conformational spaces are often discussed only in terms of their torsional degrees of freedom, operating underthe rigid geometry hypothesis [14]. The impact of other degrees of freedom on the conformational spacesthemselves is often ignored. 1 a r X i v : . [ q - b i o . Q M ] J u l .1 Geometric and topological methods in data analysis Topology-based data analysis methods have seen continued interest in recent years. Persistent homologyand discrete Morse theory are two topological data analysis tools, which are closely interlinked, and whichwe have applied to the exploration of conformation spaces of molecules.Persistent homology, which may be more familiar in the chemistry community, is a method of assigningnumerical descriptors to data, based on topological notions of shape, which emerges through a process ofcreating a combinatorial structure, called a simplicial complex, from the data, together with a filter function.These descriptors satisfy robust stability results with respect to the so-called bottleneck distance. The useof this method allowed us to explore the topology of the conformation spaces.Discrete Morse theory is mathematically very closely linked to persistent homology, but it has differentapplications. It is used for topological simplification, for distilling the information down to the most relevant.We used it to explore the energy landscapes via their extrema.In the paper
PHoS: Persistent Homology for Virtual Screening [20], Keller et al. use multi-dimensionalpersistence to investigate molecules in the context of drug discovery. Their idea is using two filter functions,one of which is a scalar field defined around the molecule. This is similar to our ideas, however we focus onthe conformational space, rather than individual molecules, and instead of merely calculating the persistenceof the scalar function (the energy landscape, in our case), we explore it using discrete Morse theory.Discrete Morse theory has recently been used to reconstruct hidden graph-like structures from potentiallynoisy data. This has found application in vastly diverse areas. For example, Sousbie et al. used the simulateddensity field of dark matter to reconstruct the network of filaments in the large scale distribution of matterin the universe, the so-called cosmic web [32]. Given a collection of GPS trajectories, Dey et al. recoveredthe hidden road network by modelling it as reconstructing a geometric graph embedded in the plane [8].Another paper by Delgado-Friedrichs et al. defines skeletons and partitions of greyscale digital images bymodelling a greyscale image as a cubical complex with a real-valued function defined on its vertices andusing discrete Morse theory [7].A more fundamental application of discrete Morse theory in topological data analysis is topologicalsimplification. Here, the link with persistent homology allows a topology-based denoising of data, as exploredin [11] and [2].This methodology has already been introduced to the chemical setting as well. Gyulassi et al. [16] usedMorse-theoretic approaches to investigate the properties of a simulated porous solid as it is hit by a projectileby generating distance fields containing a minimal number of topological features, and using them to identifyfeatures of the material. In [3], the authors construct the Morse-Smale complex ofOur approach relies on these results, however our focus is somewhat different. We use the connectionbetween Morse theory and persistent homology to construct a combinatorial summary of the conformationspace of a given molecule, which takes into account both the topological properties of the conformationspace, as well as the energy landscape defined on it.
The general outline of the paper is shown in Figure 1. Starting with the computational details in the secondsection, we explain the conformer generation procedure used, followed by the calculations of the potentialand free energy surfaces. Next, we move on to outline the geometric and topological methods used for theanalysis of the conformational space.Section 3 discusses our mathematical framework with regards to the conformational spaces arising fromthe molecular graphs and group actions on these spaces, as well as the metrics defined on them. This sectioncontains our original theoretical results.The fourth section discusses the results of our analyses performed on several benchmark molecules. Theseare based on the mathematical framework of the previous section, and use the aforementioned mathematicaldata analysis methods.Finally, we end with our conclusions in Section 5.In the Appendix, we explain in more detail the mathematical methods applied in the paper.2igure 1: The workflow of the paper.
The task of creating sets of molecular conformations is inherently complex due to the large number of degreesof freedom in a molecule. Furthermore, it is often the case that in reality what is actually desired is a set oflow-energy structures, and often the ability of an algorithm or program to create these conformers is used asits quality metric [10]. In general conformer generation procedures can be separated into knowledge-based,grid search, or distance geometry based approaches. Knowledge-based approaches use known low-energyconformers, such as crystal structures, to define rules which can then be used to generate conformers fora new molecule. Grid search approaches simply enumerate combinations of different degrees of freedom.Finally, distance geometry uses upper and lower bounds to create sets of conformations that satisfy thesebounds. The reader is directed to [10] for more information regarding these methods.For this work, we are in general unrestricted by energy, instead using a more general physicality criteria.This is because we would like to sample the conformational space as best as we can, rather than simplyobtain representative low energy conformers. However, if energy were totally unrestricted, this would lead toconformers that we would consider unphysical, in particular caused by clashes between atoms. This can berectified by recognising that any commonly used forcefield would give such a configuration an energy ordersof magnitude higher than any other, due to the near exponential scaling of any reasonable Pauli repulsionapproximation.RDKit [22] has been used in this work to create conformation sets, using the ETKDG method [29].This creates a set of conformers with reasonable distributions in bond lengths and angles, but fairly fixedtorsions. To ensure that the conformational space was covered, each conformer had its torsions determinedfrom independent uniform distributions on ( − π, π ] . Lastly, conformers with energies in excess of kcal/molwere removed, ensuring there were no atomic clashes.As well as studying conformer sets with variation in all molecular degrees of freedom, we have alsogenerated sets with only torsional variation. Firstly, conformer sets have been created through a grid searchin the torsion degrees of freedom. This set also requires energy pruning to remove unphysical molecules.Secondly, a metadynamics [21] approach was used. This also allowed the creation of free energy landscapesand do not require energy pruning. The drawback from this approach is that we no longer have informationas to all degrees of freedom, instead reducing our space to the reaction coordinates.Lastly, we have studied the conformational space of cyclooctane, using data obtained from [25]. Thisis a set of 6040 conformations of cyclooctane, varying in torsion angles. Hydrogen coordinates were found3hrough a constrained geometry optimisation. The reader is referred to [25, 4] for more information. Table3 contains information as to the size of the generated conformer sets. Operating within the Born-Oppenheimer approximation, we can consider the molecular energy to be afunction of the atomic coordinates. There are many different methods for calculating this molecular energy,broadly split into those that are classical and quantum mechanical. Here, we use a classical forcefieldto calculate the potential energy of a single conformer. These forcefields contain parameters describing therelative strength of bond bending, bond stretching, and torsional rotations within a molecule, broadly writtenas: E molecule = (cid:88) bonds k ( x − x (cid:48) ) + (cid:88) angles t ( θ − θ (cid:48) ) + (cid:88) torsions (cid:32) (cid:88) n V n cos( nω ) (cid:33) (2.1)Where k , t and V modulate the relative strength of interactions, and non-bonded terms have been droppedas we are not studying systems with more than one molecule in this work. We use the MMFF94 forcefieldas implemented in RDKit [17] to calculate the potential energy of a single conformer.Often, a more useful quantity to study is the free energy. The free energy can be thought of as a’smoothed out’ potential energy, where various degrees of freedom integrated out through an appropriatelyweighted Boltzmann average. In our work, we use metadynamics [21] to calculate free energy surfaces overthe torsional degrees of freedom for a molecule.Both the potential and free energy functions can be considered as maps from a conformational space toa subset of the real line. The potential energy would be a map from the full conformational space, whereasthe free energy is a map from the torsion-only subspace of the conformational space. Computations of the distance matrix associated to the internal configuration spaces ( C int M , d P ) and ( C int M , d O ) were carried out in Matlab using the in-built function ‘Procrustes’. We estimated the local dimension of C int M using local PCA. The algorithm was implemented in Matlab, and it is shown in Figure 1. A more detailedexposition on local PCA is presented in Appendix A. Let S = { C i } Ni =1 be a data set of N conformers ofa molecule M . Given a conformer C i ∈ S we computed its k -nearest neighbours, where k (cid:28) N using thethe Matlab function ‘knn’ together with the results of Procrustes. We gave a matrix representation to eachelement of the permutation invertion group (GPI) which is defined in Section 3. We used this representationto compute the distance matrix associated to the conformational spaces ( C int M , d P ) and ( C int M , d O ) along withits local dimension. Algorithm 1
Local PCA with group actions
Require:
Data set S of N conformers, C , . . . , C N , of a molecule with n atoms in R n , N > n Require:
The automorphism group P of the molecular graph, as a subgroup of the symmetric group S n Require:
A constant γ ∈ (0 , ; higher values of γ result in higher predicted dimension for i ∈ { , . . . , N } do for j ∈ { , . . . , N } do Let ˜ p j ∈ P and ˜ A j ∈ SO (3) be elements minimising d F ( C i , A ( C j · p )) , p ∈ P , A ∈ SO (3) Set ˜ C ij = d O ( C i , C j ) = d F ( C i , ˜ A j ( C j · ˜ p j )) = d P ( C i , C j · ˜ p j ) Let
N N ⊆ { , . . . , n } be such that { ˜ C ij | j ∈ N N } are the k lowest values of ˜ C ij for ≤ j ≤ N Compute PCA for { ˜ A j ( C j · ˜ p j ) | j ∈ N N } and let λ , . . . , λ n be the resulting eigenvalues Let d i ∈ { , . . . , n } be the smallest d such that ( (cid:80) dk =1 λ k ) / ( (cid:80) nk =1 λ k ) > γ ; d i is the predicted dimension of OC int M at C i The predicted dimension ld of OC int M is the median of the d i , ≤ i ≤ N
4e tested orientability of the conformational spaces. In [31] Singer et al. developed an algorithm todetect orientability on large data set which are sample from manifolds. In Algorithm 2 we present a versionof this algorithm that includes the group action of a discrete group.
Algorithm 2
Orientability of conformational spaces
Require:
Data set S with N conformers C i of a molecule with n atoms Perform Algorithm 1 to obtain the predicted dimension ld , and a N × ld matrix O i , i ∈ { , . . . , N } ,with column vectors form an orthonormal basis that approximates the tangent space T C i C int M . For neighbour conformers C i and C j obtain O ij = argmin O ∈ O ( ld ) (cid:107) O − O Ti O j (cid:107) F . Let Z be the N × N matrix with entries given by z ij = detO ij for nearby points and 0 otherwise. Define the matrix A = D − Z , where D is diagonal and D ii = (cid:80) Ni =1 | z ij | . Compute the top eigenvector v top of A. Decide the orientability analysing the histogram of the coordinates of v top . Persistent homology [11, 38] is a method of topological data analysis, which has been used to analyse differenttypes of data sets from different areas in recent years, including chemistry.It is a method of calculating topological, or more accurately, homological, features at different spatialresolutions. Features that persist for a wider range of the spatial parameter are deemed to be more likelyto represent true features of the underlying space the data was sampled from, rather than noise, samplingerrors or particular choices of some parameters.In order to calculate persistent homology of a data set, we need to represent the data as a space with atriangulation, called a simplicial complex. For a set of points S , with | S | = k , the most common way to dothis is to define the k -simplex (or k -dimensional polytope) ∆ S with the points as its vertices.This work involves the use of persistent homology in two different contexts. First of all, we use it toinvestigate the homology of the points sampled from the conformation space. With some sensible assumptionson the sampling quality of these points, we can assume that we can deduce from this the homology of theconformation space, and therefore say something about its topological features.Secondly, we use the connection between persistent homology and discrete Morse theory in our analysisof the energy landscapes defined on the conformation spaces. In order to investigate the energy landscapes on the conformation spaces via persistent homology, we needto filter the conformation space (or its combinatorial approximation) by the real-valued energy function. Inessence, we are filtering our simplex ∆ S by both the pairwise distance function f defined above, and theenergy function E : ∆ S → R . However, we are not actually interested in the two-parameter persistencemodule we get this way.Instead, we wish to construct a combinatorial structure, called the Morse-Smale complex of the energyfunction, which represents the associated gradient flow and summarises it according to its critical points.This construction is described in more detail in the appendix.In order to construct the Morse-Smale complex with the correct number of critical points, we need tofilter a simplicial complex which has the Euler characteristic of the conformational space. Given a valueof f for which the preimage f − is a triangulation with the ’correct’ topology of our conformational space,we can use the lower-star filtration on this complex using the filter function E , and compute one-parameterpersistence.In practice, the construction we use to get the triangulation, are D alpha complexes, which can substan-tially reduce the computational complexity from having to construct Rips complexes with a threshold value.5n fact, alpha complexes are thresholded Rips complexes, but there are easier implementations available, e.g.in Matlab.To describe this construction, first let us say a word about Voronoi diagrams. Given a set S of points inEuclidean space R n , one defines convex polytopes V s , s ∈ S called Voronoi cells, which consist of all points x ∈ R n such that the distance between x and s is less than the distance between x and s (cid:48) for any other s (cid:48) ∈ S . The subsets V s give a tessellation of R n .Given a finite set of points S ⊂ R n and a real number r ∈ R , one defines the region R s ( r ) = ¯ B s ( r ) ∩ V s ,where ¯ B s ( r ) is the closure of the ball of radius r centred at s . Now we can form the α -complex (or alphacomplex) K r as follows: a subset σ ⊂ S is called an α -simplex if (cid:92) s ∈ σ R s ( σ ) (cid:54) = ∅ . After we have discovered the topology of the conformational space from which we sampled our point set S , we can safely choose a triangulation T of the point cloud that reflects this topology (i.e. choose a radiusfor the Rips complex) and linearly extend the energy function E : S → R to the entire simplicial complex,which gives us a piecewise-linear function ˜ E : T → R . This allows us to explore the energy landscape of thegiven molecule.Let M be a compact manifold and let f be a Morse function defined on M . Then the alternating sum ofthe number of critical points of index k of f equals the Euler characteristic of the manifold. The same goesfor a simplicial complex. Therefore, we have a linear relation between the minima of our energy landscapeand the other critical points, which is unique to the conformation space (unique in the sense that it dependson the topology of the conformation space).The Morse-Smale complex is commonly applied for surface segmentation [32, 7, 8]. Each segment of thesurface has uniform integral lines. This leads to a reduction of information. Instead of the entire energylandscape, we are left with a cell complex which accounts for the unique features of the energy function.Each cell has uniform flow, meaning that we can read off the Morse-Smale complex directly which energyminimum a given conformer will flow to as it loses energy. It also shows the unstable equilibria, the maximaand saddle points where there is no flow, but the slightest perturbation can lead to a more drastic change ofthe conformation.Moreover, this cell complex still has the same homology as the conformation space, combining the char-acteristics of both the conformation space and the energy landscape in a compact, combinatorial structure,which can be regarded as a unique descriptor of the molecule. Our model of conformation spaces of molecules will be given by embeddings of graphs in R . In such graphs,we combinatorially encode atoms and bonds in a molecule as vertices and edges of a graph, respectively. Tointroduce this, we need to define molecular graphs. Definition . A molecular graph is a tuple M = ( V, E, c V , L, Θ) consisting of the following data.1. Γ = (
V, E ) is a finite undirected graph : that is, V is a finite set, and E ⊂ V × V is a subset such that,for any v, w ∈ V , we have ( v, v ) / ∈ E , and ( v, w ) ∈ E if and only if ( w, v ) ∈ E . We refer to V as theset of vertices of M , and to E as the set of edges of M .2. c V : V → N is a vertex colouring ; for v ∈ V , we will think of c V ( v ) as the chemical element the atomcorresponding to v represents, and will set c V ( v ) to be the atomic number of this element.3. L : E → (0 , ∞ ) is a set of length constraints ; L ( e ) will be called the length of an edge e ∈ E .6. Θ : E → (0 , π ] is a set of angle constraints , where E = { ( v, w , w ) ∈ V × V × V | ( v, w ) , ( v, w ) ∈ E, w (cid:54) = w } is viewed as the set of adjacent edges in Γ ; we will refer to Θ( v, w , w ) as the angle between edges ( v, w ) ∈ E and ( v, w ) ∈ E .For consistency, we also require that L ( v, w ) = L ( w, v ) for every ( v, w ) ∈ E and Θ( v, w , w ) = Θ( v, w , w ) for every ( v, w , w ) ∈ E . Remark . In Definition 3.1 the length and angle constraints have the following chemical meanings. Thebond constraint is associated to the average of a bond length between two atoms A and B, whose centres ofmass are modelled as points in R n . The angle constraint is associated to the angle given by the hybridationtype of the atom around which the bond angle is defined. Therefore in this definition we assume that thebond lengths and bond angles are rigid. That is, they are constant for every conformer.To define an embedding of a molecular graph M , note that the length constraints can be used to make M into a geodesic metric space. Indeed, we may define the geometric realisation of M as a metric spacedefined by gluing an interval [0 , L ( e )] for each edge e ∈ E along the vertices V in the obvious way. Fromnow on, slightly abusing notation, we will identify a molecular graph M with its geometric realisation. Definition . An configuration of a molecular graph M = ( V, E, c V , L, Θ) is an embedding (injectivecontinuous map) ϕ : M → R from the geometric realisation of M to R such that1. for any edge e ⊆ M , the restriction ϕ | e : e → R is an isometric embedding; and2. for any adjacent edges ( v, w ) , ( v, w ) ∈ E (that is, for any ( v, w , w ) ∈ E ), we have ∠ ( ϕ ( w ) , ϕ ( v ) , ϕ ( w )) =Θ( v, w , w ) , where ∠ ( x, y, z ) is the angle between the line yx and the line yz in R for x, y, z ∈ R .The conformational space of M , denoted C M , is the set of all embeddings ϕ of M .From the condition (1) in Definition 3.3 and from the fact that geodesics in R are unique, it followsthat an embedding ϕ : M → R can be recovered uniquely from its values on the finite set V . In par-ticular, if V = { v , . . . , v n } contains n points, then any embedding ϕ can be recovered from the tuple C ϕ = ( ϕ ( v ) , . . . , ϕ ( v n )) ∈ R n , called the vector representation of ϕ . The map C M → R n , ϕ (cid:55)→ C ϕ is there-fore injective. We use this map to realise C M as a subspace of R n ; in particular, as R n is a topologicalspace, this induces a subspace topology on C M . Thus, from now on, we will regard C M as a topologicalspace.In order to analyse the connectivity properties of C M , we will use the following. Definition . The orientation of an embedding ϕ of the molecular graph M is the map O ϕ : E →{− , , } , where E = { ( v, w , w , w ) ∈ V × V × V × V | ( v, w i ) ∈ E, w i (cid:54) = w j whenever i (cid:54) = j } , defined by O ϕ ( v, w , w , w ) = sign ( ϕ ( w ) − ϕ ( v )) · [( ϕ ( w ) − ϕ ( v )) × ( ϕ ( w ) − ϕ ( v ))] , where sign c = − if c < , if c = 0 , if c > , for any c ∈ .Given a molecular graph M = ( V, E, c V , L, Θ) and a quadruple ( v, w , w , w ) ∈ E , one can see thatthe number | O ϕ ( v, w , w , w ) | ∈ { , } is independent of the embedding ϕ of M . Indeed, it follows fromthe fact that any two points x, y ∈ S on the sphere S ⊆ R are joined by a unique geodesic on S (unless y = − x ) that we have O ϕ ( v, w , w , w ) = 0 ⇔ either θ + θ + θ = 2 π or θ ij + θ ik = θ jk for some { i, j, k } = { , , } , (3.1)7here θ ij = Θ( v, w i , w j ) . We will call a tuple ( v, w , w , w ) ∈ E planar if O ϕ ( v, w , w , w ) = 0 for some,or any, ϕ ∈ C M (that is, if both sides of (3.1) are true). We will say that the molecular graph M is planar ifall tuples in E are planar. Note that, by convention, a disconnected molecular graph is planar if and onlyif all its connected components are planar. Remark . A planar graph Γ is a graph for which there exists an embbeding from the geometric realisationof Γ into R . Thus our definition of a planar conformation is more restrictive and does not coincide withthat of a planar graph.On the other hand, the map O ϕ itself is not independent of the embedding ϕ , unless all tuples in E areplanar. Indeed, given an embedding ϕ of M and a quadruple ( v, w , w , w ) ∈ E , it is easy to check that O ι ◦ ϕ ( v, w , w , w ) = − O ϕ ( v, w , w , w ) , where ι : R → R , x (cid:55)→ − x is the antipodal map (it is clear fromthe definition that ι ◦ ϕ : M → R is also an embedding of M ). This allows us to show the following result. Proposition 3.6.
If a molecular graph M is not planar, then the conformational space C M is not pathconnected. More precisely, if ϕ, ψ ∈ C M and ( v, w , w , w ) ∈ E are such that O ϕ ( v, w , w , w ) = − O ψ ( v, w , w , w ) (cid:54) = 0 , then ϕ and ψ are in different path components of C M .Proof. As M is not planar, there exists a quadruple ( v, w , w , w ) ∈ E that is not planar. Let ϕ ∈ C M beany embedding: then clearly ι ◦ ϕ ∈ C M , where ι : R → R is the antipodal map. Thus the first part of theProposition follows from the second part, by taking ψ = ι ◦ ϕ .To prove the second part of the Proposition, suppose for contradiction that ϕ, ψ ∈ C M are in the samepath component of C M . Then there exists a path α : [0 , → C M such that α (0) = ϕ and α (1) = ψ . Considerthe map D : [0 , → R ,t (cid:55)→ ( α t ( w ) − α t ( v )) · [( α t ( w ) − α t ( v )) × ( α t ( w ) − α t ( v ))] where α t = α ( t ) : M → R . The map D is clearly continuous, and it follows from the assumption that D (0) = − D (1) (cid:54) = 0 . Therefore, by the Intermediate Value Theorem, there exists t ∈ (0 , such that D ( t ) = 0 .But this implies that O α ( t ) ( v, w , w , w ) = 0 , contradicting the fact that ( v, w , w , w ) ∈ E is not planar.Thus C M is not path connected, as required. It is known that the quantum mechanical description of a molecular systems must be invariant to the severaltypes of transformations [5], in which the following are included:1) Rotation of the positions of all particles about any axis through the centre of mass.2) Translation in space.3) Permutation of the positions of any set of identical nuclei.4) Simultaneous inversion of the positions of all particles in the centre of mass.Therefore it is crucial to study the symmetries present in our model of the conformational space C M bystudying actions of groups on C M . In what follows, for any n ∈ N , let S n be the symmetric group on n elements (the group of all permutations of an n -element set), let D n be the dihedral group of order n (thegroup of all symmetries of a regular n -gon), and let Z n be the group of integers modulo n . Let E (3) thegroup of isometries of R and let SE + (3) be the subgroup of E (3) of orientation preserving isometries. It isknown that there are group isomorphisms E (3) ∼ = O (3) (cid:110) R and SE (3) + ∼ = SO (3) (cid:110) R , where O (3) and SO (3) are the group of real invertible matrices with determinant ± and +1 , respectively.In [15], Guichardet approches to configuration spaces of molecules defining the following spaces of n -points, n ≥ : 8 M = { ( x , . . . , x n ) ∈ R n | x i (cid:54) = x j , if i (cid:54) = j } (3.2) X M = { ( x , . . . , x n ) ∈ R n | x i (cid:54) = x j , if i (cid:54) = j, n (cid:88) i =1 m i x i = 0 } (3.3)The group R of E (3) acts on X M by translation and it is easy to see that X M = X M × R . Adding thecondition span ( x j , . . . , x n ) = R in (3.3) we obtain another, configuration space, X P M . The group SO (3) acts freely on X P M and there is a principal SO (3) -bundle: SO (3) (cid:47) (cid:47) X P M π (cid:47) (cid:47) X int M (3.4)where X int M is homeomorphic to the orbit space X P M /SO (3) . In our model there are restrictions on the nucleipositions which are imposed by the underlying molecular graph. Therefore it is expected that our C M is asubspace of X P M .Let M = ( V.E, c V , L, Θ) be a molecular graph such that | V | = n . Given a conformation C ϕ of a M , C M = ( ϕ ( v ) , . . . , ϕ ( v n )) ∈ R n , the groups E (3) and O (3) ∼ = E (3) / R act on it by isometries. We definethe following space C P M = { C ϕ = ( ϕ ( v ) , . . . , ϕ ( v n )) ∈ R | n (cid:88) i =1 c V ( v i ) ϕ ( v i ) = 0 } (3.5)We can also see that C M ∼ = C P M × R . We might assume additionally that | V | ≥ . The group SO (3) actson M by rotation which induces an action C P M . Let C int M be the orbit space of the action of SO (3) on C P M .We give the space C int M the quotient topology. Then we have the following result. Theorem 3.7.
Let M be a molecular graph. If Θ( E ) (cid:42) { π } then the quotient map q : C P M → C int M definesa principal SO (3) -bundle over C int M . Moreover this principal SO (3) -bundle is trivial.Proof. Let C ϕ = ( ϕ ( v ) , . . . , ϕ ( v n )) ∈ C P M ⊂ R n − . Let us write x i = ϕ ( v i ) for all ≤ i ≤ n . The elementsof SO (3) act on C ϕ = ( x , . . . , x n ) ∈ C P M with the canonical action. Let π : C P M → C int M be the quotient map.By assumption there exists a triple ( v, W , W ) ∈ E such that Θ( v, w , w ) (cid:54) = π then this action has nofixed points. It follows that q − ( C ϕ ) ∼ = SO (3) for all C ϕ ∈ C P M . Therefore, the quotient map q : C P M → C int M defines a principal SO (3) -bundle over C int M . It is a routine calculation to check that s is a continuous map.Since the map π that defines the principal SO (3) -bundles has a section, then this bundle is trivial.Now we show that the bundle π : C P M → C int M is trivial, that is C P M ∼ = C int M × SO (3) . It suffices to showthat there is a map s : C int M → C P M such that the composition π ◦ s is the identity map on C int M . This isequivalent to choosing one point from each SO (3) orbit in C P M in a continuous way. Now since the action of SO (3) there exists a unique A ∈ SO (3) and D ∈ R such that Aϕ ( v ) + D = (0 , , Aϕ ( w ) + D = ( L ( v, w ) , , Aϕ ( w ) + D = ( L ( v, w ) cos θ, L ( v, w ) sin θ, where θ = Θ( v, w , w ) . We define a map s : C int M → C P M by sending π ( C ϕ ) to C Aϕ . By construction themap π ◦ s is the identity on C int M .We will start analysing the action of discrete groups. In [24], Longuet-Higgins introduces a symmetrygroup of non-rigid molecules, called the complete nuclear permutation inversion group , or CNPI group forshort. Suppose we are given a molecule M consisting of n = n + · · · + n k atoms where n i ≥ is the numberof atoms of element c i , for some distinct chemical elements c , . . . , c k . We let the conformational space C M of M be the set of all embeddings V → R , where V is a finite set of cardinality n (corresponding to the n atoms of M ); we then have an obvious injective map C M → R n , which makes C M into a topological spaceas before. The CN P I group of M is then CN P I M = S n × · · · × S n k × C . This group acts on C M asfollows: 9. for i = 1 , . . . , k , the group S n i permutes the images of the atoms of element c i ; and2. the non-trivial element of C sends ϕ ∈ C M to ι ◦ ϕ , where ι : R → R is the antipodal map.This construction fits in our setting as follows. Let Γ = ( V, ∅ ) be the discrete graph on | V | = n = n + · · · + n k vertices – that is, Γ has no edges. Label the vertices of Γ as V = { v , , . . . , v ,n , v , , . . . , v ,n , . . . , v k, , . . . , v k,n k } , and define c V : V → N by c V ( v i,j ) = c i ∈ N . Note that since Γ has no edges we also have E = ∅ , and sothe functions L : ∅ → (0 , ∞ ) and Θ : ∅ → (0 , π ] in Definition 3.1 are defined uniquely. Thus, we have amolecular graph M = ( V, ∅ , c V , L, Θ) . As a topological space, M is a discrete space of cardinality n , andan embedding M → R is just a choice of n distinct points in R . Thus, our construction generalises theconstruction in [15]. The graph permutation inversion group, introduced below, generalises the concept ofthe CNPI group. Loosely speaking, this is the group of automorphisms that preserve the structure of themolecular graph M . Definition . Let M = ( V, E, c V , L, Θ) be a molecular graph. A graph permutation of M is a bijection g : V → V such that1. for any v, w ∈ V , we have ( v, w ) ∈ E if and only if ( g ( v ) , g ( w )) ∈ E ;2. c V ( v ) = c V ( g ( v )) for every v ∈ V ;3. L ( v, w ) = L ( g ( v ) , g ( w )) for every ( v, w ) ∈ E ; and4. Θ( v, w , w ) = Θ( g ( v ) , g ( w ) , g ( w )) for every ( v, w , w ) ∈ E .Moreover, let ψ ∈ C int M , and let C ψ ⊆ C M be the set of embeddings ϕ ∈ C M such that O ϕ ≡ O ψ ; byProposition 3.6, C ψ is a union of path components of C M . Then a graph permutation g is said to be orientation preserving with respect to C ψ if5. O ψ ( v, w , w , w ) = O ψ ( g ( v ) , g ( w ) , g ( w ) , g ( w )) for every ( v, w , w , w ) ∈ E ,and orientation reversing with respect to C ψ if(5’) O ψ ( v, w , w , w ) = − O ψ ( g ( v ) , g ( w ) , g ( w ) , g ( w )) for every ( v, w , w , w ) ∈ E .We denote by Sym + C ( M ) and Sym − C ( M ) the sets of graph permutations that are orientation preserving andorientation reversing with respect to C = C ψ , respectively, and we let Sym ± C ( M ) = Sym + C ( M ) ∪ Sym − C ( M ) .Clearly, the sets Sym + C ( M ) and Sym ± C ( M ) are groups under composition.It is clear that a graph permutation g defines a map g : C M → C M by permuting the points { ϕ ( v ) | v ∈ V } for an embedding ϕ : M → R . Moreover, if ψ ∈ C ψ ⊆ C M are as in Definition 3.8, thenfor a graph permutation g that is orientation preserving with respect to C ψ , this map restricts to a map g : C ψ → C ψ . Similarly, if a graph permutation g is orientation reversing with respect to C ψ , then we havea map ˆ ι ◦ g : C ψ → C ψ , where ˆ ι : C M → C M is defined by ˆ ι ( ϕ ) = ι ◦ ϕ and ι : R → R is the antipodal map.It is also easy to check given two such maps, each of which is either g for g orientation preserving or ˆ ι ◦ g for g orientation reversing with respect to C ψ , the composite of these two maps will also have this form. Thus,we define the graph permutation inversion group , or GPI group , of M with respect to C = C ψ to be GP I M ,C = (cid:8) g | g ∈ Sym + C ( M ) (cid:9) (cid:116) (cid:8) ˆ ι ◦ g | g ∈ Sym − C ( M ) (cid:9) . Note that
GP I M ,C ∼ = (cid:40) Sym ± C ( M ) × Z if M is planar , Sym ± C ( M ) otherwise . We will usually omit C from the notation and will simply talk about the GPI group GP I M of M . Thefollowing theorem is clear because GP I is a finite group. An introduction to orbifolds is given in AppendixB. 10raph CNPI group GPI group n -cycle S n × C D n × C n -path S n +1 × C C × C pentane S × S × C ( C (cid:111) C ) (cid:111) C tetraethylmethane(heavy atoms) S × C S Table 1: Graphs and their associated symmetry groups
Theorem 3.9.
Let M be a molecular graph and G be its GP I group. Suppose that C int M is a manifold. Thenthere is a properly discontinuous action of the group G on C int M . In particular, if GP I is non trivial, thequotient space C int M /G has the structure of an orbifold. Example 3.10.
Let M = (Γ , c V , Θ) be the molecular graph associated to pentane. If we only considerthe carbon atoms with a rigid conformation, then the orbifold conformational space is homeomorphic to thequotient of a torus S × S by the action of the group Aut (Γ) = C . The groups C acts on the conformationalspace by permuting the ends of the molecular graph M . This action is equivalent to permute the parameters t and t associated to two torsion angles. As it is shown in Figure 2. C = Figure 2: Action of C on S × S . The orbifold conformational space OC int M = C int M /G is homeomorphic toa Moebius strip. Let M = ( V, E, c V , Θ) be a molecular graph such that | V | = n . We can endow a conformational space C int M with the following metrics:1. Given two matrices X, Y ∈ Mat × n ( R ) representing two conformers in C int M the Frobenius distance d F ( X, Y ) between them, is defined to be d F ( X, Y ) = (cid:107) X − Y (cid:107) F , where (cid:107) − (cid:107) F is a matrix norm of an × n matrix A defined as the squared root of the sum of theabsolute squares of its entries (cid:107) A (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i =1 n (cid:88) i =1 | a ij | . (3.6)11bserve that given a matrix A ∈ Mat p × n the Frobenius norm C int M coincides with the Euclidean metric (cid:107) − (cid:107) in the vector representation of C int M .2. Given two conformers X, Y ∈ C int M and a matrix R ∈ SO (3) , the Procrustes distance function is definedas, d P ( X, Y ) = inf R ∈ SO (3) {(cid:107) X − RY (cid:107) F }
3. A common distance metric used in molecular sciences is the root-mean-square deviation (RMSD): d RMSD ( X, Y ) = 1 √ N d P ( X, Y ) which is commonly used when aligning chemical structures such as proteins [9], and can be shown tobe a metric [33, ? , 30].Let M be a molecular graph and let GP I be the graph permutation inversion group of M . We can givethe orbifold configuration space O C int M the following metric d O ( X, Y ) = min g ∈ GP I { d P ( X, g · Y ) } (3.7) Following our discussion in Section 3, each conformer C ϕ in the configuration space C M is uniquely determinedby the positions of the atoms in the molecules. A conformation of a molecule M with n atoms is representedby an n -by- matrix with real coefficients. The entries of the i -th row vector in this matrix are the spatialcoordinates of the i -th atom in C ϕ . We eliminate the action of the subgroup of translations T by fixing thecentre of mass of each conformer at (0 , , ∈ R . From the set of molecular configurations S = { C i } Ni =1 sampled from the configuration space C P M we generated a data set of the following metric spaces: • The metric space ( X, d F ) . Assuming that the bond angles are almost constant then by Theorem 3.7,we have that C M ∼ = C int M × SO (3) . We generate a set of n -by-3 matrix in R n × , where n is the numberof atoms in the molecule. Each matrix is associated to an aligned conformer. The euclidean metrichas been used in other studies of conformational spaces, such as in [25, 26]. • The metric space ( X, d P ) defined by the Procrustes metric. We obtained a distance matrix. • The metric space ( X, d
RMSD ) defined by the RMSD metric. We obtained a distance matrix. • The metric space ( X, d O ) defined by the orbifold metric. We obtained a distance matrix. We used geometric and topological data analysis to study the conformational spaces of small molecules:butane, pentane, alanine dipeptide, and cyclooctane. We used principal component analysis, or PCA – onethe most common tools in data analysis.The data set of conformers associated to a molecule M , consists of a set S of 3-by- n real matrices S = { A i } Ni =1 , with N the number of conformers. We compute the distance matrix to study the geometricand topological properties of the conformational space C M . In Table 2 we show the result of local dimensionand orientability. In our analysis, fluoromethane is the only molecule with no torsional angles. That is, it isthe only molecule that has a contractible conformational space.Local dimension shows that data variability depends, to a great extent, on torsional angles. Indeed, allthe non-rigid molecules have at most 2 torsional angles, and this allows to reduce the local dimension of theconformational space from n − to 2 or 3 dimensions. In contrast, although fluoromethane is the smallest12olecule, the local dimension observed in this case is the highest. For this molecule, the reduction of thelocal dimension depends on other parameters such as lengths of bonds and angles between them.One interesting outcome of our local dimensionality study is the detection of singularities. This algorithmapproximates the minimal dimension required to span a neighbourhood of a point. If the analysed pointis a singular point, its neighbourhood will typically have a higher dimension than the one observed for apoint whose neighbourhood can locally be modelled as a subset of Euclidean space. Thus we could detectthe singularities in the conformational spaces of cyclooctane and pentane after quotienting out the action of C . In the former case, the local dimension detected at a singular point is higher that the local dimension atnon-singular points. In [26], it was shown that the conformational space of cyclooctane is a non-manifold.More specifically, the authors showed that the conformational space can be embedded in 5-dimensionalEuclidean space and that it corresponds to the space formed by the Klein bottle and a -sphere intersectingin circles. The local dimension and the orientability of the clusters was determined. Our results show thatthe local dimension of the set of singular points is one, whereas the local dimension of the other clusters is2. Moreover, one of these clusters is orientable and the other is not. Persistent homology of these spaces isshown in Table 2. Molecule Loc. Dim. dim. singularities Orientablebutane 2 - Yespentane 2 - Yesalanine dipeptide 2 - Yescyclooctane 2,3 1 Nopentane( C ) 2 1 NoCyclooctane (singular set) 1 - YesCyclooctane (sphere) 2 - YesCyclooctane (Klein bottle) 2 - NoTable 2: Local dimension and orientability Local principal component E i gen v a l ue Local dimension CS(pentane) (a)
Local principal component E i gen v a l ue s CS(Ala-Ala) (b)
Figure 3: Local dimension of C int M was determined using PCA at each point. Plots (a) and (b) show the localPCA at one point of C int M of pentane and alanine dipeptide, respectively, with the euclidean metric. In bothcases local PCA suggests that there are two principal components. This implies that the local dimension ata chosen point C ϕ ∈ C int M of both pentane and alanine dipeptide is 2.13 Orientability CS(pentane) (a) -0.015 -0.01 -0.005 0 0.005 0.01 0.0150500100015002000250030003500400045005000
Orientability CS(Ala-Ala) (b)
Figure 4: The results of the detection of orientability of pentane and dipeptide alanine are shown in figures(a) and (b), respectively. There is noise in the data set of pentane, however it is possible to distinguish twowell separated region which corresponds to a choice of orientation at each point in both pentane and alaninedipeptide. (a) (b)
Figure 5: Figure (a) shows the orientability test result for C int M of cyclooctane with the euclidean metricwhereas figure (b) shows the corresponding result for C int M of pentane with the orbifold metric. In this section we analyse the topology of the internal configuration spaces C int M . We investigate whether thechoice of metric for the conformational space leads to a significantly different persistent homology. The alanine dipeptide molecule can be seen in Figure 8. We note that there are two free torsions in the alaninedipeptide molecule, as the peptide bonds themselves are considered to be inflexible due to its resonancestabilisation. Therefore, ignoring bond stretching and bending, alanine dipeptide would be predicted to havea conformational space of S × S = T .The vector representation of alanine dipeptide was defined by aligning the set of conformers to a minimumenergy conformer calculated by density functional theory. Furthermore, we aligned each conformer to a coreset of atoms. This flexibility is inherent within the vector representation of the conformational space.Persistent homology was calculated by using the Rips complex persistence on the Euclidean distance onthe vector representation. Persistence was calculated on the vector space using all atoms, and also for heavy14 a) (b) Figure 6: Orientability of clusters of cyclooctane: (a) orientable cluster (b) non-orientable cluster. Theclustering method identified two clusters of local dimension 2 which present different orientability properties.Molecule
N β β β Alanine dipeptide 9112 1 2 1Pentane 9108 1 2 1Pentane/ C β k for the conformational spaces of the molecules studied in this work, calculatedusing the RMSD for all molecules and orbifold metric for pentane. The Betti numbers of four subspaces ofthe conformational space of cyclooctane are shown in the table.(non-hydrogen) atoms. These can be seen in Figure 9.Firstly, it is clear that both sub-representations have similar persistent Betti numbers of (1 , , . Thismatches those of a torus, as earlier predicted. However, these features appear at different times depending onrepresentation. This can be explained when considering the behavior of the Euclidean metric as the numberof dimensions increases. The all atom system is 66-dimensional, whereas the heavy atom system has only30 dimensions. This leads to a shorter average distance between conformers in the conformational space inthe heavy atom system.To create the RMSD representation for alanine dipeptide, the optimal alignment between every pairof conformers was found by optimising the RMSD between their atoms. This was calculated for bothall atom and heavy atom sets. Persistent homology was then calculated using the Rips filtration on theoptimum RMSD metric, with the resulting persistence diagram in Figure 10. Again, we find similar persistentBetti numbers of (1,2,1). This suggests that the topology of our conformational space is independent ofrepresentation - however it will be shown that this is not always the case. Further, there is a much smallerdifference in choice of atom subsets in the case of the RMSD representation. This is due to the difference inthe behaviour of the RMSD metric itself. For example, features appear slightly earlier in the all atom system.This is because the average displacement of hydrogen atoms tends to be quite small, but the increase in thedenominator of the RMSD metric causes a slightly lower metric. In the case of the vector representation,15 a) (b) Figure 7: 3d-embbeding of C int M spanned by heavy atoms of (a) butane and (b) pentane. The scatter plotsare coloured by the energy function.Figure 8: The structure of alanine dipeptide. The alignment core refers to the heavy atoms inside the squarebox.each hydrogen adds an extra 3 dimensions to the Euclidean distance - the RMSD does not suffer from thiscurse of dimensionality in the same way. The structure for pentane can be seen in Figure 11. Similarly to alanine dipeptide, there are two free torsionsin pentane. For this section we are ignoring the symmetry of the pentane molecule, and therefore we expectthe conformational space to have the topology of the torus T .Similarly to alanine dipeptide, we define our vector representation by aligning each conformer to somereference, in this case a DFT optimised conformation. However, in contrast to alanine dipeptide, we donot align to a core, but instead align to minimise the RMSD between all carbon atoms. Persistence is thencalculated analogously to alanine dipeptide. The RMSD representation for pentane was defined by calculatingthe optimum RMSD distance between all carbon atoms for each pair of pentane conformers. Persistence wasthen calculated using this metric. The vector and RMSD representation persistent homology can be foundin Figure 12.There is a clear difference in the persistent homology for these two representations. Whereas the RMSDrepresentation has persistent Betti numbers of (1,2,1), those of the vector representation are far more unclear.This makes clear one of the main drawbacks of the vector representation of conformational spaces, in thatthey are dependent on the alignment to a reference. If we had aligned to some core of the three centralcarbons, we would have seen persistent Betti numbers of (1,2,1), as we did for alanine dipeptide. Furthermore,although aligning each conformer to some reference may make it easier to visualise the entire set at once (as16 a) All atoms (b) Heavy atoms only Figure 9: Persistence of the vector space representation of alanine dipeptide (a) All atoms (b) Heavy atoms only
Figure 10: Persistence of the RMSD representation of alanine dipeptide17igure 11: The structure of pentane (a) Vector representation (b) RMSD representation
Figure 12: Persistence of the different representations of the pentane conformational space. Symmetry isignored for these representationsis often done in the study of proteins etc.), it makes pairwise comparisons of different conformers hard toperform. In contrast, the RMSD representation does not suffer from this, and therefore leads to the correctpersistent Betti numbers. Furthermore, as we will now see, the RMSD representation allows us to directlytake into account molecular symmetry - which would be impossible for the vector representation.The molecular graph of pentane has an inherent symmetry, as discussed earlier. In particular, the twotorsions are equivalent, rather than being distinguishable. It is a standard result that leads to a Möbiusband topology.We can take this symmetry into account when calculating the RMSD metric between two conformers.This is done by performing two separate RMSD alignments. In the first, we align the two conformers suchthat each carbon in the first conformer is matched to the same carbon in the second conformer. In thesecond, we match each carbon in the first conformer to the opposite carbon in the second. We then choosethe optimum RMSD to be our metric. The resulting persistent homology can be seen in Figure 13. There isa significant difference in the persistent homology when compared to the original RMSD representation. Inparticular, the persistent Betti numbers are now (1,1,0) matching those of the Möbius band.
Once singular points were identified, they were removed from the data set. In principle, this separated thedata into its manifold components. These can then be found using a clustering algorithm, in our work,HDBScan [6]. Subsequently, the manifolds were matched, in an attempt to recreate the spherical and Kleinbottle components found in the original work. This could then be verified using persistence.We used the software
Ripser [36] to compute the persistent homology of our point data sampled fromthe conformational spaces.To verify that we had correctly found the spherical component, we calculated the persistent homology ofthe Rips complex constructed on the RMSD metric between conformers. The resulting persistence diagramcan be seen in Figure 14. We can see that the persistent Betti numbers are (1,0,1), as expected for a sphere.18igure 13: Persistence of the RMSD representation of pentane conformational space, with symmmetry takeninto accountFigure 14: Persistence of the RMSD representation of the spherical component of the cyclooctane conform-ational spaceVerifying the presence of the Klein bottle component is slightly more involved. Here, we perform persistenthomology calculations in the same manner as before. However, we calculate homology over two differentfields of coefficients, namely Z and Z . The Klein bottle has different (persistent) Betti numbers over thesefields, (1,2,1) and (1,1,0) respectively - a torus would not. This allows us to verify with more confidencethe presence of a Klein bottle component. The persistence diagrams can be seen in Figure 15. The correctpersistent Betti numbers are found. In order to compute the Morse-Smale complexes of the energy landscapes, we made use of the TopologyToolKit (ttk) [35], a software platform designed for the topological analysis of scalar data. We also usedMatlab’s alphaShape function for initial processing, which produces an alpha-shape triangulation from a pointcloud, of a specified radius. Extracting the boundary of this triangulation gave us the surface triangulationwe required. This, together with the energy values at each point was input into the Topology ToolKit19 a) Z (b) Z Figure 15: Persistence diagrams to verify the presence of a Klein bottle component to the cyclooctaneconformational space.software.We analysed the potential energy landscapes of cyclooctane, alanine dipeptide, pentane and fluoropentane.The conformational spaces of alanine dipeptide, pentane and fluoropentane are all tori, which makes for aside-by-side comparison of different energy landscapes on what is topologically the same conformationalspace. There is also an analysis of the free energy landscape of alanine dipeptide.The results for cyclooctane are compared with the results from [26]. After finding the singular points, andseparating out the sphere and Klein bottle components, we did separate analyses of the energy landscapes onthese two components. The points sampled from a sphere, an orientable, low-dimensional manifold, can betriangulated so that the resulting simplicial complex has the topology of a sphere. This simplicial complexcan then be input into the ttk software and filtered by the energy function.We use the connection between persistent homology and discrete Morse theory, to smooth this energysurface by removing topological features below a certain persistence threshold. In order to do this, wecompute the persistence diagram, as well as a statistical summary of it, called the ttkPersistenceCurve inthe ttk software, which plots the distance from the diagonal against the number of persistent points in thepersistence diagram. Depicting this curve in the log scale allows us to estimate a sensible level of noise, byobserving a change in the gradient of this curve. In Figure 16, this computation is shown for the sphericalpart of cyclooctane, while Figure 17 shows the computed Morse-Smale complex.For alanine dipeptide, we consider the model of the conformational space where only the two torsionalangles are allowed to rotate, giving a torus. The scalar values of the potential energy are then given on thistwo-dimensional surface, displayed in Figure 18.We then did the same calculation for the free energy landscape. The results are displayed in Figure 19.The obvious conclusion is that the free energy landscape is significantly less noisy. Another difference is thechange in the number of minima. There are far more local maxima and saddle points in the potential energylandscape than in the free energy landscape.For pentane, the conformational space is a torus as well. The potential energy surface was analysed in thesame fashion as for alanine dipeptide. In Figure 21, the results of this analysis are depicted. The symmetryof the molecule can be observed in the energy landscape as well, in contrast to the energy landscape ofalanine dipeptide. After thresholding, only one minimum remains.20 a) (b) (c)
Figure 16: The noise detection for the spherical part of cyclooctane, computed using the ttk software. Thefirst figure (a) shows the zero-dimensional persistence diagram, the second (b) shows the persistence curvewith the likely persistence threshold for noise circled in red, and the final (c) image shows the denoisedpersistence diagram, using the threshold discovered through the persistence curve.Fluoropentane is a molecule that is very similar to pentane. It has the exact same underlying graph witha different vertex colouring. Naturally, this implies that the conformation spaces of the two molecules areidentical. This means we can have a direct comparison of the differences in the energy landscapes.Finally, we are able to analyse the energy landscape for the Klein bottle component of cyclooctane. Asthe Klein bottle is not an orientable manifold, and also cannot be embedded into R , we are unable to use ttk . However, due to the link between Morse theory and persistent homology, we can use persistent homologyto find the values of the extrema.Firstly, we must create a simplicial complex. It can be seen from the persistence diagram of the Ripsfiltration of the Klein bottle component that a filtration value of . leads to the correct topology. TheMMFF94 energy of each simplex is then found, and the persistent homology of the energy function iscalculated. This can be seen in Figure 23.The infinite bars correspond to the topology of the space itself (i.e. the Klein bottle). Of interest,however, are the features that have a death value. Those correspond to the local critical values of the energyfunction. The zero-dimensional points are born at local minima and die at saddle points, while the one-dimensional points are born at saddle points and die at local maxima. We speculate that this methodology,of using persistence to find simplicial complexes and then critical values of complex energy landscapes couldprove very fruitful. As an example, we propose a similarity measure in the chemical space making use of thetopological properties of the energy landscapes.The polynomial associated to a molecular graph M is defined as P M ( t ) := (cid:88) p ∈ CP t λ ( p ) (4.1)where CP is the set of critical points of the PES and λ ( p ) is the index of the critical point p (and c ( p ) isthe critical value at p . Let P and P be two Morse polynomials associated to the molecules M and M ,respectively. We define the potential similarity measure S : Chem × Chem → R in the space of moleculargraphs as follows S ( M , M ) := (cid:90) |P M ( t ) − P M ( t ) | dt (4.2)21 a) (b) (c) Figure 17: The Morse-Smale complex for the spherical component of cyclooctane, computed using the ttk software. The first figure (a) shows the potential energy function. The second figure (b) shows the Morse-Smale complex without topological simplification and after simplification (c). The red points are the maxima,the white points are saddle points and the blue points are the energy minima.
We have developed a data-driven approach to understanding molecular conformational spaces and energylandscapes. We have used this method to demonstrate that conformational spaces of linear molecules matchchemical intuition, and are namely products of circles caused by torsional flexibility. Further, bond stretchingand bending do not change the homology groups of the conformational space as they lead to spaces relatedvia a retraction. By using this method on different commonly used representations of conformational spaces(namely vector and metric spaces), we demonstrated that it is only the metric space representation thatcan consistently recreate the expected conformational space. Furthermore, we have demonstrated that theconformational space analysis still holds when molecular symmetry is taken into account.Spaces of conformers of molecules play a fundamental role in Chemistry. Understanding these spacesmight lead us to understand the relationship between the structure and the activity of biomolecules, drugs andother important classes of molecules. For many years chemists have modelled molecules using combinatorialobjects, such as graphs. Associating both algebraic and geometric objects to molecules seems to providenew ways to study and to understand their chemical properties. In this paper we have shown that thereexists a rich variety of algebraic, geometric and topological tools that can be used to model moleculesand their conformational spaces. Symmetry groups of molecules are closely related to the topology of theconformational spaces. Methods developed in geometrical and topological data analysis seems to provide agood source of tools to analyse conformational spaces and functions defined on them such as (free) energylandscapes.
A Local PCA
Principal component analysis (PCA) is a mathematical method to transform a set of possible correlatedvariables into a set of linearly uncorrelated variables by means of orthogonal transformations. This methodis used for dimensionality reduction of large data sets. The local version of PCA can be used to determinelocal dimensions of spaces using samples of points. In local PCA, a neighborhood of a point in a data set istransformed to a new coordinate system using orthogonal transformations.22 a) (b) (c) (d)
Figure 18: The Morse-Smale complex for Alanine Dipeptide, computed using the ttk software. The figures(a) and (b) show the potential energy function from both sides of the torus. The image (c) shows theMorse-Smale complex, together with the Morse function after very minor topological simplification. Thered points are the maxima, the white points are saddle points and the blue points are the energy minima.Finally, image (d) shows the Morse-Smale complex after severe topological simplification, leaving only oneminimum.Let S = { A i } Ni =1 be a data set of points sampled from a topological space X ⊂ R d , with ≤ i ≤ n , d ≥ . Given a data point A i , we can reorder the data set A i , A i , . . . , A i N with the order given by (cid:107) A i − A i (cid:107) ≤ (cid:107) A i − A i (cid:107) ≤ · · · ≤ (cid:107) A i − A i N (cid:107) . The k nearest neighbours of A i are the first k -th datapoints of ordered set. Here we assume that k (cid:28) N . The result of local P CA on the set set K ( A i ) of k nearest neighbours of A i . We try to estimate the local dimension at the point A i for the singular valuesof the mean shifted local data matrix M ( A i ) = [ A i − µA i − µ · · · A i N − µ ] , where µ = k (cid:80) kj =1 A i j . Let s i ≥ s i ≥ · · · ≥ s ki be the singular values of M ( A i ) . If X is a manifold of dimension p , the local dimensionat any point A i ∈ X is well defined and indeed it is equal to p . In this case the local dimension of X at thepoint A i is estimated as the number of non-vanishing singular values. When working with large data setsembbeded in high dimensional euclidean spaces it is common to estimate the local dimension as the numberof singular values with the highest percentage of the variability of the data. For that it is needed to choosea threshold, that is, a number α between 0 and 1, and the dimension is estimated as the smallest integer ld such that (cid:80) ldj =1 s j (cid:80) mj =1 s j > α where m is the minimum m = min { k, d } . B Principal bundles and orbifolds
We give a brief introduction to fibre bundles and orbifolds. Standard references for the theory of fibre bundlesand orbifolds are [19] and [1]. A (left) action of a topological group G on a topological space X is a map G × X → X , ( g, x ) (cid:55)→ gx such that x = x and g ( hx ) = ( gh ) x for all x ∈ X and for all g, h ∈ G . Given x ∈ X the stabilizer of x is the group of elements g ∈ G such that gx = x . The action of G on X is calledfree if for all x ∈ X , G x = { } ; the action is almost free if the groups G x are finite. The orbit of x is theset O x = { gx ∈ X | g ∈ G } . An action is effective if gx = x for all x then g = 1 . The action of G inducesa partition of X into disjoint subsets called orbits. The orbit space , denoted X/G , is the set of all orbits of X . The orbit space is endowed with the quotient topology induced by the quotient map X → X/G .Let B , E be topological spaces. A bundle is a triple ( B, E, π ) , where π : E → B is a map. The space B is the base space , the space E is the total space , and the map π is the projection of the bundle. For each23 a) (b) (c) (d) Figure 19: The analysis of the free energy surface for alanine dipeptide, computed using the ttk software.The first two figures (a) and (b) are showing the energy function from both sides of the surface of the torus,while (c) and (d) are showing the Morse-Smale complex, together with the energy function, before and aftertopological simplification. (a) (b) (c)
Figure 20: The analysis of the free energy surface for pentane, computed using the ttk software. The firsttwo figures (a) and (b) are showing the energy function from both sides of the surface of the torus, while (c)is showing the simplified Morse-Smale complex, together with the energy function.24 a) (b) (c)
Figure 21: The analysis of the energy landscape for fluoropentane, computed using the ttk software. Thefirst two figures (a) and (b) are showing the energy function from both sides of the surface of the torus, while(c) is showing the simplified Morse-Smale complex, together with the energy function. b ∈ B , the space π − ( b ) is called the fibre of the bundle over b . A cross section of a bundle ( E, B, π ) is amap s : B → E such that p ◦ s = 1 B .A principal G -bundle , denoted is a bundle ( B, E, π ) and an action E × G → E , ( x, g ) (cid:55)→ xg , such thatthe following hold:(1) the map E × G → E × E is given by ( x, g ) (cid:55)→ ( x, xg ) x ∈ E, g ∈ G is a homeomorphism onto its image;(2) B = E/G and the projection p is the quotient map;(3) for all b ∈ B there exists an open neighborhood V together with a homeomorphism φ : V × G → π − ( V ) such that the diagram V × G φ (cid:47) (cid:47) p (cid:33) (cid:33) π − ( V ) π (cid:124) (cid:124) V commutes, and for all x ∈ π − ( V ) and g ∈ G , φ − ( xg ) = φ − ( x ) g , where the action on V × G is givenby ( x, g ) g (cid:48) = ( x, gg (cid:48) ) .We say that the sequence G → E π −→ B is a principal G -bundle over B .Let X be a topological space. An n -dimensional orbifold chart on X is a tuple ( ˜ U , G, ϕ ) , where ˜ U ⊂ R n Z .Figure 23: Persistence of the superlevel sets of the MMFF94 energy function defined on the Klein bottlecomponent. Calculated with coefficients in Z .is open, G is a finite group of smooth automorphisms of ˜ U and ϕ : ˜ U → X is a map such that the diagram ˜ U ϕ (cid:47) (cid:47) q (cid:15) (cid:15) X ˜ U /G, ˜ ϕ (cid:61) (cid:61) (B.1)commutes, where q is the quotient map and ˜ ϕ is a homeomorphism onto its image. Two orbifold charts ( ˜ U , G , ϕ ) and ( ˜ U , G , ϕ ) on X are compatible if for every x ∈ U ∩ U ⊂ X there exists a neighbourhood W of x and an orbifold chart ( ˜ W , H, ψ ) , with ψ ( ˜ W ) = W , such that there are smooth embeddings λ i : ˜ W → U i and ψ = ϕ i ◦ λ i , for i = 1 , . An orbifold atlas on X is a collection U = { U α , G α , p α } α ∈ I of compatible n -dimensional charts that cover X . An n -dimensional orbifold X is a paracompact Hausdorff space X withan orbifold atlas of n -dimensional charts on X . We additionally will require for each chart ( ˜ U , G, ϕ ) , theaction of G on ˜ U to be effective. Since every smooth action is locally smooth, any orbifold has an atlas ofthe form ( R n , G, ϕ ) where G ⊂ O ( n ) acts on R n via orthogonal representation. CITE ADEM Given a point26 ∈ X and a local chart ( ˜ U , G, ϕ ) around x , the local group at x = ϕ ( y ) , denoted G x , is the stabiliser of y .The set of singularities Σ( X ) or singular locus of an orbifold X = ( X, U ) is defined by Σ( X ) = { x ∈ X | G x (cid:54) = { }} . The set Σ( X ) may have several connected components and if Σ X = ∅ then X is a manifold. Orbifoldsarise in a natural way via actions of groups of transformations on manifolds. A quotient orbifold X is theone obtained as the quotient compact Lie group acting smoothly, effectively and almost freely on a smoothmanifold M . C Persistent homology
The persistence modules most commonly used in topological data analysis arise from filtered simplicialcomplexes, whose combinatorial nature is very suitable for computations.A simplicial complex K with vertex set S is a family of nonempty, finite subsets of S . Subsets of S of p + 1 elements are called p -simplices. A p -simplex is represented as a list of its vertices [ v , · · · , v p ] . In asimplicial complex K , one requires that all elements v of S form -simplices [ v ] in K , and if σ ∈ K and ∅ (cid:54) = τ ⊂ σ , then τ ∈ K . We usually consider the case when S is finite. A simplicial complex K has theassociated space | K | , called the geometric realisation, which can be regarded as a triangulated polyhedronin an appropriate Euclidean space.We need a metric space to make use of persistent homology. Let S be the set of points sampled from theconformational space. For any two points z, w ∈ S , the distance d S ( z, w ) is given by the Euclidean distancebetween z and w .In general, to a metric space ( X, d ) equipped with a function f : X → R one can assign a persistencemodule as follows: First, we define the sub-level set of X with respect to α ∈ R by X α := { x ∈ X | f ( x ) ≤ α } . If α < α we have the inclusion X α ⊆ X α .For p = 1 , , · · · , the p th homology gives information about the p -dimensional holes : For p = 0 , this refersto connected components, for p = 1 , it refers to loops, for p = 2 , it is cavities or voids, etc. The number α is called a p -critical value of f if the number of p -dimensional holes of X α − (cid:15) and X α + (cid:15) changes for all small (cid:15) > . For p = 0 , by -dimensional holes we mean connected components, for p = 1 , we mean standardholes, for p = 2 we are talking about voids or cavities, etc.For the set S of points sampled from a conformational space this works as follows. We define ∆ S to bethe simplex with the elements of S as its vertices. Then we define a function f : ∆ S → R , which assigns toeach point s ∈ S the value , and to each higher simplex in ∆ S the maximal pairwise distance between thevertices it contains.We take all p -critical values α < α < · · · < α n . Then the sub-level sets connected by natural inclusionmaps give rise to a filtration: X α ⊆ X α ⊆ · · · ⊆ X α n = X. The zeroth persistence diagram Dg ( f ) captures the connected components that were born or died passingthrough a critical point. It consists of a set of points in the plane { ( a, b ) ∈ R | a < b } . Each point can occurmore than once. The coordinates a and b of a point indicate the birth and death times of the connectedcomponents. The multiplicity of the point indicates the number of connected components that were born attime a and died at time b . The first persistence diagram Dg ( f ) does the same for loops instead of connectedcomponents. Similarly for higher dimensional persistence diagrams.27 Morse theory
Morse theory addresses the relationship between the properties of a manifold and the properties of its real-valued functions. In particular, it aims to relate the topological properties of the n -dimensional manifold M with analytical properties of smooth functions f : M → R .For an in-depth discussion of Morse theory, we refer the reader to, e.g. [27]. Here we only recall theconcepts that we are interested in when generalising to the discrete setting.Morse theory captures the relationship between a function f and the manifold M by relying on thegradient ∇ x f ( x ) = df /dx ( x ) and its flow. The gradient defines a preferential direction at every point (thedirection of steepest ascent) except where it vanishes (i.e. where ∇ x f = 0 ). Those particular points arecalled critical points and can be classified according to the sign of the Hessian matrix, the n × n matrix ofthe second derivatives H f ( x ) = d f /dx i dx j ( x ) .Let M be a (smooth) manifold, and let f : M → R be a smooth function. A point m ∈ M is called acritical point of f if ∇ f ( m ) = 0 . The index of m is defined to be the number of negative eigenvalues of theHessian matrix H f ( m ) .Note that according to this definition, the index of a critical point is defined by the sign of the eigenvaluesof the Hessian, which must therefore be non-null. This condition is essential to Morse theory: a function f which obeys Morse theory must necessarily satisfy this constraint. Conversely, such functions are calledMorse functions:Let m be a critical point of f : M → R . We say that m is a non-degenerate critical point if and only ifthe nullity of H f ( m ) , i.e. the dimension of the -eigenspace of H f ( m ) , is zero. We say that f is a Morsefunction if and only if every critical point of f is non-degenerate.On an n -dimensional manifold M , the index of any local maximum point is n , and the index of any localminimum point is .At the location of any non-critical point, one can define specific lines, the integral lines, by following thepreferred direction of the gradient flow.Integral lines represent the flow along the gradient between critical points. They have the followingproperties: • Two integral lines are either disjoint or the same. • Integral lines cover all of M . • The origin and destination of an integral line are critical points of f (except at boundary). • In a gradient vector field, integral lines are monotonic, i.e. the origin is distinct from the destination.Let m be a critical point of f : M → R . The ascending manifold of m is the set of points belonging tointegral lines whose origin is m . The descending manifold of m is the set of points belonging to integral lineswhose destination is m . Note that ascending and descending manifolds are also referred to as unstable andstable manifolds.For a Morse function f : M → R , the complex of the descending manifold of f is called the Morsecomplex.A Morse function f is Morse-Smale if the ascending and descending manifolds intersect only transversally.Intuitively, an intersection of two manifolds is transversal when they are not ’parallel’ at their intersection.A pair of critical points that are the origin and destination of an integral line in the Morse-Smale functioncannot have the same index. Furthermore, the index of the critical point at the origin is less than the indexof the critical point at the destination.Given a Morse-Smale function f , the Morse-Smale complex of f is the complex formed by the intersectionof the Morse complex of f and the Morse complex of − f .For a -dimensional manifold M , any Morse-Smale function will give rise to a Morse-Smale complex withthe following combinatorial properties: • The nodes of the complex are exactly the critical points of the Morse function.28
The saddle points have exactly four arcs incident to them. • All regions are quadrangles. • The boundary of a region alternates between saddle points and extrema.
D.1 Discrete Morse Theory
Discrete Morse Theory was defined by Robin Forman in 1995 [12, 13]. It is, as the name suggests, acombinatorial analogue of classical Morse Theory, providing combinatorial equivalents of several core conceptsof classical Morse theory, such as Morse functions, gradient vector fields, critical points, and a cancellationtheorem for the elimination of pairs of critical points from a vector field. The discrete theory maintains theintuition of its classical counterpart while enabling simple and explicit constructions that are considerablymore involved in the smooth setting.Let M be any finite simplicial complex, K the set of simplices of M , and K p the simplices of dimension p . A discrete Morse function on M will actually be a function on K . That is, we assign a single real numberto each simplex in M . Write σ ( p ) if σ has dimension p , and τ > σ if σ lies in the boundary of τ . We say afunction f : K → R is a discrete Morse function if it satisfies the following two properties:1. For any τ ∈ K , the number of σ < τ for which f ( τ ) ≤ f ( σ ) is at most one.2. For any σ ∈ K , the number of σ < τ for which f ( τ ) ≤ f ( σ ) is at most one.We say σ ( p ) is critical (with index p ) if1. there is no τ < σ ( p ) with f ( τ ) ≥ f ( σ ) ,2. there is no τ > σ ( p ) with f ( τ ) ≤ f ( σ ) .So in the condition of a Morse function above, the conditions can be broken for one simplex each, butif the properties are satisfied for all face simplices or all coface simplices, we have a critical simplex. Thisdefinition provides a discrete analogue of the smooth notion of a critical point of index p . Note that in thediscrete case, we have a simplex of dimension p rather than a point.A discrete Morse function f over K defines a discrete gradient vector field by coupling simplices ingradient arrows (also called gradient pairs):1. if a simplex σ p has exactly one lower valued coface σ p +1 , then [ σ p , σ p +1 ] form a gradient arrow,2. if a simplex σ p has exactly one higher valued facet σ p − , then [ σ p − , σ p ] form a gradient arrow,3. if a simplex σ p is critical, it does not belong to a gradient arrow.So a discrete vector field V on a simplicial complex K is a set of pairs of simplices ( σ, τ ) , with σ a facetof τ , such that each simplex of K is contained in at most one pair of V . A simplex σ ∈ K is critical withrespect to V if σ is not contained in any pair of V . The dimension of a critical simplex is also called itsindex.A pair ( σ, τ ) in a discrete vector field V can be visualised by an arrow from σ to τ . We consider animportant subclass of vector fields in which the arrows do not form closed paths. This can be made preciseusing the concept of V -paths.Given a discrete vector field V , a V -path is a sequence τ , σ , τ , · · · , σ l , τ l , σ l +1 , where ( σ i , τ i ) ∈ V forevery i = 1 , · · · , l , and each σ i +1 is a facet of τ i for each i = 0 , · · · , l . If l = 0 , the V -path is trivial. This V -path is cyclic if l > and ( σ l +1 , τ ) ∈ V ; otherwise, it is acyclic, in which case we call this V -path agradient path. We say that a gradient path is a vertex-edge gradient path if dimension ( σ i ) = 0 , implyingthat dimension ( τ i ) = 1 . Similarly, it is an edge-triangle gradient path if dimension ( σ i ) = 1 .A discrete vector field V becomes a discrete gradient vector field if there are no cyclic V -paths inducedby V . 29ntuitively, a gradient path τ , σ , τ , · · · , σ l , τ l , σ l +1 is the analogue of an integral line in the smoothsetting. Different from the smooth setting, a maximal gradient path may not start or end at critical simplices.However, those that do (i.e, when τ and σ k +1 are critical simplices) are analogous to maximal integral linesin the smooth setting which start and end at critical points, and for convenience one can think of critical k -simplices in the discrete Morse setting as index- k critical points in the smooth setting.For example, for a function on a two-dimensional simplicial complex, critical -, - and -simplices in thediscrete Morse setting correspond to minima, saddles and maxima in the smooth setting, respectively.For a critical edge e , we define its stable (or descending) manifold to be the union of edge-triangle gradientpaths that ends at e . Its unstable (or ascending) manifold is defined to be the union of vertex-edge gradientpaths that begins with e . D.2 Morse theory and Persistent homology for topological simplification
From the perspective of persistence for a Morse real-valued function, the infinite persistence barcodes de-termine the Betti numbers of the underlying manifold and the finite barcodes give information about themultitude of visible trajectories between the critical points.The ends of any barcode for f are among the critical values of f , the values t for which the homology ofthe level of f at t differs from the homology of levels at values in an arbitrary neighborhood of t .Given a scalar function f , such as the energy function, defined on a topological space, we can talk abouttopology-preserving simplification of the scalar function [2]. By this we mean that topological features withpersistence greater than a specified level are guaranteed to remain.In the discrete case, our topological space gets replaced by a simplicial complex, while the scalar functionis required to be a piecewise linear function defined on the simplices of the simplicial complex, a discreteMorse function as defined above. The topological features are then the pairings of critical cells, whichcorrespond to the persistent bars in the persistence barcode.There exist special configurations where two critical points are linked by two or more different paths.Those particular configurations cannot be cancelled, as applying a discrete gradient reversal would result inthe formation of a V -path that loops onto itself. However, for cancellable pairs, we specify a threshold, andcancel all critical pairs that correspond to a persistence bar below that threshold.This is, in essence, a way of denoising the scalar function, while having a tight control of the size of noise. Acknowledgements
This research was supported by the EPSRC grants EP/N014189/1 Joining the dots: from data to insight,and EP/L015722/1 Centre for Doctoral Training in Theory and Modelling in Chemical Sciences. LS thanksKhaled Abdel-Maksoud for assistance in running metadynamics simulations.
References [1] A. Adem, J. Leida, and Yongbin Y. Ruan,
Orbifolds and stringy topology , vol. 171, Cambridge UniversityPress, 2007.[2] U. Bauer, C. Lange, and M. Wardetzky,
Optimal topological simplification of discrete functions onsurfaces , Discrete and Computational Geometry (2012), no. 2, 347–377.[3] K. Beketayev, G.H. Weber, M. Haranczyk, P.-T. Bremer, M. Hlawitschka, and B. Hamann, Topology-based Visualization of Transformation Pathways in Complex Chemical Systems , Computer GraphicsForum (2011), no. 3, 663–672.[4] W. M. Brown, S. Martin, S. N. Pollock, E. A. Coutsias, and J-P. Watson, Algorithmic dimensionalityreduction for molecular structure analysis , The Journal of Chemical Physics (2008), no. 10, 234115–174109. 305] P. R. Bunker,
Molecular symmetry and spectroscopy , New York: Academic Press, 1979.[6] R. J. G. B. Campello, D. Moulavi, and J. Sander,
Density-Based Clustering Based on HierarchicalDensity Estimates , Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2013, pp. 160–172.[7] O. Delgado-Friedrichs, V. Robins, and A. Sheppard,
Skeletonization and partitioning of digital imagesusing discrete morse theory , IEEE Transactions on Pattern Analysis and Machine Intelligence (2015),654–666.[8] T. Dey, J. Wang, and Y. Wang, Graph reconstruction by discrete morse theory , (2018).[9] M. Duan, M. Li, L. Han, and S. Huo,
Euclidean sections of protein conformation space and theirimplications in dimensionality reduction. , Proteins (2014), no. 10, 2585–96.[10] J-P. Ebejer, G. M. Morris, and C. M. Deane, Freely Available Conformer Generation Methods: HowGood Are They? , Journal of Chemical Information and Modeling (2012), 1146–1158.[11] H. Edelsbrunner, D. Letscher, and A. Zomorodian, Topological persistence and simplification , Discreteand Computational Geometry (2000), 511–533.[12] R. Forman, A discrete morse theory for cell complexes , Geometry, Topology and Physics for Raoul Bott(1995).[13] ,
A user’s guide to discrete Morse theory , Séminaire Lotharingien de Combinatoire (2002).[14] K. D. Gibson and H. A. Scheraga,
Energy minimization of rigid-geometry polypeptides with exactlyclosed disulfide loops , Journal of Computational Chemistry (1997), no. 3, 403–415.[15] A. Guichardet, On rotation and vibration motions of molecules , (1984), no. 3, 329–342.[16] A. Gyulassy, M. A. Duchaineau, and V. Pascucci V. Natarajan, E. Bringa, A. Higginbotham, andB. Hamann, Topologically clean distance fields , Visualization and Computer Graphics, IEEE Transac-tions on (2007), 1432 – 1439.[17] T. A. Halgren, Merck molecular force field. I. Basis, form, scope, parameterization, and performance ofMMFF94 , Journal of Computational Chemistry (1996), no. 5-6, 490–519.[18] T. X. Hoang, A. Trovato, F. Seno, J. R. Banavar, and A. Maritan, Geometry and symmetry presculpt thefree-energy landscape of proteins , Proceedings of the National Academy of Sciences (2004), no. 21,7960–7964.[19] D. Husemoller,
Fibre bundles , Graduate Texts in Mathematics, vol. 20, Springer, 1966.[20] B. Keller, M. Lesnick, and T. Willke,
Phos: Persistent homology for virtual screening , (2018).[21] A. Laio and M. Parrinello,
Escaping free-energy minima. , Proceedings of the National Academy ofSciences of the United States of America (2002), no. 20, 12562–6.[22] G. Landrum, RDKit , 2018.[23] J. Li, T. Ehlers, J. Sutter, S. Varma-O’Brien, and J. Kirchmair,
Caesar: a new conformer generationalgorithm based on recursive buildup and local rotational symmetry consideration , Journal of chemicalinformation and modeling (2007), no. 5, 1923–1932.[24] H. C. Longuet-Higgins, The symmetry groups of non-rigid molecules , Molecular Physics (1963), no. 5,445–460. 3125] S. Martin, A. Thompson, E. A. Coutsias, and J-P. Watson, Topology of cyclo-octane energy landscape ,The Journal of Chemical Physics (2010), 234115.[26] S. Martin and J-P. Watson,
Non-manifold surface reconstruction from high-dimensional point clouddata , Computational Geometry (2011), no. 8, 427–441.[27] J. Milnor, Morse Theory , Princeton University Press (1965).[28] R. Pinal,
Effect of molecular symmetry on melting temperature and solubility , Organic & biomolecularchemistry (2004), no. 18, 2692–2699.[29] S. Riniker and G. A. Landrum, Better Informed Distance Geometry: Using What We Know to ImproveConformation Generation , Journal of Chemical Information and Modeling (2015), no. 12, 2562–2574.[30] A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill, and S. Goedecker, Metrics for measuringdistances in configuration spaces , The Journal of Chemical Physics (2013), no. 18, 184118.[31] A. Singer and H. Wu,
Orientability and diffusion maps , Applied and computational harmonic analysis (2011), no. 1, 44–58.[32] T. Sousbie, The persistent cosmic web and its filamentary structure – i. theory and implementation ,Monthly Notices of the Royal Astronomical Society (2011), 350 – 383.[33] B. Steipe,
A revised proof of the metric properties of optimally superimposed vector sets , Acta Crystal-lographica A (2002), 506.[34] P. Tao and L. Lai, Protein ligand docking based on empirical method for binding affinity estimation ,Journal of Computer-Aided Molecular Design (2001), no. 5, 429–446.[35] J. Tierny, G. Favelier, J. A. Levine, C. Gueunet, and M. Michaux, The Topology ToolKit , IEEE Trans-actions on Visualization and Computer Graphics (Proc. of IEEE VIS) (2017).[36] U. Bauer,
Ripser .[37] J. Wei,
Molecular symmetry, rotational entropy, and elevated melting points , Industrial & engineeringchemistry research (1999), no. 12, 5019–5027.[38] A. Zomorodian and G. Carlsson, Computing persistent homology , Discrete and Computational Geometry33