Earthmover-based manifold learning for analyzing molecular conformation spaces
EEARTHMOVER-BASED MANIFOLD LEARNING FOR ANALYZING MOLECULARCONFORMATION SPACES
Nathan Zelesko ♠ Amit Moscovich ♦ Joe Kileel ♦ Amit Singer ♦ , ♣♠ Department of Mathematics, Brown University ♦ Program in Applied and Computational Mathematics, Princeton University ♣ Department of Mathematics, Princeton University
ABSTRACT
In this paper, we propose a novel approach for manifoldlearning that combines the Earthmover’s distance (EMD)with the diffusion maps method for dimensionality reduction.We demonstrate the potential benefits of this approach forlearning shape spaces of proteins and other flexible macro-molecules using a simulated dataset of 3-D density mapsthat mimic the non-uniform rotary motion of ATP synthase.Our results show that EMD-based diffusion maps require farfewer samples to recover the intrinsic geometry than the stan-dard diffusion maps algorithm that is based on the Euclideandistance. To reduce the computational burden of calculatingthe EMD for all volume pairs, we employ a wavelet-basedapproximation to the EMD which reduces the computation ofthe pairwise EMDs to a computation of pairwise weighted- (cid:96) distances between wavelet coefficient vectors. Index Terms — Shape space, dimensionality reduction,Wasserstein metric, diffusion maps, Laplacian eigenmaps,cryo-electron microscopy
1. INTRODUCTION
Proteins and other macromolecules are elastic structures thatmay deform in various ways. Since the spatial conformationof an organic molecule is known to play a key role in itsbiological function, the complete description of a moleculemust include more than just a single static structure (as is tra-ditionally produced by X-ray crystallography). Ideally, wewould like to map the entire space of molecular conforma-tions. However, understanding the topology and geometryof these conformation spaces remains one of the grand chal-lenges in the field of structural biology [1].One promising approach is to employ cryo-electron mi-croscopy (cryo-EM) as a tool for structure determination inthe presence of conformational heterogeneity [2]. In cryo-EM, multiple images of a particular macromolecule are takenby a transmission electron microscope and then processed us-ing specialized algorithms. Traditionally, these algorithmsconstruct an estimate of the mean molecular volume, in the (a) (b) (c)
Fig. 1 . EMD vs. Euclidean distance for translational motion.
Euclidean (or (cid:96) ) distance is only meaningful for measuringsmall displacements. e.g., the (cid:96) distance between half-disks(c) and (a) is the same as between (c) and (b). By contrast, forany translational motion, the EMD is its magnitude.form of a 3-D electrostatic density map. In particular, thisprocess averages out any variability in the spatial conforma-tions of the molecules in the sample. Recent works have ap-plied techniques from the field of manifold learning to cryo-EM data sets, obtaining a low-dimensional representation ofthe molecular conformation space [3, 4]. Specifically, theseworks build affinity graphs based on the Euclidean distancesbetween molecular volumes (or projection images) and thencompute diffusion map embeddings [5, 6].However, the Euclidean distance is suboptimal for captur-ing the distance between geometric conformations. Consider,for example, two conformations of a molecule that has only asingle moving part. If the two conformations are distant, thesupport of the moving part in the two volumes may not inter-sect, rendering the Euclidean distance independent of the con-formational distance. See Fig. 1. In such cases, in order to ap-ply manifold learning based on a Euclidean metric, one needa dense cover of the conformation space by the molecules inthe sample. Since the number of points in such a cover scalesexponentially in the dimension, it may be infeasible to applythese methods, even using the largest existing experimentaldatasets, which consist of about n ≈ samples.In this paper, we propose to use the Earthmover’s distance (EMD), also known as the
Wasserstein metric , instead of thecommonly used Euclidean distance as input to manifold em-bedding algorithms. EMD has an intuitive geometric mean-ing: it measures the minimal amount of “work” needed totransform one pile of mass into another pile of equal mass, a r X i v : . [ q - b i o . B M ] O c t here “work” is defined as the amount of mass moved timesthe distance by which it is moved. In particular, EMD pro-vides a distance metric that is meaningful even between spa-tial conformations that are far from each other. Following thediscussion above, this property should reduce the number ofsamples needed to learn the intrinsic manifold.Methods for computing the EMD, based on off-the-shelflinear programming solvers, are expensive when the numberof voxels is large. Therefore we used a fast approximation tothe EMD, based on a wavelet representation [7].To test our proposal, we compared the standard (cid:96) -baseddiffusion maps to EMD-based diffusion maps on a syntheticdataset mimicking the motion of ATP synthase (Fig. 2). Thisdataset samples the underlying manifold in a non-uniformmanner since ATP synthase has three dominant conforma-tions that are 120 ◦ apart [8]. The approximate EMD-basedapproach yields a marked improvement in the number of sam-ples required for learning the conformational manifold, whilestill offering a computationally feasible algorithm.
2. METHODS
In this section, we review the basic techniques that under-lie Earthmover-based manifold learning. Our current focus ison learning shape spaces of 3-D volumes, but the same tech-niques may also be applied to analyze other types of datasets,such as 2-D image sets, 1-D histograms, etc. To start, let X = { x , . . . , x n } be a set of 3-D voxel arrays in R L . Weassume that X obeys the manifold hypothesis [2, 9, 10], i.e., x , . . . , x n form a (noisy) sample of a low-dimensional man-ifold M ⊂ R L . Our task is to reorganize the data to betterreflect the intrinsic geometry of M .For Riemannian manifolds, eigenfunctions of the Laplace-Beltrami operator provide an intrinsic coordinate system[11, 12]. Accordingly, several popular methods for dimen-sionality reduction and data representation methods are basedon mapping input points using empirical estimates of Lapla-cian eigenfunctions [6, 5]. Under the manifold hypothesis,these estimates converge to eigenfunctions of the Laplace-Beltrami operator, or more generally to eigenfunctions of aweighted Laplacian, depending on the construction [13].We now describe the diffusion maps method [6]. Let w : R L × R L → R denote a symmetric non-negative functionthat gives an affinity score for each pair of volumes. Onecommon way of constructing affinities is to take a distancemetric d : X × X → R and apply a Gaussian kernel with asuitably chosen width σ to form the affinity matrix W ∈ R n × n W ij = w ( x i , x j ) = exp (cid:0) − d ( x i , x j ) / (2 σ ) (cid:1) . (1)The degree matrix D ∈ R n × n is defined to be the diagonalmatrix that satisfies D ii = (cid:80) nj =1 W ij . We use the Coifman-Lafon normalized graph Laplacian [14], which convergesto the Laplace-Beltrami operator, regardless of the sampling
Fig. 2 . ATP synthase. (left) F and axle subunits. They ro-tate together in the presence of hydrogen ions, forming a tinyelectric motor; (middle) the F subunit (in cyan) envelops theaxle. As the axle rotates, this subunit assembles ATP; (right)sample slice of the rotated F and axle subunits with the ad-ditive Gaussian noise.density. To compute this, one first performs a two-sidednormalization of the affinity matrix, (cid:102) W = D − W D − andthen computes the random-walk Laplacian, L = (cid:101) D − (cid:102) W , where (cid:101) D is the degree matrix for (cid:102) W . The random-walkLaplacian is similar to a positive semi-definite symmetricmatrix and hence its eigenvectors are real and its eigenvaluesare non-negative. The all-ones vector is an eigenvector of L with eigenvalue zero [15]. Let φ , φ , . . . , φ n − ∈ R n beeigenvectors of L with corresponding eigenvalues λ ≤ λ ≤ . . . ≤ λ n − . We think of the eigenvectors φ (cid:96) as real-valued functions on X , by identifying φ (cid:96) ( x i ) = ( φ (cid:96) ) i . The k -dimensional diffusion map Ψ ( k ) t : X → R k is defined by: x i (cid:55)→ (cid:0) λ t φ ( x i ) , . . . , λ tk φ k ( x i ) (cid:1) . The mapping Ψ ( k ) t gives a system of k coordinates on X ,which captures the intrinsic geometry of M . In our simu-lations, we used t = 0 , in which case diffusion maps coincidewith Laplacian eigenmaps [5].The diffusion map depends on the choice of affinity. Thetypical choice is a Gaussian kernel as defined in Eq. (1) thatis based on a Euclidean (or (cid:96) ) distance function, d (cid:96) ( x i , x j ) = (cid:107) x i − x j (cid:107) . We propose instead to base the Gaussian kernel of Eq. (1) onthe
Earthmover’s distance (EMD) , also known as the
Wasser-stein metric [16]. EMD is popular in various applications,e.g., image retrieval [17], however, to the best of our knowl-edge, it has never been used to define affinities for manifoldearning algorithms. To define this distance, consider two 3-Ddensity maps x i , x j ∈ R L that are non-negative and normal-ized to unit mass. These densities define probability measureson the set of voxels, [ L ] , where [ L ] = { , . . . , L } . We set: d EMD ( x i , x j ) = min π ∈ Π( x i , x j ) (cid:88) u ∈ [ L ] (cid:88) v ∈ [ L ] π ( u , v ) (cid:107) u − v (cid:107) , where Π( x i , x j ) is the set of joint probability measures on [ L ] × [ L ] with marginals x i and x j , respectively.Mathematically, EMD amounts to a linear program in O ( L ) variables subject to O ( L ) constraints, i.e., a signif-icant computation. However, in the wavelet domain [18],EMD enjoys a fast (weighted- (cid:96) ) wavelet approximation [7],which we refer to as WEMD: d WEMD ( x i , x j ) = (cid:88) λ − s/ |W x i ( λ ) − W x j ( λ ) | . (2)Here, W x denotes the of x , and theindex λ contains the shifts ( m , m , m ) ∈ Z and the scale s ∈ Z ≥ . More explicitly, W decomposes x = x [ u , u , u ] with respect to an orthonormal basis of functions, s/ f (2 s u − m ) g (2 s u − m ) h (2 s u − m ) , for varying s ∈ Z ≥ , varying ( m , m , m ) ∈ Z , and ( f, g, h ) ranging over { ψ, ω } \ { ( ω, ω, ω ) } where ψ, ω arecertain 1-D functions called the mother and father wavelet [18]. Formula (2) approximates EMD in the sense that d EMD and d WEMD are strongly equivalent metrics, i.e., there existconstants C ≥ c > such that for all x , y ∈ R L , we have: c · d WEMD ( x , y ) ≤ d EMD ( x , y ) ≤ C · d WEMD ( x , y ) . Moreover, there are known bounds on the ratio
C/c , depend-ing on the type of wavelet used. We have chosen the Coiflets 3wavelet since it gives a small ratio [7]. Wavelet transformsare computed in linear time, thus the same holds for the EMDapproximation. We implemented the approximation (2), us-ing the PyWavelets package [19]. We computed the wavelettransform up to scale s = 5 for accurate truncation of Eq. (2).This overparameterizes the volumes by a factor of ≈ .
3. RESULTS
To test our methods, we generated two synthetic datasets of3-D density maps that are simplified models of the conforma-tion space of ATP synthase [8]. This enzyme is a molecularstepper motor with a central asymmetric axle that rotates insteps of 120 ◦ relative to the F subunit, with short transientmotion in-between the three dominant conformations. Here,the intrinsic geometry is a circle, with a sampling density con-centrated around three equispaced angles. We simulated thismotion by generating 3-D density maps in which the F sub-unit is held in place while the F and axle subunits are rotated -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 WEMDEuclidean
Fig. 3 . Euclidean distance vs. WEMD as functions of the an-gle between two angles of the ATP synthase rotor (see Fig. 2).The (cid:96) distances are scaled to be comparable to the WEMD.The WEMD is monotone in the magnitude of the angular dif-ference for almost the entire range whereas the Euclidean dis-tance exhibits this behavior only up to about ± ◦ . n Wavelet transform WEMD distances (cid:96) distances25 60 0.9 0.850 121 3.4 1.2100 243 13 2.4200 481 50 4.8400 965 221 9.4800 1932 844 20.6 Fig. 4 . Running times [sec] for computing the fast wavelettransform, all pairwise wavelet Earthmover approximationsand all pairwise (cid:96) distances.together by a random angle. The angles were drawn i.i.d. ac-cording to the following mixture model: U [0 , N (0 ,
1) + 15 N (120 ,
1) + 15 N (240 , , where U and N denote uniform and Gaussian distributions,respectively. To form our datasets, we downloaded en-try 1QO1 [20] from the Protein Data Bank [21], produced3-D density maps at a 6 ˚A resolution with array dimensions × × using the molmap command in UCSF Chimera[22], and then took random rotations of the F and axle sub-units. From this, we generated a clean dataset and a noisydataset. For the latter, i.i.d. Gaussian noise was added withmean zero and standard deviation equal to one-tenth of themaximum voxel value.We first tested the plausibility of our proposal by com-paring the EMD approximation to the Euclidean distancefor a range of angular differences using the noiseless dataset(Fig. 3). We then performed 2-dimensional diffusion mapsfor various sample sizes, using both the Euclidean distanceoiselessEuclideanNoiselessWEMDNoisyEuclideanNoisyWEMD n
25 50 100 200 400 800
Fig. 5 . Main results.
Euclidean vs. EMD-based diffusion mappings on the clean and noisy ATP synthase datasets for samplesizes n = 25 , , , , , . The Euclidean diffusion maps need more than samples to capture the intrinsicgeometry whereas WEMD manages to do so with merely n = 25 samples. The colors encode the (ground truth) angle.and the wavelet-based approximation to the EMD, as de-scribed in the previous section. The resulting embeddings areshown in Fig. 5. The value of the width parameter σ in theGaussian kernel (1) was hand-picked to yield the best results.We note that for the Euclidean diffusion maps, careful tuningof σ was required. However, this was not necessary for theEMD approximation, where a wide range of σ values gaveexcellent results. Running times (on an Intel Core i7) for thecomputation of EMD and Euclidean-based diffusion mapsare listed in Fig. 4.
4. CONCLUSION
In this paper, we proposed to use Earthmover-based affinitiesin the diffusion maps framework to analyze molecular con-formation spaces. We showed that this results in a markeddecrease in the number of samples needed to capture the in-trinsic conformation space of ATP synthase. The method iscomputationally tractable, thanks to a fast wavelet approxi-mation, and robust to noise. Our results show promise, par-ticularly for the analysis of cryo-EM datasets with continuousheterogeneity. More broadly, EMD-based manifold learningcould be applied to analyze the variability of other collec- tions of 3-D shapes [23], 2-D images [17], videos and othersignals, e.g., to better model animal motion [24]. Our workalso raises several interesting theoretical questions: in whichcases can one prove that EMD-based manifold learning has alower sample complexity than manifold learning based on theEuclidean distance? More ambitiously, are there reasonablegenerative models for variability where EMD is the optimaldistance metric?
5. REPRODUCIBILITY
Code for reproducing the results in this paper is available athttp://github.com/nathanzelesko/earthmover
6. ACKNOWLEDGMENTS
The authors thank Ariel Goldstein, William Leeb, NicholasMarshall and Stefan Steinerberger for interesting discussions.This work was supported in parts by AFOSR FA9550-17-1-0291, ARO W911NF-17-1-0512, the Simons Collaborationin Algorithms and Geometry, the Simons Investigator Award,the Moore Foundation Data-Driven Discovery InvestigatorAward and NSF BIGDATA Award IIS-1837992. . REFERENCES [1] J. Frank, “New opportunities created by single-particle cryo-EM: the mapping of conformationalspace,”
Biochemistry , vol. 57, no. 6, pp. 888, 2018.doi:10.1021/acs.biochem.8b00064[2] C.O.S. Sorzano et al., “Survey of the analysis of con-tinuous conformational variability of biological macro-molecules by electron microscopy,”
Acta Crystallogr.Sect. F Struct. Biol. Commun. , vol. 75, no. 1, pp. 19–32,2019. doi:10.1107/S2053230X18015108[3] P. Schwander, R. Fung, and A. Ourmazd, “Con-formations of macromolecules and their complexesfrom heterogeneous datasets,”
Philos. Trans. R. Soc.B Biol. Sci. , vol. 369, no. 1647, pp. 1–8, 2014.doi:10.1098/rstb.2013.0567[4] A. Moscovich, A. Halevi, J. And´en, and A. Singer,“Cryo-EM reconstruction of continuous heterogeneityby Laplacian spectral volumes,”
Inverse Probl., submit-ted, 2019.[5] M. Belkin and P. Niyogi, “Laplacian eigenmaps for di-mensionality reduction and data representation,”
Neu-ral Comput. , vol. 15, no. 6, pp. 1373–1396, 2003.doi:10.1162/089976603321780317[6] R.R. Coifman et al., “Geometric diffusions as a toolfor harmonic analysis and structure definition of data:diffusion maps,”
PNAS , vol. 102, no. 21, pp. 7426–7431,2005. doi:10.1073/pnas.0500334102[7] S. Shirdhonkar and D.W. Jacobs, “Approximate Earth-mover’s distance in linear time,” in
CVPR 2008 .doi:10.1109/CVPR.2008.4587662[8] M. Yoshida, E. Muneyuki, and T. Hisabori, “ATP syn-thase – a marvellous rotary engine of the cell,”
Nat.Rev. Mol. Cell Biol. , vol. 2, no. 9, pp. 669–677, 2001.doi:10.1038/35089509[9] A. Moscovich, A. Jaffe, and B. Nadler, “Minimax-optimal semi-supervised regression on unknown man-ifolds,” in
AISTATS 2017 . http://proceedings.mlr.press/v54/moscovich17a.html[10] A.B. Lee, K.S. Pedersen, and D. Mumford, “The nonlin-ear statistics of high-contrast patches in natural images,”
Int. J. Comput. Vis. , vol. 54, no. 1-3, pp. 83–103, 2003.doi:10.1023/A:1023705401078[11] P. B´erard, G. Besson, and S. Gallot, “EmbeddingRiemannian manifolds by their heat kernel,”
Geom.Funct. Anal. , vol. 4, no. 4, pp. 373–398, 1994.doi:10.1007/BF01896401 [12] P.W. Jones, M. Maggioni, and R. Schul, “Manifoldparametrizations by eigenfunctions of the Laplacian andheat kernels,”
PNAS , vol. 105, no. 6, pp. 1803–1808,2008. doi:10.1073/pnas.0710175104[13] D. Ting, L. Huang, and M. Jordan, “An analysis of theconvergence of graph Laplacians,” in
ICML 2010 .https://icml.cc/Conferences/2010/papers/554.pdf[14] R.R. Coifman and S. Lafon, “Diffusion maps,”
Appl.Comput. Harmon. Anal. , vol. 21, no. 1, pp. 5–30, 2006.doi:10.1016/j.acha.2006.04.006[15] U. von Luxburg, “A tutorial on spectral clustering,”
Stat. Comput. , vol. 17, no. 4, pp. 395–416, 2007.doi:10.1007/s11222-007-9033-z[16] C. Villani,
Optimal Transport: Old and New ,Springer Berlin Heidelberg, 2009. doi:10.1007/978-3-540-71050-9[17] Y. Rubner, C. Tomasi, and L.J. Guibas, “The Earth-mover’s distance as a metric for image retrieval,”
Int.J. Comput. Vis. , vol. 40, no. 2, pp. 99–121, 2000.doi:10.1023/A:1026543900054[18] S. Mallat,
A Wavelet Tour of Signal Processing , Else-vier, 3rd edition, 2009. doi:10.1016/B978-0-12-374370-1.X0001-8[19] G. Lee, R. Gommers, F. Waselewski, K. Wohlfahrt, andA. O’Leary, “PyWavelets: a Python package for waveletanalysis,”
J. Open Source Softw. , vol. 4, no. 36, pp.1237, 2019. doi:10.21105/joss.01237[20] D. Stock, A.G. Leslie, and J.E. Walker, “Molecu-lar architecture of the rotary motor in ATP synthase,”
Science , vol. 286, no. 5445, pp. 1700–1705, 1999.doi:10.1126/science.286.5445.1700[21] P.W. Rose et al., “The RCSB protein data bank: inte-grative view of protein, gene and 3D structural informa-tion,”
Nucleic Acids Res. , vol. 45, no. D1, pp. D271–D281, 2017. doi:10.1093/nar/gkw1000[22] E.F. Pettersen at al., “UCSF Chimera – a visualiza-tion system for exploratory research and analysis,”
J.Comput. Chem. , vol. 25, no. 13, pp. 1605–1612, 2004.doi:10.1002/jcc.20084[23] M. Ovsjanikov, W. Li, L.J. Guibas, and N.J. Mitra, “Ex-ploration of continuous variability in collections of 3Dshapes,”
ACM Trans. Graph. , vol. 30, no. 4, pp. 1–10,2011. doi:10.1145/2010324.1964928[24] D.L. Hu, J. Nirody, T. Scott, and M.J. Shel-ley, “The mechanics of slithering locomotion,”