Leveraging binding-site structure for drug discovery with point-cloud methods
Vincent Mallet, Carlos G. Oliver, Nicolas Moitessier, Jerome Waldispuhl
LLeveraging binding-site structure for drug discoverywith point-cloud methods
Vincent Mallet
Department of Computer ScienceMcGill University [email protected]
Carlos G. Oliver
Department of Computer ScienceMcGill University [email protected]
Nicolas Moitessier
Department of ChemistryMcGill University [email protected]
Jérôme Waldispühl
Department of Computer ScienceMcGill University [email protected]
Abstract
Computational drug discovery strategies can be broadly placed in two categories:ligand-based methods which identify novel molecules by similarity with knownligands, and structure-based methods which predict molecules with high-affinityto a given 3D structure (e.g. a protein). However, ligand-based methods do notleverage information about the binding site, and structure-based approaches relyon the knowledge of a finite set of ligands binding the target. In this work, weintroduce
TarLig , a novel approach that aims to bridge the gap between ligand andstructure-based approaches. We use the 3D structure of the binding site as input toa model which predicts the ligand preferences of the binding site. The resultingpredictions could then offer promising seeds and constraints in the chemical spacesearch, based on the binding site structure.
TarLig outperforms standard modelsby introducing a data-alignment and augmentation technique. The recent popularityof Volumetric 3DCNN pipelines in structural bioinformatics suggests that this extrastep could help a wide range of methods to improve their results with minimalmodifications.
In silico drug discovery
The majority of drugs are small organic molecules that alter cellularmechanisms upon binding to key bio-molecules such as proteins. Identifying molecules with adesired set of properties is known as the drug discovery process. This process often relies on aniterative, tedious and expensive process. Because most molecules are inactive or lack specificitytoward a given function and binding site, Virtual Screening (VS) tools have emerged as an essentialtool to identify the most promising candidates molecules prior to synthesis and/or experimentalevaluation. Given a initial set of generic small molecules (i.e. the library), VS methods output asubset of this library enriched with active compounds. Another use of these computational methodsis to predict off-target binding (binding to a non-desired site) to assess the binding specificity to themolecular target. This information is in particular helpful to anticipate potentially toxic compounds.
Structure-based approaches
Most VS methods broadly fall under two categories [1]. Structure-based approaches use the 3D structure of the protein binding site to identify promising candidateligands (e.g., enzyme inhibitors, receptor antagonists,...). If the binding site is not known, severaltools are able to detect the binding sites of a protein [2, 3]. Knowledge of the binding site shape and a r X i v : . [ q - b i o . Q M ] M a y equirements from ligand-protein co-crystals can be exploited to dock ligands into this binding site.This process is computationally demanding (i.e., from a few seconds to a few minutes per ligand)and consists in two phases: one that explores possible binding orientations of the ligands in thebinding site and another that ranks them by affinity. This approach has been the subject of detailedstudies [4, 5]. Some machine learning approaches can be used to do the ranking step or even bypassthe docking step and directly compute an affinity score[6]. However, increased accuracy comes atthe cost of trying a larger number of orientations and therefore cannot be easily accelerated. Theshortcoming of such methods is that they take as input both a ligand and the target protein and thusthey are restrained to a limited set of ligands. This leads to limitations because this finite set hasalready been extensively studied and patented. Additionally, these methods are often computationallyexpensive. Ligand-based approaches
Ligand-based approaches assume that ligands structurally similar toactive ligands will also be active[7, 8]. While this concept ignores the actual biological bindingsite explicitly, it implicitly considers the potential interactions and takes advantage of experimentalbinding affinities of reference ligands to search for new ones. Interest towards generative modelsin this ligand-centered task is rising[9]. A major asset of this class of methods is the ability toidentify potential ligands that lie outside of the patented molecular space. These models often relyon unsupervised methods that use a similar strategies to word2vec in the molecular space [10].The latent space induced by such methods is conveniently prone to algebra. Notably, we can useoptimization processes in the latent space to fine-tune a seed for ligand druggability [11, 12]. Thesemethods are becoming increasingly popular but struggle to use binding site information, relyinginstead on extensive experimental assays to reference ligands. Although it is not a fundamentalbottleneck for target prediction tasks, this is a major limitation for off-target predictions.
Our contribution
In this paper, we propose to integrate binding site information into a ligand-based approach. For this purpose, we introduce a new task: we use protein structure information toquickly identify regions of interest in ligand space and use this knowledge as a potential guide inthe ligand generative process. More specifically, we use the 3D structure of a protein binding site asinput and predict a latent space embedding of a ligand. To our knowledge, this kind of approach wasonly conducted independently in a preprint [13] that used graphs to represent the pocket. Our modellearns a rotation invariant representation for the 3D input using an alignment and augmentation step,showing superior results against 3D steerable and classic Volumetric CNNs pipelines. We believe thisstep is relevant for 3D shape comparison in general and could help other Volumetric CNN modelsimprove their results.
Ligands can be represented as graphs, strings, sets of subgroups or 3Dstructures. All of these representations can in turn be embedded into vectors. Molecular fingerprintsare a commonly used representation defined as a bit-string encoding the presence or absence ofcertain relevant chemical fragments according to domain experts [14, 15]. Such representations havethe usual advantages and drawbacks of hand-featured representations: they do not rely on abundantdata and are easily interpretable. However, they are also not compact and potentially miss importantfeatures of the chemical space. The largest chemical compounds database (REAL database) contains compounds that could be represented using binary vectors of dimension log (10 ) ∼ insteadof the fingerprints of length 1024 used by the ECPF6 [15] representations. Thus, hand-crafted, binaryembeddings and the bit-to-bit distance (Tanimoto coefficient) span a sparse discontinuous space withregards to the space usually explored by chemists. Continuous embeddings allow a more compactembedding and better preservation of chemical similarity that are the basis of all potential machinelearning algorithms. Data-driven representations
Data-driven representations arose in 2015 and were first applied tograph representations of ligands [16, 17]. Next, unsupervised methods of auto encoding were appliedto larger data sets using the
SMILES representation of millions of compounds available in largedatabases such as ZINC[18] with a paper leveraging similar ideas as word2vec to get embeddings2f molecules [10]. The authors were the first to introduce the idea that the latent space could serveas a good molecular representation for other learning tasks such as bio-activity prediction. A lot ofwork was conducted in this domain to adapt it to the use of VAEs or GANs[12] and we chose to baseour work on a transitional auto encoder [19] trained from
SMILES to canonical
SMILES on over onebillion data points. We will represent our ligands with the latent space representation of this model.
The functional level of proteins and of most biological objects resides in their 3D structure and iscomposed of finite number of elements. We preferred the atomic resolution over the amino-acidresolution as the latter cannot properly model the orientation of each amino acid and side-chaindetails, two major factors in protein-ligand binding. Therefore the binding sites are fundamentallymodeled as a point cloud of atoms. Several deterministic embedding tools exist that turn these objectsinto vectors [20] but for reasons similar to ligands, learned representations are more promising [21].
While we chose to model protein binding sites as a point cloud - ie a matrix of coordinates associatedwith atom types - for biological motivations, there is no well established way to process such data.The challenges are choosing a translation and rotation invariant coordinate system to express ourdata in and an order to consider our points. The point ordering is partially solved by a frameworksuch as PointNet [22] using MaxPooling over all points. The most common solution to the orderingchallenge is to consider this cloud as a sparse 3D image and to leverage the CNN frameworks, whichconsists in the Volumetric CNN framework. Treating objects as images fails to respect their rotationalinvariance and leads to very sparse inputs. For 2D images there is usually an orientation conventionbut there is no such native pose of a 3D object. This makes data augmentation much more difficult.Moreover as argued in [23] the data augmentation is much more challenging in 3D than in 2D.For these reasons, recent works try to extend the invariance properties of CNNs. In 2D, some workhas shown superior performance including rotational and symmetry invariances [24, 25]. In 3D,recent models manage to include rotational invariance[26] or equivariance[23, 27]. These modelshave the ability to leverage the data properties. However they do not usually have clear convergenceproperties. Also since they have not been extensively used and studied, they are harder to use and totrain than usual networks. We implemented a Volumetric pipeline with modifications and we chose toalso use steerable 3DCNN[27] because they required only minimal modification from the Volumetricpipeline
The protein binding sites we used were extracted from the PDB[28] from bound examples. Theextraction process consisted in taking bound protein structures, extracting at all amino-acids aroundthe ligand and extending the radius around each of its atom until a certain radius or a certain numberof neighbors is reached . We removed the metabolites in order to focus on larger, more interestingbinding molecules. We then removed the binding sites that we considered duplicates using sequenceidentity cutoff and removing several examples of the same ligand bound to the same protein.Table 1: Data extracted from the PDBNumber of proteins Size in Angstrom Number of ligands30964 40 * 30 * 30 3362As a metric we will often use the Mean Square Error (MSE) and the Enrichment factor (EF). EF at the i th level consist in the rate of active ligands in a subset of size i % of the original data set, normalizedby the rate of actives in the whole set. EF was designed to mimic the true use case of virtual screeningwhere chemists only test a small subset and hope to have as many actives as possible.3e used the DUDE [29] database that is the standard for assessing enrichment factors. The DUDEdatabase consists in 102 targets and their associated set of actives. They then create 50 decoys peractives, preserving the physico-chemical features, but making the structure different, to compromisethe binding. We also offer an alternate strategy to bypass the problem of rotation invariance. For 3D objectsembedded in images, we can compute the PCA as a way to find a consensus way to align theseobjects addressing the lack of native pose problem. Computing these axis is costly but it can be doneas a preprocessing step and enables us to bypass the registration problem. Since the eigenvectorshave no canonical sign, there remains an ambiguity regarding the pose but for a k-dimensional space,it is reduced to k = 8 possible poses for a 3D object.We first validated this idea to use the principal axis of 3D objects by applying it to a small moleculescomparison method : USRCAT [30]. This method uses the successive first moments of the distributionof distances to references points as an embedding for a given molecule. As an example, the firstfeature of a set of points would be its mean distance to its centroid. We used the PCA-based approachto get orientation independent reference points (at a given distance in angstrom from the centroid,following the eigenvectors axis). We used the benchmark used in the latest version of this methodand show superior performance in term of enrichment factor.We then used this idea to represent our binding sites all aligned on the same grid. A side advantageof this alignment step is that it enables us to fit all object on a smaller grid since the dimensionswith the highest variance usually correspond to the ones with largest values - with the exception ofoutliers. Using this framework solves the translation problem and enables us to drastically reduce theproblem of representing rotations of point clouds. We can use data augmentation to put all 8 flippedviews in the data set. We can then think our network might learn the invariance to flips. If we usea permutation invariant reduction operation such as the average on the flips, we make the networkinvariant to rotations.To explicitly enforce invariance to rotations for each prediction, we need to tie weights that aresymmetric with respect to one of the axes of the 3D grid. We can also present batches containingthe 8 poses and the batch-averaged gradients will then be symmetric. This is an engineering fix thatmakes the implementation straightforward. Further implementation of this model should include theweight tying. This would reduce both the batch size and the number of parameters. However, sincethe number of possible models is the same, we should be getting similar results with such constraintsactively enforced.This simple alignment could be an easy yet efficient engineering solution to help building rotationinvariant networks. However this approach has a few caveats : we need to compute the PCA of thepoint cloud for every evaluation of the networks, which could be challenging for high number ofpoints - but not for molecular applications.We used the pipeline presented in Figure 1 . To assess the impact of the alignment step with a side application, we competed against USRCAT[30]using reference points derived from the PCA. We computed EF at different thresholds for each of theDUDE[29] binding sites and averaged the results. We obtain similar yet significantly superior results( ∼ %). The limited improvement can be explained by the method reaching its limits, as explainedin [30] but is enough to show that using points derived from the PCA is a promising solution forrepresenting 3D objects. 4 atent Ligand Space Input ligand-binding site complex GeneratedCompounds
Fingerprint generation using VAE c1ccc2c(c1)OCO2
Raw point cloud Aligned + AugmentedPoint Clouds
Siamese Conv3D
RNNRNN
Figure 1: Pipeline used : a binding site (in red) is extracted around a ligand (blue) and turned intoa point cloud. This point cloud is aligned to its principal axis and turned into 8 volumetric images.These representations are then used to learn the latent code of the extracted ligandTable 2: EF computed with different threshold averaged over the DUDE database (the ratio vs ourmethod) Method EF . EF . UFSR 7.97 (0.45) 4.50 (0.46)USRCAT 16.50 (0.91) 9.14 (0.94)USRPCAT 17.51 (1) 9.97 (1)
We now come back to the task of predicting latent representation of ligands from the structure of abinding site. We have trained a number of models to assess the impact of : • Aligning the data to the PCA axes • Augmenting the data set with flips • Using batching or siamese model to enforce invariance • Different CNN settings with more or less parametersSee
Section 5 for more details on the implementation and infrastructure architecture and parameters.As a control experiment, we trained our model on shuffled labels. We benchmarked against steerableCNN implementations[27]. We also wanted to assess the bagging effect of models, so we show theaverage error on each view of a 3D image in addition to the bagged error.5able 3: Mean square error (MSE) between true and predicted ligands.Method Best test MSE Bagged test MSE Final train MSE Time to train
TarLig shuffled 0.210 0.206 0.07 6h
TarLig
TarLig flips 0.095 0.092 0.04 6h
TarLig
PCA 0.103 0.103 0.036 TarLig
PCA flips 0.089
TarLig batched flips 0.095 0.088 0.055 5h
TarLig siamese
TarLig flips 0.099 0.089 0.035 5hSe3cnn flips 0.18 0.16 0.091 35hSmall se3cnn flips Diverged N/A 0.17 19hWe see in
Table 3 that our model is able to learn, and performs significantly better than the shuffledcontrol. We have the expected result with regards to the PCA alignment : aligning all binding sitesresults in a higher score. Moreover, these results were obtained using early stopping, but consideringthe test metric curves, we see that the learning is a lot less stable with unaligned data, i.e. the model isharder to train and the best test error is actually over-fitting on the test set (data not shown). Therefore,we conclude that our alignment strategy is beneficial. The data augmentation also has the expectedeffect of helping the learning.The best model is the one that does not enforce the rotational invariance. Putting all flips in the samebatch results in the same value of convergence but takes longer to reach it. This can be interpreted asan ’effective batch size’ effect : we enforce rotational invariance at the cost of showing eight timesless binding sites in each batch batch (for a constant batch size). The siamese model only pushes thebagged average to the right vector while the other push each individual prediction to the label.All architectures we trained behaved very similarly. We report the
TarLig one that achieved the best,yet comparable, results to the ones we trained after, including smaller versions. The trends we justdescribed concerning the impact of aligning the pocket, augmenting the data or using it in a siamesearchitecture were found to be true across all the models we trained.The se3cnn network showed poor performance overall and we could not manage to bring its perfor-mance close to our results. We think it is a big limitation of such networks, despite having appealingtheoretical properties, they are much harder to train and use which is a barrier to fully leveragingtheir power. The alignment step enabled learning for this task and we think that combined with thevolumetric approach and some of the refinements developed for this approach, it could be an efficientand much easier way to use 3D data.
Next, we investigate the behaviour of our model beyond a simple average MSE score. To do so webroke down our results per ligand and clustered the ligands into a dendrogram to measure the effectof the distribution of ligands in the dataset on performance (Figure 2). (a) Performance by ligand, with clustering of theligands Number of examples in the test (log-scale)0.00.10.20.30.40.50.60.7 M S E Random predictionMean prediction (b) Performance by ligand, split by the number ofexamples in the set
Figure 2: Performance of the model detailed in the output space6verall the model performs well across ligand families. The points on the extreme left and rightof the dendrogram are ligands that lie far from the main clusters are thus sparsely populated.
Fig.2 illustrates, naturally, that performance improves as more binding sites for a ligand are obtained.Regardless, we still have a better prediction than random for most points, including those with few oreven one example.
We now use the average MSE for each output dimension to determine which dimensions are betterpredicted and as a rough estimate of the uncertainty bounds of our method. One can compare theMSE per-dimension to the variance of the data (the MSE always returning the average). If the MSEis small compared to the variance, we can say that the model has learned to predict this dimensionreasonably well (Figure 3a).For most dimensions, the reduction is near two-fold over random, yet roughly 30 dimensionsexperience a reduction of up to 8-fold. To contextualize the usefulness of this reduction, we can thinkof structure-based approaches as taking samples in the chemical space and docking them againstthe binding site by assessing their affinity. The variance of this sampling is reflected by the varietyof compounds observed in the PDB. The MSE defines a high dimensional box in the latent space,where the affinity is high. If we consider uniform sampling the latent space, we would need anexpected samples (computed value by the product of bins possibility in each dimension) to havea point fall in this box. while the usual number of compounds that can be handled by structure-basedmethods is usually only molecules. Due to the curse of dimensionality, even though the result isonly giving on average a 2-fold enhancement in each dimension, our method is theoretically able tosuggest a point unreachable with the usual structure-based approaches. In this part, we use DUDE [29] as an external database and evaluate the performance of our techniquesw.r.t. the enrichment factor. Although we do not aim to match the performance of well-establishedmethods with this settings, it offers an interesting perspective of the potential of our approach.Ligand-based methods look for similarity between actives compounds that exist in the data set, whilestructure-based methods explicitly evaluate all candidates. By contrast, we do only one prediction.A more complete validation of the quality of our predictions would be to probe their neighborhood.Here, we are conducting two experiments. First, we want to know if we can correctly identify thesub-region of space with given properties. Next, we aim to determine if this sub-region is closer toactive compounds than decoys. We make our prediction using the 3D structure of the binding siteand compute the distance to our prediction. To get an EF score, we then sort actives and decoyscompounds by distance to this prediction. V a r i a n c e / M S E Mean valueNo enrichment (a) Variance reduction in the latent space per latentdimension (b) Enrichment factor at different thresholds foreach target of the DUDE Database
Our first results shows that the average distance from a prediction to its associated ligand has an’MSE’ value of 0.13, which is lower than the random one of 0.2. We conclude that our methods areable to identify regions of interest for this task. Importantly, we get an enrichment for most targets,7hich means that our model is able to make a prediction closer to the region of actives compounds.However, it seems that inactive compounds do not lie in a well separated space. Thus, the closestpoints of the predictions necessarily include decoys. A strategy to address this caveat would bedeveloping other embeddings that separates compounds based on affinity.
In this paper, we show that learning a representation of known binding sites enables us to predict sub-regions of the ligand space with improved binding potential. These results support more extensiveuse of binding site information within ligand generative models and suggest novel avenues forimprovement for the computational drug discovery pipelines. Additional binding information suchas experimental affinity could further enhance results and help overcome bias of learning only onco-crystallized complexes.The alignment step within binding sites that helps us to amplify the signal, is a simple yet promisingsolution to represent 3D structures. Eventually, the binding site representations could be improvedthrough a self-supervised pipeline leveraging the increasing amount of structural data available forproteins.Finally, several lines of work are already developed to improve generative models for ligands. Inthis context, our method offers promising perspectives to generate relevant seed and help for furtheroptimization [31].
Code and Setup
We used a standard model with 7 convolution layers and 2 fully connected. The detailed architecturecan be found on GitHub and is described in the models/SmallC3D.py class. One can also run python main.py –summary to get a summary of the model architecture and parameters that wereused. Training was performed on 4 NVIDIA P100 Pascal GPU using 20 Broadwell cores.
Acknowledgements
We thank Mathieu Blanchette and William Hamilton for comments on the text and useful discussion.We thank Robin Winter for his help using and implementing the cddd project. We thank Mario Geigerand Tess Schmidt for advice and help implementing the se3cnn pipeline.8 eferenceseferences