[PDF] Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges

Abstract

Advanced mathematics, such as multiscale weighted colored graph and element specific persistent homology, and machine learning including deep neural networks were integrated to construct mathematical deep learning models for pose and binding affinity prediction and ranking in the last two D3R grand challenges in computer-aided drug design and discovery. D3R Grand Challenge 2 (GC2) focused on the pose prediction and binding affinity ranking and free energy prediction for Farnesoid X receptor ligands. Our models obtained the top place in absolute free energy prediction for free energy Set 1 in Stage 2. The latest competition, D3R Grand Challenge 3 (GC3), is considered as the most difficult challenge so far. It has 5 subchallenges involving Cathepsin S and five other kinase targets, namely VEGFR2, JAK2, p38- α , TIE2, and ABL1. There is a total of 26 official competitive tasks for GC3. Our predictions were ranked 1st in 10 out of 26 official competitive tasks.

Full PDF

MMathematical deep learning for pose and binding afﬁnityprediction and ranking in D3R Grand Challenges

Duc Duy Nguyen , Zixuan Cang , Kedi Wu , Menglun Wang , Yin Cao and Guo-Wei Wei , , , ∗ Department of Mathematics, Michigan State University, MI 48824, USA. Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA. Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA.May 1, 2018

Abstract

Advanced mathematics, such as multiscale weighted colored graph and element speciﬁc persistent homol-ogy, and machine learning including deep neural networks were integrated to construct mathematical deeplearning models for pose and binding afﬁnity prediction and ranking in the last two D3R grand challenges incomputer-aided drug design and discovery. D3R Grand Challenge 2 (GC2) focused on the pose prediction andbinding afﬁnity ranking and free energy prediction for Farnesoid X receptor ligands. Our models obtained thetop place in absolute free energy prediction for free energy Set 1 in Stage 2. The latest competition, D3R GrandChallenge 3 (GC3), is considered as the most difﬁcult challenge so far. It has 5 subchallenges involving Cathep-sin S and ﬁve other kinase targets, namely VEGFR2, JAK2, p38- α , TIE2, and ABL1. There is a total of 26 ofﬁcialcompetitive tasks for GC3. Our predictions were ranked 1st in 10 out of 26 ofﬁcial competitive tasks. ∗ Corresponding to Guo-Wei Wei. Email: [email protected] a r X i v : . [ q - b i o . B M ] A p r ontentsI Introduction 3II Methods 4 II.A Ligand preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4II.B Protein structures selection and preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4II.C Docking protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Protocol 1: Machine learning based docking . . . . . . . . . . . . . . . . . . . . . . . . . . 4Protocol 2: Align-close . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Protocol 3: Align-target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Protocol 4: Close-dock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Protocol 5: Cross-dock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Protocol 6: Constraint-IFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Protocol 7: Free-IFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5II.D Multiscale weighted colored graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5II.E Algebraic topology based molecular signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6II.E.1 Persistent homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6II.E.2 Topological description of molecular systems . . . . . . . . . . . . . . . . . . . . . . . . . . 7II.F Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Ensemble of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

III Results and discussion 9

III.A Grand Challenge 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Subchallenge 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Subchallenge 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Subchallenge 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Subchallenge 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Subchallenge 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11III.B Grand challenge 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

IV Conclusion 11 Introduction

With the availability of increasingly powerful computers and fast accumulating molecular and biomolecular datasets,one can dream of a possible scenario that all the major tasks of drug design and discovery can be conductedon computers.

Virtual screening (VS) is one of the most important aspects of computer-aid drug design(CADD). VS involves two stages, namely, the generation of different ligand conformations (i.e., poses) when acompound is docked to a target protein binding site, and the prediction of binding afﬁnities. It is generally be-lieved that the ﬁrst stage can be well resolved by available techniques, such as molecular dynamics (MD), MonteCarlo (MC), and genetic algorithm (GA).

However, The development of scoring function (SF) for binding afﬁn-ity prediction with high accuracy still remains a formidable challenge. In general, current SFs can be classiﬁedinto four different categories, namely force-ﬁeld-based ones, knowledge-based ones, empirical-based ones andmachine learning-based ones. Among them, force-ﬁeld-based SFs, such as COMBINE and MedusaScore, emphasize the physical description of protein and ligand interactions in the solvent environment, including vander Walls (vdW), electrostatics, hydrogen bonding, solvation effect, etc. Typical Knowledge-based SFs repre-sent the binding afﬁnity as the linear sum of pairwise statistical potentials between receptor and ligand atoms.KECSA, PMF, DrugScore, and IT-Score are some of the well-known examples. The empirical-basedSFs, in fact, make use of multiple linear regression to construct a linear combination from different physical-descriptor components such as vdW interaction, hydrophobic, hydrogen bonding, desolvation, dipole, etc. Therenowned candidates for empirical-based SFs include X-Score, PLP, and ChemScore, etc.Recently, machine learning including deep learning has emerged as a major technique in CADD. By usingadvanced machine learning algorithms, such as random forest (RF) and deep convolutional neural network,the machine learning-based SFs can characterize the non-additive contributions of functional groups in protein-ligand binding interactions. Such a characterization can help machine learning-based SFs consistently main-tain their accuracy in binding afﬁnity predictions for a variety of protein-ligand complexes.

However, theperformance of machine learning-based SFs depends crucially on the training data quality and statistic distribu-tion. Additionally, it also depends on selected features that might or might not accurately and completely describethe protein-ligand binding interactions. We assume that the physics of interest of complex biomolecules and in-teractions lies on low-dimensional manifolds or subspaces embedded in a high-dimensional data space. Basedon this hypothesis, we have recently proposed several low-dimensional mathematical models that dramaticallyreduce the structural complexity of protein-ligand complexes and give rise to surprisingly accurate predictions ofvarious bimolecular properties. For example, we proposed a multiscale weighted colored graph (MWCG) modelto simplify protein structures and analyze their ﬂexibility. The essential idea of this method is to use the graphtheory to represent the interactions between atoms in a molecule in an element-level collective manner. TheWMCG approach has been shown to be over 40% more accurate than the Gaussian network model on a set of364 proteins. In addition to graph theory simpliﬁcation, we have also developed the topological abstraction of complex pro-tein structures. In order to describe the topological changes such as the opening or closing of ion channels, thefolding or unfolding of proteins, and the subtle change in binding site after the protein-ligand binding, we takethe advantage of topological methods to study the connectivity of different molecular components in a space which can represent the independent topological entities such as independent components, rings and higherdimension faces. However, since the conventional topology and homology are metric or coordinate free, theycapture very little biomolecular geometric information and thus are unable to efﬁciently characterize biomolecularstructures. Persistent homology (PH) is a new branch in algebraic topology. It embeds the geometric informationinto topological invariants. By changing a ﬁltration parameter such as the radius of atoms PH creates a familyof topological structures for a given set of atoms. As a result, the topological properties of a given biomoleculecan be systematically analyzed and recorded in terms of topological invariants, i.e., the so-called Betti num-bers, over the ﬁltration process. The resulting barcodes monitor the “birth” and “death” of isolated components,circles, and cavities at different geometric scales. The persistent homology framework together with practicalalgorithms was introduced by Edelsbrunner et al. and formal mathematical theories were also developed byZomorodian and Carlsson. A zeroth dimensional version was also introduced earlier under the name of sizefunction by Frosini and Landi. Primitive applications of PH to computational biology has been reported in theliterature.

Recently, we have developed a variety of advanced PH models to analyze the topology-functionrelationship in protein folding and protein ﬂexibility, quantitative predictions of curvature energies of fullereneisomers, protein cavity detection, and the resolving ill-posed inverse problems in cryo-EM structure de-termination. In 2015, we introduced some of the ﬁrst combinations of PH and machine learning for proteintructural classiﬁcation. Topological descriptors were further integrated with a variety of deep learning algo-rithms to achieve state-of-the-art analysis and prediction of protein folding stability change upon mutation, drugtoxicity, aqueous solubility, partition coefﬁcient,, binding afﬁnity and virtual screening. In this paper, we report the performance of our mathematical deep learning models on pose and bindingafﬁnity prediction and ranking in the last two D3R grand challenges, namely D3R Grand Challenge 2 (GC2)and D3R Grand Challenge 3 (GC3). The GC2 was initiated in 2016 and consisted of two stages. The ﬁrststage asked participants to predict the crystallographic poses of 36 ligands for the target of farnesoid X receptor(FXR). In addition, there were afﬁnity ranking task for all 102 compounds and absolute free energy prediction fortwo designated subsets of 18 and 15 small molecules. In the second stage, participants were asked again tosubmit the afﬁnity ranking and free energy after the release of 36 crystal structures. In GC2, we employed ourmathematical deep learning models to select the best poses from docking software generated poses for bindingafﬁnity ranking and prediction tasks. Our models achieved the top place in afﬁnity ranking for the free energy Set1 in Stage 2.In addition, our results for the latest grand challenge, i.e., GC3, are presented in this paper. The third grandchallenge, took place in 2017, is the largest in terms of the number of competitive tasks since 2015. It consistedof 5 subchallenges. Subchallenge 1 was about Cathepsin S. It comprised two stages with tasks the same asones in the GC2. There were 24 ligands with crystal structures and their binding energies spread three ordersof magnitude for 136 compounds. Subchallenge 2 focused on kinase selectivity. It has three kinase targets,namely VEGFGR2, JAK2, and p38- α with their numbers of compounds being 85, 89, and 72, respectively.This subchallenge only asked participants to submit afﬁnity ranking for each kinase dataset. Subchallenge 3involved the binding afﬁnity ranking and free energy prediction of target JAK2. It consisted of a relatively smalldataset with 17 ligands having similar chemical structures. In Subchallenge 4, there were 18 congeneric ligandswith Kd values for kinase TIE2. In addition to asking for the afﬁnity ranking of 18 compounds, Subchallenge 4asked participants to predict the free energies of two subsets with 4 and 6 compounds, respectively. The lastsubchallenge in GC3 concerned the binding afﬁnity ranking of different mutants on protein ABL1. There weretwo compounds and 5 different mutation sites. Overall, our models performed well in GC3. Speciﬁcally, weobtained the ﬁrst place in 10 out of a total of 26 predictive tasks. II Methods

In this section, we brieﬂy describe our computation methods and algorithms developed for GC2 and GC3.

II.A Ligand preparation

All ligands in Grand Challenges are provided in the SMILES string format. They are converted to the optimal 3Dstructures and protonated at pH=7.5 using ligprep tool in Schrödinger software. Before employing AutodockVina for docking, Gasteiger partial charges were added to these ligands via MGLTools v1.5.6. II.B Protein structures selection and preparation

Except for Subchallenge 1, all the receptor structures in GC3 are supplied in the protein sequence format. Weutilized the homology modeling task in Maestro of Schrödinger software to obtain 3D structure predictions. Inaddition, we make use of the crystal structures available in the Protein Data Bank (PDB) for each protein family(see the supporting information for a complete list). These collected protein structures were prepared usingthe protein preparation wizard provided in Schrödinger package with default parameters except enabling the fillsidechains option. II.C Docking protocols

We use a number of docking protocols in GC2 and GC3. Among, a machine learning protocol was developedin our own lab. Motivated by earlier work, we carried out four different docking strategies, namely align-close,align-target, close-dock and cross-dock, to attain the best poses for binding afﬁnity predictions. We also usedinduced ﬁt docking (IFD) and unrestricted IFD in our pose predictions. Protocol 1: Machine learning based docking

We developed a machine learning-based scoring function toselect the poses generated by GOLD, GLIDE, and Autodock Vina. Given a ligand target, we at ﬁrst formeda training data of complexes taken from the PDB. The criteria for such selections are based on the similaritycoefﬁcient, measured by ﬁngerprint 2 (FP2) in Open Babel v2.3.1, of ligand in the complex. Then, we utilizeddocking software packages such as GOLD, GLIDE, and Autodock Vina to re-dock ligands to protein in thoseselected complexes. A variety of docking poses was distributed into 10 different RMSD bins as follows: [0,1],(1,2], . . . , (9,10]Å. In each bin, we clustered decoys into 10 clusters based on their internal similarities. Thedocking poses having the smallest free energy were selected as the candidate for their clusters. As a result,one may end up with a total of 100 poses for each given complex. We employed all these decoy poses to form training set with labels deﬁned by their RMSDs. Our topological based deep learning models were utilized tolearn this training set. Finally, we employed this established scoring function to re-rank the poses of the targetligand produced by docking software packages. Protocol 2: Align-close

In the align-close method, we select ligand available in the PDB that has the high-est chemical similarity to the target ligand. Here, the similarity score was measured by ﬁngerprint 2 (FP2) inOpen Babel v2.3.1. It is also noted that all the processed structures in this procedure were conducted in theSchrödinger suite 2017-4. A ligand was aligned to its similar candidates by the ﬂexible ligand alignment taskin Schrödinger’s Maestro.

Then, the resulting aligned ligand is minimized to the co-crystal structure of themost similar ligand by Prime in Schödinger package.

Protocol 3: Align-target

In the align-target protocol, the homology modeling tool in Maestro was used to con-struct protein 3D structures from given sequences, and the aligned ligands obtained from the align-close proce-dure are minimized with respect to corresponding receptors.

Protocol 4: Close-dock

The fourth docking strategy is called as close-dock. Based on previous docking meth-ods, one can identify the most similar structure in the PDB to a given D3R ligand. This procedure also gives usthe corresponding co-crystal structure, i.e., the so-called closet receptor. In the close-dock approach, AutodockVina is used to docking the target ligand to its corresponding closet receptor. The best pose is selected basedon Autodock Vina’s energy scoring.

Protocol 5: Cross-dock

The next approach in our docking methods is named cross-dock. This is basically across docking method in which the close receptors are the co-crystal structures of the ligands having the similarchemical characteristics to the interested ligand. In the cross-docking procedure, we use Autodock Vina to dockthe D3R ligands to their close receptors. Those poses that have the smallest binding energies are selected asthe best poses.

Protocol 6: Constraint-IFD

Similarly to the align-target protocol, we used the homology modeling module inMaestro to generate 3D structure from a given sequence. For the docking procedure, we employed the inducedﬁt docking (IFD) in Maestro with restricting docking poses to the closet ligands with a tolerance of 3Å. Thebest pose was selected due to the ranking from IFD.

Protocol 7: Free-IFD

This protocol is exactly the same protocol as Constraint-IFD except for no constraintduring the run of induced-ﬁt docking.

II.D Multiscale weighted colored graph representation

Weighted colored graph (WCG) method describes intermolecular and intramolecular interactions as pairwiseatomic correlations. To apply the WCG for analyzing the protein-ligand interactions, we convert all the atomsand their pairwise interactions at the binding site of a protein-ligand complex with a cutoff distance d into acolored graph G ( V d , E ) with vertices V d and edge E . As such, the i th atom is labeled by its position r i , elementtype α i and co-crystal type β i . Thus, we can express vertices V d as V d = { ( r i , α i , β i ) | r i ∈ R , α i ∈ C , β i ∈ S , (cid:107) r i − r j (cid:107) < d for some ≤ j ≤ N such that β i + β j = 1 , i = 1 , , . . . , N } , (1)where C = { C, N, O, S, P, F, Cl, Br, I } contains all the commonly occurring element types in a complex, and S = {

0, 1 } meaning that if the i th atom belongs to protein then β i = 0 , otherwise β i = 1 . Hydrogen element isomitted since it does not present in the crystal structures of most protein-ligand complexes. To describe pairwiseinteractions between the protein and the ligand, we deﬁne an ordered and colored set P = { ( α, α (cid:48) , } .Here, α ∈ { C, N, O, S } is a heavy atom in the protein, and α (cid:48) ∈ { C, N, O, S, P, F, Cl, Br, I } is a heavy atom inthe ligand. With that setting, it is trivial to verify that set P has a total 36 partitions. For example, a partition P = { ( C, O, } contains all pairs CO in the complex with the ﬁrst atom is a carbon in the protein and thesecond atom is an oxygen in the ligand. For each set of atom pairs P k , k = 1 , , . . . , , a set of vertices V P k isa subset of V d containing all atoms that belong to a pair in P k . Therefore, the edges in such WCG describingpotential pairwise atomic interactions are deﬁned by E σ,τ,ζ P k = { Φ στ,ζ ( (cid:107) r i − r j (cid:107) ) | (( α i , β i )( α j , β j )) ∈ P k ; i, j = 1 , , . . . , N } , (2)where (cid:107) r i − r j (cid:107) deﬁnes a Euclidean distance between i th and j th atoms, σ indicates the type of radial basicfunctions (e.g., σ = L for Lorentz kernel, σ = E for exponential kernel), τ is a scale distance factor between twoatoms, and ζ is a parameter of power in the kernel (i.e., ζ = κ when σ = E, ζ = ν when σ = L). The kernel Φ στ,ζ haracterizes a pairwise correlation satisfying the following conditions Φ στ,ζ ( (cid:107) r i − r j (cid:107) ) = 1 as (cid:107) r i − r j (cid:107) → , (3) Φ στ,ζ ( (cid:107) r i − r j (cid:107) ) = 0 as (cid:107) r i − r j (cid:107) → ∞ . (4)Commonly used radial basis functions include generalized exponential functions Φ E τ,κ = e − ( (cid:107) r i − r j (cid:107) /τ ( r i + r j )) κ , κ > , (5)and generalized Lorentz functions Φ L τ,ν ( (cid:107) r i − r j (cid:107) ) = 11 + ( (cid:107) r i − r j (cid:107) /τ ( r i + r j )) ν , ν > , (6)where r i and r j are, respectively, the van der Waals radius of the i th and j th atoms.In the graph theory or network analysis, centrality is widely used to identify the most important nodes. Thereare various types of centrality such as degree centrality, closeness centrality, harmonic centrality, etc.Speciﬁcally, while the degree centrality is measured as a number of edges upon a node, closeness and harmoniccentralities depend on the length of edges and are deﬁned as / (cid:80) j (cid:107) r i − r j (cid:107) and (cid:80) j / (cid:107) r i − r j (cid:107) , respectively.Our centrality used in the current work is an extension of the harmonic formulation by our correlation functions µ k,σ,τ,νi = | V P k | (cid:88) j =1 w ij Φ στ,ν ( (cid:107) r i − r j (cid:107) ) , (( α i , β i )( α j , β j )) ∈ P k , ∀ i = 1 , , . . . , | V P k | , (7)where w ij is a weight function assigned to each atomic pair. In the current work, we choose w ij = 1 if β i + β j = 1 ,otherwise w ij = 0 , for all calculations to reduce dimension of the parameter space. To describe a centrality forthe whole graph G ( V P k , E σ,τ,ζ P k ) , we take into account a summation of the node’s centralities µ k,σ,τ,ν = | V P k | (cid:88) j =1 µ k,σ,τ,νi . (8)Since we have 36 choices of the set of weighted colored edges P k , we can obtain corresponding 36 graphcentralities µ k,σ,τ,ν . By varying kernel parameters ( σ, τ, ν ) , one can achieve multiscale centralities for multiscaleweighted colored graph (MWCG). For a two-scale WCG, we obtain a total of 72 descriptors for a protein-ligandcomplex.

II.E Algebraic topology based molecular signature

The geometry of biomolecular systems together with the complex interaction patterns allows us to build topolog-ical spaces upon the systems which facilitate powerful topological analysis. The topological analysis providesus a description of the molecular system that captures a collection of key aspects of the system including themultiscale description of geometry, the characterization of interaction network in an arbitrary dimension, and theimportant physical and chemical information, which ensures the success of the downstream machine learningmodeling. In this section, we ﬁrst brieﬂy describe the background of persistent homology. Then, we demonstratehow to apply it to biomolecular systems to obtain a rich but concise description.

II.E.1 Persistent homology

We describe the theory of persistent homology in the framework of simplicial homology in a geometric sensewhere topological spaces are represented by collections of points, edges, triangles, and their higher-dimensionalcounterparts. A k -simplex is a collection of ( k + 1) afﬁnely independent points in R n with n ≥ k . If the verticesof a simplex is a subset of the vertices of another simplex, it is called a face of the other simplex. Simplices ofvarious dimensions are building blocks of a simplicial complex which is a ﬁnite collection of simplices satisfyingtwo conditions: (1) the faces of any simplex in the complex are also in the complex and (2) the intersection oftwo simplices in the complex is either empty or a common face of the two. A simplicial complex can be usedto discretely represent or approximate a topological space. Given a simplicial complex X , a k -chain is a formalsum of all the k -simplices in X which is deﬁned as c = (cid:88) i a i σ i , (9)here σ i is a k -simplex in X and a i is a coefﬁcient in a coefﬁcient set of choice such as a ﬁnite ﬁeld Z p witha prime p . The set of all k -chains with the addition operator in the coefﬁcient group forms a group called the k th chain group denoted C k ( X ) . The chain groups of different dimensions are connected by a collection ofhomeomorphisms called the boundary operators forming a chain complex, · · · ∂ i +1 −−→ C i ( X ) ∂ i −−→ C i − ( X ) ∂ i − −−→ · · · ∂ −−→ C ( X ) ∂ −−→ C ( X ) ∂ −−→ . (10)It sufﬁces to deﬁne the boundary operator on simplices, and then, such a deﬁnition can be extended to generalchains. ∂ k ( σ ) = k (cid:88) i =0 ( − i [ v , · · · , ˆ v i , · · · , v k ] , (11)where v , · · · , v k are vertices of the k -simplex σ and [ v , · · · , ˆ v i , · · · , v k ] means the codim- face of σ be omittingthe vertex v i . The boundary operator has an important property that ∂ k ◦ ∂ k +1 = 0 . (12)With the boundary operators, we can deﬁne boundary groups and cycle groups which are subgroups of chaingroups. The k th boundary group is deﬁned to be the image of ∂ k +1 denoted B k ( X ) = Im( ∂ k +1 ) . The k th cyclegroup is deﬁned to be the kernel of ∂ k denoted Z k ( X ) = Ker( ∂ k ) . It can be seen that B k ( X ) ⊆ Z k ( X ) followingthe property in Eq. (12). Then, the k th homology group is deﬁned to be the quotient group H k ( X ) = Z k ( X ) / B k ( X ) . (13)Intuitively, the k th homology group contains elements associated to k dimensional holes which are not bound-aries of ( k + 1) -chains to characterize the topology.The theory described above computes the homology of various dimensions of a topological space to obtaina multidimensional characterization of the space. However, this is not enough for the cases where the objectsare also multiscale. Therefore, instead of only computing homology for a ﬁxed topological space, we can build asequence of subspaces of the topological space and track how homology evolves along this changing sequence.This sequence is called a ﬁltration, ∅ = X ⊆ X ⊆ · · · ⊆ X m − ⊆ X m = X. (14)The ﬁltration naturally induces an inclusion map connecting the homology groups of a certain dimension, H k ( X ) → H k ( X ) → · · · → H k ( X m − ) → H k ( X m ) . (15)Then for a homology generator δ ∈ H k ( X i ) , it is said to be born at i if it is not an image of the inclusion mapfrom H k ( X i − ) and it is said to die at i + 1 if it mapped to the empty set or another homology generator that isborn before i by the inclusion map from H k ( X i ) . Persistent homology tracks how these homology generatorsappear and disappear along the course of the ﬁltration resulting in a robust multiscale description of the originaltopological space. The birth and death of each generator can be represented by a half-open interval startingat the birth time and stopping at the death time. There are several visualization methods for collections of suchintervals such as barcodes and persistence diagrams. II.E.2 Topological description of molecular systems

To describe molecular systems using persistent homology, the atoms can be regarded as vertices and differentconstructions of ﬁltrations can reveal different aspects of the system.To describe a complex protein geometry, an efﬁcient ﬁltration using alpha complex can be employed. Tobuild an alpha ﬁltration, a Voronoi diagram is ﬁrst generated for the collection of points representing the atomsin the system. The ﬁnal frame of the topological spaces at the end of the course of ﬁltration is constructed byincluding a k -simplex if there is a nonempty intersection of the ( k + 1) Voronoi cells associated to its ( k + 1) vertices. The ﬁltration of the space can be constructed by associating a subcomplex to each value of a ﬁltrationparameter (cid:15) . The subcomplex associated to (cid:15) is deﬁned as X alpha ( (cid:15) ) = { σ ∈ X | σ = [ v , · · · , v k ] , ∩ i ( V ( v i ) ∩ B (cid:15) ( v i )) (cid:54) = ∅} , (16)here V ( v i ) is the Voronoi cell of v i and B (cid:15) ( v i ) is an (cid:15) ball centered at v i .A more abstract construction of ﬁltration using the Vietoris-Rips complex can be used to address other prop-erties of the system such as protein-ligand interactions. Given a set of points with a pairwise distance (notnecessarily satisfying triangular inequality), the subcomplex associated to a ﬁltration parameter (cid:15) is deﬁned tobe X rips ( (cid:15) ) = { σ ∈ X | σ = [ v , · · · , v k ] , d ( v i , v j ) ≤ (cid:15) for 0 ≤ i, j ≤ k } , (17)where d is the predeﬁned distance function and X is the collection of all possible simplices. Tweaking thedistance function can help emphasize on different properties of the system. For example, in a protein-ligandcomplex, setting the distance between an atom from the protein and an atom from the ligand to the Euclideandistance while setting the distance between atoms from the same molecule to inﬁnity will emphasize the inter-action pattern between the two molecules. Also, we can assign values between atoms according to a speciﬁcdistance of interest by using kernel functions as distances. We have proposed element speciﬁc persistenthomology to encode physical interactions into topological invariants.

By computing persistent homology onsubsets of the atoms, we can extract different chemical information. For example, the element speciﬁc persistenthomology computation on the collection of all carbon atoms describes the hydrophobic network or the struc-tural basis of the molecule while computation on the nitrogen and oxygen atoms characterizes the hydrophilicnetworks. For the characterization of small molecules, we can use a multilevel element speciﬁc persistenthomology to both describe the covalent bonds and noncovalent interactions in the molecule. The element speciﬁc persistent homology results (barcodes) can be paired with machine learning models inseveral ways. For example, Wasserstein metric can be applied to measure similarities among the barcodesof different proteins, which can be used with methods such as nearest neighbors and manifold learning. Theelement speciﬁc persistent homology barcodes can also be turned into ﬁxed length feature vectors by discretizingthe range of barcode and counting the persistence, birth, and death events that fall in each subinterval. Thestatistics of element speciﬁc persistent homology barcodes can also be used for featurization. These ﬁxedlength features can be used with powerful machine learning methods such as the ensemble of trees and deeplearning neural networks.

The barcodes can also be transformed to representations similar to images andused in a 1-dimensional or a 2-dimensional convolutional neural networks.

II.F Machine learning algorithms … Protein-ligand complex Element specific groups Topological or graphfeatures

Deep learning prediction … Figure 1: Illustration of mathematical deep learning prediction.

The machine learning methods used in our prediction fall into two categories, ensemble of trees and deeplearning. A schematic illustration of our mathematical deep learning modeling is given in Fig. 1. nsemble of trees

The basic building block of this type of method is decision tree which identiﬁes key featuresto make decisions at the nodes of the tree. Due to its simple structure, it is usually considered as weak learnerespecially in the case of highly nonlinear problems or applications with high dimensional features. Ensembleof trees methods build models consisting of a collection of decision trees with the assumption that grouping theweak learners can improve the learning capability. We mainly used random forest and gradient boosting trees forthe prediction. Random forest builds uncorrelated decision trees with each tree being trained on a resamplingof the original training set (bootstrap). On the contrary, gradient boosting trees add one tree to the collectionat a time along the direction of the steepest descent of the loss of the current collection. As these two modelsattempt to reduce error in two different ways, they behave differently in the bias-variance tradeoff where therandom forest is better at lowering bias and gradient boosting trees focus more on reducing variance. Therefore,a higher level bagging of models of different kinds can further improve the performance. The ensemble learningmethods are also robust and overﬁtting can be reduced by learning partial problems. For example, each tree canbe trained with a random subset of the training data and a subset of the features and the model complexity canbe constrained by setting maximum tree depth. Both our graph theory based models and algebraic topologybased models achieve top-class performance with the ensemble of trees methods.

Deep Learning

When the feature is complex or there is some underlying dimension in the feature space, deeplearning models can further push the performance of the predictor. For example, a spatial dimension associatedto the ﬁltration parameter lies in the persistent homology representation of protein-ligand systems. This enablesthe usage of the powerful convolutional neural networks (CCNs) which has been extremely successful in theﬁeld of computer vision. The neural networks we used in the prediction are in the category of feedforwardnetworks where the signal from the previous layer undergoes a linear transformation to the current layer, thenthe current layer applies a nonlinear activation function and sends the signal to the next layer. Classical deepneural networks are constructed by stacking fully connected layers where every pair of neurons in two adjacentlayers are connected. Different rules of neuron connections and parameter sharing have resulted in a numberof powerful deep learning models that ﬂourish in various application ﬁelds. CNNs take advantage of the featurestructure where there are spatial dimensions and only allow local connections with the parameters shared alongthe spatial dimensions which signiﬁcantly lowers the dimension of the parameter space. Also, the ﬂexibilityof neural networks allows learning different but related tasks together by sharing layers, i.e., a type of multi-task learning. We applied convolutional neural networks and multi-task learning in our prediction which furtheradvanced the capability of our models.

III Results and discussion

Here, we provide the results of our mathematical deep learning models in two current grand challenges, GC2and GC3.

III.A Grand Challenge 3

There are 5 subchallenges in GC3 involving a total of 12 afﬁnity prediction submissions and 2 pose predictionchallenges, resulting in 26 different competitive tasks. Our submissions were ranked 1st in 10 of these 26 tasksas shown in Fig. 2 for additional information. While we employed align-close, align-target, close-dock and cross-dock protocols for pose generations in subchallenges 1–4, we applied constraint-IFD and free-IFD proceduresfor kinase mutants in subchallenge 5. The combination of MWCG and algebraic topological descriptors wasutilized as the features in the random forest and deep learning methods. Also, we were interested in seeing howthe docking features can enhance our mathematical descriptors by including the Autodock Vina scoring terms insome submissions. In fact, these additional docking features did not improve our available models. The followingis the discussion of our performance for each subchallenge task.

Subchallenge 1

The protein target for this challenge is Cathepsin S. There are 24 ligand-protein co-crystalstructures and 136 ligands having binding data (IC50s). There are two stages in this subchallenge. Stage 1asks participants to submit pose prediction, afﬁnity ranking, and energy prediction. Stage 2 asks similar tasksexcept for pose prediction. Co-crystal structures were released for the second stage.In order to examine the performances of scoring functions on the binding afﬁnity when the ligand pose errorsdo not contribute to the ﬁnal outcome, D3R organizers evaluated the accuracy of all submitted methods on 19ligands having crystallographic poses. With this setting, our models attained the ﬁrst places for the followingtasks: free energy set in Stage 1, scoring and free energy set in Stage 2. It is worth mentioning that onlyStage 2 has the experimental structures. Stage 1 is still affected by the pose prediction errors. That explainswhy our predictors performed decently for scoring task in Stage 1 with the best Kendall’s τ = 0.23, but theyachieved a state-of-the-art result for the same task in Stage 2 with the best Kendall’s τ = 0.54 (receipt ID 6jekk).

3R Grand Challenge 3 Pose PredictionCathepsin Stage 1A Cathepsin Stage 1B

Pose Predictions (partials) Pose Prediction

Affinity Rankings excluding Kds > 10 µMCathepsin Stage 1 Cathepsin Stage 2

Scoring (partials) Scoring (partials)Free Energy Set Free Energy Set

VEGFR2 JAK2 SC2 p38-α

Scoring (partials) Scoring (partials) Scoring

JAK2 SC3 TIE2 ABL1

Scoring Scoring Scoring (partials) Free Energy Set Free Energy Set 2

Active / Inactive Classification

VEGFR2 JAK2 SC2 p38-α

Scoring (partials) Scoring (partials) Scoring (partials)

JAK2 SC3 TIE2 ABL1

Scoring Scoring (partials) Scoring (partials) Free Energy Set Free Energy Set 1

Affinity Rankings for Cocrystalized LigandsCathepsin Stage 1 Cathepsin Stage 2

Scoring (partials) Scoring (partials)Free Energy Set Free Energy Set

Figure 2: Overview of all 26 predictive tasks in D3R GC3. Our predictions were ranked 1st in the tasks marked by golden stars. M H NN\ P Y I \ T M SJ U RG V G U Y I \ I W P K W SS M P Z ETRQ U L D U GSS I U XE ] R N GJD L ]] D L H P V]V U X V K \ L Y E P Z SRR M T \ DT N \ M X [\ P D P [Y P Z EX Z U I U \ S F [ J Y R U M S L QH \ D I E M F U I XJSSHH P JT N T V P G U T V X Y W H Z MMM SK Y H U S ] L F Z L Q [ P HJGG W QRX M Q P Z D V P \ E F W ] SQXD UU QGRDH U M S M U Y Y I JX Z VV RD [ GDK M V U P G F L V]] XT \ X ] L TX V QH Z S W HG N H M X W E L RG ] J \ F H Y W MM Z U V JX U T Y\ W SHD F D IW L K U ]\ Q M JHG M íí . HQGD OO V 7 DX Figure 3: Performance comparison of different submissions on afﬁnity ranking of 19 ligands having crystallographic poses in Stage 2 ofSubchallenge 1 of D3R GC3. Our best prediction, with receipt ID 6jekk and in red color, achieved the top performance with Kendall’s τ =0.54. Figure 3 depicts the ranking of all participants on the afﬁnity ranking of 19 ligands in Stage 2. The best freeenergy predictions on the ligands with experiment structures were also attained by our predictions. Particularly,in Stage 1, our prediction with receipt ID fomca obtained RMSE c =0.33 kcal/mol. In Stage 2, we accomplishedRMSE c =0.29 kcal/mol with receipt ID v4jv4. Those results ﬁrmly support that our mathematical deep learningmodels indeed gain a better performance when no pose error predictions are involved. Subchallenge 2

In this subchallenge, there are 3 kinase families, namely VEGFGR2, JAK2, and p38- α withnumber of ligands being 85, 89 and 72, respectively. The challenge is to rank afﬁnities of all ligands in eachkinase family. Our predictors do not perform well on these datasets. Our best result is the second place on thective/inactive classiﬁcation of VERGFR2 set. Our best Matthews correlation coefﬁcient (MCC) on such task isreported to be 0.48 from receipt IDs: qikvs and rtv8m. Subchallenge 3

The third subchallenge involves the kinase JAK2 which already appeared in the second one.However, this challenge only comprises 17 compounds with small changes in chemical structure. Subchallenge3 consists of two tasks, namely afﬁnity ranking and relative binding afﬁnity prediction. We obtained the ﬁrst placeon the binding energy prediction with the centered RMSE as low as RMSE c =1.06 kcal/mol (receipt ID 4u5ey).On the afﬁnity ranking, the performance of our models is not impressive. However, we still manage to sit atthe second place on the active/inactive classiﬁcation with Mathew correlation coefﬁcient = 0.23 with receipt IDyqoad. Subchallenge 4

Similar to the third subchallenge, the fourth one consists of 18 ligands with small changes inchemical structure. However, the new protein family, TIE2, is considered. The tasks are still to give an afﬁnityranking for 18 ligands and absolute or relative binding energies for two subsets of 4 and 6 compounds. It isinteresting to see our model perform extremely well for the TIE2 dataset. We achieve the ﬁrst place for all theevaluation metrics taken into account for this subchallenge. Speciﬁcally, for the afﬁnity ranking excluding Kds >10 µ M , our model, receipt ID uuihe, produces the best Kendall’s τ and Spearman correlation coefﬁcient amongall of the participants with values being 0.57 and 0.76, respectively. When one is interested in active/inactiveclassiﬁcation by including compounds havingKds > 10 µ M , our model, receipt ID uuihe, is still ranked the ﬁrstplace with MCC = 0.78. On the absolute free energy predictions, the top results are still produced by our models.Speciﬁcally, on Set 1, our predictor with receipt ID vwbp8 was ranked the ﬁrst place with MCC = 1.0. On Set 2,our model with receipt ID 5g8ed attained the RMSE c =1.02 kcal/mol which is the lowest among all submissions. Subchallenge 5

The last subchallenge in the GC3 measures the accuracy of models on the binding afﬁnitychange prediction upon the mutation. ABL1 is the protein target for this subchallenge, and there are two com-pounds and ﬁve mutants. The challenge is to predict the ranking of all mutants for each of two ligands. Ourmodels perform pretty decently for this task. Our best submission has receipt ID rdn3k which achieves the bestKendall’s tau ( τ =0.52) for afﬁnity ranking excluding Kds > 10 µ M . III.B Grand challenge 2

The second grand challenge had 36 ligands with crystal structures and binding data for 102 ligands. All thesecompounds bind to the FXR target. The predictive tasks are the same as those of Subchallenge 1 in GC3.Speciﬁcally, GC2 consisted of two stages. The ﬁrst stage included (i) pose prediction for 36 ligands; (ii) bindingafﬁnity ranking for 102 compounds; (iii) absolute or relative free energy predictions for two subsets of 18 and 15ligands, respectively. The second stage with released structures asked the same tasks as in the previous oneexcept for pose prediction.We employed the machine learning based scoring function to select the best poses for all prediction tasks.Although our pose ranking power was not impressive, the free energy predictions of our model performed prettywell. Speciﬁcally, our submission with receipt ID 5bvwx was ranked the second place in the free energy Set 1of Stage 1 with RMSE c = 0.68 kcal/mol. In Stage 2, our models improved the accuracy of the energy predictionof compounds in the aforementioned free energy set. In fact, we obtained the ﬁrst place in term of Kendall’s tauvalue ( τ = 0 . ) with receipt ID 4rbjk. That was also the highest Kendall’s tau value among all submissions intwo stages for the free energy Set 1. Figure 4 plots the performance of all submissions on the free energy Set 1in Stage 2. Our best submission is highlighted in the red color. IV Conclusion

In this work, we report the performances of our mathematical deep learning strategy on the binding afﬁnity tasksin D3R GC2 and across 5 subchallenges in D3R GC3. The multiscale weighted colored graph and elementspeciﬁc persistent homology representations are the main descriptors in our models. We also employed avariety of machine learning algorithms including random forest and deep convolutional neural networks for theenergy predictions. Overall, in GC2, our predictive models achieved the top place in free energy prediction forfree energy Set 1 in Stage 2. In GC3, our submissions were ranked 1st in 10 out of 26 ofﬁcial evaluation tasks.These results rigorously conﬁrm the predictive power and practical usage of our mathematical deep learningmodels in drug discovery. It is worthy to mention that the docking accuracy is still a bottleneck of our afﬁnityprediction performance. We have tried a variety of docking protocols, namely align-close, align-target, close-dock, cross-dock, constraint-IFD, and free-IFD, for pose selection in GC3. However, none of them showed adominant role in binding afﬁnity accuracy. In addition, when one excludes the pose prediction error, Kendall’stau of our model improves from 0.21 to 0.54 on the afﬁnity ranking of compounds in Cathepsin S subchallenge.Therefore, the development of a state-of-the-art docking protocol is our major task in the roadmap to improve U E M N E Z S M QHT M W V\ H W DH DJJ M OL E S U D N U W TX P Y[Y KT Z [ Y Q U S I R ] K S \[ L Y Z I H F P R LL ETK L P M N W M G U Q Z S I U Z P F] E [ W L GSR J L DX L Q V S M J ][ T Y E ]F L J Y I HQ VF HH [F[ TS Y D V JH M H U G I Z W SH [Y T V V QE N FN NF M X M U VN D ]N[ [ P W E [ G JEG Y HDG ] P EE ] W K W S Y R Y X N O K ] T OM G M P K M H I ZZ I íí . HQGD OO V 7 DX Figure 4: Performance comparison of different submissions on free energy prediction for free energy Set 1 in Stage 2 of D3R GC2. Our bestprediction, with receipt ID 4rbjk and in red color, achieved the top performance with Kendall’s τ = 0.41. the accuracy of binding energy prediction when crystallographic structures are not available. Acknowledgments

This work was supported in part by NSF grants IIS-1302285, DMS-1160352 and DMS-1761320 and MSU Centerfor Mathematical Molecular Biosciences Initiative.

References [1] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne,“The protein data bank,”

Nucleic acids research , vol. 28, no. 1, pp. 35–242, 2000.[2] Z. Liu, M. Su, L. Han, J. Liu, Q. Yang, Y. Li, and R. Wang, “Forging the basis for developing protein–ligandinteraction scoring functions,”

Accounts of Chemical Research , vol. 50, no. 2, pp. 302–309, 2017.[3] A. Ahmed, R. D. Smith, J. J. Clark, J. B. Dunbar Jr, and H. A. Carlson, “Recent improvements to bind-ing moad: a resource for protein–ligand binding afﬁnities and structures,”

Nucleic acids research , vol. 43,no. D1, pp. D465–D469, 2014.[4] R. T. Kroemer, “Structure-based drug design: docking and scoring,”

Current protein and peptide science ,vol. 8, no. 4, pp. 312–328, 2007.[5] A. R. Leach, B. K. Shoichet, and C. E. Peishoff, “Prediction of protein-ligand interactions. docking andscoring: Successes and gaps.,”

J. Med. Chem. , vol. 49, pp. 5851–5855, 2006.[6] F. N. Novikov, A. A. Zeifman, O. V. Stroganov, V. S. Stroylov, V. Kulkov, and G. G. Chilov, “CSAR Scor-ing challenge reveals the need for new concepts in estimating protein-ligand binding afﬁnity,”

Journal ofChemical Information and Model , vol. 51, pp. 2090–2096, 2011.[7] R. Wang, Y. Lu, and S. Wang, “Comparative evaluation of 11 scoring functions for molecular docking,”

J.Med. Chem. , vol. 46, pp. 2287–2303, 2003.[8] J. Liu and R. Wang, “Classiﬁcation of current scoring functions,”

Journal of Chemical Information and Model ,vol. 55, no. 3, pp. 475–482, 2015.[9] A. R. Ortiz, M. T. Pisabarro, F. Gago, and R. C. Wade, “Prediction of drug binding afﬁnities by comparativebinding energy analysis,”

J. Med. Chem , vol. 38, pp. 2681–2691, 1995.[10] S. Yin, L. Biedermannova, J. Vondrasek, and N. V. Dokholyan, “Medusascore: An acurate force ﬁeld-basedscoring function for virtual drug screening,”

Journal of Chemical Information and Model , vol. 48, pp. 1656–1662, 2008.11] Z. Zheng, T. Wang, P. Li, and K. M. Merz Jr, “KECSA-Movable type implicit solvation model (KMTISM),”

Journal of Chemical Theory and Computation , vol. 11, pp. 667–682, 2015.[12] I. Muegge and Y. Martin, “A general and fast scoring function for protein-ligand interactions: a simpliﬁedpotential approach.,”

J Med Chem. , vol. 42, no. 5, pp. 791–804, 1999.[13] H. F. G. Velec, H. Gohlke, and G. Klebe, “Knowledge-based scoring function derived from small moleculecrystal data with superior recognition rate of near-native ligand poses and better afﬁnity prediction.,”

J. Med.Chem , vol. 48, pp. 6296–6303, 2005.[14] S. Y. Huang and X. Zou, “An iterative knowledge-based scoring function to predict protein-ligand interac-tions: I. derivation of interaction potentials.,”

J. Comput. Chem. , vol. 27, pp. 1865–1875, 2006.[15] R. Wang, L. Lai, and S. Wang, “Further development and validation of empirical scoring functions for struc-tural based binding afﬁnity prediction.,”

J. Comput-Aided Mol. Des , vol. 16, pp. 11–26, 2002.[16] G. Verkhivker, K. Appelt, S. T. Freer, and J. E. Villafranca, “Empirical free energy calculations of ligand-protein crystallographic complexes. i. knowledge based ligand-protein interaction potentials applied to theprediction of human immunodeﬁciency virus protease binding afﬁnity.,”

Protein Eng , vol. 8, pp. 677–691,1995.[17] M. D. Eldridge, C. W. Murray, T. R. Auton, G. V. Paolini, and R. P. Mee, “Empirical scoring functions: I.the development of a fast empirical scoring function to estimate the binding afﬁnity of ligands in receptorcomplexes.,”

J. Comput. Aided. Mol. Des , vol. 11, pp. 425–445, 1997.[18] B. Baum, L. Muley, M. Smolinski, A. Heine, D. Hangauer, and G. Klebe, “Non-additivity of functional groupcontributions in protein-ligand binding: a comprehensive study by crystallography and isothermal titrationcalorimetry,”

J. Mol. Bio , vol. 397, no. 4, pp. 1042–1054, 2010.[19] H. Li, K.-S. Leung, M.-H. Wong, and P. J. Ballester, “Substituting random forest for multiple linear regressionimproves binding afﬁnity prediction of scoring functions: Cyscore as a case study,”

BMC bioinformatics ,vol. 15, no. 1, p. 1, 2014.[20] D. D. Nguyen, T. Xiao, M. L. Wang, and G. W. Wei, “ Rigidity strengthening: A mechanism for protein-ligandbinding ,”

Journal of Chemical Information and Modeling , vol. 57, pp. 1715–1721, 2017.[21] Z. X. Cang and G. W. Wei, “Integration of element speciﬁc persistent homology and machine learningfor protein-ligand binding afﬁnity prediction ,”

International Journal for Numerical Methods in BiomedicalEngineering , vol. 34(2), p. DOI: 10.1002/cnm.2914, 2018.[22] Z. X. Cang and G. W. Wei, “TopologyNet: Topology based deep convolutional and multi-task neural net-works for biomolecular property predictions,”

PLOS Computational Biology , vol. 13(7), pp. e1005690,https://doi.org/10.1371/journal.pcbi.1005690, 2017.[23] Z. X. Cang, L. Mu, and G. W. Wei, “Representability of algebraic topology for biomolecules in machinelearning based scoring and virtual screening ,”

PLOS Computational Biology , vol. 14(1), pp. e1005929,https://doi.org/10.1371/journal.pcbi.1005929, 2018.[24] D. Bramer and G.-W. Wei, “Multiscale weighted colored graphs for protein ﬂexibility and rigidity analysis,”

The Journal of chemical physics , vol. 148, no. 5, p. 054103, 2018.[25] T. Kaczynski, K. Mischaikow, and M. Mrozek,

Computational homology . Springer-Verlag, 2004.[26] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simpliﬁcation,”

DiscreteComput. Geom , vol. 28, pp. 511–533, 2001.[27] A. Zomorodian and G. Carlsson, “Computing persistent homology,”

Discrete Comput. Geom. , vol. 33,pp. 249–274, 2005.28] P. Frosini and C. Landi, “Size theory as a topological tool for computer vision,”

Pattern Recognition andImage Analysis , vol. 9, no. 4, pp. 596–603, 1999.[29] P. M. Kasson, A. Zomorodian, S. Park, N. Singhal, L. J. Guibas, and V. S. Pande, “Persistent voids a newstructural metric for membrane fusion,”

Bioinformatics , vol. 23, pp. 1753–1759, 2007.[30] M. Gameiro, Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow, and V. Nanda, “Topological measurementof protein compressibility via persistence diagrams,”

Japan Journal of Industrial and Applied Mathematics ,vol. 32, pp. 1–17, 2014.[31] Y. Dabaghian, F. Mémoli, L. Frank, and G. Carlsson, “A topological paradigm for hippocampal spatial mapformation using persistent homology,”

PLoS computational biology , vol. 8, no. 8, p. e1002581, 2012.[32] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, ﬂexibility and folding,”

Interna-tional Journal for Numerical Methods in Biomedical Engineering , vol. 30, pp. 814–844, 2014.[33] K. L. Xia, X. Feng, Y. Y. Tong, and G. W. Wei, “Persistent homology for the quantitative prediction of fullerenestability,”

Journal of Computational Chemistry , vol. 36, pp. 408–422, 2015.[34] B. Wang and G. W. Wei, “Object-oriented persistent homology,”

Journal of Computational Physics , vol. 305,pp. 276–299, 2016.[35] B. Liu, B. Wang, R. Zhao, Y. Tong, and G. W. Wei, “ESES: software for Eulerian solvent excluded surface,”

Journal of Computational Chemistry , vol. 38, pp. 446–466, 2017.[36] K. L. Xia and G. W. Wei, “Persistent topology for cryo-EM data analysis,”

International Journal for NumericalMethods in Biomedical Engineering , vol. 31, p. e02719, 2015.[37] Z. X. Cang, L. Mu, K. Wu, K. Opron, K. Xia, and G.-W. Wei, “A topological approach to protein classiﬁcation,”

Molecular based Mathematical Biology , vol. 3, pp. 140–162, 2015.[38] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutation byelement speciﬁc persistent homology,”

Bioinformatics , vol. 33, pp. 3549–3557, 2017.[39] K. Wu and G.-W. Wei, “Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Net-works,”

Journal of Chemical Information and Modeling , p. http://dx.doi.org/10.1021/acs.jcim.7b00558, 2018.[40] K. Wu, Z. Zhao, R. Wang, and G.-W. Wei, “Topp-s: Persistent homology based multi-task deep neu-ral networks for simultaneous predictions of partition coefﬁcient and aqueous solubility,” arXiv preprintarXiv:1801.01558 , 2017.[41] G. M. Sastry, M. Adzhigirey, T. Day, R. Annabhimoju, and W. Sherman, “Protein and ligand preparation:parameters, protocols, and inﬂuence on virtual screening enrichments.,”

J. Comput. Aid. Mol. Des. , vol. 27,pp. 221–234, 2013.[42] O. Trott and A. J. Olson, “AutoDock Vina: improving the speed and accuracy of docking with a new scoringfunction, efﬁcient optimization, and multithreading,”

J Computat Chem , vol. 31, no. 2, pp. 455–461, 2010.[43] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S. Goodsell, and A. J. Olson, “Autodock4and autodocktools4: Automated docking with selective receptor ﬂexibility,”

Journal of computational chem-istry , vol. 30, no. 16, pp. 2785–2791, 2009.[44] J. Bell, Y. Cao, J. Gunn, T. Day, E. Gallicchio, Z. Zhou, R. Levy, and R. Farid, “Primex and the schrödingercomputational chemistry suite of programs,” 2012.[45] Z. Ye, M. P. Baumgartner, B. M. Wingert, and C. J. Camacho, “Optimal strategies for virtual screening ofinduced-ﬁt and ﬂexible target in the 2015 d3r grand challenge,”

Journal of computer-aided molecular design ,vol. 30, no. 9, pp. 695–706, 2016.[46] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor, “Development and validation of a geneticalgorithm for ﬂexible docking.,”

Journal of Molecular Biology , vol. 267, no. 3, pp. 727–748, 1997.47] R. A. Friesner, J. L. Banks, R. B. Murphy, T. A. Halgren, J. J. Klicic, D. T. Mainz, M. P. Repasky, E. H. Knoll,M. Shelley, J. K. P. JK, D. E. Shaw, P. Francis, and P. S. Shenkin, “Glide: a new approach for rapid, accuratedocking and scoring. 1. method and assessment of docking accuracy.,”

J. Med. Chem. , vol. 47, p. 1739,2004.[48] N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison, “Open babel:An open chemical toolbox,”

Journal of cheminformatics , vol. 3, no. 1, p. 1, 2011.[49] S. LLC, “Schrödinger release 2017-4, Schrödinger LLC, new york,” 2017.[50] S. L. Dixon, A. M. Smondyrev, E. H. Knoll, S. N. Rao, D. E. Shaw, and R. A. Friesner, “Phase: a new enginefor pharmacophore perception, 3d qsar model development, and 3d database screening: 1. methodologyand preliminary results,”

Journal of computer-aided molecular design , vol. 20, no. 10-11, pp. 647–671, 2006.[51] S. L. Dixon, A. M. Smondyrev, and S. N. Rao, “Phase: a novel approach to pharmacophore modeling and3d database searching,”

Chemical biology & drug design , vol. 67, no. 5, pp. 370–372, 2006.[52] M. P. Jacobson, D. L. Pincus, C. S. Rapp, T. J. Day, B. Honig, D. E. Shaw, and R. A. Friesner, “A hierarchicalapproach to all-atom protein loop prediction,”

Proteins: Structure, Function, and Bioinformatics , vol. 55,no. 2, pp. 351–367, 2004.[53] M. P. Jacobson, R. A. Friesner, Z. Xiang, and B. Honig, “On the role of the crystal environment in determiningprotein side-chain conformations,”

Journal of molecular biology , vol. 320, no. 3, pp. 597–608, 2002.[54] R. Farid, T. Day, R. A. Friesner, and R. A. Pearlstein, “New insights about herg blockade obtained from pro-tein modeling, potential energy mapping, and docking studies,”

Bioorganic & medicinal chemistry , vol. 14,no. 9, pp. 3160–3173, 2006.[55] W. Sherman, T. Day, M. P. Jacobson, R. A. Friesner, and R. Farid, “Novel procedure for modeling lig-and/receptor induced ﬁt effects,”

Journal of medicinal chemistry , vol. 49, no. 2, pp. 534–553, 2006.[56] W. Sherman, H. S. Beard, and R. Farid, “Use of an induced ﬁt receptor structure in virtual screening,”

Chemical biology & drug design , vol. 67, no. 1, pp. 83–84, 2006.[57] S. P. Borgatti, “Centrality and network ﬂow,”

Social networks , vol. 27, no. 1, pp. 55–71, 2005.[58] L. C. Freeman, “Centrality in social networks conceptual clariﬁcation,”

Social networks , vol. 1, no. 3, pp. 215–239, 1978.[59] A. Bavelas, “Communication patterns in task-oriented groups,”

The Journal of the Acoustical Society ofAmerica , vol. 22, no. 6, pp. 725–730, 1950.[60] A. Dekker, “Conceptual distance in social network analysis,”