[PDF] GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research

Abstract

Full PDF

GGPU-Accelerated Drug Discovery with Docking on the SummitSupercomputer: Porting, Optimization, and Application toCOVID-19 Research

Scott LeGrand

NVIDIA CorporationSanta Clara, California

Aaron Scheinberg

Jubilee DevelopmentCambridge, Massachusetts

Andreas F. Tillack

Scripps ResearchLa Jolla, California

Mathialakan Thavappiragasam

Oak Ridge National LaboratoryOak Ridge, Tennessee

Josh V. Vermaas

Oak Ridge National LaboratoryOak Ridge, Tennessee

Rupesh Agarwal

University of Tennessee, KnoxvilleKnoxville, Tennessee

Jeff Larkin

NVIDIA CorporationSanta Clara, California

Duncan Poole

NVIDIA CorporationSanta Clara, California

Diogo Santos-Martins

Scripps ResearchLa Jolla, California

Leonardo Solis-Vasquez

TU DarmstadtDarmstadt, Germany

Andreas Koch

TU DarmstadtDarmstadt, Germany

Stefano Forli

Scripps ResearchLa Jolla, California

Oscar Hernandez

Oak Ridge National LaboratoryOak Ridge, Tennessee

Jeremy C. Smith

UT/ORNLOak Ridge, TennesseeUniversity of TennesseeKnoxville, Tennessee

Ada Sedova

Oak Ridge National LaboratoryOak Ridge, [email protected]

ABSTRACT

Protein-ligand docking is an in silico tool used to screen potentialdrug compounds for their ability to bind to a given protein receptorwithin a drug-discovery campaign. Experimental drug screeningis expensive and time consuming, and it is desirable to carry outlarge scale docking calculations in a high-throughput manner tonarrow the experimental search space. Few of the existing com-putational docking tools were designed with high performancecomputing in mind. Therefore, optimizations to maximize use ofhigh-performance computational resources available at leadership-class computing facilities enables these facilities to be leveragedfor drug discovery. Here we present the porting, optimization, andvalidation of the AutoDock-GPU program for the Summit super-computer, and its application to initial compound screening effortsto target proteins of the SARS-CoV-2 virus responsible for the cur-rent COVID-19 pandemic. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retainsand the publisher, by accepting the article for publication, acknowledges that thePermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

CCS CONCEPTS • Applied computing → Computational biology . KEYWORDS

Drug discovery, high-performance computing, GPU acceleration,protein-ligand docking

ACM Reference Format:

Scott LeGrand, Aaron Scheinberg, Andreas F. Tillack, Mathialakan Thavap-piragasam, Josh V. Vermaas, Rupesh Agarwal, Jeff Larkin, Duncan Poole,Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, Stefano Forli,Oscar Hernandez, Jeremy C. Smith, and Ada Sedova. 2020. GPU-AcceleratedDrug Discovery with Docking on the Summit Supercomputer: Porting, Op-timization, and Application to COVID-19 Research. In

Proceedings of 11thACM Conference on Bioinformatics, Computational Biology, and Health Infor-matics (ACM BCB) (ACM-BCB 2020).

ACM, New York, NY, USA, 11 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

The binding of a drug to a protein target in vivo can elicit a molecu-lar response, such as the inhibition of a cellular function, and formsthe basis of targeted drug discovery. Computational protein-liganddocking can be utilized for the rapid structure-based screening ofsmall molecule drug candidates for target binding [5, 21]. Three

United States Government retains a non-exclusive, paid- up, irrevocable, world-widelicense to publish or reproduce the published form of this manuscript, or allow others todo so, for United States Government purposes. The Department of Energy will providepublic access to these results of federally sponsored research in accordance with theDOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). a r X i v : . [ q - b i o . B M ] J u l CM-BCB 2020, August 30–September 2, Virtual LeGrand et al. dimensional structural models of both a protein receptor and a setof small molecule compounds, or ligands, are employed to com-putationally predict the ability of a ligand to bind to the receptor,using an optimization algorithm within some function, which canbe either a physics-based empirical energy potential or statistical.[10, 16, 31]. Advances in high-throughput experimental screening,both cell-based [18, 26] or molecular [7, 12], have allowed tensof thousands of chemical compounds to be tested simultaneously.However, these assays are expensive, and the ability to computa-tionally filter chemical compounds for their propensity to bind to atarget protein can significantly reduce the overall cost and time tosolution. The data to be evaluated are plentiful as the relevant chem-ical search space consists of billions of compounds. For instance,the Enamine REAL database (https://enamine.net) contains overa billion small molecules, and the ZINC 15 database contains 230million ready-to-dock compounds, [29]. Thus the number of com-pounds that can potentially be screened with in silico docking tocomplement experimental efforts is now on the order of billions. In-creases in processor speeds and the number of cores on large nodesof computing clusters and cloud resources have made meeting thisdocking challenge theoretically attainable. Unfortunately, manydocking programs are not inherently designed with massive high-throughput screens in mind. Most open source docking programscommonly applied in academic research are CPU-based, single-node, and utilize file-based input and output [16, 31]. Such codesare often designed to perform a single ligand-docking calculationper run instance of the executable.A variety of programs exist for ligand docking, with some us-ing empirical but physics-based potential energy descriptions andothers using fully empirical statistical potentials trained on a set ofexperimental structures of ligands bound to proteins [16, 17, 31].The widely employed AutoDock program saw its inception in 1990[8] and has undergone several major changes, with the latest ver-sion being AutoDock4. This program incorporates a physics-basedenergy function that includes entropic and solvation terms, in ad-dition to van der Waals terms, hydrogen bond, and electrostaticinteractions [15, 16]. The energy minimization procedure consistsof a Lamarckian genetic algorithm (LGA), whereby a local searchaccompanies the standard genetic optimization, and the improvedligand pose that is obtained from this local optimization is the inputfor the crossover portion of the algorithm. The original local searchmethod is based on the Solis-Wets random optimization algorithm[27].

General purpose graphics processing units (GPUs) are used to ac-celerate dense numerical calculations in a variety of settings, fromhigh-performance computing (HPC) facilities to data centers andcloud resources. There are several commonly used application pro-gramming interfaces (APIs) that are used to program NVIDIA GPUsincluding CUDA, and OpenCL. A few docking programs have in thepast five years made use of GPUs for accelerating their calculations[4, 14, 28]. Recently Scripps Research, in collaboration with TUDarmstadt, developed an accelerated version of their AutoDock4program using OpenCL, which provided up to 50 × speedup overthe single-threaded CPU version [24, 25, 28]. OpenCL was chosen as the programming model for using the accelerator as it providescode portability to various types of architectures, for instance GPUs,CPUs, and FPGAs, from multiple vendors. This program has beennamed AutoDock-GPU and while it uses the same physics-basedempirical potential energy function, it includes new algorithmicadditions and changes that can improve both performance and thequality of the results [24, 25, 28]. In particular, within the AutoDock-GPU program, there are now two possible algorithms for the localsearch that can be employed, (1) the Solis-Wets (SW) method of ran-dom optimization, and (2) the ADADELTA gradient-based method[33]. The ADADELTA method was added to improve the local re-finement of results, especially for ligands with larger numbers oftorsions. AutoDock-GPU uses similar run parameters as AutoDock4,with some minor changes, including a renaming for some parame-ters, changes in some defaults, and some different hyperparameters.The run parameters important to the current study include the nrun parameter, which designates the number of different independentcomplete LGA calculations that are performed in one instance of theexecutable. Each nrun results in a separate final pose, and is outputin the output files. A drug screen would take the best scoring posefrom these independent outputs. In addition, a recently developedfeature is the ability to set the autostop parameter, which allowsthe local search algorithm to finish prematurely (with respect tothe value set in the nev parameter that sets the number of totaliterations of the LGA algorithm per run), if the energy value has notchanged a sufficient amount, measured by the standard deviationover the past 10 iterations. Several previous efforts have enabled the deployment of dockingcodes such as AutoDock4 and AutoDock Vina on large compute

PU Docking on OLCF Summit ACM-BCB 2020, August 30–September 2, Virtual resources, from clusters to supercomputers. Vina MPI used an MPIwrapper to enable the simultaneous launching of thousands of par-allel AutoDock Vina executables and was run on the OLCF Jaguarand Titan supercomputers [3]. A similar effort at the Lawrence Liv-ermore National Laboratory resulted in a program called VinaLC[34]. Recently a GNU-Bash based pipeline was created for dockingmillions of ligands on clusters [9]. GPU acceleration has also beenincluded in docking programs. GeauxDock created a novel energyfunction consisting of physics-based and statistical terms that ranon GPUs [4], but has seen limited application or updates. To ourknowledge, the version of AutoDock-GPU described in this paperis the only docking program using the AutoDock4 potential that isoptimized for the most recent GPUs and also adds the capability tooptimally process thousands of ligand input files sequentially witha single executable.

The ability to rapidly dock millions to billions of ligands to pro-tein receptors to help find inhibitors for viral proteins could beinvaluable to experimental drug discovery efforts. Correspondingly,harnessing leadership computing resources for drug discovery ef-forts against the SARS-COV-2 virus could provide an invaluableresource with which to battle the current COVID-19 pandemic [22].Applying this approach is part of the collaborative internationalresearch effort recently formed around discovering therapeutics tomitigate morbidity and mortality caused by the disease. The workpresented here takes the first step in facilitating the deploymentof a GPU-enabled, open source docking program on the Summitsupercomputer, and optimization for high-throughput docking oflarge numbers of ligands in a single instance. With this capability inplace, future work involving the deployment of efficient platformsto enable running the AutoDock-GPU program at scale on all ofSummit’s 27,648 GPUs can be pursued.

A number of additions and changes were made to the AutoDock-GPU program to create a version that made effective use of thenodes on Summit and supported massive, high-throughput dock-ing screens. In particular we focused on the case where a verylarge number of ligands (millions to billions) would be docked to asingle receptor. Fig. 1 illustrates the design of this version of theprogram, showing the addition of OpenMP threading for creatinga pipeline that hides latency from file I/O by staging ligand fileswhile the GPU is busy docking. In addition, the receptor data is nowexplicitly reused, further reducing I/O when docking thousands ofcompounds to the same protein target. These changes are detailedin the following subsections.

The original AutoDock-GPU program developed by Scripps and TUDarmstadt was written in OpenCL to improve portability acrossGPU vendors, and enabled threaded execution on CPUs. However,OpenCL is not supported on all platforms, including the combi-nation of NVIDIA GPUs and POWER9 CPUs present on Summit.

Figure 1: Schematic of new HPC-friendly AutoDock-GPUprogram, showing the OpenMP threading-based pipelinefor hiding ligand input and staging, and the receptor-reusefunctionality for docking many ligands to a single receptor.Noted are locations (CPU, GPU) where steps are executed.Black arrows indicate operations performed once, and bluearrow are those performed multiple times.

Therefore it was necessary to port AutoDock-GPU to CUDA toleverage the V100 GPUs on Summit to run AutoDock-GPU at scale.This provides not only large speedups over CPU-based code butalso a reduction in power consumption for the same compute taskon Summit [6]. This porting is not effortless, due to philosophicaland design differences between OpenCL and CUDA. OpenCL wascreated in 2009 as an open and portable alternative to CUDA [30],and has maintained the commitment to remain free of hardware-specific programming elements. While the OpenCL API resemblesCUDA’s API in many ways, it is a vendor-neutral solution witha focus on portability across CPUs, GPUs, FPGAs, and embeddeddevices, and thus cannot expose architecture-specific hardware fea-tures. CUDA, on the other hand, has been developed in tandem withNVIDIA GPUs hardware and exposes new architectural featureswith each release. While OpenCL has the advantage of running onany device that supports it, porting OpenCL code to CUDA alsoprovides significant opportunities to exploit those vendor-specifichardware advances, which would otherwise squander the transis-tors on NVIDIA GPUs that enable them.In order to create an initial port that could reproduce the OpenCLresults, we transcribed the AutoDock OpenCL version (forked April3rd from the “develop" branch) by hand to CUDA. This created aninitial executable which performed at roughly half the speed of theOpenCL version. Next, we applied basic optimization techniques tothe code like replacing calls to pow with intrinsics, and dynamically

CM-BCB 2020, August 30–September 2, Virtual LeGrand et al. allocating shared memory to allow more threads per processor.These first optimizations enabled the CUDA version to run 2.8 × faster than the OpenCL version on an NVIDIA RTX2080TI GPU forthe ADADELTA algorithm. Back-porting these optimizations to theOpenCL code showed little to no benefit there, for the followingreasons: OpenCL already has intrinsics for pow(float, int) , andthe shared memory amount is known to OpenCL at compile time. Although the OpenCL and CUDA implementations of AutoDock-GPU were able to accelerate docking calculations significantly,AutoDock-GPU was designed to take a single receptor and a singleligand as command-line inputs. If one wished to dock multipleligands with a single receptor (or vice versa), one would have torun the executable separately for each “job.” For a large ligand set,this is both tedious for the user and inefficient in terms of resources.Here we describe several steps taken to maximize performance inthe context of massive docking calculations.

To improve the codefor this use case, we enabled the user to provide a list of protein andligand files, accessible through a -filelist flag. The program thenloops over the provided list and performs the docking calculationsfor each. To further optimize the performance of GPU-based high-throughput docking of thousands of ligands to a single receptor,we enabled the program to reuse the 16 receptor maps of up tohundreds of megabytes containing the interaction grid informationfor atoms in ligands rather than re-read the map files for every newligand.

After enablingmultiple jobs to run consecutively in the same executable, we couldbegin performance optimization of this new pipeline. Dependingon the parameters specified, a significant portion of the run timeis spent in the setup and post-processing phases. These phases areperformed on the host and include I/O, allocations, and some initialand final calculations less suitable for GPU. If jobs are executedserially, the GPU is idle during these phases and thus underuti-lized. To reduce this latency, we introduced OpenMP directives sothat CPU-based threads could work in parallel to load ligand inputfiles, transfer them onto the GPU, launch the GPU-based dock-ing kernels, and process and write the output of the calculation.This is illustrated in Fig. 1. To ensure that jobs do not conflict, allCPU-GPU communication (besides the initial receptor map setupdescribed above) and the docking algorithm itself (consisting ofmultiple CUDA kernels) are placed in a function called from an omp critical section. As a result, the OpenMP threads performthe setup and post-processing phases of each job in parallel, whilequeuing up to use the GPU one by one for the docking simulationitself. Combined with the reduction in setup time from the opti-mizations detailed below, this threading approach is sufficient tohide almost all setup and post-processing time and ensure the GPUis rarely idle.

We alsoimproved handling of GPU memory and host-device communica-tion. Instead of creating and destroying the CUDA context anew for each ligand input, the context is now created once and GPUmemory is preallocated and reused for each ligand. Similarly, ratherthan transfer the receptor grid maps onto the GPU for every ligand,all of the receptor’s grids are sent once during initialization andleft on device. Because each ligand requires a different subset ofthese grids, a mapping was required to point to the grids neces-sary for a specific ligand docking. Historically, creation of CUDAcontexts in memory, GPU memory allocation, and transfer of datato GPU have been bottlenecks in GPU programming. On Summit,with the improved NVLINK2 interconnect, data transfer onto theGPU is less expensive, but still considerable. The V100 GPUs onSummit contain 16 GB of global memory, which is enough to storeall possible receptor interaction grids for supported atom types.

We tested the performance of the new CUDA version of AutoDock-GPU with and without the OpenMP-based pipeline and comparedto performance of the OpenCL version using a NVIDIA DGX-2appliance hosted at ORNL. The DGX-2 contains 16 NVIDIA V100GPUs and dual Intel Xeon Platinum 8168 CPUs containing 24 cores.The x86 architecture for the DGX-2 platform permits us to makedirect comparison between the original OpenCL AutoDock-GPUimplementation and the CUDA port. Both executables were builtwithin an Ubuntu 18.04 Singularity container with CUDA 10.1 andversion 7.5.0 of the GNU compiler collection, which implementOpenCL 1.2. These tests were performed on a set of 42 experimen-tal crystallographic structures of ligands bound to enzyme activesites, obtained from the Research Collaboratory for Structural Bioin-formatics (RCSB) Protein Data Bank (PDB) [1], and converted toinput files for AutoDock-GPU using OpenBabel program [20]. Theset of inputs and the PDB IDs for these structures can be foundat https://github.com/diogomart/AD-GPU_set_of_42. This set ishereafter referred to as S42. The S42 dataset contains a variety ofdifferent ligand sizes and numbers of torsions, or rotatable bonds,and a range of different sized search boxes defined by the inputgrid coordinates.We then tested the performance of the CUDA/pipeline versionof AutoDock-GPU on a large benchmark set of ligands on Summit.For Summit tests we used GNU compiler collection version 6.4 andCUDA version 10.1.243. Summit currently runs the Red Hat Enter-prise Linux Server version 7.6 (Maipo). The ligand dataset consistedof 9,000 ligands taken from the full SWEETLEAD database [19](as acquired in March of 2020), but with ligands containing atomtypes unsupported by the AutoDock Utilities tools [16, 28] removed,and supplemented with ligands from the NCI. This data set will bereferred to as SN9000. Performance was measured using both localsearch algorithms (Solis-Wets and ADADELTA), and compared tothe performance of both AutoDock4 and AutoDock Vina on Sum-mit. The SARS-CoV-2 endoribonuclease protein (NendoU) crystalstructure recently deposited on the RCSB PDB, PDB-ID 6VWW[13], was the target receptor.Finally, we explored in detail the different optimizations whichwere added, namely, the application of the OpenMP pipeline, theaddition of context reuse, and the reuse of the receptor grids by stor-ing all of them on the GPU’s global memory. We performed these

PU Docking on OLCF Summit ACM-BCB 2020, August 30–September 2, Virtual tests on Summit using a single resource set on a single computenode which consists of a GPU, 1 (for non-pipelined version) and7 (for pipelined version) physical CPU cores with one thread percore with the pipeline using 1 and 7 OpenMP threads. For this testwe picked a random subset of about 400 ligands from the SN9000ligand set, and also docked to the NendoU receptor with a searchbox of size 22.5 Å per side using both local search methods (Solis-Wets and ADADELTA). We chose ligands with small numbers oftorsions ( ≤

10) to test the performance of the Solis-Wets method,and ligands with more than 10 torsions for ADADELTA. We per-formed the test using the number of runs ( nrun ) parameter set to 10and 100 for both cases, and with the newly implemented autostopfunctionality both on and off. For all receptors, any bound smallmolecule ligand, ions, and water molecules were removed from thestructures using MOE [32]. Hydrogen atoms were then added, andeach structure was saved in its apo form. The AutoDockTools script( prepare_receptor4.py ) was then applied to convert the PDBfiles to PDBQT format, and AutoGrid4 was employed to generatethe atom-specific affinities, electrostatic potential, and desolvationpotential maps for each receptor. Details of experiments on viralMpro (section 4) are given in that section. Molecular images weremade with VMD [11] and UCSF Chimera [23].

Here we describe the results of performance testing and validationof the CUDA/pipeline version of AutoDock-GPU on several poten-tial use-cases for high-throughput docking. The parameters that arevaried in these different situations include the number of ligandsdocked against a receptor, the size of the allowed search box, thenumber of torsions in the ligands, the local search algorithm used,and the number of replicas of the calculation that are performedfor each docking (the nrun parameter). We break down the perfor-mance gains imparted by the different improvements detailed inthe Methods section. We also confirmed that the results obtainedwith the new version are consistent with those obtained using theoriginal OpenCL version.

Results for running the S42 benchmark on the DGX-2 are shown inFig. 2. Each of the 42 receptor-ligand pairs was tested with threerandom starting conformations of the corresponding ligand. Thetests were performed using nrun set to 10 and using the ADADELTAlocal search method, with all other parameters set as default. TheCUDA version here shows a 1.8 × mean speedup over the OpenCLversion with this dataset, and the pipeline provides an additional2.4 × speedup over the un-pipelined CUDA version, resulting in anoverall 4.4 × mean speedup over the original OpenCL version, forthis particular set of receptors, ligands, and box sizes, and these runparameters. Another metric over which tocompare docking algorithms are the generated poses themselves.Leveraging the experimentally determined positions for ligands, wecan check how well the predicted bound poses match experiment. L i gand D o ck i ng T i m e ( s ) Aggregate:

Figure 2: AutoDock-GPU performance, using theADADELTA algorithm, on a DGX-2 on a set of 42 re-ceptors with 3 ligands each with nrun=10 , permitting theOpenCL, CUDA, and pipelining features to be compareddirectly.

This is known as re-docking. Because the S42 dataset consists ofexperimental structures of ligands bound to receptors, it can beharnessed to not only validate the ability of the AutoDock-GPUprogram to reproduce experimental ligand poses, but to also vali-date the consistency of outputs after major changes to the program.Here we show the cumulative root mean squared (Euclidean) dis-tance (RMSD) distribution between the final best pose from dockingand the crystallographic pose from the S42 set (Fig. 3). The RMSDis a standard measure of the three-dimensional similarity betweenconformations of a molecule. Note that some ligands within thistest set are intentionally difficult, with a large number of torsionsand a commensurately large search space, and so not all posesfound are near the experimental position. Despite these challenges,the median RMSD over the full 42 ligand set ranges from 1.28 to1.85 Å for the CUDA implementation of AutoDock-GPU, dependingon the local search algorithm. This indicates useful agreement withthe experimental poses. As expected, the docking quality improveswith increased computational effort, although the improvementis not linear, and usable results can be obtained with a compar-atively small number of runs, particularly with the ADADELTAlocal search algorithm. The small differences in the results betweenOpenCL and CUDA implementations within Fig. 3 are not indica-tive of qualitatively different results, are instead consistent withthe stochasticity inherent in docking algorithms.

For these tests we used the SARS-CoV-2 endoribonuclease protein(NendoU) crystal structure recently deposited on the RCSB PDB,PDB-ID 6VWW [13]. In addition, we tested both the Solis-Wetslocal search method, and the ADADELTA method introduced withAutoDock-GPU. We created two test cases for evaluating two ex-tremes of search-space size. The first is an exhaustive search overthe protein, known as “blind” docking, together with a completeset of ligand torsional degrees of freedom. The second uses a fo-cused, small search box, limited to the active site region that mustbe pre-determined together with a subset of the ligand dataset that

CM-BCB 2020, August 30–September 2, Virtual LeGrand et al. c ( ˚A)0.00.20.40.60.81.0 C u m u l a t i v e P r obab ili t y OpenCL

ADA ( nrun =10) OpenCL SW ( nrun =100) OpenCL SW ( nrun =10) CUDA

ADA ( nrun =10) CUDA SW ( nrun =100) CUDA SW ( nrun =10) Figure 3: Cumulative distribution for RMSD for the re-docked S42 set against the initial crystal structure, notedhere as RMSD c . The cumulative distributions were calcu-lated for both OpenCL (solid) and CUDA (dashed) implemen-tations and both ADADELTA or Solis-Wets local searches.Figure 4: Blind docking (large box, light grey) and focuseddocking (small box, red) regions around the SARS-CoV-2(green cartoon) endoribonuclease protein active site (PDB-ID 6VWW). Blind docking uses a larger search space (gray),and the small box (red) focuses on a small region directlyaround the active site. contained only those ligands with ≤

10 torsions. Fig. 4 shows thedifferent search spaces on the receptor.

In the largesearch space test the ligands were allowed to bind anywhere in alarge search volume (gray region, Fig. 4) which was a box 40 Å perside centered on the active site. In addition, we used the full setof ligands which included those with ≥

10 torsional degrees offreedom, or rotatable bonds. We set ( nrun =100) in order to explorethis larger search space sufficiently. The dockings were performedwith both the SW and the ADA-DELTA methods.

For thesmall search space test, we docked to the known binding site. Inthis case, only a small region (22.5 Å per side) of the protein wasexplored for ligand binding (red box, Fig. 4), selected based on thefinal binding location for the majority of ligands in the blind dockingtrials, and only those ligands with ≤

10 torsions. The smaller searchbox coupled to selecting for simpler ligands in this test means that B li nd D o ck i ng T i m e ( s ) T a r ge t ed D o ck i ng T i m e ( s ) Figure 5: Runtime distribution for docking calculations un-der varying conditions with alternative docking programs.Left: violin plots of runtime distributions with maximum,median, and minimum runtimes for ligands indicated ad-jacent to their position within the distribution. Right: run-time distributions illustrating growth in runtime as ligandsincrease in size. fewer runs were needed to saturate the space and we therefore set( nrun =10).

The two previous test cases were also used to compare theperformance of AutoDock-GPU to equivalent calculations using theCPU-only AutoDock4 [16] or AutoDock Vina [31] programs whenperformed on Summit. For the large search space test, we comparedtime to solution using AutoDock Vina against the time to solutionfor AutoDock-GPU on Summit. AutoDock4 was excluded from thisanalysis, as the high number of runs required to sample the space( nrun =100) would routinely exceed the allowed walltime on Sum-mit. For the small search space test, a lower nrun setting permittedAutoDock4 to be added to the comparison set. The results of theseruntime tests under these two conditions are summarized in Fig. 5.For the large space test, AutoDock Vina runtimes are observed togrow rapidly as a function of ligand complexity. As a consequence,while the median docking times are only a factor of 16 slower inthe large space case, the aggregate run times for all AutoDock Vinacalculations are a factor of 25 slower due to particularly poor perfor-mance for the largest ligands. By contrast, AutoDock-GPU tackleseven these challenging ligands in under two minutes (Fig. 5). Thegeneral runtime distribution for AutoDock4 and AutoDock-GPU onSummit is similar, with additional ligand complexity only modestlyincreasing the runtime for an individual ligand (Fig. 5). However,whereas AutoDock4 has significantly longer runtimes even for thesimple ligand set selected, AutoDock-GPU leverages the parallelisminherent to the GPUs to bring the runtimes down significantly. Thelarge reduction in runtimes and narrower time distribution maypermit large ligand sets to be screened on Summit on a routinebasis.

PU Docking on OLCF Summit ACM-BCB 2020, August 30–September 2, Virtual

Table 1: Performance Improvements from Sequential Opti-mizations: Solis-Wets. Shown is mean time in seconds overligands run sequentially on one GPU using the pipeline withwith either 1 or 7 threads. Ligands in this set contained 10 orfewer torsions. CR: context reuse; F/RR: receptor file and re-ceptor grid reuse on GPU. Each optimization is added on topof the previous ones going from left to right. nrun

We also performed a systematic test of the incremental contribu-tions of each of the optimizations to the overall speedup of the newHPC-centric version of AutoDock-GPU. 412 randomly selected lig-ands from the SN9000 dataset were picked for this analysis. Wecomputed the average time in seconds per ligand for values of the nrun parameter set to 10 and 100, for the program with and withoutthreading (by setting the

OMP_NUM_THREADS environment variableto 1 and 7 respectively), and tested different stages of optimizationthe program, added incrementally: with addition of CUDA contextreuse, and with the reuse of the receptor input files and of this databy making employing of GPU global memory to store all requiredgrids for all ligands in a batch. Tables 1 and 2 show these effectsfor the SW and ADADELTA algorithms, respectively. Each newoptimization is added on top of the previous one in these tables.

Table 2: Performance Improvements from Sequential Opti-mizations: ADADELTA. Mean run time (in seconds) calcu-lated as in Table 1. CR: context reuse; F/RR: receptor file andreceptor grid reuse on GPU. Each optimization is added ontop of the previous ones going from left to right.torsions nrun ≤ ≤ > 10 10 > 10 100 × speedup for nrun = 10, but slightly less than a 2 × speedup for nrun = 100. With nrun = 10, for both low torsion sets, the file andreceptor reuse optimization provided a significant further speedupof about 2 to 2.8 × , while CUDA context reuse did not provide anyadditional speedup, while context reuse provided 2.2 × speedup forSW with nrun = 100, and file and receptor reuse did not add perfor-mance. Fig. 6 summarizes the total speedup, with all optimizations,for all 6 cases. Figure 6: Total speedup over all optimizations for variationsof local search algorithm, torsion number and nrun parame-ter (10 or 100) indicated by the numbers under the bars.Table 3: Contribution of autostop feature to time to solu-tion. Right two columns show mean run time per ligand inseconds. AS: autostop ; alg, algorithm; SW, Solis-Wets; AD,ADADELTA. Threads used:1.alg torsions nrun

AS off AS on SW ≤ ≤ ≤ ≤ > 10 10 > 10 100 autostop parameter was added recently to the developmentversion of AutoDock-GPU (OpenCL) and ported to the CUDA ver-sion. It allows the program to stop the local search if no progress isbeing made. Table 3 shows the speedup provided by using autostop ,shown without the pipeline (threads set to 1). This feature helpsmost for larger nrun values combined with smaller molecules. Many groups from around the world have been studying differ-ent SARS-CoV-2 proteins and solving their structure. The mainprotease (Mpro) in particular has attracted much attention as a po-tential drug target, and as such over 90 crystallographic structurescontaining a bound ligand have been released on the RCSB PDB,with many bound to the protease’s active site region, and othersbound elsewhere on the protein. We therefore performed somepreliminary docking calculations on ligands taken from this set.These calculations allow us to examine calculated binding energiesand docking poses. Fig. 7 shows an overlay of some of these ligandsbound to Mpro.From this set, 68 ligands are observed to interact with the enzymeactive site, of which 22 are non-covalent ligands that have beencrystallized. The remaining 46 ligands bound to the active site arecovalently bound to the receptor, and this interaction cannot berepresented in AutoDock Vina or AutoDock-GPU. Fig. 7 indicatesthat 25 crystallographically determined ligands bind non-covalentlyto Mpro, well outside of the active site region. All of these putative

CM-BCB 2020, August 30–September 2, Virtual LeGrand et al.

Figure 7: Ligand binding locations (multicolored sticks) ob-served in crystal structures of the SARS-CoV-2 main pro-tease (white cartoon).Figure 8: Dendrogram of hierarchical clustering using MDLkeysets. The small molecules were divided into 24 clusters(clades) based on structural similarity. Active-site bindersare highlighted in green. ligands are fragments, and therefore are relatively small, with nomore than 4 torsions.A hierarchical chemical clustering over all non-covalent bindersto the protease was performed in chemical fingerprint space usingMDL keysets [2], the Tanimoto similarity coefficient between com-pounds, and the Ward clustering linkage method with a clusteringthreshold of 0.8. The small molecules were divided into 24 clusters(clades) based on structural similarity (Fig. 8). All molecules in thisset, both bound to the active site and outside of it, were found insimilar clusters. Furthermore, many of the external binders were inthe same clades as active site binders. It is possible, considering theircloseness in chemical fingerprint space as shown by the clusteringanalysis, that the external binders also occupied the Mpro activesite for some time before being displaced during crystallization.We docked both the 22 active-site binders and the 25 externalbinders to the Mpro receptor. For these tests all docking calculationswere performed with SW with nruns = 100. Interestingly, scoresfor both sets (active site and external binders) had a mean between-6.4 and -7.4 kcal/mol for all of the Mpro crystal structures tested,with best scores over each ligand set extending below -8.5 kcal/mol.Docking of the 22 non-covalent ligands to Mpro with AutoDock Vina was also performed (Fig. 9). Vina scores are shifted to highervalues and have a more one-sided and less disperse distributionthan AutoDock-GPU scores. These differences reflect the differentpotentials and algorithms used in the two programs, and can beuseful information when choosing a cut-off for score-based virtualscreening.

Figure 9: Histogram of the best binding free energies (scores)of ligands which were experimentally found to be boundnon-covalently to Mpro active site, docked to Mpro structurePDB ID 5RE9 using AutoDock-GPU and AutoDock Vina.

We docked the 22active-site binders to several available Mpro crystal structures, PDBIDs: 5R7Y, 5R80, 5R81, 5R84, 5RE9, 5RF3, 5RGI, 6WQF, 6Y2E. Usinga box that was fit very tightly about the active site, of dimensions18.75 × × Figure 10: Scores obtained by binding the 22 crystallizednon-covalent active-site bound ligands, using three differ-ent crystal structures for Mpro, each with a slightly differ-ent active site. Residue names given on x-axis and corre-sponding PDB IDs on left. All docking calculations were per-formed with SW using nruns = 100.

PU Docking on OLCF Summit ACM-BCB 2020, August 30–September 2, Virtual

Figure 11: Overlay of the five experimental crystal struc-tures of Mpro which are bound to non-covalent ligandsshowing the most variation in active site region conforma-tion. Left and right panels show views rotated 180 degrees.

Re-docking refers to the docking of a ligand, crystallized in complexwith a receptor, to the same receptor. We performed re-dockingexperiments on four of the 22 non-covalent binders, ligands fromPDB IDs 5R7Y, 5R84, 5RF3, and 5RGI. After docking, the top-ranked(lowest score) docked pose of the ligand was superimposed onthe crystal structure ligand pose to compare them. 5RF3 (ligandresidue name T5V) is a very small fragment and can dock in manylocations in the large active site of Mpro, this is shown in Fig. 12; italso receives a high value for the binding free energy, -4.75 kcal/mol.For two of the four ligands tested, PDB IDs 5R84 (ligand residuename GWS) and 5RGI (ligand residue name U0P), re-docking resultswere remarkably good, as illustrated in 13. 5R7Y was docked in aless accurate position than these two.Fig. 13 also shows a comparison to the same re-dockings withVina, showing that AutoDock-GPU can calculate redocked posesof comparable closeness to the crystallized position as Vina. Usinga smaller search box could help to better locate very small ligandssuch as T5V, however, the smaller search box described above pre-vented the GWS ligand from being docked correctly, as shown inFig. 14. This demonstrates some of the difficulties encountered inhigh-throughput docking, as it is difficult to find a search box thatcan ensure the best docking result for all ligand sizes, even with avery reduced subset containing only 4 torsions or fewer.

Figure 12: PDB 5RF3 shown in orange cartoon with crystal-lized T5V ligand, shown in cyan (CPK), along with three keyresidues (HIS 41, CYS 145, HIS 163) found in the active siteshown in orange with atoms visible (CPK). Docked positionshown in magenta in CPK representation.Figure 13: Two re-docking results comparing AutoDock-GPU and AutoDock Vina using PDB IDs 5rgi (top panel) and5r84 (bottom panel). Vina: green; AutoDock-GPU: red; crys-tal structure: blue.Figure 14: Two re-docking results (orange CPK) for the GWSligand with Mpro stucture 5R84 (cyan ribbons), using asearch box with volume 26.25 Å per side (left panel), and18.75 × × We have presented a new version of AutoDock-GPU that has beenported to CUDA and containing a number of optimizations to enable

CM-BCB 2020, August 30–September 2, Virtual LeGrand et al. more efficient processing of thousands of ligands per receptor tofacilitate high-throughput structure-based in silico drug discovery.This version has enabled AutoDock-GPU to be deployed on theSummit supercomputer, which opens up this resource for massivecomputational efforts for therapeutics, especially for combating thecurrent COVID-19 pandemic. The total speedup for flexible dockingof small compounds with a moderate number of search iterations,using the new pipeline is close to 10 × relative to the previous singlereceptor/ligand workflow. This pipeline will be back-ported to theOpenCL version in future work, allowing this method to also beemployed on non-NVIDIA systems. For deployment on Summit,next steps will involve integrating this program into workflowmanagement systems to efficiently manage large scale dockingcampaigns. ACKNOWLEDGMENTS

This research was sponsored by the Laboratory Directed Researchand Development Program at Oak Ridge National Laboratory (ORNL),which is managed by UT-Battelle, LLC, for the U.S. Department ofEnergy (DOE) under Contract No. DE-AC05-00OR22725, and usedresources of the Oak Ridge Leadership Computing Facility, whichis a DOE Office of Science User Facility supported under ContractDE-AC05-00OR22725. The development of the AutoDock-GPU wassupported by the National Institutes of Health (GM069832). Theauthors thank Jonathan Lefman and Geetika Gupta (NVIDIA) foressential coordination and communication support and collectionof important feedback.

REFERENCES [1] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady NBhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. 2000. The ProteinData Bank.

Nucleic Acids Research

28, 1 (2000), 235–242.[2] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. 2002.Reoptimization of MDL keys for use in drug discovery.

Journal of chemicalinformation and computer sciences

42, 6 (2002), 1273–1280.[3] Sally R. Ellingson, Jeremy C. Smith, and Jerome Baudry. 2013. VinaMPI: Facili-tating multiple receptor high-throughput virtual docking on high-performancecomputers.

J. Comput. Chem.

34, 25 (Sep 2013), 2212–2221. https://doi.org/10.1002/jcc.23367[4] Ye Fang, Yun Ding, Wei P. Feinstein, David M. Koppelman, Juana Moreno, MarkJarrell, J. Ramanujam, and Michal Brylinski. 2016. GeauxDock: AcceleratingStructure-Based Virtual Screening with Heterogeneous Computing.

PLoS One

11, 7 (Jul 2016), e0158898. https://doi.org/10.1371/journal.pone.0158898[5] Stefano Forli, Ruth Huey, Michael E Pique, Michel F Sanner, David S Goodsell,and Arthur J Olson. 2016. Computational proteinâĂŞligand docking and virtualdrug screening with the AutoDock suite.

Nat. Protoc.

11, 5 (May 2016), 905–919.https://doi.org/10.1038/nprot.2016.051[6] Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltrão Costa, and Matei Ripeanu.2013. The energy case for graph processing on hybrid CPU and GPU systems.In

Proceedings of the 3rd Workshop on Irregular Applications: Architectures andAlgorithms . 1–8.[7] J Goddard and J Reymond. 2004. Enzyme assays for high-throughput screening.

Curr. Opin. Biotechnol.

15, 4 (Aug 2004), 314–322. https://doi.org/10.1016/j.copbio.2004.06.008[8] David S Goodsell and Arthur J Olson. 1990. Automated docking of substrates toproteins by simulated annealing.

Proteins: Structure, Function, and Bioinformatics

8, 3 (1990), 195–202.[9] Christoph Gorgulla, Andras Boeszoermenyi, Zi-Fu Wang, Patrick D Fischer,Paul W Coote, Krishna M Padmanabha Das, Yehor S Malets, Dmytro S Radchenko,Yurii S Moroz, David A Scott, et al. 2020. An open-source drug discovery platformenables ultra-large virtual screens.

Nature

Front. Pharmacol.

J. Mol. Graphics

14 (1996), 33–38. [12] Zhenming Jin, Xiaoyu Du, Yechun Xu, Yongqiang Deng, Meiqin Liu, Yao Zhao,Bing Zhang, Xiaofeng Li, Leike Zhang, Chao Peng, et al. 2020. Structure of Mprofrom SARS-CoV-2 and discovery of its inhibitors.

Nature (2020), 1–5.[13] Youngchang Kim, Robert Jedrzejczak, Natalia I. Maltseva, Mateusz Wilamowski,Michael Endres, Adam Godzik, Karolina Michalska, and Andrzej Joachimiak.2020. Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2.

Protein Sci. (May 2020), pro.3873. https://doi.org/10.1002/pro.3873[14] Simon McIntosh-Smith, James Price, Richard B Sessions, and Amaurys A Ibarra.2015. High performance in silico virtual drug screening on many-core processors.

Int. J. High Perform. Comput. Appl.

29, 2 (May 2015), 119–134. https://doi.org/10.1177/1094342014528252[15] Garrett M. Morris, David S. Goodsell, Robert S. Halliday, Ruth Huey, William E.Hart, Richard K. Belew, and Arthur J. Olson. 1998. Automated docking using aLamarckian genetic algorithm and an empirical binding free energy function.

J.Comput. Chem.

19, 14 (Nov 1998), 1639–1662. https://doi.org/10.1002/(SICI)1096-987X(19981115)19:14<1639::AID-JCC10>3.0.CO;2-B[16] Garrett M. Morris, Ruth Huey, William Lindstrom, Michel F. Sanner, Richard K.Belew, David S. Goodsell, and Arthur J. Olson. 2009. AutoDock4 and AutoDock-Tools4: Automated docking with selective receptor flexibility.

J. Comput. Chem.

30, 16 (Dec 2009), 2785–2791. https://doi.org/10.1002/jcc.21256[17] Joseph A. Morrone, Jeffrey K. Weber, Tien Huynh, Heng Luo, and Wendy D.Cornell. 2020. Combining Docking Pose Rank and Structure with Deep LearningImproves ProteinâĂŞLigand Binding Mode Prediction over a Baseline DockingApproach.

J. Chem. Inf. Model. (Mar 2020), acs.jcim.9b00927. https://doi.org/10.1021/acs.jcim.9b00927[18] Gregory Nierode, Paul S. Kwon, Jonathan S. Dordick, and Seok-Joon Kwon. 2016.Cell-Based Assay Design for High-Content Screening of Drug Candidates.

J.Microbiol. Biotechnol.

26, 2 (Feb 2016), 213–225. https://doi.org/10.4014/jmb.1508.08007[19] Paul A. Novick, Oscar F. Ortiz, Jared Poelman, Amir Y. Abdulhay, and Vijay S.Pande. 2013. SWEETLEAD: an In Silico Database of Approved Drugs, RegulatedChemicals, and Herbal Isolates for Computer-Aided Drug Discovery.

PLoS One

J.Cheminform.

3, 1 (Dec 2011), 33. https://doi.org/10.1186/1758-2946-3-33[21] Nataraj S. Pagadala, Khajamohiddin Syed, and Jack Tuszynski. 2017. Softwarefor molecular docking: a review.

Biophys. Rev.

9, 2 (Apr 2017), 91–102. https://doi.org/10.1007/s12551-016-0247-1[22] Jerry M. Parks and Jeremy C. Smith. 2020. How to Discover Antiviral DrugsQuickly.

N. Engl. J. Med.

Journalof computational chemistry

25, 13 (2004), 1605–1612.[24] Diogo Santos-Martins, Jerome Eberhardt, Giulia Bianco, Leonardo Solis-Vasquez,Francesca Alessandra Ambrosio, Andreas Koch, and Stefano Forli. 2019. D3RGrand Challenge 4: prospective pose prediction of BACE1 ligands with AutoDock-GPU.

J. Comput. Aided. Mol. Des.

33, 12 (Dec 2019), 1071–1081. https://doi.org/10.1007/s10822-019-00241-9[25] Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, and Stefano Forli.2019. Accelerating AUTODOCK4 with GPUs and Gradient-Based Local Search.(2019).[26] Liang Shen, Junwei Niu, Chunhua Wang, Baoying Huang, Wenling Wang, NaZhu, Yao Deng, Huijuan Wang, Fei Ye, Shan Cen, et al. 2019. High-throughputscreening and identification of potent broad-spectrum inhibitors of coronaviruses.

Journal of Virology

93, 12 (2019), e00023–19.[27] Francisco J. Solis and Roger J.-B. Wets. 1981. Minimization by Random SearchTechniques.

Math. Oper. Res.

6, 1 (Feb 1981), 19–30. https://doi.org/10.1287/moor.6.1.19[28] Leonardo Solis-Vasquez and Andreas Koch. 2017. A Performance and EnergyEvaluation of OpenCL-accelerated Molecular Docking. In

Proc. 5th Int. Work.OpenCL - IWOCL 2017 . ACM Press, New York, New York, USA, 1–11. https://doi.org/10.1145/3078155.3078167[29] Teague Sterling and John J. Irwin. 2015. ZINC 15 – Ligand Discovery for Everyone.

J. Chem. Inf. Model.

55, 11 (Nov 2015), 2324–2337. https://doi.org/10.1021/acs.jcim.5b00559[30] John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Pro-gramming Standard for Heterogeneous Computing Systems.

Comput. Sci. Eng.

12, 3 (May 2010), 66–73. https://doi.org/10.1109/MCSE.2010.69[31] Oleg Trott and Arthur J. Olson. 2010. AutoDock Vina: Improving the speed andaccuracy of docking with a new scoring function, efficient optimization, andmultithreading.

J. Comput. Chem.

31, 2 (2010), 455–461. https://doi.org/10.1002/jcc.21334[32] Santiago Vilar, Giorgio Cozza, and Stefano Moro. 2008. Medicinal Chemistryand the Molecular Operating Environment (MOE): Application of QSAR and

PU Docking on OLCF Summit ACM-BCB 2020, August 30–September 2, Virtual

Molecular Docking to Drug Discovery.

Curr. Top. Med. Chem.

8, 18 (Dec 2008),1555–1572. https://doi.org/10.2174/156802608786786624[33] Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXivpreprint arXiv:1212.5701 (2012). [34] Xiaohua Zhang, Sergio E. Wong, and Felice C. Lightstone. 2013. Message passinginterface and multithreading hybrid for parallel molecular docking of largedatabases on petascale high performance computing machines.