Performance of a GPU-based Direct Summation Algorithm for Computation of Small Angle Scattering Profile
Konstantin Berlin, Nail A. Gumerov, Ramani Duraiswami, David Fushman
11 Performance of a GPU-based Direct SummationAlgorithm for Computation of Small AngleScattering Profile
Konstantin Berlin ∗†‡ , Nail A. Gumerov † , Ramani Duraiswami † , David Fushman ∗†∗ Department of Chemistry and Biochemistry,Center for Biomolecular Structure and Organization,University of Maryland, College Park, MD 20742, USA † Institute for Advanced Computer Studies,University of Maryland, College Park, MD 20742, USA ‡ Contact Email: [email protected]
Abstract —Small Angle Scattering (SAS) of X-rays or neutronsis an experimental technique that provides valuable structuralinformation for biological macromolecules under physiologicalconditions and with no limitation on the molecular size. In orderto refine molecular structure against experimental SAS data, abinitio prediction of the scattering profile must be recomputedhundreds of thousands of times, which involves the computationof the sinc kernel over all pairs of atoms in a molecule. Thequadratic computational complexity of predicting the SAS profilelimits the size of the molecules and and has been a majorimpediment for integration of SAS data into structure refinementprotocols. In order to significantly speed up prediction of theSAS profile we present a general purpose graphical processingunit (GPU) algorithm, written in OpenCL, for the summationof the sinc kernel (Debye summation) over all pairs of atoms.This program is an order of magnitude faster than a parallelCPU algorithm, and faster than an FMM-like approximationmethod for certain input domains. We show that our algorithmis currently the fastest method for performing SAS computationfor small and medium size molecules (around atoms orless). This algorithm is critical for quick and accurate SAS profilecomputation of elongated structures, such as DNA, RNA, andsparsely spaced pseudo-atom molecules. I NTRODUCTION
Accurate characterization of biomolecular structures in so-lution is required for understanding their biological functionand for therapeutic applications. Small-angle scattering (SAS)of X-rays and neutrons indirectly measures distribution ofinteratomic distances in a molecule [1]. Unlike high-resolutiontechniques, such as X-ray crystallography and solution NMR,SAS allows the study of molecules and their interactionsunder native physiological conditions and with essentiallyno limitation on the size of the system under investigation.Though providing a powerful and unique set of experimentalstructural constraints, SAS is limited in resolution, as well asthe ambiguity in deconvolution of interatomic distances fromexperimental data. It does not provide enough data to derivea high resolution molecular structure, but due to the ease ofcollecting SAS data at various conditions and the unique scaleof atomic distance information, it is an extremely promisingcomplement to the high-resolution techniques [2]. Solution SAS studies have become increasingly popular,with the applications covering a broad range, including struc-ture refinement of biological macromolecules and their com-plexes [3], [4], [5], [6], and analysis of conformational ensem-bles and flexibility in solution [7], [8], [9], [10]. Finally, SAS isstarting to be used in high-throughput biological applications[11], [12].In order to use SAS for structural refinement, the scatteringprofile of the SAS experiment needs to be predicted ab initio from molecular structure, which requires computing all-pairsinteractions of the atoms in the molecule (also referred toas an N-body problem). In addition, the profile must berecomputed hundreds of thousands of times in an iterativestructure refinement algorithm, ensemble analysis [3], [13],or for thousands of different structures in a high-throughputmethod. For such applications serial computation of the scat-tering profile becomes prohibitive, even for smaller molecules.Several approximation algorithms have been proposed tospeed up this computation [14], [15], [13], [16]. However,depending on the diameter of the molecule, the approximationscan introduce significant errors (see [17] for theoretical limi-tations of these algorithm). Recently a hierarchical harmonicexpansion method based on the fast multipole method (FMM)has been shown to have superior asymptotical performancethan all previously proposed approximation methods [17],while maintaining any prescribed accuracy. However, the algo-rithm is very complex, which makes it difficult to implementand parallelize.Here we describe a parallelization of the direct computationof SAS profile onto readily affordable graphical processingcards (GPUs) that provides a dramatic improvement in theefficiency of the scattering profile prediction. While CUDA[18] GPU parallelization has been proposed for a similar prob-lem of powder diffraction [19], we extend GPU parallelizationto SAXS/WAXS/SANS using an open GPU programmingstandard, OpenCL [20], that makes our software compatiblewith most modern desktop and laptop systems. We show thatour GPU implementation is an order of magnitude faster thanthe parallelized CPU version, and faster than the parallelizedCPU version of the hierarchical harmonic expansion. a r X i v : . [ q - b i o . B M ] J un I. B
ACKGROUND
A. Computational Complexity of SAS Scattering Profile
The scattering profile is a function of the momentumtransfer q , I ( q ) = (cid:42)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) j =1 f j ( q ) exp ( i q · r j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:43) Ω , (1)where N is the number of atoms in the molecule, f j is theatomic form factor of the j th atom, r j is the position of the j th atom relative to some coordinate frame, (cid:104)(cid:105) Ω representsthe averaging over all possible orientations of the molecule(this is mathematically equivalent to the averaging over allpossible orientations of q ), with q= (cid:107) q (cid:107) . We can analyticallyintegrate Eq. (1) over Ω , which yields an all-pairs sinc kernelsummation (Debye summation), I ( q ) = N (cid:88) k f k ( q ) N (cid:88) j f j ( q ) sin( qr jk ) qr jk , (2)where r jk is the distance between the j th and k th atomiccenters [21], [1]. Unlike the case of powder diffraction [19],in the general SAS case, the form factors f j cannot be factoredout of the equation, because the non-uniform density of thewater layer atoms around the molecule creates an intractablenumber of possible f k f j combinations.The computational cost of directly evaluating Eq. (2) foreach q is O ( N ) . For larger molecular system (tens orhundreds of thousands of atoms), the quadratic computationalcost is a prohibitive barrier for atomic level computation ofEq. (2). This computational cost is further compounded ina structure refinement algorithms, when the Debye sum hasto be repeatedly calculated as part of a structural refinementprotocol, e.g., [22], [23], [4], or if an ensemble of molecularconformations is considered. For example, on a modern desk-top directly computing a scattering profile a single time couldtake minutes for a moderately sized atom molecule.To address the quadratic complexity of Eq. (2) severalpapers have suggested approximately computing Eq. (1) byaveraging over a constant number of orientations [15], [13],[16], or using harmonic expansion [14]. Assuming one accu-rately computes the averaged I ( q ) value for some constantnumber of orientations or harmonic expansions, the algorithmshould have a superior O ( N ) complexity. However, it wasdemonstrated in [17] that the number of orientations (or har-monics) needed for accurate orientational averaging is prob-lem dependent, requiring no less than O ( q D ) orientationalsamples, where D is the the diameter of the molecule (thelargest distance between two atoms of the molecule). Thus,the computation of the full profile has an overall complexityof O ( q D N ) . In addition, the constant inside the O ( q D N ) is larger than in direct O ( N ) computation.For a significant variety of structures (especially elongatedstructures like DNA or RNA), the factor q D can be largerthan N . For example, the complex [PDB:2R8S] has atoms, and diameter of ˚A. For q = 1 . ˚A − , since q D = 150 > N , it clearly makes more sense to directly compute Eq. (2) for this complex, instead of using the approxi-mations mentioned above. Similar analysis between direct andapproximate methods can be done automatically for any typeof input.Recently we have demonstrated that the complexity ofthe approximate computation can be significantly improvedby applying an FMM-like hierarchical algorithm based onspherical harmonics [17], giving an overall complexity of O ( N log N + ( qD ) log( qD )) for a specific value of q . Thismethod offers significant speedup in evaluation of Eq. (1) formost relevant ranges of qD and N as compared to previousapproximation methods.However, due to the ( qD ) dependence, as we demonstratebelow, for medium size molecules and WAXS values of q ( q > . ), it is still faster to compute the O ( N ) Eq.(2), directly on the GPU. Our results clearly demonstrate alarge computational benefit of a GPU based direct Debyesummation algorithm for higher ranges of qD and/or lowerdensity pseudo-molecules [23], [24]. B. GPU Computing
GPUs with a substantial computing power are widelyavailable today in most PCs, and can now provide over Tflop of computational power, such as the Tesla C2070.While GPUs were originally designed for intense graphicscomputations, using GPUs as a less expensive alternative tothe high-performance computing on a CPU cluster is nowa well established technique in scientific computing, and iscommonly referred to as General Purpose computing on theGPU (GPGPU). Whereas a general purpose CPU devotesmuch of the on-chip real estate to flow-control and caching,and only a relatively small area to computational cores, GPUshave many cores that simultaneously execute the same smallprogram on different data, referred to as Single-InstructionMultiple-Data (SIMD) architecture [25].The GPGPU paradigm is well suited for structural biologybecause of the embarrassingly parallel nature of molecularforce fields and the large number of particles in a typicalmolecular system. Such computations, running on the GPU,have been shown to give order of magnitude speedup over theirparallel CPU companions (see [26] for a review), but requirethe programmer to explicitly manage different memory levelsin order to achieve good performance.The most popular method to program a GPU is with CUDA(interface developed by NVIDIA), which allows programmingof the GPUs in a C-like language. Detailed description ofan efficient CUDA implementation for an N-body problem isprovided by NVIDIA [27] and more examples can be found,see [26] for references. However, these implementations areavailable only for CUDA enabled NVIDIA cards, and cannotbe used on GPUs from other vendors.To standardize and unify GPU programming across multipleplatforms, the OpenCL standard was introduced in 2008 [20].The diagram of the three basic OpenCL memory levels isgiven in Figure 1. GPU global memory is a large block ofmemory that is shared between all worker threads, howeveraccess to it is extremely slow relative to the local and private memory. Local memory is orders of magnitude faster thanglobal memory, but is only shared between threads in anindividual compute unit. Finally, each thread has a very smallblock of fast private memory. An OpenCL workgroup is aa subset of the compute unit, whose size is determined atcompilation time.
Fig. 1. Simplified OpenCL memory model, representing the three main levelsof memory: global, local, and private.
In order to design an efficient GPU algorithm the majorityof the computation in each thread should be done in privateand local memory, with only an aggregate result periodicallycopied back to global memory. A corollary to this is thatthe amount of work done by each compute unit should bemaximized relative to the amount of memory transfer betweenglobal and local/private memory. With this in mind, we nowmove on to describing our algorithm for the computation ofthe sinc kernel. II. I
MPLEMENTATION
We first observe than for each value of q , direct evaluationof Eq. (1) is embarrassingly parallel, and can be significantlysped up by using parallel GPU computing. Given O ( N ) work threads, the computation can be theoretically computedin O (log N ) time and O ( N ) work, with an O (1) timeparallel kernel evaluation, followed by an O (log N ) timeparallel reduction. In practice such an algorithm is not scalablebecause of the O ( N ) memory requirement. Let W < N be the number of workers available in the hardware of theGPU. We set out to design an O ( N /W + log W ) timealgorithm with O ( N ) work complexity, and O ( N + W ) memory requirement. This algorithm scales linearly with Q ,the number of q values that need to be evaluated.Since we expect users to run structure refinement computa-tion on a heterogeneous set of systems, based on what is avail-able to them at any given time (including personal laptops), wedesigned the algorithm for the general OpenCL 1.0 memorymodel. We focused on optimizing memory transfers betweenthe RAM and global GPU memory, and between global GPUmemory and local GPU memory, as well as avoiding quadraticscaling in memory usage.Let B be the number of workers in each compute unit, andassume that the global GPU memory is large enough to storeall N positions of atoms and at least one set of QN associatedform factors for a specific q value. For local GPU memory weassume that it is large enough to store B atom positions and B form factors associated with the specific q value. If QN formfactors are too large to fit into the main memory, we atomicallysplit the GPU computation into multiple computations on asubset of q values. Just like any implementation of the Debyesummation, the algorithm consists of three loops: the loop overthe q values; the loop over all the atoms; and another loop overall the atoms.The overwriting design of the algorithm is to minimize thenumber of memory transfers from global memory to localmemory, and to avoid access of global memory outside ofblock transfers. In a normal CPU implementation the iterationover q would be inside the double loop over the atoms, sincethis avoids recomputing the distance between two atoms atdifferent values of q (which requires an expensive squareroot operation). However, in our implementation we put theiteration over q as the outer most loop due to the size limitationof local GPU memory, and only compute two q values at a timeinside the innermost iterations. Further increasing the numberof the q values does not improve the performance, since thestorage of f ( q ) values overwhelms the private and then thelocal memory, while also decreasing the maximum workgroupsize.The algorithm starts by spawning N workers. For everysecond q value, each worker j copies its atomic position andthe associated form factor from global memory to privatememory and each local workgroup of size B copies a blockof B sequential form factors and atom positions into the localmemory. At this point each worker in the local workgroupproceeds to compute the kernel sum between its private atomand all the atoms stored in the local memory of the group.Once the computation is completed, the next block of B formfactors (for the two q values) and atom positions is copied intothe local memory, and the computation is repeated. At the endof the all-pair summation for the two current q values, a localparallel reduction is done to aggregate the private totals ofeach worker into one local memory value, and that value iscopied into global memory. At the end of computation anotherreduction is done to aggregate all the workgroup totals, storedin global memory, into a final I ( q ) array. The final reductionis a quick computation, and can be done in a CPU. In orderto avoid unnecessary computation, only values for j ≤ k arecomputed. In the case when the number of atoms is smallerthan the number of workers in a GPU, N < W , we create pN ≥ W worker, such that ( p − N > W , and the internalloop over all atoms that was previously done by one worker,is now split between p workers. The basic diagram of ouralgorithm is shown in Figure 2.Due to the limitation of some GPU cards, the OpenCL codewas implemented using single instead of double precision.This does not affect the applicability of the computation,since the errors (even for a million atom system) are ordersof magnitude below experimental errors and the inherentaccuracy of the theoretical model. We expect the computationusing doubles to be twice as slow, however it is unnecessaryin this case. Fig. 2. Diagram of our GPU algorithm. Each atom thread ( T ) computesthe sinc kernel between itself and all the atoms in block B (all the atomsare split into block, which are stored in local memory). After all computeunits are done with their computation, the blocks are shifted by one, and thecomputation is repeated. III. R
ESULTS
The GPU algorithm (“GPU”) was compared to three othermethods: two methods are different implementations of thedirect summation of Eq. (2), IMP toolbox implementationin C++ (“IMP”) [28] and our CPU parallelized Java imple-mentation (“Parallel”); and the other method is a parallelizedCPU FMM-like hierarchical method [17] (“FMM”, with theprescribed relative accuracy of . ). We did not compare ourmethod to other approximation methods since the FMM-likemethod has been demonstrated to have orders of magnitudefaster performance than O ( q D N ) approximation methods.We performed benchmarks on a laptop system, as wellas a significantly more expensive specialized computer. Thelaptop is a dual core 2.66 GHz Intel Core i7 Macbook Prowith NVIDIA GeForce GT 330M GPU, running Mac OS XLion 10.7.2., C++ Apple LLVM compiler 4.2, CUDA 4.0, andJava 1.6. The second system is a single node of the ChimeraGPU cluster, which has dual Intel Xeon quad-core 2.5 Ghzprocessors with Nvidia GTX 295 GPU, running Linux 2.6.18,Intel 11.1 compiler, and Java 1.6.All C++ and Fortran programs were compiled with the “-O3” flag. The harmonic expansion method was only run onthe Chimera node due to compiler incompatibility on the MacOS. Timing results were obtained for the critical region of themoment transfer values of < q ≤ . ˚A, as suggested in[13].The results for the Macbook Pro are shown in Figure 3,and the detailed CPU and GPU timing results are also shownin Table I. The GPU version is over times faster thanthe standard single CPU computation in the IMP library, thusdecreasing the computation time for a moderately large atom protein from seconds to around seconds. Forsmaller proteins the speedup decreases due to the overheadof CPU-GPU memory transfers becoming a more prominentpercent of the overall time. For a commonly sized atomprotein used in a SAS experiment, the GPU version was times faster with the time decreasing from seconds for IMP,to around . seconds.Our parallel Java implementation was over times fasterthan the single threaded IMP implementations on our dualcore system, with over % efficiency (this is most likelydue to hyper-threading available on each core). The speedupof a GPU Debye summation over a similarly parallelized CPU −2 −1 Number of Atoms T i m e ( s ) IMPParallelGPU
Fig. 3. Timing results for the computation of a point SAS profile ondifferently sized, randomly generated proteins on the Macbook Pro system. version is over times on the Macbook Pro. Approximately % of the computation is taken up by memory transfers, %by the atomic distance computation, and the rest by evaluationof the sinc function and other unavoidable operations.With such a dramatic speedup, the GPU algorithm makes itpossible to perform previously infeasible SAS profile compu-tations of moderate size proteins or even to compute the SASprofiles for a large ensemble of structures on a basic laptop.Similar results for a single Chimera node are shown inFigure 4, and the detailed CPU and GPU timing results arealso shown in Table I. The Chimera node is a significantlymore expensive and powerful computer than the Macbook Pro,however for single threaded code, like IMP, the performanceis only . times faster than on the Macbook Pro. −2 −1 Number of Atoms T i m e ( s ) IMPParallelGPUFMM
Fig. 4. Timing results for the computation of uniformly spaced point SASprofile with < q ≤ . ˚A on different size randomly generated proteins ona single Chimera node. On the Chimera node, our parallel Java implementation isover times faster than IMP (below the perfect speedup of , with about % parallelization efficiency), and our GPUimplementation is around times faster than our parallelJava version for atom protein, dropping to about times speedup for atom protein. If we assume perfectparallelization of the C++ IMP code, the speedup drops toabout for atoms. This is similar to the speedup TABLE IT
IMING RESULTS OF THE DESCRIBED ALGORITHMS .N Macbook Pro ChimeraCPU (2 core) GPU Speedup CPU (8 core) GPU Speedup1000 0.28 0.03 9.5 0.10 0.02 5.92000 1.16 0.07 15.6 0.35 0.03 13.04000 4.25 0.25 17.0 1.35 0.08 17.310000 26.8 1.34 20.1 8.28 0.34 24.320000 108 5.02 21.5 33.8 1.14 29.640000 439 20.4 21.6 133 3.94 33.7100000 (2695) 121 22.3 806 22.5 35.9All timing results are in seconds. Estimated timing is shown in parentheses. observed on the Macbook Pro. Complete timing results for theparallel CPU version and the GPU version on both systemsare shown in Table I.The parallelized CPU FMM-like algorithm becomes fasterthan the direct GPU computation at around atoms.However, due to the dependence of FMM-like algorithm on q , the FMM-like algorithm is actually slower for higher q values in the profile. In Figure 5 we demonstrate the switch inthe optimal computation algorithm for a atom protein,from the FMM-like algorithm, to the GPU algorithm. Thecrossover between the FMM-like method and GPU is around q = 0 . ˚A − , at which point our GPU implementation is stillthe fastest available method for computing a SAS profile. Thisresult does not take into account the additional computationaltime of setting up the FMM, such as generating an octree andprecomputing translation operators. −1 q (Å −1 ) T i m e ( s ) GPUFMM
Fig. 5. Timing results at the individual q values of a SAS profile of arandomly generated atom protein, on a single Chimera node. We demonstrate a practical application of our GPU methodby comparing our ab initio prediction to the popular andextremely fast CRYSOL [14] software package using exper-imental data. The experimental data are for low q , whereapproximation methods, like CRYSOL, are most applicable,yet we still demonstrate a significant boost in computationspeed. We note that our algorithm does not have to be usedas a standalone component, as is demonstrated here, but canbe quickly integrated into a more advanced SAS pipelinethat would yield improved fit to experimental data. Here wedemonstrate the immediate computational advantage of our implementation, even for lower q values, over the current state-of-the-art spherical harmonic approximation method.All hydrogens where explicitly treated and all ab initio prediction was done without refinement of any parametersagainst experimental data. CRYSOL was ran with defaultoptions, and no water layer was added to our method. Both,the profile for CRYSOL and GPU method was computed for uniformly distributed points. We provide the ab initio fitsto experimental data solely to demonstrate that we are timingthe necessary amount of experimentally relevant computationrelative to CRYSOL. The final fits given in most publicationsinclude some amount of pre/post processing that is not relevantto our analysis.The first benchmark was done on the lysozyme data pro-vided in the CRYSOL package. The computation was timed onthe Macbook Pro machine, with the GPU computation taking . seconds, and CRYSOL . seconds. For visualizationpurposes, the profile fits are given in Figure 6. −2 −1 q(Å −1 ) I ( q ) Exp.CRYSOLGPU
Fig. 6. Comparison of SAXS profile fit between CRYSOL and our GPUimplementation for hen egg-white lysozyme [PDB:193L]. The predictedprofiles were rescaled to overlap the experimental profile.
The second benchmark was ran on the dataset[BID:1PYR1P], downloaded from the BIOSIS database.The GPU method ran on a Macbook Pro in . seconds,compared to . seconds for CRYSOL. The fit is shown inFigure 7.The timing results demonstrate the advantage of our directGPU algorithm, even at low q values, over one of the fastestand most popular current implementations of approximationmethod. −1 q (Å −1 ) I ( q ) Exp.CRYSOLGPU
Fig. 7. Comparison of SAXS profile fit between CRYSOL and our GPUimplementation for [PDB:3K3K]. The predicted profiles were rescaled tooverlap the experimental profile.
IV. D
ISCUSSION
We have demonstrated that, for important input domains,our GPU implementation is currently the fastest possiblealgorithm for high qD inputs, or small molecules, that caneffectively run on a personal computer. This direct GPU imple-mentation should be considered as a complementary algorithmto the CPU FMM-like implementation at lower q values, wherethe algorithm choice should be made automatically, dependingon the value of qD and N . In a typical range of < q < . ˚A − , we suggest that for low q values an FMM-like algorithmshould be used, while for higher q values computation shouldbe switched to the direct GPU algorithm.Since the CPU FMM-like and the GPU algorithms run onseparate processors, both should run at the same time, ondifferent q ranges, in order to achieve maximum efficiency.The GPU algorithm performance should further improve inthe future, as the size of the local memory increases and more q values can fit inside the innermost loop.V. C ONCLUSION
We provided the first comparison of a direct a GPUcomputation vs. FMM-like method for SAS problem, anddemonstrates the immense utility of a parallel GPU directsummation method for a large set of potential input. At thesame time, we showed the inherent computational tradeoff be-tween approximation methods vs. direct all-to-all summation.Our implementation was written in OpenCL, allowing it tobe run on most modern laptop and desktop GPUs, while stillhaving over times faster runtime than a similar parallelCPU implementation.We also demonstrated an application of our new method ontwo experimental datasets, and showed that our GPU softwarecan compute the scattering profile faster than CRYSOL, witha guaranteed machine level computational precision of theunderlying mathematical model. The performance gains of thisalgorithm can be quickly realized by any of the current abinitio SAS prediction packages. The above software is part of a new high performancesoftware package, ARMOR, which includes additional NMRrestraints [29], [30].A
CKNOWLEDGEMENTS
We thank the UMIACS New Research Frontiers Award forsupporting this research.R