Particle Mesh Ewald for Molecular Dynamics in OpenCL on an FPGA Cluster
Lawrence C. Stewart, Carlo Pascoe, Brian W. Sherman, Martin Herbordt, Vipin Sachdeva
AAn OpenCL 3D FFT for Molecular DynamicsDistributed Across Multiple FPGAs
Lawrence C. Stewart
Silicon Therapeutics
451 D St, Suite 205Boston, MA, [email protected]
Carlo Pascoe
Silicon Therapeutics
451 D St, Suite 205Boston, MA, [email protected]
Brian W. Sherman
Silicon Therapeutics
451 D St, Suite 205Boston, MA, [email protected]
Martin Herbordt
College of EngineeringBoston University [email protected]
Vipin Sachdeva
Silicon Therapeutics
451 D St, Suite 205Boston MA, [email protected]
Abstract —3D FFTs are used to accelerate MD electrostaticforces computations but are difficult to parallelize due to com-munications requirements. We present a distributed OpenCL3D FFT implementation on Intel Stratix 10 FPGAs for gridsup to . We use FPGA hardware features such as HBM2memory and multiple 100 Gbps links to provide scalable memoryaccesses and communications. Our implementation outperformsGPUs for smaller FFTs, even without distribution. For weachieve 4.4 microseconds on a single FPGA, similar to Anton1 on 512 nodes. For 8 parallel pipelines (hardware limited), wereach the same performance both locally and distributed, showingthat communications are not limiting the performance. Our FFTimplementation is designed to be part of the electrostatic forcepipeline of a scalable MD engine. Index Terms —FPGA, Molecular Dynamics, HPC, Reconfig-urable Computing, FFT
I. I
NTRODUCTION
Molecular dynamics (MD) simulations play an important rolein drug discovery. MD simulation engines such as AMBER [1]and OpenMM [2] provide high performance implementationsfor CPU and GPU, and provide a flexible framework in whichnew computational technologies can be assessed.One of the key challenges in molecular dynamics simulationsis the long-range (LR) electrostatic force computation, whichoften uses Ewald summation to split the work into short-range and long-range terms. Methods for the long-range terminclude k-space summation [3], µ -series [4], and use of FourierTransforms to solve Poisson’s Equation [5]. The FFT methodsreduce computation from O ( N ) to O ( N logN ) , but requireglobal communication and strong scaling of 3D FFTs in thesize range from to .In this paper we describe an FPGA-based architecture,implementation, and evaluation of a distributed 3D FFT appli-cable to molecular dynamics simulations. Our implementationoutperforms GPUs on small transforms, and is scalable topipelines on a single or multiple boards. It is implementedentirely in OpenCL and works for a variety of sizes applicableto drug discovery. Furthermore, we provide an overall designfor a scalable long-range MD pipeline.The outline of this paper is as follows: Section II discussesbackground and related work on MD and FFT targetingcontemporary architectures including CPUs, GPUs, and ASICs. Section III details the system architecture of our FFT as wellas long-range pipeline, and Section IV follows up with theimplementation details in OpenCL. Section V summarizes theresults of our work, along with performance comparison toother hardware. Finally, section VI concludes the paper alongwith our plans for future work.II. B ACKGROUND AND R ELATED W ORK
Molecular dynamics simulations have proven to be a valuabletool in drug discovery for understanding protein motion. Open-source GPU accelerated molecular dynamics applications suchas GROMACS [6], NAMD [7], OpenMM [2], and CP2K [8]allow many practitioners to use MD simulations as a regulartool. AMBER [1], while not open-source, is another MD enginethat is widely used in industry.To our knowledge, the only studyshowing strong scaling on GPUs for a 100,000 atom system iswith the recently redeveloped GROMACS package [9], [10],which does not distribute the FFT across multiple GPUs.Several efforts to develop custom ASICs for small moleculesimulations have been undertaken. ASICs require long de-velopment times and lack reconfigurability, but sheer perfor-mance makes them attractive. The earliest initiative is theMDGRAPE [11] series of supercomputers. MDGRAPE-4ais capable of 1.1 microseconds for a 100,000 atom system.Another very well known initiative for strong scaling moleculardynamics ASICs is the Anton series of supercomputers [12]developed by D. E. Shaw Research and capable of tensof microseconds in a single day. Anton 1 was released in2007, with performance for a 23,000 atom system close to 17microseconds/day. Anton 2, released in 2014, increased thisperformance five-fold to 85 microseconds/day [13].The most challenging part of scaling molecular dynamicssimulations is the electrostatic forces computation, of whichFFT is often a major component. Anton 1 could solve FFTproblems of size in 3.7 microseconds, and in 13.3microseconds on 512 nodes. Anton 2 did not use FFT inits simulations, instead relying on a different decompositioncalled the µ -series [4], [13]. By using a separable kernel,the long-range Ewald term can be directly calculated byX, Y, and Z 1D convolutions without the use of FFT. This a r X i v : . [ c s . A R ] N ov mproved performance by 15% but increased communicationsrequirements.Efforts to get parts of molecular dynamics simulationsrunning on FPGAs have been explored over the past fewyears [14], [15]. More recently, the increase in FPGA resourcessuch as logic elements, DSPs, BRAM, etc., have allowed fullMD simulations to run on a single FPGA [16].One of the greatest strength of FPGAs is the I/O transceivers,which are capable of providing hundreds of gigabits ofbandwidth per FPGA with very low latency [17]. Some clusterswith highly interconnected FPGAs are the Novo-G YSTEM A RCHITECTURE
In this section, we briefly show how FFT fits into a fullMolecular Dynamics application. We then detail our FFTpipeline and communications architecture, and also describe ourfull long-range pipeline that incorporates the charge spreading,FFT, and force interpolation steps, which compute long-rangeelectrostatic forces. We describe the overall architectural andscaling issues that constrain the system design. The overallarchitectural problem is to build an efficient long-range pipelinewhich can stream data with balanced performance in each stage.
A. Molecular Dynamics
Molecular Dynamics models the behavior of atoms andmolecules by individually calculating the various forces thatact on them. Forces that apply to bonded atoms include bondtorsions and tensions. Forces that apply to non-bonded atomsinclude short-range forces that include both van der Waalsand electrostatics, and long-range forces, which are mainlyelectrostatic. Short-range forces are managed by pairwisecomputations. For long-range forces, applications instead usemultipole approximations [30] or Ewald summations. Our focusis on an Ewald variation known as Smooth Particle MeshEwald (SPME) [31]. SPME calculates a charge distribution ona grid, then uses Fourier Transforms and a Green’s functionto calculate a potential field. Potential gradients then are usedto calculate forces. Figure 1 shows the OpenCL portions ofour modified OpenMM application and the role played by 3DFFT.
B. 3D FFT
The three dimensional FFT of an XYZ volume can becomputed as a sequence of one dimensional transforms [32],
Host Program (OpenMM)Force AccumulatorBondedForce Pipeline Range LimitedForce Pipeline Long RangeForcePipelineMotion UpdateOpenCL Runtime Charge Spreading3D FFTGreen’s Function Multiply3D Inverse FFTForce Interpolation
FPGAs
Fig. 1. Molecular dynamics application overview.
Input Volume X FFT Y FFT Z FFTPipe 0Pipe 1 Pipe 0 Pipe 1All to AllX FFT Y FFT Z FFTa)b)
Fig. 2. 3D FFT steps and parallelization. [33]. First, each vector of samples in the X direction (fixedY and Z) is transformed. There are N such 1D transforms,known as pencils . Then 1D transforms in the Y direction aremade, followed by 1D transforms in the Z direction as shownin Figure 2 (a). Each 1D transform takes N logN operations,leading to a total computation of N logN operations. Forcomplex data, this takes approximately N logN flops.Because the 1D transforms are independent, they can be donein parallel. Figure 2 (b) shows a parallelization in which theinput volume is broken into slabs that contain subsets of the Zdimension. Within each slab, all data is locally available for theX and Y transforms, but only half the data for each Z dimensiontransform. After the YFFT phase, the pipelines must exchangedata to prepare for the Z direction FFTs. There are alternativeparallel 3DFFT formulations such as 2D decomposition asused in [34] and the generalized vector radix decomposition[35] but for up to 64 nodes and our initial focus is onthe slab decomposition.In 2012, Galvez, Huang, and Chen showed how to buildpipelined parallel FFT hardware using a feedforward architec-ture that is well suited to FPGA implementations [36]. As anexample, an 8-wide parallel FFT unit will compute a 32-point1D FFT in 4 cycles plus a 3 cycle pipeline latency. Multipletransforms can be pipelined with a 4 cycle pitch. This sort ofcomputational unit uses 5-9% of a Stratix 10 device dependingn transform size and consumes and delivers about 19 GB/secof data.The challenge of any FFT is that every output depends onevery input. The FFT is a great advance over the DFT in thatit reduces an O ( N ) problem to an O ( N logN ) problem, butit doesn’t help with memory requirements or communications.It is straightforward to do the FFT calculations, but not sostraightforward to get the correct operands to the functionalunits in the correct order.A fully pipelined FFT can complete N
1D FFTs in thenumber of clock cycles it takes to read the data. Using a nominal300 MHz design speed, Table I shows the time in microsecondsto complete N
1D FFTs for various size transforms, givendifferent numbers of 8-wide vector compute units. To completea full forward and inverse 3D FFT will take six times as muchwork but can be both pipelined and parallelized.In order to distribute such a system over a network of FPGAs,it is necessary to balance communications and computationperformance it is also necessary to choose points in the solutionspaces for FFT and for All to All in which the bandwidthsmatch.The cells in Table I with parenthesized references representparticular solution choices that match well with potentialcommunications designs, which are discussed in section III-C.
C. All to All
The all to all network is responsible for interchange of dataamong multiple processing pipelines, both when colocated ona single board and when distributed across multiple FPGAs.The role of the all to all network is shown, for two pipelines,in Figure 2. It must reorganize the data of the FFT volumefrom one distributed along the Z dimension to one distributedalong the X or Y direction.For two pipelines, the overall volume is split betweentwo pipes. Initially, the Z dimension is distributed across thepipelines. After the exchange, the Y dimension is distributed.Each unit must send half its data to the other to accomplishthis exchange. In any parallelization into N units, ( N − /N of the data must move. Since FFT and all to all are pipelined,the overall performance will be set by the slower function.
1) Network Topologies:
In order to implement an efficientall to all, we require a network with very high bandwidth andvery high bisection bandwidth, as constrained by availablehardwareOur current testbed uses four BittWare mod-ules [37], each with a single Intel Stratix 10 MX2100FPGA [17] and 4 QSFP28 communications ports operatingat 100 Gbps. The vendor board support package makes thecommunications links available as 256 bit wide FIFO channelswhich can operate at up to 390 MHz (100 Gbps). We areoperating with a preliminary version which runs up to 305MHz (78 Gbps).For two-board designs, we can operate with 4 parallel fullduplex links, offering 312 Gbps in each direction. We canoperate four boards with point to point links connecting eachpair of boards. For larger numbers of boards, the available four ports permit hypercubes up to 16 nodes, or arbitrarilylarge two dimensional torus networks. We are evaluating use ofshort copper cables to create two additional links for in-chassisconnections. Six links would permit hypercube networks up to64 boards or 3D torus networks of any size. We use opticalexternal cables so we are not constrained by cable length foroff-chassis connections. With four FPGA boards per server,we can build single rack systems up to 64 nodes.Direct networks such as Hypercube, 2D, and 3D torusnetworks require multiple hops for full connectivity using anon-chip router (see Section IV-C1. Another possibility wouldbe to use on-board hard Ethernet MACs and 100 Gbps Ethernetswitches to build single hop networks.
2) All to all scaling:
We can now analyze potential networktopologies to evaluate points in the solution space that arecompatible with distributed FFT designs.Table II relates numbers of FPGA modules, network topol-ogy, and the time to complete the All to All. The environmentfor this analysis consists of a number of FPGA nodes, eachequipped with either four or six links running at 100 Gbps.Configurations marked “Switched” use Ethernet packet framingto route messages via 100 Gbps Ethernet switches to achievesingle hop connections. PtoP configurations use point to pointcables, other topologies require on-FPGA switches and higherhopcounts. As an example, a 4 dimensional hypercube for 16nodes has an average hopcount of 2, because on average adestination node ID differs in only two bits from the sourcenode ID.The time to complete figures are given by D ∗ N − N ∗ HB ∗ N ∗ L where D is the FFT data volume in bits. ( N − /N is thefraction of data that must be sent to a different board. H is theaverage hop count, B is the link bandwidth in bits per second,L is the number of links per board, and N is the number ofboards.The scaling behavior of this equation is such that for a fixedtopology, the time to completion goes as H/N, assuming equallink loading, uniform traffic, and perfect link scheduling. Allto all certainly has uniform traffic and there is a symmetryargument for uniform link loading.For a switched network, perfect scheduling of an all to allis straightforward. In round i , node n transmits to node ( n + i )mod N . Point to point networks are also easy to schedule.As for multihop scheduling, the question is still open butwe are experimenting with static scheduling as described inSection IV-C1. D. Computation vs Communications
Table I and Table II identify consistent points in solutionspace for 8 through 128 pipelines where computation perfor-mance is a good match for communications performance.1) Eight pipelines–Eight pipelines, split among either twoor four FPGAs using point to point links.
ABLE IT
IME FOR
FFT
SIZES VS NUMBER OF UNITS , AT
300 MH Z ( µ SEC )XYZ Data Points Data Bits 1 2 4 8 16 32 64 128 ETWORK TIMING FOR A LL TO A LL ( µ SEC )Nodes Topology Hopcount Links 32x32x32 64x64x64 64x64x128 96x96x96 128x128x1282 PTOP 1 4 1.7 13.4 26.9 45.4 107.5 (1)4 PtoP 1 3 1.7 13.4 26.9 45.4 107.5 (1)8 2D Torus 1.5 4 1.1 8.8 17.6 29.8 70.68 Hypercube 1.5 3 1.5 11.8 23.5 39.7 94.18 Hypercube++ 1.25 4 0.9 7.4 14.7 24.8 58.8 (2)8 3D Torus 1.5 3 1.5 11.8 23.5 39.7 47.18 Switched 1 4 0.7 5.9 11.8 19.8 47.116 2D Torus 2 4 0.8 6.3 12.6 21.3 50.416 3D Torus 2 6 0.5 4.2 8.4 14.2 33.6 (3)16 Hypercube 2 4 0.8 6.3 12.6 21.3 50.416 Switched 1 4 0.4 3.2 6.3 10.6 25.2 (3)32 2D Torus 3 4 0.6 4.9 9.8 16.5 39.132 3D Torus 2.5 6 0.3 2.7 5.4 9.2 21.732 Hypercube 2.5 5 0.4 3.3 6.5 11.0 26.032 Switched 1 4 0.2 1.6 3.3 5.5 13.0 (4)64 2D Torus 4 4 0.4 3.3 6.6 11.2 26.564 3D Torus 3 6 0.2 1.7 3.3 5.6 13.2 (4)64 Hypercube 3 6 0.2 1.7 3.3 5.6 13.264 Switched 1 4 0.1 0.8 1.7 2.8 6.6 (5)Data within () reference similar entries in table I
2) 16 pipelines–16 pipelines split among eight nodes usinga hypercube++ topology that has additional links acrossthe major diagonals.3) 32 pipelines–32 pipelines split among 16 FPGAs, usinga switched topology or a 3D Torus.4) 64 pipelines–64 pipelines split among 32 FPGAs with aswitched topology or among 64 FPGAs with a 3D torus.5) 128 pipelines–128 FFT pipelines split across 64 boardswith a switched topology.This analysis is done for 128x128x128 transforms, butsimilar choices exist for other size FFTs. Since the completiontimes are entirely bandwidth limited, they scale only with thetotal data volume and the same points in solution space applyto all sizes. However, for the smaller transforms, hardwareunit latencies and communications latencies start to becomeimportant.These solution points permit balanced designs, in which nounit runs faster than necessary, thus minimizing hardware.
E. Long-Range Pipeline
In the previous subsection, we detailed the FFT pipeline anddiscussed scaling FFT to multiple pipelines on one or moreboards. We now describe the entire long-range force pipelinefor molecular dynamics. The overall architecture of the long-range force pipeline is shown in Figure 3. The LR pipelineaccepts atom positions and charges as input, and delivers long-range electrostatic force updates per atom back to the overallMD application. The LR pipeline is composed of three separate
HBM A HBM BHBM B ForceInterpolationChargeSpreadingX FFT HBM AOther PipelinesForce updatesAtom POSQ Atom POSQ 1.01.0HBMT Z FFTT TT Y FFT TZ FFTT T FFT Pipeline All to All NetworkZ -1 FFT TX -1 FFT T
Fig. 3. Logical view of long-range force pipeline. pieces of hardware, each of which is capable of doing severaltasks in time interleaved fashion. • Charge spreading–transform atom position and chargedata into FFT input dataset. The same hardware is usedfor force interpolation. • Dual FFT–Two 1D FFT pipelines, with transpose units fordata reordering. The hardware is used three times, for Xand Y forward transforms, Z and Z-inverse transforms, andY-inverse and X-inverse transforms. In the Z, Z-inversecase, an additional multiplication is made by the Green’sunction data for the long-range kernel in the frequencydomain. • All to All network–This hardware is responsible forexchanging data between FFT pipeline instances andamong multiple FPGA boards. It is used twice, totranspose data from the output of the forward Y transformto the input of the forward Z transform, and again totranspose data from the output of the Z-inverse transformto the input of the Y-inverse transform. Data is buffered inhigh bandwidth memory because the Z transform cannotstart until the Y transform is finished.The overall idea of this LR pipeline is to be able to subdividethe long-range problem into 1 to 128 parallel instances ofthe LR pipeline that can each process 8 input samples percycle. With one processing pipeline, a 128-cube problem wouldtake 786432 cycles ( XY , ZZ − , and Y − X − at 262,144each), while with 128 pipelines, with perfect efficiency processwould finish in 6,144 cycles. At 350 MHz, that could be 18microseconds. The fastest
3D FFT for GPU that we areaware of is 340 microseconds for forward and inverse .
1) Charge Spreading:
Any particular atom has a fractionalposition in X, Y, and Z somewhere between grid points inthe 3D FFT input volume. The charge attached to the atom isspread to grid points in the 4x4x4 cell surrounding the atom’sposition. The performance of this step is critical. FFT volumesfrom to approximately have been used successfullyin MD, but for problems of 50,000 to 100,000 atoms, sizes inthe to range seem the most effective. If we considerthe case, there are 2,097,152 grid points. With an 8-wide vector FFT, that takes 262,144 cycles, divided by thenumber of available processing pipelines. For 128K atoms,that requires completing the charge spreading calculations forone atom in, on average, two cycles in order to not limitperformance. To that end, we have built a 64-way parallelpipelined charge spreading unit that can process one atom percycle, with additional cycles necessary for atoms whose rangeof influence crosses boundaries of the chunk size of the FFTpipeline input.
2) FFT Unit:
Described in section IV.
3) Force Interpolation:
The force interpolation unit sharesthe 64-way parallel arithmetic and BRAM units of the chargespreading hardware. It accepts streaming data from the outputof the inverse FFT, and atom position and charge data. 3Dcardinal B-splines and their derivatives are used to calculateX, Y, and Z forces for each atom based on gradients of theelectrostatic potential field calculated by the FFT. These updatesare fed back to the main MD application. The output data fromthe FFT is temporarily buffered in 8x8x128 chunks in BRAM.For A atoms and 3D FFT size N , the charge spreadingstep operates in O ( A ) time, the 3D FFT is O ( N logN ) , andthe force interpolation is again O ( A ) . By choosing an FFTvolume proportional to the number of atoms, and by usinghardware to remove the logN term, the entire LR pipeline Measured on V100 with cuFFT 9.1
INPUT OUTPUT .0 16 8 24 4 20 12 28 3-CYCLE1 17 9 25 5 21 13 29 PIPELINE2 18 10 26 6 22 14 30 LATENCY3 19 11 27 7 23 15 31 0 16 8 24 4 20 12 28INPUT FOR 2 18 10 26 6 22 14 30NEXT 1 17 9 25 5 21 13 29TRANSFORM 3 19 11 27 7 23 15 31OUPUTPIPELINE FOR NEXTRUNDOWN TRANSFORM
Fig. 4. Pipelined FFT input and output. becomes O ( A ) , a dramatic improvement over O ( A ) pairwisemethods of computing electrostatic forces.IV. S YSTEM I MPLEMENTATION
A. FFT
Intel provides an OpenCL computational kernel exampleusing the feedforward parallel FFT of Garrido et al, and wetook it as our starting point. [36], [38] (This is also used in[25]). This design accepts vectors bit-reversed by lane and inorder by vector. For a 32 point transform, there are four inputvectors and four output, offset by a 3 cycle latency, as shownin Figure 4.The internal structure of the parallel FFT is shown in Figure5. This example is 8-wide, compiled for a 64 point FFT. Thedesign compiles a variable number of stages, correspondingto the base 2 logarithm of the transform size. As a result, the logN term in FFT’s O ( N logN ) is subsumed by hardware andthe unit runs in O ( N ) time. B. Bit Dimension Permutations
In order to assemble 1D FFT units into a pipelined 3D FFTthe sequencing of the data must be modified in order to deliverto and accept data from the FFT units in the correct order. Thisproblem is referred to as bit dimension permutations [39].It is convenient to consider a generalized sample addresssuch as X0 X1 X2 X3 X4 Y0 Y1 Y2 Y3 Y4 Z0 Z1 Z2 Z3 Z4which represents a notional transform stored in memorywith the X indices varying fastest. The low three bits, X0 X1X2, determine the 8-wide vector lane. The high two bits, Z3Z4, determine the pipeline unit of a four-pipe design. The R2R2R2R2 R2R2R2R2X XXXXXX R2R2R2R2 R244 R244 XXXXXX R222 R222 R222 R222 R211 R211X R244 X R244 X R211 X R211 X
XR2 X 4Radix-2 Butterfly Trivial Rotator Complex Rotator Delay
Fig. 5. 8-Wide, single-precision complex FFT compute unit (64 point).
A 7..0WAInput Block Select Output Block Select
BRAM 7BRAM 6BRAM 5BRAM 4BRAM 3BRAM 2BRAM 1BRAM 0
Fig. 6. Transpose Unit. middle bits, X3...Z2, represent sequential storage locations inmemory or sequential communications through a channel.The FFT processing pipeline is shown in Figure 3. Forthis processing pipeline, the generalized sample address mustprogress through a series of permutations mandated by orderingrequirements.For memory operations, the sequencing is constrained bythe data layout required by other parts of the system. To someextent, the memory address generator can change the order butcannot change address bits associated with a parallel datapathand should not change the low few bits of the memory addressin order to preserve maximally sequential addressing.The Input transpose unit changes the ordering of the low 5bits as required by the FFT hardware input ordering. The FFToutput is in bit reversed order, which must also be modified toset up conditions for the following unit.To accomplish these transformations, we have adopted aflexible transpose unit, shown in Figure 6. The 512 bit vectorinput proceeds through three ranks of 2-1 multiplexors forminga butterfly network. The first exchanges the high and low 256bits. The second exchanges the 128 bit half of each 256 bits.The third rank exchanges 64 bit values in each 128 bit group.At the output of the third exchange, individual 64 bit complexvalues are written into 8 independent block rams. The inputside of the BRAMS have a common write address. The outputside read addresses are independent. The 512 bit aggregateoutput then passes through another three ranks of multiplexorscontrolled by an output block select signal. This structure isable to perform nearly arbitrary bit dimension permutationsof the input sequence. A standalone python program uses acontrol file to generate the OpenCL source code necessary tocontrol the hardware. The “nearly arbitrary” limitation is thatthis structure cannot transform, say X0 X1 X2 ... into X2 X0X1 in which the same bits are present in the parallel part ofthe datapath but in different locations. The limitation can beremoved by using 5-stage Benes networks [40] at the input andoutput but we have found the three stage networks suffice forthe permutations encountered in 3DFFT. The main constraintis that each BRAM bank has one read port and one write port,so the scheduler must assure that data read during the samecycle must have been written into different banks.
1) BRAM Only FFT:
Molecular modelling, at the scale ofinterest to us, does not use very much memory, but demandsextraordinarily high memory bandwidth. This suggests that
TransposeButterflyNetwork FFTOther PipelinesMemory MemoryNxNxNx2/PBRAM TransposeButterflyNetwork
Fig. 7. BRAM FFT pipeline.Fig. 8. All to all network. 4 pipelines on two FPGAs, with cables. it might be advantageous to keep the entire dataset for FFTin FPGA block rams, rather than in HBM memory. We havedesigned an FFT processing pipeline in this style as well, asshown in Figure 7. This design uses a single FFT processingcore, and a single transpose unit, with BRAM expanded to holda complete slice of the FFT volume. Multiple pipelines areinterconnected via the same All to All unit as used elsewhere.The all BRAM design requires three passes to complete a 3DFFT, but because it uses much less hardware than the HBMdesign, we can fit as many as 16 copies of the pipeline on asingle FGPA for FFT sizes at or below .For designs with 8FPGAs and up, it becomes possible to fit an entire FFTinto BRAM, and an all BRAM FFT will permit substantialhardware savings. We report some performance results for allBRAM designs in Section V.
C. All to All
The All to all network is responsible for interchange of dataamong multiple processing pipelines, both when colocated ona single board and when distributed across multiple FPGAs.Figure 8 shows such a design for four processing pipelineson two FPGAs. In this case, two full-duplex interboard cablesare used for point to point service, with each cable carryingserialized 256 bit words in each direction.Figure 9 is a design for 8 pipelines distributed across fourFPGAs. Each board has a direct connection to each other board.Similar to Figure 8, some connections remain on board, but inthe 4 board example they are modelled as a loopback cablethat connects a board to itself. ig. 9. All to all network. 8 pipelines on four FPGAs, with cables.
1) Router:
There is a large literature on interconnect net-works and routing strategies, but we are taking a somewhat dif-ferent tack. For the distributed molecular modelling application,we coordinate the activities of multiple FPGAs both throughhost communications and direct communications. For directcommunications, we need nearest neighbor communicationsfor the range-limited and bonded force pipelines, and foratom migration due to particle movement, but the bulk ofcommunications is for the FFT All to All. For four boards,direct point to point cables suffice, but we must use an on-FPGA router for multihop cases. The prototype router usesa crossbar switch with buffering at each crosspoint, togetherwith a statically compiled switch schedule. This permits thenetwork to operate in a streaming mode without packet headersor frame boundaries. Links from application logic to the routeruse Intel’s OpenCL Channel extension, as do the off-boardlinks themselves.Because the all to all communication pattern is symmetric,switch scheduling reduces to a bin packing problem ofpacking message fragments into open channel time slots, whilemanaging the maximum buffer occupancy [41], [42]. There isno danger of livelock or deadlock and no need for traditionaltechniques such as virtual channels, because all messages areknown at compile time. Flow control and error recovery areprovided by the vendor Board Support Package (BSP). Weexpect to report results in future work.
D. Long-Range Pipeline
An individual long-range processing pipeline is shown inFigure 10. This unit is replicated in order to achieve the desiredperformance. The FFT portion of the unit consists of two FFTcores and three transpose units. The FFT can accept input fromthe charge spreading unit, or from HBM memory. The outputof the second FFT can be streamed to the force interpolationunit or sent to the All to All hardware and then buffered inHBM memory.
HBM ReadAB or CD Charge Spread /Force InterpolationAtom POSQ Force Update HBM StoreCD or ABHBM EFFTT T FFT T All to Allto OtherPipelinesTranspose Unit 8-Wide SinglePrecisionComplex FFT8-Wide Multiply
Fig. 10. Long-range pipeline hardware unit.
Within each processing pipeline, as soon as a single XYplane is complete, the second FFT core can begin processingY direction transforms. As soon as a Y direction transformcompletes, the output data can stream to a pipelined all to allfunction, which redistributes data in preparation for Z directiontransforms. The output of the all to all cannot be directlystreamed to a third FFT core, because the Z direction transformscannot begin until all the Y direction transforms are complete.Instead, the all to all output is buffered in HBM memory.A second pass reuses one of the FFT cores to compute Zdirection transforms. The output of the Z direction transformis streamed to the Green’s function multiplier and then to the Z − transform and then to a second pass through the All toAll. A third pass implements the Y − and X − transformsand feeds the force interpolation unit. The full forward andinverse 3D FFT can be accomplished in three times the timeshown in Table I. E. Using OpenCL for FPGA programming
We chose OpenCL as the programming language for ourimplementation. This is due to several reasons: OpenCL allowsmore productive software development, and also allows usto be vendor-agnostic. Using OpenCL on the host side hasallowed us to reuse OpenMM’s framework for launchingkernels on the FPGA as well. OpenCL compilation for FPGAstransforms high-level source code into a dataflow graph andinstantiates the necessary hardware. We approach FPGA codingin OpenCL with a hardware engineer’s perspective. It is possibleto visualize the dataflow hardware you want, as in Figure 5 orFigure 6 and then write fairly straightforward code to realizeit. The tricky parts are cajoling the compiler into doing whatyou want.In the future, we may move some parts of our implementationinto VHDL or Verilog for optimal resource utilization. OpenCLdoes permit linking to HDL provided the HDL modulesprovide certain prescribed interfaces.V. E VALUATION
We have developed three versions of 3D FFT. All of theseversions are in OpenCL, with no Verilog or VHDL components.The first is a BRAM only design, with up to 16 processingpipelines per FPGA. The second is an HBM-based designwith up to 8 processing pipelines per FPGA, but each pipelineincludes two FFT units chained together. The third design is Hardware Description Language: typically VHDL or VeriloigABLE IIIBRAM-
BASED
3D FFTSize Pipes Clock Run- Ideal BRAM DSPMHz time usage % usage %32x32x32 1 290 59 42 2 332x32x32 8 243 8.5 6.3 21 1832x32x32 16 260 4.4 3.27 28 5264x64x64 16 275 24.5 22.3 49 58 a full implementation of FFT and inverse FFT for use in thelong-range force pipeline. Because the forward-backward FFTrequires six 1D FFTs, it can share hardware and runs in about / the time of two single transforms. A. Experimental Setup
Our hardware setup comprises 4 BittWare 520N-MX boardson a single hardware node. The hardware node has 2 8-coreIntel Xeon Silver CPUs as well as 768 GB of memory usedmostly for OpenCL compilation jobs. The CPUs also serveas host processor for the OpenCL FFT programs. Each 520N-MX has 4 QSFP28 channels, each capable of communicatingat a peak bandwidth of 100 Gb/s. Each board is connectedto each other board using a QSFP28 channel. The fourthchannels are connected in pairs to provide double bandwidthfor two-board configurations. Each FPGA is configured with thep520 max m210h BSP to allow OpenCL as the programmingmodel for FPGA computation as well as communicationbetween the different boards. Our application source code iscompiled using Quartus release 19.3. The server runs CentOS7.6. We use
SLURM to manage both compilation and hardwareresources.
B. BRAM-based FFT
We first present results of our BRAM-based FFT. In thisversion, the entire dataset fits into the BRAM of the FPGA.Results from this design are shown in table III. We presentsingle FPGA results for this design, with up to 16 processingpipelines, for FFT sizes and . A BRAM only design will not fit on our current hardware available dueto BRAM limitations. The version occupies 28% ofthe BRAMS and 52% of the DSP blocks and runs in 4.4microseconds at 260 MHz. The version occupies 49% ofthe BRAMS and 58% of the DSP blocks and runs in 24.5microseconds.Because interboard communications are limited to 400 Gbps,one cannot use more than 8 processing pipelines in a two-boardor four-board design. For larger systems, BRAM-based transforms become feasible because the amount of BRAMstorage scales with the system size. The would require134 megabits of BRAM, spread out over the entire system. Asshown in Table III, the design uses 49% of the BRAMson one FPGA, so the , which has eight times the storagerequirement, would fit on 8 FPGAs or up.In Table III, the column labelled “Ideal” is the predictedruntime if the design were able to deliver results at exactly thecompiled speed. There are two reasons for measured runtimesthat are slower than ideal. First, loop dependencies may prevent TABLE IVHBM-
BASED
3D FFT, S
INGLE
FPGASize Pipes Clock Run- Ideal BRAM DSPMHz time usage % usage %32x32x32 1 375 38.6 21.8 2 632x32x32 2 373 17.8 11.0 4 1332x32x32 4 322 11.1 6.4 8 2632x32x32 8 306 8.23 3.35 15 5264x64x64 1 370 376 177 3 764x64x64 2 386 133 85 5 1564x64x64 4 328 61 50 11 2964x64x64 8 295 38 28 21 5864x64x128 1 395 528 332 4 864x64x128 2 363 277 180 8 1764x64x128 4 325 124 101 15 3364x64x128 8 284 92 58 30 66128x128x128 1 376 2263 1394 5 9128x128x128 2 370 1175 708 10 19128x128x128 4 300 597 437 19 37128x128x128 8 327 272 200 43 74TABLE VHBM-
BASED
3D FFT, M
ULTIPLE
FPGASize Boards Pipes per board Clock Runtime Ideal32x32x32 1 8 306 8.23 3.3564x64x128 1 8 284 92 5864x64x128 2 4 287 139 58128x128x128 1 8 327 272 200128x128x128 2 4 312 450 210 the OpenCL compiler from generating a full dataflow designthat can accept new operands every cycle. In OpenCL this isknown as the initiation interval and the ideal value is 1. All ofour designs achieve this goal. Second, unit pipeline latency anddata dependency latency in the transpose unit impose delaysthat occur once per pass through the hardware. These effectsare identifiable because they affect small transforms such as much more than larger ones. C. HBM-based FFT
The second design uses BRAM only for transpositions andstages data in HBM memory. We have run versions up to 8pipelines on a single FPGA or split between two FPGAs orsplit among four FPGAs. Due to communications limits andavailable hardware we cannot run more than 8 parallel pipelines.The eight pipeline version for runs in 34 microsecondsand the implementation in 272 microseconds. Theseperformance figures are somewhat worse than the BRAMversion due to HBM memory latencies and some non-sequentialaccesses. D. Full LR Pipeline Design
The forward and inverse pipeline for the full long-rangeforce calculation uses the same structure, but adds a multiplierbetween the two FFT units and runs the hardware in threepasses as shown in Figure 3. It is thus possible to run thecomplete FFT and inverse FFT for the long-range force pipelinein 50% longer than the time for a single direction 3D FFT.Results for 1 and 2 board configurations up to 4 pipelines totalare shown in Table VI.
ABLE VILR P
IPELINE
FFT + IFFTSize Boards Pipes Clock Runtime Ideal Notes32x32x32 1 1 363 64.5 33.832x32x32 1 2 360 27.5 11.432x32x32 1 4 328 17 6.332x32x32 2 1 363 44.5 11.332x32x32 2 2 321 28 6.4128x128x128 1 1 350 3623 2247128x128x128 1 2 357 1831 1101128x128x128 1 4 316 981 622128x128x128 2 1 322 1862 1221128x128x128 2 2 312 972 630128x128x128 1 4 316 686 622 11 – HBM Disabled
E. Discussion
The results shown in Table III show close agreement betweenthe ideal results and actual results, with the gap becomingsmaller for larger problems. This is consistent with the effectsof pipeline latency. Hopefully as OpenCL compilers improve,the pipeline delays will shrink to the minimums required by thedatapath. These figures are about 150-200 cycles for memoryfetch, 134 for Transpose units, and 11 for the FFT. Suchimprovements would be helpful for small transforms like but become much less important for since there is 64times as much data.There is a much larger performance gap between ideal andmeasured in our HBM results shown in Tables IV – VI. Weinclude a measurement in Table VI taken with HBM storesdisabled, illustrating that HBM stores are responsible for thebulk of the gap between measured and ideal timing. This gapwould be eliminated for an all-BRAM solution.The data in Table VI for 1 board 4 pipeline (981 usec)versus the data for 2 boards 2 pipelines (972 usec) illustratethat distribution across multiple boards is scalable. F. Performance comparison with other architectures
3D FFT, due to its wide applications in many areas hasbeen benchmarked extensively on many architectures includingCPUs, GPUs and ASICs. Many of the benchmarks focuson larger FFTs ( and above) but there is some publicinformation on smaller FFTs applicable to MD. Table VIIcompares performance of our FPGA FFT with CPUs, GPUsand Anton. For converting timing to flops we use N lg ( N ) for complex FFT.For GPU measurements, we have depended both on in-house experiments as well as performance benchmarks from[10], [43]. Our GPU code uses CUDA cuFFT library [44]for computing FFT. We test this code on a single V100 GPU[45] with CUDA 11.1 compilers and libraries. Our in-houseexperiments performed on V100 with NVLINK2 as well as [10]show that using multiple GPUs does not improve performanceof sizes upto .Anton 1 [12] has details of timings for both and on 512 nodes, which we have also included in the table.We also include CPU-based benchmarks in this table. [46]shows the timings using Intel MKL and FFTW on a 56core Intel Xeon Platinum processor. For sizes , and TABLE VIIGF
LOPS / S FOR MULTIPLE ARCHITECTURES
Size Size Size Size System Citation x128
559 963 - - BRAM 16 pipe Table III- - 969 810 HBM 8 pipe Table IV664 1774 - - Anton-1 512 nodes [12]109 139 - 180 Novo-G 8 FPGA [27]218 1358 1561 1247 V100 cuFFT Inhouse tests180 400 500 610 56C Xeon 8280L MKL [46]- - - 9 BG/P 512 nodes [43] , performance on the processor is approximately 200,400 and 600 GFlops respectively. We have not found manypublic benchmarks on performance of smaller distributed 3DFFT. We have included timings on JUGENE, a BlueGene/Parchitecture [47]. [43] shows 12 milliseconds for a problem ofsize on 512 BlueGene/P nodes.We have also included Novo-G’s timings of distributed 3DFFT on 8 Stratix V FPGAs [27] for comparison. Compared toother architectures in Table VII, we outperform all architecturesfor and except 512 nodes of Anton 1, and V100 cuFFTfor larger sizes such as .There is, as far as we know, no magic to achieving excellent3D FFT performance. It is a game of balancing computation,memory bandwidth, and communication. It should not be asurprise that custom ASICs can do well, nor that modern GPUslike the V100 can achieve more than a teraflop once the problemsize grows large enough to sustain efficient memory access(V100 has about 50% more flops than a Stratix 10 FPGA andalmost double the memory bandwidth [17], [45]). The attractivefeatures of FPGA designs are that they can run efficiently acrossa range of sizes and that one can connect compute pipelinesdirectly to communications resources, which means that onecan relatively easily distribute a parallel implementation acrossmultiple FPGAs.VI. C ONCLUSION AND F UTURE W ORK
In this paper, we demonstrated that FPGAs can compute 3DFFTs in a scalable way, even for small transforms applicable inmolecular dynamics that are difficult to distribute. The resultsshow that our architecture and implementation balances com-putation, memory bandwidth, and communications bandwidthto produce 3D FFT implementations that run efficiently acrossmultiple FPGAs. Our implementation works for a variety ofsizes, and is completely written in OpenCL for portability. Ourresults show that we outperform or are competitive with a widevariety of architectures including CPUs, GPUs, and ASICs.The goal of our work is to achieve strong scaling forFFT and the long range pipeline for molecular dynamics onmultiple FPGAs. We plan to grow our FPGA cluster to 8 andsubsequently 16 FPGAs, and present results on scalability ofboth FFT and long-range pipeline. A larger cluster will alsoallow us to use BRAM only on FPGAs for transformas well. We also plan to explore avenues such as reducingprecision for communications, exploring different layouts ofthe FFT dataset as well as linking VHDL/Verilog code withOpenCL. EFERENCES[1] R. Salomon-Ferrer, D. A. Case, and R. C. Walker, “An overview ofthe AMBER biomolecular simulation package,”
WIREs ComputationalMolecular Science , vol. 3, no. 2, pp. 198–210, 2013. [Online]. Available:https://onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1121[2] P. Eastman et al. , “OpenMM 7: Rapid development of high performancealgorithms for molecular dynamics,”
PLoS computational biology , vol. 13,no. 7, p. e1005659, 2017.[3] R. Halver, J. H. Meinke, and G. Sutmann, “Kokkos implementation of anewald coulomb solver and analysis of performance portability,”
Journalof Parallel and Distributed Computing , vol. 138, pp. 48–54, 2020.[4] C. Predescu, A. K. Lerer, R. A. Lippert, B. Towles, J. Grossman, R. M.Dirks, and D. E. Shaw, “The µ -series: A separable decompositionfor electrostatics computation with improved accuracy,” arXiv preprintarXiv:1911.01377 , 2019.[5] U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee, and L. G.Pedersen, “A smooth particle mesh ewald method,” The Journal ofchemical physics , vol. 103, no. 19, pp. 8577–8593, 1995.[6] H. J. Berendsen, D. van der Spoel, and R. van Drunen, “GROMACS: amessage-passing parallel molecular dynamics implementation,”
Computerphysics communications , vol. 91, no. 1-3, pp. 43–56, 1995.[7] J. C. Phillips et al. , “Scalable molecular dynamics with NAMD,”
Journalof computational chemistry , vol. 26, no. 16, pp. 1781–1802, 2005.[8] T. D. K¨uhne et al. , “Cp2k: An electronic structure and moleculardynamics software package-quickstep: Efficient and accurate electronicstructure calculations,”
The Journal of Chemical Physics , vol. 152, no. 19,p. 194103, 2020.[9] A. Gray, “Creating faster molecular dynamics sim-ulations with gromacs 2020,” devblogs.nvidia.com ,2020. [Online]. Available: https://devblogs.nvidia.com/creating-faster-molecular-dynamics-simulations-with-gromacs-2020[10]
Multi-GPU FFT Performance on Different Hardware Configurations ,GTC Silicon Valley 2019, 05 2019.[11] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, and M. Taiji,“MDGRAPE-4: a special-purpose computer system for molecular dy-namics simulations,”
Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , vol. 372, no. 2021, p.20130387, 2014.[12] D. E. Shaw, M. M. Deneroff et al. , “Anton, a special-purpose machine formolecular dynamics simulation,”
Commun. ACM , vol. 51, no. 7, p. 91–97,Jul. 2008. [Online]. Available: https://doi.org/10.1145/1364782.1364802[13] D. E. Shaw, J. Grossman et al. , “Anton 2: raising the bar for perfor-mance and programmability in a special-purpose molecular dynamicssupercomputer,” in
SC’14: Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis .IEEE, 2014, pp. 41–53.[14] M. A. Khan, M. Chiu, and M. C. Herbordt, “FPGA-Accelerated MolecularDynamics,”
Springer , 2013.[15] M. Chiu and M. C. Herbordt, “Molecular Dynamics Simulations onHigh-Performance Reconfigurable Computing Systems,”
ACM Trans.Reconfigurable Technol. Syst. , vol. 3, no. 4, Nov. 2010. [Online].Available: https://doi.org/10.1145/1862648.1862653[16] C. Yang et al. , “Fully integrated FPGA molecular dynamics simulations,”in
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , 2019, pp. 1–31.[17]
Intel ® Stratix ® Device Datasheet , Intel Corporation, March 2020.[18] A. George et al. , “Novo-G
HPExC , 2016.[19] A. Putnam, “A Reconfigurable Fabric for Accelerating Large-ScaleDatacenter Services,” in
Proc. International Symposium on ComputerArchitecture , 2014, pp. 13–24.[20] C. Plessl, “Bringing FPGAs to HPC production systems and codes,”in
Fourth International Workshop on Heterogeneous High-performanceReconfigurable Computing, workshop at Supercomputing , 2018.[21] R. Kobayashi et al. , “OpenCL-ready high speed FPGA network forreconfigurable high performance computing,” in
Proceedings of theInternational Conference on High Performance Computing in Asia-PacificRegion , 2018, pp. 192–201.[22] T. Sasaki, K. Betsuyaku, T. Higuchi, and U. Nagashima, “Reconfigurable3D-FFT Processor for the Car-Parrinello Method,”
Journal of ComputerChemistry, Japan , vol. 4, no. 4, pp. 147–154, 2005. [23] C.-L. Yu, K. Irick, C. Charkrabarti, and V. Narayanan, “MultidimensionalDFT IP Generator for FPGA Platforms,”
IEEE Trans. Circuits and SystemI , vol. 58, no. 4, 2011.[24] B. Humphries, H. Zhang, J. Sheng, R. Landaverde, and M. Herbordt,“3D FFT on a Single FPGA,” in
Proc. Field Programmable CustomComputing Machines , 2014.[25] A. Ramaswami,
FFT3D for FPGA (CP2K) , 2019, http://github.com/pc2/fft3d-fpga.[26] J. Sheng, B. Humphries, H. Zhang, and M. C. Herbordt, “Design of 3DFFTs with FPGA clusters,” in , 2014, pp. 1–6.[27] A. Lawande, “A Reconfigurable Interconnect for Large-Scale FPGAApplications and Systems,” Ph.D. dissertation, University of Florida,2016.[28] J. Sheng, C. Yang, A. Caulfield, M. Papamichael, and M. Herbordt, “HPCon FPGA Clouds: 3D FFTs and Implications for Molecular Dynamics,”in
Proc. Field Programmable Logic and Applications , 2017.[29] J. Lee, L. Shannon, M. J. Yedlin, and G. F. Margrave, “A multi-fpgaapplication-specific architecture for accelerating a floating point fourierintegral operator,” in . IEEE, 2008, pp. 197–202.[30] L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations,”
Journal of Computational Physics , vol. 135, no. 2, pp. 280–292, 1997.[31] U. Essmann et al. , “A smooth particle mesh ewald method,”
The Journalof Chemical Physics , vol. 103, no. 19, pp. 8577–8593, 1995. [Online].Available: https://doi.org/10.1063/1.470117[32] R. T. M. An and C. Lu,
Algorithms for Discrete Fourier Transform andConvolution . New York: Springer-Verlag, 1989.[33] M. Frigo and S. G. Johnson, “The design and implementation of fftw3,”
Proceedings of the IEEE , vol. 93, no. 2, pp. 216–231, 2005.[34] D. Pekurovsky, “P3dfft: A framework for parallel computations of fouriertransforms in three dimensions,”
SIAM Journal on Scientific Computing ,vol. 34, no. 4, pp. C192–C209, 2012.[35] D. Harris, J. McClellan, D. Chan, and H. Schuessler, “Vector radix fastfourier transform,” in
ICASSP’77. IEEE International Conference onAcoustics, Speech, and Signal Processing , vol. 2. IEEE, 1977, pp.548–551.[36] M. Garrido, J. Grajal, M. A. Sanchez, and O. Gustafsson, “Pipelinedradix- k feedforward fft architectures,” IEEE Transactions on Very LargeScale Integration (VLSI) Systems , vol. 21, no. 1, pp. 23–32, 2013.[37]
Bittware 520N-MX Datasheet , BittWare Corporation, February 2020.[38]
OpenCL 2D Fast Fourier Transform Design Example , Intel Corporation,April 2019.[39] M. Garrido, J. Grajal, and O. Gustafsson, “Optimum circuits for bit-dimension permutations,”
IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 27, no. 5, pp. 1148–1160, 2019.[40] Lenfant, “Parallel Permutations of Data: A Benes Network ControlAlgorithm for Frequently Used Permutations,”
IEEE Transactions onComputers , vol. C-27, no. 7, pp. 637–647, 1978.[41] F. Annexstein and M. Baumslag, “A unified approach to off-linepermutation routing on parallel networks,” in
Proceedings of the secondannual ACM symposium on Parallel algorithms and architectures , 1990,pp. 398–406.[42] H. Subramoni et al. , “Designing topology-aware communication sched-ules for alltoall operations in large infiniband clusters,” , pp. 231–240, 2014.[43] A. Sunderlanda et al. , “An analysis of fft performance in prace applicationcodes,” in
PRACE Whitepaper , 2012.[44]
NVIDIA ® CUDA FFT , November 2019.[45]
NVIDIA ® V100 , NVIDIA Corporation, January 2020.[46]
Intel Math Kernel Library Performance Benchmarks , Intel Corporation,2019.[47] I. journal of Research and D. staff, “Overview of the ibm blue gene/pproject,”