[PDF] High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

Abstract

This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance computing domain. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world work loads and (2) the development and use of a predictive analytic model for exploring the design space, resource estimates and performance. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100) analyzing time to solution, bandwidth and energy consumption. Performance results indicate equivalent runtime performance of the FPGA implementations to the V100 GPU, with over 2x energy savings, for the largest non-trivial application synthesized on the FPGA compared to the best performing GPU-based solution. Our investigation shows the considerable challenges in gaining high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.

Full PDF

HHigh-Level FPGA Accelerator Design forStructured-Mesh-Based Explicit Numerical Solvers

Kamalavasan Kamalakkannan,Gihan R. Mudalige

Dept. of Computer ScienceUniversity of Warwick, UK { kamalavasan.kamalakkannan , g.mudalige } @warwick.ac.uk Istvan Z. Reguly

Faculty of Information Technology & BionicsPazmany Peter Catholic University, Hungary [email protected]

Suhaib A. Fahmy

King Abdullah University ofScience and Technology (KAUST),Thuwal, Saudi Arabia [email protected]

Abstract —This paper presents a workﬂow for synthesizingnear-optimal FPGA implementations for structured-mesh basedstencil applications for explicit solvers. It leverages key charac-teristics of the application class, its computation-communicationpattern, and the architectural capabilities of the FPGA toaccelerate solvers from the high-performance computing domain.Key new features of the workﬂow are (1) the uniﬁcation ofstandard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling,motivated by increasing throughput for real-world work loadsand (2) the development and use of a predictive analytic model forexploring the design space, resource estimates and performance.Three representative applications are implemented using thedesign workﬂow on a Xilinx Alveo U280 FPGA, demonstrat-ing near-optimal performance and over 85% predictive modelaccuracy. These are compared with equivalent highly-optimizedimplementations of the same applications on modern HPC-gradeGPUs (Nvidia V100) analyzing time to solution, bandwidth andenergy consumption. Performance results indicate equivalentruntime performance of the FPGA implementations to the V100GPU, with over 2 × energy savings, for the largest non-trivialapplication synthesized on the FPGA compared to the bestperforming GPU-based solution. Our investigation shows theconsiderable challenges in gaining high performance on currentgeneration FPGAs compared to traditional architectures. Wediscuss determinants for a given stencil code to be amenableto FPGA implementation, providing insights into the feasibilityand proﬁtability of a design and its resulting performance. Index Terms —FPGAs, Stencil Applications, Explicit solvers

I. I

NTRODUCTION

Field Programmable Gate Arrays (FPGAs) have becomehighly attractive as accelerator architectures by virtue oftheir high performance, low power consumption, and re-programmability. As a result, FPGAs have gained a footholdin a wider range of application domains such as cyber secu-rity [1], databases [2], and deep learning [3]. In recent years,the integration of FPGAs as ﬁrst-class accelerator platformshas also attracted signiﬁcant interest in the high-performance(HPC) and scientiﬁc computing community, particularly in theﬁnancial computing domain [4]. They have also emerged aspotential accelerator platforms for cloud computing [5]. How-ever, a key limitation has been the design effort needed to pro-duce performant accelerators for FPGAs, requiring hardwareexpertise and an alternative approach to programming thatis more data-ﬂow oriented. Commercial FPGA vendors haveattempted to address this problem with high level synthesis(HLS) tools that can translate programs written in standardhigh-level languages such as C/C++ or OpenCL. However,these tools still require low level modiﬁcation of code to produce accelerators with optimum performance.One solution to this problem is to leverage key characteris-tics of an application, its computation-communication patternsor motifs to explore the design space on hardware. Once thebest optimization strategy for a given motif is identiﬁed forthe target hardware, it can be used as a design template forsimilar applications, even going so far as to create higher-levelframeworks such as DSLs that can automatically generate theaccelerator implementations. Such a strategy has become animportant technique in developing performance portable mas-sively parallel HPC applications given the increasing diversityof processor architectures [6]–[9].In this paper we apply such an analysis to the domain ofstructured-mesh-based explicit numerical solvers, character-ized by stencil computations, targeting FPGAs. These codesfrequently appear as the core motif in solvers for partial differ-ential equation (PDEs). As such, they are used in applicationsfrom a wide range of ﬁelds, including computational ﬂuiddynamics (CFD), hydro-dynamics, ﬁnancial computing, andoil/gas exploration simulations. The key characteristic is theloop over a “rectangular” multi-dimensional set of mesh pointsusing one or more “stencils” to access data. This is in contrastto unstructured meshes that require explicit connectivity infor-mation [7] between neighboring mesh elements via mappings.Considerable previous research has developed a range ofstrategies to synthesize optimized FPGA implementations forstencil codes [10]–[20]. Most recent works utilize HLS tools,usually compiling OpenCL, and target both 2D and 3D stencilapplications. They develop a number of standard techniques,ranging from basic methods such as cell-parallel/vectorization,unrolling the iterative loop, to more complex transformationssuch as spatial/temporal blocking (tiling), in order to bestutilize FPGA resources for maximum performance. However,many of these previous works target optimizations speciﬁc toan application in isolation without developing a design strategythat can be applied to other stencil codes. While some [12],[16] does attempt to generalize accelerator implementationsfor stencil codes, they only target simpler stencil applicationswithout exploiting higher-gain optimizations. A key gap isthe lack of a unifying design strategy particularly focusingon realistic applications.As such, an open question remains regarding how to op-timize implementation of stencil applications on FPGAs, andhow characteristics of the application and various optimiza-tions determine performance compared to traditional CPU and a r X i v : . [ c s . A R ] J a n PU architectures. To this end, in this paper we present aninitial proposal and unifying workﬂow for designing near-optimal FPGA implementations for structured-mesh basedstencil applications together with analytical models that enableexploration of the design space for stencil accelerators on mod-ern FPGAs. Speciﬁcally, we make the following contributions: • We propose an implementation template , and an accom-panying step-wise optimization strategy for conversionof structured-mesh, explicit, iterative stencil applicationsto FPGA accelerators. Given the hardware resource con-straints, we focus on features of the application that areamenable for FPGA implementation and optimizationsfor gaining near-optimal performance. A key optimiza-tion, novel in this area, is the batched execution ofmultiple independent stencil problems on an FPGA. • Targeting current generation Xilinx FPGAs, we presentthe design and optimization of three contrasting, repre-sentative explicit stencil solvers, comparing a range ofalternatives based on resource and performance trade-offs. The applications include both 2D and 3D stencilsolvers and multiple stencil loops. • We develop a predictive analytic model that providesestimates for determining the feasibility of implementinga given stencil application on an FPGA using the pro-posed design strategy. The model calculates the resourcerequirements considering the optimizations implemented,together with memory requirements and limitations onoperating frequency. It predicts the runtime of the result-ing FPGA synthesis of the application accurate to within ±

15% of the achieved runtime. • Finally, the runtime, bandwidth and power/energy perfor-mance of the FPGA implementations developed with theproposed strategy are compared with highly optimizedimplementations on a traditional accelerator architecture,a modern Nvidia GPU.Initial results on current generation Xilinx hardware demon-strate competitive performance compared to the best perfor-mance achieved for the same applications on traditional (in thiscase GPU) architectures using single precision ﬂoating-point(SP) representations. To our knowledge, such an extendedworkﬂow for stencil application development accompaniedwith a predictive model have not been previously presented.We believe that the proposed approach provides a promisingstrategy for use in industrial workloads from areas such asﬁnancial computing, standardizing the development cycle forthese platforms.The rest of the paper is organized as follows: Section IIbegins with a background on structured-mesh stencil applica-tions together with previous work that explored FPGA imple-mentations for this class of applications. Section III presentsour proposed design strategy for implementing iterative stencilcodes for FPGAs, as a step-by-step methodology, starting fromthe basic stencil loops, down to target FPGA code for theXilinx Alveo U280 accelerator board. Further optimizationsfor the target synthesis is discussed in Section IV. Thedesign including the advanced optimizations is then explored through the development of an analytic performance model. Aperformance analysis and benchmarking of the FPGA imple-mentations compared against the performance of state-of-the-art optimized CPU and GPU implementations is presented inSection V. Finally, conclusions are presented in Section VI.II. B

ACKGROUND

The key characteristic of structured-mesh stencil computationsis loops over a “rectangular” multi-dimensional set of meshpoints using one or more ﬁxed data access patterns, called stencils , to access data. The main motivating numerical methodhere is the solution to Partial Differential Equations (PDEs),speciﬁcally based on the ﬁnite difference method. These tech-niques are used extensively in computational ﬂuid dynamics(CFD), computational electro magnetics (CEM) in the form ofiterative solvers. For example the ﬁnite difference scheme forthe solution of a generic PDE can be given by the 2D explicitequation (1): U t +1 x,y = aU tx − ,y + bU tx +1 ,y + cU tx,y − + dU tx,y +1 + eU tx,y (1)Here, U is a 2D mesh and a, b, c, d, and e are coefﬁcients.In this example U is accessed at spatial mesh points (x-1,y), (x+1,y), (x,y-1), (x,y+1), and (x,y) which forms a ﬁvepoint stencil. In an explicit scheme the computation iteratesover the full rectangular mesh, updating the solution at eachmesh point, for the current time step, t+1 , using the solutionfrom the previous time step, t . The time step iteration usuallycontinues until a steady state solution is achieved. There isa data dependency for the computations among multiple timestep iterations, but no dependency within the spatial iterations.As such, each mesh point calculation within a time iterationcan be computed in parallel. In contrast, an implicit schemewould update the solution at the current time step usingvalues from the same time step, further introducing a datadependency within the spatial iterations. This leads to a muchfaster convergence to the ﬁnal solution, but enforces an orderin which a computation iterates over the mesh leading tolimited parallelism. While both explicit and implicit schemesare equally used in production settings in scientiﬁc computingapplications, we focus on explicit iterative solvers in this papergiven their simplicity and higher parallelism. A. Related Work

Many previous works have targeted FPGAs for stencil com-putations. Early works [21]–[23] used Hardware DescriptionLanguages (HDL) for describing the architectures. Howeverthe process required extensive hardware knowledge and atime consuming development cycle. The introduction of High-Level Synthesis (HLS) tools has signiﬁcantly improved de-veloper productivity and time to design. As such more re-cent work [11]–[13], [16], [17] have all utilized HLS toolsfor implementing FPGA designs for stencil computations.As FPGAs have advanced to incorporate a variety of highbandwidth interfaces and memory types, the system level viewof an accelerator architecture has become more important toachieving overall high performance.he most comprehensive implementation workﬂow andoptimization methodology to date is by Waidyasooriya et.al in [12], [16]. The authors use OpenCL and propose anoptimization strategy for stencil applications targeting IntelFPGAs. A number of 2-D and 3-D stencil applications aredeveloped through the above strategy, demonstrating up to950 GFLOPS of achieved computational performance on IntelFPGAs. Runtime and bandwidth performance are compared toconventional GPU and multi-core CPU implementations. Thework, however, limits the investigation to applications withonly a single stencil loop over the mesh. Multiple stencil loopswithin a single time-step iterative loop are not considered.A previous implementation of the 3D Reverse Time Mi-gration (RTM) application, which has similarities to the RTMapplication we develop later in this paper, can be found in [10].The implementation uses early-generation Xilinx FPGAs, priorto the introduction of HLS tools, but with designs equivalent tothe techniques we use in this paper. The work in [14] uses IntelFPGAs with a design goal to enable unrestricted input sizesfor stencil computations without sacriﬁcing performance. Theycombine spatial and temporal blocking to avoid input sizerestrictions, and employ multiple FPGA-speciﬁc optimizationsto tackle the added design complexity. The same authors applythese techniques to higher-order stencils in [15]. The use ofspatial and temporal blocking is novel, which our work in thispaper also addresses, but we extend it to variable sized tilingand multi-port implementations, generalizing the techniqueand incorporating it to our overall design workﬂow.A number of previous works have also utilized high-levelframeworks for generating efﬁcient FPGA accelerators. TheSDSLc framework [11] presents the use of source-to-sourcetranslation for generating parallel executables for a range ofhardware platforms. These include CPUs, GPUs and FPGAs.The paper details optimizations such as iterative loop unrollingand full data reuse within FPGAs. Similarly the SODA frame-work [13] performs several optimizations including perfectdata reuse by minimal reuse buffers and data quantization.Additionally it models the performance and predicts resourceconsumption, signiﬁcantly reducing design time. The authorspresent competitive performance with multi-core CPU im-plementations and state-of-the-art stencil implementations onFPGAs. The main limitations of the work are ﬁxed tile size andhost based tiling. Due to the DSL’s support of only declarativeprogramming, it is not clear whether any limitations exists forporting of complex kernels using SODA.The more recent HeteroCL framework [17] addresses imageprocessing applications. It also supports stencil applicationsthrough a SODA back-end. The HeteroCL DSL separatesalgorithm from compute, schedules and determines data types,and automatically translates SODA DSL to reﬂect the itera-tion factor, unroll factor and other parameters such as datawidth. A deep single kernel pipeline generated using theabove frameworks usually suffers from routing congestion inmodern large FPGAs from Xilinx that incorporate multipleSuper Logic Regions (SLRs). In [18], Dohi et. al, use theproprietary MaxCompiler and MaxGenFD high-level design tools to implement ﬁnite-difference equations. The work islimited to Maxeler Technologies FPGA platforms and does notcompare results with other FPGAs, GPUs or CPUs. The au-thors of [19] use the polyhedral model and implement a relatedframework to automatically accelerate iterative stencil loopson a multi-FPGA system. In contrast [20] develop a ScalableStreaming Array to implement stencil computations on mul-tiple FPGAs, using a DSL, achieving reduced developmenttime and near-peak performance. Automatic code generationis also used in [24] and has similarities to our design in thispaper. However, it mainly focuses on non-iterative applicationswith multiple kernels, hence spanning designs over multipleFPGAs. Batching and tiling optimizations are not attempted.In contrast to the above work, this paper presents a unify-ing strategy for the development of FPGA implementationsof both 2-D and 3-D stencil applications, including multi-dimensional mesh elements and multiple stencil loops. Weincorporate many of the optimization techniques in previousworks that have usually been applied in isolation or on a singleapplication. Additionally we introduce a number of furtheroptimizations such as batching to achieve higher throughput inreal-world/production workloads and settings. We also presenta predictive analytic model that estimates the feasibility ofimplementing a given stencil application on a given FPGAplatform. We compare the performance of the FPGA ac-celerators to equivalent highly-optimized implementations ofthe same applications on modern HPC-grade GPUs analyz-ing time to solution, bandwidth, and energy consumption.To our knowledge, the 3D RTM application developed inthis work, using HLS tools for the FPGA using our designﬂow, motivated by real-world stencil codes, is also novel,being one of the few non-trivial applications presented inliterature. Our investigation uncovers the determinants for agiven stencil code to beneﬁt from a FPGA implementation,providing insights into the feasibility of a design and itsresulting performance.III. A

CCELERATOR D ESIGN

FPGAs differs from traditional CPUs and GPUs as theydo not present a ﬁxed general purpose architecture to beexploited using software. A program, made up of a sequence ofinstructions, is executed on a ﬁxed CPU or GPU architecturethat does not change. In contrast, an FPGA must be conﬁguredwith an architecture that implements the data ﬂow computationfor a speciﬁc task. This leads to signiﬁcantly reduced energyconsumption compared to the CPUs and GPUs, primarilydue to the locality of data movement within the mappedarchitecture rather than via multiple reads and writes tovarious stages of the memory hierarchy. The reconﬁgurabilityof FPGAs offers a signiﬁcant advantage over the designof custom Application Speciﬁc Integrated Circuits (ASICs)which is much more time-consuming and costly and leadsto a ﬁxed architecture that cannot be modiﬁed post-design.FPGAs comprise a variety of basic circuit elements, includingample look-up-tables (LUTs) and registers, large numbers ofdigital signal processing (DSP) blocks on modern devices,

R0 R1R9 R8

R2R7 input1 input2

Cyclic BufferCyclic BufferCU1 R0R4R5R6R9CU2 R1R3R4 R5 R8 Output2Output1

Input1

Input2

Fig. 1:

Window buffer and factor of 2 vectorization. block memories (BRAM/URAM), clock modules, and a richrouting fabric to connect these elements into a large logicalaccelerator architecture. While these resources are primarilysuited to implementation of ﬁxed point integer data-paths,they can be used to implement ﬂoating point data paths too,though these typically consume signiﬁcantly more resourcesfor the same computation. In modern Xilinx FPGAs, theoverall die is split into a number of regions called Super LogicRegions (SLR) [25]. Bandwidth within an SLR is extremelyhigh (TB/s) due to the wealth of connections and memoryelements, while between SLRs it is limited by the numberof silicon connections available. The BRAMs and URAMsthat reside in SLRs provide high-speed, small blocks of on-chip memory, typically 10s of MB in total, recent deviceslike the U280 have close coupled High Bandwidth Memory(HBM) (8 GB on the U280) connected to multiple SLRs.HLS tools can typically connect multiple on-chip BRAMsor URAMs to obtain a larger block of memory. An FPGAboard will also include much larger, but slower DDR4 (32GB on the U280) memory as external memory. Managing themovement of data between these different types of memoryis key to achieving high computational performance. Theperformance of an FPGA architecture is hard to predict, as itis impacted by various design characteristics beyond the levelof parallelization applied. As a design grows and begins tooccupy a larger portion of the FPGA, routing (i.e. connectingall the circuit elements together) becomes more challenging,and can reduce the achievable clock frequency and henceoverall performance.To achieve high computational throughput on FPGAs, acustom architecture is designed that is then implementedusing the low-level circuit elements described above. A data-ﬂow arrangement seeks to map a complex computation to aseries of data-paths that implement the required computationalsteps with movement of data managed by direct connections.Compared to ﬁxed CPU and GPU architectures where the stepsin an algorithm are computed sequentially with intermediateresults stored to registers, FPGA compute pipelines can bemuch deeper and more irregular parallelism can be exploited.Performing a stencil computation will then involve, starting upthe pipeline (requiring some clock cycles equal to the pipelinedepth) and outputting the result from the computation for eachmesh point per clock cycle as a pipelined execution.For CPU/GPU architectures such a computation is imple-mented using nested loops, iterating over the mesh and overthe neighborhood points. On FPGAs these multiple levels ofloops can be unrolled. Retaining an outer loop can be costly

Reg Reg RegRegReg input

Cyclic BufferCyclic BufferStencil Computation Reg Reg RegRegReg Cyclic BufferCyclic BufferStencil Computation

Compute module for Iteration - k Compute module for Iteration – k+1 Fig. 2:

Unrolling the iterative loop. due to the need to ﬂush the unrolled inner loop pipelinewhich can be long. Hence, multi-dimensional nested loopsshould be ﬂattened to a 1D loop either manually or by usingHLS directives such as loop_flatten . We have observedthat manual ﬂattening still provides the best performanceand optimized resource utilization, as current Xilinx HLScompilers can make pessimistic scheduling decisions.A key approach to gaining the best performance from theabove computational pipeline is streaming data from/to exter-nal and near-chip memories to/from on-chip BRAMs/URAMsto feed the computational pipelines efﬁciently. A perfect datareuse path can be created by (1) using a First-In-First-Out(FIFO) buffer to fetch data from DDR4/HBM memory withoutinterruption (allowing burst transfers) to on-chip memory, andthen (2) by caching mesh points using the multiple levels ofmemory, from registers to BRAM/URAM. Fig. 1 illustratessuch a data path for a 2D, 2nd order stencil. This techniquehas previously been referred to as window buffers [10]. A 2D, D order stencil requires D rows to be buffered to achieveperfect data reuse. Similarly, D planes should be buffered fora 3D stencil. The total number of mesh elements needed to bebuffered is the maximum number of mesh elements betweenany two stencil points. BRAM/URAMs can be used to designwindow buffers by using cyclic buffering. Given their highcapacity, URAMs are preferred if the number of elements tobe buffered is large.Multiple pipelines for the same computation (i.e. loop bodyor kernel) can be created using HLS directives. This technique,called the cell-parallel method in [12] allows computation ofthe stencil on multiple mesh points simultaneously, providedthat there are no data dependencies, which is the case forexplicit schemes. The cell-parallel method is similar to SIMDvectorization on CPUs and SIMT on GPUs but on an FPGA itessentially creates parallel replicas of the computational unitsas opposed to single vector operations. However the resourceavailability in an FPGA limits the number of parallel unitsthat can be synthesized on a given device. Fig. 1 illustratesa factor of 2 implementation, where the vectorization factorrepresents the number of mesh points updated in parallel.Another approach that can increase performance whenmapping to FPGAs is to unroll the iterative loop, whichencompasses one or more stencil loops over the rectangularmesh. This allows the results from a previous iteration tobe fed to the next iteration without writing back to external(DDR4 or HBM) memory. This scheme, called the step-parallel technique in previous work [12] is illustrated inig. 2. The technique leads to increased throughput withoutthe need for additional external memory bandwidth. However,the unrolling factor depends once more on available FPGAresources and internal memory capacity. Cutting down onexternal memory access in this manner also lead to morepower-efﬁcient designs. One disadvantage, however, is theincreased length of the computational pipeline, which signiﬁ-cantly affects performance for small mesh sizes. A. Model for the Baseline Design

The performance of a baseline design, as discuses above, there-fore depends on (1) the capacity of the computational pipelineand (2) the external memory bandwidth. Computational ca-pacity depends on the number of mesh point updates donein parallel (vectorization factor), latency of the pipeline andoperating clock frequency of the FPGA. However, memorythroughput depends on various factors such as the numberof mesh elements transferred, and the stride between eachtransferred element. To simplify, we model reading/writing ofcontiguous data from/to memory with a maximum transfersize of 4K bytes, to reach a near optimal throughput ofexternal/near-chip memory for the Xilinx U280 FPGA, ourtarget hardware in this work.Assuming that the memory throughput is sufﬁcient to supply V mesh points (i.e. a vectorization factor of V ) continuouslywithout interruption, then the total clock cycles taken toprocess a row from a 2D mesh with m × n elements willbe given by (cid:6) mV (cid:7) . Here, we have padded each row to be amultiple of V if required. The compute pipeline will process n + D rows as there are D/ different rows between the currentstencil update mesh point and farthest mesh point required forthe stencil computation, where D is the stencil order. If theouter iterative loop unroll factor is given by p then the totalnumber of clock cycles required to process the full m × n mesh for n iter iterations is given by: Clks D = n iter p × (cid:18)(cid:108) mV (cid:109) × ( n + p × D (cid:19) (2)The above extends naturally to 3D meshes as in (3), where the3D mesh size is given by m × n × l and D is then equivalentto the number of plains to be buffered. Clks D = n iter p × (cid:18)(cid:108) mV (cid:109) × n × ( l + p × D (cid:19) (3)As noted before, the models above only hold for cases wherethe vectorization factor V , which determines the numberof parallel mesh points computed, does not demand morememory bandwidth than what can be supplied by the FPGA’sexternal DDR4 bandwidth. The FPGA’s HBM memory canbe used to support a larger V , which could then be limitedby the resources available to implement the parallel computepipelines. An estimate of maximum V for an application canbe computed by, using the FPGA operating frequency f , andmaximum supported bandwidth of a data channel (or port)on the FPGA, BW channel , and the size in bytes of a meshelement sizeof ( t ) as follows: BW channel ≥ V f × sizeof ( t ) (4) For 2D meshes, if the width of the mesh n is a multiple ofvectorization factor V, then clock cycles for computing a singlemesh point (or a cell) per iteration per compute module canbe obtained from equation (2) as : Clks D,cell = 1 /V + pD/ nV (5)Setting n to higher values gives a better clock cycles permesh point ratio, the ideal being, /V . But higher orderstencil applications on meshes with fewer rows will have alarger ( pD ) / (2 nV ) value, indicating idling in the processingpipeline. We explore techniques to reduce this idle time inSection IV-B.A key parameter in (2) and (3) is the loop unroll factor, p which directly determines performance, where a large p reduces the total clock cycles required. However, p is limitedby the available resources on the FPGA as in Fig. 2, a larger p requires more DSP blocks and LUTs. Furthermore, the internalmemory required for a compute module, primarily due tomemory capacity for the cyclic buffers also determines p .The number of DSP blocks required for a single mesh-pointupdate, G dsp depends on the stencil loop kernel’s arithmeticoperations and number representation. Here we consider singleprecision ﬂoating point. With a V vectorization factor, the totalconsumed is V × G dsp . If the total available DSP blocks onthe FPGA is FPGA dsp then the maximum unroll factor basedon DSP resources, p dsp is given by: p dsp = FPGA dsp /V G dsp (6)The internal memory requirement for a single compute modulewhich performs a D order stencil operation on an m × n meshis D × m . If the total available internal memory on the deviceis FPGA mem , then maximum possible iterative unroll factorbased on internal memory requirements, p mem is : p mem = FPGA mem /kDm (7)Here, k is the size of a mesh element in bytes. The denomina-tor of (7) becomes kDmn for 3D meshes. Thus we see that theinternal memory of an FPGA, directly limits the solvable meshsize. Usually, the above ideal depth is not achievable, as theFPGA internal memory, BRAMs and URAMs, are quantized(for example BRAMs are 18Kb/36Kb and URAMs are 288Kbon the U280). Additionally, the limited width conﬁgurations ofthe URAMs, plus the need to allow for ﬂexible routing furtherreduces the effective internal memory resources. Thus we usu-ally target an 80%–90% internal memory utilization. Then themaximum iterative loop unroll factor is given by the minimumof p dsp and p mem . It is also worth considering that a largerpipeline depth, and hence more resource consumption leadsto the design spreading over multiple SLRs. Communicationbetween SLRs increasing routing congestion between theseregions, directly impacting the achievable operating frequency.IV. O PTIMIZATIONS

Further optimizations and extensions are required to obtainhigh throughput for more complex applications. These include(1) spatial and temporal blocking, speciﬁcally for solvers overlarger meshes, and (2) batching for improving performanceand throughput of stencil applications on smaller meshes. Inhis section, we build on the baseline design from Section IIIand extend the performance models to account for theseoptimizations.

A. Spatial and Temporal Blocking

The baseline design attempts to obtain perfect data reuse,requiring FPGA internal memory (consisting of BRAMs andURAMs) to be of size D × m for 2D and D × m × n for 3Dmeshes. Equation (7) illustrates this, where the requirementbecomes highly limiting for applications with higher order ( D )stencils and/or on larger meshes (increasing m ). Even if themesh fully ﬁts in the FPGA’s DDR4 memory, a sufﬁcientlylarge mesh could result in a p mem less than one, meaningthat even a single compute module cannot be synthesized. Asolution is to implement a form of spatial blocking, similar tocache blocking tiling on CPUs, for the FPGA.The idea is to use the baseline design to build an acceleratorthat operates on a smaller block of mesh elements and thentransfer one such block at a time to the compute pipeline fromFPGA DDR4 memory. The compute pipeline is designed withan appropriate vectorization factor ( V ) and an outer iterativeloop unroll factor ( p ). Larger p results in better exploitationof temporal locality, where the execution uses the same dataseveral times. One issue with such a blocked execution iswhen applying the computation over the boundary of a blockwhere a stencil computation on the boundary will not havethe contributions from all the neighboring elements in themesh. The solution is to overlap blocks such that the correctcomputation is carried out on the boundary by a subsequentblock. The amount of overlap depends on the order of thestencil. Overlapping leads to redundant computation. Howeverthis overhead can be acceptable, due to the savings fromfurther exploiting local data in multiple iterations.The main challenge of tiling then is to get close to maximumDDR4 memory bandwidth, due to the latency of smaller, non-contiguous data transfer sizes. Such data transfers results dueto a strided access pattern in one dimension when accessingmemory locations within a spatial block. For example onthe Xilinx U280, it takes 16 clock cycles to transfer 1024Bytes via the 512 bit wide AXI interface bus, but the latencyof the transfer is about 14 clock cycles. As such, multipleread/write requests should be made to hide the latency of eachindividual memory transaction. The preference to maintain a512 bit wide bus interface to obtain better memory bandwidthfurther increases the amount of redundant computation atblock boundaries as we must maintain a 512 bit alignment inread/write transactions, regardless of the order of the stencil.A ﬁnal modiﬁcation is the need to loop through the spatialblocks to solve over the full mesh. This control loop isbest implemented on the FPGA to reduce latency due tothe host calling multiple kernels on the FPGA. An importantconsideration is ﬁnding the optimal block size and its offsetfrom the start of the mesh. The block size and offsets needonly be computed once, which can be done on the host andcopied to FPGA memory. Considering a 3D stencil applicationover a mesh of size m × n × l solved by computing over with blocks (or tiles) of size M × N × l , the valid number of meshpoints computed per block is given by: Block valid = ( M − pD ) × ( N − pD ) × l (8)Since the number of clock cycles required to process p iterations (or a temporal block) on the M × N × l spatial blockis similar to the baseline design, the average time taken tocompute one block (assuming block dimensions are a multipleof V ) would be: Clks block, D = MV × N × l + pD/ p (9)Dividing (8) by (9) leads to the number of valid mesh points(or cells) computed per clock cycle (i.e. throughput, T ) : T = (1 − pDM ) × (1 − pDN ) × ( pV ll + pD/ (10)Now, substituting N from (7), for a 3D application, assumingfull utilization of the FPGA’s internal memory by a block, itcan be shown that maximum throughput can be achieved fora given p when : M = (cid:112) F P GA mem /kpD (11)The corresponding N , can be shown to be also equal to M ,implying a square block to give the best throughput. However,the throughput also varies with p and this can be analyzed byconsidering a square tile (i.e. M = N ) applied to equation (10)and assuming l to be very large such that ll + pD/ is closeto 1. With these assumptions, we can show that maximumthroughput is achieved, for a given M , when setting p to a p max given by: p max = M/ D (12)Obtaining a value for pV from (6), assuming we use all thecomputational capacity of the FPGA, we can rewrite (10) as: T D = (1 − pDM ) × F P GA dsp G dsp × ( ll + pD/ (13)The same for a 2D stencil application can also be derived as: T D = (1 − pDM ) × F P GA dsp G dsp × ( nn + pD/ (14)Here, we see that reducing pipeline depth p and increasing V will improve the performance of the spatial blocked design.The effect of p is more signiﬁcant for 3D applications. B. Batching

A ﬁnal optimization attempts to improve throughput forsmaller mesh problems that usually perform poorly on acceler-ator platforms, including FPGAs. On traditional architecturessuch as GPUs the reason is the the under-utilization ofthe massive parallelism available. Essentially the time spentcalling a kernel on the device and the overheads for datamovement between host and device become dominate theactual processing time.On an FPGA, in addition to the above, further overheads arecaused due to the latency of the processing pipeline, as givenin equation (5), compared to the time to process the mesh.The idle time is proportional to the width of the 2D mesh.Thus if a large number of smaller meshes are to be solved, asABLE I:

Experimental systems speciﬁcations.

FPGA Xilinx Alveo U280 [27]DSP blocks 8490BRAM / URAM 6.6MB (1487 blocks) / 34.5MB (960 blocks)HBM 8GB, 460GB/s, 32 channelsDDR4 32GB, 38.4GB/s, in 2 banks (1 channel/bank)Host Intel Xeon Silver 4116 @2.10GHz (48 cores)256GB RAM, Ubuntu 18.04.3 LTSDesign SW Vivado HLS, Vitis-2019.2GPU Nvidia Tesla V100 PCIe [27]Global Mem. 16GB HBM2, 900GB/sHost Intel Xeon Gold 6252 @2.10GHz (48 cores)256GB RAM, Ubuntu 18.04.3 LTSCompilers, OS nvcc CUDA 9.1.85, Debian 9.11 is the case in ﬁnancial applications [26], then processing onemesh at a time incurs signiﬁcant latencies. This motivates theidea of grouping together meshes with the same dimensionsin batches, increasing the overall throughput of the solve. Inpractice, the mesh can be extended in the last dimension bystacking up the small meshes. Now, the inter-compute modulelatencies only occur once at the start of the batched solve.With B , 2D meshes in a batch, the time to process a singlemesh within a batched execution is given by: Clks D/batched mesh = (cid:18)(cid:108) mV (cid:109) × ( n + p × D B ) (cid:19) (15)Thus, increasing B signiﬁcantly reduces the idle time from(5). Similar reasoning can be applied for batched 3D meshes.V. P ERFORMANCE

In this section we apply the FPGA design strategy, optimiza-tions, and extensions to illustrate their utility in acceleratingstencil computations for explicit-iterative numerical solvers.We select three representative applications consisting of, both2D and 3D, low and high order, and with single and multiplestencil loops to explore the versatility of our design ﬂow.Model-predicted resource utilization estimates are used todetermine initial design parameters, and runtime performanceis compared to model predictions for each application. Theimplementations target the Xilinx Alveo U280 acceleratorboard and demonstrate concrete implementations for eachapplication. We use Vivado C++ due to ease of use for conﬁg-urations, arbitrary precision data types, and support of someC++ constructs compared to OpenCL, but note OpenCL canbe equally used to implement the same design. Additionally,we compare equivalent implementations of each application’sperformance on a modern GPU system for comparison . TA-BLE I brieﬂy details the speciﬁcations of the FPGA and GPUsystems (both hardware and software) used in our experiments. A. Poisson-5pt-2D

The ﬁrst application is a 2D Poisson solver which uses a 2ndorder stencil, with scalar elements: U t +1 i,j = (cid:0) U ti − ,j + U ti +1 ,j + U ti,j − + U ti,j +1 (cid:1) + U ti,j (16)A suitable initial vectorization factor V can be identiﬁed byusing (4) and assuming an operating frequency of 300MHz We have omitted CPU performance results here as our previous work [26]shows that GPUs provide signiﬁcant speedups over CPUs for these applica-tions

TABLE II:

Baseline and batching, model parameters

Application Freq. G dsp p dsp (MHz) (model) (actual)Poisson-5pt-2D 250 14 68 60Jacobi-7pt-3D 246 33 28 29Reverse Time Migration 261 2444 3 3 TABLE III:

Spatial blocking model parameters

App. p V M N T D | D Valid ratioPoisson-5pt-2D 60 8 8192 472 98.5%Jacobi-7pt-3D 3 64 768 768 189 98.4% given this is the default set by the Vivado HLS tools. Fora baseline implementation of Poisson a value of 8 for V is calculated when using a single DDR4 channel or twoHBM channels with a frequency of 300MHz. However, thisfrequency could only be supported when iterative loop unrollfactor p is in the order of 1– 20. Higher p lead to routingcongestion, which limited achievable frequency. As such thefrequency was reduced to 250MHz to support a p of 60, whichwe observed to give the best performance for this stencil.We ﬁnd in some cases such a trial frequency adjustment isunavoidable, but our model signiﬁcantly narrows the designspace, enabling us to reason about and quickly obtain anoptimal conﬁguration. The number of DSP blocks requiredfor a single mesh-point’s stencil computation for Poisson andthe resulting p dsp from (6) for V = 8 , assuming a 90% DSPutilization, is given in the ﬁrst row of TABLE II.Column 4gives the predicted p dsp from our performance model, whilecolumn 5 is the actual result after synthesis, indicating goodagreement with the predicted design.Fig. 3 (a) and (b) present the runtime performance ofPoisson-5pt-2D, with the above design and compare theresultant performance to an equivalent implementation onthe Nvidia V100 GPU. The achieved bandwidth and energyconsumption from these runs are summarized in TABLE IV.The bandwidth is computed by counting the total numberof bytes transferred during the execution of the stencil loop(looking at the mesh data accessed) and dividing it by thetotal time taken by the loop. Baseline FPGA performance issigniﬁcantly better than on the V100, since the GPU is notsaturated by this application. The batching of 2D meshes asin [26] improves GPU performance signiﬁcantly and offers acloser comparison. The FPGA achieves a maximum speedupof about 30–34% for different mesh sizes and batching sizesof 100 (100B) and 1000 (1000B). Memory bandwidth resultsindicate high utilization of the communication channels inagreement with the observed runtimes. The xbutil utilitywas used to measure power during FPGA execution, while nvidi-smi was used for the same on the V100. The powerconsumption of the FPGA during the 1000B runs is indicativeof the signiﬁcant energy efﬁciency of the device compared toa GPU. The FPGA was operating at an average 70W, whilethe GPU’s power consumption ranged from 40W (for singlebatch) to 210W for 1000B runs on the larger meshes.To implement Poisson-5pt-2D on larger meshes with spatial .03 R un t i m e ( s e c o nd s ) Mesh size - m x n

FPGA FPGA - Pred GPU (a) Baseline - 60000 iterations R un t i m e ( s e c o nd s ) Mesh size - m x n

GPU - 1000BFPGA - 1000BGPU - 100BFPGA - 100BFPGA - Pred (b) Batching - 60000 iterations

GPU -20000^2GPU -15000^30510 R un t i m e ( s e c o nd s ) Tile Size

FPGA - 15000^2 FPGA - 20000^2 FPGA - Pred (c) Spatial-blocking - 6000 iterations

Fig. 3:

Poisson-5pt-2D performance R un t i m e ( s e c o nd s ) Mesh size - m x n

FPGA FPGA - Pred GPU (a) Baseline - 29000 iterations R un t i m e ( s e c o nd s ) Mesh size - m x n

FPGA - 50BGPU - 50B

FPGA - 10B

GPU - 10BFPGA - Pred (b) Batching - 2900 iterations

GPU -600^3GPU -1800x1800x1000

256 384 512 640 768 R un t i m e ( s e c o nd s ) Tile Size

FPGA - 600^3 FPGA - 1800x1800x100 FPGA - Pred (c) Spatial-blocking - 120 iterations

Fig. 4:

Jacobi-7pt-3D performance

TABLE IV:

Poisson-5pt: Bandwidth (GB/s) and Energy(kJ)

Baseline and Batching (60000 iterations)Mesh Baseline 100B 1000B Energy-1000BFPGA GPU FPGA GPU FPGA GPU FPGA GPU ×

384 18 857 404 867 530 0.77 3.48 ×

543 32 886 465 892 540 1.50 6.74 ×

535 38 901 483 907 560 1.66 7.60 ×

681 69 922 530 ×

612 62 889 536 ×

735 116 904 560Spatial-blocking (60000 iterations)Mesh Tile Size BW EnergyFPGA GPU FPGA GPU blocking, we assume a V and p equivalent to the baselinedesign and compute the valid mesh points updated per clockcycle using (13). Here we assume the dimensions of the meshto be very large. TABLE III lists the model parameters for spa-tial blocking. For Poisson we see that the 2D spatially blockeddesigns theoretically perform similar to the baseline designand thus we need not change the compute pipeline. Runtime,bandwidth and energy consumption of this implementation isgiven in Fig. 3 (c) and TABLE IV, respectively, includingcomparison to performance from the V100 GPU. Again wesee good speedups and higher energy efﬁciency achieved withthe FPGA, this time on large problem sizes with tiling. B. Jacobi-7pt-3D

The Jacobi iteration as a 3D, 7-point stencil, provides us withan initial, 3D, single stencil loop, for our evaluation: U t +1 i,j,k = k U ti +1 ,j,k + k U ti − ,j,k + k U ti,j − ,k + k U ti,j,k + k U ti,j +1 ,k + k U ti,j,k +1 + k U ti,j,k − (18) This application requires higher internal memory for thebaseline design. For the spatially blocked design it involvestransfers less than 4K from memory, which makes it difﬁcultto approach raw external memory bandwidth. This is differentto the baseline/batched and 2D spatially blocked design. Wespeculate that this could be the reason for the slightly lessaccurate model predictions in Fig. 4(c). While the stencil isstill fairly simple, now we see the GPU outperforming theFPGA conclusively, in both baseline, Fig. 4(a) and batchedFig. 4(b) tests. The V100 GPU gives nearly 40% fasterruntimes on the 50B problem. However, the FPGA remainsmore energy efﬁcient for the same problem. For the × problem with 50B, it is nearly 2 × more energy efﬁcient thanthe faster GPU run (see TABLE V). The FPGA operated atan average 90W while the GPU power ranged from 77–240W.Spatial blocking was signiﬁcantly more challenging and theresulting FPGA design, using a tile size was about 40%slower than the GPU runtime (see Fig. 4(c)). However, theFPGA was again more energy efﬁcient, operating at an average70W consuming about 40–50% less energy than the GPU(operating at 180–216 W) as seen in TABLE V. C. Reverse Time Migration (RTM) - Forward Pass

The ﬁnal application we applied our development ﬂow tois the forward pass from a Reverse Time Migration (RTM)solver [28]. The application represents algorithms of interestfrom industry [29], going beyond simple single stencil loops.It includes an iterative loop consisting of multiple stencil loopsas summarized in Algorithm 1.

Y, T and K ..K are 3Dﬂoating-point (SP) data arrays deﬁned on the mesh consistingof vector elements of size 6. Y holds current values and T holds intermediate values, both updated with the f pml functionwhich uses a 25-point, eighth order 3D stencil. K ..K is ac-cessed/updated with a self-stencil (or zeroth-order, i.e. i, j, k ). ρ and µ are two 3D scalar coefﬁcient meshes, which are also .28 0.34 0.35 0.56 0.76 2.18 4.120.33 R un t i m e ( s e c o nd s ) Mesh size - m x n

FPGA FPGA - Pred GPU (a) Baseline - 1800 iterations R un t i m e ( s e c o nd s ) Mesh size - m x n

GPU - 40B FPGA - 40BGPU - 20B FPGA - 20B

FPGA - Pred (b) Batching - 180 iterations

Fig. 5:

RTM performance

TABLE V:

Jacobi-7pt-3D: Bandwidth (GB/s) and Energy(kJ)

Baseline (29k iters) and Batching (2.9k iters)Mesh Baseline 10B 50B Energy-50BFPGA GPU FPGA GPU FPGA GPU FPGA GPU

202 83 307 284 323 404 0.04 0.07

301 284 378 434 387 469 0.27 0.51

374 496 421 548 426 543 1.96 3.77

391 559 431 585

403 553 438 569Spatial-blocking (120 iterations)Mesh Tile Size BW EnergyFPGA GPU FPGA GPU

256 233 392 0.062 0.106512 281 0.051640 292 0.049 × ×

256 247 363 0.088 0.143512 270 0.080640 273 0.079 accessed using a self-stencil. This application is signiﬁcantly

Algorithm 1

RTM - Forward Pass for i = 0 , i < n iter , i + + do K = f pml ( Y pt , ρ, µ ) × dt ; T = Y + K / K = f pml ( T pt , ρ, µ ) × dt ; T = Y + K / K = f pml ( T pt , ρ, µ ) × dt ; T = Y + K K = f pml ( T pt , ρ, µ ) × dtY = Y + K / K / K / K / end for more complex than the previous applications and pushes theresource usage on the FPGA to its limits. Nevertheless ourdesign strategy is able to provide a good implementation,albeit limited to a batched design. The number of stencilloops was reduced by fusing the K , K , and K with thecorresponding T loop. The K and ﬁnal Y update weremerged into one further loop, resulting in a total of 4 loops.For an FPGA implementation, all the four fused loops neededto be brought into a single pipeline. Intermediate data T and K ...K were replaced with a FIFO stream connected throughwindow buffers. Similarly ρ, µ and Y were internally bufferedand fed to subsequent compute units. These optimizationsreduce the number of memory accesses to a single read andwrite of Y and a single read each for ρ and µ . These aresigniﬁcant savings compared to the original loop chain.A limitation of the FPGA implementation is that the meshplane size (in this 3D application), is limited to as ituses 3D stencils on a 6 dimensional element (i.e. a vectorof 6 ﬂoats). Furthermore, partitioning four compute-intensivekernels on the U280’s three SLR regions was a signiﬁcant challenge. Our implementation avoids spanning of a computeunit on multiple SLRs to avoid inter SLR routing congestion,by setting V to 1, allowing us to ﬁt the four fused loops in oneSLR. This, then allows for an iterative loop unroll factor of 3( p ) given the three SLRs on the U280. We do note that usingmore HBM channels could provide more bandwidth to obtaina larger V , but we have not explored this in current work. Asolution for the limited mesh size is of course spatial blocking,but it requires p = 4 . This leads to a tile size dimension M = 96 from (12) given D is 8, which requires a large amountof FPGA internal memory, making an implementation on theU280 challenging as the four fused loops will span acrossSRLs. We leave this to future work.TABLE VI: RTM - Avg. Bandwidth (GB/s) and Energy(kJ)

Baseline (1800 iters) and Batching (180 iters)Mesh Baseline 20B 40B Energy-40BFPGA GPU FPGA GPU FPGA GPU FPGA GPU × ×

108 130 225 251 232 266 0.043 0.086 × ×

141 163 247 263 253 274 0.062 0.133 × ×

77 124 210 251 220 263 0.055 0.111 × ×

127 155 262 266 270 272 0.091 0.218 × ×

165 179 287 271 293 275 0.130 0.338

From the runtime results in Fig. 5 and bandwidth results inTABLE VI we see that the FPGA implementation is eithermatching or marginally better performing than the GPU. Notethat, given there are four stencil loops fused on the FPGA thebandwidth reported is for the fused loop. The GPU bandwidththerefore is the average for the full loop chain. The mosttime consuming kernel, f pml on the GPU achieved around180 GB/s, while the highest bandwidth achieved by a singlekernel is over 340GB/s. Again we see that the FPGA operatesat a lower average power (70W) than the GPU (51–170W)consuming 2 × less energy.VI. C ONCLUSIONS

In this paper we developed a uniﬁed workﬂow and a sup-porting predictive analytic model for FPGA synthesis ofstructured-mesh stencil applications that combines standardstate-of-the-art techniques with a number of high-gain opti-mizations targeting features of real-world work loads. Themodel allows estimation of design parameters, resource usage,and performance for performant FPGA implementation. Theworkﬂow was applied to three representative applications,implemented on a Xilinx Alveo U280 FPGA. Performance wasompared to highly-optimized HPC-grade Nvidia V100 GPUcode. In most cases, the FPGA is able to match or improve onGPU performance. However, even when runtime is inferior tothe GPU, signiﬁcant energy savings, over 2 × for the largestapplication, are observed. Estimations produced by the modelwere shown to be accurate and a good guide in the designprocess. Future work will investigate how a similar workﬂowcan be applied to implicit solvers and further automating thedevelopment of this class of application on FPGAs, includingalternative numerical representations. The FPGA and GPUsource code developed in this paper are available at [30].A CKNOWLEDGMENT

Gihan Mudalige was supported by the Royal Society Industry Fel-lowship Scheme(INF/R1/1800 12). Istv´an Reguly was supported byNational Research, Development and Innovation Fund of Hungary,project PD 124905, ﬁnanced under the PD 17 funding scheme. Theauthors would like to thank Jacques Du Toit and Tim Schmielau atNAG UK Ltd., for the RTM application and for their valuable advice. R EFERENCES[1] D. B. Cousins, K. Rohloff, and D. Sumorok, “Designing an FPGA-accelerated homomorphic encryption co-processor,”

IEEE Transactionson Emerging Topics in Computing , vol. 5, no. 2, pp. 193–206, 2016.[2] M. Owaida, D. Sidler, K. Kara, and G. Alonso, “Centaur: A frameworkfor hybrid CPU-FPGA databases,” in

Proceedings of the IEEE Interna-tional Symposium on Field-Programmable Custom Computing Machines(FCCM) , 2017, pp. 211–218.[3] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: Ascalable deep learning accelerator unit on FPGA,”

IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems , vol. 36,no. 3, pp. 513–517, 2016.[4] T. Becker, O. Mencer, S. Weston, and G. Gaydadjiev, “Maxeler data-ﬂowin computational ﬁnance,” in

FPGA Based Accelerators for FinancialApplications , 2015, pp. 243–266.[5] S. A. Fahmy, K. Vipin, and S. Shreejith, “Virtualized FPGA acceleratorsfor efﬁcient cloud computing,” in

Proceedings of the IEEE InternationalConference on Cloud Computing Technology and Science (CloudCom) ,2015, pp. 430–435.[6] I. Z. Reguly, G. R. Mudalige, and M. B. Giles, “Loop tiling in large-scalestencil codes at run-time with OPS,”

IEEE Transactions on Parallel andDistributed Systems , vol. 29, no. 4, pp. 873–886, April 2018.[7] G. R. Mudalige, M. B. Giles, I. Z. Reguly, C. Bertolli, and P. H. J.Kelly, “OP2: An active library framework for solving unstructuredmesh-based applications on multi-core and many-core architectures,” , 2012. [Online].Available: http://dx.doi.org/10.1109/InPar.2012.6339594[8] P. Vincent, F. Witherden, B. Vermeire, J. S. Park, and A. Iyer, “To-wards green aviation with python at petascale,” in

SC16: InternationalConference for High Performance Computing, Networking, Storage andAnalysis , Nov 2016, pp. 1–11.[9] F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. H¨uckelheim,C. Yount, P. Witte, P. H. J. Kelly, F. J. Herrmann, and G. J. Gorman,“Architecture and performance of Devito, a system for automatedstencil computation,”

CoRR , vol. abs/1807.03032, Jul 2018. [Online].Available: http://arxiv.org/abs/1807.03032[10] H. Fu and R. G. Clapp, “Eliminating the memory bottleneck: AnFPGA-based solution for 3D reverse time migration,” in

Proceedings ofthe 19th ACM/SIGDA International Symposium on Field ProgrammableGate Arrays , ser. FPGA ’11. New York, NY, USA: Associationfor Computing Machinery, 2011, p. 65–74. [Online]. Available:https://doi.org/10.1145/1950413.1950429[11] P. Rawat, M. Kong, T. Henretty, J. Holewinski, K. Stock, L.-N. Pouchet,J. Ramanujam, A. Rountev, and P. Sadayappan, “SDSLc: A multi-targetdomain-speciﬁc compiler for stencil computations,” in

Proceedingsof the 5th International Workshop on Domain-Speciﬁc Languagesand High-Level Frameworks for High Performance Computing , ser.WOLFHPC ’15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2830018.2830025[12] H. M. Waidyasooriya, Y. Takei, S. Tatsumi, and M. Hariyama, “OpenCL-based FPGA-platform for stencil computation and its optimizationmethodology,”

IEEE Transactions on Parallel and Distributed Systems ,vol. 28, no. 5, pp. 1390–1402, 2017.[13] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA: Stencil with pptimizeddataﬂow architecture,” in , 2018, pp. 1–8.[14] H. R. Zohouri, A. Podobas, and S. Matsuoka, “Combined spatial andtemporal blocking for high-performance stencil computation on FPGAsusing OpenCL,” in

Proceedings of the 2018 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays , ser. FPGA ’18. NewYork, NY, USA: Association for Computing Machinery, 2018, p.153–162. [Online]. Available: https://doi.org/10.1145/3174243.3174248[15] H. R. Zohouri, A. Podobas, and S. Matsuoka, “High-performance high-order stencil computation on FPGAs using OpenCL,” in , 2018, pp. 123–130.[16] H. M. Waidyasooriya and M. Hariyama, “Multi-FPGA acceleratorarchitecture for stencil computation exploiting spacial and temporalscalability,”

IEEE Access , vol. 7, pp. 53 188–53 201, 2019.[17] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, andZ. Zhang, “HeteroCL: A multi-paradigm programming infrastructurefor software-deﬁned reconﬁgurable computing,” in

Proceedings of the2019 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays , ser. FPGA ’19. New York, NY, USA: Associationfor Computing Machinery, 2019, p. 242–251. [Online]. Available:https://doi.org/10.1145/3289602.3293910[18] K. Dohi, K. Fukumoto, Y. Shibata, and K. Oguri, “Performance model-ing and optimization of 3D stencil computation on a stream-based FPGAaccelerator, year=2013, pages=1-6,” in .[19] G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, and M. D.Santambrogio, “A polyhedral model-based framework for dataﬂowimplementation on FPGA devices of iterative stencil loops,” in , 2016, pp. 1–8.[20] K. Sano, Y. Hatsuda, and S. Yamamoto, “Multi-FPGA accelerator forscalable stencil computation with constant memory bandwidth,”

IEEETransactions on Parallel and Distributed Systems , vol. 25, no. 3, pp.695–705, 2014.[21] M. Shaﬁq, M. Peric`as, R. de la Cruz, M. Araya-Polo, N. Navarro,and E. Ayguad´e, “Exploiting memory customization in FPGA for3D stencil computations,” in , 2009, pp. 38–45.[22] M. Schmidt, M. Reichenbach, and D. Fey, “A generic VHDL templatefor 2D stencil code applications on FPGAs,” in , 2012, pp. 180–187.[23] K. Sano, Y. Hatsuda, and S. Yamamoto, “Scalable streaming-array ofsimple soft-processors for stencil computations with constant memory-bandwidth,” in

High PerformanceComputing , M. Weiland, G. Juckeland, S. Alam, and H. Jagode, Eds.Cham: Springer International Publishing, 2019, pp. 124–141.[27]

Alveo U280 data center accelerator card data sheet , Xilinx Inc., May2020, v1.3.[28] R. Clayton and B. Engquist, “Absorbing boundary conditions for acous-tic and elastic wave equations,”