NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling
Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, Henk Corporaal
NNERO: A Near High-Bandwidth Memory Stencil Acceleratorfor Weather Prediction Modeling
Gagandeep Singh a , b , c Dionysios Diamantopoulos c Christoph Hagleitner c Juan Gómez-Luna b Sander Stuijk a Onur Mutlu b Henk Corporaal a a Eindhoven University of Technology b ETH Zürich c IBM Research Europe, Zurich
Ongoing climate change calls for fast and accurate weatherand climate modeling. However, when solving large-scaleweather prediction simulations, state-of-the-art CPU and GPUimplementations suffer from limited performance and high en-ergy consumption. These implementations are dominated bycomplex irregular memory access patterns and low arithmeticintensity that pose fundamental challenges to acceleration. Toovercome these challenges, we propose and evaluate the use ofnear-memory acceleration using a reconfigurable fabric withhigh-bandwidth memory (HBM). We focus on compound sten-cils that are fundamental kernels in weather prediction models.By using high-level synthesis techniques, we develop NERO, anFPGA+HBM-based accelerator connected through IBM CAPI2(Coherent Accelerator Processor Interface) to an IBM POWER9host system. Our experimental results show that NERO out-performs a 16-core POWER9 system by × and × whenrunning two different compound stencil kernels. NERO reducesthe energy consumption by × and × for the same twokernels over the POWER9 system with an energy efficiency of1.5 GFLOPS/Watt and 17.3 GFLOPS/Watt. We conclude thatemploying near-memory acceleration solutions for weather pre-diction modeling is promising as a means to achieve both highperformance and high energy efficiency.
1. Introduction
Accurate weather prediction using detailed weather mod-els is essential to make weather-dependent decisions in atimely manner. The Consortium for Small-Scale Modeling(COSMO) [38] built one such weather model to meet thehigh-resolution forecasting requirements of weather services.The COSMO model is a non-hydrostatic atmospheric pre-diction model currently being used by a dozen nations formeteorological purposes and research applications.The central part of the COSMO model (called dynamicalcore or dycore ) solves the Euler equations on a curvilineargrid and applies implicit discretization (i.e., parameters aredependent on each other at the same time instance [26]) in thevertical dimension and explicit discretization (i.e., a solution isdependent on the previous system state [26]) in the horizontaldimension. The use of different discretizations leads to threecomputational patterns [96]: 1) horizontal stencils, 2) tridi-agonal solvers in the vertical dimension, and 3) point-wisecomputation. These computational kernels are compoundstencil kernels that operate on a three-dimensional grid [48]. Vertical advection ( vadvc ) and horizontal diffusion ( hdiff ) aresuch compound kernels found in the dycore of the COSMOweather prediction model. These kernels are representative of the data access patterns and algorithmic complexity of theentire COSMO model. They are similar to the kernels usedin other weather and climate models [60, 79, 107]. Their per-formance is dominated by memory-bound operations withunique irregular memory access patterns and low arithmeticintensity that often results in <
10% sustained floating-pointperformance on current CPU-based systems [99].Figure 1 shows the roofline plot [104] for an IBM 16-core POWER9 CPU (IC922). After optimizing the vadvc and hdiff kernels for the POWER architecture by follow-ing the approach in [105], they achieve 29.1 GFLOP/s and58.5 GFLOP/s, respectively, for 64 threads. Our roofline anal-ysis indicates that these kernels are constrained by the hostDRAM bandwidth. Their low arithmetic intensity limits theirperformance, which is one order of magnitude smaller thanthe peak performance, and results in a fundamental memorybottleneck that standard CPU-based optimization techniquescannot overcome. -1 Arithmetic Intensity [flop/byte] A tt a i nab l e P e r f o r m an c e [ G F l op / s e c ] D R A M G B p s e x p e r i m e n t a l B W o n S T R E A M L 3 - c a c h e ( . ) x = G B p s Roofline for POWER9 (16-core, SMT4) & [AD9V3,AD9H7] FPGAs hdiff (P9 64 threads)hdiff (P9 1 thread)vadvc (P9 64 threads)vadvc (P9 1 thread)Arithmetic Intensity for hdiffArithmetic Intensity for vadvc
Figure 1: Roofline [104] for POWER9 (1-socket) showing ver-tical advection ( vadvc ) and horizontal diffusion ( hdiff ) ker-nels for single-thread and 64-thread implementations. Theplot shows also the rooflines of the FPGAs used in our work.
In this work, our goal is to overcome the memory bot-tleneck of weather prediction kernels by exploiting near-memory computation capability on FPGA accelerators withhigh-bandwidth memory (HBM) [5, 65, 66] that are attached IBM and POWER9 are registered trademarks or common law marksof International Business Machines Corp., registered in many jurisdictionsworldwide. Other product and service names might be trademarks of IBMor other companies. a r X i v : . [ c s . A R ] S e p o the host CPU. Figure 1 shows the roofline models of thetwo FPGA cards (AD9V3 [2] and AD9H7 [1]) used in thiswork. FPGAs can handle irregular memory access patternsefficiently and offer significantly higher memory bandwidththan the host CPU with their on-chip URAMs (UltraRAM),BRAMs (block RAM), and off-chip HBM (high-bandwidthmemory for the AD9H7 card). However, taking full advan-tage of FPGAs for accelerating a workload is not a trivialtask. To compensate for the higher clock frequency of thebaseline CPUs, our FPGAs must exploit at least one orderof magnitude more parallelism in a target workload. Thisis challenging, as it requires sufficient FPGA programmingskills to map the workload and optimize the design for theFPGA microarchitecture.Modern FPGA boards deploy new cache-coherent inter-connects, such as IBM CAPI [93], Cache Coherent Intercon-nect for Accelerators (CCIX) [24], and Compute Express Link(CXL) [88], which allow tight integration of FPGAs with CPUsat high bidirectional bandwidth (on the order of tens of GB/s).However, memory-bound applications on FPGAs are limitedby the relatively low DDR4 bandwidth (72 GB/s for four in-dependent dual-rank DIMM interfaces [11]). To overcomethis limitation, FPGA vendors have started offering devicesequipped with HBM [6,7,12,66] with a theoretical peak band-width of 410 GB/s. HBM-equipped FPGAs have the potentialto reduce the memory bandwidth bottleneck, but a study oftheir advantages for real-world memory-bound applicationsis still missing.We aim to answer the following research question: Can FPGA-based accelerators with HBM mitigate theperformance bottleneck of memory-bound compoundweather prediction kernels in an energy efficient way?
As an answer to this question, we present NERO, a near-HBMaccelerator for weather prediction. We design and imple-ment NERO on an FPGA with HBM to optimize two kernels(vertical advection and horizontal diffusion), which notablyrepresent the spectrum of computational diversity found inthe COSMO weather prediction application. We co-design ahardware-software framework and provide an optimized APIto interface efficiently with the rest of the COSMO model,which runs on the CPU. Our FPGA-based solution for hdiff and vadvc leads to performance improvements of 4.2 × and8.3 × and energy reductions of 22 × and 29 × , respectively,with respect to optimized CPU implementations [105].The major contributions of this paper are as follows:• We perform a detailed roofline analysis to show that rep-resentative weather prediction kernels are constrained bymemory bandwidth on state-of-the-art CPU systems.• We propose NERO, the first near-HBM FPGA-based accel-erator for representative kernels from a real-world weatherprediction application.• We optimize NERO with a data-centric caching schemewith precision-optimized tiling for a heterogeneous mem-ory hierarchy (consisting of URAM, BRAM, and HBM). • We evaluate the performance and energy consumption ofour accelerator and perform a scalability analysis. We showthat an FPGA+HBM-based design outperforms a complete16-core POWER9 system (running 64 threads) by 4.2 × forthe vertical advection ( vadvc ) and 8.3 × for the horizontaldiffusion ( hdiff ) kernels with energy reductions of 22 × and 29 × , respectively. Our design provides an energy effi-ciency of 1.5 GLOPS/Watt and 17.3 GFLOPS/Watt for vadvc and hdiff kernels, respectively.
2. Background
In this section, we first provide an overview of the vadvc and hdiff compound stencils, which represent a large frac-tion of the overall computational load of the COSMO weatherprediction model. Second, we introduce the CAPI SNAP(Storage, Network, and Analytics Programming) framework that we use to connect our NERO accelerator to an IBMPOWER9 system. A stencil operation updates values in a structured multidi-mensional grid based on the values of a fixed local neighbor-hood of grid points. Vertical advection ( vadvc ) and horizon-tal diffusion ( hdiff ) from the COSMO model are two suchcompound stencil kernels, which represent the typical codepatterns found in the dycore of COSMO. Algorithm 1 showsthe pseudo-code for vadvc and hdiff kernels. The horizontaldiffusion kernel iterates over a 3D grid performing
Laplacian and flux to calculate different grid points. Vertical advectionhas a higher degree of complexity since it uses the Thomasalgorithm [97] to solve a tri-diagonal matrix of the velocityfield along the vertical axis. Unlike the conventional stencilkernels, vertical advection has dependencies in the verticaldirection, which leads to limited available parallelism.Such compound kernels are dominated by memory-boundoperations with complex memory access patterns and lowarithmetic intensity. This poses a fundamental challenge toacceleration. CPU implementations of these kernels [105]suffer from limited data locality and inefficient memory usage,as our roofline analysis in Figure 1 exposes.
The OpenPOWER Foundation Accelerator Workgroup [8]created the CAPI SNAP framework, an open-source envi-ronment for FPGA programming productivity. CAPI SNAPprovides two key benefits [103]: (i) it enables an improveddeveloper productivity for FPGA acceleration and eases theuse of CAPI’s cache-coherence mechanism, and (ii) it placesFPGA-accelerated compute engines, also known as FPGA ac-tions , closer to relevant data to achieve better performance.SNAP provides a simple API to invoke an accelerated ac-tion , and also provides programming methods to instantiatecustomized accelerated actions on the FPGA side. These ac-celerated actions can be specified in C/C++ code that is thencompiled to the FPGA target using the Xilinx Vivado High-Level Synthesis (HLS) tool [10]. https://github.com/open-power/snap lgorithm 1: Pseudo-code for vertical advection andhorizontal diffusion kernels used by the COSMO [38]weather prediction model. Function verticalAdvection( float* ccol, float* dcol, float* wcon, float* ustage,float* upos, float* utens, float* utensstage ) for c ← to column – 2 do for r ← to row-2 do Function forwardSweep( float* ccol, float* dcol, float* wcon,float* ustage, float* upos, float* utens, float* utensstage ) for d ← to depth do /* forward sweep calculation */ end end Function backwardSweep( float* ccol, float* dcol, float* wcon,float* ustage, float* upos, float* utens, float* utensstage ) for d ← depth – 1 to do /* backward sweep calculation */ end end end end end Function horizontalDiffusion( float* src, float* dst ) for d ← to depth do for c ← to column – 2 do for r ← to row-2 do /* Laplacian calculation */ lap CR = laplaceCalculate ( c , r ) /* row-laplacian */ lap CRm = laplaceCalculate ( c , r – 1) lap CRp = laplaceCalculate ( c , r + 1) /* column-laplacian*/ lap CmR = laplaceCalculate ( c – 1, r ) lap CpR = laplaceCalculate ( c + 1, r ) /* column-fluxcalculation */ flux C = lap CpR – lap CR flux Cm = lap CR – lap CmR /* row-flux calculation */ flux R = lap CRp – lap CR flux Rm = lap CR – lap CmR /* output calculation */ dest [ d ][ c ][ r ] = src [ d ][ c ][ r ] – c ∗ ( flux CR – flux CmR ) + ( flux CR – flux CRm ) end end end end
3. Design Methodology
The low arithmetic intensity of real-world weather pre-diction kernels limits the attainable performance on currentmulti-core systems. This sub-optimal performance is dueto the kernels’ complex memory access patterns and theirinefficiency in exploiting a rigid cache hierarchy, as quan-tified in the roofline plot in Figure 1. These kernels can-not fully utilize the available memory bandwidth, whichleads to high data movement overheads in terms of latencyand energy consumption. We address these inefficienciesby developing an architecture that combines fewer off-chipdata accesses with higher throughput for the loaded data.To this end, our accelerator design takes a data-centric ap-proach [13, 14, 27, 43, 52, 53, 63, 75, 89, 91] that exploits nearhigh-bandwidth memory acceleration.Figure 2 shows a high-level overview of our integratedsystem. An HBM-based FPGA is connected to a server systembased on an IBM POWER9 processor using the CoherentAccelerator Processor Interface version 2 (CAPI2). The FPGA
PSL
IBM Power9 CA P I HBM IP M e m o r y C on t r o ll e r DRAM
P9 Core
Accelerator Functional Unit
BRAM PE FF URAM
LULU - b i t HBM2 Stack 2 ....
BRAM PE FF URAM
LULU ....
HBM2 Stack 1 ....
FPGA
Figure 2: Heterogeneous platform with an IBM POWER9 sys-tem connected to an HBM-based FPGA board via CAPI2. consists of two HBM stacks , each with 16 pseudo-memorychannels [3]. A channel is exposed to the FPGA as a 256-bitwide port, and in total, the FPGA has 32 such ports. TheHBM IP provides 8 memory controllers (per stack) to handlethe data transfer to and from the HBM memory ports. Ourdesign consists of an accelerator functional unit (AFU) thatinteracts with the host system through the power servicelayer (PSL), which is the CAPI endpoint on the FPGA. AnAFU comprises of multiple processing elements (PEs) thatperform compound stencil computation. Figure 3a showsthe architecture overview of NERO. As vertical advection isthe most complex kernel, we depict our architecture designflow for vertical advection. We use a similar design for thehorizontal diffusion kernel.The weather data, based on the atmospheric model reso-lution grid, is stored in the DRAM of the host system ( 1 inFigure 3a). We employ the double buffering technique be-tween the CPU and the FPGA to hide the PCIe (PeripheralComponent Interconnect Express [72]) transfer latency. Byconfiguring a buffer of 64 cache lines, between the AXI4 inter-face of CAPI2/PSL and the AFU, we can reach the theoreticalpeak bandwidth of CAPI2/PCIe (i.e., 16 GB/s). We create aspecialized memory hierarchy from the heterogeneous FPGAmemories (i.e., URAM, BRAM, and HBM). By using a greedyalgorithm, we determine the best-suited hierarchy for ourkernel. The memory controller (shown in Figure 2) handlesthe data placement to the appropriate memory type based onprogrammer’s directives.On the FPGA, following the initial buffering ( 2 ), the trans-ferred grid data is mapped onto the HBM memory ( 3 ). As theFPGA has limited resources we propose a 3D window-basedgrid transfer from the host DRAM to the FPGA, facilitatinga smaller, less power-hungry deployment. The window sizerepresents the portion of the grid a processing element (PEin Figure 2) would process. Most FPGA developers manuallyoptimize for the right window size. However, manual opti-mization is tedious because of the huge design space, and itrequires expert guidance. Further, selecting an inappropriatewindow size leads to sub-optimal results. Our experiments (inSection 4.2) show that: (1) finding the best window size is crit-ical in terms of the area vs. performance trade-off, and (2) thebest window size depends on the datatype precision. Hence, In this work, we enable only a single stack based on our resource andpower consumption analysis for the vadvc kernel. OWER9 Host System
Virtual address translation by PSL
FPGA AXI Register
FPGA Cacheline Buffer
32 x float32
HBM2 Stack .......
Software-defined FPGA data (un)packing
Single output stream
2D partitioned BRAM or URAM3D window gridding/degridding
Fields Stream Splitter Backward Sweep
Intermediate FIFO ...
VADVC Engine ...
Multiple fields to represent atmospheric components
Weather data in the host DRAM
Stream Converter
Precision-aware auto-tuning for window size
Forward Sweep (a)
POWER9 HostCAPI2 DMAHBM Memory
PE1
Time
Data Packing/Unpacking
Address Translation & Coherency by PSLPOWER9 cacheline transfer 1024-bit=128B
AXI Full Bus Transaction
Cacheline buffering
Shared mem.start notify Shared mem.completion notify
Host execution AX I li t e c on r o l f l o w Dataflow execution
MC1* MC1
FPGAExecution *MC= memory channel (b)
Figure 3: (a) Architecture overview of NERO with data flow sequence from the host DRAM to the on-board FPGA memory viaPOWER9 cachelines. We depict a single processing element (PE) fetching data from a dedicated HBM port. The number ofHBM ports scales linearly with the number of PEs. Heterogeneous partitioning of on-chip memory blocks reduces read andwrite latencies across the FPGA memory hierarchy. (b) Execution timeline with data flow sequence from the host DRAM tothe onboard FPGA memory. instead of pruning the design space manually, we formu-late the search for the best window size as a multi-objectiveauto-tuning problem taking into account the datatype preci-sion. We make use of OpenTuner [20], which uses machine-learning techniques to guide the design-space exploration.Our design consists of multiple PEs (shown in Figure 2) thatexploit data-level parallelism in COSMO weather predictionkernels. A dedicated HBM memory port is assigned to aspecific PE; therefore, we enable as many HBM ports as thenumber of PEs. This allows us to use the high HBM bandwidtheffectively because each PE fetches from an independent port.In our design, we use a switch, which provides the capabilityto bypass the HBM, when the grid size is small, and map thedata directly onto the FPGA’s URAM and BRAM. The HBMport provides 256-bit data, which is half the size of the CAPI2bitwidth (512-bit). Therefore, to match the CAPI2 bandwidth,we introduce a stream converter logic ( 4 ) that converts a256-bit HBM stream to a 512-bit stream (CAPI compatible)or vice versa. From HBM, a PE reads a single stream of datathat consists of all the fields that are needed for a specificCOSMO kernel computation. The PEs use a fields streamsplitter logic ( 5 ) that splits a single HBM stream to multiplestreams (512-bit each), one for each field.To optimize a PE, we apply various optimization strategies.First, we exploit the inherent parallelism in a given algorithmthrough hardware pipelining. Second, we partition on-chipmemory to avoid the stalling of our pipelined design, since theon-chip BRAM/URAM has only two read/write ports. Third,all the tasks execute in a dataflow manner that enables task-level parallelism. vadvc is more computationally complexthan hdiff because it involves forward and backward sweepswith dependencies in the z-dimension. While hdiff performsonly Laplacian and flux calculations with dependencies in thex- and y-dimensions. Therefore, we demonstrate our design Fields represent atmospheric components like wind, pressure, velocity,etc. that are required for weather calculation. flow by means of the vadvc kernel (Figure 3a). Note that weshow only a single port-based PE operation. However, formultiple PEs, we enable multiple HBM ports.We make use of memory reshaping techniques to con-figure our memory space with multiple parallel BRAMs orURAMs [37]. We form an intermediate memory hierarchyby decomposing (or slicing) 3D window data into a 2D grid.This allows us to bridge the latency gap between the HBMmemory and our accelerator. Moreover, it allows us to exploitthe available FPGA resources efficiently. Unlike traditionally-fixed CPU memory hierarchies, which perform poorly withirregular access patterns and suffer from cache pollution ef-fects, application-specific memory hierarchies are shown toimprove energy and latency by tailoring the cache levels andcache sizes to an application’s memory access patterns [98].The main computation pipeline ( 7 ) consists of a forwardand a backward sweep logic. The forward sweep results arestored in an intermediate buffer to allow for backward sweepcalculation. Upon completion of the backward sweep, resultsare placed in an output buffer that is followed by a degriddinglogic ( 6 ). The degridding logic converts the calculated resultsto a 512-bit wide output stream ( 8 ). As there is only a singleoutput stream (both in vadvc and hdiff ), we do not needextra logic to merge the streams. The 512-bit wide stream goesthrough an HBM stream converter logic ( 4 ) that convertsthe stream bitwidth to HBM port size (256-bit).Figure 3b shows the execution timeline from our host sys-tem to the FPGA board for a single PE. The host offloads theprocessing to an FPGA and transfers the required data viaDMA (direct memory access) over the CAPI2 interface. TheSNAP framework allows for parallel execution of the hostand our FPGA PEs while exchanging control signals over theAXI lite interface [4]. On task completion, the AFU notifiesthe host system via the AXI lite interface and transfers backthe results via DMA.4 .2. NERO Application Framework
Figure 4 shows the NERO application framework to sup-port our architecture. A software-defined COSMO API ( 1 )handles offloading jobs to NERO with an interrupt-basedqueuing mechanism. This allows for minimal CPU usage(and, hence, power usage) during FPGA operation. NERO em-ploys an array of processing elements to compute COSMOkernels, such as vertical advection or horizontal diffusion.Additionally, we pipeline our PEs to exploit the availablespatial parallelism. By accessing the host memory throughthe CAPI2 cache-coherent link, NERO acts as a peer to theCPU. This is enabled through the Power-Service Layer (PSL)( 2 ). SNAP ( 3 ) allows for seamless integration of the COSMOAPI with our CAPI-based accelerator. The job manager ( 4 )dispatches jobs to streams, which are managed in the streamscheduler ( 5 ). The execution of a job is done by streams thatdetermine which data is to be read from the host memoryand sent to the PE array through DMA transfers ( 6 ). Thepool of heterogeneous on-chip memory is used to store theinput data from the main memory and the intermediate datagenerated by each PE.
CAPI2 POWER Service Layer (PSL)
Job Manager Stream Scheduler AXI DMACAPI2
SNAPFPGA
AXI Lite Bus
MMIO Control registers
AXI Full Bus
Burst Transactions
PE PE
NERO
Partitioned On-chip Memory
PEPE
Host POWER System Coherent Accelerator Processor Proxy (CAPP)COSMO API libCXLCOSMO WEATHER MODEL
PCIe4
HBM Memory Controller
HBM2 Stack 1
HBM2 Stack 2
234 5 6 1 .....
Figure 4: NERO application framework. We co-design oursoftware and hardware using the SNAP framework. COSMOAPI allows the host to offload kernels to our FPGA platform.
4. Results
We implemented our design on an Alpha-Data ADM-PCIE-9H7 card [1] featuring the Xilinx Virtex Ultrascale+XCVU37P-FSVH2892-2-e [9] and 8GiB HBM2 (i.e., two stacksof 4GiB each) [5] with an IBM POWER9 as the host system.The POWER9 socket has 16 cores, each of which supportsfour-thread simultaneous multi-threading. We compare ourHBM-based design to a conventional DDR4 DRAM [2] baseddesign. We perform the experiments for the DDR4-baseddesign on an Alpha-Data ADM-PCIE-9V3 card featuring theXilinx Virtex Ultrascale+ XCVU3P-FFVC1517-2-i [9].Table 1 provides our system parameters. We have co-designed our hardware and software interface around theSNAP framework while using the HLS design flow.
Table 1: System parameters and hardware configuration forthe CPU and the FPGA board.
Host CPU
Cache-Hierarchy
32 KiB L1-I/D, 256 KiB L2, 10 MiB L3
System Memory
HBM-basedFPGA Board
Alpha Data ADM-PCIE-9H7Xilinx Virtex Ultrascale+ XCVU37P-28GiB (HBM2) with PCIe Gen4 x8
DDR4-basedFPGA Board
Alpha Data ADM-PCIE-9V3Xilinx Virtex Ultrascale+ XCVU3P-28GiB (DDR4) with PCIe Gen4 x8
We run our experiments using a 256 × × vadvc . For hdiff ,we consider sizes in all three dimensions. We define ourauto-tuning as a multi-objective optimization with the goalof maximizing performance with minimal resource utiliza-tion. Section 3 provides further details on our design. Figure 5shows hand-tuned and auto-tuned performance and FPGAresource utilization results for vadvc , as a function of thechosen tile size. From the figure, we draw two observations.
12 14 16 18 20
Resource utilization (%) P e r f o r m an c e ( G F l op / s ) (a)hand-tunedauto-tuned Resource utilization (%) P e r f o r m an c e ( G F l op / s ) (b)hand-tunedauto-tuned Figure 5: Performance and FPGA resource utilization of sin-gle vadvc
PE, as a function of tile-size, using hand-tuningand auto-tuning for (a) single-precision (32-bit) and (b) half-precision (16-bit). We highlight the Pareto-optimal solutionthat we use for our vadvc accelerator (with a red circle). Notethat the Pareto-optimal solution changes with precision.
First, by using the auto-tuning approach and our carefulFPGA microarchitecture design, we can get Pareto-optimalresults with a tile size of 64 × ×
64 for single-precision vadvc , which gives us a peak performance of 8.49 GFLOP/s.For half-precision, we use a tile size of 32 × ×
64 to achievea peak performance of 16.5 GFLOP/s. We employ a similarstrategy for hdiff to attain a single-precision performance of30.3 GFLOP/s with a tile size of 16 × × × × vadvc has dependencies in the z-dimension; therefore, it cannot beparallelized in the z-dimension. Number of PEs R un t i m e ( m s e c ) POWER9 socket (64 threads) (a)
HBMDDR4 0 4 8 12 14
Number of PEs R un t i m e ( m s e c ) POWER9 socket (64 threads) (b)
HBMDDR4
Figure 6: Single-precision performance for (a) vadvc and (b) hdiff , as a function of accelerator PE count on the HBM- andDDR4-based FPGA boards. We also show the single socket(64 threads) performance of an IBM POWER9 host systemfor both vadvc and hdiff . First, our full-blown HBM-based vadvc and hdiff imple-mentations provide 120.7 GFLOP/s and 485.4 GFLOP/s perfor-mance, which are 4.2 × and 8.3 × higher than the performanceof a complete POWER9 socket. For half-precision, if we usethe same amount of PEs as in single precision, our accel-erator reaches a performance of 247.9 GFLOP/s for vadvc (2.1 × the single-precision performance) and 1.2 TFLOP/s for hdiff (2.5 × the single-precision performance). Our DDR4-based design achieves 34.1 GFLOP/s and 145.8 GFLOP/s for vadvc and hdiff , respectively, which are 1.2 × and 2.5 × theperformance on the POWER9 CPU.Second, for a single PE, which fetches data from a singlememory channel, the DDR4-based design provides higherperformance than the HBM-based design. This is because theDDR4-based FPGA has a larger bus width (512-bit) than anHBM port (256-bit). This leads to a lower transfer rate foran HBM port (0.8-2.1 GT/s ) than for a DDR4 port (2.1-4.3GT/s). One way to match the DDR4 bus width would beto have a single PE fetch data from multiple HBM ports inparallel. However, using more ports leads to higher powerconsumption ( ∼ Gigatransfers per second.
Fourth, in the DDR4-based design, the use of only a singlechannel to feed multiple PEs leads to a congestion issue thatcauses a non-linear run-time reduction. As we increase thenumber of accelerator PEs, we observe that the PEs competefor a single memory channel, which causes frequent stalls.This phenomenon leads to worse performance scaling char-acteristics for the DDR4-based design as compared to theHBM-based design.
We compare the energy consumption of our accelera-tor to a 16-core POWER9 host system. For the POWER9system, we use the AMESTER tool to measure the activepower consumption. We measure 99.2 Watts for vadvc , and97.9 Watts for hdiff by monitoring built power sensors inthe POWER9 system.By executing these kernels on an HBM-based board, wereduce the energy consumption by 22 × for vadvc and 29 × for hdiff compared to the 16-core POWER9 system. Figure 7shows the energy efficiency (GFLOPS per Watt) for vadvc and hdiff on the HBM- and DDR4-based designs. We makethree major observations from the figure. Number of PEs E ne r g y E ff i c i en cy ( G F l op / s / W a tt ) POWER9 socket (64 threads) (a)
HBMDDR4 0 4 8 12 14
Number of PEs E ne r g y E ff i c i en cy ( G F l op / s / W a tt ) POWER9 socket (64 threads) (b)
HBMDDR4
Figure 7: Energy efficiency for (a) vadvc and (b) hdiff onHBM- and DDR4-based FPGA boards. We also show the sin-gle socket (64 threads) energy efficiency of an IBM POWER9host system for both vadvc and hdiff . First, with our full-blown HBM-based designs (i.e., 14 PEsfor vadvc and 16 PEs for hdiff ), we achieve energy efficiencyvalues of 1.5 GFLOPS/Watt and 17.3 GFLOPS/Watt for vadvc and hdiff , respectively.Second, the DDR4-based design is more energy efficientthan the HBM-based design when the number of PEs is small.This observation is inline with our discussion about perfor-mance with small PE counts in Section 4.2. However, as weincrease the number of PEs, the HBM-based design providesbetter energy efficiency for memory-bound kernels. This isbecause more data can be fetched and processed in parallelvia multiple ports.Third, kernels like vadvc , with intricate memory accesspatterns, are not able to reach the peak computational powerof FPGAs. The large amount of control flow in vadvc leads tolarge resource consumption. Therefore, when increasing thePE count, we observe a high increase in power consumptionwith low energy efficiency. https://github.com/open-power/amester Active power denotes the difference between the total power of acomplete socket (including CPU, memory, fans, I/O, etc.) when an applicationis running compared to when it is idle.
6e conclude that enabling many HBM ports might notalways be beneficial in terms of energy consumption becauseeach HBM port consumes ∼ hdiff can achieve muchhigher performance in an energy efficient manner with morePEs and HBM ports. Table 2 shows the resource utilization of vadvc and hdiff on the AD9H7 board. We draw two observations. First, thereis a high BRAM consumption compared to other FPGA re-sources. This is because we implement input, field, and outputsignals as hls::streams . In high-level synthesis, by default,streams are implemented as FIFOs that make use of BRAM.Second, vadvc has a much larger resource consumption than hdiff because vadvc has higher computational complexityand requires a larger number of fields to perform the com-pound stencil calculation. Note that for hdiff , we can accom-modate more PEs, but in this work, we make use of only asingle HBM stack. Therefore, we use 16 PEs because a singleHBM stack offers 16 memory ports.
Table 2: FPGA resource utilization in our highest-performing HBM-based designs for vadvc and hdiff . Algorithm BRAM DSP FF LUT URAM vadvc
81% 39% 37% 55% 53% hdiff
58% 4% 6% 11% 8%
5. Related Work
To our knowledge, this is the first work to evaluate the ben-efits of using FPGAs equipped with high-bandwidth memory(HBM) to accelerate stencil computation. We exploit near-memory capabilities of such FPGAs to accelerate importantweather prediction kernels.Modern workloads exhibit limited locality and operate onlarge amounts of data, which causes frequent data movementbetween the memory subsystem and the processing units [27,43, 75, 76]. This frequent data movement has a severe impacton overall system performance and energy efficiency. A wayto alleviate this data movement bottleneck [27, 43, 75, 76, 89]is near-memory computing (NMC), which consists of placingprocessing units closer to memory. NMC is enabled by newmemory technologies, such as 3D-stacked memories [5, 62,65, 66, 81], and also by cache-coherent interconnects [24, 88,93], which allow close integration of processing units andmemory units. Depending on the applications and systems ofinterest (e.g., [13, 14, 15, 22, 23, 27, 29, 31, 40, 42, 49, 50, 53, 58, 61,68, 69, 71, 74, 77, 87]), prior works propose different types ofnear-memory processing units, such as general-purpose CPUcores [13, 16, 27, 28, 29, 39, 64, 69, 78, 83, 86], GPU cores [44, 52,80, 106], reconfigurable units [41, 55, 57, 90], or fixed-functionunits [14, 47, 49, 50, 53, 63, 70, 77].FPGA accelerators are promising to enhance overall systemperformance with low power consumption. Past works [17,18, 19, 30, 36, 45, 54, 56, 57, 59, 67] show that FPGAs can be em-ployed effectively for a wide range of applications. The recent addition of HBM to FPGAs presents an opportunity to exploithigh memory bandwidth with the low-power FPGA fabric.The potential of high-bandwidth memory [5, 66] has beenexplored in many-core processors [44, 82] and GPUs [44, 108].A recent work [102] shows the potential of HBM for FPGAswith a memory benchmarking tool. NERO is the first workto accelerate a real-world HPC weather prediction applica-tion using the FPGA+HBM fabric. Compared to a previouswork [90] that optimizes only the horizontal diffusion kernelon an FPGA with DDR4 memory, our analysis reveals that thevertical advection kernel has a much lower compute intensitywith little to no regularity. Therefore, our work acceleratesboth kernels that together represent the algorithmic diversityof the entire COSMO weather prediction model. Moreover,compared to [90], NERO improves performance by 1.2 × ona DDR4-based board and 37 × on an HBM-based board forhorizontal diffusion by using a dataflow implementation withauto-tuning.Enabling higher performance for stencil computations hasbeen a subject of optimizations across the whole computingstack [21, 32, 33, 34, 35, 46, 48, 51, 73, 85, 92, 95, 101]. Szus-tak et al. accelerate the MPDATA advection scheme onmulti-core CPU [94] and computational fluid dynamics ker-nels on FPGA [84]. Bianco et al. [25] optimize the COSMOweather prediction model for GPUs while Thaler et al. [96]port COSMO to a many-core system. Wahib et al. [100] de-velop an analytical performance model for choosing an opti-mal GPU-based execution strategy for various scientific appli-cations, including COSMO. Gysi et al. [48] provide guidelinesfor optimizing stencil kernels for CPU–GPU systems.
6. Conclusion
We introduce NERO, the first design and implementationon a reconfigurable fabric with high-bandwidth memory(HBM) to accelerate representative weather prediction ker-nels, i.e., vertical advection ( vadvc ) and horizontal diffusion( hdiff ), from a real-world weather prediction application.These kernels are compound stencils that are found in vari-ous weather prediction applications, including the COSMOmodel. We show that compound kernels do not performwell on conventional architectures due to their complex dataaccess patterns and low data reusability, which make themmemory-bounded. Therefore, they greatly benefit from ournear-memory computing solution that takes advantage of thehigh data transfer bandwidth of HBM.NERO’s implementations of vadvc and hdiff outper-form the optimized software implementations on a 16-corePOWER9 with 4-way multithreading by 4.2 × and 8.3 × , with22 × and 29 × less energy consumption, respectively. Weconclude that hardware acceleration on an FPGA+HBM fab-ric is a promising solution for compound stencils presentin weather prediction applications. We hope that our re-configurable near-memory accelerator inspires developersof different high-performance computing applications thatsuffer from the memory bottleneck.7 cknowledgments This work was performed in the framework of the Hori-zon 2020 program for the project “Near-Memory Computing(NeMeCo)”. It is funded by the European Commission underMarie Sklodowska-Curie Innovative Training Networks Euro-pean Industrial Doctorate (Project ID: 676240). Special thanksto Florian Auernhammer and Raphael Polig for providingaccess to IBM systems. We appreciate valuable discussionswith Kaan Kara and Ronald Luijten. We would also like tothank Bruno Mesnet and Alexandre Castellane from IBMFrance for help with the SNAP framework. This work waspartially supported by the H2020 research and innovationprogramme under grant agreement No 732631, project OPRE-COMP. We also thank Google, Huawei, Intel, Microsoft, SRC,and VMware for their funding support.
References et al. , “A Scalable Processing-in-Memory Accelerator for Parallel GraphProcessing,” in
ISCA , 2015.[14] J. Ahn et al. , “PIM-Enabled Instructions: A Low-Overhead, Locality-AwareProcessing-in-Memory Architecture,” in
ISCA , 2015.[15] B. Akin et al. , “Data Reorganization in Memory Using 3D-stacked DRAM,”
ISCA ,2015.[16] M. Alian et al. , “Application-Transparent Near-Memory Processing Architecturewith Memory Channel Network,” in
MICRO , 2018.[17] M. Alser et al. , “Shouji: a fast and efficient pre-alignment filter for sequencealignment,”
Bioinformatics , 2019.[18] M. Alser et al. , “GateKeeper: A New Hardware Architecture for AcceleratingPre-Alignment in DNA Short Read Mapping,”
Bioinformatics , 2017.[19] M. Alser et al. , “SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs,” arXiv preprint , 2019.[20] J. Ansel et al. , “OpenTuner: An Extensible Framework for Program Autotuning,”in
PACT , 2014.[21] A. Armejach et al. , “Stencil Codes on a Vector Length Agnostic Architecture,” in
PACT , 2018.[22] H. Asghari-Moghaddam et al. , “Chameleon: Versatile and Practical Near-DRAMAcceleration Architecture for Large Memory Systems,” in
MICRO , 2016.[23] O. O. Babarinsa and S. Idreos, “JAFAR: Near-Data Processing for Databases,” in
SIGMOD , 2015.[24] B. Benton, “CCIX, Gen-Z, OpenCAPI: Overview and Comparison,” in
OFA , 2017.[25] M. Bianco et al. , “A GPU Capable Version of the COSMO Weather Model,”
ISC ,2013.[26] L. Bonaventura, “A Semi-implicit Semi-Lagrangian Scheme using the Height Co-ordinate for a Nonhydrostatic and Fully Elastic Model of Atmospheric Flows,”
JCP , 2000.[27] A. Boroumand et al. , “Google Workloads for Consumer Devices: Mitigating DataMovement Bottlenecks,”
ASPLOS , 2018.[28] A. Boroumand et al. , “CoNDA: Efficient Cache Coherence Support for Near-DataAccelerators,” in
ISCA , 2019.[29] A. Boroumand et al. , “LazyPIM: An Efficient Cache Coherence Mechanism forProcessing-in-Memory,”
CAL , 2016. [30] L.-W. Chang et al. , “Collaborative Computing for Heterogeneous Integrated Sys-tems,” in
ICPE , 2017.[31] P. Chi et al. , “PRIME: A Novel Processing-in-memory Architecture for NeuralNetwork Computation in ReRAM-based Main Memory,” 2016.[32] Y. Chi et al. , “SODA: Stencil with Optimized Dataflow Architecture,” in
ICCAD ,2018.[33] M. Christen et al. , “PATUS: A Code Generation and Autotuning Framework forParallel Iterative Stencil Computations on Modern Microarchitectures,” in
IPDPS ,2011.[34] K. Datta et al. , “Optimization and Performance Modeling of Stencil Computa-tions on Modern Microprocessors,”
SIAM review , 2009.[35] J. de Fine Licht et al. , “Designing scalable FPGA architectures using high-levelsynthesis,” in
PPoPP , 2018.[36] D. Diamantopoulos et al. , “ecTALK: Energy efficient coherent transprecision ac-celerators - The bidirectional long short-term memory neural network case,” in
COOL CHIPS , 2018.[37] D. Diamantopoulos and C. Hagleitner, “A System-Level Transprecision FPGAAccelerator for BLSTM Using On-chip Memory Reshaping,” in
FPT , 2018.[38] G. Doms and U. Schättler, “The Nonhydrostatic Limited-Area Model LM (Lokal-model) of the DWD. Part I: Scientific Documentation,”
DWD, GB Forschung undEntwicklung , 1999.[39] M. Drumond et al. , “The Mondrian Data Engine,” in
ISCA , 2017.[40] A. Farmahini-Farahani et al. , “NDA: Near-DRAM Acceleration ArchitectureLeveraging Commodity DRAM Devices and Standard Memory Modules,” in
HPCA , 2015.[41] M. Gao and C. Kozyrakis, “HRL: Efficient and Flexible Reconfigurable Logic forNear-Data Processing,” in
HPCA , 2016.[42] M. Gao et al. , “Practical Near-Data Processing for In-Memory Analytics Frame-works,” in
PACT , 2015.[43] S. Ghose et al. , “Processing-in-memory: A workload-driven perspective,”
IBMJRD , 2019.[44] S. Ghose et al. , “Demystifying Complex Workload-DRAM Interactions: An Ex-perimental Study,”
POMACS , 2019.[45] H. Giefers et al. , “Accelerating arithmetic kernels with coherent attached FPGAcoprocessors,” in
DATE , 2015.[46] J. González and A. González, “Speculative Execution via Address Prediction andData Prefetching,” in
ICS , 1997.[47] B. Gu et al. , “Biscuit: A Framework for Near-data Processing of Big Data Work-loads,”
ISCA , 2016.[48] T. Gysi et al. , “MODESTO: Data-centric Analytic Optimization of Complex Sten-cil Programs on Heterogeneous Architectures,” in SC , 2015.[49] M. Hashemi et al. , “Accelerating Dependent Cache Misses with an EnhancedMemory Controller,” in ISCA , 2016.[50] M. Hashemi et al. , “Continuous Runahead: Transparent Hardware Accelerationfor Memory Intensive Workloads,” in
MICRO , 2016.[51] T. Henretty et al. , “Data Layout Transformation for Stencil Computations onShort-Vector SIMD Architectures,” in CC , 2011.[52] K. Hsieh et al. , “Transparent Offloading and Mapping (TOM): EnablingProgrammer-Transparent Near-Data Processing in GPU Systems,” in ISCA , 2016.[53] K. Hsieh et al. , “Accelerating Pointer Chasing in 3D-Stacked Memory: Chal-lenges, Mechanisms, Evaluation,” in
ICCD , 2016.[54] S. Huang et al. , “Analysis and Modeling of Collaborative Execution Strategiesfor Heterogeneous CPU-FPGA Architectures,” in
ICPE , 2019.[55] Z. István et al. , “Caribou: Intelligent Distributed Storage,”
VLDB , 2017.[56] J. Jiang et al. , “Boyi: A Systematic Framework for Automatically Deciding theRight Execution Model of OpenCL Applications on FPGAs,” in
FPGA , 2020.[57] S.-W. Jun et al. , “BlueDBM: An Appliance for Big Data Analytics,” in
ISCA , 2015.[58] Y. Kang et al. , “Enabling Cost-effective Data Processing with Smart SSD,” in
MSST , 2013.[59] K. Kara et al. , “FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off,” in
FCCM , 2017.[60] S. Kehler et al. , “High Resolution Deterministic Prediction System (HRDPS) Sim-ulations of Manitoba Lake Breezes,”
Atmosphere-Ocean , 2016.[61] D. Kim et al. , “Neurocube: A Programmable Digital Neuromorphic Architecturewith High-Density 3D Memory,” 2016.[62] J. Kim et al. , “A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 ×
128 I/OsUsing TSV Based Stacking,”
JSSC , 2012.[63] J. S. Kim et al. , “GRIM-Filter: Fast seed location filtering in DNA read mappingusing processing-in-memory technologies,”
BMC Genomics , 2018.[64] G. Koo et al. , “Summarizer: Trading Communication with Computing Near Stor-age,” in
MICRO , 2017.[65] D. U. Lee et al. , “25.2 A 1.2V 8Gb 8-Channel 128GB/s High-Bandwidth Memory(HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nmProcess and TSV,” in
ISSCC , 2014.[66] D. Lee et al. , “Simultaneous Multi-Layer Access: Improving 3D-Stacked MemoryBandwidth at Low Cost,”
ACM TACO , 2016.[67] J. Lee et al. , “ExtraV: Boosting Graph Processing Near Storage with a CoherentAccelerator,” 2017.[68] J. H. Lee et al. , “BSSync: Processing Near Memory for Machine Learning Work-loads with Bounded Staleness Consistency Models,” in
PACT , 2015.[69] V. T. Lee et al. , “Application Codesign of Near-Data Processing for SimilaritySearch,” in
IPDPS , 2018.
70] J. Liu et al. , “Processing-in-Memory for Energy-efficient Neural Network Train-ing: A Heterogeneous Approach,” in
MICRO , 2018.[71] Z. Liu et al. , “Concurrent Data Structures for Near-Memory Computing,” in
SPAA ,2017.[72] D. Mayhew and V. Krishnan, “PCI Express and Advanced Switching: Evolution-ary Path to Building Next Generation Interconnects,” in
HOTI , 2003.[73] J. Meng and K. Skadron, “A Performance Study for Iterative Stencil Loops onGPUs with Ghost Zone Optimizations,”
IJPP , 2011.[74] A. Morad et al. , “GP-SIMD Processing-in-Memory,”
ACM TACO , 2015.[75] O. Mutlu et al. , “Processing Data Where It Makes Sense: Enabling In-MemoryComputation,”
MicPro , 2019.[76] O. Mutlu et al. , “Enabling Practical Processing in and near Memory for Data-Intensive Computing,” in
DAC , 2019.[77] L. Nai et al. , “GraphPIM: Enabling Instruction-Level PIM Offloading in GraphComputing Frameworks,” in
HPCA , 2017.[78] R. Nair et al. , “Active Memory Cube: A Processing-in-Memory Architecture forExascale Systems,”
IBM JRD , 2015.[79] R. B. Neale et al. , “Description of the NCAR Community Atmosphere Model(CAM 5.0),”
NCAR Tech. Note , 2010.[80] A. Pattnaik et al. , “Scheduling Techniques for GPU Architectures withProcessing-in-Memory Capabilities,” in
PACT , 2016.[81] J. T. Pawlowski, “Hybrid Memory Cube (HMC),” in
HCS , 2011.[82] C. Pohl et al. , “Joins on high-bandwidth memory: a new level in the memoryhierarchy,”
VLDB , 2019.[83] S. H. Pugsley et al. , “NDC: Analyzing the Impact of 3D-Stacked Memory+LogicDevices on MapReduce Workloads,” in
ISPASS , 2014.[84] K. Rojek et al. , “CFD Acceleration with FPGA,” in
H2RC , 2019.[85] K. Sano et al. , “Multi-FPGA Accelerator for Scalable Stencil Computation withConstant Memory Bandwidth,”
TPDS , 2014.[86] P. C. Santos et al. , “Operand Size Reconfiguration for Big Data Processing inMemory,” in
DATE , 2017.[87] V. Seshadri et al. , “Gather-Scatter DRAM: In-DRAM Address Translation to Im-prove the Spatial Locality of Non-unit Strided Accesses,” in
MICRO , 2015.[88] D. Sharma, “Compute Express Link,”
CXL Consortium White Paper , 2019.[89] G. Singh et al. , “Near-Memory Computing: Past, Present, and Future,”
MicPro ,2019.[90] G. Singh et al. , “NARMADA: Near-memory horizontal diffusion accelerator forscalable stencil computations,” in
FPL , 2019. [91] G. Singh et al. , “NAPEL: Near-Memory Computing Application Performance Pre-diction via Ensemble Learning,” in
DAC , 2019.[92] R. Strzodka et al. , “Cache Oblivious Parallelograms in Iterative Stencil Computa-tions,” in
ICS , 2010.[93] J. Stuecheli et al. , “IBM POWER9 opens up a new era of acceleration enablement:OpenCAPI,”
IBM JRD , 2018.[94] L. Szustak et al. , “Using Intel Xeon Phi Coprocessor to Accelerate Computationsin MPDATA Algorithm,” in
PPAM , 2013.[95] Y. Tang et al. , “The Pochoir Stencil Compiler,” in
SPAA , 2011.[96] F. Thaler et al. , “Porting the COSMO Weather Model to Manycore CPUs,” in
PASC , 2019.[97] L. Thomas, “Elliptic Problems in Linear Differential Equations over a Network,”
Watson Sci. Comput. Lab. Rept., Columbia University , 1949.[98] P.-A. Tsai et al. , “Jenga: Software-Defined Cache Hierarchies,” in
ISCA , 2017.[99] J. van Lunteren et al. , “Coherently Attached Programmable Near-Memory Accel-eration Platform and its application to Stencil Processing,” in
DATE , 2019.[100] M. Wahib and N. Maruyama, “Scalable Kernel Fusion for Memory-Bound GPUApplications,” in SC , 2014.[101] H. M. Waidyasooriya et al. , “OpenCL-Based FPGA-Platform for Stencil Compu-tation and Its Optimization Methodology,” TPDS , 2017.[102] Z. Wang et al. , “Shuhai: Benchmarking High Bandwidth Memory on FPGAs,” in
FCCM , 2020.[103] L. Wenzel et al. , “Getting Started with CAPI SNAP: Hardware Development forSoftware Engineers,” in
Euro-Par , 2018.[104] S. Williams et al. , “Roofline: An Insightful Visual Performance Model for Multi-core architectures,”
CACM , 2009.[105] J. Xu et al. , “Performance Tuning and Analysis for Stencil-Based Applicationson POWER8 Processor,”
ACM TACO , 2018.[106] D. Zhang et al. , “TOP-PIM: Throughput-Oriented Programmable Processing inMemory,” in
HPDC , 2014.[107] J. A. Zhang et al. , “Evaluating the Impact of Improvement in the Horizontal Dif-fusion Parameterization on Hurricane Prediction in the Operational HurricaneWeather Research and Forecast (HWRF) Model,”
Weather and Forecasting , 2018.[108] M. Zhu et al. , “Performance Evaluation and Optimization of HBM-Enabled GPUfor Data-intensive Applications,”
VLSI , 2018., 2018.