[PDF] Benchmarking High Bandwidth Memory on FPGAs

Abstract

FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarkingHBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. As a yardstick, we also applyShuhaito DDR4to show the differences between HBM and DDR4.Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will makeShuhaiopen-source, benefiting the community

Full PDF

BBenchmarking High Bandwidth Memory on FPGAs

Zeke Wang, Hongjing Huang, Jie Zhang

Collaborative Innovation Center of Artiﬁcial IntelligenceZhejiang University, ChinaEmail: { wangzeke, 21515069, carlzhang4 } @zju.edu.cn Gustavo Alonso

Systems GroupETH Zurich, SwitzerlandEmail: [email protected]

Abstract —FPGAs are starting to be enhanced with HighBandwidth Memory (HBM) as a way to reduce the memorybandwidth bottleneck encountered in some applications and togive the FPGA more capacity to deal with application state.However, the performance characteristics of HBM are still notwell speciﬁed, especially in the context of FPGAs. In this paper,we bridge the gap between nominal speciﬁcations and actualperformance by benchmarking HBM on a state-of-the-art FPGA,i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem.To this end, we propose Shuhai, a benchmarking tool that allowsus to demystify all the underlying details of HBM on an FPGA.FPGA-based benchmarking should also provide a more accuratepicture of HBM than doing so on CPUs/GPUs, since CPUs/GPUsare noisier systems due to their complex control logic and cachehierarchy. Since the memory itself is complex, leveraging customhardware logic to benchmark inside an FPGA provides moredetails as well as accurate and deterministic measurements. Weobserve that 1) HBM is able to provide up to 425 GB/s memorybandwidth, and 2) how HBM is used has a signiﬁcant impacton performance, which in turn demonstrates the importance ofunveiling the performance characteristics of HBM so as to selectthe best approach. Shuhai can be easily generalized to otherFPGA boards or other generations of memory, e.g., HBM3,and DDR3. We will make Shuhai open-source, beneﬁting thecommunity.

I. I

NTRODUCTION

The computational capacity of modern computing sys-tem continues increasing due to the constant improvementson CMOS technology, typically by instantiating more coreswithin the same area and/or by adding extra functionalityto the cores (AVX, SGX, etc.). In contrast, the bandwidthcapability of DRAM memory has only slowly improved overmany generations. As a result, the gap between memory andprocessor speed keeps growing and is being exacerbated bymulticore designs due to the concurrent access. To bridge thememory bandwidth gap, semiconductor memory companiessuch as Samsung have released a few memory variants, e.g.,Hybrid Memory Cube (HMC) and High Bandwidth Memory(HBM), as a way to provide signiﬁcantly higher memoryba ndwidth. For example, the state-of-the-art Nvidia GPUV100 features 32 GB HBM2 (the second generation HBM) toprovide up to 900 GB/s memory bandwidth for its thousandsof computing cores. Compared with a GPU of the same generation, FPGAs usedto have an order of magnitude lower memory bandwidth since FPGAs typically feature up to 2 DRAM memory channels,each of which has up to 19.2 GB/s memory bandwidth onour tested FPGA board Alevo U280 [1]. As a result, anFPGA-based solution using DRAM could not compete with aGPU for bandwidth-critical applications. Consequently, FPGAvendors like Xilinx [1] have started to introduce HBM in theirFPGA boards as a way to remain competitive on those sameapplications. HBM has the potential to be a game-changingfeature by allowing FPGAs to provide signiﬁcantly higherperformance for memory- and compute-bound applicationslike database engines [2] or deep learning inference [3]. Itcan also support applications in keeping more state withinthe FPGA without the signiﬁcant performance penalties seentoday as soon as DRAM is involved.Despite the potential of HBM to bridge the bandwidth gap,there are still obstacles to leveraging HBM on the FPGA. First,the performance characteristics of HBM are often unknown todevelopers, especially to FPGA programmers. Even thoughan HBM stack consists of a few traditional DRAM diesand a logic die, the performance characteristics of HBMare signiﬁcantly different than those of, e.g., DDR4. Second,Xilinx’s HBM subsystem [4] introduces new features like a switch inside its HBM memory controller. The performancecharacteristics of the switch are also unclear to the FPGAprogrammer due to the limited details exposed by Xilinx.These two issues can hamper the ability of FPGA developersto fully exploit the advantages of HBM on FPGAs.To this end, we present Shuhai, a benchmarking tool thatallows us to demystify all the underlying details of HBM.Shuhai adopts a software/hardware co-design approach toprovide high-level insights and ease of use to developersor researchers interested in leveraging HBM. The high-levelinsights come from the ﬁrst end-to-end analysis of the per-formance characteristic of typical memory access patterns .The ease of use arises from the fact that Shuhai performsthe majority of the benchmarking task without having toreconﬁgure the FPGA between parts of the benchmark. Toour knowledge, Shuhai is the ﬁrst platform to systematicallybenchmark HBM on an FPGA. We demonstrate the usefulness In the following, we use HBM which refers to HBM2 in the context ofXilinx FPGAs, as Xilinx FPGAs feature two HBM2 stacks. Shuhai is a pioneer of Chinese measurement standards, with which hemeasured the territory of China in the Xia dynasty. a r X i v : . [ c s . A R ] M a y f Shuhai by identifying four important aspects on the usageof HBM-enhanced FPGAs: • HBMs Provide Massive Memory Bandwidth.On thetested FPGA board Alveo U280, HBM provides up to 425GB/s memory bandwidth, an order of magnitude morethan using two traditional DDR4 channels on the sameboard. This is still half of what state-of-the-art GPUsobtain but it represents a signiﬁcant leap forward forFPGAs. • The Address Mapping Policy is Critical to High Band-width. Different address mapping policies lead to anorder of magnitude throughput differences when runninga typical memory access pattern (i.e., sequential traversal)on HBM, indicating the importance of matching theaddress mapping policy to a particular application. • Latency of HBM is Much Higher than DDR4. Theconnection between HBM chips and the associated FPGAis done via serial I/O connection, introducing extra pro-cessing for parallel-to-serial-to-parallel conversion. Forexample, Shuhai identiﬁes that the latency of HBM is106.7 ns while the latency of DDR4 is 73.3 ns, when thememory transaction hits an open page (or row), indicatingthat we need more on-the-ﬂy memory transactions, whichare allowed on modern FPGAs/GPUs, to saturate HBM. • FPGA Enables Accurate Benchmarking Numbers. Wehave implemented Shuhai on an FPGA with the bench-marking engine directly attaching to HBM modules,making it easier to reason about the performance numbersfrom HBM. In contrast, benchmarking memory perfor-mance on CPUs/GPUs makes it difﬁcult to distinguisheffects as, e.g., the cache introduces signiﬁcant interfer-ence in the measurements. Therefore, we argue that ourFPGA-based benchmarking approach is a better optionwhen benchmarking memory, whether HBM or DDR.II. B

ACKGROUND

An HBM chip employs the latest development of IC packag-ing technologies, such as Through Silicon Via (TSV), stacked-DRAM, and 2.5D package [5], [6], [7], [8]. The basic structureof HBM consists of a base logic die at the bottom and4 or 8 core DRAM dies stacked on top. All the dies areinterconnected by TSVs.Xilinx integrates two

HBM stacks and an HBM controllerinside the FPGA. Each HBM stack is divided into eightindependent memory channels , where each memory channelis further divided into two 64-bit pseudo channels . A pseudochannel is only allowed to access its associated

HBM channel that has its own address region of memory, as shown in Fig-ure 1. The Xilinx HBM subsystem has 16 memory channels,32 pseudo channels, and 32 HBM channels.On the top of 16 memory channels, there are 32

AXIchannels that interact with the user logic. Each AXI channeladheres to the standard AXI3 protocol [4] to provide aproven standardized interface to the FPGA programmer. EachAXI channel is associated with a HBM channel (or pseudo ... ... ...

HBM ChannelMemory

Channel

Switch

FPGA

32 AXI channels

32 pseudo channelsMini-switch 0 Mini-switch 1 Mini-switch 7

Fig. 1. Architecture of Xilinx HBM subsystem channel), so each AXI channel is only allowed to accessits own memory region. To make each AXI channel ableto access the full HBM space, Xilinx introduces a switchbetween 32 AXI channels and 32 pseudo channels [9], [4]. However, the switch is not fully implemented due to its hugeresource consumption. Instead, Xilinx presents eight mini-switches , where each mini-switch serves four AXI channelsand their associated pseudo channels and the mini-switch isfully implemented in a sense that each AXI channel accessesany pseudo channel in the same mini-switch with the samelatency and throughput. Besides, there are two bidirectionalconnections between two adjacent mini-switches for globaladdressing.III. G

ENERAL B ENCHMARKING F RAMEWORK S HUHAI

A. Design Methodology

We summarize two concrete challenges C1 and C2 , andthen present Shuhai to tackle the two challenges.First, high-level insight ( C1 ). It is critical to make ourbenchmarking framework meaningful to FPGA programmersin a sense that we should provide high-level insights to FPGAprogrammers for ease of understanding. In particular, weshould give the programmer an end-to-end explanation, ratherthan just incomprehensible memory timing parameters likerow precharge time T RP , so that the insights can be used toimprove the use of HBM memory on FPGAs.Second, easy to use ( C2 ). It is difﬁcult to achieve ease ofuse when benchmarking on FPGAs when a small modiﬁcationmight need to reconﬁgure the FPGA. Therefore, we intend tominimize the reconﬁguration effort so that the FPGA does notneed to be reconﬁgured between benchmarking tasks. In otherwords, our benchmarking framework should allow us to use asingle FPGA image for a large number of benchmarking tasks,not just for one benchmarking task.

1) Our Approach:

We propose Shuhai to tackle the abovetwo challenges. In order to tackle the ﬁrst challenge C1 ,Shuhai allows to directly analyze the performance charac-teristics of typical memory access patterns used by FPGAprogrammers, providing an end-to-end explanation for theoverall performance. To tackle the second challenge C2 ,Shuhai uses runtime parameterization of the benchmarkingcircuit so as to cover a wide range of benchmarking tasks By default, we disable the switch in the HBM memory controller whenwe measure latency numbers of HBM, since the switch that enables globaladdressing among HBM channels is not necessary. The switch is on when wemeasure throughput numbers. ithout reconﬁguring the FPGA. Through the access patternsimplemented in the benchmark, we are able to unveil theunderlying characteristics of HBM and DDR4 on FPGAs.Shuhai adopts a software-hardware co-design approachbased on two components: a software component (Subsec-tion III-B) and a hardware component (Subsection III-C). Themain role of the software component is to provide ﬂexibilityto the FPGA programmer in terms of runtime parameters.With these runtime parameters, we do not need to frequentlyreconﬁgure the FPGA when benchmarking HBM and DDR4.The main role of the hardware component is to guaranteeperformance. More precisely, Shuhai should be able to exposethe performance potential, in terms of maximum achievablememory bandwidth and minimum achievable latency, of HBMmemory on the FPGA. To do so, the benchmarking circuititself cannot be the bottleneck at any time.

B. Software Component

Shuhai’s software component aims to provide a user-friendly interface such that an FPGA developer can easily useShuhai to benchmark HBM memory and obtain relevant per-formance characteristics. To this end, we introduce a memoryaccess pattern widely used in FPGA programming:

RepetitiveSequential Traversal (RST) , as shown in Figure 2.The RST pattern traverses a memory region , a data arraystoring data elements in a sequence. The RST repetitivelysweeps over the memory region of size W with the startingaddress A , and each time reads B bytes with a stride of S bytes, where B and S are a power of 2. On our tested FPGA,the burst size B should be not smaller than 32 (or 64) forHBM (or DDR4) due to the constraint of HBM/DDR4 memoryapplication data width. The stride S should be not larger thanthe working set size W . The parameters are summarized inTable I. We calculate the address T [ i ] of the i -th memoryread/write transaction issued by the RST, as illustrated inEquation 1. The calculation can be implemented with simplearithmetic, which in turn leads to fewer FPGA resourcesand potentially higher frequency. Even though the supportedmemory access pattern is quite simple, it can still unveil theperformance characteristics of the memory, e.g., HBM andDDR4, on FPGAs. T [ i ] = A + ( i × S )% W (1) C. Hardware Component

The hardware component of Shuhai consists of a

PCIemodule , M latency modules , a parameter module and M engine modules , as illustrated in Figure 3. In the following,we discuss the implementation details for each module. ... W/B B W ... A S Fig. 2. Memory access pattern used in Shuhai. TABLE IS

UMMARY OF RUNTIME PARAMETERS

Parameter Deﬁnition

N Number of memory read/write transactionsB Burst size (in bytes) of a memory read/write transactionW Working set size (in bytes). W ( >

16) is a power of 2.S Stride (in bytes)A Initial address (in bytes)

1) Engine Module:

We directly attach an instantiated en-gine module to an AXI channel such that the engine moduledirectly serves the AXI interface, e.g., AXI3 and AXI4 [10],[11], provided by the underlying memory IP core, e.g., HBMand DDR4. The AXI interface consists of ﬁve different chan-nels: read address (RA), read data (RD), write address (WA),write data (WD) and write response (WR) [10]. Besides, theinput clock of the engine module is exactly the clock from theassociated AXI channel. For example, the engine module isclocked with 450 MHz when benchmarking HBM as it allowsat most 450 MHz for its AXI channels. There are two beneﬁtsto use the same clock. First, no extra noise, such as longerlatency, is introduced by FIFOs needed to cross different clockregions. Second, the engine module is able to saturate itsassociated AXI channel, not leading to underestimates of thememory bandwidth capacity.The engine module, written in Verilog, consists of twoindependent modules: a write module and a read module . Thewrite module serves three write-related channels WA, WD, andWR, while the read module serves two read-related channelsRA and RD.The write module contains a state machine to serve amemory-writing task at a time from the CPU. The task has theinitial address A , number of write transactions N , burst size B , stride S , and working set size W . Once the writing taskis received, this module always tries to saturate the memorywrite channels WR and WD by asserting the associated validsignals before the writing task completes, aiming to maximizethe achievable throughput. The address of each memory writetransaction is speciﬁed in Equation 1. This module also probesthe WR channel to validate that the on-the-ﬂy memory writetransactions are successfully ﬁnished.The read module contains a state machine to serve amemory-reading task at a time from the CPU. The task has theinitial address A , number of read transactions N , burst size B ,stride S , and working set size W . Unlike the write module,that only measures the achievable throughput, the read modulemeasures as well the latency of each of serial memory readtransactions: we immediately issue the second memory readtransaction only after the read data of the ﬁrst read transactionis returned. When measuring throughput, this module always We are able to unveil many performance characteristics of HBM andDDR4 by analyzing the latency difference among serial memory read trans-actions. The fundamental reason of the immediate issue is that a refreshcommand that occurs periodically will close all the banks in our HBM/DDR4memory, and then there will be no latency difference if the time interval oftwo serial read transactions is larger than the time (e.g., 7.8 µs ) between tworefresh commands. PU FPGA

Latency

ParameterPCIe ...

AXI

AXI AXI AXI

M AXI channels

Engine 2

Write Read

Engine 1

Write Read

Engine M

Write Read

Engine M-1

Write Read

HBM:450MHz PCIe:250MHz

Latency Latency Latency

Fig. 3. Overall hardware architecture of our benchmarking framework. It cansupport M hardware engines running simultaneously, with each engine forone AXI channel. In our experiment, M is 32 for HBM, while M is 2 forDDR4. tries to saturate the memory read channels RA and RD byalways asserting the RA valid signal before the reading taskcompletes.

2) PCIe Module:

We directly deploy the XilinxDMA/Bridge Subsystem for PCI Express (PCIe) IP corein our

PCIe module , which is clocked at 250 MHz. OurPCIe kernel driver exposes a PCIe bar mapping the runtimeparameters on the FPGA to the user such that the user isable to directly interact with the FPGA using software code.These runtime parameters determine the control and statusregisters stored in the parameter module.

3) Parameter Module:

The parameter module maintains theruntime parameters and communicates with the host CPU viathe PCIe module, receiving the runtime parameters, e.g., S ,from the CPU and returning the throughput numbers to theCPU.Upon receiving runtime parameters, we use them to con-ﬁgure M engine modules, each of which needs two 256-bitcontrol registers to store its runtime parameters: one registerfor the read module and the other register for the write modulein each engine module. Inside a 256-bit register, W takes 32bits, S takes 32 bits, N takes 64 bits, B takes 32 bits, and A takes 64 bits. The remaining 32 bits are reserved for futureuse. After setting all the engines, the user can trigger the startsignal to begin the throughput/latency testing.The parameter module is also responsible for returning thethroughput numbers (64-bit status registers) to the CPU. Onestatus register is dedicated to each engine module.

4) Latency Module:

We instantiate a latency module foreach engine module dedicated to an AXI channel. The latencymodule stores a latency list of size 1024, where the latencylist is written by the associated engine module and read by theCPU. Its size is a synthesis parameter . Each latency numbercontaining an 8-bit register refers to the latency for a memoryread operation, from the issue of the read operation to the datahaving arrived from the memory controller. ‘ TABLE IIA

DDRESS MAPPING POLICIES FOR

HBM

AND

DDR4. T

HE DEFAULTPOLICIES OF

HBM

AND

DDR4

ARE MARKED BLUE . Policies HBM (app addr[27:5]) DDR4 (app addr[33:6])RBC

RCB

BRC

RGBCG

BRGCG

RCBI

TABLE IIIR

ESOURCE CONSUMPTION BREAKDOWN OF THE HARDWARE DESIGN FORBENCHMARKING

HBM

Hardware modules LUTs Registers BRAMs Freq.Engine

PCIe

Parameter

Latency

672 1760 1.17Mb 250MHzTotal resources used 104K 122K 5.53MbTotal utilization 8 % % % IV. E

XPERIMENT S ETUP

A. Hardware Platform

We run our experiments on a Xilinx’s Alevo U280 [1]featuring two HBM stacks of a total size of 8GB and twoDDR4 memory channels of a total size of 32 GB. Thetheoretical HBM memory bandwidth can reach 450 GB/s (450MHz * 32 * 32 B/s), while the theoretical DDR4 memorybandwidth can reach 38.4 GB/s (300 MHz * 2 * 64 B/s).

B. Address Mapping Policies

The application address can be mapped to memory addressusing multiple policies, where different address bits map tobank, row, or column addresses. Choosing the right mappingpolicy is critical to maximize the overall memory throughput.The policies enabled for HBM and DDR4 are summarized inTable II, where “xR” means that x bits are for row address,“xBG” means that x bits are for bank group address, “xB”means that x bits are for bank address, and “xC” meansthat x bits are for column address. The default policies ofHBM and DDR4 are “RGBCG” and “RCB”, respectively. “-”stands for address concatenation. We always use the defaultmemory address mapping policy for both HBM and DDR4 ifnot particularly speciﬁed. For example, the default policy forHBM is RGBCG. C. Resource Consumption Breakdown

In this subsection, we breakdown the resource consumptionof the hardware design of Shuhai when benchmarking HBM. Table III shows the exact FPGA resource consumption ofeach instantiated module. We observe that Shuhai requires areasonably small amount of resources to instantiate 32 enginemodules, as well as additional components such as the PCIemodule, with the total resource utilization being less than 8 % . Due to space constraints, we omit the resource consumption for bench-marking DDR4 memory on the FPGA. . Benchmarking Methodology

We aim to unveil the underlying details of HBM stacks onXilinx FPGAs under Shuhai. As a yardstick, we also analyzethe performance characteristics of DDR4 on the same FPGAboard U280 [1] when necessary. When we benchmark a HBMchannel, we compare the performance characteristics of HBMwith that of DDR4 (in Section V). We believe that the numbersobtained for a HBM channel can be generalized to othercomputing devices such as CPUs or GPUs featuring HBMs.When benchmarking the switch inside the HBM memorycontroller, we do not do the comparison with DDR, sincethe DDR4 memory controller does not contain such a switch(Section VI).V. B

ENCHMARKING AN

HBM C

HANNEL

A. Effect of Refresh Interval

When a memory channel is operating, memory cells shouldbe refreshed repetitively such that the information in eachmemory cell is not lost. During a refresh cycle, normalmemory read and write transactions are not allowed to accessthe memory. We observe that a memory transaction thatexperiences a memory refresh cycle exhibits a signiﬁcantlylonger latency than a normal memory read/write transactionthat is allowed to directly access the memory chips. Thus, weare able to roughly determine the refresh interval by leveragingmemory latency differences between normal and in-a-refreshmemory transactions. In particular, we leverage Shuhai tomeasure the latency of serial memory read operations. Figure 4illustrates the case with B = 32, S = 64, W = 0x1000000, and N = 1024. We have two observations. First, for both HBM andDDR4, a memory read transaction that coincides with an activerefresh command has signiﬁcantly longer latency, indicatingthe need to issue enough on-the-ﬂy memory transactions toamortize the negative effect of refresh commands. Second,for both HBM and DDR4, refresh commands are scheduledperiodically, the interval between any two consecutive refreshcommands being roughly the same. B. Memory Access Latency

We leverage Shuhai to accurately measure the latencyof consecutive memory read transactions when the memorycontroller is in an “idle” state, i.e., where no other pendingmemory transactions exist in the memory controller such thatthe memory controller is able to return the requested data tothe read transaction with minimum latency. We aim to identifylatency cycles of three categories: page hit , page closed , and page miss . The “page hit” state occurs when a memory transactionaccesses a row that is open in its bank, so no Precharge andActivate commands are required before the column access,resulting in minimum latency. The latency numbers are identiﬁed when the switch is disabled. Thelatency numbers will be seven cycles higher when the switch is enabled,as the AXI channel accesses its associated HBM channel through the switch.The switching of bank groups does not affect memory access latency, sinceat most one memory read transaction is active at any time in this experiment. (a) HBM(b) DDR4Fig. 4. Higher access latency of memory refresh commands that occurperiodically on HBM and DDR4.

The “page closed” state occurs when a memory transactionaccesses a row whose corresponding bank is closed, so therow Activate command is required before the column access.The “page miss” state occurs when a memory transactionaccesses a row that does not match the active row in thebank, so one Precharge command and one Activate commandare issued before the column access, resulting in maximumlatency.We employ the read module to accurately measure thelatency numbers for the cases B = 32, W = 0x1000000, N = 1024, and varying S . Intuitively, the small S leads to highprobability to hit the same page while a large S potentiallyleads to a page miss. Besides, a refresh command closes allthe active banks. In this experiment, we use two values of S :128 and 128K.We use the case S =128 to determine the latency of page hitand page closed transactions. S =128 is smaller than the pagesize, so the majority of read transactions will hit an open page,as illustrated in Figure 5. The remaining points illustrate thelatency of page closed transactions, since the small S leads to alarge amount of read transactions in a certain memory regionand then a refresh will close the bank before the access toanother page in the same bank. We use the case S =128K to determine the latency of apage miss transaction. S =128K leads to a page miss for eachmemory transaction for both HBM and DDR4, since twoconsecutive memory transaction will access the same bankbut different pages.We summarize the latency on HBM and DDR in Table IV.We observe that the memory access latency on HBM is higherthan that on DDR4 by about 30 nano seconds under thesame category like page hit. It means that HBM could have The latency trend of HBM is different of that of DDR4 due to the differentdefault address mapping policy. The default address mapping policy of HBMis RGBCG, indicating that only one bank needs to be active at a time, whilethe default policy of DDR4 is RCB, indicating that four banks are active ata time. L a t e n c y ( c y c l e s ) Index of memory read transaction

S = 128S = 128k

Pagemiss PagehitPageclosed (a) HBM L a t e n c y ( c y c l e s ) Index of memory read transaction

S = 128S = 128k

Page miss Page hitPage closed (b) DDR4Fig. 5. Snapshots of page miss, page closed and page hit, in terms of latencycycles, on HBM and DDR4. TABLE IVI

DLE MEMORY ACCESS LATENCY ON

HBM

AND

DDR4. I

NTUITIVELY , THE

HBM

LATENCY IS MUCH HIGHER THAN

DDR4.

Idle Latency HBM DDR4Cycles Time Cycles TimePage hit

48 106.7 ns 22 73.3 ns

Page closed

55 122.2 ns 27 89.9 ns

Page miss

62 137.8 ns 32 106.6 ns disadvantages when running latency-sensitive applications onFPGAs.

C. Effect of Address Mapping Policy

In this subsection, we examine the effect of different mem-ory address mapping policies on the achievable throughput. Inparticular, under different mapping policies, we measure thememory throughput with varying stride S and burst size B ,while keeping the working set size W (= 0x10000000) largeenough. Figure 6 illustrates the throughput trend for differentaddress mapping policies for both HBM and DDR4. We haveﬁve observations.First, different address mapping policies lead to signiﬁcantperformance difference. For example, Figure 6a illustratesthat the default policy (RGBCG) of HBM is almost 10Xfaster than the policy (BRC) when S is 1024 and B is 32,demonstrating the importance of choosing the right addressmapping policy for a memory-bound application running onthe FPGA. Second, the throughput trends of HBM and DDR4are quite different even though they employ the same addressmapping policy, demonstrating the importance of a benchmarkplatform such as Shuhai to evaluate different FPGA boardsor different memory generations. Third, the default policyalways leads to the best performance for any combination of S and B on HBM and DDR4, demonstrating that the defaultsetting is reasonable. Fourth, small burst sizes lead to lowmemory throughput, as shown in Figures 6a, 6e, meaningthat FPGA programmers should increase spatial locality toachieve higher memory throughput out of HBM or DDR4.Fifth, large S ( > D. Effect of Bank Group

In this subsection, we examine the effect of bank group,which is a new feature of DDR4, compared to DDR3. Ac-cessing multiple bank groups simultaneously helps us relievethe negative effect of DRAM timing restrictions that havenot improved over generations of DRAM. A higher memorythroughput can be potentially obtained by accessing multiplebank groups. Therefore, we use the engine module to validatethe effect of a bank group (Figure 6). We have two observa-tions.First, with the default address mapping policy, HBM allowsto use large stride size while still keeping high throughput, asshown in Figures 6a, 6b, 6c, 6d. The underlying reason is thateven though each row buffer is not fully utilized due to large S , bank-group-level parallelism is able to allow us to saturatethe available memory bandwidth. Second, a pure sequentialread does not always lead to the highest throughput under acertain mapping policy. Figures 6b, 6c illustrate that when S increases from 128 to 2048, a bigger S can achieve highermemory throughput under the policy “RBC”, since a bigger S allows more active bank groups to be accessed concurrently,while a smaller S potentially leads to only one active bankgroup that serves user’s memory requests. We conclude that itis critical to leverage bank-group-level parallelism to achievehigh memory throughput under HBM. E. Effect of Memory Access Locality

In this subsection, we examine the effect of memory accesslocality on memory throughput. We vary the burst size B and the stride S , and we set the working set size W totwo values: 256M and 8K. The case W =256M refers tothe baseline that does not beneﬁt from any memory accesslocality, while the case W =8K refers to the case that beneﬁtsfrom locality. Figure 7 illustrates the throughput for varyingparameter settings on both HBM and DDR4. We have twoobservations.First, memory access locality indeed increases the memorythroughput for each case with high stride S . For example, thememory bandwidth of the case ( B =32, W =8K, and S =4K)is 6.7 GB/s on HBM, while 2.4 GB/s of the case ( B =32, W =256M, and S =4K), indicating that memory access localityis able to eliminate the negative effect of a large stride.Second, memory access locality cannot increase the memorythroughput when S is small. In contrast, memory accesslocality can signiﬁcantly increase the total throughput onmodern CPUs/GPUs due to the on-chip caches which havedramatically higher bandwidth than off-chip memory [12]. a) B=32 (HBM) (b) B=64 (HBM) (c) B=128 (HBM)(d) B=256 (HBM) (e) B=64 (DDR4) (f) B=128 (DDR4)(g) B=256 (DDR4) (h) B=512 (DDR4)Fig. 6. Memory throughput comparison between an HBM channel and a DDR4 channel, with different burst sizes and stride under all the address mappingpolicies. In this experiment, we use the AXI channel 0 to access its associated HBM channel 0 for the best performance from a single HBM channel. Weuse the DDR4 channel 0 to obtain the DDR4 throughput numbers.(a) HBM(b) DDR4Fig. 7. Effect of memory access locality. F. Total Memory Throughput

In this subsection, we explore the total achievable memorythroughput of HBM and DDR4 (Table V). The HBM systemon the tested FPGA card, U280, is able to provide up to 425GB/s (13.27 GB/s * 32) memory throughput when we use all

TABLE VT

OTAL MEMORY THROUGHPUT COMPARISON BETWEEN

HBM

AND

DDR4.

HBM DDR4Throughput of a channel

Number of channels

32 2

Total memory throughput

425 GB/s 36 GB/s the 32 AXI channels to simultaneously access their associatedHBM channels. The DDR4 memory is able to provide up to36 GB/s (18 GB/s * 2) memory throughput when we simulta-neously access both DDR4 channels on our tested FPGA card.We observe that the HBM system has 10 times more memorythroughput than DDR4 memory, indicating that the HBM-enhanced FPGA enables us to accelerate memory-intensiveapplications, which are typically accelerated on GPUs.VI. B

ENCHMARKING THE S WITCH IN THE

HBMC

ONTROLLER

Our goal in this section is to unveil the performancecharacteristics of the switch. In a fully implemented switch,the performance characteristics of the access from any AXIchannel to any HBM channel should be roughly the same. Each AXI channel accesses its local HBM channel, there is no inferenceamong the 32 AXI channels. Since each AXI channel approximately has thesame throughput, we estimate the total throughput by simply scaling up thethroughput of the channel 0 by 32.ABLE VIM

EMORY ACCESS LATENCY FROM ANY OF

32 AXI

CHANNELS TO THE

HBM

CHANNEL

0. T

HE SWITCH IS ON . I

NTUITIVELY , LONGER DISTANCEYIELDS LONGER LATENCY . T

HE LATENCY DIFFERENCE REACHES UP TO CYCLES . Channels Page hit Page closed Page missCycles Time Cycles Time Cycles Time0-3

55 122.2 ns 62 137.8 ns 69 153.3 ns

56 124.4 ns 63 140.0 ns 70 155.6 ns

58 128.9 ns 65 144.4 ns 72 160.0 ns

60 133.3 ns 67 148.9 ns 74 164.4 ns

71 157.8 ns 78 173.3 ns 85 188.9 ns

73 162.2 ns 80 177.7 ns 87 193.3 ns

75 166.7 ns 82 182.2 ns 89 197.8 ns

77 171.1 ns 84 186.7 ns 91 202.2 ns

However, in the current implementation, the relative distancecould play an important role. In the following, we examinethe performance characteristics between any AXI channel andany HBM channel, in terms of latency and throughput.

1) Memory Latency:

Due to space constraints, we onlydemonstrate the memory access latency using the memory readtransaction issued in any AXI channel (from 0 to 31) to theHBM channel 0. Access to other HBM channels has similarperformance characteristics. Similar to the experimental setupin Subsection V-B, we also employ the engine module todetermine the accurate latency for the case B = 32, W =0x1000000, N = 1024, and varying S . Table VI illustratesthe latency difference among 32 AXI channels. We have twoobservations.First, the latency difference can be up to 22 cycles. Forexample, for a page hit transaction, an access from the AXIchannel 31 needs 77 cycles, while an access from the AXIchannel 0 only needs 55 cycles. Second, the access latencyfrom any AXI channel in the same mini-switch is identical,demonstrating that the mini-switch is fully-implemented. Forexample, the AXI channels 4-7 in the same mini-switch havethe same access latency to the HBM channel 0. We concludethat an AXI channel should access its associated HBM channelor the HBM channels close to it to minimize latency.

2) Memory Throughput:

We employ the engine module tomeasure memory throughput from any AXI channel (from 0to 31) to HBM channel 0, with the setting B = 64, W =0x1000000, N = 200000, and varying S . Figure 8 illustratesthe memory throughput from an AXI channel in each mini-switch to the HBM channel 0. We observe that AXI channelsare able to achieve roughly the same memory throughput,regardless of their locations.VII. R ELATED W ORK

To our knowledge, Shuhai is the ﬁrst platform to benchmarkHBM on FPGAs in a systematic and comprehensive manner.We contrast closely related work with Shuhai on 1) bench-marking traditional memory on FPGAs; 2) data processingwith HBM; and 3) accelerating application with FPGAs. The switch is enabled to allow global addressing, when comparing thelatency difference among AXI channels. Fig. 8. Throughput from eight AXI channels to the HBM channel 1, whereeach AXI channel is from a mini-switch.

First, benchmarking traditional memory on FPGAs. Previ-ous work [13], [14], [15] tries to benchmark traditional mem-ory, e.g., DDR3, on the FPGA by using high-level languages,e.g., OpenCL. In contrast, we benchmark HBM on the state-of-the-art FPGA.Second, data processing with HBM/HMC. Previouswork [16], [17], [18], [19], [20], [21], [22], [23] employs HBMto accelerate their applications, e.g., hash table deep learningand streaming, by leveraging the high memory bandwidthprovided by Intel Knights Landing (KNL)s HBM [24]. Incontrast, we benchmark the performance of HBM on theXilinx FPGA.Third, accelerating applications with FPGAs. Previouswork [25], [26], [27], [28], [29], [30], [31], [32], [33], [34],[35], [36], [3], [37], [38], [39], [40], [41], [42], [43], [44], [2],[45], [46], [47], [48], [49], [50] accelerates a broad range ofapplications, e.g., database and deep learning inference, usingFPGAs. In contrast, we systematically benchmark HBM onthe state-of-the-art FPGA regardless of the application.VIII. C

ONCLUSION

FPGAs are being enhanced with High Bandwidth Mem-ory (HBM) to tackle the memory bandwidth bottleneck thatdominates memory-bound applications. However, the perfor-mance characteristics of HBM are still not quantitatively andsystematically analyzed on FPGAs. We bridge the gap bybenchmarking HBM stack on a state-of-the-art FPGA featur-ing a two-stack HBM2 subsystem. Accordingly, we proposeShuhai to demystify the underlying details of HBM such thatthe user is able to obtain a more accurate picture of thebehavior of HBM than what can be obtained by doing so onCPUs/GPUs as they introduce noise from the caches. Shuhaican be easily generalized to other FPGA boards or othergenerations of memory modules. We will make the relatedbenchmarking code open-source such that new FPGA boardscan be explored and the results across boards are compared.The code is available: https://github.com/RC4ML/Shuhai.A

CKNOWLEDGEMENTS

We thank Xilinx University Program for the valuable feed-back to improve the quality of this paper. This work issupported by the National Natural Science Foundation ofChina (U19B2043, 61976185), and the Fundamental ResearchFunds for the Central Universities.

EFERENCES[1] Xilinx, “Alveo U280 Data Center Accelerator Card Data Sheet,” 2019.[2] Z. Wang, B. He, and W. Zhang, “A study of data partitioning onOpenCL-based FPGAs,” in

FPL , 2015.[3] Z. Liu, Y. Dou, J. Jiang, Q. Wang, and P. Chow, “An fpga-basedprocessor for training convolutional neural networks,” in

FPT , 2017.[4] Xilinx, “AXI High Bandwidth Memory Controller v1.0,” 2019.[5] H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim, “Hbm(high bandwidth memory) dram technology and architecture,” in

IMW ,2017.[6] K. Cho, H. Lee, H. Kim, S. Choi, Y. Kim, J. Lim, J. Kim, H. Kim,Y. Kim, and Y. Kim, “Design optimization of high bandwidth memory(hbm) interposer considering signal integrity,” in

EDAPS , 2015.[7] M. OConnor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W.Keckler, and W. J. Dally, “Fine-grained dram: Energy-efﬁcient dramfor extreme bandwidth systems,” in

MICRO , 2017.[8] J. Macri, “Amd’s next generation gpu and high bandwidth memoryarchitecture: Fury,” in

Hot Chips , 2015.[9] M. Wissolik, D. Zacher, A. Torza, and B. Day, “Virtex UltraScale+ HBMFPGA: A Revolutionary Increase in Memory Performance,” 2019.[10] Xilinx, “AXI Reference Guide,” 2011.[11] Arm, “AMBA AXI and ACE Protocol Speciﬁcation,” 2017.[12] S. Manegold, P. Boncz, and M. L. Kersten, “Generic database costmodels for hierarchical memory systems,” in

PVLDB , 2002.[13] H. R. Zohouri and S. Matsuoka, “The memory controller wall: Bench-marking the intel fpga sdk for opencl memory interface,” in

H2RC ,2019.[14] S. W. Nabi and W. Vanderbauwhede, “Smart-cache: Optimising memoryaccesses for arbitrary boundaries and stencils on fpgas,” in

IPDPSW ,2019.[15] K. Manev, A. Vaishnav, and D. Koch, “Unexpected diversity: Quantita-tive memory analysis for zynq ultrascale+ systems,” in

FPT , 2019.[16] H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin,“Streambox-hbm: Stream analytics on high bandwidth hybrid memory,”in

ASPLOS , 2019.[17] C. Pohl and K.-U. Sattler, “Joins in a heterogeneous memory hierarchy:Exploiting high-bandwidth memory,” in

DAMON , 2018.[18] X. Cheng, B. He, E. Lo, W. Wang, S. Lu, and X. Chen, “Deployinghash tables on die-stacked high bandwidth memory,” in

CIKM , 2019.[19] Y. You, A. Buluc¸, and J. Demmel, “Scaling deep learning on gpu andknights landing clusters,” in SC , 2017.[20] I. B. Peng, R. Gioiosa, G. Kestor, P. Cicotti, E. Laure, and S. Markidis,“Exploring the performance beneﬁt of hybrid memory system on hpcenvironments,” in IPDPSW , 2017.[21] A. Li, W. Liu, M. R. B. Kristensen, B. Vinter, H. Wang, K. Hou,A. Marquez, and S. L. Song, “Exploring and Analyzing the Real Impactof Modern On-Package Memory on HPC Scientiﬁc Kernels,” in SC ,2017.[22] S. Khoram, J. Zhang, M. Strange, and J. Li, “Accelerating graphanalytics by co-optimizing storage and access on an fpga-hmc platform,”in FPGA , 2018.[23] B. Bramas, “Fast Sorting Algorithms using AVX-512 on Intel KnightsLanding,”

CoRR , 2017.[24] Jim Jeffers and James Reinders and Avinash Sodani, “Intel Xeon PhiProcessor High Performance Programming Knights Landing Edition,”2016.[25] Altera, “Guidance for Accurately Benchmarking FPGAs,” 2007.[26] Q. Gautier, A. Althoff, Pingfan Meng, and R. Kastner, “Spector: AnOpenCL FPGA benchmark suite,” in

FPT , 2016.[27] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A Quantitative Analysis on Microarchitectures of Modern CPU-FPGAPlatforms,” in

DAC , 2016.[28] Y.-K. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “In-DepthAnalysis on Microarchitectures of Modern Heterogeneous CPU-FPGAPlatforms,”

ACM Trans. Reconﬁgurable Technol. Syst. , 2019.[29] D. B. Thomas, L. Howes, and W. Luk, “A Comparison of CPUs, GPUs,FPGAs, and Massively Parallel Processor Arrays for Random NumberGeneration,” in

FPGA , 2009.[30] Z. Istvn, D. Sidler, and G. Alonso, “Runtime parameterizable regularexpression operators for databases,” in

FCCM , 2016.[31] M. J. H. Pantho, J. Mandebi Mbongue, C. Bobda, and D. Andrews,“Transparent Acceleration of Image Processing Kernels on FPGA-Attached Hybrid Memory Cube Computers,” in

FPT , 2018.[32] J. Weberruss, L. Kleeman, D. Boland, and T. Drummond, “Fpgaacceleration of multilevel orb feature extraction for computer vision,”in

FPL , 2017.[33] S. I. Venieris and C. Bouganis, “fpgaconvnet: A framework for mappingconvolutional neural networks on fpgas,” in

FCCM , 2016.[34] S. Taheri, P. Behnam, E. Bozorgzadeh, A. Veidenbaum, and A. Nicolau,“Afﬁx: Automatic acceleration framework for fpga implementation ofopenvx vision algorithms,” in

FPGA , 2019.[35] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Efﬁcient synthesis ofcompressor trees on fpgas,” in

ASP-DAC , 2008.[36] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Designﬂow of accelerating hybrid extremely low bit-width neural network inembedded fpga,” in

FPL , 2018.[37] G. Weisz, J. Melber, Y. Wang, K. Fleming, E. Nurvitadhi, and J. C. Hoe,“A study of pointer-chasing performance on shared-memory processor-fpga systems,” in

FPGA , 2016.[38] Q. Xiong, R. Patel, C. Yang, T. Geng, A. Skjellum, and M. C. Herbordt,“Ghostsz: A transparent fpga-accelerated lossy compression framework,”in

FCCM , 2019.[39] M. Asiatici and P. Ienne, “Stop crying over your cache miss rate:Handling efﬁciently thousands of outstanding misses in fpgas,” in

FPGA ,2019.[40] S. Jun, M. Liu, S. Xu, and Arvind, “A transport-layer network fordistributed fpga platforms,” in

FPL , 2015.[41] N. Ramanathan, J. Wickerson, F. Winterstein, and G. A. Constantinides,“A case for work-stealing on fpgas with opencl atomics,” in

FPGA ,2016.[42] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt,“A high memory bandwidth fpga accelerator for sparse matrix-vectormultiplication,” in

FCCM , 2014.[43] E. Brossard, D. Richmond, J. Green, C. Ebeling, L. Ruzzo, C. Olson,and S. Hauck, “A model for programming data-intensive applicationson fpgas: A genomics case study,” in

SAAHPC , 2012.[44] Z. Wang, K. Kara, H. Zhang, G. Alonso, O. Mutlu, and C. Zhang,“Accelerating Generalized Linear Models with MLWeaving: A One-size-ﬁts-all System for Any-precision Learning,”

VLDB , 2019.[45] Z. Wang, J. Paul, H. Y. Cheah, B. He, and W. Zhang, “Relational queryprocessing on OpenCL-based FPGAs,” in

FPL , 2016.[46] Z. Wang, B. He, W. Zhang, and S. Jiang, “A performance analysisframework for optimizing OpenCL applications on FPGAs,” in

HPCA ,2016.[47] Z. Wang, S. Zhang, B. He, and W. Zhang, “Melia: A MapReduceframework on OpenCL-based FPGAs,”

TPDS , 2016.[48] M. Owaida, G. Alonso, L. Fogliarini, A. Hock-Koon, and P.-E. Melet,“Lowering the latency of data processing pipelines through fpga basedhardware acceleration,” 2019.[49] Z. He, Z. Wang, and G. Alonso, “Bis-km: Enabling any-precision k-means on fpgas,” in

FPGA , 2020.[50] Z. Wang, J. Paul, B. He, and W. Zhang, “Multikernel data partitioningwith channel on OpenCL-based FPGAs,”