[PDF] Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device

Abstract

Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patterns. To provide the most efficient communications between CPUs and accelerators, understanding the data transaction behaviors and selecting the right I/O cache coherence method is essential. In this paper, we use Xilinx Zynq UltraScale+ as the SoC platform to show how certain I/O cache coherence method can perform better or worse in different situations, ultimately affecting the overall accelerator performances as well. Based on our analysis, we further explore possible software and hardware modifications to improve the I/O performances with different I/O cache coherence options. With our proposed modifications, the overall performance of SoC design can be averagely improved by 20%.

Full PDF

AAnalysis and Optimization of I/O Cache CoherencyStrategies for SoC-FPGA Device

Seung Won Min

Electrical and Computer EngineeringUniversity of Illinois

Urbana, [email protected]

Jinjun Xiong

IBM T.J. Watson Research Center

Yorktown Heights, [email protected]

Sitao Huang

Electrical and Computer EngineeringUniversity of Illinois

Urbana, [email protected]

Deming Chen

Electrical and Computer EngineeringUniversity of Illinois

Urbana, [email protected]

Mohamed El-Hadedy

Electrical and Computer EngineeringCalifornia State Polytechnic University

Ponoma, [email protected]

Wen-mei Hwu

Electrical and Computer EngineeringUniversity of Illinois

Urbana, [email protected]

Abstract —Unlike traditional PCIe-based FPGA accelerators,heterogeneous SoC-FPGA devices provide tighter integrationsbetween software running on CPUs and hardware accelerators.Modern heterogeneous SoC-FPGA platforms support multipleI/O cache coherence options between CPUs and FPGAs, but theseoptions can have inadvertent effects on the achieved bandwidthsdepending on applications and data access patterns. To providethe most efﬁcient communications between CPUs and accelera-tors, understanding the data transaction behaviors and selectingthe right I/O cache coherence method is essential. In this paper,we use Xilinx Zynq UltraScale+ as the SoC platform to show howcertain I/O cache coherence method can perform better or worsein different situations, ultimately affecting the overall acceleratorperformances as well. Based on our analysis, we further explorepossible software and hardware modiﬁcations to improve theI/O performances with different I/O cache coherence options.With our proposed modiﬁcations, the overall performance of SoCdesign can be averagely improved by 20%.

Index Terms —FPGA, heterogeneous computing, cache, cachecoherence

I. I

NTRODUCTION

Heterogeneous SoC-FPGA platforms such as Xilinx ZynqUltraScale+ MPSoC provide ﬂexible development environ-ment with tightly-coupled interfaces between different pro-cessing units inside. Depending on the needs of users, theseprocessing units can be combined and programmed to providethe most suitable conﬁguration. For the different componentsto operate seamlessly together, it is important to understandhow data coherency between them are managed. For thetraditional server or desktop class machines, there is littlemeaning of conﬁguring the host system’s I/O cache coherencefor general FPGA designers because often: 1) manufacturersdo not provide any documentations of that level of detail or2) I/O cache coherence is enabled by default in such scales ofsystems. On the other hand, in SoC-FPGA design, all availableI/O cache coherence options are fully disclosed to the FPGAdesigners and the designers are responsible of choosing themost suitable methods for target applications. However, choosing the right I/O cache coherence methodfor different applications is a challenging task because ofit’s versatility. By choosing different methods, they can intro-duce different types of overheads. Depending on data accesspatterns, those overheads can be ampliﬁed or diminished. Inour experiments, we ﬁnd using different I/O cache coherencemethods can vary overall application execution times at most3.39 × . This versatility not only makes designers hard to decidewhich methods to use, but also can mislead them to wrongdecisions if performance evaluations are incomprehensive.SoC IP providers such as Xilinx and ARM provide high-levelguides [1], [2] of using different I/O cache coherence methodsand interfaces, but these are often vague and do not includeany quantitative analysis.In this work, we analyze the effects of using different I/Ocache coherence methods in SoC-FPGA as detail as possibleand provide general guide of using each method. Our I/O cachecoherence performance analysis consists of two parts: softwarecosts and hardware costs. The software cost denotes how muchthe software portion of applications can be affected to maintaincertain types of I/O cache coherence methods. The hardwarecost denotes how much the hardware complexities added tomaintain I/O cache coherence can affect I/O bandwidths. Laterin this paper, both of the costs are combined to evaluate thetotal cost of I/O cache coherence. Throughout the experiments,we use Xilinx’s Zynq UltraScale+ platform which supportsvariety of interface options including hardware coherent I/Obus and direct accesses to L2 cache. The contributions of thispaper can be summarized as follows: • Evaluate software and hardware costs of using differentI/O cache coherence methods. • Introduce several optimization techniques which caneliminate some I/O cache coherence costs. • Provide a complete guide of achieving efﬁcient I/O cachecoherence based on real hardware evaluation results. a r X i v : . [ c s . A R ] A ug he rest of the paper is organized as follows. In Section II,we explain backgrounds of different I/O cache coherencestrategies in detail. In Section III, we elaborate our exper-iment environment. In Section IV, we show our softwareand hardware I/O cache coherence cost evaluation results. InSection V, we provide a general guide of I/O cache coherenceoptimizations. Section VI discusses related works. Finally inSection VII, we summarize our work and conclude this paper.II. I/O C ACHE C OHERENCE

In a modern system design, it is common to use memoryas a shared buffer to transfer data between CPUs and I/Odevices [3]. However, with CPU caches, it is possible the datainside the shared buffer is physically scattered over the cachesand DRAM. In such case, depending on the perspective, thebuffer may contain different values. To avoid the situation,I/O cache coherence is required to maintain data coherencyand consistency between CPUs and I/O devices. I/O cachecoherence can be achieved in several ways. First, certainregions of memory can be disabled from caching. Second,CPUs can manually ﬂush or invalidate cache lines before anyI/O data transactions. Third, hardware implementations can beadded so I/O devices can snoop CPU caches. In this section,we describe the methods and brieﬂy discuss their beneﬁts andcosts.

A. Allocating Non-cacheable Memory

The simplest way of achieving I/O cache coherency ismaking memory accesses non-cacheable. This does not needto be enforced globally, and it can be narrowed down tospeciﬁc memory regions which are shared between CPUs andI/O devices by setting appropriate ISA-dependent virtual pageattributes. However, in this case, CPU memory accesses to theregions may lose beneﬁts of data locality.

B. Software I/O Coherency

Software I/O coherency requires CPUs to manually ﬂushor invalidate cache lines by executing cache maintenanceinstructions before any data transactions between CPUs andI/O devices are made. In this method, CPUs can still cachedata from the memory regions shared with I/O devices, butthe manual cache instructions are in critical paths of I/O datatransactions and it can decrease effective bandwidths [4]. Fur-thermore, global memory fences should be inserted betweenthe cache instructions and data accesses to guarantee no dataaccesses precedes the cache instructions.

C. Hardware I/O Coherency

Hardware coherency relies on hardware implementationsincluded in host systems which let I/O devices to snoop CPUcaches. This I/O coherence method requires the least amountof software designers’ attentions as the shared memory regionscan be treated as cacheable, and it does not require any cachemaintenance instructions. Achieving the cache snooping canbe largely done in two ways. First, I/O buses between CPUsand I/O devices can be modiﬁed so every memory access

CPU

L2 (1MB)

User-Defined

Logic

Cache Coherent

Interconnect

I/O Bus Main Memory ❶ ACP ❷ HPC ❸ HP L1 L1 • • Processing System(PS) Programmable Logic (PL)

Fig. 1. Simpliﬁed block diagram of possible I/O conﬁgurations in XilinxZynq UltraScale+. 1 Accelerator Coherency Port (ACP) can access L2 cachedirectly. 2 High Performance Coherent (HPC) interface goes through coherentI/O bus where it can issue cache snooping requests to CPU cache. 3 HighPerformance (HP) interface goes to memory directly and I/O cache coherenceshould be dealt by CPU. requests from I/O devices cause cache snoop requests aswell. Depending on snooping results, I/O devices can directlygrab data from caches for readings or automatically invalidatestale cache lines from CPU caches when writing to memory.However, resolving cache snoop requests may require severalextra bus cycles between different memory requests whichcan reduce I/O bandwidths [5]. The second way is directlyconnecting I/O devices to caches. In this case, I/O devicesgenerate cache snooping requests like other CPU cores. Thedifference compared to the ﬁrst method is in this case, I/Odata requests are treated as regular CPU data requests andeach request generates a cache line allocation. This would bebeneﬁcial if the cache line allocated is reused frequently, butwith inappropriate data access patterns it can end up evictinguseful cache lines for CPUs.III. E

XPERIMENT E NVIRONMENT

All experiments in this paper are done based on Xilinx ZynqUltraScale+ MPSoC. Zynq Ultrascale+ has Processing System(PS) block and Programmable Logic (PL) block as describedin Fig. 1. PS consists of hard IPs such as CPU, coherent I/O,and memory. PL consists of programmable logic and can beprogrammed by users like regular FPGAs. Between the twoblocks, there are several types of I/O available. 1 AcceleratorCoherency Port (ACP) interface can access shared L2 cache(1MB) directly. However, this port does not fully comply

TABLE IA

VAILABLE

PL I

NTERFACES AND D ATA C OHERENCY M ETHODS IN Z YNQ U LTRA S CALE +Alias Inteface Memory Data channel CoherencyAllocation is connected to MethodHP (NC) HP Non-cacheable Memory Not RequiredHP (C) HP Cacheable Memory Cache Inst.HPC HPC Cacheable Memory & H/W CoherentCache (Read-only)ACP ACP Cacheable Cache H/W Coherent T X B and w i d t h ( G B / s ) Data Transfer Size

HP HPC (w/ Write) HPC (w/ Flush)

ACP (w/ Write) ACP (w/ Flush)

L2 Size

Fig. 2. I/O bus TX (CPU → PL) bandwidth comparison. No software overheadis included in this measurement. with Advanced eXtensible Interface 4 (AXI4) protocol whichis commonly used in Xilinx IPs. Since there is no publiclyavailable ACP adapter IP, we developed ACP ↔ AXI4 converterfor our experiments. 2 High Performance Coherent (HPC)interface goes through coherent I/O bus where it can issuecache snooping requests to the CPU cache.ARM Cache Coherent Interconnect 400 (CCI-400) [6]is used for this coherent I/O bus and it uses AXI Co-herency Extensions (ACE) and ACE-Lite protocols to sup-port cache coherency. ACE protocol supports bi-directional(Cache ↔ Cache) cache coherency and ACE-Lite supports one-directional (Device → Cache) cache coherency. CCI-400 cansupport up to two ACE ports where the one is already occupiedby ARM Cortex-A53 CPU. We do not use the other ACE portin this experiment since our accelerators do not implement anyprivate caches. In context of Zynq UltraScale+, HPC interfacesare only using ACE-Lite protocols. 3 High Performance (HP)interface goes to memory directly and I/O cache coherenceshould be dealt by the CPU. All interfaces are 128-bit wideand we ﬁx interface frequencies to 300 MHz throughout ourexperiments, providing the maximum theoretical bandwidthsof 4.8 GB/s. Table I summarizes overall Zynq UltraScale+interfaces and possible I/O cache coherence methods. In therest of the paper, we refer the HP interface with non-cacheableand cacheable memory allocations as HP (NC) and HP (C),respectively.Software I/O coherency implementation is embedded inXilinx drivers and the drivers are capable of identifying thebuffer allocation types. If the buffers are non-cacheable, thedrivers do not manually ﬂush or invalidate caches. If thebuffers are cacheable, the drivers automatically perform cacheﬂushes and invalidations.IV. I/O C

ACHE C OHERENCE AND S O C-FPGAIn this section, we evaluate hardware and software costsof different I/O cache coherence methods. For the hardwarecost, we are interested in identifying how much the extrasteps required to resolve cache snoop requests in hardwarecan negatively affect I/O bandwidths. For the software cost R X B and w i d t h ( G B / s ) Data Transfer Size

HP HPC (w/ Read) HPC (w/ Flush)

ACP (w/ Read) ACP (w/ Flush)

L2 Size

Fig. 3. I/O bus RX (PL → CPU) bandwidth comparison. No software overheadis included in this measurement. evaluation, we measure CPU overheads added when hardwarecoherent I/O interfaces are not supported.

A. Hardware Cost Evaluation

In this experiment, we measure raw bandwidths of non-hardware coherent I/O (HP) and hardware coherent I/O (HPCand ACP) interfaces. The raw bandwidth here means the pureinterface bandwidths without any software overheads included.To measure CPU to PL (TX) and PL to CPU (RX) bandwidths,we program PL to initiate data transfers and count how manybus clock cycles spent. For the hardware coherent I/Os, we’dlike to also know if there are any bandwidth differences whenthe shared buffer data for both TX and RX cases are cachedor not. To achieve this, we intentionally read/write or ﬂushthe entire range of the shared buffers before the data transfersbegin. The summary of the test setups can be found at Table II.We do not differentiate between HP (NC) and HP (C) in thisexperiment as their differences are only at software costs.Fig. 2 shows the TX bandwidth measurement results. Start-ing from the HP results, we observe almost no differencesin TX bandwidths while sweeping from 4KB to 32MB datatransfers. There is a small bandwidth drop at 4KB due to theinitial DRAM access latency, but the overhead of the latencybecomes almost not visible as the data transfer size increases.In case of HPC, we see huge differences when the datais cached or not. For HPC (w/ Flush), there is only a small

TABLE IIR AW B ANDWIDTH T EST S ETUP

Direction Interface Before data transferthe buffer has beenHPCPU HPC (w/ Write) Written ↓ HPC (w/ Flush) FlushedPL ACP (w/ Write) WrittenACP (w/ Flush) FlushedHPPL HPC (w/ Read) Read ↓ HPC (w/ Flush) FlushedCPU ACP (w/ Read) ReadACP (w/ Flush) Flushed N oa r m a li z ed T r an s po s e T i m e Matrix Size

Cache. to N-cacheCache. to Cache. (a) Memcpy (b) Transposition N o r m a li z ed M e m c p y T i m e Memcpy Size

N-cache. to N-cache. N-cache. to Cache.Cache. to N-cache. Cache. to Cache.

Fig. 4. (a) Memcpy execution time comparison using different combinationsof non-cacheable and cacheable source/destination buffers. (b) Matrix trans-pose execution time comparison with non-cacheable and cacheable destinationbuffers. bandwidth drop compared to HP, but for HPC (w/ Write), theTX bandwidth decreases signiﬁcantly. Based on this analysis,we can assume the data ﬂow path from CPU cache to thedevice is sub-optimal in Zynq UltraScale+. Writing largeramount of data to the buffer attenuates this problem as themaximum amount of cached data is limited by the L2 size.Still, to reach near the peak HPC bandwidth, more than 32MB of data should be transferred.ACP bandwidth is nearly reaching 4.8 GB/s with small sizesof data, but it starts to sharply drop as the data size approachestoward the L2 size. A53 L2 cache does not have hardwareprefetching unit and therefore all cache accesses without pre-populated cache lines need to pay cache miss penalties. Byobserving the measurement results, we can assume writingmore than 64KB of data in one time starts to evict itspreviously allocated cache lines. Currently, A53 L2 cache isusing random cache replacement policy, but future SoC-FPGAplatforms using least recently used cache replacement policymay push back the self eviction point. When the buffer iscompletely ﬂushed before the data transfer, ACP constantlysuffers from the low bandwidth as all cache accesses causecache misses.For the RX bandwidth measurement results, we do not seeany signiﬁcant bandwidth changes beside ACP. In Fig. 3, bothHP and HPC are reaching near 4.8 GB/s of bandwidths in allcases. In case of ACP, we observe a similar trend to the TXcase where the ACP bandwidth is higher when most of thedata are cached.The bandwidth discrepancies between the RX and TX canbe due to the cache coherency protocol. For example, Molkaet al. [7] describes different cache read and write bandwidthsin the Intel’s Nehalem processors due to the cache coherencyprotocol used in them.

B. Software Cost Evaluation

In this section, we evaluate non-cacheable memory accessbandwidths and manual cache operation costs. The advantageof using caches is well evaluated in the past [8], [9], butwe include the evaluation in this paper for the completenessof I/O cache coherency evaluation. For the non-cacheable C P U → F P G A F P G A → C P U Data Transfer Time Breakdown

Data Transfer Cache Ops

CPU↓FPGA

FPGA ↓CPU

Fig. 5. Data transfer time breakdown with manual cache maintenanceinstructions. memory access evaluation, we ﬁrst measure four types ofmemory copy operations: non-cacheable to non-cacheable,non-cacheable to cacheable, cacheable to non-cacheable, andcacheable to cacheable. All memory copies are done using memcpy() function from the C library. In Fig. 4 (a), weﬁnd the bandwidth penalty is as large as 30 × when readingfrom the non-cacheable region compared to reading from thecacheable region. On the other hand, the memory writes tothe non-cacheable regions remains almost the same becausethe Write-Combine (WC) function can combine multiple non-cacheable write requests to a single larger memory write. Thisfeature will be further discussed in Section V-A.Still, the WC is only active in regular memory accesspatterns and CPUs can suffer from long memory latencieswith irregular memory write patterns. In Fig. 4 (b), wemeasure execution times of matrix transpositions to differenttypes of memory. In this experiment, the source matrix isstored in cacheable memory region and the destination for thetransposed matrix is located in non-cacheable memory region.When the entire matrix can ﬁt in the cache, the cacheablememory is about 4 × faster than the non-cacheable memory.When the matrix size is much larger than the cache size, thecacheable memory is still about 1.33 × faster than the non-cacheable memory.When manual cache instructions are needed, the CPUoverhead added heavily depends on other CPU workloads andthe total number of buffers ﬂushed or invalidated. In Linux,after each buffer is ﬂushed or invalidated, global memorybarrier should be inserted to guarantee no memory accessesare reordered. If this global memory barrier needs be executedmultiple times while heavy memory accesses are being made,the overall CPU performance can be severely degraded.In Fig. 5, we show data transfer time breakdown withmanual cache instructions. With a smaller data size, the manualcache instructions take majority of the total data transfer time.With a larger data size, the overhead of the memory barriertakes less portion of the total data transfer time and the totaloverhead of the manual cache instructions become smaller. Weﬁnd the directions of the data transfers do not signiﬁcantlyaffect the manual cache instruction overheads. arge(>16MB) Direction? Is it mostly CPU write?HP (NC)HPC Data size?

When data is read?

Can you insert more memory access? (>16MB)

HPC HPC ACPIs it mostly sequential? HP (NC)

CPU→PLPL→PLPL→CPU NoYes No YesMediumSmall (<64KB) Unknown/Later

Yes

No Immediately after CPU writes

Final Decision

HP (C)

Any other background tasks with memory intensive workloads?

Yes No *From this point, we cannot use non-cacheable memory and should rely on cacheable memory

Fig. 6. Decision tree for selecting the optimal I/O cache coherence method.

V. O

PTIMIZING D ATA T RANSACTIONS

In this section, we suggest several I/O cache coherenceoptimization techniques to achieve the most effective datatransaction behaviors. First, we introduce several hardwarefeatures which can be exploited to remove some I/O cachecoherence overheads. Second, we present a decision tree(Fig. 6) which can be utilized to optimize I/O cache coherenceselections. Finally, we apply our decision tree to several ap-plications and compare the overall performances with baselinedesigns.

A. Exploiting Hardware Features1) Wribe Combine (WC):

WC is a cache feature whichcan combine multiple write accesses to non-cacheable regionsinto a single larger memory write request [10]. Compared torequesting multiple small memory writes, requesting a singlelarger memory write can better utilize the memory bandwidth.To activate this feature, consecutive write requests should becontiguous in memory address space in certain degree. Theminimum requirement for the contiguity may depend on CPUarchitecture, and A53 requires at least the write requests are128-bit aligned. For example, if there are four integer (4-byte)write requests to address of 0x00, 0x01, 0x02 and 0x03, thenthey can be combined into a single 128-bit write request. Whenthe write requests are pointing to different memory addressesresulting into different memory alignments, they need to besplit into different memory write requests.

2) Cache Bypass:

In Section IV-A, we showed theCPU → PL bandwidth of HPC interface can be signiﬁcantlylower when the data is cached. It is possible to resolvethis by manually ﬂushing cache lines, but this costs CPUcycles in exchange. One way to implicitly ﬂush the cachelines is using cache bypass function in hardware [11]. Cachebypass can be used in cacheable memory region where cachesdecide not to allocate certain cache lines for certain dataaccess patterns. In A53, similar function, called

Read AllocateMode , is implemented to not allocate cache lines when thereis a massive amount of writes with regular access patterns.This kind of behavior can be often observed when using memset() . With this feature, without explicitly executingcache ﬂush instructions, data can be directly written intoDRAM even if the memory regions are cacheable. However,if the memory writes are done with irregular patterns, the readallocate mode is not activated.

B. I/O Cache Coherence Decision Tree

Gathering all explorations from previous sections, we builda decision tree (Fig. 6) to provide a general I/O cache coher-ence optimization ﬂow. The total cost of I/O cache coherencecan be roughly estimated as follows: ( total cost ) = α ( raw bandwidth ) + ( sof tware cost ) Here, the α represents the bandwidth requirement of anapplication. We ﬁrst categorize all data transaction types intoCPU to PL, PL to PL, and PL to CPU. Just to clarify, in thisdecision tree, we are only accounting to the cases where ashared memory (mostly host DRAM) between two instancesis used as a data communication medium. Without any sharedmemory, there are no I/O cache coherency issues. Our decisiontree strategy focuses on minimizing unexpected risks ratherthan maximizing possible gains. The parameter values set inthis decision tree can be rather conservative.For the communication between PL logics, there is no CPUinvolvement and therefore using HP (NC) is the best. For thePL to CPU case, we conclude using HPC interface is the bestin general as it can provide relatively high memory bandwidthwhile not introducing additional software costs. The memorybandwidth loss with the HPC interface in this case comparedto the HP is about 5% (Fig. 3).CPU to PL case is more complex than the former two casesas the raw bandwidth differences are huge in this case. In thiscase, we ﬁrst check if the TX buffer is mostly used for CPUwrite. If the CPU is mostly writing to the buffer, then wecheck if the writing is mostly done in sequential manner. Ifthe memory write patterns are sequential or can be modiﬁedto be sequential, then we can safely use the non-cacheablememory allocation. If the writes cannot be made sequential orthe CPU needs to make substantial amount of read requestsfrom this buffer, the buffer cannot be made non-cacheable.From this point, we need to rely on HP (C), HPC, or ACP.Using HP (C) is discouraged in general since executing extracache instructions and memory barriers can only have negativeaffects in terms of performances. To use HPC or ACP, wemust check how much of the data to be transferred is cachedas the raw bandwidths of HPC and ACP vary a lot dependingon the data locations. However, because it is impossible toknow the exact location of data before we access the cache,we rely on several intellectual guesses. First, we check the P (NC) HP (C) HPC ACP OptimizedPre-processing Accelerator Post-processing

DoG(480x270) H P ( NC ) H P ( C ) H P C A C P O p t i m i z ed H P ( NC ) H P ( C ) H P C A C P O p t i m i z ed H P ( NC ) H P ( C ) H P C A C P O p t i m i z ed H P ( NC ) H P ( C ) H P C A C P O p t i m i z ed N o r m a li z ed E x e c u t i on T i m e DoG(3840x2160)

SGEMM Average

Fig. 7. Benchmark results with using different I/O cache coherence methods.Difference of Gaussian (DoG) is tested with different image sizes. size of the data. If the data size is large enough ( > < > C. Case-Study Evaluations

To evaluate our decision tree, we use modiﬁed

Differenceof Gaussian (DoG) ﬁlter from xfOpenCV [12],

SGEMM , and

CHaiDNN [13] with

AlexNet as case-study examples. Allapplications are written in C++ and synthesized with XilinxSDSoC. DoG takes grayscale images as inputs and generatestwo outputs. The ﬁrst output is generated by directly applyinga gaussian ﬁlter to the input and the second output is generatedby passing the ﬁrst output to another gaussian ﬁlter. Later, twooutput images are subtracted to each other and the ﬁnal outputis generated. This difference of the gaussian ﬁltered imagesare often used for edge detections. For this application, weuse CPU to convert RGB images to grayscale images andsubtract two gaussian ﬁltered images. Accelerator is used foraccelerating the gaussian ﬁlters. For SGEMM, we implementa 128 ×

128 matrix multiplication accelerator and performblock matrix multiplication for larger input matrices. CPU isresponsible of cropping input matrices into 128 ×

128 blocks N o r m a li z ed E x e c u t i on T i m e De-Quant.Pool5Conv5Conv4Conv3Pool2Conv2Pool1Conv1Quant.

Fig. 8. Benchmark results of CHaiDNN with different I/O cache coherencemethods. Quantizations and de-quantizations are done at CPU. The executionorder is from bottom to top (Quant. → Conv1 → ... → Pool5 → De-Quant.). and feeding into the SGEMM accelerator and accumulatingthe accelerator outputs into the output matrix. CHaiDNNaccelerates convolution and pooling layers of DNN and CPUis responsible of quantizing input images and de-quantizingaccelerator outputs.For the baselines, we implement designs with pure HP (NC),HP (C), HPC, or ACP options. Due to the design complexity,we only compare between HP (NC), HP (C), and optimizedversion for CHaiDNN. The baseline CHaiDNN design fromXilinx only uses HP (NC) and HP (C). The optimized designsfollow the decision tree we created. The modiﬁcations are onlydone in memory allocation types and interface connections andaccelerators are not modiﬁed while comparing with other I/Ocache coherence methods.Fig. 7 shows the benchmark results of DoG with differentimage sizes and SGEMM. In average, our optimized versionachieved at least 20% of execution time reduction comparedto any other baseline conﬁgurations. In general, HP (NC) hasthe smallest accelerator execution times due to is high rawbandwidth, but the post-processing times have been greatlyincreased. HP (C) in general has very long accelerator execu-tion times because of manual cache instructions and memorybarriers. HPC performs well when the input sizes are large, butstarts to suffer from low raw bandwidth when the inputs aresmall due to the reason explained in Section IV-A. In opposite,ACP performs well when the input sizes are small, but as theinput sizes increase the cache hit rates become lower and theaccelerator execution times start to skyrocket.Fig. 8 shows the AlexNet execution time breakdown withCHaiDNN. HP (NC) greatly suffers from non-cacheable mem-ory accesses during both quantizations and de-quantizations.HP (C) has slightly better performance than HP (NC), butstill need to spend non-negligible amount of time executingmanual cache instructions. The optimized version removesthe penalties of both HP (NC) and HP (C) and reduces theexecution time by 37.2% and 30.9% compared to HP (NC)and HP (C), respectively.VI. R

ELATED W ORKS

There are several I/O cache coherence bandwidth researcheswith older SoC-FPGA platforms such as Xilinx’s Zynq-7000and Altera’s Cyclone V [14]–[20]. For both platforms, thenly available hardware coherent I/O port is ACP. [15]–[20]are limited to evaluating raw I/O bandwidths of using differentports and did not include software cost evaluations. [14] hasevaluated software costs of I/O cache coherence, but only witha ﬁxed data access pattern.VII. C

ONCLUSION

The costs of different I/O cache coherence methods varieswidely depending on applications. Approaching the I/O cachecoherence optimization problem should be done in bottom-upfashion including both software and hardware proﬁlings. Inthis paper, we presented multiple I/O cache coherence methodsof SoC-FPGA and optimization techniques based on thoroughanalysis of Zynq UltraScale+ platform. By properly combiningdifferent I/O cache coherence methods, we showed the overallexecution time can be reduced by 20%. In this paper, wemainly discussed the I/O cache coherence in a context ofCPU-to-accelerator connections, but this can be also applied toother device connections such as high-speed Ethernet, GPU,and NVMe. Considering that modern SoC-FPGA platformscan support many different kinds of peripherals, a goodunderstanding of I/O cache coherence optimization will bemore important in the future.VIII. A

CKNOWLEDGEMENTS

This work was supported by the Applications DrivingArchitectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA, and IBM-ILLINOIS Centerfor Cognitive Computing Systems Research (C3SR) – a re-search collaboration as part of the IBM AI Horizon Network.R

IEEE micro , vol. 29, no. 3, 2009.[4] M. Loghi, M. Poncino, and L. Benini, “Cache coherence tradeoffs inshared-memory MPSoCs,”

ACM Transactions on Embedded ComputingSystems (TECS) , vol. 5, no. 2, pp. 383–407, 2006.[5] G. Gir˜ao, B. C. de Oliveira, R. Soares, and I. S. Silva, “Cache coherencycommunication cost in a NoC-based MPSoC platform,” in

Proceedingsof the 20th annual conference on Integrated circuits and systems design .ACM, 2007, pp. 288–293.[6] ARM, “ARM CoreLink CCI-400 Cache Coherent Interconnect Tech-nical Reference Manual,” https://developer.arm.com/docs/ddi0470/k/preface.[7] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller, “Memoryperformance and cache coherency effects on an intel nehalem multi-processor system,” in . IEEE, 2009, pp. 261–270.[8] N. P. Jouppi, “Improving direct-mapped cache performance by theaddition of a small fully-associative cache and prefetch buffers,” in

ACMSIGARCH Computer Architecture News , vol. 18, no. 2SI. ACM, 1990,pp. 364–373.[9] M. D. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performanceand optimizations of blocked algorithms,” in

ACM SIGARCH ComputerArchitecture News , vol. 19, no. 2. ACM, 1991, pp. 63–74.[10] J. Benkual, T. Y. Ho, and J. F. Duluk Jr, “System, apparatus, method,and computer program for execution-order preserving uncached writecombine operation,” Dec. 30 2003, uS Patent 6,671,747. [11] T. L. Johnson, D. A. Connors, M. C. Merten, and W.-M. Hwu, “Run-timecache bypassing,”

IEEE Transactions on Computers , vol. 48, no. 12, pp.1338–1354, 1999.[12] Xilinx, “xfOpenCV,” https://github.com/Xilinx/xfopencv, 2019.[13] ——, “CHaiDNN,” https://github.com/Xilinx/CHaiDNN, 2019.[14] A. Powell and D. Silage, “Statistical performance of the ARM cortexA9 accelerator coherency port in the xilinx zynq SoC for real-timeapplications,” in . IEEE, 2015, pp. 1–6.[15] J. Silva, V. Sklyarov, and I. Skliarova, “Comparison of on-chip com-munications in Zynq-7000 all programmable systems-on-chip,”

IEEEEmbedded Systems Letters , vol. 7, no. 1, pp. 31–34, 2015.[16] M. Sadri, C. Weis, N. Wehn, and L. Benini, “Energy and performanceexploration of accelerator coherency port using Xilinx ZYNQ,” in

Proceedings of the 10th FPGAworld Conference . ACM, 2013, p. 5.[17] P. Vogel, A. Marongiu, and L. Benini, “An evaluation of memorysharing performance for heterogeneous embedded SoCs with many-coreaccelerators,” in

Proceedings of the 2015 International Workshop onCode Optimisation for Multi and Many Cores . ACM, 2015, p. 6.[18] V. Sklyarov, I. Skliarova, J. Silva, and A. Sudnitson, “Analysis andcomparison of attainable hardware acceleration in all programmablesystems-on-chip,” in . IEEE, 2015, pp. 345–352.[19] R. F. Molanes, J. J. Rodr´ıguez-Andina, and J. Farina, “Performancecharacterization and design guidelines for efﬁcient processor–FPGAcommunication in Cyclone V FPSoCs,”

IEEE Transactions on IndustrialElectronics , vol. 65, no. 5, pp. 4368–4377, 2018.[20] R. F. Molanes, F. Salgado, J. Fari˜na, and J. J. Rodr´ıguez-Andina, “Char-acterization of FPGA-master ARM communication delays in Cyclone Vdevices,” in