[PDF] Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors

Abstract

Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-based computing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becoming increasingly attractive, due to their excellent flexibility and low design cost. In this paper, we propose the architectural support for efficient interfacing between FPGA-based multi-accelerators and chip-multiprocessors (CMPs) connected through the network-on-chip (NoC). Distributed packet receivers and hierarchical packet senders are designed to maintain scalability and reduce the critical path delay under a heavy task load. A dedicated accelerator chaining mechanism is also proposed to facilitate intra-FPGA data reuse among accelerators to circumvent prohibitive communication overhead between the FPGA and processors. In order to evaluate the proposed architecture, a complete system emulation with programmability support is performed using FPGA prototyping. Experimental results demonstrate that the proposed architecture has high-performance, and is light-weight and scalable in characteristics.

Full PDF

11 Scalable Light-Weight Integration of FPGABased Accelerators with Chip Multi-Processors

Zhe Lin,

Student Member, IEEE,

Sharad Sinha,

Member, IEEE,

Hao Liang,

Student Member, IEEE,

Liang Feng,

Student Member, IEEE, and Wei Zhang,

Member, IEEE

Abstract —Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-basedcomputing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becomingincreasingly attractive, due to their excellent ﬂexibility and low design cost. In this paper, we propose the architectural support forefﬁcient interfacing between FPGA-based multi-accelerators and chip-multiprocessors (CMPs) connected through the network-on-chip(NoC). Distributed packet receivers and hierarchical packet senders are designed to maintain scalability and reduce the critical pathdelay under a heavy task load. A dedicated accelerator chaining mechanism is also proposed to facilitate intra-FPGA data reuseamong accelerators to circumvent prohibitive communication overhead between the FPGA and processors. In order to evaluate theproposed architecture, a complete system emulation with programmability support is performed using FPGA prototyping. Experimentalresults demonstrate that the proposed architecture has high-performance, and is light-weight and scalable in characteristics.

Index Terms —FPGA, hardware accelerator, heterogeneous system, network-on-chip, chip-multiprocessor. (cid:70)

NTRODUCTION N OWADAYS , the desire for low-power and high-performance design has led to the migration of mod-ern computing systems from homogeneous multicore sys-tems to heterogeneous multicore systems, where hardwareaccelerators (HWAs) are used to speed up computationallyintensive applications [1]. Field programmable gate arrays(FPGAs), which feature great ﬂexibility and high computa-tional capability, are promising candidates to serve as HWAsin heterogeneous systems. Recently, FPGAs have been seenincreasingly used in industry to enhance the computationcapability of chip-multiprocessors (CMPs). For instance, Al-tera and Intel provide a research platform, HARP [2], whichconsists of an Altera Stratix-V FPGA and an Intel Xeon E5processor. Likewise, Xilinx’s Zynq platform [3] combines adual core ARM processor with traditional FPGA fabric toform a programmable system-on-chip (SoC).With the increasing scale of computer systems, thenetwork-on-chip (NoC) has been used as a high-bandwidthand scalable interconnect architecture for large-scale mul-ticore systems [4], [5], [6]. It is also promising to integratean FPGA as a heterogeneous core in an NoC-based multi-core system as the next-generation heterogeneous system.Nevertheless, most prior research work has focused on theinterfacing of off-chip FPGAs and processors [7], [8], [9],[10], [11] with a limited number of cores through bus-basedcommunication. Moreover, the rapid increase in the resource • The authors Zhe Lin, Hao Liang, Liang Feng and Wei Zhang are withthe Department of Electronic and Computer Engineering, Hong KongUniversity of Science and Technology (e-mail: [email protected];[email protected]; [email protected]; [email protected]). • The author Sharad Sinha is with School of Computer Engi-neering at Nanyang Technological University, Singapore (e-mail:sharad [email protected]).Manuscript received December 1, 2016; revised June 15, 2017; acceptedSeptember 4, 2017. capacity and variety of FPGAs over the past few yearshas made it feasible to implement multiple accelerators ona single FPGA. However, there lacks an interface designwhich supports (1) run time ﬂexibility when a multitude ofprocessors may request many accelerators, and (2) scalabil-ity, as multiple accelerators implemented on a single FPGAcannot be accessed independently by mutually exclusiveprocessors. In light of the above consideration, in this pa-per, we investigate a high-speed, light-weight and scalableinterface architecture that loosely couples the FPGA-basedHWAs with NoC-based CMPs and allows ﬂexible invoca-tions of HWAs according to the runtime demands of theprocessors. Besides this, our proposed interfacing architec-ture is different from industrial solutions like HARP [2] andCAPI [12], in that it provides a single-die prototyping forheterogeneous systems, where bus based interfacing (e.g.PCIe in CAPI) is not requisite for all kinds of acceleration.The main contributions of our work are threefold: • We exploit a scalable and light-weight interface forthe multiple accelerators in an FPGA which areloosely coupled with CMPs. The key design-speciﬁcparameters including the number of task buffers,distributed packet receivers and hierarchical packetsenders, are investigated to maintain scalability andmaximize the performance of the interface architec-ture integrated in the FPGA. • We propose a hardware accelerator chaining mech-anism that allows HWAs to be serially combinedtogether to collaboratively operate as a monolithicbut more complex accelerator during run time. Thischaining mechanism exploits intra-FPGA commu-nication and thereby obviates the necessity for ex-cessive data transmission between the FPGA andprocessors. • A full system including CMPs, the FPGA and the a r X i v : . [ c s . A R ] S e p NoC is prototyped and emulated under variousworkload conditions on an FPGA. A software inter-face for processors to invoke HWAs is also designedto tackle the programmability challenges. The eval-uation results demonstrate the high throughput ofour proposed design, compared with both AXI-basedand shared FPGA cache solutions.The remainder of this paper is organized as follows.Section 2 reviews the existing work and discusses theirlimitations in integrating FPGAs in the context of multicoresystems. Section 3 provides an overview of the whole sys-tem. Section 4 describes the proposed architecture in detail,while Section 5 presents the support for programmability.In Section 6, the full system evaluation results are presentedand analyzed. At last, we conclude the paper and discussfuture extension of our work in Section 7.

ELATED W ORK

Various communication scenarios between an FPGA andprocessor cores have been studied in recent years. Thework in [7] proposed a system consisting of an ARM mi-croprocessor and a maximum of four accelerators in anFPGA, with AMBA buses as communication channels. Thework in [8] presented a system with a PCI express (PCIe)between processors and an off-chip FPGA, which alsoachieved reconﬁguration when necessary. Similarly, workin [9] and [10] realized data transmission between an FPGAand processors using a PCIe and AXI interconnect. Theseinterfacing architectures focused on establishing off-chipcommunication between the FPGA and processors based onexisting bus architectures, which are hard to extend to large-scale on-chip multicore systems. In addition, high platformdependence makes these techniques mostly non-portableacross different platforms. Most importantly, they do notinvestigate the support for sharing various accelerators inan FPGA by multiple processors. In contrast, our proposedon-chip interfacing architecture is optimized under a gen-eral situation without platform dependence and in whicha number of processors can invoke various FPGA-basedaccelerators. The authors of RIFFA [11] proposed a series ofworks where processors access HWAs. The idea of multipleHWAs accessed by different processors is similar to ours;however, they mainly emphasized providing support fordifferent operating systems to gain access to HWAs, withoutgoing deep into hardware performance improvement. Tothe best of our knowledge, ours is the ﬁrst work targetingoptimizing architectural design for interfacing FPGA-basedmulti-accelerators with NoC-based multicore systems. Fur-thermore, our work is complementary to accelerator-richarchitectures (i.e., the multicore systems with multiple accel-erators) where ASIC blocks or CGRAs are distributed indi-vidually in an NoC framework as processing elements [13],[14].

ULL -S YSTEM O VERVIEW

NoCs are proposed to be promising on-chip communica-tion architectures for achieving high bandwidth under alimited power budget. The processing elements in an NoC communicate with each other by sending and receivingpackets through routers. In the experiments, the employedmultiprocessor system-on-chip (MPSoC) architecture is sim-ilar to [15]. We adopt a 3-by-3 mesh topology and Fig. 1presents the system framework. The processors maintaintheir software routines and leverage the HWAs in the FPGAto complete the acceleration of some computationally inten-sive works. Note that the difference in sizes between theprocessors and the FPGA will impact the layout of the chipwhile it will not inﬂuence the topology of the system. Inprinciple, our idea supports any topology with the FPGAplaced beside any node. The analysis of the NoC routingalgorithms and the trafﬁc patterns [16], [17] may suggest aspeciﬁc placement for the FPGA but that is complementaryto our main goal and out of the scope of this work. R ... HWA HWA HWA

Controllers & Buffers Processor Processor ProcessorProcessor FPGA ProcessorProcessor Processor ProcessorR R R R R RRR R

HW accelerat ion

Application SW routine Fig. 1. Full-system framework.

Packet-based transmission is required for an NoC. A packetis composed of several ﬂits: a head ﬂit, multiple body ﬂitsand a tail ﬂit, which are the smallest unit in communica-tion [18]. We design the ﬂit width to be 137-bit. The head ﬂitsare always ﬁrst to be transmitted in packets and primarilycontain routing information together with speciﬁc informa-tion related to the invoked HWAs. Table 1 summarizes thebit information in a head ﬂit. Following the head ﬂits, arethe body or tail ﬂits, with bits from 128 to 136 consistingof routing and packet information, and all the remainingbits carry payload data. It is trivial to adjust the ﬂit sizefor different system conﬁgurations by reducing or extendingpayload bits. Additionally, the number of packets for eachHWA invocation is variable since different HWAs requiredifferent data sizes, which are distinguished by task headand tail bits.

BASED M ULTI - ACCELERATOR A RCHI - TECTURE

The proposed FPGA-based multi-accelerator architectureis shown in Fig. 2 (a) and comprises interface block andhardware accelerator channels (HWA channels) as the cru-cial elements. In order to bridge the frequency differencebetween the FPGA and the local NoC router, router inputand output buffers are implemented using asynchronousFIFOs. As there can be multiple HWAs on an FPGA, thescalability of the interface block design is crucial to preventit from being the performance bottleneck.

HWA Channels

Hardware Accelerator Channel s Hardware Accelerator Channel

RouterRouter Output Buffer Router Input BufferPacket Receiver (PR) Packet Sender (PS) …..

Packet Input Payload Output Grant Output Packet Input Payload Output Grant Output

Task Buffer (TB)

Chaining Buffer (CB) Chaining

Controller (CC)Task Buffer(TB) TaskArbiter(TA) Packet Output Buffer(POB)

Request

Buffer(RB) Local Grant Buffer(LGB)Local Grant Contrller(LGC)HWA Controller (HWAC)Chaining Data from other channels HWA Packet Generator(PG) To PSFrom PR To PS(a) Architecture overview (b) Hardware accelerator channelRequest bypassingInterface Block Modules for Request and Grant MechanismModules for HWA Chaining Mechanism Modules for HWA management routineTo other Chaining controllers in the same group

Fig. 2. (a) Overview of the FPGA-based multi-accelerator architecture; and (b) The detailed design within an HWA channel.TABLE 1Description of the bit index in a head ﬂit.

Bit index Description

The interface block is the bridge between HWA channels,the NoC and CMPs. Speciﬁcally, it manages the packettransmission and arbitration of both command packets andpayload packets. The fundamental components of the in-terface block include a packet receiver and packet senderwhich control the packet dispatch and assembly, respec-tively, between HWAs and the router buffer.

A.1 Packet receiver (PR)

The PR reads ﬂits from the router output buffer anddispatches the packets to the corresponding HWA channels.A PR is implemented as a ﬁnite state machine which isable to identify different ﬂit types and decode head ﬂitinformation. It also identiﬁes the packet length in case ofa variable-length packet.Considering the case that many HWAs could be imple-mented and the interface could become the critical path, weexplore different design strategies to optimize PR perfor-mance: a centralized PR strategy and different distributedPR strategies. For the centralized PR strategy, only a singlePR is used to dispatch packets to all the HWA channels,while for the distributed PR strategies, there are multiplePRs, each of which dispatches packets to a ﬁxed number ofHWA channels. Fig. 3 (a) shows the idea of distributed PR strategies. We investigate various distributed PR strategiesand ﬁnd out the PR strategy with the highest performanceby varying the number of PRs. It is observed that thedistributed packet receiver strategies can effectively reducethe routing overhead and notably improve the operatingfrequency of the PR, as demonstrated in Section 6.3.

A.2 Packet sender (PS)

The PS arbitrates among different HWA channels andsends the selected output packets to the router input buffer.There are two types of packets to be sent out by the PS:command packets for the HWA requests, and packets forcomputation results from the HWAs (denoted as resultpackets). A command packet only has a single ﬂit and itenjoys higher priority in being sent out than result packets.A command packet can be a grant packet which requestsinput packets, or a notifying packet used to inform theprocessors of the completion of acceleration. A grant packetwould be sent to the requesting processor in the case ofa direct access communication scenario or to the memorymanagement unit for a memory access communication sce-nario, as illustrated in Section 5. A round-robin scheme isused to arbitrate command packets from different HWAchannels. In contrast, a result packet comprises more thanone ﬂit. The PS selects the result packets in a priority-based round-robin manner, with the priority informationembedded in the head ﬂits.By introducing the priority bits in the head ﬂits, therequesting processors can set different priorities for differenttasks to be accelerated. This attribute can be removed bysetting the priority bits in the head ﬂits as zeros. In such acase, round-robin arbitration is deployed.Noticing that the complexity of both the arbitration andmultiplexing increases with the rise of the number of HWAs,we therefore investigate two types of PS implementation:the global PS strategy and the hierarchical PS strategies.The global PS strategy takes all the command packets andresult packets as the input, offers arbitration and sendsout the packets. In contrast, the hierarchical PS strategieswith the idea shown in Fig. 3 (b), clusters a certain numberof HWA channels together in the ﬁrst-level hierarchy and,accordingly, arbitration is done within this speciﬁc group. Asecond-level hierarchical controller ﬁnally arbitrates amongthe ﬁrst-level hierarchical controllers and then signals the . .. st -level PS 1Grt_rdyD_rdyD rdyGrt_dataAcc_data Grt_rden_Grt_out_Acc_rden_Acc_outGAAFSM . ArbitratorrAAAAA1 st -level PS NGrt_rdyD_rdyD rdddyGrt_dataAcc_data Grt_rden_ Grt_out_

Acc_rden_Acc_outGAAFSMArbitratorrAAAA FSMArbitratorrAAAAA2 nd -level PSFeedback Feedback

Data_out

Out_en

From M/N HWA channelsFrom M/N HWA channels .. . To M/N HWA channels

Data_in

EmptyReady Data_outWr_enRd_enPR 1 . DFSM . DFSMMM

To M/N HWA channelsData_inEmptyReady

Data_out

Wr_enRd_enPR N D NFSM D FSMMMM (a) (b)

Fig. 3. (a) Simpliﬁed model of distributed PR strategy; and (b) Simpliﬁed model of hierarchical PS strategy. selected ﬁrst-level hierarchical controller for packet trans-mission, after which the packet transmission starts. Experi-ments are conducted to determine the optimal number ofhierarchical PSs to maximize the operating frequency ofthe FPGA, reported in Section 6.3. The results validate thatthe optimal hierarchical strategy can signiﬁcantly reducethe PS delay and as a result, demonstrate a more than 2 × improvement compared with the global PS method. B.1 HWA invocation

Fig. 2 (b) shows the major components necessary toguarantee a robust accelerator invocation. Task buffers (TBs)act as temporary storage for packets with input data forHWAs. Multiple task buffers are desirable to hide the com-munication delay. The experiment reported in Section 6.2 isconducted to investigate the optimal number of TBs regard-ing different HWA communication patterns and it revealsthat usually two TBs are enough to hide the communicationdelay. A task arbiter (TA) identiﬁes the ready tasks fromthe task buffers and selects a task to be executed basedon round-robin arbitration. A HWA controller (HWAC) isresponsible for reading packets from either task buffers orchaining buffers, and then setting essential control signalsto invoke the HWA when the HWA is idle. When the HWAexecution is ﬁnished, the HWAC will signal the packetgenerator (PG) to read the execution results. The PG alsodetects the chaining condition using header informationand controls either the packet output buffer (POB) or thechaining buffer to receive the results. If the results are tobe sent out, packets are formed simultaneously. The POBserves as temporary storage for result packets before theyare granted the chance to be sent back under the supervisionof the packet sender.Note that each HWA’s frequency can be different andin order to enable each HWA to run at its own frequency,the task buffers, packet output buffers and chaining buffersare designed to interface between different frequencies.Besides this, the HWAC and PG will work at the samefrequency with the HWA to feed the input and generatethe output packet under synchronization by asynchronous FIFOs. The control signals crossing different frequencies willbe synchronized by two-stage synchronizers implementedby registers.

B.2 Request and grant mechanism

Considering the case that a myriad of applications areinvoking multiple hardware accelerators in the FPGA, arequest and grant mechanism is developed to resolve thecontention and ensure the robustness of HWA invocation.For each invocation of an HWA, a request packet wouldﬁrstly be generated and sent to the FPGA by the processor.The request packet is composed of a single ﬂit with

Packettype “command” , Source ID , HWA ID , Packet direction , Startaddress and

Data size .As there could be multiple processors requesting thesame HWA, a received request packet is ﬁrstly queued inthe request buffer (RB). A local grant controller (LGC) keepstrack of the status of request buffers and task buffers withthe support of a status table that is updated every cycle.Based on the task buffers’ availability, the LGC generatesgrant packets in a ﬁrst-come-ﬁrst-serve manner, writes thegranted task buffer identiﬁcation into the grant packets andsignals the PS for packet transmission. To further reducethe latency for writing and reading requests, a request canbypass the request buffer when no other requests exist in therequest buffer. Also note that the grant packets will not bepermitted to transmit until a valid task buffer is available.

B.3 HWA chaining mechanism

HWA chaining is developed for the case that a taskattempts to invoke a series of HWAs sequentially. Noticethat the chaining HWAs are pre-speciﬁed by the task, andhence the required HWAs can be identiﬁed and formed intoa chaining group. For instance, in JPEG decompression [19],inverse zigzag, inverse quantization, inverse DCT, shift andbound are invoked in a sequence. Therefore, when imple-mented as HWAs, they can be incorporated in the samechaining group to enable local data reuse among each other,eliminating excessive data transmission through the NoCto the memory or processors. Hence, the HWA chainingmechanism allows a set of HWAs to be invoked collectivelyin addition to being used individually, making the designmore ﬂexible and general.

Chaining depth and

Chaining index are dedicated bitsto describe the chaining times and sequences. Moreover,the chaining buffer (CB), chaining controller (CC), HWAcontroller and packet generator are designed to supportchaining. When receiving the results from the HWA, thepacket generator ﬁrst checks

Chaining depth in the headerinformation. If it is non-zero, chaining is required andboth the header information and execution results wouldbe written to the chaining buffer, with the

Chaining length in the header information decreased by one. The headerinformation in the chaining buffers is transparent to all thechaining controllers in the same group.The chaining controller is a combinational logic to in-dicate existing matchings for chaining. It deduces the nextchaining

HWA ID from the

Chaining index and

Chaining depth and then compares the derived

HWA ID to its channel

HWAID . It then signals the HWA controller for data fetchingwhen

HWA ID matchings exist and selects the next chainingbuffer to read by a round-robin scheme. The HWA controllerthen fetches data from the corresponding chaining buffer.The HWA controller prioritizes chaining requests over inputrequests so as to obviate stalling of an ongoing chainedoperation and avert overﬂow of chaining buffers.Note that other ways to facilitate data reuse on theFPGA are usually through the local cache. Some off-the-shelfcommercial designs make use of cache memory to integratean FPGA and a processor/chip-multiprocessor. Examplesof such designs are Intel’s Heterogeneous Architecture Re-search Platform (HARP) [2] and IBM’s Coherent AcceleratorProcessor Interface (CAPI) [12]. HARP makes use of a dedi-cated cache memory implemented on the FPGA. This cachememory is used for shared data communication among theaccelerators on the FPGA, and between the FPGA and theapplications running on the chip multiprocessor. The CAPIsolution is meant for its POWER8-processor-based systemsand it allows this system to treat an attached FPGA co-processor as a coherent peer: the FPGA accelerator andthe POWER8 system share the same memory space. ThePOWER8 processor preserves dedicated silicon to imple-ment CAPI. Both HARP and CAPI are implemented usingseparate boards or sockets for processors and FPGAs. Thesedesigns serve well when only one accelerator is imple-mented on the FPGA. However, for multiple acceleratorson the FPGA, there will be heavy memory contention [20].We also note that Intel’s HARP and IBM’s CAPI are bus-based designs for different chips, where our design targetsFPGA-CMP system-on-chip.In our architecture, we leverage upon the advantagesof block RAMs (BRAMs) in the FPGA to build multipledistributed buffers (i.e., TBs, POBs, etc.) which are usedin different stages of HWA invocations. There is no globalcache in our design and these distributed buffers abatethe potential penalties due to cache misses by bufferingdata from different stages instead of accessing the cacheover and again. When a new set of inputs is demanded,these data are pre-stored by the PR in the TBs and therebyresults in a reduced input read-in latency for a HWAcompared to the cache access latency. Furthermore, the useof chaining buffers facilitates the communication betweengrouped HWAs with minimal delay while the communica-tion through the cache tends to take a longer time and cause

TABLE 2Latency in clock cycles for different components in the interfacearchitecture.

Component Latency (in cycles)

Per HWA HWAC 4+NPG 4+NLGC 1TA 1CC 1Buffers 4+NOverall PR Command: 1Payload: 2+NPS Command: 1Payload: 4+N contention. Generally speaking, our design can supportefﬁcient data reuse among chaining HWAs as well as fastinput access for separate HWA invocation at the same time.Compared to the AXI-based design as shown in Fig. 11and the shared FPGA cache design as illustrated in Fig. 12,our distributed buffer design reaps the beneﬁts because theinput size for each HWA is pre-deﬁned and there is usuallylittle data sharing between HWAs in different chaininggroups. Results reported in Section 6.6 also demonstrate theoverheads in resource and runtime incurred by the chainingmechanism, which are trivial compared to the obtainedimprovement in performance.Table 2 generalizes the latency for different componentsin the interface architecture, where N represents the numberof ﬂits in the payload packets for a single HWA invocation.The latency incorporates the time for transferring the wholepacket with N ﬂits. The buffers (i.e., TB, POB, RB, LGBand CB) are instantiated as FIFOs and therefore they havethe same latency for the ﬁrst payload to be immediatelytransferred from the input to the output. TA and CC arecombinational logics with the delay of a single cycle. ROGRAMMABILITY SUPPORT FOR

HWA

INVO - CATION

A software interface is necessary to be developed for pro-cessors to invoke HWAs. Specialized C-based functions forHWA invocation are deﬁned and plugged into the user codeto specify the information like the

HWA ID and the caller thread ID , as shown in Fig. 4.More importantly, two communication scenarios be-tween the processors and the FPGA are considered. Aprocessor can either directly send the input data to HWAs orsend the requests with the physical addresses of the inputdata to the HWAs after the virtual to physical translation,as shown in Fig. 5. In Fig. 5 (a), the processor directly sendspayload packets with input data to the FPGA, while in Fig. 5(b), the HWA can fetch the payload packets from memorythrough the memory management unit (MMU) by sendingthe grant packets to the MMU with the speciﬁed

Start address and

Data size information. When receiving the grant packetsfrom HWAs, the MMU decodes the contained informationand initializes data transmission via direct memory access(DMA). In addition, the MMU writes the received resultpackets in the memory. Notice that the PS is supposed tonotify the invoking processor using a packet with the mem-ory address in the header information. Then the processor can fetch data via the MMU either from the memory or thewrite buffers.In our design, input packets are received by HWAs andresult packets are output and sent over an NoC for pro-cessors to process. The data coherency between HWAs andprocessors is maintained by the processors which invoke theHWAs. Speciﬁcally speaking, a processor is responsible forupdating the memory and the data coherency state, whichare shared among different processors when accelerationresults are obtained from the FPGA’s HWAs. This is com-plementary to our proposed design. int D_HWA_invoke(int HWA_id, int thread_id, int size, int* data_array, int clen, int* cseq); Sends data from the processor to the FPGA to invoke the HWA with

HWA_id . The length of data is specified by size and the data are stored in data_array in 32-bit format. The chaining length is defined by clen , and the sequence is defined by cseq . The return value indicates that the acceleration has been finished successfully. int M_HWA_invoke(int HWA_id, int thread_id, uint start_address, int clen, int* cseq); Invoke the HWA through the memory access by sending a request to the FPGA with the information of

HWA_id , thread_id , start_address , clen cseq. The grant controller for the HWA will decode this request and read the input data from the memory with start_address.

Fig. 4. Functions for processors to invoke HWAs.

Processor

NoC

Interface Block

HWA HWA ...

Network Interface Network Interface

NoC

Network Interface Network InterfaceNetwork Interface

MMU

Grt &Result

Input dataResult

Address translation

ReqReq & Input dataGrt, Noti & Result (a) (b) FPGAFPGADMAMemoryProcessor Interface Block

HWA HWA ...

Noti

Fig. 5. Communication scenarios: (a) direct access; and (b) memoryaccess.

XPERIMENTAL RESULTS

The complete system is prototyped and emulated on aXilinx Virtex-7 FPGA (xc7vx690tffg1930-3). The NoC is theCONNECT [21], with peek ﬂow control, XY-routing and virtual output queues. The employed processor is Microb-laze [22], which is commonly used for MPSoC prototypingon an FPGA [23], [24], [25], [26]. We use the Xilinx SDK tocompile C-code to execute on Microblaze. Correspondingly,we implement the C-based software interface for acceleratorinvocation and data communication, as shown in Fig. 4. Fastsimplex links (FSLs) [27] are leveraged for communicationbetween processors and routers. We derive HWAs reportedin Table 3 from Xilinx Vivado HLS by performing high-levelsynthesis of C-based benchmarks from CHStone [28] andSNU Real-Time Benchmarks [29], both of which encompasssome computationally intensive applications suitable forFPGA implementation. The average lookup table (LUT)utilization is 20424. Three applications use BRAMs and ﬁveapplications utilize DSPs, showing a variety of resource uti-lization. We assume that both the CMPs and the NoC oper-ate at 1 GHz according to a commonly used assumption [30].Note that the 32-bit processor Microblaze implements aclassic RISC Harvard architecture exploiting instructionlevel parallelism (ILP) with a 5-stage pipeline [31]. Thepipeline stages (i.e., instruction fetch, instruction decode,execution, memory access and write back) conform to theconventional MIPS pipeline structure. Therefore, Microblazecan be used to extract execution cycles for instructions ofclassic RISC processors, which are not supposed to changeunder different operating frequencies. Our proposed multi-accelerator architecture runs at 300 MHz, which approachesthe maximum frequency reported in Xilinx Vivado [32].The HWAs also operate with their maximum frequenciesreported by Vivado. Since the whole system is prototypedin FPGA and the FPGA cannot operate at 1 GHz frequency,we scale the frequencies of both the microprocessor andthe FPGA according to the ratio expected in a real systemto ensure ﬁdelity in emulation, without impacting the keyparameters, such as the HWA execution cycle and commu-nication latency, of the system-wide evaluation.

TABLE 3Benchmark complexity in resources for FPGA implementation.

Benchmark LUT BRAM DSP FF

AES Enc 12259 116 0 7286AES Dec 15218 116 0 7350Dfadd 4983 0 0 3768Dfdiv 9661 0 24 13171Dfmul 1927 0 16 2089Gsm 4257 0 12 2643Prime 161237 0 0 277026Sha 13147 1 0 9931Izigzag 100 0 0 98Iquantize 608 0 76 1413Idct 14552 0 368 12390Shiftbound 7133 0 0 7928

Task buffers (TBs) serve as temporary storage between thepacket receiver and HWA controller. As a result, increasingthe number of task buffers is expected to hide the com-munication overhead when the HWAs are in operation.In this experiment, we evaluate the optimal number oftask buffers to minimize the overall communication latency.

Speciﬁcally, we evaluate two types of HWAs: (1) an HWAprocessing a small amount of data with a large executiontime (e.g., Dfdiv); and (2) an HWA with an extremely lowexecution time but working on a relatively large data set(e.g., Izigzag). These two types of benchmarks demonstratetwo extreme communication patterns and other HWAs havecommunication patterns between these two situations.We exploit the total execution time when multiple re-quests for the same HWA are generated from differentprocessors simultaneously. We record the total executiontime when different numbers of task buffers are utilized toprocess all the requests. According to the results shown inFig. 6, there is no improvement in execution time for Dfdivas the number of task buffers increases. This is becausethe time for packet transmission is shorter than the HWAexecution time. In such a situation, new payload packetscan be transmitted into the task buffer via the NoC priorto the completion of the last HWA execution and thereforeone task buffer is enough. On the contrary, using two taskbuffers demonstrates a 28.4% improvement in executiontime for Izigzag and no further improvement is observedwhen increasing the number of task buffers. In such a case,two task buffers are enough to work collaboratively to over-lap the packet transmission time with the HWA executiontime. These two example HWAs reveal two extremes ofcommunication patterns; hence using two task buffers issufﬁcient to guarantee high-speed acceleration for variousapplications. In the following experiments, we incorporatetwo task buffers for each HWA. N o r m a li ze d e x e t i m e Number of task buffers

DfdivIzigzag

Fig. 6. The execution time using different numbers of task buffers.

We evaluate the maximum frequencies reported from Vi-vado 2015.2 after placement and routing with respect todifferent PR and PS strategies, as shown in Fig. 7. Theaverage maximum frequencies of different PR strategies forthe speciﬁc PS strategy are shown above the bars. We set thenumber of HWA channels to be thirty-two, a large numbersufﬁcient for scalability investigation. The digits followingPR or PS deﬁne the number of HWA channels a PR ora ﬁrst-level PS manages. For example, PS4 indicates thata ﬁrst-level PS takes control of four HWA channels andcorrespondingly, the second-level PS arbitrates among eightﬁrst-level PSs.From a PS aspect, the maximum frequencies of all thehierarchical PS strategies are more than twice as high as F re qu e n c y ( M H z ) PRC4PRC8PRC16PRC32 avg:142.3avg:312.2 avg:307.8 avg:303.4

Fig. 7. Maximum frequency: different PR and PS strategies. that of the global PS strategy. Hierarchical strategies remark-ably lower the routing efforts because they considerablydiminish the fan-in number for both the ﬁrst-level andsecond-level designs. Hence, routing congestion resultingfrom global strategy is alleviated by distributing the heavycentralized routing to multiple paths. Moreover, registersare employed in hierarchical strategies to separate the longwiring in the critical path into two shorter ones. In all, PS4renders the highest frequency as indicated on top of the barsin Fig. 7, revealing that PS4 best reduces routing congestionand balances delay. Moreover, scalability is preserved underthe case of multiple HWAs using PS4.From a PR aspect, the PR4 strategy surpasses otherstrategies in frequency, since this strategy trims down thefan-out number of every PR to a desirable value, whichsimilarly lightens the routing burden. PR8 and PR16 providesimilar results, while PR32 exhibits the worst performance,since it leads to the heaviest routing burden.

The LUT and BRAM resource breakdown including PRs,PSs and components in HWA channels with dummy HWAsis evaluated using the PR4-PS4 strategy with the highestperformance, as shown in Table 4. The DSP resource is notutilized for any design strategy. Note that TBs and POBsare implemented in BRAMs while other buffers are imple-mented by distributed memories using LUTs. Furthermore,regarding the different PR and PS strategies we investigated,the LUT utilization ranges between 10.48% and 10.78%, andexhibits an average resource consumption of 10.63% overalland 0.33% per HWA channel. This value is further veriﬁedby implementing a design with eight HWA channels, whichutilizes 2.6% of the resources in all, and again 0.33% perHWA channel. Therefore, the results validate the light-weight characteristic of our design.

To validate the system and evaluate the throughput, weassume eight HWAs on an FPGA and each processor ran-domly sends requests to speciﬁc HWAs under a wide rangeof request frequencies. Injection rate is used to represent thenumber of incoming ﬂits per unit time from the router toour design when the full system is at a stable stage. Thethroughput is calculated as PS output ﬂits per unit time.In the ﬁrst case (denoted as Izigzag-HWA), to evaluatethe maximum throughput achievable, the eight invokedHWAs are all implemented as Izigzag which has a negli-gible execution time (i.e., one cycle). Fig. 8 (a) shows both

Request frequency(requests/ µ s)(a) I z i g z ag − H W A : I n j ec t i o n r a t e & T h r o u g hpu t (f li t s / µ s ) Injection rate Throughput

Request frequency(requests/ µ s)(b) E i g h t − H W A : I n j ec t i o n r a t e & T h r o u g hpu t (f li t s / µ s ) Injection rate Throughput

Request frequency(requests/ µ s)(c) D f d i v − H W A : I n j ec t i o n r a t e & T h r o u g hpu t (f li t s / µ s ) Injection rate Throughput

Fig. 8. (a) Izigzag-HWA injection rate and throughput; (b) Eight-HWA injection rate and throughput; and (c) Dfdiv-HWA injection rate and throughput.TABLE 4Resource breakdown for the interface architecture in the prototype.

Component LUT BRAM number % number %Per HWA TB 100 0.02 4 0.27TA 2 0 0 0HWAC+PG 290 0.07 0 0POB 231 0.05 2 0.14RB 243 0.06 0 0LGC 139 0.03 0 0LGB 247 0.06 0 0Overall PR 870 0.2 0 0PS 5039 1.16 0 0 the injection rate and throughput for Izigzag-HWA. Themaximum injection rate is 27.95 ﬂits/ µ s and the FPGAis busy for 93% of all the execution time, approachingbut not reaching 100 percent owing to the communicationoverhead incurred by the request and grant mechanism. Thethroughput becomes saturated at 0.2 requests/ µ s and themaximum throughput reaches 24.81 ﬂits/ µ s, which is 5.7%smaller than the injection rate, due to the latency incurredby packet fetching, packet generation and stalling for PSarbitration. When the request frequency further increases,the throughput decreases slightly, because the intensiveand substantial data communication eventually accountsfor network congestion, which in turn diminishes the datatransmission rate.In the second case (denoted as Eight-HWA), we use theﬁrst eight benchmarks in Table 3 with a diversiﬁed HWAexecution time, to test a common and real scenario. A similartrend to Izigzag-HWA can be seen in Fig. 8 (b). However,the throughput saturates at a higher request frequency, andthe throughput is notably lower than the injection rate as aconsequence of the non-trivial and diverse HWA executiontime.In the third case (denoted as Dfdiv-HWA), Dfdiv isadopted for all eight HWAs to evaluate the throughputunder the other extreme where the HWA execution timeis the major dominant factor. As shown in Fig.8 (c), eventhough the injection rate increases linearly with the rise ofrequest packets, the throughput is chieﬂy constrained by thelengthy HWA execution time and thereby remains constant. GSM.p1 GSM.p2 GSM.p3 JPEG.p1 JPEG.p2 JPEG.p3 JPEG.p4 JPEG.p5 E x ec u ti on ti m e / n s Partitioned application

ProcessorFPGAData transmission

Fig. 9. Latency breakdowns of different partitions regarding a singleinvocation.

In this experiment, we evaluate the latency breakdown ina single invocation for the processor execution, the FPGAacceleration and data transmission. We conduct task par-titioning for the two computationally intensive benchmarkswith multiple functions–GSM and JPEG decoder–in Table. 3.The payload packet sizes are 3-ﬂit for GSM and 18-ﬂitfor the JPEG decoder. The latency breakdowns of differentpartitions are shown in Fig. 9. The FPGA executes all thefunctions in the cases of GSM.p3 and JPEG.p5, which renderthe smallest overall latency amongst their correspondingpartitions. As these two applications incorporate many in-tensive computations suitable for FPGA acceleration, theimprovement in execution time using FPGA acceleration isprominent in all of the different partitions, even with thecommunication overhead considered. Therefore, the resultsdemonstrate the high efﬁciency of the FPGA as a platformsuitable for accelerating computationally intensive applica-tions in multicore systems.

To investigate the efﬁciency of the HWA chaining mecha-nism, an experiment is conducted with the Izigzag, Iquan-tize, Idct, and Shiftbound benchmarks as in Table 3 forJPEG decompression [19]. These four functions are executedserially to ﬁnish decoding for the compressed images inJPEG format. The chaining schemes are: Chaining depth 0(no chaining), Chaining depth 1 (Izigzag+Iquantize), Chain-ing depth 2 (Izigzag+Iquantize+Idct) and Chaining depth 3(Izigzag+Iquantize+Idct+Shiftbound).

The speedup for each different chaining scheme withChaining depth 0 as the baseline is shown in Fig. 10. Notic-ing that the most time-consuming part is the packet sendingand receiving operations of the processors, our chainingmechanism effectively diminishes the communication over-head, and it indicates a growing trend for performanceimprovement when the chaining depth increases.The communication latency for the chaining mechanismis N cycles, where N is the number of result ﬂitsto be stored in the chaining buffer. This intra-FPGA com-munication overhead at runtime is trivial compared withthe communication overhead between the FPGA and theprocessors. Moreover, the LUT overhead per HWA channelfor incorporating the chaining mechanism is 526 (0.12%)and the BRAM overhead is 2 (0.14%), implying high area-efﬁciency. As a result, the proposed hardware chainingmechanism demonstrates prominent speedup in the execu-tion time compared with the non-chaining approach, withnegligible overheads in runtime and resource, especiallywhen heavy data communication is involved as the chainingdepth increases.

234 Chaining depth0 Chaining depth1 Chaining depth2 Chaining depth3 Sp ee dup Fig. 10. Speedup: Different chaining depths v.s. Chaining depth 0.

Processor M a s t e r S l a v e M a s t e r S l a v e FPGA M a s t e r S l a v e ... AXI: P to F

AXI: F to P

Processor

Fig. 11. System framework based on bus based integration.

As illustrated in Section 2, the bus-based integration ofan FPGA and processors has been studied in prior work.From the industry aspect, the bus-based integration of anFPGA and processors is also extensively deployed in currentsituation. A representative instance is the ARM CorelinkNIC Network Interconnect [33], which utilizes AMBA AXI4protocol. In addition, we note that the AXI4 can be wellintegrated with our proposed interfacing architecture, andin order to perform a fair comparison between bus-basedcommunication and NoC, a prototype is implemented withAMBA AXI4 as a replacement of the NoC in this experiment,as shown in Fig. 11. The AXI4 frequency is set to be identical to the proces-sors so as to obtain the upper limit of throughput. It is setto be 1 GHz and is scaled to 100 MHz for emulation on theFPGA [34]. The behaviours of injection rate and throughputare similar to NoC based integration. However, as shown inFig. 13, in comparison with NoC, the maximum throughputfor Izigzag-HWA exhibits a reduction of 27%, while forEight-HWA, a 53% decrease in throughput can be observed.For Dfdiv-HWA, the throughput restricted by HWA exe-cution retains an identical constant. Fig. 14 further revealsthe communication latency for the AXI-based design andshows a 2.42 × improvement for the proposed NoC designcompared with the AXI-based design. In other words, theproposed framework with NoC support indicates predictedadvancement in throughput compared with the bus-basedintegration due to NoC’s good scalability, especially whencommunication overhead becomes the major concern. In order to quantitatively characterize the beneﬁts of ourdesign over a shared cache design, we prototype a systemusing system cache [35] for an FPGA. This cache memoryis used to store input and output packets received andsent over the NoC interface, as shown in Fig. 12. Theprototyped system is identical to our proposed system butwithout TBs, POBs and CBs. The HWAs implemented on theFPGA have direct access to the cache. Experimental resultsshown in Fig. 13 indicate a 22.5% throughput reduction forIzigzag and a 28.2% reduction for Eight-HWA, comparedwith our proposed architecture. Fig. 14 also demonstrates anenhancement in communication latency by a factor of 1.63 × for our proposed design in comparison to the shared FPGAcache design. Besides this, the system cache consumes 1% ofthe LUT resources and 5%- 9.5% of the BRAMs in the FPGA,depending on the cache size, which ranges from 32 Kbyte to512 Kbyte, with the set associativity ﬁxed as two by default.When there are more chances for HWA chaining, the cache isbeneﬁcial. Nevertheless, the intensive access of the cache byall the operating HWAs will cause a surge of congestion andin turn, boost the average access time, which counteracts itsmerits, thus showing a reduction in throughput comparedwith our architecture making full use of distributed buffers. Processor ...

FPGA

NoC

Cache Cache

Control & Buffer ...

HWA HWA HWASystem Cache

MemoryProcessorCache

Fig. 12. System framework with FPGA in-built cache.

ONCLUSION AND F UTURE W ORK

This paper mainly proposes and implements a platform-independent architectural design for FPGA-based T h r oughpu t (f lit/ u s ) NoC

AXI4Cache

Fig. 13. Maximum throughput of the three different prototypes. N o r m a li ze d l a t e n c y Fig. 14. Communication latency for a single invocation of the threedifferent prototypes. multi-accelerators to efﬁciently interface with chip-multiprocessors through NoC. Our target is to optimizethe performance of the interface when a large number ofHWAs are mapped in an FPGA. Speciﬁcally, we explore thevariations of the key design-speciﬁc parameters including:(1) the number of TBs to reduce communication latency; (2)the distributed PR strategies and hierarchical PS strategiesto maximize operating frequency as well as maintain goodscalability; and (3) the speedup and tradeoff derived fromour proposed chaining mechanism. Results show that theoptimal set of these parameters can guarantee a more than2 × improvement in performance. In order to emulate thesystem-level functionality and evaluate the performanceof the proposed interface architecture, we prototype a fullsystem on an FPGA. This prototype encompasses NoC, theFPGA with an integrated interface architecture and multipleHWAs, together with the soft processor cores with HWAinvocation functions to tackle programmability issues. Wecompare our design with commonly used bus-based andFPGA share cache prototypes and ﬁnally ﬁnd our proposedinterface architecture demonstrates prominent superiorityin performance, area-efﬁciency and scalability. In our futurework, we plan to evaluate the effect of different NoCrouting protocols on the performance of the interface. A CKNOWLEDGMENTS

The authors acknowledge the support of the HKUST start-up fund R9336. R EFERENCES [1] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, “Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?” in

Proceedings of the 43rd AnnualIEEE/ACM International Symposium on Microarchitecture . IEEEComputer Society, 2010, pp. 225–236.[2] “IvyTown Xeon + FPGA: The HARP Program,” https://cpufpga.ﬁles.wordpress.com/2016/04/harp isca 2016 ﬁnal.pdf.[3] L. H. Crockett, R. A. Elliot, M. A. Enderwitz, and R. W. Stewart,

The Zynq Book: Embedded Processing with the Arm Cortex-A9 on theXilinx Zynq-7000 All Programmable Soc . Strathclyde AcademicMedia, 2014.[4] Z. L. Qian, D. C. Juan, P. Bogdan, C. Y. Tsui, D. Marculescu,and R. Marculescu, “A Support Vector Regression (SVR)-BasedLatency Model for Network-on-Chip (NoC) Architectures,”

IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems , vol. 35, no. 3, pp. 471–484, 2016.[5] F. Lan, Y. Pan, and K. T. T. Cheng, “An Efﬁcient Network-on-Chip Yield Estimation Approach Based on Gibbs Sampling,”

IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems , vol. 35, no. 3, pp. 447–457, 2016.[6] S. V. Winkle, D. Ditomaso, M. Kennedy, and A. Kodi, “Energy-efﬁcient optical Network-on-Chip architecture for heterogeneousmulticores,” in

IEEE Optical Interconnects Conference (OI) , 2016, pp.62–63.[7] M. Hubner, P. Figuli, R. Girardey, D. Soudris, K. Siozios, andJ. Becker, “A Heterogeneous Multicore System on Chip withRun-Time Reconﬁgurable Virtual FPGA Architecture,” in , 2011, pp. 143–149.[8] K. Papadimitriou, C. Vatsolakis, and D. Pnevmatikatos, “Invitedpaper: Acceleration of computationally-intensive kernels in thereconﬁgurable era,” in , 2012, pp. 1–5.[9] O. Sander, S. Baehr, E. Luebbers, T. Sandmann, V. V. Duy,and J. Becker, “A ﬂexible interface architecture for reconﬁg-urable coprocessors in embedded multicore systems using PCIeSingle-root I/O virtualization,” in

International Conference on Field-Programmable Technology (FPT) , 2014, pp. 223–226.[10] M. Weinhardt, A. Krieger, and T. Kinder, “A framework for PCapplications with portable and scalable FPGA accelerators,” in

International Conference on Reconﬁgurable Computing and FPGAs(ReConFig) , 2013, pp. 1–6.[11] M. Jacobsen, D. Richmond, M. Hogains, and R. Kastner, “RIFFA2.1: A Reusable Integration Framework for FPGA Accelerators,”

ACM Trans. Reconﬁgurable Technol. Syst. , vol. 8, no. 4, pp. 22:1–22:23, 2015.[12] J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel, “CAPI: ACoherent Accelerator Processor Interface,”

IBM Journal of Researchand Development , vol. 59, no. 1, pp. 7:1–7:7, 2015.[13] Y. T. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, andY. Zou, “Accelerator-rich CMPs: From concept to real hardware,”in

IEEE 31st International Conference on Computer Design (ICCD) ,2013, pp. 169–176.[14] W. Hussain, R. Airoldi, H. Hoffmann, T. Ahonen, and J. Nurmi,“Design of an accelerator-rich architecture by integrating multipleheterogeneous coarse grain reconﬁgurable arrays over a network-on-chip,” in

IEEE 25th International Conference on Application-Speciﬁc Systems, Architectures and Processors , 2014, pp. 131–138.[15] G. Gir˜ao, D. Barcelos, and F. R. Wagner, “Performance and energyevaluation of memory hierarchies in NoC-based MPSoCs underlatency,” in , 2009, pp. 127–132.[16] W. Fu, M. Yuan, T. Chen, Q. Shi, L. Liu, and M. Wu, “Beneﬁt ofUnbalanced Trafﬁc Distribution for Improving Local OptimizationEfﬁciency in Network-on-Chip,” in

IEEE Intl Conf on High Perfor-mance Computing and Communications, 2014 IEEE 6th Intl Symp onCyberspace Safety and Security, 2014 IEEE 11th Intl Conf on EmbeddedSoftware and Syst (HPCC,CSS,ICESS) , 2014, pp. 76–83.[17] Z. Wang, W. Liu, J. Xu, X. Wu, Z. Wang, B. Li, R. Iyer, andR. Illikkal, “A systematic network-on-chip trafﬁc modeling andgeneration methodology,” in

IEEE Asia Paciﬁc Conference on Cir-cuits and Systems (APCCAS) , 2014, pp. 675–678.[18] W. J. Dally and B. Towles, “Route packets, not wires: on-chip inter-connection networks,” in

Proceedings of the 38th Design AutomationConference , 2001, pp. 684–689.[19] M. Mody, V. Paladiya, and K. Ahuja, “Efﬁcient progressive JPEGdecoder using JPEG baseline hardware,” in

IEEE Second Interna- tional Conference on Image Information Processing (ICIIP) , 2013, pp.369–372.[20] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, andT. Czajkowski, “Impact of Cache Architecture and Interfaceon Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems,” in IEEE 20th International Symposium onField-Programmable Custom Computing Machines , 2012, pp. 17–24.[21] M. K. Papamichael and J. C. Hoe, “CONNECT: Re-examining Con-ventional Wisdom for Designing Nocs in the Context of FPGAs,”in

Proceedings of the ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays . ACM, 2012, pp. 37–46.[22] I. Xilinx, “Microblaze processor reference guide,” reference manual ,vol. 23, 2014.[23] S. Xu and H. Pollitt-Smith, “A Multi-MicroBlaze Based SOCSystem: From SystemC Modeling to FPGA Prototyping,” in

The19th IEEE/IFIP International Symposium on Rapid System Prototyping ,2008, pp. 121–127.[24] A. K. Singh, A. Kumar, T. Srikanthan, and Y. Ha, “Mapping real-life applications on run-time reconﬁgurable NoC-based MPSoCon FPGA,” in

International Conference on Field-Programmable Tech-nology , 2010, pp. 365–368.[25] E. H. E. Mimouni and M. Karim, “A MicroBlaze-based Mul-tiprocessor System on Chip for real-time cardiac monitoring,”in

International Conference on Multimedia Computing and Systems(ICMCS) , 2014, pp. 331–336.[26] S. Li, M. Huang, H. Ding, and S. Ma, “A Hierarchical MemoryArchitecture with NoC Support for MPSoC on FPGAs,” in

IEEE22nd Annual International Symposium on Field-Programmable CustomComputing Machines , 2014, pp. 173–173.[27] H.-P. Rosinger, “Connecting customized IP to the MicroBlaze softprocessor using the Fast Simplex Link (FSL) channel,”

XilinxApplication Note , 2004.[28] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “CH-Stone: A benchmark program suite for practical C-based high-level synthesis,” in

IEEE International Symposium on Circuits andSystems

Proceedings of the 49thAnnual Design Automation Conference . ACM, 2012, pp. 398–405.[31] L. Barthe, P. Benoit, and L. Torres, “Investigation of a MaskingCountermeasure against Side-Channel Attacks for RISC-basedProcessor Architectures,” in

International Conference on Field Pro-grammable Logic and Applications , 2010, pp. 139–144.[32] T. Feist, “Vivado design suite,”

White Paper

Zhe Lin (S’15) received his B.S. degree fromSchool of Electronic Science and Engineeringfrom Southeast University, Nanjing, China, in2014. Since 2014, he has been a Ph.D. Studentin the Department of Electronic and ComputerEngineering at Hong Kong University of Scienceand Technology (HKUST), Hong Kong. Zhe’scurrent research interests cover FPGA-basedheterogeneous multicore systems and powermanagement strategies of modern FPGAs.

Sharad Sinha (S’03, M’15) received his PhDdegree in Computer Engineering from NTU, Sin-gapore (2014). He is a Research Scientist inthe School of Computer Engineering at NTU.He received the

Best Speaker Award from

IEEECASS Society , Singapore Chapter, in 2013 forhis PhD work on High Level Synthesis andserves as a Corresponding Editor for

IEEE Po-tentials and an Associate Editor for

ACM Ubiq-uity . Dr. Sinha earned a Bachelor of Technology(B.Tech) degree in Electronics and Communica-tion Engineering from Cochin University of Science and Technology(CUSAT), India in 2007. From 2007-2009, he was a design engineer withProcessor Systems (India) Pvt. Ltd. Dr. Sinha’s research and teachinginterests are in computer arhcitecture, embedded systems and recon-ﬁgurable computing.

Hao Liang received a B.S. degree in softwareengineering from the Shanghai Jiaotong Uni-versity, Shanghai, China, in 2011. He is cur-rently pursuing a Ph.D. degree in electronicand computer engineering at Hong Kong Uni-versity of Science and Technology, Hong Kong.His current research interests include 3-D ICthermal modeling, emerging interconnect tech-nology, embedded systems, and reconﬁgurablecomputing.

Liang Feng received a B.S. degree in Microelec-tronics from Nanjing University, China, in 2014.He is currently a PhD student in electronic andcomputer engineering at Hong Kong Universityof Science and Technology, Hong Kong. Liang’sresearch interests include reconﬁgurable com-puting, multi-core system and electronic designautomation (EDA).

Wei Zhang (M’05) received a Ph.D. degreefrom Princeton University, Princeton, NJ, USA,in 2009. She was an assistant professor withthe School of Computer Engineering, NanyangTechnological University, Singapore, from 2010to 2013. Dr. Zhang joined the Hong Kong Uni-versity of Science and Technology, Hong Kong,in 2013, where she is currently an associatedprofessor and she established the reconﬁgurablecomputing system laboratory (RCSL). Dr. Zhangwas a co-investigator of the Singapore-MIT Al-liance for Research and Technology Centre, Singapore, where she wasinvolved in low-power electronics. Dr. Zhang was a collaborator withthe A*STAR-UIUC Advanced Digital Sciences Center, Singapore, whereshe was involved in ﬁeld programmable gate array (FPGA) accelerationfor multimedia applications. Dr. Zhang has authored or co-authored over50 book chapters and papers in peer reviewed journals and internationalconferences. Dr. Zhang’s current research interests include reconﬁg-urable systems, FPGA-based design, low-power high-performance mul-ticore systems, electronic design automation, embedded systems, andemerging technologies.Dr. Zhang serves as the Area Editor of Reconﬁgurable Computing ofthe