[PDF] High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip

Abstract

This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17\% performance improvement \ignore{and 14\% energy usage reduction,} when using all platform resources compared to just using the FPGA accelerators and up to 865\% performance increase \ignore{and up to 89\% energy usage decrease} when using just the CPU. The workflow uses existing commercial tools and C/C++ as a single programming language for both accelerator design and CPU programming for improved productivity and ease of verification.

Full PDF

HHigh-Performance Simultaneous Multiprocessingfor Heterogeneous System-on-Chip

Position Paper

Kris Nikov , Mohammad Hosseinabady , Rafael Asenjo , Andr´es Rodr´ıguez ,Angeles Navarro , and Jose Nunez-Yanez University of Bristol, UK { kris.nikov, m.hosseinabady, j.l.nunez-yanez } @bristol.ac.uk Universidad de M´alaga, Spain { asenjo, andres, angeles } @ac.uma.es Abstract.

This paper presents a methodology for simultaneous hetero-geneous computing, named ENEAC, where a quad core ARM Cortex-A53CPU works in tandem with a preprogrammed on-board FPGA accelera-tor. A heterogeneous scheduler distributes the tasks optimally among allthe resources and all compute units run asynchronously, which allows forimproved performance for irregular workloads. ENEAC achieves up to17% performance improvement when using all platform resources com-pared to just using the FPGA accelerators and up to 865% performanceincrease when using just the CPU. The workﬂow uses existing commercialtools and C/C++ as a single programming language for both acceleratordesign and CPU programming for improved productivity and ease ofveriﬁcation.

Keywords:

FPGA · Xilinx ZCU102 · Heterogeneous Scheduling · Per-formance Improvement

With the advent of Dark Silicon and the end of Dennard Scaling [5] [7], heteroge-neous systems are seen as the way for the semiconductor industry to keep up withperformance demands. This is not surprising since DSPs, GPUs and NPUs arealready widely used coprocessors, however emerging ﬁelds such as cryptosecurityand artiﬁcial neural networks have also raised the demand for dedicated on-chipaccelerators. With more of these being integrated in consumer devices, it isinevitable that eventually the trade-oﬀ of increase chip area will necessitate thereuse of silicon for task acceleration. FPGAs are set to play an important role inthe future of Heterogeneous Computing for upcoming generations of SoCs havingthe ability to be reprogrammed with speciﬁc accelerators on-demand and withincontext switches.The methodology, called ENergy Eﬃcient Adaptive Computing with heteroge-neous architectures (ENEAC) presented in this paper aims to build upon existingtools and platform in order to develop a comprehensive solution to HeterogeneousComputing using CPUs and FPGAs. This paper presents a continuation of the a r X i v : . [ c s . D C ] A ug K. Nikov et al. work in Nunez-Yanez et al. [6], and moves from the original Zynq embedded de-vices to add support for high-performance Zynq Ultrascale devices. The workﬂowis updated and evaluated on the Xilinx ZCU102 Development Platform with alarger FPGA device and 64-bit ARM processors. The entire methodology andusage tutorial are open-source and available online at [1].Key contributions include:1. A framework for customising accelerator code and programming the FPGAusing the Xilinx SDSoC development environment.2. A scheduling algorithm which distributes the workload between the CPUcores and FPGA accelerators and ensures load balance among devices.3. A custom platform that adds extensive interrupt management to enable theaccelerators to work independently, which improves system throughput forirregular workloads.

Heterogeneous computing is not just a means to improve performance, butcan also be highly eﬀective in areas where minimizing energy usage is critical,such as embedded systems. The limitations of CPUs are highlighted by Chunget al. [4], who demonstrate that over 90% of the energy in a general-purposeprocessor is “overhead”. There is a clear need to integrate more application-speciﬁc accelerators and current eﬀorts to promote Heterogeneous Computinginclude the Heterogeneous System Architecture (HSA) Foundation [2]. Theypresent a new integrated computational platform and associated software toolsthat allows distributed workload execution over a variety of processors from asingle software source.The majority of heterogeneous computing involves using a host processor,which controls the execution across the other compute units. ENEAC exploresa horizontal collaborative approach, where the platform CPU also contributesto the task execution, thus improving performance compared to solely using theFPGA as an accelerator. The developed workﬂow includes commercially availabletools from the Xilinx suite to enable quick adoption of new algorithms/workloadsand easy system reconﬁguration. A similar approach has been explored by Tsoiet al. [8]. They focus on using multiple devices to showcase how the Nbodysimulation can be successfully implemented on a heterogeneous system anduse both FPGA and GPU to compute the same kernel on diﬀerent portions ofparticles, which achieves a 22.7 times speedup compared to the CPU only version.This work is a continuation of the methodology presented in [6], howeverthe workﬂow has been further optimised for irregular workloads and has beenvalidated on a more complex platform. The key updates include the ability toschedule to a set of accelerators, programmed on the FPGA individually andusing a more complicated interrupt mechanism to free the host threads andensure asynchronous execution. igh-Performance Multiprocessing with Heterogeneous System-on-Chips 3

PS Sharedinterruptspl_ps_irq0

SW deﬁnedHW accelerator(FC)SW deﬁnedHW accelerator(FC)SW deﬁnedHW accelerator(FC)SW deﬁnedHW accelerator(ACC)

FPGA

ZYNQ Ultra Processing System, PS, 4xCortex A53(CC)

CPU

CONCATInterruptcontroller

M_AXI_HPM0_LPD

PlatforminterruptsPlatform deﬁned master interrupt interfaceAccelerator DONE signals K e r ne l D r i v e r Fig. 1: SDSoC multiprocessing platform

ENEAC is developed and evaluated on the ZCU102 Development Platform andcontains a heterogeneous scheduler to distribute workload between the on-boardCPU and FPGA as well as custom hardware interrupt controllers and softwareinterrupt mechanism to improve performance.

The platform features the Xilinx Zynq Ultrascale+ SoC [3], which has a quadcore ARM Cortex-A53, as well as an on-chip FPGA. Data transfer between theCPU and FPGA logic is done via AXI interface and the chip supports access via4 High-Performance (HP) or 2 High-Performance-Cacheable (HPC) ports intoCPU memory. Programming the FPGA is done via the SDSoC environment withoptimised hardware accelerator implementations for the two benchmarks thatare used in the evaluation. Both types of accelerator connections are evaluatedusing ENEAC with HPC ports being the preferred method of connection, sinceusing the HP ports requires intermediate software data buﬀers from cacheable tonon-cacheable memory.

A key component of ENEAC is the custom interrupt generation mechanismconsisting of i) hardware interrupt generators, which connect to the CPU IRQlines and indicate when each hardware accelerator is ﬁnished; and ii) softwaredrivers, which catch the interrupts and wake the host thread (the thread incharge of oﬄoading work to the FPGA). Figure 1 shows the hardware platformconﬁgurations, including the data access ports and the interrupt controllers.A key feature is that every FPGA accelerator has its own dedicated interruptcontroller, interrupt driver and host thread so that each FPGA accelerator canperform independently. Moreover, the host thread does not waste CPU cycleswaiting for the accelerator.

The custom heterogeneous software scheduler, that is part of ENEAC, buildson top of SDSoC and TBB libraries, and oﬀers a parallel for() function

K. Nikov et al.

Core (CC)FPGA

SDSoCrun-time

HBB library class Body, class Scheduler

User Application parallel_for(begin, end, body);

Threading Building Blocks (TBB)

Tokens

Stage Stage ntokens tokentoken token chnkchunk Iteration Space chunkCPUchunkFPGA time F Threads (O.S. dependent) chunks

ACC

Core (CC) Core (CC)Core (CC) remaining (r) . . .

F C C . . .

S1 chunkS2S1 chunkS2S1 chunkS2S1 chunkS2S1 S2S1 S2S1 S2S1 S2 S1 S2S1 S2chunkchunk chunk chnk chnkchnkchnkchnkchnk chnk chunk chunkchunkchunk

ACCACC ACC

Fig. 2: The heterogeneous scheduling design

2. Software Interrupts Setup1. Hardware Platform Generation

Platform

Customisation

Base

PlatformUpdated Base Platform

Interrupt

ControllersHardwareAcceleratorsGen. Intermediate

HW Platform

Test Functionality and Throughput Gen. FinalHW PlatformInterrupt Interface

Clock

Module Loadable HW Binary

SW Interface

Library Platform Kernel Interrupt Drivers

Generate

Device Tree

Generate Kernel

ModulesUpdated Kernel Driver Modules

3. Experiment Setup

Data AnalysisLoad

Environment

Compile WorkloadInter. DriverInterface SchedulerInterface Benchmark Code Run Workload

Fig. 3: ENEAC Workﬂowtemplate to run on heterogeneous CPU-FPGA systems. Fig. 2 shows the ZCU102system with four FPGA accelerators (ACC) and four CPU cores (CC) as isused in the experimental evaluation. The left side shows the software stack thatruns the workload, which includes the heterogeneous scheduler. It takes careof splitting the iteration space into chunks and processes each chunk on a CCor an ACC device. The right part illustrates how the internal engine managingthe parallel for() works. The iteration space consists of the chunks alreadyassigned to a processing unit and the remaining iterations waiting to be assigned.In the current implementation of the scheduler, called

MultiDynamic , the userspeciﬁes the ACC chunk size and the scheduler dynamically adapt the CCchunk size with the goal of maximizing the load balance. The scheduler supportsthe oﬄoad to each compute unit as soon as it becomes available, a featureparticularly relevant for irregular workloads in which the execution time of achunk of iterations can not be predicted at runtime.

The three stages of the methodology are presented in Figure 3. Stage one consistsof using the SDSoC tools to update the default platform with the interruptcontrollers, introduce the application speciﬁc accelerator and ensure correctfunctionality. The ﬁnal design uses a custom clock module, set to 200MHz forall conﬁgurations. The second stage involves updating the platform environment igh-Performance Multiprocessing with Heterogeneous System-on-Chips 5 with the interrupt controller drivers. The experimental setup stage includescompiling the workload code with the scheduler software stack, the interruptdriver interface and the software interface library. Then the FPGA is loaded withthe accelerator platform and the workload is executed, while collecting runtimedata.

ENEAC is evaluated with two distinct benchmarks - HOTSPOT and SPMM.HOTSPOT is a stencil algorithm from the Rodinia benchmark collection,which estimates thermal dissipation on the surface of a chip by solving a series ofdiﬀerential equations. Workload size is altered by specifying the chip size. For theevaluation a chip size of 2048 × × × uses just the four CPU coresto execute the workload; (2) uses 4 FPGA accelerators connectedthrough the HP ports to the PS (CPU); (3) uses HPC connectedaccelerators; (4)-(5) distributes the workload between the 4CPU cores and 4 HP connected accelerators without and with the hardwareinterrupts, respectively; and (6)-(7) refer to the last twoconﬁgurations pairing the CPU with HPC connected accelerators, again withoutand with interrupts. For all conﬁgurations the performance is measured over arange of FPGA workload chunk sizes to identify the optimal workload distributionusing the MultiDynamic scheduler presented earlier.Figure 4 and Table 1 show the results of the performance evaluation while usingthe

MultiDynamic scheduler for heterogeneous computing. For both HOTSPOTand SPMM benchmarks, shown in 4a and 4b respectively, it can be observedthat the highest throughput point is achieved using the ﬁnal conﬁguration usingall 4 CPU Cores and 4 HPC-connected FPGA accelerators. Comparison betweenconﬁgurations (6) and (7) and conﬁgurations (4) and (5) reveals that in bothbenchmarks using the custom interrupt mechanism improves resource utilisationand increases the platform throughput regardless of the accelerator data ports.For both benchmarks the optimal point for the workload distribution variesbetween conﬁgurations and a sharp decrease in throughput can be observed whenmore that 1/4 of the workload is scheduled per accelerator - 512 chunk size forHOTSPOT and 8192 chunk size for SPMM.The results obtained using ENEAC show that utilising the heterogeneousscheduler to distribute the workload between the CPU and FPGA with thecustom interrupt mechanism on the Xilinx ZCU102 development board produces

K. Nikov et al.

164 10241288 204832 64 256 5122

Chunk Size [ T h r o u g h p u t [ t e m p s / m s ] (1)(2)(3)(4)(5)(6)(7)Peak (a) HOTSPOT Chunk Size [ T h r o u g h p u t [ r o w s / m s ] (1)(2)(3)(4)(5)(6)(7)Peak (b) SPMM Fig. 4: Benchmark performance on the ZCU102 Development Platform

ID Conﬁguration Inter. Sched. Peak ThroughputHOTSPOT SPMM(1) 4CC No MD 51.17 7.78(2) 4HPACC No MD 30.13 53.55(3) 4HPCACC No MD 98.55 65.38(4) 4CC+4HPACC No MD 36.47 51.90(5) 4CC+4HPACC Yes MD 41.49 68.67(6) 4CC+4HPCACC No MD 94.15 64.71(7) 4CC+4HPCACC Yes MD 115.11 75.09

Table 1: Benchmark performance on the ZCU102 Development Platformthe highest throughput across the hardware conﬁgurations for both benchmarksincluded in the evaluation. Using the workﬂow results in 124.96%, 16.80% and22.26% increase in throughput for the HOTSPOT benchmark when comparedto just using 4CPU cores, only 4HPC FPGA accelerators, and 4CPU + 4HPCFPGA accelerators without the interrupt mechanism, respectively. The irregularSPMM benchmark also shows a signiﬁcant improvement of 865.17%, 14.85% and16.04% over the three equivalent hardware conﬁgurations.

This paper presents ENEAC, a custom methodology to optimally distributeworkloads on a complex heterogeneous computing platform between the multi-core CPU and the on-board FPGA. Multiple FPGA hardware conﬁgurationsare explored and ENEAC manages to successfully integrate custom hardwareaccelerators and interrupt mechanism to improve workload execution, comparedto just using the hardware accelerators without help from the CPU, by 16.80%and 14.85% for the HOTSPOT and SPMM benchmarks respectively.Future work involves optimising and evaluating a more advanced customscheduler on the platform, which identiﬁes the optimal workload distributionautomatically, without the need to manually set the chunk sizes and also focusingspeciﬁcally on optimising energy eﬃciency in addition to throughput. A largerset of benchmarks will be used to demonstrate the general applicability of themethodology. ENEAC is open-source and can be accessed at [1]. igh-Performance Multiprocessing with Heterogeneous System-on-Chips 7

References

1. ENergy Eﬃcient Adaptive Computing with multi-grain heterogeneous architectures(ENEAC). https://github.com/eejlny/ENEAC , accessed: 2018-10-152. HSA Foundation. , accessed: 2018-03-053. Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. , accessed: 2018-10-154. Chung, E.S., Milder, P.A., Hoe, J.C., Mai, K.: Single-chip heterogeneous computing:Does the future include custom logic, FPGAs, and GPGPUs? In: Micro ’10 (Dec2010)5. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D.: Dark siliconand the end of multicore scaling. In: 2011 38th Annual international symposium oncomputer architecture (ISCA). pp. 365–376. IEEE (2011)6. Nunez-Yanez, J., Amiri, S., Hosseinabady, M., Rodr´ıguez, A., Asenjo, R., Navarro,A., Suarez, D., Gran, R.: Simultaneous multiprocessing in a software-deﬁned hetero-geneous fpga. The Journal of Supercomputing75