High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip
Kris Nikov, Mohammad Hosseinabady, Rafael Asenjo, Andrés Rodríguezz, Angeles Navarro, Jose Nunez-Yanez
HHigh-Performance Simultaneous Multiprocessingfor Heterogeneous System-on-Chip
Position Paper
Kris Nikov , Mohammad Hosseinabady , Rafael Asenjo , Andr´es Rodr´ıguez ,Angeles Navarro , and Jose Nunez-Yanez University of Bristol, UK { kris.nikov, m.hosseinabady, j.l.nunez-yanez } @bristol.ac.uk Universidad de M´alaga, Spain { asenjo, andres, angeles } @ac.uma.es Abstract.
This paper presents a methodology for simultaneous hetero-geneous computing, named ENEAC, where a quad core ARM Cortex-A53CPU works in tandem with a preprogrammed on-board FPGA accelera-tor. A heterogeneous scheduler distributes the tasks optimally among allthe resources and all compute units run asynchronously, which allows forimproved performance for irregular workloads. ENEAC achieves up to17% performance improvement when using all platform resources com-pared to just using the FPGA accelerators and up to 865% performanceincrease when using just the CPU. The workflow uses existing commercialtools and C/C++ as a single programming language for both acceleratordesign and CPU programming for improved productivity and ease ofverification.
Keywords:
FPGA · Xilinx ZCU102 · Heterogeneous Scheduling · Per-formance Improvement
With the advent of Dark Silicon and the end of Dennard Scaling [5] [7], heteroge-neous systems are seen as the way for the semiconductor industry to keep up withperformance demands. This is not surprising since DSPs, GPUs and NPUs arealready widely used coprocessors, however emerging fields such as cryptosecurityand artificial neural networks have also raised the demand for dedicated on-chipaccelerators. With more of these being integrated in consumer devices, it isinevitable that eventually the trade-off of increase chip area will necessitate thereuse of silicon for task acceleration. FPGAs are set to play an important role inthe future of Heterogeneous Computing for upcoming generations of SoCs havingthe ability to be reprogrammed with specific accelerators on-demand and withincontext switches.The methodology, called ENergy Efficient Adaptive Computing with heteroge-neous architectures (ENEAC) presented in this paper aims to build upon existingtools and platform in order to develop a comprehensive solution to HeterogeneousComputing using CPUs and FPGAs. This paper presents a continuation of the a r X i v : . [ c s . D C ] A ug K. Nikov et al. work in Nunez-Yanez et al. [6], and moves from the original Zynq embedded de-vices to add support for high-performance Zynq Ultrascale devices. The workflowis updated and evaluated on the Xilinx ZCU102 Development Platform with alarger FPGA device and 64-bit ARM processors. The entire methodology andusage tutorial are open-source and available online at [1].Key contributions include:1. A framework for customising accelerator code and programming the FPGAusing the Xilinx SDSoC development environment.2. A scheduling algorithm which distributes the workload between the CPUcores and FPGA accelerators and ensures load balance among devices.3. A custom platform that adds extensive interrupt management to enable theaccelerators to work independently, which improves system throughput forirregular workloads.
Heterogeneous computing is not just a means to improve performance, butcan also be highly effective in areas where minimizing energy usage is critical,such as embedded systems. The limitations of CPUs are highlighted by Chunget al. [4], who demonstrate that over 90% of the energy in a general-purposeprocessor is “overhead”. There is a clear need to integrate more application-specific accelerators and current efforts to promote Heterogeneous Computinginclude the Heterogeneous System Architecture (HSA) Foundation [2]. Theypresent a new integrated computational platform and associated software toolsthat allows distributed workload execution over a variety of processors from asingle software source.The majority of heterogeneous computing involves using a host processor,which controls the execution across the other compute units. ENEAC exploresa horizontal collaborative approach, where the platform CPU also contributesto the task execution, thus improving performance compared to solely using theFPGA as an accelerator. The developed workflow includes commercially availabletools from the Xilinx suite to enable quick adoption of new algorithms/workloadsand easy system reconfiguration. A similar approach has been explored by Tsoiet al. [8]. They focus on using multiple devices to showcase how the Nbodysimulation can be successfully implemented on a heterogeneous system anduse both FPGA and GPU to compute the same kernel on different portions ofparticles, which achieves a 22.7 times speedup compared to the CPU only version.This work is a continuation of the methodology presented in [6], howeverthe workflow has been further optimised for irregular workloads and has beenvalidated on a more complex platform. The key updates include the ability toschedule to a set of accelerators, programmed on the FPGA individually andusing a more complicated interrupt mechanism to free the host threads andensure asynchronous execution. igh-Performance Multiprocessing with Heterogeneous System-on-Chips 3
PS Sharedinterruptspl_ps_irq0
SW definedHW accelerator(FC)SW definedHW accelerator(FC)SW definedHW accelerator(FC)SW definedHW accelerator(ACC)
FPGA
ZYNQ Ultra Processing System, PS, 4xCortex A53(CC)
CPU
CONCATInterruptcontroller
M_AXI_HPM0_LPD
PlatforminterruptsPlatform defined master interrupt interfaceAccelerator DONE signals K e r ne l D r i v e r Fig. 1: SDSoC multiprocessing platform
ENEAC is developed and evaluated on the ZCU102 Development Platform andcontains a heterogeneous scheduler to distribute workload between the on-boardCPU and FPGA as well as custom hardware interrupt controllers and softwareinterrupt mechanism to improve performance.
The platform features the Xilinx Zynq Ultrascale+ SoC [3], which has a quadcore ARM Cortex-A53, as well as an on-chip FPGA. Data transfer between theCPU and FPGA logic is done via AXI interface and the chip supports access via4 High-Performance (HP) or 2 High-Performance-Cacheable (HPC) ports intoCPU memory. Programming the FPGA is done via the SDSoC environment withoptimised hardware accelerator implementations for the two benchmarks thatare used in the evaluation. Both types of accelerator connections are evaluatedusing ENEAC with HPC ports being the preferred method of connection, sinceusing the HP ports requires intermediate software data buffers from cacheable tonon-cacheable memory.
A key component of ENEAC is the custom interrupt generation mechanismconsisting of i) hardware interrupt generators, which connect to the CPU IRQlines and indicate when each hardware accelerator is finished; and ii) softwaredrivers, which catch the interrupts and wake the host thread (the thread incharge of offloading work to the FPGA). Figure 1 shows the hardware platformconfigurations, including the data access ports and the interrupt controllers.A key feature is that every FPGA accelerator has its own dedicated interruptcontroller, interrupt driver and host thread so that each FPGA accelerator canperform independently. Moreover, the host thread does not waste CPU cycleswaiting for the accelerator.
The custom heterogeneous software scheduler, that is part of ENEAC, buildson top of SDSoC and TBB libraries, and offers a parallel for() function
K. Nikov et al.
Core (CC)FPGA
SDSoCrun-time
HBB library class Body, class Scheduler
User Application parallel_for(begin, end, body);
Threading Building Blocks (TBB)
Tokens
Stage Stage ntokens tokentoken token chnkchunk Iteration Space chunkCPUchunkFPGA time F Threads (O.S. dependent) chunks
ACC
Core (CC) Core (CC)Core (CC) remaining (r) . . .
F C C . . .
S1 chunkS2S1 chunkS2S1 chunkS2S1 chunkS2S1 S2S1 S2S1 S2S1 S2 S1 S2S1 S2chunkchunk chunk chnk chnkchnkchnkchnkchnk chnk chunk chunkchunkchunk
ACCACC ACC
Fig. 2: The heterogeneous scheduling design
2. Software Interrupts Setup1. Hardware Platform Generation
Platform
Customisation
Base
PlatformUpdated Base Platform
Interrupt
ControllersHardwareAcceleratorsGen. Intermediate
HW Platform
Test Functionality and Throughput Gen. FinalHW PlatformInterrupt Interface
Clock
Module Loadable HW Binary
SW Interface
Library Platform Kernel Interrupt Drivers
Generate
Device Tree
Generate Kernel
ModulesUpdated Kernel Driver Modules
3. Experiment Setup
Data AnalysisLoad
Environment
Compile WorkloadInter. DriverInterface SchedulerInterface Benchmark Code Run Workload
Fig. 3: ENEAC Workflowtemplate to run on heterogeneous CPU-FPGA systems. Fig. 2 shows the ZCU102system with four FPGA accelerators (ACC) and four CPU cores (CC) as isused in the experimental evaluation. The left side shows the software stack thatruns the workload, which includes the heterogeneous scheduler. It takes careof splitting the iteration space into chunks and processes each chunk on a CCor an ACC device. The right part illustrates how the internal engine managingthe parallel for() works. The iteration space consists of the chunks alreadyassigned to a processing unit and the remaining iterations waiting to be assigned.In the current implementation of the scheduler, called
MultiDynamic , the userspecifies the ACC chunk size and the scheduler dynamically adapt the CCchunk size with the goal of maximizing the load balance. The scheduler supportsthe offload to each compute unit as soon as it becomes available, a featureparticularly relevant for irregular workloads in which the execution time of achunk of iterations can not be predicted at runtime.
The three stages of the methodology are presented in Figure 3. Stage one consistsof using the SDSoC tools to update the default platform with the interruptcontrollers, introduce the application specific accelerator and ensure correctfunctionality. The final design uses a custom clock module, set to 200MHz forall configurations. The second stage involves updating the platform environment igh-Performance Multiprocessing with Heterogeneous System-on-Chips 5 with the interrupt controller drivers. The experimental setup stage includescompiling the workload code with the scheduler software stack, the interruptdriver interface and the software interface library. Then the FPGA is loaded withthe accelerator platform and the workload is executed, while collecting runtimedata.
ENEAC is evaluated with two distinct benchmarks - HOTSPOT and SPMM.HOTSPOT is a stencil algorithm from the Rodinia benchmark collection,which estimates thermal dissipation on the surface of a chip by solving a series ofdifferential equations. Workload size is altered by specifying the chip size. For theevaluation a chip size of 2048 × × × uses just the four CPU coresto execute the workload; (2) uses 4 FPGA accelerators connectedthrough the HP ports to the PS (CPU); (3) uses HPC connectedaccelerators; (4)-(5) distributes the workload between the 4CPU cores and 4 HP connected accelerators without and with the hardwareinterrupts, respectively; and (6)-(7) refer to the last twoconfigurations pairing the CPU with HPC connected accelerators, again withoutand with interrupts. For all configurations the performance is measured over arange of FPGA workload chunk sizes to identify the optimal workload distributionusing the MultiDynamic scheduler presented earlier.Figure 4 and Table 1 show the results of the performance evaluation while usingthe
MultiDynamic scheduler for heterogeneous computing. For both HOTSPOTand SPMM benchmarks, shown in 4a and 4b respectively, it can be observedthat the highest throughput point is achieved using the final configuration usingall 4 CPU Cores and 4 HPC-connected FPGA accelerators. Comparison betweenconfigurations (6) and (7) and configurations (4) and (5) reveals that in bothbenchmarks using the custom interrupt mechanism improves resource utilisationand increases the platform throughput regardless of the accelerator data ports.For both benchmarks the optimal point for the workload distribution variesbetween configurations and a sharp decrease in throughput can be observed whenmore that 1/4 of the workload is scheduled per accelerator - 512 chunk size forHOTSPOT and 8192 chunk size for SPMM.The results obtained using ENEAC show that utilising the heterogeneousscheduler to distribute the workload between the CPU and FPGA with thecustom interrupt mechanism on the Xilinx ZCU102 development board produces
K. Nikov et al.
164 10241288 204832 64 256 5122
Chunk Size [ T h r o u g h p u t [ t e m p s / m s ] (1)(2)(3)(4)(5)(6)(7)Peak (a) HOTSPOT Chunk Size [ T h r o u g h p u t [ r o w s / m s ] (1)(2)(3)(4)(5)(6)(7)Peak (b) SPMM Fig. 4: Benchmark performance on the ZCU102 Development Platform
ID Configuration Inter. Sched. Peak ThroughputHOTSPOT SPMM(1) 4CC No MD 51.17 7.78(2) 4HPACC No MD 30.13 53.55(3) 4HPCACC No MD 98.55 65.38(4) 4CC+4HPACC No MD 36.47 51.90(5) 4CC+4HPACC Yes MD 41.49 68.67(6) 4CC+4HPCACC No MD 94.15 64.71(7) 4CC+4HPCACC Yes MD 115.11 75.09
Table 1: Benchmark performance on the ZCU102 Development Platformthe highest throughput across the hardware configurations for both benchmarksincluded in the evaluation. Using the workflow results in 124.96%, 16.80% and22.26% increase in throughput for the HOTSPOT benchmark when comparedto just using 4CPU cores, only 4HPC FPGA accelerators, and 4CPU + 4HPCFPGA accelerators without the interrupt mechanism, respectively. The irregularSPMM benchmark also shows a significant improvement of 865.17%, 14.85% and16.04% over the three equivalent hardware configurations.
This paper presents ENEAC, a custom methodology to optimally distributeworkloads on a complex heterogeneous computing platform between the multi-core CPU and the on-board FPGA. Multiple FPGA hardware configurationsare explored and ENEAC manages to successfully integrate custom hardwareaccelerators and interrupt mechanism to improve workload execution, comparedto just using the hardware accelerators without help from the CPU, by 16.80%and 14.85% for the HOTSPOT and SPMM benchmarks respectively.Future work involves optimising and evaluating a more advanced customscheduler on the platform, which identifies the optimal workload distributionautomatically, without the need to manually set the chunk sizes and also focusingspecifically on optimising energy efficiency in addition to throughput. A largerset of benchmarks will be used to demonstrate the general applicability of themethodology. ENEAC is open-source and can be accessed at [1]. igh-Performance Multiprocessing with Heterogeneous System-on-Chips 7
References
1. ENergy Efficient Adaptive Computing with multi-grain heterogeneous architectures(ENEAC). https://github.com/eejlny/ENEAC , accessed: 2018-10-152. HSA Foundation. , accessed: 2018-03-053. Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. , accessed: 2018-10-154. Chung, E.S., Milder, P.A., Hoe, J.C., Mai, K.: Single-chip heterogeneous computing:Does the future include custom logic, FPGAs, and GPGPUs? In: Micro ’10 (Dec2010)5. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D.: Dark siliconand the end of multicore scaling. In: 2011 38th Annual international symposium oncomputer architecture (ISCA). pp. 365–376. IEEE (2011)6. Nunez-Yanez, J., Amiri, S., Hosseinabady, M., Rodr´ıguez, A., Asenjo, R., Navarro,A., Suarez, D., Gran, R.: Simultaneous multiprocessing in a software-defined hetero-geneous fpga. The Journal of Supercomputing75