[PDF] HTS: A Hardware Task Scheduler for Heterogeneous Systems

Abstract

As the Moore's scaling era comes to an end, application specific hardware accelerators appear as an attractive way to improve the performance and power efficiency of our computing systems. A massively heterogeneous system with a large number of hardware accelerators along with multiple general purpose CPUs is a promising direction, but pose several challenges in terms of the run-time scheduling of tasks on the accelerators and design granularity of accelerators. This paper addresses these challenges by developing an example heterogeneous system to enable multiple applications to share the available accelerators. We propose to design accelerators at a lower abstraction to enable applications to be broken down into tasks that can be mapped on several accelerators. We observe that several real-life workloads can be broken down into common primitives that are shared across many workloads. Finally, we propose and design a hardware task scheduler inspired by the hardware schedulers in out-of-order superscalar processors to efficiently utilize the accelerators in the system by scheduling tasks in out-of-order and even speculatively. We evaluate the proposed system on both real-life and synthetic benchmarks based on Digital Signal Processing~(DSP) applications. Compared to executing the benchmark on a system with sequential scheduling, proposed scheduler achieves up to 12x improvement in performance.

Full PDF

HHTS: A Hardware Task Scheduler forHeterogeneous Systems

Kartik Hegde † , Abhishek Srivastaval † , Rohit Agrawal † University of Illinois at Urbana-Champaign { kvhegde2, as29, rohita2 } @illinois.edu Abstract —As the Moore’s scaling era comes to an end, applica-tion speciﬁc hardware accelerators appear as an attractive way toimprove the performance and power efﬁciency of our computingsystems. A massively heterogeneous system with a large numberof hardware accelerators along with multiple general purposeCPUs is a promising direction, but pose several challenges interms of the run-time scheduling of tasks on the accelerators anddesign granularity of accelerators. This paper addresses thesechallenges by developing an example heterogeneous system toenable multiple applications to share the available accelerators.We propose to design accelerators at a lower abstraction toenable applications to be broken down into tasks that can bemapped on several accelerators. We observe that several real-life workloads can be broken down into common primitivesthat are shared across many workloads. Finally, we proposeand design a hardware task scheduler inspired by the hardwareschedulers in out-of-order superscalar processors to efﬁcientlyutilize the accelerators in the system by scheduling tasks in out-of-order and even speculatively. We evaluate the proposed systemon both real-life and synthetic benchmarks based on DigitalSignal Processing (DSP) applications. Compared to executing thebenchmark on a system with sequential scheduling, proposedscheduler achieves up to × improvement in performance. I. INTRODUCTION

We are at a challenging juncture in computer architectureresearch, where the imminent death of Moore’s law threatensto slow down the performance growth we are used to, andrapidly growing ﬁelds such as deep learning demand morecompute power than ever. Also, single thread performanceimprovement has saturated and Amdahl’s law prevents us fromany further exploitation of parallel computing for performanceboost. Furthermore, general purpose computing chips aresigniﬁcantly disadvantaged in terms of power and performanceefﬁciency for hardware acceleration of different demandingtasks [1].Therefore, the recent trend has been to use applicationspeciﬁc hardware accelerators to boost the performance ofcomputing systems. The application speciﬁc nature of theseaccelerators enables them to exploit the already known knowl-edge of execution of the targeted algorithms and data-accesspatterns to gain substantial improvements. To give a perspective,[1] claims over 500x higher energy efﬁciency over the generalpurpose CPUs in the task of video decoding.However, it is non-trivial to build a massively heterogeneoussystem with a sea of accelerators, as the entire stack ofOS, compilers and schedulers will have to be redesigned. † Authors contributed equally to this work.

Figure 1:

A traditional view of a heterogeneous system

It is also challenging to recognize the granularity of theaccelerators to support a wide variety of algorithms. Sinceaccelerators might share data that the general purpose CPUis generating/consuming, coherency and consistency of thememory also pose a challenge. There are also difﬁculties inestablishing a uniform Virtual Address translation mechanism.Figure1 depicts a heterogeneous system containing CPUs andaccelerators sharing the memory.Speciﬁcally, we recognize the difﬁculty of run-time taskscheduling in systems with multiple hardware accelerators:where a task is an abstraction level for a set of instructions thatdeﬁne a primitive, which tends to repeat across the program.User threads can be made up of multiple tasks and those tasksmight be best performed by some accelerators in the system.The threads can have ﬁne-grained parallelism leading to lotsof dependencies among the tasks, coarse-grained parallelismthat requires minimal interactions, or data parallelism thatmakes them completely independent. Furthermore, controlﬂow changes may change the dependency structure completelyduring run-time. This clearly indicates the necessity of a run-time task scheduler for such heterogeneous systems, and a richline of works [2], [3], [4], [5] based on run-time APIs addressthis challenge.As noted in [3], [6], we can clearly see that the taskscheduling on a CMP system is very analogous to instructionscheduling on different functional units in a general purposeCPU. We extend this notion to multiple domain speciﬁc a r X i v : . [ c s . O S ] J un ccelerators and general purpose CPUs in a heterogeneoussystem. Usage of custom schedulers in accelerators has beenexplored before [7], but the notion of hardware based taskscheduling for heterogeneous systems is unexplored.Arguably, one of the most profound developments in CPUmicro-architecture that boosted the performance to higherlevels is the design of out-of-order (OoO) pipelines to exploit instruction level parallelism . Based on this inspiration, weargue that breaking down the threads into multiple tasks andexploiting Task Level Parallelism effectively is the next growtharea for high performance computer architecture. As we seethe advent of massively heterogeneous systems, we believethat this work can contribute towards improving the utilizationof hardware accelerators in the system.In this light, we propose to design a hardware task schedulerthat interfaces the CPU with all the accelerators in the system.Again, drawing deom the analogy of an OoO pipeline, herethe compiler/OS running on the CPU pushes tasks on to thescheduler, proposed scheduler is akin to the CPU frontendwhich considers the ﬁne-grained accelerators as the

Executionunits . The hardware scheduler is aware of the status of eachof the slave execution units, giving a unique advantage to thehardware scheduler to assign and execute tasks in an

Out-of-Order fashion with a global view and even execute tasks speculatively . Paper Outline.

The paper is organized as follows. Section IIgives an overview of the current literature and outlines therequirements for a heterogeneous task scheduler. Section IIIdetails the inspiration behind the idea and describes the coreinsight of this paper. Section IV describes the architectureof the system, accelerators and the hardware task scheduler.Finally, Section VI shows the preliminary results of our work.II.

BACKGROUND

Researchers are always looking at opportunities at variouslevels of the computing stack to improve application perfor-mance. This generally involves exposing parallelism at somelevel. The ﬁeld has seen massive developments in this frontthrough exploitation of Instruction Level Parallelism (ILP),Thread Level Parallelism (TLP) and Data Level Parallelism(DLP). All of them coincided with monumental developmentsin computer architecture. The shift towards out of orderscheduling in processors led to massive growth in abilityto exploit ILP while the prominence of multi-cores madeexploiting TLP important. The addition of vector units, andconsequent proposal of accelerators like GPGPUs, led toeffective utilization of DLP.Following the aforementioned trend, we feel the highly well-documented surge of hardware specialization paves way forthe usage of

Tasks as units of computation, with exploitationof Task Level Parallelism (TLP) becoming another importantaspect of application programming. This should allow users toexpose scope for concurrency at the application level, whichwhen coupled with appropriate scheduling strategies can bemapped to threads, and hence instructions. Adding anotherlevel of abstraction should also enable users to pass intricate details about the application algorithm which might not seemintuitive to underlying runtime system and hardware, like ﬁne-grained dependency tracking, eager memory management etc.Note that the notion of deﬁning computation through tasks isnot a novel concept. In fact, it has seen substantial explorationin the recent past. We summarize some of the prominent effortsbelow.

A. Runtime-based Task Parallelism

Most of the prominent task-based parallel computing envi-ronments consist of two components - (1) task-parallel API and(2) task runtime system [8]. The former deﬁnes the way anapplication developer describes parallelism, dependencies, datadistribution options amongst other things, while the latter actsas the basis for implementing the APIs. The runtime deﬁnesthe efﬁciency and ability of the environment. It determinesthe target architectures supported, task scheduling objectives,scheduling methodologies, support for fault tolerance etc. Alarge number of task-based programming environments havebeen developed over the past decades, with even establishedlanguages like C++ integrating tasks for shared memoryparallelism.Cilk [9] language allows task-based parallel programmingwith work stealing based scheduling. OpenMP [10] integratetasks into their programming interface, while task parallelismbased libraries such as Intel TBB [11] have emerged. Afore-mentioned environments were built for shared memory systems.The past few years have seen task-parallel models being builtfor heterogeneous hardware, like StarPU [12], and distributedsystems like Chapel [13], X10 [5], HPX [14] and Charm++[4]. In distributed setting, tasks are combined with a GlobalAddress Space (GAS) programming model to form a distributedexecution of a task-parallel program [8].Legion [3] is a data-centric programming language with taskbased runtime. A program instance is deﬁned in terms of itslogical regions, which express locality and independence ofprogram data, and tasks. The runtime system uses distributed,parallel scheduling algorithm and a mapping interface to controlmovement of data and place tasks on devices based on locality.It is aimed towards heterogeneous systems.In OmpSs programming model [15], user speciﬁes taskswith their data dependences. The runtime performs dynamictask dependence analysis, followed by dataﬂow scheduling andout-of-order execution.In such systems, as one tries to scale applications ontomany processors, many more tasks are required to make fullusage of available hardware resources. However, it has beenobserved that for ﬁne-grained tasks, the overhead of softwarebased task scheduling and management is too high to maintainscalable performance [16]. This is generally attributed to tasklaunch overheads, especially for ﬁne-grained computations.Such studies have sparked interest in hardware based taskdependence management systems.

B. Hardware-based Task Parallelism

Task Superscalar [6] was proposed to accelerate task anddependence management using hardware. It showed promising2igure 2:

Different abstractions levels for hardware acceleration results, but had issues with unresolved deadlocks due to queuesaturation and memory capacity, which led to the proposal ofa more enhanced design called Picos [17]. It resolves thesedeadlocks and adds support for nested tasks. Carbon [18]implements task queue operations and scheduling in hardwareto support fast task dispatch and stealing. TriMedia-basedmulticore system [19] contains a hardware task scheduling unit,built on Carbon. TMU [20] is a look-ahead task managementunit for reducing task retrieval latency to accelerate taskcreation and synchronisation. Nexus++ [21] is a prominentcontribution in this work. It, similar to Picos, leverages the workof dynamically scheduling tasks with real-time data dependenceanalysis but maintaining programmability of the system.We observe common denominations among all task-parallelprogramming environments. They require the programmerto present the application code in a new language, or atbest annotate sections of the code for analysis. The formerclearly affects portability, while latter might not give theruntime enough information to extract ﬁne-grained parallelism.Also, to the best of our knowledge, none of aforementionedtask-parallel models with software/hardware scheduler catersto heterogeneous systems with specialized accelerators. Oursurvey ﬁndings suggest decoupling programming and taskmanagement aspects of a task-parallel programming model toenable portability, and to develop task management hardwarefor modern day systems.Runnemede [22] is a notable contribution in the literature,which echoes many of the design principles of our proposal.It is a co-designed hardware/software effort where hardware,execution model, OS/runtime and applications are being si-multaneously developed. Its execution model divides programsinto tasks called codelets [22], which are self-contained unitsof computations with deﬁned inputs and outputs. Notably,it provides a number of programming models with differenttradeoffs, which is one way of enabling portability. The user canuse higher-level models like Hierarchically-Tiled Arrays [23]and Concurrent Collections [24], or code to runtime’s codeletmodel to get a lower-level interface to the hardware.The execution model is based on a dataﬂow model. Carteret.al [22] elucidate the following characteristics of a dataﬂowexecution model, which make it well suited to extreme-scalesystems, similar to what we intend to work on:1) It allows easy exploitation of all parallelism within eachphase of an application, not requiring its thread based division2) It incurs less synchronization costs as only producer andconsumer(s) of an item need to synchronize3) It enables clear break down and efﬁcient scheduling oftasks onto different parts of the system in a non-blockingmanner, which makes it easy to schedule code close toits data, marshal input data at the location where thecomputation is to be performed etcRunnemede architecture designs two types of cores - generalpurpose Control Engines (CE) and specialized ExecutionEngines (XE) intended for execution of codelets. Note that anXE cannot perform I/O operations, instead an I/O operation isrepresented as a dataﬂow dependence between codelets whichis performed by a CE.III. M

OTIVATION

It is well established that the application speciﬁc hardware ac-celerators signiﬁcantly outperform CPUs and GPUs in a varietyof tasks such as Deep Learning, genome sequencing, computervision, digital signal processing etc,. These accelerators aregenerally coupled with an API and upon the CPU’s command,they complete the task they are built for. Unfortunately, thesehardware blocks can not be used for anything other than theapplication that it is built for.However, there can be a mid-point between a fully generalpurpose system and an inﬂexible hardware accelerator. [25]proposes several dwarfs in computing algorithms that form thebasic categories of computations that generally occur. Froman overview, it seems like building a system that does wellon each of these dwarfs should provide performance uplift.But, that is simply not the case, largely because workloads cannot be segregated into to dwarfs at that abstraction, but ratherrequires much lower abstraction. For e.g., image processingapplications fall into dense and sparse linear algebra, but itis too high level of an abstraction to meaningfully acceleratethem in hardware.We provide a core insight here: a massively heterogeneoussystem should have a large number of accelerators at a lowerabstraction such that they are usable as a basic functionalblock across a large range of applications simultaneously.

Figure 2 depicts this pictorially. Every application is made upof several kernels, and each kernel would need several functionsto be implemented and at the lowest level of the hierarchyare OPs(operations) that constitute a function. Figure 2 gives3igure 3:

Proposed Task Scheduler integrated in the system an example of image processing application for better clarity.While most hardware accelerators today are at Applicationgranularity (for example, Deep Learning inference) and CPUsat basic OPs granularity, we argue that building a large numberof accelerators at the kernel granularity enables them to bereused across s across several applications of a domain.To enable a large number of accelerators developed toexecute various functions to be shared across a number ofapplications and kernels, we require a run-time schedulingsystem.

Interestingly, such a scheduler is very similar to ahardware scheduler in an out-of-order processor, because eachof these accelerators execute a task associated with a memoryregion, akin to instructions with operands [3]. The complexityof such a scheduler is non-trivial, because it has to deal withmany of the difﬁculties that the OoO processor scheduler hasto deal but at a much higher granularity.Several earlier works [3], [4] proposed software approachesto scheduling on CMPs, and a reasonable extension of theseworks should enable a massively heterogeneous system taskscheduler. In such cases, as depicted in Figure 1, CPUs interactwith the accelerators as masters on a common programmingbus. Accelerators respond to requests made by the masters viaan interrupt to the requester. The accelerators would be masterson the databus to access the memory like CPUs. However, suchsoftware task scheduling has inherent disadvantages:1) The instance of scheduler running on the host CPU isan overhead [3]2) The task completion signals will have to be conveyedvia interrupts which can have a long latency.3) A central manager of the hardware accelerator is absent,which makes accelerator management requirements suchas Dynamic Voltage-Frequency Scaling (DVFS) difﬁcult.However, all the above mentioned disadvantages can bealleviated if the system contains a central hardware taskscheduler. Figure 3 modiﬁes Figure 1 to include a HardwareTask Scheduler (HTS) in the system that interfaces the CPUsand the accelerators. CPUs can push new tasks and theassociated meta-data to the HTS and continue execution or poll on the scheduler in case of a dependency. HTS maintainsa queue of tasks and the associated metadata, akin to theinstructions to be executed in an OoO CPU core. HTS is awareof the busy status of each of the slave accelerator in the system,based on which it can schedule the tasks. Accelerators notify theHTS once the assigned task is complete via a physical signal,which is orders of magnitudes faster than interrupts. HTS canalso control the power-management of all the accelerators inthe system.Most importantly, HTS can schedule tasks in an out-of-orderfashion based on dependencies and status of the accelerator,which can bring a great speed-up in a task-parallel system.Furthermore, speculatively executing tasks based on bothcontrol and data branches can further bring a signiﬁcantboost in performance. Hardware instruction schedulers haverevolutionized the modern micro-processor design and hasgreatly simpliﬁed the OS and compilers associated. The keytake-away point here is that similar enhancements in OoOand speculative task execution mechanisms based on hardwareare necessary to realize a massively heterogeneous systemarchitecture. IV. A

RCHITECTURE

In this section, we detail the overall architecture of theproposed heterogeneous system and the task scheduler. Wewill ﬁrst describe the overall system design and then elucidatethe accelerator and scheduler interfaces.

A. System Design

There are several crucial design decisions to make to designa system with multiple accelerators.1) Are accelerators masters on the data-bus? Do they havean internal DMA controller?2) Are accelerator scratch-pads coherent with the system?3) How can CPUs program the accelerators?4) How are interrupts conﬁgured?5) How to manage contention?We largely agree with the system design proposed in [26],where each accelerator has its own DMA engine. The scratch-pads are not coherent with the system memory and requireexplicit synchronization via barriers. A centralized managermanages the tasks to be scheduled on the accelerators (referredto as GAM in [26]), and handles notifying the CPUs whenthe task is complete. Managing contention is currently beyondthe scope of this work and we leave it for future work. Anoverview of the proposed system is shown in Figure 3

B. Accelerator Design

When the system contains a large number of accelerators,it is essential to maintain a homogeneous external interfacefor each accelerator. Figure 4 depicts the interface of everyaccelerator in the system. Each accelerator contains a DMAengine that can fetch the data from the memory via the Masterport connected to the NoC. The scheduler can deliver the datato the accelerator via Task Delivery slave port, which shouldcontain the entire description on the task to be performed and4igure 4:

A basic interface for a candidate accelerator in the system the base address. Accelerator can convey the busy status to themanager via accelerator status signal. The power managementof the accelerator is again controlled by the manager.

C. Task Scheduler

The core idea of having a large number of acceleratorsthat are shared across applications is realizable only when therun-time scheduling is realized in the system to effectivelyshare the resources. As proposed in Section III, we design a

Hardware Task Scheduler (HTS) that receives tasks from CPUsand manages the task-run on the accelerators. We develop thetask scheduler as an OoO core that is capable of schedulingtasks on the accelerator dynamically. The key insight here isthat all the optimizations that CPU development has seen canbe brought to task scheduling with minimal changes.Figure 5 gives a top level view of the proposed HTS.As can be observed, its design is completely based on anOoO processor that can execute instructions in an out-of-orderfashion and is able to execute instructions speculatively. Weexploit the fact that tasks can be executed in an OoO fashionas well [3], where the user’s program is compiled to generatea task-ﬂow graph (explained in Section V). Each CPU runningthe program, therefore, can push the tasks to the scheduler andbe notiﬁed of completion of the tasks via a dedicated bus.

1) Overview:

The HTS receives tasks from the CPUs intothe

Task Queue , which is then decoded by the

Task Decode logic. The decoded tasks contain information on the type ofthe task and the associated meta-data (Section V). Every taskis associated with a memory region and there are likely tobe dependencies among them. This is analyzed by the

TaskDispatch logic, which can re-order the tasks in a window thatis decided at the design time (similar to instruction window).These tasks are dispatched according to an OoO issue logic,which are named as

Reservation Stations since its functionalityis similar to Reservation Stations as proposed by Tomasulo’s algorithms for OoO scheduling [27]. Based on whether theaccelerator that the task maps to is free (similar to busy status ofthe Functional Units in a CPU),

Reservation Station dispatchesthe tasks to the accelerators. Note that the width of dispatchis a design parameter.The accelerators receive these tasks and proceed to performthem. Note that accelerators are masters on the data-bus, hencethey can fetch the data required for the task. Once an acceleratorcompletes the task and writes the result back to the memory, itsends the completion status back the HTS. This is conveyed tothe HTS via a

Common Data Bus (CDB), which clears otherdependencies waiting for this task to complete. Every cycle,the reservation stations can issue tasks whose dependenciesare cleared based on the availability of the accelerator. Thisenables the design to issue multiple tasks per cycle, hencesimilar to a

Superscalar design.

2) Resolving RAW dependency:

One of the most commontype of hazard occurring in an OoO pipeline is a Read-After-Write (RAW) hazard. Our design resolves this in the TaskDispatch stage using an additional structure named

MemoryTracker . Each incoming task is assigned an ID by the dispatch.The dispatch logic informs the Memory Tracker about theoutput memory region that the outgoing task is going to write,and the corresponding task ID is stored. When a new task isready for dispatch, the input memory region of the new task isscanned in Memory Tracker for any dependencies. In case anentry is found in the Tracker, corresponding task ID is returnedand the new task is dispatched to the reservation where it waitstill the dependency is resolved. The completion of each taskhas to announced on the CDB which is controlled using anarbiter that implements a ticket lock system for serialization.

3) Speculative Execution:

Interestingly, the whole conceptof

Speculative Execution can be applied in task scheduling. Thedataﬂow graph is executed speculatively whenever a branch isencountered. The main challenge in realizing speculation intask scheduling is that the results of the speculative tasks cannot be reversed since they operate on the memory directly. Weget around this by allocating a region of memory for speculatedtasks to operate on, which can be discarded in case of mis-speculation. This is in similar spirit of how

TransactionalMemory operates.We consider three different types of branches to speculateupon:1)

These branches can be resolvedby simply accessing the general purpose register bankin the scheduler. This causes a single cycle bubble, andcomparing to the cycles that each task takes (typicallyin 1000s of cycles), it is not beneﬁcial to speculate. Wesimply incur the bubble cost and resolve the branch.2)

Memory-Read (MR) : These branches are based on thedata in some location in memory. This requires spawninga new task to read memory which can potentially takea large number of cycles. Hence, we consider these forspeculation.3)

Bus-Read (BR) : These branches depend on the outputof some task that is yet to ﬁnish. In this case, the dispatch5igure 5:

Proposed Task scheduler that retains the basic design of an OoO processor unit continues the execution speculatively but continuesto monitor the CDB to resolve the branch.When the HTS is running in speculative mode, each task’soutput is mapped to a new location in the

TransactionalMemory , which is a dedicated part of memory reserved forspeculative tasks. The dispatch unit queries the

Task LookupBuffer (TLB) to allocate a new region for the output of thecurrent speculative task, and the corresponding mapping instored in the TLB. Additionally, each subsequent task’s inputmemory region is looked up in the TLB to get the mapping,if present. Note that, each speculation is given an ID and thesame is noted for each mapping present in the TLB. Based onthe branch resolution, there can be two cases:1)

Mis-speculation : In this case, all the entries corre-sponding to the speculation ID is discarded from theTLB. All the tasks that are evicted from the TLB areaborted immediately. Also, we need not operate on theTransactional Memory, because the deletion of entriesin the TLB is equivalent to invalidating/erasing the mis-speculated region of the TM.2)

Correct Speculation : In this case, execution can con-tinue normally and the mapping is retained in the TLB.Any future tasks that request the data present in theregion that was speculated and is present in the TLB, isremapped to read the data from the TM. Note that, if theTLB/TM become full, the HTS will stall and delete themapping by copying the data from TM to the mappedlocation in memory. This ensures functional correctnessof the HTS.

4) Accelerator Status:

One important requirement of theHTS is to be able to monitor the busy status of each acceleratorin the system. We achieve this by maintaining a directory named

Accelerator Status Register (ASR) as depicted in Figure 5. The Figure 6:

Supported instruction set architecture for programming thedataﬂow graph reservation station checks the ASR before releasing any taskto the accelerator to make sure the required accelerator is idle.The ASR also monitors the CDB to clear the busy status ofany accelerator that completed the task.

5) General Purpose Registers:

HTS provides a

GeneralPurpose Register (GPR) bank for supporting different pro-gramming models. We elaborate the programming model inSection V. The number of registers in the GPR is a designtime parameter. Each register can be addressed as Rx , where x is the number of the register.V. P ROGRAMMING M ODEL

As described in Section IV, the HTS executes dataﬂow graph.However, the CPU or the compiler need to describe the dataﬂowgraph that can be understood by the HTS. Again drawing fromthe analogy of CPUs and high level languages to describeprogram, we need a unifying set of rules and instruction withwhich all the dataﬂow graphs can be described. We need amechanism to divide the application into tasks, accompanied by6heir data and control dependencies. Note that portability acrosssuch heterogeneous systems is an important design principle ofour proposal. So, we wish to employ a generic programmingmodel which describes the relationships among the differententities (tasks, data, control) and leave the job of schedulingand execution on the

Task Scheduler and heterogeneous systemrespectively. In this section, we describe our

Instruction SetArchitecture and provide a glimpse into programming the HTSto execute dataﬂow graphs.

A. Instruction Set Architecture

Figure 6 lists the instructions supported by the HTS. Alongwith the task instruction, that can be used to assign a taskto an accelerator, we support arithmetic instructions like add , mul and mov that can be used to operate on the GPRs. Inorder to support all types of dataﬂow graphs, we also add if instruction for branches, jump to jump to a speciﬁc part of thedataﬂow graph, lbeg to start a loop and lend to end a loop.All the instructions are of 128 bit width and the breakdown ofeach ﬁeld is given in Table I.Table I: Instruction Breakdown Range Purpose [7:0]

Accelerator ID [23:8]

Input Memory Region [31:24]

Input Memory Size [47:32]

Output Memory Region [55:48]

Output Memory Size [59:56]

Task ID [63:60]

Process ID [67:64]

Control [127:68]

Metadata (for the accelerator)

B. Describing the dataﬂow graphs

The HTS executes the program written in assembly language.Each accelerator is given a keyname (for e.g., fft_256 for allthe accelerators that can execute a 256 point FFT). The keynamewould be assigned an accelerator ID when the code is compiled.After the keyname, the instruction should be described asmentioned in Table I. Each ﬁeld is a hexadecimal number, anda simple dataﬂow graph depicting a set of independent nodesis described below. real_fir 10 2 13 2 0 0 0 0000complex_fir 16 2 19 2 1 0 0 0000adaptive_fir 23 3 28 3 2 0 0 0000vector_dot 40 4 48 4 3 0 0 0000iir 32 3 36 3 4 0 0 0000

C. Support for Loops

The dataﬂow graphs need to support looping, as it is widelyfound in real-life applications. HTS recognizes the start of aloop with lbeg instruction and begins a counter based on therequested number of iterations. The lend instruction depictsthe end of the loop body and the associated loop count register.HTS loops through the loop body for requested number ofiterations as shown in the example program below. mov 58 0 2 0 1 0 0 0001mov 3 0 3 0 2 0 0 0001mov 75 0 6 0 3 0 0 0001lbeg 4 4 0 0 4 0 0 0001add 4 2 5 0 5 0 0 0001add 4 6 7 0 6 0 0 0001iir 5 3 7 3 7 0 1 0000lend 0 4 2 0 8 0 0 0001

D. Support for Branches

One of the most important part of the ISA is its supportfor branches, based on which the speculative execution issupported. The if instruction depicts the start of a branch andit describes the dependency of the branch that helps the HTSclassify the type of branch (Section IV-C3). The if instructionshould also program the PC jump that the HTS has to performif the branch is taken. Below is an example code supporting abranch, where the branch condition is evaluated based on thememory region (hence an MR branch) which requests a PCjump by if the branch is taken. Note that the branches canco-exist with loops as well. mov 3 0 a 0 0 0 0 0001real_fir 10 2 13 2 0 0 0 0000complex_fir 16 2 19 2 1 0 0 0000if 93 a 12 0 1 0 d 0000adaptive_fir 23 3 28 3 2 0 0 0000iir 32 3 36 3 3 0 0 0000vector_dot 40 4 48 4 4 0 0 0000vector_add 55 4 62 4 5 0 0 0000vector_max 68 5 76 5 6 0 0 0000fft_256 84 6 93 6 7 0 0 0000dct_64 102 2 106 2 8 0 0 0000correlation 110 3 115 3 9 0 0 0000 VI. E

VALUATION

In this section, we describe our evaluation strategy and thecurrent results in our simulation environment.

A. Workload Characterization

Generally, hardware accelerators are built for speciﬁc appli-cations. A modern mobile SoC would contain accelerators forgraphics processing, video decoding, digital signal processingetc. However, a massively heterogeneous system as describedin Figure 2 would contain a large number of Function levelaccelerators. Therefore, to demonstrate the effectiveness of sucha system, we choose a workload based on several reasons:1) There should be a large number of applications requiringhardware acceleration.2) The applications should be decomposable into Kernelsand Functions as described in Figure 2.3) The applications should share the Kernels and Functionsacross each other.In this work, we demonstrate the advantages of building amassively heterogeneous system with a task scheduler basedon Digital Signal Processing (DSP) workloads. The choice7able II: DSP Functions modeled as accelerators

Kernel Description Input dataframe size Cycles

Real FIR

Real input valued ﬁnite-duration impulse response ﬁlter 40 921

Complex FIR

Complex input valued ﬁnite-duration impulse response ﬁlter 40 3696

Adaptive FIR

Least mean square ﬁnite-duration impulse response ﬁlter 40 4384

IIR

Inﬁnite-duration impulse response ﬁlter 40 2450

Vector Dot

Calculates vector product of two vectors 40 53

Vector Add

Adds two vectors 40 131

Vector Max

Computes largest value in a vector 40 55

FFT

DCT

Discrete cosine transform 64 874

Correlation

Computes measure of similarity 40 753 N o r m a li z ed P e r f o r m an c e ( C yc l e s ) Naive Runtime, 1-FU/Kernel Runtime, 2-FUs/Kernel HTS, 1-FU/Kernel HTS, 2-FUs/Kernel

Figure 7: Performance comparison on synthetic benchmarks without branches.is driven by the reasons described above. In addition, DSPworkloads contain popular real-life applications, which arewidely benchmarked by several previous works. It is alsointeresting to compare them with DSP processors, which forma mid-point between general-purpose CPUs and ASICs.

B. Accelerators

We assume presence of accelerators for the Functionsdescribed in Table II. These accelerators were benchmarked byLennartsson et.al [28]. We use the mentioned benchmarkingcycle numbers in our experiments. These cycle numbers arecrucial for our further evaluation of the task scheduling systems.Table II provides an enumeration of DSP Functions thatwe model as accelerators in our system. For the sake of ourexperiments, we assume that the aforementioned Functions canbe used to run any DSP Kernel/Application.

C. Experiments

We modeled the proposed Hardware Task Scheduler (HTS)in python. The implementation is cycle accurate. It assumes anincoming stream of tasks sent by the CPU. This is accomplishedby passing an assembly (.asm) ﬁle to the model, which containstasks described as per our ISA. The model is conﬁgurable bynumber of accelerators per Function. Our experimentation can be divided into two sections -custom benchmarks and real application. We devised differentcustom-made benchmarks to observe the behavior of variousfeatures of our proposed HTS. We then pick audio compression,a real life application, to show the feasibility of our proposaland its observed performance.For every experiment, we compare three scheduling algo-rithms :1)

Naive scheduling - The CPU schedules one task at a time(in-order). For each task, CPU schedules the task, andwaits for its completion before processing to the nexttask. We estimate its performance by adding (executioncycle number, interrupt latency) for each task. Note thatinterrupt latency is independent of the task.2)

Runtime (Software) based scheduling - An out-of-orderRuntime running on the CPU schedules tasks. We designit as the manifestation of our HTS design in software. Weestimate its performance by adding (software schedulingoverhead, interrupt latency) for each task. We modelsoftware scheduling overhead as memory access latencyif our exact HTS was implemented in hardware, in whichcase, Memory tracker, Reservation Station etc wouldactually reside in memory. We assume L2 cache hit for8 N o r m a li z ed P e r f o r m an c e ( C yc l e s ) Naïve Runtime, 1-FU/KernelRuntime, 2-FUs/Kernel HPU,1-FU/Kernel w/o SpecHPU, 1-FU/Kernel w/ Spec HPU, 2-FUs/Kernel w/o SpecHPU, 2-FUs/Kernel w/ Spec

Figure 8:

Performance comparison on synthetic benchmarks with branches. each memory access.3)

HTS Scheduling - Our out-of-order, speculative hardwareschedulerWe use ARM Cortex-A interrupt latency [29] and ARMCortex-A9 L2 cache memory access latency [30] for ourexperiments. Also, performance is modeled by clock cyclenumbers. Note that we are making crude estimations, especiallyfor Runtime based scheduling, for the purpose of comparison.So, these experiment values are not absolute in any sense.Our intention is to provide a ballpark performance of otherscheduling algorithms to shed light on the advantages of ourproposed HTS.

Custom-made Benchmarks:

We use the following custom-made benchmarks:1) No Dependency - no dependencies among any tasks, hasno loops or branches2) Same Dependency - dependencies only among tasksmapped to same functional unit (Ex - instance of FFTand FFT ), has no loops or branches3) Different Dependency - dependencies only among tasksmapped to different functional units (Ex - instance ofFFT and Correlation), has no loops or branches4) Random Dependency - no deﬁnite pattern among depen-dencies, has no loops or branches5) Loop No Dependency - one loop with no dependencyoutside the loop, has no branches6) Loop Dependency - one loop with dependency of the loopiteration on one or more outside tasks, has no branches7) Branch Taken No Dependency - one branch whichwill actually be taken, with no dependency for branchresolution8) Branch Not Taken No Dependency - one branch whichwill actually be not taken, with no dependency for branchresolution9) Branch Taken Dependency - one branch which willactually be taken, has a dependency for branch resolutionThese benchmarks have been made with a purpose of analyzingperformance of our proposed HTS in various basic scenarios.Fig 7 illustrates these scenarios. Note that the plot hasbeen normalized by the maximum value. In Fig 7, one can clearly observe

Naive Scheduling performing the worst, asexpected. This can be attributed to it being constrained by in-order execution and having to incur interrupt latency overheadfor each task.

Result:

Compressed audiofetch audio;Correlate audio; if correlated audio > threshold then time domain ; for i ← to Bands do FIR;FIR;FIR; endelsefor i ← to Bands do frequency domain ;FFT;Vector Dot;Vector Dot;Vector Dot;iFFT; endend Algorithm 1: Audio CompressionThere is a clear improvement in performance for both

Runtime (Software) based scheduling and

HTS Scheduling when they have multiple Functional Units(FUs), as both ofthem can execute tasks out-of-order. This is also true for loopbased benchmarks, as iterations are also implicitly executedout-of-order (as long as they are independent).Our speculation naively assumes branch as not taken . So, inFig 8, ﬁrst and third plots illustrate cases where

HTS Schedul-ing mis-speculates while in second case it speculates correctly.Notably, we observe minimal difference in performance in

HTS w/o Spec and

HTS w/ Spec when

HTS Scheduling mis-speculates (scale of the plot hides actual difference), whichcan be attributed to the efﬁcient implementation of speculation.Also,

HTS Scheduling gains much better performance than

Runtime (Software) based scheduling when it speculatescorrectly. We assume that

Runtime (Software) based scheduling cannot speculate.

A real life application:

We illustrate the performance ofour proposed HTS on a real application. We choose AudioCompression application for the same.Note that we are able to decompose the application intothe available set of Functions. The algorithm uses FIRs (timedomain) or FFTs and Vector Dot (frequency domain) based oncomparison of correlated audio to a threshold.Fig 9 depicts comparative performance of audio com-pression. As expected,

Naive Scheduling and

Runtime (Soft-ware) based scheduling perform poorly as compared to

HTSScheduling owing to interrupt latency and software schedulingoverheads + interrupt latency overheads.9 N o r m a li z ed P e r f o r m an c e ( C yc l e s ) Naïve RuntimeHPU w/o Spec, BT HPU w/ Spec, BTHPU w/ Spec, BNT

Figure 9:

Performance comparison of scheduling algorithms on audiocompression. P e r f o r m an c e ( C yc l e s ) Figure 10:

Performance scaling with number of FUs on audio compression.

A notable difference between this application and ourcustom-made benchmarks is that branch resolution resultdrastically impacts runtime, as task blocks are different (FFThas considerably higher cycle count as compared to others).So, BT and BNT cases have different cycle counts.Fig 10 sketches out performance trend when employingstrong scaling. Hyper-parameter for this experiment is thenumber of iterations (Number of Bands). We change it to alterthe number of tasks in the system. We observe a decrease incycle count (increase in performance) as number of FunctionalUnits increase in the system, since

HTS Scheduling is able toschedule tasks out-of-order. The improvement in performanceis higher for program containing higher number of tasks. Thisis a fairly good indicator of how well our proposed HTS isperforming. VII. C

ONCLUSION

In this paper, we proposed to design a massively heteroge-neous system architecture with a large number of accelerators.We proposed to implement accelerators at

Function abstraction, rather than

Application or Kernel abstraction. This helps toshare the accelerators across several applications, which arebroken down into common set of Functions. In these scenarios,effectively scheduling the tasks in run-time becomes verycrucial. We proposed to implement a hardware task schedulerin the lines of out-of-order speculative processor, that helps tooverlap and execute tasks more efﬁciently for higher resourceutilization. Our preliminary results show a great potential toprovide huge uplifts in several real-life workloads.R

EFERENCES[1] R. Hameed et al. , “Understanding sources of inefﬁciency in general-purpose chips,” in

ACM SIGARCH Computer Architecture News , vol. 38,no. 3. ACM, 2010.[2] M. Houston et al. , “A portable runtime interface for multi-level memoryhierarchies,” in

Proceedings of the 13th ACM SIGPLAN Symposium onPrinciples and practice of parallel programming . ACM, 2008.[3] M. Bauer et al. , “Legion: Expressing locality and independence withlogical regions,” in

Proceedings of the international conference onhigh performance computing, networking, storage and analysis . IEEEComputer Society Press, 2012.[4] L. V. Kale et al. , “Charm++: a portable concurrent object oriented systembased on c++,” in

ACM Sigplan Notices , vol. 28, no. 10. ACM, 1993.[5] P. Charles et al. , “X10: an object-oriented approach to non-uniformcluster computing,” in

Acm Sigplan Notices , vol. 40, no. 10. ACM,2005.[6] Y. Etsion et al. , “Task superscalar: An out-of-order task pipeline,”in

Proceedings of the 2010 43rd Annual IEEE/ACM InternationalSymposium on Microarchitecture . IEEE Computer Society, 2010.[7] N. P. Carter et al. , “Runnemede: An architecture for ubiquitous high-performance computing,” in

High Performance Computer Architecture(HPCA2013), 2013 IEEE 19th International Symposium on . IEEE,2013.[8] P. Thoman et al. , “A taxonomy of task-based parallel programmingtechnologies for high-performance computing,”

The Journal ofSupercomputing , Jan 2018. [Online]. Available: https://doi.org/10.1007/s11227-018-2238-4[9] R. D. Blumofe et al. , “Cilk: An efﬁcient multithreaded runtimesystem,” in

Proceedings of the Fifth ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming , ser. PPOPP’95. New York, NY, USA: ACM, 1995. [Online]. Available:http://doi.acm.org/10.1145/209936.209958[10] L. Dagum et al. , “Openmp: an industry standard api for shared-memoryprogramming,”

IEEE Computational Science and Engineering , vol. 5,Jan 1998.[11] T. Willhalm et al. , “Putting intel®threading building blocks to work,” in

Proceedings of the 1st International Workshop on Multicore SoftwareEngineering , ser. IWMSE ’08. New York, NY, USA: ACM, 2008.[Online]. Available: http://doi.acm.org/10.1145/1370082.1370085[12] C. Augonnet et al. , “Starpu: A uniﬁed platform for task scheduling onheterogeneous multicore architectures,”

Concurr. Comput. : Pract. Exper. ,vol. 23, Feb. 2011. [Online]. Available: http://dx.doi.org/10.1002/cpe.1631[13] B. Chamberlain et al. , “Parallel programmability and the chapellanguage,”

The International Journal of High Performance ComputingApplications , vol. 21, 2007. [Online]. Available: https://doi.org/10.1177/1094342007078442[14] H. Kaiser et al. , “Hpx: A task based programming model in a globaladdress space,” in

Proceedings of the 8th International Conferenceon Partitioned Global Address Space Programming Models , ser.PGAS ’14. New York, NY, USA: ACM, 2014. [Online]. Available:http://doi.acm.org/10.1145/2676870.2676883[15] J. Bueno et al. , “Productive cluster programming with ompss,” in

Proceedings of the 17th International Conference on Parallel Processing- Volume Part I , ser. Euro-Par’11. Berlin, Heidelberg: Springer-Verlag,2011. [Online]. Available: http://dl.acm.org/citation.cfm?id=2033345.2033405[16] X. Tan et al. , “Performance analysis of a hardware accelerator ofdependence management for task-based dataﬂow programming models,”in , April 2016.

17] F. Yazdanpanah et al. , “Picos,”

Future Gener. Comput. Syst. , vol. 53, Dec.2015. [Online]. Available: http://dx.doi.org/10.1016/j.future.2014.12.010[18] S. Kumar et al. , “Carbon: Architectural support for ﬁne-grainedparallelism on chip multiprocessors,” in

Proceedings of the 34thAnnual International Symposium on Computer Architecture , ser.ISCA ’07. New York, NY, USA: ACM, 2007. [Online]. Available:http://doi.acm.org/10.1145/1250662.1250683[19] J. Hoogerbrugge et al. , “Transactions on high-performance embeddedarchitectures and compilers iii,” P. Stenstr¨om, Ed. Berlin, Heidelberg:Springer-Verlag, 2011, ch. A Multithreaded Multicore System forEmbedded Media Processing, pp. 154–173. [Online]. Available:http://dl.acm.org/citation.cfm?id=1980776.1980787[20] M. Sj¨alander et al. , “A look-ahead task management unit for embeddedmulti-core architectures,” in , Sept 2008.[21] T. Dallou et al. , “Fpga-based prototype of nexus++ task manager,” 2013.[22] N. P. Carter et al. , “Runnemede: An architecture for ubiquitous high-performance computing,” in , Feb 2013.[23] J. Guo et al. , “Hierarchically tiled arrays for parallelism and locality,” in

Proceedings 20th IEEE International Parallel Distributed ProcessingSymposium , April 2006.[24] Z. Budimli´c et al. , “Concurrent collections,”

Sci. Program. , vol. 18,Aug. 2010. [Online]. Available: http://dx.doi.org/10.1155/2010/521797[25] K. Asanovic et al. , “The landscape of parallel computing research: Aview from berkeley,” Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, Tech. Rep., 2006.[26] J. Cong et al. , “Parade: A cycle-accurate full-system simulation platformfor accelerator-rich architectural design and exploration,” in

IEEE/ACMInternational Conference on Computer-Aided Design (ICCAD) , 2015.[27] R. M. Tomasulo, “An efﬁcient algorithm for exploiting multiple arithmeticunits,”

IBM Journal of research and Development