[PDF] SARA: Self-Aware Resource Allocation for Heterogeneous MPSoCs

Abstract

In modern heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering end-to-end QoS. Previous frameworks have either focused on singular QoS targets or the allocation of partitionable resources among CPU applications at relatively slow timescales. However, heterogeneous MPSoCs typically require instant response from the memory system where most resources cannot be partitioned. Moreover, the health of different cores in a heterogeneous MPSoC is often measured by diverse performance objectives. In this work, we propose a Self-Aware Resource Allocation (SARA) framework for heterogeneous MPSoCs. Priority-based adaptation allows cores to use different target performance and self-monitor their own intrinsic health. In response, the system allocates non-partitionable resources based on priorities. The proposed framework meets a diverse range of QoS demands from heterogeneous cores.

Full PDF

aa r X i v : . [ c s . D C ] A p r SARA: Self-Aware Resource Allocationfor Heterogeneous MPSoCs

Yang Song † , Olivier Alavoine ‡ , Bill Lin †† Electrical and Computer Engineering Department, University of California at San Diego ‡ Qualcomm Inc., San Diego, [email protected]

ABSTRACT

In modern heterogeneous MPSoCs, the management of sharedmemory resources is crucial in delivering end-to-end QoS.Previous frameworks have either focused on singular QoStargets or the allocation of partitionable resources amongCPU applications at relatively slow timescales. However,heterogeneous MPSoCs typically require instant responsefrom the memory system where most resources cannot bepartitioned. Moreover, the health of diﬀerent cores in aheterogeneous MPSoC is often measured by diverse perfor-mance objectives. In this work, we propose a Self-Aware Re-source Allocation (SARA) framework for heterogeneous MP-SoCs. Priority-based adaptation allows cores to use diﬀer-ent target performance and self-monitor their own intrinsichealth. In response, the system allocates non-partitionableresources based on priorities. The proposed framework meetsa diverse range of QoS demands from heterogeneous cores.

1. INTRODUCTION

Modern heterogeneous MPSoCs [1, 2] have been widelydeployed in mobile devices thanks to their energy eﬃciency.These MPSoCs typically integrate a diverse collection ofcores. Fig. 1 depicts an example of a heterogeneous MPSoC.Besides general-purpose cores like the CPU for running ap-plications, most heterogeneous cores are dedicated to certainfunctions, such as the GPU, the DSP and the display. Thesecores have diverse notions of Quality-of-Service (QoS). Forexample, the GPU measures target real-time performance interms of frame rate; the DSP demands the memory latencyto remain below a certain limit; and the display requiressuﬃcient bandwidth to refresh frames at a constant rate.To save cost and energy, heterogeneous cores commonlyshare resources, among which, the sharing of the memorysystem (including the on-chip network and the memory con-troller) is the most challenging because memory performanceoften has a direct and substantial impact on the system per-formance. As data is being shared through memory, com-peting memory requests from diﬀerent cores interfere witheach other, and these memory interferences can cause the

ACM ISBN 978-1-4503-2138-9.DOI:

GPU

DisplayJPEGFrameRotator

On-Chip NetworkMemory ControllerDRAM ...

ModemUSBWiFi ...

CPUCPUCPUCPU CPUCPUCPUDSP

VideoCodec GPS

Mediacores Systemcores

Figure 1: Heterogeneous system architecture example. memory system to fail in meeting the target performance ofsome cores. Fig. 2 depicts a camcorder application, whichrepresents a typical use case in that it involves many cores atthe same time. With ineﬀective memory scheduling, a real-time core (e.g., the display) may not achieve the target real-time performance due to inadequate memory bandwidth.Moreover, as latency-sensitive cores such as the DSP sharememory with other cores, they can be easily overwhelmedby real-time cores consuming high bandwidth.

Sensor

Video Buffer

Camera

SnapshotBuffer

JPEG

ImageProcessorReferenceFramePreviewBuffer

Rotator

PreviewBufferVideoCodecReferenceFrames

StorageDisplay

LCDPanelIcons

GPU

Recordinga videoTakinga snapshot Previewingthe video

Figure 2: Simpliﬁed dataﬂow of a camcorder application.Shared memory represented by boxes in dash lines andcores by boxes in solid lines.

QoS-aware management for speciﬁc types of memory re-sources has been well-studied by previous work [3, 4, 5, 6,7]. In [3], a QoS-aware scheduling policy was proposed forCPU-GPU systems. The concept of frame progress was in-troduced for monitoring GPU performance. Although thepolicy can be extended to include more media cores, it can-not be applied to real-time cores whose target QoS can-not be assessed in terms of frame rate. Moreover, holis-tic memory management frameworks for CPU-centric ho-mogeneous systems have also been explored recently [8, 9,10]. This series of work typically constructs a managementmodel based on the control theory to partition computingand memory resources. These frameworks accept ﬂexibleoS targets as clients are allowed to deﬁne their own tar-get performance. Nonetheless, such type of approaches isperformed at a relatively slow timescale (e.g., on the orderof milleseconds) due to the computational complexity. Incomparison, real-time cores in heterogeneous MPSoCs oftendemand much more instant response from the memory sys-tem. Besides, communication between heterogeneous coresis mainly conducted through shared memory as shown inFig. 2, because multimedia data is generally too large to ﬁtin caches. Therefore, DRAM plays a more crucial role inheterogeneous systems. However, previous frameworks can-not handle DRAM eﬀectively because its bandwidth is notpartitionable. Speciﬁcally, available DRAM bandwidth re-lies on the memory access pattern, as higher spatial localityresults in fewer redundant precharge operations and bettermemory eﬃciency.So far, there has not been a QoS-aware resource man-agement model for heterogeneous MPSoCs which is capableof allocating non-partitionable resources to ﬂeeting QoS de-mands. In this work, we propose the Self-Aware ResourceAllocation (SARA) framework as a solution. The contribu-tions of our work can be summarized as follows. • We propose a QoS-aware holistic resource managementframework for heterogeneous systems. The SARA modelaccepts diverse notions of QoS and monitors performancedistributively with lightweight meters to guarantee end-to-end QoS. • We introduce priority-based self-adaptations for the man-agement of non-partitionable resources, such as DRAMand on-chip network, which constitute most of the sharedresources in heterogeneous MPSoCs. • We evaluate the proposed framework using memory traﬃcof next-generation MPSoCs and show that the proposedSARA model delivers target performance to all cores. Incontrast, the performance of critical cores can fall below10% of their targets without the SARA framework. Fur-ther, memory system optimization is performed withoutQoS degradations.The rest of this paper is organized as follows: Section 2brieﬂy reviews related work. Section 3 describes the pro-posed SARA framework. Experimental results and conclu-sions follow in Sections 4 and 5.

2. RELATED WORK

Most previous work on QoS-aware resource managementin heterogeneous MPSoCs were focused on a single layer ofthe memory system. In [3], a novel scheduling policy wasintroduced to dynamically balance bandwidth between theCPU and the GPU based on the frame progress of real-time workloads. To achieve QoS-aware memory scheduling,the staged memory scheduler [4] was presented as the ﬁrstQoS-aware scheduler for CPU-GPU systems. Further, thesingle-tier virtual queuing memory controller [5] was pro-posed to overcome the limitation of two-tier schedulers inQoS-aware scheduling. Besides memory scheduling, QoS-aware cache management [7] and on-chip network design [6]have also been well-explored in recent years. Nonetheless,these work cannot guarantee end-to-end QoS because theyonly deal with certain parts of the memory system. Forexample, the QoS provided in the memory controller couldbe deteriorated by the interconnect if it is not applying thesame QoS policy. In addition, implementing a centralizedQoS monitor in the memory system can be prohibitive sinceit needs to collect runtime information from all cores. More limiting, these work assume speciﬁc notions of QoS, whichis not applicable to modern heterogeneous MPSoCs wherethe health of diﬀerent cores is often evaluated by diverseperformance objectives.METE [8] is a multi-level framework for end-to-end re-source management based on the control theory. It utilizesruntime information to predict application behaviors. Ap-plication controllers calculate the amounts of resources re-quired to achieve target application performance. A globalresource broker determines the ﬁnal resource partitions forapplications. SEEC [9] is a self-aware computing frameworkdesigned for a many-core processor. It follows the controlloop of observe-decide-act for resource allocations. Perfor-mance of CPU applications are observed by the decision en-gine which decides resource partitions using available actionsdeﬁned by system designers. ARCC [10] is a self-computingframework implemented in the Tessellation many-core OS.It performs the two-level scheduling: ﬁrst the resource allo-cation broker distributes global resources and then at user-level scheduling policies are customized separately.Aforementioned frameworks were intended for CPU-centricmulti-core systems. These frameworks are aimed at allocat-ing partitionable resources, such as CPU cores and cacheways, to applications at the software/OS level. They are notsuitable for heterogeneous MPSoCs for the following reasons.First, complicated control models may not be fast enoughfor heterogeneous cores (e.g., these software/OS level ap-proaches operate at milleseconds timescales). For example,the DSP sets limit on memory latency at nanosecond level,but prior frameworks need more time to adapt through con-trol theory computations in OS. Second, prior work assumeall memory resources are partitionable. However, DRAMbandwidth cannot be simply partitioned like cache ways. InDRAM, data storage of a memory bank is organized intorows and columns. To access a column, the row where thiscolumn is located will be loaded into the row-buﬀer (i.e.row activation operation) after the other rows are closed(i.e. precharge operation) [11]. These row activation andprecharge operations cause time penalty without contribut-ing to actual data transfer, which makes DRAM bandwidthinconstant and unpredictable.

3. SELF-AWARE RESOURCEALLOCATION FRAMEWORK

On-Chip Network

PerformanceMeter DirectMemoryAccess

Targetperf.

DistributedMonitoringDistributed

Priority-based

AdaptationDistributedSystemResponse

Core A Core BCore DCore CMemory ControllerDRAM

Actual perf.

Requests with priorities

Priority ...

Each resourceperforms allocationin a distributed manner

NPI

Figure 3: The proposed SARA framework for heteroge-neous MPSoCs. Each core self-monitors its performanceand self-adapts its priority, and each resource performspriority-based allocation in a distributed manner. ime0 Frame progress Δ Occupancy-25%025%50%100% timetimetimeAverage latency50%Max75%Max timetime567 345 345Priority PriorityPriority x1x0.5x0.75 62 (a) DSP (c) Display(b) GPU

Max Referenceprogress0 -50%

Figure 4: Examples of priority-based adaptation in het-erogeneous cores, including the DSP, GPU and display.

The proposed architecture of SARA framework is shownin Fig. 3. The resource management model consists of threestages, including distributed monitoring, priority-based run-time adaptation and system response. In the rest of this sec-tion, we will go through SARA framework stage by stage.

In the ﬁrst stage, each core self-monitors its own per-formance. The distributed monitoring relieves the memorysystem from the burden of monitoring heterogeneous coreswith various notions of QoS. Self-monitoring also providesmore accurate feedback on the end-to-end QoS comparedwith centralized monitoring in the memory system. In addi-tion, implementing lightweight performance meters is goodfor scalability, because a new core can be added or modiﬁedwithout updating the rest of the system.Every core customizes its own internal performance me-ter to measure its own performance or progression againsta given target, and the measurement gets normalized intoa fractional number called a Normalized Performance In-dicator (NPI), which is used as an indicator of the core’sintrinsic health. In the DSP, the performance meter mon-itors the average latency of its transactions, while in thedisplay the meter counts the occupancy level in the readbuﬀer. The deviation from the target performance (e.g., la-tency, occupancy level, etc) produces the NPI metric. In ourframework, each independent DMA (Direct Memory Access)unit is equipped with a performance meter. Note that thereare usually multiple DMAs in a single core. For simplicity,we only show one DMA per core in Fig. 3.

In the second stage, each core adapts the relative priorityof its transactions based on its NPI value. The NPI valuedelivered by the performance meter is translated into a rela-tive priority level which is attached to memory transactionsfrom the same DMA. The priority level will be evaluatedwithin on-chip network arbiters and the memory controller,as the transaction travels along the way to DRAM. Priority-based arbitrations allow the memory system to provide QoSwithout specifying the heterogeneous QoS for all cores andDMAs. Same with performance meters, the formulation ofthe NPI metric and the adaptations of priority can be im-plemented diﬀerently from core to core, depending on thelocal target performance. Fig. 4 shows three examples ofpriority-based adaptation in diﬀerent cores.As for the DSP, the target performance is to have theaverage memory latency lower than the maximum latencylimit. The average latency is measured and compared witha pre-set limit to produce the NPI value (see Eqn. 1), whichremains above or equal to 1 when the target performanceis achieved. This NPI value is then translated to a relative priority level (Fig. 4(a)). The priority level increases alongwith average latency.

NP I

DSP = maximum latency limitaverage latency (1) Similarly, cores requesting for bandwidth produce NPImetrics by computing the ratio between the average and thetarget bandwidth. However, frame rate diﬀers from band-width, because frame size can be variable and thus a con-stant frame rate can lead to variable bandwidth. Henceframe progress [3, 5] is used instead to produce NPI metricsfor frame rate based cores. Take the GPU as an example,the target is to let the frame progress reach 100% as the cur-rent frame period comes to an end. The GPU’s NPI value isproduced at any time by comparing the frame progress withreference progresses which grow proportionally with frametime. The NPI value is then translated to a relative prior-ity level of GPU transactions. Fig. 4(b) shows the referenceprogresses achieving 1, 0.75 and 0.5 times the average datarate of target performance.

NP I

GPU = frame progressreference progress (2) In the display, LCD panel reads data from a read buﬀerat a constant frame rate, while the display controller DMAtries to reﬁll this buﬀer from DRAM so it never gets empty.Its health (see Eqn. 3) relies on maintaining the reﬁll rate( R refill ) no lower than the read data rate ( R read ), andcan be indicated by the variation of buﬀer occupancy level(∆ occupancy ). Compared with an initial level (e.g. 50%),the lower the occupancy level of this buﬀer gets, the worsethe NPI value becomes, which is in turn translated to ahigher priority level (Fig. 4(c)). NP I display = R refill R read = 1 + ∆ occupancyR read · time (3) Intuitively, one might be concerned that every core wouldintentionally raise the priority to the maximum level to ob-tain as much resources as possible. However, this situationshould not happen because the priority level is only max-imized when the actual performance is far below the tar-get. The system designer has the responsibility to makesure cores have realistic performance targets and enough re-sources to satisfy all possible combinations of QoS demands.Once the system is fabricated in hardware, heterogeneouscores cannot change their target performance arbitrarily, es-pecially because most of them are ﬁxed-function IP blockswith invariable QoS targets and little programmability.In our evaluations, the priority levels are quantized into2 k levels, which can be encoded using k bits. We found that k = 3 bits provides suﬃcient granularity in priority levelsto produce satisfying results (i.e., the priority levels rangefrom 0 to 7). As transactions travel through the memory system, thesystem responds to QoS demands by providing resource man-agement based on their priority levels. The priority-basedmanagement is performed correspondingly in diﬀerent partsof the memory system. In on-chip network routers, trans-actions with higher priorities are preferentially selected dur-ing switch allocation. In the memory controller, when apriority-based scheduler arbitrates among transactions goingto available memory banks, the ones with higher prioritieshave more chances to be served. An example of such memoryscheduling policies is the priority-based round-robin shownn Policy 1. To avoid starvation of transactions with low pri-orities, the scheduler also needs to consider the aging factorduring arbitration. In our evaluations, the scheduler peri-odically clears the backlog of transactions that have waitedfor at least T cycles (e.g., T = 10000 cycles). • Policy 1 : Suppose P A and P B are priorities for transac-tions A and B, if P A > P B choose A; if P A < P B choose B;otherwise choose between A and B in round-robin man-ners.Priorities notify the system whether the cores are in urgentQoS demands. That gives the memory system an opportu-nity to optimize memory performance without underminingthe QoS. Speciﬁcally, when transactions are in low urgency,the system can improve memory performance such as row-buﬀer hit rate, instead of focusing on serving QoS demands.Row-buﬀer hits refer to the number of memory accessesto the same active row-buﬀer before precharge. More row-buﬀer hits means less time and power are wasted on rowactivation and precharge operations. Thus increasing row-buﬀer hits helps lower memory latency and improve DRAMtotal bandwidth.To increase row-buﬀer hits, the memory controller re-orders transactions to favor the ones hitting open rows. Itmay cause degradations to the QoS when the transactions inhigh urgency are postponed due to row-buﬀer hits optimiza-tion. Yet, with priorities, the memory controller is aware ofthe urgency levels of transactions and able to avoid delay-ing urgent transactions during optimization. Policy 2 showsan extension of Policy 1 to increase row-buﬀer hits with-out QoS degradations. The parameter δ is an adjustablethreshold to balance row-buﬀer hits optimization and QoS-aware scheduling. When the priority level is lower than δ ,the scheduler focuses on row-buﬀer hits, otherwise the QoScomes ﬁrst. A higher δ value gives more favor to DRAMbandwidth, but also potentially causes more disturbance tothe QoS. We found δ = 6 a good setting to achieve highDRAM bandwidth without causing QoS degradations. • Policy 2 : Suppose transaction A is going to an activerow-buﬀer and B is not. If P A , P B < δ or P A = P B ,choose A. Otherwise, perform priority-based round-robin.The priority-based resource allocation is able to handlenon-partitionable with little computation in comparison withprevious management models [8, 9, 10]. This facilitates in-stant response from the memory system to QoS demands. The implementation of the proposed SARA framework in-cludes three parts: the computation of NPI value, the trans-lation of NPI value to a priority level, and the priority-basedarbitration in the memory system.To calculate the NPI, a divider is needed at the perfor-mance meter for each DMA. For the translation of the NPI,a mapping function can be stored in a look-up table at eachcore. Each priority level is assigned with a table entry, andthis entry stores the lowest NPI value allowed at that pri-ority level. For example, if priority = p when NPI ∈ [ u, v ),the value u will be stored at the entry for p on the look-uptable. Note that v will be the lower bound of the NPI forthe priority level p −

1. Comparators are needed to accesstable entries in parallel. If the current NPI value is not lowerthan the stored lower bound of NPI value, the correspondingpriority level will be asserted. When multiple priority levelsare asserted, the lowest level will be adopted.Supposed each priority level is encoded into three bits,

Table 1: Simulation settings.

Test CasesCase A all cores activewith DRAM @ 1866MHz;Case B inactive cores:GPS, camera, rotator and JPEG,with DRAM @ 1700MHz.Memory ControllerTotal entries 42Transaction queues 5DRAMVolume 2GBMax I/O bus freq. 1866MHzCL-tRCD-tRP (cycles) 36-34-34tWTR-tRTP-tWR (cycles) 19-14-34tRRD-tFAW (cycles) 19-75Channels-Ranks-Banks 2-2-8

Table 2: Summary of heterogeneous cores and types oftarget performance.

Name Performance Name Performancetype typeGPU frame rate Display buﬀer occupancyDSP latency GPS processing timeImage Processor frame rate WiFi bandwidthVideo Codec frame rate USB bandwidthRotator frame rate Modem processing timeJPEG frame rate Audio latencyCamera buﬀer occupancy a look-up table requires 2 = 8 entries and each entry isa register for the NPI value. A comparator is paired witheach table entry. In total, the implementation only costs thestorage of eight registers and eight comparators per core.In the memory system, performing the priority-based ar-bitration requires a 3-bit comparator to arbitrate amongtransactions with diﬀerent priority levels. Since most ex-isting QoS-aware schedulers already provide hardware sup-port for priorities, our framework can be integrated into thememory system without raising complexity.

4. EVALUATION

In this section, the proposed SARA framework will betested to demonstrate its eﬀectiveness in providing targetperformance to heterogeneous cores. Two test cases basedon the camcorder dataﬂow (Fig. 2) will be used for demon-stration. Further, we will show row-buﬀer hits optimizationcan be performed eﬃciently within SARA framework with-out performance degradations.The proposed framework is modeled as in Fig. 3, wherememory traﬃc from every DMA is generated based on anext-generation MPSoC [1]. DRAMSim2 [12] with LPDDR4timing model is used for cycle-accurate simulation of DRAM.Table 1 shows the simulation settings. Table 2 lists the sim-ulated cores and the types of target performance.The target performance for each core is set according tothe camcorder dataﬂow (Fig. 2) which runs at 30fps. Forinstance, the frame rotator writes and reads 1080p YUV420images at 30fps, which requires 89MB/s for each DMA and178MB/s in total.

To begin with, we test the SARA framework in deliveringtarget performance to heterogeneous cores. For comparison,four arbitration policies are used in the memory controllerand on-chip network arbiters, including ﬁrst-come-ﬁrst-serve(FCFS), round-robin (RR), a frame-rate-based QoS policy[3] and the priority-based QoS policy (Policy 1). FCFS pol-icy serves all the transactions according to the arrival or-der. Round-robin policy separates transactions into diﬀer-

10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo Codec

Display

CameraUSBGPSWiFi (a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy

Figure 5: NPI value of critical cores during one frame period (33ms) for test case A with diﬀerent arbitration policies. time(ms)0.1110 N P I Image Proc.Video CodecDisplayUSBDSPWiFi time(ms)0.1110 N P I Image Proc.Video CodecDisplay USBDSPWiFi0 10 20 300 10 20 30 time(ms) N P I Image Proc.Video CodecDisplayUSBDSPWiFi0 10 20 30time(ms)0.1110 N P I Image Proc.Video CodecDisplayUSBDSPWiFi (a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy

Figure 6: NPI value of critical cores during one frame period (33ms) for test case B with diﬀerent arbitration policies. ent queues and serves them in a round-robin fashion. Inthe memory controller, we have ﬁve transaction queues re-spectively designated to the CPU, the GPU, the DSP, me-dia cores and system cores. Round-robin policy also ap-plies to on-chip network arbiters, as input queues are servedin turn. The frame-rate-based QoS policy prioritizes me-dia cores when they are missing real-time deadlines, butotherwise, the policy provides best-eﬀort service to latency-sensitive cores. Furthermore, the priority-based QoS policycompares priority levels for arbitration and uses round-robinas the tiebreaker.The NPI of critical cores during a frame period are shownin Fig. 5 when test case A is applied. As explained in Sec-tion 3.2, the NPI metric reﬂects performance as higher valueindicates better performance. When NPI value drops below1, it means the the target performance is not achieved.Without reordering memory requests, FCFS policy endsup spending most of the time serving cores consuming highbandwidth. That easily leads to the starvation of latency-sensitive cores. As shown in Fig. 5(a), the NPI of the GPSdrops below 1 because the GPS is overwhelmed by other sys-tem cores sharing the same interconnect, such as the USB.For media cores, the video codec, the rotator and the imageprocessor have all the frame data available at the beginningof a frame period and thus create bursty traﬃc, meanwhilethe camera and the display generate and consume data atconstant rates which are determined by image sensor andLCD panel. In Fig. 5(a), media cores with bursty traﬃcobtain most of the bandwidth in the beginning, resultingin high NPI value. On the other hand, the display fails toachieve the target performance. The display’s NPI drops aslow as 0.13 which means only 13% of the target performanceis achieved.When round-robin policy is applied, the competition among media cores becomes more intense since they share the sametransaction queue in the memory controller. In Fig. 5(b),the display and the camera both fail due to the interferencefrom other media cores. Less than 10% of their target per-formance is achieved in the worst case. In the meantime, allthe system cores meet their target performance because theyavoid the interference from media cores by using a separatetransaction queue.The frame-rate-based QoS policy helps all media coresachieve NPI value above 1 in Fig. 5(c). However, all systemcores fail due to the absence of adaptations for the coreswith diﬀerent QoS targets other than frame rates.In Fig. 5(d), all the cores reach their target performancewhen QoS-aware scheduling is performed, because priority-based adaptations help arbiters serve the cores in urgentneeds. Note that the NPI of the other cores such as theGPU are not shown because no failure is observed from thesecores.The results by test case B are shown in Fig. 6. Similar toFig. 5, the latency-sensitive DSP suﬀers when FCFS policyis adopted (Fig. 6(a)). When round-robin policy is applied(Fig. 6(b)), the DSP suﬀers less since it has its own trans-action queue, while the display fails due to the increasedinterference from other media cores sharing the same trans-action queue. Again, the frame-rate-based QoS policy failsto serve non-media cores. At last, the dynamic prioritieshelp the memory system deliver target performance to allcores (Fig. 6(d)).Next, we take the image processor from test case A as anexample to examine the priority-based adaptation in a sin-gle core. Fig. 7 shows the distributions of the image proces-sor’s priority levels during one frame period, while DRAMfrequency decreases from 1700MHz to 1300MHz. Each hor-izontal bar is designated to a certain DRAM frequency. In single bar, each block represents the percentage of timeduring which a certain priority level is adopted. Diﬀerentshades of blue represent diﬀerent priority levels, as higherpriority levels in darker shades. As shown in Fig. 7, whenDRAM frequency is set to 1700MHz, for 90% of the time theimage processor is adapted to the priority of 0. As frequencydecreases, less memory requests can be processed by DRAM.More memory interferences and competitions happen as theresult. To maintain target bandwidth, the self-adaptationleads to a gradual increase in priority levels, which can be ob-served through the increasing area of blocks in dark shades.When DRAM frequency is lowered to 1300MHz, the imageprocessor has the priority of 7 for 60% of the time. In addi-tion, as frequency decreases, the average bandwidth of theimage processor remains above target bandwidth thanks tothe priority-based adaptation.

0% 20% 40% 60% 80% 100% D R A M F r e qu e n c y ( M H z ) priority Figure 7: Distributions of the image processor’s prioritylevels during one frame period (33ms) with respect todiﬀerent DRAM frequencies.

As explained in Section 3.3, row-buﬀer hits optimizationhelps improve available DRAM bandwidth. With the knowl-edge of heterogeneous cores’ urgency levels, the memory con-troller in the SARA framework is capable of optimizing row-buﬀer hits without degrading system performance.For comparison, we compare with another scheduling pol-icy named ﬁrst-ready ﬁrst-come-ﬁrst-serve (FR-FCFS) whichprioritizes transactions going to open rows whenever it ispossible, and otherwise schedules transactions based on FCFS.FR-FCFS policy is expected to achieve the most row-buﬀerhits and the highest DRAM bandwidth. Fig. 8 shows theaverage DRAM bandwidth during one frame period whentest case A is applied. Four memory scheduling policies aretested, including RR, FCFS, QoS (Policy 1), QoS-RB (Pol-icy 2) and FR-FCFS. Fig. 9 shows the NPI of critical cores asQoS-RB and FR-FCFS are adopted. As expected, FR-FCFSpolicy achieves the highest bandwidth, whereas performancedegradations happen to the GPS and the display as the ex-pense. The bandwidth by QoS-RB is slightly lower (by 1%)than FR-FCFS, but much higher than other policies. Specif-ically, the average DRAM bandwidth obtained by QoS-RBpolicy is 24%, 12% and 10% higher than RR, FCFS andQoS policies respectively. In the meantime, no performancedegradations are caused to heterogeneous cores.

14 15 16 17 18 19

FR-FCFSQoS-RB

QoSFCFS RR DRAM Bandwidth (GB/s)

Figure 8: Summary of average bandwidth when diﬀerentscheduling policies applied.

5. CONCLUSIONS time(ms) N P I QoS w/ row-buffer optimization

Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi0 10 20 300.1110 time(ms) N P I First-Ready First-Come-First-Serve

Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi

Figure 9: NPI value for test case A with respect to FR-FCFS and QoS-RB scheduling policies.

In this work, we proposed the self-aware resource alloca-tion (SARA) framework for memory management in het-erogeneous systems. Lightweight performance meters aredistributed in each core to monitor end-to-end QoS withlow cost. The priority-based adaptation allows cores tocustomize their target performance and adjust their pri-ority levels according to the observed performance. Thememory system with non-partitionable resources respondsto QoS demands by performing priority-based managementwhich does not require complicated computations. Experi-mental results show that with the priority-based adaptationand management, SARA framework helps all the heteroge-neous cores achieve their target performance. By compar-ison, without using priorities, performance of critical corescan drop lower than 10% of the target.

6. REFERENCES

ACM/IEEE DAC , 2012.[4] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh,and O. Mutlu. Staged memory scheduling: Achieving highperformance and scalability in heterogeneous systems. In

ACMISCA , 2012.[5] Y. Song, K. Samadi, and B. Lin. Single-tier virtual queuing:An eﬃcacious memory controller architecture for MPSoCs withmultiple realtime cores. In

ACM/IEEE DAC , 2016.[6] B. Grot, S. W. Keckler, and O. Mutlu. Preemptive virtualclock: A ﬂexible, eﬃcient, and cost-eﬀective QoS scheme fornetworks-on-chip. In

ACM/IEEE MICRO , 2009.[7] P.-H. Wang, C.-H. Li, and C.-L. Yang. Latencysensitivity-based cache partitioning for heterogeneousmulti-core architecture. In

ACM/IEEE DAC , 2016.[8] A. Shariﬁ, S. Srikantaiah, A. K. Mishra, M. Kandemir, andC. R. Das. METE: Meeting end-to-end QoS in multicoresthrough system-wide resource management.

SIGMETRICSPerform. Eval. Rev. , 39(1):13–24, June 2011.[9] H. Hoﬀmann, J. Holt, G. Kurian, E. Lau, M. Maggio, J. E.Miller, S. M. Neuman, M. Sinangil, Y. Sinangil, A. Agarwal,A. P. Chandrakasan, and S. Devadas. Self-aware computing inthe Angstrom processor. In

ACM/IEEE DAC , 2012.[10] J. A. Colmenares, G. Eads, S. Hofmeyr, S. Bird, M. Moret´o,D. Chou, B. Gluzman, E. Roman, D. B. Bartolini, N. Mor,K. Asanovi´c, and J. D. Kubiatowicz. Tessellation: Refactoringthe OS around explicit resource containers with continuousadaptation. In

ACM/IEEE DAC , 2013.[11] B. Jacob, S. Ng, and D. Wang.

Memory Systems: Cache,DRAM, Disk . Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2007.[12] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel,and B. Jacob. DRAMSim: A memory-system simulator. In