SARA: Self-Aware Resource Allocation for Heterogeneous MPSoCs
aa r X i v : . [ c s . D C ] A p r SARA: Self-Aware Resource Allocationfor Heterogeneous MPSoCs
Yang Song † , Olivier Alavoine ‡ , Bill Lin †† Electrical and Computer Engineering Department, University of California at San Diego ‡ Qualcomm Inc., San Diego, [email protected]
ABSTRACT
In modern heterogeneous MPSoCs, the management of sharedmemory resources is crucial in delivering end-to-end QoS.Previous frameworks have either focused on singular QoStargets or the allocation of partitionable resources amongCPU applications at relatively slow timescales. However,heterogeneous MPSoCs typically require instant responsefrom the memory system where most resources cannot bepartitioned. Moreover, the health of different cores in aheterogeneous MPSoC is often measured by diverse perfor-mance objectives. In this work, we propose a Self-Aware Re-source Allocation (SARA) framework for heterogeneous MP-SoCs. Priority-based adaptation allows cores to use differ-ent target performance and self-monitor their own intrinsichealth. In response, the system allocates non-partitionableresources based on priorities. The proposed framework meetsa diverse range of QoS demands from heterogeneous cores.
1. INTRODUCTION
Modern heterogeneous MPSoCs [1, 2] have been widelydeployed in mobile devices thanks to their energy efficiency.These MPSoCs typically integrate a diverse collection ofcores. Fig. 1 depicts an example of a heterogeneous MPSoC.Besides general-purpose cores like the CPU for running ap-plications, most heterogeneous cores are dedicated to certainfunctions, such as the GPU, the DSP and the display. Thesecores have diverse notions of Quality-of-Service (QoS). Forexample, the GPU measures target real-time performance interms of frame rate; the DSP demands the memory latencyto remain below a certain limit; and the display requiressufficient bandwidth to refresh frames at a constant rate.To save cost and energy, heterogeneous cores commonlyshare resources, among which, the sharing of the memorysystem (including the on-chip network and the memory con-troller) is the most challenging because memory performanceoften has a direct and substantial impact on the system per-formance. As data is being shared through memory, com-peting memory requests from different cores interfere witheach other, and these memory interferences can cause the
ACM ISBN 978-1-4503-2138-9.DOI:
GPU
DisplayJPEGFrameRotator
On-Chip NetworkMemory ControllerDRAM ...
ModemUSBWiFi ...
CPUCPUCPUCPU CPUCPUCPUDSP
VideoCodec GPS
Mediacores Systemcores
Figure 1: Heterogeneous system architecture example. memory system to fail in meeting the target performance ofsome cores. Fig. 2 depicts a camcorder application, whichrepresents a typical use case in that it involves many cores atthe same time. With ineffective memory scheduling, a real-time core (e.g., the display) may not achieve the target real-time performance due to inadequate memory bandwidth.Moreover, as latency-sensitive cores such as the DSP sharememory with other cores, they can be easily overwhelmedby real-time cores consuming high bandwidth.
Sensor
Video Buffer
Camera
SnapshotBuffer
JPEG
ImageProcessorReferenceFramePreviewBuffer
Rotator
PreviewBufferVideoCodecReferenceFrames
StorageDisplay
LCDPanelIcons
GPU
Recordinga videoTakinga snapshot Previewingthe video
Figure 2: Simplified dataflow of a camcorder application.Shared memory represented by boxes in dash lines andcores by boxes in solid lines.
QoS-aware management for specific types of memory re-sources has been well-studied by previous work [3, 4, 5, 6,7]. In [3], a QoS-aware scheduling policy was proposed forCPU-GPU systems. The concept of frame progress was in-troduced for monitoring GPU performance. Although thepolicy can be extended to include more media cores, it can-not be applied to real-time cores whose target QoS can-not be assessed in terms of frame rate. Moreover, holis-tic memory management frameworks for CPU-centric ho-mogeneous systems have also been explored recently [8, 9,10]. This series of work typically constructs a managementmodel based on the control theory to partition computingand memory resources. These frameworks accept flexibleoS targets as clients are allowed to define their own tar-get performance. Nonetheless, such type of approaches isperformed at a relatively slow timescale (e.g., on the orderof milleseconds) due to the computational complexity. Incomparison, real-time cores in heterogeneous MPSoCs oftendemand much more instant response from the memory sys-tem. Besides, communication between heterogeneous coresis mainly conducted through shared memory as shown inFig. 2, because multimedia data is generally too large to fitin caches. Therefore, DRAM plays a more crucial role inheterogeneous systems. However, previous frameworks can-not handle DRAM effectively because its bandwidth is notpartitionable. Specifically, available DRAM bandwidth re-lies on the memory access pattern, as higher spatial localityresults in fewer redundant precharge operations and bettermemory efficiency.So far, there has not been a QoS-aware resource man-agement model for heterogeneous MPSoCs which is capableof allocating non-partitionable resources to fleeting QoS de-mands. In this work, we propose the Self-Aware ResourceAllocation (SARA) framework as a solution. The contribu-tions of our work can be summarized as follows. • We propose a QoS-aware holistic resource managementframework for heterogeneous systems. The SARA modelaccepts diverse notions of QoS and monitors performancedistributively with lightweight meters to guarantee end-to-end QoS. • We introduce priority-based self-adaptations for the man-agement of non-partitionable resources, such as DRAMand on-chip network, which constitute most of the sharedresources in heterogeneous MPSoCs. • We evaluate the proposed framework using memory trafficof next-generation MPSoCs and show that the proposedSARA model delivers target performance to all cores. Incontrast, the performance of critical cores can fall below10% of their targets without the SARA framework. Fur-ther, memory system optimization is performed withoutQoS degradations.The rest of this paper is organized as follows: Section 2briefly reviews related work. Section 3 describes the pro-posed SARA framework. Experimental results and conclu-sions follow in Sections 4 and 5.
2. RELATED WORK
Most previous work on QoS-aware resource managementin heterogeneous MPSoCs were focused on a single layer ofthe memory system. In [3], a novel scheduling policy wasintroduced to dynamically balance bandwidth between theCPU and the GPU based on the frame progress of real-time workloads. To achieve QoS-aware memory scheduling,the staged memory scheduler [4] was presented as the firstQoS-aware scheduler for CPU-GPU systems. Further, thesingle-tier virtual queuing memory controller [5] was pro-posed to overcome the limitation of two-tier schedulers inQoS-aware scheduling. Besides memory scheduling, QoS-aware cache management [7] and on-chip network design [6]have also been well-explored in recent years. Nonetheless,these work cannot guarantee end-to-end QoS because theyonly deal with certain parts of the memory system. Forexample, the QoS provided in the memory controller couldbe deteriorated by the interconnect if it is not applying thesame QoS policy. In addition, implementing a centralizedQoS monitor in the memory system can be prohibitive sinceit needs to collect runtime information from all cores. More limiting, these work assume specific notions of QoS, whichis not applicable to modern heterogeneous MPSoCs wherethe health of different cores is often evaluated by diverseperformance objectives.METE [8] is a multi-level framework for end-to-end re-source management based on the control theory. It utilizesruntime information to predict application behaviors. Ap-plication controllers calculate the amounts of resources re-quired to achieve target application performance. A globalresource broker determines the final resource partitions forapplications. SEEC [9] is a self-aware computing frameworkdesigned for a many-core processor. It follows the controlloop of observe-decide-act for resource allocations. Perfor-mance of CPU applications are observed by the decision en-gine which decides resource partitions using available actionsdefined by system designers. ARCC [10] is a self-computingframework implemented in the Tessellation many-core OS.It performs the two-level scheduling: first the resource allo-cation broker distributes global resources and then at user-level scheduling policies are customized separately.Aforementioned frameworks were intended for CPU-centricmulti-core systems. These frameworks are aimed at allocat-ing partitionable resources, such as CPU cores and cacheways, to applications at the software/OS level. They are notsuitable for heterogeneous MPSoCs for the following reasons.First, complicated control models may not be fast enoughfor heterogeneous cores (e.g., these software/OS level ap-proaches operate at milleseconds timescales). For example,the DSP sets limit on memory latency at nanosecond level,but prior frameworks need more time to adapt through con-trol theory computations in OS. Second, prior work assumeall memory resources are partitionable. However, DRAMbandwidth cannot be simply partitioned like cache ways. InDRAM, data storage of a memory bank is organized intorows and columns. To access a column, the row where thiscolumn is located will be loaded into the row-buffer (i.e.row activation operation) after the other rows are closed(i.e. precharge operation) [11]. These row activation andprecharge operations cause time penalty without contribut-ing to actual data transfer, which makes DRAM bandwidthinconstant and unpredictable.
3. SELF-AWARE RESOURCEALLOCATION FRAMEWORK
On-Chip Network
PerformanceMeter DirectMemoryAccess
Targetperf.
DistributedMonitoringDistributed
Priority-based
AdaptationDistributedSystemResponse
Core A Core BCore DCore CMemory ControllerDRAM
Actual perf.
Requests with priorities
Priority ...
Each resourceperforms allocationin a distributed manner
NPI
Figure 3: The proposed SARA framework for heteroge-neous MPSoCs. Each core self-monitors its performanceand self-adapts its priority, and each resource performspriority-based allocation in a distributed manner. ime0 Frame progress Δ Occupancy-25%025%50%100% timetimetimeAverage latency50%Max75%Max timetime567 345 345Priority PriorityPriority x1x0.5x0.75 62 (a) DSP (c) Display(b) GPU
Max Referenceprogress0 -50%
Figure 4: Examples of priority-based adaptation in het-erogeneous cores, including the DSP, GPU and display.
The proposed architecture of SARA framework is shownin Fig. 3. The resource management model consists of threestages, including distributed monitoring, priority-based run-time adaptation and system response. In the rest of this sec-tion, we will go through SARA framework stage by stage.
In the first stage, each core self-monitors its own per-formance. The distributed monitoring relieves the memorysystem from the burden of monitoring heterogeneous coreswith various notions of QoS. Self-monitoring also providesmore accurate feedback on the end-to-end QoS comparedwith centralized monitoring in the memory system. In addi-tion, implementing lightweight performance meters is goodfor scalability, because a new core can be added or modifiedwithout updating the rest of the system.Every core customizes its own internal performance me-ter to measure its own performance or progression againsta given target, and the measurement gets normalized intoa fractional number called a Normalized Performance In-dicator (NPI), which is used as an indicator of the core’sintrinsic health. In the DSP, the performance meter mon-itors the average latency of its transactions, while in thedisplay the meter counts the occupancy level in the readbuffer. The deviation from the target performance (e.g., la-tency, occupancy level, etc) produces the NPI metric. In ourframework, each independent DMA (Direct Memory Access)unit is equipped with a performance meter. Note that thereare usually multiple DMAs in a single core. For simplicity,we only show one DMA per core in Fig. 3.
In the second stage, each core adapts the relative priorityof its transactions based on its NPI value. The NPI valuedelivered by the performance meter is translated into a rela-tive priority level which is attached to memory transactionsfrom the same DMA. The priority level will be evaluatedwithin on-chip network arbiters and the memory controller,as the transaction travels along the way to DRAM. Priority-based arbitrations allow the memory system to provide QoSwithout specifying the heterogeneous QoS for all cores andDMAs. Same with performance meters, the formulation ofthe NPI metric and the adaptations of priority can be im-plemented differently from core to core, depending on thelocal target performance. Fig. 4 shows three examples ofpriority-based adaptation in different cores.As for the DSP, the target performance is to have theaverage memory latency lower than the maximum latencylimit. The average latency is measured and compared witha pre-set limit to produce the NPI value (see Eqn. 1), whichremains above or equal to 1 when the target performanceis achieved. This NPI value is then translated to a relative priority level (Fig. 4(a)). The priority level increases alongwith average latency.
NP I
DSP = maximum latency limitaverage latency (1) Similarly, cores requesting for bandwidth produce NPImetrics by computing the ratio between the average and thetarget bandwidth. However, frame rate differs from band-width, because frame size can be variable and thus a con-stant frame rate can lead to variable bandwidth. Henceframe progress [3, 5] is used instead to produce NPI metricsfor frame rate based cores. Take the GPU as an example,the target is to let the frame progress reach 100% as the cur-rent frame period comes to an end. The GPU’s NPI value isproduced at any time by comparing the frame progress withreference progresses which grow proportionally with frametime. The NPI value is then translated to a relative prior-ity level of GPU transactions. Fig. 4(b) shows the referenceprogresses achieving 1, 0.75 and 0.5 times the average datarate of target performance.
NP I
GPU = frame progressreference progress (2) In the display, LCD panel reads data from a read bufferat a constant frame rate, while the display controller DMAtries to refill this buffer from DRAM so it never gets empty.Its health (see Eqn. 3) relies on maintaining the refill rate( R refill ) no lower than the read data rate ( R read ), andcan be indicated by the variation of buffer occupancy level(∆ occupancy ). Compared with an initial level (e.g. 50%),the lower the occupancy level of this buffer gets, the worsethe NPI value becomes, which is in turn translated to ahigher priority level (Fig. 4(c)). NP I display = R refill R read = 1 + ∆ occupancyR read · time (3) Intuitively, one might be concerned that every core wouldintentionally raise the priority to the maximum level to ob-tain as much resources as possible. However, this situationshould not happen because the priority level is only max-imized when the actual performance is far below the tar-get. The system designer has the responsibility to makesure cores have realistic performance targets and enough re-sources to satisfy all possible combinations of QoS demands.Once the system is fabricated in hardware, heterogeneouscores cannot change their target performance arbitrarily, es-pecially because most of them are fixed-function IP blockswith invariable QoS targets and little programmability.In our evaluations, the priority levels are quantized into2 k levels, which can be encoded using k bits. We found that k = 3 bits provides sufficient granularity in priority levelsto produce satisfying results (i.e., the priority levels rangefrom 0 to 7). As transactions travel through the memory system, thesystem responds to QoS demands by providing resource man-agement based on their priority levels. The priority-basedmanagement is performed correspondingly in different partsof the memory system. In on-chip network routers, trans-actions with higher priorities are preferentially selected dur-ing switch allocation. In the memory controller, when apriority-based scheduler arbitrates among transactions goingto available memory banks, the ones with higher prioritieshave more chances to be served. An example of such memoryscheduling policies is the priority-based round-robin shownn Policy 1. To avoid starvation of transactions with low pri-orities, the scheduler also needs to consider the aging factorduring arbitration. In our evaluations, the scheduler peri-odically clears the backlog of transactions that have waitedfor at least T cycles (e.g., T = 10000 cycles). • Policy 1 : Suppose P A and P B are priorities for transac-tions A and B, if P A > P B choose A; if P A < P B choose B;otherwise choose between A and B in round-robin man-ners.Priorities notify the system whether the cores are in urgentQoS demands. That gives the memory system an opportu-nity to optimize memory performance without underminingthe QoS. Specifically, when transactions are in low urgency,the system can improve memory performance such as row-buffer hit rate, instead of focusing on serving QoS demands.Row-buffer hits refer to the number of memory accessesto the same active row-buffer before precharge. More row-buffer hits means less time and power are wasted on rowactivation and precharge operations. Thus increasing row-buffer hits helps lower memory latency and improve DRAMtotal bandwidth.To increase row-buffer hits, the memory controller re-orders transactions to favor the ones hitting open rows. Itmay cause degradations to the QoS when the transactions inhigh urgency are postponed due to row-buffer hits optimiza-tion. Yet, with priorities, the memory controller is aware ofthe urgency levels of transactions and able to avoid delay-ing urgent transactions during optimization. Policy 2 showsan extension of Policy 1 to increase row-buffer hits with-out QoS degradations. The parameter δ is an adjustablethreshold to balance row-buffer hits optimization and QoS-aware scheduling. When the priority level is lower than δ ,the scheduler focuses on row-buffer hits, otherwise the QoScomes first. A higher δ value gives more favor to DRAMbandwidth, but also potentially causes more disturbance tothe QoS. We found δ = 6 a good setting to achieve highDRAM bandwidth without causing QoS degradations. • Policy 2 : Suppose transaction A is going to an activerow-buffer and B is not. If P A , P B < δ or P A = P B ,choose A. Otherwise, perform priority-based round-robin.The priority-based resource allocation is able to handlenon-partitionable with little computation in comparison withprevious management models [8, 9, 10]. This facilitates in-stant response from the memory system to QoS demands. The implementation of the proposed SARA framework in-cludes three parts: the computation of NPI value, the trans-lation of NPI value to a priority level, and the priority-basedarbitration in the memory system.To calculate the NPI, a divider is needed at the perfor-mance meter for each DMA. For the translation of the NPI,a mapping function can be stored in a look-up table at eachcore. Each priority level is assigned with a table entry, andthis entry stores the lowest NPI value allowed at that pri-ority level. For example, if priority = p when NPI ∈ [ u, v ),the value u will be stored at the entry for p on the look-uptable. Note that v will be the lower bound of the NPI forthe priority level p −
1. Comparators are needed to accesstable entries in parallel. If the current NPI value is not lowerthan the stored lower bound of NPI value, the correspondingpriority level will be asserted. When multiple priority levelsare asserted, the lowest level will be adopted.Supposed each priority level is encoded into three bits,
Table 1: Simulation settings.
Test CasesCase A all cores activewith DRAM @ 1866MHz;Case B inactive cores:GPS, camera, rotator and JPEG,with DRAM @ 1700MHz.Memory ControllerTotal entries 42Transaction queues 5DRAMVolume 2GBMax I/O bus freq. 1866MHzCL-tRCD-tRP (cycles) 36-34-34tWTR-tRTP-tWR (cycles) 19-14-34tRRD-tFAW (cycles) 19-75Channels-Ranks-Banks 2-2-8
Table 2: Summary of heterogeneous cores and types oftarget performance.
Name Performance Name Performancetype typeGPU frame rate Display buffer occupancyDSP latency GPS processing timeImage Processor frame rate WiFi bandwidthVideo Codec frame rate USB bandwidthRotator frame rate Modem processing timeJPEG frame rate Audio latencyCamera buffer occupancy a look-up table requires 2 = 8 entries and each entry isa register for the NPI value. A comparator is paired witheach table entry. In total, the implementation only costs thestorage of eight registers and eight comparators per core.In the memory system, performing the priority-based ar-bitration requires a 3-bit comparator to arbitrate amongtransactions with different priority levels. Since most ex-isting QoS-aware schedulers already provide hardware sup-port for priorities, our framework can be integrated into thememory system without raising complexity.
4. EVALUATION
In this section, the proposed SARA framework will betested to demonstrate its effectiveness in providing targetperformance to heterogeneous cores. Two test cases basedon the camcorder dataflow (Fig. 2) will be used for demon-stration. Further, we will show row-buffer hits optimizationcan be performed efficiently within SARA framework with-out performance degradations.The proposed framework is modeled as in Fig. 3, wherememory traffic from every DMA is generated based on anext-generation MPSoC [1]. DRAMSim2 [12] with LPDDR4timing model is used for cycle-accurate simulation of DRAM.Table 1 shows the simulation settings. Table 2 lists the sim-ulated cores and the types of target performance.The target performance for each core is set according tothe camcorder dataflow (Fig. 2) which runs at 30fps. Forinstance, the frame rotator writes and reads 1080p YUV420images at 30fps, which requires 89MB/s for each DMA and178MB/s in total.
To begin with, we test the SARA framework in deliveringtarget performance to heterogeneous cores. For comparison,four arbitration policies are used in the memory controllerand on-chip network arbiters, including first-come-first-serve(FCFS), round-robin (RR), a frame-rate-based QoS policy[3] and the priority-based QoS policy (Policy 1). FCFS pol-icy serves all the transactions according to the arrival or-der. Round-robin policy separates transactions into differ-
10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi 0 10 20 300.1 110 time(ms) N P I Image Proc.RotatorVideo Codec
Display
CameraUSBGPSWiFi (a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy
Figure 5: NPI value of critical cores during one frame period (33ms) for test case A with different arbitration policies. time(ms)0.1110 N P I Image Proc.Video CodecDisplayUSBDSPWiFi time(ms)0.1110 N P I Image Proc.Video CodecDisplay USBDSPWiFi0 10 20 300 10 20 30 time(ms) N P I Image Proc.Video CodecDisplayUSBDSPWiFi0 10 20 30time(ms)0.1110 N P I Image Proc.Video CodecDisplayUSBDSPWiFi (a) FCFS policy (b) Round-robin policy (c) Frame-rate-based QoS policy [4] (d) Priority-based QoS policy
Figure 6: NPI value of critical cores during one frame period (33ms) for test case B with different arbitration policies. ent queues and serves them in a round-robin fashion. Inthe memory controller, we have five transaction queues re-spectively designated to the CPU, the GPU, the DSP, me-dia cores and system cores. Round-robin policy also ap-plies to on-chip network arbiters, as input queues are servedin turn. The frame-rate-based QoS policy prioritizes me-dia cores when they are missing real-time deadlines, butotherwise, the policy provides best-effort service to latency-sensitive cores. Furthermore, the priority-based QoS policycompares priority levels for arbitration and uses round-robinas the tiebreaker.The NPI of critical cores during a frame period are shownin Fig. 5 when test case A is applied. As explained in Sec-tion 3.2, the NPI metric reflects performance as higher valueindicates better performance. When NPI value drops below1, it means the the target performance is not achieved.Without reordering memory requests, FCFS policy endsup spending most of the time serving cores consuming highbandwidth. That easily leads to the starvation of latency-sensitive cores. As shown in Fig. 5(a), the NPI of the GPSdrops below 1 because the GPS is overwhelmed by other sys-tem cores sharing the same interconnect, such as the USB.For media cores, the video codec, the rotator and the imageprocessor have all the frame data available at the beginningof a frame period and thus create bursty traffic, meanwhilethe camera and the display generate and consume data atconstant rates which are determined by image sensor andLCD panel. In Fig. 5(a), media cores with bursty trafficobtain most of the bandwidth in the beginning, resultingin high NPI value. On the other hand, the display fails toachieve the target performance. The display’s NPI drops aslow as 0.13 which means only 13% of the target performanceis achieved.When round-robin policy is applied, the competition among media cores becomes more intense since they share the sametransaction queue in the memory controller. In Fig. 5(b),the display and the camera both fail due to the interferencefrom other media cores. Less than 10% of their target per-formance is achieved in the worst case. In the meantime, allthe system cores meet their target performance because theyavoid the interference from media cores by using a separatetransaction queue.The frame-rate-based QoS policy helps all media coresachieve NPI value above 1 in Fig. 5(c). However, all systemcores fail due to the absence of adaptations for the coreswith different QoS targets other than frame rates.In Fig. 5(d), all the cores reach their target performancewhen QoS-aware scheduling is performed, because priority-based adaptations help arbiters serve the cores in urgentneeds. Note that the NPI of the other cores such as theGPU are not shown because no failure is observed from thesecores.The results by test case B are shown in Fig. 6. Similar toFig. 5, the latency-sensitive DSP suffers when FCFS policyis adopted (Fig. 6(a)). When round-robin policy is applied(Fig. 6(b)), the DSP suffers less since it has its own trans-action queue, while the display fails due to the increasedinterference from other media cores sharing the same trans-action queue. Again, the frame-rate-based QoS policy failsto serve non-media cores. At last, the dynamic prioritieshelp the memory system deliver target performance to allcores (Fig. 6(d)).Next, we take the image processor from test case A as anexample to examine the priority-based adaptation in a sin-gle core. Fig. 7 shows the distributions of the image proces-sor’s priority levels during one frame period, while DRAMfrequency decreases from 1700MHz to 1300MHz. Each hor-izontal bar is designated to a certain DRAM frequency. In single bar, each block represents the percentage of timeduring which a certain priority level is adopted. Differentshades of blue represent different priority levels, as higherpriority levels in darker shades. As shown in Fig. 7, whenDRAM frequency is set to 1700MHz, for 90% of the time theimage processor is adapted to the priority of 0. As frequencydecreases, less memory requests can be processed by DRAM.More memory interferences and competitions happen as theresult. To maintain target bandwidth, the self-adaptationleads to a gradual increase in priority levels, which can be ob-served through the increasing area of blocks in dark shades.When DRAM frequency is lowered to 1300MHz, the imageprocessor has the priority of 7 for 60% of the time. In addi-tion, as frequency decreases, the average bandwidth of theimage processor remains above target bandwidth thanks tothe priority-based adaptation.
0% 20% 40% 60% 80% 100% D R A M F r e qu e n c y ( M H z ) priority Figure 7: Distributions of the image processor’s prioritylevels during one frame period (33ms) with respect todifferent DRAM frequencies.
As explained in Section 3.3, row-buffer hits optimizationhelps improve available DRAM bandwidth. With the knowl-edge of heterogeneous cores’ urgency levels, the memory con-troller in the SARA framework is capable of optimizing row-buffer hits without degrading system performance.For comparison, we compare with another scheduling pol-icy named first-ready first-come-first-serve (FR-FCFS) whichprioritizes transactions going to open rows whenever it ispossible, and otherwise schedules transactions based on FCFS.FR-FCFS policy is expected to achieve the most row-bufferhits and the highest DRAM bandwidth. Fig. 8 shows theaverage DRAM bandwidth during one frame period whentest case A is applied. Four memory scheduling policies aretested, including RR, FCFS, QoS (Policy 1), QoS-RB (Pol-icy 2) and FR-FCFS. Fig. 9 shows the NPI of critical cores asQoS-RB and FR-FCFS are adopted. As expected, FR-FCFSpolicy achieves the highest bandwidth, whereas performancedegradations happen to the GPS and the display as the ex-pense. The bandwidth by QoS-RB is slightly lower (by 1%)than FR-FCFS, but much higher than other policies. Specif-ically, the average DRAM bandwidth obtained by QoS-RBpolicy is 24%, 12% and 10% higher than RR, FCFS andQoS policies respectively. In the meantime, no performancedegradations are caused to heterogeneous cores.
14 15 16 17 18 19
FR-FCFSQoS-RB
QoSFCFS RR DRAM Bandwidth (GB/s)
Figure 8: Summary of average bandwidth when differentscheduling policies applied.
5. CONCLUSIONS time(ms) N P I QoS w/ row-buffer optimization
Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi0 10 20 300.1110 time(ms) N P I First-Ready First-Come-First-Serve
Image Proc.RotatorVideo CodecDisplayCameraUSBGPSWiFi
Figure 9: NPI value for test case A with respect to FR-FCFS and QoS-RB scheduling policies.
In this work, we proposed the self-aware resource alloca-tion (SARA) framework for memory management in het-erogeneous systems. Lightweight performance meters aredistributed in each core to monitor end-to-end QoS withlow cost. The priority-based adaptation allows cores tocustomize their target performance and adjust their pri-ority levels according to the observed performance. Thememory system with non-partitionable resources respondsto QoS demands by performing priority-based managementwhich does not require complicated computations. Experi-mental results show that with the priority-based adaptationand management, SARA framework helps all the heteroge-neous cores achieve their target performance. By compar-ison, without using priorities, performance of critical corescan drop lower than 10% of the target.
6. REFERENCES
ACM/IEEE DAC , 2012.[4] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh,and O. Mutlu. Staged memory scheduling: Achieving highperformance and scalability in heterogeneous systems. In
ACMISCA , 2012.[5] Y. Song, K. Samadi, and B. Lin. Single-tier virtual queuing:An efficacious memory controller architecture for MPSoCs withmultiple realtime cores. In
ACM/IEEE DAC , 2016.[6] B. Grot, S. W. Keckler, and O. Mutlu. Preemptive virtualclock: A flexible, efficient, and cost-effective QoS scheme fornetworks-on-chip. In
ACM/IEEE MICRO , 2009.[7] P.-H. Wang, C.-H. Li, and C.-L. Yang. Latencysensitivity-based cache partitioning for heterogeneousmulti-core architecture. In
ACM/IEEE DAC , 2016.[8] A. Sharifi, S. Srikantaiah, A. K. Mishra, M. Kandemir, andC. R. Das. METE: Meeting end-to-end QoS in multicoresthrough system-wide resource management.
SIGMETRICSPerform. Eval. Rev. , 39(1):13–24, June 2011.[9] H. Hoffmann, J. Holt, G. Kurian, E. Lau, M. Maggio, J. E.Miller, S. M. Neuman, M. Sinangil, Y. Sinangil, A. Agarwal,A. P. Chandrakasan, and S. Devadas. Self-aware computing inthe Angstrom processor. In
ACM/IEEE DAC , 2012.[10] J. A. Colmenares, G. Eads, S. Hofmeyr, S. Bird, M. Moret´o,D. Chou, B. Gluzman, E. Roman, D. B. Bartolini, N. Mor,K. Asanovi´c, and J. D. Kubiatowicz. Tessellation: Refactoringthe OS around explicit resource containers with continuousadaptation. In
ACM/IEEE DAC , 2013.[11] B. Jacob, S. Ng, and D. Wang.
Memory Systems: Cache,DRAM, Disk . Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2007.[12] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel,and B. Jacob. DRAMSim: A memory-system simulator. In