Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms
UUNIVERSITY OF CALIFORNIARIVERSIDEThermal Safety and Real-Time Predictability on Heterogeneous Embedded SoCPlatformsA Dissertation submitted in partial satisfactionof the requirements for the degree ofDoctor of PhilosophyinComputer SciencebySeyed Mehdi Hosseini MotlaghDecember 2020Dissertation Committee:
Prof. Hyoseung Kim, ChairpersonProf. Nael Abu-GhazalehProf. Daniel WongProf. Shaolei Ren a r X i v : . [ c s . D C ] D ec opyright bySeyed Mehdi Hosseini Motlagh2020 he Dissertation of Seyed Mehdi Hosseini Motlagh is approved: Committee Chairperson
University of California, Riverside cknowledgments
I would like to thank my research advisor, Dr. Hyoseung Kim, for his guidance and supportduring all these years. I am greatly indebted to him for his constant inspiration, encourage-ment, and patience throughout my doctoral study. His exceptional commitment to researchand strong demand for excellence have guided me this far and his valuable feedback con-tributed greatly to the improvement and progress of my research and dissertation. I feelextremely lucky that I met him and that he let me be a part of his research team.I thank the rest of my dissertation committee members: Dr. Neal Abu-Ghazaleh,Dr. Daniel Wong, and Dr. Shaolei Ren for the discussions and directions for the future work.Their insightful comments and constructive criticism helped me to improve the dissertationin many ways.I want to thank the people that more closely helped me with my research. Iwant to thank Dr. Christian R. Shelton for his help and support and collaboration duringthe PhD years. I would like to thank Ali Ghahreman-Nezhad for the successful teamworkthroughout the years. I am beholden to Dr. Farshad Khunjush for his trust, help andguidance. I would not have ended up in the US if it wasn’t for him. Thanks to theresearchers that I worked with in Google, Danijela Mijailovic, Yao Qin, Pablo Ruiz Junco,and particularly my managers, Sreekumar Kodaraka and David Ruiz.In addition, I thank my friends and fellow students at University of California,Riverside for numerous discussions related to my research work. I sincerely thank all thecurrent and previous members in Real-Time Embedded and Networked systems (RTEN)Lab, including Hyunjong Choi, Yecheng Xiang, Mohsen Karimi, Yidi Wang, Abdulrahmanivukhari, and Daniel Enright.I want to thank my friends all around the world, who contributed to make myPhD years an awesome journey. In particular, I owe special thanks to Ali Mokari Amiri,Nima Azizi, Mahdi Aminian, Pouya Haghighat, Ehsan Mohandesi, and Omid Jahromi.I am grateful to my Iranian friends at UC Riverside: Seyed Hossein Mostafavi, JoobinGharibshah, Abbas Roayaei Ardakany, Amir Feghahati, and Amir-Hossein Nodehi Sabet.Last but not least, I would like to express my gratitude to my parents and myfamily members for their continuous support and encouragement. Without their tremendoussacrifices and great love, I would not have chance for successfully accomplishing my PhD.v o my parents for all the support. vi BSTRACT OF THE DISSERTATION
Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC PlatformsbySeyed Mehdi Hosseini MotlaghDoctor of Philosophy, Graduate Program in Computer ScienceUniversity of California, Riverside, December 2020Prof. Hyoseung Kim, ChairpersonRecent embedded systems are designed with high-performance System-on-Chips(SoCs) to satisfy the computational needs of complex applications widely used in real life,such as airplane controllers, autonomous driving automobiles, medical devices, drones, andhand-held devices. Modern SoCs integrate multi-core CPUs and various types of acceler-ators including graphics processing units (GPUs), digital signal processing (DSP), videoencoding, and decoding units. The performance gain of such SoCs comes at the cost ofhigh power consumption, which in turn leads to high heat dissipation. Uncontrolled heatdissipation is one of the main sources of interference that can adversely affect the reliabilityand real-time performance of safety-critical applications. The mechanisms currently avail-able to protect SoCs from overheating, such as frequency throttling or core shutdown, mayexacerbate the problem as they cause unpredictable delay and deadline misses. Dynamicchanges in ambient temperature further increase the difficulty of solving this problem.This dissertation addresses the challenges caused by thermal interference in real-time mixed-criticality systems (MCSs) built with heterogeneous embedded SoC platforms.viie propose a novel thermal-aware system framework with analytical timing and thermalmodels to guarantee safe execution of real-time tasks under the thermal constraints of amulti-core CPU/GPU integrated SoC. For mixed-criticality tasks, the proposed frameworkbounds the heat generation of the system at each criticality level and provides differentlevels of assurance against ambient temperature changes. In addition, we propose a data-driven thermal parameter estimation scheme that is directly applicable to MCSs built withcommercial-off-the-shelf multi-core processors to obtain a precise thermal model withoutusing special measurement instruments or access to proprietary information. The practical-ity and effectiveness of our solutions have been evaluated using real SoC platforms and ourcontributions will help develop systems with thermal safety and real-time predictability.viii ontents
List of Figures xiiList of Tables xv1 Introduction 1
Bibliography 141 xi ist of Figures τ finishes at22. b) Busy waiting: τ finishes at 14. . . . . . . . . . . . . . . . . . . . . . 232.3 Example task scheduling with the GPU server under a) the deferrable serverpolicy and b) the sporadic server policy. For both tasks τ and τ , E i = 8and M i, = M i, = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 Data transfer in a GPU request a) without and b) with the MOT reservationmechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5 The worst-case scenario for CPU-GPU handover delay with a CPU pollingserver. When a GPU request with E i is chosen from the GPU waiting queuefor execution, it experiences the delay highlighted in blue boxes. . . . . . . 362.6 a) Server design with the given maximum temperature of 95 ° C when theCPU fan is off. b) Server utilization w.r.t the given maximum temperature.c) The maximum observed temperature w.r.t the server replenishment periodwhen the CPU fan is off and CPU utilization is 30%. . . . . . . . . . . . . . 432.7 a) and b) The percentage of schedulable tasksets under the given maximumtemperature constraint of 95 ° C when the CPU fan is off and on, respectively.c) Taskset schedulability results with and without the remote blocking en-hancement under the polling server policy while the CPU fan is off. d) Thepercentage of schedulable tasksets w.r.t the ratio of GPU-using tasks. . . . 492.8 a) Response time of the real-time workzone recognition task without ourframework. b) Temperature of CPU and GPU without our framework. c)Response time of the real-time task with our framework d) Temperature ofCPU and GPU with our framework. . . . . . . . . . . . . . . . . . . . . . . 503.1 Criticality mode change diagram . . . . . . . . . . . . . . . . . . . . . . . . 613.2 A generic periodic power signal. . . . . . . . . . . . . . . . . . . . . . . . . . 643.3 Current, steady state, and average temperature profiles. . . . . . . . . . . . 673.4 Temperature change in one period when the CPU operates a) for t wk timeunits b) for t and t time units. . . . . . . . . . . . . . . . . . . . . . . . . 733.5 Workload execution pattern in the steady state. . . . . . . . . . . . . . . . . 77xii.6 Power signal of the cores for (a) case I: in-phase, and (b) case II: out-of-phase. 793.7 Temperature profiles of in-phase and out-of-phase cases for a (a) two-core and(b) three-core system. (Blue lines represent the temperature of the cores inin-phase and other colors represent temperature of the cores in out-of-phasestates.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.8 Comparison of temperature (a) increase and (b) decrease between in-phaseand out-of-phase cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.9 The maximum temperature of big cores in the Exynos 5422 captured byFLIR A325sc IR camera in the operating frequency of 1.4 GHz without heatsink. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.10 Search space for the idle server settings. . . . . . . . . . . . . . . . . . . . . 873.11 Comparison of the experimental results and the model prediction for CPUtemperature at different workloads and ambient temperatures. . . . . . . . 903.12 a) Utilization versus period and ambient temperature. b) Utilization versusperiod at different ambient temperatures. . . . . . . . . . . . . . . . . . . . 913.13 Shifting time a) from initial ambient temperature to different final ambienttemperatures at various workloads, b) from different initial workloads todifferent final workloads at an ambient temperature of 23C. . . . . . . . . . 933.14 Experimental environment using furnace. . . . . . . . . . . . . . . . . . . . 943.15 Experimental and model results for a case study with a period of 50 ms attwo ambient temperatures and workloads. . . . . . . . . . . . . . . . . . . . 954.1 Parameter Estimation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.2 a) Auxiliary test for detect one error in either Y or Y . b) Auxiliary testfor detect one error in either Y or Y . c) Second-tier testing of auxiliarytests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.3 Example of adjacency graph of a quad-core CPU. . . . . . . . . . . . . . . . 1164.4 Example of adjacency graph of a quad-core CPU. . . . . . . . . . . . . . . . 1174.5 Final adjacency graph of a quad-core CPU. . . . . . . . . . . . . . . . . . . 1184.6 (a) Idle steady-state temperature of each core at different frequency levels.(b) Temperature increase of CPU cores due to computation. . . . . . . . . . 1274.7 The error of steady-state temperature of CPU cores by using different casesin construction of Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1284.8 MSE of CPU cores from all setting in different cases. . . . . . . . . . . . . . 1294.9 Temperature increase when a)core 1 2)core 2 3)core 3 4)core 4 operates fully-utilized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.10 Floorplan estimation of Exynos5422 based on data of 1.4 GHz. (a) The fully-connected graph from the temperature increase data, (b) Graph reductionstage (c) The CPU affinity graph, (d) Estimation of GPU location relativeto CPU location, and (e) The actual Exynos5422 floorplan [36]. . . . . . . 1314.11 Power data of CPU cores in Exynos5422. (a) Total power of the big clus-ter with built-in sensors, (b) The leakage power data; the relative powerestimates with different frequency ranges used, (c) Comparison between theestimated relative power consumption and the normalized actual power datafrom built-in power sensors for CA
1, and (d) Comparison for CA , , ,
4. . 132xiii.12 The CPU temperature from sensor data and the model when a-d) two coresare fully-utilized Z e-h) all cores are fully-utilized Z . . . . . . . . . . . 135xiv ist of Tables hapter 1 Introduction
Rising demands for data and computationally-intensive applications such as deeplearning and big-data driven applications entail the request for higher performance pro-cessors, e.g., multi-core CPUs, Graphics Processor Units (GPUs), and Tensor ProcessingUnits (TPUs) as well as more aggressive computing technologies such as 3D chips. Withthis rapidly increasing power trend, the problem of thermal management in systems isbecoming acute. My research focuses on enabling thermal-aware real-time computing innext-generation cyber-physical systems (CPS), such as self-driving cars driving across thecountry, drones flying over volcanic areas, and medical devices for humans, where the sys-tems have to timely accomplish their missions even under harsh thermal conditions. Thisis an important problem because the two requirements, real-time and thermal, are funda-mental to the reliable operation of such systems, but hard to be holistically analyzed andoptimized by the current state of the art. Solving this problem will bring significant benefitslike improving system safety and reliability, adding more intelligent features, and reducing1osts for high-volume products. This dissertation presents novel analytical methodologiesand system software techniques to analyze and safely guarantee both requirements.One of major factors that plays a role in temperature increase of modern systems-on-chips (SoCs) is heat generation due to complex computations. There exist two reasonsthat directly or indirectly result in temperature increase: 1) the increased number of accel-erators and CPU cores, and 2) the high power consumption of each subsystems. Nowadays,different types of accelerators, such as TPUs, FPGAs, DSPs, GPUs, video processing units,and also multiple heterogeneous CPU cores are integrated into SoCs to satisfy various needsof complex applications. The power consumption of active units to perform computationgenerates heat. There exists heat conduction between these units, which causes temper-ature increase on other units even when they are idle. Aside from it, the need for highcomputation which can inevitably necessitate a high clock frequency and a large numberof transistors further raises the power consumption of each computing unit. The powerconsumption is converted into heat, resulting in a significant increase in temperature dueto performing computation on active computing units. Moreover, an increase in temper-ature has a considerable impact on the growth of leakage current which leads to a rise instatic power consumption. An increase in power consumption levels up the device temper-ature, thereby resulting in again an increase in power consumption. This detrimental loopcauses“thermal runaway” [6].Ambient temperature is one of the key factors that affect the cooling process ofmixed-criticality CPS. Designing of cooling packages such as heat sinks and cooling fans canbe costly [90]. There is also an extra challenge in cooling solutions through packaging only2o enable thermal safety under various thermal conditions. Active/passive cooling packagesattached to SoCs dissipate the heat generated due to computation. These solutions try totransfer heat from SoCs to the environment. Fans and heat sinks are an example of activeand passive cooling packages respectively that blow air to chip or maximize the hot surfacearea in contact with the cooling medium surrounding it, e.g., transferring heat to ambientair by heat convection. These mechanisms are only effective when SoCs are exposed tolow-temperature ambient air. In harsh ambient temperature, heat flux is reduced substan-tially, meaning heat convection to the air becomes little, because the temperature differencebetween the operating chip and the air is negligible. Therefore, the harsh ambient temper-ature reduces the the effectiveness of the cooling process while SoCs still generate heat toperform computation. This condition can occur frequently in practice which may lead toburnout of SoC. In addition, for many high-performance embedded systems like medicalimplants or smartphones, that require thermal safety, packaging solutions are expensive,bulky and inapplicable.To protect the processor chip from thermal damage, a set of policies are defined inthe thermal governor of the OS for different scenarios. When the chip temperature crossesa trip point, a predefined cooling scenario is performed such as frequency throttling orshutting down of CPU cores [18, 37, 97]. Such thermal countermeasures, however, lead totiming unpredictability in real-time mixed-criticality systems (MCS) since the deadlines oftasks could be unexpectedly violated by reduced processing speed or temporarily unavailableCPU cores. A single delayed response in the pedestrian detection in a self- driving car canlead to irreparable harm. A delay in the control system of the Patriot missiles, used to3rotect Saudi Arabia during the Gulf War led to injuries and enormous economic damage .This unwanted performance degradation may also affect the timing predictabil-ity of integrated accelerators as well. In this dissertation, we focus on integrated GPUsbecause the unique characteristics of GPU operations, e.g., kernel execution on the GPUand interactions between CPU and GPU cores for data transfer, introduce new challengesto thermal-safety design and timing assurance. For instance, miscellaneous operations forGPU segments require CPU intervention. The intermittent performance degradation ofCPU cores adds up additional delay for completion of real-time tasks because GPUs arerequired to spend extra time for miscellaneous operations while other tasks are waiting foraccess to shared GPUs. This delay due to CPU slowdown leads to performance unpre-dictability of real-time GPU-using tasks on multi-core SoC platforms.Despite its importance, the thermal parameter estimation of commercial off-the-shelf (COTS) processors still remains a challenging problem. The practical use of thermally-safe mechanisms remains largely limited due to the fact that it is extremely difficult to obtaina precise thermal model of commercial-processors without using special measurement instru-ments or access to proprietary information, such as the power traces of micro-architecturalunits and detailed floorplan maps.In this dissertation, we focus on challenges arising from the thermal interferenceon heterogeneous multi-core SoC platforms. We develop novel frameworks supported byanalytical time and thermal models to bound the maximum operating temperature of CPUcores with the assurance of performance predictability in mixed-criticality domains. Ad-ditionally, we propose a data-driven thermal parameter estimation scheme that is directly L’Espresso, Vol. XXXVIII, No. 14, 5 April 1992, p. 167.
Thesis Statement:
The timing predictability of real-time tasks in dynamic ther-mal conditions is achievable on multi-core GPU-integrated SoC platforms by de-signing a novel thermal-aware system framework with the support of analyticaltiming and thermal models.
The aim of my PhD work is to test the hypothesis that thermal-aware real-timecomputing on heterogeneous multi-core CPU and GPU systems can be achieved by novelanalytical and software support. To do so, my work develops an ensemble frameworkfor real-time tasks to support various policies under different circumstances while providinganalytical foundations to check both thermal and temporal safety. This dissertation includesthe following several components:
In Chapter 2, we propose a thermal-aware CPU-GPU framework to handle bothreal-time CPU-only and GPU-using tasks. Our framework satisfies thermal-safety underthe given thermal constraint on CPU and GPU cores and offers bounded response time toreal-time tasks. The framework also introduces two novel mechanisms for GPU requests to5educe task response time.
Contributions.
The contributions of this work are as follows: • We propose a thermal-aware CPU-GPU server framework for multi-core GPU-integratedreal-time systems. We characterize different timing penalties and present a protocolfor CPU and GPU thermal servers. • For real-time predictability on a GPU, we propose a GPU server design with a variantof the sporadic server policy, where the GPU segments of tasks execute with no thermalviolation. • We propose an enhancement to the waiting queue of the GPU server to mitigate thepessimism of a priority-based queue. • We introduce a miscellaneous operation time reservation mechanism for deferrableand sporadic CPU servers to reduce CPU-GPU handover delay and remote blockingtime. • We extensively analyze the thermal safety and task schedulability of CPU and GPUservers with various budget replenishment policies.
The key contribution of our work in Chapter 3 is showing that the problem ofthermal-aware real-time scheduling can be decomposed into thermal schedulability (howmuch CPU budget is usable under thermal constraints) and timing schedulability (if tasks6re schedulable using given budget).Under harsh ambient temperature, a system may not be able to utilize 100% ofCPU time even if the CPU runs at the minimum possible frequency with active/passivecooling packages. In such a condition, the only option left to ensure timing and thermalguarantees of critical tasks is to secure cooling time by suspending less critical tasks. Ourwork addresses this problem.
Contributions.
The contributions of this work are as follows: • We show that the problem of thermal-aware real-time scheduling can be decomposedinto thermal schedulability (how much CPU budget is usable under thermal con-straints) and timing schedulability (if tasks are schedulable using given budget). Ourthermal schedulability achieves the simplicity in timing analysis by ensuring that thebudget is guaranteed to be made available for any execution patterns without violatingthermal constraints. • We extensively analyze the thermal safety of a multi-core system and bound themaximum operating temperature that the system can reach. At a specific ambienttemperature level, we characterize the worst-case thermal behavior of a system andalso determine the minimum time for the system to transition from one criticalitylevel to a lower level. • We introduce the notion of idle thermal servers that allow bounding the maximumoperating temperature caused by multiple preemptive active servers scheduled dy-namically on a multi-core processor for a given mixed-criticality taskset.7 .1.3 Chapter 4: Data-driven Thermal Parameter Estimation for COTS-based Mixed-criticality Systems
In Chapter 4, we propose a fast and accurate scheme to estimate the thermalparameters of COTS multi-core processors for real-time MCS. Our scheme requires only asmall number of temperature traces from on-chip thermal sensors which are widely availablein today’s processors. Our scheme also improves the accuracy of thermal parameters throughthe ensemble of measurements from different frequency levels and execution patterns.
Contributions.
The contributions of this work are as follows: • We present a thermal estimation scheme that has low computational cost by design.Given that steady-state profiles are much compact than transient-state profiles, ourscheme first estimates the thermal parameters of a given system using only steady-state profiles, and then uses transient-state data for calibration purpose. • We characterize various sources of errors in thermal parameter estimation, and reducetheir negative effects through the multiple refinement stages of our scheme. Ourscheme also enables locating errors in the temperature profiles. • Our scheme can identify the relative distance between CPU cores and produce anestimated chip floorplan from temperature profiles. It can also estimate the relativepower consumption for a given workload on each CPU core. • We present techniques to further improve the accuracy of thermal parameters by ex-ploiting the ensemble of measurement data obtained at various frequency and work-load settings. 8 hapter 2
Thermal-Aware Servers forReal-Time Tasks on Multi-CoreGPU-Integrated EmbeddedSystems
The recent trend in real-time applications raises the demand for powerful embed-ded systems with GPU-CPU integrated systems-on-chips (SoCs). This increased perfor-mance, however, comes at the cost of power consumption and resulting heat dissipation.Heat conduction interferes the execution time of tasks running on adjacent CPU and GPUcores. The violation of thermal constraints causes timing unpredictability to real-time tasksdue transient performance degradation or permanent system failure. In this chapter, wepropose a thermal-aware server framework to safely upper bound the maximum temperature9f GPU-CPU integrated systems running real-time sporadic tasks. Our framework supportsvariants of real-time server policies for CPU and GPU cores to satisfy both thermal and tim-ing requirements. In addition, the framework incorporates two mechanisms, miscellaneous-operation-time reservation and pre-ordered scheduling of GPU requests, which significantlyreduce task response time. We present analysis to design thermal-server budget and to checkthe schedulability of CPU-only and GPU-using sporadic tasks. The thermal properties ofour framework have been evaluated on a commercial embedded platform. Experimentalresults with randomly-generated tasksets demonstrate the performance characteristics ofour framework with different configurations.
High temperature in embedded systems with modern systems-on-chips (SoCs)causes several major issues. An increase in temperature has a considerable impact in thegrowth of leakage current which leads to rise in static power consumption. An increasein power consumption levels up the device temperature, thereby resulting in again an in-crease in power consumption. This detrimental loop causes not only rapid battery drainbut also “thermal runaway” [6]. Furthermore, studies show that operating in high temper-ature reduces the system reliability substantially [87]. For instance, 10 ° C to 15 ° C increasein temperature doubles the probability of failure of the underlying electronic devices [93].Thermal violation not only reduces the reliability in real-time implantable medical devices,but also causes physical harm [12]. Therefore, bounding the maximum temperature is animportant issue, especially when real-time requirements have to be satisfied.10hermal management on today’s multi-core CPU-GPU integrated platforms withreal-time requirements is a challenging problem. Dynamic Thermal Management (DTM) istriggered when thermal violation occurs so that it forces frequency throttling or shutdownof the SoC for cooling purpose. This unwanted performance degradation leads to timingunpredictability in task execution, and real-time tasks may miss their deadlines. Thermalviolation avoidance in uni-processor systems has been studied extensively in the literatureof real-time systems [12, 95] but it cannot be directly used for multi-core GPU-integrateddevices due to heat conduction between processor units. Dynamic Voltage Frequency Scal-ing (DVFS) techniques to mitigate the heat and power dissipation of processors also hasbeen widely studied in the literature [95, 17]. However, aside from a considerable reductionin system reliability over time due to continuous frequency changes [43, 57, 99], not allembedded devices support DVFS, especially for integrated GPUs.Despite the popularity of integrated GPUs in modern multi-core SoCs, state-of-the-art approaches are incapable of simultaneously addressing thermal management andreal-time schedulability issues. On the one hand, the GPU access segment of a real-timetask has been modeled as a critical section to ensure schedulability [23, 24, 69]. Theseapproaches, however, are oblivious of thermal constraints so that the system may sufferfrom heat dissipation and intermittent performance drops by DTM. On the other hand,there are previous studies [3, 2] introducing the concept of thermal servers for real-timeuni-processor and multi-core platforms. However, their schemes are not ready to use fora multi-core SoC with an integrated GPU. The unique characteristics of GPU operations,e.g., kernel execution on the GPU and interactions between CPU and GPU cores for data11ransfer, introduce new challenges to thermal server design and schedulability analysis.To the best of our knowledge, there is no prior work that offers both thermal safety andreal-time schedulability in multi-core GPU-integrated embedded systems.In this chapter, we propose a thermal-aware CPU-GPU framework to handle bothreal-time CPU-only and GPU-using tasks. Our framework enhances the notion of ther-mal servers to satisfy the given thermal constraint on CPU and GPU cores and to offerbounded response time to real-time tasks. The framework also introduces two mechanisms,miscellaneous-operation-time reservation and pre-ordered waiting queue for GPU requests,to reduce task response time. We will show with experimental results that our frameworkis effective in satisfying both thermal and temporal requirements.
Contributions.
The contributions of this chapter are as follows: • We propose a thermal-aware CPU-GPU server framework for multi-core GPU-integratedreal-time systems.We characterize different timing penalties and present a protocol for CPU and GPUthermal servers. • For real-time predictability on a GPU, we propose a GPU server design with a variantof the sporadic server policy, where the GPU segments of tasks execute with no thermalviolation. • We propose an enhancement to the waiting queue of the GPU server to mitigate thepessimism of a priority-based queue. • We introduce a miscellaneous operation time reservation mechanism for deferrable12nd sporadic CPU servers to reduce CPU-GPU handover delay and remote blockingtime. • We extensively analyze the thermal safety and task schedulability of CPU and GPUservers with various budget replenishment policies.
Real-time GPU management has been studied to handle GPU requests with thegoal of improving timing predictability [46, 45, 47, 102]. To guarantee the schedulability ofGPU-using real-time tasks, synchronization-based approaches [23, 24, 69] that model GPUaccess segments as critical sections have been developed. The work in [48, 49] introducesa dedicated GPU-serving task as an alternative to the synchronization-based approaches.While these prior studies have established many fundamental aspects on real-time taskschedulability with GPUs, thermal violation issues have not been considered.There exist extensive studies on bounding the maximum temperature in real-timeuni-processor systems [12, 59] and non-real-time heterogeneous multi-core systems [84, 71,78, 70, 36]. In [12], the authors proposed a novel scheme that bounds the maximum tem-perature by analyzing task execution and cooling phases in uni-processor real-time systems.Real-time thermal-aware resource management for uni-processor systems has been devel-oped with the consideration of varying ambient temperature and diverse task-level powerdissipation [59]. For non-real-time heterogeneous systems, Singla et. al [84] proposed adynamic thermal and power management algorithm to adjust the frequency of GPU andCPU as well as number of active cores by computing the power budget. Prakash et al. [71]13roposed a control-theory based dynamic thermal management technique for mobile games.Gong et al. [36] presented a thermal model on a real-life heterogeneous mobile platform.In [70] and [78], the authors proposed a proactive frequency scheduling for heterogeneousmobile devices to maintain their performance. However, these studies cannot be directlyapplied to real-time multi-core GPU-integrated systems where the GPU is shared amongtasks. The notion of periodic thermal-aware servers was proposed in [2] for uni-processors.In this work, the optimal server utilization has been proved and the budget replenishmentperiod is determined by a heuristic algorithm. Similar to the thermal server, the notionof cool shapers was proposed in [52] to satisfy the maximum temperature constraint bythrottling task execution. The notion of hot tasks was introduced in [41] to partitionlengthy tasks into several chunks to avoid continuous task execution and thermal violationwhile maximizing throughput. Although all of these aforementioned studies have broughtvaluable contributions, they have been proposed for uni-processor platforms. In contrast,our framework addresses the temperature bounding problem for real-time tasks with GPUsegments running on modern CPU-GPU integrated SoCs.Recently, the authors of [20] introduced a novel technique for periodic tasks exe-cuting on multi-core platform. This technique introduces an Energy Saving (ES) task thatruns with the highest priority and captures the sleeping time of CPU cores. The techniquecan be seen as an alternative to a thermal server because the ES task effectively models thebudget-depleted duration of a thermal server. The authors of [3] proposed thermal-isolationservers that avoid the thermal interference among tasks in temporal and spatial domains14ith thermal composability. These techniques, however, cannot address the challenges ofscheduling GPU-using real-time tasks.
In this section, we briefly introduce the thermal model used in this chapter. Itdepends on power consumption, heat dissipation, and the conductive heat transfer betweenadjacent power-consuming resource components, which includes CPU cores, CPU periph-erals, GPU, caches, and other IPs.
Uni-processor thermal model.
With respect to the power function [60, 13], the thermalmodel of a uni-processor is modeled as an RC circuit in the literature [13, 80, 85]. Accordingto the Fourier’s Law [94] and considering t = 0, the temperature of a core θ ( t ) after t timeunits operating at a fixed clock frequency is given by: θ ( t ) = α + ( θ ( t ) − α ) e βt (2.1)where α > β < aka cooling phase) where the frequency is switched off can be modeled as: θ ( t ) = θ ( t ) e βt . (2.2)because in the cooling phase, α = 0. Details are available in [20] and [3]. eterogeneous multi-core thermal model. With the presence of multiple cores andother power-consuming resources, there is heat dissipation not only to ambient but alsobetween nodes. In this chapter, we only consider CPU and GPU cores as power-consumingnodes because other IPs consume much less power than them, thereby causing negligiblethermal effects.In such a CPU-GPU integrated system with the lateral thermal conductivity be-tween processing cores, the temperature of a core depends not only on its current temper-ature and power consumption, but also on those of adjacent cores. Prior work [27] showedthat the temperature of each core at time t +∆ can be modeled with an acceptable accuracyas follows: Θ( t + ∆) = A × P ( t + ∆) + Γ × Θ( t )where A is m × m matrix, and Γ, P and Θ are m × θ ( t ), thetemperature after t time units is given by: θ i ( t + t ) = α + ( θ ( t ) − α ) e βt + m (cid:88) j =1 γ ij θ j ( t + t ) . (2.3) In this section, we describe the thermal-aware server as well as the task modelsused in this chapter and explain the procedure of a kernel launch on a GPU. Then, we16haracterize the scheduling penalties that arise from the use of a GPU with thermal-awareservers.
We consider a temperature-constrained embedded platform equipped with an inte-grated CPU-GPU SoC. The SoC has multiple CPU cores and one GPU core, each runningat a fixed clock frequency. Note that such SoC design with a single GPU is popular intoday’s embedded processors, such as Samsung Exynos 5422 and NVIDIA TX2. The as-sumption on the fixed operating frequency is particularly suitable for the GPU as DVFScapabilities are not widely supported on embedded on-chip GPUs. The thermal behavior ofCPU and GPU cores follows the model described in Section III. For simplicity, we assumethat the amount of temperature generated by other SoC components, such as peripheralsand caches, is either negligible or acceptably small.
We consider one thermal-aware server for each CPU and GPU core. Each serveris statically associated with one core and does not migrate to another core at runtime.However, unlike prior work [3, 2], we do not limit the server to follow only the polling serverpolicy. We will show in the later section that this flexibility brings significant benefit in theschedulability of tasks accessing the GPU.To bound the temperature of each core, its corresponding server v i is modeledas v i = ( C vi , T vi ) where C vi is the maximum execution budget and T vi is the budget re- Running multiple thermal servers on each CPU/GPU core is left for future work. v i . For brevity, we will use v g = ( C g , T g ) to denote the GPU serverand v c = ( C c , T c ) for the CPU server. For budget replenishment policies, we consider polling [82], deferrable [88], and sporadic servers [86]. Under the polling server policy, thecorresponding server activates periodically and executes ready tasks until its budget is de-pleted. The budget is fully replenished at the start of the next period. If there is no taskready, the remaining budget is immediately depleted. In contrast, under the deferrableserver, any unused budget is preserved until the end of the period. Hence, a task can exe-cute at any time during the server period while the budget is available. The sporadic serveralso preserves remaining budget, but replenishes the budget sporadically; only the amountof budget consumed is replenished after T v time units from the time when that budget isused. Let J v denote the task release jitter relative to the server release. The value of J v is T v under the polling server policy and T v − C v under the deferrable and sporadic serverpolicies [9]. This work considers sporadic tasks with implicit deadlines under partitioned fixed-priority preemptive scheduling , which is widely used in many real-time systems. Each task τ i has been statically allocated to one CPU core (thus to the server of that core) with aunique priority. Tasks are labeled in an increasing order of priority, i.e., i < j implies τ i haslower priority than τ j . Without loss of generality, each task can contain at most one GPUsegment, but it can be easily extended to multiple GPU segments. A task τ i is modeled as τ i = (( C i, , E i , C i, ) , T i , s i ), where C i, and C i, are the worst-case execution time (WCET)of the normal execution segments of task τ i , and E i is the worst-case time of the GPU18 igure 2.1: Task execution with a GPU segment. access segment. The normal execution segments run entirely on the CPU core and theGPU segment involves GPU operations. Let T i denote the the minimum inter-arrival timeof τ i and s i indicate whether τ i has a GPU segment, i.e., s i = 1 means a GPU-using task.In case τ i executes only on the CPU, s i , E i and C i, are all zero. Thus, the accumulatedsum of the WCETs of τ i is denoted as C i = s i × ( C i, + E i + C i, ) + (1 − s i ) × C i, . Furthermore, V ( τ i ) represents the CPU server where τ i is assigned. Tasks are considered asfully compute-intensive and independent from each other during normal segment execution.The only resource shared among tasks is the GPU and it is modeled as a critical sectionprotected by a suspension-based mutually-exclusive lock (mutex). Note that this approachfollows the well-established locking-based real-time GPU access schemes [69, 23, 24]. Wewill later present how pending GPU requests are queued in our proposed framework.19 .4.4 GPU Execution Model The GPU has its own memory region, which is assumed to be sufficient enoughfor the tasks under consideration. We do not consider the concurrent execution of GPUrequests from different tasks because of the resulting unpredictability in kernel executiontime [69, 65]. Once a task acquires the GPU lock, its GPU segment is handled through thefollowing steps (see Fig. 2.1):1.
Data Transfer to the GPU:
The task first copies data needed for the GPU compu-tation, from CPU memory to GPU memory. This can be done by Direct MemoryAccess (DMA), which requires minimal CPU intervention. If the GPU uses a unifiedmemory model, this step can be omitted.2.
Kernel Launch:
Kernel launches on the GPU. Meanwhile, the task on the CPU sideself-suspends and waits for the GPU computation to complete.3.
Kernel Notification Signal:
The GPU signals the CPU to notify the completion ofkernel execution.4.
Data Transfer to the CPU:
The task wakes up and transfers the results from GPUmemory to CPU memory.It is worth noting that a GPU kernel cannot self-suspend on the GPU in the middle ofexecution. On the other hand, since there is no CPU intervention during kernel execution,the task on the CPU side self-suspends to save CPU cycles. As a result, other tasks have Although GPU kernel preemption is available on some recent GPU architectures, e.g., Nvidia Pascal [19],to the best of our knowledge, the self-suspension of a kernel is not supported in any of today’s GPUarchitectures.
20 chance to execute or the CPU core sleeps during the kernel execution. For a task τ i , thetotal time for the above four steps consists of two major parts: • Miscellaneous operations that require CPU intervention. Let M i, denote the timefor data transfer before the kernel execution and M i. denote that after the kernelexecution. • Pure GPU kernel operations that do not require any CPU intervention denoted as K i .In such aspect, the GPU segment time of a task τ i is modeled as E i = M i, + K i + M i, . Asthe GPU is non-suspendable during kernel execution, the GPU server should have enoughbudget larger than or equal to E i . The CPU server only needs to have budget larger than M i, or M i, . Aside from the blocking delays coming from the locking-based GPU access ap-proach, e.g., local and remote blocking to acquire the GPU lock [69, 23, 24], there are othernew challenges faced by thermal-aware servers in a CPU-GPU integrated system. • Server budget depletion : Task execution in a server is scheduled with respect to theavailable budget. When the server budget is depleted, a task has to wait until thebudget is replenished. • Mutual budget availability : If a task τ i issues a GPU request, both CPU and GPUservers must have enough budget to handle this request. It is worth noting that serverbudget needs for the CPU and GPU servers are different ( M i, or M i, vs. E i ).21 CPU-GPU handover delay : Even if both servers are designed to have enough budget,each server may have to wait for the other’s budget to be replenished when theirinteractions are needed, e.g., data transfer between the CPU and the GPU. • Back-to-back heat generation : In case of the deferrable server, some tasks can useup all the budget at the end of the budget replenishment period, and at the verybeginning of the next period, some other tasks can start to consume the replenishedbudget. This causes the server to run longer than its budget and generate heat in aback-to-back manner.
In this section, we present our proposed framework. We first give a protocol forCPU and GPU thermal servers, and then explain thermal server design to address thechallenges discussed in the previous section. We lastly describe a miscellaneous operationtime reservation mechanism to trade-off between server budget and GPU waiting time.
Our framework simultaneously bounds the worst-case blocking time for GPU ac-cess and the maximum temperature of CPU and GPU cores. The thermal server designedwith our framework isolates the thermal conductive effects of each compute node to othernodes. To achieve these properties, we establish the following rules in our framework.
Shared GPU Server
1. Pending GPU requests are inserted to a priority queue that orders the requests based22n the priorities of the corresponding tasks. This rule assures that the GPU requestof a high-priority task is blocked by at most one lower-priority request in the queue.2. To handle the GPU request of a task τ i , there must exist at least E i budget avail-able on the GPU server. This rule is due to the preemptive and non-self-suspendingcharacteristics of the GPU.3. If there is an insufficient amount of budget to launch a GPU segment, the GPU islocked until having enough budget for that GPU segment. This rule assures thatstarvation does not happen and the critical section of a higher-priority task waits forjust a single critical section of a lower-priority task. With these rules, remote blockingtime can be bounded. Later, we will propose an enhancement to these rules. CPUGPU τ ((2, 10, 2), 25, 1) v (4,5) (a) CPUGPU τ ((2, 10, 2), 25, 1) v (4,5) Busy wait Busy wait (b)Figure 2.2: Example task scheduling with an unlimited GPU server budget and a CPU serverfollowing the polling server policy a) No busy waiting: τ finishes at 22. b) Busy waiting: τ finishesat 14. CPU Core Server
1. Servicing a GPU request boosts the priority of the corresponding task to the highest-priority level. As a result, the normal segments of higher-priority tasks are blocked23ntil the GPU segment completes. If there is not enough budget on the correspondingCPU server (e.g., M i, ) or on the GPU server (e.g., E i ), no other task can execute onthe CPU or the GPU.2. During data transmission to/from the GPU, the CPU server “busy-waits”. The rea-soning behind this rule is to reduce the total response time of a task. Fig. 2.2illustrates task scheduling with and with or without this rule on a CPU server usingthe polling server policy. As one can see in this example, without this rule, it takesone additional replenishment period for transferring data to/from the CPU, becauseduring processing the transferring request on the GPU, the CPU server has no otherworkload to execute; hence it deactivates until the beginning of the next replenishmentperiod. We now discuss budget replenishment policies for the GPU server and their im-plications. One may consider both the CPU and GPU servers following the polling serverpolicy. In this case, the CPU-GPU handover delay can be at least three complete replen-ishment periods of a CPU server. For instance, consider a task holding a GPU lock andits kernel being executed on the GPU. Once the kernel execution completes, the GPU hasto wait for the task on the CPU to transfer the results and release the lock. It is possiblethat at this time, the CPU server budget has been already depleted. Thus, the GPU hasto wait for the next period of the CPU server, and this kind of extra delay happens foreach sub-segment of the CPU segment, i.e., M i, , K i and M i, . Moreover, since operations24 PUGPU τ ((2, 8, 0), 30, 1) v (4, 5) τ ((0, 8, 0), 40, 1) v g (16, 32) Time = 10 (a)
CPUGPU τ ((2, 8, 0), 30, 1) v (4, 5) τ ((0, 8, 0), 40, 1) v g (16, 32) Time = 10
BEGIN Time = 17 Time = 49 (b)Figure 2.3: Example task scheduling with the GPU server under a) the deferrable server policy andb) the sporadic server policy. For both tasks τ and τ , E i = 8 and M i, = M i, = 2. on the GPU are non-preemptive and non-suspendable, the GPU server budget has to belarge enough that at least one entire GPU sub-segment of M i, , K i or M i, can completeits execution within the same period. However, having a large replenishment period wouldexacerbate the response time of GPU segments especially for small kernels.One may consider busy waiting between GPU sub-segments so as to fill their ex-ecution gaps because such gaps may deactivate the GPU’s polling server. This approach,however, not only requires over-provisioning of the GPU budget, but also produces a consid-erable amount of heat. In the worst case, the extra heat generation due to the busy-waitingon the GPU server may continue over two periods of the CPU server because the CPUserver with the polling server policy can be deactivated between GPU sub-segments.25ne may suggest the deferrable server policy for the GPU to mitigate the heatgeneration issue and to minimize the response time of a GPU segment. This approach,however leads to the thermal back-to-back execution phenomenon. Fig. 2.3a illustrates anexample of possible drawbacks of the deferrable server chosen for the GPU. The first jobsof τ and τ arrive at the latest moments in the first GPU server period, and the secondjob of τ arrives at the beginning of the second GPU server period. Although the tasks areschedulable by the given server budgets, it causes burst heat generation by back-to-backexecution, which can potentially lead to thermal violation. In order to avoid this, the budgetof the deferrable server for the GPU has to be halved to avoid thermal violation but it candrastically lower task schedulability.In contrast, the sporadic server policy can take the merits of both the pollingserver and deferrable server policies. If the sporadic server policy is used for the GPU,a GPU segment can execute at any time as long as there is enough budget, and back-to-back heat generation does not occur. Fig. 2.3b illustrates the previous example withthe sporadic server on the GPU. As can be seen, unlike the deferrable server case, thebudget replenishment of the GPU server is one period apart from its consumption time,thereby preventing potential thermal violation. The sporadic server on the GPU is alsopractically effective because the GPU server needs to have a relatively large budget witha long replenishment period due to its non-preemptive nature whereas the CPU servertypically has a short replenishment period to reduce task response time.In summary, due to the aforementioned reasons, our framework specifically usesthe sporadic server policy for the GPU, while all the three policies (polling, deferrable, and26poradic) are allowed for CPU cores. We also set the budget of the GPU sporadic server tobe at least as large as one complete GPU segment of any task, i.e., max( E i ), because thiscan reduce the response time of a GPU segment and remote blocking time. The detailedanalysis on these delays will be presented in Section 2.6.The thermal server for the GPU is a resource abstraction managed on the CPUside, similar to other real-time GPU management schemes [69, 48, 49, 23, 24]. One canimplement the GPU thermal server as part of GPU drivers or application-level APIs. In order to reduce the CPU-GPU handover delay, we propose an MOT reservationmechanism. With this mechanism, a small portion of the CPU server budget is reservedonly for miscellaneous operations in a GPU segment, e.g., transferring data to/from theGPU. The MOT reservation is feasible with the deferrable and sporadic server policiesbut not with the polling server policy because the polling server is unable to keep unusedbudget by design. Although this MOT reservation reduces the amount of CPU budget forregular task execution, it guarantees that the GPU does not need to wait for the budgetreplenishment of the CPU server during the data transmission phase of a GPU segment.It can also reduce the remote blocking time of other tasks. This reserved budget for MOThas to be the largest amount of the CPU intervention time in all GPU requests. The tradeoff between the MOT reservation and the reduced budget for regular task execution will beextensively investigated in the evaluation.Fig. 2.4 illustrates an example to highlight the benefit of the MOT reservationmechanism. As one can see in Fig. 2.4a, τ has to wait until 10 to transfer its data to the27 ore2GPU τ ((0, 19, 0), 50, 1) v (7, 10) τ ((8, 0, 0), 20, 0) v g (35, 60) Core1 τ ((0, 6, 0), 25, 1) v (7, 10) Deadline missing (a)
Core2
GPU τ ((0, 19, 0), 50, 1) v (5, 10) MOT = 2 τ ((8, 0, 0), 20, 0) v g (35, 60) Core1 τ ((0, 6, 0), 25, 1) v (5, 10)MOT = 2 MOT budget τ finishes after 22 time units (b)Figure 2.4: Data transfer in a GPU request a) without and b) with the MOT reservation mechanism. GPU because τ already has consumed all budget of v . Similarly, although the result of theGPU kernel of τ is ready at 27, its result begins to be transferred at 30. These delays causethe remote blocking of τ lasts for 22 time units although there exists an available amountof the server budget on v . Designating 2 time units as the MOT budget (see Fig. 2.4b)leads τ to transfer its data to the GPU at 7 and finishes its kernel launch at 27; hence τ initiates its kernel launch accordingly. Consequently, designating some amount of the serverbudget as MOT reservation leads to reduction in the remote blocking of a GPU-using taskfrom other CPU core. 28 .6 Thermal and Schedulability Analysis In this section, we present the schedulability analysis of our framework. We firstdesign our thermal-aware servers for multi-core GPU-integrated platforms to avoid thermalviolation. Then, we analyze task schedulability with and without the MOT reservationmechanism.
In our framework, task execution is performed within thermal-aware servers. Asdiscussed, the notion of servers is to isolate each compute node from others in terms ofthe thermal aspect. Accordingly, under any circumstance of task execution, the thermalviolation avoidance has to be guaranteed in the server design. The specifications of serversare independent of their running tasks (except for the design of the MOT reservation),whereas task schedulability does depend on them. In our proposed framework, introducingthe thermal-aware servers for both the CPU and the GPU makes the physical characteristicsof underlying platforms be transparent of the task schedulability test.
Single-Core Platforms
First, we calculate the “maximum” budget that a server can have while limitingthe temperature not to exceed the given thermal constraint in a single-core platform undera given replenishment period. In the worst case of a polling server, the server exhausts allof its budget at the beginning of its period and then sleeps until the beginning of the nextreplenishment period. Let t wk and t slp denote the active time (i.e., the budget-consuming29hase) and the sleeping time (i.e., the cooling phase) of a CPU core server, respectively.Hence, the server period T is T = t wk + t slp . (2.4)In the steady state of the system, we are interested in bounding the server’s maximumtemperature. According to Eq. 2.1, α + ( θ s − α ) e βt wk ≤ θ M where θ s and θ M are the steady state temperature and the thermal constraint, respectively.Therefore, e βt wk ≥ θ M − αθ s − α = ⇒ t wk ≤ β ln θ M − αθ s − α . (2.5)On the other hand, in the cooling phase, according to Eq. 2.2 to respect the steady state, θ M e βt slp = θ s . Hence, t slp = 1 β ln θ s θ M . (2.6)By substituting Eqs. 2.5 and 2.6 by Eq. 2.4, we have1 β ln θ M − αθ s − α + 1 β ln θ s θ M ≤ Tθ M − αθ s − α × θ s θ M ≤ e βT . Therefore, the worst-case steady state temperature at the beginning of each period is θ s = αθ M e βT θ M ( e βT −
1) + α . (2.7)Accordingly, the maximum budget for the period T is t wk = T − β ln θ s θ M . (2.8)30onsequently, for a given replenishment period, a server v i = ( T − β ln θ s θ M , T ) can boundthe maximum temperature to θ M .As one can figure out from the analysis, the maximum budget converges because of α . This means that after some point, an increase in the replenishment period has no effecton the maximum feasible budget. As discussed earlier through our framework design, thebudget has to be considered as t wk for a deferrable server due to the thermal back-to-back phenomenon. This phenomenon does not occur in a sporadic server, and it can use thecomputed budget as is. Homogeneous Multi-core Platforms
The worst case for the budget of polling servers on a multi-core CPU happenswhen all of them exhaust their budget completely. Therefore, according to Eq. 2.3 for thecomposability characteristics of the heat transfer, we have α + ( θ s − α ) e βt wk + m (cid:88) j =1 j (cid:54) = i γ i,j θ jM ≤ θ iM where θ iM is the maximum temperature for the i th node. In the worst case, every core mayreach its maximum temperature at the same time. Hence,(1 + m (cid:88) j =1 j (cid:54) = i γ i,j ) (cid:124) (cid:123)(cid:122) (cid:125) λ i [ α + ( θ s − α ) e βt wk ] ≤ θ iM . However, the geographic location of cores on the chip results in different values of the con-duction coefficients although it remains symmetric (i.e., γ i,j = γ j,i ). Denoting λ = max ≤ i ≤ m λ i , θ M is given by θ M = λ [ α + ( θ s − α ) e βt wk ]. Similar to the single-core analysis given in the31revious subsection, the steady state temperature is θ s = α θ M λ e βTθ M λ ( e βT −
1) + α (2.9)Therefore, the server budget for each compute node i is C c = t wk = T − β ln θ s × λθ M . (2.10) Heterogeneous Multi-Core GPU-Integrated Platforms
Hereby, we will determine the budget of servers in the presence of an integratedGPU. Similar to the homogeneous multi-core platform, the worst case happens when allservers exhaust their budgets completely. There is also a GPU segment execution on theGPU which causes extra heat dissipation. Since the GPU runs at a different frequency andits architecture is different from the CPU cores, its heat generation parameters differ fromthose of the CPU cores. Hence, θ iM = λ i [ α + ( θ s − α ) e βt wk ] + γ i,g [ α g + ( θ gs − α g ) e β g t gwk ] θ gM = α g + ( θ gs − α g ) e β g t gwk + m (cid:88) j =1 γ g,j (cid:124) (cid:123)(cid:122) (cid:125) γ g [ α + ( θ s − α ) e βt wk ]= ⇒ λθ iM e βt slp + γ i,g θ gM e β g t gslp = θ s θ gM e β g t gslp + γ g θ iM e βt slp = θ gs (2.11)where the symbols with a superscript g represent the corresponding parameters of the GPU. Miscellaneous Operation Time Reservation
To reduce the remote locking time, some portion of the CPU server budget can bereserved for data transferring from/to the GPU that needs CPU intervention. The MOT32udget has to be large enough to handle the longest data transferring time, therefore C c = t wk − max ∀ τ i ( M i, , M i, ) (2.12)It is noteworthy that acquiring a lock on the GPU happens when there is enoughbudget on the GPU server to execute the whole GPU request; hence, no budget reservationis needed on the GPU side. The thermal analysis in the previous subsection gives the maximum budget of eachCPU/GPU server that satisfies the thermal constraint of the system. Hereby, we presentthe schedulability analysis of a task τ i in our framework.Before introducing our analysis, we review the existing response time test for inde-pendent tasks with no thermal constraints and no shared GPU under hierarchical schedul-ing [77], which is W n +1 i = C i + (cid:88) τ h ∈ V ( τ i ) h>i (cid:24) W ni + J c T h (cid:25) C h + (cid:24) W n + C c T c (cid:25) ( T c − C c ) (2.13)where J c is the jitter of a task running in a server (see Section 2.4) and W = C i . Therecursion terminates successfully when W n +1 = W n and fails when W n +1 > T i . Thisequation considers the budget depletion of a CPU server. The first term is the amountof CPU time used by the task τ i , the second term captures the back-to-back execution ofeach higher-priority task τ h , and the third term captures the amount of interference thatthe server can generate due to the periodic budget replenishment. However, this equationcannot be used directly for the thermal constraint problem with a shared GPU.33ur analysis extends the existing response time test by considering the factorsdiscussed in Section V: (i) local blocking time, (ii) remote blocking time, (iii) back-to-backexecution due to remote blocking, (iv) mutual budget availability, (v) CPU-GPU handoverdelay, (vi) multiple priority inversions, and (vii) CPU and GPU server budget co-depletion.We take into account the factors (iv) and (v) together in the analysis of the handover delay,the factor (vi) as part of the local blocking time analysis, and the factor (vii) as part of theremote blocking time analysis. By considering all these factors, the following recurrenceequation bounds the worst-case response time of a task τ i : W n +1 i = C i + B li + B ri + H gci + (cid:24) W ni + C c − s i ( H gci + K i ) T c (cid:25) ( T c − C c )+ (cid:88) τ h ∈ V ( τ i ) h>i (cid:24) W ni + J c + ( W h − C h ) − s i ( H gci + E i ) T h (cid:25) C h (2.14)where C i is the worst-case execution time of τ i , B li is the local blocking time, B ri is theremote blocking time, H gci is the CPU-GPU handover delay. We will later discuss each ofthem in details. The recurrence equation terminates when W n +1 i = W ni , and the task τ i isschedulable if its response time does not exceed its implicit deadline (i.e., W ni < = T i ).The second line of the equation captures the delay due to the server budget de-pletion on the CPU side (an extension of the last term of Eq. 2.13). It is worth noting thatduring the pure GPU kernel execution of τ i ( K i ), no CPU budget is consumed by τ i . Anyother tasks can execute on the CPU core if there is a remaining budget. The task τ i onlyconsumes the CPU server budget when it executes normal segments or the miscellaneousoperations (e.g. data copy) of GPU segments. Therefore, H gci and K i , which already existin W ni for a GPU-using task τ i , are excluded such that only the CPU-consuming parts ofthis task is affected by the CPU server budget depletion. Any additional delay due to CPU34udget replenishments during GPU segment execution will be captured by H gci .The last line captures the preemption time by the normal execution segments ofhigher-priority tasks (an extension of the second term of Eq. 2.13). The fact that a higher-priority task τ h can only preempt the normal execution segments of τ i leads to the deductionof H gci and E i from W ni . There can be at most one additional job of τ h that has arrivedduring τ i ’s GPU segment execution and interfere τ i ’s normal segments. This holds true if τ h is schedulable, i.e., W h ≤ T h , because other arrivals of τ h during τ i ’s GPU segment willfinish their executions while τ i is self-suspended for its kernel execution on the GPU. Theinterference from this additional carry-in job of τ h is taken into account by modeling it asa dynamic self-suspending model and adding W h − C h to W ni [11].Now, we present the detailed analysis of the delay factors used in our responsetime test. CPU-GPU Handover Delay
It captures an extra delay a task can experience after acquiring the GPU lockbecause of the factors (iv) and (v). Fig. 2.5 illustrates this type of delay decomposed intothree parts when the polling server policy is used for a CPU server. 1 When there is notenough budget available in the GPU server for the execution of a GPU segment, the taskhas to wait at most the jitter of the GPU server ( T g − C g ). 2 After the GPU budget isreplenished, if the CPU server is inactive, the task has to wait for additional T c time unitsfor the next period of the CPU server. 3 After the completion of kernel execution on theGPU, at most T c time units are needed for the CPU server to be activated in order to35 i CPU server
GPU server
CPU-GPU Handover Delay ... ...... i K ,2 i M ,1 i M
21 3 c T c T g g T C − Figure 2.5: The worst-case scenario for CPU-GPU handover delay with a CPU polling server. Whena GPU request with E i is chosen from the GPU waiting queue for execution, it experiences the delayhighlighted in blue boxes. transfer the results back from the GPU to the CPU. Hence, H gci = s i ( T g − C g + 2 T c ) . (2.15)For a CPU server with the deferrable and sporadic server policies, 2 and 3 change tothe jitter of the CPU server (i.e., T c − C c ) because a GPU segment needs to wait for thereplenishment of its corresponding CPU server’s budget. The handover delay is then givenby H gci = s i [ T g − C g + 2( T c − C c )] . (2.16) Local Blocking
It occurs when a task τ i is blocked by the lower-priority task on the same core.As in the case of MPCP [75, 76], a task can be blocked by each of its lower-priority task τ l ’s GPU segment at most once due to the priority boosting of our framework. To obtain atight bound, we analyze local blocking time from two different perspectives.36n the one hand, the worst-case local blocking time of a task τ i happens wheneach normal execution segment of τ i is blocked by the GPU segment of each lower-prioritytask τ l with the amount of E l − K l = M i, + M i, , which is the maximum CPU time usedby the GPU segment of τ l . Hence, the total local blocking time of a task τ i is(1 + s i ) (cid:88) l E l − K l (2.17)where (1 + s i ) indicates the number of normal execution segments of τ i . It is worth notingthat under the RM policy, the total blocking time can be bounded by just (cid:88) l E l − K l (2.18)because each task has only one GPU segment and the period of any lower-priority task τ l is larger than that of τ i , which leads to only one blocking time from each τ l during the jobexecution of τ i .On the other hand, the worst-case local blocking time of τ i can also be boundedby the amount of GPU budget available during one period of τ i . The reasoning behindthis approach is that lower-priority tasks cannot execute GPU segments more than theavailable budget on the GPU. Thus, the maximum total blocking time of τ i is bounded by(1 + s i )( (cid:100) T i T g (cid:101) + 1) C g where “+1” is due to the carry-in effect.Using these two approaches, the total local blocking time of τ i is bounded by B li = ( s i + 1) · min (cid:18)(cid:24) T i T g (cid:25) + 1 (cid:19) C g , (cid:88) l E l − K l . (2.19)37 emote Blocking It occurs when the GPU segment of a task is blocked in the GPU waiting queuedue to other GPU requests. Recall that the GPU segments of tasks are ordered in a priorityqueue according to their tasks’ original priorities. The response time of a GPU segment of τ i is given by W (cid:48) i = H gci + E i . The reasoning is that after a task acquires the GPU lock, it hasto wait H gci for the handover delay of data transferring and mutual server synchronization(Eq. 2.15). There is no other delay than H gci added to the GPU segment length E i becauseour framework sets the GPU budget to be large enough to perform any GPU segment inone GPU period and boosts the priority of the task executing a GPU segment. Hence, theremote blocking time of τ i is bounded by the following recurrence equation: B r,n +1 i = max l
Recall our MOT reservation mechanism presented in Section 2.5.3. When enabled,it ensures that there is always an enough amount of budget for miscellaneous operations(e.g. data copy from/to GPU); thus, the GPU does not have to wait until the start of thenext CPU budget replenishment. Hence, the CPU-GPU handover delay with the MOTmechanism is H gci = s i ( T g − C g ) . (2.21)The improvement in the handover delay has also a profound impact on the remoteblocking delay by reducing the worst-case response time of a GPU segment. However, it isworth noting that for deferrable CPU servers, their budget needs to be halved as discussedearlier in Section 2.6.1 to avoid the thermal back-to-back phenomenon. Remote Blocking Enhancement
To address the problem of enormous remote blocking time due to CPU-GPU han-dover delay, we propose an alternative approach to the GPU waiting queue. This approachimplements the queue based on a variant of the first-come first-served (FCFS) policy witha pre-defined bin-packing order. To be more precise, a bin-packing heuristic is employedto determine the number of bins, where the size of each bin is the GPU budget and the39ength of a GPU segment is the size of an item to be packed. The total number of items inthe bins is | Γ g | and the number of bins is related to the waiting time for a GPU segment.The reason for employing the FCFS policy is to avoid starvation of jobs with large periodbecause under this policy, jobs with shorter period get serviced only once at any time. If asmall-period job arrives meanwhile, it waits until the rest of waiting jobs get serviced andafter finishing all other jobs, it gets serviced according to its position in the bins. Sincein this approach all tasks have the same amount of waiting time, it leads to a significantreduction in the waiting time of low-priority GPU-using tasks but a moderate increase inthat of high-priority ones. It is worth reminding that the replenishment period of the GPUserver is typically much larger than that of the CPU server. The remote blocking time fora GPU-using task τ i under the polling server policy for CPU cores is B ri = s i [( | bins | + 1) T g + 2( | Γ g | − T c ] . (2.22)In this approach, missing activation points of the CPU polling server can still happen inthe transferring time of data from/to the CPU server and due to the FCFS characteristicsof the queue, the total amount is 2( | Γ g | − T c . The remote blocking time of τ i under thedeferrable and sporadic CPU server policies without the MOT mechanism is B ri = s i ( | bins | + 1) T g + 2( | Γ g | − T c − C c ) . (2.23)This is because in the worst case, the GPU server waits for the amount of CPU server jitter.The remote blocking time with the MOT mechanism is B ri = s i ( | bins | + 1) T g . (2.24)To this end, our framework takes a hybrid scheduling scheme that chooses one of40he proposed queue implementations which successfully passes the schedulability analysis. This section gives the experimental evaluation of our framework. First, we explainour implementation on a real platform. Then, we explore the impact of proposed approacheson task schedulability with randomly-generated tasksets based on practical parameters.
We did our experiments on an ODroid-XU4 development board [64] equipped witha Samsung Exynos5422 SoC. There exist two different CPU clusters of little Cortex-A7 andbig Cortex-A15 cores, where each cluster consists of four homogeneous cores. There existsan integrated Mali-T628 GPU on the chip which supports OpenCL 1.1. Built-in sensorswith sampling rate of 10 Hz with the precision of 1 ° C are on each big CPU core and alsothe GPU to measure the temperature . The DTM throttles the frequency of the big CPUcluster to 900 MHz when one of its cores reaches the pre-defined maximum temperature.During experiments, the CPU fan is always either turned off or on at its maximum speedand the CPU is set to run at its maximum frequency.We stressed the CPU cores of the big cluster and the GPU with different settingsby executing sgemm program of the Mali SDK benchmark [62] to measure the system pa-rameters used in Section 2.6. We observed that without launching any kernel on the GPU,because of the heat conduction, the temperature on the GPU rises from 40 ° C (the ambient There are no temperature sensors on little cores since the power consumption and heat generation ofthe little cluster is considerably low. ° C. However, the GPU has less thermal effect on CPU cores due to thelow heat dissipation. The kernel execution on the GPU in the presence of CPU workloadsraises the CPU temperature by 5-10 ° C.Fig. 2.6a illustrates the result of the implementation of the CPU polling serverwith the replenishment period of 1 second and the maximum temperature bound of 95 ° Cwhen the CPU fan is off. As one can see, the CPU temperature oscillates between 78 ° C to95 ° C after a long time of the system operating in the steady state.Figures 2.6b depicts the server utilization with respect to the maximum tempera-ture bound. The maximum achievable utilization of the CPU server is only 43% when thefan is off and almost 95% when the fan is on. The gap in server utilization between these twocases (fan on/off) decreases as the value of the maximum temperature bound reduces. Thesteady state temperatures under different maximum temperature bounds remain almost thesame regardless of whether the fan is on/off.With our proposed thermal-aware server design, it is possible to bound the op-erating temperature to any thermal constraint. It is worth noting that the same serverutilization value can give different temperature bounds depending on the value of replenish-ment period used. Fig. 2.6d shows the temperature bounds at the invariant server utilizationof 30% and the replenishment period in the range of [600 , We conducted the experiment with large values of replenishment period because of the coarse granularityof the sampling rate of the on-board temperature sensors. a) (b)(c)Figure 2.6: a) Server design with the given maximum temperature of 95 ° C when the CPU fan is off.b) Server utilization w.r.t the given maximum temperature. c) The maximum observed temperaturew.r.t the server replenishment period when the CPU fan is off and CPU utilization is 30%.
With these results, we have shown that it is possible to satisfy the maximumtemperature constraint based on our proposed server design. Next, we will use the measuredparameters in our analysis and discuss its effect in taskset schedulability.
Task Generation.
We randomly generate 10,000 tasksets for each experimentalsetting. The base parameters given in Table 2.1 and the measured parameters from theboard are used for the taskset generation and the server design, respectively. It is worthnoting that the GPU parameters are in compliance with the case study of prior work4346, 47, 48, 51]. Server budgets are determined according to the maximum temperatureneed by applying the equations of Section 2.6.1 and the measured system parameters. Thenumber of tasks in each taskset is determined based of the uniform distribution in the rangeof [8, 20]. Then, the utilization of the taskset is partitioned randomly for these tasks in away that no task has the utilization more than the CPU server utilization. The total WCETof each task (i.e., C i ) is calculated based on the task’s utilization and its randomly-chosenperiod. If the task τ i is a CPU-only task, the whole C i is assigned to C i, otherwise C i isdivided into E i , C i, and C i, , according to the random ratio of the GPU segment lengthto the normal WCET. In this phase, if E i is more than the GPU server budget, anotherrandom ratio is generated. Then, E i is partitioned randomly into the miscellaneous time( M i, + M i, ) and the pure kernel execution time ( k i ) according to the ratio of miscellaneousoperations given in Table 2.1. The accumulated miscellaneous-operation time is randomlydivided into M i, and M i, . Finally, tasks are assigned to CPU cores by using the worst-fitdecreasing (WFD) heuristic for load balancing across cores. When the MOT reservation isused, the MOT budget is determined by the maximum of M i, and M i, for all tasks. Table 2.1: Base parameters for taskset generation
Parameters ValuesNumber of CPU cores 4Number of tasks [8, 20]Taskset utilization [0.4, 1.6]Task period and deadline [30, 500] msPercentage of GPU-using tasks [10, 30] %Ratio of GPU segment len. to normal WCET [2, 3]:1Ratio of misc. operations in GPU segment M i, + M i, E i [10, 20]%Server Period 10 msGPU Period 20 ms44 esults. Figures 2.7a-b depict the percentage of the schedulable tasksets whenthe CPU fan is on or off with different taskset utilization. The CPU sporadic serversoutperform the other CPU server policies as expected because their replenishment bud-get are as large as the polling server and the remote blocking and CPU-GPU handoverdelays are as low as those in the deferrable server. Compared to the polling server, thedeferrable server with MOT yields a higher percentage of schedulable tasksets especiallywhen taskset utilization is low. Designating some portion of CPU server budget under thedeferrable replenishment policy results in improvement in taskset schedulability. However,the percentage of schedulable taskset under the deferrable server drops sharply as tasksetutilization increases because of the insufficient amount of server budget (which is just thehalf of the polling/sporadic server budget). It is worth noting that when the CPU fan isoff, there is a large gap between the two sporadic servers with and without MOT at lowtaskset utilization. This is due to the advantage of the MOT mechanism that also reducesenhanced remote blocking delay.Fig. 2.7c shows the effect of the proposed remote blocking enhancement under thepolling policy. As one can see, the remote blocking reduces substantially due to our proposedremote blocking enhancement by up to 20% especially when there exists less workload inthe system. Fig. 2.7d depicts the impact of the rate of GPU-using tasks on schedulabletaskset rate under the CPU sporadic server policy with and without MOT. As the numberof GPU-using tasks increases, taskset schedulability goes decreases as more tasks contendfor the shared GPU.Next, we investigate the effect of server periods on the schedulability rate. In45his experiment, the temperature constraint is fixed to the maximum level and the serverperiod is varied from 5 to 30 milliseconds (see Fig. 2.7e) under the polling server policy.As expected, because of the large CPU-GPU handover delay, the percentage of schedulabletasksets drops significantly as the server replenishment period increases.
We have implemented a prototype of our framework and conducted a case studyon ODroid-XU4. We show that without our framework, a real-time task experiences unex-pectedly large delay due to the thermal violation, but with our framework, the operatingtemperature is safely bounded within a desired range.In the case study, we used three types of applications: a real-time GPU-usingtask, and non-real-time CPU-only and GPU-using tasks. The non-real-time CPU-onlyapplication is run on each of the four big CPU cores with the lowest priority. The non-real-time GPU-using matrix multiplication task is run on one big CPU core with the mediumpriority. This task randomly generates two matrices and performs the matrix multiplicationrepetitively using the GPU-accelerated Mali OpenCL library. The size of matrix is set to512 ×
512 in order not to cause unnecessarily long waiting time to the real-time task. Onthe other big CPU core, the highway workzone recognition application for autonomousdriving [58] is run as the real-time GPU-using task with the highest priority. This task isconfigured to process 8 frames per second and has one GPU segment based on OpenCL. Avideo consisting of around 800 frames is given as input to this task. To avoid unexpecteddelay in data fetching, video frames are preloaded into memory as a vector during the46nitialization of the task and the loaded frames are repeatedly processed at runtime. Afterthe initialization, all tasks are signaled to start their execution together and the CPU fan isturned off during the experiment. Other tasks in the system, including system maintenanceand monitoring processes, are assigned to the little cluster cores. The thermal-aware serversare implemented based on the polling server policy with the budget replenishment periodof 10 milliseconds.Two scenarios are used to show the effectiveness of our proposed framework. Thefirst one is the baseline system with no thermal-aware server. Fig. 2.8a shows the responsetime of each frame (job) of the real-time workzone recognition task, and Fig. 2.8b depicts thetemperature measurements during the experiment. As one can see, after processing around1040 frames, the DTM was triggered and it started throttling the CPU frequency from 2.0GHz to 900 MHz since the CPU temperature had reached the threshold of 95 ° C. However,the frequency throttling did prevent the operating temperature from rising on both the CPUand the GPU, which caused the OS to power off the big CPU cluster temporarily. Thisexperiment took less than three minutes. In the second scenario, all tasks execute within thethermal-aware servers. As illustrated in Fig. 2.8c, the observed response time of each framewas larger compared to the first scenario due to the budget of servers, but all frames wereprocessed before the deadline. Fig. 2.8d depicts that it took around 500 seconds to reachthe steady state temperature. Moreover, the operating temperature was tightly boundedby the thermal threshold in any circumstance. This result shows the thermal safety andaccuracy of our framework in practical settings.47 .9 Summary
In this chapter, we proposed a novel thermal-aware framework to bound the max-imum temperature of CPU cores as well as an integrated GPU while guaranteeing real-timeschedulability. Our framework supports various server policies and provides analytical foun-dations to check both thermal and temporal safety. Experimental results show that eachCPU server policy provided by our framework is effective in bounding the maximum temper-ature. We proposed the miscellaneous operation time reservation mechanism for the CPUservers in order to improve task schedulability by reducing the CPU-GPU handover delay.We also introduced a remote blocking enhancement technique that employs the bin-packingstrategy to reduce the remote blocking caused by other tasks.48 a) (b)(c) (d)(e)Figure 2.7: a) and b) The percentage of schedulable tasksets under the given maximum temperatureconstraint of 95 ° C when the CPU fan is off and on, respectively. c) Taskset schedulability resultswith and without the remote blocking enhancement under the polling server policy while the CPUfan is off. d) The percentage of schedulable tasksets w.r.t the ratio of GPU-using tasks. R e s p o n s e t i m e ( m s ) Frame
CPU powering off
Deadline (a) T e m p e r a t u r e ( ° C ) Time (seconds)CPU GPU Threshold bound (b) R e s p o n s e t i m e ( m s ) Frame
Deadline (c)
100 0 100 200 300 400 500 600 700 800 T e m p e r a t u r e ( ° C ) Time (seconds)CPU GPU Steady state temp. Threshold bound (d)Figure 2.8: a) Response time of the real-time workzone recognition task without our framework. b)Temperature of CPU and GPU without our framework. c) Response time of the real-time task withour framework d) Temperature of CPU and GPU with our framework. hapter 3 On Dynamic Thermal Conditionsin Mixed-Criticality Systems
The rising demand for powerful embedded systems to support modern complexreal-time applications signifies the on-chip temperature challenges. Heat conduction be-tween CPU cores interferes in the execution time of tasks running on other cores. Theviolation of thermal constraints causes timing unpredictability to real-time tasks due totransient performance degradation or permanent system failure. Moreover, dynamic ambi-ent temperature affects the operating temperature on multi-core systems significantly.In this chapter, we propose a thermal-aware server framework to safely upper-bound the maximum operating temperature of multi-core mixed-criticality systems. Withthe proposed analysis on the impact of ambient temperature, our framework manages mixed-criticality tasks to satisfy both thermal and timing requirements. We present techniquesto find the maximum ambient temperature for each criticality level to guarantee the safe51perating temperature bound. We also analyze the minimum time required for a criticalitymode change from one level to another. The thermal properties of our framework havebeen evaluated on a commercial embedded platform. A case study with real-world mixed-critical applications demonstrates the effectiveness of our framework in bounding operatingtemperature under dynamic ambient temperature changes.
Ensuring continuous operation with high assurance in the physical environmentremains a significant challenge to cyber-physical systems (CPS). This is particularly im-portant for safety-critical applications with real-time mixed-criticality components, e.g.,automotive, aerospace, manufacturing, and defense systems, where even occasional timingfailures of high-criticality components can lead to catastrophic consequences. Various typesof unexpected changes in the physical environment may affect the system behavior andcontribute to the difficulty of this problem.Ambient temperature is one of the key factors that affect many mixed-criticalityCPS applications. For instance, in automobiles, a report from the National Renewable En-ergy Laboratory of the US Department of Energy [7] indicates that cabin air temperaturecan reach up to and 82 ◦ C in Phoenix, Arizona. The heat generated by the engine worsensthe ambient temperature level of nearby electronic control units [68]. Another exampleis a fire-containment drone [91]. Even with a heat protection shield, the drone’s computingsystem starts warning when the ambient temperature reaches 35 ◦ C and becomes nonoper-ational at 40 ◦ C. This also limits the minimum distance from the drone to a fire hazard.52hile dynamic ambient temperature is an important problem, most thermal man-agement schemes for real-time embedded systems [52, 53, 2, 41, 20, 3] assume fixed, room-level ambient temperature and focus only on the temperature increase caused by the com-puting system itself. CPS are expected to run in various physical environments; hence,the assumption of room temperature made by prior work limits their practical applicabil-ity. There is recent work [59] considering dynamic ambient temperature but it assumes auni-processor single-criticality system.Under harsh ambient temperature, a system may not be able to utilize 100% ofCPU time even if the CPU runs at the minimum possible frequency with active/passivecooling packages. The operating temperature of the CPU may still reach the maximumthermal constraint, resulting in temporary system shutdown or permanent hardware dam-age. In such a condition, the only option left to ensure timing and thermal guarantees ofmore critical tasks is to secure cooling time by suspending less critical tasks. In other words,although DVFS [30, 61, 70, 78, 101] and cooling packages [15, 22, 33, 34] can help toleratehigh ambient temperature, it is inevitable to consider only partial operations of the system.In this work, we aim to design a system that offers different levels of assur-ance against ambient temperature changes. This is different from the well-known Vestalmodel [92], which focuses on varying assurance of execution time, but it shares the spiritof addressing uncertainties in real-time system design. To avoid confusion, we clarify ourmixed-criticality model as follows:
Definition 1 Thermally mixed-criticality systems are the systems that assure ambi-ent temperature changes and heat dissipation from lower-criticality task execution do not dversely affect the real-time schedulability of higher-criticality tasks. In the thermally mixed-criticality model, ambient temperature plays a key rolein determining the maximum amount of workload that can be executed on the CPU. Thefollowing questions still remain unanswered by the state-of-the-art: • Up to what ambient temperature is the system fully or partially operational? Specif-ically, for a given criticality level of a mixed-criticality system, can we find the corre-sponding critical ambient temperature , under which real-time tasks with that or highercriticality are guaranteed to meet their deadline at the expense of lower-criticalitytasks? • Can we take into account the effect of dynamic ambient temperature along with heatconduction on a modern multi-core processor? • If the system moves from a hot to a cold region, how long will it take for the system tocool down and safely resume the operation of low-criticality tasks, without violatingthe processor thermal constraint?This chapter presents a multi-core mixed-criticality scheduling framework withambient temperature awareness. In our proposed framework, thermal-aware servers are usedto bound heat generation at each criticality level and the criticality mode change is triggeredby ambient temperature changes. This is the first work to address the aforementionedlimitations and provides analytical guarantees on the timely execution of critical componentsin dynamic thermal conditions.
Contributions.
The contributions of this chapter are as follows:54
We show that the problem of thermal-aware real-time scheduling can be decomposedinto thermal schedulability (how much CPU budget is usable under thermal con-straints) and timing schedulability (if tasks are schedulable using given budget). Ourthermal schedulability achieves the simplicity in timing analysis by ensuring that thebudget is guaranteed to be made available for any execution patterns without violatingthermal constraints. • We extensively analyze the thermal safety of a multi-core system and bound themaximum operating temperature that the system can reach. At a specific ambienttemperature level, we characterize the worst-case thermal behavior of a system andalso determine the minimum time for the system to transition from one criticalitylevel to a lower level. • We introduce the notion of idle thermal servers that allow bounding the maximumoperating temperature caused by multiple preemptive active servers scheduled dy-namically on a multi-core processor for a given mixed-criticality taskset. • We perform a case study on mixed-criticality applications running on an ODROID-XU4 embedded platform, and evaluate our framework and analysis in different ambi-ent temperature levels.
There exist extensive studies on bounding the maximum operating temperature innon-real-time multi-core systems [31, 30, 61, 70, 78, 101], most of which propose adjusting55PU clock speed. The authors of [31, 30] proposed a feedback loop that dynamicallycontrols processor temperature by either adapting CPU utilization or scaling frequency tosatisfy thermal constraints in varying ambient temperature environment. In [70] and [78],the authors proposed proactive frequency scheduling to improve overall system performanceunder different ambient temperatures. Although average-case performance degradation hasbeen discussed in these papers, they cannot be directly applied to our problem.The notion of hot and cold tasks has been introduced [15, 41, 59, 53, 44, 101] inboth real-time and non-real-time systems. They propose scheduling algorithms to interleavehot and cold tasks, adjust the CPU frequency, and force idling time for the CPU to cooldown after running hot tasks. In most of them, the scheduling problem is either to find taskexecution order or to reduce the size of lengthy hot tasks [41] in a fixed schedule. However,DVFS may cause a considerable reduction in system reliability over time [43, 57, 99] andmay be unsupported in some embedded devices.There exists extensive research on real-time uni-processor systems with stringentthermal constraints [17, 96, 16, 52, 5, 4, 2, 41, 59, 44]. In uni-processor systems, thermaldependency between workloads is only temporal . However, in multi-core systems, becauseof heat dissipation between cores, there also exists spatial thermal dependency between theexecution of workload on one core and those on other cores. Due to this reason, the workon uni-processor systems cannot be used safely in multi-core real-time systems.The notion of thermal servers (either by injecting of idle tasks or using thermal-aware servers explicitly) have been proposed in the literature of real-time systems for uni-processors [52, 53, 2, 41] and multi-core platforms [20, 3]. In particular, the authors of [20]56ntroduced a novel technique for periodic tasks executing on multi-core platforms. Thistechnique introduces an Energy Saving (ES) task that runs with the highest priority andcaptures the sleeping time of CPU cores. The technique can be seen as an alternative to athermal-aware server because the ES task effectively models the budget-depleted durationof a thermal server. The authors of [3] proposed thermal-isolation servers that avoid thethermal interference among tasks in temporal and spatial domains with thermal compos-ability. Despite their achievements in isolating the thermal-aware design from real-timeschedulability analysis, these techniques assume a fixed schedule for idle task or periodicservers, and are inapplicable to dynamically-scheduled (e.g., priority-based) servers in multi-core platforms. Since priority-based preemptive servers are widely used in many real-timesystems, such as real-time virtualization [50, 98], the restriction imposed by these priorthermal servers is a significant limiting factor.Matrix exponential is a well-known approach to solve the first-order linear systemof differential equation. In [66], the authors proposed a technique based on the numericalNewton–Raphson method to solve the thermal equations in the steady state for each powerchange. Based on this, the work in [73, 100, 54, 14] finds the worst-case peak temperature byexhaustively searching all possible patterns. Unlike prior work, the unique contribution ofour work is that we solve the temperature equation for oscillating power signals analytically,by representing them as continuous functions, and analyze the worst-case peak temperaturedirectly for multi-core platforms. It is worth noting that our work not only proves the worstcase for peak temperature but also reduces computational complexity considerably.There exists some research focusing on varying ambient temperature in real-time57ystems [59, 41, 96]. However, there is no discussion in these studies about mixed-criticalitysystems and the effect of harsh ambient temperature change on the schedulability of criticaltasks. To the best of our knowledge, there is no prior work on multi-core mixed-criticalitysystems with the consideration of the effect of ambient temperature change.
The total power consumption of CMOS circuits is modeled as the summation ofdynamic and static powers [13], i.e., P ( t ) = P S ( t ) + P D ( t ). Static power P S depends on thesemiconductor technology and the operating temperature caused by current leakage. Hence,it can be modeled as: P S ( t ) = k θ ( t )+ k , where k and k are technology-dependent systemconstants, and θ ( t ) is the operating temperature [60]. Dynamic power P D ( t ) is the amountof power consumption due to the processor operating frequency f , modeled as P D = k f s ,where s and k are system constants that depend on the semiconductor technology. We consider multiple preemptive thermal-aware servers for each CPU core. Eachserver is statically associated with one core and does not migrate to another core at run-time. A server v i is modeled as v i = ( C i , T i , L i ), where C i is the maximum execution budgetof the server v i , T i is its budget replenishment period, and L i is its criticality level. Serversare ordered in an increasing order of priorities, i.e. i < j implies that a server v i has lower58riority than a v j . At the criticality level of l , only the servers with criticality level higherthan or equal to l (i.e., L i ≥ l ) are eligible to execute.For budget replenishment policies, we consider polling [82], deferrable [88], and sporadic servers [86]. Let J i denote the task release jitter relative to the server release.The value of J i is T i under the polling server policy and T i − C i under the deferrable andsporadic server policies [9]. This work considers sporadic tasks with implicit deadlines under partitioned fixed-priority preemptive scheduling , which is widely used in many practical systems. Each task τ i is statically allocated to one thermal server (thus to the corresponding CPU core of thatserver) with a unique priority. In each server, tasks are labeled in an increasing order ofpriority, i.e., i < j implies τ i has lower priority than τ j . There also exist non-real-timetasks running with the lowest priority level in the server and they execute only if there is noreal-time task ready to execute and the server has remaining budget. A task τ i is modeledas τ i = ( E i , D i ) where E i is the worst-case execution time (WCET) of task τ i , D i denotesthe minimum inter-arrival time of τ i which is implicit deadline of τ i . It is worth mentioningthat the simplicity in timing schedulability achieved by our work enables easy adoption ofmore complex task models. For instance, the analysis for tasks with critical sections underhierarchical scheduling [51, 39] is directly applicable to our work since we ensure periodicresource budget. 59 .3.4 Criticality Model In this work, there exists a set of m criticality levels L = { l , l , . . . , l m } . At criti-cality mode l , only the servers (and tasks within these servers) with criticality level higherthan or equal to l are eligible to execute. Thus, for each criticality level l , there exists asubset of servers V l ⊆ V and a subset of taskset Γ l ⊆ Γ that execute.
Definition 2 Critical ambient temperature of a criticality level l is the maximumambient temperature that the system can execute eligible servers v ∈ V l without violatingthe system’s maximum temperature constraint. Details on how criticality mode changes is given in Sec. 3.4.
This section presents the overall design-time and run-time aspects of our frameworkand how the criticality mode changes. With our framework, all tasks in a system run withinthermal-aware servers. A criticality mode change is triggered by ambient temperature,which can be obtained from an off-chip temperature sensor. If the criticality mode changesfrom a lower criticality level to a higher one, i.e., the critical ambient temperature has beenreached, the framework terminates the lower-criticality servers and their tasks immediately.This design guarantees that the lower-criticality tasks have no effect on the thermal and timing schedulability of tasks running in servers with higher criticality levels. The timingschedulability of tasks refers to the ability to complete their execution by the deadline.Thermal schedulability is defined as follows.60 efinition 3 Thermal schedulability is to guarantee that under any task execution pat-terns, the CPU does not exceed the maximum temperature constraint.
When the ambient temperature changes from a higher criticality level to a lowerone, lower-criticality servers and their tasks do not resume immediately. The reasoningis that it takes time for the CPU to cool down to reach the safe temperature level thatthe increase in workload (due to resuming the lower-criticality tasks) will not lead to atemperature violation. Therefore, in the transition to a lower criticality level, only higher-criticality servers (and also their tasks) still run, and after reaching the safe temperature,then lower-criticality servers (and their tasks) resume. Let shifting time denote this delay.We will later determine this delay as a function of physical characteristics and systemsettings.
Running servers V i State
Condition C(Mx,My) = θ amb > θ Mx & θ amb ≤ θ My V M V M ... V m M m V S Initial
C(M1,M2)
C(M0,M1)tm = 0C(M0,M1)tm++ C(M0,M1) C(M1,M2) C(..,Mm) C(M0,M1)tm=tm S1 C(M1,M2)C(M1,Mm) C(...,Mm)C(M1,M2) C(...,Mm) Figure 3.1: Criticality mode change diagram The state diagram of criticality mode changes is illustrated in Fig. 3.1. The criti-cality mode of the system is determined by the associated servers of the current state (e.g.,61 l for level l ), and the change of the state is triggered by the monitored ambient tem-perature. If the ambient temperature goes higher than the critical ambient temperatureof the highest criticality mode l m , the system will shutdown. An increase in the ambienttemperature leads the system state to transition to a higher criticality mode immediately.However, a transition to a lower criticality mode involves a shifting state. Let S i denotethe shifting state from one criticality mode l i +1 to its immediate lower mode l i , and M i represents the state of the criticality mode l i . After staying in S i for tm Si time units andthe ambient temperature is still under the safe level, the system state will change to M i .In summary, the design-time analysis and the run-time support of our frameworkare as follows: Design-time: 1. Check the timing schedulability of tasks for each criticality.2. Find the parameters of thermal-aware servers that ensure thermal schedulability foreach criticality.3. Compute the corresponding critical ambient temperature for each criticality.4. Compute the shifting time from each criticality level to its immediate lower one. Run-time: 1. Monitor ambient temperature.2. If ambient temperature exceeds the critical ambient temperature of the current crit-icality level l i , a) switch to a higher criticality l i +1 , and b) terminate servers withcriticality less than l i +1 immediately. 62. If ambient temperature decreases below the critical ambient temperature of one lowercriticality level l i − , a) switch to the lower criticality l i − , b) wait for shifting time ,and c) resume servers with criticality l i − . In this section, we first develop a generalized thermal model to represent the CPUtemperature as a function of a generic input power signal. We show the relation of workloadand ambient temperature level under the maximum thermal constraint in a steady state.The shifting time to a lower-criticality level will be discussed. Finally, we prove the worst-case scenario of task arrivals in invariant ambient temperature at a specific criticality levelin multi-core platforms. The thermal behavior of the CPU is modeled when a generic periodic power signalgenerates heat dissipation, which results in temperature oscillations. The temperaturefunction of the CPU is analytically extracted based on the ambient temperature, workload,physical and geometrical properties, and the thermal resistances between the CPU andsurroundings.A generic power signal is a periodic step function: P ( t ) = P S Sleeping timeP S + P D O.W. Fig. 3.2 illustrates this power signal function where t wk and t slp denote the exe-cution time and the sleeping time of a periodic workload, u is the CPU utilization (i.e.,63 = t wk T ), and φ is the release offset. θ tθ θ θ' θ θ' θ (Δ θ ) out-of-phase (Δ θ ) in-phase (Δ θ ) out-of-phase (Δ θ ) in-phase θ tθ θ θ' θ θ' θ (Δ θ ) out-of-phase (Δ θ ) in-phase (Δ θ ) out-of-phase (Δ θ ) in-phase t t t t (a) (b) P tP S (t)P S (t)+P D (t) uT+ϕ T+ϕt wk t slp ϕ Figure 3.2: A generic periodic power signal. In order to derive a continuous temperature function, here we represent the peri-odic step function of the power signal as the following Fourier series: P ( t ) = P S + P D u + P O (3.1)where P O = ∞ (cid:88) k =1 P D sin ( ukπ ) kπ cos (cid:18) kπT (cid:18) t − uT − φ (cid:19)(cid:19) Note that using this continuous representation of the power signal is advantageousin deriving a straightforward formula for a temperature equation, compared to iterativelyapplying the recursion relation with new initial conditions at every period. First, we define θ ( t ) = Θ CP U ( t ) − Θ amb ( t ) as the temperature difference of the CPU core and the ambi-ent. Then, the rate of temperature change can be captured by the following differentialequation [3, 20]: dθ ( t ) dt = Aθ ( t ) + BP ( t ) (3.2)where A and B are scalar values determined based on the system inner thermal resistanceand capacitance. Substituting P ( t ) from Eq. 3.1, we solve Eq. 3.2 for the CPU temperature.Assuming the temperature difference at the initial time t is θ , the temperature of the CPU64an be written as: θ ( t ) = θ e A ( t − t ) − BA ( P S + P D u ) (cid:16) − e A ( t − t ) (cid:17) + B (cid:16) S ( A, P D , u, T, φ, t ) − S ( A, P D , u, T, φ, t ) e A ( t − t ) (cid:17) (3.3)where S is a function defined as: S ( β, P, u, T, φ, t ) = − ∞ (cid:88) k =1 P T sin ( ukπ ) kπ ( T β + 4 k π ) × (cid:18) − T β cos (cid:18) kπT ( t − ψ ) (cid:19) + 2 nπ sin (cid:18) kπT ( t − ψ ) (cid:19)(cid:19) with ψ = uT + φ .The parameters A and B in Eq. 3.3 are the system characteristics which depend onthe thermal resistances, heat capacity, CPU mass, and thermal convection of the ambient. Itis worth mentioning that the thermal response of a system with constant power dissipationcan be derived as a special case of Eq. 3.3 by considering u = 1: θ ( t ) = θ e A ( t − t ) − BA ( P S + P D ) (cid:16) − e A ( t − t ) (cid:17) (3.4)If t = 0, the well-known expression for the constant power dissipation case canbe obtained from Eq. 3.4: θ ( t ) = α + ( θ − α ) e βt (3.5)where β = A , and α = − BA ( P S + P D ).For any u , the thermal response of the system with a constant power signal reachesthe steady state. From Eq. 3.4: θ s ( t ) = α = − BA ( P S + P D ) . (3.6)65or the case where the CPU utilization is not 100%, the temperature still oscillatesin the steady state, but the oscillation stays within a certain range given by the minimumand the maximum steady-state temperatures ( θ L and θ M ). We can derive the steady statethermal response from Eq. 3.3: θ s ( t ) = − BA ( P S + P D u ) + BS ( A, P D , u, T, φ, t ) (3.7)where S ( A, P D , u, T, φ, t ) represents the oscillation. Based on this general expressions, wewill give details on the thermal behavior of multiple CPU cores under transient and steadyconditions in Sec. 3.5.2. Workload, ambient, and maximum temperature relations In the steady state condition, the relation between workload, ambient temperatureΘ amb , and the maximum operating temperature Θ M can be determined based on the modeldeveloped above. The temperature difference θ M can be written as: θ M = Θ M − Θ amb = − BA (cid:18) P S + P D − e AuT − e AT (cid:19) (3.8)Therefore, if Θ M is used as a thermal constraint, the maximum ambient temper-ature for a given utilization u is expressed as follows:Θ amb = Θ M + BA (cid:18) P S + P D − e AuT − e AT (cid:19) (3.9)Also, another useful expression can be derived for the workload based on theconstraint maximum temperature, ambient temperature, and the power signal: u = 1 AT ln (cid:18) AB Θ M − Θ amb + ( B / A ) P S P D (cid:0) − e AT (cid:1)(cid:19) (3.10)66 ime shifting and transient analysis We derive average steady-state temperature by removing the oscillations from theabove equations. This is especially useful for transient phase analysis. From Eq. 3.3, thefollowing can be derived: θ ave ( t ) = θ e A ( t − t ) − BA ( P S + P D u ) (cid:16) − e A ( t − t ) (cid:17) − BS ( A, P D , u, T, φ, t ) e A ( t − t ) (3.11) θ ave ( t ) can be approximated by taking the first few terms of the series for S ( A, P D , u, T, φ, t ).It can be seen from Eq. 3.11 that the plateau of θ ave in the steady state is − BA ( P S + P D u ).Based on the expression of the θ ave ( t ), shifting time can be calculated which isdefined as the time it takes for the system to reach a new steady-state condition whensystem parameters are changed. We calculate the shifting time that the CPUs take toreach to 99% of the new steady-state condition from an initial temperature difference of θ .Assuming that S k is the value of S ( t ) taking the first k terms, we have: t shift = 1 A ln (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . BA ( P S + P D u ) θ + BA ( P S + P D u ) + BS k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) (3.12) θ M θ L t shift -B(P S +P D u)/A t wk t slp θ(t)θ ave (t)θ s (t) Temperature time Figure 3.3: Current, steady state, and average temperature profiles. Fig. 3.3 illustrates temperature profile θ ( t ), oscillating steady state temperature θ s ( t ), and average temperature θ ave ( t ). 67 .5.2 General Thermal Model for Multi-Core Platforms Similar to the single-core case, a general thermal model can be developed for amulti-core platform with a periodic input power signal. In this subsection, we representthe power signal as a continuous function and show its benefit in simplifying the finalsolution. For a multi-core platform with n cores, we can have n eigenvalues that define thethermal response of the system. Therefore, we use matrix notations to solve the differentialequations. After performing a thermal analysis, the rate of CPUs’ temperature change canbe denoted as: d θ ( t ) dt = A θ ( t ) + BP ( t ) (3.13)where A , an n × n matrix, and B , a diagonal n × n matrix, are determined by the innerthermal resistance and capacitance of the system. θ is an n × P ( t ) is the n × θ is the initial temperature difference matrix at t , the solution of Eq. 3.13 can be writtenas: θ ( t ) = e ( t − t ) A θ + (cid:90) tt e ( t − s ) A BP ( s ) ds (3.14)The first part is the homogeneous solution which is the thermal response due to the initialtemperature difference from the ambient. The second part is the non-homogeneous solutioncaused by the input power signal. The temperature rise due to a power signal is independentof the initial condition. e ( t − t ) A is the matrix exponential of A . For a platform with n CPUcores, we denote the eigenvalues as β , β , ..., and β n with β ≥ β ≥ ... ≥ β n . To solveEq. 3.14, we first show that the solution of power signals can be obtained as the sum of thesolutions of each signal. 68 heorem 1 (Superposition) Thermal response due to any combinations of power signalsis equivalent to the sum of thermal responses caused by each of those power signals. Proof. Assume that θ and θ are the thermal responses caused by executions P ( t ) and P ( t ). Also, θ is the thermal response caused by P ( t ) = P ( t ) + P ( t ). From Eq. 13: d θ ( t ) dt = A θ ( t ) + BP ( t ), and d θ ( t ) dt = A θ ( t ) + BP ( t ) . By adding these two differential equations, we have: ddt ( θ ( t ) + θ ( t )) = A ( θ ( t ) + θ ( t )) + B ( P ( t ) + P ( t ))This is equivalent to the differential equation of θ : d θ ( t ) dt = A θ ( t ) + BP ( t ) = A θ ( t ) + B ( P ( t ) + P ( t ))Therefore, θ ( t ) = θ ( t ) + θ ( t ). Let V denote an n × n matrix of eigenvectorsof A where column i is the i th eigenvector. Also, D is a diagonal n × n matrix in whichthe diagonal entries are the eigenvalues of A . The solutions of the non-homogeneous partin Eq. 3.14 can be written as: (cid:90) tt e ( t − s ) A BP ( s ) ds = (cid:90) tt e ( t − s ) A B ( P ∞ + P o ( s )) ds = (cid:90) tt e ( t − s ) A BP ∞ ds + (cid:90) tt e ( t − s ) A BP o ( s ) ds (3.15)where P ∞ and P o ( s ) are the constant and oscillating part of the power signal matrix,respectively. P ∞ = [ P j ∞ ] n × = [ P S j + P D j u j ] n × o = [ P o j ] = (cid:34) ∞ (cid:88) k =1 P D j kπ sin ( u j kπ ) cos (cid:18) kπT j ( s − ψ j ) (cid:19)(cid:35) n × with ψ j = u j T j + φ j . Since P ∞ is constant, (cid:90) tt e ( t − s ) A BP ∞ ds = − A − (cid:16) I − e ( t − t ) A (cid:17) BP ∞ (3.16)For the second integral we have: (cid:90) tt e ( t − s ) A BP o ( s ) ds = V e t D (cid:90) tt e − s D V − BP o ( s ) ds = V e t D . (cid:34) n (cid:88) i =1 (cid:90) tt e − β j s V − ji B ii P o i ( s ) (cid:35) n × (3.17)Similar to the single core case, we can use S in Eq. 3.3 but in a matrix form. Wedefine the matrix S as: S ( t ) = [ S ij ( t )] n × n = (cid:2) S ( β i , P D j , u j , T j , φ j , t ) (cid:3) n × n We have (cid:82) e − β j s P o i ( s ) ds = e − β j s S ji ( s ). Therefore: (cid:34) n (cid:88) i =1 (cid:90) tt e − β j s V − ji B ii P o i ( s ) ds (cid:35) n × = (cid:34) n (cid:88) i =1 e − β j t V − ji B ii S ji ( t ) − e − β j t V − ji B ii S ji ( t ) (cid:35) n × To write this in a matrix notation, we define B (cid:48) = diag ( B ) an n × B such that B (cid:48) j = B jj . Then, (cid:34) n (cid:88) i =1 e − β j t V − ji B ii S ji ( t ) − e − β j t V − ji B ii S ji ( t ) (cid:35) n × = e − t D (cid:0) V − ◦ S ( t ) (cid:1) B (cid:48) − e − t D (cid:0) V − ◦ S ( t ) (cid:1) B (cid:48) where V − ◦ S ( t ) is the Hadamard product of matrices V − and S ( t ), which is a matrix ofthe same dimension of n × n where each element ij is the product of counterpart elements70 j of V − and S ( t ): (cid:0) V − ◦ S ( t ) (cid:1) ij = V − ij S ij ( t ). Substituting this in Eq. 3.17 gives: V e tD . (cid:34) n (cid:88) i =1 (cid:90) tt e − β j s V − ji B ii P o i ( s ) (cid:35) n × = V (cid:16) V − ◦ S ( t ) − e ( t − t ) D (cid:0) V − ◦ S ( t ) (cid:1)(cid:17) B (cid:48) (3.18)Substituting Eq. 3.16 and Eq. 3.18 in Eq. 3.15 results in a general solution for the thermalresponse of a n -core CPU platform with distinct input power signals to each core: θ ( t ) = e ( t − t ) A θ − A − (cid:16) I − e ( t − t ) A (cid:17) BP ∞ + V (cid:16) V − ◦ S ( t ) − e ( t − t ) D (cid:0) V − ◦ S ( t ) (cid:1)(cid:17) B (cid:48) (3.19)This expression of θ ( t ), which contains simple matrix operations, can be easilyused to compute the general solution for the transient temperature profile of multi-coreCPUs. It is important to know the workload execution patterns which cause the peakheat dissipation. In this section, based on the model developed in the previous sections, weexplore and discuss the worst-case scenarios for workload execution. Consecutive workload execution First, we prove that the time to reach the maximum temperature is minimized ifall workloads execute consecutively. Suppose that for any arbitrary execution of workloadsin a period of T , the CPU executes some workload for t , sleeps for t , and then wakes upand executes for t time units, and these working-sleeping switches occur n times. For theCPU idling time, Eq. 3.5 changes to θ ( t ) = θ e βt if the static power is ignored (i.e., α = 0).71ence, the temperature change during a period is: θ t = θ L θ t = α + ( θ t − α ) ∗ e βt θ t = θ t ∗ e βt ... θ t n = θ t n − ∗ e βt n where (cid:80) t i = T for all n > t i > θ t n = θ L , θ L = α e β ( t + t + ··· + t n − + t n ) − e β ( t + ··· + t n − + t n ) + · · · − e βt n e βT − t wk = (cid:80) i =1 , i = i +2 t i is the execution time and t slp = (cid:80) i =2 , i = i +2 t i is the idle time,respectively. We prove that the amount of total execution time is minimized when there isonly one execution time chunk. In other words, the worst-case thermal scenario happenswhen all workloads execute consecutively.For clarification, suppose the following two scenarios for a given period under themaximum temperature constraint: • All workloads execute consecutively to reach the maximum temperature for t wk timeunits; then, the CPU sleeps until the beginning of the next period for t slp time units(Fig. 3.4a). 72 A portion of workloads executes for t time units, then the CPU sleeps for t time units,and the rest of the workloads run until the CPU reaches the maximum temperatureat t + t + t ; afterwards the CPU sleeps for t time units (Fig. 3.4b). wk t slp t L L M (a) t t t t ' L M ' L (b)Figure 3.4: Temperature change in one period when the CPU operates a) for t wk time units b) for t and t time units. The following shows that the budget of a thermal server should be bounded basedon the first scenario. Theorem 2 The amount of waking time to reach the maximum temperature constraint isminimized when all workloads execute consecutively. Proof. We are interested to prove that t w ≤ t + t . To prove the statement, we findthe relation of the minimum steady state temperature in each scenario, individually. Tocalculate the budget for the first scenario, we first calculate the minimum steady-statetemperature which is θ L = e β t slp (cid:0) α − α e β t wk (cid:1) − e β t wk e β t slp . Accordingly, the maximum temperature in the steady state is73 M = α − e β t wk (cid:32) α + e β t slp (cid:0) α − α e β t wk (cid:1) e β t wk e β t slp − (cid:33) = α e β t wk − β T − . For the second scenario, the maximum temperature is reached after t + t + t time units. Therefore, θ M = α e β ( t + t + t ) − e β ( t + t ) + e β ( t ) − βT − θ L vs. θ L (cid:48) ), in both scenarios the given maximum temperature is the same. Hence, θ M = α e β t wk − β T − α e β ( t + t + t ) − e β ( t + t ) + e β ( t ) − βT − . Therefore, we have e β t wk = e β ( t + t + t ) − e β ( t + t ) + e β ( t ) . Now we want to prove that forany value of t , t and t , t + t ≥ t w . Contradiction: Suppose the hypothesis is not correct. Hence, t + t < t w −→ β ( t + t ) > β t w −→ e β ( t + t ) ≥ e β t wk e β ( t + t ) ≥ e β t wk = e β ( t + t + t ) − e β ( t + t ) + e β ( t ) ⇐⇒ e β t ≥ e β ( t + t ) − e β ( t ) + 1 ⇐⇒ e β t − ≥ e β ( t + t ) − e β ( t ) ⇐⇒ e β t − ≥ e β ( t ) (e β ( t ) − ⇐⇒ e β ( t ) ≥ β < t > Corollary 2.1 The maximum temperature reduces when the period and waking time arehalved, since it is the special condition where t = t and t = t . erver budget calculation under polling server budget replenishment policy. Now we calculate the “maximum” budget that a server can have while limiting the operatingtemperature not to exceed the given thermal constraint in a single-core platform. The worstcase for the polling server happens when it exhausts all of its replenishment budget at thebeginning of its period and then it sleeps until the beginning of the next replenishmentperiod. For a server period T , T = t wk + t slp . (3.20)In the steady state of the system, we are interested in bounding the server’s max-imum temperature. According to Eq. 3.5, α + ( θ L − α ) e βt wk ≤ θ M . By calculating theminimum temperature at the end of the period and substituting it in this formula, themaximum budget for a period T is given by: t wk < = 1 β ln θ M ( e βT − 1) + αα . (3.21) Suppose that the CPU workload runs at the end of a period and the workload ofthe next period execute at the beginning of the period. It causes burst heat generation byback-to-back execution, which can potentially lead to thermal violation. It is noteworthythat this case has been shown in Corollary 2.1. The worst-case occurrence is when thisphenomenon happens repetitively in the steady state. Server budget calculation under deferrable server budget replenishment policy. t wk ) and sleeping (2 t slp ) times, and determine the maximum waking time. Theorem 3 The maximum waking time in thermal back-to-back execution is t wk = β ln θ M ( e βT − αα . Proof. Considering the period of workload as twice of the previous example, we have2 t wk + 2 t slp = 2 T. The maximum temperature is reached after 2 t wk time units. So, θ M = α + ( θ L − α )e β t wk . Similarly to Eq. 21, we have t wk = β ln θ M − αθ L − α . Since the mini-mum temperature in the steady state is reached after 2 t slp , θ M e β t slp = θ L which leads to t slp = β ln θ L θ M . Hence, t slp + t wk = T = ⇒ β ln θ M − αθ L − α + 12 β ln θ L θ M = T. Therefore, the minimum temperature in the steady state is θ L = αθ M e β T θ M (e β T − α . By substi-tuting the value of θ L in t wk , the maximum waking time for the period T is obtained. Corollary 3.1 The maximum achievable utilization of a server under the maximum tem-perature constraint θ M is θ M α . Proof. Since the maximum utilization is obtained when the period converges to 0, themaximum utilization is lim T → u = lim T → β θ M βe βT θ M ( e βT − 1) + α = θ M α . The same approach applies to the polling servers.76nlike the claim in [73, 100] which states that the worst-case peak temperatureoccurs when the system warms up by applying a periodic pattern to be in steady statefollowing by a burst workload, we will show that by employing thermal-aware servers formixed-criticality tasks, this will never happen. Theorem 4 The maximum temperature of a CPU for any workload execution pattern inthe “steady-state” condition is less than the periodic back-to-back execution pattern. Proof. Figure 3.5: Workload execution pattern in the steady state. To prove that, we follow up on the two scenarios illustrated in Fig. 3.5. With-out loss of generality, it is assumed that the system has been already warmed up. Thegreen line in Fig. 3.5 shows that back-to-back execution pattern continues in the steady-state phase. The orange line depicts the worst-case burst workload that can be executed(according to the Theorem 2 where all of the replenished budget is exhausted at the begin-ning of the period). Let θ and θ represent the maximum temperature of burst executionand normal back-to-back execution, respectively. Hence, θ = α + θ M e T β − α e β t w and77 = α + θ M e T β − α e β t w . Let ∆ θ denote the temperature difference of these scenarios.Therefore, ∆ θ = θ − θ = θ M e T β − θ M e T β + α e β t w − α e β t w . By considering u = θ M α and substituting the maximum waking time determined from The-orem 3, we show that ∆ θ > u e β T − e β β ln θM ( e βT − αα + e β β ln θM ( e βT − αα − u e β T > . For any value of u ∈ [0 , u − β T ) < 0, then we have u + u e β T + 2e β T < u e β T + e β T . By multiplying u in the inequality, we have u + 1 − u + u e β T − u − u e β T < u e β T − u + 1which leads to u − − u e β T > − ( θ M e β T − θ M + αα ) . Therefore, u e β T − θ M e β T − θ M + αα + ( θ M e β T − θ M + αα ) − u e β T > . Now, we extend our analysis to multi-core platforms where each CPU core consistsonly of one thermal-aware server. We will show that the minimum replenishment budget ofpolling servers on multi-core CPU happens when all of them exhaust their budget completelyat the same time. Afterwards, the maximum available server budget will be determined.78 heorem 5 The minimum value of waking time for a maximum temperature constraint iswhen the server on each core exhausts all its budget at once, simultaneously. Proof. Assume we have a dual-core CPU with the same physical characteristics and sur-rounding conditions. Both cores have the same periodic step power signal but with a phasedifference of φ . For simplicity we assume P S is zero and P = P D . The input power of thefirst core P ( t ) starts at t = t while the input power of the second core P ( t ) starts witha phase change at t = t + φ . We consider two cases: for case I, the phase change is zero( φ = 0), and for case II it is positive ( φ > P uT+ϕ T+ϕϕ Case I: in-phase P uT T P uT T Core 1 Core 2 P uT T Core 1 Core 2 00 00 (a) (b) Case II: out-of-phase Figure 3.6: Power signal of the cores for (a) case I: in-phase, and (b) case II: out-of-phase. Lemma . In a multi-core system, if the power signal of each core varies with timein the way described above, the maximum change rate magnitude of temperature is whenthe powers are in-phase ( φ = 0). For a dual-core CPU . According to the developed model, temperatures of the two79 uT+ϕ T+ϕϕ Case I: in-phase P uT T P uT T Core 1 Core 2 P uT T Core 1 Core 2 00 00 (a) (b) Case II: out-of-phase θ M θ' M t' w t' s t w t s θ ave θ' L θ L θ ave θ' L θ L t' w t' s t w t s θ M θ' M (a) (b) Figure 3.7: Temperature profiles of in-phase and out-of-phase cases for a (a) two-core and (b) three-core system. (Blue lines represent the temperature of the cores in in-phase and other colors representtemperature of the cores in out-of-phase states.) cores between t and t when the power P and P are constant are as follows: θ = 12 e β ( t − t ) θ + θ θ + θ + 12 e β ( t − t ) θ − θ θ − θ + B β (cid:16) e β ( t − t ) − (cid:17) P + P P + P + B β (cid:16) e β ( t − t ) − (cid:17) P − P P − P θ tθ θ θ' θ θ' θ (Δ θ ) out-of-phase (Δ θ ) in-phase (Δ θ ) out-of-phase (Δ θ ) in-phase θ tθ θ θ' θ θ' θ (Δ θ ) out-of-phase (Δ θ ) in-phase (Δ θ ) out-of-phase (Δ θ ) in-phase t t t t (a) (b) Figure 3.8: Comparison of temperature (a) increase and (b) decrease between in-phase and out-of-phase cases. θ and θ at time t and get to the final temperatures of θ and θ at time t as shown in Fig.3.8 . We wantto show that | ∆ θ in − phase | > | ∆ θ out − of − phase | for any time slot between t and t where thepower remains constant. We calculate the derivative of temperature for the two cases. Whenthere is a power for the in-phase case ( P (cid:54) = 0), the temperate change rate is positive. Forin-phase case P + P = 2 P and P − P = 0. For the out-of-phase case, P + P = P and P − P = − P because P = 0 and P = P or vice versa. We have: (I represents in-phaseand II represents out-of-phase) (cid:18) dθdt (cid:19) I − (cid:18) dθdt (cid:19) II = B P e β ( t − t ) + e β ( t − t ) e β ( t − t ) − e β ( t − t ) Since β > β for both cores and B and P are positive, then (cid:0) dθdt (cid:1) I ≥ (cid:0) dθdt (cid:1) II .Same conclusion can be drawn if P = P and P = 0. We can conclude that ∆ θ in − phase ≥ ∆ θ out − of − phase based on the fact that if f ≥ g in [ a, b ], and f and g are integrable in [ a, b ],then b ∫ a f dx ≥ b ∫ a gdx . If the power in the in-phase case is zero ( P + P = 0 and P − P = 0),the temperature change rate is negative, and for P = 0 and P = P we have: (cid:18) dθdt (cid:19) I − (cid:18) dθdt (cid:19) II = B P − e β ( t − t ) + e β ( t − t ) − e β ( t − t ) − e β ( t − t ) Since β > β for both cores (cid:0) dθdt (cid:1) I ≤ (cid:0) dθdt (cid:1) II . Same conclusion can be drawnif P = P and P = 0. Therefore, in case of an increase in temperature, the in-phasestate has the highest temperature change rate, and in case of a decrease in temperature,the in-phase state has the lowest temperature change rate. In conclusion, | ∆ θ in − phase | ≥| ∆ θ out − of − phase | . 81 or a multi-core CPU . For a multi-core system when temperature is increasing,the difference between the temperature derivatives of the in-phase (I) and out-of-phase (II)cases between t and t when the power signal does not change is as follows: (cid:18) dθdt (cid:19) I − (cid:18) dθdt (cid:19) II = B n e β ( t − t ) ( n − k ) P n × − B k e β ( t − t ) m Pm P ... m n P n × − . . . − B k n e β n ( t − t ) m n Pm n P ... m nn P n × k i , i = 2 , .., n are integers which satisfy 1 ≤ k i ≤ n, i = 2 , .., n . For k we have1 ≤ k < n since there is at least one core which is idle when powers are in out-of-phase. m ij can change based on how many cores are active and we have m ij < k i . Since n > k i ,the first term is positive, and since β (cid:29) β , ..., β n and all β s are negative, the other termsdecrease exponentially to zero and can be neglected compared to the first term. Therefore, (cid:0) dθdt (cid:1) I ≥ (cid:0) dθdt (cid:1) II . It can be shown in the same way that when there is no power in thein-phase case, the temperature is decreasing and (cid:0) dθdt (cid:1) I ≤ (cid:0) dθdt (cid:1) II . Lemma . For the described power signals, the maximum temperature for in-phasecase is larger than that of the either core for the out-of-phase case ( θ M ≥ θ (cid:48) M ).First, let’s point out that θ ave = − A − BP ∞ is the same for both cases in the steady state.Therefore, θ M + θ L = θ (cid:48) M + θ (cid:48) L . Assume that θ M < θ (cid:48) M , then θ L > θ (cid:48) L . Assume that the time82t takes for the in-phase case temperature to get from θ L to θ M is t w and the time it takesto get from θ M to θ L is t s . Assume these times for the out-of-phase case is t (cid:48) w to get from θ (cid:48) L to θ (cid:48) M and t (cid:48) s to get from θ (cid:48) M to θ (cid:48) L . If θ M < θ (cid:48) M , and θ L > θ (cid:48) L , then t w would be smallerthan t (cid:48) w because the temperature increase rate is largest for the in-phase case. At the sametime, t s < t (cid:48) s since temperature decrease rate is largest for the in-phase case. Therefore, t w + t s < t (cid:48) w + t (cid:48) s . But the periods for the two cases are the same t w + t s = t (cid:48) w + t (cid:48) s . Experimental Example To support our claim, we measure the temperature of big coreson the Exynos 5422 SoC [26] using IR camera FLIR A325sc [29] with the sampling rate of60 frames per second. Four periodic computationally-intensive workloads are ran on fourbig cores of the board. In all cases, the CPU frequency is set to 1.4 GHz, and each workloadexecutes every 10 seconds with the utilization of 40%. Let φ ( i ) denote the delay of startingtime of a workload execution on core i . Table 3.1 shows the configurations of each case. Table 3.1: Descriptions of workload executions on big cores. Name description workloadsdelay settings(seconds) Case 1 all workloads execute at the same time φ (1) = φ (2) = 0 φ (3) = φ (4) = 0 Case 2 workloads on two cores begin after those of other cores φ (1) = φ (2) = 0 φ (3) = φ (4) = 4 Case 3 workloads on two cores begin 1 second before finishingthose of on other cores φ (1) = φ (2) = 0 φ (3) = φ (4) = 3 Case 4 workloads on each core execute with a 2-second overlap φ (1) = 0 φ (2) = 2 φ (3) = 4 φ (4) = 6 Fig. 3.9 shows our observations of the maximum temperature of the SoC for thedescribed cases in the steady state. As one can see, in Case 1 where all workloads executesynchronously, the maximum temperature of the chip reaches its highest value and its83inimum temperature is the lowest among all cases. On contrary, in Case 4 where workloadsof different cores have the least overlap, the maximum temperature of the chip fluctuatesthe least of other cases and it is close to the average temperature in the steady state. InCase 2 where there is no overlap between executions of two pairs of workloads, the rate oftemperature rise is lower than the Case 3 where there is 1 second overlap within all workloadexecutions. Figure 3.9: The maximum temperature of big cores in the Exynos 5422 captured by FLIR A325scIR camera in the operating frequency of 1.4 GHz without heat sink. In this section, we extend our analysis for multiple servers running on the multi-core CPU at each criticality level. We are interested to see if there are enough sleepingslacks between active times of servers to cool down the CPU. The cooling time must belarge enough such that the CPU does not exceed the maximum temperature constraint underany circumstance. Hence, we should check if the amount of the sleeping time required forcooling is guaranteed for a given period of time.84e propose an idle thermal server technique in this regard. Unlike regular servers,the idle server does not execute; instead, its budget represents the amount of time that theCPU core needs to be idle in the cooling phase. The reasoning behind this technique isto simplify the modeling of the resulting idle time from the execution of multiple regularservers as a single budget parameter. Hence, the idle server’s budget can be determinedsuch that the maximum CPU operating temperature caused by heat dissipation during theidle server’s inactive time is the same as or higher than that of running regular servers underany task execution pattern. If such an idle server is schedulable, one can conclude that thegiven taskset is thermally schedulable. The thermal effect of multiple servers is analyzed bythe complement signal of periodic idle-server execution, the worst-case behavior of whichhas been proved by Theorems 1-5.After analytically finding the relation between the budget and period of the idleserver, we check its schedulability.Since the idle server does not actually exist on the CPU, it is considered as thelowest-priority server in the schedulability test. In our proposed framework, the CPU is notforced to sleep so that the proposed idle server has no effect on the timing schedulability ofregular running servers.The idle server utilization corresponds to the time during which all regular serversare deactivated. Based on this, we investigate the feasibility of the idle server with itsminimum possible period within a valid range. In this work, we focus on designing homo-geneous idle servers, i.e., all idle servers on different CPU cores share the same parametersat each criticality level. It is worth mentioning that this does not mean that sleeping time85n different cores must happen at the same time. Finding the different idle server settingsfor each CPU core is beyond the scope of this chapter. Idle server design First, we compute the total utilization of the CPU core c at thecriticality level l by u lc = (cid:80) ∀ i P ( v i )= c ∧ v i ∈ V l C i T i , where P ( v i ) is the CPU core assigned to v i .The maximum per-core CPU utilization is then u lmax = max ∀ c u lc . Due to the homogeneityof idle servers on all cores, u lmax is considered as the utilization of one core so that eachcore is supposed to be idle at least 1 − u lmax . As a result, we are looking for a server thatcan be schedulable with the amount of utilization of at least 1 − u lmax . Let u lidle denote thisvalue. According to Theorem 3, the period of the idle server T lidle can be modeled as: T lidle ≤ ln(1 + α e − ulidle + 12 β − θ M )2 β . (3.22)It is worth noting that the period is determined based on the back-to-back execution underin presence of multiple servers for both polling and deferrable budget replenishment policiessince servers can preempt each other and sleeping time does not happen in a contiguousmanner. T lidle is an increasing function in terms of utilization. As one can figure out fromEq. 3.22, an increase in the workload on the CPU core (hence resulting in a decrease in u lidle ) leads to a decrease in T lidle . In other words, under heavier workload, the CPU hasto sleep more frequently but in a shorter duration, to satisfy the maximum temperatureconstraint. Thermal schedulability Hereby, we present our proposed thermal schedulability test fora specific utilization of idle servers. The worst-case response time of each idle server of each86ore at each criticality level can be obtained as follows R n +1 ,lidle,c = u lidle × T lidle + (cid:88) ∀ i P ( v i )= c ∧ v i ∈ V l (cid:38) R n,lidle,c T i (cid:39) C i (3.23)where R n +1 ,lidle,c denotes the worst-case response time of the idle server on the core c at thecriticality level l . As shown in Eq. 3.23, the idle server of each core can be preempted byall active servers at l . The only unknown parameter in above test is the idle server settings.Next, we discuss the optimal valid range of idle server’s setting range to reduce the numberof tests for each criticality level. valid range i C lidle u lidle T max l u Figure 3.10: Search space for the idle server settings. Optimal server setting range As discussed in Eq. 3.22, the period of the idle server isan increasing function in terms of its utilization. Fig. 3.10 plots the period with respect toutilization of the idle server. The area of the plot is divided to four regions. We discuss thatonly the highlighted area needs to be searched for finding the valid settings of idle servers. • region 1 : In this region, the CPU violates the maximum temperature constraintaccording to Eq. 3.22 • region 2 : The utilization of idle server has to be less than 1 − u lmax . Choosing a87etting in this region causes the system to be unschedulabe at this criticality level. • region 3 : Since idle servers have the lowest priority, they cannot preempt otherservers. According Eq. 3.23, in the worst case all servers preempt the idle servers. • region 4 : Valid settings can be found in this region, but it is unnecessary to searchthe entire of this region because there exists a valid setting with period of T (cid:48) , a solutionis valid on the highlighted range with the same utilization value.Therefore, the only period of T idle = ln(1+ α e − uidle + 12 β − θM )2 β within the optimal rangeof utilization, u idle , needs to be checked. One may try to insert T idle into Eq. 3.23 and assign T idle to R n +1 ,lidle,c to find a single optimum point. However, because of the ceiling operation inEq. 3.23, finding the optimum u idle would need an exhaustive search.After finding the idle server settings, the critical ambient temperature for eachcriticality level is computed by using Eq. 3.9 with u = 1 − u lidle . The shifting time fromcriticality level l + 1 to l can be determined by applying Eq. 3.12. Timing schedulability Due to our separate thermal schedulability analysis, we are ableto use the existing response time test developed for independent tasks with no thermalconstraints under hierarchical scheduling [77]: W n +1 ,li = E i + (cid:88) τ h ∈ V l ( τ i ) h>i (cid:38) W n,li + J i + ( W lh − E h ) D h (cid:39) E h + (cid:24) W n,l + C j T j (cid:25) ( T j − C j ) (3.24)where W n,li is the worst-case response time of τ i at criticality level l , V l ( τ i ) is the server of τ i , J j is the jitter of a task running in a server v j (see Sec 3.3), and W ,li = E i .88 .7 Evaluation This section gives the experimental evaluation of our framework. First, we showthe model validation by measuring the physical system parameters on a real platform.After the analysis of server characteristics in different ambient temperatures, we presentour discussion with a case study. The experimental platform is an ODroid-XU4 development board [64] equippedwith a Samsung Exynos5422 SoC. There exist two different CPU clusters of little Cortex-A7and big Cortex-A15 cores, where each cluster consists of four homogeneous cores. Built-insensors with the sampling rate of 10 Hz with the precision of 1C are on each big CPU coreto measure the chip temperature. Note that there are no temperature sensors on little coressince the power consumption and heat generation of the little cluster is considerably low.The DTM throttles the frequency of the big CPU cluster to 900 MHz when one of its coresreaches the pre-defined maximum temperature constraint of 95C. During experiments, theCPU fan is turned off and the CPU is set to run at 1400 MHz. According to Eq. 3.3 and 3.19, the characteristic matrices A and B can be deter-mined by a utilization test at different CPU frequencies. A zero utilization (idle) at differentambient temperatures can reveal the range of the static and consequently the dynamic dis-sipation. After finding the thermal parameters of the system and model calibration, the89nalytical model is validated with the experimental results. For this purpose the CPU tem-perature is recorded at 90% utilization with a period of 1 s in Θ amb of 23C. Then utilizationis decreased to 30% and the CPU is cooled down until reaching a steady state. Afterwards,the CPU working at 30% is placed in the furnace with Θ amb of 42C. The CPU temperatureis recorded until the steady state. The same conditions are simulated using the developedmodel. Fig. 3.11 compares the CPU temperature recorded in the experiment at differentstages of workloads and ambient temperature with the predicted values by the model. Itcan be seen that there is a good agreement between the model and the experimental results. Figure 3.11: Comparison of the experimental results and the model prediction for CPU temperatureat different workloads and ambient temperatures. We investigate the effect of ambient temperature, server period, and utilization us-ing Eq. 3.9 and Eq. 3.10. Assuming the temperature threshold of Θ m = 95C, the maximumallowable utilization is plotted in Fig. 3.12a against period and ambient temperatures. Itcan be seen that for all considered periods, when the ambient temperature is increased from903C, the maximum workload decreases almost linearly with the ambient temperature. Athigher ambient temperatures lower workloads can be used until the ambient temperatureof 69C where even an idle CPU usage will result in a working temperature equal to thethreshold temperature. To better see the effect of period, the maximum workload is plottedagainst period at different ambient temperatures in Fig. 3.12b. The period has been changedfrom 10 ms to 30 s. It can be seen that the maximum workload decreases by increasing thevalue of the period in an almost linear manner. This has been discussed and confirmed inTheorem 2. Moreover, it can be concluded that the effect of ambient temperature on themaximum allowable workload is more prominent than the effect of period. 030 200.5 u 10 70 amb T (s) (°C) (a) T (s) u ambambamb = 50 ° C = 45 ° C = 40 ° C (b)Figure 3.12: a) Utilization versus period and ambient temperature. b) Utilization versus period atdifferent ambient temperatures. As mentioned before, any change in the working parameters of the system maycause a transient thermal behavior and change the steady state conditions. Here, we discussthe effects of changing workload and also ambient temperature on the thermal response91f the CPU cores. In Fig. 3.13a. shifting times are plotted against the final ambienttemperature at different workloads assuming the initial ambient temperature is 23C or 50C.It can be seen that at higher workloads it takes less time for the CPU to reach the steadystate when ambient temperature is changed from Θ amb i to Θ amb f . Also, at all workloads,it takes more time for the CPU to reach the final ambient temperature Θ amb f if it startsfrom a lower initial ambient temperature Θ amb i . Furthermore, for all utilizations, the timeit takes for the CPU to reach the steady state when the ambient temperature goes up by∆Θ is almost equal to the time it takes when it goes down by the same amount.In Fig. 3.13b. shifting times are plotted against the final workload u f for differentinitial workloads u i . It can be seen that it takes more time to reach the steady state offinal u f when starting from a lower workload u i . Also, opposite to the case of ambienttemperature, if u i < u f , it takes less time to reach the steady state when shifting from u i to u f (heating), compared to when shifting from u f to u i (cooling). We emulate the mixed-criticality Flight Management System (FMS) application [3,35] which comprises two criticality levels of H and L . The parameters of the executing real-time tasks are given in [3]. In our experiment, there exist one high-criticality and onelow-criticality server on each CPU core. The budget replenishment period of each thermal-aware server is considered 50 ms under the deferrable budget replenishment policy. Thebudget for high-criticality and low-criticality servers are 15 ms and 27 ms, respectively.Tasks are assigned to CPU cores by using the worst-fit decreasing (WFD) heuristic for92 Final amb Sh i ft i n g t i m e ( s ) u = 0u = 20%u = 40%u = 60%u = 80%u = 100% (°C) (a) Final u Sh i ft i n g t i m e ( s ) u_i = 0u_i = 20%u_i = 40%u_i = 60%u_i = 80%u_i = 100% (b)Figure 3.13: Shifting time a) from initial ambient temperature to different final ambient temperaturesat various workloads, b) from different initial workloads to different final workloads at an ambienttemperature of 23C. load balancing across cores and are scheduled by the Rate Monotonic (RM) policy. Sincethe amount of the workload in low-criticality level is insignificant to reach the maximumtemperature, non-real-time tasks are also assigned to low-criticality servers with the lowestpriority level.Critical ambient temperatures have been determined by Eq. 3.9: 24C for the low-criticality and 40C for the high-criticality level. As shown in Fig. 3.14, the experiment hasbeen performed in the furnace. Nordic Semiconductor Thingy:52 ™ IoT sensor developmentkit [89] is used to capture the ambient temperature with the sampling rate of 10 Hz.Fig. 3.15 shows the experimental and model results for a case study with period of50 ms. In step I, the CPU is idling for 1800 s, and then in step II, workload with u = 95% isapplied at Θ amb = 24C. The system is left to work with u = 95% until it reaches steady stateconditions and keeps working for about 10000 s. The CPU temperature reaches to a value93 igure 3.14: Experimental environment using furnace. around 89C. Afterwards, in step III, the CPU is placed in the furnace with Θ amb = 40Cand the workload is changed to 30% at the same time. The temperature increases to 92Cand remains steady for about 4000 s. The CPU is then taken out of the furnace and leftto work with u = 30% for a fast cooling. Finally, in step IV, the workload is set to 95%at ambient temperature Θ amb = 24C. It can be seen that the developed model matches theexperimental results with a good accuracy. The time step in the model is more accurate thanthe actual temperature sensor and temperature variations for each period can be capturedby the developed model. Temperature curve from 6000 s to 6002 s is zoomed for a bettercomparison of the variations. In this chapter, we proposed a novel mixed-criticality thermal-aware server frame-work to bound the maximum temperature of CPU cores in the presence of dynamic am-bient temperature. In this framework, the server schedule is flexible – fully preemptiveand priority-based. We investigated the thermal feasibility by analyzing the amount of94 Time (s) ExperimentModel T e m p er a t u re ( ° C ) Figure 3.15: Experimental and model results for a case study with a period of 50 ms at two ambienttemperatures and workloads. slack between execution of preemptive thermal servers with the notion of idle servers. Wepresented a mechanism to optimally search the maximum ambient temperature for everycriticality level. We provided analytical foundations to check thermal safety while temporalsafety is guaranteed. Experimental results show that our proposed framework is effectivein bounding the maximum temperature at every criticality level.95 hapter 4 Data-driven Thermal ParameterEstimation for COTS-basedMixed-criticality Systems Thermal awareness is increasingly important for mixed-criticality systems deployedin harsh environments. As high chip temperature can cause frequency throttling or shut-down of processor cores at an unexpected time, many real-time scheduling techniques havebeen developed to ensure continuous, fail-safe operation of safety-critical tasks with strin-gent timing constraints. However, their practical use remains largely limited due to the factthat it is extremely difficult to obtain a precise thermal model of commercial processorswithout using special measurement instruments or access to proprietary information, suchas the power traces of micro-architectural units and detailed floorplan maps.In this chapter, we propose a data-driven thermal parameter estimation scheme96hat is directly applicable to commercial off-the-shelf multi-core processors used in real-timemixed-criticality systems. By using a small number of thermal profiles obtained from on-chiptemperature sensors, our scheme can predict and bound the processor operating temperatureunder dynamic real-time workloads at various CPU frequencies and ambient conditions.The thermal model derived from our scheme is fast to converge and robust against differentsources of errors. Our scheme is non-intrusive, meaning that it does not require changes tothe software code or the hardware packaging of the target system. Furthermore, our schemecan estimate the relative power consumption of the processor for given workload and clockfrequency level. We evaluate the effectiveness of our scheme on a multi-core ARM platform. One of the major concerns in recent embedded systems is the high heat dissi-pation caused by complex applications running on high-performance multi-core systems-on-chips (SoCs). Ambient temperature in the physical environment is another key factorthat increases chip operating temperature. The high temperature also increases power con-sumption [6], reduces the chip reliability [87, 93], and leads to chip burnout. Due to thesereasons, many of today’s operating systems (OSs) monitor the chip operating temperatureusing on-chip temperature sensors to check if it is within the safe thermal constraint.To protect the processor chip from thermal damage, a set of policies are defined inthe thermal governor of the OS for different scenarios. When the chip temperature crosses atrip point, a predefined cooling scenario is performed such as frequency throttling or shuttingdown of CPU cores [18, 37, 97]. Such thermal countermeasures, however, lead to timing97npredictability in real-time mixed-criticality systems (MCS) since the deadlines of taskscould be unexpectedly violated by reduced processing speed or temporarily unavailableCPU cores. Extensive studies have been have been conducted to prevent the negativeimpact of such performance disruption, including the techniques based on Dynamic VoltageFrequency Scaling (DVFS) [30, 61, 70, 78, 101] and forced sleeping during the executionof hot applications [15, 41, 59, 53, 44, 101]. Moreover, offline analysis has been proposedto guarantee the thermal safety of real-time tasks by bounding the maximum operatingtemperature in the steady [38, 39] and transient states [66] of a given system. They allassume to have a priori knowledge of precise thermal models for temperature prediction,resource management, and task scheduling, and need accurate simulation tools or extraequipment for validation. Therefore, obtaining thermal parameters of a given system is thefundamental requirement to substantiate these techniques in practice.Despite its importance, the thermal parameter estimation of commercial off-the-shelf (COTS) processors still remains a challenging problem. Existing numerical simulationtools like HotSpot [42] can construct a compact thermal model for modern VLSI devices byusing the resistance-capacitance (RC) thermal network to capture the transient temperatureand generate the heatmap at each time instant. However, they are not directly applicable tomodern COTS multi-core processors because the information required by these tools, suchas power traces of micro-architectural units, detailed floorplan maps, and cooling package,is proprietary and not publicly available. An exhaustive search approach for approximatingthis information through reverse engineering is time-consuming and prone to unacceptablyhigh inaccuracies, especially for transient temperature estimation.98he latest work [72, 74] partially addresses these issues by calibrating thermalparameters without power traces and detailed floorplans. However, it is not applicableto MCS where execution patterns are dynamic due to preemptive, priority-driven, work-conserving task schedulers [38, 39, 66, 3, 20]. Similar to HotSpot, it uses a time-drivenprediction model where the temperature is calculated for each time instant under staticworkloads. This makes the model not only very slow but also inflexible to capture thecorrect operating temperature when multiple real-time tasks with different periodicity areconcurrently scheduled.In this chapter, we propose a fast and accurate scheme to estimate the thermalparameters of COTS multi-core processors for real-time MCS. Our scheme requires only asmall number of temperature traces from on-chip thermal sensors which are widely availablein today’s processors. Our scheme also improves the accuracy of thermal parameters throughthe ensemble of measurements from different frequency levels and execution patterns. Contributions. The contributions of this chapter are as follows: • We present a thermal estimation scheme that has low computational cost by design.Given that steady-state profiles are much compact than transient-state profiles, ourscheme first estimates the thermal parameters of a given system using only steady-state profiles, and then uses transient-state data for calibration purpose. • We characterize various sources of errors in thermal parameter estimation, and reducetheir negative effects through the multiple refinement stages of our scheme. Ourscheme also enables locating errors in the temperature profiles. • Our scheme can identify the relative distance between CPU cores and produce an99stimated chip floorplan from temperature profiles. It can also estimate the relativepower consumption for a given workload on each CPU core. • We present techniques to further improve the accuracy of thermal parameters by ex-ploiting the ensemble of measurement data obtained at various frequency and work-load settings. • The effectiveness and accuracy of our proposed scheme is demonstrated with extensiveexperimental results on a real ARM embedded platform. There exist well-known numerical tools to estimate the chip operating temperature.For instance, HotSpot solves the system of differential equations using the fourth-orderRunge-Kutta numerical method through very fine-granularity iterations. The authors of[36, 56] constructed the thermal model with the measured power and temperature traceof each subsystem on a real mobile platform. Power Blurring [103] calculates temperaturedistributions using a matrix convolution technique, in contrast to the finite-element analysis(FEA) used in other simulation tools like HotSpot. To use these tools in practice, one needsto obtain the detailed information of the target device including power traces and floorplansor to arrange special measurement equipment such as a high-precision IR camera.The authors of [72, 74] proposed a calibration-based method to predict thermalbehavior by using an impulse response model. They assume that the power consumption ofeach application task remains unchanged during execution. Similar to [21, 3], they employedthe Generalized-Pencil-Of-Function (GPOF) [40] to calculate the impulse response of each100pplication from utilization and temperature traces. In their work, the thermal effects dueto conduction between CPU cores are represented as impulse responses. However, if anapplication task is preempted by another task or getting suspended, the impulse functionhas to switch. This spatio-temporal thermal dependency cannot be captured in their modelfor preemptively-scheduled tasks on the same CPU core or concurrently-executing tasks onother CPU cores. Due to this reason, despite all the other benefits provided, the applica-bility of their approach to real-time mixed-criticality systems is limited. In this chapter,we present a thermal model based on matrix exponential, which overcomes the aforemen-tioned limitations of prior work and allows estimating the key thermal parameters thatrepresent the characteristics of the semiconductor technology independent of the specificityof applications. We consider a homogeneous multi-core processor where each CPU core uses thesame microarchitecture. Each core is assumed to have a dedicated temperature sensor thatis accessible by the OS at runtime. This assumption can be easily met in many commercialprocessors such as Intel Core i7 and Samsung Exynos products. Reflecting reality, it isassumed that the following information is not available to use: the chip floorplan, the exactlocations of on-chip temperature sensors, and the power traces of the processor.In the rest of this section, we introduce the power and temperature model used inthis chapter. 101 .3.1 Power Model The total power consumption of CMOS circuits at time t is modeled as the sum-mation of dynamic and static powers [13], i.e., P ( t ) = P S ( t ) + P D ( t ). Static power P S depends on the semiconductor technology and the operating temperature caused by currentleakage. Hence, it can be modeled as: P S ( t ) = k θ ( t ) + k , where k and k are technology-dependent system constants, and θ ( t ) is the operating temperature [60]. Dynamic power P D ( t ) is the amount of power consumption due to the processor operating frequency f attime t , modeled as P D ( t ) = k f ( t ) s , where s and k are the system constants that dependon the semiconductor technology. We consider the temperature model widely used in real-time mixed-criticality sys-tems [3, 20], which follows the well-known linear time-invariant (LTI) model. Hence, thetemperature model for a multi-core CPU with n cores is given by the following equation:[ θ (cid:48) ( t )] n × = A n × n [ θ ( t )] n × + B n × n [ P ( t )] n × (4.1)where θ ( t ) is the n × P ( t ) is the power consumption of all cores at time t . A is aninvariant n × n matrix and it is based on the characteristics of the semiconductor technology.It quantifies the effect of the conduction between adjacent cores, the convection among allcores, and the difference between ambient and operating temperature. B is the diagonal n × n matrix and it captures the effect of power consumptionon the temperature of each core. For homogeneous multi-core CPUs, since the total power102f CPU cores are the same (either P S or P S + P D ), the matrix B can be represented as b × I , where I is the n × n identity matrix. Similar to A , the values of b are invariant tothe changes in static or dynamic power consumption.Hence, the problem will be estimating the values of the matrices A and B (= b × I )without any prior knowledge or direct measurement of CPU power consumption. We willdiscuss that it is impossible to estimate the value of b without having any knowledge ofCPU power consumption; instead, we estimate B × P which matters in calculating thetemperature in the LTI model. Given a multi-core CPU equipped with on-chip temperature sensors, construct anaccurate and fast thermal RC model by the estimation of A and B × P of the CPU, exclu-sively from a limited number of temperature profiles without requiring a priori knowledgeof the floorplan, cooling package, and power traces. Before continuing our discussion, we must address some limitations to thermalparameter estimation on real-life platforms. Identifying these potential sources of erroris critical to our data analysis and comprehension. It allows us to address the noise andlimitations of our profile data set by preprocessing raw data and improve the accuracy ofthermal parameters through the ensemble of measurements from different frequency levelsand workload settings. 103 .4.1 Built-in Sensors The largest sources of error in the raw data are the built-in temperature sensors. Sensor Locations The data sampled from the CPU’s built-in temperature sensors is sensitive to thephysical location of the sensors within CPU cores. For example, the thermal sensor forone CPU core may be located near its primary source of heat dissipation (i.e., hotspot)while the thermal sensor for another CPU core may be located further away from theCPU’s hotspot. In addition to the proximity of the thermal sensor to the CPU’s hotspot,the physical location of the hotspot may vary depending on the type of workload. Thus,distinct thermal footprints of various applications may not be precisely captured by a singlebuilt-in sensor. Sensor Sampling Frequency In CPUs for embedded systems, the sampling rate of temperature sensors is, gen-erally, very low. This limitation causes inaccuracies in raw data. For instance, the ODroidXU4 board maintains a 10Hz sampling rate for the thermal sensors. If the utilization pat-tern of a prospective application changes in less than 100 ms, e.g., real-time control tasksactivated every 30 ms, the thermal footprint cannot be captured with the built-in sensors. Sampling Precision The data from on-chip CMOS temperature sensors is subject to quantization.This generates another source of error because the accuracy of temperature measurements104s reduced to the granularity of quantization. Hence, the data needs to be processed beforeand after the estimating procedure to ensure the correctness of our results. For instance, ifthe sampling precision of all CPU core temperature sensors is 1 ° C, then according to thesuperposition law in thermal modeling [38], the magnitude of total temperature error for aquad-core processor can reach up to 4 ° C. Sensor Response Function Due to differences in the construction and architecture of on-chip thermal sen-sors, their response time may vary. Hence, the sensor data may not represent the actualtemperature if the CPU utilization changes in a relatively short time interval. Over time,if there is no fluctuation in CPU utilization, the sensor data will converge to the actualCPU core temperature. Therefore, we can assume that this type of error only affects thetransient-state data while the steady-state data is unaffected. Temperature is a relative value to the ambient temperature in our thermal model.Although it is assumed that the ambient temperature would remain invariant during pro-filing, in reality, it may change even in a room or a thermal furnace. Varying Ambient Temperature The ambient temperature in room can change slightly, or even several degreesgiven circumstances that are difficult to regulate such as poorly insulated walls, drasticchanges in the outside temperature, and even heat dissipated from idle electrical outlets.105his change can affect the operating temperature of the CPU. Therefore, the fluctuation ofthe ambient temperature while profiling can introduce noise into the raw data. This typeof noise impacts the relative operating temperature. Varying Air Convection Heat convection may also introduce noise into our data sets (thermal profiles).Forced air convection caused by air conditioning, active cooling package of the CPU or evenpeople just moving around the board can all contribute to fluctuations in the CPU core’sheat dissipation. This type of noise will directly affect the heat transfer from the CPU coresto the ambient air, hence the values of diagonal elements of the matrix A change. Heat dissipation due to miscellaneous tasks can impact the accuracy of the dataset. Even though one expects the operating temperature of CPU cores to approach theambient temperature when the CPU cores are disabled (e.g., by using the CPU hotplugmechanism in Linux), the operating temperature is much higher than this level. OS Tasks There are some essential OS service tasks that cannot be terminated while profilingthermal data. Some of them even have predefined CPU affinity, thereby affecting theobserved thermal behavior of target CPU cores.106 ache Coherent Interconnect (CCI) Under cache consistency policies, CCI which is near to the CPU cores, generatessome heat during task execution which causes an increase in the operating temperature ofthe CPU cores. Workload on Intellectual Property (IP) Blocks Running tasks on IP blocks such as integrated GPUs, video encoders/decoders,and digital signal processors (DSP), can significantly affect the overall heat dissipation.Additionally the static power consumed while IPs are idle can also cause non-trivial heatdissipation. If there is no change in the utilization status of IPs, we can consider the amountof heat dissipation from those IPs as a constant quantity in thermal modeling. Nevertheless,this type of the noise can continuously or temporarily affect the profiling of thermal data. In this section, we introduce our scheme for estimating the thermal parameters ofa given multi-core processor. The entire workflow of the proposed scheme is illustrated inFig. 4.1. The very first step is profiling steady-state temperature data for a set of designedworkloads. Then, the scheme removes noise from the raw data set ( 2 ), and performs thefloorplan estimation ( 3 ). By using the estimated floorplan template and the collectedsteady-state data, the value of the matrix A is estimated in terms of the power parameters( 4 ). The parameters B × P are then estimated by analyzing a subset of the transient-state data at the final stage ( 5 ). One of the reasons that we propose dividing analysis107 rofile Profile n+1 Profile ⋮ 𝑌 = 𝜃 ⋯ 𝜃 𝑛,1∞ ⋮ ⋱ ⋮𝜃 ⋯ 𝜃 𝑛,𝑛∞ 𝑋 𝑍 = 𝑧 𝑖 × 𝑌 𝑖 − 𝑌 + 𝑌 ⋮ ⋮ ⋮ 𝑨 & 𝑩𝑷 Steady-State Profiling ① ② Noise Reduction and Anomaly Detection Pre-processing for noise reduction Thermal profile ( 𝒀 ) Post-processing for anomaly detection Floorplan Estimation ③ Thermal Adjacency Graph(template construction) ④⑤ Thermal Parameter Estimation with Transient-State Data Thermal Parameter Estimation with Steady-State Data Figure 4.1: Parameter Estimation Scheme into the steady-state and the transient-state stages ( 4 and 5 ) is to cope with the varioustypes of errors that may be introduced during temperature profiling. If one tries to tacklethis problem by using both data at the same time, errors in the transient-state data canadversely affect the characterization of the system in steady state because the transient-state data has a much larger number of data points than the steady-state data. Moreover,our proposed scheme has a low computational cost as it requires processing only a few datapoints in the steady-state stage to obtain the thermal characteristics of the system. The proposed scheme primarily uses the state-state data for thermal parameterestimation since it is more robust to measurement errors than the transient-state data butstill contains the required information about semiconductor technology and power consump-tion. Hence, before introducing the detailed stages of our scheme, we analyze the thermalmodel and provide the reasoning behind the steady-state profiling.Our goal is to estimate the thermal parameters related to the steady-state data.After solving the first order equation of Eq. 4.1, we have108 ( t ) = θ chip + e ( t − t ) A θ + (cid:90) tt e ( t − s ) A BP ( s ) ds (4.2)where [ θ chip ] n × is the total heat dissipation caused by IP blocks on the chip and idle powerof the CPU cores. We assume that all IPs generate a constant amount of heat. This termcaptures the heat conduction between all parts of the chip and the CPU. When the CPUcores are idle for a very long time, θ chip can be captured. It is worth noting that due tothe location CPU cores and their sensors on the floorplan, the steady-state temperatureof idle CPU cores are different. The second term is the homogeneous solution which isthe thermal response due to the initial temperature difference from the ambient. The thirdterm is the non-homogeneous solution caused by the input power signal.The total power is the function of temperature at any time instant t because theother factors such as frequency remain invariant during task execution. For homogeneousmulti-core CPUs, the power for each core is P S ( t ) when CPU is idle, and P S ( t ) + P D ( t )when the CPU executes some workload in which all cores are fully utilized. Hence, Eq. 4.2can be written as: θ ( t ) = θ chip + e ( t − t ) A θ − A − ( I − e At ) BP . (4.3)In the steady state, the second term disappears because the steady-state operatingtemperature depends only on the power consumption (third term) and it is unaffected byinitial temperature θ . Suppose θ ∞ represents the operating temperature in the steadystate, then: θ ∞ = lim t →∞ θ ( t ) = θ chip − A − BP . (4.4)109e are interested in finding the value of matrices A and BP . It is worth notingthat the only control parameter is the power signal. It means that for each profiling proce-dure, it is possible to execute workload on any subset of CPU cores or hot-unplug them butthe actual value of the power remains unknown. Let [ Y i ] n × denote the operating tempera-ture of the CPU cores when the i -th core is fully-utilized and Y represent the temperatureof the CPU when all CPU cores are idle. Furthermore, let [ Y ] n × n = [ Y Y . . . Y n ] T bethe matrix of temperature profiles of the CPU in the steady state. For instance, y i,j in therow i and column j of Y is the temperature of the j -th CPU core when tasks are executingonly on the i -th CPU core. Hence, according to Eq. 4.4, Y − [ Y Y ... Y ] Tn × n = P D ( − A − B ) . (4.5)We solve the equation to find the value of A . Therefore, A = − P D b ( Y − [ Y Y ... Y ] T ) − . (4.6)By denoting ˜ A = ( Y − [ Y Y ... Y ] T ) − and γ = b P D , we have A = − γ ˜ A . (4.7)Therefore, by estimating ˜ A and γ , we can determine the value of A . ˜ A can be determinedby profile without any knowledge of power consumption. The value of γ cannot be de-termined by the information from the steady-state profiles even with temperature profilesencompassing various CPU frequencies.As shown in Eq. 4.7, we only need n + 1 profiles to estimate the value of ˜ A , i.e.,one profile measured when all cores are off, and n profiles when each core is fully utilizedone at a time. However, because of errors and limitations discussed in Section 4.4, there110ay be a considerable amount of noises in estimating the value of ˜ A . If the profiles arenoiseless, ˜ A will be a symmetric matrix with positive values on diagonal and non-positivevalues on non-diagonal elements. Zeros at ˜ a i,j represent that there is no heat conductionbetween core i and j . It is noteworthy that a non-symmetric ˜ A caused by noisy data setscan lead to imaginary eigenvectors which is impossible in practice. We will later proposeseveral methods to address different types of noises. Moreover, we will extend our analysisto reach a more precise ˜ A by infusing more data profiles. By using those techniques, it is notnecessary to have the exact CPU core combinations of Y i for profiling purposes, but stillpossible to benefit from multiple auxiliary data obtained from the same core configuration.One interesting property of A and ˜ A is that both have the same eigenvectors.Additionally the eigenvalues of A are − γ times the eigenvalues of ˜ A . We will use theseproperties to find the value of γ later and justify why it is infeasible to estimate the valueof power consumption from thermal profiling. Compensating for various sources of errors in the steady-state data is critical toaccurately estimate the system thermal parameters. In this section, we will discuss twolow-cost noise reduction and detection procedures for the steady-state data. Pre-processing for Noise Reduction The most challenging sources of errors in the steady-state data are the built-in CPUcore temperature sensors and the uncontrollable fluctuations in the ambient temperature.Heat dissipation due to all types of task interference can be captured in θ chip . We will111how in the evaluation section that this amount is almost invariant for any given operatingfrequency. Therefore, we focus our discussion on other types of errors. The accuracy of thesteady-state data is crucial to calibrating thermal parameters with transient-state data inthe later stage of our scheme. It is also important for estimating the relative locations ofCPU cores on the floorplan since they depend on the accuracy of this stage.The errors that could be found in transient-state data, such as due to the sam-pling frequency and response time of built-in sensors, do not occur in steady-state databecause the steady-state temperature converges to a certain level and remains invariant.To overcome the limitation of sampling precision, we pre-process the steady-state temper-ature data by taking a moving average. Although the temperature should remain invariantthroughout the steady state, the limitation introduced by sensor data quantization maycause fluctuations. Suppose that the precision of built-in temperature sensor is 1 ° C andthe actual steady-state core temperature is 55.1 ° C. The data will usually reflect a sensorreading of 55 ° C. However, on less frequent occasions, the data may reflect a sensor read-ing of 56 ° C. Aside from the precision concern, some transient noises such as varying airconvection, temporary OS tasks, or variant ambient temperature can also be added up inthe steady state. To alleviate the effects of the limitation of sensed data precision and alsotransient noises, one can apply a low-pass filter such as the moving average in an arbitrarytime window on the steady-state data before constructing the observation matrix Y . Post-processing for Anomaly Detection As we discussed in Sec. 4.5.1, our scheme needs steady-state temperature profileswhich are obtained when only one of the CPU cores is fully utilized at a time. If there are112ore profiled data, it is possible to detect if there is any persistent error in the temperatureprofiles. For instance, if the ambient temperature in one profile is different from that inother profiles, it can be detected and rectified. Our post-processing error detection tests ifthe observed data Y i are consistent with other auxiliary profiles that are obtained whenmore than one CPU core is fully utilized. If any error is found from the observed data Y i ,the corresponding column of Y will be rectified with the correct value.We now present the details. Let [ X Z ] denote the steady-state temperature of CPUcores and Z = z z . . . z n is the predicate that shows the status of CPU cores in the profiletest where z i ∈ { , } represents whether the i -th CPU core is busy ( z i = 1) or not ( z i = 0).We are interested to test primary observed data Y i by using auxiliary data X s to detect theprospective error in Y . According to the thermal superposition theorem [38], X Z = n (cid:88) i =1 z i × ( Y i − Y ) + Y (4.8)Suppose that the available tests for a hypothetical 3-core CPU are Y i for i ∈ [1 , X . Hence, X = Y + Y − Y . In the idle condition wherethere is no errors in the data, the equation much hold. Let’s suppose there is an errordata in one of them. We are interested to detect or correct it. By adding an error term (cid:15) to the equation, we have X = Y + Y − Y + (cid:15) , . If there is no error in data,one can assume that (cid:15) , = [0 , , T . If there is an error in one of the tests, it can bedetectable but not correctable. Now suppose that there is another test X available,hence X = Y + Y − Y + (cid:15) , .Now in this case, it is possible to detect not only the presence of error but also thelocation of error in the profiles. In our example,113 ) if (cid:15) , = (cid:15) , = 0, there is no single error in the steady state values of any of Y , Y or Y . II) if (cid:15) , = 0 but (cid:15) , (cid:54) = 0, there is error in either Y or X . III) if (cid:15) , (cid:54) = 0 but (cid:15) , = 0, there is error in either X or Y . IV) if (cid:15) , (cid:54) = 0 and also (cid:15) , (cid:54) = 0, there is error in the steady state values of Y .In order to detect and correct the error, we design a test in an n-hypercube format.Each corner of the hypercube represents one measurement setting Z , and there is only aone-bit difference in the Z values of two neighboring corners. In this way, it is possible toexamine the correctness of each test with its neighbors. It is also possible to spot the errorwith a sufficient number of tests.We explain the procedure by making an example. Fig. 4.2a illustrates the auxiliarytest X . The blue side shows the verification of the Y and Y . If all the data areconsistent, one can conclude that there is no single error in Y and Y . If there is an errorin one data an additional test is needed that we will explain later. X = Y X = Y X = Y X = Y X (a) X = Y X = Y X = Y X = Y X (b) X =Y X = Y X = Y X = Y X X X (c)Figure 4.2: a) Auxiliary test for detect one error in either Y or Y . b) Auxiliary test for detectone error in either Y or Y . c) Second-tier testing of auxiliary tests. Y and Y . If there is nosingle error, the data on this side is also consistent.Now suppose that both sides are inconsistent, then it is easy to say that the dataof Y is faulty because we assume that there is only one error in observation profiles andthe edge of ( Y , Y ) is the intersection of both sides. Since Y is zero base and not faulty, Y is noisy.We now discuss the case III where (cid:15) , (cid:54) = 0 and (cid:15) , = 0. To locate the error inthe data, we use the auxiliary test that make a side with two of suspicious tests (the yellowside in Fig. 4.2c). In this case, we consider X that can make a side with X , X and X . If the data of this side is consistent, the error will belong to Y , otherwise to the test X . By employing the same technique, it is easily possible to extend the proposedmethod for two errors in the tests by adding another tier to test the auxiliary tests witheach other. Additionally, it is possible to conclude that there exists a single error or twoerrors in the data. It is important to mention that in this chapter, it is assumed that errorsin the two tests cannot conceal each other. Hereby, we introduce a breadth-first-search (BFS) algorithm to estimate the to-pography of the CPU cores in the chip. Later, we use this information to calibrate thethermal term b P D in the temperature model by using the transient-state data. One of thesteps to refine the thermal parameters of a chip is to estimate the parameters complyingwith the floorplan. We present a mechanism to estimate the relative location of the CPU115ores on a given chip.The intuition behind our proposed algorithm is that the amount of heat dissipationfrom a heat source to a closer object is more than that to a farther object. Therefore, weexpect that the temperature increase due to heat conduction of adjacent CPU cores is morethan that of non-adjacent CPU cores. We introduce a fully-connected weighted graph wherethe edge weight represents the temperature increase of a CPU core when its neighboringcores are fully utilized. For instance, suppose there is a quad-core CPU where each coreis labeled as C i with i ∈ [1 , w i,j , represents the temperature increase of a core C j when another core C i is fully utilized (i.e., y i,j − y i ). It is worth noting that w i,j (cid:54) = w j,i not only because of noise but also due to the location of the built-in temperature sensorrelative to the hotspot of each CPU core, although the amount of heat dissipation fromeach homogeneous core is ideally the same. C C C C 20 2019 20 17161717 181813 131414 1817 Figure 4.3: Example of adjacency graph of a quad-core CPU. We are interested in the relative temperature increase of core C i when it is fullyutilized compared to when another core C j is fully-utilized. Hence, we have w (cid:48) i,j = y i,i − y i,j .We also remove self-loops from the graph for simplicity. After this, the graph of Fig. 4.3 is116educed to a graph where all weights are positive. Next, we convert the bi-directed graphto a uni-directed graph by taking maximum of the edges in the different direction (i,e.,min { w (cid:48) i,j , w (cid:48) j,i } ) as shown in Fig. 4.4. It is noteworthy that one can use average, minimumor any reduction operation for this stage. As shown in Fig. 4.4, the temperature increase ofCPU core C when the CPU core C is fully utilized is because of heat conduction of CPUcores C and C . In other words, the CPU core C is heated up because of transitive heattransfer from C to both C and C and then heat transfer occurs from these CPU cores tothe CPU core C . Therefore, there is only transitive heat conductivity between the CPUcore C and the core C . C C C C 32 min 2, 2 = 266 2 Figure 4.4: Example of adjacency graph of a quad-core CPU. By using this graph and also the aforementioned intuition, we estimate the locationof the CPU cores on the floorplan. To do this end, we begin from one arbitrary core (let’ssay C ) and find the minimum value of all weights connected to its corresponding node inthe graph. In this example, the maximum weight is 2 for both CPU cores C and C . Next,we find the maximum weight of CPU cores C and C from the unvisited node list, one ata time. We continue this step until all nodes are visited. If some of the visited CPU coreshave the same value (within some error margin), they are connected. In this example, theerror margin is 1 so the cores are connected as illustrated in Fig. 4.5. It is shown that the117ore C (or at least its built-in temperature sensor) is closer to the core C than C . C C C C Figure 4.5: Final adjacency graph of a quad-core CPU. Furthermore, if the temperature profiles of on-chip IPs such as an integrated GPUare available, the same approach can be applied to locate the corresponding IP by usingthe temperature increase when each CPU core is fully-utilized. In this way, there is onlyone edge from each CPU core C i to the IP. We propose a method to calibrate the thermal parameters based on the floorplanestimation discussed in the previous section. The thermal parameter estimation withoutconsidering the floorplan may not be applicable in practice. Some important properties of A according to heat transfer are as follows: • A is a symmetric matrix. • The elements on diagonal of A are positive. • The zero values of non-diagonal elements represent a pair of corresponding CPU coresare not adjacent. This is because Y can be inaccurate and the properties of A may be violated if the floorplan is notconsidered. All non-zero values of non-diagonal elements are negative.These properties of A guarantees that there exist n real eigenvalues and a set of n eigenvec-tors, one for each eigenvalue, that are mutually orthogonal. This stage estimates the valuesof thermal parameters in way that it not only complies with the floorplan but also reducesthe effect of noisy data observed data in A .The matrix ˜ A is constructed according to the estimated floorplan. If there is nothermal interference between two cores C i and C j in the estimated floorplan, the corre-sponding value in the matrix, i.e., ˜ a i,j , is zero. A single value can be used for all adjacentCPU cores in the matrix ˜ A as they share the same heat dissipation increase value. Thegradient descent algorithm will then estimate the non-zero values. For the quad-core CPUexemplified in the previous section, there exist 4 zeros in the matrix ˜ A because each CPUcore has only 2 adjacent CPU cores. The number of distinct non-zero values in ˜[ A ] aresix in total because four different values are on the diagonal of ˜ A and two distinct valuesare used for the other non-zero elements. It is worth noting that, depending on the user’saccuracy demand, more distinct non-zero values may be used. For the explained example,the matrix ˜ A will be ˜ A = a a a a a a a a a a a a . We propose a gradient descent algorithm to find the parameters according the estimatedfloorplan. For the calibration of the parameters, ˜ A is compared with the inverse of Y ,which means that ˜ A follows the estimated floorplan template and its inverse must be close119o the temperature profiles. We propose the cost function as followsargmin a i i ∈ [1 , r ] || ˜ A − − Y || F (4.9)where r is the number of unknown parameters. The sign of each parameter has to becarefully considered in the implementation, taking into account the properties of A discussedbefore (e.g., diagonal elements should be positive and the rest be non-positive). In this section, we discuss how to estimate the matrix B for homogeneous multi-core platforms. As we mentioned in Sec. 4.3.2, it is impossible to determine γ from thesteady-state observations, hence, we must process the transient-state data.As discussed in the steady-state section, if the power remains the same, the tem-perature equation will become Eq. 4.3. To estimate the value of A , we only need γ whichis commensurate to the power consumption. Substituting Eq.4.7 in Eq. 4.3, we have: θ ( t ) = θ chip + e ( t − t ) − γ ˜A θ + ( γ ˜A ) − ( I − e At ) bIP . (4.10)Suppose that V and Λ are the eigenvectors and eigenvalues of ˜ A , respectively.Therefore, e ( t − t ) − γ ˜A can be represented as V e ( t − t ) − γ Λ V − . From our proposed steady-state scheme, V and λ are determined. We can estimate the value of γ from the transient-state data of a singular observation. To better calibrate γ , multiple profiles may also beconsidered. Hence, θ ( t ) = θ chip + e − ( t − t ) γ ˜A θ + ˜A − ( I − e A ( t − t ) ) H (4.11)120here H is an n × i -th element is h i ∈ { , } . It is noteworthy that h i = 1 when i -th core is fully utilized. By substituting the eigenvalues and eigenvectors,the equation can be represented as θ ( t ) = θ chip + V e − ( t − t ) γ Λ V − θ + ˜A − ( I − V e − ( t − t ) γ Λ V − ) H . (4.12)The only missing part in the temperature model is γ which can be estimated bycurve fitting on transient-state temperature. The values of P d ( t ), θ chip and ˜A change atdifferent frequency levels but the values of Λ , V and b remain invariant against frequencychange. Although the value of bγ can be estimated and is embedded in the temperaturedata, it is not possible to estimate the value of power even with temperature informationat different frequency levels.Since γ is a scalar value, we observed that curve fitting on only a few cores canprovide good results. The simplest test for estimating the value of γ is when all cores arecooling down. In this case the third term is eliminated and the temperature equation forthe i th core can be reduced to θ i ( t ) = θ ichip + (cid:88) λ j ∈ Λ v i,j y i,j e − ( t − t ) γλ j θ i (4.13)where v i,j and y i,j are the elements of the i th row and the j th column in the matrices V and V − , respectively. 121 .6 Accuracy Enhancement In this section, we extend our proposed scheme to improve accuracy using addi-tional steady-state data from different core settings and extra data observed at differentfrequencies. We discuss how to use the ensemble of extra observations at one frequency levelto obtain a more accurate thermal observation profile. As discussed in the previous section,our method requires n + 1 observed steady-state temperature profiles for a n -core system toconstruct the matrix Y : one profile when all cores are idle and n profiles when each of CPUcore is only fully-utilized. Now, we extend our analysis to answer the following questions: • Is it possible to collect different CPU core usage settings (other than the aforemen-tioned n + 1 observed data) to construct Y ? • Is it possible to have a more precise Y if there exist multiple instantiations from anidentical setting and use all of them? • Is it possible to construct Y with more than n + 1 steady-state data?We are interested in generalizing the construction of Y to include any possiblethermal traces of fully-utilized CPU core combinations and the ensemble of more thermaltraces. It is noteworthy that if there is no noise in the observed data and also data is notquantized, no extra data or multiple instantiations are needed due to superposition law inEq. 4.8. 122ow, suppose we have m profiles such that n ≤ m ≤ n − Z i . Based on that, we can construct the predicatematrix [ D ] m × n = [ Z Z . . . Z m ] T from the auxiliary profiles in U = [ X z X z . . . X z m ] T . Thematrix Y is then generated as follows Y = (( D × D T ) − × D × U T ) − . (4.14)Because of the inversion in the equation, the equation works when the number ofthe profiles is more than the number of CPU cores. It also works when there exist enoughorthogonal profiles, meaning that there is at least one profile for each CPU core where thecore is idle. The matrix ˜ A is then calculated as explained in Sec. 4.5.4. We now discuss how to obtain a more accurate A by using additional data collectedat multiple frequency levels. As explained in Sec. 4.3.2, the thermal parameter A is de-pendent on the semiconductor technology and remains invariant against frequency changes.Therefore, it would be logical to assume that having observed temperature profiles fromdifferent frequency levels would lead to an identical A . However, due to the term γ inEq. 4.7, ˜ A is proportional to the power consumption and the clock frequency of CPU cores.Based on this, we extend our scheme to answer the following questions: • Is it possible to have an identical A from different Y s which are collected at differentfrequency levels? • Is it possible to estimate the power consumption with respect to different frequency123evels?To answer these questions, we propose a thermal parameter estimation approachbased on multi-frequency data ensembles. Let Y i denote the temperature profile matrixconstructed from the data when the CPU frequency is f i , and γ i denote its power effecton temperature at the frequency f i . Unlike the method presented in Sec. 4.5.4 which esti-mates the parameters in ˜ A , we will calculate a ˜ A , which is the base at f . Additionally,we consider γ = 1 and estimate γ i γ , ∀ i > 1. In such case, tracking the values of γ overdifferent frequency levels gives the power consumption of CPU cores proportional to the γ of the base frequency level f , and all Y s share the same thermal parameter A . Usingthe same procedure as in Sec. 4.5.5 allows estimating the value of γ ; hence, all γ s can beestimated. It is noteworthy that, even in this case, the actual value of the power consump-tion cannot be computed, but the relative power consumption at different frequency levelscan be obtained. Therefore, one can see the effect of power consumption embedded in thetemperature profiles.We change the cost model of Eq. 4.9 toargmin a j j ∈ [1 , r ] γ i γ i ∈ [1 , | f | ] | f | (cid:88) || ˜ A − − γ i γ Y i || F (4.15)The problem is estimating the values of ˜ A and also γ i γ . One can expect that f i > f j leads to γ i < γ j due to dynamic power increase and this can be considered as aconstraint in the implementation. 124 .7 Evaluation This section describes the experimental evaluation of our scheme. We explain ourimplementation on a real embedded platform. We show the results of our proposed methodto find the thermal parameters from steady-state temperature data. We also explore theeffect of our approaches in the error correction of estimated thermal parameters. Further-more, we validate our method of finding the location of CPU core on the floorplan. We alsoshow the results of our parameter-based power estimation model and compare it with datacollected from on-chip power sensors. We performed our experiments on an ODroid-XU4 development board [64] equippedwith a Samsung Exynos5422 SoC based on the ARM big.LITTLE heterogeneous computingarchitecture. The Exynos CPU package contains two different quad-core CPU clusters oflittle Cortex-A7 and big Cortex-A15 cores. Built-in temperature sensors with a samplingrate of 10 Hz and precision of 1 ° C are available for each big CPU core as well as the GPUto measure the operating temperature . The DTM throttles the frequency of the entirebig CPU cluster to 900 MHz when one of its cores exceeds the hardware-defined maximumtemperature threshold of 95 ° C. There is no active or passive cooling mechanism enabledon the CPU. The big CPU cluster frequency can be dynamically adjusted within the rangeof [0 . , . 0] GHz. However, for each experiment, it was pinned at a fixed frequency in therange of [0 . , . 4] GHz to avoid thermal violations that occur when all CPU cores run fully- There is no temperature sensor for little cores since the power consumption and heat dissipation of thelittle cluster is substantially lower than that of the big cluster. ° C. Idle Steady-State Temperature First, we measure the temperature of CPU cores at different frequencies and de-termine the steady state temperature when all big CPU cores are idle. As we show inFig. 4.6a, the temperatures of the big CPU cores during the idle state changes with CPUfrequency for two possible reasons: i) some parts of big CPU package such as CCI, big CPUcontrolling unit, cache memory and its peripherals still operate at their designated frequen-cies and cause heat dissipation, and ii) other IPs on the SoC may operate with big CPUfrequency. This temperature increase can also be caused by chip leakage power-inducedheat dissipation. Dynamic Temperature Fig. 4.6b depicts the relative temperature increase of each CPU core at differentfrequency levels when that core is fully-utilized. There is a slight difference in the temper-ature increase between the CPU cores despite identical workloads and micro-architectures.Although there is noise related to the precision of on-chip thermal sensors, this difference126 T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (a) T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (b)Figure 4.6: (a) Idle steady-state temperature of each core at different frequency levels. (b) Temper-ature increase of CPU cores due to computation.Table 4.1: Descriptions of steady-state traces on big cores. Case name Descriptions Traces (decimal) CA Z , Z , Z , Z CA Z , Z , Z , Z CA Z , Z , Z , Z , Z , Z CA , Z , Z , Z , Z , Z , Z , Z , Z CA , Z , Z , Z , Z , Z , Z , Z , Z , Z , Z CA , , , Z , Z , . . . , Z We evaluate our scheme in improving the accuracy of thermal parameter estimationby using the ensemble of steady-state profiles from multiple frequency levels. We apply thesuperposition theorem as in Eq. 4.8 to estimate the steady-state temperature of CPU coreswhen different subsets of them are fully-utilized. We use the data collected from a subset127 E rr o r ( ° C ) ZCA1 CA3 CA2 CA1,3 CA1,2 CA1,2,3,4 (a) Core 1 -1.5-1-0.500.511.5 E rr o r ( ° C ) ZCA1 CA3 CA2 CA1,3 CA1,2 CA1,2,3,4 (b) Core 2 -1.5 -1 -0.50 E rr o r ( ° C ) ZCA1 CA3 CA2 CA1,3 CA1,2 CA1,2,3,4 (c) Core 3 -1.5-1-0.500.511.5 E rr o r ( ° C ) ZCA1 CA3 CA2 CA1,3 CA1,2 CA1,2,3,4 (d) Core 4Figure 4.7: The error of steady-state temperature of CPU cores by using different cases in construc-tion of Y . of all possible combinations of CPU cores at the fixed clock frequency of 1.4 GHz. Thedetails on the subsets used to construct the matrix Y are given in Table 4.1. Some casescontain the least number of orthogonal profiles (i.e., 4 profiles) while others contain moreprofiles. The case CA , , , Y . The name ofeach case indicates the number of fully-utilized CPU cores. For instance, CA , Y from Eq. 4.14). As shown in the figure, using more profileshelps reduce the effect of noise in constructing Y . The proposed anomaly detection mecha- Note that the total number of possible core combinations is 2 in a quad-core system and the totalnumber of selecting a subset of the combinations is 2 . M S E ( ° C ) Cases Figure 4.8: MSE of CPU cores from all setting in different cases. nism presented in Sec. 4.5.2 was applied to the profiles and excluded the outliers from beingconsidered for Y . For instance, applying this mechanism detected an anomaly in the profile Z by comparing it with the other profiles. Hence, while constructing Y , we were able tofind out that the anomaly was not because of an error in the other profiles but rather dueto the error in Z .The mean square errors (MSE) of the temperature model for all CPU cores underall different settings are shown in Fig. 4.8. The x axis is sorted in ascending order in termsof the number of profiles used during the construction of Y . Some spikes in the results, e.g.,the yellow line at CA , 2, are due to that additional traces contained noisy data. However,the error generally decreases as more profiles are used for the construction of Y . This trendindicates that our proposed approach to considering data ensembles can reduce the negativeimpact of noisy data and improve the accuracy of thermal parameter estimation.129 11 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (a) core 1 T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (b) core 2 11 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (c) core 3 911 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 T e m p e r a t u r e ( ° C ) CPU frequency (GHz) core1 core2 core3 core4 (d) core 4Figure 4.9: Temperature increase when a)core 1 2)core 2 3)core 3 4)core 4 operates fully-utilized We validate our proposed floorplan estimation method on the Exynos5422 SoC.The temperature increase of each CPU core when only one CPU core is fully-utilized at atime is depicted in Fig. 4.9.We draw the proposed adjacency graph according to data of 1 . C and C have greater spatial proximity than cores C and C . Although the physical layout of the cores on the CPU package is bi-symmetrical,the primary reason for the asymmetrical layout shown in Fig. 4.10d is that the location ofthe built-in temperature sensors on each processor may vary between cores. Additionally130 C C C (a) C C C C (b) C C C C (c) C C C C G (d) core 1 C GPU Cache and Peripherals C C C (e)Figure 4.10: Floorplan estimation of Exynos5422 based on data of 1.4 GHz. (a) The fully-connectedgraph from the temperature increase data, (b) Graph reduction stage (c) The CPU affinity graph, (d)Estimation of GPU location relative to CPU location, and (e) The actual Exynos5422 floorplan [36]. the L2 cache memory and peripheral controllers may have an effect on the modeled spatialproximity between the core pairs of C , C and C , C . Using the same approach and GPUtemperature data, we are also able to locate the embedded GPU by profiling the heat con-duction between each CPU cores and the GPU. Fig. 4.10d depicts the estimated locationof the embedded GPU in the Exynos5422. The floorplan estimation is validated with thedata reported in [36] .We construct the template of the matrix A to be compatible with the estimatedfloorplan. The matrix ˜ A at 1.4 GHz is then computed by using the steady-state data of CA , , , The CPU core labeling in [36] is different from the labels in the driver and it is verified with infra-redimaging captured by an FLIR A325sc IR camera [29]. P o w e r ( w ) Frequency (GHz) (a) S t a t i c p o w e r ( w ) Temperature ( ° C) (b) R e l a t i v e p o w e r ( w ) Frequency (GHz)[0.7,1.4] GHz [0.8,1.4]GHz [0.9,1.4]GHz [1.0,1.4]GHz[1.1, 1.4]GHz [1.2,1.4]GHz [1.3,1.4]GHz Sensor data (c) R e l a t i v e p o w e r ( w ) Frequency (GHz)[0.7,1.4] GHz [0.8,1.4]GHz [0.9,1.4]GHz [1.0,1.4]GHz[1.1, 1.4]GHz [1.2,1.4]GHz [1.3,1.4]GHz Sensor data (d)Figure 4.11: Power data of CPU cores in Exynos5422. (a) Total power of the big cluster with built-insensors, (b) The leakage power data; the relative power estimates with different frequency rangesused, (c) Comparison between the estimated relative power consumption and the normalized actualpower data from built-in power sensors for CA 1, and (d) Comparison for CA , , , ˜ A = . − . − . − . . − . − . . − . − . − . . . The relative power consumption of each CPU core can be estimated while estimat-ing ˜ A , as explained in Sec. 4.6.2. Fig. 4.11 illustrates the estimated power consumption.132he results are obtained using profiles from different frequency ranges. As depicted in thefigure, the estimated relative power closely follows the actual data collected from the built-inpower sensors of the XU3 board that is equipped with the same Exynos5422 SoC. We estimate the parameter γ as the base parameter for the clock frequency 1.4GHz in our experiments. Based on this, the absolute value of other γ s can be determined.As discussed in Sec. 4.5.5, we use the transient-state trace when all CPU cores are coolingdown. After estimating the values of γ s, we apply our model to predict the operatingtemperature of CPU cores in two different scenarios. Figures 4.12a-d show the CPU tem-peratures when only two CPU cores are fully utilized and Figures 4.12e-h show that whenall CPU cores are fully utilized. In each sub-figure, “data” means the CPU operating tem-perature measured from a real platform and “mdl” means the temperature predicted byour model. As shown, there are some differences in transient-state temperature values,especially at the earlier part of the experiments. The difference is relatively small whenthe frequency is low. On the other hand, in steady state, the temperature predicted bythe model is very close to the real one and the difference is less than 1.25 ° C in all cases.Since the steady-state temperature is the metric to test the thermal safety of the system,we expect that our proposed scheme can be effectively used in the thermal-aware design ofCOTS-based mixed-criticality systems. 133 .8 Summary In this chapter, we proposed a novel, rapid, and accurate scheme to estimate thethermal parameters of COTS multi-core processors for real-time mixed-criticality systems.By decomposing our estimation scheme into steady-state and transient-state stages, wesubstantially reduce the number of transient-state profiles needed to estimate the system’sthermal parameters. We presented methods to improve the accuracy of our scheme byutilizing additional temperature profiles in the parameter estimation. Our proposed schemeis fast to converge and has a low computational cost for the prediction of chip operatingtemperature. Hence, it can be even used in an event-driven manner, e.g., at the arrivaland the departure of each job of periodic real-time tasks, with negligible memory andcomputational overhead. We derived the thermal characteristics of a multi-core processorwhich remain unchanged across different frequency levels. We also showed the effectivenessof our scheme in extracting the relative power consumption information from temperatureprofiles. 134 a) core 1 (b) core 2(c) core 3 (d) core 4(e) core 1 (f) core 2(g) core 3 (h) core 4Figure 4.12: The CPU temperature from sensor data and the model when a-d) two cores are fully-utilized Z e-h) all cores are fully-utilized Z . hapter 5 Conclusions This dissertation shows that the timing predictability of real-time tasks in dynamicthermal conditions is achievable on multi-core GPU-integrated SoC platforms by designinga novel thermal-aware system framework with the support of analytical timing and thermalmodels. We discussed challenges arising from the thermal interference on heterogeneousmulti-core SoC platforms. In Chapter 2, we focused on bounding the temperature increasedue to heat conduction among CPU cores as well as integrated computational acceleratorslike GPUs. We proposed a thermal-aware CPU-GPU framework to handle both real-timeCPU-only and GPU-using tasks by designing thermal-aware servers that bound the maxi-mum operating temperature of SoCs with the assurance of performance predictability. InChapter 3, we introduced a novel multi-core mixed-criticality scheduling framework withambient temperature awareness to bound heat generation at each criticality level and to trig-ger the criticality mode change by ambient temperature changes. We tackled the thermal-136ware real-time scheduling challenges in dynamic environment conditions by decomposingthe analysis into thermal schedulability and timing schedulability. We introduced the notionof idle thermal servers that allow bounding the maximum operating temperature in MCSin way that the temperature increase due to heat dissipation during an inactive time of idleservers is not less than that of running preemptive servers under any task execution pattern.In Chapter 4, we presented our work on the estimation of processor thermal parameters toobtain a precise thermal model without using special measurement instruments or access toproprietary information. We showed that even a small number of temperature traces fromon-chip thermal sensors is sufficient to achieve an accurate thermal parameter estimation.Not only can our framework detect anomalies in thermal profiles, but increases accuracyusing multi-frequency data ensembles and takes into account extra observations made atone or a subset of frequency levels. There are possible extensions and improvements to the work presented in thisdissertation. The following suggests several future research directions that can be builtupon our work. By considering task allocation to CPU and GPU cores, the schedulability of real-time tasks with a given temperature constraint can be further increased. We believe thatit is possible to develop an offline mapping algorithm to increase the schedulability of a137askset by determining whether a task executes on the CPU or the GPU according to thesystem parameters and its execution timing model.In our task model, tasks may or may not include concurrent segment which exe-cutes on GPU. When there is a GPU segment request, the corresponding task issues a GPUaccess request. However, it is possible to design an online algorithm to determine if concur-rent task segment executes on GPU or CPU. In this way, some jobs of a task can executeonly on CPU while the rest executing on the integrated GPU. We believe this technique canreduce the maximum peak temperature of embedded systems and also increase the timingschedulability of tasksets.The miscellaneous operations contains a single data transfer to/from GPUs. How-ever, in practice, for improving performance or overcoming the resource limit of GPUs,data can be divided into several chunks and transmitted to GPUs so that data copy andkernel execution can be overlapped. We believe that updating the GPU execution modeland revisiting some of our proposed protocols will enable this feature and the practicalapplicability of our proposed framework will also increase.The thermal behavior of GPU-using tasks may depend significantly on the type ofresources used by their kernels. For instance, a GPU kernel frequently accessing local mem-ory may generate much less heat than those using global memory or being computationally-intensive. A potential direction to this issue would be developing an analytical model basedon the thermal footprint of real-time applications to bound the maximum peak temperaturewith acceptable margins. 138 .1.2 Dynamic Ambient Temperature-Aware Framework In our proposed thermal-schedulability analysis for multiple real-time tasks onmulti-core SoCs, identical idle servers are assumed for all CPU cores at each criticalitylevel. This assumption can affect the thermal schedulability of a real-time taskset. Webelieve that extending our analysis to allow different idle server settings per CPU core canimprove the range of ambient temperature supported by mixed-criticality levels. Besides,developing our analysis to capture the effect of cooling packages and forced heat convectioncan be a challenging issue.In our proposed framework, a criticality mode change is triggered by ambienttemperature which is different from the well-known Vestal model [92], which focuses onvarying assurance of execution timing. It is intriguing to extend our framework such thatchanges in both ambient temperature and execution time of tasks trigger the criticalitymode of MCS. We believe that our proposed scheme can be extended to capture the thermalfootprint of different tasks on heterogeneous SoCs by using a minimum number of profiles.We also believe that the idea of further development of the analysis to capture the effect ofcooling packages and forced heat convection lead to the estimation of operating temperatureof SoCs in various CPU cooling conditions at runtime. Additionally, statistical approachesto construct the adjacency matrix of the location of CPU cores and IPs can be a potentialintriguing extension. 139he robustness of our proposed estimation scheme against noise in the thermalprofiles has not been discussed through mathematical analysis. To be more precise, althoughour proposed method estimates the thermal parameters of SoCs with acceptable margins,we did not quantify the degree of noise or errors in the estimation with respect to theamount of data used. Given that it is essential to assess the trustworthy of each thermalprofile, we consider this as interesting future work.140 ibliography [1] Technical Report: On Dynamic Thermal Conditions in Mixed-Criticality Systems(available upon request from the track chair). Technical report, 2019.[2] Masud Ahmed, Nathan Fisher, Shengquan Wang, and Pradeep Hettiarachchi. Mini-mizing peak temperature in embedded real-time systems via thermal-aware periodicresources. Sustainable Computing: Informatics and Systems , 1(3):226–240, 2011.[3] Rehan Ahmed, Pengcheng Huang, Max Millen, and Lothar Thiele. On the designand application of thermal isolation servers. ACM Trans. Embed. Comput. Syst. ,16(5s):165:1–165:19, September 2017.[4] Rehan Ahmed, Parameswaran Ramanathan, and Kewal K Saluja. On thermal utiliza-tion of periodic task sets in uni-processor systems. In , pages267–276. IEEE, 2013.[5] Rehan Ahmed, Parameswaran Ramanathan, and Kewal K Saluja. Temperature mini-mization using power redistribution in embedded systems. In , pages 264–269. IEEE, 2014.[6] Rehan Ahmed, Parameswaran Ramanathan, and Kewal K Saluja. Necessary andsufficient conditions for thermal schedulability of periodic real-time tasks under fluidscheduling model. ACM Transactions on Embedded Computing Systems (TECS) ,15(3):49, 2016.[7] Rob arrington and John Rugh. Impact of vehicle air-conditioning on fuel economy,tailpipe emissions, and electric vehicle range. In Earth technologies forum , pages 1–6.NREL Washington, DC, 2000.[8] Peter Bailis, Vijay Janapa Reddi, Sanjay Gandhi, David Brooks, and Margo Seltzer.Dimetrodon: processor-level preventive thermal management via idle cycle injection.In , pages 89–94. IEEE, 2011. 1419] Guillem Bernat and Alan Burns. New results on fixed priority aperiodic servers. In IEEE Real-Time Systems Symposium (RTSS) , 1999.[10] Oreste M Bevilacqua. Effect of Air Conditioning on Regulated Emissions for In-useVehicles: Phase I . Coordinating Research Council, Incorporated, 1999.[11] Konstantinos Bletsas, Neil Audsley, Wen-Hung Huang, Jian-Jia Chen, and GeoffreyNelissen. Errata for three papers (2004-05) on fixed-priority scheduling with self-suspensions. Leibniz Transactions on Embedded Systems , 5(1):02–1, 2018.[12] Youn`es Chandarli, Nathan Fisher, and Damien Masson. Response time analysis forthermal-aware real-time systems under fixed-priority scheduling. In IEEE Interna-tional Symposium on Real-Time Distributed Computing (ISORC) , 2015.[13] T. Chantem, R. P. Dick, and X. S. Hu. Temperature-aware scheduling and assignmentfor hard real-time applications on mpsocs. In , pages 288–293, March 2008.[14] Thidapat Chantem, X Sharon Hu, and Robert P Dick. Online work maximizationunder a peak temperature constraint. In Proceedings of the 2009 ACM/IEEE in-ternational symposium on Low power electronics and design , pages 105–110. ACM,2009.[15] Thidapat Chantem, X Sharon Hu, and Robert P Dick. Temperature-aware schedulingand assignment for hard real-time applications on mpsocs. IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems , 19(10):1884–1897, 2010.[16] Jian-Jia Chen, Chia-Mei Hung, and Tei-Wei Kuo. On the minimization for the in-stantaneous temperature for periodic real-time tasks. In , pages 236–248. IEEE,2007.[17] Jian-Jia Chen, Shengquan Wang, and Lothar Thiele. Proactive speed schedulingfor real-time tasks under thermal constraints. In , pages 141–150. IEEE, 2009.[18] Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. Fine-grained dynamic volt-age and frequency scaling for precise energy and performance tradeoff based on theratio of off-chip access to on-chip computation times. IEEE transactions on computer-aided design of integrated circuits and systems , 24(1):18–28, 2004.[19] Pascal Tuning Guide. https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html,2018.[20] Sandeep M D’souza and Ragunathan Raj Rajkumar. Thermal implications of energy-saving schedulers. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.14221] T. J. A. Eguia, S. X. . Tan, R. Shen, E. H. Pacheco, and M. Tirumala. General be-havioral thermal modeling and characterization for multi-core microprocessor design.In , pages1136–1141, 2010.[22] Ali Elghool, Firdaus Basrawi, Thamir Khalil Ibrahim, Khairul Habib, HassanIbrahim, and Daing Mohamad Nafiz Daing Idris. A review on heat sink for thermo-electric power generation: Classifications and parameters affecting performance. En-ergy conversion and management , 134:260–277, 2017.[23] G. A. Elliott, B. C. Ward, and J. H. Anderson. Gpusync: A framework for real-timegpu management. In IEEE Real-Time Systems Symposium(RTSS) , 2014.[24] Glenn A Elliott and James H Anderson. Globally scheduled real-time multiprocessorsystems with gpus. Real-Time Systems , 48(1):34–74, 2012.[25] Glenn A Elliott and James H Anderson. An optimal k-exclusion real-time lockingprotocol motivated by multi-gpu systems. Real-Time Systems International Symposium on Low Power Electronics and Design (ISLPED) ,pages 353–358. IEEE, 2013.[28] Nathan Fisher, Jian-Jia Chen, Shengquan Wang, and Lothar Thiele. Thermal-awareglobal real-time scheduling on multicore systems. In , pages111–120. IEEE, 2010.[31] Yong Fu, Nicholas Kottenstette, Chenyang Lu, and Xenofon D Koutsoukos. Feedbackthermal control of real-time systems on multicore processors. In Proceedings of thetenth ACM international conference on Embedded software International Jour-nal of Heat and Mass Transfer , 122:1313–1326, 2018.14334] Ali Ghahremannezhad, Huijin Xu, Mohammad Alhuyi Nazari, Mohammad HosseinAhmadi, and Kambiz Vafai. Effect of porous substrates on thermohydraulic perfor-mance enhancement of double layer microchannel heat sinks. International Journalof Heat and Mass Transfer , 131:52–63, 2019.[35] Georgia Giannopoulou, Nikolay Stoimenov, Pengcheng Huang, and Lothar Thiele.Scheduling of mixed-criticality applications on resource-sharing multicore systems. In Proceedings of the Eleventh ACM International Conference on Embedded Software ,page 17. IEEE Press, 2013.[36] Young-Ho Gong, Jae Jeong Yoo, and Sung Woo Chung. Thermal modeling andvalidation of a real-world mobile ap. IEEE Design Test , 35(1):55–62, Feb 2018.[37] Sebastian Herbert and Diana Marculescu. Analysis of dynamic voltage/frequencyscaling in chip-multiprocessors. In Proceedings of the 2007 international symposiumon Low power electronics and design (ISLPED’07) , pages 38–43. IEEE, 2007.[38] Seyedmehdi Hosseinimotlagh, Ali Ghahremannezhad, and Hyoseung Kim. On dy-namic thermal conditions in mixed-criticality systems. In , pages 336–349, 2020.[39] Seyedmehdi Hosseinimotlagh and Hyoseung Kim. Thermal-aware servers for real-timetasks on multi-core gpu-integrated embedded systems. In , pages 254–266. IEEE,2019.[40] Yingbo Hua and Tapan K Sarkar. Generalized pencil-of-function method for extract-ing poles of an em system from its transient response. IEEE transactions on antennasand propagation , 37(2):229–234, 1989.[41] Huang Huang, Vivek Chaturvedi, Gang Quan, Jeffrey Fan, and Meikang Qiu.Throughput maximization for periodic real-time systems under the maximal tem-perature constraint. ACM Trans. Embed. Comput. Syst. , 13(2s):70:1–70:22, January2014.[42] Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan,Kevin Skadron, and Mircea R Stan. Hotspot: A compact thermal modeling method-ology for early-stage vlsi design. IEEE Transactions on very large scale integration(VLSI) systems , 14(5):501–513, 2006.[43] Arman Iranfar, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and DavidAtienza. Thespot: Thermal stress-aware power and temperature management formultiprocessor systems-on-chip. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 2017.[44] Ramkumar Jayaseelan and Tulika Mitra. Temperature aware task sequencing andvoltage scaling. In Proceedings of the 2008 IEEE/ACM International Conference onComputer-Aided Design , pages 618–623. IEEE Press, 2008.14445] Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa,and Ragunathan Rajkumar. Rgem: A responsive gpgpu execution model for runtimeengines. In Real-Time Systems Symposium (RTSS), 2011 IEEE 32nd , pages 57–66.IEEE, 2011.[46] Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. Time-graph: Gpu scheduling for real-time multi-tasking environments. In Proc. USENIXATC , pages 17–30, 2011.[47] Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott A Brandt. Gdev: First-class gpu resource management in the operating system. In USENIX Annual TechnicalConference , pages 401–412. Boston, MA;, 2012.[48] Hyoseung Kim, Pratyush Patel, Shige Wang, and Ragunathan Raj Rajkumar. Aserver-based approach for predictable gpu access control. In , pages 1–10. IEEE, 2017.[49] Hyoseung Kim, Pratyush Patel, Shige Wang, and Ragunathan Raj Rajkumar. Aserver-based approach for predictable GPU access with improved analysis. Journal ofSystems Architecture , 88:97–109, 2018.[50] Hyoseung Kim and Ragunathan Rajkumar. Predictable shared cache managementfor multi-core real-time virtualization. ACM Transactions on Embedded ComputingSystems (TECS) , 17(1):1–27, 2017.[51] Hyoseung Kim, Shige Wang, and Ragunathan Rajkumar. vMPCP: A synchroniza-tion framework for multi-core virtual machines. In , pages 86–95, Dec 2014.[52] Pratyush Kumar and Lothar Thiele. Cool shapers: shaping real-time tasks for im-proved thermal guarantees. In Proceedings of the 48th Design Automation Conference(DAC) , pages 468–473. ACM, 2011.[53] Pratyush Kumar and Lothar Thiele. System-level power and timing variabilitycharacterization to compute thermal guarantees. In Proceedings of the seventhIEEE/ACM/IFIP international conference on Hardware/software codesign and sys-tem synthesis , pages 179–188. ACM, 2011.[54] Pratyush Kumar and Lothar Thiele. Thermally optimal stop-go scheduling of taskgraphs with real-time constraints. In , pages 123–128. IEEE, 2011.[55] Tadahiro Kuroda. Cmos design challenges to power wall. In Digest of Papers. Mi-croprocesses and Nanotechnology 2001. 2001 International Microprocesses and Nan-otechnology Conference , pages 6–7. IEEE, 2001.14556] Ohchul Kwon, Wonjae Jang, Giyeon Kim, and Chang-Gun Lee. Accurate thermalprediction for nans (n-app n-screen) services on a smart phone. In , pages 1–10. IEEE,2018.[57] Clemens JM Lasance. Thermally driven reliability issues in microelectronic systems:status-quo and challenges. Microelectronics Reliability , 43(12):1969–1974, 2003.[58] JongHo Lee, Young-Woo Seo, Wende Zhang, and David Wettergreen. Kernel-basedtraffic sign tracking to improve highway workzone recognition for reliable autonomousdriving. In Intelligent Transportation Systems-(ITSC), 2013 16th International IEEEConference on , pages 1131–1136, 2013.[59] Youngmoon Lee, Hoonsung Chwa, Kang G. Shin, and Shige Wang. Thermal-awareresource management for embedded real-time systems. In Embedded Software (EM-SOFT), 2018 International Conference on . IEEE, 2018.[60] Yongpan Liu, Robert P Dick, Li Shang, and Huazhong Yang. Accurate temperature-dependent integrated circuit leakage power estimation is easy. In , pages 1–6. IEEE, 2007.[61] Yue Ma, Thidapat Chantem, X Sharon Hu, and Robert P Dick. Improving lifetimeof multicore soft real-time systems through global utilization control. In Proceedingsof the 25th edition on Great Lakes Symposium on VLSI , pages 79–82. ACM, 2015.[62] Mali OpenCl SDK. https://developer.arm.com/products/software/mali-sdks, 2016.[63] S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, and G. De Micheli.Temperature-aware processor frequency assignment for mpsocs using convex optimiza-tion. In Real-Time and EmbeddedTechnology and Applications Symposium (RTAS), 2017 IEEE , pages 353–364. IEEE,2017.[66] Santiago Pagani, Jian-Jia Chen, Muhammad Shafique, and J¨org Henkel. Matex:Efficient transient and peak temperature computation for compact thermal models.In ,pages 1515–1520. IEEE, 2015.[67] Santiago Pagani, Muhammad Shafique, Heba Khdr, Jian-Jia Chen, and J¨org Henkel.seboost: Selective boosting for heterogeneous manycores. In Proceedings of the10th International Conference on Hardware/Software Codesign and System Synthesis ,pages 104–113. IEEE Press, 2015. 14668] Sangyoung Park, Jian-Jia Chen, Donghwa Shin, Younghyun Kim, Chia-Lin Yang,and Naehyuck Chang. Dynamic thermal management for networked embedded sys-tems under harsh ambient temperature variation. In , pages 289–294. IEEE,2010.[69] Pratyush Patel, Iljoo Baek, Hyoseung Kim, and Ragunathan Rajkumar. Analyticalenhancements and practical insights for mpcp with self-suspensions. In , pages177–189. IEEE, 2018.[70] Francesco Paterna and Tajana ˇSimunic Rosing. Modeling and mitigation of extra-soc thermal coupling effects and heat transfer variations in mobile devices. In , pages831–838. IEEE, 2015.[71] Alok Prakash, Hussam Amrouch, Muhammad Shafique, Tulika Mitra, and J¨orgHenkel. Improving mobile gaming performance through cooperative cpu-gpu ther-mal management. In Proceedings of the 53rd Annual Design Automation Conference ,DAC ’16, pages 47:1–47:6, New York, NY, USA, 2016. ACM.[72] Devendra Rai and Lothar Thiele. A calibration based thermal modeling technique forcomplex multicore systems. In , pages 1138–1143. IEEE, 2015.[73] Devendra Rai, Hoeseok Yang, Iuliana Bacivarov, Jian-Jia Chen, and Lothar Thiele.Worst-case temperature analysis for real-time systems. In , pages 1–6. IEEE, 2011.[74] Devendra Rai, Hoeseok Yang, Iuliana Bacivarov, and Lothar Thiele. Power agnos-tic technique for efficient temperature estimation of multicore embedded systems.In Proceedings of the 2012 international conference on Compilers, architectures andsynthesis for embedded systems , pages 61–70, 2012.[75] Ragunathan Rajkumar. Real-time synchronization protocols for shared memory mul-tiprocessors. In Distributed Computing Systems, 1990. Proceedings., 10th Interna-tional Conference on , pages 116–123. IEEE, 1990.[76] Ragunathan Rajkumar, Lui Sha, and John P Lehoczky. Real-time synchronizationprotocols for multiprocessors. In Real-Time Systems Symposium (RTSS), 1988., Pro-ceedings. , pages 259–269. IEEE, 1988.[77] Saowanee Saewong, Ragunathan Raj Rajkumar, John P Lehoczky, and Mark H Klein.Analysis of hierarchical fixed-priority scheduling. In Proceedings 14th Euromicro Con-ference on Real-Time Systems (ECRTS) , 2002.[78] Onur Sahin and Ayse K Coskun. Providing sustainable performance in thermally con-strained mobile devices. In , pages 1–6, Oct 2016.14779] Lars Schor, Iuliana Bacivarov, Hoeseok Yang, and Lothar Thiele. Fast worst-case peaktemperature evaluation for real-time applications on multi-core systems. In , pages 1–6. IEEE, 2012.[80] Lars Schor, Iuliana Bacivarov, Hoeseok Yang, and Lothar Thiele. Worst-case tem-perature guarantees for real-time applications on multi-core systems. In , pages 87–96.IEEE, 2012.[81] Lars Schor, Hoeseok Yang, Iuliana Bacivarov, and Lothar Thiele. Thermal-awaretask assignment for real-time applications on multi-core systems. In InternationalSymposium on Formal Methods for Components and Objects , pages 294–313. Springer,2011.[82] Lui Sha, John Lehoczky, and Ragunathan Rajkumar. Solutions for some practicalproblems in prioritized preemptive scheduling. In IEEE Real-Time Systems Sympo-sium (RTSS) . IEEE Computer Society Press, 1986.[83] Insik Shin and Insup Lee. Compositional real-time scheduling framework with periodicmodel. ACM Transactions on Embedded Computing Systems (TECS) , 7(3):30, 2008.[84] Gaurav Singla, Gurinderjit Kaur, Ali K Unver, and Umit Y Ogras. Predictive dy-namic thermal and power management for heterogeneous mobile platforms. In , pages 960–965, March 2015.[85] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, SivakumarVelusamy, and David Tarjan. Temperature-aware microarchitecture: Modeling andimplementation. ACM Trans. Archit. Code Optim. , 1(1):94–125, March 2004.[86] Brinkley Sprunt, Lui Sha, and John Lehoczky. Aperiodic task scheduling for hard-real-time systems. Real-Time Systems , 1(1):27–60, 1989.[87] Jayanth Srinivasan, Sarita V Adve, Pradip Bose, and Jude A Rivers. The impact oftechnology scaling on lifetime reliability. In International Conference on DependableSystems and Networks, 2004 , pages 177–186. IEEE, 2004.[88] Jay K. Strosnider, John P. Lehoczky, and Lui Sha. The deferrable server algorithm forenhanced aperiodic responsiveness in hard real-time environments. IEEE Transactionson Computers Proceedings of the 35thannual Design Automation conference , pages 732–737, 1998.14891] Bryan Anthony Toribio, Cameron Duross Peterson, David Parker Rubenstein, andTimothy Philip Neilan. Fire containment drone. Technical report, Worcester Poly-technic Institute, 2016.[92] Steve Vestal. Preemptive scheduling of multi-criticality systems with varying degreesof execution time assurance. In , pages 239–243. IEEE, 2007.[93] Ram Viswanath, Vijay Wakharkar, Abhay Watwe, Vassou Lebonheur, et al. Thermalperformance challenges from silicon to systems. Intel Technology Journal , Q3, 2000.[94] Shengquan Wang, Youngwoo Ahn, and Riccardo Bettati. Schedulability analysis inhard real-time systems under thermal constraints. Real-Time Systems , 46(2):160–188,2010.[95] Shengquan Wang and Riccardo Bettati. Delay analysis in temperature-constrainedhard real-time systems with general task arrivals. In , pages 323–334. IEEE, 2006.[96] Shengquan Wang and Riccardo Bettati. Delay analysis in temperature-constrainedhard real-time systems with general task arrivals. In , pages 323–334. IEEE, 2006.[97] Yefu Wang, Kai Ma, and Xiaorui Wang. Temperature-constrained power controlfor chip multiprocessors with online model estimation. ACM SIGARCH computerarchitecture news , 37(3):314–324, 2009.[98] Sisu Xi, Justin Wilson, Chenyang Lu, and Christopher Gill. RT-Xen: towards real-time hypervisor scheduling in Xen. In International Conference on Embedded Software(EMSOFT) , 2011.[99] Yun Xiang, Thidapat Chantem, Robert P Dick, X Sharon Hu, and Li Shang. System-level reliability modeling for mpsocs. In Proceedings of the eighth IEEE/ACM/IFIPinternational conference on Hardware/software codesign and system synthesis , pages297–306. ACM, 2010.[100] Hoeseok Yang, Iuliana Bacivarov, Devendra Rai, Jian-Jia Chen, and Lothar Thiele.Real-time worst-case temperature analysis with temperature-dependent parameters. Real-Time Systems , 49(6):730–762, 2013.[101] Sushu Zhang and Karam S Chatha. Thermal aware task sequencing on embeddedprocessors. In Proceedings of the 47th Design Automation Conference , pages 585–590.ACM, 2010.[102] Husheng Zhou, Guangmo Tong, and Cong Liu. Gpes: A preemptive execution sys-tem for gpgpu computing. In Real-Time and Embedded Technology and ApplicationsSymposium (RTAS), 2015 IEEE , pages 87–97. IEEE, 2015.149103] Amirkoushyar Ziabari, Je-Hyoung Park, Ehsan K Ardestani, Jose Renau, Sung-MoKang, and Ali Shakouri. Power blurring: Fast static and transient thermal analysismethod for packaged integrated circuits and power devices.