[PDF] Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems

Abstract

Motivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including human-centered and time-critical applications, this paper proposes a scheduling framework that seamlessly adds resource-awareness to any parallel application. In particular, we introduce a learning-based framework for dynamic placement of parallel threads to Non-Uniform Memory Access (NUMA) architectures. Decisions are taken independently by each thread in a decentralized fashion that significantly reduces computational complexity. The advantage of the proposed learning scheme is the ability to easily incorporate any multi-objective criterion and easily adapt to performance variations during runtime. Under the multi-objective criterion of maximizing total completed instructions per second (i.e., both computational and memory-access instructions), we provide analytical guarantees with respect to the expected performance of the parallel application. We also compare the performance of the proposed scheme with the Linux operating system scheduler in an extensive set of applications, including both computationally and memory intensive ones. We have observed that performance improvement could be significant especially under limited availability of resources and under irregular memory-access patterns.

Full PDF

LLearning-based Dynamic Pinning of ParallelizedApplications in Many-Core Systems

Georgios C. Chasparis Vladimir Janjic Michael Rossbory

Abstract —Motivated by the need for adaptive, secure andresponsive scheduling in a great range of computing applications,including human-centered and time-critical applications, thispaper proposes a scheduling framework that seamlessly addsresource-awareness to any parallel application. In particular, weintroduce a learning-based framework for dynamic placement ofparallel threads to Non-Uniform Memory Access (NUMA) ar-chitectures. Decisions are taken independently by each thread ina decentralized fashion that signiﬁcantly reduces computationalcomplexity. The advantage of the proposed learning scheme isthe ability to easily incorporate any multi-objective criterion andeasily adapt to performance variations during runtime. Under themulti-objective criterion of maximizing total completed instruc-tions per second (i.e., both computational and memory-accessinstructions), we provide analytical guarantees with respect tothe expected performance of the parallel application. We alsocompare the performance of the proposed scheme with the Linuxoperating system scheduler in an extensive set of applications,including both computationally and memory intensive ones. Wehave observed that performance improvement could be signiﬁcantespecially under limited availability of resources and underirregular memory-access patterns.

I. I

NTRODUCTION

Efﬁcient resource allocation for multi-threaded applicationsin NUMA architectures has attracted signiﬁcant scientiﬁcattention due to a) the involved complexity of the decision-making process, and b) the need to incorporate alternativeoptimization criteria that goes beyond standard maximizationof execution speed. This statement is further reinforced bythe recent advancement of tools for parallelizing complexapplications, that gave birth to non-trivial and highly advancedparallel and data patterns [2], [3], [4], [5]. In addition, thenature of an application (e.g., machine-learning, image pro-cessing, control and optimization) may add additional criteriathat cannot easily be integrated into an OS scheduler. Asexpected, the problem of efﬁciently utilizing resources, whileconcurrently optimizing a multi-objective criterion, cannot betreated by standard heuristic-based techniques.

To this end, this paper proposes and investigates the poten-tial of a learning- or measurement-based scheduling schemethat is part of a running application and regularly corrects/im-proves allocation decisions given the observed application’sperformance. In particular, this paper proposes a distributed

This paper is an extension of an earlier version appeared in the conferencepaper [1]. It has been supported by the European Union grant EU H2020-ICT-2014-1 project RePhrase (No. 644235).G. C. Chasparis and M. Rossbory are with the Software Competence CenterHagenberg GmbH, Softwarepark 21, A-4232 Hagenberg, Austria.V. Janjic is with the School of Computer Science, University of St Andrews,Scotland, UK. learning scheme speciﬁcally tailored for addressing the prob-lem of dynamically assigning/pinning threads of a parallelizedapplication to the available processing units. The proposedscheme is ﬂexible enough to incorporate any multi-objectiveoptimization criterion and provides convergence guaranteesto at least suboptimal assignments. Given the fact that itis measurement-based, it is computationally efﬁcient with alinear-complexity with the number of threads. Since it is iter-ative in nature, it also exhibits minimal memory requirements.It is worth noting that we target an online learning frame-work where allocation decisions are taken during runtime,and without requiring any prior application knowledge. Suchfeature can make parallel applications more responsive byreducing their execution time, especially in situations wherecomputing resources are shared between different applications.This is also very important for human-centered computing,where strict timing requirements can be of high importance,given that they are often computationally intensive, such asmachine-learning or image processing applications. In addi-tion, the proposed scheduling framework can seamlessly beattached to any parallel application. These features provide aneasy-to-use and user-friendly supervisory scheduling schemethat reduces the need for expert and application knowledge.In our previous work [6], [7], we have proposed areinforcement-learning-based distributed scheduling frame-work (

PaRLSched ), adapted to Uniform Memory Architec-tures (UMA). In this paper, our goal is to provide a general-ized methodology that also extends to Non-Uniform MemoryArchitectures (NUMA). Such framework should be consideredas a supervisory scheme that acts on top of any OS schedulingand performs either low- or high-frequency allocation correc-tions possibly subject to alternative multi-objective criteria.For example, when optimizing with respect to both computa-tional and memory-access instructions completed per second,the learning scheme should ﬁnd the right balance betweencomputing bandwidth and memory afﬁnities. In this paperthough, we are not concerned with memory migrations.This paper is an extension of an earlier version appeared in[1]. In this updated version, we provide analytical guaranteesof the performance of the learning-based scheduling frame-work, and we have extended our experimental evaluation toapplications with memory irregularities.The paper is organized as follows. Section II discussesrelated work and contributions. Section III describes theproblem formulation and objective of the paper. Section IVpresents the main features of the proposed Dynamic Scheduler(

PaRLSched ) and Section V provides analytical convergence a r X i v : . [ c s . D C ] J a n uarantees with respect to the application’s performance.Section VI presents a performance comparison with thestandard Linux scheduler in benchmark applications. Finally,Section VII presents concluding remarks and future work.II. R ELATED W ORK AND C ONTRIBUTIONS

Prior work has demonstrated the importance of thread-to-corebindings in the overall performance of a parallelized appli-cation [8]. The task of discovering such optimal bindings israther complex, given the structure of NUMA architectures [9].This task becomes even harder given the need for developingtools that can easily generalize to any architecture and theyare application independent.For example, reference [10] describes a tool that checks theperformance of each of the available thread-to-core bindingsand searches for an optimal placement. Unfortunately, the exhaustive-search type of optimization that is implementedmay prohibit runtime implementation. Reference [11] com-bines the problem of thread scheduling with scheduling hints related to thread-memory afﬁnity issues. A similar schedulingpolicy is also implemented by [12].At the same time, given that no prior knowledge of theapplication’s details is available, a centralized optimizationformulation is prohibitive. Such design restrictions give riseto learning-based techniques, where scheduling decisions aretaken based only on performance measurements. This need forlearning from data has been recognized in [13], where a ma-chine learning based mechanism is designed for transactionalapplications. In this case, each instance of the application hasto be run and proﬁled before any learning process is to beimplemented.Even such learning processes could be computationallycomplex given the quite large search space. For this rea-son, distributed or game-theoretic optimizations have beenattempted in the past for related problems, including coop-erative game formulation for allocating bandwidth in gridcomputing [14], the non-cooperative game formulation in theproblem of medium access protocols in communications [15]or for allocating resources in cloud computing [16]. Theseapproaches can signiﬁcantly reduce the involved computa-tional complexity and also allow for the development of onlineselection rules based on performance measurements. However,such modeling techniques have not yet been implemented inthe context of pinning of parallelized applications.Recognizing this need for both learning- and distributed-based optimization, and contrary to the aforementioned ref-erences on pinning of parallelized applications, our earlierwork [6], [7] proposed a scheduling scheme for optimallyallocating threads of a parallelized application that com-bines both a learning- and a distributed-based optimization.It requires a minimum information exchange, where onlymeasurements collected from each running thread are needed.Furthermore, it is ﬂexible enough to accommodate alternativeoptimization criteria depending on the available performancecounters. However, one potential drawback was the fact that nospecial consideration was taken upon the possible non-uniform memory access (NUMA) architectures, as it did not distinguishbetween moving a thread to a “local” (within the same NUMAnode) and “remote” (from a different NUMA node) core.This paper extends the scheduling framework of our previ-ous work [6], [7] with respect to the following contributions:(C1) We propose a novel two-level scheduling process thatis appropriate for NUMA architectures. At the higherlevel, the scheduler decides on which NUMA node eachthread should be assigned, while at the lower level itdecides on which CPU core (within that NUMA node)to execute the thread.(C2) We provide analytical convergence guarantees with re-spect to the resulting performance of the application incomparison to the optimal performance.(C3) We demonstrate the efﬁciency of the proposed approachon several benchmark applications with different charac-teristics, including computational- and memory-intensiveapplications.This paper is also an extension of an earlier version appearedin [1] with respect to contributions (C2) and (C3).III. P

ROBLEM F ORMULATION AND O BJECTIVE

Let a parallel application comprise n threads, I = { , , ..., n } . We denote the assignment of a thread i to a setof available NUMA nodes J NUMA by α i ∈ J NUMA . Withinthe selected NUMA node α i , thread i should be assignedto one of the available CPU cores J CPU ( α i ) , denoted by β i ∈ J CPU ( α i ) . Let also α = { ( α i , β i ) , i ∈ I} denote theoverall assignment proﬁle , and let A be the set of all proﬁles.The Resource Manager ( RM ) periodically checks the perfor-mance of a thread and makes decisions about its assignmentfor the next scheduling iteration. For the remainder of thepaper , we will assume that: a) The internal properties anddetails of the threads are not known to the RM . Instead, the RM may only have access to measurements related to theirperformances; b) Threads may not be idled or postponed bythe RM . Instead, the goal of the RM is to assign the currently available resources to the currently running threads ( work-conserving ).

1) Static optimization and issues:

A possible centralizedobjective that we may consider could be to maximize theaverage processing speed over all threads, i.e., max α ∈A f ( α, w ) . = (cid:80) ni =1 u i ( α, w ) /n, (1)where, for example, u i may represent the processing speed ofthread i under assignment α ∈ A . In general, u i will dependon the assignment proﬁle α and exogenous disturbances (e.g.,other applications) summarized within the parameter w . Anysolution to the optimization problem (1) will correspond to an efﬁcient/optimal assignment . However, there are two practicalissues when posing an optimization problem in this form,namely a) the details of the function u i ( α, w ) are unknownand it may only be evaluated through measurements, denotedby ˜ u i ; and, b) w is also unknown and may vary with time. ) Measurement- or learning-based optimization: We wishto address a static optimization objective of the form (1)through a measurement- or learning-based methodology. Thatis, the RM reacts to measurements of f ( α, w ) , periodicallycollected at time instances k = 1 , , ... and denoted by ˜ f ( k ) . The measured objective may take on the form ˜ f ( k ) . = (cid:80) ni =1 ˜ u i ( k ) /n . Given these measurements and the currentassignment α ( k ) of resources, the RM will select the nextassignment of resources α ( k + 1) , so that the measured objec-tive approaches the true optimum of the unknown performancefunction f ( α, w ) .

3) Multi-agent formulation:

We further distribute thedecision-making process into a thread-based optimization,where the RM makes decisions independently for each thread.Equivalently, we may assume that each thread makes its ownindependent decisions as in multi-agent formulations. Suchdistribution reduces the complexity of the decision-makingprocess, since each thread has a reduced number of choices ascompared to the number of choices of the group of threads.Furthermore, it increases robustness, since any performancedegradation noticed in a group of threads can immediately betreated by the affected threads, thus avoiding the complexityof centrally designed assignment corrections.

4) Multi-level decision-making and actuation:

Recent workby the authors [6], [7] has demonstrated the potential oflearning-based optimization in UMA architectures. However,when an application runs on a NUMA architecture, additionalinformation can be exploited to enhance scheduling of aparallelized application. To this end, a multi-level decision-making and actuation process is considered. We extend the

PaRLSched dynamic scheduler of [6], [7] by introducing twonested decision processes depicted in Figure 1. At the higherlevel (Level 1), the performance of a thread is evaluated withrespect to its own prior history of performances, and decisionsare taken with respect to its NUMA placement. At the lowerlevel (Level 2), the performance of a thread is evaluatedwith respect to its own prior history of performances, anddecisions are taken with respect to its CPU placement (withinthe selected NUMA node).IV. D

YNAMIC S CHEDULER

Each one of the two levels of the decision process will takeplace at different frequencies and based on different reasoning.In particular, NUMA-node switching may be costly, especiallywhen performed with high frequency due primarily to memoryafﬁnities, while CPU-node switching within the same NUMAnode may be costless (with respect to its impact to theprocessing speed). For this reason, we have introduced twomeasurement-based learning algorithms speciﬁcally tailored toaccommodate these different needs (Figure 1): − (Level 1) Aspiration learning for NUMA-node switch-ing , that responds only to signiﬁcant performance vari-ations and does not require frequent migrations. − (Level 2) Perturbed learning automata for CPU-corepinning within a given NUMA node, that allows fre-quent CPU-core switches. Fig. 1. Two-level scheduling where the RM decides ﬁrstly the NUMA nodeand secondly the CPU core at which each thread should be pinned on. We introduce periodic time instances with period T CPU > ,and indexed by k = 1 , , ... , at which decisions at Level 2(CPU-core pinning) are revised. Decisions at Level 1 (NUMA-node switching) are performed less frequently, at periodic timeinstances of period T NUMA (cid:29) T CPU , which will be indexedby τ = 1 , , ... . A. Utility Function

A cornerstone in the design of any such multi-agent formu-lation is the preference criterion or utility function u i for eachthread i ∈ A . The utility function captures the beneﬁt of adecision maker (thread) resulting from the assignment proﬁle α , i.e., it represents a function of the form u i : A → R + (where we restrict it to be a positive number). The actionproﬁle (i.e., the selections of all threads) constitutes a “state”of the environment that directly determines the performancesof all threads. We are interested in building learning-basedreﬂex agents that respond only to current measurements in aneffort to “eventually” learn to play efﬁcient assignments.It is important to note that the utility function u i of eachagent/thread i is subject to design and it is introduced inorder to guide the preferences of each agent. Thus, u i maynot necessarily correspond to a measured quantity, but itcould be a function of available performance counters. Forexample, a natural choice for the utility of each thread is itsown execution speed, which can be measured by the numberof executed instructions per unit of time. This may also becombined with other counters, e.g., the number of memory-access instructions, the number of cache misses, etc., to givea better representation of the performance of a thread. B. Aspiration learning for NUMA-node switching

We developed a novel learning scheme for NUMA-nodeswitching that is based upon the notions of benchmark ac-tions/performances and bears similarities with the so-called spiration learning [17]. The novelty here lies in the in-troduction of two benchmark levels in order to handle thepossibility of noisy measurements. Such type of learningdynamics tries to gradually reach assignment proﬁles where allthreads perform well. They have the advantage that exploration(of new assignments) can be performed selectively (e.g., whena signiﬁcant reduction in performance is observed). In thisway, a low-frequency NUMA-node switching can be attained.The speciﬁc steps are depicted in Table I.It is important to note that this learning scheme will reactimmediately to a rapid drop in the performance. In particular,when the performance drops below the lower benchmark, thenwith high probability the action will change, while in anyother case, the action will change with a small probability ζ > . The reason for maintaining both an upper and lowerbenchmark is in order to minimize the effect of noise in thedecision-making process.When the thread needs to select a new NUMA node, itwill select among the set of better replies, i.e., nodes at whichother threads perform better so far. Note that a thread may nothave a-priori knowledge of the exact impact an action switchhas on his own utility (until this action switch is performed).However, we may use prior data of the performances of otherthreads, as deﬁned in BR NUMA ,i ( α ) . Thus, at step (4a), wemay direct threads that currently do not perform well to theNUMA nodes where threads perform better. C. Perturbed Learning Automata for CPU-core pinning

Let us assume that, at Level 1, and for each one of therunning threads i ∈ I , the RM has already selected a NUMAnode α i ∈ J NUMA . Then, at Level 2, the RM needs to decidewhich CPU core each thread should be pinned to. Given thatCPU-core switching within the same NUMA node is usuallycostless, we have designed a learning algorithm that allowsfrequent switching and therefore a faster convergence rate.To this end, we employ perturbed learning automata [18]developed by the authors. Such dynamics perform well in thepresence of noise contrary to alternative schemes, as discussedin [18], and can guarantee convergence to at least locallyoptimal assignments.The basic idea behind learning automata is rather simple.Each agent i keeps track of a strategy vector that holds itsestimates over the best choice. We denote this strategy by σ i = [ σ ij ] j , where j ∈ J CPU ( α i ) , σ ij ≥ and (cid:80) j σ ij = 1 .To provide an example, consider the case of 3 available CPUcores, i.e., J CPU ( α i ) = { , , } . In this case, a vector of theform σ i = (0 . , . , . is a strategy vector, such that corresponds to the probability of assigning itself to CPU core , to CPU core and to CPU core . Brieﬂy, theCPU core selection will be denoted by β i ∈ J CPU ( α i ) . Notethat if σ i is a unit vector, say e j , then agent i selects its j thaction with probability one.In particular, the steps executed in each iteration of theperturbed learning automata are depicted in Table II. Accord-ing to this recursion, if currently thread i selected CPU core β i ( k ) , and measured performance β i ( k ) , then its strategy is TABLE IA

SPIRATION L EARNING FOR

NUMA-

NODE S WITCHING

At ﬁxed periodic time instances denoted by τ = 1 , , ... , with period T NUMA sec, the following steps are executed recursively for each thread i in parallel.(1) Performance measurement.

For the currently selected NUMA-node α i ( τ ) thread i retrieves its current performance measurement, ˜ u i ( τ ) .(2) Aspiration-level update.

Given the current performance measurement ˜ u i ( τ ) , update the discounted running average performance of the thread, asfollows: ρ i ( τ + 1) = ρ i ( τ ) + ν · [˜ u i ( τ ) − ρ i ( τ )] , (2)where ˜ u i ( τ ) is the current measurement of the utility of thread i .(3) Benchmarks update.

Deﬁne the upper benchmark performance , b i ( τ ) ,as a performance threshold over which a performance is considered satis-factory , and the lower benchmark performance , b i ( τ ) , as a performancethreshold under which a performance is considered unsatisfactory , with b i ( τ ) < b i ( τ ) . They are updated as follows: − if ρ i ( τ + 1) ≥ b i ( τ ) , then b i ( τ + 1) = ρ i ( τ + 1) b i ( τ + 1) = ρ i ( τ + 1) /η − if b i ( τ ) ≤ ρ i ( τ + 1) < b i ( τ ) , then b i ( τ + 1) = b i ( τ ) b i ( τ + 1) = b i ( τ ) − if ρ i ( τ + 1) < b i ( τ ) , then b i ( τ + 1) = η · ρ i ( τ + 1) b i ( τ + 1) = ρ i ( τ + 1) for some constant η > .(4) Action update.

A thread i selects actions according to the followingrule:a) if ρ i ( τ + 1) < b i ( τ ) , i.e., if the updated discounted running averageperformance is unsatisfactory, then thread i will perform a randomswitch to a better reply, i.e., α i ( τ + 1) ∈ rand unif (cid:2) BR NUMA ,i ( α ) (cid:3) , where BR NUMA ,i ( α ) denotes the better-reply of thread i to theassignment α , deﬁned as BR NUMA ,i ( α ) . = (cid:40) α (cid:48) i ∈ J NUMA : ρ i ( τ ) < γ (cid:80) { j ∈I : α j ( τ )= α (cid:48) i } ρ j ( τ ) (cid:12)(cid:12) { j ∈ I : α j ( τ ) = α (cid:48) i } (cid:12)(cid:12) (cid:41) (3)for some γ ∈ (0 , . The set { j ∈ I : α j ( τ −

1) = α (cid:48) i } includes allthose threads that selected action α (cid:48) i in the previous time instance.In other words, an action α (cid:48) i ∈ BR NUMA ,i ( α ) if the average of thethreads selecting α (cid:48) i did better on average than thread i .If more than one thread has chosen to migrate, then only one thread(selected at random) is allowed to execute this migration.b) if ρ i ( τ + 1) ≥ b i ( τ ) , then each thread i will keep playing the sameaction with high probability and experiment with any other actionwith a small probability ζ > , i.e., α i ( τ + 1) = (cid:40) α i ( τ ) , w.p. − ζ rand unif [BR NUMA ,i ( α )] , w.p. ζ (4)If more than one thread has chosen to migrate, then only one thread(selected at random) is allowed to execute this migration. going to increase in the direction of the selected action andproportionally to the observed performance. Informally, thedynamics reinforce repeated selection and reinforcement isalways proportional to the received reward. ABLE IIP

ERTURBED L EARNING A UTOMATA FOR

CPU-

CORE P INNING

At ﬁxed time instances denoted by k = 1 , , ... , the following steps areexecuted recursively for each thread i in parallel.(1) Performance measurement.

For the currently selected CPU-core β i ( k ) thread i retrieves its current performance measurement, ˜ u i ( k ) .(2) Strategy update.

Given that α i is the current NUMA-node assignmentof thread i , and |J CPU ( α i ) | is the number of the available CPU cores, thestrategy of thread i with respect to its CPU-core pinning is deﬁned as: σ i ( k ) = (1 − λ ) x i ( k ) − λ |J CPU ( α i ) | (5)where λ > corresponds to a perturbation term (or mutation ) and x i ( k ) corresponds to the nominal strategy of agent i . The nominal strategy isupdated according to the following update recursion: x i ( k + 1) = x i ( k ) + (cid:15) · ˜ u i ( k ) · [ e β i ( k ) − x i ( k )] (6)for some constant step-size (cid:15) > .(3) Action update.

The action of each thread i is updated as follows: β i ( k + 1) = rand σ i [ J CPU ( α i )] . V. C

ONVERGENCE A NALYSIS

The problem of optimally allocating threads into CPU corescan be formulated as a load-balancing game . Such formulationcan help us provide immediate answer with respect to whetheroptimal allocations exist as well as the characteristics of theseallocations. The notion of weak-acyclicity [19] in strategic-form games can help us provide an answer to these questions.In the context of load-balancing games, we are given aset of tasks (or computing threads ) that need to be executedin a multi-core computing system (comprising multiple CPUcores). An objective may correspond to the minimization ofthe makespan , that is the maximum load over all the availableCPU cores. In this case, the computing load of a CPU corecorresponds to the total computing bandwidth requested by allthreads assigned to this core, that is the frequency with whichthe CPU core is reserved by all threads.More formally, there exist m CPU cores with speeds s , s , ..., s m and n threads with weights w , w , ..., w m ,where the weight of a thread i characterizes its operation/ser-vice level (e.g., the computing bandwidth requested). Thespeed s j of CPU core j will be deﬁned as the maximumnumber of instructions per sec (IPS) that can be executed bythe CPU core. Moreover, the weight w i of a thread i will bemeasured by the number of instructions per second that thisthread will require within a unit of available bandwidth.The speed s j of machine j may not necessarily be knownin advance (usually average over many different types ofthreads). Also, the weight w i may also not be available, whileit may change throughout the execution time of a thread. Fornow, let us assume that these quantities are constant, but notnecessarily known. As we will see, the explicit knowledge ofthese quantities will not be necessary.We can analyze the problem of allocating threads into CPUcores within the context of strategic-form games . In strategic-form games, there exists a set of players/agents I . = { , ..., n } ,which in this case to be the set of threads requesting resources, T T T n · · · CPU 1 CPU 2 CPU 3 · · ·

CPU m α α α n Fig. 2. A sketch of a load-balancing allocation problem in the context of amulti-core computing system. Each running thread independently pins itselfto a single CPU core. Multiple threads may run on the same CPU core. and J CPU . = { , ..., m } to be the set of machines or CPUcores available. In this setting, each thread may be thoughtof as an independent player that can decide independentlywith respect to which one of the available cores to run on.In this context, β i ∈ J corresponds to the action of thread i , which may be any one of the available cores J CPU , and β . = ( β , ..., β n ) corresponds to the action proﬁle over allthreads (or assignment ).This deﬁnition of actions naturally ﬁt to the setup ofPerturbed Learning Automata for CPU-core pinning of Sec-tion IV-C, where each thread i regularly updates its selection β i so that threads gradually learn the optimal allocation. Canthreads, however, learn to play an optimal allocation? In orderto answer this question, we need to have a closer look on thestructure and properties of their interaction. Such investigationcan be performed in the context of strategic-form games andit will be described in the following section. A. Weak-acyclicity and optimal CPU-core pinning

As it is the case in standard operating systems, each threadmay run in either one of the available CPU cores underno constraints, e.g., all threads may run on the same core.However, the number of threads running on the same CPUcore inﬂuences the speed with which these threads will beexecuted (a high number of threads on the same CPU corewill lead to a low processing speed for these threads and viceversa). In particular, the load of a CPU core j ∈ J underassignment β will be deﬁned as (cid:96) j ( β ) . = (cid:80) { k ∈I : β k = j } w k s j > . (7)We will also denote the maximum load under proﬁle β as L ( β ) . = max j ∈J CPU (cid:96) j ( β j ) . In other words, L ( β ) correspondsto the makespan , cf., [20, Chapter 20].Although the speed s j of CPU core j and the weight w i of thread i may not be known in advance, the actual runningspeed of a thread on a given core can be measured in real-time quite accurately (that is the total number of completedinstructions per sec which may include computational ormemory related instructions).e deﬁne the utility of thread i as the number of instructionscompleted per sec on core j , which can be expressed asfollows: u i ( β i = j, β − i ) . = w i (cid:80) { k ∈I : β k = j } w k s j = w i (cid:96) j ( β ) , (8)where we have assumed that the operating system allocatesfairly the available bandwidth in CPU core j over all threadsand proportionally to their weights. It is important to note that w i and (cid:96) j ( β ) may not be known or easily measured. However,the utility u i can directly be measured on regular time intervalsand per thread. Thus, it can directly be integrated into theimplementation of the algorithms in Tables I–II. This designis motivated by the measurement-based optimization approachfor resource allocation introduced in [21]. It also introduces aslightly different design than the classical treatment of load-balancing games (see, e.g., [20]), where the cost function ofa thread is deﬁned as the load of the core.The strategic-form game, characterized by the tuple (cid:104)I , A , { u i } i (cid:105) will be referred to as a load-balancing game .We are speciﬁcally interested in allocations that correspondto (pure) Nash equilibria , that is allocations β ∗ at which nothread would have the incentive to switch to a different CPUcore. In particular, an allocation β ∗ is a Nash equilibrium if u i ( β (cid:48) i , β ∗− i ) ≤ u i ( β ∗ i , β ∗− i ) for all β (cid:48) i (cid:54) = β ∗ i .Let us denote the set of Nash-equilibrium allocations by B NE . Moreover, let us deﬁne the set B ∗ of optimal allocationsas B ∗ . = {∀ β ∈ B : L ( β ∗ ) ≤ L ( β ) } . (9)In other words, the set of optimal assignments minimizes themakespan. Let also denote L ∗ , the minimum makespan thatcan be achieved at the optimal assignments. Proposition 5.1 (Existence of Nash equilibria):

Consider theload-balancing game characterized by the tuple (cid:104)I , A , { u i } i (cid:105) with a utility function deﬁned by (8). Then, the set of pureNash equilibria is non-empty, i.e., B NE (cid:54) = ∅ . Proof.

Let us consider any allocation proﬁle β which is nota pure Nash equilibrium. In other words, there exists a thread i and two available CPU cores j and l , such that, switchingfrom core j to core l strictly increases the utility of thread i (i.e., its processing speed). In particular, given that: u i ( β i = j, β − i ) − u i ( β (cid:48) i = l, β − i ) = w i (cid:96) l ( β (cid:48) ) − (cid:96) j ( β ) (cid:96) l ( β (cid:48) ) (cid:96) j ( β ) (10)we conclude that, if u i ( β (cid:48) ) > u i ( β ) (i.e., β (cid:48) is a better replyto β ) then (cid:96) j ( β ) > (cid:96) l ( β (cid:48) ) . In other words, if thread i strictlyimproves its speed by switching from core j to core l , itimplies that the load of core j (when i runs on core j ) isstrictly larger than the load of core l (when i runs on core l ). Thus, we conclude that L ( β (cid:48) ) ≤ L ( β ) , i.e., under anybetter reply, the makespan reduces or remains the same.Furthermore, the number of threads that have a load which isequal or higher than (cid:96) j ( β ) has now been strictly decreased.We conclude that this process may only terminate at a statethan no thread can improve its speed any further, i.e., at a Nash equilibrium. (cid:3) The importance of this proposition lies on the fact that thereexists a set of Nash equilibria at which all threads performwell at least locally. Note that the set of Nash equilibria maynot necessarily coincide with the set of optimal allocations B ∗ . In fact, the set of optimal allocations may or may not bepart of the set of Nash equilibria. However, certain guaranteescan be established with respect to the utility achieved at theworst Nash equilibrium as compared to the utility received atan optimal allocation. The following proposition provides alower bound on the performance of any Nash equilibrium ascompared to the performance of an optimal assignment. Weonly investigate the case of identical CPU cores, since thiscondition is satisﬁed by our experimental setup. Proposition 5.2 (Performance of Nash equilibria):

For thecase of identical CPU cores and for any pure Nash equilibriumassignment β ∈ B NE , the makespan satisﬁes L ( β ) ≤ |J CPU ||J

The proof of the ﬁrst statement (11) follows the exactsame reasoning with Theorem 20.5 in [20]. The proof of thesecond statement (12) follows directly from the deﬁnition ofthe utility (8) and the ﬁrst statement (11). In particular, let usconsider any thread i with weight w i . Its speed will satisfy: u i ≥ w i L ( β ) ≥ ( |J CPU | + 1)2 |J CPU | · w i L ∗ . which concludes the proof. (cid:3) The above proposition provides a lower-bound in the utilitythat can be achieved at a Nash equilibrium assignment. Inparticular, note that the ratio u ∗ i . = w i / L ∗ corresponds tothe least maximum speed that a thread can achieve underan optimal assignment. Thus, in a 10 CPU-core architecture,condition (12) implies that u i ( β ) ≥ / u ∗ i . Such lowerbound is a bit conservative, however it provides a signiﬁcantguarantee.From Equation (12), we may also conclude that: |J CPU | (cid:88) i ∈I u i ≥ ( |J CPU | + 1)2 |J CPU | · (cid:32) |J CPU | (cid:88) i ∈I w i L ∗ (cid:33) , which also establishes a similar lower bound with respect toour original (desirable) objective of maximizing the averagespeed over all threads.We conclude that if threads settle on a Nash equilibriumassignment, then there is a certain guarantee with respect totheir average running speed. . Convergence analysis of CPU-core pinning The previous section discussed existence and propertiesof assignments that are Nash equilibria of the load balanc-ing game of the CPU-core assignment problem. Given theproperties of Proposition 5.2, Nash-equilibrium assignmentsshould be desirable, since they provide certain guaranteeswith respect to the overall performance. However, can thedynamics presented in Section IV of Tables I–II guaranteeconvergence to the set of Nash-equilibrium assignments?

Thisis the question we try to answer in this section.First, we will investigate the convergence properties of thedynamics of Table II under the condition of a single NUMA-node availability. In other words, threads do not have theopportunity to migrate, and they can only increase their utilityby improving their pinning assignment to the available CPUcores. The following proposition provides strong guaranteeswith respect to the convergence of the dynamics for CPU-corepinning of Table II.

Proposition 5.3 (Convergence of CPU-pinning):

Considerthe update recursion of Table II. The fraction of time thatthe discrete-time dynamics spends in an arbitrarily smallneighborhood of the set of pure Nash equilibria goes to oneas the perturbation factor λ ↓ , the step-size (cid:15) ↓ and thetime index k → ∞ . Proof.

Theorem 3.1 in [18] has shown that as the perturbationfactor λ ↓ , the induced Markov chain of the dynamicsof Table II has an invariant probability measure whosesupport lies on the pure strategy states (i.e., states at whichfor all i , x i assigns probability one to some action). ByBirkhoff’s individual ergodic theorem [22, Theorem 2.3.4],this implies that the process will spend an arbitrarilylarge portion of time on pure-strategy states as λ ↓ and k → ∞ . Furthermore, according to [23, Proposition 3.6], λ -perturbations of pure Nash equilibria are the unique limitpoints of the continuous-time approximation of the dynamics(6). Thus, according to a straightforward implementation of[24, Theorem 8.2.1], the fraction of time that the discrete-time dynamics (6) spends in a small neighborhood of theset of pure Nash equilibria goes to one as (cid:15) ↓ and k → ∞ . (cid:3) C. Discussion on combined NUMA and CPU placements

The main motivation for decomposing the decision makingprocess into NUMA-placement and CPU-pinning in Tables I–II, respectively, lies on the principle of the two time-scaledynamics. In particular, the NUMA placement algorithm ofTable I operates at a slow time-scale with a period of T NUMA ,while the CPU-pinning of Table II operates at a faster time-scale with a period T CPU (cid:28) T NUMA . The goal is to allowthe dynamics of CPU-pinning to ﬁrst approach a Nash-equilibrium assignment (given the convergence guarantees ofProposition 5.3), before any thread considers migrating to adifferent NUMA node. Such design principle also restrictsfrequent NUMA-node migrations, since they may be rathercostly (taking into account possible implications to memoryaccess). When we select T NUMA /T CPU to be sufﬁciently large, thenthe CPU-core pinning dynamics have already settled in theset of pure Nash equilibria (according to Proposition 5.3)before revising the migration of threads to different nodes.There are two possibilities that a thread decides to migrate.Under the ﬁrst condition (4a) of Table 2, thread i is unsatisﬁedunder the current assignment, and randomly selects amongalternative NUMA nodes where currently threads performbetter on average. By appropriately selecting sufﬁciently small γ ∈ (0 , in the implementation of the better-reply condition(3), a migration to a new NUMA node will only resultin an increased processing speed for a thread. This is alsoguaranteed by the fact that only one thread is allowed tomigrate at a given time. Under the second condition (4b) ofTable 2, there always exists a small probability ζ > that a(neither satisﬁed nor unsatisﬁed) thread is selected to migrateat random and given that there are alternative nodes that canoffer a better performance. Thus, under either condition, andfor sufﬁciently large T NUMA /T CPU , we should expect thatthreads may only increase their performance by migrating.VI. E

XPERIMENTS

In this section, we present an experimental studyof the proposed framework. Experiments were conductedon × Intel c (cid:13)

Xeon c (cid:13)

CPU E5-2650 v3 2.30 GHz running Linux Kernel 64bit 3.13.0-43-generic. The cores aredivided into two NUMA nodes (Node 1: 0-9 CPU cores, Node2: 10-19 CPU cores).In all experiments, the utility of each thread is deﬁned asthe total instructions completed per second which incorporatesboth the computational and memory-access instructions. Thisis a multi-objective criterion and it is expected that the largerthe number of instructions completed, the larger the processingspeed of a thread. We compared the overall performance ofthe application (in terms of processing speed of threads andcompletion time of an application) with that of the Linux OS scheduler. We considered a number of parallel applicationsunder different levels of resource availability (i.e., number ofCPU cores available for the applications) and background-loadsettings (i.e., number of threads of other applications runningon the available cores at the same time). A. Benchmark applications

In particular, we have considered the following benchmarkapplications: − Swaptions (SWA), that uses the Heath-Jarrow-Morton(HJM) framework to price a portfolio of swaptions. TheHJM framework describes how interest rates evolve forrisk management and asset liability management [25].The application employs Monte-Carlo simulation tocompute the prices. It is regular in terms of task sizes,with a low degree of communication between differentthreads. It was taken from the

Parsec benchmark suite. − Blackscholes (BLA), that calculates, using differentialequations, how the value of an option changes as the

ABLE IIIC

OMPUTATIONAL /M EMORY I NTENSITY OF C ASE S TUDIES (TOT INS =

TOTAL INSTRUCTIONS , LST INS =

LOAD / STORE INSTRUCTIONS , TLB DM= D

ATA TRANSLATIONS )Index BLA SWA ACO CSOTOT INS / LST INS O (10 +7 ) O (10 +6 ) O (10 +5 ) O (10 +2 ) TLB DM / LST INS O (10 − ) O (10 − ) O (10 − ) O (10 − ) price of the underlying asset changes; parallel implemen-tation calculates values for a number of options at thesame time, assigning a thread to each option (or a groupof options). If the options are equally divided betweenthreads, this results in a regular (in terms of task sizes)parallel application. In practice, similar calculations areused by ﬁnancial houses to price 10-100 thousands ofoptions. This is computationally intensive applicationas depicted in Table III. It was taken from the Parsec benchmark suite. − Ant Colony Optimization (ACO) [26] is a metaheuristicused for solving NP-hard combinatorial optimizationproblems. In this paper, we apply ACO to the SingleMachine Total Weighted Tardiness Problem (SMTWTP).Brieﬂy, this is a scheduling problem of jobs that arecharacterized by varying processing times, deadlinesand weights. The objective is to ﬁnd the schedule thatminimizes the total tardiness. A detailed description ofthis use case is provided in [6]. This is computationallyintensive application as depicted in Table III. − Stochastic-Local-Search for Cutting-Stock IndustrialOptimization (CSO) that optimizes classical bin-packingand cutting-stock optimization problems using an evo-lutionary stochastic-local-search (SLS) algorithm. Theuse case and the type of parallelization (which is basedon the Fast-Flow parallelization library [27]) has beendescribed in detail in [28]. In particular, we used theScholl 1–3 datasets for classical bin packing problemsprovided in [29]. According to the implemented SLSalgorithm, an initial number of candidate solutions (pool)of a bin-packing/cutting-stock problem, are processedcontinuously through a series of heuristic based oper-ations/modiﬁcations (optimization cycle). In each suchcycle, multiple threads are assigned a portion of thecandidate solutions. Since the application usually runsfor a ﬁxed time, the total number of candidate solutionsprocessed in all optimization cycles completed consti-tutes an indication of the average processing speed.This is a memory intensive application as depicted inTable III, while the computation bandwidth requestedvaries signiﬁcantly with time.

B. Experimental setup

The period of the CPU pinning is ﬁxed to T CPU = 0 . sec, which is also the interval in which the RM collectsmeasurements of the total instructions completed per sec (using the PAPI library [30]) for each one of the threads TABLE IVA

LGORITHM SETTINGS

Parameter Value (cid:15) . / λ . T CPU . sec ν . ζ . γ . η . T NUMA sec separately. In other words, the utility u i of thread i correspondsto the total instructions completed per sec for thread i .Pinning of threads to CPU cores is achieved through the sched.h library. In all experiments, the RM is executed bythe master thread of an application, which is always runningin a ﬁxed CPU core (usually the ﬁrst available CPU core ofthe ﬁrst NUMA node).In Table V, we provide an overview of the conductedexperiments. We classify the experiments with respect to theresource availability and the CPU availability. We classify theresource availability as small (around 4 application threadsper CPU core), medium (2 threads per CPU core) and high (1 thread per CPU core). We classify the CPU availabilityas uniform , when no background applications are runningand therefore all CPU cores are fully available to the testedapplication, non-uniform where 8 threads of a backgroundapplication are running on the ﬁrst 4 CPU cores of themachine for the whole duration of the running of the testedapplication and time-varying , where initially the availabilityvaries continuously with time in the ﬁrst 4 CPU cores of themachine.Our goal is to investigate the performance of the schedulerunder different set of available resources, and how the dynamicscheduler adapts to background load. TABLE VC

LASSIFICATION OF THE EXPERIMENTS . Exp. Resource availability CPU availability

A.1 Small UniformA.2 Small Non-uniformA.3 Small Time-varyingB.1 Medium UniformB.2 Medium Non-uniformB.3 Medium Time-varyingC.1 Large UniformC.2 Large Non-uniformC.3 Large Time-varying

C. Experimental Results

Tables VI–IX show the execution times of the four chosenapplications under OS and PaRLSched scheduler and underthe experimental scenarios of Table V. Below, we analyze eachapplication separately.

ABLE VIC

OMPLETION TIMES OF OS AND

PaRLSched

SCHEDULING FOR S WAPTIONS APPLICATION . W

E SHOW THE MEAN EXECUTION TIME OFTHE APPLICATION , THE DEVIATION AND IMPROVEMENT IN EXECUTIONTIME OF

PaRLSched

OVER OS SCHEDULING

Exp/Resources

OS PaRLSched

Diff. (%)

Mean Dev Mean DevSWA (A.1) + . SWA (A.2) + . SWA (A.3) + . SWA (B.1) + . SWA (B.2) + . SWA (B.3) + . SWA (C.1) − . SWA (C.2) + . SWA (C.3) + . a) SWA: We observe that the

PaRLSched schedulerexhibits better behavior than the OS under small and mediumavailability of resources (i.e., categories A and B) with or with-out background interference. The improvement varies between0.13% and 10.69%. In case of large availability of resources(i.e., category C), the OS outperforms the PaRLSched butonly in the case where there is no background interference.Note also that the percentages of the deviations are signiﬁ-cantly smaller than the corresponding performance differences(except for the A.1 case), thus we may not attribute theseimprovements to noise. b) ACO:

In this set of experiments, we see a similarbehavior to the SWA experiments. The

PaRLSched outper-forms the OS in the case of small and medium availabilityof resources and in the presence of background interference(i.e., categories A.2–A.3 and B.2–B.3). The improvementmay reach up to 16.92%. In the absence of any backgroundinterference, the behavior under small availability of resources(i.e., category A.1) is about equivalent, while in the remainingcategories the OS outperforms the PaRLSched scheduler.As a side note, we should mention that even under scenarioswhere the OS outperformed PaRLSched , such as scenario C.3,the average speed over all threads is not necessarily smaller,as Figure 3 demonstrates. In other words, the

PaRLSched does indeed achieve a good level of the average processingspeed, which agrees with its design criterion, but apparentlycompletion time is not only a matter of average speed. Forexample, a large average speed over all threads does notnecessarily guarantee that all threads are running with identicalspeeds. Instead, there might be signiﬁcant differences in thespeeds between threads, which may have an impact on theoverall completion time. c) BLA:

The performance under the Blackscholes ap-plication is not deviating signiﬁcantly in comparison withthe conclusions of ACO and SWA applications. In fact, weobserve a constantly better performance of the

PaRLSched inconditions of small resource availability which may reach upto 4.05% improvement. On the other hand, the performanceunder large resource availability has been up to -8.89% worsethan the OS performance. ,

000 1 ,

200 1 ,

400 1 ,

600 1 , R un . A v e r . Sp ee d ( A . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched ,

000 1 ,

200 1 ,

400 1 ,

600 1 , R un . A v e r . Sp ee d ( B . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) ,

000 1 ,

200 1 ,

400 1 ,

600 1 , R un . A v e r . Sp ee d ( C . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) Fig. 3. Sample responses for Experiments of category (i.e., under time-varying CPU availability. The running average speed is measured in (10 · instructions/sec/thread). d) CSO: The CSO application is a bit different than theones previously considered. It is characterized by scatteredmemory pages as Table III reﬂects. In general, we observesigniﬁcant advantage of the

PaRLSched scheduler under cat-egories A and C of resource availability, and a reducedperformance in the case of category B (medium availability).The rather inconclusive behavior should be attributed to theirregular memory accesses of the application and the longidle times of the threads. This large variation in the requestedbandwidth is also demonstrated in Figures 4, 5, and 6 whichshow the response of the

PaRLSched scheduler under allscenarios.

D. Discussion

In general, we observed that the

PaRLSched scheduler wasable to achieve better performance that the OS scheduler inlimited cases of limited availability of resources (Category A)and external disturbances. Under such scenarios, we expectthe performance of individual threads to vary due to externalinﬂuences and, therefore, it is important to make the correctremapping decisions. Also, under such scenarios, it is notpossible to predict this variation in the performance solelybased on the characteristics of the application itself. Finally,in the memory-intensive application (CSO), the schedulerwas able to better adapt to the irregularity in the memory-

ABLE VIIC

OMPLETION TIMES (CT)

AND AVERAGE PROCESSING SPEED (A VG . S PD ) OF OS AND

PaRLSched

SCHEDULING FOR

ACO

APPLICATION . W

E SHOW THEMEAN EXECUTION TIME OF THE APPLICATION , MEAD DEVIATION ( IN SECONDS ) AND AVERAGE PROCESSING SPEED PER THREAD ( IN INSTRUCTIONSPER SECOND ). Exp/Time(s)

OS PaRLSched

Diff. CT(%) Diff. Avg. Spd(%)

Mean CT Dev CT Avg. Spd. Mean CT Dev CT Avg. SpdACO (A.1) − . + . ACO (A.2) + . + . ACO (A.3) + . + . ACO (B.1) − . + . ACO (B.2) + . + . ACO (B.3) + . + . ACO (C.1) − . + . ACO (C.2) − . − . ACO (C.3) − . + . Average + . + . TABLE VIIIC

OMPLETION TIMES OF OS AND

PaRLSched

SCHEDULING FOR B LACKSCHOLES (BLA)

APPLICATION

Exp/Resources

OS PaRLSched

Diff. (%)

Mean Dev Mean DevBLA (A.1) + . BLA (A.2) + . BLA (A.3) + . BLA (B.1) + . BLA (B.2) − . BLA (B.3) − . BLA (C.1) − . BLA (C.2) − . BLA (C.3) − . access speeds between the two NUMA nodes also under largeavailability of resources.On the other hand, the OS outperformed the PaRLSched scheduler in most cases of large availability of resources (e.g.,category C.1). This should be attributed to the fact that theLinux scheduler is utilizing internal load balancing of threadsbetween cores, which has notable effect on the executiontime when there is not signiﬁcant background interference(in terms of additional running applications). In this case,performance of the individual threads depends exclusively onthe distribution of threads of the application to cores, so thereis no additional beneﬁt in measuring external interference inthe

PaRLSched scheduler. The

PaRLSched scheduler appliesrigid pinning of threads to cores, which means that it cannotutilize any internal load balancing by the Linux scheduler.Given the rather diverse nature of the considered appli-cations, the observed improvements constitute a promisingindication. Note that the intention and goal of this work is notto replace the OS scheduler, but instead to act on a supervisorylevel, and possibly under alternative multi-objective criteria.The notion of the utility function that drives the threadplacement can be designed to accommodate any such multi-objective criterion, since the only assumption considered is thepositivity constraint. R un . A v e r . Sp ee d ( A . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( B . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( C . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched

Fig. 4. Sample responses for Experiments of category (i.e., under time-varying CPU availability. The running average speed is measured in (10 · instructions/sec/thread). VII. C

ONCLUSIONS AND FUTURE WORK

We proposed a measurement- (or performance-) based learn-ing scheme for addressing the problem of efﬁcient dynamicpinning of parallelized applications into many-core systemsunder a NUMA architecture. According to this scheme, a cen-tralized objective is decomposed into thread-based objectives,where each thread is assigned its own utility function. Allo-cation decisions were organized into a hierarchical decisionstructure: at the ﬁrst level, decisions are taken with respect to

ABLE IXC

ANDIDATE SOLUTIONS PROCESSED (CSP)

AND AVERAGE PROCESSING SPEED (A VG . S PD ) UNDER OS AND

PaRLSched

SCHEDULING FOR

CSO

APPLICATION WITHIN MIN SIMULATION TIME . W

E SHOW THE M EAN SOLUTIONS PROCESSES , THE DEVIATION , AND AVERAGE PROCESSING SPEED PERTHREAD ( IN INSTRUCTIONS PER SECOND ). Exp/Resources

OS PaRLSched

Diff. CSP(%) Diff. Avg. Spd(%)

Mean CSP Dev CSP Avg. Spd Mean CSP Dev CSP Avg. SpdCSO (A.1) − . − . CSO (A.2) + . + . CSO (A.3) + . + . CSO (B.1) + . + . CSO (B.2) − . − . CSO (B.3) − . − . CSO (C.1) + . + . CSO (C.2) + . + . CSO (C.3) + . + . Average + . + . R un . A v e r . Sp ee d ( A . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( B . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( C . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched

Fig. 5. Sample responses for Experiments of category (i.e., under time-varying CPU availability. The running average speed is measured in (10 · instructions/sec/thread). the assigned NUMA node, while at the second level, decisionsare taken with respect to the assigned CPU core (within theselected NUMA node). The proposed framework is ﬂexibleenough to accommodate any multi-objective criterion, whileit is appropriately designed to handle noisy observations.We demonstrated the utility of the proposed framework inthe maximization of the running average processing speedof the threads and we evaluated its performance in fourbenchmark parallel applications. We have concluded that the R un . A v e r . Sp ee d ( A . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( B . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched R un . A v e r . Sp ee d ( C . ) ( · i n s tr u c t i o n s / s ec / t h r e a d ) OSPaRLSched

Fig. 6. Sample responses for Experiments of category (i.e., under time-varying CPU availability. The running average speed is measured in (10 · instructions/sec/thread). PaRLSched scheduler can achieve better running speed incertain cases, especially of small availability of resources orlarge background load. These observations should be furtherreinforced with additional benchmark tests. In addition, weplan to identify and generalize the indicators that trigger theseadvantageous responses of the

PaRLSched scheduler and alsoto consider additional utility functions, such as register countof each thread.

EFERENCES[1] G. C. Chasparis, M. Rossbory, V. Janjic, and K. Hammond, “Learning-Based Dynamic Pinning of Parallelized Applications in Many-CoreSystems,” in . Pavia, Italy: IEEE,Feb. 2019, pp. 1–8.[2] M. Danelutto, “On skeletons and design patterns,” in

Proc. of Intl. ParCo2001 , ser. Parallel Computing: Advances and Current Issues, G. Joubert,A. Murli, F. Peters, and M. Vanneschi, Eds. Imperial College Press,2001, pp. 425–432.[3] M. Aldinucci, G. P. Pezzi, M. Drocco, C. Spampinato, and M. Torquati,“Parallel visual data restoration on multi-gpgpus using stencil-reducepattern,”

The International Journal of High Performance ComputingApplications , vol. 29, no. 4, pp. 461–472, 2015.[4] D. del Rio Astorga, M. F. Dolz, J. Fernndez, and J. D. Garca, “A genericparallel pattern interface for stream and data processing: A genericparallel pattern interface for stream and data processing,”

Concurrencyand Computation: Practice and Experience , vol. 29, no. 24, Dec. 2017.[5] V. Janjic, C. Brown, K. Mackenzie, K. Hammond, M. Danelutto,M. Aldinucci, and J. D. Garcia, “Rpl: A domain-speciﬁc language for de-signing and implementing parallel c++ applications,” in , Feb 2016, pp. 288–295.[6] G. C. Chasparis, M. Rossbory, and V. Janjic,

Efﬁcient Dynamic Pinningof Parallelized Applications by Reinforcement Learning with Applica-tions , ser. Lecture Notes in Computer Science, F. F. Rivera, T. F. Pena,and J. C. Cabaleiro, Eds. Springer International Publishing, 2017, vol.10417.[7] G. C. Chasparis and M. Rossbory, “Efﬁcient Dynamic Pinning ofParallelized Applications by Distributed Reinforcement Learning,”

Int.J. Parallel Program. , pp. 1–15, 2017.[8] A. Podzimek, L. Bulej, L. Y. Chen, W. Binder, and P. Tuma, “Analyzingthe Impact of CPU Pinning and Partial CPU Loads on Performanceand Energy Efﬁciency,” in , May 2015, pp. 1–10.[9] B. Goglin, “Managing the topology of heterogeneous cluster nodeswith hardware locality (hwloc),” in

International Conference on HighPerformance Computing and Simulation (HPCS) , 2014, pp. 74–81.[10] T. Klug, M. Ott, J. Weidendorfer, and C. Trinitis, “ autopin - au-tomated optimization of thread-to-core pinning on multicore systems,”in

Transactions on High-Performance Embedded Architectures andCompilers III , ser. Lecture Notes in Computer Science, P. Stenstrom,Ed. Springer Berlin Heidelberg, 2011, vol. 6590, pp. 219–235.[11] F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst,“ForestGOMP: An efﬁcient OpenMP environment for NUMA architec-tures,”

International Journal Parallel Programming , vol. 38, pp. 418–439, 2010.[12] S. Olivier, A. Porterﬁeld, and K. Wheeler, “Scheduling task parallelismon multi-socket multicore systems,” in

ROSS’11 , Tuscon, Arizona, USA,2011, pp. 49–56.[13] M. Castro, L. F. W. Goes, C. P. Ribeiro, M. Cole, M. Cintra, and J.-F. Mehaut, “A machine learning-based approach for thread mappingon transactional memory applications,” in , 2011, pp. 1–10.[14] R. Subrata, A. Y. Zomaya, and B. Landfeldt, “A cooperative gameframework for QoS guided job allocation schemes in grids,”

IEEETransactions on Computers , vol. 57, no. 10, pp. 1413–1422, Oct. 2008.[15] H. Tembine, E. Altman, R. ElAzouri, and Y. Hayel, “Correlated evolu-tionary stable strategies in random medium access control,” in

Int. Conf.Game Theory for Networks , 2009, pp. 212–221.[16] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoreticmethod of fair resource allocation for cloud computing services,”

TheJournal of Supercomputing , vol. 54, no. 2, pp. 252–269, Nov. 2010.[17] G. C. Chasparis, A. Arapostathis, and J. S. Shamma, “Aspiration learningin coordination games,”

SIAM J. Control and Optim. , vol. 51, no. 1,2013.[18] G. C. Chasparis, “Stochastic Stability of Perturbed Learning Automatain Positive-Utility Games,”

IEEE Transactions on Automatic Control ,vol. 64, no. 11, pp. 4454–4469, Nov. 2019.[19] A. Fabrikant, A. D. Jaggard, and M. Schapira, “On the Structure ofWeakly Acyclic Games,”

Theory Comput Syst , vol. 53, no. 1, pp. 107–122, Apr. 2013. [20] B. Vcking, “Selﬁsh Load Balancing,” in

Algorithmic Game Theory ,N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani, Eds. Cam-bridge: Cambridge University Press, 2007, pp. 517–542.[21] G. C. Chasparis, “Measurement-based efﬁcient resource allocation withdemand-side adjustments,”

Automatica , vol. 106, pp. 274–283, Aug.2019.[22] O. Hernandez-Lerma and J. B. Lasserre,

Markov Chains and InvariantProbabilities . Birkhauser Verlag, 2003.[23] G. Chasparis and J. Shamma, “Distributed dynamic reinforcement ofefﬁcient outcomes in multiagent coordination and network formation,”

Dynamic Games and Applications , vol. 2, no. 1, pp. 18–50, 2012.[24] H. J. Kushner and G. G. Yin,

Stochastic Approximation and RecursiveAlgorithms and Applications , 2nd ed. Springer-Verlag New York, Inc.,2003.[25] D. Heath, R. Jarrow, and A. Morton, “Bond pricing and the termstructure of interest rates: A new methodology for contingent claimsvaluation,”

Econometrica , vol. 60, no. 1, pp. 77–105, Jan. 1992.[26] M. Dorigo and T. St¨utzle,

Ant Colony Optimization . Scituate, MA,USA: Bradford Company, 2004.[27] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati,“Pool Evolution: A Parallel Pattern for Evolutionary and SymbolicComputing,”

International Journal of Parallel Programming , vol. 44,no. 3, pp. 531–551, June 2016.[28] G. C. Chasparis, M. Rossbory, and V. Haunschmid, “An evolution-ary stochastic-local-search framework for one-dimensional cutting-stockproblems,” arXiv , vol. 1707.08776, 2017.[29] M. Delorme, M. Iori, and S. Martello. (2018) A bin packing problemlibrary. [Online]. Available: http://or.dei.unibo.it/library/bpplib[30] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portableinterface to hardware performance counters,” in