[PDF] Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic

Abstract

Networks-on-Chip (NoCs) used in commercial many-core processors typically incorporate priority arbitration. Moreover, they experience bursty traffic due to application workloads. However, most state-of-the-art NoC analytical performance analysis techniques assume fair arbitration and simple traffic models. To address these limitations, we propose an analytical modeling technique for priority-aware NoCs under bursty traffic. Experimental evaluations with synthetic and bursty traffic show that the proposed approach has less than 10% modeling error with respect to cycle-accurate NoC simulator.

Full PDF

11 Analytical Performance Modeling of NoCs underPriority Arbitration and Bursty Trafﬁc

Sumit K. Mandal , Raid Ayoub , Michael Kishinevsky , Mohammad M. Islam , Umit Y. Ogras School of ECEE, Arizona State University; Intel Corporation, Hillsboro, OR

Abstract —Networks-on-Chip (NoCs) used in commercialmany-core processors typically incorporate priority arbitration.Moreover, they experience bursty trafﬁc due to application work-loads. However, most state-of-the-art NoC analytical performanceanalysis techniques assume fair arbitration and simple trafﬁcmodels. To address these limitations, we propose an analyticalmodeling technique for priority-aware NoCs under bursty trafﬁc.Experimental evaluations with synthetic and bursty trafﬁc showthat the proposed approach has less than 10% modeling errorwith respect to cycle-accurate NoC simulator.

I. I

NTRODUCTION

Industrial many-core processors incorporate priority arbi-tration for the routers in NoC [1]. Moreover, these designsexecute bursty trafﬁc since real applications exhibit bursti-ness [2]. Accurate NoC performance models are required toperform design space exploration and accelerate full-systemsimulations [3, 4]. Most existing analysis techniques assumefair arbitration in routers, which does not hold for NoCs withpriority arbitration used in manycore processors, such as high-end servers [5] and high performance computing (HPC) [1].A recent technique targets priority-aware NoCs [6], but itassumes that the input trafﬁc follows geometric distribution.While this assumption simpliﬁes analytical models, it fails tocapture the bursty behavior of real applications [2]. Indeed,our evaluations show that the geometric distribution assump-tion leads up to 60% error in latency estimation unless thebursty nature of applications is explicitly modeled. Therefore,there is a strong need for NoC performance analysis techniquesthat consider both priority arbitration and bursty trafﬁc.This work proposes a novel performance modeling tech-nique for priority-aware NoCs that takes bursty trafﬁc intoaccount. It ﬁrst models the input trafﬁc as a generalized geo-metric (GGeo) discrete-time distribution that includes a param-eter for burstiness. We achieve high scalability by employingthe principle of maximum entropy (ME) to transform the givenqueuing network into a near equivalent set of individual queuenodes of multiple-classes with revised characteristics (e.g.,modifying service process). Furthermore, our solution involvestransformations to handle priority arbitration of the routersacross a network of queues. Finally, we construct analyticalmodels of the transformed queue nodes to obtain end-to-end latency. The proposed performance analysis techniqueis evaluated with SYSmark ® ® ® This work was supported partially by Strategic CAD Labs, Intel Corpora-tion, USA.S. K. Mandal and U. Y. Ogras, School of Electrical, Computer and EnergyEngineering, Arizona State University, Tempe, AZ, 85287; emails: { skmandal,umit } @asu.edu;Raid Ayoub, Michael Kishinevsky and Mohammad M. Islam, Intel Cor-poration, 2111 NE 25th Ave., Hillsboro, OR 97124; emails: { raid.ayoub,michael.kishinevsky, [email protected] } @intel.com suites, as well as synthetic trafﬁc. The proposed technique hasless than 10% modeling error with respect to an industrialcycle-accurate NoC simulator. The major contributions of thiswork are as follows: • Accurate and scalable high-level performance modelingof priority-based NoCs considering burstiness, • Dynamic approximation of realistic bursty trafﬁc viaGGeo distribution, • Thorough evaluations on industrial priority-based NoCswith synthetic trafﬁc and real applications.II. R

ELATED W ORK

NoC analytical performance analysis techniques primarilytarget fast design space exploration and accelerating full-system simulations. Most of the existing techniques considerNoC routers with fair arbitration [4, 10], but this assumptiondoes not hold for NoCs that employ priority arbitration [1, 5].Several performance analysis techniques target priority-awareNoCs [3, 6]. The technique presented in [3] assumes thateach class of trafﬁc in the NoC occupies different queues.This assumption is not practical since most of the industrialNoCs share queues between multiple trafﬁc classes. Analyticalmodel for industrial NoCs, which estimates average end-to-endlatency is proposed in [6]. However, these models assume thatthe input trafﬁc follows geometric distribution, which is notapplicable for workloads with bursty trafﬁc.Analytical modeling of priority-based queuing networkshas also been studied outside of the realm of the on-chipinterconnect [11, 12]. Analytical models constructed in [11]considers a queuing network in the continuous-time domain.This assumption is not valid for NoCs, as events happen indiscrete clock cycles. In [12], performance analysis modelsare constructed in the discrete-time domain. Since the numberof random variables required in this technique is equal tothe number of classes (exponential on the number of routers)present in the NoC, this approach does not scale. In contrast,the analytical models presented in this paper use the discrete-time domain and scale to thousands of trafﬁc classes.III. B

ACKGROUND AND O VERVIEW

The goal of this work is to construct accurate performancemodels for industrial NoCs under priority-arbitration and bursty trafﬁc . We mainly target manycore processors used inservers, HPC, and high-end client CPUs [1, 5]. The proposedtechnique takes burstiness and injection rate of the trafﬁc asinput and then provides end-to-end latency of each trafﬁc class.

Input trafﬁc model assumptions:

Applications usually pro-duce bursty NoC trafﬁc with varying inter-arrival times [2, 4].We approximate the input trafﬁc using the GGeo discrete-time distribution model, which takes both burstiness and a r X i v : . [ c s . PF ] J u l Fig. 1. GGeo trafﬁc model discrete-time feature of NoCs into account [4, 13]. GGeomodel includes Geometric and null (no delay) branches, asshown in Figure 1. Selection between branches conforms tothe Bernoulli trial, where the null (upper) and Geo (lower)branches are selected with probability p b and − p b , respec-tively. The Geo branch leads to geometrically distributed inter-arrival time, while the null branch issues additional ﬂit inthe current time slot leading to a burst. Both the numberof ﬂits in a time slot and the inter-arrival rate depend on p b [13]. Hence, we use p b as a parameter of burstiness. GGeodistribution has two important properties [13]. First, it ispseudo-memoryless, i.e. the remaining inter-arrival time isgeometrically distributed.

Second, it can be described by itsﬁrst two moments ( λ , C a ), where C a = 2 / (1 − p b ) − λ − .We exploit these properties to construct analytical models.IV. S YSTEMATIC G ENERATION OF A NALYTICAL M ODELS

In industrial NoCs, ﬂits already in the network have higherpriority than new injections to achieve predictable latency [1].This leads to nontrivial timing dependencies between themulti-class ﬂits in the network. Hence, we propose a system-atic approach for accurate and scalable performance analysis.We note that the proposed technique can be extended to NoCswith fair arbitration if we assume that all classes have the samepriority. However, we do not focus on non-priority NoCs sincethis domain has been studied in the past [10].

A. Maximum entropy for queuing networks

We apply the principle of ME to queuing systems to ﬁndthe probability distribution of desired metrics (e.g., queueoccupancy) [13]. According to this principle, the selecteddistribution should be the least biased among all feasibledistributions satisfying the prior information in the form ofmean values. The optimal distribution is found by maximizingthe corresponding entropy function: we formulate a nonlinearprogramming problem and solve it analytically via the La-grange method of undetermined multipliers as discussed next.

B. Decomposition of basic priority queuing

In a non-preemptive priority queuing system, the router doesnot preempt a higher priority ﬂit while processing a lowerpriority ﬂit. An example system with two queues and a sharedserver is shown in Figure 2(a). There are two ﬂows arriving ata priority-based arbiter and a shared server. The shaded circlecorresponds to high priority input (class 1) to the arbiter. We

TABLE IS

UMMARY OF THE NOTATIONS USED IN THIS PAPER λ, λ m Mean arrival rate of total trafﬁc and class m p b Probability of burstiness T m , (cid:98) T m Original and modiﬁed mean service time of class m ﬂits

R, R mk Total residual time and residual time of class m while class k is served ρ m Mean server utilization of class m ﬂits (= λ m T m ) C a , C am Coeff. of variation of interarrival time of total trafﬁc and class m ﬂits C sm , (cid:98) C sm Coeff. of variation of original and modiﬁed service time of class m ﬂits C d , C dm Coeff. of variation of interdeparture time of total trafﬁc and class m ﬂits W m Mean waiting time of class m ﬂits n m , n m Mean and current occupancy of class m ﬂits in a queue-node β m Mean number of bursty arrivals of class m n mk Mean queue-node occupancy of class m with serving class k n State vector, n = ( n , n , ..., n M ) of priority queue-nodes p ( n ) Probability that a queue-node is in state n p m (0) Marginal probability of zero ﬂits of class m in a queue-node. α m ( n ) α m ( n ) = 1 if class m in service and 0 otherwise M Number of classes that share same server S Ŝ Ŝ ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 Fig. 2. Decomposition of a basic priority queuing denote this structure as basic priority queuing . Our goal isto decompose this system into individual queue-nodes withmodiﬁed servers, as shown in Figure 2(b). The combinationof a queue and its corresponding server is referred to as a queue-node . The effective expected service time of class 2ﬂits, ˆ T , is larger than the original mean service time T , sinceclass 2 ﬂits wait for the higher priority (class 1) ﬂits in theoriginal system. We calculate the effective service time in thetransformed network using Little’s Law as: (cid:98) T m = 1 − p m (0) λ m (1)where p m (0) is the marginal probability of having no ﬂits ofclass m in the queue-node, as listed in Table I. Computing p m (0) using ME: We ﬁnd p m (0) using the MEprinciple by maximizing the entropy function H ( p ( n )) givenin (2) subject to the constraints listed in (3):maximize p H ( p ( n )) = − (cid:88) n p ( n ) log( p ( n ) (2)subject to ∞ (cid:88) n = p ( n ) = 1 , ∞ (cid:88) n = except n m =1 α m ( n ) p ( n ) = ρ m , m = 1 , . . . , M (3) ∞ (cid:88) n = except n m = n k =1 n m α k ( n ) p ( n ) = ¯ n mk , m, k = 1 , .., M The notation ∞ means a state vector n with all elementsset to ∞ , and ( n = except n m = 1 ) refers to a vector n with the m th element set to 1 and other elements set to0. The constraints in (3) comprise three types: normalization,mean server utilization and mean occupancy. We introducedan extended set of mean occupancy constraints comparedto [13] to provide further information about the underlyingsystem. When a ﬂit of a certain class arrives at the system,it may ﬁnd the server busy with its own class or otherclasses since the server is a shared resource, as shown inFigure 2(a). Therefore, the mean occupancy of each classcan be partitioned according to the contribution of each classoccupying the server. We exploit this inherent partitioning togenerate M additional occupancy constraints. The occupancyrelated constraints depend on three components, β m , R mk and W m (deﬁned in Table I) derived in [6, 13].We solve the nonlinear programming problem in (2, 3) toﬁnd p ( n ) which we use to determine the probability of havingzero ﬂits of class m , p m (0) . The convergence of this solutionis guaranteed when the queuing system is in a stable region.We derived the general expression for M queues in a prioritystructure with a single class per queue as: p m (0) = 1 − ρ m − M (cid:88) k =1 ,k (cid:54) = m ρ k n mk ρ k + n mk (4) S A Ŝ A Ŝ c ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 S B ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝑇 , 𝐶 𝑠 ) Fig. 3. Decomposition of ﬂow contention at low priority

Plugging the expression of p m (0) from (4) into (1), we obtainthe ﬁrst moment of the service process. Computing second moment of the service time:

Since wealso need the second moment to characterize the GGeo trafﬁc,we calculate the modiﬁed squared coefﬁcient of variation ofthe service time for class m ( (cid:98) C s m ). We utilize the queuingoccupancy formulation of GGeo/G/1 [13] and the modiﬁedserver utilization (cid:98) ρ m = λ m (cid:98) T m to obtain the following expres-sion for (cid:98) C s m : (cid:98) C s m = (1 − (cid:98) ρ m )(2 n m − (cid:98) ρ m ) − (cid:98) ρ m C a m (cid:98) ρ m (5) C. Decomposition of priority queuing with partial contention

Priority-aware NoCs involve complex queuing structuresthat cannot be modeled accurately using only the models forbasic priority queuing. The complexity is primarily attributedto the partial priority contention across queues. We identiﬁedtwo basic structures with partial priority dependency thatconstitute the building blocks of practical priority-aware NoCs.The ﬁrst basic structure is shown in Figure 3(a) wherehigh priority class 1 is in contention with a portion of thetrafﬁc in q (class 2) through server S A . Class 2 and 3 ﬂitshave the same priority and share q before entering the trafﬁcsplitter that assigns class 2 and 3 ﬂits to server S A and S B respectively, following a notation similar to the one adoptedin [14]. We denote this structure as contention at low priority .To decompose q and q , we need to calculate the ﬁrst twomoments of the modiﬁed service process of class 1 and 2. Thedecomposed structure is shown in Figure 3(b). First, we set λ to zero which leads to a basic priority structure. Then, weapply the decomposition method discussed in Section IV-B toobtain ( (cid:98) T , (cid:98) C s ) and ( (cid:98) T , (cid:98) C s ). We derived mean queuing time( W m ) of individual classes of q in the decomposed form as: W m = R + (cid:80) Mk =1 (cid:98) ρ k (cid:98) T k β k − (cid:80) Mk =1 (cid:98) ρ k + (cid:98) T m ( β m + 1) − T m (6)where R = (cid:80) Mk =1 12 (cid:98) ρ k ( (cid:98) T k − (cid:99) T k (cid:98) C s k ) and β m = ( C A m + λ m − .The other basic structure, contention at high priority , isshown in Figure 4(a). In this scenario, only a fraction of theclasses in q (class 2) has higher priority than class 3 sinceclass 1 in q is served by S A . Determining (cid:98) T is challengingdue to class 1 that inﬂuences the inter-departure time ofclass 2. To incorporate this effect, we calculate the squaredcoefﬁcient of variation of inter-departure time, C d , of class2 using the split process formulation of GGeo streams givenin [13]. We introduce a virtual queue, q v and feed it withthe ﬂits of class 2. Therefore, q v and q form a basic prioritystructure, as shown in Figure 4(b). Subsequently, we apply thedecomposition method described in Section IV-B to calculate( (cid:99) T , (cid:98) C s ) as well as ( (cid:99) T , (cid:98) C s ). The decomposed structure isshown in Figure 4(c). D. Iterative decomposition algorithm

Algorithm 1 shows a step-by-step procedure to obtainthe analytical model using our approach described in Sec- S A ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 S B ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝜆 , 𝐶 𝑎 ) 𝜆 𝒒 𝒗 S A S B ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) S B ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑑 ) 𝜆 𝜆 Ŝ B Ŝ c ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (c) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝑇 , 𝐶 𝑠 ) Fig. 4. Decomposition of ﬂow contention at high priority tion IV-C. The inputs to the algorithm are NoC topology,routing algorithm and server process. The analytical modelspresented for the canonical queuing system are independentof the NoC topology. Therefore, the analytical models arevalid for any NoC, including irregular topologies. First, weidentify priority dependencies between different classes in thenetwork. Next, we apply decomposition for contention at highand low priority, as shown in line 7 – 8 of Algorithm 1.Subsequently, we calculate the modiﬁed service process ( (cid:98) T , (cid:98) C s ) using (1, 4) and (5). Then, we compute the waiting timeper class following (6). Finally, we obtain the average waitingtime in each queue ( W q ), as shown in line 12.V. E XPERIMENTAL E VALUATION

The proposed technique is implemented in C++ to facilitateintegration with system-level simulators. Analysis takes 2.7 msfor a 6 × O ( n ) , where n is the number of nodes. In all experiments, 200K cycles ofwarm-up period is considered. The accuracy of the models isevaluated against an industrial cycle-accurate simulator [15]under both real applications and synthetic trafﬁc that modelsuniformly distributed core to last-level cache trafﬁc with 100%hit rate. A. Evaluation on Architectures with Ring NoCs

This section analyzes the accuracy of the proposed analyti-cal models using uniform trafﬁc on a priority-based 6 × × which do not consider Algorithm 1:

Iterative Decomposition Algorithm Input:

NoC topology, routing algorithm, server process, ( λ ) and ( p b )for each class as parameters Output:

Average waiting time for each queue ( W q ) N = number of queues in the network S q = set of classes in queue q for q = N do for m = | S q | do Apply decomp. for contention at high priority (if found) Apply decomp. for contention at low priority (if found) Compute (cid:98) T , (cid:98) C s using (1, 4) and (5) Compute queuing time ( W q,m ) using (6) end W q = (cid:80) | Sq | i =1 λ q,m W q,m (cid:80) | Sq | i =1 λ q,m end TABLE IIC

OMPARISONS AGAINST EXISTING ALTERNATIVES (R EFERENCE [3]

AND R EFERENCE [6]). H DENOTES ERRORS OVER

Topology 6 × × Ring 4 × × p b λ E rr( % ) Prop. 0.2 5.6 12 0.8 0.6 12 0.2 4.3 14 0.5 3.7 7.3 0.9 5.1 12 0.5 3.1 12 2.3 5.0 11 2.9 7.5 13 2.0 9.1 12 4.7 0.6 11 4.3 8.2 10 6.1 7.9 12

Ref[3]

17 H H 30 H H 54 H H 66 H H H H H H H H 30 H H 10 H H 12 H H 28 H H 54 H H 78 H H

Ref[6] 8.5 12 18 20 30 55 36 47 79 7.5 8.8 11 18 24 39 33 42 85 10 21 40 21 38 82 37 56 88 7.2 13 45 14 34 64 28 48 76 burstiness [6] signiﬁcantly underestimate the latency by 33%on average (highlighted with the shaded row). In contrast,the work without the proposed decomposition technique [3]leads to over 100% overestimation even at low trafﬁc loads(highlighted with text in italics). In this case, GGeo modelscan not handle partial contention, since it assumes all packetsin the high-priority queue have higher priority than each packetin the low priority queue. These results demonstrate thatthe proposed priority-aware NoC performance models havesigniﬁcantly higher accuracy than the existing alternatives.

B. Evaluation on Architectures with Mesh NoCs

Table II compares the analytical model and simulationresults for a priority-based 4 × × × × C. Evaluation with Real Applications

This section validates the proposed analytical models usingSYSmark ® ® ® λ ) and p b . Computing p b : For each source, we feed trafﬁc arrivals withtimestamps over a 200K clock cycle window into a virtualqueue with the same service rate as the NoC to determine Fig. 5. Comparison of a proposed analytical model with cycle-accuratesimulation for 8 × × p b = 0 . and (b) p b = 0 . . the queue occupancy. At the end of the window, we computethe average occupancy. Then, we employ the model describedin [13] to ﬁnd the occupancy and then p b of each class.The proposed analytical models are used to estimate thelatency using the injection rate and burst parameters, as well asthe NoC architecture and routing algorithm. The applicationsshow burstiness in the range of 0.2 – 0.5. As shown inTable III, the proposed technique has on average 2% and 4%error compared to cycle-accurate simulations for 6 × × TABLE IIIM

ODELING E RROR (%)

WITH R EAL A PPLICATIONS xalan-cbmk mcf gcc bwaves GemsFDTD omnet-pp perl-bench SYSmark14seProp 2.17 4.97 0.92 0.15 0.38 5.10 3.63 0.73Ref [3] 14.62 11.99 7.69 12.29 5.18 13.64 11.46 7.256 × × VI. C

ONCLUSION

We presented analytical models for priority-aware NoCsunder bursty trafﬁc. We model bursty trafﬁc as generalized ge-ometric distribution and applied the maximum entropy methodto construct analytical models. Experimental evaluations showthat the proposed technique has less 10% modeling error withrespect to cycle-accurate NoC simulation for real applications.A

PPENDIX A Usage of the proposed analytical models:

In this work,we aim to replace the cycle accurate NoC simulators withanalytical performance models. The full-system simulationenvironment keeps track of the trafﬁc injected from processingcores (e.g. CPU, GPU, caches, memory etc.) to the NoC,as shown in Figure 6. The proposed technique obtains thetrafﬁc information of processing cores over a time window,which is in the order of 100-200K cycles in our experiments.The duration can be decreased if the workload characteristicschange considerably within a window or increased if theworkload is steady. Our simulator estimates the burstinessof the input and calculates the injection rate of each trafﬁcclass using this information (ﬁrst two steps in Figure 6).Then, it applies the proposed analytical models to obtain end-to-end latency of each trafﬁc class. Whenever a processingcore issues a new transaction, the communication latencyis computed using these models instead of cycle-accuratesimulations. These steps are repeated for each time window.A

PPENDIX B Generalization of the proposed analytical models:

Weincorporate Y-X routing algorithm in the experimental eval-uations presented following the actual reference commercialhardware design [1]. However, we note that the proposedapproach is independent of the routing algorithm. In fact, the

Full-system Simulation Environment

Processing Cores

The Proposed Technique Estimate burstiness in the input

2. Find the injection rate3. Apply proposed analytical models4. Return communication latenciesCPU cores . . .

GPU cores . . .

Display, Memory, I/O Controllers N o C T r a n sac t i on s Fig. 6. An overview of the proposed approachTABLE IVP

ROBABILITY OF BURSTINESS ( p b ) FOR DIFFERENT APPLICATIONS .Apps xalan-cbmk mcf gcc bwaves Gems-FDTD omnet-pp perl-bench SYSmark-14se p b routing algorithm is one of the inputs to the proposed IterativeDecomposition Algorithm (Algorithm 1).The analytical models are valid for any type of NoCincluding irregular topologies. The analytical models presentedfor the canonical queuing system are independent of theNoC topology. The canonical model constitutes the end-to-endlatency model for a given NoC topology. In fact, NoC topologyis an input to the algorithm which computes the end-to-endlatency (Algorithm 1). Since this work targets general purposeNoCs used in manycore processors, we evaluate our proposedmodel only with Mesh and Ring NoC used in Intel Xeonserver [1], Xeon Phi [17], and quad-core i7 (with integratedgraphics) [18] processors.A PPENDIX C Results with real application executed on 6 × Table IV shows the probability of burstiness of differentapplications used in our experiemntal evaluations. The levelsof burstiness exhibited by these applications are between 0.18and 0.53 re-emphasizing that the chosen levels of burstinessfor evaluation with synthetic trafﬁc in Section V-C are repre-sentative of real applications.Table V shows the modeling error with respect to simulationfor 6 × TABLE VM

ODELING E RROR (%)

WITH R EAL A PPLICATIONS xalan-cbmk mcf gcc bwaves GemsFDTD omnet-pp perl-bench SYSmark14seProp 5.38 5.48 4.57 6.94 0.31 6.58 9.91 1.20Ref [3] 24.06 13.20 11.80 19.89 9.79 10.94 13.53 7.746 × R EFERENCES [1] James Jeffers et al.

Intel Xeon Phi Processor High PerformanceProgramming: Knights Landing Edition . Morgan Kaufmann,2016.[2] Paul Bogdan et al. Workload Characterization and Its Impacton Multicore Platform Design. In

Proc. of the Intl. Conf. onHardware/Software Codesign and System Synthesis , pages 231–240, 2010.[3] Abbas Eslami Kiasari et al. An Analytical Latency Model forNetworks-on-Chip.

IEEE TVLSI , 21(1):113–123, 2013.[4] Zhi-Liang Qian et al. A Support Vector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architec-tures.

IEEE Trans. on Computer-Aided Design of IntegratedCircuits and Systems , 35(3):471–484, 2015.[5] Simon M Tam et al. SkyLake-SP: A 14nm 28-Core Xeon®Processor. In , pages 34–36, 2018.[6] Sumit K Mandal et al. Analytical Performance Models for NoCswith Multiple Priority Trafﬁc Classes.

ACM TECS , 18(5s),2019.[7] Business Applications Performance Corporation (BAPCo).Benchmark, sysmark2014. http://bapco.com/products/sysmark-2014, accessed 27 May 2020.[8] John L Henning. SPEC CPU2006 Benchmark Descriptions.

ACM SIGARCH Computer Architecture News , 34(4):1–17,2006.[9] James Bucek, Klaus-Dieter Lange, and J´oakim v. Kistowski.SPEC CPU2017: Next-Generation Compute Benchmark. In

Companion of the 2018 ACM/SPEC International Conferenceon Performance Engineering , pages 41–42, 2018.[10] Umit Y Ogras and Radu Marculescu.

Modeling, Analysisand Optimization of Network-on-Chip Communication Architec-tures , volume 184. Springer Science & Business Media, 2013.[11] Gunter Bolch et al.

Queueing Networks and Markov Chains:Modeling and Performance Evaluation with Computer ScienceApplications . John Wiley & Sons, 2006.[12] Joris Walraevens.

Discrete-time Queueing Models with Priori-ties . PhD thesis, Ghent University, 2004.[13] Demetres D Kouvatsos. Entropy Maximisation and QueuingNetwork Models.

Annals of Operations Research , 48(1):63–126, 1994.[14] Alexander Gotmanov et al. Verifying Deadlock-Freedom ofCommunication Fabrics. In

Intl. Workshop on Veriﬁcation,Model Checking, and Abstract Interpretation , pages 214–231.Springer, 2011.[15] Umit Y Ogras et al. Energy-Guided Exploration of On-ChipNetwork Design for Exa-Scale Computing. In

Proc. of Intl.Workshop on System Level Interconnect Prediction , pages 24–31, 2012.[16] Nathan Binkert et al. The Gem5 Simulator.

SIGARCH Comp.Arch. News , May. 2011.[17] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim,Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Ra-jat Agarwal, and Yen-Chen Liu. Knights landing: Second-generation intel xeon phi product.

Ieee micro , 36(2):34–46,2016.[18] James Charles, Preet Jassi, Narayan S Ananth, Abbas Sadat,and Alexandra Fedorova. Evaluation of the intel® core i7turbo boost feature. In2009 IEEE International Symposiumon Workload Characterization (IISWC)