Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic
Sumit K. Mandal, Raid Ayoub, Michael Kishinevsky, Mohammad M. Islam, Umit Y. Ogras
11 Analytical Performance Modeling of NoCs underPriority Arbitration and Bursty Traffic
Sumit K. Mandal , Raid Ayoub , Michael Kishinevsky , Mohammad M. Islam , Umit Y. Ogras School of ECEE, Arizona State University; Intel Corporation, Hillsboro, OR
Abstract —Networks-on-Chip (NoCs) used in commercialmany-core processors typically incorporate priority arbitration.Moreover, they experience bursty traffic due to application work-loads. However, most state-of-the-art NoC analytical performanceanalysis techniques assume fair arbitration and simple trafficmodels. To address these limitations, we propose an analyticalmodeling technique for priority-aware NoCs under bursty traffic.Experimental evaluations with synthetic and bursty traffic showthat the proposed approach has less than 10% modeling errorwith respect to cycle-accurate NoC simulator.
I. I
NTRODUCTION
Industrial many-core processors incorporate priority arbi-tration for the routers in NoC [1]. Moreover, these designsexecute bursty traffic since real applications exhibit bursti-ness [2]. Accurate NoC performance models are required toperform design space exploration and accelerate full-systemsimulations [3, 4]. Most existing analysis techniques assumefair arbitration in routers, which does not hold for NoCs withpriority arbitration used in manycore processors, such as high-end servers [5] and high performance computing (HPC) [1].A recent technique targets priority-aware NoCs [6], but itassumes that the input traffic follows geometric distribution.While this assumption simplifies analytical models, it fails tocapture the bursty behavior of real applications [2]. Indeed,our evaluations show that the geometric distribution assump-tion leads up to 60% error in latency estimation unless thebursty nature of applications is explicitly modeled. Therefore,there is a strong need for NoC performance analysis techniquesthat consider both priority arbitration and bursty traffic.This work proposes a novel performance modeling tech-nique for priority-aware NoCs that takes bursty traffic intoaccount. It first models the input traffic as a generalized geo-metric (GGeo) discrete-time distribution that includes a param-eter for burstiness. We achieve high scalability by employingthe principle of maximum entropy (ME) to transform the givenqueuing network into a near equivalent set of individual queuenodes of multiple-classes with revised characteristics (e.g.,modifying service process). Furthermore, our solution involvestransformations to handle priority arbitration of the routersacross a network of queues. Finally, we construct analyticalmodels of the transformed queue nodes to obtain end-to-end latency. The proposed performance analysis techniqueis evaluated with SYSmark ® ® ® This work was supported partially by Strategic CAD Labs, Intel Corpora-tion, USA.S. K. Mandal and U. Y. Ogras, School of Electrical, Computer and EnergyEngineering, Arizona State University, Tempe, AZ, 85287; emails: { skmandal,umit } @asu.edu;Raid Ayoub, Michael Kishinevsky and Mohammad M. Islam, Intel Cor-poration, 2111 NE 25th Ave., Hillsboro, OR 97124; emails: { raid.ayoub,michael.kishinevsky, [email protected] } @intel.com suites, as well as synthetic traffic. The proposed technique hasless than 10% modeling error with respect to an industrialcycle-accurate NoC simulator. The major contributions of thiswork are as follows: • Accurate and scalable high-level performance modelingof priority-based NoCs considering burstiness, • Dynamic approximation of realistic bursty traffic viaGGeo distribution, • Thorough evaluations on industrial priority-based NoCswith synthetic traffic and real applications.II. R
ELATED W ORK
NoC analytical performance analysis techniques primarilytarget fast design space exploration and accelerating full-system simulations. Most of the existing techniques considerNoC routers with fair arbitration [4, 10], but this assumptiondoes not hold for NoCs that employ priority arbitration [1, 5].Several performance analysis techniques target priority-awareNoCs [3, 6]. The technique presented in [3] assumes thateach class of traffic in the NoC occupies different queues.This assumption is not practical since most of the industrialNoCs share queues between multiple traffic classes. Analyticalmodel for industrial NoCs, which estimates average end-to-endlatency is proposed in [6]. However, these models assume thatthe input traffic follows geometric distribution, which is notapplicable for workloads with bursty traffic.Analytical modeling of priority-based queuing networkshas also been studied outside of the realm of the on-chipinterconnect [11, 12]. Analytical models constructed in [11]considers a queuing network in the continuous-time domain.This assumption is not valid for NoCs, as events happen indiscrete clock cycles. In [12], performance analysis modelsare constructed in the discrete-time domain. Since the numberof random variables required in this technique is equal tothe number of classes (exponential on the number of routers)present in the NoC, this approach does not scale. In contrast,the analytical models presented in this paper use the discrete-time domain and scale to thousands of traffic classes.III. B
ACKGROUND AND O VERVIEW
The goal of this work is to construct accurate performancemodels for industrial NoCs under priority-arbitration and bursty traffic . We mainly target manycore processors used inservers, HPC, and high-end client CPUs [1, 5]. The proposedtechnique takes burstiness and injection rate of the traffic asinput and then provides end-to-end latency of each traffic class.
Input traffic model assumptions:
Applications usually pro-duce bursty NoC traffic with varying inter-arrival times [2, 4].We approximate the input traffic using the GGeo discrete-time distribution model, which takes both burstiness and a r X i v : . [ c s . PF ] J u l Fig. 1. GGeo traffic model discrete-time feature of NoCs into account [4, 13]. GGeomodel includes Geometric and null (no delay) branches, asshown in Figure 1. Selection between branches conforms tothe Bernoulli trial, where the null (upper) and Geo (lower)branches are selected with probability p b and − p b , respec-tively. The Geo branch leads to geometrically distributed inter-arrival time, while the null branch issues additional flit inthe current time slot leading to a burst. Both the numberof flits in a time slot and the inter-arrival rate depend on p b [13]. Hence, we use p b as a parameter of burstiness. GGeodistribution has two important properties [13]. First, it ispseudo-memoryless, i.e. the remaining inter-arrival time isgeometrically distributed.
Second, it can be described by itsfirst two moments ( λ , C a ), where C a = 2 / (1 − p b ) − λ − .We exploit these properties to construct analytical models.IV. S YSTEMATIC G ENERATION OF A NALYTICAL M ODELS
In industrial NoCs, flits already in the network have higherpriority than new injections to achieve predictable latency [1].This leads to nontrivial timing dependencies between themulti-class flits in the network. Hence, we propose a system-atic approach for accurate and scalable performance analysis.We note that the proposed technique can be extended to NoCswith fair arbitration if we assume that all classes have the samepriority. However, we do not focus on non-priority NoCs sincethis domain has been studied in the past [10].
A. Maximum entropy for queuing networks
We apply the principle of ME to queuing systems to findthe probability distribution of desired metrics (e.g., queueoccupancy) [13]. According to this principle, the selecteddistribution should be the least biased among all feasibledistributions satisfying the prior information in the form ofmean values. The optimal distribution is found by maximizingthe corresponding entropy function: we formulate a nonlinearprogramming problem and solve it analytically via the La-grange method of undetermined multipliers as discussed next.
B. Decomposition of basic priority queuing
In a non-preemptive priority queuing system, the router doesnot preempt a higher priority flit while processing a lowerpriority flit. An example system with two queues and a sharedserver is shown in Figure 2(a). There are two flows arriving ata priority-based arbiter and a shared server. The shaded circlecorresponds to high priority input (class 1) to the arbiter. We
TABLE IS
UMMARY OF THE NOTATIONS USED IN THIS PAPER λ, λ m Mean arrival rate of total traffic and class m p b Probability of burstiness T m , (cid:98) T m Original and modified mean service time of class m flits
R, R mk Total residual time and residual time of class m while class k is served ρ m Mean server utilization of class m flits (= λ m T m ) C a , C am Coeff. of variation of interarrival time of total traffic and class m flits C sm , (cid:98) C sm Coeff. of variation of original and modified service time of class m flits C d , C dm Coeff. of variation of interdeparture time of total traffic and class m flits W m Mean waiting time of class m flits n m , n m Mean and current occupancy of class m flits in a queue-node β m Mean number of bursty arrivals of class m n mk Mean queue-node occupancy of class m with serving class k n State vector, n = ( n , n , ..., n M ) of priority queue-nodes p ( n ) Probability that a queue-node is in state n p m (0) Marginal probability of zero flits of class m in a queue-node. α m ( n ) α m ( n ) = 1 if class m in service and 0 otherwise M Number of classes that share same server S Ŝ Ŝ ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 Fig. 2. Decomposition of a basic priority queuing denote this structure as basic priority queuing . Our goal isto decompose this system into individual queue-nodes withmodified servers, as shown in Figure 2(b). The combinationof a queue and its corresponding server is referred to as a queue-node . The effective expected service time of class 2flits, ˆ T , is larger than the original mean service time T , sinceclass 2 flits wait for the higher priority (class 1) flits in theoriginal system. We calculate the effective service time in thetransformed network using Little’s Law as: (cid:98) T m = 1 − p m (0) λ m (1)where p m (0) is the marginal probability of having no flits ofclass m in the queue-node, as listed in Table I. Computing p m (0) using ME: We find p m (0) using the MEprinciple by maximizing the entropy function H ( p ( n )) givenin (2) subject to the constraints listed in (3):maximize p H ( p ( n )) = − (cid:88) n p ( n ) log( p ( n ) (2)subject to ∞ (cid:88) n = p ( n ) = 1 , ∞ (cid:88) n = except n m =1 α m ( n ) p ( n ) = ρ m , m = 1 , . . . , M (3) ∞ (cid:88) n = except n m = n k =1 n m α k ( n ) p ( n ) = ¯ n mk , m, k = 1 , .., M The notation ∞ means a state vector n with all elementsset to ∞ , and ( n = except n m = 1 ) refers to a vector n with the m th element set to 1 and other elements set to0. The constraints in (3) comprise three types: normalization,mean server utilization and mean occupancy. We introducedan extended set of mean occupancy constraints comparedto [13] to provide further information about the underlyingsystem. When a flit of a certain class arrives at the system,it may find the server busy with its own class or otherclasses since the server is a shared resource, as shown inFigure 2(a). Therefore, the mean occupancy of each classcan be partitioned according to the contribution of each classoccupying the server. We exploit this inherent partitioning togenerate M additional occupancy constraints. The occupancyrelated constraints depend on three components, β m , R mk and W m (defined in Table I) derived in [6, 13].We solve the nonlinear programming problem in (2, 3) tofind p ( n ) which we use to determine the probability of havingzero flits of class m , p m (0) . The convergence of this solutionis guaranteed when the queuing system is in a stable region.We derived the general expression for M queues in a prioritystructure with a single class per queue as: p m (0) = 1 − ρ m − M (cid:88) k =1 ,k (cid:54) = m ρ k n mk ρ k + n mk (4) S A Ŝ A Ŝ c ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 S B ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝑇 , 𝐶 𝑠 ) Fig. 3. Decomposition of flow contention at low priority
Plugging the expression of p m (0) from (4) into (1), we obtainthe first moment of the service process. Computing second moment of the service time:
Since wealso need the second moment to characterize the GGeo traffic,we calculate the modified squared coefficient of variation ofthe service time for class m ( (cid:98) C s m ). We utilize the queuingoccupancy formulation of GGeo/G/1 [13] and the modifiedserver utilization (cid:98) ρ m = λ m (cid:98) T m to obtain the following expres-sion for (cid:98) C s m : (cid:98) C s m = (1 − (cid:98) ρ m )(2 n m − (cid:98) ρ m ) − (cid:98) ρ m C a m (cid:98) ρ m (5) C. Decomposition of priority queuing with partial contention
Priority-aware NoCs involve complex queuing structuresthat cannot be modeled accurately using only the models forbasic priority queuing. The complexity is primarily attributedto the partial priority contention across queues. We identifiedtwo basic structures with partial priority dependency thatconstitute the building blocks of practical priority-aware NoCs.The first basic structure is shown in Figure 3(a) wherehigh priority class 1 is in contention with a portion of thetraffic in q (class 2) through server S A . Class 2 and 3 flitshave the same priority and share q before entering the trafficsplitter that assigns class 2 and 3 flits to server S A and S B respectively, following a notation similar to the one adoptedin [14]. We denote this structure as contention at low priority .To decompose q and q , we need to calculate the first twomoments of the modified service process of class 1 and 2. Thedecomposed structure is shown in Figure 3(b). First, we set λ to zero which leads to a basic priority structure. Then, weapply the decomposition method discussed in Section IV-B toobtain ( (cid:98) T , (cid:98) C s ) and ( (cid:98) T , (cid:98) C s ). We derived mean queuing time( W m ) of individual classes of q in the decomposed form as: W m = R + (cid:80) Mk =1 (cid:98) ρ k (cid:98) T k β k − (cid:80) Mk =1 (cid:98) ρ k + (cid:98) T m ( β m + 1) − T m (6)where R = (cid:80) Mk =1 12 (cid:98) ρ k ( (cid:98) T k − (cid:99) T k (cid:98) C s k ) and β m = ( C A m + λ m − .The other basic structure, contention at high priority , isshown in Figure 4(a). In this scenario, only a fraction of theclasses in q (class 2) has higher priority than class 3 sinceclass 1 in q is served by S A . Determining (cid:98) T is challengingdue to class 1 that influences the inter-departure time ofclass 2. To incorporate this effect, we calculate the squaredcoefficient of variation of inter-departure time, C d , of class2 using the split process formulation of GGeo streams givenin [13]. We introduce a virtual queue, q v and feed it withthe flits of class 2. Therefore, q v and q form a basic prioritystructure, as shown in Figure 4(b). Subsequently, we apply thedecomposition method described in Section IV-B to calculate( (cid:99) T , (cid:98) C s ) as well as ( (cid:99) T , (cid:98) C s ). The decomposed structure isshown in Figure 4(c). D. Iterative decomposition algorithm
Algorithm 1 shows a step-by-step procedure to obtainthe analytical model using our approach described in Sec- S A ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (a) (b) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) 𝑞 𝑞 S B ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑎 ) 𝜆 ( 𝜆 , 𝐶 𝑎 ) 𝜆 𝒒 𝒗 S A S B ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) S B ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) ( 𝜆 , 𝐶 𝑑 ) 𝜆 𝜆 Ŝ B Ŝ c ( 𝑇 , 𝐶 𝑠 ) ( 𝑇 , 𝐶 𝑠 ) (c) 𝑞 𝑞 ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝜆 , 𝐶 𝑎 ) ( 𝑇 , 𝐶 𝑠 ) Fig. 4. Decomposition of flow contention at high priority tion IV-C. The inputs to the algorithm are NoC topology,routing algorithm and server process. The analytical modelspresented for the canonical queuing system are independentof the NoC topology. Therefore, the analytical models arevalid for any NoC, including irregular topologies. First, weidentify priority dependencies between different classes in thenetwork. Next, we apply decomposition for contention at highand low priority, as shown in line 7 – 8 of Algorithm 1.Subsequently, we calculate the modified service process ( (cid:98) T , (cid:98) C s ) using (1, 4) and (5). Then, we compute the waiting timeper class following (6). Finally, we obtain the average waitingtime in each queue ( W q ), as shown in line 12.V. E XPERIMENTAL E VALUATION
The proposed technique is implemented in C++ to facilitateintegration with system-level simulators. Analysis takes 2.7 msfor a 6 × O ( n ) , where n is the number of nodes. In all experiments, 200K cycles ofwarm-up period is considered. The accuracy of the models isevaluated against an industrial cycle-accurate simulator [15]under both real applications and synthetic traffic that modelsuniformly distributed core to last-level cache traffic with 100%hit rate. A. Evaluation on Architectures with Ring NoCs
This section analyzes the accuracy of the proposed analyti-cal models using uniform traffic on a priority-based 6 × × which do not consider Algorithm 1:
Iterative Decomposition Algorithm Input:
NoC topology, routing algorithm, server process, ( λ ) and ( p b )for each class as parameters Output:
Average waiting time for each queue ( W q ) N = number of queues in the network S q = set of classes in queue q for q = N do for m = | S q | do Apply decomp. for contention at high priority (if found) Apply decomp. for contention at low priority (if found) Compute (cid:98) T , (cid:98) C s using (1, 4) and (5) Compute queuing time ( W q,m ) using (6) end W q = (cid:80) | Sq | i =1 λ q,m W q,m (cid:80) | Sq | i =1 λ q,m end TABLE IIC
OMPARISONS AGAINST EXISTING ALTERNATIVES (R EFERENCE [3]
AND R EFERENCE [6]). H DENOTES ERRORS OVER
Topology 6 × × Ring 4 × × p b λ E rr( % ) Prop. 0.2 5.6 12 0.8 0.6 12 0.2 4.3 14 0.5 3.7 7.3 0.9 5.1 12 0.5 3.1 12 2.3 5.0 11 2.9 7.5 13 2.0 9.1 12 4.7 0.6 11 4.3 8.2 10 6.1 7.9 12
Ref[3]
17 H H 30 H H 54 H H 66 H H H H H H H H 30 H H 10 H H 12 H H 28 H H 54 H H 78 H H
Ref[6] 8.5 12 18 20 30 55 36 47 79 7.5 8.8 11 18 24 39 33 42 85 10 21 40 21 38 82 37 56 88 7.2 13 45 14 34 64 28 48 76 burstiness [6] significantly underestimate the latency by 33%on average (highlighted with the shaded row). In contrast,the work without the proposed decomposition technique [3]leads to over 100% overestimation even at low traffic loads(highlighted with text in italics). In this case, GGeo modelscan not handle partial contention, since it assumes all packetsin the high-priority queue have higher priority than each packetin the low priority queue. These results demonstrate thatthe proposed priority-aware NoC performance models havesignificantly higher accuracy than the existing alternatives.
B. Evaluation on Architectures with Mesh NoCs
Table II compares the analytical model and simulationresults for a priority-based 4 × × × × C. Evaluation with Real Applications
This section validates the proposed analytical models usingSYSmark ® ® ® λ ) and p b . Computing p b : For each source, we feed traffic arrivals withtimestamps over a 200K clock cycle window into a virtualqueue with the same service rate as the NoC to determine Fig. 5. Comparison of a proposed analytical model with cycle-accuratesimulation for 8 × × p b = 0 . and (b) p b = 0 . . the queue occupancy. At the end of the window, we computethe average occupancy. Then, we employ the model describedin [13] to find the occupancy and then p b of each class.The proposed analytical models are used to estimate thelatency using the injection rate and burst parameters, as well asthe NoC architecture and routing algorithm. The applicationsshow burstiness in the range of 0.2 – 0.5. As shown inTable III, the proposed technique has on average 2% and 4%error compared to cycle-accurate simulations for 6 × × TABLE IIIM
ODELING E RROR (%)
WITH R EAL A PPLICATIONS xalan-cbmk mcf gcc bwaves GemsFDTD omnet-pp perl-bench SYSmark14seProp 2.17 4.97 0.92 0.15 0.38 5.10 3.63 0.73Ref [3] 14.62 11.99 7.69 12.29 5.18 13.64 11.46 7.256 × × VI. C
ONCLUSION
We presented analytical models for priority-aware NoCsunder bursty traffic. We model bursty traffic as generalized ge-ometric distribution and applied the maximum entropy methodto construct analytical models. Experimental evaluations showthat the proposed technique has less 10% modeling error withrespect to cycle-accurate NoC simulation for real applications.A
PPENDIX A Usage of the proposed analytical models:
In this work,we aim to replace the cycle accurate NoC simulators withanalytical performance models. The full-system simulationenvironment keeps track of the traffic injected from processingcores (e.g. CPU, GPU, caches, memory etc.) to the NoC,as shown in Figure 6. The proposed technique obtains thetraffic information of processing cores over a time window,which is in the order of 100-200K cycles in our experiments.The duration can be decreased if the workload characteristicschange considerably within a window or increased if theworkload is steady. Our simulator estimates the burstinessof the input and calculates the injection rate of each trafficclass using this information (first two steps in Figure 6).Then, it applies the proposed analytical models to obtain end-to-end latency of each traffic class. Whenever a processingcore issues a new transaction, the communication latencyis computed using these models instead of cycle-accuratesimulations. These steps are repeated for each time window.A
PPENDIX B Generalization of the proposed analytical models:
Weincorporate Y-X routing algorithm in the experimental eval-uations presented following the actual reference commercialhardware design [1]. However, we note that the proposedapproach is independent of the routing algorithm. In fact, the
Full-system Simulation Environment
Processing Cores
The Proposed Technique Estimate burstiness in the input
2. Find the injection rate3. Apply proposed analytical models4. Return communication latenciesCPU cores . . .
GPU cores . . .
Display, Memory, I/O Controllers N o C T r a n sac t i on s Fig. 6. An overview of the proposed approachTABLE IVP
ROBABILITY OF BURSTINESS ( p b ) FOR DIFFERENT APPLICATIONS .Apps xalan-cbmk mcf gcc bwaves Gems-FDTD omnet-pp perl-bench SYSmark-14se p b routing algorithm is one of the inputs to the proposed IterativeDecomposition Algorithm (Algorithm 1).The analytical models are valid for any type of NoCincluding irregular topologies. The analytical models presentedfor the canonical queuing system are independent of theNoC topology. The canonical model constitutes the end-to-endlatency model for a given NoC topology. In fact, NoC topologyis an input to the algorithm which computes the end-to-endlatency (Algorithm 1). Since this work targets general purposeNoCs used in manycore processors, we evaluate our proposedmodel only with Mesh and Ring NoC used in Intel Xeonserver [1], Xeon Phi [17], and quad-core i7 (with integratedgraphics) [18] processors.A PPENDIX C Results with real application executed on 6 × Table IV shows the probability of burstiness of differentapplications used in our experiemntal evaluations. The levelsof burstiness exhibited by these applications are between 0.18and 0.53 re-emphasizing that the chosen levels of burstinessfor evaluation with synthetic traffic in Section V-C are repre-sentative of real applications.Table V shows the modeling error with respect to simulationfor 6 × TABLE VM
ODELING E RROR (%)
WITH R EAL A PPLICATIONS xalan-cbmk mcf gcc bwaves GemsFDTD omnet-pp perl-bench SYSmark14seProp 5.38 5.48 4.57 6.94 0.31 6.58 9.91 1.20Ref [3] 24.06 13.20 11.80 19.89 9.79 10.94 13.53 7.746 × R EFERENCES [1] James Jeffers et al.
Intel Xeon Phi Processor High PerformanceProgramming: Knights Landing Edition . Morgan Kaufmann,2016.[2] Paul Bogdan et al. Workload Characterization and Its Impacton Multicore Platform Design. In
Proc. of the Intl. Conf. onHardware/Software Codesign and System Synthesis , pages 231–240, 2010.[3] Abbas Eslami Kiasari et al. An Analytical Latency Model forNetworks-on-Chip.
IEEE TVLSI , 21(1):113–123, 2013.[4] Zhi-Liang Qian et al. A Support Vector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architec-tures.
IEEE Trans. on Computer-Aided Design of IntegratedCircuits and Systems , 35(3):471–484, 2015.[5] Simon M Tam et al. SkyLake-SP: A 14nm 28-Core Xeon®Processor. In , pages 34–36, 2018.[6] Sumit K Mandal et al. Analytical Performance Models for NoCswith Multiple Priority Traffic Classes.
ACM TECS , 18(5s),2019.[7] Business Applications Performance Corporation (BAPCo).Benchmark, sysmark2014. http://bapco.com/products/sysmark-2014, accessed 27 May 2020.[8] John L Henning. SPEC CPU2006 Benchmark Descriptions.
ACM SIGARCH Computer Architecture News , 34(4):1–17,2006.[9] James Bucek, Klaus-Dieter Lange, and J´oakim v. Kistowski.SPEC CPU2017: Next-Generation Compute Benchmark. In
Companion of the 2018 ACM/SPEC International Conferenceon Performance Engineering , pages 41–42, 2018.[10] Umit Y Ogras and Radu Marculescu.
Modeling, Analysisand Optimization of Network-on-Chip Communication Architec-tures , volume 184. Springer Science & Business Media, 2013.[11] Gunter Bolch et al.
Queueing Networks and Markov Chains:Modeling and Performance Evaluation with Computer ScienceApplications . John Wiley & Sons, 2006.[12] Joris Walraevens.
Discrete-time Queueing Models with Priori-ties . PhD thesis, Ghent University, 2004.[13] Demetres D Kouvatsos. Entropy Maximisation and QueuingNetwork Models.
Annals of Operations Research , 48(1):63–126, 1994.[14] Alexander Gotmanov et al. Verifying Deadlock-Freedom ofCommunication Fabrics. In
Intl. Workshop on Verification,Model Checking, and Abstract Interpretation , pages 214–231.Springer, 2011.[15] Umit Y Ogras et al. Energy-Guided Exploration of On-ChipNetwork Design for Exa-Scale Computing. In
Proc. of Intl.Workshop on System Level Interconnect Prediction , pages 24–31, 2012.[16] Nathan Binkert et al. The Gem5 Simulator.
SIGARCH Comp.Arch. News , May. 2011.[17] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim,Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Ra-jat Agarwal, and Yen-Chen Liu. Knights landing: Second-generation intel xeon phi product.
Ieee micro , 36(2):34–46,2016.[18] James Charles, Preet Jassi, Narayan S Ananth, Abbas Sadat,and Alexandra Fedorova. Evaluation of the intel® core i7turbo boost feature. In2009 IEEE International Symposiumon Workload Characterization (IISWC)