[PDF] Performance Analysis of Priority-Aware NoCs with Deflection Routing under Traffic Congestion

Abstract

Priority-aware networks-on-chip (NoCs) are used in industry to achieve predictable latency under different workload conditions. These NoCs incorporate deflection routing to minimize queuing resources within routers and achieve low latency during low traffic load. However, deflected packets can exacerbate congestion during high traffic load since they consume the NoC bandwidth. State-of-the-art analytical models for priority-aware NoCs ignore deflected traffic despite its significant latency impact during congestion. This paper proposes a novel analytical approach to estimate end-to-end latency of priority-aware NoCs with deflection routing under bursty and heavy traffic scenarios. Experimental evaluations show that the proposed technique outperforms alternative approaches and estimates the average latency for real applications with less than 8% error compared to cycle-accurate simulations.

Full PDF

PPerformance Analysis of Priority-Aware NoCs withDeflection Routing under Traffic Congestion

Sumit K. Mandal , Anish Krishnakumar , Raid Ayoub , Michael Kishinevsky , Umit Y. Ogras Dept. of ECE, University of Wisconsin-Madison; Intel Corporation, Hillsboro, OR

ABSTRACT

NoCs with deflection routing under burstyand heavy traffic scenarios. Experimental evaluations show thatthe proposed technique outperforms alternative approaches andestimates the average latency for real applications with less than8% error compared to cycle-accurate simulations.

CCS CONCEPTS • Networks → Network performance modeling ; •

Computersystems organization → System on a chip ; Pre-silicon design-space exploration and system-level simulationsconstitute a crucial component of the industrial design cycle [12,27]. They are used to confirm that new generation designs meetpower-performance targets before labor- and time-intensive RTLimplementation starts [4]. Furthermore, virtual platforms combinepower-performance simulators and functional models to enablefirmware and software development while hardware design isin progress [21]. These pre-silicon evaluation environments in-corporate cycle-accurate NoC simulators due to the criticality ofshared communication and memory resources in overall perfor-mance [1, 16]. However, slow cycle-accurate simulators have be-come the major bottleneck of pre-silicon evaluation. Similarly, ex-haustive design-space exploration is not feasible due to the longsimulation times. Therefore, there is a strong need for fast, yetaccurate, analytical models to replace cycle-accurate simulationsto increase the speed and scope of pre-silicon evaluations [36].Analytical NoC performance models are used primarily for fastdesign space exploration since they provide significant speed-upcompared to detailed simulators [17, 18, 26, 30]. However, mostexisting analytical models fail to capture two important aspectsof industrial NoCs [15].

First, they do not model routers that em-ploy priority arbitration.

Second, existing analytical models assume

This paper will be published in the proceedings of ICCAD 2020.This work was supported partially by Strategic CAD Labs, Intel Corporation, USA.Author’s addresses: S. K. Mandal, A. Krishnakumar and U. Y. Ogras, Department ofElectrical and Computer Engineering, University of Wisconsin-Madison, WI, 53706.Emails: {skmandal, anish.n.krishnakumar, uogras}@wisc.edu;R. Ayoub and M. Kishinevsky, Intel Corporation, 2111 NE 25th Ave., Hillsboro, OR97124; emails: {raid.ayoub, michael.kishinevsky}@intel.com .

Deflection Probability

Average Latency (cycles)

In je c tio n R a te (p a c k e ts /c yc le /s o u rc e )

Figure 1: Cycle-accurate simulations on a 6 × p d ) at the sink. that the destination nodes always sink the incoming packets. Inreality, network interfaces between the routers and cores have fi-nite (and typically limited) ingress buffers. Hence, packets bounce(i.e., they are deflected) at the destination nodes when the ingressqueue is full. Recently proposed performance models target priority-aware NoCs [23, 24]. However, these ignore deflection due to finitebuffers and uses the packet injection rate as the primary input. Thisis a significant limitation since the deflection probability ( p d ) in-creases both the hop count and traffic congestion. Indeed, Figure 1shows that the average NoC latency increases significantly with theprobability of deflection. For example, the average latency variesfrom 6–70 cycles for an injection rate of 0.25 packets/cycle/sourcewhen p d varies from 0.1–0.5. Therefore, performance models forpriority-aware NoCs have to account for deflection probability atthe destinations.This work proposes an accurate analytical model for priority-aware NoCs with deflection routing under bursty traffic . In addition toincreasing the hop count, deflection routing also aggravates trafficcongestion due to extra packets traveling in the network. Since thedeflected packets also have a complex effect on the egress queuesof the traffic sources, analytical modeling of priority-aware NoCswith deflection routing is challenging. To address this problem,we first need to approximate the probability distribution of inter-arrival time of deflected packets. Specifically, we compute the firsttwo moments of inter-arrival time of deflected packets since weconsider bursty traffic. To this end, the proposed approach startswith a canonical queuing system with deflection routing. We firstmodel the distribution of deflected traffic and the average queuingdelay for this system. However, this methodology is not scalablewhen the network has multiple queues with complex interactionsbetween them. Therefore, we also propose a superposition-basedtechnique to obtain the waiting time of the packets in arbitrar-ily sized industrial NoCs. This technique decomposes the queuing a r X i v : . [ c s . PF ] A ug ystem into multiple subsystems. The structure of these subsys-tems is similar to the canonical queuing system. After derivingthe analytical expressions for the parameters of the distributionmodel of deflected packets of individual subsystems, we superim-pose the result to solve the original system with multiple queues.Thorough experimental evaluations with industrial NoCs and theircycle-accurate simulation models show that the proposed techniquesignificantly outperforms prior approaches [18, 24]. In particular,the proposed technique achieves less than 8% modeling error whentested with real applications from different benchmark suites. Themajor contributions of this work are as follows : • An accurate performance model for priority-aware NoCswith deflection routing under bursty traffic, • An algorithm to obtain end-to-end latency using the pro-posed performance model, • Detailed experimental evaluation with industrial priority-aware NoC under varying degrees of deflection.The rest of this paper is organized as follows. Section 2 summa-rizes the related research. Section 3 presents the background andoverview of the proposed work. Section 4 describes the proposedmethodology to construct the analytical model for priority-awareNoCs, which considers deflection routing. Section 5 details experi-mental evaluations, and Section 6 concludes the paper.

Deflection routing was first introduced in the domain of opticalNoC as hot-potato routing [7]. Later, it was adapted for the NoCsused in high-performance SoCs to minimize buffer requirementsand increase energy efficiency [10, 11, 25]. This routing mechanismalways assigns the packets to a free output port of a router, evenif the assignment does not result in minimum latency. This way,the buffer size requirement in the routers is minimized. Authorsin [22] perform a thorough study on the effectiveness of deflectionrouting for different NoC topology and routing algorithm. Deflec-tion routing is also used in industrial priority-aware NoC [15].Since arbitrary deflections can cause livelocks and unpredictablelatency, industrial priority-aware NoCs deflect the packets only atthe destination nodes when the ingress buffer is full. Furthermore,the deflected packets always remain within the same row or col-umn, and they are guaranteed to be sunk after a fixed number ofdeflections.NoC performance analysis techniques have been used for de-sign space exploration and architectural studies such as buffer siz-ing [26, 28, 35]. However, most of these techniques do not considerNoCs with priority arbitration and deflection routing, which arethe key features of industrial NoCs [15]. Performance analysis ofpriority-aware queuing networks has also been studied for off-chipnetworks [3, 6, 33]. These analytical models consider the queuingnetworks in continuous time. However, each transaction in NoChappens at each clock cycle. Therefore, the underlying queuingsystem needs to be considered in the discrete time domain. A perfor-mance analysis technique for a priority-aware queuing network indiscrete time domain is presented in [33]. However, this techniquesuffers from high complexity for a complex queuing network, hencenot applicable to industrial priority-aware NoCs. A recent technique targets priority-aware NoCs [18], but it con-siders only a single class of packets in each queue of the network.In contrast, industrial priority-aware NoCs have multiple classesof packets that can exist in the same queue. NoCs with multiplepriority traffic classes has recently been analyzed in [24]. However,this analysis assumes that the input traffic follows a geometric dis-tribution. This technique has limited applications since industrialNoCs can experience bursty traffic. Furthermore, it does not con-sider deflection routing. Since deflection routing increases trafficcongestion, it is crucial to incorporate this aspect while construct-ing performance models. An analytical bound on maximum delayin networks with deflection routing is presented in [8]. However,evaluating maximum delay is not useful since it leads to significantoverestimation. Another analytical model for NoCs with deflectionrouting is proposed in [13]. The authors first compute the blockingprobability at each port of a router using an M/G/1 queuing model.Then, they compute the contention matrix at each router port. Theaverage waiting time of packets at each port is computed using thecontention matrix. However, this analysis ignores different priorityclasses and applies to only continuous-time queuing systems.In contrast to prior work, we propose a performance analysistechnique that considers both priority-aware NoCs with deflec-tion routing under bursty and high traffic load.

The proposedtechnique applies the superposition principle to obtain the statisti-cal distribution of the deflected packets. Using this distribution, itcomputes the average waiting time for each queue. To the best ofour knowledge, this is the first analytical model for priority-awareindustrial NoCs with deflection routing under high traffic load.

Architecture:

This work considers priority-aware NoCs used inhigh-end servers and many core architectures [15]. Each column ofthe NoC architecture, shown in Figure 2, is also used in client sys-tems such as Intel i7 processors [31]. Hence, the proposed analysistechnique is broadly applicable to a wide range of industrial NoCs.In priority-aware NoCs, the packets already in the network havehigher priority than the packets waiting in the egress queues of thesources. Assume that Node 2 in Figure 2 sends a packet to Node 12following Y-X routing (highlighted by red arrows). Suppose that

Router linksPath of deflectedpacketsPath of packet w/o deflectionSink / JunctionRouters

Figure 2: A representative 4 × packet in the egress queue of Node 6 collides with this packet.The packet from Node 2 to Node 12 will take precedence since thepackets already in the NoC have higher priority. Hence, packetsexperience a queuing delay at the egress queues but have predictablelatency until they reach the destination or turning point (Node 10in Figure 2). Then, it competes with the packets already in thecorresponding row. That is, the path from the source (Node 2) tothe destination (Node 12) can be considered as two segments, whichconsist of a queuing delay followed by a predictable latency.Deflection in priority-aware NoCs happens when the ingressqueue at the turning point (Node 10) or final destination (Node 12)become full. This can happen if the receiving node, such as a cachecontroller, cannot process the packets fast enough. The probabilityof observing a full queue increases with smaller queues (neededto save area) and heavy traffic load from the cores. If the packet isdeflected at the destination node, it circulates within the same row,as shown in Figure 2. Consequently, a combination of regular anddeflected traffic can load the corresponding row and pressure theingress queue at the turning point (Node 10). This, in turn, can leadto deflection on the column and propagates the congestion towardsthe source. Finally, if a packet is deflected more than a specificnumber of times, it reserves a slot in the ingress queue. This boundsthe maximum number of deflections and avoids livelock. Traffic:

Industrial priority-aware NoCs can experience bursty traf-fic, which is characteristic of real applications [5, 30]. This workconsiders generalized geometric (GGeo) distribution for the inputtraffic, which takes burstiness into account [20]. GGeo traffic ischaracterized by an average injection rate ( λ ) and the coefficientof variation of inter-arrival time ( C A ). We define a traffic class asthe traffic of each source-destination pair. The average injectionrate and coefficient of variation of inter-arrival time of class- i aredenoted by λ i and as C Ai respectively, as shown in Table 1. Finally,the mean service time and coefficient of variation of inter-departuretime of class- i are denoted as T i and C Si . Our goal is to construct an accurate analytical model to computethe end-to-end latency for priority-aware NoCs with deflectionrouting. The proposed approach can be used to accelerate full sys-tem simulations and also to perform design space exploration. Weassume that the parameters of the GGeo distribution of the inputtraffic to the NoC ( λ , C A ) are known from the knowledge of theapplication. The proposed model uses the deflection probability( p d ) as the second major input, in contrast to existing techniquesthat ignore deflection. Its range is found from architecture simu-lations as a function of the NoC architecture (topology, numberof processors, and buffer sizes). Its analytical modeling is left for Table 1: Summary of the notations used in this paper. λ i Arrival rate of class- ip d j Deflection probability at sink- jT i , (cid:98) T i Original and modified mean service time of class- iρ i Mean server utilization of class- i (= λ i T i ) C Ai Coefficient of variationof inter-arrival time of class- iC Si , (cid:98) C Si Coefficient of variation of originaland modified service time of class- iC Di Coefficient of variationof inter-departure time of class- iC Mi Coefficient of variation of inter-departure timeof merged traffic of class- iW i Mean waiting time of class- i future work. The proposed analytical model utilizes the distributionof the input traffic to the NoC ( λ , C A ) and the deflection probability( p d ) to compute the average end-to-end latency as a sum of fourcomponents: (1) Queuing delay at the source, (2) the latency fromthe source router to the junction router, (3) queuing delay at thejunction router, and (4) the latency from the junction router to thedestination. Note that all these components account for deflection,and it is challenging to compute them, especially under high trafficload. This section presents the proposed performance analysis techniquefor estimating the end-to-end latency for priority-aware NoCs withdeflection routing. We first construct a model for a canonical systemwith a single traffic class, where the deflected traffic distribution isapproximated using a GGeo distribution (Section 4.1). Subsequently,we introduce a scalable approach for a network with multiple trafficclasses. In this approach, we first develop a solution for the canoni-cal system. Then, employ the principle of superposition to extendthe analytical model to larger and realistic NoCs with multiple traf-fic classes (Section 4.2). Finally, we propose an algorithm that usesour analytical models to compute the average end-to-end latencyfor a priority-aware NoC with deflection routing (Section 4.3).

Figure 3(a) shows an example of a single class input traffic andegress queue that inject traffic to a network with deflection routing.The input packets are buffered in the egress queue Q i (analogousto the packets stored in the egress queue of Node 2 in Figure 2). Wedenote the traffic of Q i as class- i , which is modeled using GGeodistribution with two parameters ( λ i , C Ai ). The packets in Q i are SinkdestinationDeflected traffic Ring buffers ( 𝑸 𝒅 )Sink signal ( 𝒑 𝒅 𝒊 ) 𝝀 𝒊 , 𝑪 𝒊𝑨 𝝀 𝒅 𝒊 (a) S Egress queue( 𝑸 𝒊 )Priority arbiter Port with alower index has a higher priority 𝑸 𝒊 𝑸 𝒅 𝑺 𝒅 𝒊 𝝀 𝒅 𝒊 , 𝑪 𝒅𝑨𝒊 𝝀 𝒊 , 𝑪 𝒊𝑨 𝑺 𝒊 (𝑻( 𝒊 , 𝑪( 𝒊𝑺 )𝑪 𝒅 𝒊 𝑫 𝑪 𝒊𝑫 𝑪 𝒊𝑴 (𝑻( 𝒅 𝒊 , 𝑪( 𝒅𝑺𝒊 ) 𝟏 − 𝒑 𝒅 𝒊 (To sink) (b) 𝒑 𝒅 𝒊 (To 𝑸 𝒅 ) Figure 3: (a) Queuing system of a single class with deflection routing (b) Approximate queuing system to compute C Ad i . ispatched to a priority arbiter and assigned a low priority, markedwith 2 . In contrast, the packets already in the network have a highpriority, which are routed to the port marked with 1 . The packettraverses a certain number of hops (similar to the latency from thesource router to the junction router in Figure 2) and reaches thedestination. Since the number of hops is constant for a particulartraffic class, we omit these details in Figure 3(a) for simplicity . If theingress queue at the destination is full (with probability p d i ), thepacket is deflected back into the network. Otherwise, it is consumedat the destination (with probability 1 − p d i ). Deflected packetstravel through the NoC (within the column or row as illustratedin Figure 2) and pass through the source router, but this time withhigher priority. The profile of the deflected packets in the networkis modeled by a buffer ( Q d ) in Figure 3(a), since they remain inorder and have a fixed latency from the destination to the originalsource. This process continues until the destination can consumethe deflected packets.Our goal is to compute the average waiting time W i in the sourcequeue, i.e., components 1 and 3 of the end-to-end latency describedin Section 3.2. To obtain W i , we first need to derive the analyticalexpression for the rate of deflected packets of class- i ( λ d i ) and thecoefficient of variation of inter-arrival time of the deflected packets( C Ad i ) as follows. Rate of deflected packets ( λ d i ) : λ d i is obtained by calculatingthe average number of times a packet is deflected ( N d i ) until it isconsumed at the destination as: N d i = p d i ( − p d i ) + p d i ( − p d i ) + . . . + np nd i ( − p d i ) + . . . = ∞ (cid:213) n = np nd i ( − p d i ) = p d i − p d i (1)Therefore, λ d i can be expressed as: λ d i = λ i N d i = λ i p d i − p d i (2) Coefficient of variation of inter-arrival time of deflected pack-ets ( C Ad i ): To compute C Ad i , the priority related interaction betweenthe deflected traffic of Q d and new injections in Q i must be captured.This computation is more involved due to the priority arbitration be-tween the packets in Q d and Q i that involve a circular dependency.We tackle this problem by transforming the system in Figure 3(a)into an approximate representation shown in Figure 3(b) to simplifythe computations. The idea here is to transform the priority queuingwith a shared resource into separate queue nodes (queue + server)with a modified server process. This transformation enables thedecomposition of Q d and Q i and their shared server into individualqueue nodes with servers (cid:98) S d and (cid:98) S i respectively. The departuretraffic from these two nodes merge at the destination, consumedwith a probability 1 − p d i and deflected otherwise.The input traffic to the egress queue, as well as the deflectedtraffic, may exhibit bursty behavior. Indeed, the deflected trafficdistribution can be bursty because of the server-process effect andthe priority interactions between the input traffic and the deflectedtraffic, even when the input traffic is not bursty. Therefore, weapproximate the distribution of the deflected traffic via GGeo dis-tribution. To compute the parameters of the GGeo traffic, we needto apply the principle of maximum entropy (ME) as shown in [20]. To obtain the modified service process of class- i , we first calculatethe probability of no packets in Q i and in its corresponding server(i.e., p Q i ( ) ) using ME as, p Q i ( ) = − ρ i − ρ d i n i n i + ρ i + ρ d i (3)where ρ i and ρ d i denote the utilization of the respective servers,and n i is the occupancy of class- i in Q i . Next, we apply Little’s lawto compute the first order moment of modified service time ( (cid:98) T i ) as: (cid:98) T i = − p Q i ( ) λ i (4)Subsequently, we obtain the effective coefficient of variation (cid:98) C Si as: ( (cid:98) C Si ) = ( − (cid:98) ρ i )( n i + (cid:98) ρ i ) − (cid:98) ρ i ( C Ai ) (cid:98) ρ i (5)where (cid:98) ρ i = λ i (cid:98) T i . We follow similar steps (Equation 3 – Equation 5)for the deflected traffic to obtain (cid:98) T d i and (cid:98) C Sd i . With the modifiedservice process, the coefficients of variation of inter-departure timeof the packets in Q d ( C Dd i ) and Q i ( C Di ) are computed using the pro-cess merging method [29]. Then, we find the coefficient of variation( C Mi ) of the merged traffic from queues Q d and Q i as: ( C Mi ) = λ d i + λ i ( λ d i ( C Dd i ) + λ i ( C Di ) ) (6)We note that C Mi is a function of the coefficient of variation of theinter-arrival time of deflected traffic C Ad i . Since part of this mergedtraffic is consumed at the sink, we apply the traffic splitting methodfrom [29] to approximate C Ad i as: ( C Ad i ) = + p d i (( C Mi ) − ) (7)Finally, we extend the priority-aware formulations in continuoustime domain [6] to discrete time domain to obtain the averagewaiting time of the packets in Q d i and Q i : W d i = ρ d i ( T d i − ) + ρ i ( T i − ) + T d i (( C Ad i ) + λ d i − ) ( − ρ d i ) (8) W i = ρ d i ( T d i + ) + ρ d i W d i + ρ i ( T i − ) + T i (( C Ai ) + λ i − ) ( − ρ i − ρ d i ) (9) The analytical model for the system with a single class presentedin Section 4.1 becomes intractable with a higher number of trafficclasses. This section introduces a scalable approach based on thesuperposition principle that builds upon our canonical system usedin Section 4.1.Figure 4(a) shows an example with priority arbitration and N egress queues, one for each traffic class. We note that this queuingsystem is a simplified representation of a real system . The packetsrouted to port i have higher priority than those routed to port j for i < j . The deflected traffic in the network is buffered in Q d ,which has the highest priority in the queuing system. The primarygoal is to model the queuing time of the packets of each trafficclass. Modeling the coefficient of variations of the deflected trafficbecomes harder since deflected packets interact with all traffic a) (b) 𝑸 𝒅 𝝀 𝒅 , 𝑪 𝒅𝑨 S 𝒅 𝒑 𝒅 𝑸 𝑵 𝝀 𝑵 , 𝑪 𝑵𝑨 𝑸 𝝀 , 𝑪 (𝑻 𝒅 , 𝑪 𝒅𝑺 )…(𝑻 𝑵 , 𝑪 𝑵𝑺 )𝑸 𝝀 , 𝑪 N+1 (c) 𝝀 𝒅 = 𝒅 𝒊 𝑵𝒊$𝟏 , 𝑪 𝒅𝑨 ≈ 𝓜 ( 𝑪 𝒅 𝑨 ,… , 𝑪 𝒅 𝑵 𝑨 ) 𝑸 𝒅 𝝀 𝒅 , 𝑪 𝒅𝑨 𝑸 𝝀 , 𝑪 S 𝑸 𝝀 , 𝑪 𝑸 𝑵 𝝀 𝑵 , 𝑪 𝑵𝑨 N+1 𝒅 𝑪 Subsystem-1 𝒅 𝑸 𝑸 𝒅 𝑺 𝒅 𝝀 𝒅 , 𝑪 𝒅 𝑨 𝝀 , 𝑪 𝑪 𝒅 𝑫 𝑪 (𝑻- , 𝑪- )(𝑻- 𝒅 , 𝑪- 𝒅 𝑺 ) 𝑺 𝑪 𝑵𝑴 𝒅 𝑸 𝑵 𝑸 𝒅 𝑺 𝒅 𝑵 𝝀 𝒅 𝑵 , 𝑪 𝒅 𝑵 𝑨 𝝀 𝑵 , 𝑪 𝑵𝑨 𝑺 𝑵 𝑪 𝒅 𝑵 𝑫 𝑪 𝑵𝑫 (𝑻- 𝑵 , 𝑪- 𝑵𝑺 )(𝑻- 𝒅 𝑵 , 𝑪- 𝒅 𝑵 𝑺 ) Subsystem-N 𝒑 𝒅 𝒑 𝒅 (𝑻 𝒅 , 𝑪 𝒅𝑺 )…(𝑻 𝑵 , 𝑪 𝑵𝑺 ) Figure 4: (a) Queuing system with N classes with deflection routing, (b) Decomposition into N subsystems to calculate GGeoparameters of deflected traffic per class, (c) Applying superposition to obtain the GGeo parameters of overall deflected traffic. M denotes the merging process. classes rather than a single class. These interactions complicate theanalytical expressions significantly.Priority arbitration enables us to sort the queues in the order atwhich the packets are served. The queue of the deflected packets hasthe highest priority, while the rest are ordered with respect to theirindices. Due to this inherent order between the priority classes, theirimpact on the deflected traffic distribution can be approximatedas being independent of each other. This property enables us todecompose the queuing system into multiple subsystems and modeleach subsystem separately, as illustrated in Figure 4(b). Then, weapply the principle of superposition to obtain the parameters of theGGeo distribution of the deflected traffic. Note that each of thesesubsystems is identical to the canonical system analyzed in Section 4.1 .Hence, we first compute λ d i and C Ad i of each subsystem- i followingthe procedure described in Section 4.1. Subsequently, we apply thesuperposition principle to λ d i and C Ad i for i = . . . N to obtain theGGeo distribution parameters of the deflected traffic ( λ d , C Ad ).In general, we obtain the GGeo distribution parameters of thedeflected traffic corresponding to class- i by setting all traffic classesto zero expect class- i , ( λ j = j = . . . N , j (cid:44) i ). The values of λ d i and C Ad i can be expressed as: λ d i = λ d (cid:12)(cid:12)(cid:12) λ j = , j (cid:44) i ; λ i > and C Ad i = C Ad (cid:12)(cid:12)(cid:12) λ j = , j (cid:44) i ; λ i > (10)Subsequently, we apply the principle of superposition to obtain thedistribution parameters of Q d as shown in Figure 4(c). First, wecompute λ d by adding all λ d i as: λ d = N (cid:213) i = λ d i (11)The value of C Ad is approximated by applying the superposition-based traffic merging process [29] for each C Ad i , as shown below: ( C Ad ) = N (cid:213) i = λ d i λ d ( C Ad i ) (12) Next, we use these distribution parameters ( λ d , C Ad ) of the deflectedpackets to calculate the waiting time of the traffic classes in thesystem. The formulation of the priority-aware queuing system isapplied to obtain the waiting time of each traffic class- i ( W i ) [3]: W i = ρ d ( T d + ) + ρ i W d ( − ρ d − (cid:205) in = ρ n ) + (cid:205) i − n = ( ρ n ( T n + ) + ρ i W n ) ( − ρ d − (cid:205) in = ρ n ) + ρ i ( T i − ) + T i (( C Ai ) + λ i − ) ( − ρ d − (cid:205) in = ρ n ) (13)The first term in Equation 13 denotes the effect of deflected trafficon class- i ; the second term denotes the effect of higher priorityclasses (class- j , j < i ) on class- i ; and the last term denotes the effectof class- i itself. For more complex scenarios that include trafficsplits, we apply an iterative decomposition algorithm [23] to obtainthe queuing time of different classes.Figure 5 shows the average latency comparison between theproposed analytical model and simulation for the system in Figure 4.In this setup, we assume the number of classes is 5 ( N = p d = . Average Latency (cycles)

I n j e c t i o n R a t e ( p a c k e t s / c y c l e / s o u r c e )

Figure 5: Comparison of average latency between simula-tion and analytical model for the canonical example shownin Figure 4 with p d = . and N = . ighly overestimates the latency as it does not consider multipletraffic classes. The performance model of the priority-aware NoCin [24] accounts for multiple traffic classes, but it does not modeldeflection. Hence, it severely underestimates the average latency. Summary of the analytical modeling:

We presented a scalableapproach for the analytical model generation of end-to-end latencythat handles multiple traffic classes of priority-aware NoCs withdeflection routing. It applies the principle of superposition on sub-systems where each subsystem is a canonical queuing system ofa single traffic class to significantly simplify the approximation ofthe GGeo parameters of deflected traffic and in turn, the latencycalculations.

End-to-End latency computation:

Algorithm 1 describes theend-to-end latency computation with our proposed analytical model.The input parameters of the algorithm are the NoC topology, rout-ing algorithm, service process of each server, input traffic distribu-tion for each class, and deflection probability per sink. It outputsthe average end-to-end latency ( L avд ). First, the queuing system isdecomposed into multiple subsystems as shown in Figure 4(b) and λ d i and C Ad i for each subsystem- i are computed. Subsequently, theproposed superposition methodology is applied to compute λ d and C Ad , shown in lines 6–9 of the algorithm. Then, λ d and C Ad are usedto compute the average waiting time of the deflected packets ( W d ).Then, the average waiting time for class- s in Q n ( W ns ) is computed Algorithm 1:

End-to-end latency computation Input:

NoC topology, routing algorithm, service process,input distribution for each class, ( λ , C A ), deflectionprobability ( p d ) for each sink Output:

Average end-to-end latency ( L avд ) S = set of all classes in the network N = number of queues in the network S n = set of classes in queue n /* Distribution of deflected traffic */ for i =

1: | S | do Compute λ di and C Ad i using Equation 10 Compute λ d and C Ad using Equation 11 and Equation 12 end Compute W d using λ d and C Ad /* Average waiting time of each class */ for n = N do for s = |S n | do Compute W ns using Equation 13 (if |S n | = Compute W ns following the decomposition method in[23] (if |S n | > end end L avд = (cid:205) Nn = (cid:205) S ns = ( W ns + L ns ) λ ns (cid:205) Nn = (cid:205) S ns = λ ns ( For mesh this term includes thelatency both on the rows and the columns. ) as shown in lines 13–14. The service time combined with static la-tency from source to destination ( L ns ) is added to W ns to obtain theend-to-end latency. Finally, the average end-to-end latency ( L avд )is computed by taking a weighted average of the latency of eachclass, as shown in line 16 of the algorithm. This section validates the proposed analytical model against anindustrial cycle-accurate NoC simulator under a wide range oftraffic scenarios. The experiment scenarios include real applicationsand synthetic traffic that allow evaluations with varying injectionrates and deflection probabilities. The evaluations include a 6 × × ×

16. All cycle-accurate simulations run for 200K cycles,with a warm-up period of 20K cycles, to allow the NoC to reach thesteady-state.

One of the key components of the proposed analytical model isestimating the average number of deflected packets. This sectionevaluates the accuracy of this estimation compared to simulationwith a 6 × p d = . × This section evaluates the accuracy of our latency estimation tech-nique when the sources inject packets following a geometric traffic

Row-1Row-2Row-3Row-4Row-5Row-6Col-1Col-2Col-3Col-4Col-5Col-6

02 04 06 08 01 0 0

Estimation Accuracy of Deflected Packets (%)

Figure 6: Estimation accuracy of average number of pack-ets deflected for each row and column in a 6 × p d =0.3. . 0 9 0 . 1 8 0 . 2 7 0 . 3 6 0 . 4 5 0 . 5 4081 62 43 24 0 0 . 0 7 0 . 1 4 0 . 2 1 0 . 2 8 0 . 3 501 12 23 34 45 5( a ) S i m u l a t i o n A n a l y t i c a l ( P r o p o s e d ) A n a l y t i c a l ( w / o D e c o m p o s i t i o n a n d w / o D e f l e c t i o n ) [ 9 ] A n a l y t i c a l ( w / o D e f l e c t i o n R o u t i n g ) [ 1 3 ] Average Latency (cycles)

I n j e c t i o n R a t e ( p a c k e t s / c y c l e / s o u r c e )p d = 0 . 1 ( b ) Average Latency (cycles)

I n j e c t i o n R a t e ( p a c k e t s / c y c l e / s o u r c e )p d = 0 . 3 Figure 7: Comparison of average latency between simulation, the analytical model proposed in this work, and analyticalmodels proposed in [18, 24] for a 6 × distribution. We note that our technique can also handle burstytraffic, which is significantly harder. However, we start with thisassumption to make a fair comparison to two state-of-the-art tech-niques from the literature [18, 24]. The model presented in [18]does not incorporate multiple traffic classes and deflection routing.On the other hand, the model presented in [24] considers multipletraffic classes but does not consider bursty traffic and deflectionrouting.The evaluations are performed first on the server-like 6 × p d = p d = × × Since real applications exhibit burstiness, it is crucial to performaccurate analytical modeling under bursty traffic. Therefore, thissection presents the comparison of our proposed analytical modelwith respect to simulation under bursty traffic. For an extensiveand thorough validation, we sweep the packet injection rate ( λ ),probability of burstiness ( p br ), and deflection probability ( p d ). Theinjection rates cover a wide range to capture various traffic con-gestion scenarios in the network. Likewise, we report evaluationresults for two different burstiness ( p br = {0.2,0.6}), and three dif-ferent deflection probabilities ( p d = {0.1,0.2,0.3}). The coefficientof variation for the input traffic ( C A ), the final input to the model,is then computed as a function of p br and λ [19]. We simulate the6 × × Table 2: Validation of the proposed analytical model for 6 × × × × p d p br λ E rr o r ( % ) Prop. 7.3 9.6 8.1 14 13 14 8.9 8.0 7.7 13 12 12 9.6 9.2 6.5 11 12 13 1.0 4.1 5.8 4.6 5.2 5.5 0.7 2.3 4.2 6.3 7.3 8.6 0.7 0.9 3.3 6.3 8.5 8.6

Ref[18] 2.6 E E 26 E E 22 E E 39 E E 35 18 E 57 E E 7.0 E E 34 E E 23 E E 45 E E 42 E E 54 E ERef[24] 12 15 23 3.1 18 23 28 41 65 19 33 49 42 45 55 39 35 31 15.3 18 22 18 24 33 30 38 67 31 44 54 41 50 73 42 50 58

YSm ark14 gccbw aves m cfG em sFDTDO M NeT++ XalanPerlbench

Average Latency (Normalized)

S i m u l a t i o n A n a l y t i c a l ( P r o p o s e d ) A n a l y t i c a l ( w / o D e c o m p o s i t i o n a n d w / o D e f l e c t i o n ) [ 9 ] A n a l y t i c a l ( w / o D e f l e c t i o n R o u t i n g ) [ 1 3 ]( a ) ( b )

SYSm ark14 gccbw aves m cfG em sFDTDO M NeT++ XalanPerlbench d = 0 . 3p d = 0 . 1 Average Latency (Normalized)

Figure 8: Average latency comparison between simulation, the analytical model proposed in this work, and analytical modelsproposed in [18, 24] for a 6 × p d = 0.1 and (b) p d = 0.3. The right-hand side of Table 2 summarizes the estimation errorsobtained on the 6 × p d = . p br = . λ = .

4, the error is 14%, which is ac-ceptable, considering that the network is severely congested. Incontrast, the analytical models proposed in [18] overestimate the la-tency, whereas the models in [24] underestimate the latency whichconforms the results with geometric traffic, as in the 6 × In addition to the synthetic traffic, the proposed analytical modelis evaluated with applications from SPEC CPU®2006 [14], SPECCPU®2017 benchmark suites [9], and the SYSmark®2014 appli-cation [2]. Specifically, the evaluation includes SYSmark14, gcc,bwaves, mcf, GemsFDTD, OMNeT++, Xalan, and perlbench applica-tions. The chosen applications represent a variety of injection ratesfor each source in the NoC and different levels of burstiness. Eachapplication is profiled offline to find the input traffic parameters.Of note, the probability of burstiness for these applications rangesfrom p b = p b = × × × p d = p d = × × × since they ignore deflected packets, andthe average traffic loads are small. In conclusion, the proposedtechnique outperforms state-of-the-art for real applications and awide range of synthetic traffic inputs. - 3 - 2 - 1 M e s h S i z e

Execution Time of Analytical Model (s)

Figure 9: Execution time of the proposed analytical model(in seconds) for different mesh sizes.

Finally, we evaluate the scalability of the proposed technique forlarger NoCs. We note that accuracy results for larger NoCs are notavailable since they do not have detailed cycle-accurate simula-tion models. We implemented the analytical model in C. Figure 9shows that the analysis completes in the order of seconds for up to16 ×

16 mesh. In comparison, cycle-accurate simulations take hourswith this size, even considering linear scaling. When we scale themesh size aggressively to 16 ×

16, the analysis completes in about 5seconds, which is orders of magnitude faster than cycle-accuratesimulations of NoCs of this size.

Industrial NoCs incorporate priority-arbitration and deflection rout-ing to minimize buffer requirement and achieve predictable latency.Analytical performance modeling of these NoCs is needed to per-form design space exploration, fast system simulations, and tuningarchitectural parameters. However, state-of-the-art performanceanalysis models for NoCs do not incorporate priority arbitrationand deflection routing together. This paper presented a performanceanalysis technique for industrial priority aware NoCs with deflec-tion routing under heavy traffic. Experimental evaluations withindustrial NoCs show that the proposed technique significantlyoutperforms existing analytical models under both real applicationand a wide range of synthetic traffic workloads.

REFERENCES [1] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K Jha. 2009. GARNET:A Detailed on-chip Network Model inside a Full-System Simulator. In . 33–42.[2] Business Applications Performance Corporation (BAPCo). [n.d.].

Benchmark,SYSmark2014 . http://bapco.com/products/sysmark-2014, accessed 27 May 2020.[3] Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet. 1992.

Data Networks .Vol. 2. Prentice-Hall International New Jersey.[4] Nathan Binkert et al. 2011. The Gem5 Simulator.

SIGARCH Computer ArchitectureNews (May. 2011).5] Paul Bogdan and Radu Marculescu. 2010. Workload Characterization and ItsImpact on Multicore Platform Design. In .231–240.[6] Gunter Bolch, Stefan Greiner, Hermann De Meer, and Kishor S Trivedi. 2006.

Queueing Networks and Markov Chains: Modeling and Performance Evaluationwith Computer Science Applications . John Wiley & Sons.[7] Allan Borodin, Yuval Rabani, and Baruch Schieber. 1997. Deterministic Many-to-Many Hot Potato Routing.

IEEE Transactions on Parallel and Distributed Systems

8, 6 (1997), 587–596.[8] Jack T Brassil and Rene L. Cruz. 1995. Bounds on Maximum Delay in Networkswith Deflection Routing.

IEEE Transactions on Parallel and Distributed Systems

Companion of the 2018 ACM/SPECInternational Conference on Performance Engineering . 41–42.[10] Chris Fallin, Chris Craik, and Onur Mutlu. 2011. CHIPPER: A Low-ComplexityBufferless Deflection Router. In . 144–155.[11] Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun,and Onur Mutlu. 2012. MinBD: Minimally-buffered Deflection Routing forEnergy-efficient Interconnect. In . 1–10.[12] Arijit Ghosh and Tony Givargis. 2003. Analytical Design Space Exploration ofCaches for Embedded Systems. In . 650–655.[13] Pavel Ghosh, Arvind Ravi, and Arunabha Sen. 2010. An Analytical Frameworkwith Bounded Deflection Adaptive Routing for Networks-on-Chip. In . 363–368.[14] John L Henning. 2006. SPEC CPU2006 Benchmark Descriptions.

ACM SIGARCHComputer Architecture News

34, 4 (2006), 1–17.[15] James Jeffers, James Reinders, and Avinash Sodani. 2016.

Intel Xeon Phi ProcessorHigh Performance Programming: Knights Landing Edition . Morgan Kaufmann.[16] Nan Jiang et al. [n.d.]. A Detailed and Flexible Cycle-accurate Network-on-chip Simulator. In . 86–96.[17] Hany Kashif and Hiren Patel. 2014. Bounding Buffer Space Requirements forReal-time Priority-aware Networks. In . 113–118.[18] Abbas Eslami Kiasari, Zhonghai Lu, and Axel Jantsch. 2013. An AnalyticalLatency Model for Networks-on-Chip.

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems

21, 1 (2013), 113–123.[19] DD Kouvatsos and PA Luker. 1984. ON THE ANALYSIS OF QUEUEING NET-WORK MODELS: MAXIMUM ENTROPY AND SIMULATION. In

UKSC 84 .488–496.[20] Demetres D Kouvatsos. 1994. Entropy Maximisation and Queuing NetworkModels.

Annals of Operations Research

48, 1 (1994), 63–126.[21] Rainer Leupers, Lieven Eeckhout, Grant Martin, Frank Schirrmeister, NigelTopham, and Xiaotao Chen. 2011. Virtual Manycore Platforms: Moving To-wards 100+ Processor Cores. In . IEEE, 1–6.[22] Zhonghai Lu, Mingchen Zhong, and Axel Jantsch. 2006. Evaluation of On-chipNetworks using Deflection Routing. In

Proceedings of the 16th ACM Great LakesSymposium on VLSI . 296–301.[23] Sumit K Mandal, Raid Ayoub, Michael Kishinevsky, Mohammad M Islam, andUmit Y Ogras. 2020. Analytical Performance Modeling of NoCs under PriorityArbitration and Bursty Traffic.

IEEE Embedded Systems Letters (2020).[24] Sumit K Mandal, Raid Ayoub, Michael Kishinevsky, and Umit Y Ogras. 2019.Analytical Performance Models for NoCs with Multiple Priority Traffic Classes.

ACM Transactions on Embedded Computing Systems (TECS)

18, 5s (2019).[25] Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing inOn-chip Networks. In

Proceedings of the 36th Annual International Symposiumon Computer Architecture . 196–207.[26] Umit Y Ogras, Paul Bogdan, and Radu Marculescu. 2010. An Analytical Approachfor Network-on-Chip Performance Analysis.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

29, 12 (2010), 2001–2013.[27] Maurizio Palesi and Tony Givargis. 2002. Multi-objective Design Space Ex-ploration using Genetic Algorithms. In

Proceedings of the tenth internationalsymposium on Hardware/software codesign . 67–72.[28] Michele Petracca, Benjamin G Lee, Keren Bergman, and Luca P Carloni. 2009.Photonic NoCs: System-Level Design Exploration.

IEEE Micro

29, 4 (2009), 74–85.[29] Guy Pujolle and Wu Ai. 1986. A Solution for Multiserver and Multiclass OpenQueueing Networks.

INFOR: Information Systems and Operational Research

24, 3(1986), 221–230.[30] Zhi-Liang Qian, Da-Cheng Juan, Paul Bogdan, Chi-Ying Tsui, Diana Marculescu,and Radu Marculescu. 2015. A Support Vector Regression (SVR)-based LatencyModel for Network-on-Chip (NoC) Architectures.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

35, 3 (2015), 471–484.[31] Efraim Rotem. 2015. Intel Architecture, Code Name Skylake Deep Dive: ANew Architecture to Manage Power Performance and Energy Efficiency. In

IntelDeveloper Forum .[32] Sriram R Vangal et al. 2008. An 80-tile sub-100-w teraflops processor in 65-nmcmos.

IEEE Journal of Solid-State Circuits

43, 1 (2008), 29–41.[33] Joris Walraevens. 2004.

Discrete-time Queueing Models with Priorities . Ph.D.Dissertation. Ghent University.[34] David Wentzlaff et al. 2007. On-chip Interconnection Architecture of the TileProcessor.

IEEE micro

27, 5 (2007), 15–31.[35] Yulei Wu, Yulei Wu, Geyong Min, Mohamed Ould-Khaoua, Hao Yin, and LanWang. 2010. Analytical Modelling of Networks in Multicomputer Systems underBursty and Batch Arrival Traffic.

The Journal of Supercomputing

51, 2 (2010),115–130.[36] Sungjoo Yoo, Gabriela Nicolescu, Lovic Gauthier, and Ahmed Amine Jerraya.2002. Automatic Generation of Fast Timed Simulation Models for OperatingSystems in SoC Design. In