[PDF] AdEle: An Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D NoCs

Abstract

By lowering the number of vertical connections in fully connected 3D networks-on-chip (NoCs), partially connected 3D NoCs (PC-3DNoCs) help alleviate reliability and fabrication issues. This paper proposes a novel, adaptive congestion- and energy-aware elevator-selection scheme called AdEle to improve the traffic distribution in PC-3DNoCs. AdEle employs an offline multi-objective simulated-annealing-based algorithm to find good elevator subsets and an online elevator selection policy to enhance elevator selection during routing. Compared to the state-of-the-art techniques under different real-application traffics and configuration scenarios, AdEle improves the network latency by 14.9% on average (up to 21.4%) with less than 10.5% energy consumption overhead.

Full PDF

AAdEle: An Adaptive Congestion-and-Energy-AwareElevator Selection for Partially Connected 3D NoCs

Ebadollah Taheri, Ryan G. Kim, and Mahdi Nikdast

Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, CO 80523, USA

Abstract —By lowering the number of vertical connections infully connected 3D networks-on-chip (NoCs), partially connected3D NoCs (PC-3DNoCs) help alleviate reliability and fabricationissues. This paper proposes a novel, adaptive congestion- andenergy-aware elevator-selection scheme called AdEle to improvethe trafﬁc distribution in PC-3DNoCs. AdEle employs an ofﬂinemulti-objective simulated-annealing-based algorithm to ﬁnd goodelevator subsets and an online elevator selection policy to enhanceelevator selection during routing. Compared to the state-of-the-art techniques under different real-application trafﬁcs andconﬁguration scenarios, AdEle improves the network latency by14.9% on average (up to 21.4%) with less than 10.5% energyconsumption overhead.

Index Terms —Partially connected 3D networks-on-chip,through-silicon via, simulated annealing, elevator selection.

I. I

NTRODUCTION

Network-on-chip (NoC) has become the prevailing solutionto enable scalable on-chip communication in manycore sys-tems. Moreover, with the advances in three-dimensional (3D)integration technologies, 3D NoCs are emerging to furtherimprove the heterogeneity and integration density by verticallystacking multiple dies connected with an efﬁcient die-to-dieinterconnect [1]. Among different vertical interconnect tech-nologies, through-silicon vias (TSVs) promise high bandwidthand low power [2]–[4].TSV-based 3D NoCs have been proposed for various appli-cations (e.g., [5], [6]). However, vertical links in TSV-based3D NoCs use multiple TSVs in a bundle, resulting in high areaoverhead due to the large TSV interconnect-pitch and keep-out-zone requirements [7]. Also, TSVs are particularly sus-ceptible to electromigration and capacitive crosstalk-inducedissues [8], [9]. Therefore, 3D NoC architectures with TSVs atevery router (i.e., fully connected) impose higher design com-plexity, fabrication costs, and performance degradation [1], [4].Addressing such challenges has motivated the development of3D NoCs with fewer TSV-based vertical links, also known aspartially connected 3D NoCs (PC-3DNoCs) [1], [10], [11].Nevertheless, PC-3DNoCs introduce some new design chal-lenges because of their partial vertical connectivity [11],[12]. In particular, the vertical links (a.k.a. the elevators)must be shared among multiple routers, potentially creatingtrafﬁc hotspots at the elevators and increasing the networklatency [1]. To balance the trafﬁc at these hotspot elevators, anadaptive routing technique is needed to select lower utilizedelevators without detouring far from the minimal path (i.e.,elevator-selection problem). Yet, initial routing solutions inPC-3DNoCs (e.g., Elevator-First routing [10]) na¨ıvely selectthe nearest elevator without considering the trafﬁc, resulting in unbalanced elevator utilization. To reduce congestion, ad-vanced methods (e.g., CDA [12]) use global trafﬁc informationto improve the trafﬁc distribution during runtime. However, re-trieving global trafﬁc information increases both the hardwareoverhead and network trafﬁc.This paper addresses the elevator-selection problem inPC-3DNoC routing techniques by developing, for the ﬁrsttime, a novel, congestion- and energy-aware adaptive elevator-selection scheme called AdEle. AdEle works in two stages tobalance the trafﬁc with minimal overhead: an ofﬂine elevator-set optimization and an online elevator-selection policy. Inthe ofﬂine elevator-set optimization, AdEle uses a multi-objective simulated-annealing-based optimization algorithm(AMOSA [13]) to collectively choose an optimized subset ofelevators for each source router that minimizes the averagelatency and energy under an assumed trafﬁc scenario. Duringruntime, each router monitors its local trafﬁc and selects oneelevator from its subset to improve the latency of the net-work. AdEle employs a low-overhead local trafﬁc monitoringtechnique that examines the blocking as a proxy for pathcongestion, balancing the elevator trafﬁc while eliminating theoverhead of global trafﬁc monitoring used in other approaches.Our results simulated using different real-application trafﬁcsand conﬁguration scenarios show the promise of AdEle com-pared to the state-of-the-art techniques: on average, AdEleimproves the network latency by 14.9% (up to 21.4%) andwith only 10.3% (up to 10.5%) energy consumption overhead.The rest of the paper is organized as follows. We review therecent related work on PC-3DNoCs in Section II. Section IIIdiscusses the elevator-selection problem and its complexity inPC-3DNoCs and details our proposed technique and its im-plementation. We present our simulation results in Section IV.Finally, Section V concludes the paper.II. B

ACKGROUND AND R ELATED W ORK

Employing conventional dimension-order routing algo-rithms in PC-3DNoCs will result in deadlock because of theirregular topology in such networks. To prevent deadlock, theElevator-First routing algorithm [10] employs two virtual net-works to break cyclic dependencies. Moreover, as the elevator-less routers cannot directly send packets to other layers, anelevator is selected for each packet to facilitate the inter-layercommunication. Leveraging such a principle, several routingalgorithms have been proposed for PC-3DNoCs [3], [14].However, they follow an elevator-selection policy that ignoreselevators’ load distribution and the minimal path. This can beespecially harmful for PC-3DNoCs with non-uniform elevator a r X i v : . [ c s . D C ] F e b inimal path Non-minimal path D S e e e (a) (b)Fig. 1. (a) An example PC-3DNoC with three elevators ( e , e , and e ).The routing path from S to D based on Elevator-First algorithm [10] (dotted-red line) and the minimal path (blue-solid line) are shown. The middle-layerrouters are colored based on their Elevator-First selected elevator. (b) Trafﬁcload on each router in the middle layer: the e elevator is highly congestedbecause of the inefﬁcient elevator selection in Elevator-First algorithm. placements, small number of elevators, or non-uniform trafﬁcdistributions. Adaptive elevator-selection techniques have beenproposed [4], [11], [15] but mainly focus on elevator failureconcerns. These strategies select the closest non-faulty elevatorto the source without considering the elevator’s congestion,causing them to suffer from high energy and latency costs.To improve the trafﬁc distribution in PC-3DNoCs, [16]proposed an optimized elevator-selection scheme using theTabu search algorithm. However, the ofﬂine Tabu optimizationcannot capture the dynamics of the runtime network trafﬁc.Also, the search algorithm ignores the network energy efﬁ-ciency during the elevator selection. In [12], an online elevator-selection scheme called CDA selects the elevator based onthe buffer utilization of the routers between a source and theelevator. However, CDA requires online global information ofthe network buffer utilization which imposes high latency andhardware overheads to share this information.Considering the aforementioned works, an efﬁcient elevator-selection solution is essential but yet to be addressed forPC-3DNoCs. We take on this challenge by developing a novel,adaptive congestion- and energy-aware elevator-selectionscheme (AdEle). Ofﬂine elevator-selection approaches enjoylow overhead while online approaches achieve better networklatency and energy consumption. Accordingly, AdEle com-bines the beneﬁts of both approaches while also consideringenergy consumption. On top of being energy-aware, AdEleincludes elevator redundancy and online policies to accommo-date dynamic trafﬁc behavior. We will show that using a setof elevators instead of one elevator for each router can greatlyimprove network performance. Also, our proposed approachonly utilizes local information of routers to effectively manageelevator congestion with low overheads.III. P ROPOSED E LEVATOR -S ELECTION S CHEME : A D E LE This section details our proposed adaptive congestion- andenergy-aware elevator-selection scheme. As shown in Fig. 2,AdEle uses an ofﬂine multi-objective simulated-annealing-based algorithm (AMOSA) to ﬁnd an optimal subset of eleva-tors for each router, and an online elevator-selection algorithm

Elevator ConfigurationTraffic PatternObjectives: Utilization & Distance

Router

Elevator SelectionFor Each Router

Online Elevator Selection (Section ΙΙΙ.C)Offline Optimization (Section ΙΙΙ.B)

Elevator-Set Search (AMOSA)Enhanced Round RobinLow Traffic Override Traffic MonitorElevator Subset (one per router)

Fig. 2. An overview of our proposed elevator-selection scheme: AdEle. to improve elevator selection in the presence of runtime trafﬁc.The following discusses the novel contributions of AdEle.

A. Motivation: Routing in PC-3DNoCs

In PC-3DNoCs, because of the irregular topology, therouting process requires three main steps: 1) selecting anelevator for each packet in the source router and then routingthe packet to that elevator; 2) vertically routing the packetto the destination layer; and 3) routing the packet from theelevator to the destination node. In this routing process, theelevator selection (the ﬁrst step) is critical as the number ofvertical paths (elevators) is much smaller than the number ofhorizontal paths, putting signiﬁcantly more trafﬁc pressure onthe elevators.Fig. 1(a) shows an example of a PC-3DNoC with threeelevators ( e – e ) using Elevator-First-based elevator selec-tion [4], [10], [11] (i.e., the closest elevator to the source routeris selected). Routers are colored with the elevator’s color theywould use under the Elevator-First policy: i.e., four routerswill use the green ( e ) elevator, seven will use the blue ( e )elevator, and ﬁve will use the red ( e ) elevator. Unfortunately,such an uneven elevator utilization can put severe trafﬁcpressure on certain elevators ( e in this example). Ideally, someof the load on the e elevator could be assigned to the e or e elevators, making the e elevator less congested. Fig. 1(b)demonstrates the utilization of the middle-layer routers withElevator-First selection policy under uniform trafﬁc. As canbe seen, e is highly congested due to the uneven elevatorselection. In terms of energy efﬁciency, the best elevatorselection is on the minimal path between the source anddestination. However, as can be seen in Fig. 1(a) for the pathbetween S and D, policies like Elevator-First (red-dotted line)do not necessarily choose the minimal path (blue-solid line).AdEle will consider both trafﬁc distribution and energyefﬁciency to select optimal elevators and evenly distributetrafﬁc loads among the elevators. To the best of our knowledge,AdEle is the ﬁrst congestion- and energy-aware elevator-selection scheme in PC-3DNoCs that includes elevator re-dundancy and online policies to accommodate dynamic trafﬁcbehavior while relying only on local router information. . Optimal Elevator-Subset for Each Router To ﬁnd the optimal subset of elevators for each router,AdEle performs an ofﬂine optimization to distribute the ex-pected trafﬁc load across all elevators and minimize the aver-age inter-node (source to destination) distance. To do this, weﬁrst deﬁne two optimization objectives: 1) elevator-utilizationvariance to improve the trafﬁc load distribution, and 2) aver-age inter-node distance to minimize the energy consumption.Leveraging these objective functions, we will use a multi-objective simulated-annealing-based algorithm (AMOSA [13])to ﬁnd the optimal elevator subsets.

1) Objective 1 - Elevator Utilization:

To balance the trafﬁcon the elevators, AdEle attempts to minimize the elevator-utilization variance. As discussed above, it is important toevenly distribute the trafﬁc over elevators to avoid highlycongested elevators. To calculate the utilization variance, letus consider an N -node/router network with a set of elevators E = { e , e , . . . , e E } , where E is the total number of eleva-tors. Moreover, assume that during runtime, each router i canselect its elevator from a subset A i ⊆ E . For simplicity, fornow we assume that each router selects each elevator fromits elevator subset ( A i ) uniformly (e.g., using a round-robinpolicy). Therefore, the utilization of elevator e ( U e ) is: U e = N (cid:88) i =1 | A i | N (cid:88) j =1 f ij · P ije , (1)where f ij is the frequency of trafﬁc between routers i and j ,and P ije denotes whether the routing between routers i and j uses the elevator e ( P = 1 ) or not ( P = 0 ). Leveraging (1),the average trafﬁc over all the elevators ( µ ) is: µ = 1 E E (cid:88) i =1 U i . (2)Using (1) and (2), elevator-utilization variance is: σ = 1 E E (cid:88) i =1 ( U i − µ ) . (3)Minimizing the elevator-utilization variance will result in abetter distribution of trafﬁc load on the elevators and lowernetwork latency.

2) Objective 2 - Average Distance:

To improve networkenergy efﬁciency, AdEle attempts to minimize the averagedistance. As elevator selection is under consideration here, weonly consider inter-layer trafﬁc here. Therefore, the distancebetween inter-layer nodes i and j over an elevator e can bedeﬁned as: Dij e = (cid:26) , i and j are on the same layer d se + d e + d ed , otherwise , (4)where d se , d e , and d ed are the Manhattan distances betweenthe source and elevator, on the elevator (inter-die), and fromthe elevator to the destination, respectively. Based on (4), theaverage inter-layer-node distance in an L -layer network is: AD = 1 N × ( L − L × N ) N (cid:88) i =1 | A i | | A i | (cid:88) e =1 N (cid:88) j =1 D eij . (5)

3) Multi-Objective Optimization:

We use a multi-objective simulated annealing-based optimization algorithm(AMOSA [13]) to ﬁnd a set of optimal elevator subsetsfor all the routers in the network ( A = { A , . . . , A N } )while minimizing the objective functions in (3) and (5). AsAMOSA is a multi-objective optimization search, it offers aset of solutions that lie on the Pareto front of the optimizationobjectives (see [13] for more details). AMOSA-basedoptimization in AdEle provides different optimal solutions interms of latency and energy efﬁciency. From these solutions,a designer can make trade-offs when choosing betweenmore latency-aware or energy-aware solutions (see Fig. 3).Selection of solutions are discussed in detail in Section IV. C. Adaptive Elevator Selection

Here, we discuss how a router i can efﬁciently select anelevator during runtime from its elevator subset ( A i ) identiﬁedin the previous subsection. As we are interested in an evendistribution of trafﬁc load over all the elevators to improve traf-ﬁc congestion during runtime, we apply an enhanced round-robin (RR) algorithm to select an elevator. In a conventionalRR approach, elevators would be selected in a sequentialorder without considering the runtime trafﬁc. To account forruntime trafﬁc, we include the probability of skipping ( P Sik ) acongested elevator ( k ) for router i in the RR approach. P Sik isadjusted based on the average latency imposed by the elevator k , i.e., higher latencies seen using elevator k increases theprobability of skipping it in the future. Accordingly, AdElecan adaptively manage dynamic trafﬁc loads and congestion.To ﬁnd P Sik , let us ﬁrst deﬁne a cost function associatedwith making a selection from an elevator subset. After select-ing an elevator, AdEle estimates the cost of this selection byconsidering the time between when the ﬁrst ﬂit (the headerﬂit) and when the last ﬂit (the tail ﬂit) leave the source router.The latency imposed by selecting an elevator e k from a subset A i is: T e k = t tail − t head − l p l p , (6)where t tail and t head denote the time when the tail ﬂit andthe header ﬂit leave the source router, respectively. Also, l p isthe length of the packet. The elevator-selection cost ( C k ) canbe updated using the latency of the last selection deﬁned in(6) and based on: C k ← ( a × T e k ) + ((1 − a ) × C k ) , ≤ a ≤ (7)where a is a coefﬁcient to increase or decrease the impactof the new cost versus the old cost. We have experimentallyfound that a = only local information . With wormholeswitching, any blocking in an elevator can be propagated alongthe path from the elevator to the source router. Therefore,blocking at a source router can be interpreted as blocking inthe elevator. Note that incorporating global-network informa-tion into AdEle would improve the selection policy but beless practical as it will impose high hardware area, energyconsumption, and latency costs. ABLE IS

IMULATION S ETUP

Network size 4 × × × × P M P S1 P S3 P S2 Considering (7), we can deﬁne router i ’s relative cost ofselecting elevator k from A i versus other possible elevators: C relik = C k (cid:80) | A i | p =0 C p . (8)Based on the relative cost of a particular elevator selection,the possibility of skipping that elevator in the RR approach is: P Sik =  − ξ, if C relik ≥ N ; N × ( C relik − N ) × (1 − ξ ) , if N > C relik ≥ N . , otherwise (9)Here, ξ is considered to allow for exploring new solutionseven under high relative costs ( ξ = ξ , suppose that the P Sik of a selectionis 1 because of high congestion. In this case, the elevator k will not be selected in the RR sequence at all and have nochance to update its elevator-selection cost ( C k ). This wouldkeep P Sik high and prevent the elevator from observing anychanges in its cost. To address such an update failure, ξ allowsevery elevator to be selected with a low probability regardlessof P Sik so the cost function has a chance for updating. Toimprove energy efﬁciency, when C k is below a threshold for all k (low latency applications) and congestion is not a concern,AdEle will instead choose the elevator along the minimalpath (discussed in Section III.A). Here, we experimentallyﬁnd the threshold that minimizes the latency for each trafﬁcand elevator conﬁguration. Our future work will investigate adynamic threshold management.IV. S IMULATION AND E VALUATION R ESULTS

Here, we compare AdEle against two well-known elevator-selection approaches: Elevator-First [10] and CDA [12]. Thesimulation setup is shown in Table I. In PC-3DNoCs, thenumber and location of elevators is limited by hardwareconstraints [9]. Therefore, AdEle is evaluated using differentelevator-placement patterns to show that its efﬁcacy is indepen-dent of any such patterns. Also, because of performance-areatrade-off in PC-3DNoCs [1], various elevator concentrationsmight be employed. Therefore, here we simulate differentconcentration of elevators to show that AdEle performance isnot limited by elevator concentration. Three elevator patternsare considered for a 4 × × P S – P S ) with differentlevels of elevator concentration. P S and P S are extracted to A v e r a g e D i s t a n ce S S S S S S Optimized Solutions 0.1% of Explored Solutions Selected SolutionsElevator-First

Fig. 3. Elevator-selection solutions found by AMOSA optimization in AdEle.TABLE IIP

ERFORMANCE OF SELECTED SOLUTIONS FROM F IG . 3 Elev. Optimized solutionsFirst S S S S S S (cid:88) Latency ∗

131 463 159 132 90.5 88.1 78Energy ∗ Average Latency (cycles)

Energy/ﬂit ( nj ) (cid:88) Selected have an optimized average distance and P S is based on [4].A large network (8 × ×

4) is simulated to show the scalabilityof AdEle. The pattern for this network ( P M ) is also extractedbased on the average distance optimization.AdEle’s ofﬂine optimization (see Section III.B) is imple-mented in Python to extract the elevator subsets for routers.These subsets are added to the AdEle router implemented inAccess Noxim simulator [17]. We considered uniform trafﬁcfor the ofﬂine optimization, the most pessimistic assumption(i.e., trafﬁc is not known a priori ), while the network simu-lations are done using different synthetic and real-applicationtrafﬁcs. Our analysis will demonstrate that AdEle does notrequire runtime trafﬁc in its ofﬂine optimization as its onlineselection policy will adjust to runtime trafﬁc. However, AdElecan use the runtime trafﬁc during elevator-subset selection tooffer further latency and energy improvement. A. AMOSA Elevator-Subset Exploration

As discussed in Section III, AMOSA ﬁnds various solutionswith different latency and energy-efﬁciency. To show thesolution selection process, the optimization for P M is detailedhere. A small sample of AMOSA’s explored solutions isshown in Fig. 3. As AMOSA explores the solution space,it makes its way towards the Pareto front (blue curve) toﬁnd the optimal trade-offs between utilization variance andaverage distance. Given the set of solutions, depending on theimportance of energy efﬁciency (average distance) and latency(utilization variance), the ﬁnal solution can be selected. Forbrevity, several of these points spread along the Pareto front areselected for network simulation ( S to S ) where the resultsare summarized in Table II. Considering Table II and Fig. 3,lower utilization variance and lower average distance improvesthe latency and energy consumption, respectively. As we areable to signiﬁcantly reduce the latency with fairly minimalincreases in energy, we select S for further analysis. Similarly,we select the solution for P S – P S . B. AdEle Performance Under Synthetic Trafﬁc

To compare AdEle with Elevator-First and CDA, we ﬁrstevaluate the average latency under uniform and shufﬂe traf-

Packet injection rate L a t e n c y ( c y c l e s ) ElevFirstCDAAdEle (a) P S -Uniform Packet injection rate L a t e n c y ( c y c l e s ) (b) P S -Uniform Packet injection rate L a t e n c y ( c y c l e s ) (c) P S -Uniform Packet injection rate L a t e n c y ( c y c l e s ) AdEle-RR (d) P M -Uniform Packet injection rate L a t e n c y ( c y c l e s ) ElevFirstCDAAdEle (e) P S -Shufﬂe Packet injection rate L a t e n c y ( c y c l e s ) (f) P S -Shufﬂe Packet injection rate L a t e n c y ( c y c l e s ) (g) P S -Shufﬂe Packet injection rate L a t e n c y ( c y c l e s ) AdEle-RR (h) P M -ShufﬂeFig. 4. Average latency for Elevator-First, CDA, and AdEle under uniform (a–d) and shufﬂe (e–h) trafﬁc and with different elevator-placement patterns. ElevFirst CDA AdEle0246 N o r m a li ze d l o a d Fig. 5. Trafﬁc load over routers with elevators (blue, green, and red)normalized to the average load over routers without an elevator (white bar). ﬁc patterns and with different elevator-placement patterns inFig. 4. Across all the trafﬁc and elevator-placement patterns,AdEle achieves the lowest latency and highest saturationthreshold. Note that CDA is able to approach AdEle’s per-formance because it considers global intra-layer trafﬁc. In thiswork, we do not consider the high cost of CDA’s global infor-mation sharing and optimistically assume that the informationis instantaneously received at every router. In reality, CDA willlikely perform much worse with stale information or includesigniﬁcant implementation overhead. With a higher elevatordensity (e.g., P S ), the elevator congestion issue is less criticaland intra-layer trafﬁc will be more critical. Similarly, in anetwork with larger horizontal dimensions like P M , intra-layertrafﬁc is more important. Yet, AdEle shows better performanceeven with a high density of elevators and for P M . Recall thatAdEle’s ofﬂine optimization step used uniform trafﬁc. Yet, asFigs. 4(e)–(h) show, while the trafﬁc is new for AdEle, it stillachieves the lowest latency because its online selection policycan monitor runtime congestion and select better elevators.If the trafﬁc is known a priori , AdEle can use this trafﬁcinformation during ofﬂine optimization to improve elevator-subset selection even further. For P M in Figs. 4(d) and 4(h),we also include the average latency of AdEle with standardRR selection. This demonstrates that AdEle’s proposed on-line skipping policy achieves higher improvements in latencycompared to RR in both uniform and shufﬂe trafﬁc patterns.To show the main reason for latency improvement whenusing AdEle, the load distribution over routers with elevatorsfor P S is shown in Fig. 5. The white bar shows the averageload over elevator-less routers. The other colored bars show the P S1 P S2 P S3 P M N o r m . e n e r gy ElevFirst CDA AdEle (a) Low injection rate P S1 P S2 P S3 P M (b) High injection rateFig. 6. Energy per ﬂit for Elevator-First (ElevFirst), CDA, and AdElenormalized to ElevFirst, and under different injection rates. load over different elevators. As can be seen, AdEle reducesthe load on the highest utilized elevator (blue elevator). Theenergy consumption for each approach and elevator placementis shown in Fig. 6 for low (1E − −

4) and high (1E − −

3) injection rates based on the saturation point (injectionrate at which latency is 10 × zero-load latency) for eachconﬁguration. For low injection rates, AdEle has the lowestenergy consumption because it switches to minimal routingand uses the minimal paths. On the other hand, AdEle incursa small energy overhead (less than 8.2% compared to CDA)under high injection rates to take non-minimal paths andimprove trafﬁc congestion. If less energy overhead is desired,AdEle can use conﬁgurations with lower energy (see Table II). C. AdEle Performance under Real-Application Trafﬁc

We extracted the trafﬁc of several SPLASH-2 [18] and PAR-SEC [19] benchmarks using Gem5 [20] for real-applicationsimulations. Because Gem5 is limited to 64 cores, we demon-strate our results for P S – P S . As shown in Fig. 7, AdEleimproves the network latency in nearly all cases. In particular,AdEle has more improvements in applications with highertrafﬁc loads (canneal, fft, radix, and water) as there is moreopportunity to reduce the resulting elevator congestion. Inapplications with lower trafﬁc loads (ﬂuidanimate and lu),AdEle maintains similar performance to the other approachesas there is little contention on the elevators and the latencyis close to zero-load latency. Although P S still shows someimprovements for AdEle, the lower number of elevators (three)results in minimal opportunity for AdEle to redirect trafﬁc and a nn ea l ff t f l u i d . l u r a d i x w a t e r * A vg . N o r m . l a t e n c y ElevFirst CDA AdEle (a) P S ca nn ea l ff t f l u i d . l u r a d i x w a t e r * A vg . N o r m . l a t e n c y ElevFirst CDA AdEle (b) P S ca nn ea l ff t f l u i d . l u r a d i x w a t e r * A vg . N o r m . l a t e n c y ElevFirst CDA AdEle (c) P S P S1 P S2 P S3 N o r m . e n e r gy ElevFirst CDA AdEle (d) Energy: P S – P S Fig. 7. Latency ((a)–(c), per application) and energy ((d), averaged across all applications) for Elevator-First (ElevFirst), CDA, and AdEle normalized toElevFirst under real-application trafﬁc with different elevator-placement patterns. ∗ Avg. in (a)–(c) is the average of all six applications.TABLE IIIA

REA ANALYSIS

Cycles Router area ( µm ) Base (ElevFirst) 1 35550 OverheadCDA ∗ ∗ global information sharing is not included. improve latency. On average, AdEle improves the networklatency by 14.9% (up to 21.4%) compared to CDA and by29.9% (up to 52.1%) compared to Elevator-First under P S – P S . Fig. 7(d) shows, for each elevator-placement pattern( P S – P S ), the average energy over all the applications nor-malized to Elevator-First. AdEle imposes a small overheadbecause it may route packets over non-minimal paths in caseof congestion to improve latency. Compared to CDA, AdElehas on average 10.5%, 10.3% and 10% energy overhead under P S , P S , and P S , respectively. D. Hardware-Area Analysis and Comparison

Routers’ hardware of Elevator-First, AdEle, and CDA areimplemented and analyzed using Cadence Genus in 45 nmtechnology. Here, we consider a 1 GHz clock. The results areshown in Table III. Compared to CDA ∗ , AdEle has a smallerarea overhead. This is because AdEle only requires local trafﬁcinformation while CDA requires a table to save global trafﬁcinformation and ﬁnd the best path in each router. However,CDA ∗ ’s area overhead is an optimistic assumption here as itdoes not include any overhead related to the actual sharingof information. Therefore, real CDA will likely impose higherarea and latency overheads. Also, AdEle does not affect therouter stages and will scale well with the network size, whileCDA requires an additional cycle (or more for larger networks)to update its tables. V. C ONCLUSION

This paper proposes AdEle, as adaptive congestion- andenergy-aware elevator-selection scheme to address elevatoroverutilization in partially connected 3D NoCs. Employing aset of elevators instead of one elevator for each source router,AdEle monitors the network trafﬁc and provides an onlinepolicy to select the proper elevator while considering runtimetrafﬁc loads. AdEle only requires local router information andis able to improve average latency in various scenarios underboth synthetic and real trafﬁc at the cost of less than 10.5% inenergy consumption. Moreover, AdEle can be easily adjusted to consider faults, which is of great interest in PC-3DNoCs,while considering elevator congestion.R

EFERENCES[1] A. I. Arka et al. , “Making a case for partially connected 3D NoC: NFICversus TSV,”

ACM JETC , vol. 16, no. 4, pp. 1–17, 2020.[2] T. Lu et al. , “TSV-based 3-D ICs: Design methods and tools,”

IEEETCAD , vol. 36, no. 10, pp. 1593–1619, 2017.[3] R. Salamat et al. , “LEAD: An adaptive 3D-NoC routing algorithm withqueuing-theory based analytical veriﬁcation,”

IEEE TC , vol. 67, no. 8,pp. 1153–1166, 2018.[4] A. Coelho et al. , “FL-RuNS: A high-performance and runtime recon-ﬁgurable fault-tolerant routing scheme for partially connected three-dimensional networks on chip,”

IEEE TNANO , vol. 18, pp. 806–818,2019.[5] X. Wang et al. , “HRC: A 3D NoC architecture with genuine supportfor runtime thermal-aware task management,”

IEEE TC , vol. 66, no. 10,pp. 1676–1688, 2017.[6] T. H. Vu et al. , “Fault-tolerant spike routing algorithm and architecturefor three dimensional NoC-based neuromorphic systems,”

IEEE Access ,vol. 7, pp. 90 436–90 452, 2019.[7] F. Wang et al. , “An effective approach of reducing the keep-out-zoneinduced by coaxial through-silicon-via,”

IEEE T-ED , vol. 61, no. 8, pp.2928–2934, 2014.[8] T. Frank et al. , “Resistance increase due to electromigration induceddepletion under TSV,” in

IEEE IRPS , 2011.[9] A. Eghbal et al. , “Analytical fault tolerance assessment and metrics forTSV-based 3D network-on-chip,”

IEEE TC , vol. 64, no. 12, pp. 3591–3604, 2015.[10] F. Dubois et al. , “Elevator-ﬁrst: A deadlock-free distributed routingalgorithm for vertically partially connected 3D NoCs,”

IEEE TC , vol. 62,no. 3, pp. 609–615, 2011.[11] E. Taheri et al. , “Addressing a new class of reliability threats in 3-Dnetwork-on-chips,”

IEEE TCAD , vol. 39, no. 7, pp. 1358–1371, 2020.[12] Y. Fu et al. , “Congestion-aware dynamic elevator assignment for par-tially connected 3D-NoCs,” in

IEEE ISCAS , 2019.[13] S. Bandyopadhyay et al. , “A simulated annealing-based multiobjectiveoptimization algorithm: AMOSA,”

IEEE Transactions on EvolutionaryComputation , vol. 12, no. 3, pp. 269–283, 2008.[14] J. Lee et al. , “Redelf: An energy-efﬁcient deadlock-free routing for 3DNoCs with partial vertical connections,”

ACM JETC , vol. 12, no. 3, pp.1–22, 2015.[15] R. Salamat et al. , “A resilient routing algorithm with formal reliabilityanalysis for partially connected 3D-NoCs,”

IEEE TC , vol. 65, no. 11,pp. 3265–3279, 2016.[16] S. Foroutan et al. , “Assignment of vertical-links to routers in vertically-partially-connected 3-D NoCs,”

IEEE TCAD , vol. 33, no. 8, pp. 1208–1218, 2014.[17] K.-Y. Jheng et al. , “Trafﬁc-thermal mutual-coupling co-simulation plat-form for three-dimensional network-on-chip,” in

IEEE VLSI-DAT , 2010.[18] S. C. Woo et al. , “The SPLASH-2 programs: Characterization andmethodological considerations,”

ACM SIGARCH Computer ArchitectureNews , vol. 23, no. 2, pp. 24–36, 1995.[19] C. Bienia and K. Li,

Benchmarking modern multiprocessors . PrincetonUniversity Princeton, NJ, 2011.[20] N. Binkert et al. , “The Gem5 simulator,”