The Stellar Transformation: From Interconnection Networks to Datacenter Networks
Alejandro Erickson, and Iain A. Stewart, Javier Navaridas, Abbas E. Kiasari
aa r X i v : . [ c s . D C ] J un THE STELLAR TRANSFORMATION:FROM INTERCONNECTION NETWORKS TO DATACENTER NETWORKS
ALEJANDRO ERICKSON, IAIN A. STEWART, JAVIER NAVARIDAS, AND ABBAS E. KIASARI
Abstract.
The first dual-port server-centric datacenter network, FiConn, was introduced in 2009 andthere are several others now in existence; however, the pool of topologies to choose from remains small.We propose a new generic construction, the stellar transformation, that dramatically increases the size ofthis pool by facilitating the transformation of well-studied topologies from interconnection networks, alongwith their networking properties and routing algorithms, into viable dual-port server-centric datacenternetwork topologies. We demonstrate that under our transformation, numerous interconnection networksyield datacenter network topologies with potentially good, and easily computable, baseline properties. Weinstantiate our construction so as to apply it to generalized hypercubes and obtain the datacenter networksGQ ⋆ . Our construction automatically yields routing algorithms for GQ ⋆ and we empirically compare GQ ⋆ (and its routing algorithms) with the established datacenter networks FiConn and DPillar (and their routingalgorithms); this comparison is with respect to network throughput, latency, load balancing, fault-tolerance,and cost to build, and is with regard to all-to-all, many all-to-all, butterfly, and random traffic patterns.We find that GQ ⋆ outperforms both FiConn and DPillar (sometimes significantly so) and that there issubstantial scope for our stellar transformation to yield new dual-port server-centric datacenter networksthat are a considerable improvement on existing ones. Introduction
The digital economy has taken the world by storm and completely changed the way we interact, communi-cate, collaborate, and search for information. The main driver of this change has been the rapid penetrationof cloud computing which has enabled a wide variety of digital services, such as web search and onlinegaming, by offering elastic, on-demand computing resources to digital service providers. Indeed, the value ofthe global cloud computing market is estimated to be in excess of $100 billion [41]. Vital to this ecosystemof digital services is an underlying computing infrastructure based primarily in datacenters [3]. With thissudden move to the cloud, the demand for increasingly large datacenters is growing rapidly [18].This demand has prompted a move away from traditional datacenter designs, based on expensive high-density enterprise-level switches, towards using commodity-off-the-shelf (COTS) hardware. In their pro-duction datacenters, major operators have primarily adopted (and invented) ideas similar to Fat-Tree [2],Portland [33], and VL2 [16]; on the other hand, the research community (several major operators in-cluded) maintains a diverse economy of datacenter architectures and designs in order to meet future de-mand [14,18,20,30,36,37]. Indeed, the “switch-centric” datacenters currently used in production datacentershave inherent scalability limitations and are by no means a low-cost solution (see, e.g. , [6, 18, 19, 29]).One approach intended to help overcome these limitations is the “server-centric” architecture, the firstexamples of which are DCell [18] and BCube [17]. Whereas in a switch-centric datacenter network (DCN)there are no links joining pairs of servers, in a server-centric DCN there are no links joining pairs of switches.This server-centric restriction arises from the circumstance that the switches in a server-centric DCN act onlyas non-blocking “dumb” crossbars. By offloading the task of routing packets to the servers, the server-centricarchitecture leverages the typically low utilisation of CPUs in datacenters to manage network communication.This can reduce both the number of switches used in a DCN and the capabilities required of them. Inparticular, the switches route only locally, to their neighbouring servers, and therefore have no need forlarge or fast routing tables. Thus, a server-centric DCN can potentially incorporate more servers andbe both cheaper to operate and to build (see [34] for a more detailed discussion). Furthermore, usingservers (which are highly programmable) rather than switches (which have proprietary software and limitedprogrammability) to route packets will potentially accelerate research innovation [27]. ince the advent of DCell and BCube, a range of server-centric DCNs have been proposed, some of whichfurther restrict themselves to requiring at most two ports per server, with FiConn [25] and DPillar [28]being the most established of this genre. This dual-port restriction is motivated by the fact that manyCOTS servers presently available for purchase, as well as servers in existing datacenters, have two NIC ports(a primary and a backup port). Dual-port server-centric DCNs are able to utilise such servers withoutmodification, thus making it possible to use some of the more basic equipment (available for purchase orfrom existing datacenters) in a server-centric DCN and thereby reduce the building costs.The server-centric DCN architecture provides a versatile design space, as regards the network topology,evidenced perhaps by the sheer number of fairly natural constructions proposed from 2008 to the present. Onthe other hand, this pool is small relative to the number of interconnection networks found in the literature, i.e. , highly structured graphs with good networking properties. One of the challenges of identifying aninterconnection network suitable for conversion to a DCN topology, however, lies in the fact that the literatureon interconnection networks is focused primarily on graphs whose nodes are homogeneous , whereas in both aswitch-centric and a server-centric DCN we have server-nodes and switch-nodes which have entirely differentoperational roles. Some server-centric DCN topologies arise largely from outside the interconnection networkliterature, e.g. , DCell and FiConn, whilst others arise from transformations of well-known interconnectionnetworks, e.g. , BCube and DPillar.The transformations used to obtain BCube and DPillar take advantage of certain sub-structures in theunderlying base graphs of the interconnection networks in question (generalized hypercubes and wrappedbutterfly networks, respectively) in order to create a server-centric DCN that inherits beneficial networkingproperties such as having a low diameter and fault-tolerant routing algorithms. The limitation, of course,is that not every prospective base graph has the required sub-structures (cliques and bicliques, respectively,in the cases of BCube and DPillar). New methods of transforming interconnection networks into server-centric DCNs may therefore greatly enlarge the server-centric DCN design space by lowering the structuralrequirements on potential base graphs.It is with the construction of new dual-port server-centric DCNs that we are concerned in this paper. Inparticular, we show how to systematically transform interconnection networks, as base graphs, into dual-portserver-centric DCNs, which we refer to as stellar DCNs. The stellar transformation is very simple and widelyapplicable: the edges of the base graph are replaced with 3-paths, so that the nodes of the base graph becomethe switch-nodes of the stellar DCN and the added nodes, interior to the 3-paths, become the server-nodes(see Fig. 3). By requiring very little of the base graph in the way of structure, the stellar construction greatlyincreases the pool of interconnection networks that can potentially serve as blueprints to design dual-portserver-centric DCN topologies.We validate our construction in three ways: first, we prove that various good networking properties ofthe base graph are preserved under the stellar transformation; second, we build a library of interconnectionnetworks that suit the stellar transformation; and third, we empirically evaluate GQ ⋆ , an instantiation ofa stellar DCN whose base graph is a generalized hypercube, against both FiConn and DPillar, and we alsocompare GQ ⋆ and its routing algorithm (inherited from generalized hypercubes) against what might beoptimally possible in GQ ⋆ .Our empirical results are extremely encouraging. We employ a comprehensive set of performance metricsso as to evaluate network throughput, latency, load balancing capability, fault-tolerance, and cost to build,within the context of all-to-all, many all-to-all, butterfly, and random traffic patterns, and we show thatGQ ⋆ broadly outperforms both FiConn and DPillar as regards these metrics, sometimes significantly so.Highlights of these improvemenmts are as follows. In terms of aggregate bottleneck throughput (a primarymetric as regards the evaluation of throughput in an all-to-all context), our DCN GQ ⋆ improves upon bothFiConn and DPillar (upon the former markedly so). As regards fault-tolerance, our DCN GQ ⋆ , with itsfault-tolerant routing algorithm GQ ⋆ -routing (inherited from generalized hypercubes), outperforms DPillar(and its fault-tolerant routing algorithm DPillarMP from [28]) and competes with FiConn even when wesimulate optimal fault-tolerant routing in FiConn (even though such a fault-tolerant routing algorithm hasyet to be exhibited). Not only does GQ ⋆ -routing (in GQ ⋆ ) tolerate faults better than the respective routingalgorithms in FiConn and DPillar, but when we make around 10% of the links faulty and compare it with We disregard the terminal nodes of indirect networks, which are not intrinsic to the topology. he optimal scenario in GQ ⋆ , GQ ⋆ -routing provides around 95% connectivity and generates paths that are,on average, only around 10% longer than the shortest available paths. When we consider load balancing inGQ ⋆ , FiConn, and DPillar, with their respective routing algorithms GQ ⋆ -routing , TOR , and
DPillarSP andunder a variety of traffic patterns, we find that the situation in GQ ⋆ is demonstrably improved over that inFiConn and DPillar (with DPillar performing particularly poorly), and that the improved load balancing inGQ ⋆ in tandem with the generation of relatively short paths translates to potential latency savings.However, we have only scratched the surface in terms of what might be possible as regards the translationof high-performance interconnection networks into dual-port server-centric DCNs in that we have appliedour generic, stellar construction to only one family of interconnection networks so as to achieve encouragingresults. In addition to our experiments, we demonstrate that there are numerous families of interconnectionnetworks to which our construction might be applied. Whilst our results with generalized hypercubes areextremely positive, we feel that the generic nature of our construction has significant potential and scope forfurther application.The rest of the paper is organized as follows. In the next section, we give an overview of the design spacefor dual-port server-centric DCNs, along with related work, before defining our new generic construction inSection 3 and proving that good networking properties of the underlying interconnection network translate togood networking properties of the stellar DCN. We also instantiate our stellar construction in Section 3 so asto generate the DCNs GQ ⋆ , and in Sections 4 and 5 we describe the methodology of our empirical evaluationand the results of this investigation, respectively. We close with some concluding remarks and suggestionsfor future work in Section 6. We refer the reader: to [10] for all standard graph-theoretic concepts; to [21,42]for the interplay between graph theory and interconnection network design; and to [7] for an overview ofinterconnection networks and their implementation for distributed-memory multiprocessors. We implicitlyrefer to these references throughout.2. The dual-port server-centric DCN design space
A dual-port server-centric DCN can be built from: COTS servers, each with (at most) two network inter-face card (NIC) ports; dumb “crossbar” switches; and the cables that connect these hardware componentstogether. We define the capability of a dumb crossbar-switch (henceforth referred to as a switch) as beingable to forward an incoming packet to a single port requested in the packet header and to handle all such traf-fic in a non-blocking manner. Such a switch only ever receives packets destined for servers directly attachedto it and handles these requests by retrieving addresses from a very small forwarding table. Consequently,it is never the case that two switches in the network are directly connected by a cable.We take a (primarily) mathematical view of datacenters in order to systematically identify potentialDCN topologies, and we abstract a DCN as an undirected graph so as to model only the major hardwarecomponents; namely, the servers and switches are abstracted as server-nodes and switch-nodes, respectively,and the interconnecting cables as edges or links. As our server-centric DCNs are dual-port, our graphs aresuch that every server-node has degree at most 2 and the switch-nodes form an independent set in the graph.2.1.
Designing DCNs with good networking properties.
There are well-established performance met-rics for DCNs and their routing algorithms so that we might evaluate properties such as network throughput,latency, load balancing capability, fault-tolerance, and cost to build (we’ll return to these metrics later whenwe outline our methodology, in Section 4, and undertake our empirical analysis, in Section 5). Networks thatperform well with respect to these or related metrics are said to have good networking properties . Maintaininga diverse pool of potential DCN topologies with good networking properties gives DCN designers greaterflexibility. There is already such a pool of interconnection networks, developed over the past 50 years orso, and it is precisely from here that the switch-centric DCN fabrics of layer-2 switch-nodes in fat-trees andrelated topologies have been adapted (see, e.g. , [2, 24]).Adapting interconnection networks to build server-centric DCNs, which necessarily have a more sophisti-cated arrangement of server-nodes and switch-nodes, however, is more complicated. For example, BCube [17]is built from a generalized hypercube (see Definition 3.1) by replacing the edges of certain cliques, each with aswitch-node connected to the nodes of the clique. In doing so, BCube inherits well-known routing algorithmsfor generalized hypercubes, as well as mean-distance, fault-tolerance, and other good networking properties.DPillar [28], which we discuss in detail in Section 2.4, is built in a similar manner from a wrapped butterfly etwork (see, e.g. , [23]) by replacing bicliques with switch-nodes. The presence of these cliques and bicliquesare inherent in the definitions of generalized hypercubes and wrapped butterfly networks, respectively, butare not properties of interconnection networks in general. Furthermore, the dual-port property of DPillaris not by design of the construction, but is a result of the fact that each node in a wrapped butterfly is inexactly two maximal bicliques.In order to effectively capitalise on a wide range of interconnection networks, for the purpose of server-centric DCN design, we must devise new generic construction methods, similar to those used to constructBCube and DPillar but that do not impose such severe structural requirements on the interconnectionnetwork used as the starting point.2.2. Related work.
We briefly survey the origins of the dual-port server-centric DCNs proposed thus farwithin the literature [19, 25–28], referring the reader to the original publications for definitions of topologiesnot given below. FiConn [25] is an adaptation of DCell and is unrelated to any particular interconnectionnetwork. DPillar’s origins [28] were discussed above. The topologies HCN and BCN [19] are built by com-bining a 2-level DCell with another network, later discovered to be related to WK-recursive interconnectionnetworks [9, 38]. BCCC [27] is a tailored construction related to BCube and based on cube-connected-cyclesand generalized hypercubes. Finally, SWKautz, SWCube, and SWdBruijn [26] employ a subdivision rulesimilar to ours, but the focus in [26] is not on the (generic) benefits of subdividing interconnection networksas much as it is on the evaluation of those two particular network topologies.In Section 5 we compare an instantiation of our construction, namely the dual-port server-centric DCNGQ ⋆ , to FiConn and DPillar. The rationale for using these DCNs in our evaluation is that they are goodrepresentatives of the spectrum of dual-port server-centric DCNs mentioned above: FiConn is a good exampleof a DCN that includes both server-node-to-server-node and server-node-to-switch-node connections and issomewhat unstructured, whereas DPillar is server-node symmetric [11] and features only server-node-to-switch-node connections. In addition, FiConn is arguably unrelated to any previously known interconnectionnetwork topology, whilst DPillar is built from, and inherits some of the properties of, the wrapped butterflynetwork. Various other dual-port server-centric DCNs lie somewhere between these two extremes. Noticethat neither FiConn nor DPillar can be described as an instance of our generalised construction: FiConnhas some server-nodes whose only connection is to a solitary switch-node, and in DPillar each server-nodeis connected only to 2 switch-nodes. We now describe the constructions of the DCNs FiConn and DPillar.2.3. The construction of FiConn.
We start with FiConn, the first dual-port server-centric DCN to beproposed and, consequently, typically considered the baseline such DCN. For any even n ≥ k ≥ k,n [25] is a recursively-defined DCN where k denotes the level of the recursive construction and n the number of server-nodes that are directly connected to a switch-node (so, all switch-nodes have degree n ). FiConn ,n consists of n server-nodes and one switch-node, to which all the server-nodes are connected.Suppose that FiConn k,n has b server-nodes of degree 1 ( b = n when k = 0; moreover, no matter what thevalue of k , b can always be shown to be even). In order to build FiConn k +1 ,n , we take b +1 copies of FiConn k,n and for every copy we connect one server-node of degree 1 to each of the other b copies (these additionallinks are called level k links). The actual construction of which server-node is connected to which is detailedprecisely in [25] (FiConn , , as constructed in [25], can be visualised in Fig. 1); in particular, there is a well-defined naming scheme where server-nodes of FiConn k,n are named as specific k -tuples of integers. In fact,although it is not made clear in [25], there is a multitude of connection schemes realising different versionsof FiConn. Note that all of the DCNs we consider in this paper come in parameterized families; so, when wesay “the DCN FiConn”, what we really mean is “the family of DCNs { FiConn k,n : k ≥ , n ≥ } ”.In [25], two routing algorithms are supplied: TOR (traffic-oblivious routing) and
TAR (traffic-awarerouting).
TAR is intended as a routing algorithm that dynamically adapts routes given changing trafficconditions (it was remarked in [25] that it could be adapted to deal with link or port faults). A biclique is a graph formed from two independent sets so that every node in one independent set is joined to every nodein the other independent set. Meaning that for every pair ( u, v ) of server-nodes, there is an automorphism of the network topology that maps u to v . evel 0 edgeslevel 1 edgeslevel 2 edges [0,0,0] [0,0,1] [0,0,3][0,0,2] [0,1,0][0,1,1][0,1,3][0,1,2][0,2,0] [0,2,1][0,2,3] [0,2,2] [1,0,0] [1,0,1] [1,0,3][1,0,2] [1,1,0][1,1,1][1,1,3][1,1,2][1,2,0] [1,2,1][1,2,3] [1,2,2][2,0,0] [2,0,1] [2,0,3][2,0,2] [2,1,0][2,1,1][2,1,3][2,1,2][2,2,0] [2,2,1][2,2,3] [2,2,2][3,0,0] [3,0,1] [3,0,3][3,0,2] [3,1,0][3,1,1][3,1,3][3,1,2][3,2,0] [3,2,1][3,2,3] [3,2,2] switch-nodedegree 1 server-nodedegree 2 server-node Figure 1.
A visualisation of FiConn , .2.4. The construction of DPillar.
The DCN DPillar k,n [28], where n ≥ k ≥
2, is suchthat n denotes the number of ports of a switch-node and k denotes the level of the recursive construction;it can be imagined as k columns of server-nodes and k columns of switch-nodes, arranged alternately onthe surface of a cylindrical pillar (see as an example DPillar , in Fig. 2). Each server-node in some server-column is adjacent to 2 switch-nodes, in different adjacent switch-columns. Each server-column has ( n ) k server-nodes, named as { , , . . . , n − } k , whereas each switch-column has ( n ) k − switch-nodes, named as { , , . . . , n − } k − . We remark that in the literature, our DPillar k,n is usually referred to as DPillar n,k .However, we have adopted our notation so as to be consistent with other descriptions of DCNs.Fix c ∈ { , , . . . , k − } . The server-nodes in server-columns c, c + 1 ∈ { , , . . . , k − } (with addi-tion modulo k ) are arranged into ( n ) k − groups of n server-nodes so that in server-columns c and c + 1,the server-nodes in group ( u k − , . . . , u c +1 , u c − , . . . , u ) ∈ { , , . . . , n − } k − are the server-nodes named { ( u k − , . . . , u c +1 , i, u c − , . . . , u ) : i ∈ { , , . . . , n − }} . The adjacencies between switch-nodes and server-nodes are such that any server-node in group ( u k − , . . . , u c +1 , u c − , . . . , u ) in server-columns c and c + 1 isadjacent to the switch-node of name ( u k − , . . . , u c +1 , u c − , . . . , u ) in switch-column c .In [28], two routing algorithms are supplied: DPillarSP and
DPillarMP . The former is a single-pathrouting algorithm and the latter is a multi-path routing algorithm.While all of the dual-port server-centric DCNs from the literature have merit, it is clear that a genericmethod of transforming interconnection networks into dual-port server-centric DCNs has not previously beenproposed and analysed. Having justified the value in studying the dual-port restriction, and having discussedthe benefits of tapping into a large pool of potentially useful topologies, we proceed by presenting our genericconstruction in detail in the next section. Figure 2.
A visualization of DPillar , . Squares represent switch-nodes, whereas dotsrepresent server-nodes. For the sake of simplicity, the left-most and the right-most server-columns are the same (server-column 0).3. Stellar DCNs: a new generic construction
In this section we present our generic method of transforming interconnection networks into potentialdual-port server-centric DCNs. We then describe how networking properties of the DCN, including routingalgorithms, are inherited from the interconnection network, and go on to identify a preliminary pool ofinterconnection networks that particularly suit the stellar transformation. Next, we apply our stellar trans-formation in detail to generalized hypercubes as a prelude to an extensive empirical evaluation in Sections 4and 5. The key aspects of our stellar construction are its topological simplicity, its universal applicability,and the tight relationship between the interconnection network and the resulting stellar DCN (in a practicalnetworking sense). While we present our stellar construction within a graph-theoretic framework, we endthis section by briefly discussing concrete networking aspects of our construction in relation to implemen-tation. We remind the reader that we use [7, 21, 42] as our sources of information for the definitions andthe networking properties of the families of interconnection networks mentioned below; we use these sourcesimplicitly and only cite other sources when pertinent.3.1.
The stellar construction. An interconnection network is an undirected graph together with associ-ated routing algorithms, packet-forwarding methodologies, fault-tolerance processes, and so on. However, itsuffices for us to abstract an interconnection network as simply a graph G = ( V, E ) that is undirected andwithout self-loops.Let G = ( V, E ) be any non-trivial connected graph, which we call the base graph of our construction. The stellar
DCN G ⋆ is obtained from G by placing 2 server-nodes on each link of G and identifying the originalnodes of G as switch-nodes (see Fig. 3). We use the term “stellar” as we essentially replace every node of G and its incident links with a “star” subnetwork consisting of a hub switch-node and adjacent server-nodes.Clearly, G ⋆ has 2 | E | server-nodes and | V | switch-nodes, with the degree of every server-node being 2 andthe degree of every switch-node being identical to the degree of the corresponding node in G .We propose placing 2 server-nodes on every link of G so as to ensure: uniformity, in that every server-node is adjacent to exactly 1 server-node and exactly 1 switch-node (uniformity, and its stronger counterpartsymmetry, are widely accepted as beneficial properties in general interconnection networks); that there areno links incident only with switch-nodes (as this would violate the server-centric restriction, discussed inSection 2); and that we can incorporate as many server-nodes as needed within the construction (subject o the other conditions). In fact, any DCN in which every server-node is adjacent to exactly 1 server-nodeand 1 switch-node and where every switch-node is only adjacent to server-nodes can be realised as a stellarDCN G ⋆ , for some base graph G . In addition, the stellar transformation can be applied to any (non-trivialconnected) base graph; that is, the transformation does not rely on any non-trivial structural properties ofthe base graph.3.2. Topological properties of stellar DCNs.
The principal decision that must be taken when construct-ing a stellar DCN is in choosing an appropriate base graph G . The good networking properties discussedin Section 2.1 are underpinned by several graph-theoretic properties that are preserved under the stellartransformation: for example, low diameter, high connectivity, and efficient routing algorithms in the basegraph G translate more-or-less directly into good (theoretical) networking properties of the stellar graph G ⋆ ,as we now discuss. The DCN designer, having specific performance targets in mind, can use this informationto facilitate the selection of a base graph G that meets the requirements of the desired stellar DCN.3.2.1. Paths.
A useful aspect of our construction is as regards the transformation of paths in G to paths in G ⋆ . As is usual in the analysis of server-centric DCNs (see, e.g., [17–19, 25]), we measure a server-node-to-server-node path P by its hop-length , defined as one less than the number of server-nodes in P . Accordingly,we prefix other path-length-related measures with hop-; for example, the hop-length of a shortest pathjoining two given server-nodes in G ⋆ is the hop-distance between the two server-nodes, and the hop-diameter of a server-centric DCN is the maximum over the hop-distances for every possible pair of server-nodes. Let G = ( V, E ) be a connected graph and let u, v ∈ V . Let u ⋆ and v ⋆ be the switch-nodes of G ⋆ correspondingto u and v , respectively. Let u ′ and v ′ be server-node neighbours of u ⋆ and v ⋆ , respectively, in G ⋆ . Each( u, v )-path P in G , of length m , corresponds uniquely to a ( u ′ , v ′ )-path in G ⋆ of hop-length 2 m −
1, 2 m , or2 m + 1. The details are straightforward.3.2.2. Path-based sub-structures.
The transformation of paths in G to paths in G ⋆ is the basis for the transferof potentially useful sub-structures in G to G ⋆ so as to yield good DCN properties. Any useful (path-based) sub-structure in G , such as a spanning tree, a set of node-disjoint paths, or a Hamiltonian cycle,corresponds uniquely to a closely related sub-structure in G ⋆ . Swathes of research papers have uncoveredthese sub-structures in interconnection networks, and the stellar construction facilitates their usage in dual-port server-centric DCNs. It is impossible to cover this entire topic here, but we describe how a few of themore commonly sought-after sub-structures behave under the stellar transformation.Foremost are internally node-disjoint paths, associated with fault-tolerance and load balancing. As thedegree of any server-node in G ⋆ is 2, one cannot hope to obtain more than 2 internally node-disjoint pathsjoining any 2 distinct server-nodes of G ⋆ . However, a set of c internally node-disjoint ( u, v )-paths in G corresponds uniquely to a set of c internally (server- and switch-) node-disjoint ( u ⋆ , v ⋆ )-paths in G ⋆ , where u, v, u ⋆ , v ⋆ , u ′ , and v ′ are as defined above. This provides a set of c ( u ′ , v ′ )-paths in G ⋆ , called parallel paths ,that are internally node-disjoint apart from possibly u ⋆ and v ⋆ (see Fig. 3). It is trivial to show that theminimum number of parallel paths between any pair of server-nodes, not connected to the same switch-node,in G ⋆ is equal to the connectivity of G . u v u ⋆ v ⋆ u ′ v ′ u v Figure 3.
Transforming 4 paths from u to v in G ( left ) into 4 paths from u ′ to v ′ in G ⋆ ( right ). y reasoning as above, it is easy to see that a set of c edge-disjoint ( u, v )-paths in G becomes a set of c internally server-node-disjoint ( u ′ , v ′ )-paths in G ⋆ , with u, v, u ⋆ , v ⋆ , u ′ , and v ′ defined as above; we shall callthese server-parallel paths . The implication is that as any two of these paths share only the links ( u ′ , u ⋆ )and ( v ⋆ , v ′ ), a high value of c may be leveraged to alleviate network traffic congestion as well as fortify thenetwork against server-node failures.On a more abstract level, consider any connected spanning sub-structure H of G , such as a Hamiltoniancycle or a spanning tree. Let H ⋆ be the corresponding sub-structure in G ⋆ (under the path-to-path mappingdescribed above) and observe that each edge of G not contained in H corresponds to two adjacent server-nodes in G ⋆ not contained in H ⋆ . On the other hand, every server-node not in H ⋆ is exactly one hop awayfrom a server-node that is in H ⋆ ; so within an additive factor of one hop, H ⋆ is just as “useful” in G ⋆ as H is in G . In fact, if H is a spanning tree in G then we can extend H ⋆ in G ⋆ by augmenting it withpendant edges from switch-nodes so that what results is a spanning tree in G ⋆ containing all server-nodes of G ⋆ (and not just those in the original H ⋆ ). By the same principle, non-spanning sub-structures of G , suchas those used in one-to-many, many-to-many, and many-to-one communication patterns, also translate touseful sub-structures in G ⋆ .We summarise the relationship between properties of G and G ⋆ that we have discussed so far in Table 1where corresponding properties for G and G ⋆ are detailed. It should now be apparent that the simplicityof our stellar transformation enables us to import good networking properties from our base graphs to ourstellar DCNs where these properties are crucial to the efficacy of a DCN. Table 1.
Transformation of networking properties of a connected graph G property G = ( V, E ) G ⋆ nodes/nodes | V | | V | switch-nodes2 | E | server-nodesnode degree/switch-node degree d d edges/links | E | | E | (bidirectional)path-length/hop-length x x − ≤ · ≤ x + 1diameter/hop-diameter D D − , D, or 2 D + 1internally-disjoint paths/parallel paths κ κ edge-disjoint paths/server-parallel paths γ γ We close this sub-section with a brief discussion of the transferral of routing algorithms under the stellartransformation. A routing algorithm for an interconnection network G is effectively concerned with anefficient computation over some communication sub-structures. For example, in the case of unicast routingfrom u to v , we may compute one or more ( u, v )-paths (and route packets over them), or for a broadcast wemay compute one or more spanning trees. Routing algorithms can be executed at the source node or in adistributed fashion, and they can be deterministic or non-deterministic; whatsoever the process, the resultingoutput is a communication sub-structure over which packets are sent from node to node. We discussed abovethe correspondence between communication sub-structures in G and those in G ⋆ ; we now observe that, inaddition, any routing algorithm on G can be simulated on G ⋆ with the same time complexity. We leave thedetails to the reader (but we will instantiate this later when we build the stellar DCNs GQ ⋆ ).3.3. A pool of suitable base graphs.
So far, we have referred to an interconnection network as a solitaryobject. However, interconnection networks (almost always) come in families where there are parameters thevalues for which precisely delineate the family members. For example the hypercube Q n is parameterized bythe degree n of the nodes, and so really by “the hypercube Q n ” we mean “the family of hypercubes { Q n : n = 1 , , . . . , } ”. For the rest of this sub-section we will be precise and speak of families of interconnectionnetworks as we need to focus on the parameters involved. To ease understanding, when there is more than 1parameter involved in some definition of a family of interconnection networks and these parameters appear assubscripts or in tuples in the denotation, we list parameters relating to the dimension of tuples or the depthof recursion first with parameters referring to the size of some component-set coming afterwards (we havedone this so far with FiConn k,n and DPillar k,n ). We remark that this is sometimes at odds with standardpractice in the literature. e validate our claim that many families of interconnection networks suit the stellar construction byhighlighting several that, first, have parameters flexible enough to yield interconnection networks of varyingand appropriate size and degree, and, second, are known to possess good networking properties. The firstgoal is to identify families of interconnection networks that have suitable combinations of degree and size,bearing in mind that today’s DCN COTS switches have up to tens of ports, with 48 being typical, whileconceivable (but not necessarily in production) sizes of DCNs range from tens of server-nodes up to, perhaps,5 million in the near future. An illustration of a family of interconnection networks lacking this flexibilityis the family of hypercubes, where the hypercube Q n necessarily has 2 n nodes when the degree is fixed at n ; this translates to a stellar DCN with n -port switch-nodes and, necessarily, n n server-nodes. As such,there is a lack of flexibility, in terms of the possible numbers of server-nodes, and if we were to build ourstellar DCNs using commodity switches with 48 ports then we would have to have 48 × servers whichis clearly impossible. Another illustration of a family of interconnection networks lacking flexibility is thefamily of cube-connected cycles { CCC ( n ) : n ≥ } , where CCC ( n ) is obtained from a hypercube Q n via atransformation similar to our stellar transformation: 2 new nodes are placed on each edge; the new nodesadajcent to some old node are joined (systematically) in a cycle of length n ; and the old nodes, and anyadjacent edges, are removed. So, CCC ( n ) is regular of degree 3 and consequently unsuitable for our stellartransformation.We now look at some families of interconnection networks that are suitable for our stellar transformation.It is too much to list all of the good networking properties of the interconnection networks discussed be-low. However, it should be remembered that, from above, any path, path-based sub-structure, and routingalgorithm is immediately inherited by the stellar DCN; consequently, we focus on the flexibility of the param-eterized definition in what follows and refer the reader to other sources (including [7,21,42]) for more detailsas regards good networking properties. Besides: the fact that these families of interconnection networkshave featured so strongly within the research literature is testament to their good networking properties.Also, the families of interconnection networks mentioned below are simply illustrations of interconnectionnetworks for which our stellar transformation has potential and there are many others not mentioned.Tori (also known as toroidal meshes) have been widely studied as interconnection networks; indeed,tori form the interconnection networks of a range of distributed-memory multiprocessor computers (see, e.g. , [7]). The uniform version of a torus is the n -ary k -cube Q k,n , where k ≥ n ≥
3, whose node-setis { , , . . . , n − } k and where there is an edge joining two nodes if, and only if, the nodes differ in exactlyone component and the values in this component differ by 1 modulo n ; hence, Q k,n has n k nodes and kn k edges, and every node has degree 2 k . There is some, though limited, scope for using n -ary k -cubes in ourstellar construction. For example, if we use switch-nodes with 16 ports to build our DCN then this meansthat k = 8; choosing n = 3, 4, or 5 results in our stellar DCN having 104 ,
976 server-nodes, 1 , , , ,
000 server-nodes, respectively. We get more variation if we allow the sets of valuesin different components to differ; that is, we use mixed-radix tori. However, it is not really feasible to useswitch-nodes with more than 16 ports in our stellar construction.Circulant graphs have been studied extensively in a networking context, where they are often called multi-loop networks. Let S be a set of integers, called jumps , with 1 ≤ s ≤ (cid:4) N (cid:5) , for each s ∈ S , and where N ≥ G ( N ; S ) has node set { , , . . . , N − } , where node i is connected to nodes i ± s (mod N ), foreach s ∈ S . A circulant has N nodes and at most N | S | edges, and the degree of every node is approximately2 | S | (depending upon the relative values of N and the integers in S ); consequently, the parameters providesignificant general flexibility. Illustrations of good networking properties of circulants can be found in, forexample, [5, 22, 31].The wrapped butterfly network BF ( k, n ) can be obtained from DPillar k,n by replacing all switch-nodeswith bicliques (joining server-nodes in adjacent columns); consequently, B ( k, n ) has k ( n ) k nodes and k ( n ) k +1 edges, and each node has degree n . Thus, by varying k and n , there is reasonable scope for flexibility interms of the sizes of stellar DCNs. Illustrations of the good networking properties of wrapped butterflynetworks can be found in, for example, [15, 39]. Note that transforming a wrapped butterfly network toobtain DPillar is different to transforming it according to the stellar transformation; the two resulting DCNsare combinatorially very distinct.The de Bruijn digraph dB ( k, n ), where k ≥ n ≥ { , , . . . , n − } k .There is a directed edge from ( s k − , s k − , . . . , s ) to ( s k − , s k − , . . . , s , α ), for each α ∈ { , , . . . , n − } . ndirected de Bruijn graphs are obtained by regarding all directed edges as undirected and removing self-loops and multiple edges; such graphs are not regular but nearly so, with most of the ( n ) k nodes havingdegree n although some have degree n − n −
2. Consequently, by varying the values of k and n , thereis good flexibility in terms of the sizes of stellar DCNs. Illustrations of the good networking properties ofde Bruijn graphs can be found in, for example, [13, 35]. Note that de Bruijn graphs have been studied asserver-centric DCNs in [34] but these DCNs are not dual-port.The arrangement graph A k,n , where n ≥ ≤ k ≤ n −
1, has node-set { ( s k − , s k − , . . . , s , s ) : s i ∈{ , , . . . , n − } , s i = s j , i, j = 0 , , . . . , k − } . There is an edge joining two nodes if, and only if, the nodesare identical in all but one component. Hence, the arrangement graph A k,n has n !( n − k )! nodes and k ( n − k ) n !2( n − k )! edges, and is regular of degree k ( n − k ). The family of arrangement graphs includes the well-known stargraphs as a sub-family, and there is clearly considerable flexibility in their degree and size.3.4. The stellar DCNs GQ ⋆ . Having hinted that there are various families of interconnection networksto which our stellar transformation might sensibly be applied, we now apply the stellar transformation toone specific family in detail: the family of generalized hypercubes [4]. We provide below more details asregards the topological properties of and routing algorithms for generalized hypercubes as we will use theseproperties and algorithms in our experiments in Sections 4 and 5. We choose generalized hypercubes becauseof their flexibility as regards the stellar construction, their good networking properties, and the fact thatthey have already featured in DCN design as templates for BCube.
Definition 3.1.
The generalized hypercube GQ k,n , where k ≥ and n ≥ , has node-set { , , . . . , n − } k and there is an edge joining two nodes if, and only if, the names of the two nodes differ in exactly onecomponent. Consequently, GQ k,n has n k nodes and k ( n − n k edges, and is regular of degree k ( n − GQ k,n has diameter k and connectivity k ( n − ⋆k,n has hop-diameter 2 k + 1 and there are k ( n − GQ k,n . We might choose ( k, n ) as (2 , , ,
13) with the result that the number of server-nodes is 30 ,
000 for GQ ⋆ , , 235 ,
824 for GQ ⋆ , , or 1 , ,
928 for GQ ⋆ , , respectively (of course, we canvary this number of server-nodes if we do not use all switch-ports or if we use switch-nodes with less than48 ports).The stellar construction allows us to transform existing routing algorithms for the base graph GQ k,n intorouting algorithms for GQ ⋆k,n . We describe this process using the routing algorithms for GQ k,n surveyedin [43]. Let u = ( u k − , u k − , . . . , u ) and v = ( v k − , v k − , . . . , v ) be two distinct nodes of GQ k,n . Thebasic routing algorithm for GQ k,n is dimension-order (or e-cube ) routing where the path from u to v isconstructed by sequentially replacing each u i by v i , for some predetermined ordering of the coordinates, say i = 0 , . . . , k −
1. As we mentioned above, dimension-order routing translates into a shortest-path routingalgorithm for GQ ⋆k,n with unchanged time complexity, namely O ( k ).We introduce a fault-tolerant mechanism called intra-dimensional routing by allowing the path to replace u i by v i in two steps, using a local proxy , rather than in one step, as described in dimension-order routing.Suppose, for example, that one of the edges in the dimension order route from u to v is faulty; say, the onefrom u = ( u k − , u k − , . . . , u , u ) to x = ( u k − , u k − , . . . , u , v ) (assuming that u and v are distinct). Inthis case we can try to hop from u to ( u k − , u k − , . . . , u , x ), where u = x = v , and then to x . Inter-dimensional routing is a routing algorithm that extends intra-dimensional routing so that if intra-dimensional routing fails, because a local proxy within a specific dimension cannot be used to re-route rounda faulty link, an alternative dimension is chosen. For example, suppose that in GQ k,n intra-dimensionalrouting has successfully built a route over dimensions 1 and 2 but has failed to re-route via a local proxy indimension 3. We might try and build the route instead over dimension 4 and then return and try again withdimension 3. Note that if a non-trivial path extension was made in dimension 4 then this yields an entirelydifferent locality within GQ k,n when trying again over dimension 3.However, in our upcoming experiments we implement the most extensive fault-tolerant, inter-dimensionalrouting algorithm possible, called GQ ⋆ -routing , for the stellar DCN GQ ⋆k,n , whereby we perform a depth-firstsearch of the dimensions and we use intra-dimensional routing to cross each dimension wherever necessary and possible). In addition, if GQ ⋆ -routing fails to route directly in this fashion then it attempts four moretimes to route (as above) from the source to a randomly chosen server-node, and from there to the destination.We have chosen to make this extensive search of possible routes in order to test the maximum capabilityof GQ ⋆ -routing ; however, we expect that in practice the best performance will be obtained by limiting thesearch in order to avoid certain worst-case scenarios. The precise implementation details of GQ ⋆ -routing canbe found in the software release of INRFlow [12] (see Section 4.6). Finally, it is easy to see that GQ ⋆ -routing can be implemented as a distributed algorithm if a small amount of extra header information is attached toa path-probing packet, similarly to the suggestion in [25] for implementing TAR (Traffic Aware Routing) inFiConn.3.5.
Implementing stellar DCNs.
Implementing the software suite from scratch would require a softwareinfrastructure that supports through-server end-to-end communications. This could be implemented eitheron top of the transport layer (TCP) so as to simplify development, since most network-level mechanisms(congestion control, fault-tolerance, quality of service) would be provided by the lower layers. Alternatively,it could be implemented on top of the data-link layer to improve the performance, since a lower protocolstack will result in the faster processing of packets. The latter would require a much higher implementationeffort in order to deal with congestion and reliability issues. At any rate, the design and development ofa software suite for server-centric DCNs is outside the scope of this paper, but may be considered in thefuture. 4.
Methodology
The good networking properties discussed in Section 2.1 guide our evaluation methodology; they arenetwork throughput, latency, load balancing capability, fault-tolerance, and cost to build. These propertiesare reflected through performance metrics, and in this section we explain how we use aggregate bottleneckthroughput, distance metrics, connectivity, and paths and their congestion, combined with a selection oftraffic patterns, in order to evaluate the performance of our DCNs and routing algorithms. In particular, wedescribe and justify the use of our simulation tool in Section 4.6.Our methodological framework is as follows. First, we take the position, similar to Popa et al. [34], thatthe cost of a network is of fundamental importance. No matter what purpose a network is intended for, theprimary objective is to maximise the return on the cost of a DCN. While there are several elements thatfactor into the cost of a DCN, including operational costs, our concern is with the capital costs of purchasingand installing the components we are modelling: servers, switches, and cables. Having calculated these costs(in Section 4.1 below), where appropriate (in our evaluation in Section 5.1) we normalise with respect to costand proceed by both quantitatively and qualitatively interpreting the resulting multi-dimensional metric-based comparison. Subsequently, (from Section 5.2 onwards) we focus on 4 carefully chosen DCNs, namelyGQ ⋆ , , GQ ⋆ , , FiConn , , and DPillar , , and evaluate these DCNs in some detail. We have selected theseDCNs as their properties are relevant to the construction of large-scale DCNs: they each have around 25 , Table 2.
Basic properties of the selected DCNs.topology GQ ⋆ , GQ ⋆ , FiConn , DPillar , server-nodes 27 ,
000 25 ,
920 24 ,
648 26 , ,
000 1 ,
296 1 ,
027 2 , ,
000 77 ,
760 67 ,
782 104 , Network cost.
We follow Popa et al. [34] and assume that the cost of a switch is proportional to itsradix (this is justified in [34] for switches of radix up to around 100-150 ports). Let c s be the cost of aserver, let c p be the cost of a switch-port, and let c c be the average cost of a cable. We make the simplifyingassumption that the average cost of a cable c c is uniform across DCNs with N servers within the families Q ⋆ , FiConn, and DPillar, and, furthermore, that the average cost of a cable connected only to serversis similar to that of a cable connected to a switch. Thus, the cost of a DCN GQ ⋆ with N server-nodes is N ( c p + c c + c s + c c ); the cost of a DCN FiConn k,n with N server-nodes is N ( c p + c c + c s + c c − c c k +1 ),since it contains N k server-nodes of degree 1 [25]; and the cost of a DCN DPillar with N server-nodes is N (2( c p + c c ) + c s ). Next, we express c p = ρc s and c c = γc s so that the costs per server-node become N c s ( ρ + γ + 1 + γ ), N c s ( ρ + γ + 1 + γ − γ k +1 ), and N c s (2( ρ + γ ) + 1), respectively. A rough estimateis that realistic values for ρ lie in the range [0 . , . γ lie in the range[0 . , . e.g. , between copper and optical cables, as well as how we account for the labour involved in installing them.Consequently, we normalise with respect to the aggregated component cost per server-node in GQ ⋆ , letting c s ( ρ + γ + 1 + γ ) = 1, and plot component costs per server-node against γ in Fig. 4, for the representativeselection ρ ∈ { . , . , . , . , . } (in Fig. 4, there is one graph for each DCN and for each of the 5 valuesfor ρ , with the 5 graphs corresponding to FiConn being almost indistinguishable from one another). Theupshot is that the higher the value for ρ , the higher the cost of DPillar, and for the specific choices of ρ and γ mentioned above, DPillar could be up to 20% more expensive and FiConn around 4% less expensivethan GQ ⋆ when all DCNs have the same number of server-nodes. Perhaps the most realistic values of ρ and γ , however, yield a DPillar that is only about 10% more expensive and FiConn that is only about 2% lessexpensive. . . . . . . . . Cost cable/cost server-node, γ C o s t / s e r v e r - n o d e ( n o r m . ) DPillarGQ ⋆ FiConn
Figure 4.
The component costs per server-node of FiConn and DPillar, relative to that ofGQ ⋆ , for ρ ∈ { . , . , . , . , . } .4.2. Hop-distance metrics.
The number of servers a packet flow needs to travel through significantlyaffects the flow’s latency. In addition, for each server on the path, the compute and memory overheadsare impacted upon: in a server-centric DCN (with currently available COTS hardware), the whole of theprotocol stack, up to the application level, needs to be processed at each server which can make messagetransmission noticeably slower than in a switch-centric network where lower layers of the protocol stack areemployed and use optimised implementations.The paths over which flows travel are computed by routing algorithms, and it may not be the case thatshortest-paths are achievable by available routing algorithms and without global fault-knowledge; large-scalenetworks like DCNs are typically restricted to routing algorithms that use only local knowledge of faultlocations. As such, the performance of the routing algorithm is perhaps more important than the hop-diameter or mean hop-distance of the topology itself. Therefore, we use distance-related metrics that revealthe performance of the topology and the routing algorithm combined, namely routed hop-diameter and routed mean hop-distance , as well as for the topology alone (where appropriate), namely hop-diameter andmean hop-distance (see Section 2.1). This allows us to (more realistically) assess both the potential of thetopologies and the actual performance that can be extracted from them when implemented with currentlyavailable routing algorithms. .3. Aggregate bottleneck throughput.
The aggregate bottleneck throughput ( ABT ) is a metric intro-duced in [17] and is of primary interest to DCN designers due to its suitability for evaluating the worst-casethroughput in the all-to-all traffic pattern, which is extremely significant in the context of DCNs (see Sec-tion 4.5). The reasoning behind ABT is that the performance of an all-to-all operation is limited by itsslowest flow, i.e. , the flow with the lowest throughput. The ABT is defined as the total number of flowstimes the throughput of the bottleneck flow ; that is, the link sustaining the most flows. Formally, the ABTof a network of size N is equal to N ( N − bF , where F is the number of flows in the bottleneck link and b isthe bandwidth of a link.In our experiments, the bottleneck flow is determined experimentally using the implementations of actualrouting algorithms; this is atypical of ABT calculations ( e.g. , see [29]), where ordinarily shortest-paths areused, but our approach facilitates a more realistic evaluation. We measure ABT using GQ ⋆ -routing for GQ ⋆ , TOR for FiConn, and
DPillarSP for
DPillar , assuming N ( N −
1) flows and a bandwidth of 1 unit perdirectional link, where N is the number of server-nodes. Since datacenters are most commonly used as astream processing platform, and are therefore bandwidth limited, this is an extremely important performancemetric in the context of DCNs. Given that ABT is only defined in the context of all-to-all communications,for other traffic patterns we focus on the number of flows in the bottleneck as an indicator of congestionpropensity.We should explain our choice of routing algorithm in FiConn and DPillar as regards our ABT analysis.In [25], it was shown that TOR yields better performance for all-to-all traffic patterns than
TAR . In [28],the all-to-all analysis (actually, it is a many all-to-all analysis) showed that
DPillarSP performs better than
DPillarMP . We have chosen
TOR and
DPillarSP so as not to disadvantage FiConn and DPillar when wecompare against GQ ⋆ and GQ ⋆ -routing .4.4. Fault-tolerance.
High reliability is of the utmost importance in datacenters, as it impacts upon thebusiness volume that can be attracted and sustained. When scaling out to tens of thousands of servers ormore, failures are common, with the mean-time between failures (MTBF) being as short as hours or evenminutes. As an example, consider a datacenter with 25 ,
000 servers, 1 ,
000 switches, and 75 ,
000 links, eachwith an optimistic average lifespan of 5 years. Based upon a very rough estimate that the number of elementsdivided by the average lifespan results in the numbers of failures per day, the system will have an averageof about 13 server faults per day, 40 link faults per day, and 1 switch fault every 2 days. In other words,failures are ubiquitous and so the DCN should be able to deal with them in order to remain competitivelyoperational. Any network whose performance degrades rapidly with the number of failures is unacceptable,even if it does provide the best performance in a fault-free environment.We investigate how network-level failures affect routed connectivity , defined as the proportion of server-node-pairs that remain connected by a path computable by a given routing algorithm, as well as how theyaffect routed mean hop-distance. Our study focuses on uniform random link failures in particular, becauseboth server-node and switch-node failures induce link failures, and also because the sheer number of links(and NICs) in a DCN implies that link-failures will be the most common event. A more detailed study offailure events will be conducted in follow-up research, in which we will consider correlated link, server-node,and switch-node failures. We consider failure configurations with up to a 15% network degradation, wherewe randomly select, with uniform probability, 15% of the links to have a fault. Furthermore, we consideronly bidirectional failures, i.e. , where links will either work in both directions or in neither. The rationale forthis is that the bidirectional link-failure model is more realistic than the unidirectional one: failures affectingthe whole of a link ( e.g. , NIC failure, unplugged or cut link, or switch-port failure) are more frequent thanthe fine-grained failures that would affect a single direction. In addition, once unidirectional faults have beendetected they will typically be dealt with by disabling the other direction of the failed link (according to theIEEE 802.3ah EFM-OAM standard).As regards routed connectivity and routed mean hop-distance, we consider GQ ⋆ with GQ ⋆ -routing , FiConnwith breadth-first search (BFS), and DPillar with DPillarMP . Again, we explain our choice of routingalgorithms. As regards FiConn,
TAR is a distributed “heuristic” algorithm devised so as to improve networkload balancing with bursty and irregular traffic patterns, and was neither optimised for nor tested on outrightfaulty links. In addition,
TAR computes paths that are 15–30% longer in these scenarios than
TOR does.However,
TOR is not fault-tolerant and so we simply use BFS. In short, we have given FiConn preferential reatment (this makes the performance of GQ ⋆ against FiConn, described in Section 5.2, all the moreimpressive). As regards DPillar, DPillarMP is fault-tolerant whereas
DPillarSP is not.4.5.
Traffic patterns.
We now describe the traffic patterns used in our evaluation, the primary one be-ing the all-to-all traffic pattern. All-to-all communications are extremely relevant as they are intrinsic toMapReduce, the preferred paradigm for data-oriented application development; see, for example, [8, 29, 40].In addition, all-to-all can be considered a worst-case traffic pattern for two reasons: ( a ) the lack of spatiallocality; and ( b ) the high levels of contention for the use of resources.Our second set of experiments focuses on specific networks hosting around 25,000 server-nodes and evalu-ates them with a wider collection of traffic patterns; we use the routing algorithms GQ ⋆ -routing , TOR , and
DPillarSP . Apart from all-to-all, we also consider the three other traffic patterns many all-to-all, butterfly,and random. In many all-to-all , the network is split into disjoint groups of a fixed number of server-nodeswith server-nodes within a group performing an all-to-all operation. Our evaluation shows results for groupsof 1,000 server-nodes but these results are consistent with ones for groups of sizes 500 and 5,000. Thisworkload is less demanding than the system-wide all-to-all, but can still generate a great deal of congestion.It aims to emulate a typical tenanted cloud datacenter in which there are many independent applicationsrunning concurrently. We assume a typical topology-agnostic scheduler and randomly assign server-nodes togroups. The butterfly traffic pattern is a “logarithmic implementation” of a pattern such as all-to-all as eachserver-node only communicates with other server-nodes at hop-distance 2 k , for each k ∈ { , . . . , ⌈ log ( N ) ⌉ − } (see [32] for more details). This workload significantly reduces the overall utilization of the network whencompared with the all-to-all traffic pattern and aims to evaluate the behaviour of networks when the trafficpattern is well-structured. Finally, we consider a random traffic pattern in which we generate one millionflows (we also studied other numbers of flows but the results turn out to be very similar to those with onemillion flows). For each flow, the source and destination are selected uniformly at random. These additionalcollections of experiments provide further insights into the performance achievable with each of the networksand allow a more detailed evaluation of propensity to congestion, load balancing, and latency.4.6. Software tools.
Our software tool, Interconnection Networks Research Flow Evaluation Framework(INRFlow) [12] is specifically designed for testing large-scale, flow-based systems such as DCNs with tensor hundreds of thousands of nodes, which would prohibit the use of of packet-level simulations. The resultsobtained from INRFlow inform a detailed evaluation within the intended scope of our paper.INRFlow is capable of evaluating network topologies in two ways. Within INRFlow we can undertake aBFS for each server-node; this allows us to compute the hop-length of the shortest path between any twoserver-nodes and also to examine whether two server-nodes become disconnected in the presence of linkfailures. As we have noted in Section 4.2, results on shortest paths are of limited use when not studied inconjunction with a routing algorithm. Consequently, INRFlow also provides path and connectivity informa-tion about a given routing algorithm. We use the different routing algorithms within our DCNs as we havedescribed so far in this section. The operation of the tool is as follows: for each flow in the workload, itcomputes the route using the required routing algorithm and updates link utilization accordingly. Then itreports a large number of statistics of interest, including the metrics discussed above.Simulation. Simulation is the accepted methodology as regards the empirical investigation of DCNs. Forexample, as regards the DCNs FiConn, HCN, BCN, SWKautz, SWCube, and SWdBruijn, all empiricalanalysis is undertaken by simulation; on the other hand, DCell uses a test-bed of only 20 servers, BCubeuses a test-bed of only 16 servers, and CamCube [1] uses a test-bed of only 27 servers. We argue that for thescenarios for which server-centric DCNs are intended, where the DCN will be expected to have thousands (ifnot millions) of servers (in future), experiments with a small test-bed cluster will not be too useful (exceptto establish proof-of-concept) and that simulation is the best way to proceed. Moreover, the uniformity andstructured design of server-centric DCNs ameliorates against performance discrepancies that might arise in“more random” networks.Error bars. The absence of error bars in our evaluation is by design. In our paper, random sampling occursin two different ways: the first is where a random set of faulty links is chosen and properties of the faultytopology are plotted, as in Figs. 8 to 11; the second is with regards to randomised traffic patterns, as inFigs. 12, 13 and 15. For each set of randomised link failures we plot statistics, either on connectivity or pathlength, for the all-to-all traffic pattern ( i.e., the whole population of server-node-pairs). n Figs. 8 to 11 we sample the mean of two statistics over the set of all possible sets of m randomisedlink failures based on only one trial for each network and statistic, and therefore it does not make senseto compute estimated standard error for these plots. The true error clearly remains very small, however,because of the high level of uniformity of the DCNs we are studying, including the non-homogeneous DCNFiConn. The uniformity effectively simulates a large number of trials, since, for each choice of faulty linkthere are hundreds or thousands of other links in the DCN whose failure would have almost exactly thesame effect on the overall experiment. Quantifying this error is outside the scope of our paper; however,it is evident from the low amount of noise in our plots that the true error is negligible in the context ofthe conclusions we are making. Figs. 12, 13 and 15 sample flows to find the mean number of links with acertain proportion of utilisation, and to find the mean hop-lengths of the flows. Our sample sizes, given inSection 4.5, are exceedingly large for this purpose, and thus error bars would be all but invisible in theseplots. We leave the explicit calculation to the reader.5. Evaluation
In this section we perform an empirical evaluation of the DCN GQ ⋆ and compare its performance withthat of the DCNs FiConn and DPillar using the methodology and framework as detailed in Section 4. Webegin by comparing various different versions of the three DCNs as regards ABT and latency (though thelatter is a coarse-grained analysis). Next, we focus on 4 comparable large-scale DCNs, namely GQ ⋆ , , GQ ⋆ , ,FiConn , , and DPillar , , and we examine them in more detail with regard to fault-tolerance, latency,and load balancing, under different traffic patterns. Interspersed is an examination of the fault-tolerancecapabilities of GQ ⋆ -routing in comparison with what might happen in the optimal scenario.5.1. Aggregate bottleneck throughput.
We begin by comparing GQ ⋆ , FiConn, and DPillar as regardsaggregate bottleneck throughput, following our framework as outlined in Section 4.3; in particular, we usethe routing algorithms GQ ⋆ -routing , TOR , and
DPillarSP . We work with 3 different parameterized versionsof GQ ⋆ , 2 of FiConn, and 3 of DPillar. Not only do we look at the relative ABT of different DCNs but welook at the scalability of each DCN in terms of ABT as the number of servers or component cost grows.We first consider ABT vs. the number of servers in each network. Fig. 5 shows that ABT scales muchbetter in GQ ⋆ than in FiConn. For the largest systems considered, GQ ⋆ supports up to around three timesthe ABT of FiConn ,n . The difference between the 3 versions of GQ ⋆ and FiConn ,n is not as large but isstill substantial. We can see that although the DCNs GQ ⋆ are constructed using far fewer switch-nodes andlinks than DPillar (when the two DCNs have the same number of server-nodes), their maximum sustainableABT is broadly better; indeed, the DCNs GQ ⋆k,n with k = 2 and k = 3 consistently outperform all DPillarDCNs. . . . . · · Number of servers A B T GQ ⋆ ,n GQ ⋆ ,n GQ ⋆ ,n FiConn ,n FiConn ,n DPillar ,n DPillar ,n DPillar ,n Figure 5.
ABT using GQ ⋆ -routing , TOR , and
DPillarSP .Fig. 6 shows a plot of ABT vs. network cost under the most plausible assumptions discussed in Section 4.1,namely that the aggregated cost of components for DPillar is around 10% more and that of FiConn is around % less than that of GQ ⋆ . When we normalize by network cost, we can see a similar shape to Fig. 5 exceptthat FiConn has a slightly improved scaling whereas DPillar has a slightly degraded one. . . . . · · Network cost A B T GQ ⋆ ,n GQ ⋆ ,n GQ ⋆ ,n FiConn ,n FiConn ,n DPillar ,n DPillar ,n DPillar ,n Figure 6.
ABT in terms of network cost for GQ ⋆ -routing , TOR , and
DPillarSP , wherea DCN DPillar is 110% the cost of a DCN GQ ⋆ with the same number of server nodes,whilst a DCN FiConn is 98% of the cost of a DCN GQ ⋆ . Network cost is normalised by theaggregated component cost per server in GQ ⋆ .Let us focus on the increase in ABT for GQ ⋆k,n as k decreases, which can be explained as follows. First, fora fixed number of server-nodes, reducing k results in an increased switch-node radix, which translates intohigher locality. Second, reducing k results in lower routed mean hop-distance (see Fig. 7), which lowers thetotal utilization of the DCN and, when combined with good load balancing properties, yields a bottlenecklink with fewer flows. As regards routed mean hop-distances for each of the DCNs, we can see that for eachtopology these increase very slowly with network size (apart from, perhaps, FiConn ,n ) and are, of course,bounded by the routed hop-diameter, which is dependent on k for all 3 topologies: 2 k + 1 for GQ ⋆ -routing ;2 k +1 − TOR ; and 2 k − DPillarSP . The “exponential nature” of FiConn discourages buildingthis topology for any k larger than 2. However, note that in terms of routed mean hop-distance, DPillar isslightly better than GQ ⋆ , broadly speaking. However, such a metric cannot be taken in isolation and we takea closer look at this metric in relation to load balancing in a more detailed evaluation of our three DCNs inSection 5.4 (things are not what they might appear here). Number of servers M e a nh o p s p e rr o u t e GQ ⋆ ,n GQ ⋆ ,n GQ ⋆ ,n FiConn ,n FiConn ,n DPillar ,n DPillar ,n DPillar ,n Figure 7.
Routed mean hop-distances for GQ ⋆ -routing , TOR , and
DPillarSP .Although we forgo a simulation of packet injections, our experiments do allow for a coarse-grained latencyanalysis. Network latency is brought on by routing packets over long paths and spending additional timeprocessing ( e.g. , buffering) the packets at intermediate nodes, due to network congestion. These scenarioshave various causes, but they are generally affected by a DCN’s ability to simultaneously balance network raffic and route it over short paths efficiently. Figs. 5 and 7 show that GQ ⋆ -routing scales well with respect toload balancing (high ABT) and routed mean hop-distance, from which we infer that in many situations GQ ⋆ ,n has lower latency than GQ ⋆ ,n and all FiConn DCNs, and likely performs at least similarly to DPillar ,n .In summary, GQ ⋆ has better ABT properties than FiConn and also broadly outperforms the denserDPillar; as discussed in Section 4.3, ABT is a performance metric of primary interest in the context ofdatacenters. We can also infer from our experiments a coarse-grained latency analysis, namely that GQ ⋆ -routing is likely to be at least as good as DPillar and better than FiConn.5.2. Fault-tolerance.
We now turn our attention to four concrete instances of the topologies and their rout-ing algorithms: GQ ⋆ , and GQ ⋆ , with GQ ⋆ -routing ; FiConn , with BFS; and DPillar , with DPillarMP (though we shall also consider DPillar with
DPillarSP in the non-fault-tolerant environment of Section 5.4).As stated in Section 4.4, these DCNs were chosen as each has around 25,000 server-nodes and use switch-nodes with around 24 ports.
A priori , GQ ⋆ has a provably high number of parallel paths and server-parallel paths compared to FiConnand DPillar of similar size (see Table 2). Thus, if GQ ⋆ -routing utilises these paths, we expect strongperformance in degraded networks. Fig. 8 shows the routed connectivity under failures of GQ ⋆ -routing and DPillarMP . The plot indicates that
DPillarMP underutilises the network, since the unrouted connectivityof DPillar (not plotted) is slightly stronger than that of GQ ⋆ . This highlights the fact that there is a closeand complex relationship between topology, path-lengths, routing, fault-tolerance, and so on; ensuring thatall aspects dovetail together is of primary importance. These observations also motivate a more detailedevaluation of GQ ⋆ -routing (and indeed fault-tolerant routing for DPillar). Note that the evaluation of DPillarMP in Liao et al. [28] is with respect to server-node faults, in which the performance of
DPillarMP looks stronger than it does in our experiments with link-failures. This is because the failed server-nodes donot send messages and therefore do not factor into the connectivity of the faulty DPillar.
Percent link failures P e r c e n t c o nn e c t i v i t y GQ ⋆ , GQ ⋆ , DPillar , Figure 8.
Routed connectivity of GQ ⋆ -routing and DPillarMP .5.3.
Assessment of GQ ⋆ -routing . With FiConn not having a fault-tolerant algorithm comparable to GQ ⋆ -routing (see Section 4.6), in Fig. 9 we plot the unrouted connectivity of GQ ⋆ with that of FiConn using BFS.As we can see, GQ ⋆ -routing performs similarly to FiConn in an optimal scenario. To our knowledge there isno fault-tolerant routing algorithm for FiConn that achieves anything close to the optimal performance ofBFS (however, Fig. 11 shows that GQ ⋆ -routing very nearly achieves the optimum unrouted connectivity ofGQ ⋆ ).In summary, we have shown that GQ ⋆ and GQ ⋆ -routing are very competitive when compared with bothFiConn and DPillar in terms of fault-tolerance.We assess the performance of GQ ⋆ -routing by comparing it with optimum performance, obtained bycomputing a BFS which finds a shortest-path (if it exists). Notice that since dimensional routing yields ashortest-path algorithm on GQ k,n , it is straightforward to modify GQ ⋆ -routing so as to be a shortest path Routed and unrouted data computed for other DCNs GQ ⋆ was very similar and is not plotted for the sake of clarity. Percent link failures P e r c e n t c o nn e c t i v i t y GQ ⋆ , GQ ⋆ , FiConn , FiConn , Figure 9.
Unrouted connectivity of GQ ⋆ and FiConn. Number of servers M e a nh o p s p e rr o u t e GQ ⋆ ,n GQ ⋆ ,n BFSGQ ⋆ ,n GQ ⋆ ,n BFSGQ ⋆ ,n GQ ⋆ ,n BFS
Figure 10.
Routed ( GQ ⋆ -routing ) and unrouted mean-distance in GQ ⋆ with 10% link failures.algorithm on GQ ⋆k,n ; however, due to simplifications in our implementation there is a discrepancy of about2% between shortest paths and GQ ⋆ -routing in a fault-free GQ ⋆ . Number of servers P e r c e n t c o nn e c t i v i t y GQ ⋆ ,n GQ ⋆ ,n BFSGQ ⋆ ,n GQ ⋆ ,n BFSGQ ⋆ ,n GQ ⋆ ,n BFS
Figure 11.
Routed ( GQ ⋆ -routing ) and unrouted connectivity of GQ ⋆ with 10% link failures.Of interest to us here is the relative performance of GQ ⋆ -routing and BFS in faulty networks. Fig. 10 plotsthe routed and unrouted mean hop-distances in networks with a 10% link failure rate; as can be seen, thedifference between GQ ⋆ -routing and BFS in mean hop-distance is close to 10%. This is a reasonable overheadfor a fault-tolerant routing algorithm, especially given the algorithm’s high success rate at connecting pairsof servers in faulty networks: Fig. 11 plots the unrouted connectivity, which is optimum and achieved by BFS, and the routed connectivity, achieved by GQ ⋆ -routing , for the same (10%) failure rate . As it iscurrently implemented, GQ ⋆ -routing is optimised for maintaining connectivity at the cost of routing overlonger paths if necessary. A different mix of features might reduce the 10% gap in Fig. 10 but increase thegap in Fig. 11. In any case, the performance of GQ ⋆ -routing is very close to the optimum.5.4. Detailed evaluation of large-scale DCNs.
We now return to our four concrete instances of thetopologies and their basic routing algorithms: GQ ⋆ , and GQ ⋆ , with GQ ⋆ -routing ; FiConn , with TOR ;and DPillar , with DPillarSP . Our intention is to look at throughput, how loads are balanced, and theimpact on latency.Fig. 12 shows the number of flows in the bottleneck for the different traffic patterns considered in our study.We can see that these results follow those described above in that not only can GQ ⋆ broadly outperformFiConn and DPillar in terms of ABT, cost, latency, and fault-tolerance, but it does likewise in terms ofthroughput in that it can significantly reduce the number of flows in the bottleneck. The only exceptionis DPillar , with the butterfly traffic pattern. The rationale for these results is that the butterfly patternmatches perfectly the DPillar topology and, thus, it allows a very good balancing of the network, reducingthe flows in the bottleneck. For the rest of the patterns, DPillar is clearly the worst performing in termsof bottleneck flow. Fig. 13 shows the routed mean hop-distance for the different patterns and topologies,and shows that DPillar, due to the higher number of switches, can generally reach its destination usingthe shortest paths. Note that even with the clear advantage of having higher availability of shorter paths,DPillar , still has the highest number of flows in the bottleneck and, therefore, is the most prone tocongestion. On the other hand GQ ⋆ , , which uses the longest paths, has the second lowest number of flowsin the bottleneck after GQ ⋆ , . all-to-all manyall-to-all butterfly random00 . . . . . F l o w s i n t h e b o tt l e n e c k ( n o r m . ) GQ ⋆ , GQ ⋆ , DPillar , FiConn , Figure 12.
Relative number of flows in the bottleneck for the different traffic patterns,normalised to FiConn and
TOR .The results we have obtained as regards bottleneck flows and routed hop-distances might appear surprising.However, a closer analysis helps to better appreciate the situation. Fig. 14 shows the distribution of flowsacross links in the all-to-all traffic pattern: for a given number of flows, we show the proportion of linkscarrying that number of flows. We can see that both GQ ⋆ s are much better balanced than both FiConn , and DPillar , . For example, in GQ ⋆ , all of the links carry between 60 ,
000 and 100 ,
000 flows, and in GQ ⋆ , all of the links carry between 80 ,
000 and 120 ,
000 flows. However, nearly 25% of the links in FiConn , have less than 40 ,
000 flows, whereas the other 75% of the links have between 80 ,
000 and 140 ,
000 flows.Even worse, in DPillar , half of the links have more than 100 ,
000 flows while the other half are barelyused. The imbalances present in FiConn , and DPillar , result in parts of the networks being significantlyunderutilised and other parts being overly congested.A more detailed distribution obtained using the random traffic pattern is shown in Fig. 15. Here, wecan see how both GQ ⋆ s are clearly better balanced than FiConn , , as the latter has two pinnacles: one of GQ ⋆ -routing appears to be better than BFS for certain numbers of servers, but this is because the faults were generatedrandomly for each test. ll-to-all manyall-to-all butterfly random02468 M e a nh o p s p e rr o u t e GQ ⋆ , GQ ⋆ , DPillar , FiConn , Figure 13.
Routed hop-distance for the different traffic patterns.low-load with about 30% of the links, and another of high-load with the rest of the links. We can also seethat choosing the bottleneck link as the figure of merit is reasonable as it would yield similar results as if wehad chosen the peaks in the plot.Just as we did in Section 5.1, we can infer that GQ ⋆ , will provide better latency figures than GQ ⋆ , andFiConn , as it has fewer flows in the bottleneck link and uses shorter paths. The shorter paths in DPillar , do suggest that with low-intensity communication workloads it should have lower latency than GQ ⋆ , , butsince DPillar , is much poorer at balancing loads than GQ ⋆ , , we can infer that it may have higher latencyunder higher-intensity communication workloads such as the ones typically used in datacenters.6. Conclusion
This paper proposes a new, generic construction that can be used to automatically convert existinginterconnection networks, and their properties in relation to routing, path length, node-disjoint paths, andso on, into dual-port server-centric DCNs, that inherit the properties of the interconnection network. A rangeof interconnection networks has been identified to which our construction might be applied. A particularinstantiation of our construction, the DCN GQ ⋆ where the base interconnection network is the generalizedhypercube, has been empirically validated as regards network throughput, latency, load balancing capability,fault-tolerance, and cost to build. In particular, we have shown how GQ ⋆ , with its routing algorithm GQ ⋆ -routing , that is inherited from an existing routing algorithm for the generalized hypercube, consistentlyoutperforms the established DCNs FiConn, with its routing algorithm TOR , and DPillar, with its routingalgorithms
DPillarSP and
DPillarMP . As regards FiConn, the improved performance of GQ ⋆ was across allof the metrics we studied, apart from aggregated component cost where the two DCNs were approximatelyequal. As regards DPillar, the improved performance was across all metrics, apart from mean routed hop-distance. However, in mitigation against DPillar’s improved mean routed hop-distance, our experimentsas regards load balancing enable us to infer that although DPillar will exhibit lower latency in the case oflow traffic congestion, when there is average to high traffic congestion DPillar’s propensity to unbalancedloads on its links will mean that GQ ⋆ will have the better latency. Particularly marked improvements ofGQ ⋆ against DPillar are as regards the fault-tolerant performance of the respective routing algorithms inlink-degraded DCNs and also the aggregated component cost which in DPillar is around 10% higher thanin GQ ⋆ . When we compare the performance of GQ ⋆ -routing within GQ ⋆ against what is optimally possible,in terms of path length, we find that GQ ⋆ -routing finds paths that are within 2% of the optimal length(0% is realistically possible) and within around 10% for degraded networks with 10% faulty links. This isa relatively small overhead for our routing algorithm which achieves very high connectivity, typically 95%connectivity when 10% of links are chosen to be faulty (uniformly at random).There are a number of open questions arising from this paper that we will investigate in the future. Anon-comprehensive list is as follows: analyse the practicalities (floor planning, wiring, availability of localrouting, and so on) of packaging the DCNs GQ ⋆ ; perform a broader evaluation using a higher number of DCNarchitectures and traffic models; refine GQ ⋆ -routing to produce minimal paths for fault-free networks andcompare its performance with the near-optimal algorithm used in this paper; apply the stellar transformation . . . . . . · . . . . . Number of flows P r o p o r t i o n o f o cc u rr e n c e s GQ ⋆ , GQ ⋆ , DPillar , FiConn , Figure 14.
Histogram of proportion of flows per link under the all-to-all traffic pattern.The mean number of flows per link are 89 , , , , ⋆ , ,GQ ⋆ , , FiConn , and DPillar , , respectively. Connecting lines are drawn for clarity. , , , Number of flows N u m b e r o f li n k s GQ ⋆ , GQ ⋆ , DPillar , FiConn , Figure 15.
Distribution of number of flows per link for the random traffic pattern. Notplotted are 52,488 unused links in DPillar and 6,162 unused links in FiConn.to other well-understood interconnection networks (some of which we have already highlighted); and, finally,explore the effect of the stellar construction on formal notions of symmetry in the base graph and in relationto metrics such as bisection width.
Acknowledgements
This work has been funded by the Engineering and Physical Sciences Research Council (EPSRC) throughgrants EP/K015680/1 and EP/K015699/1. Dr. Javier Navaridas is also supported by the European Union’sHorizon 2020 programme under grant agreement No. 671553 ‘ExaNeSt’. The authors gratefully acknowledgetheir support.
References [1] H. Abu-Libdeh, P. Costa, A. Rowstron, G. O’Shea, and A. Donnelly. Symbiotic routing in future data centers.
SIGCOMMComputer Communcation Review , 40(4):51–62, August 2010.[2] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture.
SIGCOMM ComputerCommunication Review , 38(4):63–74, October 2008.[3] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, andM. Zaharia. A view of cloud computing.
Communications of the ACM , 53(4):50–58, April 2010.[4] L. N. Bhuyan and D. P. Agrawal. Generalized hypercube and hyperbus structures for a computer network.
IEEE Trans-actions on Computers , C-33(4):323–333, April 1984.
5] J.-Y. Cai, G. Havas, B. Mans, A. Nerurkar, J.-P. Seifert, and I. Shparlinski. On routing in circulant graphs. In
Proc. of5th Int. Conf. on Computing and Combinatorics , volume 1627 of
Lecture Notes in Computer Science , pages 360–369.Springer, 1999.[6] T. Chen, X. Gao, and G. Chen. The features, hardware, and architectures of data center networks: A survey.
Journal ofParallel and Distributed Computing , 96:45–74, October 2016.[7] W. J. Dally and B. Towles.
Principles and Practices of Interconnection Networks . Morgan Kaufmann, 2003.[8] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters.
Communications of the ACM ,51(1):107–113, January 2008.[9] G. Della Vecchia and C. Sanges. A recursively scalable network VLSI implementation.
Future Generation ComputerSystems , 4(3):235–243, October 1988.[10] R. Diestel.
Graph Theory , volume 173 of
Graduate Texts in Mathematics . Springer, 2012.[11] A. Erickson, A. Kiasari, J. Navaridas, and I. A. Stewart. An efficient shortest-path routing algorithm in the data centrenetwork DPillar. In
Proc. of 9th Int. Conf. on Combinatorial Optimization and Applications , volume 9486 of
Lecture Notesin Computer Science , pages 209–220. Springer, 2015.[12] A. Erickson, A. Kiasari, J. Pascual Saiz, J. Navaridas, and I. A. Stewart. Interconnection Networks Research Flow Evalu-ation Framework (INRFlow), 2016. [Software] https://bitbucket.org/alejandroerickson/inrflow .[13] A.-H. Esfahanian and S. L. Hakimi. Fault-tolerant routing in de Bruijn communication networks.
IEEE Transactions onComputers , 34(9):777–788, September 1985.[14] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios:A hybrid electrical/optical switch architecture for modular data centers.
SIGCOMM Computer Communication Review ,40(4):339–350, October 2010.[15] A. W.-C. Fu and S.-C. Chau. Cyclic-cubes: A new family of interconnection networks of even fixed-degrees.
IEEE Trans-actions on Parallel and Distributed Systems , 9(12):1253–1268, December 1998.[16] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: Ascalable and flexible data center network.
SIGCOMM Computer Communication Review , 39(4):51–62, October 2009.[17] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: A high performance, server-centricnetwork architecture for modular data centers.
SIGCOMM Computer Communication Review , 39(4):63–74, October 2009.[18] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for datacenters.
SIGCOMM Computer Communication Review , 38(4):75–86, October 2008.[19] D. Guo, T. Chen, D. Li, M. Li, Y. Liu, and G. Chen. Expandable and cost-effective network structures for data centersusing dual-port servers.
IEEE Transactions on Computers , 62(7):1303–1317, July 2013.[20] N. Hamedazimi, Z. Qazi, H. Gupta, V. Sekar, S. R. Das, J. P. Longtin, H. Shah, and A. Tanwer. Firefly: A reconfigurablewireless data center fabric using free-space optics.
SIGCOMM Computer Communication Review , 44(4):319–330, October2014.[21] L.-H. Hsu and C.-K. Lin.
Graph Theory and Interconnection Networks . CRC Press, 2009.[22] F. K. Hwang. A survey on multi-loop networks.
Theoretical Computer Science , 299(1–3):107–121, April 2003.[23] F. T. Leighton.
Introduction to Parallel Algorithms and Architectures: Array, Trees, Hypercubes . Morgan Kaufmann, 1992.[24] C. E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing.
IEEE Transactions on Computers ,34(10):892–901, October 1985.[25] D. Li, C. Guo, H. Wu, K. Tan, Y. Zhang, S. Lu, and J. Wu. Scalable and cost-effective interconnection of data-centerservers using dual server ports.
IEEE/ACM Transactions on Networking , 19(1):102–114, February 2011.[26] D. Li and J. Wu. On data center network architectures for interconnecting dual-port servers.
IEEE Transactions onComputers , 64(11):3210–3222, November 2015.[27] Z. Li, Z. Guo, and Y. Yang. BCCC: An expandable network for data centers. In
Proc. of 10th ACM/IEEE Symp. onArchitectures for Networking and Communications Systems , pages 77–88. ACM, 2014.[28] Y. Liao, J. Yin, D. Yin, and L. Gao. DPillar: Dual-port server interconnection network for large scale data centers.
Computer Networks , 56(8):2132–2147, May 2012.[29] Y. Liu, J. K. Muppala, M. Veeraraghavan, D. Lin, and M. Hamdi.
Data Center Networks: Topologies, Architectures andFault-Tolerance Characteristics . Springer, 2013.[30] Y. J. Liu, P. X. Gao, B. Wong, and S. Keshav. Quartz: A new design element for low-latency DCNs.
SIGCOMM ComputerCommunication Review , 44(4):283–294, October 2014.[31] E. A. Monakhova. A survey on undirected circulant graphs.
Discrete Mathematics, Algorithms and Applications ,4(1):1250002 (30 pages), March 2012.[32] J. Navaridas, J. Miguel-Alonso, and F. J. Ridruejo. On synthesizing workloads emulating MPI applications. In
Proc. ofIEEE Int. Symp. on Parallel and Distributed Processing , pages 1–8. IEEE, April 2008.[33] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat.Portland: A scalable fault-tolerant layer-2 data center network fabric.
SIGCOMM Computer Communication Review ,39(4):39–50, October 2009.[34] L. Popa, S. Ratnasamy, G. Iannaccone, A. Krishnamurthy, and I. Stoica. A cost comparison of datacenter network archi-tectures. In
Proc. of 6th Int. Conf. on Emerging Networking Experiments and Technologies , article no. 16. ACM, 2010.[35] D. K. Pradhan and S. M. Reddy. A fault-tolerant communication architecture for distributed systems.
IEEE Transactionson Computers , C-31(9):863–870, September 1982.
36] G. Qu, Z. Fang, J. Zhang, and S.-Q. Zheng. Switch-centric data center network structures based on hypergraphs andcombinatorial block designs.
IEEE Transactions on Parallel and Distributed Systems , 26(4):1154–1164, April 2015.[37] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: Networking data centers randomly. In
Proc. of 9th USENIXSymp. on Networked Systems Design and Implementation . USENIX Association, 2012.[38] I. A. Stewart. Improved routing in the data centre networks HCN and BCN. In
Proc. of 2nd Int. Symp. on Computingand Networking , pages 212–218. IEEE, December 2014.[39] A. Touzene, K. Day, and B. Monien. Edge-disjoint spanning trees for the generalized butterfly networks and their applica-tions.
Journal of Parallel and Distributed Computing , 65(11):1384–1396, November 2005.[40] T. White.
Hadoop: The Definitive Guide . O’Reilly Media, 2009.[41] C. Wu and R. Buyya.
Cloud Datacenters and Cost Modeling . Elsevier, 2015.[42] J. Xu.
Topological Structure And Analysis of Interconnection Networks . Springer, 2010.[43] S. Young and S. Yalamanchili. Adaptive routing in generalized hypercube architectures. In
Proc. of 3rd IEEE Symp. onParallel and Distributed Processing , pages 564–571. IEEE, December 1991.