An In-Depth Analysis of the Slingshot Interconnect
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, Torsten Hoefler
AAn In-Depth Analysis of the Slingshot Interconnect
Daniele De Sensi
Department of Computer ScienceETH Zurich [email protected]
Salvatore Di Girolamo
Department of Computer ScienceETH Zurich [email protected]
Kim H. McMahon
Hewlett Packard Enterprise [email protected]
Duncan Roweth
Hewlett Packard Enterprise [email protected]
Torsten Hoefler
Department of Computer ScienceETH Zurich torsten.hoefl[email protected]
Abstract —The interconnect is one of the most critical compo-nents in large scale computing systems, and its impact on the per-formance of applications is going to increase with the system size.In this paper, we will describe S
LINGSHOT , an interconnectionnetwork for large scale computing systems. S
LINGSHOT is basedon high-radix switches, which allow building exascale and hyper-scale datacenters networks with at most three switch-to-switchhops. Moreover, S
LINGSHOT provides efficient adaptive routingand congestion control algorithms, and highly tunable trafficclasses. S
LINGSHOT uses an optimized Ethernet protocol, whichallows it to be interoperable with standard Ethernet devices whileproviding high performance to HPC applications. We analyze theextent to which S
LINGSHOT provides these features, evaluatingit on microbenchmarks and on several applications from thedatacenter and AI worlds, as well as on HPC applications. Wefind that applications running on S
LINGSHOT are less affectedby congestion compared to previous generation networks.
Index Terms —interconnection network, dragonfly, exascale,datacenters, congestion
I. I
NTRODUCTION
The first US exascale supercomputer will be built within twoyears, marking an important milestone for computing systems.Exascale computing has been a long-awaited goal, whichrequired significant contributions both from academic andindustrial research. One of the most critical components havinga direct impact on the performance of such large systems is theinterconnection network ( interconnect ). Indeed, by analyzingthe performance of the Top500 supercomputers [1] whenexecuting HPL [2] and HPCG [3], two benchmarks commonlyused to assess supercomputing systems, we can observe thatHPCG is typically characterized by ∼
50x lower performancecompared to HPL. Part of this performance drop is caused bythe higher communication intensity of HPCG, clearly show-ing that, among others, an efficient interconnection networkcan help in exploiting the full computational power of thesystem. The impact of the interconnect on the performance ofsupercomputing systems increases with the scale of the system,highlighting the need for novel and efficient solutions.Both the HPC and datacenter communities are following apath towards convergence of HPC, data centers, and AI/MLworkloads, which poses new challenges and requires new so-lutions. Workloads are becoming much more data-centric, andlarge amounts of data need to be exchanged with the outside world. Due to the wide adoption of Ethernet in datacenters,interconnection networks should be compatible with standardEthernet, so that they can be efficiently integrated with stan-dard devices and storage systems. Moreover, many data centerworkloads are latency-sensitive. For such applications, taillatency is much more relevant than the best case or averagelatency. For example, web search nodes must provide th percentile latencies of a few milliseconds [4]. This is alsoa relevant problem for HPC applications, whose performancemay strongly depend on messages latency, especially when us-ing many global or small messages synchronizations. Despitethe efforts in improving the performance of interconnectionnetworks, tail latency still severely affect large HPC and datacenter systems [4]–[7].To address these issues, Cray recently designed the S LING - SHOT interconnection network. S
LINGSHOT will power allthree announced US exascale systems [8]–[10] and numeroussupercomputers that will be deployed soon. It provides somekey features, like adaptive routing and congestion control, thatmake it a good solution for HPC systems but also for clouddata centers. S
LINGSHOT switches have 64 ports with 200Gb/s each and support arbitrary network topologies. To reducetail latencies, S
LINGSHOT offers advanced adaptive routing,congestion control, and quality of service (QoS) features.Those also protect applications from interference, sometimesreferred to as network noise [5], [11], caused by other ap-plications sharing the interconnect. Lastly, S
LINGSHOT bringsHPC features to Ethernet, such as low latency, low packetoverhead, and optimized congestion control, while maintainingindustry standards. In S
LINGSHOT , each port of the switchcan negotiate the available Ethernet features with the attacheddevices, and can communicate with existing Ethernet devicesusing standard Ethernet protocols, or with other S
LINGSHOT switches and NICs by using S
LINGSHOT specific additions.This allows the network to be fully interoperable with existingEthernet equipment while at the same time providing goodperformance for HPC systems.In this study, we experimentally analyze S
LINGSHOT ’sperformance features to guide researchers, developers, and Cray is a
Hewlett Packard Enterprise (HPE) company since 2019. a r X i v : . [ c s . D C ] A ug ystem administrators. We use Mellanox ConnectX-5 100 Gb/sEthernet NICs to test the ability of S LINGSHOT to deal withstandard
RDMA over Converged Ethernet (RoCE) traffic .Moreover, by doing so we can analyze the impact of the switchon the end-to-end performance by factoring out some of theimprovements on the Ethernet protocol introduced by S LING - SHOT . We first analyze the latencies of a quiet system. Then,we analyze the impact of congestion on both microbenchmarksand real applications for different configurations, showing thatS
LINGSHOT is only marginally affected by network noise . Tofurther show the benefits of the congestion control algorithm,we compare S
LINGSHOT to Cray’s previous A
RIES network,which has a similar topology and uses a similar routingalgorithm. II. S
LINGSHOT A RCHITECTURE
We now describe the S
LINGSHOT interconnection network.We first introduce the R
OSETTA switch and show how switchescan be connected to form a
Dragonfly [12] topology. We thendive into specific features of S
LINGSHOT such as adaptiverouting, congestion control, and quality of service manage-ment. Lastly, we describe the main characteristics of theS
LINGSHOT additions to Ethernet and the software stack.
A. Switch Technology ( R OSETTA ) The core of the S
LINGSHOT interconnect is the R
OSETTA switch, providing 64 ports at 200 Gb/s per direction. Each portuses four lanes of 56 Gb/s
Serializer/Deserializer (SerDes)blocks using
Pulse-Amplitude Modulation (PAM-4) modu-lation. Due to
Forward Error Correction (FEC) overhead,50Gb/s can be pushed through each lane. The R
OSETTA
ASICconsumes up to 250 Watts and is implemented on TSMCs 16nm process. R
OSETTA is composed by 32 peripheral functionblocks and 32 tile blocks. The peripheral blocks implementthe
SerDes , Medium Access Control (MAC),
Physical CodingSublayer (PCS),
Link Layer Reliability (LLR), and Ethernetlookup functions.The 32 tile blocks implement the crossbar switching be-tween the 64 ports, but also adaptive routing and congestionmanagement functionalities. The tiles are arranged in fourrows of eight tiles, with two switch ports handled per tile, asshown in Figure 1. The tiles on the same row are connectedthrough 16 per-row buses, whereas the tiles on the samecolumn are connected through dedicated channels with per-tile crossbars. Each row bus is used to send data from thecorresponding port to the other 16 ports on the row. The per-tile crossbar has 16 inputs (i.e., from the 16 ports on the row)and 8 outputs (i.e., to the 8 ports on the column). For each port,a multiplexer is used to select one of the four inputs (this is notexplicitly shown in the figure for the sake of clarity). Packetsare routed to the destination tile through two hops maximum.Figure 1 shows an example: if a packet is received on
Port 19 and must be routed to
Port 56 , the packet is first routed on therow bus, then it goes through the 16-to-8 crossbar highlighted
200 Gb/s Ethernet NICs were not available at the time of writing in the picture, and then down a column channel to
Port 56 .Thanks to the hierarchical structure of the tiles, there is noneed for a 64 ports arbiter, and the packets only incur in a 16to 8 arbitration.
Rosetta Switch ... ......
16 row buses(1 per port)
46 47 62 63
From other tiles xbar
Tile
Port 31Port 30
RowBusses Column crossbars 62636061585956575455525350514849 46474445424340413839363734353233 30312829262724252223202118191617 14151213101108090607040502030001 Port numbers
14 15
To other ports
Fig. 1: R
OSETTA switch tiled structure.The 32 tiles in R
OSETTA implement a crossbar between the64 ports. For performance and implementation reasons, thecrossbar is physically composed by different function-specificcrossbars, each handling a different aspect of the switchingtraffic: • Requests to Transmit
To avoid head-of-line blocking (HOL) [13], R
OSETTA relies on a virtual output-queuedarchitecture [14], [15] where the routing path is deter-mined before sending the data. The data is bufferedin the input buffers until the resources are available,guaranteeing no further blocks. Before forwarding thedata, a request-to-transmit is sent to the tile correspondingto the switch output port. When a grant to transmit isreceived from the output port, the data is forwarded. • Grants to Transmit
Grants to transmit are sent by thetile handling the output port to the tile from which theswitch received the packet. In the previous example, thegrants would be transmitted from the tile handling
Port56 , to the tile handling
Port 19 . Grants are used to notifythe permission to forward the data to the next hop. Theuse of requests and grants to transmit is a central pieceof the QoS management. • Data
Data is sent on a wider crossbar (48B). To speed upthe processing, R
OSETTA parses and processes the packetheader as soon as it arrives, even if the data might stillbe arriving. • Request Queue Credits
Credits provide an estimation ofqueue occupancy. This information is then used by theadaptive routing algorithm (see Section II-C) to estimatethe congestion of different paths and to select the leastcongested one. • End-to-End Acks
End-to-End acknowledgments areused to track the outstanding packets between every pairof network endpoints. This information is used by thecongestion control protocol (see Section II-D).By using physically separated crossbars, S
LINGSHOT guar-antees that different types of messages do not interfere witheach other and that, for example, large data transfers do notslow down requests and grants to transmit.
00 200 300 400 500 600
Time (ns)
Fig. 2: Distribution of switch latency for RoCE traffic.To analyze the impact of the switch architecture on thelatency, we report in Figure 2 the latency of the switchwhen dealing with RoCE traffic. It is worth remarking that,because we are using standard RoCE NICs, the NIC sendsplain Ethernet frames, and we cannot exploit all the featuresof S
LINGSHOT ’s specialized Ethernet protocol (Section II-F).Some of the features like link-level reliability and propagationof congestion information are however still used in the switch-to-switch communications. To compute the latency of theswitch, we consider the latency difference between 2-hopsand 1-hop latencies (we provide details on the topology inSection II-B). We observe that R
OSETTA has a mean andmedian latency of 350 nanoseconds, with all the distributionlying between 300 and 400 nanoseconds, except for a fewoutliers.
B. Topology R OSETTA switches can be arranged into any arbitrary topol-ogy.
Dragonfly [12] is the default topology for S
LINGSHOT -based systems, and it is the topology we refer to in therest of the paper. Dragonfly is a hierarchical direct topology,where all the switches are connected to both computing nodesand other switches. Sets of switches are connected betweeneach other forming so-called groups . The switches insideeach group may be connected by using an arbitrary topology,and groups are connected in a fully connected graph. In theS
LINGSHOT implementation of Dragonfly (shown in Figure 3),each R
OSETTA switch is connected to 16 endpoints throughcopper cables (up to 2.6 meters), using the remaining 48 portsfor inter-switches connectivity. The partitioning of these 48ports between inter- and intra-group connectivity, as well asthe number of switches per group, depends on the size of thesystem. In S
LINGSHOT , the switches inside a group are alwaysfully connected through copper cables. Switches in differentgroups are connected through long optical cables (up to 100meters). Due to the full-connectivity both within the group andbetween groups, this topology has a diameter of 3 switch-to-switch hops.Thanks to the low-diameter, applications performance onlymarginally depend on the specific node allocation. We reportin Figure 4 the latency and the bandwidth between nodesat different distances, and for different message sizes onan isolated system. We consider nodes connected to portson the same switch (1 inter-switch hop), connected to twodifferent switches in the same group (2 inter-switch hops), andconnected to two different switches in two different groups (3
N0 N15 ...
S0 S31
All-to-All within group N N ...... ......
31 links17links 31 links 17links ... ... ... ...
Group 0
544 global links N N ... S S All-to-All within group N N ...... ......
31 links16 links 16 links 16 links 16 links17links 31 links 17links ... ... ... ...
Group 544
544 global links
All-to-all amongst groups ......
Fig. 3: S
LINGSHOT
Topology. In this specific example weshow the topology of the largest 1-dimensional Dragonflynetwork that can be built with the 64-ports R
OSETTA switches. T i m e ( u s ) B a n d w i d t h ( G b / s ) SQ1MedianQ3L
Fig. 4: Latency and bandwidth for different node distances,on an isolated system. Q is the first quartile, Q is the thirdquartile, IQR = Q − Q , S is the smallest sample greaterthan Q − . · IQR , and L is the largest sample smaller than Q . · IQR .inter-switch hops). For the same switch case, we observed nosignificant difference when using two ports on the same switchtile or on two different tiles.First, we observe that, in the worst case, the node allocationhas only a impact on the latency for B messages andthat, starting from KiB messages we observe less than difference in latency between the different node distances.The same holds for bandwidth, with less than differencebetween the different distances across all the message sizes. Insome cases, we observe a slightly higher bandwidth when thenodes are in two different groups, because more paths connectthe two nodes, increasing the available bandwidth.In the largest system (shown in Figure 3), each group has32 switches (for a total of ×
16 = 512 nodes, and switchesinside each group are fully connected by using 31 switchesports. The remaining 17 ports from each switch are used toglobally connect all the groups in a fully connected network.In this specific case, because each group contains 32 switchesand each switch uses 17 ports to connect to other groups, eachgroup has ×
17 = 544 connection towards other groups.This leads to a system having 545 groups, each of which isconnected to 512 nodes, for a total of
279 040 endpoints at fullglobal bandwidth . This number of endpoints satisfies both In practice, the addressing scheme limits the number of groups to 511, fora total of
261 632 nodes. xascale supercomputers and hyperscale data centers demand.Indeed, this is larger than the number of servers used in datacenters [16], and much larger than the number of nodes usedby Summit [17], the most performing supercomputer at thetime being, that currently relies on nodes and delivers
P F lop/s . Thanks to this large number of endpoints, eachcomputing node can have multiple connections to the samenetwork, increasing the injection bandwidth and improvingnetwork resiliency in case of NICs failures.
C. Routing
In Dragonfly networks (including S
LINGSHOT ), any pairof nodes is connected by multiple minimal and non-minimal paths [12], [18]. For example, by considering the topology inFigure 3, the minimal path connecting N0 to N496 includesthe switches S0 and S31 . In smaller networks, due to linksredundancy,multiple minimal paths are connecting any pairof nodes [18]. On the other hand, a possible non-minimalpath involves an intermediate switch that is directly connectedto both S0 and S31 . The same holds for nodes located indifferent groups. In this case, a non-minimal path crosses anintermediate group.Sending data on minimal paths is clearly the best choiceon a quiet network. However, in a congested network, withmultiple active jobs, those paths may be slower than longerbut less congested ones. To provide the highest throughputand lowest latency, S
LINGSHOT implements adaptive routing:before sending a packet, the source switch estimates the loadof up to four minimal and non-minimal paths and sends thepacket on the best path, that is selected by considering both thepaths’ congestion and length. The congestion is estimated byconsidering the total depth of the request queues of each outputport. This congestion information is distributed on the chip byusing a ring to all the forwarding blocks of each input port. It isalso communicated between neighboring switches by carryingit in the acknowledgement packets. The total overhead forcongestion and load information is an average of four bytes inthe reverse direction for every packet in the forward direction.As more packets take non-minimal paths and therefore averagehop count per packet increases, both the latency and the linkutilization increase. Therefore, S
LINGSHOT adaptive routingbiases packets to take minimal paths more frequently, tocompensate for the higher cost of non-minimal paths.
D. Congestion Control
Two types of congestion might affect an interconnec-tion network: endpoint congestion, and intermediate conges-tion [6]. The endpoint congestion mostly occurs on the last-hop switches, whereas intermediate congestion is spread acrossthe network. Adaptive routing improves network utilizationand application performance by changing the path of thepackets to avoid intermediate congestion. However, even ifadaptive routing can bypass congested intermediate switches,all the paths between two nodes are affected in the same wayby endpoint congestion. As we show in Section III-A, this was a relevant issue on other networks, particularly for many-to-one traffic. In this case, due to the highly congested linkson the receiver side, the adaptive routing would spread thepackets over the different paths but without being able to avoidcongestion, because it is occurring in the last hop.Congestion control helps in mitigating this problem bydecreasing the injection bandwidth of the nodes generating thecongestion. However, existing congestion control mechanisms(like
ECN [19] and
QCN [20], [21]) are not suited for HPCscenarios. They work by marking packets that experiencecongestion. When a node receives a packet that has beenmarked, it asks the sender to slow down its injection rate.These congestion control algorithms work relatively well inpresence of large volume and stable communications (knownas elephant flows ), but tend to be fragile, hard to tune [22],[23], and generally unsuitable for bursty HPC workloads.Indeed, in standard congestion control algorithms, the controlloop is too long to adapt fast enough, and while convergingto the correct transmission rate, the offending traffic can stillinterfere with other applications.To mitigate this problem, S
LINGSHOT introduces a sophis-ticated congestion control algorithm, entirely implemented inhardware, that tracks every in-flight packet between everypair of endpoints in the system. S
LINGSHOT can distinguishbetween jobs that are victims of congestion and those whoare contributing to congestion, applying stiff and fast back-pressure to the sources that are contributing to congestion. Bytracking all the endpoints pairs individually, S
LINGSHOT onlythrottles those streams of packets who are contributing to theendpoint congestion, without negatively affecting other jobsor other streams of packets within the same job who are notcontributing to congestion. This frees up buffers space for theother jobs, avoiding HOL blocking across the entire network,and reducing tail latencies, which are particularly relevant forapplications characterized by global synchronizations.The approach to congestion control adopted by S
LINGSHOT is fundamentally different from more traditional approachessuch as ECN-based congestion control [19], [20], and leads togood performance isolation between different applications, aswe show in Section III-A.
E. Quality of Service (QoS)
Whereas congestion control partially protects jobs frommutual interference, jobs can still interfere with each other.To provide complete isolation, in S
LINGSHOT jobs can beassigned to different traffic classes, with guaranteed quality ofservice. QoS and congestion control are orthogonal concepts.Indeed, because traffic classes are expensive resources requir-ing large amounts of switch buffers space, each traffic classis typically shared among several applications, and congestioncontrol still needs to be applied within a traffic class.Each traffic class is highly tunable and can be customizedby the system administrator in terms of priority, packetsordering required, minimum bandwidth guarantees, maximumbandwidth constraint, lossiness, and routing bias [5]. Thesystem administrator guarantees that the sum of the minimumandwidth requirements of the different traffic classes doesnot exceed the available bandwidth. Network traffic can beassigned to traffic classes on a per-packet basis. The jobscheduler will assign to each job a small number of trafficclasses, and the user can then select on which class to sendits application traffic. In the case of MPI, this is done byspecifying the traffic class identifier in an environment vari-able. Moreover, communication libraries could even changetraffic classes at a per-message (or per-packet) granularity. Forexample, MPI could assign different collective operations todifferent traffic classes. For example, it may assign latency-sensitive collective operations such as
MPI_Barrier and
MPI_Allreduce to high-priority and low-bandwidth trafficclasses, and bulk point-to-point operations to higher bandwidthand lower priority classes.Traffic classes are completely implemented in the switchhardware. A switch determines the traffic class required for aspecific packet by using the
Differentiated Services Code Point (DSCP) tag in the packet header [24]. Based on the value of thetag, the switch assigns the packet to one of the multiple virtualqueues. Each switch will allocate enough buffers to each trafficclass to achieve the desired bandwidth, whereas the remainingbuffers will be dynamically allocated to the traffic which isnot assigned to any specific traffic class.
F. Ethernet Enhancements
To improve interoperability, and to better suit datacentersscenarios, S
LINGSHOT is fully Ethernet compatible, and canseamlessly be connected to third-party Ethernet-based devicesand networks. S
LINGSHOT provides additional features on topof standard Ethernet, improving its performance and makingit more suitable for HPC workloads. S
LINGSHOT uses thisenhanced protocol for internal traffic, but it can mix it withstandard Ethernet traffic on all ports at packet-level granularity.This allows S
LINGSHOT to achieve high-performance, whileat the same time being able to communicate with standardEthernet devices, allowing it to be used efficiently in bothsupercomputing and datacenter worlds.To improve performance, S
LINGSHOT reduces the 64 Bytesminimum frame size to 32 Bytes, allows IP packets to besent without an Ethernet header, and removes the inter-packetgap. Lastly, S
LINGSHOT provides resiliency at different lev-els by implementing low-latency
Forward Error Correction (FEC) [25],
Link-Level Reliability (LLR) to tolerate tran-sient errors, and lanes degrade [26] to tolerate hard failures.Moreover, the S
LINGSHOT
NIC provides end-to-end retry toprotect against packet loss. These are relevant features in high-performance networks. For example, FEC is required for allEthernet systems at 100Gb/s or higher, independently fromthe system size, and LLR is useful in large systems (suchas hyperscale data centers) to localize the error handling andreduce end-to-end retransmission.
G. Software Stack
Communication libraries can either use the standard TCP/IPstack or, in case of high-performance communication li- braries such as
MPI [27], [28],
Chapel [29],
PGAS [30] and
SHMEM [31], the libfabric interface [32]. Cray contributedwith new features to the libfabric open-source verbs providerand
RxM utility provider to support the S
LINGSHOT hard-ware. All HPC traffic is layered over
RDMA over ConvergedEthernet (RoCEv2) and data is sent over the network throughpackets containing up to 4KiB of data plus headers and trailers.Headers and trailers include Ethernet (26 bytes including thepreamble), IPv4 (20 bytes), UDP (8 bytes), InfiniBand (14bytes), and an additional RoCEv2 CRC (4 bytes), for a totalof 62 bytes.
Cray MPI is derived from
MPICH [33] andimplements the MPI-3.1 standard. Proprietary optimizationsand other enhancements have been added to
Cray MPI targetedspecifically for the S
LINGSHOT hardware. Any MPI imple-mentation supporting libfabric can be used out of the box onS
LINGSHOT . Moreover, standard API for some features, liketraffic classes, have been recently added to libfabric and couldbe exploited as well. We report in Figure 5 the latencies fordifferent message sizes and for different network protocols.We observe that for small message sizes, MPI adds only amarginal overhead to libfabric . Size (Bytes) R TT / ( u s e c ) IB VerbsLibfabric MPIUDP TCP
MPI IPTCP UDPRoCEv2 NICLibfabricIB Verbs
NIC driver
OSHW
TCP socket
UDPsocket
Fig. 5: Half round trip time (RTT/2) for different messagesizes (x-axis) and software layers.Moreover, we show in Figure 6 the bisection bandwidth (i.e.,the bandwidth when half of the nodes send data to the otherhalf of the nodes and vice versa) and the
MPI_Alltoall bandwidth on S
HANDY , a S
LINGSHOT -based system using nodes (see Section III for details). We report the resultsfor different processes per node (PPN) and different messagesizes. This system is composed of eight groups, and all thebisection cuts cross the same number of links. In this system,each group has 56 global links out of 112 (8 towards eachother group), to match the injection bandwidth. Each of the 4groups in one partition is connected to each of the 4 groupsin the other partition, and the total number of links crossing abisection cut is · · . Because each link has a 200Gb/sbandwidth, and we are sending traffic in both directions, thepeak bisection bandwidth is · Gb/s · . T b/s .In an all-to-all communication, each node sends / ofthe traffic to nodes in the other 7 groups and / of thetraffic to nodes in the same group. Because this system has · global links, the all-to-all maximum bandwidth is / · · Gb/s = 12 . T b/s . Note that
MPI_Alltoall can achieve twice the bisection bandwidth because half ofthe connections terminate in the same partition [34]. The plotshows that the
MPI_Alltoall reaches more than the 90% B B B B K i B K i B K i B K i B Size T B / s Theoretical Bisection BandwidthTheoretical Alltoall Bandwidth
Alltoall (PPN=16)Alltoall (PPN=24)Bisection (PPN=128)
G0 G1 G2 G3 G4 G5 G6 G7
Dragon fl y group, x16 Rosetta switches Fig. 6: Bisection and
MPI_Alltoall bandwidth on all the nodes of S
HANDY , for different processes per node (PPN) and message sizes. The x-axis is in logarithmic scale.of the theoretical peak bandwidth, without any packet loss. Weobserve a performance drop for 256 bytes messages because,to reduce memory usage, the MPI implementation switches toa different algorithm [35] for messages larger than 256 bytes.III. P
ERFORMANCE S TUDY
We now study the performance of the S
LINGSHOT intercon-nect on real applications and microbenchmarks, by focusingon two key features of S
LINGSHOT , namely congestion controland quality of service management. For our analysis, weconsider the following systems: • C RYSTAL : A system based on the Cray A
RIES intercon-nect [48]. This system has 698 nodes. The CPUs on thenodes are
Intel Xeon E5-269x . The system is composedof two groups, each containing at most 384 nodes. • M ALBEC : A S
LINGSHOT system with 484 nodes. CPUson the nodes are either
Intel Xeon Gold 61xx or Intel XeonPlatinum 81xx
CPUs. The system is composed of fourgroups, each containing at most 128 nodes. Each groupis connected to each other group through 48 global linksoperating at 200Gb/s each. Each node has a
MellanoxConnectX-5 EN
NIC. • S HANDY : A S
LINGSHOT system with 1024 nodes. Com-pute nodes are equipped with
AMD EPYC Rome
64 coresCPUs. The system is composed of eight groups, eachcontaining 128 nodes. Each group is connected to eachother group through 56 global links operating at 200Gb/seach. Each node has two
Mellanox ConnectX-5 EN
NICs,each connected to a different switch of the same network,allowing a better load distribution and resilience in theevent of NICs failures.We consider two S
LINGSHOT systems, of different size, toanalyze the performance at different system scales. For all theexperiments, we booked these systems for exclusive use, tohave a controlled environment and avoid interference causedby other users.
A. Congestion Control
To evaluate the ability of S
LINGSHOT to react to congestion,we divide the nodes in the system in two partitions: victim nodes and aggressor nodes. The aggressor nodes generatecongestion that impacts the performance of victim nodes. Weconsider two types of congestion patterns: endpoint congestionand intermediate congestion, and we use the GPCNet code [6]to generate those congestion patterns. We generate endpointcongestion through a many-to-one ( incast ) communication pat-tern, where a number of nodes send data to the same endpointby using MPI_Put , and intermediate congestion by using an all-to-all pattern implemented through
MPI Sendrecv . Bothaggressors exchange 128KiB messages. This decision is basedon characterization studies on production systems, that showan average message size of ∼ bytes both in collective andpoint-to-point communications [49].We consider the victim applications described in Table I.Moreover, we also analyze the impact of congestion on mi-crobenchmarks, include standard MPI operations, and the em-ber microbenchmarks [50] reproducing some common com-munication patterns in HPC applications ( halo3d , sweep3d ,and incast ). We first consider the results on 512 nodes. Then,we show the results for different node counts. We considerthe following victim/aggressor splits: 460/52 ( ∼ / ),256/256 ( ∼ / and 53/459 ( ∼ / ). Becausethe implementation of some MPI collectives changes accord-ing to the number of nodes used, we have chosen these splitsso that we run the victim with both power of two (256), even(460) and odd (53) number of nodes. To further increase thegenerated congestion, in some experiments we increase thenumber of processes per node ( PPN ) used by the aggressor.Each node used by the aggressor spawns PPN processes, eachof them performing the same communications. Namely, thecongestion pattern is concurrently executed PPN times.Moreover, the allocation of the nodes to victims andaggressors determines how many switches and groups areshared between the two jobs and has a direct impact on theperformance of the victim. In our experiments, we consider thethree well-known allocation placement strategies [51] depictedin Figure 7: linear , where we allocate the first n nodes to thevictim and the remaining nodes to the aggressor; interleaved ,where we interleave the nodes allocated to the victim and theaggressor; and random , where we randomly allocate the nodesto the victim and the aggressor.We make sure that the data we report is statisticallysound [52]: for each microbenchmark, we execute the victimat least 200 times and for at least 4 seconds. We stopthe benchmark when both the previous two conditions aresatisfied, and when the 95% confidence interval is within5% of the median. We then consider for each iteration themaximum time among the ranks. For the applications, we = Node assigned to Victim = Node assigned to Aggressor Linear Interleaved Random
S1 S2 S1 S2 S1 S2N0 N1 N2 N3 N4 N5 N0 N1 N2 N3 N4 N5 N0 N1 N2 N3 N4 N5= Switch
Fig. 7: Different victim/aggressor allocations.
YPE A PPL . D
ESCRIPTION
HPC
MILC It is a set of numerical simulation codes working on quantum chromodynamics (QCD) [36]. We use the su3_rmd kernel, thatdecomposes a four dimensional grid, and mostly performs point-to-point neighbour communications and global reductions [37].HPCG A set of communication and computational patterns matching a wide set of applications. It relies on sparse triangular solversand preconditioned conjugate gradient algorithms [3]. It mostly uses stencil communications and global reductions.LAMMPS A molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous state [38]. This kernel performsreductions and point-to-point blocking and non-blocking communications, between nodes at different distances.FFT
Fast Fourier Transform on a 3D domain [39]. It employs broadcasts, scatters, and point-to-point communications [40].Resnet-proxy This is a ML/AI proxy application [41], reproducing the communication phases of a Deep500 benchmark [42]
Residual Neural Network (resnet). This application uses non-blocking reduction operations. DC Silo A fast in-memory transactional database [43]. Widely used in online transaction processing systems (OLTP).Sphinx A speech recognition system [44], involving probabilistically pruning a large search tree.Xapian A search engine [45] using a search index built from a snapshot of the English version of Wikipedia. Multiple queries areexecuted, with a distribution similar to that of online search queries.Img-dnn An application using a deep neural network-based autoencoder to identify handwritten characters [46].
TABLE I: Applications used as victim in the congestion tests. We consider both HPC and datacenter (DC) applications. Img-dnn, Xapian, Sphinx and Silo are all single-client, single-server applications, coming from the
Tailbench benchmark [47] forlatency-sensitive datacenter applications. We selected this subset because it covers a wide range of latencies, from microseconds(
Silo ) to seconds (
Sphinx ).consider the time reported by the application, that we executemultiple times until the 95% confidence interval is within 5%of the median.We report in Figure 8 the time distribution for the
Tailbench applications, both when executed in isolation, and when ex-ecuted with an incast aggressor, on both A
RIES and S
LING - SHOT . We also annotate the th and th percentiles, to showthe impact of tail latency. We executed these experiments usingthe linear allocation and a / victim/aggressor ratio.For Silo , Xapian and
Img-dnn we observe severe performancedegradation due to congestion on A
RIES , whereas we do notobserve any relevant effect on S
LINGSHOT . For
Sphinx , weobserve a smaller degradation because the communication tocomputation ratio is lower than that of the other applications.Moreover, we observe a higher tail latency on A
RIES , whichfurther increases in the presence of congestion. It is worthremarking that the congestion impact itself is enough tocharacterize how much S
LINGSHOT is affected by congestion.In addition, we are also comparing S
LINGSHOT with anA
RIES interconnect, to also show the improvements comparedto an existing interconnection network. Moreover, a similarperformance degradation to that we observed on A
RIES hasalso been observed on other interconnects [6], [11], [53].Due to the large number of combinations of victims, aggres-sors, and allocations, we provide a data summary of the linearallocation results as a heatmap in Figure 9. Each element ofthe heatmap represents the mean congestion impact C [6], i.e., C = T c T i (1)where T i is the mean execution time of the victim whenexecuted in isolation, and T c is the mean execution time ofthe victim when co-executed with the aggressor. For example, T i m e ( m s ) Aries(Crystal)0123 T i m e ( s ) Slingshot(Shandy)
Aries(Crystal)050100150200
Slingshot(Shandy) silo sphinximg-dnn xapian
Isolated With congestion
99p 95p
Fig. 8: Time distribution of
Tailbench applications, with andwithout endpoint congestion. The labels on the top of eachplot denote the th and th percentile.the element on the top left corner represents the scenario whereMILC is executed together with an all-to-all aggressor. 10% ofthe nodes are allocated to the aggressor, whereas the remainingnodes are allocated to the victim. For this specific case, nosignificant congestion impact is observed. On the other hand,MILC experiences a 1.6 slowdown on A RIES due to endpointcongestion ( incast ), when 10% of the nodes are allocated tothe aggressor. For the same scenario, we don’t observe anyslowdown on S
LINGSHOT .We report the applications and microbenchmarks results us-ing two different (logarithmic) color scales, to better appreciatethe differences. Indeed, applications are usually less affected
45 1 92 a ll - t o - a lli n c a s t A r i e s ( C r y s t a l ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
444 47 54 35 44 20 9 . . . . . . . .
780 87 93 70 67 30 12 6 . . . M I L C H P C G L A MM P S FF T r e s n e t - p r o xy s il o s p h i n x x a p i a n i m g - d nn a ll - t o - a lli n c a s t S li n g s h o t ( S h a n d y ) . . . . . .
11 1 1 1 . . . . . . B B K i B K i B K i B M i B M i B M i B B B K i B K i B K i B M i B M i B B B K i B K i B K i B M i B M i B B B K i B K i B K i B M i B M i B M i B B K i B B B B B K i B K i B pingpong allreduce alltoall barrier broadcast hal swp inc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9: Congestion effects on different victim and aggressor combinations. Each element of the heatmap represents the congestionimpact of the aggressor on the victim.by congestion because, differently from microbenchmarks,they also have computation phases. Because communicationsare just a part of the overall execution time, even whencommunications are severely affected by congestion, this doesnot directly translate into a large performance degradation.As reported in the heatmap, S
LINGSHOT is always lessaffected by congestion compared to A
RIES . In the worst case,we observed a 1.3x slowdown on S
LINGSHOT , compared toa maximum 93x slowdown on A
RIES . Moreover, the conges-tion impact increases when increasing the fraction of nodesallocated to the aggressor application, and has a larger impacton small message communications, due to the larger impactof end-to-end latency on the overall performance. The effectsof congestion can be seen not only on microbenchmarksbut also on full applications. For example, LAMMPS is 17xslower when executed together with an incast aggressor witha 50/50 split on A
RIES . Intermediate congestion (generatedthrough all-to-all communication), does not significantly affectthe systems we are analyzing, because the adaptive routingalgorithm successfully routes the packets around the congestedlinks. This means that, the additional load generated by the all-to-all does not manifest as congestion.Similar trends can be observed also for different node count,higher PPNs, and other allocations. For space reasons, we donot report all the heatmaps for each of these cases. Instead,we summarize each heatmap by showing the distribution ofthe heatmap elements (congestion impacts, Equation 1) acrossall the victim/aggressor combinations. We show the result ofthis comparison in Figure 10.First, we show in Figure 10 ( A ) the congestion impact for different allocations. For example, for the linear allocation, weare showing the same data of Figure 9. However, instead ofshowing all the individual congestion impacts, we now reporttheir distribution. For readability purposes, we cut the longtails of the distributions, and we annotate on top of each violinthe maximum value. We observe that whereas on A RIES thecongestion impact for the linear allocation is never higher than100, for the interleaved and random allocations we observevalues up to 150. We observed a similar effect on S
LINGSHOT but on a different scale. In this case, in all but one cases weobserve congestion impact values lower than two. Moreover,differently from A
RIES , the distribution on S
LINGSHOT is lessspread, which indicates that the congestion control algorithmis performing well across a wide set of victims and allocations.In Figure 10 ( B ), we report a similar analysis, but now theaggressors are using 24 processes per node (PPN) instead of1, thus generating a higher load on the network. In this case,the impact of congestion increases for A RIES , especially forrandom allocations. On the other hand, S
LINGSHOT is onlyminimally impacted, showing a maximum congestion impact ∼ times lower than on A RIES .Lastly, in Figure 10 ( C ) we report the congestion impactwhen using fewer nodes (128). Until now, we compared aS LINGSHOT system (S
HANDY , 1024 nodes) against a smallerA
RIES one (C
RYSTAL , 698 nodes). To factor out possibleperformance variations coming from different system sizes,we now compare C
RYSTAL with a smaller S
LINGSHOT system(M
ALBEC , 484 nodes). We also fix the number of nodes perdragonfly group to 64, in order to allocate the same number ofgroups (two) in both cases. On A
RIES , the maximum conges- A r i e s C o n g . I m p .
92 144 154 A B B
40 43 40
Linear InterleavedRandom
512 nodes, 1 PPN S li n g s h o t C o n g . I m p . Linear InterleavedRandom
512 nodes, 24 PPN
Linear InterleavedRandom
128 nodes, 1 PPN C Fig. 10: Congestion impact distribution across different victim/aggressor combinations, fordifferent allocations, node count, and processes per node (PPN). M I L C H P C G L A MM P S FF T r e s n e t - p r o xy s il o s p h i n x x a p i a n i m g - d nn a ll - t o - a lli n c a s t S li n g s h o t ( S h a n d y ) . . .
11 1 1 1 1 . . . . . . . . . . . . . N . A . N . A . N . A . N . A . Fig. 11: Congestion impacton nodes of S
HANDY .tion impact goes from 154 (Figure 10 ( A )) to 40 (Figure 10( C )) when using 128 nodes instead of 512. This can beexplained by the lower generated traffic (aggressors have nowfewer nodes), but also by the higher fraction of available globalbandwidth. On S LINGSHOT , the same experiment makes themaximum congestion impact go from 2.3 to 1.5. We concludethat S
LINGSHOT is less affected by congestion, even whenvarying the system size and the number of allocated nodes.The results of Figure 11 show the congestion impact onthe applications when using all the nodes on S
HANDY .We report the data when using a random allocation becausethat is the one generating the most congestion (see Figure 10).We can observe that even at full system scale the congestioncontrol effectively protects applications from congestion, witha maximum 3.55x slowdown on LAMMPS when 75% of thenodes are allocated to the incast congestor. Data on MILCand HPCG with a 25%/75% aggressor/victim ratio is missing.Indeed, they should run on 768 nodes, but they can only runon a number of nodes which is a power of two.We complete our analysis on the effects of congestionby analyzing the impact of bursty congestion S
LINGSHOT .Indeed, in the previous experiments we always consideredpersistent congestion, generated by sending messages witha fixed size of 128KiB during the entire victim execution.To analyze the impact of bursty congestion, we execute a128 byte
MPI_Alltoall microbenchmark (victim) with an incast aggressor. This is one of the cases where we observedthe highest congestion impact on S
LINGSHOT (see Figure 9).We run this test on all the M
ALBEC nodes, splitting themequally between aggressor and victim, with an interleavedallocation strategy.We report the results of this analysis in Figure 12. Eachheatmap corresponds to a different message size for the incast aggressor. On each heatmap we report the congestion impactwhen varying the number of messages in a burst (
BurstSize , on the y-axis) and the time between two subsequentcongestion bursts (
Bursts Gap , on the x-axis). For example, thebottom-left element in the first heatmap, represents the casewhere the aggressor sends consecutive messages, eachone containing 8 bytes. Before sending the next burst of B u r s t S i z e ( m e ss a g e s ) .
08 1 .
07 1 .
00 1 . .
09 1 .
08 1 .
08 1 . .
10 1 .
09 1 .
09 1 . .
10 1 .
07 1 .
09 1 . Bursts Gap (us) .
19 1 .
17 1 .
00 1 . .
20 1 .
20 1 .
19 1 . .
20 1 .
20 1 .
20 1 . .
21 1 .
21 1 .
22 1 . .
00 1 .
00 1 .
00 1 . .
00 1 .
00 1 .
00 1 . .
00 1 .
00 1 .
00 1 . .
00 1 .
00 1 .
00 1 . Fig. 12: Impact of incast congestion on a 128 byte
MPI_Alltoall . We show the impact for different messagesizes, congestion duration, and time between subsequent con-gestion bursts.messages, the aggressor will wait 1 microsecond.We observe that the incast aggressor does not affect the vic-tim when sending too small messages or too large messages.Indeed, small messages do not generate enough congestion,whereas for large messages the congestion control algorithmfully kicks in and throttle the aggressor. On the other hand,for medium size messages, some congestion builds up beforethe congestion control algorithm detects and reacts to it,and we observe an increase in the congestion impact up to . . However, as we shown in Figure 9, this is negligiblewhen compared to what happens on other types of systems.Moreover, we observe the highest congestion impact for largebursts and for small gaps between subsequent bursts. Wealso observe no differences between bursts of messagesand the persistent congestion. This shows that S LINGSHOT istolerant to both persistent congestion, and bursty and short-lived congestion.
B. Traffic Classes
We now evaluate the ability of S
LINGSHOT to provide per-formance guarantees to jobs running by using traffic classes. Itis worth remarking that traffic classes and congestion controlare orthogonal concepts. Traffic classes can be used to protect .0 0.5 1.0 1.5 2.0 2.5 3.0
Time (msec) C o n g . I m p . Same Traffic ClassSeparate Traffic Classes
Fig. 13: Congestion impact for an 8B
MPI_Allreduce , co-executed with a 256KiB
MPI_Alltoall on M
ALBEC (witha 25% tapering) with and without traffic classes.a job (or parts of it) from other traffic, and they can allocateresources fairly or unfairly between users and jobs. However,even if resources are assigned fairly, congestion can still occurdue to jobs filling up the buffers. Congestion control is usedto avoid such situations within and across traffic classes.All the experiments presented in the following have beenexecuted on M
ALBEC . We taper the bandwidth to 25% ofthe available bandwidth, to force co-running jobs to in-terfere with each other. We execute a job performing an8B
MPI_Allreduce together with a job performing a256KiB
MPI_Alltoall . Each job uses 64 nodes and 16processes per node, and they are placed using the inter-leaved allocation. We report in Figure 13 the congestionimpact of the
MPI_Allreduce when using the same trafficclass of the
MPI_Alltoall and when using a separatetraffic class. Each point represents the mean over
100 000 runs. The
MPI_Alltoall is started around 0.4 millisec-onds after the beginning of the test. We observe that when
MPI_Allreduce runs in the same traffic class of the
MPI_Alltoall , it experiences a congestion impact of 2.85(i.e. is 2.85 times slower compared to when executed inisolation). On the other hand, when executed in a separatetraffic class it only experiences a 1.15x slowdown comparedto the isolated case.We now further investigate the capacity of S
LINGSHOT to enforce specific limits on traffic classes. We execute twojobs, each running a bisection bandwidth test, with the secondone starting after 0.9 milliseconds from the beginning of thetest. Each job uses 16 processes per node and runs on 64nodes. Jobs are placed by using the interleaved allocation. Weconfigure two traffic classes:
TC1 with a minimum bandwidthrequirement of 80% of the available bandwidth, and
TC2 , witha minimum 10% bandwidth required.We report the results of this experiment in Figure 14. Onthe upper part, we report the results we obtain when bothjobs run on the same traffic class (
TC1 ). At the beginning ofthe execution, the first job runs on an empty system and gets100% of the available bandwidth. When the second job starts,the available bandwidth is fairly shared between the two jobs.Eventually, when the first job terminates the second job rampsup and uses all the available bandwidth.On the lower part of Figure 14, we report the results G b / s / n o d e Same TC
Job 1 (TC1) Job 2 (TC1)
Time (msec) G b / s / n o d e Separate TCs
Job 1 (TC1)Job 2 (TC2)
Fig. 14: Performance of two bisection bandwidth tests onM
ALBEC (with a 25% tapering) when running in the sametraffic class (top) and when running into two separate trafficclasses (bottom).when the first job runs in
TC1 and the second job runs in
TC2 . In this case, when the second job starts, the bandwidthof the first job drops to 80% of the available bandwidth,matching the minimum bandwidth required for
TC1 . Thesecond job required a minimum bandwidth of 10%, and itgets the 20% of the available bandwidth. Indeed, there is anextra 10% of bandwidth which was not allocated to either
TC1 or TC2 . S
LINGSHOT decides to dynamically allocate this extrabandwidth to
TC2 because it is the traffic class with the lowestbandwidth share. Eventually, when the first job terminates, thesecond job uses all the available bandwidth.IV. S
TATE OF THE A RT A. Interconnection Networks
Existing large-scale computing systems are characterizedby different types of interconnection networks, either basedon open standards or proprietary technology. These networkshave different topologies and provide different features. Inthis section, we highlight the main characteristics of the mostcommon and actively developed interconnection networks, tobetter understand the similarities and differences with S
LING - SHOT . InfiniBand is an open standard for high-performance net-work communications. Different vendors manufacture Infini-Band switches and interfaces, and the InfiniBand standard isnot tied to any specific network topology. The most commonlyused InfiniBand implementations rely on
Mellanox hardware,with switches arranged in a fat tree topology [54]. For ex-ample, both
Sierra [55] and
Summit [17], the two fastestsupercomputers at the time being, use such configuration.Mellanox networks also provide other features to improveapplication performance, such as switch offloading of MPIcollective operations, adaptive routing, congestion control, andtraffic classes. However, congestion control is usually not usedin large production systems due to difficulties in the tuning ofhe algorithm [6]. Regarding interoperability with Ethernet,Mellanox adopts a different approach than S
LINGSHOT , re-quiring traffic to be converted between InfiniBand and Ethernetby using dedicated gateways.Cray A
RIES [48] is the 7th generation of Cray intercon-nection networks. It is based on a Dragonfly topology andsupports different systems configuration up to
92 544 nodes(Trinity [56], the largest A
RIES system currently deployed,has
19 420 nodes). It provides a peak injection bandwidthof 81.6 Gb/s per node, and a rich set of features includingadaptive routing, collective operations offload, and remoteatomic operations. It uses fewer optical links than fat trees networks, reducing the cost of the network.
Tofu Interconnect D (TofuD) [57] is the third generation
Tofu interconnection networks, which will be used by the
Fugaku supercomputer [58] (formerly known as
Post-K ). TofuD provides a peak injection rate of 300Gb/s per nodeand, like its predecessors, it is based on a 6D mesh/torus.Around 25% of the links used by the interconnect are optical.To reduce latency and improve fault resiliency,
TofuD uses atechnique called dynamic packet slicing , to split the packets inthe data-link layer. This can either be used to split the packetand improve the transmission performance or to duplicatethe packet to provide fault tolerance in case the link qualitydegrades. Moreover, this interconnect provides an offloadengine, called
Tofu Barrier , to execute collective operationswithout involving the CPU.The
Dragonfly+ [59] is currently used by the
Niagara supercomputer [60]. It is a variation of the Dragonfly inter-connect [12], where the switches inside a group are connectedthrough a fat-tree network. Similarly to the Dragonfly network,this interconnect is characterized by different minimal andnon-minimal paths between each pair of nodes. The implemen-tation used in the Niagara supercomputer relies on MellanoxInfiniBand hardware. To select the optimal path,
Dragonfly+ uses a variation of the
OFAR adaptive routing [61], which ateach hop re-evaluates the optimal path to use. Explicit controlmessages are sent among the switches to notify congestionand avoid creating hotspots in the network.Several other low-diameter networks [62] have been pro-posed by the research community, including but not limited to
SlimFly [63],
Megafly [64],
HyperX [65], [66],
Jellyfish [67]and
Xpander [68] topologies. On the data centers side,Clos [69] is the most prevalent deployed topology. Whereasthe above mentioned low-diameter topologies are claiming tohave substantial cost-performance improvements, they havebeen scarcely employed because of hard-to-deploy routingschemes. Also, classical congestion control mechanisms (e.g.,ECMP [70]) are not effective in such low-diameter networksdue to the scarcity of minimal paths [18]. S
LINGSHOT ad-dresses these issues by providing a low-diameter network withan effective congestion control algorithm, setting a steppingstone towards HPC data centers.Overall, S
LINGSHOT introduces a set of key features thatcan be taken as reference for next-generation large-scalecomputing systems. First, the end-to-end congestion control algorithm can quickly react to congestion and is stable across awide set of applications and microbenchmarks. Moreover, traf-fic classes provide additional flexibility and open new softwareoptimization opportunities. Lastly, it is natively interoperablewith existing Ethernet devices, and thanks to novel adaptiverouting strategies, it provides high network utilization also forin-order RoCE traffic (see Figure 6).
B. Interconnection Networks Benchmarking
In this work we described the S
LINGSHOT interconnectionnetwork and, for the first time, we extensively evaluated itacross a wide set of microbenchmarks and real applications.We reported both the isolated performance and the perfor-mance under the presence of congestion.Regarding the evaluation of the under-load system, differentworks analyzed the impact of congestion (also known as network noise ) on application performance [5], [6], [11], [71]–[74] on different types of networks. The GPCNet bench-mark [6] has been recently proposed as a portable benchmarkfor estimating network congestion. We used in this work thesame definition of endpoint/intermediate congestion and of congestion impact used by GPCNet. Whereas the authors ofGPCNet also report some preliminary results on a S
LINGSHOT system, they do not provide a detailed view of the systemperformance. Indeed, the main goal of GPCNet was to designa portable congestion benchmarking infrastructure by usinga small set of victim microbenchmarks (random ring and
MPI_Allreduce ) to easily compare different systems. How-ever, this does not represent a wide spectrum of real scenarios.On the other hand, we focus on the impact of congestionon S
LINGSHOT by using different microbenchmarks and bothon HPC and datacenters applications. Moreover, the GPCNetpaper only analyzes the impact of congestion for a fixed victimmessage size, allocation, and aggressor/victim ratio. However,as we show in Section III-A, all these factors play a role inthe observed congestion and they can be helpful to understandthe system performance.V. C
ONCLUSIONS
Interconnection networks have a significant impact on theperformance of large computing systems, both in supercom-puters and hyperscale datacenters. In this paper, we describeand evaluate S
LINGSHOT , the latest interconnection networkdesigned by Cray. We describe S
LINGSHOT ’s main features:high-radix Ethernet switches, adaptive routing, congestioncontrol, and QoS management. We then evaluate S
LING - SHOT ’s performance, both in isolation and when executingdifferent concurrent workloads.Our results demonstrate that applications running onS
LINGSHOT are much less affected by congestion compared toprevious generation networks and that the congestion controlalgorithm works on a wide set of different microbenchmarksand HPC and datacenter applications. We also show thatallocation policies have a much lower impact on performanceon S
LINGSHOT compared to previous generation networks.astly, we demonstrate how S
LINGSHOT can provide band-width guarantees to jobs running in separate traffic classes.The information we provide can be used by HPC anddatacenter system operators, administrators, users, and pro-grammers to optimize, deploy, and manage parallel applica-tions. A deep understanding of the interconnect’s features is aprerequisite to ensure optimized operations and utilization ofcomputing resources in clouds and datacenters.A
CKNOWLEDGMENT
We thank the anonymous reviewers for their insightfulcomments, and the Slingshot team at HPE for providing accessand support in using the systems. We thank Steve Scott (HPE)for invaluable input. We would also like to thank ShigangLi for providing the code for the
Resnet-proxy application.Daniele De Sensi is supported by an ETH Postdoctoral Fellow-ship (19-2 FEL-50). This project has received funding fromthe European Research Council (ERC) under the EuropeanUnions Horizon 2020 programme (grant agreement DAPP, No.678880). R
EFERENCES[1] The Top500 List. http://top500.org. Accessed: 12-03-2020.[2] Jack J Dongarra, James R Bunch, Cleve B Moler, and G W Stewart.
LINPACK users’ guide . Soc. for Industrial and Applied Mathematics,Philadelphia, PA, 1979.[3] Jack Dongarra, Michael A Heroux, and Piotr Luszczek. High-performance conjugate-gradient benchmark: A new metric for rankinghigh-performance computing systems.
The International Journal of HighPerformance Computing Applications , 30(1):3–10, 2016.[4] Jeffrey Dean and Luiz Andr Barroso. The tail at scale.
Communicationsof the ACM , 56:74–80, 2013.[5] Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler. Miti-gating network noise on dragonfly networks through application-awarerouting. In
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , SC 19, NewYork, NY, USA, 2019. Association for Computing Machinery.[6] Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, JacobBalma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, ScottParker, Steven Warren, and et al. Gpcnet: Designing a benchmarksuite for inducing and measuring contention in hpc networks. In
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC 19, New York, NY,USA, 2019. Association for Computing Machinery.[7] Pulkit A. Misra, Mar´ıa F. Borge, ´I˜nigo Goiri, Alvin R. Lebeck, WillyZwaenepoel, and Ricardo Bianchini. Managing tail latency in datacenter-scale file systems under production constraints. In
Proceedings of theFourteenth EuroSys Conference 2019 , pages 1–8, May 2009.[12] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In , pages 77–88, June 2008.[13] W. H. Tranter, D. P. Taylor, R. E. Ziemer, N. F. Maxemchuk, and J. W.Mark.
Input Versus Output Queueing on a SpaceDivision Packet Switch ,pages 561–570. 2007.[14] Y. Tamir and G. L. Frazier. High-performance multiqueue buffers forvlsi communication switches. In [1988] The 15th Annual InternationalSymposium on Computer Architecture. Conference Proceedings , pages343–354, 1988. [15] Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P.Thacker. High-speed switch scheduling for local-area networks.
ACMTrans. Comput. Syst.
CoRR ,abs/1906.10885, May 2019.[19] Sally Floyd. Tcp and explicit congestion notification.
SIGCOMMComput. Commun. Rev. , 24(5):823, October 1994.[20] IEEE 802.1Qau – Congestion Notification. https://1.ieee802.org/dcb/802-1qau/. Accessed: 12-03-2020.[21] Yibo Zhu, Yibo Zhu, Haggai Eran, Daniel Firestone, Daniel Firestone,Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye,Shachar Raindel, Mohamad Haj Yahia, Ming Zhang, and Jitu Padhye.Congestion control for large-scale rdma deployments. In
SIGCOMM .ACM - Association for Computing Machinery, August 2015.[22] Yibo Zhu, Monia Ghobadi, Vishal Misra, and Jitendra Padhye. Ecn ordelay: Lessons learnt from analysis of dcqcn and timely. In
Proceedingsof the 12th International on Conference on Emerging NetworkingEXperiments and Technologies , CoNEXT 16, page 313327, New York,NY, USA, 2016. Association for Computing Machinery.[23] Y. Gao, Y. Yang, T. Chen, J. Zheng, B. Mao, and G. Chen. Dcqcn+:Taming large-scale incast congestion in rdma over ethernet networks. In ,pages 110–120, Sep. 2018.[24] Differentiated Services Codepoint (DSCP) - RFC 3260. https://tools.ietf.org/html/rfc3260. Accessed: 12-03-2020.[25] 25G Ethernet Consortium. Low-Latency FEC Specification. https://25gethernet.org/ll-fec-specification. Accessed: 01-03-2020.[26] Lane error detection and lane removal mechanism to reduce the probabil-ity of data corruption. https://patents.google.com/patent/US9325449B2/en. Accessed: 12-03-2020.[27] Message Passing Interface Forum. Mpi: A message-passing interfacestandard, version 3.0. Specification, September 2012.[28] Rajeev Thakur, P. Balaji, D. Buntinas, D. Goodell, William Gropp,Torsten Hoefler, S. Kumar, E. Lusk, and Jesper Larsson Trff. MPI atExascale. In
Procceedings of SciDAC 2010 , Jun. 2010.[29] Pavan Balaji.
Programming Models for Parallel Computing . The MITPress, 2015.[30] George Almasi.
PGAS (Partitioned Global Address Space) Languages
Proceedings of the 27th International ACM Conference on InternationalConference on Supercomputing , pages 139–148. ACM, Jun. 2013.[35] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimizationof collective communication operations in mpich.
Int. J. High Perform.Comput. Appl. , 19(1):4966, February 2005.[36] Steven Gottlieb, W. Liu, William D Toussaint, R. L. Renken, and R.L. Sugar. Hybrid-molecular-dynamics algorithms for the numericalsimulation of quantum chromodynamics.
Physical review D: Particlesand fields , 35(8):2531–2542, 1987.[37] G. Bauer, S. Gottlieb, and T. Hoefler. Performance Modeling andComparative Analysis of the MILC Lattice QCD Application su3 rmd.In
Proceedings of the 2012 12th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (ccgrid 2012) , pages 652–659.IEEE Computer Society, May 2012.[38] Steve Plimpton. Fast parallel algorithms for short-range moleculardynamics.
J. Comput. Phys. , 117(1):1–19, March 1995.[39] M. Frigo and S. G. Johnson. The design and implementation of fftw3.
Proceedings of the IEEE , 93(2):216–231, Feb 2005.40] Teng Ma, Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra. Im-pact of kernel-assisted mpi communication over scientific applications:Cpmd and fftw. In Yiannis Cotronis, Anthony Danalis, Dimitrios S.Nikolopoulos, and Jack Dongarra, editors,
Recent Advances in theMessage Passing Interface , pages 247–254, Berlin, Heidelberg, 2011.Springer Berlin Heidelberg.[41] Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, andTorsten Hoefler. Taming unbalanced training workloads in deep learningwith partial collective operations.
Proceedings of the 25th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming ,Feb 2020.[42] Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas,Daniel Peter, and Torsten Hoefler. A modular benchmarking infras-tructure for high-performance and reproducible deep learning.
CoRR ,abs/1901.10183, 2019.[43] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and SamuelMadden. Speedy transactions in multicore in-memory databases. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating Sys-tems Principles , SOSP 13, page 1832, New York, NY, USA, 2013.Association for Computing Machinery.[44] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh,Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A flexibleopen source framework for speech recognition. Technical report, 2004.[45] Xapian Project. https://github.com/xapian/xapian. Accessed: 12-03-2020.[46] A deep network handwriting classifier. https://github.com/xingdi-ericyuan/multi-layer-convnet. Accessed: 12-03-2020.[47] H. Kasture and D. Sanchez. Tailbench: a benchmark suite and evaluationmethodology for latency-critical applications. In , pages 1–10,2016.[48] Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. Crayxc series network.
Cray Inc., White Paper WP-Aries01-1112 , 2012.[49] S. Chunduri, S. Parker, P. Balaji, K. Harms, and K. Kumaran. Char-acterization of mpi usage on a production supercomputer. In
SC18:International Conference for High Performance Computing, Networking,Storage and Analysis , pages 386–400, 2018.[50] Ember Communication Pattern Library. https://github.com/sstsimulator/ember. Accessed: 10-04-2019.[51] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg,and Torsten Hoefler. Efficient Task Placement and Routing in DragonflyNetworks . In
Proceedings of the 23rd ACM International Symposiumon High-Performance Parallel and Distributed Computing (HPDC’14) .ACM, Jun. 2014.[52] Torsten Hoefler and Roberto Belli. Scientific benchmarking of parallelcomputing systems: Twelve ways to tell the masses when reportingperformance results. In
Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis ,SC ’15, pages 73:1–73:12, New York, NY, USA, 2015. ACM.[53] Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele.Evaluation of an interference-free node allocation policy on fat-treeclusters. In
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, and Analysis , SC ’18,pages 26:1–26:13, Piscataway, NJ, USA, 2018. IEEE Press.[54] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. Ascalable, commodity data center network architecture.
SIGCOMMComput. Commun. Rev. , pages 1–8, Feb2017.[60] Marcelo Ponce, Bruno C. Mundim, Mike Nolta, Jaime Pinto, MarcoSaldarriaga, Vladimir Slavnic, Erik Spence, Ching-Hsing Yu, W. Richard Peltier, Ramses van Zon, and et al. Deploying a top-100 supercomputerfor large parallel workloads.
Proceedings of the Practice and Experiencein Advanced Research Computing on Rise of the Machines (learning) -PEARC 19 , 2019.[61] M. Garca, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero,G. Rodrguez, J. Labarta, and C. Minkenberg. On-the-fly adaptiverouting in high-radix hierarchical networks. In , pages 279–288, Sep. 2012.[62] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoe-fler. Cost-effective diameter-two topologies: analysis and evaluation.In
SC ’15: Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , pages 1–11, Nov 2015.[63] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameter networktopology. In
SC ’14: Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis , pages348–359, Nov 2014.[64] Mario Flajslik, Eric Borch, and Mike A. Parker. Megafly: A topology forexascale systems. In Rio Yokota, Mich`ele Weiland, David Keyes, andCarsten Trinitis, editors,
High Performance Computing , pages 289–310,Cham, 2018. Springer International Publishing.[65] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. Hy-perx: topology, routing, and packaging of efficient large-scale networks.In
Proceedings of the Conference on High Performance ComputingNetworking, Storage and Analysis , pages 1–11, 2009.[66] Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, TomoyaYuki, Akihiro Nomura, Shinichi Miura, Nie McDonald, Dennis L. Floyd,and Nicolas Dub´e. Hyperx topology: First at-scale implementationand comparison to the fat-tree. In
Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis , SC 19, New York, NY, USA, 2019. Association for ComputingMachinery.[67] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jel-lyfish: Networking data centers randomly. In
Presented as part of the 9thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 12) , pages 225–238, San Jose, CA, 2012. USENIX.[68] Asaf Valadarsky, Michael Dinitz, and Michael Schapira. Xpander:Unveiling the secrets of high-performance datacenters. In
Proceedingsof the 14th ACM Workshop on Hot Topics in Networks , HotNets-XIV,New York, NY, USA, 2015. Association for Computing Machinery.[69] Charles Clos. A study of non-blocking switching networks.
Bell SystemTechnical Journal , 32(2):406–424, 1953.[70] Christian Hopps et al. Analysis of an equal-cost multi-path algorithm.Technical report, RFC 2992, November, 2000.[71] Philip Taffet and John Mellor-Crummey. Understanding congestionin high performance interconnection networks using sampling. In
Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC 19, New York, NY,USA, 2019. Association for Computing Machinery.[72] Staci A. Smith, Clara E. Cromey, David K. Lowenthal, Jens Domke,Nikhil Jain, Jayaraman J. Thiagarajan, and Abhinav Bhatele. Mitigatinginter-job interference using adaptive flow-aware routing. In
Proceedingsof the International Conference for High Performance Computing,Networking, Storage, and Analysis , SC 18. IEEE Press, 2018.[73] Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and ZhilingLan. Watch out for the bully! job interference study on dragonflynetwork. In
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , SC 16.IEEE Press, 2016.[74] A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs. There goesthe neighborhood: Performance degradation due to nearby jobs. In