[PDF] An In-Depth Analysis of the Slingshot Interconnect

Abstract

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.

Full PDF

AAn In-Depth Analysis of the Slingshot Interconnect

Daniele De Sensi

Department of Computer ScienceETH Zurich [email protected]

Salvatore Di Girolamo

Department of Computer ScienceETH Zurich [email protected]

Kim H. McMahon

Hewlett Packard Enterprise [email protected]

Duncan Roweth

Hewlett Packard Enterprise [email protected]

Torsten Hoeﬂer

Department of Computer ScienceETH Zurich torsten.hoeﬂ[email protected]

Abstract —The interconnect is one of the most critical compo-nents in large scale computing systems, and its impact on the per-formance of applications is going to increase with the system size.In this paper, we will describe S

LINGSHOT , an interconnectionnetwork for large scale computing systems. S

LINGSHOT is basedon high-radix switches, which allow building exascale and hyper-scale datacenters networks with at most three switch-to-switchhops. Moreover, S

LINGSHOT provides efﬁcient adaptive routingand congestion control algorithms, and highly tunable trafﬁcclasses. S

LINGSHOT uses an optimized Ethernet protocol, whichallows it to be interoperable with standard Ethernet devices whileproviding high performance to HPC applications. We analyze theextent to which S

LINGSHOT provides these features, evaluatingit on microbenchmarks and on several applications from thedatacenter and AI worlds, as well as on HPC applications. Weﬁnd that applications running on S

LINGSHOT are less affectedby congestion compared to previous generation networks.

Index Terms —interconnection network, dragonﬂy, exascale,datacenters, congestion

I. I

NTRODUCTION

The ﬁrst US exascale supercomputer will be built within twoyears, marking an important milestone for computing systems.Exascale computing has been a long-awaited goal, whichrequired signiﬁcant contributions both from academic andindustrial research. One of the most critical components havinga direct impact on the performance of such large systems is theinterconnection network ( interconnect ). Indeed, by analyzingthe performance of the Top500 supercomputers [1] whenexecuting HPL [2] and HPCG [3], two benchmarks commonlyused to assess supercomputing systems, we can observe thatHPCG is typically characterized by ∼

50x lower performancecompared to HPL. Part of this performance drop is caused bythe higher communication intensity of HPCG, clearly show-ing that, among others, an efﬁcient interconnection networkcan help in exploiting the full computational power of thesystem. The impact of the interconnect on the performance ofsupercomputing systems increases with the scale of the system,highlighting the need for novel and efﬁcient solutions.Both the HPC and datacenter communities are following apath towards convergence of HPC, data centers, and AI/MLworkloads, which poses new challenges and requires new so-lutions. Workloads are becoming much more data-centric, andlarge amounts of data need to be exchanged with the outside world. Due to the wide adoption of Ethernet in datacenters,interconnection networks should be compatible with standardEthernet, so that they can be efﬁciently integrated with stan-dard devices and storage systems. Moreover, many data centerworkloads are latency-sensitive. For such applications, taillatency is much more relevant than the best case or averagelatency. For example, web search nodes must provide th percentile latencies of a few milliseconds [4]. This is alsoa relevant problem for HPC applications, whose performancemay strongly depend on messages latency, especially when us-ing many global or small messages synchronizations. Despitethe efforts in improving the performance of interconnectionnetworks, tail latency still severely affect large HPC and datacenter systems [4]–[7].To address these issues, Cray recently designed the S LING - SHOT interconnection network. S

LINGSHOT will power allthree announced US exascale systems [8]–[10] and numeroussupercomputers that will be deployed soon. It provides somekey features, like adaptive routing and congestion control, thatmake it a good solution for HPC systems but also for clouddata centers. S

LINGSHOT switches have 64 ports with 200Gb/s each and support arbitrary network topologies. To reducetail latencies, S

LINGSHOT offers advanced adaptive routing,congestion control, and quality of service (QoS) features.Those also protect applications from interference, sometimesreferred to as network noise [5], [11], caused by other ap-plications sharing the interconnect. Lastly, S

LINGSHOT bringsHPC features to Ethernet, such as low latency, low packetoverhead, and optimized congestion control, while maintainingindustry standards. In S

LINGSHOT , each port of the switchcan negotiate the available Ethernet features with the attacheddevices, and can communicate with existing Ethernet devicesusing standard Ethernet protocols, or with other S

LINGSHOT switches and NICs by using S

LINGSHOT speciﬁc additions.This allows the network to be fully interoperable with existingEthernet equipment while at the same time providing goodperformance for HPC systems.In this study, we experimentally analyze S

LINGSHOT ’sperformance features to guide researchers, developers, and Cray is a

Hewlett Packard Enterprise (HPE) company since 2019. a r X i v : . [ c s . D C ] A ug ystem administrators. We use Mellanox ConnectX-5 100 Gb/sEthernet NICs to test the ability of S LINGSHOT to deal withstandard

RDMA over Converged Ethernet (RoCE) trafﬁc .Moreover, by doing so we can analyze the impact of the switchon the end-to-end performance by factoring out some of theimprovements on the Ethernet protocol introduced by S LING - SHOT . We ﬁrst analyze the latencies of a quiet system. Then,we analyze the impact of congestion on both microbenchmarksand real applications for different conﬁgurations, showing thatS

LINGSHOT is only marginally affected by network noise . Tofurther show the beneﬁts of the congestion control algorithm,we compare S

LINGSHOT to Cray’s previous A

RIES network,which has a similar topology and uses a similar routingalgorithm. II. S

LINGSHOT A RCHITECTURE

We now describe the S

LINGSHOT interconnection network.We ﬁrst introduce the R

OSETTA switch and show how switchescan be connected to form a

Dragonﬂy [12] topology. We thendive into speciﬁc features of S

LINGSHOT such as adaptiverouting, congestion control, and quality of service manage-ment. Lastly, we describe the main characteristics of theS

LINGSHOT additions to Ethernet and the software stack.

A. Switch Technology ( R OSETTA ) The core of the S

LINGSHOT interconnect is the R

OSETTA switch, providing 64 ports at 200 Gb/s per direction. Each portuses four lanes of 56 Gb/s

Serializer/Deserializer (SerDes)blocks using

Pulse-Amplitude Modulation (PAM-4) modu-lation. Due to

Forward Error Correction (FEC) overhead,50Gb/s can be pushed through each lane. The R

OSETTA

ASICconsumes up to 250 Watts and is implemented on TSMCs 16nm process. R

OSETTA is composed by 32 peripheral functionblocks and 32 tile blocks. The peripheral blocks implementthe

SerDes , Medium Access Control (MAC),

Physical CodingSublayer (PCS),

Link Layer Reliability (LLR), and Ethernetlookup functions.The 32 tile blocks implement the crossbar switching be-tween the 64 ports, but also adaptive routing and congestionmanagement functionalities. The tiles are arranged in fourrows of eight tiles, with two switch ports handled per tile, asshown in Figure 1. The tiles on the same row are connectedthrough 16 per-row buses, whereas the tiles on the samecolumn are connected through dedicated channels with per-tile crossbars. Each row bus is used to send data from thecorresponding port to the other 16 ports on the row. The per-tile crossbar has 16 inputs (i.e., from the 16 ports on the row)and 8 outputs (i.e., to the 8 ports on the column). For each port,a multiplexer is used to select one of the four inputs (this is notexplicitly shown in the ﬁgure for the sake of clarity). Packetsare routed to the destination tile through two hops maximum.Figure 1 shows an example: if a packet is received on

Port 19 and must be routed to

Port 56 , the packet is ﬁrst routed on therow bus, then it goes through the 16-to-8 crossbar highlighted

200 Gb/s Ethernet NICs were not available at the time of writing in the picture, and then down a column channel to

Port 56 .Thanks to the hierarchical structure of the tiles, there is noneed for a 64 ports arbiter, and the packets only incur in a 16to 8 arbitration.

Rosetta Switch ... ......

16 row buses(1 per port)

46 47 62 63

From other tiles xbar

Tile

Port 31Port 30

RowBusses Column crossbars 62636061585956575455525350514849 46474445424340413839363734353233 30312829262724252223202118191617 14151213101108090607040502030001 Port numbers

14 15

To other ports

Fig. 1: R

OSETTA switch tiled structure.The 32 tiles in R

OSETTA implement a crossbar between the64 ports. For performance and implementation reasons, thecrossbar is physically composed by different function-speciﬁccrossbars, each handling a different aspect of the switchingtrafﬁc: • Requests to Transmit

To avoid head-of-line blocking (HOL) [13], R

OSETTA relies on a virtual output-queuedarchitecture [14], [15] where the routing path is deter-mined before sending the data. The data is bufferedin the input buffers until the resources are available,guaranteeing no further blocks. Before forwarding thedata, a request-to-transmit is sent to the tile correspondingto the switch output port. When a grant to transmit isreceived from the output port, the data is forwarded. • Grants to Transmit

Grants to transmit are sent by thetile handling the output port to the tile from which theswitch received the packet. In the previous example, thegrants would be transmitted from the tile handling

Port56 , to the tile handling

Port 19 . Grants are used to notifythe permission to forward the data to the next hop. Theuse of requests and grants to transmit is a central pieceof the QoS management. • Data

Data is sent on a wider crossbar (48B). To speed upthe processing, R

OSETTA parses and processes the packetheader as soon as it arrives, even if the data might stillbe arriving. • Request Queue Credits

Credits provide an estimation ofqueue occupancy. This information is then used by theadaptive routing algorithm (see Section II-C) to estimatethe congestion of different paths and to select the leastcongested one. • End-to-End Acks

End-to-End acknowledgments areused to track the outstanding packets between every pairof network endpoints. This information is used by thecongestion control protocol (see Section II-D).By using physically separated crossbars, S

LINGSHOT guar-antees that different types of messages do not interfere witheach other and that, for example, large data transfers do notslow down requests and grants to transmit.

00 200 300 400 500 600

Time (ns)

Fig. 2: Distribution of switch latency for RoCE trafﬁc.To analyze the impact of the switch architecture on thelatency, we report in Figure 2 the latency of the switchwhen dealing with RoCE trafﬁc. It is worth remarking that,because we are using standard RoCE NICs, the NIC sendsplain Ethernet frames, and we cannot exploit all the featuresof S

LINGSHOT ’s specialized Ethernet protocol (Section II-F).Some of the features like link-level reliability and propagationof congestion information are however still used in the switch-to-switch communications. To compute the latency of theswitch, we consider the latency difference between 2-hopsand 1-hop latencies (we provide details on the topology inSection II-B). We observe that R

OSETTA has a mean andmedian latency of 350 nanoseconds, with all the distributionlying between 300 and 400 nanoseconds, except for a fewoutliers.

B. Topology R OSETTA switches can be arranged into any arbitrary topol-ogy.

Dragonﬂy [12] is the default topology for S

LINGSHOT -based systems, and it is the topology we refer to in therest of the paper. Dragonﬂy is a hierarchical direct topology,where all the switches are connected to both computing nodesand other switches. Sets of switches are connected betweeneach other forming so-called groups . The switches insideeach group may be connected by using an arbitrary topology,and groups are connected in a fully connected graph. In theS

LINGSHOT implementation of Dragonﬂy (shown in Figure 3),each R

OSETTA switch is connected to 16 endpoints throughcopper cables (up to 2.6 meters), using the remaining 48 portsfor inter-switches connectivity. The partitioning of these 48ports between inter- and intra-group connectivity, as well asthe number of switches per group, depends on the size of thesystem. In S

LINGSHOT , the switches inside a group are alwaysfully connected through copper cables. Switches in differentgroups are connected through long optical cables (up to 100meters). Due to the full-connectivity both within the group andbetween groups, this topology has a diameter of 3 switch-to-switch hops.Thanks to the low-diameter, applications performance onlymarginally depend on the speciﬁc node allocation. We reportin Figure 4 the latency and the bandwidth between nodesat different distances, and for different message sizes onan isolated system. We consider nodes connected to portson the same switch (1 inter-switch hop), connected to twodifferent switches in the same group (2 inter-switch hops), andconnected to two different switches in two different groups (3

N0 N15 ...

S0 S31

All-to-All within group N N ...... ......

31 links17links 31 links 17links ... ... ... ...

Group 0

544 global links N N ... S S All-to-All within group N N ...... ......

31 links16 links 16 links 16 links 16 links17links 31 links 17links ... ... ... ...

Group 544

544 global links

All-to-all amongst groups ......

Fig. 3: S

LINGSHOT

Topology. In this speciﬁc example weshow the topology of the largest 1-dimensional Dragonﬂynetwork that can be built with the 64-ports R

OSETTA switches. T i m e ( u s ) B a n d w i d t h ( G b / s ) SQ1MedianQ3L

Fig. 4: Latency and bandwidth for different node distances,on an isolated system. Q is the ﬁrst quartile, Q is the thirdquartile, IQR = Q − Q , S is the smallest sample greaterthan Q − . · IQR , and L is the largest sample smaller than Q . · IQR .inter-switch hops). For the same switch case, we observed nosigniﬁcant difference when using two ports on the same switchtile or on two different tiles.First, we observe that, in the worst case, the node allocationhas only a impact on the latency for B messages andthat, starting from KiB messages we observe less than difference in latency between the different node distances.The same holds for bandwidth, with less than differencebetween the different distances across all the message sizes. Insome cases, we observe a slightly higher bandwidth when thenodes are in two different groups, because more paths connectthe two nodes, increasing the available bandwidth.In the largest system (shown in Figure 3), each group has32 switches (for a total of ×

16 = 512 nodes, and switchesinside each group are fully connected by using 31 switchesports. The remaining 17 ports from each switch are used toglobally connect all the groups in a fully connected network.In this speciﬁc case, because each group contains 32 switchesand each switch uses 17 ports to connect to other groups, eachgroup has ×

17 = 544 connection towards other groups.This leads to a system having 545 groups, each of which isconnected to 512 nodes, for a total of

279 040 endpoints at fullglobal bandwidth . This number of endpoints satisﬁes both In practice, the addressing scheme limits the number of groups to 511, fora total of

261 632 nodes. xascale supercomputers and hyperscale data centers demand.Indeed, this is larger than the number of servers used in datacenters [16], and much larger than the number of nodes usedby Summit [17], the most performing supercomputer at thetime being, that currently relies on nodes and delivers

P F lop/s . Thanks to this large number of endpoints, eachcomputing node can have multiple connections to the samenetwork, increasing the injection bandwidth and improvingnetwork resiliency in case of NICs failures.

C. Routing

In Dragonﬂy networks (including S

LINGSHOT ), any pairof nodes is connected by multiple minimal and non-minimal paths [12], [18]. For example, by considering the topology inFigure 3, the minimal path connecting N0 to N496 includesthe switches S0 and S31 . In smaller networks, due to linksredundancy,multiple minimal paths are connecting any pairof nodes [18]. On the other hand, a possible non-minimalpath involves an intermediate switch that is directly connectedto both S0 and S31 . The same holds for nodes located indifferent groups. In this case, a non-minimal path crosses anintermediate group.Sending data on minimal paths is clearly the best choiceon a quiet network. However, in a congested network, withmultiple active jobs, those paths may be slower than longerbut less congested ones. To provide the highest throughputand lowest latency, S

LINGSHOT implements adaptive routing:before sending a packet, the source switch estimates the loadof up to four minimal and non-minimal paths and sends thepacket on the best path, that is selected by considering both thepaths’ congestion and length. The congestion is estimated byconsidering the total depth of the request queues of each outputport. This congestion information is distributed on the chip byusing a ring to all the forwarding blocks of each input port. It isalso communicated between neighboring switches by carryingit in the acknowledgement packets. The total overhead forcongestion and load information is an average of four bytes inthe reverse direction for every packet in the forward direction.As more packets take non-minimal paths and therefore averagehop count per packet increases, both the latency and the linkutilization increase. Therefore, S

LINGSHOT adaptive routingbiases packets to take minimal paths more frequently, tocompensate for the higher cost of non-minimal paths.

D. Congestion Control

Two types of congestion might affect an interconnec-tion network: endpoint congestion, and intermediate conges-tion [6]. The endpoint congestion mostly occurs on the last-hop switches, whereas intermediate congestion is spread acrossthe network. Adaptive routing improves network utilizationand application performance by changing the path of thepackets to avoid intermediate congestion. However, even ifadaptive routing can bypass congested intermediate switches,all the paths between two nodes are affected in the same wayby endpoint congestion. As we show in Section III-A, this was a relevant issue on other networks, particularly for many-to-one trafﬁc. In this case, due to the highly congested linkson the receiver side, the adaptive routing would spread thepackets over the different paths but without being able to avoidcongestion, because it is occurring in the last hop.Congestion control helps in mitigating this problem bydecreasing the injection bandwidth of the nodes generating thecongestion. However, existing congestion control mechanisms(like

ECN [19] and

QCN [20], [21]) are not suited for HPCscenarios. They work by marking packets that experiencecongestion. When a node receives a packet that has beenmarked, it asks the sender to slow down its injection rate.These congestion control algorithms work relatively well inpresence of large volume and stable communications (knownas elephant ﬂows ), but tend to be fragile, hard to tune [22],[23], and generally unsuitable for bursty HPC workloads.Indeed, in standard congestion control algorithms, the controlloop is too long to adapt fast enough, and while convergingto the correct transmission rate, the offending trafﬁc can stillinterfere with other applications.To mitigate this problem, S

LINGSHOT introduces a sophis-ticated congestion control algorithm, entirely implemented inhardware, that tracks every in-ﬂight packet between everypair of endpoints in the system. S

LINGSHOT can distinguishbetween jobs that are victims of congestion and those whoare contributing to congestion, applying stiff and fast back-pressure to the sources that are contributing to congestion. Bytracking all the endpoints pairs individually, S

LINGSHOT onlythrottles those streams of packets who are contributing to theendpoint congestion, without negatively affecting other jobsor other streams of packets within the same job who are notcontributing to congestion. This frees up buffers space for theother jobs, avoiding HOL blocking across the entire network,and reducing tail latencies, which are particularly relevant forapplications characterized by global synchronizations.The approach to congestion control adopted by S

LINGSHOT is fundamentally different from more traditional approachessuch as ECN-based congestion control [19], [20], and leads togood performance isolation between different applications, aswe show in Section III-A.

E. Quality of Service (QoS)

Whereas congestion control partially protects jobs frommutual interference, jobs can still interfere with each other.To provide complete isolation, in S

LINGSHOT jobs can beassigned to different trafﬁc classes, with guaranteed quality ofservice. QoS and congestion control are orthogonal concepts.Indeed, because trafﬁc classes are expensive resources requir-ing large amounts of switch buffers space, each trafﬁc classis typically shared among several applications, and congestioncontrol still needs to be applied within a trafﬁc class.Each trafﬁc class is highly tunable and can be customizedby the system administrator in terms of priority, packetsordering required, minimum bandwidth guarantees, maximumbandwidth constraint, lossiness, and routing bias [5]. Thesystem administrator guarantees that the sum of the minimumandwidth requirements of the different trafﬁc classes doesnot exceed the available bandwidth. Network trafﬁc can beassigned to trafﬁc classes on a per-packet basis. The jobscheduler will assign to each job a small number of trafﬁcclasses, and the user can then select on which class to sendits application trafﬁc. In the case of MPI, this is done byspecifying the trafﬁc class identiﬁer in an environment vari-able. Moreover, communication libraries could even changetrafﬁc classes at a per-message (or per-packet) granularity. Forexample, MPI could assign different collective operations todifferent trafﬁc classes. For example, it may assign latency-sensitive collective operations such as

MPI_Barrier and

MPI_Allreduce to high-priority and low-bandwidth trafﬁcclasses, and bulk point-to-point operations to higher bandwidthand lower priority classes.Trafﬁc classes are completely implemented in the switchhardware. A switch determines the trafﬁc class required for aspeciﬁc packet by using the

Differentiated Services Code Point (DSCP) tag in the packet header [24]. Based on the value of thetag, the switch assigns the packet to one of the multiple virtualqueues. Each switch will allocate enough buffers to each trafﬁcclass to achieve the desired bandwidth, whereas the remainingbuffers will be dynamically allocated to the trafﬁc which isnot assigned to any speciﬁc trafﬁc class.

F. Ethernet Enhancements

To improve interoperability, and to better suit datacentersscenarios, S

LINGSHOT is fully Ethernet compatible, and canseamlessly be connected to third-party Ethernet-based devicesand networks. S

LINGSHOT provides additional features on topof standard Ethernet, improving its performance and makingit more suitable for HPC workloads. S

LINGSHOT uses thisenhanced protocol for internal trafﬁc, but it can mix it withstandard Ethernet trafﬁc on all ports at packet-level granularity.This allows S

LINGSHOT to achieve high-performance, whileat the same time being able to communicate with standardEthernet devices, allowing it to be used efﬁciently in bothsupercomputing and datacenter worlds.To improve performance, S

LINGSHOT reduces the 64 Bytesminimum frame size to 32 Bytes, allows IP packets to besent without an Ethernet header, and removes the inter-packetgap. Lastly, S

LINGSHOT provides resiliency at different lev-els by implementing low-latency

Forward Error Correction (FEC) [25],

Link-Level Reliability (LLR) to tolerate tran-sient errors, and lanes degrade [26] to tolerate hard failures.Moreover, the S

LINGSHOT

NIC provides end-to-end retry toprotect against packet loss. These are relevant features in high-performance networks. For example, FEC is required for allEthernet systems at 100Gb/s or higher, independently fromthe system size, and LLR is useful in large systems (suchas hyperscale data centers) to localize the error handling andreduce end-to-end retransmission.

G. Software Stack

Communication libraries can either use the standard TCP/IPstack or, in case of high-performance communication li- braries such as

MPI [27], [28],

Chapel [29],

PGAS [30] and

SHMEM [31], the libfabric interface [32]. Cray contributedwith new features to the libfabric open-source verbs providerand

RxM utility provider to support the S

LINGSHOT hard-ware. All HPC trafﬁc is layered over

RDMA over ConvergedEthernet (RoCEv2) and data is sent over the network throughpackets containing up to 4KiB of data plus headers and trailers.Headers and trailers include Ethernet (26 bytes including thepreamble), IPv4 (20 bytes), UDP (8 bytes), InﬁniBand (14bytes), and an additional RoCEv2 CRC (4 bytes), for a totalof 62 bytes.

Cray MPI is derived from

MPICH [33] andimplements the MPI-3.1 standard. Proprietary optimizationsand other enhancements have been added to

Cray MPI targetedspeciﬁcally for the S

LINGSHOT hardware. Any MPI imple-mentation supporting libfabric can be used out of the box onS

LINGSHOT . Moreover, standard API for some features, liketrafﬁc classes, have been recently added to libfabric and couldbe exploited as well. We report in Figure 5 the latencies fordifferent message sizes and for different network protocols.We observe that for small message sizes, MPI adds only amarginal overhead to libfabric . Size (Bytes) R TT / ( u s e c ) IB VerbsLibfabric MPIUDP TCP

MPI IPTCP UDPRoCEv2 NICLibfabricIB Verbs

NIC driver

OSHW

TCP socket

UDPsocket

Fig. 5: Half round trip time (RTT/2) for different messagesizes (x-axis) and software layers.Moreover, we show in Figure 6 the bisection bandwidth (i.e.,the bandwidth when half of the nodes send data to the otherhalf of the nodes and vice versa) and the

MPI_Alltoall bandwidth on S

HANDY , a S

LINGSHOT -based system using nodes (see Section III for details). We report the resultsfor different processes per node (PPN) and different messagesizes. This system is composed of eight groups, and all thebisection cuts cross the same number of links. In this system,each group has 56 global links out of 112 (8 towards eachother group), to match the injection bandwidth. Each of the 4groups in one partition is connected to each of the 4 groupsin the other partition, and the total number of links crossing abisection cut is · · . Because each link has a 200Gb/sbandwidth, and we are sending trafﬁc in both directions, thepeak bisection bandwidth is · Gb/s · . T b/s .In an all-to-all communication, each node sends / ofthe trafﬁc to nodes in the other 7 groups and / of thetrafﬁc to nodes in the same group. Because this system has · global links, the all-to-all maximum bandwidth is / · · Gb/s = 12 . T b/s . Note that

MPI_Alltoall can achieve twice the bisection bandwidth because half ofthe connections terminate in the same partition [34]. The plotshows that the

MPI_Alltoall reaches more than the 90% B B B B K i B K i B K i B K i B Size T B / s Theoretical Bisection BandwidthTheoretical Alltoall Bandwidth

Alltoall (PPN=16)Alltoall (PPN=24)Bisection (PPN=128)

G0 G1 G2 G3 G4 G5 G6 G7

Dragon ﬂ y group, x16 Rosetta switches Fig. 6: Bisection and

MPI_Alltoall bandwidth on all the nodes of S

HANDY , for different processes per node (PPN) and message sizes. The x-axis is in logarithmic scale.of the theoretical peak bandwidth, without any packet loss. Weobserve a performance drop for 256 bytes messages because,to reduce memory usage, the MPI implementation switches toa different algorithm [35] for messages larger than 256 bytes.III. P

ERFORMANCE S TUDY

We now study the performance of the S

LINGSHOT intercon-nect on real applications and microbenchmarks, by focusingon two key features of S

LINGSHOT , namely congestion controland quality of service management. For our analysis, weconsider the following systems: • C RYSTAL : A system based on the Cray A

RIES intercon-nect [48]. This system has 698 nodes. The CPUs on thenodes are

Intel Xeon E5-269x . The system is composedof two groups, each containing at most 384 nodes. • M ALBEC : A S

LINGSHOT system with 484 nodes. CPUson the nodes are either

Intel Xeon Gold 61xx or Intel XeonPlatinum 81xx

CPUs. The system is composed of fourgroups, each containing at most 128 nodes. Each groupis connected to each other group through 48 global linksoperating at 200Gb/s each. Each node has a

MellanoxConnectX-5 EN

NIC. • S HANDY : A S

LINGSHOT system with 1024 nodes. Com-pute nodes are equipped with

AMD EPYC Rome

64 coresCPUs. The system is composed of eight groups, eachcontaining 128 nodes. Each group is connected to eachother group through 56 global links operating at 200Gb/seach. Each node has two

Mellanox ConnectX-5 EN

NICs,each connected to a different switch of the same network,allowing a better load distribution and resilience in theevent of NICs failures.We consider two S

LINGSHOT systems, of different size, toanalyze the performance at different system scales. For all theexperiments, we booked these systems for exclusive use, tohave a controlled environment and avoid interference causedby other users.

A. Congestion Control

To evaluate the ability of S

LINGSHOT to react to congestion,we divide the nodes in the system in two partitions: victim nodes and aggressor nodes. The aggressor nodes generatecongestion that impacts the performance of victim nodes. Weconsider two types of congestion patterns: endpoint congestionand intermediate congestion, and we use the GPCNet code [6]to generate those congestion patterns. We generate endpointcongestion through a many-to-one ( incast ) communication pat-tern, where a number of nodes send data to the same endpointby using MPI_Put , and intermediate congestion by using an all-to-all pattern implemented through

MPI Sendrecv . Bothaggressors exchange 128KiB messages. This decision is basedon characterization studies on production systems, that showan average message size of ∼ bytes both in collective andpoint-to-point communications [49].We consider the victim applications described in Table I.Moreover, we also analyze the impact of congestion on mi-crobenchmarks, include standard MPI operations, and the em-ber microbenchmarks [50] reproducing some common com-munication patterns in HPC applications ( halo3d , sweep3d ,and incast ). We ﬁrst consider the results on 512 nodes. Then,we show the results for different node counts. We considerthe following victim/aggressor splits: 460/52 ( ∼ / ),256/256 ( ∼ / and 53/459 ( ∼ / ). Becausethe implementation of some MPI collectives changes accord-ing to the number of nodes used, we have chosen these splitsso that we run the victim with both power of two (256), even(460) and odd (53) number of nodes. To further increase thegenerated congestion, in some experiments we increase thenumber of processes per node ( PPN ) used by the aggressor.Each node used by the aggressor spawns PPN processes, eachof them performing the same communications. Namely, thecongestion pattern is concurrently executed PPN times.Moreover, the allocation of the nodes to victims andaggressors determines how many switches and groups areshared between the two jobs and has a direct impact on theperformance of the victim. In our experiments, we consider thethree well-known allocation placement strategies [51] depictedin Figure 7: linear , where we allocate the ﬁrst n nodes to thevictim and the remaining nodes to the aggressor; interleaved ,where we interleave the nodes allocated to the victim and theaggressor; and random , where we randomly allocate the nodesto the victim and the aggressor.We make sure that the data we report is statisticallysound [52]: for each microbenchmark, we execute the victimat least 200 times and for at least 4 seconds. We stopthe benchmark when both the previous two conditions aresatisﬁed, and when the 95% conﬁdence interval is within5% of the median. We then consider for each iteration themaximum time among the ranks. For the applications, we = Node assigned to Victim = Node assigned to Aggressor Linear Interleaved Random

S1 S2 S1 S2 S1 S2N0 N1 N2 N3 N4 N5 N0 N1 N2 N3 N4 N5 N0 N1 N2 N3 N4 N5= Switch

Fig. 7: Different victim/aggressor allocations.

YPE A PPL . D

ESCRIPTION

HPC

MILC It is a set of numerical simulation codes working on quantum chromodynamics (QCD) [36]. We use the su3_rmd kernel, thatdecomposes a four dimensional grid, and mostly performs point-to-point neighbour communications and global reductions [37].HPCG A set of communication and computational patterns matching a wide set of applications. It relies on sparse triangular solversand preconditioned conjugate gradient algorithms [3]. It mostly uses stencil communications and global reductions.LAMMPS A molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous state [38]. This kernel performsreductions and point-to-point blocking and non-blocking communications, between nodes at different distances.FFT

Fast Fourier Transform on a 3D domain [39]. It employs broadcasts, scatters, and point-to-point communications [40].Resnet-proxy This is a ML/AI proxy application [41], reproducing the communication phases of a Deep500 benchmark [42]

Residual Neural Network (resnet). This application uses non-blocking reduction operations. DC Silo A fast in-memory transactional database [43]. Widely used in online transaction processing systems (OLTP).Sphinx A speech recognition system [44], involving probabilistically pruning a large search tree.Xapian A search engine [45] using a search index built from a snapshot of the English version of Wikipedia. Multiple queries areexecuted, with a distribution similar to that of online search queries.Img-dnn An application using a deep neural network-based autoencoder to identify handwritten characters [46].

TABLE I: Applications used as victim in the congestion tests. We consider both HPC and datacenter (DC) applications. Img-dnn, Xapian, Sphinx and Silo are all single-client, single-server applications, coming from the

Tailbench benchmark [47] forlatency-sensitive datacenter applications. We selected this subset because it covers a wide range of latencies, from microseconds(

Silo ) to seconds (

Sphinx ).consider the time reported by the application, that we executemultiple times until the 95% conﬁdence interval is within 5%of the median.We report in Figure 8 the time distribution for the

Tailbench applications, both when executed in isolation, and when ex-ecuted with an incast aggressor, on both A

RIES and S

LING - SHOT . We also annotate the th and th percentiles, to showthe impact of tail latency. We executed these experiments usingthe linear allocation and a / victim/aggressor ratio.For Silo , Xapian and

Img-dnn we observe severe performancedegradation due to congestion on A

RIES , whereas we do notobserve any relevant effect on S

LINGSHOT . For

Sphinx , weobserve a smaller degradation because the communication tocomputation ratio is lower than that of the other applications.Moreover, we observe a higher tail latency on A

RIES , whichfurther increases in the presence of congestion. It is worthremarking that the congestion impact itself is enough tocharacterize how much S

LINGSHOT is affected by congestion.In addition, we are also comparing S

LINGSHOT with anA

RIES interconnect, to also show the improvements comparedto an existing interconnection network. Moreover, a similarperformance degradation to that we observed on A

RIES hasalso been observed on other interconnects [6], [11], [53].Due to the large number of combinations of victims, aggres-sors, and allocations, we provide a data summary of the linearallocation results as a heatmap in Figure 9. Each element ofthe heatmap represents the mean congestion impact C [6], i.e., C = T c T i (1)where T i is the mean execution time of the victim whenexecuted in isolation, and T c is the mean execution time ofthe victim when co-executed with the aggressor. For example, T i m e ( m s ) Aries(Crystal)0123 T i m e ( s ) Slingshot(Shandy)

Aries(Crystal)050100150200

Slingshot(Shandy) silo sphinximg-dnn xapian

Isolated With congestion

99p 95p

Fig. 8: Time distribution of

Tailbench applications, with andwithout endpoint congestion. The labels on the top of eachplot denote the th and th percentile.the element on the top left corner represents the scenario whereMILC is executed together with an all-to-all aggressor. 10% ofthe nodes are allocated to the aggressor, whereas the remainingnodes are allocated to the victim. For this speciﬁc case, nosigniﬁcant congestion impact is observed. On the other hand,MILC experiences a 1.6 slowdown on A RIES due to endpointcongestion ( incast ), when 10% of the nodes are allocated tothe aggressor. For the same scenario, we don’t observe anyslowdown on S

LINGSHOT .We report the applications and microbenchmarks results us-ing two different (logarithmic) color scales, to better appreciatethe differences. Indeed, applications are usually less affected

45 1 92 a ll - t o - a lli n c a s t A r i e s ( C r y s t a l ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

444 47 54 35 44 20 9 . . . . . . . .

780 87 93 70 67 30 12 6 . . . M I L C H P C G L A MM P S FF T r e s n e t - p r o xy s il o s p h i n x x a p i a n i m g - d nn a ll - t o - a lli n c a s t S li n g s h o t ( S h a n d y ) . . . . . .

11 1 1 1 . . . . . . B B K i B K i B K i B M i B M i B M i B B B K i B K i B K i B M i B M i B B B K i B K i B K i B M i B M i B B B K i B K i B K i B M i B M i B M i B B K i B B B B B K i B K i B pingpong allreduce alltoall barrier broadcast hal swp inc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9: Congestion effects on different victim and aggressor combinations. Each element of the heatmap represents the congestionimpact of the aggressor on the victim.by congestion because, differently from microbenchmarks,they also have computation phases. Because communicationsare just a part of the overall execution time, even whencommunications are severely affected by congestion, this doesnot directly translate into a large performance degradation.As reported in the heatmap, S

LINGSHOT is always lessaffected by congestion compared to A

RIES . In the worst case,we observed a 1.3x slowdown on S

LINGSHOT , compared toa maximum 93x slowdown on A

RIES . Moreover, the conges-tion impact increases when increasing the fraction of nodesallocated to the aggressor application, and has a larger impacton small message communications, due to the larger impactof end-to-end latency on the overall performance. The effectsof congestion can be seen not only on microbenchmarksbut also on full applications. For example, LAMMPS is 17xslower when executed together with an incast aggressor witha 50/50 split on A

RIES . Intermediate congestion (generatedthrough all-to-all communication), does not signiﬁcantly affectthe systems we are analyzing, because the adaptive routingalgorithm successfully routes the packets around the congestedlinks. This means that, the additional load generated by the all-to-all does not manifest as congestion.Similar trends can be observed also for different node count,higher PPNs, and other allocations. For space reasons, we donot report all the heatmaps for each of these cases. Instead,we summarize each heatmap by showing the distribution ofthe heatmap elements (congestion impacts, Equation 1) acrossall the victim/aggressor combinations. We show the result ofthis comparison in Figure 10.First, we show in Figure 10 ( A ) the congestion impact for different allocations. For example, for the linear allocation, weare showing the same data of Figure 9. However, instead ofshowing all the individual congestion impacts, we now reporttheir distribution. For readability purposes, we cut the longtails of the distributions, and we annotate on top of each violinthe maximum value. We observe that whereas on A RIES thecongestion impact for the linear allocation is never higher than100, for the interleaved and random allocations we observevalues up to 150. We observed a similar effect on S

LINGSHOT but on a different scale. In this case, in all but one cases weobserve congestion impact values lower than two. Moreover,differently from A

RIES , the distribution on S

LINGSHOT is lessspread, which indicates that the congestion control algorithmis performing well across a wide set of victims and allocations.In Figure 10 ( B ), we report a similar analysis, but now theaggressors are using 24 processes per node (PPN) instead of1, thus generating a higher load on the network. In this case,the impact of congestion increases for A RIES , especially forrandom allocations. On the other hand, S

LINGSHOT is onlyminimally impacted, showing a maximum congestion impact ∼ times lower than on A RIES .Lastly, in Figure 10 ( C ) we report the congestion impactwhen using fewer nodes (128). Until now, we compared aS LINGSHOT system (S

HANDY , 1024 nodes) against a smallerA

RIES one (C

RYSTAL , 698 nodes). To factor out possibleperformance variations coming from different system sizes,we now compare C

RYSTAL with a smaller S

LINGSHOT system(M

ALBEC , 484 nodes). We also ﬁx the number of nodes perdragonﬂy group to 64, in order to allocate the same number ofgroups (two) in both cases. On A

RIES , the maximum conges- A r i e s C o n g . I m p .

92 144 154 A B B

40 43 40

Linear InterleavedRandom

512 nodes, 1 PPN S li n g s h o t C o n g . I m p . Linear InterleavedRandom

512 nodes, 24 PPN

Linear InterleavedRandom

128 nodes, 1 PPN C Fig. 10: Congestion impact distribution across different victim/aggressor combinations, fordifferent allocations, node count, and processes per node (PPN). M I L C H P C G L A MM P S FF T r e s n e t - p r o xy s il o s p h i n x x a p i a n i m g - d nn a ll - t o - a lli n c a s t S li n g s h o t ( S h a n d y ) . . .

11 1 1 1 1 . . . . . . . . . . . . . N . A . N . A . N . A . N . A . Fig. 11: Congestion impacton nodes of S

HANDY .tion impact goes from 154 (Figure 10 ( A )) to 40 (Figure 10( C )) when using 128 nodes instead of 512. This can beexplained by the lower generated trafﬁc (aggressors have nowfewer nodes), but also by the higher fraction of available globalbandwidth. On S LINGSHOT , the same experiment makes themaximum congestion impact go from 2.3 to 1.5. We concludethat S

LINGSHOT is less affected by congestion, even whenvarying the system size and the number of allocated nodes.The results of Figure 11 show the congestion impact onthe applications when using all the nodes on S

HANDY .We report the data when using a random allocation becausethat is the one generating the most congestion (see Figure 10).We can observe that even at full system scale the congestioncontrol effectively protects applications from congestion, witha maximum 3.55x slowdown on LAMMPS when 75% of thenodes are allocated to the incast congestor. Data on MILCand HPCG with a 25%/75% aggressor/victim ratio is missing.Indeed, they should run on 768 nodes, but they can only runon a number of nodes which is a power of two.We complete our analysis on the effects of congestionby analyzing the impact of bursty congestion S

LINGSHOT .Indeed, in the previous experiments we always consideredpersistent congestion, generated by sending messages witha ﬁxed size of 128KiB during the entire victim execution.To analyze the impact of bursty congestion, we execute a128 byte

MPI_Alltoall microbenchmark (victim) with an incast aggressor. This is one of the cases where we observedthe highest congestion impact on S

LINGSHOT (see Figure 9).We run this test on all the M

ALBEC nodes, splitting themequally between aggressor and victim, with an interleavedallocation strategy.We report the results of this analysis in Figure 12. Eachheatmap corresponds to a different message size for the incast aggressor. On each heatmap we report the congestion impactwhen varying the number of messages in a burst (

BurstSize , on the y-axis) and the time between two subsequentcongestion bursts (

Bursts Gap , on the x-axis). For example, thebottom-left element in the ﬁrst heatmap, represents the casewhere the aggressor sends consecutive messages, eachone containing 8 bytes. Before sending the next burst of B u r s t S i z e ( m e ss a g e s ) .

08 1 .

07 1 .

00 1 . .

09 1 .

08 1 .

08 1 . .

10 1 .

09 1 .

09 1 . .

10 1 .

07 1 .

09 1 . Bursts Gap (us) .

19 1 .

17 1 .

00 1 . .

20 1 .

19 1 . .

20 1 .

20 1 . .

21 1 .

22 1 . .

00 1 .

00 1 . .

00 1 .

00 1 . .

00 1 .

00 1 . .

00 1 .

00 1 . Fig. 12: Impact of incast congestion on a 128 byte

MPI_Alltoall . We show the impact for different messagesizes, congestion duration, and time between subsequent con-gestion bursts.messages, the aggressor will wait 1 microsecond.We observe that the incast aggressor does not affect the vic-tim when sending too small messages or too large messages.Indeed, small messages do not generate enough congestion,whereas for large messages the congestion control algorithmfully kicks in and throttle the aggressor. On the other hand,for medium size messages, some congestion builds up beforethe congestion control algorithm detects and reacts to it,and we observe an increase in the congestion impact up to . . However, as we shown in Figure 9, this is negligiblewhen compared to what happens on other types of systems.Moreover, we observe the highest congestion impact for largebursts and for small gaps between subsequent bursts. Wealso observe no differences between bursts of messagesand the persistent congestion. This shows that S LINGSHOT istolerant to both persistent congestion, and bursty and short-lived congestion.

B. Trafﬁc Classes

We now evaluate the ability of S

LINGSHOT to provide per-formance guarantees to jobs running by using trafﬁc classes. Itis worth remarking that trafﬁc classes and congestion controlare orthogonal concepts. Trafﬁc classes can be used to protect .0 0.5 1.0 1.5 2.0 2.5 3.0

Time (msec) C o n g . I m p . Same Traffic ClassSeparate Traffic Classes

Fig. 13: Congestion impact for an 8B

MPI_Allreduce , co-executed with a 256KiB

MPI_Alltoall on M

ALBEC (witha 25% tapering) with and without trafﬁc classes.a job (or parts of it) from other trafﬁc, and they can allocateresources fairly or unfairly between users and jobs. However,even if resources are assigned fairly, congestion can still occurdue to jobs ﬁlling up the buffers. Congestion control is usedto avoid such situations within and across trafﬁc classes.All the experiments presented in the following have beenexecuted on M

ALBEC . We taper the bandwidth to 25% ofthe available bandwidth, to force co-running jobs to in-terfere with each other. We execute a job performing an8B

MPI_Allreduce together with a job performing a256KiB

MPI_Alltoall . Each job uses 64 nodes and 16processes per node, and they are placed using the inter-leaved allocation. We report in Figure 13 the congestionimpact of the

MPI_Allreduce when using the same trafﬁcclass of the

MPI_Alltoall and when using a separatetrafﬁc class. Each point represents the mean over

100 000 runs. The

MPI_Alltoall is started around 0.4 millisec-onds after the beginning of the test. We observe that when

MPI_Allreduce runs in the same trafﬁc class of the

MPI_Alltoall , it experiences a congestion impact of 2.85(i.e. is 2.85 times slower compared to when executed inisolation). On the other hand, when executed in a separatetrafﬁc class it only experiences a 1.15x slowdown comparedto the isolated case.We now further investigate the capacity of S

LINGSHOT to enforce speciﬁc limits on trafﬁc classes. We execute twojobs, each running a bisection bandwidth test, with the secondone starting after 0.9 milliseconds from the beginning of thetest. Each job uses 16 processes per node and runs on 64nodes. Jobs are placed by using the interleaved allocation. Weconﬁgure two trafﬁc classes:

TC1 with a minimum bandwidthrequirement of 80% of the available bandwidth, and

TC2 , witha minimum 10% bandwidth required.We report the results of this experiment in Figure 14. Onthe upper part, we report the results we obtain when bothjobs run on the same trafﬁc class (

TC1 ). At the beginning ofthe execution, the ﬁrst job runs on an empty system and gets100% of the available bandwidth. When the second job starts,the available bandwidth is fairly shared between the two jobs.Eventually, when the ﬁrst job terminates the second job rampsup and uses all the available bandwidth.On the lower part of Figure 14, we report the results G b / s / n o d e Same TC

Job 1 (TC1) Job 2 (TC1)

Time (msec) G b / s / n o d e Separate TCs

Job 1 (TC1)Job 2 (TC2)

Fig. 14: Performance of two bisection bandwidth tests onM

ALBEC (with a 25% tapering) when running in the sametrafﬁc class (top) and when running into two separate trafﬁcclasses (bottom).when the ﬁrst job runs in

TC1 and the second job runs in

TC2 . In this case, when the second job starts, the bandwidthof the ﬁrst job drops to 80% of the available bandwidth,matching the minimum bandwidth required for

TC1 . Thesecond job required a minimum bandwidth of 10%, and itgets the 20% of the available bandwidth. Indeed, there is anextra 10% of bandwidth which was not allocated to either

TC1 or TC2 . S

LINGSHOT decides to dynamically allocate this extrabandwidth to

TC2 because it is the trafﬁc class with the lowestbandwidth share. Eventually, when the ﬁrst job terminates, thesecond job uses all the available bandwidth.IV. S

TATE OF THE A RT A. Interconnection Networks

Existing large-scale computing systems are characterizedby different types of interconnection networks, either basedon open standards or proprietary technology. These networkshave different topologies and provide different features. Inthis section, we highlight the main characteristics of the mostcommon and actively developed interconnection networks, tobetter understand the similarities and differences with S

LING - SHOT . InﬁniBand is an open standard for high-performance net-work communications. Different vendors manufacture Inﬁni-Band switches and interfaces, and the InﬁniBand standard isnot tied to any speciﬁc network topology. The most commonlyused InﬁniBand implementations rely on

Mellanox hardware,with switches arranged in a fat tree topology [54]. For ex-ample, both

Sierra [55] and

Summit [17], the two fastestsupercomputers at the time being, use such conﬁguration.Mellanox networks also provide other features to improveapplication performance, such as switch ofﬂoading of MPIcollective operations, adaptive routing, congestion control, andtrafﬁc classes. However, congestion control is usually not usedin large production systems due to difﬁculties in the tuning ofhe algorithm [6]. Regarding interoperability with Ethernet,Mellanox adopts a different approach than S

LINGSHOT , re-quiring trafﬁc to be converted between InﬁniBand and Ethernetby using dedicated gateways.Cray A

RIES [48] is the 7th generation of Cray intercon-nection networks. It is based on a Dragonﬂy topology andsupports different systems conﬁguration up to

92 544 nodes(Trinity [56], the largest A

RIES system currently deployed,has

19 420 nodes). It provides a peak injection bandwidthof 81.6 Gb/s per node, and a rich set of features includingadaptive routing, collective operations ofﬂoad, and remoteatomic operations. It uses fewer optical links than fat trees networks, reducing the cost of the network.

Tofu Interconnect D (TofuD) [57] is the third generation

Tofu interconnection networks, which will be used by the

Fugaku supercomputer [58] (formerly known as

Post-K ). TofuD provides a peak injection rate of 300Gb/s per nodeand, like its predecessors, it is based on a 6D mesh/torus.Around 25% of the links used by the interconnect are optical.To reduce latency and improve fault resiliency,

TofuD uses atechnique called dynamic packet slicing , to split the packets inthe data-link layer. This can either be used to split the packetand improve the transmission performance or to duplicatethe packet to provide fault tolerance in case the link qualitydegrades. Moreover, this interconnect provides an ofﬂoadengine, called

Tofu Barrier , to execute collective operationswithout involving the CPU.The

Dragonﬂy+ [59] is currently used by the

Niagara supercomputer [60]. It is a variation of the Dragonﬂy inter-connect [12], where the switches inside a group are connectedthrough a fat-tree network. Similarly to the Dragonﬂy network,this interconnect is characterized by different minimal andnon-minimal paths between each pair of nodes. The implemen-tation used in the Niagara supercomputer relies on MellanoxInﬁniBand hardware. To select the optimal path,

Dragonﬂy+ uses a variation of the

OFAR adaptive routing [61], which ateach hop re-evaluates the optimal path to use. Explicit controlmessages are sent among the switches to notify congestionand avoid creating hotspots in the network.Several other low-diameter networks [62] have been pro-posed by the research community, including but not limited to

SlimFly [63],

Megaﬂy [64],

HyperX [65], [66],

Jellyﬁsh [67]and

Xpander [68] topologies. On the data centers side,Clos [69] is the most prevalent deployed topology. Whereasthe above mentioned low-diameter topologies are claiming tohave substantial cost-performance improvements, they havebeen scarcely employed because of hard-to-deploy routingschemes. Also, classical congestion control mechanisms (e.g.,ECMP [70]) are not effective in such low-diameter networksdue to the scarcity of minimal paths [18]. S

LINGSHOT ad-dresses these issues by providing a low-diameter network withan effective congestion control algorithm, setting a steppingstone towards HPC data centers.Overall, S

LINGSHOT introduces a set of key features thatcan be taken as reference for next-generation large-scalecomputing systems. First, the end-to-end congestion control algorithm can quickly react to congestion and is stable across awide set of applications and microbenchmarks. Moreover, traf-ﬁc classes provide additional ﬂexibility and open new softwareoptimization opportunities. Lastly, it is natively interoperablewith existing Ethernet devices, and thanks to novel adaptiverouting strategies, it provides high network utilization also forin-order RoCE trafﬁc (see Figure 6).

B. Interconnection Networks Benchmarking

In this work we described the S

LINGSHOT interconnectionnetwork and, for the ﬁrst time, we extensively evaluated itacross a wide set of microbenchmarks and real applications.We reported both the isolated performance and the perfor-mance under the presence of congestion.Regarding the evaluation of the under-load system, differentworks analyzed the impact of congestion (also known as network noise ) on application performance [5], [6], [11], [71]–[74] on different types of networks. The GPCNet bench-mark [6] has been recently proposed as a portable benchmarkfor estimating network congestion. We used in this work thesame deﬁnition of endpoint/intermediate congestion and of congestion impact used by GPCNet. Whereas the authors ofGPCNet also report some preliminary results on a S

LINGSHOT system, they do not provide a detailed view of the systemperformance. Indeed, the main goal of GPCNet was to designa portable congestion benchmarking infrastructure by usinga small set of victim microbenchmarks (random ring and

MPI_Allreduce ) to easily compare different systems. How-ever, this does not represent a wide spectrum of real scenarios.On the other hand, we focus on the impact of congestionon S

LINGSHOT by using different microbenchmarks and bothon HPC and datacenters applications. Moreover, the GPCNetpaper only analyzes the impact of congestion for a ﬁxed victimmessage size, allocation, and aggressor/victim ratio. However,as we show in Section III-A, all these factors play a role inthe observed congestion and they can be helpful to understandthe system performance.V. C

ONCLUSIONS

Interconnection networks have a signiﬁcant impact on theperformance of large computing systems, both in supercom-puters and hyperscale datacenters. In this paper, we describeand evaluate S

LINGSHOT , the latest interconnection networkdesigned by Cray. We describe S

LINGSHOT ’s main features:high-radix Ethernet switches, adaptive routing, congestioncontrol, and QoS management. We then evaluate S

LING - SHOT ’s performance, both in isolation and when executingdifferent concurrent workloads.Our results demonstrate that applications running onS

LINGSHOT are much less affected by congestion compared toprevious generation networks and that the congestion controlalgorithm works on a wide set of different microbenchmarksand HPC and datacenter applications. We also show thatallocation policies have a much lower impact on performanceon S

LINGSHOT compared to previous generation networks.astly, we demonstrate how S

LINGSHOT can provide band-width guarantees to jobs running in separate trafﬁc classes.The information we provide can be used by HPC anddatacenter system operators, administrators, users, and pro-grammers to optimize, deploy, and manage parallel applica-tions. A deep understanding of the interconnect’s features is aprerequisite to ensure optimized operations and utilization ofcomputing resources in clouds and datacenters.A

CKNOWLEDGMENT

We thank the anonymous reviewers for their insightfulcomments, and the Slingshot team at HPE for providing accessand support in using the systems. We thank Steve Scott (HPE)for invaluable input. We would also like to thank ShigangLi for providing the code for the

Resnet-proxy application.Daniele De Sensi is supported by an ETH Postdoctoral Fellow-ship (19-2 FEL-50). This project has received funding fromthe European Research Council (ERC) under the EuropeanUnions Horizon 2020 programme (grant agreement DAPP, No.678880). R

EFERENCES[1] The Top500 List. http://top500.org. Accessed: 12-03-2020.[2] Jack J Dongarra, James R Bunch, Cleve B Moler, and G W Stewart.

LINPACK users’ guide . Soc. for Industrial and Applied Mathematics,Philadelphia, PA, 1979.[3] Jack Dongarra, Michael A Heroux, and Piotr Luszczek. High-performance conjugate-gradient benchmark: A new metric for rankinghigh-performance computing systems.

The International Journal of HighPerformance Computing Applications , 30(1):3–10, 2016.[4] Jeffrey Dean and Luiz Andr Barroso. The tail at scale.

Communicationsof the ACM , 56:74–80, 2013.[5] Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoeﬂer. Miti-gating network noise on dragonﬂy networks through application-awarerouting. In

Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , SC 19, NewYork, NY, USA, 2019. Association for Computing Machinery.[6] Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, JacobBalma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, ScottParker, Steven Warren, and et al. Gpcnet: Designing a benchmarksuite for inducing and measuring contention in hpc networks. In

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC 19, New York, NY,USA, 2019. Association for Computing Machinery.[7] Pulkit A. Misra, Mar´ıa F. Borge, ´I˜nigo Goiri, Alvin R. Lebeck, WillyZwaenepoel, and Ricardo Bianchini. Managing tail latency in datacenter-scale ﬁle systems under production constraints. In

Proceedings of theFourteenth EuroSys Conference 2019 , pages 1–8, May 2009.[12] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonﬂy topology. In , pages 77–88, June 2008.[13] W. H. Tranter, D. P. Taylor, R. E. Ziemer, N. F. Maxemchuk, and J. W.Mark.

Input Versus Output Queueing on a SpaceDivision Packet Switch ,pages 561–570. 2007.[14] Y. Tamir and G. L. Frazier. High-performance multiqueue buffers forvlsi communication switches. In [1988] The 15th Annual InternationalSymposium on Computer Architecture. Conference Proceedings , pages343–354, 1988. [15] Thomas E. Anderson, Susan S. Owicki, James B. Saxe, and Charles P.Thacker. High-speed switch scheduling for local-area networks.

ACMTrans. Comput. Syst.

CoRR ,abs/1906.10885, May 2019.[19] Sally Floyd. Tcp and explicit congestion notiﬁcation.

SIGCOMMComput. Commun. Rev. , 24(5):823, October 1994.[20] IEEE 802.1Qau – Congestion Notiﬁcation. https://1.ieee802.org/dcb/802-1qau/. Accessed: 12-03-2020.[21] Yibo Zhu, Yibo Zhu, Haggai Eran, Daniel Firestone, Daniel Firestone,Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye,Shachar Raindel, Mohamad Haj Yahia, Ming Zhang, and Jitu Padhye.Congestion control for large-scale rdma deployments. In

SIGCOMM .ACM - Association for Computing Machinery, August 2015.[22] Yibo Zhu, Monia Ghobadi, Vishal Misra, and Jitendra Padhye. Ecn ordelay: Lessons learnt from analysis of dcqcn and timely. In

Proceedingsof the 12th International on Conference on Emerging NetworkingEXperiments and Technologies , CoNEXT 16, page 313327, New York,NY, USA, 2016. Association for Computing Machinery.[23] Y. Gao, Y. Yang, T. Chen, J. Zheng, B. Mao, and G. Chen. Dcqcn+:Taming large-scale incast congestion in rdma over ethernet networks. In ,pages 110–120, Sep. 2018.[24] Differentiated Services Codepoint (DSCP) - RFC 3260. https://tools.ietf.org/html/rfc3260. Accessed: 12-03-2020.[25] 25G Ethernet Consortium. Low-Latency FEC Speciﬁcation. https://25gethernet.org/ll-fec-speciﬁcation. Accessed: 01-03-2020.[26] Lane error detection and lane removal mechanism to reduce the probabil-ity of data corruption. https://patents.google.com/patent/US9325449B2/en. Accessed: 12-03-2020.[27] Message Passing Interface Forum. Mpi: A message-passing interfacestandard, version 3.0. Speciﬁcation, September 2012.[28] Rajeev Thakur, P. Balaji, D. Buntinas, D. Goodell, William Gropp,Torsten Hoeﬂer, S. Kumar, E. Lusk, and Jesper Larsson Trff. MPI atExascale. In

Procceedings of SciDAC 2010 , Jun. 2010.[29] Pavan Balaji.

Programming Models for Parallel Computing . The MITPress, 2015.[30] George Almasi.

PGAS (Partitioned Global Address Space) Languages

Proceedings of the 27th International ACM Conference on InternationalConference on Supercomputing , pages 139–148. ACM, Jun. 2013.[35] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimizationof collective communication operations in mpich.

Int. J. High Perform.Comput. Appl. , 19(1):4966, February 2005.[36] Steven Gottlieb, W. Liu, William D Toussaint, R. L. Renken, and R.L. Sugar. Hybrid-molecular-dynamics algorithms for the numericalsimulation of quantum chromodynamics.

Physical review D: Particlesand ﬁelds , 35(8):2531–2542, 1987.[37] G. Bauer, S. Gottlieb, and T. Hoeﬂer. Performance Modeling andComparative Analysis of the MILC Lattice QCD Application su3 rmd.In

Proceedings of the 2012 12th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (ccgrid 2012) , pages 652–659.IEEE Computer Society, May 2012.[38] Steve Plimpton. Fast parallel algorithms for short-range moleculardynamics.

J. Comput. Phys. , 117(1):1–19, March 1995.[39] M. Frigo and S. G. Johnson. The design and implementation of fftw3.

Proceedings of the IEEE , 93(2):216–231, Feb 2005.40] Teng Ma, Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra. Im-pact of kernel-assisted mpi communication over scientiﬁc applications:Cpmd and fftw. In Yiannis Cotronis, Anthony Danalis, Dimitrios S.Nikolopoulos, and Jack Dongarra, editors,

Recent Advances in theMessage Passing Interface , pages 247–254, Berlin, Heidelberg, 2011.Springer Berlin Heidelberg.[41] Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, andTorsten Hoeﬂer. Taming unbalanced training workloads in deep learningwith partial collective operations.

Proceedings of the 25th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming ,Feb 2020.[42] Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas,Daniel Peter, and Torsten Hoeﬂer. A modular benchmarking infras-tructure for high-performance and reproducible deep learning.

CoRR ,abs/1901.10183, 2019.[43] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and SamuelMadden. Speedy transactions in multicore in-memory databases. In

Proceedings of the Twenty-Fourth ACM Symposium on Operating Sys-tems Principles , SOSP 13, page 1832, New York, NY, USA, 2013.Association for Computing Machinery.[44] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh,Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A ﬂexibleopen source framework for speech recognition. Technical report, 2004.[45] Xapian Project. https://github.com/xapian/xapian. Accessed: 12-03-2020.[46] A deep network handwriting classiﬁer. https://github.com/xingdi-ericyuan/multi-layer-convnet. Accessed: 12-03-2020.[47] H. Kasture and D. Sanchez. Tailbench: a benchmark suite and evaluationmethodology for latency-critical applications. In , pages 1–10,2016.[48] Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. Crayxc series network.

Cray Inc., White Paper WP-Aries01-1112 , 2012.[49] S. Chunduri, S. Parker, P. Balaji, K. Harms, and K. Kumaran. Char-acterization of mpi usage on a production supercomputer. In

SC18:International Conference for High Performance Computing, Networking,Storage and Analysis , pages 386–400, 2018.[50] Ember Communication Pattern Library. https://github.com/sstsimulator/ember. Accessed: 10-04-2019.[51] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg,and Torsten Hoeﬂer. Efﬁcient Task Placement and Routing in DragonﬂyNetworks . In

Proceedings of the 23rd ACM International Symposiumon High-Performance Parallel and Distributed Computing (HPDC’14) .ACM, Jun. 2014.[52] Torsten Hoeﬂer and Roberto Belli. Scientiﬁc benchmarking of parallelcomputing systems: Twelve ways to tell the masses when reportingperformance results. In

Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis ,SC ’15, pages 73:1–73:12, New York, NY, USA, 2015. ACM.[53] Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele.Evaluation of an interference-free node allocation policy on fat-treeclusters. In

Proceedings of the International Conference for HighPerformance Computing, Networking, Storage, and Analysis , SC ’18,pages 26:1–26:13, Piscataway, NJ, USA, 2018. IEEE Press.[54] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. Ascalable, commodity data center network architecture.

SIGCOMMComput. Commun. Rev. , pages 1–8, Feb2017.[60] Marcelo Ponce, Bruno C. Mundim, Mike Nolta, Jaime Pinto, MarcoSaldarriaga, Vladimir Slavnic, Erik Spence, Ching-Hsing Yu, W. Richard Peltier, Ramses van Zon, and et al. Deploying a top-100 supercomputerfor large parallel workloads.

Proceedings of the Practice and Experiencein Advanced Research Computing on Rise of the Machines (learning) -PEARC 19 , 2019.[61] M. Garca, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero,G. Rodrguez, J. Labarta, and C. Minkenberg. On-the-ﬂy adaptiverouting in high-radix hierarchical networks. In , pages 279–288, Sep. 2012.[62] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoe-ﬂer. Cost-effective diameter-two topologies: analysis and evaluation.In

SC ’15: Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , pages 1–11, Nov 2015.[63] M. Besta and T. Hoeﬂer. Slim ﬂy: A cost effective low-diameter networktopology. In

SC ’14: Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis , pages348–359, Nov 2014.[64] Mario Flajslik, Eric Borch, and Mike A. Parker. Megaﬂy: A topology forexascale systems. In Rio Yokota, Mich`ele Weiland, David Keyes, andCarsten Trinitis, editors,

High Performance Computing , pages 289–310,Cham, 2018. Springer International Publishing.[65] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. Hy-perx: topology, routing, and packaging of efﬁcient large-scale networks.In

Proceedings of the Conference on High Performance ComputingNetworking, Storage and Analysis , pages 1–11, 2009.[66] Jens Domke, Satoshi Matsuoka, Ivan R. Ivanov, Yuki Tsushima, TomoyaYuki, Akihiro Nomura, Shinichi Miura, Nie McDonald, Dennis L. Floyd,and Nicolas Dub´e. Hyperx topology: First at-scale implementationand comparison to the fat-tree. In

Presented as part of the 9thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 12) , pages 225–238, San Jose, CA, 2012. USENIX.[68] Asaf Valadarsky, Michael Dinitz, and Michael Schapira. Xpander:Unveiling the secrets of high-performance datacenters. In

Proceedingsof the 14th ACM Workshop on Hot Topics in Networks , HotNets-XIV,New York, NY, USA, 2015. Association for Computing Machinery.[69] Charles Clos. A study of non-blocking switching networks.

Bell SystemTechnical Journal , 32(2):406–424, 1953.[70] Christian Hopps et al. Analysis of an equal-cost multi-path algorithm.Technical report, RFC 2992, November, 2000.[71] Philip Taffet and John Mellor-Crummey. Understanding congestionin high performance interconnection networks using sampling. In

Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC 19, New York, NY,USA, 2019. Association for Computing Machinery.[72] Staci A. Smith, Clara E. Cromey, David K. Lowenthal, Jens Domke,Nikhil Jain, Jayaraman J. Thiagarajan, and Abhinav Bhatele. Mitigatinginter-job interference using adaptive ﬂow-aware routing. In

Proceedingsof the International Conference for High Performance Computing,Networking, Storage, and Analysis , SC 18. IEEE Press, 2018.[73] Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and ZhilingLan. Watch out for the bully! job interference study on dragonﬂynetwork. In

Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , SC 16.IEEE Press, 2016.[74] A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs. There goesthe neighborhood: Performance degradation due to nearby jobs. In