Nikolaos Chrysos | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikolaos Chrysos is active.

Explore More

Publication

Featured researches published by Nikolaos Chrysos.

international conference on communications | 2004

Variable packet size buffered crossbar (CICQ) switches

Manolis Katevenis; Giorgos Passas; Dimitrios Simos; Ioannis Papaefstathiou; Nikolaos Chrysos

One of the most widely used architectures for packet switches is the crossbar. A special version of it is the buffered crossbar, where small buffers are associated with the crosspoints; this simplifies scheduling and improves its efficiency and QoS capabilities to the point where the switch needs no internal speedup. Furthermore, by supporting variable length packets throughout a buffered crossbar: (a) there is no need for segmentation and reassembly (SAR) circuits; (b) no speedup is necessary to support SAR; and (c) synchronization between the input and output clock domains is simplified. In turn, the lack of SAR and speedup mean that no output queues are needed, either. In this paper we present an architecture, a chip layout and cost analysis, and a performance evaluation of such a 300 Gbps buffered crossbar operating on variable-size packets. The proposed organization is simple yet powerful, it can be implemented using modern technology, and, as the performance results demonstrate, it clearly outperforms unbuffered crossbars.

high performance switching and routing | 2003

Weighted fairness in buffered crossbar scheduling

Nikolaos Chrysos; Manolis Katevenis

The crossbar is the most popular packet switch architecture. By adding small buffers at the crosspoints, important advantages can be obtained: (1) crossbar scheduling is simplified; (2) high throughput is achievable; (3) weighted scheduling becomes feasible. We study the fairness properties of a buffered crossbar with weighted fair schedulers. We show by means of simulation that, under heavy demand, the system allocates throughput in a weighted max-min fair manner. We study the impact of the size of the crosspoint buffers in approximating the weighted max-min fair rates and we find that a small amount of buffering per crosspoint (3-8 cells) suffices for the maximum percentage discrepancy to fall below 5% for 32/spl times/32 switches.

ieee international conference computer and communications | 2006

Scheduling in Non-Blocking Buffered Three-Stage Switching Fabrics

Nikolaos Chrysos; Manolis Katevenis

Three-stage non-blocking switching fabrics are the next step in scaling current crossbar switches to many hundreds or few thousands of ports. Congestion (output contention) management is the central open problem –without it, performance suffers heavily under real-world traffic patterns. Centralized schedulers for bufferless crossbars manage output contention but are not scalable to high valencies and to multi-stage fabrics. Distributed scheduling, as in buffered crossbars, is scalable but has never been scaled beyond crossbars. We combine ideas from centralized and from distributed schedulers, from request-grant protocols, and from credit-based flow control, to propose a novel, practical architecture for scheduling in non-blocking buffered switching fabrics. The new architecture relies on multiple, independent, single-resource schedulers, operating in a pipeline. It: (i) does not need internal speedup; (ii) directly operates on variable-size packets or multi-packet segments; (iii) isolates well-behaved from congested flows; (iv) provides delays that successfully compete against output queueing; (v) provides 95% or better throughput under unbalanced traffic; (vi) provides weighted max-min fairness; (vii) resequences cells or segments using very small buffers; (viii) can be realistically implemented for a 1024×1024 reference fabric made out of 32×32 buffered crossbar switch elements at 10 Gbps line rate. This paper carefully studies the many intricacies of the problem and the solution, discusses implementation, and provides performance simulation results.

architectures for networking and communications systems | 2007

Congestion management for non-blocking clos networks

Nikolaos Chrysos

We propose a distributed congestion management scheme for non-blocking, 3-stage Clos networks, comprising plain buffered crossbar switches. VOQ requests are routed using multipath routing to the switching elements of the 3rd-stage, and grants travel back to the linecards the other way around. The fabric elements contain independent single-resource schedulers, that serve requests and grants in a pipeline. As any other network with limited capacity, this scheduling network may suffer from oversubscribed links, hotspot contention, etc., which we identify and tackle. We also reduce the cost of internal buffers, by reducing the data RTT, and by allowing sub-RTT crosspoint buffers. Performance simulations demonstrate that, with almost all outputs congested, packets destined to non-congested outputs experience very low delays ( ow isolation). For applications requiring very low communication delays, we propose a second, parallel operation mode, wherein linecards can forward a few packets eagerly, each, bypassing the request-grant latency overhead.

IEEE Journal on Selected Areas in Communications | 2007

Performance evaluation of the Data Vortex photonic switch

Ilias Iliadis; Nikolaos Chrysos; Cyriel Minkenberg

The data vortex photonic packet-switching architecture features an all-optical transparent data path, highly distributed control, low latency, and a high degree of scalability. These characteristics make it attractive as a routing fabric in future photonic packet switches. We analyze the performance of the data vortex architecture as a function of its height and angle dimensions, H and A. The investigation is based on two performance measures: the average delay and the maximum throughput of the switch. We present an analytical model assuming uniform traffic and derive closed-form expressions for these measures. Our results obtained demonstrate that as H increases, the saturation throughput decreases and approaches 2/9 = 0.22 when A is small and H is large. Furthermore, for fixed switch size, the saturation throughput is maximized when A is minimal. We also present simulation results for the maximum throughput under uniform and nonuniform traffic, as well as for the mean number of hops and the mean input-queue packet delay as a function of input load, and address the issue of resequencing delay. The results obtained advocate that to support more ports, it is preferable to increase the height dimension and to keep the angle dimension as small as possible.

high-performance computer architecture | 2015

SCOC: High-radix switches made of bufferless clos networks

Nikolaos Chrysos; Cyriel Minkenberg; Mark Rudquist; Claude Basso; Brian T. Vanderpool

In todays datacenters handling big data and for exascale computers of tomorrow, there is a pressing need for high-radix switches to economically and efficiently unify the computing and storage resources that are dispersed across multiple racks. In this paper, we present SCOC, a switch architecture suitable for economical IC implementation that can efficiently replace crossbars for high-radix switch nodes. SCOC is a multi-stage bufferless network with O(N2/m) cost, where m is a design parameter, practically ranging between 4-16. We identify and resolve more than five fairness violations that are pertinent to hierarchical scheduling. Effectively, from a performance perspective, SCOC is indistinguishable from efficient flat crossbars. Computer simulations show that it competes well or even outperforms flat crossbars and hierarchical switches. We report data from our ASIC implementation at 32 nm of a SCOC 136×136 switch, with shallow buffers, connecting 25 Gb/s links. In this first incarnation, SCOC is used at the spines of a server-rack, fat-tree network. Internally, it runs at 9.9 Tb/s, thus offering a speedup of 1.45 ×, and provides a fall-through latency of just 61 ns.

high performance interconnects | 2012

Occupancy Sampling for Terabit CEE Switches

Fredy D. Neeser; Nikolaos Chrysos; Rolf Clauberg; Daniel Crisan; Mitchell Gusat; Cyriel Minkenberg; Kenneth M. Valk; Claude Basso

One consequential feature of Converged Enhanced Ethernet (CEE) is loss lessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN). We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch, however, as future switches scale to higher port counts and link speeds, purely output-queued or shared-memory architectures lead to excessive memory bandwidth requirements, moreover, PFC typically requires dedicated buffers per input. Our objective is to complement PFCs coarse per-port/priority granularity with QCNs per-flow control. By detecting buffer overload early, QCN can drastically reduce PFCs side effects. We install QCN congestion points (CPs) at input buffers with virtual output queues and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims. Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OSis shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.

Proceedings of the 8th International Workshop on Interconnection Network Architecture | 2014

All routes to efficient datacenter fabrics

Nikolaos Chrysos; Fredy D. Neeser; Mitch Gusat; Cyriel Minkenberg; Wolfgang E. Denzel; Claude Basso

Performance optimized datacenters (PoDs) require efficient PoD interconnects to deal with the increasing volumes of inter-server (east-west) traffic. To cope with these stringent traffic patterns, datacenter networks are abandoning the oversubscribed topologies of the past, and move towards full-bisection fat-tree fabrics. However, these fabrics typically employ either single-path or coarse-grained (flow-level) multi-path routing. In this paper, we use computer simulations and analysis to characterize the waste of bandwidth that is due to routing inefficiencies. Our analysis suggests that, under a randomly selected permutation, the expected throughputs of d-mod-k routing and of flow-level multi-path routing are close to 63% and 47%, respectively. Furthermore, nearly 30% of the flows are expected to undergo an unnecessary 3-fold slowdown. By contrast, packet-level multi-path routing consistently delivers full throughput to all flows, and proactively avoids internal hotspots, thus serving better the growing demands of inter-server (east-west) traffic.

architectures for networking and communications systems | 2010

End-to-end congestion management for non-blocking multi-stage switching fabrics

Nikolaos Chrysos; Lydia Y. Chen; Cyriel Minkenberg; Christoforos Kachris; Manolis Katevenis

In Fig. 3, we depict the average delay of packets targeting non-hotspots, for various numbers of hotspots, each over-loaded by 2.5×, under bursty arrivals, with average burst size of 36. Each request or grant may refer to up to 128 segments in a virtual-output-queue (VOQ), thus reducing control overhead. The switching fabric assumed here is a 64×64, three-stage Clos, made of CIOQ switching chips.

architectures for networking and communications systems | 2014

Integration and QoS of multicast traffic in a server-rack fabric with 640 100g ports

Nikolaos Chrysos; Fredy D. Neeser; Brian T. Vanderpool; Mark Rudquist; Kenneth M. Valk; Todd Greenfield; Claude Basso

Flexible datacenters rely on high-bandwidth server-rack fabrics to allocate their distributed computing and storage resources anywhere, anyhow, and anytime demanded. We describe the multicast architecture of a distributed server-rack fabric, which is arranged around a spine-leaf topology and connects 640 Ethernet ports running at 100G. To cope with the immense fabric speed, we resort to hierarchical, tree-based replication, facilitated by specially commissioned fabric-end ports. At each (port-to-port) leg of the tree, a frame copy is forwarded after a request-grant admission phase and is ACKed by the receiver. To save on bandwidth, we use a packet cache in our input-queued switching-nodes, which replicates asynchronously forwarded frames thus tolerating the variable-delay in the admission phase. Because the cache has limited size, we loosely synchronize the multicast subflows to protect the cache from thrashing. We describe our policies for lossy classes, which segregate and provide fair treatment to multicast subflows. Finally, we show that industry-standard Level2 congestion control does not adapt well to one-to-many flows, and demonstrate that the methods that we implement achieve the best performance.

Explore More