Eitan Zahavi
Mellanox Technologies
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Eitan Zahavi.
international conference on embedded computer systems architectures modeling and simulation | 2012
Yaniv Ben-Itzhak; Eitan Zahavi; Israel Cidon; Avinoam Kolodny
We present HNOCS (Heterogeneous Network-on-Chip Simulator), an open-source NoC simulator based on OMNeT++. To the best of our knowledge, HNOCS is the first simulator to support modeling of heterogeneous NoCs with variable link capacities and number of VCs per unidirectional port. The HNOCS simulation platform provides an open-source, modular, scalable, extendible and fully parameterizable framework for modeling NoCs. It includes three types of NoC routers: synchronous, synchronous virtual output queue (VoQ) and asynchronous. HNOCS provides a rich set of statistical measurements at the flit and packet levels: end-to-end latencies, throughput, VC acquisition latencies, transfer latencies, etc. We describe the architecture, structure, available models and the features that make HNOCS suitable for advanced NoC exploration. We also evaluate several case studies which cannot be evaluated with any other exiting NoC simulator.
Journal of Parallel and Distributed Computing | 2012
Eitan Zahavi
As the size of High Performance Computing clusters grows, so does the probability of interconnect hot spots that degrade the latency and effective bandwidth the network provides. This paper presents a solution to this scalability problem for real life constant bisectional-bandwidth fat-tree topologies. It is shown that maximal bandwidth and cut-through latency can be achieved for MPI global collective traffic. To form such a congestion-free configuration, MPI programs should utilize collective communication, MPI-node-order should be topology aware, and the packet routing should match the MPI communication patterns. First, we show that MPI collectives can be classified into unidirectional and bidirectional shifts. Using this property, we propose a scheme for congestion-free routing of the global collectives in fully and partially populated fat trees running a single job. The no-contention result is then obtained for multiple jobs running on the same fat-tree by applying some job size and placement restrictions. Simulation results of the proposed routing, MPI-node-order and communication patterns show no contention which provides a 40% throughput improvement over previously published results for all-to-all collectives.
networks on chips | 2011
Yaniv Ben-Itzhak; Eitan Zahavi; Israel Cidon; Avinoam Kolodny
As chip density keeps doubling every process generation, the use of Network-on-Chip becomes the prevalent architecture of SoC, MPSoC and large scale CMP designs. To that end, diverse NoC solutions are developed by the industry and the research community in order to meet heterogeneous on-chip communication requirements. Consequently, there is a growing need to rely on a simulation tools in order to explore, evaluate and optimize these new NoC architectures and topologies. The simulation platform is based on OMNeT++. It provides an open-source, modular, scalable, extendible and fully parameterizable framework for modeling NoC. In this demo we describe the structure of this framework.
ieee/acm international symposium cluster, cloud and grid computing | 2011
Ernst Gunnar Gran; Eitan Zahavi; Sven-Arne Reinemo; Tor Skeie; Gilad Shainer; Olav Lysne
In loss less interconnection networks such as InfiniBand, congestion control (CC) can be an effective mechanism to achieve high performance and good utilization of network resources. The InfiniBand standard describes CC functionality for detecting and resolving congestion, but the design decisions on how to implement this functionallity is left to the hardware designer. One must be cautious when making these design decisions not to introduce fairness problems, as our study shows. In this paper we study the relationship between congestion control, switch arbitration, and fairness. Specifically, we look at fairness among different traffic flows arriving at a hot spot switch on different input ports, as CC is turned on. In addition we study the fairness among traffic flows at a switch where some flows are exclusive users of their input ports while other flows are sharing an input port (the parking lot problem). Our results show that the implementation of congestion control in a switch is vulnerable to unfairness if care is not taken. In detail, we found that a threshold hysteresis of more than one MTU is needed to resolve arbitration unfairness. Furthermore, to fully solve the parking lot problem, proper configuration of the CC parameters are required.
architectures for networking and communications systems | 2012
Eitan Zahavi; Isaac Keslassy; Avinoam Kolodny
With the growing popularity of big-data applications, Data Center Networks increasingly carry larger and longer traffic flows. As a result of this increased flow granularity, static routing cannot efficiently load-balance traffic, resulting in an increased network contention and a reduced throughput. Unfortunately, while adaptive routing can solve this load-balancing problem, network designers refrain from using it, because it also creates out-of-order packet delivery that can significantly degrade the reliable transport performance of the longer flows. In this paper, we show that by throttling each flow bandwidth to half of the network link capacity, a distributed-adaptive-routing algorithm is able to converge to a non-blocking routing assignment within a few iterations, causing minimal out-of-order packet delivery. We present a Markov chain model for distributed-adaptive-routing in the context of Clos networks that provides an approximation for the expected convergence time. This model predicts that for full-link-bandwidth traffic, the convergence time is exponential with the network size, so out-of-order packet delivery is unavoidable for long messages. However, with half-rate traffic, the algorithm converges within a few iterations and exhibits weak dependency on the network size. Therefore, we show that distributed-adaptive-routing may be used to provide a scalable and non-blocking routing even for long flows on a rearrangeably-non-blocking Clos network under half-rate conditions. The proposed model is evaluated and approximately fits the abstract system simulation model. Hardware implementation guidelines are provided and evaluated using a detailed flit-level InfiniBand simulation model. These results directly apply to adaptive-routing systems designed and deployed in various fields.
IEEE Journal on Selected Areas in Communications | 2014
Eitan Zahavi; Isaac Keslassy; Avinoam Kolodny
With the growing popularity of big-data applications, Data Center Networks (DCN) increasingly carry larger and longer traffic flows. As a result of this increased flow granularity, static routing cannot efficiently load-balance traffic, resulting in an increased network contention and a reduced throughput. Unfortunately, while adaptive routing can solve this load-balancing problem, DCN designers refrain from using it, because it also creates out-of-order packet delivery that can significantly degrade the reliable transport performance of the longer flows. In this paper, we show that by throttling each flow bandwidth to half of the network link capacity, a distributed adaptive routing algorithm is able to converge to a non-blocking routing assignment within a few iterations, causing minimal out-of-order packet delivery. We present a Markov chain model for distributed adaptive routing in the context of Clos networks that provides an approximation for the expected convergence time. This model predicts that for full-link-bandwidth traffic, the convergence time is exponential with the network size, so out-of-order packet delivery is unavoidable for long messages. However, with half-rate traffic, the algorithm converges within a few iterations and exhibits weak dependency on the network size. Therefore, we show that distributed adaptive routing may be used to provide scalable and non-blocking routing even for long flows in a rearrangeably-non-blocking Clos network under half-rate conditions. The proposed model is evaluated and approximately fits the abstract system simulation model. Hardware implementation guidelines are provided and evaluated using a detailed flit-level InfiniBand simulation model. These results, providing fast convergence to non-blocking routing assignment, directly apply to adaptive-routing systems designed and deployed in various DCNs.
high performance interconnects | 2014
Eitan Zahavi; Isaac Keslassy; Avinoam Kolodny
High-Performance Computing (HPC) Clusters and Data Center Networks often rely on fat-tree topologies. However, fat trees and their known variants are not designed for concurrent small jobs. As a result, in recent years, HPC designers have introduced ad-hoc topologies to offer better performance for these concurrent small jobs. In this paper, we present and formally define these topologies, which we call Quasi Fat Trees (QFTs). Specifically, we formulate the graph structure of these new topologies, and show that they perform better for concurrent small jobs. Furthermore, we derive a closed-form and fault-resilient contention-free routing algorithm for all global shift permutations. This routing optimizes the run-time of large computing jobs that utilize MPI collectives. Finally, we verify the algorithm by running its implementation as an OpenSM routing engine on various sizes of QFT topologies, and show that it exhibits good performance.
international parallel and distributed processing symposium | 2012
Ernst Gunnar Gran; Sven-Arne Reinemo; Olav Lysne; Tor Skeie; Eitan Zahavi; Gilad Shainer
In a loss less interconnection network, network congestion needs to be detected and resolved to ensure high performance and good utilization of network resources at high network load. If no countermeasure is taken, congestion at a node in the network will stimulate the growth of a congestion tree that not only affects contributors to congestion, but also other traffic flows in the network. Left untouched, the congestion tree will block traffic flows, lead to underutilization of network resources and result in a severe drop in network performance. The InfiniBand standard specifies a congestion control (CC) mechanism to detect and resolve congestion before a congestion tree is able to grow and, by that, hamper the network performance. The InfiniBand CC mechanism includes a rich set of parameters that can be tuned in order to achieve effective CC. Even though it has been shown that the CC mechanism, properly tuned, is able to improve both throughput and fairness in an interconnection network, it has been questioned whether the mechanism is fast enough to keep up with dynamic network traffic, and if a given set of parameter values for a topology is robust when it comes to different traffic patterns, or if the parameters need to be tuned depending on the applications in use. In this paper we address both these questions. Using the three-stage fat-tree topology from the Sun Data center InfiniBand Switch 648 as a basis, and a simulator tuned against CC capable InfiniBand hardware, we conduct a systematic study of the efficiency of the InfiniBand CC mechanism as the network traffic becomes increasingly more dynamic. Our studies show that the InfiniBand CC, even when using a single set of parameter values, performs very well as the traffic patterns becomes increasingly more dynamic, outperforming a network without CC in all cases. Our results show throughput increases varying from a few percent, to a seventeen-fold increase.
2016 First International Workshop on Communication Optimizations in HPC (COMHPC) | 2016
Richard L. Graham; Devendar Bureddy; Pak Lui; Hal Rosenstock; Gilad Shainer; Gil Bloch; Dror Goldenerg; Mike Dubman; Sasha Kotchubievsky; Vladimir Koushnir; Lion Levi; Alex Margolin; Tamir Ronen; Alexander Shpiner; Oded Wertheim; Eitan Zahavi
Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanoxs SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.
architectures for networking and communications systems | 2016
Eitan Zahavi; Alexander Shpiner; Ori Rottenstreich; Avinoam Kolodny; Isaac Keslassy
The most demanding tenants of shared clouds require complete isolation from their neighbors, in order to guarantee that their application performance is not affected by other tenants. Unfortunately, while shared clouds can offer an option whereby tenants obtain dedicated servers, they do not offer any network provisioning service, which would shield these tenants from network interference. In this paper, we introduce Links as a Service (LaaS), a new abstraction for cloud service that provides isolation of network links. Each tenant gets an exclusive set of links forming a virtual fat-tree, and is guaranteed to receive the exact same bandwidth and delay as if it were alone in the shared cloud. Consequently, each tenant can use the forwarding method that best fits its application. Under simple assumptions, we derive theoretical conditions for enabling LaaS without capacity over-provisioning in fat-trees. New tenants are only admitted in the network when they can be allocated hosts and links that maintain these conditions. LaaS is implementable with common network gear, tested to scale to large networks and provides full tenant isolation at the worst cost of a 10% reduction in the cloud utilization.