Gil Bloch
Mellanox Technologies
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gil Bloch.
grid computing | 2010
Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer
This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer
This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.
2016 First International Workshop on Communication Optimizations in HPC (COMHPC) | 2016
Richard L. Graham; Devendar Bureddy; Pak Lui; Hal Rosenstock; Gilad Shainer; Gil Bloch; Dror Goldenerg; Mike Dubman; Sasha Kotchubievsky; Vladimir Koushnir; Lion Levi; Alex Margolin; Tamir Ronen; Alexander Shpiner; Oded Wertheim; Eitan Zahavi
Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors — intelligent network devices, which manipulate data traversing the data-center network, this paper describes the SHArP technology designed to offload collective operation processing to the network. This is implemented in Mellanoxs SwitchIB-2 ASIC, using innetwork trees to reduce data from a group of sources, and to distribute the result. Multiple parallel jobs with several partially overlapping groups are supported each with several reduction operations in-flight. Large performance enhancements are obtained, with an improvement of a factor of 2.1 for an eight byte MPI_Allreduce() operation on 128 hosts, going from 6.01 to 2.83 microseconds. Pipelining is used for an improvement of a factor of 3.24 in the latency of a 4096 byte MPI_Allreduce() operations, declining from 46.93 to 14.48 microseconds.
Archive | 2002
Dror Goldenberg; Gil Bloch; Gil Stoler; Diego Crupnicoff; Michael Kagan
Archive | 2002
Michael Kagan; Diego Crupnicoff; Ariel Shachar; Gil Bloch; Dafna Levenvirth
Archive | 2001
Michael Kagan; Diego Crupnicoff; Margarita Shnitman; Ariel Shachar; Ram Izhaki; Gilad Shainer; Aviram Gutman; Benny Koren; Dafna Levenvirth; Gil Bloch; Yael Shenhav
Archive | 2010
Noam Bloch; Gil Bloch; Ariel Shachar; Hillel Chapman; Ishai Rabinobitz; Pavel Shamis; Gilad Shainer
Archive | 2002
Michael Kagan; Diego Crupnicoff; Margarita Shnitman; Ariel Shachar; Dafna Levenvirth; Gil Bloch
Archive | 2004
Michael Kagan; Benny Koren; Dror Goldenberg; Gilad Shainer; Gil Bloch; Ariel Shachar; Ophir Turbovich; Dror Borer; Diego Crupnicoff
Archive | 2009
Michael Kagan; Gil Bloch; Diego Crupnicoff; Margarita Schnitman; Dafna Levenvirth