Gabor Dozsa
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gabor Dozsa.
international conference on supercomputing | 2008
Sameer Kumar; Gabor Dozsa; Gheorghe Almasi; Philip Heidelberger; Dong Chen; Mark E. Giampapa; Michael Blocksome; Ahmad Faraj; Jeffrey J. Parker; Joseph D. Ratterman; Brian E. Smith; Charles J. Archer
We present the architecture of the Deep Computing Messaging Framework (DCMF), a message passing runtime designed for the Blue Gene/P machine and other HPC architectures. DCMF has been designed to easily support several programming paradigms such as the Message Passing Interface (MPI), Aggregate Remote Memory Copy Interface (ARMCI), Charm++, and others. This support is made possible as DCMF provides an application programming interface (API) with active messages and non-blocking collectives. DCMF is being open sourced and has a layered component based architecture with multiple levels of abstraction, allowing the members of the community to contribute new components to its design at the various layers. The DCMF runtime can be extended to other architectures through the development of architecture specific implementations of interface classes. The production DCMF runtime on Blue Gene/P takes advantage of the direct memory access (DMA) hardware to offload message passing work and achieve good overlap of computation and communication. We take advantage of the fact that the Blue Gene/P node is a symmetric multi-processor with four cache-coherent cores and use multi-threading to optimize the performance on the collective network. We also present a performance evaluation of the DCMF runtime on Blue Gene/P and show that it delivers performance close to hardware limits.
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface | 2010
Gabor Dozsa; Sameer Kumar; Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Joe Ratterman; Rajeev Thakur
With the ever-increasing numbers of cores per node on HPC systems, applications are increasingly using threads to exploit the shared memory within a node, combined with MPI across nodes. Achieving high performance when a large number of concurrent threads make MPI calls is a challenging task for an MPI implementation. We describe the design and implementation of our solution in MPICH2 to achieve high-performance multithreaded communication on the IBM Blue Gene/P. We use a combination of a multichannel-enabled network interface, fine-grained locks, lock-free atomic operations, and specially designed queues to provide a high degree of concurrent access while still maintaining MPIs message-ordering semantics. We present performance results that demonstrate that our new design improves the multithreaded message rate by a factor of 3.6 compared with the existing implementation on the BG/P. Our solutions are also applicable to other high-end systems that have parallel network access capabilities.
international conference on supercomputing | 2008
Edi Shmueli; George S. Almasi; José R. Brunheroto; José G. Castaños; Gabor Dozsa; Sameer Kumar; Derek Lieber
The Blue Gene machines in production today run a small single-user, single-process kernel (CNK) having a limited functionality. Motivated by the desire to provide applications with a much richer operating environment, we evaluate the effect of replacing CNK with a standard Linux kernel on the compute nodes of Blue Gene/L. We show that with a relatively small amount of effort we were able to improve benchmark performance under Linux up to a level that is comparable to CNK.
european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008
Sameer Kumar; Gabor Dozsa; Jeremy Berg; Bob Cernohous; Douglas Miller; Joseph D. Ratterman; Brian E. Smith; Philip Heidelberger
Different programming paradigms utilize a variety of collective communication operations, often with different semantics. We present the component collective messaging interface (CCMI), that can support asynchronous non-blocking collectives and is extensible to different programming paradigms and architectures. CCMI is designed with components written in the C++ programming language, allowing it to have reuse and extendability. Collective algorithms are embodied in topological schedulesand executorsthat execute them. Portability across architectures is enabled by the multisend data movement component. CCMI includes a programming language adaptor used to implement different APIs with different semantics for different paradigms. We study the effectiveness of CCMI on Blue Gene/P and evaluate its performance for the barrier, broadcast, and allreduce collective operations. We also present the performance of the barrier collective on the Abe Infiniband cluster.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011
Amith R. Mamidala; Daniel Faraj; Sameer Kumar; Douglas Miller; Michael Blocksome; Thomas Gooding; Philiph Heidelberger; Gabor Dozsa
The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and all reduce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peers memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce(in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network.
international conference on cluster computing | 2010
David Goodell; Pavan Balaji; Darius Buntinas; Gabor Dozsa; William Gropp; Sameer Kumar; Bronis R. de Supinski; Rajeev Thakur
With the ever-increasing numbers of cores per node in high-performance computing systems, a growing number of applications are using threads to exploit shared memory within a node and MPI across nodes. This hybrid programming model needs efficient support for multithreaded MPI communication. In this paper, we describe the optimization of one aspect of a multithreaded MPI implementation: concurrent accesses from multiple threads to various MPI objects, such as communicators, datatypes, and requests. The semantics of the creation, usage, and destruction of these objects implies, but does not strictly require, the use of reference counting to prevent memory leaks and premature object destruction. We demonstrate how a naive multithreaded implementation of MPI object management via reference counting incurs a significant performance penalty. We then detail two solutions that we have implemented in MPICH2 to mitigate this problem almost entirely, including one based on a novel garbage collection scheme. In our performance experiments, this new scheme improved the multithreaded messaging rate by up to 31% over the naive reference counting method.
Lecture Notes in Computer Science | 2005
George S. Almasi; Gabor Dozsa; C. Christopher Erway; Burkhard Steinmacher-Burow
BlueGene/L is currently in the pole position on the Top500 list[4]. In its full configuration the system will leverage 65,536 compute nodes. Application scalability is a crucial issue for a system of such size. On BlueGene/L scalability is made possible through the efficient exploitation of special communication. The BlueGene/L system software provides its own optimized version for collective communication routines in addition to the general purpose MPICH2 implementation. The collective network is a natural platform for reduction operations due to its built-in arithmetic units. Unfortunately ALUs of the collective network can handle only fixed point operands. Therefore efficient exploitation of that network for the purpose of floating point reductions is a challenging task. In this paper we present our experiences with implementing an efficient collective network algorithm for Allreduce sums of floating point numbers.
high performance computational finance | 2008
Gabor Dozsa; Maria Eleftheriou; Todd Inglett; Alan J. King; Thomas E. Musta; James C. Sexton; Robert W. Wisniewski
Stream processing systems are designed to support applications that use real time data. Examples of streaming applications include security agencies processing data from communications media, battlefield management systems for military operations, consumer fraud detection based on online transactions, and automated trading based on financial market data. Many stream processing applications are faced with the challenge of increasingly large volumes of data and the requirement to deliver low-latency responses predicated by analysis of that data. In this paper, we assess the applicability of the Blue Gene architecture for stream computing applications. This work is part of a larger effort to demonstrate the efficacy of using a Blue Gene for streaming applications. Blue Gene supercomputers provide a high-bandwidth low-latency network connecting a set of I/O and compute nodes. We examine Blue Genepsilas suitability for stream computing applications by assessing its messaging capability for typical stream computing messaging workloads. In particular, this paper presents results from micro-benchmarks we used to evaluate the raw performance of Blue Gene/P (Blue Gene/P) supercomputer under loads produced by high volumes of streaming data. We measure the performance of data streams that originate outside the supercomputer, are directed through the I/O nodes to the compute nodes and then terminate outside. Our performance experiments demonstrate that the Blue Gene/P hardware delivers low-latency and high-throughput capability in a manner usable by streaming applications.
Archive | 2007
Gheorghe Almasi; Gabor Dozsa; Sameer Kumar
Archive | 2012
Charles J. Archer; Gabor Dozsa; Joseph D. Ratterman; Brian E. Smith