Gheorghe Almasi
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gheorghe Almasi.
international conference on supercomputing | 2008
Sameer Kumar; Gabor Dozsa; Gheorghe Almasi; Philip Heidelberger; Dong Chen; Mark E. Giampapa; Michael Blocksome; Ahmad Faraj; Jeffrey J. Parker; Joseph D. Ratterman; Brian E. Smith; Charles J. Archer
We present the architecture of the Deep Computing Messaging Framework (DCMF), a message passing runtime designed for the Blue Gene/P machine and other HPC architectures. DCMF has been designed to easily support several programming paradigms such as the Message Passing Interface (MPI), Aggregate Remote Memory Copy Interface (ARMCI), Charm++, and others. This support is made possible as DCMF provides an application programming interface (API) with active messages and non-blocking collectives. DCMF is being open sourced and has a layered component based architecture with multiple levels of abstraction, allowing the members of the community to contribute new components to its design at the various layers. The DCMF runtime can be extended to other architectures through the development of architecture specific implementations of interface classes. The production DCMF runtime on Blue Gene/P takes advantage of the direct memory access (DMA) hardware to offload message passing work and achieve good overlap of computation and communication. We take advantage of the fact that the Blue Gene/P node is a symmetric multi-processor with four cache-coherent cores and use multi-threading to optimize the performance on the collective network. We also present a performance evaluation of the DCMF runtime on Blue Gene/P and show that it delivers performance close to hardware limits.
acm sigplan symposium on principles and practice of parallel programming | 2006
Ganesh Bikshandi; Jia Guo; Daniel Hoeflinger; Gheorghe Almasi; Basilio B. Fraguela; María Jesús Garzarán; David A. Padua; Christoph von Praun
Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.In this paper, a data type - Hierarchically Tiled Arrays or HTAs - that facilitates the direct manipulation of tiles is introduced. HTA operations are overloaded array operations. We argue that the implementation of HTAs in sequential OO languages transforms these languages into powerful tools for the development of high-performance parallel codes and codes with high degree of locality. To support this claim, we discuss our experiences with the implementation of HTAs for MATLAB and C++ and the rewriting of the NAS benchmarks and a few other programs into HTA-based parallel form.
international parallel and distributed processing symposium | 2006
Sameer Kumar; Chao Huang; Gheorghe Almasi; Laxmikant V. Kalé
NAMD is a scalable molecular dynamics application, which has demonstrated its performance on several parallel computer architectures. Strong scaling is necessary for molecular dynamics as problem size is fixed, and a large number of iterations need to be executed to understand interesting biological phenomenon. The Blue Gene/L machine is a massive source of compute power. It consists of tens of thousands of embedded Power PC 440 processors. In this paper, we present several techniques to scale NAMD to 8192 processors of Blue Gene/L. These include topology specific optimizations, new messaging protocols, load-balancing, and overlap of computation and communication. We were able to achieve 1.2 TF of peak performance for cutoff simulations and 0.99 TF with PME.
Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems | 2004
Basilio B. Fraguela; Jia Guo; Ganesh Bikshandi; María Jesús Garzarán; Gheorghe Almasi; José E. Moreira; David A. Padua
In this paper, we show our initial experience with a class of objects, called Hierarchically Tiled Arrays (HTAs), that encapsulate parallelism. HTAs allow the construction of single-threaded parallel programs where a master process distributes tasks to be executed by a collection of servers holding the components (tiles) of the HTAs. The tiled and recursive nature of HTAs facilitates the adaptation of the programs that use them to varying machine configurations, and eases the mapping of data and tasks to parallel computers with a hierarchical organization. We have implemented HTAs as a MATLAB™ toolbox, overloading conventional operators and array functions such that HTA operations appear to the programmer as extensions of MATLAB™. Our experiments show that the resulting environment is ideal for the prototyping of parallel algorithms and greatly improves the ease of development of parallel programs while providing reasonable performance.
high performance distributed computing | 2011
Enric Tejedor; Montse Farreras; David Grove; Rosa M. Badia; Gheorghe Almasi; Jesús Labarta
Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming while not hindering application performance. StarSs is a family of parallel programming models based on automatic function level parallelism that targets productivity. StarSs deploys a data-flow model: it analyses dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. We introduce Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs. ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. This short paper gives an overview of the ClusterSs design on top of APGAS, as well as the conclusions of a productivity study; in this study, ClusterSs was compared to the IBM X10 language, both in terms of programmability and performance. A technical report is available with the details.
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models | 2009
Guojing Cong; Gheorghe Almasi; Vijay A. Saraswat
Irregular graph algorithms for distributed-memory systems are hard to implement and optimize. Recent developments in PGAS languages make the implementation of irregular algorithms easier. In this paper we present our study of PRAM-based parallel connected components algorithm implemented in UPC for distributed-memory systems, and discuss optimization techniques for such settings. Our optimized version achieved more than 100 times speedup over the straight-forward implementation. Remarkable speedups are also achieved over the best SMP implementation for the same input. As the memory access patterns of these algorithms are representative of those of many other PRAM algorithms, we expect our techniques applicable to optimizing a wide range of PRAM graph algorithms on distributed-memory machines.
international conference on supercomputing | 2012
Gabriel Tanase; Gheorghe Almasi; Hanhong Xue; Charles J. Archer
The Power7 IH (P7IH) is one of IBMs latest generation of supercomputers. Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster. A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes. System software is tuned to exploit the hierarchical organization of the machine. In this paper we present a novel set of collective operations that take advantage of the P7IH hardware. We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware. We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations. We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores.
Concurrency and Computation: Practice and Experience | 2012
Enric Tejedor; Montse Farreras; David Grove; Rosa M. Badia; Gheorghe Almasi; Jesús Labarta
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible.
languages and compilers for parallel computing | 2014
Barnaby Dalton; Gabriel Tanase; Michail Alvanos; Gheorghe Almasi; Ettore Tiotto
Partitioned Global Address Space (PGAS) languages are a popular alternative when building applications to run on large scale parallel machines. Unified Parallel C (UPC) is a well known PGAS language that is available on most high performance computing systems. Good performance of UPC applications is often one important requirement for a system acquisition. This paper presents the memory management techniques employed by the IBM XL UPC compiler to achieve optimal performance on systems with Remote Direct Memory Access (RDMA). Additionally we describe a novel technique employed by the UPC runtime for transforming remote memory accesses on a same shared memory node into local memory accesses, to further improve performance. We evaluate the proposed memory allocation policies for various UPC benchmarks and using the IBM® Power® 775 supercomputer [1].
Concurrency and Computation: Practice and Experience | 1993
Gheorghe Almasi; D. Hale; T. McLuckie; Jean Luc Bell; A. Gordon
We report significant speed-up for seismic migration running in parallel on networkconnected IBM RISC/6000 workstations, A sustained performance of 15 MFLOP is obtained on a single-entry-level model 320, and speed-ups as high as 5 are obtained for six workstations connected by Ethernet or token ring. Our parallel software uses remote procedure calls provided by NCS (Network Computing System). We have run over a dozen workstations in parallel, but speed-ups become limited by network data rate. Fiber-optic communication should allow much greater speed-ups, and we describe our preliminary results with the fiberoptic serial link adapter of the RISC/6000. We also present a simple theoretical model that agrees well with our measurements and allows speed-up to be predicted from a knowledge of the ratio of computation to communication, which can be determined empirically before the program is parallellzed. We conclude with a brief discussion of alternative software approaches and programming models for network-connected parallel systems. In particular, our program was recently ported to PVM and Linda, and preliminary measurements yield speed-ups very close to those described here.