Jelena Pješivac-Grbović

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jelena Pješivac-Grbović is active.

Explore More

Publication

Featured researches published by Jelena Pješivac-Grbović.

international parallel and distributed processing symposium | 2005

Performance analysis of MPI collective operations

Jelena Pješivac-Grbović; Thara Angskun; George Bosilca; Graham E. Fagg; Edgar Gabriel; Jack J. Dongarra

Previous studies of application usage show that the performance of collective communications are critical for high-performance computing and are often overlooked when compared to the point-to-point performance. In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library.

ieee international conference on high performance computing data and analytics | 2005

Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing

Graham E. Fagg; Edgar Gabriel; Zizhong Chen; Thara Angskun; George Bosilca; Jelena Pješivac-Grbović; Jack J. Dongarra

With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

parallel computing | 2007

MPI collective algorithm selection and quadtree encoding

Jelena Pješivac-Grbović; George Bosilca; Graham E. Fagg; Thara Angskun; Jack J. Dongarra

We explore the applicability of the quadtree encoding method to the run-time MPI collective algorithm selection problem. Measured algorithm performance data was used to construct quadtrees with different properties. The quality and performance of generated decision functions and in-memory decision systems were evaluated. Experimental data shows that in some cases, a decision function based on a quadtree structure with a mean depth of three, incurs on average as little as a 5% performance penalty. In all cases, experimental data can be fully represented with a quadtree containing a maximum of six levels. Our results indicate that quadtrees may be a feasible choice for both processing of the performance data and automatic decision function generation.

Ibm Journal of Research and Development | 2006

Self-adapting numerical software (SANS) effort

Jack J. Dongarra; George Bosilca; Zizhong Chen; Victor Eijkhout; Graham E. Fagg; Erika Fuentes; Julien Langou; Piotr Luszczek; Jelena Pješivac-Grbović; Keith Seymour; Haihang You; Sathish S. Vadhiyar

The challenge for the development of next-generation software is the successful management of the complex computational environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. Self-adapting numerical software (SANS) systems are intended to meet this significant challenge. The process of arriving at an efficient numerical solution of problems in computational science involves numerous decisions by a numerical expert. Attempts to automate such decisions distinguish three levels: algorithmic decision, management of the parallel environment, and processor-specific tuning of kernels. Additionally, at any of these levels we can decide to rearrange the users data. In this paper we look at a number of efforts at the University of Tennessee to investigate these areas.

Future Generation Computer Systems | 2010

Self-healing network for scalable fault-tolerant runtime environments

Thara Angskun; Graham E. Fagg; George Bosilca; Jelena Pješivac-Grbović; Jack J. Dongarra

The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms.

european pvm/mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2005

Scalable fault tolerant MPI: extending the recovery algorithm

Graham E. Fagg; Thara Angskun; George Bosilca; Jelena Pješivac-Grbović; Jack J. Dongarra

Fault Tolerant MPI (FT-MPI) [6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.

european conference on parallel processing | 2007

Decision trees and MPI collective algorithm selection problem

Jelena Pješivac-Grbović; George Bosilca; Graham E. Fagg; Thara Angskun; Jack J. Dongarra

Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step for achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem. We construct C4.5 decision trees from the measured algorithm performance data and analyze both the decision tree properties and the expected run time performance penalty. In cases we considered, results show that the C4.5 decision trees can be used to generate a reasonably small and very accurate decision function. For example, the broadcast decision tree with only 21 leaves was able to achieve a mean performance penalty of 2.08%. Similarly, combining experimental data for reduce and broadcast and generating a decision function from the combined decision trees resulted in less than 2.5% relative performance penalty. The results indicate that C4.5 decision trees are applicable to this problem and should be more widely used in this domain.

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface | 2007

An evaluation of open MPI's matching transport layer on the Cray XT

Richard L. Graham; Ron Brightwell; Brian Barrett; George Bosilca; Jelena Pješivac-Grbović

Open MPI was initially designed to support a wide variety of high-performance networks and network programming interfaces. Recently, Open MPI was enhanced to support networks that have full support for MPI matching semantics. Previous Open MPI efforts focused on networks that require the MPI library to manage message matching, which is sub-optimal for some networks that inherently support matching. We describes a new matching transport layer in Open MPI, present results of micro-benchmarks and several applications on the Cray XT platform, and compare performance of the new and the existing transport layers, as well as the vendor-supplied implementation of MPI.

cluster computing and the grid | 2007

Reliability Analysis of Self-Healing Network using Discrete-Event Simulation

Thara Angskun; George Bosilca; Graham E. Fagg; Jelena Pješivac-Grbović; Jack J. Dongarra

The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However, as the number of components increases, so does the probability of failure. Thus, both scalable and fault-tolerance of software are important issues in this field. To ensure reliability of the software especially under the failure circumstance, the reliability analysis is needed. The discrete-event simulation technique offers an attractive a ternative to traditional Markovian-based analytical models, which often have an intractably large state space. In this paper, we analyze reliability of a self-healing network developed for parallel runtime environments using discrete-event simulation. The network is designed to support transmission of messages across multiple nodes and at the same time, to protect against node and process failures. Results demonstrate the flexibility of a discrete-event simulation approach for studying the network behavior under failure conditions and various protocol parameters, message types, and routing algorithms.

Biophysical Journal | 2005