Prakash Murali | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Prakash Murali is active.

Explore More

Publication

Featured researches published by Prakash Murali.

international parallel and distributed processing symposium | 2015

Matching Application Signatures for Performance Predictions Using a Single Execution

Anirudh Jayakumar; Prakash Murali; Sathish S. Vadhiyar

Performance predictions for large problem sizes and processors using limited small scale runs are useful for a variety of purposes including scalability projections, and help in minimizing the time taken for constructing training data for building performance models. In this paper, we present a prediction framework that matches execution signatures for performance predictions of HPC applications using a single small scale application execution. Our framework extracts execution signatures of applications and performs automatic phase identification of different application phases. Application signatures of the different phases are matched with the execution profiles of reference kernels stored in a kernel database. The performance of the reference kernels are then used to predict the performance of the application phases. For phases that do not match significantly, our framework performs static analysis of loops and functions in the application to provide prediction ranges. We demonstrate this integrated set of techniques in our framework with three large scale applications, including GTC, a Particle-in-Cell code for turbulence simulation, Sweep3d, a 3D neutron transport application and SMG2000, a multigrid solver. We show that our prediction ranges are accurate in most cases.

international parallel and distributed processing symposium | 2016

Subgraph Counting: Color Coding Beyond Trees

Venkatesan T. Chakaravarthy; Michael Kapralov; Prakash Murali; Fabrizio Petrini; Xinyu Que; Yogish Sabharwal; Baruch Schieber

The problem of counting occurrences of query graphs in a large data graph, known as subgraph counting, is fundamental to several domains such as genomics and social network analysis. Many important special cases (e.g. triangle counting) have received significant attention. Color coding is a very general and powerful algorithmic technique for subgraph counting. Color coding has been shown to be effective in several applications, but scalable implementations are only known for the special case of tree queries (i.e. queries of treewidth one). In this paper we present the first efficient distributed implementation for color coding that goes beyond tree queries: ouralgorithm applies to any query graph of treewidth 2. Since tree queries can be solved in time linear in the size of the data graph, our contribution is the first step into the realm of color codingfor queries that require superlinear worst case running time. This superlinear complexity leads to significant load balancing problems on graphs with heavy tailed degree distributions. Our algorithm works around high degree nodes in the data graph, and achieves very good runtime and scalability on a diverse collection of data and query graph pairs. We also provide a theoretical analysis of our algorithmic techniques, exhibiting asymptotic improvements in runtime on random graphs with power law degree distributions, a popular model for real world graphs.

Concurrency and Computation: Practice and Experience | 2016

Qespera: an adaptive framework for prediction of queue waiting times in supercomputer systems

Prakash Murali; Sathish S. Vadhiyar

Production parallel systems are space‐shared, and resource allocation on such systems is usually performed using a batch queue scheduler. Jobs submitted to the batch queue experience a variable delay before the requested resources are granted. Predicting this delay can assist users in planning experiment time‐frames and choosing sites with less turnaround times and can also help meta‐schedulers make scheduling decisions. In this paper, we present an integrated adaptive framework, Qespera, for prediction of queue waiting times on parallel systems. We propose a novel algorithm based on spatial clustering for predictions using history of job submissions and executions. The framework uses adaptive set of strategies for choosing either distributions or summary of features to represent the system state and to compare with history jobs, varying the weights associated with the features for each job prediction, and selecting a particular algorithm dynamically for performing the prediction depending on the characteristics of the target and history jobs. Our experiments with real workload traces from different production systems demonstrate up to 22% reduction in average absolute error and up to 56% reduction in percentage prediction error over existing techniques. We also report prediction errors of less than 1 h for a majority of the jobs. Copyright

international conference on supercomputing | 2018

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Prakash Murali; Shivmaran S. Pandian; Yogish Sabharwal; Dheeraj Sreedhar

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

international parallel and distributed processing symposium | 2017

On Optimizing Distributed Tucker Decomposition for Dense Tensors

Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Xing Liu; Prakash Murali; Yogish Sabharwal; Dheeraj Sreedhar

The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.

european conference on parallel processing | 2018

Improved Distributed Algorithm for Graph Truss Decomposition

Venkatesan T. Chakaravarthy; Aashish Goyal; Prakash Murali; Shivmaran S. Pandian; Yogish Sabharwal

The truss decomposition provides a popular model for discovering cohesive communities in a given network (graph). The problem has been well studied in sequential, shared memory and MapReduce settings. We study the problem on distributed memory systems. Our work builds on two prior algorithms. The first algorithm is optimized in terms of the computational load and communication volume, but it involves a large number of iterations, leading to high load imbalance and synchronization costs. The second algorithm significantly reduces the number of iterations, but at the cost of increasing the load and the volume. We design an algorithm that offers a tradeoff between the two extremes, with the number of iterations being close to that of the second algorithm and load/volume being close to that of the first. We develop an efficient distributed (MPI) implementation based on the new algorithm. We present an experimental evaluation on large real-world graphs. The evaluation shows that the new algorithm outperforms the two prior algorithms on large system sizes with the performance gain ranging up to 2x.

IEEE Transactions on Parallel and Distributed Systems | 2018

Metascheduling of HPC Jobs in Day-Ahead Electricity Markets

Prakash Murali; Sathish S. Vadhiyar

High performance grid computing is a key enabler of large scale collaborative computational science. With the promise of exascale computing, high performance grid systems are expected to incur electricity bills that grow super-linearly over time. In order to achieve cost effectiveness in these systems, it is essential for the scheduling algorithms to exploit electricity price variations, both in space and time, that are prevalent in the dynamic electricity price markets. In this paper, we present a metascheduling algorithm to optimize the placement of jobs in a compute grid which consumes electricity from the day-ahead wholesale market. We formulate the scheduling problem as a Minimum Cost Maximum Flow problem and leverage queue waiting time and electricity price predictions to accurately estimate the cost of job execution at a system. Using trace based simulation with real and synthetic workload traces, and real electricity price data sets, we demonstrate our approach on two currently operational grids, XSEDE and NorduGrid. Our experimental setup collectively constitute more than 433K processors spread across 58 compute systems in 17 geographically distributed locations. Experiments show that our approach simultaneously optimizes the total electricity cost and the average response time of the grid, without being unfair to users of the local batch systems.

IEEE Transactions on Parallel and Distributed Systems | 2017