Is this you? Create Your Porfile

Ananta Tiwari

University of Maryland, College Park

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ananta Tiwari is active.

Explore More

Publication

Featured researches published by Ananta Tiwari.

international parallel and distributed processing symposium | 2009

A scalable auto-tuning framework for compiler optimization

Ananta Tiwari; Chun Chen; Jacqueline Chame; Mary W. Hall; Jeffrey K. Hollingsworth

We describe a scalable and general-purpose framework for auto-tuning compiler-generated code. We combine Active Harmonys parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the best-performing implementation. The resulting system achieves performance of compiler-generated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates different combinations of compiler optimizations and converges to solutions in only a few tens of search-steps.

international parallel and distributed processing symposium | 2011

Online Adaptive Code Generation and Tuning

Ananta Tiwari; Jeffrey K. Hollingsworth

In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-fly. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in todays HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.

conference on high performance computing (supercomputing) | 2005

Parallel Parameter Tuning for Applications with Performance Variability

Vahid Tabatabaee; Ananta Tiwari; Jeffrey K. Hollingsworth

In this paper, we present parallel on-line optimization algorithms for parameter tuning of parallel programs. We employ direct search algorithms that update parameters based on real-time performance measurements. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithms and proposed modified versions of the direct search algorithms to cope with it. The modified version uses multiple samples instead of single sample to estimate the performance more accurately. We present preliminary results that the performance variability of applications on clusters is heavy tailed. Finally, we studay and demonstrate the performance of the proposed algorithms for real scientific application.

international parallel and distributed processing symposium | 2012

Modeling Power and Energy Usage of HPC Kernels

Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; Allan Snavely

Compute intensive kernels make up the majority of execution time in HPC applications. Therefore, many of the power draw and energy consumption traits of HPC applications can be characterized in terms of the power draw and energy consumption of these constituent kernels. Given that power and energy-related constraints have emerged as major design impediments for exascale systems, it is crucial to develop a greater understanding of how kernels behave in terms of power/energy when subjected to different compiler-based optimizations and different hardware settings. In this work, we develop CPU and DIMM power and energy models for three extensively utilized HPC kernels by training artificial neural networks. These networks are trained using empirical data gathered on the target architecture. The models utilize kernel-specific compiler-based optimization parameters and hard-ware tunables as inputs and make predictions for the power draw rate and energy consumption of system components. The resulting power draw and energy usage predictions have an absolute error rate that averages less than 5.5% for three important kernels - matrix multiplication (MM), stencil computation and LU factorization.

ieee international conference on high performance computing data and analytics | 2011

Auto-tuning full applications: A case study

Ananta Tiwari; Jeffrey K. Hollingsworth; Chun Chen; Mary W. Hall; Chunhua Liao; Daniel J. Quinlan; Jacqueline Chame

In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the application’s performance improves by 27%.

parallel computing | 2009

Tuning parallel applications in parallel

Ananta Tiwari; Vahid Tabatabaee; Jeffrey K. Hollingsworth

In this paper, we present and evaluate a parallel algorithm for parameter tuning of parallel applications. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithm and propose a strategy to minimize the impact of this variability. We evaluate our algorithm within the Active Harmony system, an automated online/offline tuning framework. We study its performance on three benchmark codes: PSTSWM, HPL and POP. Compared to the Nelder-Mead algorithm, our algorithm finds better configurations up to seven times faster. For POP, we were able to improve the performance of a production sized run by 59%.

Journal of Physics: Conference Series | 2008

PERI auto-tuning

David H. Bailey; Jacqueline Chame; Chun Chen; Jack J. Dongarra; Mary W. Hall; Jeffrey K. Hollingsworth; Paul D. Hovland; Shirley Moore; Keith Seymour; Jaewook Shin; Ananta Tiwari; Samuel Williams; Haihang You

The enormous and growing complexity of todays high-end systems has increased the already significant challenges of obtaining high performance on equally complex scientific applications. Application scientists are faced with a daunting challenge in tuning their codes to exploit performance-enhancing architectural features. The Performance Engineering Research Institute (PERI) is working toward the goal of automating portions of the performance tuning process. This paper describes PERIs overall strategy for auto-tuning tools and recent progress in both building auto-tuning tools and demonstrating their success on kernels, some taken from large-scale applications.

Parallel Processing Letters | 2013

CHARACTERIZING LARGE-SCALE HPC APPLICATIONS THROUGH TRACE EXTRAPOLATION

Laura Carrington; Michael A. Laurenzano; Ananta Tiwari

The analysis and understanding of large-scale application behavior is critical for effectively utilizing existing HPC resources and making design decisions for upcoming systems. In this work we utilize the information about the behavior of an MPI application at a series of smaller core counts to characterize its behavior at a much larger core count. Our methodology first captures the applications behavior via a set of features that are important for both performance and energy (cache hit rates, floating point intensity, ILP, etc.). We then find the best statistical fit from among a set of canonical functions in terms of how these features change across a series of small core counts. The models for a given feature can then be utilized to generate an extrapolated trace of the application at scale. The accuracy of the extrapolated traces is evaluated by calculating the error of the extrapolated trace relative to an actual trace for two large-scale applications, UH3D and SPECFEM3D. The accuracy of the fully extrapolated traces is further evaluated by comparing the results of building performance models using both the extrapolated trace along with an actual trace in order to predict application performance. For these two full-scale HPC applications, performance models built using the extrapolated traces predicted the runtime with absolute relative errors of less than 5%.

international parallel and distributed processing symposium | 2015

Predicting Optimal Power Allocation for CPU and DRAM Domains

Ananta Tiwari; Martin Schulz; Laura Carrington

Constraints imposed by power delivery and costs will be key design impediments to the development of next generation High-Performance Computing (HPC) systems. To remedy these impediments, solutions that impose power bounds (or caps) on over-provisioned computing systems to remain within the physical (and financial) power limits have been proposed. Uninformed power capping can significantly impact performance and power cappings success depends largely on how intelligently a given power budget is allocated across various subsystems of the computing nodes. Since different computations put vastly different demands on various system components, those variations in the demands must be taken into consideration while making power allocation decisions to lessen performance degradation. Given a target power bound, a model-based methodology presented in this paper, which takes computation-specific properties into account, guides power allocations for CPU and DRAM domains to maximize performance. Our methodology is accurate and can predict the performance impacts of the power capping allocation schemes for different types of computations from real applications with absolute mean error of less than 6%.

international conference on cluster computing | 2013

Understanding the performance of stencil computations on Intel's Xeon Phi

Joshua Peraza; Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; William A. Ward; Roy L. Campbell

Accelerators are becoming prevalent in high performance computing as a way of achieving increased computational capacity within a smaller power budget. Effectively utilizing the raw compute capacity made available by these systems, however, remains a challenge because it can require a substantial investment of programmer time to port and optimize code to effectively use novel accelerator hardware. In this paper we present a methodology for isolating and modeling the performance of common performance-critical patterns of code (so-called idioms) and other relevant behavioral characteristics from large scale HPC applications which are likely to perform favorably on Intel Xeon Phi. The benefits of the methodology are twofold: (1) it directs programmer efforts toward the regions of code most likely to benefit from porting to the Xeon Phi and (2) provides speedup estimates for porting those regions of code. We then apply the methodology to the stencil idiom, showing performance improvements of up to a factor of 4.7× on stencil-based benchmark codes.

Explore More