Haihang You | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haihang You is active.

Explore More

Publication

Featured researches published by Haihang You.

Parallel Tools Workshop | 2010

Collecting Performance Data with PAPI-C

Daniel Terpstra; Heike Jagode; Haihang You; Jack J. Dongarra

Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface.

international parallel and distributed processing symposium | 2007

POET: Parameterized Optimizations for Empirical Tuning

Qing Yi; Keith Seymour; Haihang You; Richard W. Vuduc; Daniel J. Quinlan

The excessive complexity of both machine architectures and applications have made it difficult for compilers to statically model and predict application behavior. This observation motivates the recent interest in performance tuning using empirical techniques. We present a new embedded scripting language, POET (parameterized optimization for empirical tuning), for parameterizing complex code transformations so that they can be empirically tuned. The POET language aims to significantly improve the generality, flexibility, and efficiency of existing empirical tuning systems. We have used the language to parameterize and to empirically tune three loop optimizations - interchange, blocking, and unrolling - for two linear algebra kernels. We show experimentally that the time required to tune these optimizations using POET, which does not require any program analysis, is significantly shorter than that when using a full compiler-based source-code optimizer which performs sophisticated program analysis and optimizations.

international parallel and distributed processing symposium | 2003

Experiences and lessons learned with a portable interface to hardware performance counters

Jack J. Dongarra; Kevin S. London; Shirley Moore; Philip Mucci; Daniel Terpstra; Haihang You; Min Zhou

The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI.

high performance distributed computing | 2008

The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software

Lamia Youseff; Keith Seymour; Haihang You; Jack J. Dongarra; Richard Wolski

Previous studies have revealed that paravirtualization imposes minimal performance overhead on High Performance Computing (HPC) workloads, while exposing numerous benefits for this field. In this study, we are investigating the memory hierarchy characteristics of paravirtualized systems and their impact on automatically-tuned software systems. We are presenting an accurate characterization of memory attributes using hardware counters and user-process accounting. For that, we examine the proficiency of ATLAS, a quintessential example of an autotuning software system, in tuning the BLAS library routines for paravirtualized systems. In addition, we examine the effects of paravirtualization on the performance boundary. Our results show that the combination of ATLAS and Xen paravirtualization delivers native execution performance and nearly identical memory hierarchy performance profiles. Our research thus exposes new benefits to memory-intensive applications arising from the ability to slim down the guest OS without influencing the system performance. In addition, our findings support a novel and very attractive deployment scenario for computational science and engineering codes on virtual clusters and computational clouds.

international conference on cluster computing | 2008

A comparison of search heuristics for empirical code optimization

Keith Seymour; Haihang You; Jack J. Dongarra

This paper describes the application of various search techniques to the problem of automatic empirical code optimization. The search process is a critical aspect of auto-tuning systems because the large size of the search space and the cost of evaluating the candidate implementations makes it infeasible to find the true optimum point by brute force. We evaluate the effectiveness of Nelder-Mead Simplex, Genetic Algorithms, Simulated Annealing, Particle Swarm Optimization, Orthogonal search, and Random search in terms of the performance of the best candidate found under varying time limits.

Ibm Journal of Research and Development | 2006

Self-adapting numerical software (SANS) effort

Jack J. Dongarra; George Bosilca; Zizhong Chen; Victor Eijkhout; Graham E. Fagg; Erika Fuentes; Julien Langou; Piotr Luszczek; Jelena Pješivac-Grbović; Keith Seymour; Haihang You; Sathish S. Vadhiyar

The challenge for the development of next-generation software is the successful management of the complex computational environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. Self-adapting numerical software (SANS) systems are intended to meet this significant challenge. The process of arriving at an efficient numerical solution of problems in computational science involves numerous decisions by a numerical expert. Attempts to automate such decisions distinguish three levels: algorithmic decision, management of the parallel environment, and processor-specific tuning of kernels. Additionally, at any of these levels we can decide to rearrange the users data. In this paper we look at a number of efforts at the University of Tennessee to investigate these areas.

Environmental Modelling and Software | 2013

Coupling climate and hydrological models

Jonathan L. Goodall; Kathleen D. Saint; Mehmet B. Ercan; Laura J. Briley; Sylvia Murphy; Haihang You; Cecelia Deluca; Richard B. Rood

Understanding regional-scale water resource systems requires understanding coupled hydrologic and climate interactions. The traditional approach in the hydrologic sciences and engineering fields has been to either treat the atmosphere as a forcing condition on the hydrologic model, or to adopt a specific hydrologic model design in order to be interoperable with a climate model. We propose here a different approach that follows a service-oriented architecture and uses standard interfaces and tools: the Earth System Modeling Framework (ESMF) from the weather and climate community and the Open Modeling Interface (OpenMI) from the hydrologic community. A novel technical challenge of this work is that the climate model runs on a high performance computer and the hydrologic model runs on a personal computer. In order to complete a two-way coupling, issues with security and job scheduling had to be overcome. The resulting application demonstrates interoperability across disciplinary boundaries and has the potential to address emerging questions about climate impacts on local water resource systems. The approach also has the potential to be adapted for other climate impacts applications that involve different communities, multiple frameworks, and models running on different computing platforms. We present along with the results of our coupled modeling system a scaling analysis that indicates how the system will behave as geographic extents and model resolutions are changed to address regional-scale water resources management problems. The prototyped hydro-climate testbed is an example of a multi-scale modeling.The work demonstrates interoperability across Earth science modeling frameworks.Community Atmosphere Model (CAM) dominates the total execution time for a regional-scale hydrologic system.Web Services communication overhead is not excess relative to CAM.Service-orientation could be a useful approach for coupling across community models.

international conference on computational science | 2004

Accurate cache and TLB characterization using hardware counters

Jack J. Dongarra; Shirley Moore; Philip Mucci; Keith Seymour; Haihang You

We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions.

Proceedings of the 2004 workshop on Memory system performance | 2004

Automatic blocking of QR and LU factorizations for locality

Qing Yi; Ken Kennedy; Haihang You; Keith Seymour; Jack J. Dongarra

QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in todays computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures.

job scheduling strategies for parallel processing | 2012

Comprehensive Workload Analysis and Modeling of a Petascale Supercomputer

Haihang You; Hao Zhang

The performance of supercomputer schedulers is greatly affected by the characteristics of the workload it serves. A good understanding of workload characteristics is always important to develop and evaluate different scheduling strategies for an HPC system. In this paper, we present a comprehensive analysis of the workload characteristics of Kraken, the world’s fastest academic supercomputer and 11th on the latest Top500 list, with 112,896 compute cores and peak performance of 1.17 petaflops. In this study, we use twelve-month workload traces gathered on the system, which include around 700 thousand jobs submitted by more than one thousand users from 25 research areas. We investigate three categories of the workload characteristics: 1) general characteristics, including distribution of jobs over research fields and different queues, distribution of job size for an individual user, job cancellation rate, job termination rate, and walltime request accuracy; 2) temporal characteristics, including monthly machine utilization, job temporal distributions for different time periods, job inter-arrival time between temporally adjacent jobs and jobs submitted by the same user; 3) execution characteristics, including distributions of each job attribute, such as job queuing time, job actual runtime, job size, and memory usage, and the correlations between these job attributes. This work provides a realistic basis for scheduler design and comparison by studying the supercomputer’s workload with new approaches such as using Gaussian mixture model, and new viewpoints such as from the perspective of user community. To the best of our knowledge, it’s the first research to systematically investigate the workload characteristics of a petascale supercomputer that is dedicated to open scientific research.

Explore More