Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xiaoyang Gao is active.

Publication


Featured researches published by Xiaoyang Gao.


Proceedings of the IEEE | 2005

Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models

Gerald Baumgartner; Alexander A. Auer; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; So Hirata; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

This paper provides an overview of a program synthesis system for a class of quantum chemistry computations. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. The input to the system is a a high-level specification of the computation, from which the system can synthesize high-performance parallel code tailored to the characteristics of the target architecture. Several components of the synthesis system are described, focusing on performance optimization issues that they address.


Molecular Physics | 2006

Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡

Alexander A. Auer; Gerald Baumgartner; David E. Bernholdt; Alina Bibireata; Venkatesh Choppella; Daniel Cociorva; Xiaoyang Gao; Robert J. Harrison; Sriram Krishnamoorthy; Sandhya Krishnan; Chi-Chung Lam; Qingda Lu; Marcel Nooijen; Russell M. Pitzer; J. Ramanujam; P. Sadayappan; Alexander Sibiryakov

As both electronic structure methods and the computers on which they are run become increasingly complex, the task of producing robust, reliable, high-performance implementations of methods at a rapid pace becomes increasingly daunting. In this paper we present an overview of the Tensor Contraction Engine (TCE), a unique effort to address issues of both productivity and performance through automatic code generation. The TCE is designed to take equations for many-body methods in a convenient high-level form and acts like an optimizing compiler, producing an implementation tuned to the target computer system and even to the specific chemical problem of interest. We provide examples to illustrate the TCE approach, including the ability to target different parallel programming models, and the effects of particular optimizations.


international parallel and distributed processing symposium | 2003

Global communication optimization for tensor contraction expressions under memory constraints

Daniel Cociorva; Xiaoyang Gao; Sandhya Krishnan; Gerald Baumgartner; Chi-Chung Lam; P. Sadayappan; J. Ramanujam

The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multi-dimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of inter-processor communication must be minimized, subject to the available memory on each processor In this paper we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.


Journal of Parallel and Distributed Computing | 2012

Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Qingda Lu; Xiaoyang Gao; Sriram Krishnamoorthy; Gerald Baumgartner; J. Ramanujam; P. Sadayappan

Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.


international conference on computational science | 2006

Identifying cost-effective common subexpressions to reduce operation count in tensor contraction evaluations

Albert Hartono; Qingda Lu; Xiaoyang Gao; Sriram Krishnamoorthy; Marcel Nooijen; Gerald Baumgartner; David E. Bernholdt; Venkatesh Choppella; Russell M. Pitzer; J. Ramanujam; Atanas Rountev; P. Sadayappan

Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the coupled cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions. Operation minimization is an important optimization step for the Tensor Contraction Engine, a tool being developed for the automatic transformation of high-level tensor contraction expressions into efficient programs. The identification of common subexpressions among a set of tensor contraction expressions can result in a reduction of the total number of operations required to evaluate the tensor contractions. In this paper, we develop an effective algorithm for common subexpression identification and demonstrate its effectiveness on tensor contraction expressions for coupled cluster equations.


acm sigplan symposium on principles and practice of parallel programming | 2005

Performance modeling and optimization of parallel out-of-core tensor contractions

Xiaoyang Gao; Swarup Kumar Sahoo; Chi-Chung Lam; J. Ramanujam; Qingda Lu; Gerald Baumgartner; P. Sadayappan

The Tensor Contraction Engine (TCE) is a domain-specific compiler for implementing complex tensor contraction expressions arising in quantum chemistry applications modeling electronic structure. This paper develops a performance model for tensor contractions, considering both disk I/O as well as inter-processor communication costs, to facilitate performance-model driven loop optimization for this domain. Experimental results are provided that demonstrate the accuracy and effectiveness of the model.


Concurrency and Computation: Practice and Experience | 2007

Efficient search‐space pruning for integrated fusion and tiling transformations

Xiaoyang Gao; Sriram Krishnamoorthy; Swarup Kumar Sahoo; Chi-Chung Lam; Gerald Baumgartner; J. Ramanujam; P. Sadayappan

Compile‐time optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction etc. The selection of the appropriate transformation to minimize the execution time is a challenging task. We address this problem in the context of tensor contraction expressions involving arrays too large to fit in main memory. Domain‐specific features of the computation are exploited to develop an integrated framework that facilitates the exploration of the entire search space of optimizations. In this paper, we discuss the exploration of the space of loop fusion and tiling transformations in order to minimize the disk I/O cost. These two transformations are integrated and pruning strategies are presented that significantly reduce the number of loop structures to be evaluated for subsequent transformations. The evaluation of the framework using representative contraction expressions from quantum chemistry shows a dramatic reduction in the size of the search space using the strategies presented. Copyright


ieee international conference on high performance computing data and analytics | 2004

Empirical performance-model driven data layout optimization

Qingda Lu; Xiaoyang Gao; Sriram Krishnamoorthy; Gerald Baumgartner; J. Ramanujam; P. Sadayappan

Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to perform data layout optimization in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on some representative computations from quantum chemistry.


Lecture Notes in Computer Science | 2006

Efficient search-space pruning for integrated fusion and tiling transformations

Xiaoyang Gao; Sriram Krishnamoorthy; Swarup Kumar Sahoo; Chi-Chung Lam; Gerald Baumgartner; J. Ramanujam; P. Sadayappan


Archive | 2008

Integrated compiler optimizations for tensor contractions

P. Sadayappan; Xiaoyang Gao

Collaboration


Dive into the Xiaoyang Gao's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

J. Ramanujam

Louisiana State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Qingda Lu

Ohio State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David E. Bernholdt

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge