Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Cy P. Chan is active.

Publication


Featured researches published by Cy P. Chan.


programming language design and implementation | 2009

PetaBricks: a language and compiler for algorithmic choice

Jason Ansel; Cy P. Chan; Yee Lok Wong; Marek Olszewski; Qin Zhao; Alan Edelman; Saman P. Amarasinghe

It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarse-grained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural way of programming. We make algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The PetaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. Additionally, we introduce novel techniques to autotune algorithms for different convergence criteria. When choosing between various direct and iterative methods, the PetaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice.


international parallel and distributed processing symposium | 2010

An auto-tuning framework for parallel multicore stencil computations

Shoaib Kamil; Cy P. Chan; Leonid Oliker; John Shalf; Samuel Williams

Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs. Results show that our generalized methodology delivers significant performance gains of up to 22× speedup over the reference serial implementation. Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems.


symposium on code generation and optimization | 2011

Language and compiler support for auto-tuning variable-accuracy algorithms

Jason Ansel; Yee Lok Wong; Cy P. Chan; Marek Olszewski; Alan Edelman; Saman P. Amarasinghe

Approximating ideal program outputs is a common technique for solving computationally difficult problems, for adhering to processing or timing constraints, and for performance optimization in situations where perfect precision is not necessary. To this end, programmers often use approximation algorithms, iterative methods, data resampling, and other heuristics. However, programming such variable accuracy algorithms presents difficult challenges since the optimal algorithms and parameters may change with different accuracy requirements and usage environments. This problem is further compounded when multiple variable accuracy algorithms are nested together due to the complex way that accuracy requirements can propagate across algorithms and because of the size of the set of allowable compositions. As a result, programmers often deal with this issue in an ad-hoc manner that can sometimes violate sound programming practices such as maintaining library abstractions. In this paper, we propose language extensions that expose trade-offs between time and accuracy to the compiler. The compiler performs fully automatic compile-time and installtime autotuning and analyses in order to construct optimized algorithms to achieve any given target accuracy. We present novel compiler techniques and a structured genetic tuning algorithm to search the space of candidate algorithms and accuracies in the presence of recursion and sub-calls to other variable accuracy code. These techniques benefit both the library writer, by providing an easy way to describe and search the parameter and algorithmic choice space, and the library user, by allowing high level specification of accuracy requirements which are then met automatically without the need for the user to understand any algorithm-specific parameters. Additionally, we present a new suite of benchmarks, written in our language, to examine the efficacy of our techniques. Our experimental results show that by relaxing accuracy requirements, we can easily obtain performance improvements ranging from 1.1× to orders of magnitude of speedup.


ieee international conference on high performance computing data and analytics | 2014

Abstract machine models and proxy architectures for exascale computing

James A. Ang; Richard F. Barrett; R.E. Benner; D. Burke; Cy P. Chan; Jeanine Cook; David Donofrio; Simon D. Hammond; Karl Scott Hemmert; Suzanne M. Kelly; H. Le; Vitus J. Leung; David Resnick; Arun Rodrigues; John Shalf; Dylan T. Stark; Didem Unat; Nicholas J. Wright

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.


ieee international conference on high performance computing data and analytics | 2009

Autotuning multigrid with PetaBricks

Cy P. Chan; Jason Ansel; Yee Lok Wong; Saman P. Amarasinghe; Alan Edelman

Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this space efficiently, our autotuner utilizes a novel dynamic programming method to build efficient tuned algorithms from the bottom up. The results are customized multigrid algorithms that invest targeted computational power to yield the accuracy required by the user. The techniques we describe allow the user to automatically generate tuned multigrid cycles of different shapes targeted to the users specific combination of problem, hardware, and accuracy requirements. These cycle shapes dictate the order in which grid coarsening and grid refinement are interleaved with both iterative methods, such as Jacobi or Successive Over-Relaxation, as well as direct methods, which tend to have superior performance for small problem sizes. The need to make choices between all of these methods brings the issue of variable accuracy to the forefront. Not only must the autotuning framework compare different possible multigrid cycle shapes against each other, but it also needs the ability to compare tuned cycles against both direct and (non-multigrid) iterative methods. We address this problem by using an accuracy metric for measuring the effectiveness of tuned cycle shapes and making comparisons over all algorithmic types based on this common yardstick. In our results, we find that the flexibility to trade performance versus accuracy at all levels of recursive computation enables us to achieve excellent performance on a variety of platforms compared to algorithmically static implementations of multigrid. Our implementation uses PetaBricks, an implicitly parallel programming language where algorithmic choices are exposed in the language. The PetaBricks compiler uses these choices to analyze, autotune, and verify the PetaBricks program. These language features, most notably the autotuner, were key in enabling our implementation to be clear, correct, and fast.


compilers, architecture, and synthesis for embedded systems | 2012

Siblingrivalry: online autotuning through local competitions

Jason Ansel; Maciej Pacula; Yee Lok Wong; Cy P. Chan; Marek Olszewski; Una-May O'Reilly; Saman P. Amarasinghe

Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important. We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.


international supercomputing conference | 2013

Software Design Space Exploration for Exascale Combustion Co-design

Cy P. Chan; Didem Unat; Michael J. Lijewski; Weiqun Zhang; John B. Bell; John Shalf

The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.


ieee international conference on high performance computing data and analytics | 2015

ExaSAT: An exascale co-design tool for performance modeling

Didem Unat; Cy P. Chan; Weiqun Zhang; Samuel Williams; John Bachan; John B. Bell; John Shalf

One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework’s ability to perform deep code analysis of a proxy application from the Department of Energy Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights into the hardware and software trade-offs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.


Foundations of Computational Mathematics | 2009

Sturm Sequences and Random Eigenvalue Distributions

James T. Albrecht; Cy P. Chan; Alan Edelman

This paper proposes that the study of Sturm sequences is invaluable in the numerical computation and theoretical derivation of eigenvalue distributions of random matrix ensembles. We first explore the use of Sturm sequences to efficiently compute histograms of eigenvalues for symmetric tridiagonal matrices and apply these ideas to random matrix ensembles such as the β-Hermite ensemble. Using our techniques, we reduce the time to compute a histogram of the eigenvalues of such a matrix from O(n2+m) to O(mn) time where n is the dimension of the matrix and m is the number of bins (with arbitrary bin centers and widths) desired in the histogram (m is usually much smaller than n). Second, we derive analytic formulas in terms of iterated multivariate integrals for the eigenvalue distribution and the largest eigenvalue distribution for arbitrary symmetric tridiagonal random matrix models. As an example of the utility of this approach, we give a derivation of both distributions for the β-Hermite random matrix ensemble (for general β). Third, we explore the relationship between the Sturm sequence of a random matrix and its shooting eigenvectors. We show using Sturm sequences that assuming the eigenvector contains no zeros, the number of sign changes in a shooting eigenvector of parameter λ is equal to the number of eigenvalues greater than λ. Finally, we use the techniques presented in the first section to experimentally demonstrate a O(log n) growth relationship between the variance of histogram bin values and the order of the β-Hermite matrix ensemble.


ieee international conference on high performance computing data and analytics | 2016

Topology-aware performance optimization and modeling of adaptive mesh refinement codes for exascale

Cy P. Chan; John Bachan; Joseph P. Kenny; Jeremiah J. Wilke; Vincent E. Beckner; Ann S. Almgren; John B. Bell

We introduce a topology-aware performance optimization and modeling workflow for AMR simulation that includes two new modeling tools, ProgrAMR and Mota Mapper, which interface with the BoxLib AMR framework and the SSTmacro network simulator. ProgrAMR allows us to generate and model the execution of task dependency graphs from high-level specifications of AMR-based applications, which we demonstrate by analyzing two example AMR-based multigrid solvers with varying degrees of asynchrony. Mota Mapper generates multiobjective, network topology-aware box mappings, which we apply to optimize the data layout for the example multigrid solvers. While the sensitivity of these solvers to layout and execution strategy appears to be modest for balanced scenarios, the impact of better mapping algorithms can be significant when performance is highly constrained by network hop latency. Furthermore, we show that network latency in the multigrid bottom solve is the main contributing factor preventing good scaling on exascale-class machines.

Collaboration


Dive into the Cy P. Chan's collaboration.

Top Co-Authors

Avatar

John Shalf

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Alan Edelman

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

John B. Bell

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jason Ansel

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Joseph P. Kenny

Sandia National Laboratories

View shared research outputs
Top Co-Authors

Avatar

Saman P. Amarasinghe

Massachusetts Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Samuel Williams

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Vincent E. Beckner

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Weiqun Zhang

Lawrence Berkeley National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge