David C. Wong | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David C. Wong is active.

Explore More

Publication

Featured researches published by David C. Wong.

international symposium on microarchitecture | 2009

BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support

Wonsun Ahn; Shanxiang Qi; M. Nicolaides; Josep Torrellas; Jae-Woo Lee; Xing Fang; Samuel P. Midkiff; David C. Wong

A platform that supported sequential consistency (SC) for all codes - not only the well-synchronized ones - would simplify the task of programmers. Recently, several hardware architectures that support high-performance SC by committing groups of instructions at a time have been proposed. However, for a platform to support SC, it is insufficient that the hardware does; the compiler has to support SC as well. This paper presents the hardware-compiler interface, and the main compiler ideas for BulkCompiler, a simple compiler layer that works with the group-committing hardware to provide a whole-system high-performance SC platform. We introduce ISA primitives and software algorithms for BulkCompiler to drive instruction-group formation, and to transform code to exploit the groups. Our simulation results show that BulkCompiler not only enables a whole-system SC environment, but also one that actually outperforms a conventional platform that uses the more relaxed Java Memory Model by an average of 37%. The speedups come from code optimization inside software-assembled instruction groups.

High-Performance Scientific Computing | 2012

Measuring Computer Performance

William Jalby; David C. Wong; David J. Kuck; Jean-Thomas Acquaviva; Jean Christophe Beyler

Computer performance improvement embraces many issues, but is severely hampered by existing approaches that examine one or a few topics at a time. Each problem solved leads to another saturation point and serious problem. In the most frustrating cases, solving some problems exacerbates others and achieves no net performance gain. This paper discusses how to measure a large computational load globally, using as much architectural detail as needed. Besides the traditional goals of sequential and parallel system performance, these methods are useful for energy optimization.

rapid simulation and performance evaluation methods and tools | 2013

Simsys: a performance simulation framework

Jose Noudohouenou; Vincent Palomares; William Jalby; David C. Wong; David J. Kuck; Jean Christophe Beyler

HW/SW codesign or computer system purchase involves many tradeoffs, including the problem data size, choice of algorithm and compiler, types of HW subsystems used, clock frequencies of each, and number of cores. Simsys is a fast simulation tool set to examine various combinations of these choices, allowing specific HW/SW performance attributions. Simsyss measurement level and approach are keys to this operating speed and attribution. A combination of modular tools forms Simsyss automatic procedure for system simulation and analysis. The paper overviews the tools and validates the proposed approach on 27 loop nest codelets extracted from Numerical Recipes. It also includes the experimental method and an error analysis. Three performance quality metrics are defined and evaluated for two simple codelets, demonstrating several modes of performance failure and the weakness of intuition in detecting them, as well as illustrating how better tools could help lead to better computer systems. Future Simsys plans include model enhancement with more HW details and much more extensive experimentation.

international conference on performance engineering | 2017

An Incremental Methodology for Energy Measurement and Modeling

Abdelhafid Mazouz; David C. Wong; David J. Kuck; William Jalby

This paper presents an empirical approach to measuring and modeling the energy consumption of multicore processors.The modeling approach allows us to find a breakdown of the energy consumption among a set of key hardware components, also called HW nodes. We explicitly model the front-end and the back-end in terms of the number of instructions executed. We also model the L1, L2 and L3 caches. Furthermore, we explicitly model the static and dynamic energy consumed by the the uncore and core components. From a software perspective, our methodology allows us to correlate energy to the executed code, which helps find opportunities for code optimization and tuning. We use binary analysis and hardware counters for performance characterization. Although, we use the on-chip counters (RAPL) for energy measurement, our methodology does not rely on a specific method for energy measurement. Thus, it is portable and easy to deploy in various computing environments. We validate our energy model using two Intel processors with a set of HPC codelets, where data sizes are varied to come from the L1, L2 and L3 caches and show 3% average modeling error. We present a comprehensive analysis and show energy consumption differences between kernels and relate those differences to the algorithms that are implemented. Finally, we discuss how vectorization leads to energy savings compared to non-vectorized codes.

ieee international symposium on workload characterization | 2017

LORE: A loop repository for the evaluation of compilers

Zhi Chen; Zhangxiaowen Gong; Justin Szaday; David C. Wong; David A. Padua; Alexandru Nicolau; Alexander V. Veidenbaum; Neftali Watkinson; Zehra Sura; Saeed Maleki; Josep Torrellas; Gerald DeJong

Although numerous loop optimization techniques have been designed and deployed in commercial compilers in the past, virtually no common experimental infrastructure nor repository exists to help the compiler community evaluate the effectiveness of these techniques. This paper describes a repository, LORE, that maintains a large number of C language for loop nests extracted from popular benchmarks, libraries, and real applications. It also describes the infrastructure that builds and maintains the repository. Each loop nest in the repository has been compiled, transformed, executed, and measured independently. These loops cover a variety of properties that can be used by the compiler community to evaluate loop optimizations using a broad and representative collection of loops. To illustrate the usefulness of the repository, we also present two example applications. One is assessing the capabilities of the auto-vectorization features of three widely used compilers. The other is measuring the performance difference of a compiler across different versions. These applications prove that the repository is valuable for identifying the strengths and weaknesses of a compiler and for quantitatively measuring the evolution of a compiler.

international conference on parallel processing | 2018

Power-Constrained Optimal Quality for High Performance Servers

Abdelhafid Mazouz; David C. Wong; David J. Kuck; William Jalby

Computer systems from HPC to data centers to PCs need to take running computations into account in order to maximize quality objectives while observing power constraints. We present PCOQ, a method that measures key parameters to control package (core + uncore) power, energy and performance, using DVFS plus choices of prefetch, instruction set type, and core count. We discuss algorithms and show results on a set of HPC codelets, comparing our results to race to halt and the OnDemand governors. We also discuss a variation of PCOQ, it is an off-line approximation method for running applications. We show that energy savings vs performance impact highly depend on data locality: 18% vs 50% CPU energy savings in LLC and RAM data sizes respectively.

Proceedings of the ACM on Programming Languages | 2018

An empirical study of the effect of source-level loop transformations on compiler stability

Zhangxiaowen Gong; Alexandru Nicolau; Josep Torrellas; Zhi Chen; Justin Szaday; David C. Wong; Zehra Sura; Neftali Watkinson; Saeed Maleki; David A. Padua; Alexander V. Veidenbaum

Modern compiler optimization is a complex process that offers no guarantees to deliver the fastest, most efficient target code. For this reason, compilers struggle to produce a stable performance from versions of code that carry out the same computation and only differ in the order of operations. This instability makes compilers much less effective program optimization tools and often forces programmers to carry out a brute force search when tuning for performance. In this paper, we analyze the stability of the compilation process and the performance headroom of three widely used general purpose compilers: GCC, ICC, and Clang. For the study, we extracted over 1,000 for loop nests from well-known benchmarks, libraries, and real applications; then, we applied sequences of source-level loop transformations to these loop nests to create numerous semantically equivalent mutations; finally, we analyzed the impact of transformations on code quality in terms of locality, dynamic instruction count, and vectorization. Our results show that, by applying source-to-source transformations and searching for the best vectorization setting, the percentage of loops sped up by at least 1.15x is 46.7% for GCC, 35.7% for ICC, and 46.5% for Clang, and on average the potential for performance improvement is estimated to be at least 23.7% for GCC, 18.1% for ICC, and 26.4% for Clang. Our stability analysis shows that, under our experimental setup, the average coefficient of variation of the execution time across all mutations is 18.2% for GCC, 19.5% for ICC, and 16.9% for Clang, and the highest coefficient of variation for a single loop nest reaches 118.9% for GCC, 124.3% for ICC, and 110.5% for Clang. We conclude that the evaluated compilers need further improvements to claim they have stable behavior.

Archive | 2016

Evaluating Out-of-Order Engine Limitations Using Uop Flow Simulation

Vincent Palomares; David C. Wong; David J. Kuck; William Jalby

Out-of-order mechanisms in recent microarchitectures do a very good job at hiding latencies and improving performance. However, they come with limitations not easily modeled statically, and hard to quantify exactly even dynamically. This paper will present Uop Flow Simulation (UFS), a loop performance prediction technique accounting for such restrictions by combining static analysis and cycle-driven simulation. UFS simulates the behavior of the execution pipeline when executing a loop. It handles instruction latencies, dependencies, out-of-order resource consumption and other low-level details while completely ignoring semantics. We will use a UFS prototype to validate our approach on Sandy Bridge using loops from real-world HPC applications, showing it is both accurate and very fast (reaching simulation speeds of hundreds of thousands of cycles per second).

dagstuhl seminar proceedings | 2007