David J. Kuck
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David J. Kuck.
High-Performance Scientific Computing | 2012
William Jalby; David C. Wong; David J. Kuck; Jean-Thomas Acquaviva; Jean Christophe Beyler
Computer performance improvement embraces many issues, but is severely hampered by existing approaches that examine one or a few topics at a time. Each problem solved leads to another saturation point and serious problem. In the most frustrating cases, solving some problems exacerbates others and achieves no net performance gain. This paper discusses how to measure a large computational load globally, using as much architectural detail as needed. Besides the traditional goals of sequential and parallel system performance, these methods are useful for energy optimization.
rapid simulation and performance evaluation methods and tools | 2013
Jose Noudohouenou; Vincent Palomares; William Jalby; David C. Wong; David J. Kuck; Jean Christophe Beyler
HW/SW codesign or computer system purchase involves many tradeoffs, including the problem data size, choice of algorithm and compiler, types of HW subsystems used, clock frequencies of each, and number of cores. Simsys is a fast simulation tool set to examine various combinations of these choices, allowing specific HW/SW performance attributions. Simsyss measurement level and approach are keys to this operating speed and attribution.n A combination of modular tools forms Simsyss automatic procedure for system simulation and analysis. The paper overviews the tools and validates the proposed approach on 27 loop nest codelets extracted from Numerical Recipes. It also includes the experimental method and an error analysis.n Three performance quality metrics are defined and evaluated for two simple codelets, demonstrating several modes of performance failure and the weakness of intuition in detecting them, as well as illustrating how better tools could help lead to better computer systems. Future Simsys plans include model enhancement with more HW details and much more extensive experimentation.
High-Performance Scientific Computing | 2012
David J. Kuck
This paper proposes a fast, novel approach for the HW/SW codesign of computer systems based on a computational capacity model. System node bandwidths and bandwidths used by the SW load underlie three sets of linear equations: a model system representing a load running on a computer, a design equation and objective function with goals as inputs, and a capacity sensitivity equation. These are augmented with nonlinear techniques to analyze multirate HW nodes as well as to synthesize system nodes when codesign goals exceed feasible engineering HW choices. Solving the equations rapidly finds the optimal costs of a broad class of architectures for a given computational load. The performance of each component can be determined globally and for each computational phase. The ideas are developed theoretically and illustrated by numerical examples plus results produced by a prototype CAPE tool implementation.
international conference on performance engineering | 2017
Abdelhafid Mazouz; David C. Wong; David J. Kuck; William Jalby
This paper presents an empirical approach to measuring and modeling the energy consumption of multicore processors.The modeling approach allows us to find a breakdown of the energy consumption among a set of key hardware components, also called HW nodes. We explicitly model the front-end and the back-end in terms of the number of instructions executed. We also model the L1, L2 and L3 caches. Furthermore, we explicitly model the static and dynamic energy consumed by the the uncore and core components. From a software perspective, our methodology allows us to correlate energy to the executed code, which helps find opportunities for code optimization and tuning. We use binary analysis and hardware counters for performance characterization. Although, we use the on-chip counters (RAPL) for energy measurement, our methodology does not rely on a specific method for energy measurement. Thus, it is portable and easy to deploy in various computing environments. We validate our energy model using two Intel processors with a set of HPC codelets, where data sizes are varied to come from the L1, L2 and L3 caches and show 3% average modeling error. We present a comprehensive analysis and show energy consumption differences between kernels and relate those differences to the algorithms that are implemented. Finally, we discuss how vectorization leads to energy savings compared to non-vectorized codes.
international conference on parallel processing | 2018
Abdelhafid Mazouz; David C. Wong; David J. Kuck; William Jalby
Computer systems from HPC to data centers to PCs need to take running computations into account in order to maximize quality objectives while observing power constraints. We present PCOQ, a method that measures key parameters to control package (core + uncore) power, energy and performance, using DVFS plus choices of prefetch, instruction set type, and core count. We discuss algorithms and show results on a set of HPC codelets, comparing our results to race to halt and the OnDemand governors. We also discuss a variation of PCOQ, it is an off-line approximation method for running applications. We show that energy savings vs performance impact highly depend on data locality: 18% vs 50% CPU energy savings in LLC and RAM data sizes respectively.
Archive | 2016
Vincent Palomares; David C. Wong; David J. Kuck; William Jalby
Out-of-order mechanisms in recent microarchitectures do a very good job at hiding latencies and improving performance. However, they come with limitations not easily modeled statically, and hard to quantify exactly even dynamically. This paper will present Uop Flow Simulation (UFS), a loop performance prediction technique accounting for such restrictions by combining static analysis and cycle-driven simulation. UFS simulates the behavior of the execution pipeline when executing a loop. It handles instruction latencies, dependencies, out-of-order resource consumption and other low-level details while completely ignoring semantics. We will use a UFS prototype to validate our approach on Sandy Bridge using loops from real-world HPC applications, showing it is both accurate and very fast (reaching simulation speeds of hundreds of thousands of cycles per second).
ieee international symposium on workload characterization | 2015
Al Rashid; Bob Kuhn; Bijan Arbab; David J. Kuck
For 25 years, industry standard benchmarks have proliferated, attempting to approximate user activities. This has helped drive the success of PCs to commodity levels by characterizing apps for designers and offering performance information for users. However, the many new configurations of each PC release cycle often leave users unsure about how to choose one. This paper takes a different approach, with tools based on new metrics to analyze real usage by millions of people. Our goal is to develop a methodology for deeper understanding of usage that can help designers satisfy users. These metrics demonstrate that usages are uniformly different between high- and low-end CPU-based systems, regardless of why a user bought a given system. We outline how this data can be used to partition markets and make more effective hardware (hw) and software (sw) design decisions tailoring systems for prospective markets.
international conference on parallel architectures and compilation techniques | 2013
David J. Kuck
Summary form only given. Energy/performance results for parallel (and sequential) computing are still, usually hard to predict and often disappointing. A model using invariant-based equations is being applied to predict energy/performance as HW and SW are changed in codesign studies. The physical model consists of HW nodes chosen to match architectural issues, together with automatically extracted SW codelets that are easy to measure and model. HW/SW measurements of computational capacity (BW used) and power [based on HW counters and SW modification (Decan)] are used by the Cape tool to evaluate tradeoffs quickly and find optimal solutions to various codesign problems. Codelets from a number of real applications are being analyzed and modeled.
international conference on supercomputing | 2002
David J. Kuck
Building HPC systems from small, cost/effective production nodes has been a favored engineering approach for many years. Technology has driven a variety of solutions over time. Today, commodity SMP nodes and interconnection networks, together with highly evolved programming models and parallel software engineering tools make a compelling combination. An overview of these topics will be presented, together with some specific solution examples. Open problems and issues will be discussed.
acm sigplan symposium on principles and practice of parallel programming | 2001
David J. Kuck
Applications of concurrent and parallel computing continue to expand faster than programming models and tools to support them. We will present an applications taxonomy and some approaches to developing software that can help motivate developers and move the field ahead.