Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Charles R. Yount is active.

Publication


Featured researches published by Charles R. Yount.


IEEE Transactions on Computers | 1996

A methodology for the rapid injection of transient hardware errors

Charles R. Yount; Daniel P. Siewiorek

Ultra-dependable computing demands verification of fault-tolerant mechanisms in the hardware. The most popular class of verification methodologies, fault-injection, is fraught with a host of limitations. Methods which are rapid enough to be feasible are not based on actual hardware faults. On the other hand, methods which are based on gate-level faults require enormous time resources. This research tries to bridge that gap by developing a new fault-injection methodology for processors based on a register-transfer-language (RTL) fault model. The fault model is developed by abstracting the effects of low-level faults to the RTL level. This process attempts to be independent of implementation details without sacrificing coverage, the proportion of errors generated by gate-level faults that are successfully reproduced by the RTL fault model. A prototype tool, ASPHALT, is described which automates the process of generating the error patterns. The IBM RISC-Oriented Micro-Processor (ROMP) is used as a basis for experimentation. Over 1.5 million transient faults are injected using a gate-level model. Over 97% of these are reproduced with the RTL model at a speedup factor of over 500:1. These results show that the RTL fault model may be used to greatly accelerate fault-injection experiments without sacrificing accuracy.


ieee international conference on high performance computing data and analytics | 2016

Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling

Charles R. Yount; Alejandro Duran

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. The performance of stencil calculations is often bounded by memory bandwidth. High-bandwidth memory (HBM) on devices such as those in the Intel® Xeon Phi™ ™200 processor family (code-named Knights Landing) can thus provide additional performance. In a traditional sequential time-step approach, the additional bandwidth can be best utilized when the stencil data fits into the HBM, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As problem sizes become significantly larger than the HBM, the effective bandwidth approaches that of the DDR, degrading performance. This paper explores the use of temporal wave-front tiling to add an additional layer of cache-blocking to allow efficient use of both the HBM bandwidth and the DDR capacity. Details of the cache-blocking and wave-front tiling algorithms are given, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and HBM-cache hit rates are also provided, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide a 2.4™ speedup compared to using HBM cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes.


international symposium on performance analysis of systems and software | 2008

Characterization of SPEC CPU2006 and SPEC OMP2001: Regression Models and their Transferability

Elmoustapha Ould-Ahmed-Vall; Charles R. Yount; James Woodlee

Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa.


ieee international conference on high performance computing data and analytics | 2016

YASK-yet another stencil kernel: a framework for HPC stencil code-generation and tuning

Charles R. Yount; Josh Tobin; Alexander Breuer; Alejandro Duran

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. While the code for many problems can certainly be written in a straightforward manner in a high-level language, this often results in sub-optimal performance on modern computing platforms. On the other hand, adding advanced optimizations such as multi-level loop interchanges and vector-folding allows the code to perform better, but at the expense of reducing readability, maintainability, and portability. This paper describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted especially for Intel® Xeon® and Intel® Xeon Phi™ processors, and running tuning experiments. The features of the framework are described, including domain-specific-languages (DSLs), code generators for stencil-equation and loop code, and a genetic-algorithm-based automated tuning tool. Two practical use-cases are illustrated with real-world examples: the standalone YASK kernel is used to tune an isotropic 3D finitedifference stencil, and the generated YASK code is integrated into an external earthquake simulator.


international supercomputing conference | 2017

Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor

Josh Tobin; Alexander Breuer; Alexander Heinecke; Charles R. Yount; Yifeng Cui

In this work we present AWP-ODC-OS, an end-to-end optimization of AWP-ODC targeting homogeneous, manycore supercomputers. AWP-ODC is an established community software package simulating seismic wave propagation using a staggered finite difference scheme which is fourth order accurate in space and second order in time. Recent production simulations, e.g. using the software for the computation of seismic hazard maps, largely relied on GPU accelerated supercomputers. In contrast, our work gives a comprehensive overview of the required steps to achieve near-optimal performance on the Intel® Xeon PhiTM x200 processor (code-named Knights Landing), and compares our competitive performance results to the most recent GPU architectures.


international symposium on performance analysis of systems and software | 2015

Graph-matching-based simulation-region selection for multiple binaries

Charles R. Yount; Harish Patil; Mohammad Shahedul Islam; Aditya Srikanth

Comparison of simulation-based performance estimates of program binaries built with different compiler settings or targeted at variants of an instruction set architecture is essential for software/hardware co-design and similar engineering activities. Commonly-used sampling techniques for selecting simulation regions do not ensure that samples from the various binaries being compared represent the same source-level work, leading to biased speedup estimates and difficulty in comparative performance debugging. The task of creating equal-work samples is made difficult by differences between the structure and execution paths across multiple binaries such as variations in libraries, in-lining, and loop-iteration counts. Such complexities are addressed in this work by first applying an existing graph-matching technique to call and loop graphs for multiple binaries for the same source program. Then, a new sequence-alignment algorithm is applied to execution traces from the various binaries, using the graph-matching results to define intervals of equal work. A basic-block profile generated for these matched intervals can then be used for phase-detection and simulation-region selection across all binaries simultaneously. The resulting selected simulation regions match both in number and the work done across multiple binaries. The application of this technique is demonstrated on binaries compiled for different Intel 64 Architecture instruction-set extensions. Quality metrics for speedup estimation and an example of applying the data for performance debugging are presented.


Archive | 1994

Software-Implemented Fault Injection of Transient Hardware Errors

Charles R. Yount; Daniel P. Siewiorek

As computer applications extend to areas which require extreme dependability, their designs mandate the ability to operate in the presence of faults. The problem of assuring that the design goals are achieved requires the observation and measurement of fault behavior parameters under various input conditions. One means to characterize systems is fault injection, but injection of internal faults is difficult due to the complexity and level of integration of contemporary VLSI implementations. This chapter explores the effects of gate-level faults on system operation as a basis for fault models at the program level.


Future Generation Computer Systems | 2017

Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches

Charles R. Yount; Alejandro Duran; Josh Tobin

Abstract Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications, especially those arising from finite-difference numerical solutions to differential equations representing the behavior of physical phenomenon such as seismic activity. The performance of stencil calculations is often bounded by memory bandwidth, and such code benefits from vectorization and tiling techniques to reuse data as much as possible once it is loaded from memory. These tiling algorithms are especially crucial for many-core CPU products that contain caches local to the individual cores, and this work provides a review of the use of techniques such as vector-folding and spatial tiling to maximize per-core cache resources. Recent many-core products also include special memory with much higher bandwidth than traditional DDR memory that is intended to provide additional performance for bandwidth-limited applications. On such platforms that also include DDR, the high-bandwidth RAM may be configurable either as separately addressable memory or as a large shared cache for the DDR. Examples of platforms with this feature include those containing products in the Intel® Xeon Phi™ x200 processor family (code-named Knights Landing), which use Multi-Channel DRAM (MCDRAM) technology to provide the higher bandwidth memory resources. In traditional sequential time-step stencil algorithms, the additional bandwidth can most easily be exploited when the stencil data fits into the faster memory, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As stencil problem sizes become significantly larger than the fast-memory capacity, the sequential time-step algorithms create an overwhelming number of misses from the fast-memory shared cache, and the effective bandwidth approaches that of the DDR, significantly degrading performance. This paper illustrates this effect and explores the application of temporal wave-front tiling to alleviate it, simultaneously leveraging both the large cache’s bandwidth and the DDR capacity. Two example applications are used to illustrate the optimizations: a single-grid isotropic approximation to the wave equation and a staggered-grid formulation for earthquake simulation. Details of the various tiling algorithms are given for both applications, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and MCDRAM-cache hit rates are provided for one of the example applications, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide up to a 2.4x speedup compared to using the fast-memory cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes on the isotropic application. Respective speedups of 1.9x and 2.8x are demonstrated for the staggered-grid application.


79th EAGE Conference and Exhibition 2017 | 2017

Optimizing Fully Anisotropic Elastic Propagation on 2nd Generation Intel Xeon Phi Processors

Albert Farrés; Alejandro Duran; Claudia Rosas; Mauricio Hanzich; Charles R. Yount; Santiago Fernández

Summary This work shows several optimization strategies evaluated and applied to an elastic wave propagation engine, based on a Fully Staggered Grid, running on the latest Intel Xeon Phi processors, the second generation of the product (code-named Knights Landing). Our fully optimized code shows a speed-up of about 4x when compared with the same algorithm optimized for the previous generation processor.


Archive | 2011

Instruction and logic to provide vector horizontal compare functionality

Elmoustapha Ould-Ahmed-Vall; Charles R. Yount; Suleyman Sair

Collaboration


Dive into the Charles R. Yount's collaboration.

Researchain Logo
Decentralizing Knowledge