Amy Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amy Wang is active.

Explore More

Publication

Featured researches published by Amy Wang.

international conference on parallel architectures and compilation techniques | 2005

Optimizing Compiler for the CELL Processor

Alexandre E. Eichenberger; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; J.C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; Peng Zhao; Michael Karl Gschwind

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.

international conference on parallel architectures and compilation techniques | 2012

Evaluation of Blue Gene/Q hardware support for transactional memories

Amy Wang; Matthew Gaudet; Peng Wu; José Nelson Amaral; Martin Ohmacht; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q (BG/Q) machine. The TM programming model supports most C/C++ programming constructs on top of a best-effort HTM with the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP benchmarks on BG/Q is the first of its kind in understanding characteristics of running coarse-grained TM workloads on HTMs. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

symposium on code generation and optimization | 2005

Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Peng Wu; Alexandre E. Eichenberger; Amy Wang

When generating codes for todays multimedia extensions, one of the major challenges is to deal with memory alignment issues. While hand programming still yields best performing SIMD codes, it is both time consuming and error prone. Compiler technology has greatly improved, including techniques that simdize loops with misaligned accesses by automatically rearranging misaligned memory streams in registers. Current techniques are applicable to runtime alignments, but they aggressively reduce the alignment overhead only when all alignments are known at compile time. This paper presents two major enhancements to the state of the art, improving both performance and coverage. First, we propose a novel technique to simdize loops with runtime alignment nearly as efficiently as those with compile-time misalignment. Runtime alignment is pervasive in real applications because it is either part of the algorithms, or it is an artifact of the compilers inability to extract accurate alignment information from complex applications. Second, we incorporate length conversion operations, e.g., conversions between data of different sizes, into the alignment handling framework. Length conversions are pervasive in multimedia applications where mixed integer types are often used. Supporting length conversion can greatly improve the coverage of simdizable loops. Experimental results indicate that our runtime alignment technique achieves a 19% to 32% speedup increase over prior art for a benchmark stressing the impact of misaligned data. We also demonstrate speedup factors of up to 8.11 for real benchmarks over sequential execution.

international conference on supercomputing | 2005

An integrated simdization framework using virtual vectors

Peng Wu; Alexandre E. Eichenberger; Amy Wang; Peng Zhao

Automatic simdization for multimedia extensions faces several new challenges that are not present in traditional vectorization. Some of the new issues are due to the more restrictive SIMD architectures designed for multimedia extensions. Among them are alignment constraints, lack of memory gather and scatter support, and the short and fixed-length nature of SIMD vectors. Since these constraints affect some very basic components of a program, a compiler must not only provide solid solutions to individual issues, but also take an integrated approach to address these constraints in combination.In this paper, we propose a simdization framework that addresses several orthogonal aspects of simdization, such as alignment handling, simdization of loops with mixed data lengths, and SIMD parallelism extraction from different program scopes (from basic blocks to inner loops). The novelty of this framework is its ability to facilitate interactions between different techniques based on the simple intermediate representation of virtual vectors. Measurements on a PPC970 with a VMX SIMD unit indicate speedup factors of up to 8.11 for numerical/video/communication kernels and speedup factors of up to 2.16 for benchmarks, when automatic simdization is turned on.

ieee international conference on high performance computing data and analytics | 2012

What scientific applications can benefit from hardware transactional memory

Martin Schindewolf; Barna Biliari; John C. Gyllenhaal; Martin Schulz; Amy Wang; Wolfgang Karl

Achieving efficient and correct synchronization of multiple threads is a difficult and error-prone task at small scale and, as we march towards extreme scale computing, will be even more challenging when the resulting application is supposed to utilize millions of cores efficiently. Transactional Memory (TM) is a promising technique to ease the burden on the programmer, but only recently has become available on commercial hardware in the new Blue Gene/Q system and hence the real benefit for realistic applications has not been studied yet. This paper presents the first performance results of TM embedded into OpenMP on a prototype system of BG/Q and characterizes code properties that will likely lead to benefits when augmented with TM primitives. We first study the influence of thread count, environment variables and memory layout on TM performance and identify code properties that will yield performance gains with TM. Second, we evaluate the combination of OpenMP with multiple synchronization primitives on top of MPI to determine suitable task to thread ratios per node. Finally, we condense our findings into a set of best practices. These are applied to a Monte Carlo Benchmark and a Smoothed Particle Hydrodynamics method. In both cases an optimized TM version, executed with 64 threads on one node, outperforms a simple TM implementation. MCB with optimized TM yields a speedup of 27.45 over baseline.

Ibm Journal of Research and Development | 2013

IBM Blue Gene/Q system software stack

Kyung Dong Ryu; Todd Inglett; Ralph Bellofatto; M. A. Blocksome; Thomas Gooding; Sameer Kumar; A. R. Mamidala; Mark Megerian; S. Miller; M. T. Nelson; Bryan S. Rosenburg; Brian E. Smith; J. Van Oosten; Amy Wang; Robert W. Wisniewski

The principal focus areas for system software on the IBM Blue Gene®/Q include ultrascalability and high reliability while delivering the full performance capability of the hardware to applications. The Blue Gene/Q system software has achieved these goals while adding functionality and flexibility compared with previous versions of Blue Gene®. Whereas part of the software stack was improved with innovative evolutionary progress, such as unified sub-block partitioning and the ability to overcommit hardware threads, other areas, such as transactional memory and speculative execution, represent a revolutionary step forward. In this paper, we describe the overall software architecture of Blue Gene/Q. We then describe each of the main components of the software stack. In each area, we focus on the major enhancements introduced in the Blue Gene/Q system software stack.

international workshop on openmp | 2012

A case for including transactions in OpenMP II: hardware transactional memory

Barna L. Bihari; Michael Wong; Amy Wang; Bronis R. de Supinski; Wang Chen

We present recent results using Hardware Transactional Memory (HTM) on IBMs Blue Gene/Q system. By showing how this latest TM system can significantly reduce the complexity of shared memory programming while retaining efficiency, we continue to make our case that the OpenMP language specification should include transactional language constructs. Furthermore, we argue for its support as an advanced abstraction to support mutable shared state, thus expanding OpenMP synchronization capabilities. Our results demonstrate how TM can be used to simplify modular parallel programming in OpenMP while maintaining parallel performance. We show performance advantages in the BUSTM ( B enchmark for U n S tructured-mesh T ransactional M emory) model using the transactional memory hardware implementation on Blue Gene/Q.

Ibm Journal of Research and Development | 2013

IBM Blue Gene/Q memory subsystem with speculative execution and transactional memory

Martin Ohmacht; Amy Wang; Thomas Gooding; Ben J. Nathanson; Indira Nair; Geert Janssen; Marcel Schaal; Burkhard Steinmacher-Burow

The memory subsystem of the IBM Blue Gene®/Q Compute chip features multi-versioning and access conflict detection. Its ordered and unordered transaction modes implement both speculative execution (SE) and transactional memory (TM). Blue Gene/Qs large shared second-level cache serves as storage for speculative versions, allowing up to 30 MB of speculative state for the 64 threads of a Blue Gene/Q node, which in the extreme can be associated with a single large transaction. Using the shared access to speculative data, the SE model implements forwarding, allowing data produced by one thread to be accessed by another thread while both are still speculative. This paper presents an overview of Blue Gene/Qs approach to TM and SE: the memory subsystem hardware and operating system extensions, IBM XL compiler support via OpenMP® extensions, and a cost estimation model for executing code speculatively. The model is validated using synthetic benchmarks.

IEEE Transactions on Computers | 2015

Software Support and Evaluation of Hardware Transactional Memory on Blue Gene/Q

Amy Wang; Matthew Gaudet; Peng Wu; Martin Ohmacht; José Nelson Amaral; Christopher Barton; Raul Esteban Silvera; Maged M. Michael

This paper describes an end-to-end system implementation of a transactional memory (TM) programming model on top of the hardware transactional memory (HTM) of the Blue Gene/Q machine. The TM programming model supports most C/C++ programming constructs using a best-effort HTM and the help of a complete software stack including the compiler, the kernel, and the TM runtime. An extensive evaluation of the STAMP and the RMS-TM benchmark suites on BG/Q is the first of its kind in understanding characteristics of running TM workloads on real hardware TM. The study reveals several interesting insights on the overhead and the scalability of BG/Q HTM with respect to sequential execution, coarse-grain locking, and software TM.

high performance computational finance | 2013

Optimizing IBM algorithmics' mark-to-future aggregation engine for real-time counterparty credit risk scoring

Amy Wang; Jan Treibig; Bob Blainey; Peng Wu; Yaoqing Gao; Barnaby Dalton; Danny Gupta; Fahham Khan; Neil Bartlett; Lior Velichover; James Sedgwick; Louis Ly

The concept of default and its associated painful repercussions have been a particular area of focus for financial institutions, especially after the 2007/2008 global financial crisis. Counterparty credit risk (CCR), i.e. risk associated with a counterparty default prior to the expiration of a contract, has gained tremendous amount of attention which resulted in new CCR measures and regulations being introduced. In particular users would like to measure the potential impact of each real time trade or potential real time trade against exposure limits for the counterparty using Monte Carlo simulations of the trade value, and also calculate the Credit Value Adjustment (i.e, how much it will cost to cover the risk of default with this particular counterparty if/when the trade is made). These rapid limit checks and CVA calculations demand more compute power from the hardware. Furthermore, with the emergence of electronic trading, the extreme low latency and high throughput real time compute requirement push both the software and hardware capabilities to the limit. Our work focuses on optimizing the computation of risk measures and trade processing in the existing Mark-to-future Aggregation (MAG) engine in the IBM Algorithmics product offering. We propose a new software approach to speed up the end-to-end trade processing based on a pre-compiled approach. The net result is an impressive speed up of 3--5x over the existing MAG engine using a real client workload, for processing trades which perform limit check and CVA reporting on exposures while taking full collateral modelling into account.

Explore More