Tipp Moseley | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tipp Moseley is active.

Explore More

Publication

Featured researches published by Tipp Moseley.

acm sigplan symposium on principles and practice of parallel programming | 2008

FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

John Giacomoni; Tipp Moseley; Manish Vachharajani

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForwards effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

IEEE Transactions on Dependable and Secure Computing | 2009

PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

Alex Shye; Joseph Blomstedt; Tipp Moseley; Vijay Janapa Reddi; Daniel A. Connors

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.

dependable systems and networks | 2007

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

Alex Shye; Tipp Moseley; Vijay Janapa Reddi; Joseph Blomstedt; Daniel A. Connors

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLRs software-centric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.

symposium on code generation and optimization | 2007

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Tipp Moseley; Alex Shye; Vijay Janapa Reddi; Dirk Grunwald; Ramesh Peri

In profiling, a tradeoff exists between information and overhead. For example, hardware-sampling profilers incur negligible overhead, but the information they collect is consequently very coarse. Other profilers use instrumentation tools to gather temporal traces such as path profiles and hot memory streams, but they have high overhead. Runtime and feedback-directed compilation systems need detailed information to aggressively optimize, but the cost of gathering profiles can outweigh the benefits. Shadow profiling is a novel method for sampling long traces of instrumented code in parallel with normal execution, taking advantage of the trend of increasing numbers of cores. Each instrumented sample can be many millions of instructions in length. The primary goal is to incur negligible overhead, yet attain profile information that is nearly as accurate as a perfect profile. The profiler requires no modifications to the operating system or hardware, and is tunable to allow for greater coverage or lower overhead. We evaluate the performance and accuracy of this new profiling technique for two common types of instrumentation-based profiles: interprocedural path profiling and value profiling. Overall, profiles collected using the shadow profiling framework are 94% accurate versus perfect value profiles, while incurring less than 1% overhead. Consequently, this technique increases the viability of dynamic and continuous optimization systems by hiding the high overhead of instrumentation and enabling the online collection of many types of profiles that were previously too costly

international conference on computer design | 2005

Methods for modeling resource contention on simultaneous multithreading processors

Tipp Moseley; Joshua L. Kihm; Daniel A. Connors; Dirk Grunwald

Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 hyper-threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented.

international conference on human computer interaction | 2005

Analysis of path profiling information generated with performance monitoring hardware

Alex Shye; Matthew Iyer; Tipp Moseley; David Hodgdon; Dan Fay; Vijay Janapa Reddi; D.A. Connor

Even with the breakthroughs in semiconductor technology that enables billion transistor designs, hardware-based architecture paradigms alone cannot substantially improve processor performance. The challenge in realizing the full potential of these future machines is to find ways to adapt program behavior to application needs and processor resources. As such, run-time optimization has a distinct role in future high performance systems. However, as these systems are dependent on accurate, fine-grain profile information, traditional approaches to collecting profiles at run-time result in significant slowdowns during program execution. A novel approach to low-overhead profiling is to exploit hardware performance monitoring units (PMUs) present in modern microprocessors. The Itanium-2 PMU can periodically sample the last few taken branches in an executing program and this information can be used to recreate partial paths of execution. With compiler-aided analysis, the partial paths can be correlated into full paths. As statistically hot paths are most likely to occur in PMU samples, even infrequent sampling can accurately identify these paths. While traditional path profiling techniques carry a high overhead, a PMU-based path profiler represents an effective lightweight profiling alternative. This paper characterizes the PMU-based path information and demonstrates the construction of such a PMU-based path profiler for a run-time system.

computing frontiers | 2005

Dynamic run-time architecture techniques for enabling continuous optimization

Tipp Moseley; Alex Shye; Vijay Janapa Reddi; Matthew Iyer; Dan Fay; David Hodgdon; Joshua L. Kihm; Alex Settle; Dirk Grunwald; Daniel A. Connors

Future computer systems will integrate tens of multithreaded processor cores on a single chip die, resulting in hundreds of concurrent program threads sharing system resources. These designs will be the cornerstone of improving throughput in high-performance computing and server environments. However, to date, appropriate systems software (operating system, run-time system, and compiler) technologies for these emerging machines have not been adequately explored. Future processors will require sophisticated hardware monitoring units to continuously feed back resource utilization information to allow the operating system to make optimal thread co-scheduling decisions and also to software that continuously optimizes the program itselfNevertheless, in order to continually and automatically adapt systems resources to program behaviors and application needs, specific run-time information must be collected to adequately enable dynamic code optimization and operating system scheduling. Generally, run-time optimization is limited by the time required to collect profiles, the time required to perform optimization, and the inherent benefits of any optimization or decisions. Initial techniques for effectively utilizing run-time information for dynamic optimization and informed thread scheduling in future multithreaded architectures are presented

international conference on parallel architectures and compilation techniques | 2007

FastForward for Efficient Pipeline Parallelism

John Giacomoni; Tipp Moseley; Manish Vachharajani

High-rate core-to-core communication is critical for efficient pipeline-parallel software architectures. This paper introduces FastForward, a software-only low-overhead high-rate queue algorithm for pipeline parallelism on multicore architectures. FastForward uses an architecturally- tuned domain-specific adaptation of concurrent lock-free queues to provide low-latency and low-overhead core-to- core communication. Enqueue and dequeue times on a 2 GHz Opteron 270 based system are as low as 36 ns, up to 4x faster than Lamports solution.

ieee international symposium on workload characterization | 2007

Seekable Compressed Traces

Tipp Moseley; Dirk Grunwald; Ramesh Peri

Program traces are commonly used for purposes such as profiling, processor simulation, and program slicing. Uncompressed, these traces are often too large to exist on disk. Although existing trace compression algorithms achieve high compression rates, they sacrifice the accessibility of uncompressed traces; typical compressed traces must be traversed linearly to reach a desired position in the stream. This paper describes seekable compressed traces that allow arbitrary positioning in the compressed data stream. Furthermore, we enhance existing value prediction based techniques to achieve higher compression rates, particularly for difficult-to-compress traces. Our base algorithm achieves a harmonic mean compression rate for SPEC2000 memory address traces that is 3.47 times better than existing methods. We introduce the concept of seekpoints that enable fast seeking to positions evenly distributed throughout a compressed trace. Adding seekpoints enables rapid sampling and backwards traversal of compressed traces. At a granularity of every 10 M instructions, seekpoints only increase trace sizes by an average factor of 2.65.

international conference on parallel architectures and compilation techniques | 2009

Chainsaw: Using Binary Matching for Relative Instruction Mix Comparison

Tipp Moseley; Dirk Grunwald; Ramesh Peri

With advances in hardware, instruction set architectures are undergoing continual evolution. As a result, compilers are under constant pressure to adapt and take full advantage of available features. However, current techniques for evaluating relative compiler performance only compare profiles at the application level, ignoring relative performance differences at finer granularities. To ensure that new features are put to good use, a more rigorous approach is necessary. A fundamental step in tuning compiler performance is identifying the specific examples that can be improved. To solve this problem, we present a compiler-independent binary matching technique to compare executions of differently compiled programs and identify intervals where the behavior can be meaningfully compared. Matched intervals can be automatically analyzed to identify anomalous segments of execution where one version performs significantly differently versus another. We present case studies using Chainsaw to identify significant performance anomalies between differently compiled codes.

Explore More