Daniel A. Connors
University of Colorado Boulder
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel A. Connors.
international symposium on microarchitecture | 2005
Qiang Wu; Margaret Martonosi; Douglas W. Clark; Vijay Janapa Reddi; Daniel A. Connors; Youfeng Wu; Jin Lee; David M. Brooks
Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS time-interrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer is implemented and integrated into an industrial-strength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3times-5times better than static voltage scaling, and is more than 2times (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control
international symposium on computer architecture | 1998
David I. August; Daniel A. Connors; Scott A. Mahlke; John W. Sias; Kevin M. Crozier; Ben Chung Cheng; Patrick R. Eaton; Qudus B. Olaniran; Wen-mei W. Hwu
Explicitly Parallel Instruction Computing (EPIC) architectures require the compiler to express program instruction level parallelism directly to the hardware. EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication have individually been shown to be very effective. However, these techniques have not been studied in combination with each other. This paper presents the IMPACT EPIC Architecture to address the issues involved in designing processors based on these EPIC concepts. In particular, we focus on new execution and recovery models in which microarchitectural support for predicated execution is also used to enable efficient recovery from exceptions caused by speculatively executed instructions. This paper demonstrates that a coherent framework to integrate the three techniques can be elegantly designed to achieve much better performance than each individual technique could alone provide.
IEEE Transactions on Computers | 1999
Teresa L. Johnson; Daniel A. Connors; Matthew C. Merten; Wen-mei W. Hwu
The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing instruction set architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.
IEEE Transactions on Dependable and Secure Computing | 2009
Alex Shye; Joseph Blomstedt; Tipp Moseley; Vijay Janapa Reddi; Daniel A. Connors
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.
workshop on computer architecture education | 2004
Vijay Janapa Reddi; Alex Settle; Daniel A. Connors; Robert Cohn
Computer architecture embraces a tremendous number of ever-changing inter-connected concepts and information, yet computer architecture education is very often static, seemingly motionless. Computer architecture is commonly taught using simple piecewise methods of explaining how the hardware performs a given task, rather than characterizing the interaction of software and hardware. Visualization tools allow students to interactively explore basic concepts in computer architecture but are limited in their ability to engage students in research and design concepts. Likewise as the development of simulation models such as caches, branch predictors, and pipelines aid student understanding of architecture components, such models have limitations in the workloads that can be examined because of issues with execution time and environment. Overall, to effectively understand modern architectures, it is simply essential to experiment the characteristics of real application workloads. Likewise, understanding program behavior is necessary to effective programming, comprehension of architecture bottlenecks, and hardware design. Computer architecture education must include experience in analyzing program behavior and workload characteristics using effective tools. To explore workload characteristic analysis in computer architecture design, we propose using PIN, a binary instrumentation tool for computer architecture research and education projects.
dependable systems and networks | 2007
Alex Shye; Tipp Moseley; Vijay Janapa Reddi; Joseph Blomstedt; Daniel A. Connors
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLRs software-centric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.
high-performance computer architecture | 2006
David A. Penry; Dan Fay; David Hodgdon; Ryan Wells; Graham Schelle; David I. August; Daniel A. Connors
Simulation is an important means of evaluating new microarchitectures. Current trends toward chip multiprocessors (CMPs) try the ability of designers to develop efficient simulators. CMP simulation speed can be improved by exploiting parallelism in the CMP simulation model. This may be done by either running the simulation on multiple processors or by integrating multiple processors into the simulation to replace simulated processors. Doing so usually requires tedious manual parallelization or re-design to encapsulate processors. This paper presents techniques to perform automated simulator parallelization and hardware integration for CMP structural models. We show that automated parallelization can achieve an 7.60 speedup for a 16-processor CMP model on a conventional 4-processor shared-memory multiprocessor. We demonstrate the power of hardware integration by integrating eight hardware PowerPC cores into a CMP model, achieving a speedup of up to 5.82.
international conference on parallel architectures and compilation techniques | 2004
Alex Settle; Joshua L. Kihm; Andrew Janiszewski; Daniel A. Connors
By converting thread-level parallelism to instruction level parallelism, simultaneous multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each threads expected resource utilization but do not take into account variance in thread behavior. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler. This work is first motivated with experiments in a simulated environment, then validated on a hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux kernel.
international conference on computer design | 2005
Tipp Moseley; Joshua L. Kihm; Daniel A. Connors; Dirk Grunwald
Simultaneous multithreading (SMT) seeks to improve the computation throughput of a processor core by sharing primary resources such as functional units, issue bandwidth, and caches. SMT designs increase utilization and generally improve overall throughput, but the amount of improvement is highly dependent on competition for shared resources between the scheduled threads. This variability has implications that relate to operating system scheduling, simulation techniques, and fairness. Although these techniques recognize the implications of thread interaction, they do little to profile and predict this interaction. The modeling approach presented in this paper uses data collected from performance counters on two different hardware implementations of Pentium-4 hyper-threading processors to demonstrate the effects of thread interaction. Techniques are described for fitting linear regression models and recursive partitioning to use the counters to make online predictions of performance (expressed as instructions per cycle); these predictions can be used by the operating system to guide scheduling decisions. A detailed analysis of the effectiveness of each of these techniques is presented.
Proceedings of the 2007 Symposium on Library-Centric Software Design | 2007
Justin Emile Gottschlich; Daniel A. Connors
Transactional memory (TM) is a recent parallel programming concept which reduces challenges found in parallel programming. TM offers numerous advantages over other synchronization mechanisms, yet many current TM systems require complex hardware, programming language extensions, specific compiler support or enforce impractical software design, making these models unrealistic as an immediate TM solution for early adopters. Our new software transactional memory (STM) system, DracoSTM, is a high performance lock-based C++ STM research library. DracoSTM uses only native object-oriented language semantics, increasing its intuitiveness for developers while maintaining high programmability via automatic handling of composition, locks and transaction termination. DracoSTM is the STM first solution to (1) implement both direct and deferred updating and (2) enable run-time alternation between these updating policies. DracoSTM requires no language extension, specific development environment or platform, widening its usability and increasing the novelty of its design. This paper describes DracoSTM from an architectural infrastructure viewpoint. TM-specific and library-specific aspects are discussed, as are their cross-cutting design concerns. Finally, performance benchmarks are presented, showing DracoSTM outperform another high performing C++ STM library, by upwards of two orders of magnitude.