Derek R. Hower | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Derek R. Hower is active.

Explore More

Publication

Featured researches published by Derek R. Hower.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

high-performance computer architecture | 2011

Calvin: Deterministic or not? Free will to choose

Derek R. Hower; Polina Dudnik; Mark D. Hill; David A. Wood

Most shared memory systems maximize performance by unpredictably resolving memory races. Unpredictable memory races can lead to nondeterminism in parallel programs, which can suffer from hard-to-reproduce hiesenbugs. We introduce Calvin, a shared memory model capable of executing in a conventional nondeterministic mode when performance is paramount and a deterministic mode when execution repeatability is important. Unlike prior hardware proposals for deterministic execution, Calvin exploits the flexibility of a memory consistency model weaker than sequential consistency. Specifically, Calvin logically orders memory operations into strata that are compatible with the Total Store Order (TSO). Calvin is also designed with the needs of future power-aware processors in mind, and does not require any speculation support. We develop a Calvin-MIST implementation that uses an unordered coalescing write cache, multiple-write coherence protocol, and delayed (timebomb) invalidations while maintaining TSO compatibility. Results show that Calvin-MIST can execute workloads in conventional mode at speeds comparable to a conventional system (providing compatibility) or execute deterministically for a modest average slowdown of less than 20% (when determinism is valued).

Communications of The ACM | 2009

Two hardware-based approaches for deterministic multiprocessor replay

Derek R. Hower; Pablo Montesinos; Luis Ceze; Mark D. Hill; Josep Torrellas

Modern computer systems are inherently nondeterministic due to a variety of events that occur during an execution, including I/O, interrupts, and DMA fills. The lack of repeatability that arises from this nondeterminism can make it difficult to develop and maintain correct software. Furthermore, it is likely that the impact of nondeterminism will only increase in the coming years, as commodity systems are now shared-memory multiprocessors. Such systems are not only impacted by the sources of nondeterminism in uniprocessors, but also by the outcome of memory races among concurrent threads. In an effort to help ease the pain of developing software in a nondeterministic environment, researchers have proposed adding deterministic replay capabilities to computer systems. A system with a deterministic replay capability can record sufficient information during an execution to enable a replayer to (later) create an equivalent execution despite the inherent sources of nondeterminism that exist. With the ability to replay an execution verbatim, many new applications may be possible: Debugging: Deterministic replay could be used to provide the illusion of a time-travel debugger that has the ability to selectively execute both forward and backward in time. Security: Deterministic replay could also be used to enhance the security of software by providing the means for an in-depth analysis of an attack, hopefully leading to rapid patch deployment and a reduction in the economic impact of new threats. Fault Tolerance: With the ability to replay an execution, it may also be possible to develop hot-standby systems for critical service providers using commodity hardware. A virtual machine (VM) could, for example, be fed, in real time, the replay log of a primary server running on a physically separate machine. The standby VM could use the replay log to mimic the primarys execution, so that in the event that the primary fails, the backup can take over operation with almost zero downtime.

international test conference | 2006

Self-Checking and Self-Diagnosing 32-bit Microprocessor Multiplier

Mahmut Yilmaz; Derek R. Hower; Sule Ozev; Daniel J. Sorin

In this paper, we propose a low-cost fault tolerance technique for microprocessor multipliers, both non-pipelined (NP) and pipelined (P). Our fault tolerant multiplier designs are capable of detecting and correcting errors, diagnosing hard faults, and reconfiguring to take the faulty sub-unit off-line. We utilize the branch misprediction recovery mechanism in the microprocessor core to take the error detection process off the critical path. Our analysis shows that our scheme provides 99% fault security and, compared to a baseline unprotected multiplier, achieves this fault tolerance with low performance overhead (5% for NP and 2.5% for P multiplier) and reasonably low area (38% NP and 26% P) and power consumption (36% NP and 28.5% P) overheads

international conference on computer design | 2013

FreshCache: Statically and dynamically exploiting dataless ways

Arkaprava Basu; Derek R. Hower; Mark D. Hill; Michael M. Swift

Last level caches (LLCs) account for a substantial fraction of the area and power budget in many modern processors. Two recent trends - dwindling die yield that falls off sharply with larger chips and increasing static power - make a strong case for a fresh look at LLC design. Inclusive caches are particularly interesting because many commercially successful processors use inclusion to ease coherence at a cost of some data being stale or redundant. Prior works have demonstrated that LLC designs could be improved through static (at design time) or dynamic (at runtime) use of “dataless ways”. The static dataless ways removes the data-but not tags-from some cache ways to save energy and area without complicating inclusive-LLC coherence. A dynamic version (dynamic dataless ways) could dynamically turn off data, but not tags, effectively adapting the classic selective cache ways idea to save energy in LLC but not area. We find that (a) all our benchmarks benefit from dataless ways, but (b) the best number of dataless ways varies by workload. Thus, a pure static dataless design leaves energy-saving opportunity on the table, while a pure dynamic dataless design misses area-saving opportunity. To surpass both pure static and dynamic approaches, we develop the FreshCache LLC design that both statically and dynamically exploits dataless ways, including a predictor to adapt the number of dynamic dataless ways as well as detailed cache management policies. Results show that FreshCache saves more energy than static dataless ways alone (e.g., 72% vs. 9% of LLC) and more area by dynamic dataless ways only (e.g., 8% vs. 0% of LLC).

measurement and modeling of computer systems | 2006

Applying architectural vulnerability Analysis to hard faults in the microprocessor

Fred A. Bower; Derek R. Hower; Mahmut Yilmaz; Daniel J. Sorin; Sule Ozev

In this paper, we present a new metric, Hard-Fault Architectural Vulnerability Factor (H-AVF), to allow designers to more effectively compare alternate hard-fault tolerance schemes. In order to provide intuition on the use of H-AVF as a metric, we evaluate fault-tolerant level-1 data cache and register file implementations using error correcting codes and a fault-tolerant adder using triple-modular redundancy (TMR). For each of the designs, we compute its H-AVF. We then use these H-AVF values in conjunction with other properties of the design, such as die area and power consumption, to provide composite metrics. The derived metrics provide simple, quantitative measures of the cost-effectiveness of the evaluated designs.

high-performance computer architecture | 2017

PABST: Proportionally Allocated Bandwidth at the Source and Target

Derek R. Hower; Harold W. Cain; Carl A. Waldspurger

Higher integration lowers total cost of ownership (TCO) in the data center by reducing equipment cost and lowering energy consumption. However, higher integration also makes it difficult to achieve guaranteed quality of service (QoS) for shared resources. Unlike many other resources, memory bandwidth cannot be finely controlled by software in existing systems. As a result, many systems running critical, bandwidth-sensitive applications remain underutilized to protect against bandwidth interference. In this paper, we propose a novel hardware architecture allowing practical, software-controlled partitioning of memory bandwidth. Proportionally Allocated Bandwidth at the Source and Target (PABST) precisely controls the bandwidth of applications by throttling request rates at the source and prioritizes requests at the target. We show that PABST is work conserving, such that excess bandwidth beyond the requested allocation will not go unused. For applications sensitive to memory latency, we pair PABST with a simple priority scheme at the memory controller. We show that when combined, the system is able to lower TCO by providing performance isolation across a wide range of workloads, even when co-located with memory-intensive background jobs.

international conference on computer design | 2017

Jenga: Efficient Fault Tolerance for Stacked DRAM

Georgios Mappouras; Alireza Vahid; A. Robert Calderbank; Derek R. Hower; Daniel J. Sorin

In this paper, we introduce Jenga, a new scheme for protecting 3D DRAM, specifically high bandwidth memory (HBM), from failures in bits, rows, banks, channels, dies, and TSVs. By providing redundancy at the granularity of a cache block–rather than across blocks, as in the current state of the art–Jenga achieves greater error-free performance and lower error recovery latency. We show that Jengas runtime is on average only 1.03× the runtime of our Baseline across a range of benchmarks. Additionally, for memory intensive benchmarks, Jenga is on average 1.11× faster than prior work.

international symposium on computer architecture | 2008