Howard S. David | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Howard S. David is active.

Explore More

Publication

Featured researches published by Howard S. David.

international symposium on low power electronics and design | 2010

RAPL: memory power estimation and capping

Howard S. David; Eugene Gorbatov; Ulf R. Hanebutte; Rahul Khanna; Christian Le

The drive for higher performance and energy efficiency in data-centers has influenced trends toward increased power and cooling requirements in the facilities. Since enterprise servers rarely operate at their peak capacity, efficient power capping is deemed as a critical component of modern enterprise computing environments. In this paper we propose a new power measurement and power limiting architecture for main memory. Specifically, we describe a new approach for measuring memory power and demonstrate its applicability to a novel power limiting algorithm. We implement and evaluate our approach in the modern servers and show that we achieve up to 40% lower performance impact when compared to the state-of-art baseline across the power limiting range.

international conference on autonomic computing | 2011

Memory power management via dynamic voltage/frequency scaling

Howard S. David; Chris Fallin; Eugene Gorbatov; Ulf R. Hanebutte; Onur Mutlu

Energy efficiency and energy-proportional computing have become a central focus in enterprise server architecture. As thermal and electrical constraints limit system power, and datacenter operators become more conscious of energy costs, energy efficiency becomes important across the whole system. There are many proposals to scale energy at the datacenter and server level. However, one significant component of server power, the memory system, remains largely unaddressed. We propose memory dynamic volt age/frequency scaling (DVFS) to address this problem, and evaluate a simple algorithm in a real system. As we show, in a typical server platform, memory consumes 19% of system power on average while running SPEC CPU2006 workloads. While increasing core counts demand more bandwidth and drive the memory frequency upward, many workloads require much less than peak bandwidth. These workloads suffer minimal performance impact when memory frequency is reduced. When frequency reduces, voltage can be reduced as well. We demonstrate a large opportunity for memory power reduction with a simple control algorithm that adjusts memory voltage and frequency based on memory bandwidth utilization. We evaluate memory DVFS in a real system, emulating reduced memory frequency by altering timing registers and using an analytical model to compute power reduction. With an average of 0.17% slowdown, we show 10.4% average (20.5% max) memory power reduction, yielding 2.4% average (5.2% max) whole-system energy improvement.

international symposium on microarchitecture | 2008

Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

Hongzhong Zheng; Jiang Lin; Zhao Zhang; Eugene Gorbatov; Howard S. David; Zhichun Zhu

The widespread use of multicore processors has dramatically increased the demand on high memory bandwidth and large memory capacity. As DRAM subsystem designs stretch to meet the demand, memory power consumption is now approaching that of processors. However, the conventional DRAM architecture prevents any meaningful power and performance trade-offs for memory-intensive workloads. We propose a novel idea called mini-rank for DDRx (DDR/DDR2/DDR3) DRAMs, which uses a small bridge chip on each DRAM DIMM to break a conventional DRAM rank into multiple smaller mini-ranks so as to reduce the number of devices involved in a single memory access. The design dramatically reduces the memory power consumption with only a slight increase on the memory idle latency. It does not change the DDRx bus protocol and its configuration can be adapted for the best performance-power trade-offs. Our experimental results using four-core multiprogramming workloads show that using x32 mini-ranks reduces memory power by 27.0% with 2.8% performance penalty and using x16 mini-ranks reduces memory power by 44.1% with 7.4% performance penalty on average for memory-intensive workloads, respectively.

high-performance computer architecture | 2013

Runnemede: An architecture for Ubiquitous High-Performance Computing

Nicholas P. Carter; Aditya Agrawal; Shekhar Borkar; Romain Cledat; Howard S. David; Dave Dunning; Joshua B. Fryman; Ivan Ganev; Roger A. Golliver; Rob C. Knauerhase; Richard Lethin; Benoît Meister; Asit K. Mishra; Wilfred R. Pinfold; Justin Teller; Josep Torrellas; Nicolas Vasilache; Ganesh Venkatesh; Jianping Xu

DARPAs Ubiquitous High-Performance Computing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. This paper describes Runnemede, the research architecture developed by the Intel-led UHPC team. Runnemede is being developed through a co-design process that considers the hardware, the runtime/OS, and applications simultaneously. Near-threshold voltage operation, fine-grained power and clock management, and separate execution units for runtime and application code are used to reduce energy consumption. Memory energy is minimized through application-managed on-chip memory and direct physical addressing. A hierarchical on-chip network reduces communication energy, and a codelet-based execution model supports extreme parallelism and fine-grained tasks. We present an initial evaluation of Runnemede that shows the design process for our on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on-chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on our architecture.

international symposium on computer architecture | 2007

Thermal modeling and management of DRAM memory systems

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Howard S. David; Zhao Zhang

With increasing speed and power density, high-performance memories, including FB-DIMM (Fully Buffered DIMM) and DDR2 DRAM, now begin to require dynamic thermal management(DTM) as processors and hard drives did. The DTM of memories, nevertheless, is different in that it should take the processor performance and power consumption into consideration. Existing schemes have ignored that. In this study, we investigate a new approach that controls the memory thermal issues from the source generating memory activities - the processor. It will smooth the program execution when compared with shutting down memory abruptly, and therefore improve the overall system performance and power efficiency. For multicore systems, we propose two schemes called adaptive core gating and coordinated DVFS. The first scheme activates clock gating on selected processor cores and the second one scales down the frequency and voltage levels of processor cores when the memory is to be over-heated. They can successfully control the memory activities and handle thermal emergency. More importantly, they improve performance significantly under the given thermal envelope. Our simulation results show that adaptive coregating improves performance by up to 23.3% (16.3% on average) on a four-core system with FB-DIMM when compared with DRAM thermal shutdown; and coordinated DVFS with control-theoretic methods improves the performance by up to 18.5% (8.3% on average).

measurement and modeling of computer systems | 2008

Software thermal management of dram memory for multicore systems

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Eugene Gorbatov; Howard S. David; Zhao Zhang

Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: the CPU heat dissipation and its impact on DRAM memories, which were ignored, are significant factors. We have observed that the second policy, called DTM-CDVFS (Coordinated Dynamic Voltage and Frequency Scaling), has much better performance than previously reported for this reason. The average improvements are 10.8% and 15.3% on the two machines (vs. 3.4% from the previous study), respectively. It also significantly reduces the processor power by 15.5% and energy by 22.7% on average.

international symposium on performance analysis of systems and software | 2007

DRAM-Level Prefetching for Fully-Buffered DIMM: Design, Performance and Power Saving

Jiang Lin; Hongzhong Zheng; Zhichun Zhu; Zhao Zhang; Howard S. David

We have studied DRAM-level prefetching for the fully buffered DIMM (FB-DIMM) designed for multi-core processors. FB-DIMM has a unique two-level interconnect structure, with FB-DIMM channels at the first-level connecting the memory controller and advanced memory buffers (AMBs); and DDR2 buses at the second-level connecting the AMBs with DRAM chips. We propose an AMB prefetching method that prefetches memory blocks from DRAM chips to AMBs. It utilizes the redundant bandwidth between the DRAM chips and AMBs but does not consume the crucial channel bandwidth. The proposed method fetches K memory blocks of L2 cache block sizes around the demanded block, where K is a small value ranging from two to eight. The method may also reduce the DRAM power consumption by merging some DRAM precharges and activations. Our cycle-accurate simulation shows that the average performance improvement is 16% for single-core and multi-core workloads constructed from memory-intensive SPEC2000 programs with software cache prefetching enabled; and no workload has negative speedup. We have found that the performance gain comes from the reduction of idle memory latency and the improvement of channel bandwidth utilization. We have also found that there is only a small overlap between the performance gains from the AMB prefetching and the software cache prefetching. The average of estimated power saving is 15%

Scopus | 2010

A novel approach to memory power estimation using machine learning

Christian Le; Eugene Gorbatov; Ulf R. Hanebutte; Mariette Awad; Rahul Khanna; Melissa Stockman; Howard S. David

Reducing power consumption has become a priority in microprocessor design as more devices become mobile and as the density and speed of components lead to power dissipation issues. Power allocation strategies for individual components within a chip are being researched to determine optimal configurations to balance power and performance. Modelling and estimation tools are necessary in order to understand the behaviour of energy consumption in a run time environment. This paper discusses a novel approach to power metering by estimating it using a set of observed variables that share a linear or non-linear correlation to the power consumption. The machine learning approaches exploit the statistical relationship among potential variables and power consumption. We show that Support Vector Machine regression (SVR), Genetic Algorithms (GA) and Neural Networks (NN) can all be used to cheaply and easily predict memory power usage based on these observed variables.

Archive | 2003