Benjamin C. Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin C. Lee is active.

Explore More

Publication

Featured researches published by Benjamin C. Lee.

international symposium on computer architecture | 2009

Architecting phase change memory as a scalable dram alternative

Benjamin C. Lee; Engin Ipek; Onur Mutlu; Doug Burger

Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCMs scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance. We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.

symposium on operating systems principles | 2009

Better I/O through byte-addressable, persistent memory

Jeremy Condit; Edmund B. Nightingale; Christopher Frost; Engin Ipek; Benjamin C. Lee; Doug Burger; Derrick Coetzee

Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained access to persistent storage. In this paper, we present a file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory. Our file system, BPFS, uses a new technique called short-circuit shadow paging to provide atomic, fine-grained updates to persistent storage. As a result, BPFS provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory. Our hardware architecture enforces atomicity and ordering guarantees required by BPFS while still providing the performance benefits of the L1 and L2 caches. Since these memory technologies are not yet widely available, we evaluate BPFS on DRAM against NTFS on both a RAM disk and a traditional disk. Then, we use microarchitectural simulations to estimate the performance of BPFS on PCM. Despite providing strong safety and consistency guarantees, BPFS on DRAM is typically twice as fast as NTFS on a RAM disk and 4-10 times faster than NTFS on disk. We also show that BPFS on PCM should be significantly faster than a traditional disk-based file system.

architectural support for programming languages and operating systems | 2006

Accurate and efficient regression modeling for microarchitectural performance and power prediction

Benjamin C. Lee; David M. Brooks

We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

international symposium on microarchitecture | 2010

Phase-Change Technology and the Future of Main Memory

Benjamin C. Lee; Ping Zhou; Jun Yang; Youtao Zhang; Bo Zhao; Engin Ipek; Onur Mutlu; Doug Burger

Phase-change may enable continued scaling of main memories, but PCM has higher access latencies, incurs higher power costs, and wears out more quickly than DRAM. This article discusses how to mitigate these limitations through buffer sizing, row caching, write reduction, and wear leveling, to make PCM a viable dream alternative for scalable main memories.

international symposium on computer architecture | 2010

Web search using mobile cores: quantifying and mitigating the price of efficiency

Vijay Janapa Reddi; Benjamin C. Lee; Trishul M. Chilimbi; Kushagra Vaid

The commoditization of hardware, data center economies of scale, and Internet-scale workload growth all demand greater power efficiency to sustain scalability. Traditional enterprise workloads, which are typically memory and I/O bound, have been well served by chip multiprocessors com- prising of small, power-efficient cores. Recent advances in mobile computing have led to modern small cores capable of delivering even better power efficiency. While these cores can deliver performance-per-Watt efficiency for data center workloads, small cores impact application quality-of-service robustness, and flexibility, as these workloads increasingly invoke computationally intensive kernels. These challenges constitute the price of efficiency. We quantify efficiency for an industry-strength online web search engine in production at both the microarchitecture- and system-level, evaluating search on server and mobile-class architectures using Xeon and Atom processors.

high-performance computer architecture | 2006

CMP design space exploration subject to physical constraints

Yingmin Li; Benjamin C. Lee; David M. Brooks; Zhigang Hu; Kevin Skadron

This paper explores the multi-dimensional design space for chip multiprocessors, exploring the inter-related variables of core count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency, under various area and thermal constraints. The results show the importance of joint optimization. Thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery, demonstrating the importance of considering thermal constraints while optimizing these other parameters. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important. Finally, the paper shows the challenges of accommodating both CPU-bound and memory-bound workloads on the same design. Their respective preferences for more cores and larger caches lead to increasingly irreconcilable configurations as area and other constraints are relaxed; rather than accommodating a happy medium, the extra resources simply encourage more extreme optimization points.

acm sigplan symposium on principles and practice of parallel programming | 2007

Methods of inference and learning for performance modeling of parallel applications

Benjamin C. Lee; David M. Brooks; Bronis R. de Supinski; Martin Schulz; Karan Singh; Sally A. McKee

Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as clustering, association, and correlation analysis, to understand the application parameter space better. We construct and compare two classes of effective predictive models: piecewise polynomial regression and artifical neural networks. We compare these techniques with theoretical analyses and experimental results. Overall, both regression and neural networks are accurate with median error rates ranging from 2.2 to 10.5 percent. The comparable accuracy of these models suggest differentiating features will arise from ease of use, transparency, and computational efficiency.

international symposium on computer architecture | 2012

Towards energy-proportional datacenter memory with mobile DRAM

Krishna T. Malladi; Benjamin C. Lee; Frank A. Nothaft; Christos Kozyrakis; Karthika Periyathambi; Mark Horowitz

To increase datacenter energy efficiency, we need memory systems that keep pace with processor efficiency gains. Currently, servers use DDR3 memory, which is designed for high bandwidth but not for energy proportionality. A system using 20% of the peak DDR3 bandwidth consumes 2.3× the energy per bit compared to the energy consumed by a system with fully utilized memory bandwidth. Nevertheless, many datacenter applications stress memory capacity and latency but not memory bandwidth. In response, we architect server memory systems using mobile DRAM devices, trading peak bandwidth for lower energy consumption per bit and more efficient idle modes. We demonstrate 3-5× lower memory power, better proportionality, and negligible performance penalties for data-center workloads.

conference on high performance computing (supercomputing) | 2002

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

Richard W. Vuduc; James Demmel; Katherine A. Yelick; Shoaib Kamil; Rajesh Nishtala; Benjamin C. Lee

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).

international symposium on computer architecture | 2010

Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis

Omid Azizi; Aqeel Mahesri; Benjamin C. Lee; Sanjay J. Patel; Mark Horowitz

Power consumption has become a major constraint in the design of processors today. To optimize a processor for energy-efficiency requires an examination of energy-performance trade-offs in all aspects of the processor design space, including both architectural and circuit design choices. In this paper, we apply an integrated architecture-circuit optimization framework to map out energy-performance trade-offs of several different high-level processor architectures. We show how the joint architecture-circuit space provides a trade-off range of approximately 6.5x in performance for 4x energy, and we identify the optimal architectures for different design objectives. We then show that many of the designs in this space come at very high marginal costs. Our results show that, for a large range of design objectives, voltage scaling is effective in efficiently trading off performance and energy, and that the choice of optimal architecture and circuits does not change much during voltage scaling. Finally, we show that with only two designs--a dual-issue in-order design and a dual-issue out-of-order design, both properly optimized-a large part of the energy-performance trade-off space can be covered within 3% of the optimal energy-efficiency.

Explore More