David H. Albonesi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David H. Albonesi is active.

Explore More

Publication

Featured researches published by David H. Albonesi.

international symposium on microarchitecture | 1999

Selective cache ways: on-demand cache resource allocation

David H. Albonesi

Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, while the full cache may remain operational for more cache-intensive periods. Because this approach leverages the subarray partitioning that is already present for performance reasons, only minor changes to a conventional cache are required and therefore, full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and energy is flexible, and can be dynamically tailored to meet changing application and machine environmental conditions. We show that trading off a small performance degradation for energy savings can produce a significant reduction in cache energy dissipation using this approach.

high-performance computer architecture | 2002

Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling

Greg Semeraro; Grigorios Magklis; Rajeev Balasubramonian; David H. Albonesi; Sandhya Dwarkadas; Michael L. Scott

As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a multiple clock domain (MCD) processor, in which the chip is divided into several clock domains, within which independent voltage and frequency scaling can be performed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end , integer units, floating point units, and load-store units. We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use off-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system.

international symposium on microarchitecture | 2000

Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

Rajeev Balasubramonian; David H. Albonesi; Alper Buyuktosunoglu; Sandhya Dwarkadas

Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an applications hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

IEEE Journal of Selected Topics in Quantum Electronics | 2006

On-Chip Optical Interconnect Roadmap: Challenges and Critical Directions

Mikhail Haurylau; Guoqing Chen; Hui Chen; Jidong Zhang; Nicholas A. Nelson; David H. Albonesi; Eby G. Friedman; Philippe M. Fauchet

Intrachip optical interconnects (OIs) have the potential to outperform electrical wires and to ultimately solve the communication bottleneck in high-performance integrated circuits. Performance targets and critical directions for ICs progress are yet to be fully explored. In this paper, the International Technology Roadmap for Semiconductors (ITRS) is used as a reference to explore the requirements that silicon-based ICs must satisfy to successfully outperform copper electrical interconnects (IEs). Considering the state-of-the-art devices, these requirements are extended to specific IC components

international symposium on microarchitecture | 2006

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Nevin Kirman; Meyrem Kirman; Rajeev K. Dokania; Jose F. Martinez; Alyssa B. Apsel; Matthew A. Watkins; David H. Albonesi

Although silicon optical technology is still in its formative stages, and the more near-term application is chip-to-chip communication, rapid advances have been made in the development of on-chip optical interconnects. In this paper, we investigate the integration of CMOS-compatible optical technology to on-chip cache-coherent buses in future CMPs. While not exhaustive, our investigation yields a hierarchical opto-electrical system that exploits the advantages of optical technology while abiding by projected limitations. Our evaluation shows that, for the applications considered, compared to an aggressive all-electrical bus of similar power and area, significant performance improvements can be achieved using an opto-electrical bus. This performance improvement is largely dependent on the applications bandwidth demand and on the number of implemented wavelengths per optical waveguide. We also present a number of critical areas for future work that we discover in the course of our research

international symposium on microarchitecture | 2002

Dynamic frequency and voltage control for a multiple clock domain microarchitecture

Greg Semeraro; David H. Albonesi; Steven G. Dropsho; Grigorios Magklis; Sandhya Dwarkadas; Michael L. Scott

We describe the design, analysis, and performance of an on-line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance. Our algorithm achieves on average a 19.0% reduction in Energy Per Instruction (EPI), a 3.2% increase in Cycles Per Instruction (CPI), a 16.7% improvement in Energy-Delay Product, and a Power Savings to Performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to Performance Degradation ratio of only 2-3. Our Energy-Delay Product improvement is 85.5% of what has been achieved using an off-line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.

international symposium on microarchitecture | 2001

Reducing the complexity of the register file in dynamic superscalar processors

Rajeev Balasubramonian; Sandhya Dwarkadas; David H. Albonesi

Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multi-ported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals.

international symposium on computer architecture | 2009

Phastlane: a rapid transit optical routing network

Mark J. Cianchetti; Joseph C. Kerekes; David H. Albonesi

Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical interconnect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks. In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low-latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieve 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.

system-level interconnect prediction | 2005

Predictions of CMOS compatible on-chip optical interconnect

Guoqing Chen; Hui Chen; Mikhail Haurylau; Nicholas A. Nelson; Philippe M. Fauchet; Eby G. Friedman; David H. Albonesi

Interconnect has become a primary bottleneck in integrated circuit design. As CMOS technology is scaled, it will become increasingly difficult for conventional copper interconnect to satisfy the design requirements of delay, power, bandwidth, and noise. On-chip optical interconnect has been considered as a potential substitute for electrical interconnect in the past two decades. In this paper, predictions of the performance of CMOS compatible optical devices are made based on current state-of-art optical technologies. Electrical and optical interconnects are compared for various design criteria based on these predictions. The critical dimensions beyond which optical interconnect becomes advantageous over electrical interconnect are shown to be approximately one tenth of the chip edge length at the 22 nm technology node.

IEEE Computer | 2003

Dynamically tuning processor resources with adaptive processing

David H. Albonesi; Rajeev Balasubramonian; S.G. Dropsbo; Sandhya Dwarkadas; Eby G. Friedman; Michael C. Huang; Volkan Kursun; Grigorios Magklis; Michael L. Scott; Greg Semeraro; Pradip Bose; Alper Buyuktosunoglu; Peter W. Cook; Stanley E. Schuster

By using adaptive processing to dynamically tune major microprocessor resources, developers can achieve greater energy efficiency with reasonable hardware and software overhead while avoiding undue performance loss. Adaptive processors require few additional transistors. Further, because adaptation occurs only in response to infrequent trigger events, the decision logic can be placed into a low-leakage state until such events occur.

Explore More