Is this you? Create Your Porfile

Babak Falsafi

École Polytechnique Fédérale de Lausanne

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Babak Falsafi is active.

Explore More

Publication

Featured researches published by Babak Falsafi.

international symposium on low power electronics and design | 2000

Gated-V/sub dd/: a circuit technique to reduce leakage in deep-submicron cache memories

Michael D. Powell; Se-Hyun Yang; Babak Falsafi; Kaushik Roy; T. N. Vijaykumar

Deep-submicron CMOS designs have resulted in large leakage energy dissipation in microprocessors. While SRAM cells in on-chip cache memories always contribute to this leakage, there is a large variability in active cell usage both <italic>within</italic> and <italic>across</italic> appliðcations. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy dissipation in instrucðtion caches. We propose, <italic>gated-V<subscrpt>dd</subscrpt>,</italic> a circuit-level technique to gate the supply voltage and reduce leakage in unused SRAM cells. Our results indicate that gated-V<subscrpt>dd</subscrpt> together with a novel resizable cache architecture reduces energy-delay by 62% with minimal impact on performance.

international symposium on computer architecture | 2003

SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Roland E. Wunderlich; Thomas F. Wenisch; Babak Falsafi; James C. Hoe

Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates.Analysis of 41 of the 45 possible SPEC2K benchmark/input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3% with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ∼2% for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64% on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively.

architectural support for programming languages and operating systems | 2012

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi

Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that todays predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while todays predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.

international symposium on computer architecture | 2009

Reactive NUCA: near-optimal block placement and replication in distributed caches

Nikos Hardavellas; Michael Ferdman; Babak Falsafi; Anastasia Ailamaki

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

international symposium on microarchitecture | 2007

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe

In deep sub-micron ICs, growing amounts of on-die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32times32 bits with significantly smaller performance, area, and power overheads than conventional techniques.

international symposium on microarchitecture | 2006

SimFlex: Statistical Sampling of Computer System Simulation

Thomas F. Wenisch; Roland E. Wunderlich; Michael Ferdman; Anastassia Ailamaki; Babak Falsafi; James C. Hoe

Timing-accurate full-system multiprocessor simulations can take years because of architecture and application complexity. Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelism

international symposium on computer architecture | 2001

Dead-block prediction & dead-block correlating prefetchers

An-Chow Lai; Cem Fide; Babak Falsafi

Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data cache block becomes evictable or “dead”. Predicting a dead block significantly enhances prefetching lookahead and opportunity, and enables placing data directly into Ll, obviating the need for auxiliary prefetch buffers. This paper also proposes Dead-Block Correlating Prefetchers (DBCPs), that use address correlation to predict “which” subsequent block to prefetch when a block becomes evictable. A DBCP enables effective data prefetching in a wide spectrum of pointer-intensive, integer, and floating-point applications. We use cycle-accurate simulation of an out-of-order superscalar processor and memory-intensive benchmarks to show that: (1) dead-block prediction enhances prefetching lookahead at least by an order of magnitude as compared to previous techniques, (2) a DBP can predict dead blocks on average with a coverage of 90% only mispredicting 4% of the time, (3) a DBCP offers an address prediction coverage of 86% only mispredicting 3% of the time, and (4) DBCPs improve performance by 62% on average and 282% at best in the benchmarks we studied.

architectural support for programming languages and operating systems | 1994

Fine-grain access control for distributed shared memory

Ioannis Schoinas; Babak Falsafi; Alvin R. Lebeck; Steven K. Reinhardt; James R. Larus; David A. Wood

This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing. This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the programs executable. The second uses the memorys error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzards performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5s will be able to support the same shared-memory interfaces as supercomputers.

international symposium on microarchitecture | 2001

Reducing set-associative cache energy via way-prediction and selective direct-mapping

Michael D. Powell; Amit Agarwal; T. N. Vijaykumar; Babak Falsafi; Kaushik Roy

Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation. Set-associative caches minimize access time by probing all the data ways in parallel with the tag lookup, although the output of only the matching way is used. The energy spent accessing the other ways is wasted Eliminating the wasted energy by performing the data lookup sequentially following the tag lookup substantially increases cache access time, and is unacceptable for high-performance L1 caches. In this paper, we apply two previously-proposed techniques, way-prediction and selective direct-mapping, to reducing L1 cache dynamic energy while maintaining high performance. The techniques predict the matching way and probe only the predicted way and not all the ways, achieving energy savings. While these techniques were originally proposed to improve set-associative cache access times, this is the first paper to apply them to reducing cache energy. We evaluate the effectiveness of these techniques in reducing L1 d-cache, L1 i-cache, and overall processor energy. Using these techniques, our caches achieve the energy-delay of sequential access while maintaining the performance of parallel access. Relative to parallel access L1 i- and d-caches, the techniques achieve overall processor energy-delay reduction of 8%, while perfect way-prediction with no performance degradation achieves 10% reduction. The performance degradation of the techniques is less than 3%, compared to an aggressive,.1-cycle, 4-way, parallel access cache.

international symposium on microarchitecture | 2011

Toward Dark Silicon in Servers

Nikos Hardavellas; Michael Ferdman; Babak Falsafi; Anastasia Ailamaki

Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that we cannot afford to power. Specialized multicore processors, however, can leverage the underutilized die area to overcome the initial power barrier, delivering significantly higher performance for the same bandwidth and power envelopes.

Explore More