Nikos Hardavellas
Northwestern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nikos Hardavellas.
international symposium on computer architecture | 2009
Nikos Hardavellas; Michael Ferdman; Babak Falsafi; Anastasia Ailamaki
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.
international symposium on microarchitecture | 2007
Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe
In deep sub-micron ICs, growing amounts of on-die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32times32 bits with significantly smaller performance, area, and power overheads than conventional techniques.
international symposium on microarchitecture | 2011
Nikos Hardavellas; Michael Ferdman; Babak Falsafi; Anastasia Ailamaki
Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that we cannot afford to power. Specialized multicore processors, however, can leverage the underutilized die area to overcome the initial power barrier, delivering significantly higher performance for the same bandwidth and power envelopes.
extending database technology | 2009
Ryan Johnson; Ippokratis Pandis; Nikos Hardavellas; Anastasia Ailamaki; Babak Falsafi
Database storage managers have long been able to efficiently handle multiple concurrent requests. Until recently, however, a computer contained only a few single-core CPUs, and therefore only a few transactions could simultaneously access the storage managers internal structures. This allowed storage managers to use non-scalable approaches without any penalty. With the arrival of multicore chips, however, this situation is rapidly changing. More and more threads can run in parallel, stressing the internal scalability of the storage manager. Systems optimized for high performance at a limited number of cores are not assured similarly high performance at a higher core count, because unanticipated scalability obstacles arise. We benchmark four popular open-source storage managers (Shore, BerkeleyDB, MySQL, and PostgreSQL) on a modern multicore machine, and find that they all suffer in terms of scalability. We briefly examine the bottlenecks in the various storage engines. We then present Shore-MT, a multithreaded and highly scalable version of Shore which we developed by identifying and successively removing internal bottlenecks. When compared to other DBMS, Shore-MT exhibits superior scalability and 2--4 times higher absolute throughput than its peers. We also show that designers should favor scalability to single-thread performance, and highlight important principles for writing scalable storage engines, illustrated with real examples from the development of Shore-MT.
symposium on operating systems principles | 1997
Robert J. Stets; Sandhya Dwarkadas; Nikos Hardavellas; Galen C. Hunt; Leonidas I. Kontothanassis; Srinivasan Parthasarathy; Michael L. Scott
Low-latency remote-write networks, such as DECs Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a two-level software coherent shared memory system-Cashmere-2L-that meets this challenge. Cashmere-2L uses hardware to share memory within a node, while exploiting the Memory Channels remote-write capabilities to implement moderately lazy release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level of asynchrony by eliminating global directory locks and the need for TLB shootdown. Remote interrupts are minimized by exploiting the remote-write capabilities of the Memory Channel network. Cashmere-2L currently runs on an 8-node, 32-processor DEC AlphaServer system. Speedups range from 8 to 31 on 32 processors for our benchmark suite, depending on the applications characteristics. We quantify the importance of our protocol optimizations by comparing performance to that of several alternative protocols that do not share memory in hardware within an SMP, and require more synchronization. In comparison to a one-level protocol that does not share memory in hardware within an SMP, Cashmere-2L improves performance by up to 46%.
acm symposium on parallel algorithms and architectures | 2007
Shimin Chen; Phillip B. Gibbons; Michael Kozuch; Vasileios Liaskovitis; Anastassia Ailamaki; Guy E. Blelloch; Babak Falsafi; Limor Fix; Nikos Hardavellas; Todd C. Mowry; Chris Wilkerson
In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
high-performance computer architecture | 2011
Song Liu; Brian Leung; Alexander Neckar; Seda Ogrenci Memik; Gokhan Memik; Nikos Hardavellas
The performance of the main memory is an important factor on overall system performance. To improve DRAM performance, designers have been increasing chip densities and the number of memory modules. However, these approaches increase power consumption and operating temperatures: temperatures in existing DRAM modules can rise to over 95°C. Another important property of DRAM temperature is the large variation in DRAM chip temperatures. In this paper, we present our analysis collected from measurements on a real system indicating that temperatures across DRAM chips can vary by over 10°C. This work aims to minimize this variation as well as the peak DRAM temperature. We first develop a thermal model to estimate the temperature of DRAM chips and validate this model against real temperature measurements. We then propose three hardware and software schemes to reduce peak temperatures. The first technique introduces a new cache line replacement policy that reduces the number of accesses to the overheating DRAM chips. The second technique utilizes a Memory Write Buffer to improve the access efficiency of the overheated chips. The third scheme intelligently allocates pages to relatively cooler ranks of the DIMM. Our experiments show that in a high performance memory system, our schemes reduce the peak DRAM chip temperature by as much as 8.39°C over 10 workloads (5.36°C on average). Our schemes also improve performance mainly due to reduction in thermal emergencies: for a baseline system with memory bandwidth throttling scheme, the IPC is improved by as much as 15.8% (4.1% on average).
international symposium on microarchitecture | 2010
Nikos Hardavellas; Michael Ferdman; Babak Falsafi; Anastasia Ailamaki
The growing core counts and caches of modern processors result in data access latency becoming a function of the datas physical location in the cache. Thus, the placement of cache blocks determines the caches performance. Reactive nonuniform cache architectures (R-NUCA) achieve near-optimal cache block placement by classifying blocks online and placing data close to the cores that use them.
design, automation, and test in europe | 2012
Abhishek Das; Matthew Schuchhardt; Nikos Hardavellas; Gokhan Memik; Alok N. Choudhary
On-chip interconnection networks consume a significant fraction of the chips power, and the rapidly increasing core counts in future technologies is going to further aggravate their impact on the chips overall power consumption. A large fraction of the traffic originates not from data messages exchanged between sharing cores, but from the communication between the cores and intermediate hardware structures (i.e., directories) for the purpose of maintaining coherence in the presence of conflicting updates. In this paper, we propose Dynamic Directories, a method allowing the directories to be placed arbitrarily in the chip by piggy-backing the virtual to physical address translation. This eliminates a large fraction of the on-chip interconnect traversals, hence reducing the power consumption. Through trace-driven and cycle-accurate simulation in a range of scientific and Map-Reduce applications, we show that our technique reduces the power and energy expended by the on-chip interconnect by up to 37% (16.4% on average) with negligible hardware overhead and a small improvement in performance (1.3% on average).
design automation conference | 2015
G. Tziantzioulis; A. M. Gok; S. M. Faisal; Nikos Hardavellas; Seda Ogrenci-Memik; Srinivasan Parthasarathy
Existing timing error models for voltage-scaled functional units ignore the effect of history and correlation among outputs, and the variation in the error behavior at different bit locations. We propose b-HiVE, a model for voltage-scaling-induced timing errors that incorporates these attributes and demonstrates their impact on the overall model accuracy. On average across several operations, b-HiVEs estimation is within 1-3% of comprehensive analog simulations, which corresponds to 5-17x higher accuracy (6-10x on average) than error models currently used in approximate computing research. To the best of our knowledge, we present the first bit-level error models of arithmetic units, and the first error models for voltage scaling of bitwise logic operations and floating-point units.