Adrián Cristal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Adrián Cristal is active.

Explore More

Publication

Featured researches published by Adrián Cristal.

international conference on computer design | 2012

Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime

Yu Cai; Gulay Yalcin; Onur Mutlu; Erich F. Haratsch; Adrián Cristal; Osman S. Unsal; Ken Mai

With the continued scaling of NAND flash and multi-level cell technology, flash-based storage has gained widespread use in systems ranging from mobile platforms to enterprise servers. However, the robustness of NAND flash cells is an increasing concern, especially at nanometer-regime process geometries. NAND flash memory bit error rate increases exponentially with the number of program/erase cycles. Stronger error correcting codes (ECC) can be used to tolerate higher error rates, but these have diminishing returns with increasing P/E cycles and can have prohibitively high power, area, and latency overheads. The goal of this paper is to develop new techniques that can tolerate high bit error rates without requiring prohibitively strong ECC. Our techniques, called Flash Correct-and-Refresh (FCR) exploit the observation that the dominant error source in NAND flash memory is retention errors, caused by flash cells losing charge over time. The key idea is to periodically read, correct, and reprogram (in-place) or remap the stored data before it accumulates more retention errors than can be corrected by simple ECC. Detailed simulations of a solid-state drive (SSD) storage system driven by measured experimental data from error characterization on real flash memory chips show that our techniques provide 46× average lifetime improvement on a variety of workloads at no additional hardware cost. We also find that our techniques achieve lifetime improvements that cannot feasibly be achieved with stronger ECC.

international symposium on microarchitecture | 2009

EazyHTM: eager-lazy hardware transactional memory

Sasa Tomic; Cristian Perfumo; Chinmay Eishan Kulkarni; Adrià Armejach; Adrián Cristal; Osman S. Unsal; Tim Harris; Mateo Valero

Transactional memory aims to provide a programming model that makes parallel programming easier. Hardware implementations of transactional memory (HTM) suffer from fewer overheads than implementations in software, and refinements in conflict management strategies for HTM allow for even larger improvements. In particular, lazy conflict management has been shown to deliver better performance, but it has hitherto required complex protocols and implementations. In this paper we show a new scalable HTM architecture that performs comparably to the state-of-the-art and can be implemented by minor modifications to the MESI protocol rather than re-engineering it from the ground up. Our approach detects conflicts eagerly while a transaction is running, but defers the resolution lazily until commit time. We evaluate this EAger-laZY system, EazyHTM, by comparing it with the scalable-TCC-like approach and a system employing ideal lazy conflict management with a zero-cycle transaction validation and fully-parallel commits. We show that EazyHTM performs on average 7% faster than scalable-TCC. In addition, EazyHTM has fast commits and aborts, can commit in parallel even if there is only one directory present, and does not suffer from cascading waits.

international symposium on microarchitecture | 2007

Transactional Memory: An Overview

Tim Harris; Adrián Cristal; Osman S. Unsal; Eduard Ayguadé; Fabrizio Gagliardi; Burton J. Smith; Mateo Valero

Writing applications that benefit from the massive computational power of future multicore chip multiprocessors will not be an easy task for mainstream programmers accustomed to sequential algorithms rather than parallel ones. This article presents a survey of transactional memory, a mechanism that promises to enable scalable performance while freeing programmers from some of the burden of modifying their parallel code.

ACM Transactions on Architecture and Code Optimization | 2004

Toward kilo-instruction processors

Adrián Cristal; Oliverio J. Santana; Mateo Valero; Jose F. Martinez

The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints.In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor, an architecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

acm sigplan symposium on principles and practice of parallel programming | 2009

Atomic quake: using transactional memory in an interactive multiplayer game server

Ferad Zyulkyarov; Vladimir Gajinov; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Tim Harris; Mateo Valero

Transactional Memory (TM) is being studied widely as a new technique for synchronizing concurrent accesses to shared memory data structures for use in multi-core systems. Much of the initial work on TM has been evaluated using microbenchmarks and application kernels; it is not clear whether conclusions drawn from these workloads will apply to larger systems. In this work we make the first attempt to develop a large, complex, application that uses TM for all of its synchronization. We describe how we have taken an existing parallel implementation of the Quake game server and restructured it to use transactions. In doing so we have encountered examples where transactions simplify the structure of the program. We have also encountered cases where using transactions occludes the structure of the existing code. Compared with existing TM benchmarks, our workload exhibits non-block-structured transactions within which there are I/Ooperations and system call invocations. There are long and short running transactions (200-1.3M cycles) with small and large read and write sets (a few bytes to 1.5MB). There are nested transactions reaching up to 9 levels at runtime. There are examples where error handling and recovery occurs inside transactions. There are also examples where data changes between being accessed transactionally and accessed non-transactionally. However, we did not see examples where the kind of access to one piece of data depended on the value of another.

international conference on parallel architectures and compilation techniques | 2011

DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory

Carlos Villavieja; Vasileios Karakostas; Lluis Vilanova; Yoav Etsion; Alex Ramirez; Avi Mendelson; Nacho Navarro; Adrián Cristal; Osman S. Unsal

Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.

international conference on parallel architectures and compilation techniques | 2007

A Flexible Heterogeneous Multi-Core Architecture

Miquel Pericàs; Adrián Cristal; Francisco J. Cazorla; Ruben Gonzalez; Daniel A. Jiménez; Mateo Valero

Multi-core processors naturally exploit thread-level parallelism (TLP). However, extracting instruction-level parallelism (ILP) from individual applications or threads is still a challenge as application mixes in this environment are nonuniform. Thus, multi-core processors should be flexible enough to provide high throughput for uniform parallel applications as well as high performance for more general workloads. Heterogeneous architectures are a first step in this direction, but partitioning remains static and only roughly fits application requirements. This paper proposes the Flexible Heterogeneous Mul-tiCore processor (FMC), the first dynamic heterogeneous multi-core architecture capable of reconfiguring itself to fit application requirements without programmer intervention. The basic building block of this microarchitecture is a scalable, variable-size window microarchitecture that exploits the concept of Execution Locality to provide large-window capabilities. This allows to overcome the memory wall for applications with high memory-level parallelism (MLP). The microarchitecture contains a set of small and fast cache processors that execute high locality code and a network of small in-order memory engines that together exploit low locality code. Single-threaded applications can use the entire network of cores while multi-threaded applications can efficiently share the resources. The sizing of critical structures remains small enough to handle current power envelopes. In single-threaded mode this processor is able to outperform previous state-of-the-art high-performance processor research by 12% on SpecFP. We show how in a quad- threaded/quad-core environment the processor outperforms a statically allocated configuration in both throughput and harmonic mean, two commonly used metrics to evaluate SMTperformance, by around 2-4%. This is achieved while using a very simple sharing algorithm.

computing frontiers | 2008

The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment

Cristian Perfumo; Nehir Sonmez; Srdjan Stipic; Osman S. Unsal; Adrián Cristal; Tim Harris; Mateo Valero

In this paper, we present a Haskell Transactional Memory benchmark to provide a comprehensive application suite for the use of Software Transactional Memory (STM) researchers. We develop a framework to profile the execution of the benchmark applications and to collect detailed runtime data on their transactional behavior, running them on a 128-core multiprocessor. Using a composite of the collected raw data, we propose new transactional performance metrics. We analyze key statistics related to scalability, atomic sections, transactional events, overall transactional overhead and the relative hardware performance, accordingly drawing conclusions on the results. Our findings advance our comprehension on the STM runtime and the characteristics of different applications under the transactional management of the pure, functional programming language, Haskell.

measurement and modeling of computer systems | 2014

Neighbor-cell assisted error correction for MLC NAND flash memories

Yu Cai; Gulay Yalcin; Onur Mutlu; Erich F. Haratsch; Osman S. Unsal; Adrián Cristal; Ken Mai

Continued scaling of NAND flash memory to smaller process technology nodes decreases its reliability, necessitating more sophisticated mechanisms to correctly read stored data values. To distinguish between different potential stored values, conventional techniques to read data from flash memory employ a single set of reference voltage values, which are determined based on the overall threshold voltage distribution of flash cells. Unfortunately, the phenomenon of program interference, in which a cells threshold voltage unintentionally changes when a neighboring cell is programmed, makes this conventional approach increasingly inaccurate in determining the values of cells. This paper makes the new empirical observation that identifying the value stored in the immediate-neighbor cell makes it easier to determine the data value stored in the cell that is being read. We provide a detailed statistical and experimental characterization of threshold voltage distribution of flash memory cells conditional upon the immediate-neighbor cell values, and show that such conditional distributions can be used to determine a set of read reference voltages that lead to error rates much lower than when a single set of reference voltage values based on the overall distribution are used. Based on our analyses, we propose a new method for correcting errors in a flash memory page, neighbor-cell assisted correction (NAC). The key idea is to re-read a flash memory page that fails error correction codes (ECC) with the set of read reference voltage values corresponding to the conditional threshold voltage distribution assuming a neighbor cell value and use the re-read values to correct the cells that have neighbors with that value. Our simulations show that NAC effectively improves flash memory lifetime by 33% while having no (at nominal lifetime) or very modest (less than 5% at extended lifetime) performance overhead.

international symposium on computer architecture | 2015

Redundant memory mappings for fast access to large memories

Vasileios Karakostas; Jayneel Gandhi; Furkan Ayar; Adrián Cristal; Mark D. Hill; Kathryn S. McKinley; Mario Nemirovsky; Michael M. Swift; Osman S. Unsal

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes. To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of processs pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the processs address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications. We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.

Explore More