Is this you? Create Your Porfile

Adrian Cristal

Spanish National Research Council

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Adrian Cristal is active.

Explore More

Publication

Featured researches published by Adrian Cristal.

high-performance computer architecture | 2004

Out-of-order commit processors

Adrian Cristal; Daniel Ortega; Josep Llosa; Mateo Valero

Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.

international symposium on microarchitecture | 2005

Kilo-instruction processors: overcoming the memory wall

Adrian Cristal; Oliverio J. Santana; Francisco J. Cazorla; Marco Galluzzi; Tanausu Ramirez; Miquel Pericàs; Mateo Valero

Historically, advances in integrated circuit technology have driven improvements in processor microarchitecture and led to todays microprocessors with sophisticated pipelines operating at very high clock frequencies. However, performance improvements achievable by high-frequency microprocessors have become seriously limited by main-memory access latencies because main-memory speeds have improved at a much slower pace than microprocessor speeds. Its crucial to deal with this performance disparity, commonly known as the memory wall, to enable future high-frequency microprocessors to achieve their performance potential. To overcome the memory wall, we propose kilo-instruction processors-superscalar processors that can maintain a thousand or more simultaneous in-flight instructions. Doing so means designing key hardware structures so that the processor can satisfy the high resource requirements without significantly decreasing processor efficiency or increasing energy consumption.

international symposium on computer architecture | 2004

A Content Aware Integer Register File Organization

Gonzalez Gonzalez; Adrian Cristal; Daniel Ortega; Alexander V. Veidenbaum; Mateo Valero

A register file is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the register files area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processors performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer register file organization, which reduces energy consumption, area, and access time of the register file with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new register file is described and shown to obtain proposed optimized register file designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new register file. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the register file access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.

international conference on pervasive services | 2005

Implementing Kilo-Instruction Multiprocessors

Enrique Vallejo; Marco Galluzzi; Adrian Cristal; Fernando Vallejo; Ramón Beivide; Per Stenström; James E. Smith; Mateo Valero

Multiprocessors are coming into wide-spread use in many application areas, yet there are a number of challenges to achieving a good tradeoff between complexity and performance. For example, while implementing memory coherence and consistency is essential for correctness, efficient implementation of critical sections and synchronization points is desirable for performance. The multi-checkpointing mechanisms of Kilo-Instruction Processors can be leveraged to achieve good complexity-effective multiprocessor designs. We describe how to implement a Kilo-Instruction Multiprocessor that transparently, i.e. without any software support, uses transaction-based memory updates. Our model not only simplifies memory coherence and consistency hardware, but at the same time, it provides the potential for implementing high performance speculative mechanisms for commonly occurring synchronization constructs.

high-performance computer architecture | 2006

A decoupled KILO-instruction processor

Miquel Pericàs; Adrian Cristal; Ruben Gonzalez; Daniel A. Jiménez; Mateo Valero

Building processors with large instruction windows has been proposed as a mechanism for overcoming the memory wall, but finding a feasible and implementable design has been an elusive goal. Traditional processors are composed of structures that do not scale to large instruction windows because of timing and power constraints. However, the behavior of programs executed with large instruction windows gives rise to a natural and simple alternative to scaling. We characterize this phenomenon of execution locality and propose a microarchitecture to exploit it to achieve the benefit of a large instruction window processor with low implementation cost. Execution locality is the tendency of instructions to exhibit high or low latency based on their dependence on memory operations. In this paper we propose a decoupled microarchitecture that executes low latency instructions on a cache processor and high latency instructions on a memory processor. We demonstrate that such a design, using small structures and many in-order components, can achieve the same performance as much more aggressive proposals while minimizing design complexity.

international conference on parallel architectures and compilation techniques | 2010

Discovering and understanding performance bottlenecks in transactional applications

Ferad Zyulkyarov; Srdjan Stipic; Tim Harris; Osman S. Unsal; Adrian Cristal; Ibrahim Hur; Mateo Valero

Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and tuning programs which use transactions.

high performance computing and communications | 2009

Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory

Sutirtha Sanyal; Sourav Roy; Adrian Cristal; Osman S. Unsal; Mateo Valero

Transactional Memory (TM) is an emerging technology which promises to make parallel programming easier. However, to be efficient, underlying TM system should protect only true shared data and leave thread-local data out of the transaction. This speed-up the commit phase of the transaction which is a bottleneck for a lazily versioned HTM. This paper proposes a scheme in the context of a lazy-lazy(lazy conflict detection and lazy data versioning) Hardware Transactional Memory (HTM) system to identify dynamically variables which are local to a thread and exclude them from the commitset of the transaction. Our proposal covers sharing of both stack and heap but also filters out local accesses to both of them. We also propose, in the same scheme, to identify local variables for which versioning need not be maintained.For evaluation, we have implemented a lazy-lazy model of HTM in line with the conventional and the scalable version of the TCC in a full system simulator. For operating system, we have modified the Linux kernel. We got an average speed-up of 31% for the conventional TCC, on applications from the STAMP benchmark suite. For the scalable TCC we got an average speedup of 16%. Also, we found that on average 99% of the local variables can be safely omitted when recording their old values to handle aborts.

international symposium on microarchitecture | 2010

The Velox Transactional Memory Stack

Pascal Felber; E Rivière; W M Moreira; Derin Harmanci; Patrick Marlier; Stephan Diestelhorst; Michael P. Hohmuth; Martin T. Pohlack; Adrian Cristal; I Hur; Osman S. Unsal; P Stenström; A Dragojevic; Rachid Guerraoui; M Kapalka; Vincent Gramoli; U Drepper; S Tomić; Yehuda Afek; Guy Korland; Nir Shavit; Christof Fetzer; Martin Nowack; Torvald Riegel

The adoption of multi- and many-core architectures for mainstream computing undoubtedly brings profound changes in the way software is developed. In particular, the use of fine grained locking as the multi-core programmers coordination methodology is considered by more and more experts as a dead-end. The transactional memory (TM) programming paradigm is a strong contender to become the approach of choice for replacing locks and implementing atomic operations in concurrent programming. Combining sequences of concurrent operations into atomic transactions allows a great reduction in the complexity of both programming and verification, by making parts of the code appear to execute sequentially without the need to program using fine-grained locking. Transactions remove from the programmer the burden of figuring out the interaction among concurrent operations that happen to conflict when accessing the same locations in memory. The EU-funded FP7 VELOX project designs, implements and evaluates an integrated TM stack, spanning from programming language to the hardware support, and including runtime and libraries, compilers, and application environments. This paper presents an overview of the VELOX TM stack and its associated challenges and contributions.

international symposium on quality electronic design | 2014

PETS: Power and energy estimation tool at system-level

Santhosh-Kumar Rethinagiri; Oscar Palomar; Osman S. Unsal; Adrian Cristal; Rabie Ben-Atitallah; Smail Niar

In this paper, we introduce PETS, a simulation based tool to estimate, analyse and optimize power/energy consumption of an application running on complex state-of-the-art heterogeneous embedded processor based platforms. This tool is integrated with power and energy models in order to support comprehensive design space exploration for low power multi-core and heterogeneous multiprocessor platforms such as OMAP, CARMA, Zynq 7000 and Virtex II Pro. Moreover, PETS is equipped with power optimization techniques such as dynamic slack reduction and work load balancing. The development of PETS involves two steps. First step: power model generation. For the power model development, functional-level parameters are used to set up generic power models for the different components of the system. So far, seven power models have been developed for different architectures, starting from the simple low power architecture ARM9 to the very complex DSP TI C64x. Second step: a simulation based virtual platform framework is developed using SystemC IPs and JIT/ISS compilers to accurately grab the activities to estimate power. The accuracy of our proposed tool is evaluated by using a variety of industrial benchmarks. Estimated power and energy values are compared to real board measurements. The power estimation results are less than 4% of error for single core processor, 4.6% for dual-core processor, 5% for quad-core, 6.8% multi-processor based system and effective optimisation of power/energy for the applications.

international symposium on microarchitecture | 2012

Vector Extensions for Decision Support DBMS Acceleration

Timothy Hayes; Oscar Palomar; Osman S. Unsal; Adrian Cristal; Mateo Valero

Database management systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centres. As a result of this criticality, efficient execution of DBMS engines has become an important area of investigation. This work takes a top-down approach to accelerating decision support systems (DSS) on x86-64 microprocessors using vector ISA extensions. In the first step, a leading DSS DBMS is analysed for potential data-level parallelism. We discuss why the existing multimedia SIMD extensions (SSE/AVX) are not suitable for capturing this parallelism and propose a complementary instruction set reminiscent of classical vector architectures. The instruction set is implemented using unintrusive modifications to a modern x86-64 micro architecture tailored for DSS DBMS. The ISA and micro architecture are evaluated using a cycle-accurate x86-64 micro architectural simulator coupled with a highly-detailed memory simulator. We have found a single operator is responsible for 41% of total execution time for the TPC-H DSS benchmark. Our results show performance speedups between 1.94x and 4.56x for an implementation of this operator run with our proposed hardware modifications.

Explore More