Is this you? Create Your Porfile

Luis Piñuel

Complutense University of Madrid

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Luis Piñuel is active.

Explore More

Publication

Featured researches published by Luis Piñuel.

IEEE Transactions on Parallel and Distributed Systems | 2008

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

Christian Tenllado; Javier Setoain; Manuel Prieto; Luis Piñuel; Francisco Tirado

The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as filter bank scheme (FBS) and lifting scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.

international symposium on low power electronics and design | 2003

Branch prediction on demand: an energy-efficient solution

Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang

High-end processors typically incorporate complex branch predictors consisting of many large structures that together consume a notable fraction of total chip power (more than 10% in some cases). Depending on the applications, some of these resources may remain underused for long periods of time. We propose a methodology to reduce the energy consumption of the branch predictor by characterizing prediction demand using profiling and dynamically adjusting predictor resources accordingly. Specifically, we disable components of the hybrid direction predictor and resize the branch target buffer. Detailed simulations show that this approach reduces the energy consumption in the branch predictor by an average of 72% and up to 89% with virtually no impact on prediction accuracy and performance.

international parallel and distributed processing symposium | 2003

Vectorization of the 2D wavelet lifting transform using SIMD extensions

Daniel Chaver; Christian Tenllado; Luis Piñuel; Manuel Prieto; Francisco Tirado

This paper addresses the vectorization of the lifting-based wavelet transform on general-purpose microprocessors in the context of JPEG2000. Since SIMD exploitation strongly depends on an efficient memory hierarchy usage, this research is based on previous work about cache-conscious DWT implementations. The experimental platform on which we have chosen to study the benefits of the SIMD extensions is an Intel Pentium-4 (P-4) based PC. However, unlike other authors, the vectorization has been performed avoiding assembler language programming in order to improve both code portability and development cost.

international symposium on microarchitecture | 2003

Customizing the branch predictor to reduce complexity and energy consumption

Michael C. Huang; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado

To exploit instruction-level parallelism, high-end processors use branch predictors consisting of many large, often underutilized structures that cause unnecessary energy waste and high power consumption. By adapting the branch target buffers size and dynamically disabling a hybrid predictors components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.

international parallel and distributed processing symposium | 2002

Parallel wavelet transform for large scale image processing

Daniel Chaver; Manuel Prieto; Luis Piñuel; Francisco Tirado

In this paper we discuss several issues relevant to the parallel implementation of a 2-D discrete wavelet transform (DWT) on general purpose multiprocessors. Our interest in this transform is motivated by its usage in an image fission application which has to manage large image sizes, making parallel computing highly advisable. We have also paid much attention to memory hierarchy exploitation, since it has a tremendous impact on performance due to the lack of spatial locality when the DWT processes image columns.

design, automation, and test in europe | 2013

Reducing writes in phase-change memory environments by using efficient cache replacement policies

Roberto Rodríguez-Rodríguez; Fernando Castro; Daniel Chaver; Luis Piñuel; Francisco Tirado

Phase Change Memory (PCM) is currently postulated as the best alternative for replacing Dynamic Random Access Memory (DRAM) as the technology used for implementing main memories, thanks to its significant advantages such as good scalability and low leakage. However, PCM also presents some drawbacks compared to DRAM, like its lower endurance. This work presents a behavior analysis of conventional cache replacement policies in terms of the amount of writes to main memory. Besides, new last level cache (LLC) replacement algorithms are exposed, aimed at reducing the number of writes to PCM and hence increasing its lifetime, without significantly degrading system performance.

international conference on computer design | 2005

Load-store queue management: an energy-efficient design based on a state-filtering mechanism

Fernando Castro; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang

Modern microprocessors incorporate sophisticated techniques to allow early execution of loads without compromising program correctness. To do so, the structures that hold the memory instructions (load and store queues) implement several complex mechanisms to dynamically resolve the memory-based dependences. Our main objective in this paper is to design an efficient LQ-SQ structure, which saves energy without sacrificing much performance. We propose a new design that divides the load queue into two structures, a conventional associative queue and a simpler FIFO queue that does not allow associative searching. A dependence predictor predicts whether a load instruction has a memory dependence on any inflight store instruction. If so, the load is sent to the conventional associative queue. Otherwise, it is sent to the non-associative queue which can only detect dependence in an inexact and conservative way. In addition, the load will not check the store queue at execution time. These measures combined save energy consumption. We explore different predictor designs and runtime policies. Our experiments indicate that such a design can reduce the energy consumption in the load-store queue by 35-50% with an insignificant performance penalty of about 1%. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-4%.

international symposium on microarchitecture | 2006

DMDC: Delayed Memory Dependence Checking through Age-Based Filtering

Fernando Castro; Luis Piñuel; Daniel Chaver; Manuel Prieto; Michael C. Huang; Francisco Tirado

One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, delayed memory dependence checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications

international parallel and distributed processing symposium | 2003

Vectorization of multigrid codes using SIMD ISA extensions

Carlos García; Roberto Lario; Manuel Prieto; Luis Piñuel; Francisco Tirado

Motivated by the recent trend towards small-scale SIMD processing, in this paper we have addressed the vectorization of multigrid codes on modern microprocessors. The aim is to demonstrate that this relatively new feature can be beneficial not only for multimedia programs but also for such numerical codes. As target kernels we have considered both standard and robust multigrid algorithms, which demand different vectorization strategies. Furthermore, we have also studied the well-known NAS-MG program from the NAS Parallel benchmarks. In all cases, the performance benefits are quite satisfactory. The interest of this research is particularly relevant if we envisage using in-processor parallelism as a way to scale-up the speedup of other optimizations such as efficient memory-hierarchy exploitation or multiprocessor parallelization.

european conference on parallel processing | 1999

Implementation of Hybrid Context Based Value Predictors Using Value Sequence Classification

Luis Piñuel; R. Moreno; Francisco Tirado

Value prediction is as yet a very novel technique, whose efficiency has still to be proved. To take advantage of this emerging technique in the short term it is essential to design accurate and low cost value predictors. This work presents a new approach of implementing hybrid predictors that allows the maximum sharing of information between predictors. We show that the new hybrid predictor outperforms not only the accuracy of the others predictors, but also their hardware utilization.

Explore More