Luis Piñuel
Complutense University of Madrid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luis Piñuel.
IEEE Transactions on Parallel and Distributed Systems | 2008
Christian Tenllado; Javier Setoain; Manuel Prieto; Luis Piñuel; Francisco Tirado
The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as filter bank scheme (FBS) and lifting scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.
international symposium on low power electronics and design | 2003
Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang
High-end processors typically incorporate complex branch predictors consisting of many large structures that together consume a notable fraction of total chip power (more than 10% in some cases). Depending on the applications, some of these resources may remain underused for long periods of time. We propose a methodology to reduce the energy consumption of the branch predictor by characterizing prediction demand using profiling and dynamically adjusting predictor resources accordingly. Specifically, we disable components of the hybrid direction predictor and resize the branch target buffer. Detailed simulations show that this approach reduces the energy consumption in the branch predictor by an average of 72% and up to 89% with virtually no impact on prediction accuracy and performance.
international parallel and distributed processing symposium | 2003
Daniel Chaver; Christian Tenllado; Luis Piñuel; Manuel Prieto; Francisco Tirado
This paper addresses the vectorization of the lifting-based wavelet transform on general-purpose microprocessors in the context of JPEG2000. Since SIMD exploitation strongly depends on an efficient memory hierarchy usage, this research is based on previous work about cache-conscious DWT implementations. The experimental platform on which we have chosen to study the benefits of the SIMD extensions is an Intel Pentium-4 (P-4) based PC. However, unlike other authors, the vectorization has been performed avoiding assembler language programming in order to improve both code portability and development cost.
international symposium on microarchitecture | 2003
Michael C. Huang; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado
To exploit instruction-level parallelism, high-end processors use branch predictors consisting of many large, often underutilized structures that cause unnecessary energy waste and high power consumption. By adapting the branch target buffers size and dynamically disabling a hybrid predictors components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.
international parallel and distributed processing symposium | 2002
Daniel Chaver; Manuel Prieto; Luis Piñuel; Francisco Tirado
In this paper we discuss several issues relevant to the parallel implementation of a 2-D discrete wavelet transform (DWT) on general purpose multiprocessors. Our interest in this transform is motivated by its usage in an image fission application which has to manage large image sizes, making parallel computing highly advisable. We have also paid much attention to memory hierarchy exploitation, since it has a tremendous impact on performance due to the lack of spatial locality when the DWT processes image columns.
design, automation, and test in europe | 2013
Roberto Rodríguez-Rodríguez; Fernando Castro; Daniel Chaver; Luis Piñuel; Francisco Tirado
Phase Change Memory (PCM) is currently postulated as the best alternative for replacing Dynamic Random Access Memory (DRAM) as the technology used for implementing main memories, thanks to its significant advantages such as good scalability and low leakage. However, PCM also presents some drawbacks compared to DRAM, like its lower endurance. This work presents a behavior analysis of conventional cache replacement policies in terms of the amount of writes to main memory. Besides, new last level cache (LLC) replacement algorithms are exposed, aimed at reducing the number of writes to PCM and hence increasing its lifetime, without significantly degrading system performance.
international conference on computer design | 2005
Fernando Castro; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang
Modern microprocessors incorporate sophisticated techniques to allow early execution of loads without compromising program correctness. To do so, the structures that hold the memory instructions (load and store queues) implement several complex mechanisms to dynamically resolve the memory-based dependences. Our main objective in this paper is to design an efficient LQ-SQ structure, which saves energy without sacrificing much performance. We propose a new design that divides the load queue into two structures, a conventional associative queue and a simpler FIFO queue that does not allow associative searching. A dependence predictor predicts whether a load instruction has a memory dependence on any inflight store instruction. If so, the load is sent to the conventional associative queue. Otherwise, it is sent to the non-associative queue which can only detect dependence in an inexact and conservative way. In addition, the load will not check the store queue at execution time. These measures combined save energy consumption. We explore different predictor designs and runtime policies. Our experiments indicate that such a design can reduce the energy consumption in the load-store queue by 35-50% with an insignificant performance penalty of about 1%. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-4%.
international symposium on microarchitecture | 2006
Fernando Castro; Luis Piñuel; Daniel Chaver; Manuel Prieto; Michael C. Huang; Francisco Tirado
One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, delayed memory dependence checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications
international parallel and distributed processing symposium | 2003
Carlos García; Roberto Lario; Manuel Prieto; Luis Piñuel; Francisco Tirado
Motivated by the recent trend towards small-scale SIMD processing, in this paper we have addressed the vectorization of multigrid codes on modern microprocessors. The aim is to demonstrate that this relatively new feature can be beneficial not only for multimedia programs but also for such numerical codes. As target kernels we have considered both standard and robust multigrid algorithms, which demand different vectorization strategies. Furthermore, we have also studied the well-known NAS-MG program from the NAS Parallel benchmarks. In all cases, the performance benefits are quite satisfactory. The interest of this research is particularly relevant if we envisage using in-processor parallelism as a way to scale-up the speedup of other optimizations such as efficient memory-hierarchy exploitation or multiprocessor parallelization.
european conference on parallel processing | 1999
Luis Piñuel; R. Moreno; Francisco Tirado
Value prediction is as yet a very novel technique, whose efficiency has still to be proved. To take advantage of this emerging technique in the short term it is essential to design accurate and low cost value predictors. This work presents a new approach of implementing hybrid predictors that allows the maximum sharing of information between predictors. We show that the new hybrid predictor outperforms not only the accuracy of the others predictors, but also their hardware utilization.