Daniel Chaver
Complutense University of Madrid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel Chaver.
international symposium on low power electronics and design | 2003
Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang
High-end processors typically incorporate complex branch predictors consisting of many large structures that together consume a notable fraction of total chip power (more than 10% in some cases). Depending on the applications, some of these resources may remain underused for long periods of time. We propose a methodology to reduce the energy consumption of the branch predictor by characterizing prediction demand using profiling and dynamically adjusting predictor resources accordingly. Specifically, we disable components of the hybrid direction predictor and resize the branch target buffer. Detailed simulations show that this approach reduces the energy consumption in the branch predictor by an average of 72% and up to 89% with virtually no impact on prediction accuracy and performance.
international parallel and distributed processing symposium | 2003
Daniel Chaver; Christian Tenllado; Luis Piñuel; Manuel Prieto; Francisco Tirado
This paper addresses the vectorization of the lifting-based wavelet transform on general-purpose microprocessors in the context of JPEG2000. Since SIMD exploitation strongly depends on an efficient memory hierarchy usage, this research is based on previous work about cache-conscious DWT implementations. The experimental platform on which we have chosen to study the benefits of the SIMD extensions is an Intel Pentium-4 (P-4) based PC. However, unlike other authors, the vectorization has been performed avoiding assembler language programming in order to improve both code portability and development cost.
international symposium on microarchitecture | 2003
Michael C. Huang; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado
To exploit instruction-level parallelism, high-end processors use branch predictors consisting of many large, often underutilized structures that cause unnecessary energy waste and high power consumption. By adapting the branch target buffers size and dynamically disabling a hybrid predictors components, the authors create a customized branch predictor that saves a significant amount of energy with little performance degradation.
international parallel and distributed processing symposium | 2002
Daniel Chaver; Manuel Prieto; Luis Piñuel; Francisco Tirado
In this paper we discuss several issues relevant to the parallel implementation of a 2-D discrete wavelet transform (DWT) on general purpose multiprocessors. Our interest in this transform is motivated by its usage in an image fission application which has to manage large image sizes, making parallel computing highly advisable. We have also paid much attention to memory hierarchy exploitation, since it has a tremendous impact on performance due to the lack of spatial locality when the DWT processes image columns.
design, automation, and test in europe | 2013
Roberto Rodríguez-Rodríguez; Fernando Castro; Daniel Chaver; Luis Piñuel; Francisco Tirado
Phase Change Memory (PCM) is currently postulated as the best alternative for replacing Dynamic Random Access Memory (DRAM) as the technology used for implementing main memories, thanks to its significant advantages such as good scalability and low leakage. However, PCM also presents some drawbacks compared to DRAM, like its lower endurance. This work presents a behavior analysis of conventional cache replacement policies in terms of the amount of writes to main memory. Besides, new last level cache (LLC) replacement algorithms are exposed, aimed at reducing the number of writes to PCM and hence increasing its lifetime, without significantly degrading system performance.
international conference on computer design | 2005
Fernando Castro; Daniel Chaver; Luis Piñuel; Manuel Prieto; Francisco Tirado; Michael C. Huang
Modern microprocessors incorporate sophisticated techniques to allow early execution of loads without compromising program correctness. To do so, the structures that hold the memory instructions (load and store queues) implement several complex mechanisms to dynamically resolve the memory-based dependences. Our main objective in this paper is to design an efficient LQ-SQ structure, which saves energy without sacrificing much performance. We propose a new design that divides the load queue into two structures, a conventional associative queue and a simpler FIFO queue that does not allow associative searching. A dependence predictor predicts whether a load instruction has a memory dependence on any inflight store instruction. If so, the load is sent to the conventional associative queue. Otherwise, it is sent to the non-associative queue which can only detect dependence in an inexact and conservative way. In addition, the load will not check the store queue at execution time. These measures combined save energy consumption. We explore different predictor designs and runtime policies. Our experiments indicate that such a design can reduce the energy consumption in the load-store queue by 35-50% with an insignificant performance penalty of about 1%. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-4%.
international symposium on microarchitecture | 2006
Fernando Castro; Luis Piñuel; Daniel Chaver; Manuel Prieto; Michael C. Huang; Francisco Tirado
One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution of memory instructions. Traditional CAM-based associative queues can be very slow and energy hungry. In this paper we introduce two new management schemes. The first one is a filtering scheme based on simple age-tracking. This scheme can easily avoid 95-98% of associative load queue (LQ) searches using only a few registers. This translates into significant power savings. More importantly, however, this filtering makes our second scheme, delayed memory dependence checking (DMDC), practical. With a small hash table, DMDC completely avoids the need for an associative LQ and relies on indexing-based checking at the commit phase and hence cuts the energy spent on LQ by an average of 95%. At an average of about 0.3%, the performance impact is negligible. When the energy cost of the increased execution time is factored in, the processor still makes net energy savings of about 3-8%, depending on the configuration and the applications
ieee international conference on high performance computing data and analytics | 2002
Daniel Chaver; Christian Tenllado; Luis Piæuel; Manuel Prieto; Francisco Tirado
In this paper we discuss several issues relevant to the vectorization of a 2-D Discrete Wavelet Transform on current microprocessors. Our research is based on previous studies about the efficient exploitation of the memory hierarchy, due to its tremendous impact on performance. We have extended this work with a more detailed analysis based on hardware performance counters and a study of vectorization, in particular, we have used the Intel Pentium SSE instruction set. Most of our optimizations are performed at source code level to allow automatic vectorization, though some compiler intrinsic functions have been introduced to enhance performance. Taking into account the abstraction at which the optimizations are performed, the results obtained on an Intel Pentium III microprocessor are quite satisfactory, even though further improvement can be obtained by a more extensive use of compiler intrinsics.
acm symposium on applied computing | 2015
Juan Carlos Saez; Adrián Pousa; Fernando Castro; Daniel Chaver; Manuel Prieto-Matías
Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large body of work has demonstrated that this potential of AMP systems can be realizable via OS scheduling. Yet, existing schedulers that seek to deliver fairness on AMPs do not ensure that equal-priority applications experience the same slowdown when sharing the system. Moreover, most of these schemes are also subject to high throughput degradation and fail to effectively deal with user priorities. In this work we propose ACFS, an asymmetry-aware completely fair scheduler that seeks to optimize fairness while ensuring acceptable throughput. Our evaluation on real AMP hardware, and using scheduler implementations on a general-purpose OS, demonstrates that ACFS achieves an average 11% fairness improvement over state-of-the-art schemes, while providing better system throughput.
international symposium on low power electronics and design | 2006
Alok Garg; Fernando Castro; Michael C. Huang; Daniel Chaver; Luis Piñuel; Manuel Prieto
Buffering more in-flight instructions in an out-of-order microprocessor is a straightforward and effective method to help tolerate the long latencies generally associated with off-chip memory accesses. One of the main challenges of buffering a large number of instructions, however, is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order scheduling of load and store instructions. Traditional CAM-based associative queues can be very slow and energy consuming. In this paper, instead of using the traditional age-based load queue to record load addresses, we explicitly record age information in address-indexed hash tables to achieve the same functionality of detecting premature loads. This alternative design eliminates associative searches and significantly reduces the energy consumption of the load queue. With simple techniques to reduce the number of false positives, performance degradation is kept at a minimum