Carlos Molina
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Carlos Molina.
international conference on parallel processing | 1999
Antonio González; Jordi Tubella; Carlos Molina
Trace-level reuse is based on the observation that some traces (dynamic sequences of instructions) are frequently repeated during the execution of a program, and in many cases, the instructions that make up such traces have the same source operand values. The execution of such traces will obviously produce the same outcome and thus, their execution can be skipped if the processor records the outcome of previous executions. This paper presents an analysis of the performance potential of trace-level reuse and discusses a preliminary realistic implementation. Like instruction-level reuse, trace-level reuse can improve performance by decreasing resource contention and the latency of some instructions. However, we show that trace-level reuse is more effective than instruction-level reuse because the former can avoid fetching the instructions of reused traces. This has two important benefits: it reduces the fetch bandwidth requirements, and it increases the effective instruction window size since these instructions do not occupy window entries. Moreover, trace-level reuse can compute all at once the result of a chain of dependent instructions, which may allow the processor to avoid the serialization caused by data dependences and thus, to potentially exceed the dataflow limit.
international conference on supercomputing | 1999
Carlos Molina; Antonio González; Jordi Tubella
A mechanism for dynamic instruction-level reuse in superscalar microprocessors is presented, The underlying concept that the mechanism exploits is the run-time removal of redundant computations, and in particular the elimination of common subexpressions and invariants. Removing redundant computation is a target of optimizing compilers but sometimes they do not succeed due to their limited knowledge of the data. Moreover: the proposed mechanism can also remove quasi-redundant computations, such as subexpressions that often produce the same result but sometimes they diffeer: depending on the data values, and thus they cannot be eliminated by the compiler: Experimental results for the Spec9.5 show that on average the mechanism can avoid the execution of about 32% of the dynamic instructions and provides an 1.10 speedup in a superscalar microprocessor: An extensive evaluation of d@erent conjgurations and a comparison with previous schemes are presented, as well as the pe
international symposium on low power electronics and design | 2003
Carlos Molina; C. Aliagas; M. Garcia; A. Gonzcalezo; J. Tubellao
ormance potential of a pet
international parallel and distributed processing symposium | 2011
Javier Lira; Carlos Molina; Antonio Gonz´lez
ect reuse engine.
ieee international conference on high performance computing data and analytics | 1999
Carlos Molina; Antonio González; Jordi Tubella
Current microprocessors spend a huge percentage of the die area to implement the memory hierarchy. Moreover, cache memory is responsible for a significant percentage of the total energy consumption. This paper presents a novel data cache design to reduce its die area, power dissipation and latency. The new scheme, called Non Redundant Cache (NRC), exploits the immense amount of value replication observed in traditional data caches. The NRC cache significantly reduces the storage requirements by avoiding the replication of values. Results show that the NRC cache reduces the die area in a 32%, the power dissipation by 14% and the latency by 25%, while maintaining the miss ratio of a conventional cache.
international conference on supercomputing | 2010
Javier Lira; Carlos Molina; Antonio González
The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the caches internal wires. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain significant performance benefits, they used a non-affordable access scheme mechanism to find data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (\emph{Home Knows where to find data within the NUCA cache}). It exploits migration features by providing fast and power efficient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an efficient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40\% energy consumed per each memory request, and achieves an average performance improvement of 6%.
international conference on computer design | 2009
Javier Lira; Carlos Molina; Antonio González
Some memory writes have the particular behaviour of not modifying memory since the value they write is equal to the value before the write. These kind of stores are what we call Redundant Stores. In this paper we study the behaviour of these particular stores and show that a significant saving on memory traffic between the first and second level caches can be avoided by exploiting this feature. We show that with no additional hardware (just a simple comparator) and without increasing the cache lalency, we can achieve on average a 10% of memory traffic reduction.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems | 2007
Miguel Delgado; Carlos Molina; Lázaro Rodrı́guez-Ariza; Daniel Sánchez; M. Amparo Vila
The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NU-CAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the off-chip memory, because of the significant speed gap between processor and memory and the limited memory bandwidth. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has prevented previously proposed replacement policies from being effective in this kind of caches. As banks operate independently of each other, their replacement decisions are restricted to a single NUCA bank. We propose a novel mechanism based on the bank replacement policy for NUCA caches on CMP, called The Auction. This mechanism enables the replacement decisions taken in a single bank to be spread to the whole NUCA cache. Thus, global replacement policies that rely on the current state of the NUCA cache, such as evicting the least frequently accessed data in the whole NUCA cache, are now feasible. Moreover, The Auction adapts to current program behaviour in order to relocate a line that is being evicted from a bank in the NUCA cache to the most suitable position in the whole cache. We propose, implement and evaluate three approaches of The Auction mechanism. We also show that The Auction manages the cache efficiently and significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8%, and reduces energy consumed by the memory system by 4%.
international conference on computer design | 2002
Carlos Molina; A. Gonzdalez; Jordi Tubella
The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. non uniform cache architectures (NUCA) have been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks. Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is adopted. Unfortunately, traditional replacement policies do not behave properly as they were designed for single-processors. This paper focuses on bank replacement. This policy involves three key decisions when there is a miss: where to place a data block within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken. This technique is based on the observation that some types of data are less commonly accessed depending on which bank they reside in. We call this technique LRU-PEA (least recently used with a priority eviction approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.
ieee international conference on high performance computing, data, and analytics | 2011
Javier Lira; Carlos Molina; David M. Brooks; Antonio González
The special needs of the OLAP technology were the main cause of the use of a multidimensional view of the data. Crisp models are not suitable to model complex or non well defined domains. They also fail to integrate data from semi/non-structured sources (e.g. Internet) or with incompatibilities in their schemata. In these situations, as a result of the modelling and/or integration, imprecision appears. So, we need a model able to manage imprecision in the structures and data. If we want to use experts knowledge in the analysis, we have to keep in mind that expert users are more comfortable when they use linguistic expressions instead of exact values. In this paper we present an extension of a fuzzy multidimensional model to support the use of linguistic labels in the definition of the hierarchies and the OLAP system that implements this model.