Manuel E. Acacio | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manuel E. Acacio is active.

Explore More

Publication

Featured researches published by Manuel E. Acacio.

conference on high performance computing (supercomputing) | 2002

Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture

Manuel E. Acacio; Jose Gonzalez; José M. García; José Duato

Cache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path of 3-hop misses, which significantly penalizes them compared to SMP designs. This work studies the use of owner prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for cache-to-cache transfer misses. Our proposal comprises an effective prediction scheme as well as a coherence protocol designed to support the use of prediction. Results indicate that owner prediction can significantly reduce the latency of cache-to-cache transfer misses, which translates into speed-ups on application performance up to 12%. In order to also accelerate most of those 3-hop misses that are either not predicted or mispredicted, the inclusion of a small and fast directory cache in every node is evaluated, leading to improvements up to 16% on the final performance.

parallel, distributed and network-based processing | 2009

A Parallel Implementation of the 2D Wavelet Transform Using CUDA

Joaquín Franco; Gregorio Bernabé; Juan C. Fernandez; Manuel E. Acacio

There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device Architecture (CUDA) which is common to the latest NVIDIA GPUs. The bottom line is a multicore platform which provides an enormous potential performance benefit driven by a non-traditional programming model. In this paper we try to provide some insight into the peculiarities of CUDA in order to target scientific computing by means of a specific example. In particular, we show that the parallelization of the two-dimensional fast wavelet transform for the NVIDIA Tesla C870 achieves a speedup of 20.8 for an image size of 8192x8192, when compared with the fastest host-only version implementation using OpenMP and including the data transfers between main memory and device memory.

international conference on parallel architectures and compilation techniques | 2002

The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors

Manuel E. Acacio; José González; José M. García; José Duato

This work is focused on accelerating upgrade misses in cc-NUMA multiprocessors. These misses are caused by store instructions for which a read-only copy of the line is found in the L2 cache. Upgrade misses require a message sent from the missing node to the directory, a directory lookup in order to find the set of sharers, invalidation messages being sent to the sharers and responses to the invalidations being sent back. Therefore, the penalty paid by these misses is not negligible, mainly if we consider that they account for a high percentage of the total miss rate. We propose the use of prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for upgrade misses by directly invalidating sharers from the missing node. Our proposal comprises an effective prediction scheme achieving high hit rates as well as a coherence protocol extended to support the use of prediction. Our work is motivated by two key observations: first, upgrade misses present a repetitive behavior and, second, the total number of sharers being invalidated is small (one, in some cases). Using execution-driven simulations, we show that the use of prediction can significantly accelerate upgrade misses (latency reductions of more than 40% in some cases). These important improvements translate into speed-ups on application performance up to 14%. Finally, these results can be obtained including a predictor with a total size of less than 48 KB in every node.

IEEE Transactions on Parallel and Distributed Systems | 2005

A two-level directory architecture for highly scalable cc-NUMA multiprocessors

Manuel E. Acacio; Jose Gonzalez; José M. García; José Duato

One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that it scales very slowly with the size of the machine. Unfortunately, current directory architectures provide scalability at the expense of performance. This work presents a scalable directory architecture that significantly reduces the size of the directory for large-scale configurations of a multiprocessor without degrading performance. First, we propose multilayer clustering as an effective approach to reduce the width of directory entries. Based on this concept, we derive three new compressed sharing codes, some of them with a space complexity of O(log/sub 2/(log/sub 2/(N))) for an N-node system. Then, we present a novel two-level directory architecture to eliminate the penalty caused by compressed directories in general. The proposed organization consists of a small full-map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in-excess information for all the lines). The proposals are evaluated based on extensive execution-driven simulations (using RSIM) of a 64-node cc-NUMA multiprocessor. Results demonstrate that a system with a two-level directory architecture achieves the same performance as a multiprocessor with a big and nonscalable full-map directory, with a very significant reduction of the memory overhead.

high performance computer architecture | 2001

A new scalable directory architecture for large-scale multiprocessors

Manuel E. Acacio; José González; José M. García; José Duato

The memory overhead introduced by directories constitutes a major hurdle in the scalability of cc-NUMA architectures, which makes the shared-memory paradigm unfeasible for very large-scale systems. This work is focused on improving the scalability of shared-memory multiprocessors by significantly reducing the size of the directory. We propose multilayer clustering as an effective approach to reduce the directory-entry width. Detailed evaluation for 64 processors shows that using this approach we can drastically reduce the memory overhead, while suffering a performance degradation we similar to previous compressed schemes (such as Coarse Vector). In addition, a novel two-level directory architecture is proposed in order to eliminate the penalty caused by these compressed directories. This organization consists of a small Full-Map first-level directory (which provides precise information for the most recently referenced lines) and a compressed second-level directory (which provides in-excess information). Results show that a system with this directory architecture can achieve the same performance as a multiprocessor with a big and non-scalable Full-Map directory with a very significant reduction of the memory overhead.

international parallel and distributed processing symposium | 2008

DiCo-CMP: Efficient cache coherency in tiled CMP architectures

Alberto Ros; Manuel E. Acacio; José M. García

Future CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future tiled CMP architectures. In DiCo- CMP the role of storing up-to-date sharing information and ensuring totally ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending coherence messages directly from the requesting caches to those that must observe them (as it would be done in brute-force protocols), and reduces the network traffic compared to Token-CMP (and consequently, power consumption in the interconnection network) by sending just one request message for each miss. Using an extended version of GEMS simulator we show that DiCo-CMP achieves improvements in execution time of up to 8% on average over a directory protocol, and reductions in terms of network traffic of up to 42% on average compared to Token-CMP.

IEEE Transactions on Parallel and Distributed Systems | 2010

A Direct Coherence Protocol for Many-Core Chip Multiprocessors

Alberto Ros; Manuel E. Acacio; José M. García

Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as, for example, Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future many-core tiled CMP architectures. In DiCo-CMP, the task of storing up-to-date sharing information and ensuring ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending requests directly to the cache that provides the block in a cache miss. These latency reductions result in improvements in execution time of up to 6 percent, on average, over a directory protocol. In comparison with Token-CMP, our protocol only sends one request message for each cache miss, as such is able to reduce network traffic by 43 percent.

IEEE Transactions on Computers | 2010

Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs

Antonio Flores; Juan L. Aragón; Manuel E. Acacio

Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processing cores on the same chip. Chip multiprocessors (CMPs) constitute a good alternative to traditional monolithic designs for several reasons, among others, better levels of performance, scalability, and performance/energy ratio. On the other hand, higher clock frequencies and the increasing transistor density have revealed power dissipation and temperature as critical design issues in current and future architectures. Previous studies have shown that the interconnection network of a Chip Multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth, and power characteristics. In this work, we show how messages can be efficiently managed, from the point of view of both performance and energy, in tiled CMPs using a heterogeneous interconnect. Our proposal consists of two approaches. The first is Reply Partitioning, a technique that splits replies with data into a short Partial Reply message that carries a subblock of the cache line that includes the word requested by the processor plus an Ordinary Reply with the full cache line. This technique allows all messages used to ensure coherence between the L1 caches of a CMP to be classified into two groups: critical and short, and noncritical and long. The second approach is the use of a heterogeneous interconnection network composed of low-latency wires for critical messages and low-energy wires for noncritical ones. Detailed simulations of 8 and 16-core CMPs show that our proposal obtains average savings of 7 percent in execution time and 70 percent in the Energy-Delay squared Product (ED2P) metric of the interconnect over previous works (from 24 to 30 percent average ED2P improvement for the full CMP). Additionally, the sensitivity analysis shows that although the execution time is minimized for subblocks of 16 bytes, the best choice from the point of view of the ED2P metric is the 4-byte subblock configuration with an additional improvement of 2 percent over the 16-byte one for the ED2P metric of the full CMP.

high-performance computer architecture | 2007

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

Ricardo Fernández-Pascual; José M. García; Manuel E. Acacio; José Duato

It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%

IEEE Transactions on Parallel and Distributed Systems | 2004

An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration

Manuel E. Acacio; Jose Gonzalez; José M. García; José Duato

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.

Explore More