Pablo Abad | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pablo Abad is active.

Explore More

Publication

Featured researches published by Pablo Abad.

international symposium on computer architecture | 2007

Rotary router: an efficient architecture for CMP interconnection networks

Pablo Abad; Valentin Puente; José A. Gregorio; Pablo Prieto

The trend towards increasing the number of processor cores and cache capacity in future Chip-Multiprocessors (CMPs), will require scalable packet-switched interconnection networks adapted to the restrictions imposed by the CMP environment. This paper presents an innovative router design, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or anti-clockwise, traveling through every port of the router. It uses a completely decentralized scheduling scheme, which allows the design to: (1) take advantage of wide links, (2) reduce Head of Line blocking, (3) use adaptive routing, (4) be topology agnostic, (5) scale with network degree, and (6) have reasonable power consumption and implementation cost. A thorough comparative performance analysis against competitive conventional routers shows an advantage for our proposal of up to 50 % in terms of raw performance and nearly 60 % in terms of energy-delay product.

networks on chips | 2012

TOPAZ: An Open-Source Interconnection Network Simulator for Chip Multiprocessors and Supercomputers

Pablo Abad; Pablo Prieto; Lucia G. Menezo; AdriÂ´n Colaso; Valentin Puente; José-Ángel Gregorio

As in other computer architecture areas, interconnection networks research relies most of the times on simulation tools. This paper announces the release of an open-source tool suitable to be used for accurate modeling from small CMP to large supercomputer interconnection networks. The cycle-accurate modeling of TOPAZ can be used standalone through synthetic traffic patterns and application-traces or within full-system evaluation systems such as GEMS or GEM5 effortlessly. In fact, we provide an advanced interface that enables the replacement of the original lightweight but optimistic GEMS and GEM5 network simulator with limited performance impact on the simulation time. Our tests indicate that in this context, underestimating network modeling could induce up to 50% error in the performance estimation of the simulated system. To minimize the impact of detailed network modeling on simulation time, we incorporate mechanisms able to attenuate the higher computational effort, reducing in this way the slowdown of the full system simulation with accurate performance estimations. Additionally, in order to evaluate large-scale networks, we parallelize the simulator to be able to optimize memory resources with the growing number of cores available per chip in the simulation farms. This allows us to simulate node networks exceeding one million of routers with up to 70% efficiency in a multithreaded simulation running on twelve cores.

high-performance computer architecture | 2009

MRR: Enabling fully adaptive multicast routing for CMP interconnection networks

Pablo Abad; Valentin Puente; José-Ángel Gregorio

On-network hardware support for multi-destination traffic is a desirable feature in most multiprocessor machines. Multicast hardware capabilities enable much more effective bandwidth utilization as multi-destination packets do not need to repeatedly use the same resources, as occurs when multicast traffic must be decomposed in unicast packets. Although Chip Multiprocessors are not an exception in this interest, up to date, few fitting proposals exist. The combination of the scarcity of available resources and the common idea that multicast support requires a substantial amount of extra resources is responsible for this situation. In this work, we propose a new approach suitable for on-chip networks capable of managing multi-destination traffic via hardware in an efficient way with negligible complexity. We introduce the Multicast Rotary Router (MRR), a router able to: (1) perform on-network multicast support with almost zero cost over the Rotary Router, (2) use a fully adaptive tree to distribute multicast traffic, (3) perform on-network congestion control extending network utilization range. The performance results, using a state-of-the-art full system simulation framework, show that it improves average full system performance of a CMP using a unicast Rotary Router in its interconnection network by 25%, and an input buffered router with multicast support by 20%.

networks on chips | 2008

Reducing the Interconnection Network Cost of Chip Multiprocessors

Pablo Abad; Valentin Puente; José A. Gregorio

This paper introduces a cost-effective technique to deal with CMP coherence protocol requirements from the interconnection network point of view. A mechanism is presented to avoid the end-to-end deadlock that arises from the dependency chains created at the network interfaces between the different message types handled by coherence protocols. Our proposal is designed to guarantee a fraction of end-to- end bandwidth for the highest priority messages and makes it unnecessary to employ several virtual networks or complex mechanisms for dealing with the limited capacity of the endpoint buffers. The presented approach uses the Rotary Router as its starting point, extending the original mechanism for the routing- dependent deadlock to the message-dependent deadlock. We also propose a solution that guarantees point-to-point message ordering in this router, which is a common requirement in some coherence protocols. Results for synthetic and parallel applications show that the proposal improves the performance of previous solutions with a much lower hardware cost.

IEEE Transactions on Parallel and Distributed Systems | 2012

Balancing Performance and Cost in CMP Interconnection Networks

Pablo Abad; Valentin Puente; José A. Gregorio

This paper presents an innovative router design, called Rotary Router, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or counterclockwise, traveling through every port of the router. These two rings constitute a completely decentralized arbitration scheme that enables a simple, but efficient way to connect every input port to every output port. The proposed router is able to avoid network deadlock, livelock, and starvation without requiring data-path modifications. The organization of the router permits the inclusion of throughput enhancement techniques without significantly penalizing the implementation cost. In particular, the router performs adaptive routing, eliminates HOL blocking, and carries out implicit congestion control using simple arbitration and buffering strategies. Additionally, the proposal is capable of avoiding end-to-end deadlock at coherence protocol level with no physical or virtual resource replication, while guaranteeing in-order packet delivery. This facilitates router management and improves storage utilization. Using a comprehensive evaluation framework that includes full-system simulation and hardware description, the proposal is compared with two representative router counterparts. The results obtained demonstrate the Rotary Routers substantial performance and efficiency advantages.

computing frontiers | 2012

Improving coherence protocol reactiveness by trading bandwidth for latency

Lucia G. Menezo; Valentin Puente; Pablo Abad; José A. Gregorio

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.

IEEE Transactions on Parallel and Distributed Systems | 2012

Adaptive-Tree Multicast: Efficient Multidestination Support for CMP Communication Substrate

Pablo Abad; Valentin Puente; Lucia G. Menezo; J. Angel Gregorio

Multidestination communications are a highly necessary capability for many coherence protocols in order to minimize on-chip hit latency. Although CMPs share this necessity, up to now few suitable proposals have been developed. The combination of resource scarcity and the common idea that multicast support requires a substantial amount of extra resources is responsible for this situation. In this work, we propose a new approach suitable for on-chip networks capable of managing multidestination traffic via hardware in an efficient way with negligible complexity. We introduce a novel multicast routing mechanism, able to circumvent many of the limitations of conventional multicast schemes. Adaptive-tree multicasting is able to maintain correctness for multiflit multicast messages without routing restrictions, while also coupling correctness and performance in a natural way. Replication restrictions not only guarantee the presence of enough resources to avoid deadlock, but also dynamically adapt tree shape to network conditions, routing multicast messages through noncongested paths. The performance results, using a state-of-the-art full system simulation framework, show that it improves the average full system performance of a CMP by 20 percent and network ED2P by 15 percent, when compared to a state-of-the-art router with conventional multicast support and similar implementation cost.

IEEE Transactions on Parallel and Distributed Systems | 2016

AC-WAR: Architecting the Cache Hierarchy to Improve the Lifetime of a Non-Volatile Endurance-Limited Main Memory

Pablo Abad; Pablo Prieto; Valentin Puente; José-Ángel Gregorio

This work shows how by adapting replacement policies in contemporary cache hierarchies it is possible to extend the lifespan of a write endurance-limited main memory by almost one order of magnitude. The inception of this idea is that during cache residency 1) blocks are modified in a bimodal way: either most of the content of the block is modified or most of the content of the block never changes, and 2) in most applications, the majority of blocks are only slightly modified. When those facts are considered by the cache replacement algorithms, it is possible to significantly reduce the number of bit-flips per write-back to main memory. Our proposal favors the off-chip eviction of slightly modified blocks according to an adaptive replacement algorithm that operates coordinately in L2 and L3. This way it is possible to improve significantly system memory lifetime, with negligible performance degradation. We found that using a few bits per block to track changes in cache blocks with respect to the main memory content is enough. With a slightly modified sectored LRU and a simple cache performance predictor it is possible to achieve a simple implementation with minimal cost in area and no impact on cache access time. On average, our proposal increases the memory lifetime obtained with an LRU policy up to 10 times (10×) and 15 times (15×) when combined with other memory centric techniques. In both cases, the performance degradation could be considered negligible.

parallel computing | 2015

Improving last level shared cache performance through mobile insertion policies (MIP)

Pablo Abad; Pablo Prieto; Valentin Puente; José-Ángel Gregorio

We show the high variability in last-level cache access patterns.When applications interact, current dynamic policies have not enough flexibility.We present MIP, a replacement policy based on mobile insertion position.MIP rapidly adapts to sudden changes in access patterns during application runtime.Our solution provides 30% better hit-rate than LRU and 10% than DRRIP. For those cache hierarchy levels where program locality is not as evident as in L1, LRU replacement does not seem to be the optimal solution to determine which blocks will be requested soon. The literature is prolific on alternative reuse-distance estimations at last on-chip cache level, proving the difficulty of achieving an optimal hit rate. One of the key aspects for performance is knowing inter and intra application reuse-distance variability. Many solutions already do this, but most of them rely on a simple choice among a few alternative policies. The experiments performed to motivate the proposal confirm application variability, but also show that the behavior of applications is much more than bimodal. This means that there is a performance gap that current hybrid policies are not able to cover. In this paper we propose a mobile insertion position replacement policy (MIP), which combines well known LRU ordering and promotion policies with a completely adaptive insertion mechanism. The dynamic behavior of insertion is able to capture hit-rate variability in a more accurate way. Making use of set dueling and dynamic set sampling for prediction, our mechanism continuously estimates the insertion position that maximizes the cache hit rate. The hardware overhead compared to a LRU replacement algorithm is merely three 3-bit saturating counters per LLC bank. Our experiments show that for a wide range of applications, MIP is able to improve the hit rate of LRU by 30% on average. MIP outperforms current state-of-the-art replacement policies with a similar implementation cost by 10% on average and in single-thread or multi-thread workloads by 20%.

International Journal of Parallel Programming | 2018

M osaic : A Scalable Coherence Protocol

Lucia G. Menezo; Valentin Puente; Pablo Abad; José-Ángel Gregorio

The coherence protocol presented in this work, denoted Mosaic, introduces a new approach to face the challenges of complex multilevel cache hierarchies in future many-core systems. The essential aspect of the proposal is to eliminate the condition of inclusiveness through the different levels of the memory hierarchy while maintaining the complexity of the protocol limited. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private cache size is large. Our approach trades area and complexity for on-chip bandwidth, employing an integrated broadcast mechanism in a directory structure. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. As it is even simpler than a conventional directory, the results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory.

Explore More