Ricardo Fernández-Pascual

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ricardo Fernández-Pascual is active.

Explore More

Publication

Featured researches published by Ricardo Fernández-Pascual.

high-performance computer architecture | 2007

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

Ricardo Fernández-Pascual; José M. García; Manuel E. Acacio; José Duato

It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%

ieee international conference on high performance computing, data, and analytics | 2010

EMC 2 : Extending Magny-Cours coherence for large-scale servers

Alberto Ros; Blas Cuesta; Ricardo Fernández-Pascual; María Engracia Gómez; Manuel E. Acacio; Antonio Robles; José M. García; José Duato

The demand of larger and more powerful highperformance shared-memory servers is growing over the last few years. To meet this need, AMD has recently launched the twelve-core Magny-Cours processors. They include a directory cache (Probe Filter) that increases the scalability of the coherence protocol applied by Opterons, based on coherent Hyper Transport interconnect (cHT). cHT limits up to 8 the number of nodes that can be addressed. Recent High Node Count HT specification overcomes this limitation. However, the 3-bit pointer used by the Probe Filter prevents Magny-Cours-based servers from being built beyond 8 nodes. In this paper, we propose and develop an external logic to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the Probe Filter. Evaluation results for up to a 32-node system show how the performance offered by our solution scales with the increment in the number of nodes, enhancing the Probe Filter effectiveness by filtering additional messages. Particularly, we reduce runtime by 47% in a 32-die system respect to the 8-die Magny-Cours system.

high performance embedded architectures and compilers | 2012

DAPSCO: Distance-aware partially shared cache organization

Antonio García-Guirado; Ricardo Fernández-Pascual; Alberto Ros; José M. García

Many-core tiled CMP proposals often assume a partially shared last level cache (LLC) since this provides a good compromise between access latency and cache utilization. In this paper, we propose a novel way to map memory addresses to LLC banks that takes into account the average distance between the banks and the tiles that access them. Contrary to traditional approaches, our mapping does not group the tiles in clusters within which all the cores access the same bank for the same addresses. Instead, two neighboring cores access different sets of banks minimizing the average distance travelled by the cache requests. Results for a 64-core CMP show that our proposal improves both execution time and the energy consumed by the network by 13% when compared to a traditional mapping. Moreover, our proposal comes at a negligible cost in terms of hardware and its benefits in both energy and execution time increase with the number of cores.

Concurrency and Computation: Practice and Experience | 2014

Managing resources dynamically in hybrid photonic-electronic networks-on-chip

Antonio García-Guirado; Ricardo Fernández-Pascual; José M. García; Sandro Bartolini

Nanophotonics promises to solve the scalability problems of current electrical interconnects thanks to its low sensitivity to distance in terms of latency and energy consumption. Before this technology reaches maturity, hybrid photonic‐electronic networks will be a viable alternative. Ideally, ordinary electrical meshes and ring‐based photonic networks should cooperate to minimize overall latency and energy consumption, but currently, we lack mechanisms to do this efficiently. In this paper, we present novel fine‐grain policies to manage the photonic resources in a tiled chip multiprocessor (CMP) scenario. Our policies are dynamic and base their decisions on parameters such as message size, ring availability, and distance between endpoints, at the message level. The resulting network behavior is also fairer to all cores, reducing processor idle time thanks to faster thread synchronization. All these policies improve performance when compared to the same CMP without the photonic ring, and the most elaborate ones reduce the overall network latency by 50%, execution time by 36%, and network energy consumption by 52% on average, in a 16‐core CMP for the PARSEC benchmark suite. Larger hybrid networks with 64 endpoints for 256‐core CMPs, based on Corona and Firefly designs, also show far superior throughput and lower latency if managed by one of the proposed policies. Copyright

dependable systems and networks | 2008

A fault-tolerant directory-based cache coherence protocol for CMP architectures

Ricardo Fernández-Pascual; José M. García; Manuel E. Acacio; José Duato

Current technology trends of increased scale of integration are pushing CMOS technology into the deep-submicron domain, enabling the creation of chips with a significantly greater number of transistors but also more prone to transient failures. Hence, computer architects will have to consider reliability as a prime concern for future chip-multiprocessor designs (CMPs). Since the interconnection network of future CMPs will use a significant portion of the chip real state, it will be especially affected by transient failures. We propose to deal with this kind of failures at the level of the cache coherence protocol instead of ensuring the reliability of the network itself. Particularly, we have extended a directory-based cache coherence protocol to ensure correct program semantics even in presence of transient failures in the interconnection network. Additionally, we show that our proposal has virtually no impact on execution time with respect to a non fault-tolerant protocol, and just entails modest hardware and network traffic overhead.

IEEE Transactions on Computers | 2015

ICCI: In-Cache Coherence Information

Antonio García-Guirado; Ricardo Fernández-Pascual; José M. García

In this paper we introduce ICCI, a new cache organization that leverages shared cache resources and flat coherence protocols to provide inexpensive hardware cache coherence for large core counts (e.g., 512), achieving execution times close to a nonscalable sparse directory while noticeably reducing the energy consumption of the memory system. Very simple changes in the system with respect to traditional bit-vector directories are enough to implement ICCI. Moreover, ICCI does not introduce any storage overhead with respect to a broadcast-based protocol, yet it provides large storage space for coherence information. ICCI makes smarter use of cache resources by dynamically allowing last-level cache entries to store blocks or sharing codes. This way, just the minimum number of directory entries required at runtime are allocated. Besides, ICCI suffers a negligible amount of directory-induced invalidations. Results for a 512-core CMP show that ICCI reduces the energy consumption of the memory system by up to 48 percent compared to a tag-embedded directory, up to 15 percent compared to a sparse directory, and up to 8 percent compared to the state-of-the-art Scalable Coherence Directory which ICCI also outperforms in execution time. In addition, ICCI can be used in combination with elaborated sharing codes to apply it to extremely large core counts. We also show analytically that ICCIs dynamic allocation of entries makes it a suitable candidate to store coherence information efficiently for very large core counts (e.g., over 200K cores), based on the observation that data sharing makes fewer directory entries necessary per core as core count increases.

ieee international conference on high performance computing, data, and analytics | 2008

Fault-tolerant cache coherence protocols for CMPs: evaluation and trade-offs

Ricardo Fernández-Pascual; José M. García; Manuel E. Acacio; José Duato

One way of dealing with transient faults that will affect theinterconnection network of future large-scale ChipMultiprocessor (CMP)systems is by extending the cache coherence protocol. Fault tolerance atthe level of the cache coherence protocol has been proven to achieve verylow performance overhead in absence of faults while being able to supportvery high fault rates. In this work, we compare two already proposed fault-tolerant cache coherence protocols in a common framework and present anew one based in the cache coherence protocol used in AMD Opteron processors.Also, we thoroughly evaluate the performance of the three protocols,show how to adjust the fault tolerance parameters of the protocols toachieve a desired level of fault tolerance andmeasure the overhead achievedto be able to support very high transient fault rates.

IEEE Transactions on Parallel and Distributed Systems | 2008

Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures

Ricardo Fernández-Pascual; José M. García; Manuel E. Acacio; José Duato

It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent.

international conference on e-science | 2015

Early Experiences with Separate Caches for Private and Shared Data

Juan M. Cebrián; Alberto Ros; Ricardo Fernández-Pascual; Manuel E. Acacio

Shared-memory architectures have become predominant in modern multi-core microprocessors in all market segments, from embedded to high performance computing. Correctness of these architectures is ensured by means of coherence protocols and consistency models. Performance and scalability of shared-memory systems is usually limited by the amount and size of the messages used to keep the memory subsystem coherent. Moreover, we believe that blindly keeping coherence for all memory accesses can be counterproductive, since it incurs in unnecessary overhead for data that will remain coherent after the access. Having this in mind, in this paper we propose the use of dedicated caches for private (+shared read-only) and shared data. The private cache (L1P) will be independent for each core while the shared cache (L1S) will be logically shared but physically distributed for all cores. This separation should allow us to simplify the coherence protocol, reduce the on-chip area requirements and reduce invalidation time with minimal impact on performance. The dedicated cache design requires a classification mechanism to detect private and shared data. In our evaluation we will use a classification mechanism that operates at the operating system (OS) level (page granularity). Results show two drawbacks to this approach: first, the selected classification mechanism has too many false positives, thus becoming an important limiting factor. Second, a traditional interconnection network is not optimal for accessing the L1S, and a custom network design is needed. These drawbacks lead to important performance degradation due to the additional latency when accessing the shared data.

european conference on parallel processing | 2014

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs

Ricardo Fernández-Pascual; Alberto Ros; Manuel E. Acacio

The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we review a kind of cache coherence protocols that, despite having been already implemented in the 90s for building large-scale commodity multiprocessors, have not been seriously considered in the current context of chip multiprocessors. In particular, we evaluate a directory-based cache coherence protocol that employs distributed simply-linked lists to encode the information about the sharers of the memory blocks. We compare this organization with two protocols that use centralized sharing codes, each one having different directory memory overhead: one of them implementing a non-scalable bit-vector sharing code and the other one implementing a more scalable limited-pointer scheme with a single pointer. Simulation results show that for large-scale chip multiprocessors, the protocol based on distributed linked lists obtains worse performance than the centralized approaches. This is due, principally, to an increase in the contention at the directory controller as a consequence of being blocked for longer time while updating the distributed sharing information.

Explore More