Blas Cuesta
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Blas Cuesta.
international symposium on computer architecture | 2011
Blas Cuesta; Alberto Ros; María Engracia Gómez; Antonio Robles; José Duato
To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of the increasingly larger systems may cause frequent evictions of directory entries and, consequently, invalidations of cached blocks, which severely degrades system performance. A significant percentage of the referred memory blocks are only accessed by one processor (even in parallel applications) and, therefore, do not require coherence maintenance. Taking advantage of techniques that dynamically identify those private blocks, we propose to deactivate the coherence protocol for them and to treat them as uniprocessor systems do. The protocol deactivation allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size. Since the operating system collaborates on the detection of private blocks, our proposal only requires minor modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 57% of the accessed blocks and their capacity can be better exploited. This contributes either to shorten the runtime of parallel applications by 15% while keeping directory cache size or to maintain system performance while using directory caches 8 times smaller.
IEEE Transactions on Computers | 2013
Blas Cuesta; Alberto Ros; María Engracia Gómez; Antonio Robles; José Duato
A key aspect in the design of efficient multiprocessor systems is the cache coherence protocol. Although directory-based protocols constitute the most scalable approach, the limited size of the directory caches together with the growing size of systems may cause frequent evictions and, consequently, the invalidation of cached blocks, which jeopardizes system performance. Directory caches keep track of every memory block stored in processor caches in order to provide coherent access to the shared memory. However, a significant fraction of the cached memory blocks do not require coherence maintenance (even in parallel applications) because they are either accessed by just one processor or they are never modified. In this paper, we propose to deactivate the coherence protocol for those blocks that do not require coherence. This deactivation means directory caches do not have to keep track of noncoherent blocks, which reduces directory cache occupancy and increases its effectiveness. Since the detection of noncoherent blocks is carried out by the operating system, our proposal only requires minor hardware modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 66 percent (on average) of the blocks accessed by a wide range of applications, thereby improving the efficiency of directory caches. This contributes either to shortening the runtime of parallel applications by 15 percent (on average) while keeping directory cache size or to maintaining performance while using directory caches 16 times smaller.
international conference on parallel processing | 2013
Alberto Ros; Blas Cuesta; María Engracia Gómez; Antonio Robles; José Duato
Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by only one thread and can be considered as private data. A lot of recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depend to a large extent on the amount of detected private data. However, the mechanisms proposed so far do not consider thread migration and the private use of data within different application phases. As a result, a considerable amount of data is not detected as private. In order to make this detection more accurate and reaching more significant improvements, we propose a mechanism that is able to account for both thread migration and private data within application phases. Simulation results for 16-core systems show that, thanks to our mechanism, the average number of pages detected as private significantly increases from 43% in previous proposals up to 74% in ours. Finally, when our detection mechanism is used to deactivate the coherence for private data in a directory protocol, our proposal improves execution time by 13% with respect to previous proposals.
parallel, distributed and network-based processing | 2007
Blas Cuesta; Antonio Robles; José Duato
Shared-memory multiprocessors are becoming to be formed by an increasingly larger number of nodes. In these systems, implementing cache coherence is a key issue. Token coherence is a low latency cache coherence protocol that avoids indirection for cache-to-cache misses and which does not require a totally-ordered interconnect. When races are rare, the protocol performs well thanks to the performance policy. Unfortunately, some medium/large systems and some applications that often access the same data simultaneously make races more common. As a result, the protocol does not perform as well as it could because it uses the persistent request mechanism to prevent starvation. This mechanism is too slow and inflexible because it overrides the performance policy. In consequence, the protocol slows down the system and does not take advantage of the flexibility and speed of the common case. We propose a new mechanism, namely priority requests, which replaces the persistent request one. Our mechanism solves races, while still respecting the performance policy, simply by ordering and giving a higher priority to requests suffering from starvation. Thus, our mechanism handles the tokens more efficiently and reduces the network traffic
ieee international conference on high performance computing, data, and analytics | 2010
Alberto Ros; Blas Cuesta; Ricardo Fernández-Pascual; María Engracia Gómez; Manuel E. Acacio; Antonio Robles; José M. García; José Duato
The demand of larger and more powerful highperformance shared-memory servers is growing over the last few years. To meet this need, AMD has recently launched the twelve-core Magny-Cours processors. They include a directory cache (Probe Filter) that increases the scalability of the coherence protocol applied by Opterons, based on coherent Hyper Transport interconnect (cHT). cHT limits up to 8 the number of nodes that can be addressed. Recent High Node Count HT specification overcomes this limitation. However, the 3-bit pointer used by the Probe Filter prevents Magny-Cours-based servers from being built beyond 8 nodes. In this paper, we propose and develop an external logic to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the Probe Filter. Evaluation results for up to a 32-node system show how the performance offered by our solution scales with the increment in the number of nodes, enhancing the Probe Filter effectiveness by filtering additional messages. Particularly, we reduce runtime by 47% in a 32-die system respect to the 8-die Magny-Cours system.
international symposium on parallel and distributed processing and applications | 2012
Alberto Ros; Blas Cuesta; María Engracia Gómez; Antonio Robles; José Duato
There is a growing trend towards developing large-scale cache-coherent systems by using commodity symmetric multiprocessors, which requires to extend their coherence protocol. In such systems, cache coherence transactions issued due to cache misses traverse interconnection networks with very different topologies and latencies. In this work, we perform a cache miss characterization aimed at analyzing the benefits that can be expected for a specialized coherence controller able to locally resolve cache misses, thus saving traffic across long-latency links. Results show that there is a high potential in reducing miss latency in these systems, and that this potential reduction grows as the number of nodes in the system increases. Particularly, in a system with just two boards 40% of the cache misses do not need the expensive inter-board communication. This percentage can increase up to 67.5% for an 8-board system.
parallel, distributed and network-based processing | 2008
Blas Cuesta; Antonio Robles; José Duato
Token coherence is a cache coherence protocol that joins the main advantages of traditional protocols. However, unlike them, token coherence does not handle messages in order, which may lead to races, causing some cache misses not to be solved. To assure their completion, an inefficient mechanism named persistent requests is used. Recently we have proposed the priority request mechanism to efficiently handle races. As acknowledgements are not required, a single node can solve several misses for the same memory block at the same time. When solving a lot of misses, the node may become a bottleneck. To avoid it, in this work we propose the multicast coherence message, which allows to simultaneously resolve several misses by using only one response message. It reduces the network traffic and the average response latency, improving significantly the overall performance.
parallel and distributed computing: applications and technologies | 2008
Blas Cuesta; Antonio Robles; Jos ´ e Duato
Traditional cache coherence protocols either provide low latency cache misses (snooping protocols) or bandwidth efficiency (directory protocols). To simultaneously capture the best attributes of traditional protocols, Token Coherence has been recently proposed. This protocol can quickly resolve cache misses by transient requests. However, since transient requests are unordered messages, they may sometimes fail in solving cache misses mainly due to the occurrence of protocol races. Thus, when the completion of cache misses is not possible by transient requests, Token Coherence uses a starvation prevention mechanism to ensure their completion. Although several implementation options of starvation prevention mechanisms have been proposed, all of them are broadcast-based. This fact represents a large detriment to the Token Coherence scalability. To tackle this problem, in this work we apply a switch-based packing technique that alleviates the harm of broadcast messages and improves the protocol scalability.
IEEE Transactions on Computers | 2012
Alberto Ros; Blas Cuesta; Ricardo Fernández-Pascual; María Engracia Gómez; Manuel E. Acacio; Antonio Robles; José M. García; José Duato
One cost-effective way to meet the increasing demand for larger high-performance shared-memory servers is to build clusters with off-the-shelf processors connected with low-latency point-to-point interconnections like HyperTransport. Unfortunately, HyperTransport addressing limitations prevent building systems with more than eight nodes. While the recent High-Node Count HyperTransport specification overcomes this limitation, recently launched twelve-core Magny-Cours processors have already inherited it and provide only 3 bits to encode the pointers used by the directory cache which they include to increase the scalability of their coherence protocol. In this work, we propose and develop an external device to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the directory cache. Evaluation results for systems with up to 32 nodes show that the performance offered by our solution scales with the number of nodes, enhancing the directory cache effectiveness by filtering additional messages. Particularly, we reduce execution time by 47 percent in a 32-die system with respect to the 8-die Magny-Cours configuration.
Journal of Parallel and Distributed Computing | 2012
Blas Cuesta; Antonio Robles; José Duato
Token Coherence is a cache coherence protocol able to simultaneously capture the best attributes of traditional protocols: low latency and scalability. However it may lose these desired features when (1) several nodes contend for the same memory block and (2) nodes write highly-shared blocks. The first situation leads to the issue of simultaneous broadcast requests which threaten the protocol scalability. The second situation results in a burst of token responses directed to the writer, which turn it into a bottleneck and increase the latency. To address these problems, we propose a switch-based packing technique able to encapsulate several messages (while in transit) into just one. Its application to the simultaneous broadcasts significantly reduces their bandwidth requirements (up to 45%). Its application to token responses lowers their transmission latency (by 70%). Thus, the packing technique decreases both the latency and coherence traffic, thereby improving system performance (about 15% of reduction in runtime).