Is this you? Create Your Porfile

José Duato

Polytechnic University of Valencia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José Duato is active.

Explore More

Publication

Featured researches published by José Duato.

IEEE Transactions on Parallel and Distributed Systems | 1993

A new theory of deadlock-free adaptive routing in wormhole networks

José Duato

The theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks is developed. The author proposes some basic definitions and two theorems. These create the conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Two design methodologies are also proposed. The first supplies algorithms with a high degree of freedom, without increasing the number of physical channels. The second methodology is intended for the design of fault-tolerant algorithms. Some examples are given to show the application of the methodologies. Simulations show the performance improvement that can be achieved by designing the routing algorithms with the new theory. >

IEEE Transactions on Parallel and Distributed Systems | 1995

A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks

José Duato

Deadlock avoidance is a key issue in wormhole networks. A first approach by W.J. Dally and C.L. Seitz (1987) consists of removing the cyclic dependencies between channels. Many deterministic and adaptive routing algorithms have been proposed based on that approach. Although the absence of cyclic dependencies is a necessary and sufficient condition for deadlock-free deterministic routing, it is only a sufficient condition for deadlock-free adaptive routing. A more powerful approach by J. Duato (1991) only requires the absence of cyclic dependencies on a connected channel subset. The remaining channels can be used in almost any way. In this paper, we show that the previously mentioned approach is also a sufficient condition. Moreover, we propose a necessary and sufficient condition for deadlock-free adaptive routing. This condition is the key for the design of fully adaptive routing algorithms with minimum restrictions, An example shows the application of the new theory. >

international conference on high performance computing and simulation | 2010

rCUDA: Reducing the number of GPU-based accelerators in high performance clusters

José Duato; Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí

The increasing computing requirements for GPUs (Graphics Processing Units) have favoured the design and marketing of commodity devices that nowadays can also be used to accelerate general purpose computing. Therefore, future high performance clusters intended for HPC (High Performance Computing) will likely include such devices. However, high-end GPU-based accelerators used in HPC feature a considerable energy consumption, so that attaching a GPU to every node of a cluster has a strong impact on its overall power consumption. In this paper we detail a framework that enables remote GPU acceleration in HPC clusters, thus allowing a reduction in the number of accelerators installed in the cluster. This leads to energy, acquisition, maintenance, and space savings.

international conference on parallel architectures and compilation techniques | 2002

Efficient interconnects for clustered microarchitectures

Joan-Manuel Parcerisa; Julio Sahuquillo; Antonio González; José Duato

Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered microarchitectures. This new class of interconnects has different demands and characteristics than traditional multiprocessor networks. In a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose point-to-point interconnects together with an effective latency-aware instruction steering scheme and show that they achieve much better performance than bus-based interconnects. The results show that the connectivity of the network together with latency-aware steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model.

IEEE Transactions on Parallel and Distributed Systems | 1996

A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks

José Duato

This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-free routing algorithms.

international parallel and distributed processing symposium | 2007

Deterministic versus Adaptive Routing in Fat-Trees

Crispín Gómez; F. Gilabert; María Engracia Gómez; Pedro López; José Duato

Clusters of PCs have become very popular to build high performance computers. These machines use commodity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances network traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of-order packet delivery, which is unacceptable for some applications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that deterministic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery.

international symposium on computer architecture | 2011

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Blas Cuesta; Alberto Ros; María Engracia Gómez; Antonio Robles; José Duato

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of the increasingly larger systems may cause frequent evictions of directory entries and, consequently, invalidations of cached blocks, which severely degrades system performance. A significant percentage of the referred memory blocks are only accessed by one processor (even in parallel applications) and, therefore, do not require coherence maintenance. Taking advantage of techniques that dynamically identify those private blocks, we propose to deactivate the coherence protocol for them and to treat them as uniprocessor systems do. The protocol deactivation allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size. Since the operating system collaborates on the detection of private blocks, our proposal only requires minor modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 57% of the accessed blocks and their capacity can be better exploited. This contributes either to shorten the runtime of parallel applications by 15% while keeping directory cache size or to maintain system performance while using directory caches 8 times smaller.

international parallel and distributed processing symposium | 2006

Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori

Andres Mejia; Jose Flich; José Duato; Sven-Arne Reinemo; Tor Skeie

Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing

IEEE Transactions on Parallel and Distributed Systems | 1997

A theory of fault-tolerant routing in wormhole networks

José Duato

Fault-tolerant systems aim at providing continuous operation in the presence of faults. Multicomputers rely on an interconnection network between processors to support the message-passing mechanism. Therefore, the reliability of the interconnection network is very important for the reliability of the whole system. This paper analyzes the effective redundancy available in a wormhole network by combining connectivity and deadlock freedom. Redundancy is defined at the channel level. We propose a sufficient condition for channel redundancy, also computing the set of redundant channels. The redundancy level of the network is also defined, proposing a theorem that supplies its value. This theory is developed on top of our necessary and sufficient condition for deadlock-free adaptive routing. The new theory also considers the failure of physical channels when virtual channels are used. Finally, we propose a methodology for the design of fault-tolerant routing algorithms, showing its application to n-dimensional meshes.

high-performance computer architecture | 2005

A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks

José Duato; I. Johnson; Jose Flich; F. Naven; Pedro Javier García; T. Nachiondo

In this paper, we propose a new congestion management strategy for lossless multistage interconnection networks that scales as network size and/or link bandwidth increase. Instead of eliminating congestion, our strategy avoids performance degradation beyond the saturation point by eliminating the HOL blocking produced by congestion trees. This is achieved in a scalable manner by using separate queues for congested flows. These are dynamically allocated only when congestion arises, and deallocated when congestion subsides. Performance evaluation results show that our strategy responds to congestion immediately and completely eliminates the performance degradation produced by HOL blocking while using only a small number of additional queues.

Explore More