José-Ángel Gregorio

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José-Ángel Gregorio is active.

Explore More

Publication

Featured researches published by José-Ángel Gregorio.

Journal of Parallel and Distributed Computing | 2001

The Adaptive Bubble Router

Valentin Puente; Cruz Izu; Ramón Beivide; José-Ángel Gregorio; Fernando Vallejo; J. M. Prellezo

The design of a new adaptive virtual cut-through router for torus networks is presented in this paper. With much lower VLSI costs than adaptive wormhole routers, the adaptive Bubble router is even faster than deterministic wormhole routers based on virtual channels. This has been achieved by combining a low-cost deadlock avoidance mechanism for virtual cut-through networks, called Bubble flow control, with an adequate design of the routers arbiter. A thorough methodology has been employed to quantify the impact that this router design has at all levels, from its hardware cost to the system performance when running parallel applications. At the VLSI level, our proposal is the adaptive router with the shortest clock cycle and node delay when compared with other state-of-the-art alternatives. This translates into the lowest latency and highest throughput under standard synthetic loads. At system level, these gains reduce the execution time of the benchmarks considered. Compared with current adaptive wormhole routers, the execution time is reduced by up to 27%. Furthermore, this is the only router that improves system performance when compared with simpler static designs.

ieee international conference on high performance computing, data, and analytics | 1997

A flow control mechanism to avoid message deadlock in k-ary n-cube networks

Carmen Carrión; Ramón Beivide; José-Ángel Gregorio; Fernando Vallejo

We propose a flow control algorithm for k-ary n-cube networks which avoids the deadlock problems without using virtual channels. Some basic definitions and theorems are proposed in order to establish the necessary and sufficient conditions to verify that an algorithm is deadlock-free. Our proposal is based on a restriction of the virtual cut-through flow control rather than of the routing algorithm and it can be applied both over central buffers or edge buffers. A minimum free buffer space of two packets is required. The implementation complexity of the router according to Chiens (1993) model, is much easier and faster than using virtual channels. Network simulations considering the router complexity show the performance achieved by this new algorithm. The results display a latency improvement of 20% to 35% compared with the use of virtual channels depending on the load of the network.

networks on chips | 2012

TOPAZ: An Open-Source Interconnection Network Simulator for Chip Multiprocessors and Supercomputers

Pablo Abad; Pablo Prieto; Lucia G. Menezo; AdriÂ´n Colaso; Valentin Puente; José-Ángel Gregorio

As in other computer architecture areas, interconnection networks research relies most of the times on simulation tools. This paper announces the release of an open-source tool suitable to be used for accurate modeling from small CMP to large supercomputer interconnection networks. The cycle-accurate modeling of TOPAZ can be used standalone through synthetic traffic patterns and application-traces or within full-system evaluation systems such as GEMS or GEM5 effortlessly. In fact, we provide an advanced interface that enables the replacement of the original lightweight but optimistic GEMS and GEM5 network simulator with limited performance impact on the simulation time. Our tests indicate that in this context, underestimating network modeling could induce up to 50% error in the performance estimation of the simulated system. To minimize the impact of detailed network modeling on simulation time, we incorporate mechanisms able to attenuate the higher computational effort, reducing in this way the slowdown of the full system simulation with accurate performance estimations. Additionally, in order to evaluate large-scale networks, we parallelize the simulator to be able to optimize memory resources with the growing number of cores available per chip in the simulation farms. This allows us to simulate node networks exceeding one million of routers with up to 70% efficiency in a multithreaded simulation running on twelve cores.

high-performance computer architecture | 2009

MRR: Enabling fully adaptive multicast routing for CMP interconnection networks

Pablo Abad; Valentin Puente; José-Ángel Gregorio

On-network hardware support for multi-destination traffic is a desirable feature in most multiprocessor machines. Multicast hardware capabilities enable much more effective bandwidth utilization as multi-destination packets do not need to repeatedly use the same resources, as occurs when multicast traffic must be decomposed in unicast packets. Although Chip Multiprocessors are not an exception in this interest, up to date, few fitting proposals exist. The combination of the scarcity of available resources and the common idea that multicast support requires a substantial amount of extra resources is responsible for this situation. In this work, we propose a new approach suitable for on-chip networks capable of managing multi-destination traffic via hardware in an efficient way with negligible complexity. We introduce the Multicast Rotary Router (MRR), a router able to: (1) perform on-network multicast support with almost zero cost over the Rotary Router, (2) use a fully adaptive tree to distribute multicast traffic, (3) perform on-network congestion control extending network utilization range. The performance results, using a state-of-the-art full system simulation framework, show that it improves average full system performance of a CMP using a unicast Rotary Router in its interconnection network by 25%, and an input buffered router with multicast support by 20%.

high-performance computer architecture | 2010

ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture

Javier Merino; Valentin Puente; José-Ángel Gregorio

This paper introduces a cost effective cache architecture called Enhanced Shared-Private Non-Uniform Cache Architecture (ESP-NUCA), which is suitable for highperformance Chip MultiProcessors (CMPs). This architecture enhances system stability by combining the advantages of private and shared caches. Starting from a shared NUCA, ESP-NUCA introduces a low-cost mechanism to dynamically allocate private cache blocks closer to their owner processor. In this way, average on-chip access latency is reduced and inter-core interference minimized. ESP-NUCA synergistically integrates victims and replicas thus making it possible to take advantage of multiple-readers for shared data, and to maximize cache usage under unbalanced core utilization. This architecture leads to stable behavior within the whole system across a broad spectrum of working scenarios. ESP-NUCA not only outperforms architectures with similar implementation costs such as private and shared caches by up to 20% and 40% respectively, but even outperforms much costlier architectures such as D-NUCA [13] by up to 28%, Adaptive Selective Replication [3] by up to 19%, and Cooperative Caching [5] by up to 15%. Moreover, performance variance throughout the set of benchmarks is 37% lower than with ASR, 87% lower than with D-NUCA, and 43% lower than with Cooperative Caching.

Performance Evaluation | 2008

Improving the performance of large interconnection networks using congestion-control mechanisms

José Miguel-Alonso; Cruz Izu; José-Ángel Gregorio

Interconnection networks in current parallel systems do not only increase in size; their buffer capacity and number of source ports have increased as well. All these factors result in a significant rise of network congestion compared with their predecessors. Consequently, packet injection must be restricted in order to prevent throughput degradation at high loads. This work evaluates, via simulation, three congestion control mechanisms on adaptive cut-through torus networks, using two different deadlock-avoidance methods, under various synthetic traffic patterns. Workload is generated using bursts of data exchanges (instead of a Bernoulli process) to reflect the synchronized nature of data interchanges in parallel applications. Results show that large networks perform their best when most network resources are dedicated to in-transit traffic. Besides, local congestion-control mechanisms are nearly as effective as the more costly global ones for both uniform and nonuniform traffic patterns.

ieee international conference on high performance computing data and analytics | 2003

Chordal Topologies for Interconnection Networks

Ramón Beivide; Carmen Martínez; Cruz Izu; Jaime Gutierrez; José-Ángel Gregorio; José Miguel-Alonso

The class of dense circulant graphs of degree four with optimal distance-related properties is analyzed in this paper. An algebraic study of this class is done. Two geometric characterizations are given, one in the plane and other in the space. Both characterizations facilitate the analysis of their topological properties and corroborate their suitability for implementing interconnection networks for distributed and parallel computers. Also a distance-hereditary non-disjoint decomposition of these graphs into rings is computed. Besides its practical consequences, this decomposition allows us the presentation of these optimal circulant graphs as a particular evolution of the traditional ring topology.

IEEE Transactions on Parallel and Distributed Systems | 2003

On the design of a high-performance adaptive router for CC-NUMA multiprocessors

Valentin Puente; José-Ángel Gregorio; Ramón Beivide; Cruz Izu

This work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully, adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade-off between the network performance and hardware cost. The outcome of this research is a high-performance adaptive router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input bufferring, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors.

computing frontiers | 2004

A first glance at Kilo-instruction based multiprocessors

Marco Galluzzi; Valentin Puente; Adrián Cristal; Ramón Beivide; José-Ángel Gregorio; Mateo Valero

The ever increasing gap between processor and memory speed, sometimes referred to as the Memory Wall problem [42], has a very negative impact on performance. This mismatch will be more severe in future processors generation. Modern cache organizations and prefetching techniques will not be able to solve this problem. A very novel and promising technique to deal with the Memory Wall consists on designing processors able to maintain thousands of in-flight instructions. An example of this kind of processors has been denoted as Kilo-instruction processors [8]. When running numerical applications, Kilo-instruction processors have demonstrated its ability to effectively maintain high values of IPC while increasing memory latencies.In this paper, we will study for the first time, the influence of Kilo-instruction processors on the performance of small-scale CC-NUMA multiprocessors. Our first results, using an ideal network, show the enormous potential of the Kilo-instruction processors, when using them as computing nodes, not only for hiding local DRAM latencies but also for the remote ones. A deeper analysis, using realistic networks, reveals the existence of heavy demands on packet throughput required by each node, since larger re-order buffers translate on higher density of remote accesses. Next, we show that current interconnection networks cannot cope with this high traffic levels, so newer and faster networks have to be designed. In short, our results show dramatic performance gains over multiprocessors based on current microprocessors and dictate a possible way to build future shared-memory multiprocessor systems.

IEEE Computer Architecture Letters | 2011

Multilevel Cache Modeling for Chip-Multiprocessor Systems

Pablo Prieto; Valentin Puente; José-Ángel Gregorio

This paper presents a simple analytical model for predicting on-chip cache hierarchy effectiveness in chip multiprocessors (CMP) for a state-of-the-art architecture. Given the complexity of this type of systems, we use rough approximations, such as the empirical observation that the re-reference timing pattern follows a power law and the assumption of a simplistic delay model for the cache, in order to provide a useful model for the memory hierarchy responsiveness. This model enables the analytical determination of average access time, which makes design space pruning useful before sweeping the vast design space of this class of systems. The model is also useful for predicting cache hierarchy behavior in future systems. The fidelity of the model has been validated using a state-of-the-art, full-system simulation environment, on a system with up to sixteen out-of-order processors with cache-coherent caches and using a broad spectrum of applications, including complex multithread workloads. This simple model can predict a near-to-optimal, on-chip cache distribution while also estimating how future system running future applications might behave.

Explore More