Diego Lugones
Bell Labs
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Diego Lugones.
cluster computing and the grid | 2009
Diego Lugones; Daniel Franco; Emilio Luque
The increasing demand of parallel applications in Cluster Computing requires the use of Interconnection Networksto provide low and bounded communication delays. However, message congestion appears when communication load between nodes is not fairly distributed over the network. Congestion spreading increases latency and reduces network throughput causing important performance degradation. In this paper we present Dynamic Routing Balancing with Multipath Distribution (DRB-MD), a new method developed to control network congestion based on a uniform balancing of communication load. DRB-MD distributes the traffic load according to a gradual and load-controlled path expansion. It monitors message latency in network switches, makes decisions about how many alternative paths should be used, and finally decides which path (or paths) to use between each source-destination pair. Experiments with permutation patterns and hotspot traffic were conducted to evaluate DRB-MD performance under conditions commonly created by parallel scientific applications.
computing frontiers | 2012
Diego Lugones; Kostas Katrinis; Martin Collier
Hybrid optical/electrical interconnects, using commercially available optical circuit switches at the core part of the network, have been recently proposed as an attractive alternative to fully-connected electronically-switched networks in terms of port density, bandwidth/port, cabling and energy efficiency. Although the shift from a traditionally packet-switched core to switching between server aggregations (or servers) at circuit granularity requires system redesign, the approach has been shown to fit well to the traffic requirements of certain classes of high-performance computing applications, as well as to the traffic patterns exhibited by typical data center workloads. Recent proposals for such system designs have looked at small/medium scale hybrid interconnects. In this paper, we present a hybrid optical/electrical interconnect architecture intended for large-scale deployments of high-performance computing systems and server co-locations. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. Also, we are the first to our knowledge to explore the benefit of using multi-hopping in the optical domain as a means to avoid constant reconfiguration of optical circuit switches. We have prototyped our architecture at packet-level detail in a simulation framework to evaluate this concept. Our results demonstrate that our hybrid interconnect, by adapting to the changing nature of application traffic, can significantly exceed the throughput of a static interconnect of equal degree, while at times attaining a throughput comparable to that of a costly fully-connected network. We also show a further benefit brought by multi-hopping, that it reduces the performance drops by reducing the frequency of reconfiguration.
IEEE\/OSA Journal of Optical Communications and Networking | 2015
Kostas Christodoulopoulos; Diego Lugones; Kostas Katrinis; Marco Ruffini; Donai O'Mahony
In response to the need for higher-speed and affordable-cost datacenter interconnection networks, hybrid optical/electrical architectures have been proposed. Even still, a number of implementation issues remain open and little is known about the performance of real applications over such networks. To fill this gap, we present a hybrid network architecture, called HydRA, comprising commodity off-the-shelf equipment, calculate its total price using current list prices, and show its price competitiveness against fat-tree alternatives. We also report on a prototype implementation of our HydRA network that uses Ethernet switches and a custom-built software controller. We show that our HydRA network prototype manages to accelerate the execution of real parallel workloads, when compared to equal-cost electronic-only fat-tree networks.
Future Generation Computer Systems | 2014
Diego Lugones; Kostas Katrinis; Georgios K. Theodoropoulos; Martin Collier
Hybrid optical/electrical interconnects using commercial optical circuit switches have been previously proposed as an attractive alternative to fully-connected electronically-switched networks. Among other advantages, such a design offers increased port density, bandwidth/port, cabling and energy efficiency, compared to conventional packet-switched counterparts. Recent proposals for such system designs have looked at small and/or medium scale networks employing hybrid interconnects. In our previous work, we presented a hybrid optical/electrical interconnect architecture targeting large-scale deployments in high-performance computing and datacenter environments. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. In this paper, we extend the fundamentals of our existing work towards quantifying and understanding the performance of these type of systems against more diverse workload communication patterns and system design parameters. In this context, we evaluate-among other characteristics-the overhead of the reconfiguration (decomposition and routing) scheme proposed and extend our simulations to highly adversarial flow generation rate/duration values that challenge the reconfiguration latency of the system. We present an optical/electrical interconnect for large-scale Clusters/Datacenters.The system uses reconfigurable optical planes to optimize network communications.Low-cost electronic planes bypass traffic during the optical plane reconfiguration.A DeBruijn-based topology ensures good tradeoff between radix and latency at scale.Performance approaches a fully connected network for stable traffic at lower cost.
european conference on parallel processing | 2009
Gonzalo Zarza; Diego Lugones; Daniel Franco; Emilio Luque
The intensive and continuous use of high-performance computers for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of high-performance computer systems that communicates and links together the processing units. Network faults have an extremely high impact because the occurrence of a single fault may prevent the correct finalization of applications. This work focuses on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal is to solve a certain number of link and node failures, considering its impact, and occurrence probability. To accomplish this task we take advantage of communication path redundancy, by means of adaptive multipath routing approaches that fulfill the four phases of fault tolerance: error detection, damage confinement, error recovery, fault treatment and continuous service. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 97% with respect to the fault-free scenarios.
parallel, distributed and network-based processing | 2010
Gonzalo Zarza; Diego Lugones; Daniel Franco; Emilio Luque
The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This paper introduces a novel fault-tolerant routing method provided with a new deadlock avoidance technique designed to solve an unbounded number of faults appearing at random during system operation. Our method provides escape paths for the stalled messages. In addition, the routing algorithm configures alternative paths to avoid the faulty areas taking advantage of communication path redundancy by means of multipath routing approaches. Deadlock avoidance is achieved by adding a small-sized queue and applying a simple set of actions when accessing output buffers with limited free space. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 96% compared to the fault-free scenarios.
parallel, distributed and network-based processing | 2010
Gonzalo Zarza; Diego Lugones; Daniel Franco; Emilio Luque
The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. Clearly, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may lead to deadlocked configurations thus preventing the correct finalization of applications. This paper introduces a new deadlock avoidance mechanism for routing algorithms designed to deal with multiple dynamic faults. The mechanism is based on adding a small-sized buffer and applying a simple set of actions when accessing output buffers with limited free space. Unlike typical static solutions, this proposal allows the design of routing algorithms capable of treating an unbounded number of dynamic faults.
international conference on cloud computing and services science | 2014
Tommaso Cucinotta; Diego Lugones; Davide Cherubini; Karsten Oberle
In this paper, we present a brokering logic for providing precise end-to-end QoS levels to cloud applications distributed across a number of different business actors, such as network service providers (NSP) and cloud providers (CSP). The broker composes a number of available offerings from each provider, in a way that respects the QoS application constraints while minimizing costs incurred by cloud consumers.
international conference on cloud computing | 2014
Tommaso Cucinotta; Diego Lugones; Davide Cherubini; Eric Jul
Contemporary Cloud Computing infrastructures are being challenged by an increasing demand for evolved cloud services characterised by heterogeneous performance requirements including real-time, data-intensive and highly dynamic workloads. The classical way to deal with dynamicity is to scale computing and network resources horizontally. However, these techniques must be coupled effectively with advanced routing and switching in a multi-path environment, mixed with a high degree of flexibility to support dynamic adaptation and live-migration of virtual machines (VMs). We propose a management strategy to jointly optimise computing and networking resources in cloud infrastructures, where Software Defined Networking (SDN) plays a key enabling role.
international conference on cluster computing | 2009
Diego Lugones; Daniel Franco; Emilio Luque
Communication requirements in High Performance Computing systems demand the use of high-speed Interconnection Networks to connect processing nodes. However, when communication load is unfairly distributed across the network resources, message congestion appears. Congestion spreading increases latency and reduces network throughput causing important performance degradation. The Fast-Response Dynamic Routing Balancing (FR-DRB) is a method developed to perform a uniform balancing of communication load over the interconnection network. FR-DRB distributes the message traffic based on a gradual and load-controlled path expansion. The method monitors network message latency and makes decisions about the number of alternative paths to be used between each source-destination pair for message delivery. FR-DRB performance has been compared with other routing policies under a representative set of traffic patterns which are commonly created by parallel scientific applications. Experiments results show an important improvement in latency and throughput.