Is this you? Create Your Porfile

Federico Silla

Polytechnic University of Valencia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Federico Silla is active.

Explore More

Publication

Featured researches published by Federico Silla.

international conference on high performance computing and simulation | 2010

rCUDA: Reducing the number of GPU-based accelerators in high performance clusters

José Duato; Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí

The increasing computing requirements for GPUs (Graphics Processing Units) have favoured the design and marketing of commodity devices that nowadays can also be used to accelerate general purpose computing. Therefore, future high performance clusters intended for HPC (High Performance Computing) will likely include such devices. However, high-end GPU-based accelerators used in HPC feature a considerable energy consumption, so that attaching a GPU to every node of a cluster has a strong impact on its overall power consumption. In this paper we detail a framework that enables remote GPU acceleration in HPC clusters, thus allowing a reduction in the number of accelerators installed in the cluster. This leads to energy, acquisition, maintenance, and space savings.

IEEE Transactions on Parallel and Distributed Systems | 2000

High-performance routing in networks of workstations with irregular topology

Federico Silla; José Duato

Networks of workstations are rapidly emerging as a cost-effective alternative to parallel computers. Switch-based interconnects with irregular topology allow the wiring flexibility, scalability, and incremental expansion capability required in this environment. However, the irregularity also makes routing and deadlock avoidance on such systems quite complicated. In current proposals, many messages are routed following nonminimal paths, increasing latency and wasting resources. In this paper, we propose two general methodologies for the design of adaptive routing algorithms for networks with irregular topology. Routing algorithms designed according to these methodologies allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. As an example of application, we propose two adaptive routing algorithms for ANI (previously known as Autonet). They can be implemented either by duplicating physical channels or by splitting each physical channel into two virtual channels. In the former case, the implementation does not require a new switch design. It only requires changing the routing tables and adding links in parallel with existing ones, taking advantage of spare switch ports. In the latter case, a new switch design is required, but the network topology is not changed. Evaluation results for several different tapologies and message distributions show that the new routing algorithms are able to increase throughput for random traffic by a factor of up to 4 with respect to the original up*/down* algorithm, also reducing latency significantly. For other message distributions, throughput is increased more than seven times. We also show that most of the improvement comes from the use of minimal routing.

ieee international conference on high performance computing, data, and analytics | 1997

Improving the efficiency of adaptive routing in networks with irregular topology

Federico Silla; José Duato

Networks of workstations are emerging as a cost-effective alternative to parallel computers. The interconnection between workstations usually relies on switch-based networks with irregular topologies. This irregularity makes routing and deadlock avoidance quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels and therefore, many messages are routed along non-minimal paths, increasing latency and wasting resources. We propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology that improves a previously proposed one by reducing the probability of routing over non-minimal paths. The resulting routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. As an example of application, we propose an improved adaptive routing algorithm for Autonet.

parallel computing | 1997

Efficient Adaptive Routing in Networks of Workstations with Irregular Topology

Federico Silla; Manuel P. Malumbres; Antonio Robles; Pedro López; José Duato

Networks of workstations are rapidly emerging as a cost-effective alternative to parallel computers. Switch-based interconnects with irregular topologies allow the wiring flexibility, scalability and incremental expansion capability required in this environment. The irregularity also makes routing and deadlock avoidance on such systems quite complicated. Current proposals avoid deadlock by removing cyclic dependencies between channels. As a consequence, many messages are routed following non-minimal paths, increasing latency and wasting resources. In this paper, we propose a general methodology for the design of adaptive routing algorithms for networks with irregular topology. These routing algorithms allow messages to follow minimal paths in most cases, reducing message latency and increasing network throughput. The methodology is based on the application of the theory of deadlock avoidance proposed in [14], which increases routing flexibility by allowing cyclic dependencies between channels. As an example of application, we propose an adaptive routing algorithm for Autonet. It can be implemented either by duplicating physical channels or by splitting each physical channel into two virtual channels. In the former case, the implementation does not require a new switch design. It only requires changing the routing tables and adding links in parallel with existing ones, taking advantage of spare switch ports. In the latter case, a new switch design is required but the network topology is not changed. Preliminary evaluation results show that the new routing algorithm is able to increase throughput for random traffic by a factor of up to 2.8 with respect to the original algorithm, also reducing latency.

ieee international conference on high performance computing, data, and analytics | 2011

Enabling CUDA acceleration within virtual machines using rCUDA

José Duato; Antonio J. Peña; Federico Silla; Juan Carlos Fernández; Rafael Mayo; Enrique S. Quintana-Ortí

The hardware and software advances of Graphics Processing Units (GPUs) have favored the development of GPGPU (General-Purpose Computation on GPUs) and its adoption in many scientific, engineering, and industrial areas. Thus, GPUs are increasingly being introduced in high-performance computing systems as well as in datacenters. On the other hand, virtualization technologies are also receiving rising interest in these domains, because of their many benefits on acquisition and maintenance savings. There are currently several works on GPU virtualization. However, there is no standard solution allowing access to GPGPU capabilities from virtual machine environments like, e.g., VMware, Xen, VirtualBox, or KVM. Such lack of a standard solution is delaying the integration of GPGPU into these domains. In this paper, we propose a first step towards a general and open source approach for using GPGPU features within VMs. In particular, we describe the use of rCUDA, a GPGPU (General-Purpose Computation on GPUs) virtualization framework, to permit the execution of GPU-accelerated applications within virtual machines (VMs), thus enabling GPGPU capabilities on any virtualized environment. Our experiments with rCUDA in the context of KVM and VirtualBox on a system equipped with two NVIDIA GeForce 9800 GX2 cards illustrate the overhead introduced by the rCUDA middleware and prove the feasibility and scalability of this general virtualizing solution. Experimental results show that the overhead is proportional to the dataset size, while the scalability is similar to that of the native environment.

parallel computing | 2014

A complete and efficient CUDA-sharing solution for HPC clusters

Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí; José Duato

We detail a complete and up-to-date GPGPU remote virtualization solution.We describe efficient mechanisms for remote GPU data transfers.We detail an optimized InfiniBand implementation of rCUDA communications.We provide an extensive experimental evaluation of remote GPU acceleration. In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrators policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix-matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.

international conference on parallel processing | 1996

A high performance router architecture for interconnection networks

José Duato; Pedro López; Federico Silla; Sudhakar Yalamanchili

We propose a new router architecture that supports wormhole switching and circuit switching concurrently. This architecture has been designed to take advantage of temporal communication locality. This can be done by establishing a circuit between nodes that are going to communicate frequently. Messages using those circuits face no contention. By combining circuit switching, pre-established physical circuits and wave pipelining across channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. This router architecture also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. Preliminary performance evaluation results show a drastic reduction in latency and increment in throughput when messages are long enough, even if circuits are established for a single transmission and locality is not exploited.

architectural support for programming languages and operating systems | 2002

A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

Shubhendu S. Mukherjee; Federico Silla; Peter J. Bannon; Joel S. Emer; Steven Lang; David Wightman Webb

Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processors on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the networks adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.

Journal of Parallel and Distributed Computing | 2001

A Comparison of Router Architectures for Virtual Cut-Through and Wormhole Switching in a NOW Environment

José Duato; Antonio Robles; Federico Silla; Ramón Beivide

Most multicomputer interconnection networks use wormhole switching, leading to fast and compact routers. Current routers incorporate virtual channels and even fully adaptive routing. Networks of workstations (NOWs) inherited multicomputer technology. Most commercial routers designed for NOWs implement wormhole switching. However, wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. However, buffer requirements and packetizing overhead prevented its widespread use in multicomputers. Nevertheless, wormhole and VCT switching require similar buffer capacity in NOWs. Moreover, some messaging layers such as Illinois Fast Messages (FM) and BIP split messages into packets for increased performance. Therefore, the traditional disadvantages of VCT switching disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole routers, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in a NOW environment. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost. Also, VCT routers require buffer capacity independent of wire length, making them suitable for networks of workstations.

international conference on parallel processing | 2011

Performance of CUDA Virtualized Remote GPUs in High Performance Clusters

José Duato; Antonio J. Peña; Federico Silla; Rafael Mayo; Enrique S. Quintana-Ortí

In a previous work we presented the architecture of rCUDA, a middleware that enables CUDA remoting over a commodity network. That is, the middleware allows an application to use a CUDA-compatible Graphics Processor (GPU) installed in a remote computer as if it were installed in the computer where the application is being executed. This approach is based on the observation that GPUs in a cluster are not usually fully utilized, and it is intended to reduce the number of GPUs in the cluster, thus lowering the costs related with acquisition and maintenance while keeping performance close to that of the fully-equipped configuration. In this paper we model rCUDA over a series of high throughput networks in order to assess the influence of the performance of the underlying network on the performance of our virtualization technique. For this purpose, we analyze the traces of two different case studies over two different networks. Using this data, we calculate the expected performance for these same case studies over a series of high throughput networks, in order to characterize the expected behavior of our solution in high performance clusters. The estimations are validated using real 1 Gbps Ethernet and 40 Gbps InfiniBand networks, showing an error rate in the order of 1% for executions involving data transfers above 40 MB. In summary, although our virtualization technique noticeably increases execution time when using a 1 Gbps Ethernet network, it performs almost as efficiently as a local GPU when higher performance interconnects are used. Therefore, the small overhead incurred by our proposal because of the remote use of GPUs is worth the savings that a cluster configuration with less GPUs than nodes reports.

Explore More