Salvador Coll | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Salvador Coll is active.

Explore More

Publication

Featured researches published by Salvador Coll.

high performance interconnects | 2001

The Quadrics network (QsNet): high-performance clustering technology

Fabrizio Petrini; Wu-chun Feng; Adolfy Hoisie; Salvador Coll; Eitan Frachtenberg

The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of high-performance interconnects: (I) integration of the virtual-address spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand. In this paper, we present an initial performance evaluation of QsNet. We first describe the main hardware and software features of QsNet, followed by the results of benchmarks that we ran on our experimental, Intel-based, Linux cluster built around QsNet. Our initial analysis indicates that QsNet performs remarkably well, e.g., user-level latency under 2 /spl mu/s and bandwidth over 300 MB/s.

international symposium on microarchitecture | 2002

The Quadrics network: high-performance clustering technology

Fabrizio Petrini; Wu-chun Feng; Adolfy Hoisie; Salvador Coll; Eitan Frachtenberg

The Quadrics network extends the native operating system in processing nodes with a network operating system and specialized hardware support in the network interface. Doing so integrates an individual nodes address spaces into a single, global, virtual address space and provides network fault tolerance.

conference on high performance computing (supercomputing) | 2002

STORM: Lightning-Fast Resource Management

Eitan Frachtenberg; Fabrizio Petrini; Juan C. Fernandez; Scott Pakin; Salvador Coll

Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem — essentially, all of the code that runs on a cluster other than the applications — increasingly impacts application efficiency. In this paper, we present STORM, a resource-management framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling.

network computing and applications | 2001

Hardware- and software-based collective communication on the Quadrics network

Fabrizio Petrini; Salvador Coll; Eitan Frachtenberg; Adolfy Hoisie

The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 /spl mu/s, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 /spl mu/s. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 /spl mu/s and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 /spl mu/s when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 /spl mu/s, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.

Concurrency and Computation: Practice and Experience | 2003

Using multirail networks in high‐performance clusters

Salvador Coll; Eitan Frachtenberg; Fabrizio Petrini; Adolfy Hoisie; Leonid Gurvits

Using multiple independent networks (also known as rails) is an emerging technique which is being used to overcome bandwidth limitations and enhance fault tolerance of current high‐performance parallel computers. In this paper, we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The methods compared include a static rail allocation, a basic round‐robin rail allocation, a local‐dynamic allocation based on local knowledge and a dynamic rail allocation that reserves both communication endpoints of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49% better than local‐knowledge allocation and 37% better than the round‐robin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for long enough messages). Most importantly, this proposed allocation scheme scales well with the number of rails and message size. In addition we propose a hybrid algorithm that combines the benefits of the local‐dynamic allocation for short messages with those of the dynamic algorithm for large messages. Copyright

foundations of computer science | 2001

Using multirail networks in high-performance clusters

Salvador Coll; Eitan Frachtenberg; Fabrizio Petrini; Adolfy Hoisie; Leonid Gurvits

Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault tolerance of current high-performance clusters. We present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load, and allocation scheme. The compared methods include a basic round-robin rail allocation, a local-dynamic allocation based on local knowledge, and a dynamic rail allocation that reserves both end-points of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates on higher loads (for messages large enough). Most importantly, this proposed allocation scheme scales well with the number of rails and message sizes. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic for short messages with those of the dynamic algorithm for large messages.

high performance interconnects | 2003

Scalable collective communication on the ASCI Q machine

Fabrizio Petrini; Juan C. Fernandez; Eitan Frachtenberg; Salvador Coll

Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barrier-synchronize in a few tens of /spl mu/s, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation.

international conference on parallel processing | 2001

Gang scheduling with lightweight user-level communication

Eitan Frachtenberg; Fabrizio Petrini; Salvador Coll; Wu-chun Feng

In this paper, we explore the performance of gang scheduling on a cluster using the Quadrics interconnection network. In such a cluster, the scheduler can take advantage of this networks unique capabilities, including a network interface card-based processor and memory and efficient user-level communication libraries. We developed a micro-benchmark to test the schedulers performance under various aspects of parallel job workloads: memory usage, bandwidth and latency-bound communication, number of processes, timeslice quantum, and multiprogramming levels. Our experiments show that the gang scheduler performs relatively well under most workload conditions, is largely insensitive to the number of concurrent jobs in the system and scales almost linearly with number of nodes. On the other hand, the scheduler is very sensitive to the timeslice quantum, and values under 30 seconds can incur large overheads and fairness problems.

IEEE Transactions on Parallel and Distributed Systems | 2009

Efficient and Scalable Hardware-Based Multicast in Fat-Tree Networks

Salvador Coll; F.J. Mora; José Duato; Fabrizio Petrini

This article presents an efficient and scalable mechanism to overcome the limitations of collective communication in switched interconnection networks in the presence of faults. Considering that current trends in supercomputing are moving toward massively parallel computers, with many thousands of components, reliability becomes a challenge. In such scenario, fat-tree networks that provide hardware support for collective communication suffer from serious performance degradation due to the presence of, even, a single faulty node. This paper describes a new mechanism to provide high-performance collective communication in such situations. The feasibility of the proposed technique is formally demonstrated. We present the design of a new hardware-based routing algorithm for multicast, that is at the base of our proposal. The proposed mechanism is implemented and experimentally evaluated. Our experimental results show that hardware-based multicast trees provide an efficient and scalable solution for collective communication in fat-tree networks, significantly outperforming traditional solutions.

international conference on cluster computing | 2002

Scalable resource management in high performance computers

Eitan Frachtenberg; Fabrizio Petrini; Juan C. Fernandez; Salvador Coll

Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making largescale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements front the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12 MB on a 64-processor, 32-node cluster in less than 250 ms. This paper provides expert. mental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks.

Explore More