José G. Castaños
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by José G. Castaños.
high-performance computer architecture | 2002
Călin Caşcaval; José G. Castaños; Luis Ceze; Monty M. Denneau; Manish Gupta; Derek Lieber; José E. Moreira; Karin Strauss; Henry S. Warren
Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.
Ibm Journal of Research and Development | 2005
George S. Almasi; Charles J. Archer; José G. Castaños; John A. Gunnels; C. Christopher Erway; Philip Heidelberger; Xavier Martorell; José E. Moreira; Kurt Walter Pinnow; Joe Ratterman; Burkhard Steinmacher-Burow; William Gropp; Brian R. Toonen
The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.
high-performance computer architecture | 2006
Hao Yu; Ramendra K. Sahoo; C. Howson; George S. Almasi; José G. Castaños; Manish Gupta; José E. Moreira; Jeffrey J. Parker; Thomas Eugene Engelsiepen; Robert B. Ross; Rajeev Thakur; Robert Latham; William Gropp
Parallel I/O plays a crucial role for most data-intensive applications running on massively parallel systems like Blue Gene/L that provides the promise of delivering enormous computational capability. We designed and implemented a highly scalable parallel file I/O architecture for Blue Gene/L, which leverages the benefit of the hierarchical and functional partitioning design of the system software with separate computational and I/O cores. The architecture exploits the scalability aspect of GPFS (General Parallel File System) at the backend, while using MPI I/O as an interface between the application I/O and the file system. We demonstrate the impact of our high performance I/O solution for Blue Gene/L with a comprehensive evaluation that consists of a number of widely used parallel I/O benchmarks and I/O intensive applications. Our design and implementation is not only able to deliver at least one order of magnitude speed up in terms of I/O bandwidth for a real-scale application HOMME (achieving aggregate bandwidth of 1.8 GB/Sec and 2.3 GB/Sec for write and read accesses, respectively), but also supports high-level parallel I/O data interfaces such as parallel HDF5 and parallel NetCDF scaling up to a large number of processors.
ACM Sigarch Computer Architecture News | 2003
George S. Almasi; Cǎlin Caşcaval; José G. Castaños; Monty M. Denneau; Derek Lieber; José E. Moreira; Henry S. Warren
Multiprocessor systems-on-a-chip offer a structured approach to managing complexity in chip design. Cyclops is a new family of multithreaded architectures which integrates processing logic, main memory and communications hardware on a single chip. Its simple, hierarchical design allows the hardware architect to manage a large number of components to meet the design constraints in terms of performance, power or application domain.This paper evaluates several alternative Cyclops designs with different relative costs and trade-offs. We compare the performance of several scientific kernels running on different configurations of this architecture. We show that by increasing the number of threads sharing a floating point unit we can hide fairly high cache and memory latencies. We prove that we can reach the theoretical peak performance of the chip and we identify the optimal balance of components for each application. We demonstrate that the design is well adapted to solve problems that are difficult to optimize. For example, we show that sparse matrix vector multiplication obtains 16 GFlops out of 32 GFlops of peak performance.
job scheduling strategies for parallel processing | 2002
Elie Krevat; José G. Castaños; José E. Moreira
BlueGene/L is a massively parallel cellular architecture system with a toroidal interconnect. Cellular architectures with a toroidal interconnect are effective at producing highly scalable computing systems, but typically require job partitions to be both rectangular and contiguous. These restrictions introduce fragmentation issues that affect the utilization of the system and the wait time and slowdown of queued jobs. We propose to solve these problems for the BlueGene/L system through scheduling algorithms that augment a baseline first come first serve (FCFS) scheduler. Restricting ourselves to space-sharing techniques, which constitute a simpler solution to the requirements of cellular computing, we present simulation results for migration and backfilling techniques on BlueGene/L. These techniques are explored individually and jointly to determine their impact on the system. Our results demonstrate that migration can be effective for a pure FCFS scheduler but that backfilling produces even more benefits. We also show that migration can be combined with backfilling to produce more opportunities to better utilize a parallel machine.
conference on high performance computing (supercomputing) | 2006
José E. Moreira; Michael Brutman; José G. Castaños; Thomas Eugene Engelsiepen; Mark E. Giampapa; Tom Gooding; Roger L. Haskin; Todd Inglett; Derek Lieber; Patrick McCarthy; Michael Mundy; Jeffrey J. Parker; Brian Paul Wallenfelt
Blue Gene/L, is currently the worlds fastest and most scalable supercomputer. It has demonstrated essentially linear scaling all the way to 131,072 processors in several benchmarks and real applications. The operating systems for the compute and I/O nodes of Blue Gene/L are among the components responsible for that scalability. Compute nodes are dedicated to running application processes, whereas I/O nodes are dedicated to performing system functions. The operating systems adopted for each of these nodes reflect this separation of junction. Compute nodes run a lightweight operating system called the compute node kernel. I/O nodes run a port of the Linux operating system. This paper discusses the architecture and design of this solution for Blue Gene/L in the context of the hardware characteristics that led to the design decisions. It also explains and demonstrates how those decisions are instrumental in achieving the performance and scalability for which Blue Gene/L is famous
Ibm Journal of Research and Development | 2005
José E. Moreira; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Bergner; José R. Brunheroto; Michael Brutman; José G. Castaños; Paul G. Crumley; Manish Gupta; Todd Inglett; Derek Lieber; David Limpert; Patrick McCarthy; Mark Megerian; Mark P. Mendell; Michael Mundy; Don Reed; Ramendra K. Sahoo; Alda Sanomiya; Richard Shok; Brian E. Smith; Greg Stewart
With up to 65,536 compute nodes and a peak performance of more than 360 teraflops, the Blue Gene®/L (BG/L) supercomputer represents a new level of massively parallel systems. The system software stack for BG/L creates a programming and operating environment that harnesses the raw power of this architecture with great effectiveness. The design and implementation of this environment followed three major principles: simplicity, performance, and familiarity. By specializing the services provided by each component of the system architecture, we were able to keep each one simple and leverage the BG/L hardware features to deliver high performance to applications. We also implemented standard programming interfaces and programming languages that greatly simplified the job of porting applications to BG/L. The effectiveness of our approach has been demonstrated by the operational success of several prototype and production machines, which have already been scaled to 16,384 nodes.
International Journal of Parallel Programming | 2002
George S. Almasi; Calin Cascaval; José G. Castaños; Monty M. Denneau; Wilm E. Donath; Maria Eleftheriou; Mark E. Giampapa; C. T. Howard Ho; Derek Lieber; José E. Moreira; Dennis M. Newns; Marc Snir; Henry S. Warren
The IBM Blue Gene/C parallel computer aims to demonstrate the feasibility of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges in this project is showing that applications can successfully scale to this massive amount of parallelism. In this paper we demonstrate that the simulation of protein folding using classical molecular dynamics falls in this category. Starting from the sequential version of a well known molecular dynamics code, we developed a new parallel implementation that exploited the multiple levels of parallelism present in the Blue Gene/C cellular architecture. We performed both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine.
Lecture Notes in Computer Science | 2003
George S. Almasi; Charles J. Archer; José G. Castaños; Manish Gupta; Xavier Martorell; José E. Moreira; William Gropp; Silvius Rus; Brian R. Toonen
The BlueGene/L computer uses system-on-a-chip integration and a highly scalable 65,536-node cellular architecture to deliver 360 Tflops of peak computing power. Efficient operation of the machine requires a fast, scalable, and standards compliant MPI library. In this paper, we discuss our efforts to port the MPICH2 library to BlueGene/L.
International Journal of Parallel Programming | 2007
José E. Moreira; Valentina Salapura; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Edward Bergner; Randy Bickford; Matthias A. Blumrich; José R. Brunheroto; Arthur A. Bright; Michael Brian Brutman; José G. Castaños; Dong Chen; Paul W. Coteus; Paul G. Crumley; Sam Ellis; Thomas Eugene Engelsiepen; Alan Gara; Mark E. Giampapa; Tom Gooding; Shawn A. Hall; Ruud A. Haring; Roger L. Haskin; Philip Heidelberger; Dirk Hoenicke; Todd A. Inglett; Gerard V. Kopcsay; Derek Lieber; David Roy Limpert; Patrick Joseph McCarthy
The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the world’s most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real scientific applications. In that process, it has enabled new science that simply could not be done before. Blue Gene/L was developed by a relatively small team of dedicated scientists and engineers. This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.