José E. Moreira | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José E. Moreira is active.

Explore More

Publication

Featured researches published by José E. Moreira.

international conference on supercomputing | 2005

Optimization of MPI collective communication on BlueGene/L systems

George S. Almasi; Philip Heidelberger; Charles J. Archer; Xavier Martorell; C. Christopher Erway; José E. Moreira; Burkhard Steinmacher-Burow; Yili Zheng

BlueGene/L is currently the worlds fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.

IEEE Transactions on Parallel and Distributed Systems | 2003

An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration

Yanyong Zhang; Hubertus Franke; José E. Moreira; Anand Sivasubramaniam

Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Parallel machines in these environments have traditionally used space-sharing strategies to accommodate multiple jobs at the same time by dedicating the nodes to a single job until it completes. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple space-sharing to improve the performance of large parallel systems. The first technique we analyze is backfilling, the second is gang-scheduling, and the third is migration. The main contribution of this paper is an analysis of the effects of combining the above techniques. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining the various techniques are shown over a spectrum of performance criteria.

international parallel and distributed processing symposium | 2007

Scale-up x Scale-out: A Case Study using Nutch/Lucene

Maged M. Michael; José E. Moreira; Doron Shiloach; Robert W. Wisniewski

Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing. Scale-out solutions are particularly effective in high-throughput Web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine. Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.

dependable systems and networks | 2005

Filtering failure logs for a BlueGene/L prototype

Yinglung Liang; Yanyong Zhang; Anand Sivasubramaniam; Ramendra K. Sahoo; José E. Moreira; Manish Gupta

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBMs BlueGene/L, which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

Ibm Systems Journal | 2000

Java programming for high-performance numerical computing

José E. Moreira; Samuel P. Midkiff; Manish Gupta; Pedro V. Artigas; Marc Snir; Richard D. Lawrence

First proposed as a mechanism for enhancing Web content, the JavaTM language has taken off as a serious general-purpose programming language. Industry and academia alike have expressed great interest in using the Java language as a programming language for scientific and engineering computations. Applications in these domains are characterized by intensive numerical computing and often have very high performance requirements. In this paper we discuss programming techniques that lead to Java numerical codes with performance comparable to FORTRAN or C, the more traditional languages for this field. The techniques are centered around the use of a high-performance numerical library, written entirely in the Java language, and on compiler technology. The numerical library takes the form of the Array package for Java. Proper use of this package, and of other appropriate tools for compiling and running a Java application, results in code that is clean, portable, and fast. We illustrate the programming and performance issues through case studies in data mining and electromagnetism.

international parallel and distributed processing symposium | 2000

Improving parallel job scheduling by combining gang scheduling and backfilling techniques

Yanyong Zhang; Hubertus Franke; José E. Moreira; Anand Sivasubramaniam

Two different approaches have been commonly used to address problems associated with space sharing scheduling strategies: (a) augmenting space sharing with backfilling, which performs out of order job scheduling; and (b) augmenting space sharing with time sharing, using a technique called coscheduling or gang scheduling. With three important experimental results-impact of priority queue order on backfilling, impact of overestimation of job execution times, and comparison of scheduling techniques-this paper presents an integrated strategy that combines backfilling with gang scheduling. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining backfilling and gang scheduling are clearly demonstrated over a spectrum of performance criteria.

conference on high performance computing (supercomputing) | 2006

Topology mapping for Blue Gene/L supercomputer

Hao Yu; I-Hsin Chung; José E. Moreira

Mapping virtual processes onto physical processors is one of the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can significantly improve the communication performance and the scalability of applications

international parallel and distributed processing symposium | 2004

Fault-aware job scheduling for BlueGene/L systems

Adam J. Oliner; Ramendra K. Sahoo; José E. Moreira; Manish Gupta; Anand Sivasubramaniam

Summary form only given. Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. We evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, average response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.

ACM Transactions on Programming Languages and Systems | 2000

From flop to megaflops: Java for technical computing

José E. Moreira; Samuel P. Midkiff; Manish Gupta

Although there has been some experimentation with Java as a language for numerically intensive computing, there is a perception by many that the language is unsuited for such work because of performance deficiencies. In this article we show how optimizing array bounds checks and null pointer checks creates loop nests on which aggressive optimizations can be used. Applying these optimizations by hand to a simple matrix-multiply test case leads to Java-compliant programs whose performance is in excess of 500 Mflops on a four-processor 332MHz RS/6000 model F50 computer. We also report in this article the effect that various optimizations have on the performance of six floating-point-intensive benchmarks. Through these optimizations we have been able to achieve with Java at least 80% of the peak Fortran performance on the same benchmarks. Since all of these optimizations can be automated, we conclude that Java will soon be a serious contender for numerically intensive computing.

high-performance computer architecture | 2002

Evaluation of a multithreaded architecture for cellular computing

Călin Caşcaval; José G. Castaños; Luis Ceze; Monty M. Denneau; Manish Gupta; Derek Lieber; José E. Moreira; Karin Strauss; Henry S. Warren

Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.

Explore More