Rod Fatoohi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rod Fatoohi is active.

Explore More

Publication

Featured researches published by Rod Fatoohi.

Supercomputing, 1991. Supercomputing '91. Proceedings of the 1991 ACM/IEEE Conference on | 2009

The NAS parallel benchmarks summary and preliminary results

David H. Bailey; Eric Barszcz; John T. Barton; D. S. Browning; Robert L. Carter; Leonardo Dagum; Rod Fatoohi; Paul O. Frederickson; T. A. Lasinski; Robert Schreiber; Horst D. Simon; V. Venkatakrishnan; Sisira Weeratunga

No abstract available

conference on high performance computing (supercomputing) | 1994

Performance evaluation of three distributed computing environments for scientific applications

Rod Fatoohi; Sisira Weeratunga

Presents performance results for three distributed computing environments using the three simulated computational fluid dynamics applications in the NAS Parallel Benchmark suite. These environments are the Distributed Computing Facility (DCF) cluster, the LACE cluster, and an Intel iPSC/860 machine. The DCF is a prototypic cluster of loosely-coupled SGI R3000 machines connected by Ethernet. The LACE cluster is a tightly-coupled cluster of 92 IBM RS6000/560 machines connected by Ethernet as well as by either FDDI or an IBM Allnode switch. Results of several parallel algorithms for the three simulated applications are presented and analyzed, based on the interplay between the communication requirements of an algorithm and the characteristics of the communication network of a distributed system.<<ETX>>

conference on high performance computing supercomputing | 1991

The NAS parallel benchmarks

David H. Bailey; Eric Barszcz; Horst D. Simon; V. Venkatakrishnan; Sisira Weeratunga; John T. Barton; D. S. Browning; Robert L. Carter; Leonardo Dagum; Rod Fatoohi; Paul O. Frederickson; T. A. Lasinski; Robert Schreiber

No abstract available

international conference on supercomputing | 1990

Vector performance analysis of the NEC SX-2

Rod Fatoohi

This paper presents the results of a series of experiments to study the vector performance of the NEC SX-2. The main object of this study is to understand the architecture and identify its bottlenecks and limiting factors. A simple performance model is used to examine the impact of certain architectural features on the performance of a set of basic operations. The results of implementing this set on the machine for four vector lengths and three memory strides are presented and compared. These results show that the vector length and the ratio of floating point operations to memory references have a great impact on the performance of the machine. Two numerical algorithms are also employed and the results of these algorithms and the basic operations are compared to early results on one processor of the Cray-2 and Cray Y-MP. These comparisons show that the SX-2 is faster than the Cray Y-MP by up to 86% for short vectors and by 2 to 4 times for long vectors. Also, it outperformed the Cray-2 by even bigger factors. Finally, the architecture of the SX-X is presented, and some predictions about its performance are given.

distributed memory computing conference | 1990

Performance Results on the Intel Touchstone Gamma Prototype

David H. Bailey; Eric Barszcz; Rod Fatoohi; Horst D. Simon; Sisira Weeratunga

This paper describes the Intel Touchstone Gamma Prototype, a distributed memory MIMD parallel computer based on the new Intel i860 floating point processor. With 128 nodes, this system has a theoretical peak performance of over seven GFLOPS. This paper presents some initial performance results on this system, including results for individual node computation, message passing and complete applications using multiple nodes. The highest rate achieved on a multiprocessor Fortran application program is 844 MFLOPS. Overview of the Touchstone Gamma System In spring of 1989 DARPA and Intel Scientific Computers announced the Touchstone project. This project calls for the development of a series of prototype machines by Intel Scientific Computers, based on hardware and software technologies being developed by Intel in collaboration with research teams at CalTech, MIT, UC Berkeley, Princeton, and the University of Illinois. The eventual goal of this project is the Sigma prototype, a 150 GFLOPS peak parallel supercomputer, with 2000 processing nodes. One of the milestones towards the Sigma prototype is the Gamma prototype. At the end of December 1989, the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center took delivery of one of the first two Touchstone Gamma systems, and it became available for testing in January 1990. The Touchstone Gamma system is based on the new 64 bit i860 microprocessor by Intel [4]. The i860 has over 1 million transistors and runs at 40 MHz (the initial Touchstone Gamma systems were delivered with 33 MHz processors, but these have since been upgraded to 40 MHz). The theoretical peak speed is 80 MFLOPS in 32 bit floating point and 60 MFLOPS for 64 bit floating point operations. The i860 features 32 integer address registers, with 32 bits each, and 16 floating point registers with 64 bits each (or 32 floating point registers with 32 bits each). It also features an 8 kilobyte onchip data cache and a 4 kilobyte instruction cache. There is a 128 bit data path between cache and registers. There is a 64 bit data path between main memory and registers. The i860 has a number of advanced features to facilitate high execution rates. First of all, a number of important operations, including floating point add, multiply and fetch from main memory, are pipelined operations. This means that they are segmented into three stages, and in most cases a new operation can be initiated every 25 nanosecond clock period. Another advanced feature is the fact that multiple instructions can be executed in a single clock period. For example, a memory fetch, a floating add and a floating multiply can all be initiated in a single clock period. A single node of the Touchstone Gamma system consists of the i860, 8 megabytes (MB) of dynamic random access memory, and hardware for communication to other nodes. The Touchstone Gamma system at NASA Ames consists of 128 computational nodes. The theoretical peak performance of this system is thus approximately 7.5 GFLOPS on 64 bit data. The 128 nodes are arranged in a seven dimensional hypercube using the direct connect routing module and the hypercube interconnect technology of the iPSC/2. The point to point aggregate bandwidth of the interconnect system, which is 2.8 MB/sec per channel, is the same as on the iPSC/2. However the latency for the message passing is reduced from about 350 microseconds to about 90 microseconds. This reduction is mainly obtained through the increased speed of the i860 on the Touchstone Gamma machine, when compared to

international parallel and distributed processing symposium | 2006

Performance evaluation of supercomputers using HPCC and IMB benchmarks

Subhash Saini; Robert Ciotti; Brian T. N. Gunney; Thomas E. Spelce; Alice Koniges; Don Dossa; Panagiotis Adamidis; Rolf Rabenseifner; Sunil R. Tiyyagura; Matthias S. Mueller; Rod Fatoohi

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems

conference on high performance computing supercomputing | 1989

Vector performance analysis of three supercomputers: Cray 2, Cray Y-MP, and ETA 10-Q

Rod Fatoohi

This paper presents the results of a series of experiments to study the single processor performance of three supercomputers: Cray-2, Cray Y-MP, and ETA10-Q. The main object of this study is to determine the impact of certain architectural features on the performance of modern supercomputers. Features such as clock period, memory links, memory organization, multiple functional units, and chaining are considered here. A simple performance model is used to examine the impact of these features on the performance of a set of basic operations. The results of implementing this set on these machines for three vector lengths and three memory strides are presented and compared. For unit stride operations, the Cray Y-MP outperformed the Cray-2 by as much as three times and the ETA10-Q by as much as four times for these operations. Moreover, unlike the Cray-2 and ETA10-Q, even-numbered strides do not cause a major performance degradation on the Cray Y-MP. Two numerical algorithms are also used for comparison. For three problem sizes of both algorithms, the Cray Y-MP outperformed the Cray-2 by 43% to 68% and the ETA10-Q by four to eight times.

international conference on computer communications and networks | 1995

Performance evaluation of communication networks for distributed computing

Rod Fatoohi

We present performance results for several high-speed networks in distributed computing environments. These networks are: HiPPI, ATM, Fibre Channel, IBM Allnode switch, FDDI, and Ethernet. These networks are parts of two testbeds: DaVinci-a cluster of 16 SGI R8000 workstations at NASA Ames-and LACE-a cluster of 96 IBM RS6000 workstations at NASA Lewis. Also, an IBM SP2 machine is considered for comparison. Several communication tests are performed and the results are presented for two programming levels: BSD socket programming interface using the program ttcp and PVM message passing library. These results show that the emerging network technologies can achieve reasonable performance under certain conditions. However, the achievable performance is still far behind the theoretical peak rates.

conference on high performance computing (supercomputing) | 1994

NAS experiences with a prototype cluster of workstations

K. Castagnera; D. Cheng; Rod Fatoohi; E. Hook; B. Kramer; C. Manning; J. Musch; C. Niggley; William Saphir; D. Sheppard; M. Smith; Ian Stockdale; S. Welch; R. Williams; D. Yip

This paper discusses the year-long activity at NAS to implement a large, loose cluster of workstations from the existing Silicon Graphics, Inc. (SGI) pool of systems. Issues related to establishing a loosely coupled cluster of workstations are presented. Included are steps needed to resolve system management issues intended to provided reasonable cycle recovery from these systems without disrupting the primary system users. Performance evaluation tests were run based on the NAS Parallel Benchmarks (NPB) and other codes, including OVERFLOW-PVM, a full-fledged computational fluid dynamics (CFD) application. This paper summarizes the activities related to the prototype cluster and identifies areas that need improvement, development, and research in order to make workstation clusters a viable computing environment for solving aeroscience problems.<<ETX>>

international conference on multimedia and expo | 2003

Scalable coded image transmissions over peer-to-peer networks

Xiao Su; Rod Fatoohi

In this paper, we study the transmission of scalable coded images over peer-to-peer networks. Scalable coded images share common prefix of their resulted bit streams even when coded using different bit rates. This property implies two important consequences on the peer-to-peer system when compared to transmission of non-scalable coded images: (1) there exists a many-to-one relationship between supplying and requesting peers as multiple peers with the code images in different bit rates become eligible as supplying peers; and (2) the set of supplying peers is dynamic over time as the peers in the supplying set may finish transmission at different times. When we transmit the requested image from multiple supplying peers to a requesting peer, it is very important to design optimal peer assignment algorithms to minimize the overall transmission time for the requesting peer. For this purpose, we first establish a sufficient property for the optimal peer assignment vector, and then design an optimal media segmentation algorithm based on the sufficient property. Finally, we compare the performance of the proposed optimal media segmentation algorithm with two heuristics and verify its superior performance.

Explore More