Ahmed H. Sameh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmed H. Sameh is active.

Explore More

Publication

Featured researches published by Ahmed H. Sameh.

ieee international conference on high performance computing data and analytics | 1989

The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers

Michael W. Berry; Da-Ren Chen; Peter F. Koss; David J. Kuck; Sy-Shin Lo; Yingxin Pang; Lynn Pointer; R. Roloff; Ahmed H. Sameh; E. Clementi; Shaoan Chin; David J. Schneider; Geoffrey C. Fox; Paul C. Messina; David Walker; C. Hsiung; Jim Schwarzmeier; K. Lue; Steven A. Orszag; F. Seidl; O. Johnson; R. Goodrum; Joanne L. Martin

This report presents a methodology for measuring the performance of supercomputers. It includes 13 Fortran programs that total over 50,000 lines of source code. They represent applications in several areas of engi neering and scientific computing, and in many cases the codes are currently being used by computational re search and development groups. We also present the PERFECT Fortran standard, a set of guidelines that allow portability to several types of machines. Furthermore, we present some performance measures and a method ology for recording and sharing results among diverse users on different machines. The results presented in this paper should not be used to compare machines, except in a preliminary sense. Rather, they are presented to show how the methodology has been applied, and to encourage others to join us in this effort. The results should be regarded as the first step toward our objec tive, which is to develop a publicly accessible data base of performance information of this type.

Proceedings of the IEEE | 1972

The Illiac IV system

W.J. Bouknight; S.A. Denenberg; D.E. McIntyre; J.M. Randall; Ahmed H. Sameh; D.L. Slotnick

The reasons for the creation of Illiac IV are described and the history of the Illiac IV project is recounted. The architecture or hard-ware structure of the Illiac IV is discussed--the Illiac IV array is an array processor with a specialized control unit (CU) that can be viewed as a small stand-alone computer. The Illiac IV software strategy is described in terms of current user habits and needs. Brief descriptions are given of the systems software itself, its history, and the major lessons learned during its development. Some ideas for future development are suggested. Applications of Illiac IV are discussed in terms of evaluating the function f(x) simultaneously on up to 64 distinct argument sets x i . Many of the time-consuming problems in scientific computation involve repeated evaluation of the same function on different argument sots. The argument sets which compose the problem data base must be structured in such a fashion that they can be distributed among 64 separate memories. Two matrix applications: Jacobis algorithm for finding the eigenvalues and eigenvectors of real symmetric matrices, and reducing a real nonsymmetric matrix to the upper-Hessenberg form using Householders transformations are discussed in detail. The ARPA network, a highly sophisticated and wide ranging experiment in the remote access and sharing of computer resources, is briefly described and its current status discussed. Many researchers located about the country who will use Illiac IV in solving problems will do so via the network. The various systems, hardware, and procedures they will use is discussed.

Science | 1986

Parallel Supercomputing Today and the Cedar Approach

David J. Kuck; Edward S. Davidson; Duncan H. Lawrie; Ahmed H. Sameh

More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade.

Siam Review | 1990

Parallel algorithms for dense linear algebra computations

Kyle A. Gallivan; Robert J. Plemmons; Ahmed H. Sameh

Scientific and engineering research is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms on modern high-performance computers. Numerical linear algebra is an indispensable tool in such research and this paper attempts to collect and describe a selection of some of its more important parallel algorithms. The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computations, and rapid elliptic solvers. A major emphasis is given here to certain computational primitives whose efficient execution on parallel and vector computers is essential in order to obtain high performance algorithms.

ACM Sigarch Computer Architecture News | 1983

CEDAR: a large scale multiprocessor

Daniel D. Gajski; David J. Kuck; Duncan H. Lawrie; Ahmed H. Sameh

This paper presents an overview of Cedar, a large scale multiprocessor being designed at the University of Illinois. This machine is designed to accommodate several thousand high performance processors which are capable of working together on a single job, or they can be partitioned into groups of processors where each group of one or more processors can work on separate jobs. Various aspects of the machine are described including the control methodology, communication network, optimizing compiler and plans for construction. 13 references.

SIAM Journal on Numerical Analysis | 1982

A Trace Minimization Algorithm for the Generalized Eigenvalue Problem

Ahmed H. Sameh; John A. Wisniewski

An algorithm for computing a few of the smallest (or largest) eigenvalues and associated eigenvectors of the large sparse generalized eigenvalue problem Ax = ABx is presented. The matrices A and B are assumed to be symmetric, and haphazardly sparse, with B being positive definite. The problem is treated as one of constrained optimization and an inverse iteration is developed which requires the solution of linear algebraic systems only to the accuracy demanded by a given subspace. The rate of convergence of the method is established, and a technique for improving it is discussed. Numerical experiments and comparisons with other methods are presented.

ieee international conference on high performance computing data and analytics | 1988

Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design

Kyle A. Gallivan; William Jalby; Ulrike Meier; Ahmed H. Sameh

Linear algebra algorithms based on the BLAS or ex tended BLAS do not achieve high performance on mul tivector processors with a hierarchical memory system because of a lack of data locality. For such machines, block linear algebra algorithms must be implemented in terms of matrix-matrix primitives (BLAS3). Designing ef ficient linear algebra algorithms for these architectures requires analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a func tion of certain system parameters. The analysis must identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration. We propose a methodology that facilitates such an analysis and use it to analyze the per formance of the BLAS3 primitives used in block methods. A similar analysis of the block size-perfor mance relationship is also performed at the algorithm level for block versions of the LU decomposition and the Gram-Schmidt orthogonalization procedures.

Mathematics of Computation | 1971

On Jacobi and Jacobi-like algorithms for a parallel computer

Ahmed H. Sameh

Many existing algorithms for obtaining the eigenvalues and eigenvectors of matrices would make poor use of such a powerful parallel computer as the ILLIAC IV. In this paper, Jacobis algorithm for real symmetric or complex Hermitian matrices, and a Jacobi-like algorithm for real nonsymmetric matrices developed by P. J. Eberlein, are modified so as to achieve maximum efficiency for the parallel computations. 1. Introduction. With the advent of parallel computers, the study of compu- tationally massive problems became economically possible. Such problems include, for example, solution of sets of partial differential equations over sizable grids, and multiplication, inversion, or determination of eigenvalues and eigenvectors of large matrices. An example of a parallel computer is the ILLIAC IV.* This computer is es- sentially an array of coupled arithmetic units driven by instructions from a common control unit. Each of the arithmetic units, called processing elements (PEs), have 2048 words of 64-bit memory with an access time under 420 nanoseconds. Each PE is capable of 64-bit floating-point multiplication in about 550 nanoseconds. Two 32-bit floating-point operations may be performed in each PE in approximately the same times. The PE instruction set is similar to that of conventional machines with two exceptions. First, the PEs are capable of communicating data to four neigh- boring PEs by means of routing instructions. Second, the PEs are able to set their own mode registers to effectively disable or enable themselves. For a more detailed

ACM Transactions on Mathematical Software | 1978

Practical Parallel Band Triangular System Solvers

Shyh-Ching Chen; David J. Kuck; Ahmed H. Sameh

We present a new algorithm for the fast solution of hnear recurrence systems, which we discuss in the form of band triangular linear systems. Parallel linear recurrence system solvers have also been discussed by several authors, e.g. [2-7, 11, 12]. The algorithm presented here is well suited to recurrences of low order. When solving such systems on a limited number of processors, the method presented obtains speed improvements of the order of 2 to 4 over previous algorithms. Throughout the paper we assume that any number of processors can be used at any time, but we give bounds on this number. All processors are assumed to perform the same operation on each time step, and each arithmetic operation can be performed in one step. If p is the number of processors used, we denote the computation time by Tp. We also define the speedup of the parallel algorithm by Sp = TI/Tp, where T~ is the minimum time required by the algorithm using only one processor, and we denote the efficiency by Ep = S J p . In Section 2 we present our algorithm and give variations on it which hold for certain special cases of practical interest. These include the computation of only the last few elements of the solution to a recurrence and the case of Toeplitz

Siam Journal on Scientific and Statistical Computing | 1992

Row projection methods for large nonsymmetric linear systems

Randall Bramley; Ahmed H. Sameh

Three conjugate gradient accelerated row projection (RP) methods for nonsymmetric linear systems are presented and their properties described. One method is based on Kaczmarz’s method and has an iteration matrix that is the product of orthogonal projectors; another is based on Cimmino’s method and has an iteration matrix that is the sum of orthogonal projectors. A new RP method, which requires fewer matrix-vector operations, explicitly reduces the problem size, is error reducing in the two-norm, and consistently produces better solutions than other RP algorithms, is also introduced. Using comparisons with the method of conjugate gradient applied to the normal equations, the properties of RP methods are explained.A row partitioning approach is described that yields parallel implementations suitable for a wide range of computer architectures, requires only a few vectors of extra storage, and allows computing the necessary projections with small errors. Numerical testing verifies the robustness of this appro...

Explore More