Marco Zagha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marco Zagha is active.

Explore More

Publication

Featured researches published by Marco Zagha.

acm symposium on parallel algorithms and architectures | 1991

A comparison of sorting algorithms for the connection machine CM-2

Guy E. Blelloch; Charles E. Leiserson; Bruce M. Maggs; C. Greg Plaxton; Stephen J. Smith; Marco Zagha

Sorting is arguably the most studied problem in computer science, both because it is used as a substep in many applications and because it is a simple, combinatorial problem with many interesting and diverse solutions. Sorting is also an important benchmark for parallel supercomputers. It requires significant communication bandwidth among processors, unlike many other supercomputer benchmarks, and the most efficient sorting algorithms communicate data in irregular patterns. Parallel algorithms for sorting have been studied since at least the 1960’s. An early advance in parallel sorting came in 1968 when Batcher discovered the elegant U(lg2 n)-depth bitonic sorting network [3]. For certain families of fixed interconnection networks, such as the hypercube and shuffle-exchange, Batcher’s bitonic sorting technique provides a parallel algorithm for sorting n numbers in U(lg2 n) time with n processors. The question of existence of a o(lg2 n)-depth sorting network remained open until 1983, when Ajtai, Komlos, and Szemeredi [1] provided an optimal U(lg n)-depth sorting network, but unfortunately, their construction leads to larger networks than those given by bitonic sort for all “practical” values of n. Leighton [15] has shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U(lg n) time in the bounded-degree fixed interconnection network domain. Not surprisingly, the optimal U(lg n)-time fixed interconnection sorting networks implied by the AKS construction are also impractical. In 1983, Reif and Valiant proposed a more practical O(lg n)-time randomized algorithm for sorting [19], called flashsort. Many other parallel sorting algorithms have been proposed in the literature, including parallel versions of radix sort and quicksort [5], a variant of quicksort called hyperquicksort [23], smoothsort [18], column sort [15], Nassimi and Sahni’s sort [17], and parallel merge sort [6]. This paper reports the findings of a project undertaken at Thinking Machines Corporation to develop a fast sorting algorithm for the Connection Machine Supercomputer model CM-2. The primary goals of this project were:

Journal of Parallel and Distributed Computing | 1994

Implementation of a portable nested data-parallel language

Guy E. Blelloch; Jonathan C. Hardwick; Jay Sipelstein; Marco Zagha; Siddhartha Chatterjee

Abstract This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machines CM-2 and CM-5, the Cray Y-MP C90, and serial workstations. We compare initial benchmark results of NESL, with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

conference on high performance computing (supercomputing) | 1996

Performance Analysis Using the MIPS R10000 Performance Counters

Marco Zagha; Brond Larson; Steve Turner; Marty Itzkowitz

Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events - support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications, which illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications.

conference on high performance computing (supercomputing) | 1990

Scan primitives for vector computers

Siddhartha Chatterjee; Guy E. Blelloch; Marco Zagha

The authors describe an optimized implementation of a set of scan (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. The algorithm used to implement the scans is based on an algorithm for parallel computers. A set of segmented versions of these scans is only marginally more expensive than the unsegmented versions. The authors describe a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 to 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks.<<ETX>>

conference on high performance computing (supercomputing) | 1991

Radix sort for vector multiprocessors

Marco Zagha; Guy E. Blelloch

No abstract available

Theory of Computing Systems \/ Mathematical Systems Theory | 1998

An Experimental Analysis of Parallel Sorting Algorithms

Guy E. Blelloch; Charles E. Leiserson; Bruce M. Maggs; C. G. Plaxton; Stephen J. Smith; Marco Zagha

Abstract. We have developed a methodology for predicting the performance of parallel algorithms on real parallel machines. The methodology consists of two steps. First, we characterize a machine by enumerating the primitive operations that it is capable of performing along with the cost of each operation. Next, we analyze an algorithm by making a precise count of the number of times the algorithm performs each type of operation. We have used this methodology to evaluate many of the parallel sorting algorithms proposed in the literature. Of these, we selected the three most promising, Batchers bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiants flashsort, and implemented them on the connection Machine model CM-2. This paper analyzes the three algorithms in detail and discusses the issues that led us to our particular implementations. On the CM-2 the predicted performance of the algorithms closely matches the observed performance, and hence our methodology can be used to tune the algorithms for optimal performance. Although our programs were designed for the CM-2, our conclusions about the merits of the three algorithms apply to other parallel machines as well.

international parallel processing symposium | 1995

Performance evaluation of a new parallel preconditioner

Keith D. Gremban; Gary L. Miller; Marco Zagha

The linear systems associated with large, sparse, symmetric, positive definite matrices are often solved iteratively using the preconditioned conjugate gradient method. We have developed a new class of preconditioners, support tree preconditioners, that are based on the connectivity of the graphs corresponding to the matrices and are well-structured for parallel implementation. We evaluate the performance of support tree preconditioners by comparing them against two common types of preconditioners: diagonal scaling and incomplete Cholesky. Support tree preconditioners require less overall storage and less work per iteration than incomplete Cholesky preconditioners. In terms of total execution time, support tree preconditioners outperform both diagonal scaling and incomplete Cholesky preconditioners.<<ETX>>

acm symposium on parallel algorithms and architectures | 1995

Accounting for memory bank contention and delay in high-bandwidth multiprocessors

Guy E. Blelloch; Phillip B. Gibbons; Yossi Matias; Marco Zagha

For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiants bulk-synchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)-BSP. We show experimentally that the (d, x)-BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machine-specific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)-BSP as a bridging model for emulating a very high-level abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)-BSP.

Journal of Parallel and Distributed Computing | 1995

Solving linear recurrences with loop raking

Guy E. Blelloch; Siddhartha Chatterjee; Marco Zagha

Abstract We present a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On four processors, we gain an additional speedup of at least 3.7.

parallel computing | 1996

Optimizing the NAS parallel BT application for the POWER CHALLENGEarray

John Brown; Marco Zagha

The POWER CHALLENGEarray is a coarse-grained collection of large processor SMP nodes. This creates interesting parallelization opportunities for scalable applications. The NAS BT benchmark is a classical ADI-like application with non-trivial communication requirements. The coarse-grained distributed feature of the POWER CHALLENGEarray provides unique parallelization strategies. We explore the implementation of this benchmark on this machine and discuss the general implications for scalable application development

Explore More