Federico Bassetti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Federico Bassetti is active.

Explore More

Publication

Featured researches published by Federico Bassetti.

conference on high performance computing (supercomputing) | 1997

Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications

Harvey J. Wassermann; Olaf M. Lubeck; Yong Luo; Federico Bassetti

We compare single-processor performance of the SGI Origin and PowerChallenge and utilize a previously-reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant differences in their memory subsystems. Our memory model includes the effect of overlap between CPU and memory operations and allows us to infer the individual contributions of all three improvements in the Origins memory architecture and relate the effectiveness of each improvement to application characteristics.

workshop on software and performance | 1998

Development and validation of a hierarchical memory model incorporating CPU- and memory-operation overlap model

Yong Luo; Olaf M. Lubeck; Harvey J. Wasserman; Federico Bassetti; Kirk W. Cameron

In this paper, we characterize application performunce with a “memory-centric” view. Using a simple strategy and performance data measured on actual mclchines, we model the performance of a simple memory hierarchy and infer the contribution of each level in the memory system to an application’s overall cycles per instruction (cpi). Included are results @rming the usefulness of the memory model over several platforms, namely the SGI Origin 2000, SGI PowerChallenge, and the Intel ASCI Red TFLOPS supercomputers. We account for the overlap of processor execution with memory accesses a key parameter, which is not directly measurable on most systems. Given the system similarities between the Origin 2000 and the PowerChallenge, we infer the separate contributions of three major architecture features in the memory subsystem of the Origin 2000. cache size, outstanding loads-under-miss, and memory latency.

Lecture Notes in Computer Science | 1998

Optimizing Transformations of Stencil Operations for Parallel Object-Oriented Scientific Frameworks on Cache-Based Architectures

Federico Bassetti; Kei Davis; Daniel J. Quinlan

High-performance scientific computing relies increasingly on high-level, large-scale, object-oriented software frameworks to manage both algorithmic complexity and the complexities of parallelism: distributed data management, process management, inter-process communication, and load balancing. This encapsulation of data management, together with the prescribed semantics of a typical fundamental component of such object-oriented frameworks--a parallel or serial array class library--provides an opportunity for increasingly sophisticated compile-time optimization techniques. This paper describes two optimizing transformations suitable for certain classes of numerical algorithms, one for reducing the cost of inter-processor communication, and one for improving cache utilization; demonstrates and analyzes the resulting performance gains; and indicates how these transformations are being automated.

conference on high performance computing (supercomputing) | 1998

OVERTURE: An Object-Oriented Framework for High Performance Scientific Computing

Federico Bassetti; David L. Brown; Kei Davis; William D. Henshaw; Daniel J. Quinlan

The Overture Framework is an object-oriented environment for solving PDEs on serial and parallel architectures. It is a collection of C++ libraries that enables the use of finite difference and finite volume methods at a level that hides the details of the associated data structures, as well as the details of the parallel implementation. It is based on the A++/P++ array class library and is designed for solving problems on a structured grid or a collection of structured grids. In particular, it can use curvilinear grids, adaptive mesh refinement and the composite overlapping grid methods to represent problems with complex moving geometry. This paper introduces Overture, its motivation, and specifically the aspects of the design central to portability and high performance. In particular we focus on the mechanisms within Overture that permit a hierarchy of abstractions and those mechanisms which permit their efficiency on advanced serial and parallel architectures. We expect that these same mechanisms will become increasingly important within other object-oriented frameworks in the future.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

C++ expression template performance issues in scientific computing

Federico Bassetti; Kei Davis; Daniel J. Quinlan

Ever-increasing size and complexity of software applications and libraries in parallel scientific computing is making implementation in the programming languages traditional for this field-FORTRAN 77 and C-impractical. The major impediment to the progression to a higher-level language such as C++ is attaining FORTRAN 77 or C performance, which is considered absolutely necessary by many practitioners. The use of template metaprogramming in C++, in the form of so-called expression templates to generate custom C++ code, holds great promise for getting C performance from C++ in the context of operations on array-like objects. Several sophisticated expression template implementations of parallel array-class libraries exist, and in certain circumstances their promise of performance is realized. Unfortunately this is not uniformly the case; the paper explores the major reasons for this. A more complete version of this paper may be obtained from http://www.c3.Ianl.gov/-kei/ipps98.html.

conference on scientific computing | 1997

Optimization of Data-Parallel Field Expressions in the POOMA Framework

William Humphrey; Steve Karmesin; Federico Bassetti; John Reynders

The POOMA framework is a C++ class library for the development of large-scale parallel scientific applications. POOMAs Field class implements a templated, multidimensional, data-parallel array that partitions data in a simulation domain into sub-blocks. These subdomain blocks are used on a parallel computer in data-parallel Field expressions. In this paper we describe the design of Fields, their implementation in the POOMA framework, and their performance on a Silicon Graphics Inc. Origin 2000. We focus on the aspects of the Field implementation which relate to efficient memory use and improvement of run-time performance: reducing the number of temporaries through expression templates, reducing the total memory used by compressing constant regions, and performing calculations on sparsely populated Fields by using sparse index lists.

conference on scientific computing | 1997

A Comparison of Performance-Enhancing Strategies for Parallel Numerical Object-Oriented Frameworks

Federico Bassetti; Kei Davis; Daniel J. Quinlan

Performance short of that of C or FORTRAN 77 is a significant obstacle to general acceptance of object-oriented C++ frameworks in high-performance parallel scientific computing; nonetheless, their value in simplifying complex computations is inarguable. Examples of good performance for object-oriented libraries/frameworks are interesting, but a systematic analysis of performance issues has not been done. This paper explores a few of these issues and reports on three mechanisms for enhancing the performance of object-oriented frameworks for numerical computation. The first is binary operator overloading implemented with substantial internal optimizations, the second is expression templates, and the third is an optimizing preprocessor. The first two have been completely implemented and are available in the A++/P++ array class library, the third, ROSE++, is work in progress. This paper provides some perspective on the types of optimizations that we consider important in our numerical applications using OVERTURE involving complex geometry and AMR on parallel architectures.

Lecture Notes in Computer Science | 1999

Improving Cache Utilization of Linear Relaxation Methods: Theory and Practice

Federico Bassetti; Kei Davis; Madhav V. Marathe; Daniel J. Quinlan; Bobby Philip

Application codes reliably achieve performance far less than the advertised capabilities of existing architectures, and this problem is worsening with increasingly-parallel machines. For large-scale numerical applications, stencil operations often impose the greater part of the computational cost, and the primary sources of inefficiency are the costs of message passing and poor cache utilization. This paper proposes and demonstrates optimizations for stencil and stencil-like computations for both serial and parallel environments that ameliorate these sources of inefficiency. Additionally, we argue that when stencil-like computations are encoded at a high level using object-oriented parallel array class libraries, these optimizations, which are beyond the capability of compilers, may be automated. The automation of these optimizations is particularly important since the transformations represented by cache based optimizations can be unreasonably complicated by the peculiarities which are architecture specific. This paper briefly presents the approach toward the automation of these transformations.

Computers & Electrical Engineering | 2000

A lower bound for quantifying overlap effects: An empirical approach

Federico Bassetti

Abstract Among the many features that are implemented in today’s microprocessors there are some that have the capability of reducing the execution time via overlapping of different operations. Overlapping of instructions with other instructions, and overlapping of computation with memory activities are the main way in which execution time is reduced. In this paper we will introduce a notion of overlap and its definition, and a few different ways to capture its effects. We will characterize some of the ASCI benchmarks using the overlap and some other quantities related to it. Also, we will present a characterization of the overlap effects using a lower bound derived empirically from measured data. We will conclude by using the lower bound to estimate other components of the overall execution time.

international performance computing and communications conference | 1998

A lower bound for quantifying overlap effects: an empirical approach

Federico Bassetti

Among the many features that are implemented in todays microprocessors there are some that have the capability of reducing the execution time via overlapping of different operations. Overlapping of instructions with other instructions and overlapping of computation with memory activities are the main way in which execution time is reduced. In this paper we will introduce a notion of overlap and its definition, and a few different ways to capture its effects. We will characterize some of the ASCI benchmarks using the overlap and some other quantities related to it. Also, we will present a characterization of the overlap effects using a lower bound derived empirically from measured data. We will conclude by using the lower bound to estimate other components of the overall execution time.

Explore More