Calin Cascaval | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Calin Cascaval is active.

Explore More

Publication

Featured researches published by Calin Cascaval.

international symposium on computer architecture | 2006

Bulk Disambiguation of Speculative Threads in Multiprocessors

Luis Ceze; James Tuck; Josep Torrellas; Calin Cascaval

Transactional Memory (TM), Thread-Level Speculation (TLS), and Checkpointed multiprocessors are three popular architectural techniques based on the execution of multiple, cooperating speculative threads. In these environments, correctly maintaining data dependences across threads requires mechanisms for disambiguating addresses across threads, invalidating stale cache state, and making committed state visible. These mechanisms are both conceptually involved and hard to implement. In this paper, we present Bulk, a novel approach to simplify these mechanisms. The idea is to hash-encode a threads access information in a concise signature, and then support in hardware signature operations that efficiently process sets of addresses. Such operations implement the mechanisms described. Bulk operations are inexact but correct, and provide substantial conceptual and implementation simplicity. We evaluate Bulk in the context of TLS using SPECint2000 codes and TM using multithreaded Java workloads. Despite its simplicity, Bulk has competitive performance with more complex schemes. We also find that signature configuration is a key design parameter.

international conference on parallel architectures and compilation techniques | 2003

Characterizing and predicting program behavior and its variability

Evelyn Duesterwald; Calin Cascaval; Sandhya Dwarkadas

To reach the next level of performance and energy efficiency, optimizations are increasingly applied in a dynamic and adaptive manner. Current adaptive systems are typically reactive and optimize hardware or software in response to detecting a shift in program behavior. We argue that program behavior variability requires adaptive systems to be predictive rather than reactive. In order to be effective, systems need to adapt according to future rather than most recent past behavior. We explore the potential of incorporating prediction into adaptive systems. We study the time-varying behavior of programs using metrics derived from hardware counters on two different microarchitectures. Our evaluation shows that programs do indeed exhibit significant behavior variation even at a granularity of millions of instructions. In addition, while the actual behavior across metrics may be different, periodicity in the behavior is shared across metrics. We exploit these characteristics in the design of on-line statistical and table-based predictors. We introduce a new class of predictors, cross-metric predictors, that use one metric to predict another, thus making possible an efficient coupling of multiple predictors. We evaluate these predictors on the SPECcpu2000 benchmark suite and show that table-based predictors outperform statistical predictors by as much as 69% on benchmarks with high variability.

acm sigplan symposium on principles and practice of parallel programming | 2009

How much parallelism is there in irregular applications

Milind Kulkarni; Martin Burtscher; Rajasekhar Inkulu; Keshav Pingali; Calin Cascaval

Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous data-parallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs? In this paper, we describe the design and implementation of a tool called ParaMeter that produces parallelism profiles for irregular programs. Parallelism profiles are an abstract measure of the amount of amorphous data-parallelism at different points in the execution of an algorithm, independent of implementation-dependent details such as the number of cores, cache sizes, load-balancing, etc. ParaMeter can also generate constrained parallelism profiles for a fixed number of cores. We show parallelism profiles for seven irregular applications, and explain how these profiles provide insight into the behavior of these applications.

acm sigplan symposium on principles and practice of parallel programming | 2007

Implicit parallelism with ordered transactions

Christoph von Praun; Luis Ceze; Calin Cascaval

Implicit Parallelism with Ordered Transactions (IPOT) is an extension of sequential or explicitly parallel programming models to support speculative parallelization. The key idea is to specify opportunities for parallelization in a sequential program using annotations similar to transactions. Unlike explicit parallelism, IPOT annotations do not require the absence of data dependence, since the parallelization relies on runtime support for speculative execution. IPOT as a parallel programming model is determinate, i.e., program semantics are independent of the thread scheduling. For optimization, non-determinism can be introduced selectively. We describe the programming model of IPOT and an online tool that recommends boundaries of ordered transactions by observing a sequential execution. On three example HPC workloads we demonstrate that our method is effective in identifying opportunities for fine-grain parallelization. Using the automated task recommendation tool, we were able to perform the parallelization of each program within a few hours.

Sigplan Notices | 2003

Calculating stack distances efficiently

George S. Almasi; Calin Cascaval; David A. Padua

This paper1 describes our experience using the stack processing algorithm [6] for estimating the number of cache misses in scientific programs. By using a new data structure and various optimization techniques we obtain instrumented run-times within 50 to 100 times the original optimized run-times of our benchmarks.

languages and compilers for parallel computing | 1999

Compile-Time Based Performance Prediction

Calin Cascaval; Luiz De Rose; David A. Padua; Daniel A. Reed

In this paper we present results we obtained using a compiler to predict performance of scientific codes. The compiler, Polaris [3], is both the primary tool for estimating the performance of a range of codes, and the beneficiary of the results obtained from predicting the program behavior at compile time. We show that a simple compile-time model, augmented with profiling data obtained using very light instrumentation, can be accurate within 20% (on average) of the measured performance for codes using both dense and sparse computational methods.

international conference on parallel architectures and compilation techniques | 2005

Multiple page size modeling and optimization

Calin Cascaval; Evelyn Duesterwald; Peter F. Sweeney; Robert W. Wisniewski

With the growing awareness that individual hardware cores will not continue to produce the same level of performance improvement, there is a need to develop an integrated approach to performance optimization. In this paper we present a paradigm for continuous program optimization (CPO), whereby automatic agents monitor and optimize application and system performance. The monitoring data is used to analyze and create models of application and system behavior. Using this analysis, we describe how CPO agents can improve the performance of both the application and the underlying system. Using the CPO paradigm, we implemented cooperating page size optimization agents that automatically optimize large page usage. An offline agent uses vertically integrated performance data to produce a page size benefit analysis for different categories of data structures within an application. We show how an online CPO agent can use the results of the predictive analysis to automatically improve application performance. We validate that the predictions made by the CPO agent reflect the actual performance gains of up to 60% across a range of scientific applications including the SPEC-cpu2000 floating point benchmarks and two large high performance computing (HPC) applications.

acm sigplan symposium on principles and practice of parallel programming | 2008

Modeling optimistic concurrency using quantitative dependence analysis

Christoph von Praun; Rajesh Bordawekar; Calin Cascaval

This work presents a quantitative approach to analyze parallelization opportunities in programs with irregular memory access where potential data dependencies mask available parallelism. The model captures data and causal dependencies among critical sections as algorithmic properties and quantifies them as a density computed over the number of executed instructions. The model abstracts from runtime aspects such as scheduling, the number of threads, and concurrency control used in a particular parallelization. We illustrate the model on several applications requiring ordered and unordered execution of critical sections. We describe a run-time tool that computes the dependence densities from a deterministic single-threaded program execution. This density metric provides insights into the potential for optimistic parallelization, opportunities for algorithmic scheduling, and performance defects due to synchronization bottlenecks. Based on the results of our analysis, we classify applications into three categories with low, medium, and high dependence densities. Applications with low dependence density are naturally good candidates for optimistic concurrency, applications with medium density may require a scheduler that is aware of the algorithmic dependencies for optimistic concurrency to be effective, and applications with high dependence density may not be suitable for parallelization.

International Journal of Parallel Programming | 2002

Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer

George S. Almasi; Calin Cascaval; José G. Castaños; Monty M. Denneau; Wilm E. Donath; Maria Eleftheriou; Mark E. Giampapa; C. T. Howard Ho; Derek Lieber; José E. Moreira; Dennis M. Newns; Marc Snir; Henry S. Warren

The IBM Blue Gene/C parallel computer aims to demonstrate the feasibility of a cellular architecture computer with millions of concurrent threads of execution. One of the major challenges in this project is showing that applications can successfully scale to this massive amount of parallelism. In this paper we demonstrate that the simulation of protein folding using classical molecular dynamics falls in this category. Starting from the sequential version of a well known molecular dynamics code, we developed a new parallel implementation that exploited the multiple levels of parallelism present in the Blue Gene/C cellular architecture. We performed both analytical and simulation studies of the behavior of this application when executed on a very large number of threads. As a result, we demonstrate that this class of applications can execute efficiently on a large cellular machine.

Ibm Journal of Research and Development | 2006

Performance and environment monitoring for continuous program optimization

Calin Cascaval; Evelyn Duesterwald; Peter F. Sweeney; Robert W. Wisniewski

Our research is aimed at characterizing, understanding, and exploiting the interactions between hardware and software to improve system performance. We have developed a paradigm for continuous program optimization (CPO) that assists in and automates the challenging task of pelformance tuning, and we have implemented an initial prototype of this paradigm. At the core of our implementation is a performance- and environment-monitoring (PEM) component that vertically integrates performance events from various layers in the execution stack. CPO agents use the data provided by PEM to detect, diagnose, and alleviate performance problems on existing systems. In addition, CPO can be used to improve future architecture designs by analyzing PEM data collected on a whole-system simulator while varying architectural characteristics. In this paper, we present the CPO paradigm, describe an initial implementation that includes PEM as a component, and discuss two CPO clients.

Explore More