Barbara Kreaseck
La Sierra University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Barbara Kreaseck.
ieee international conference on high performance computing data and analytics | 2004
Michelle Mills Strout; Larry Carter; Jeanne Ferrante; Barbara Kreaseck
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a run-time reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
international conference on parallel processing | 2006
Michelle Mills Strout; Barbara Kreaseck; Paul D. Hovland
Message passing via MPI is widely used in single-program, multiple-data (SPMD) parallel programs. Existing data-flow frameworks do not model the semantics of message-passing SPMD programs, which can result in less precise and even incorrect analysis results. We present a data-flow analysis framework for performing interprocedural analysis of message-passing SPMD programs. The framework is based on the MPI-ICFG representation, which is an interprocedural control-flow graph (ICFG) augmented with communication edges between possible send and receive pairs and partial context sensitivity. We show how to formulate nonseparable data-flow analyses within our framework using reaching constants as a canonical example. We also formulate and provide experimental results for the nonseparable analysis, activity analysis. Activity analysis is a domain-specific analysis used to reduce the computation and storage requirements for automatically differentiated MPI programs. Automatic differentiation is important for application domains such as climate modeling, electronic device simulation, oil reservoir simulation, medical treatment planning and computational economics to name a few. Our experimental results show that using the MPI-ICFG data-flow analysis framework improves the precision of activity analysis and as a result significantly reduces memory requirements for the automatically differentiated versions of a set of parallel benchmarks, including some of the NAS parallel benchmarks
international parallel and distributed processing symposium | 2003
Barbara Kreaseck; Larry Carter; Henri Casanova; Jeanne Ferrante
In this paper we investigate protocols for scheduling applications that consist of large numbers of identical, independent tasks on large-scale computing platforms. By imposing a tree structure on an overlay network of computing nodes, our previous work showed that it is possible to compute the schedule which leads to the optimal steady-state task completion rate. However, implementing this optimal schedule in practice, without prohibitive global coordination of all the computing nodes or unlimited buffers, remained an open question. To address this question, in this paper we develop autonomous scheduling protocols, i.e. distributed scheduling algorithms by which each node makes scheduling decisions based solely on locally available information. Our protocols have two variants: with non-interruptible and with interruptible communications. Further, we evaluate both protocols using simulations on randomly generated trees. We show that the non-interruptible communication version may need a prohibitive number of buffers at each node. However, our autonomous protocol with interruptible communication and only 3 buffers per node reaches the optimal steady-state performance in over 99.5% of our simulations. The autonomous scheduling approach is inherently scalable and adaptable, and thus ideally suited to currently emerging computing platforms. In particular this work has direct impact on the deployment of large applications on Grid, and peer-to-peer computing platforms.
languages and compilers for parallel computing | 2002
Michelle Mills Strout; Larry Carter; Jeanne Ferrante; Jonathan Freeman; Barbara Kreaseck
Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as Gauss-Seidel. To improve performance, iterative smoothers can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. Current methods for parallelizing Gauss-Seidel on irregular grids, such as multi-coloring and owner-computes based techniques, exploit parallelism and possibly intra-iteration data reuse but not inter-iteration data reuse. Sparse tiling techniques were developed to improve intra-iteration and inter-iteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of Gauss-Seidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to owner-computes based Gauss-Seidel methods. The latter employ only parallelism and intra-iteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intra-iteration, and inter-iteration data locality) are combined.
parallel computing | 2016
Michelle Mills Strout; Alan LaMielle; Larry Carter; Jeanne Ferrante; Barbara Kreaseck; Catherine Olschanowsky
Sparse Polyhedral Framework(SPF) to specify loop transformations for irregular codes.We describe a code generator prototype built on SPF.We present experimental results comparing generated code against hand-coded. Applications that manipulate sparse data structures contain memory reference patterns that are unknown at compile time due to indirect accesses such as ABi. To exploit parallelism and improve locality in such applications, prior work has developed a number of Run-Time Reordering Transformations (RTRTs). This paper presents the Sparse Polyhedral Framework (SPF) for specifying RTRTs and compositions thereof and algorithms for automatically generating efficient inspector and executor code to implement such transformations. Experimental results indicate that the performance of automatically generated inspectors and executors competes with the performance of hand-written ones when further optimization is done.
international parallel and distributed processing symposium | 2004
Lori Carter; Henri Casanova; Jeanne Ferrante; Barbara Kreaseck
Summary form only given. Overlapping communication with computation is a well-known technique to increase application performance. While it is commonly assumed that communication and computation can be overlapped at no cost, in reality, they do contend for resources and thus interfere with each other. Here we present an empirical quantification of the interference rate of communication on computation. We measure this rate on a single processor communicating with both local and remote processors via Java sockets. Among other results we find that the computation rate can suffer by as much as 50%, and that the reduction is approximately proportional to the communication rate. We conclude that interference deserves further study.
international conference on computational science | 2006
Barbara Kreaseck; Luis Ramos; Scott Easterday; Michelle Mills Strout; Paul D. Hovland
In forward mode Automatic Differentiation, the derivative program computes a function f and its derivatives, f′. Activity analysis is important for AD. Our results show that when all variables are active, the runtime checks required for dynamic activity analysis incur a significant overhead. However, when as few as half of the input variables are inactive, dynamic activity analysis enables an average speedup of 28% on a set of benchmark problems. We investigate static activity analysis combined with dynamic activity analysis as a technique for reducing the overhead of dynamic activity analysis.
ACM Sigarch Computer Architecture News | 2000
Barbara Kreaseck; Dean M. Tullsen; Brad Calder
Tomorrows microprocessors will be able to handle multiple flows of control. Applications that exhibit task level parallelism (TLP) and can be decomposed into parallel tasks will perform well on these platforms. TLP arises when a task is independent of its neighboring code. Traditional parallel compilers exploit one variety of TLP, loop level parallelism (LLP), where loop iterations are executed in parallel. LLP can overwhelming be found in numeric, typically FORTRAN programs with regular patterns of data accesses. In contrast, irregular applications, typified by general purpose integer applications, exhibit little LLP as they tend to access data in irregular patterns through pointers. Without pointer disambiguation to analyze data access dependences, traditional parallel compilers cannot parallelize these irregular applications and ensure correct execution.We focus on a different variety of TLP, namely Speculative Task Parallelism (STP). STP arises when a task (either a leaf-procedure, a non-leaf procedure or an entire loop) is control- and memory-independent of its preceding code, and thus could be executed in parallel. Two sections of code are memory-independent when neither contains a store to a memory location that the other accesses. To exploit STP, we assume a hypothetical speculative machine that supports speculative futures (a parallel programming construct that executes a task early on a different thread or processor) with mechanisms for resolving incorrect speculation when the task is not, after all, independent. This allows us to speculatively parallelize code when there is a high probability of independence, but no guarantee.Figure 1 illustrates STP, showing a task Y in the dynamic instruction stream of an irregular application that has no memory access conflicts with a group of instructions, X, that precede Y. The shorter of X and Y determines the overlap of memory-independent instructions as seen in Figures 1(b) and 1(c). In the absence of any register dependences, X and Y may be executed in parallel, resulting in shorter execution time. It is hard for traditional parallel compilers of pointer-based languages to expose this parallelism.The goals of this paper are to identify such regions as X and Y within irregular applications and to find the number of instructions that may thus be removed from the critical path. This number represents the maximum STP when the cost of exploiting STP is zero.Because the biggest barrier to detecting independence in irregular codes is memory disambiguation, we identify memory-independent tasks using a profile-based approach and measure the amount of STP by estimating the amount of memory-independent instructions those tasks expose. We vary the level of control dependence and memory dependence to investigate their effect on the amount of memory-independence we find. We profile at different memory granularities and introduce synchronization to expose higher levels of memory-independence. Across this variety of speculation assumptions, 7 to 22% of dynamic instructions are within tasks that are found to be memory-independent. This was on the SPECint95 benchmarks, a set of irregular applications for which traditional methods of parallelization are ineffective.
ieee international conference on high performance computing data and analytics | 2006
Barbara Kreaseck; Larry Carter; Henri Casanova; Jeanne Ferrante; Sagnik Nandy
Overlapping communication with computation is a wellknown technique to increase application performance. While it is commonly assumed that communication and computation can be overlapped at no cost, in reality they interfere with each other. In this paper we empirically evaluate the interference rate of communication on computation via measurements on a single processor communicating on a heterogeneous collection of local and remote processors, in both Java and C. We then present a model of interference, which can be used for more effective application scheduling, as demonstrated by real-world experiments.
ieee international conference on high performance computing data and analytics | 2000
Barbara Kreaseck; Dean M. Tullsen; Brad Calder
Traditional parallel compilers do not effectively parallelize irregular applications because they contain little loop-level parallelism. We explore Speculative Task Parallelism (STP), where tasks are full procedures and entire natural loops. Through profiling and compiler analysis, we find tasks that are speculatively memory- and control-independent of their neighboring code. Via speculative futures, these tasks may be executed in parallel with preceding code when there is a high probability of independence. We estimate the amount of STP in irregular applications by measuring the number of memory-independent instructions these tasks expose. We find that 7 to 22% of dynamic instructions are within memory-independent tasks, depending on assumptions.