Larry Carter
University of California, San Diego
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Larry Carter.
IEEE Transactions on Software Engineering | 1987
Andrew P. Black; Norman C. Hutchinson; Eric Jul; Henry M. Levy; Larry Carter
Emerald is an object-based language for programming distributed subsystems and applications. Its novel features include 1) a single object model that is used both for programming in the small and in the large, 2) support for abstract types, and 3) an explicit notion of object location and mobility. This paper outlines the goals of Em-erald, relates Emerald to previous work, and describes its type system and distribution support. We are currently constructing a prototype implementation of Emerald.
IEEE Transactions on Parallel and Distributed Systems | 2004
Cyril Banino; Olivier Beaumont; Larry Carter; Jeanne Ferrante; Arnaud Legrand; Yves Robert
We consider the problem of allocating a large number of independent, equal-sized tasks to a heterogeneous computing platform. We use a nonoriented graph to model the platform, where resources can have different speeds of computation and communication. Because the number of tasks is large, we focus on the question of determining the optimal steady state scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor). In contrast to minimizing the total execution time, which is NP-hard in most formulations, we show that finding the optimal steady state can be solved using a linear programming approach and, thus, in polynomial time. Our result holds for a quite general framework, allowing for cycles and multiple paths in the interconnection graph, and allowing for several masters. We also consider the simpler case where the platform is a tree. While this case can also be solved via linear programming, we show how to derive a closed-form formula to compute the optimal steady state, which gives rise to a bandwidth-centric scheduling strategy. The advantage of this approach is that it can directly support autonomous task scheduling based only on information local to each node; no global information is needed. Finally, we provide a theoretical comparison of the computing power of tree-based versus arbitrary platforms.
Algorithmica | 1993
Bowen Alpern; Larry Carter; Ephraim Feig; Ted Selker
TheUniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.A sequential computers memory is modeled as a sequence 〈M0,M1,...〉 of increasingly large memory modules. Computation takes place inM0. Thus,M0 might model a computers central processor, whileM1 might be cache memory,M2 main memory, and so on. For each moduleMu, a busBu connects it with the next larger module Mu+1. All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parametrized by the rate at which the blocksizes increase and by the ratio of the blockcount to the blocksize. A third parameter, the transfer-cost (inverse bandwidth) function, determines the time to transfer blocks at the different levels of the hierarchy.UMH analysis refines traditional methods of algorithm analysis by including the cost of data movement throughout the memory hierarchy. Thecommunication efficiency of a program is a ratio measuring the portion of UMH running time during which M0 is active. An algorithm that can be implemented by a program whose communication efficiency is nonzero in the limit is said to becommunication- efficient. The communication efficiency of a program depends on the parameters of the UMH model, most importantly on the transfer-cost function. Athreshold function separates those transfer-cost functions for which an algorithm is communication-efficient from those that are too costly. Threshold functions for matrix transpose, standard matrix multiplication, and Fast Fourier Transform algorithms are established by exhibiting communication-efficient programs at the threshold and showing that more expensive transfer-cost functions are too costly.A parallel computer can be modeled as a tree of memory modules with computation occurring at the leaves. Threshold functions are established for multiplication ofN×N matrices using up to N2 processors in a tree with constant branching factor.
programming language design and implementation | 2003
Michelle Mills Strout; Larry Carter; Jeanne Ferrante
Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed run-time reorderings of data and computation that enhance locality in such applications.This paper presents a compile-time framework that allows the explicit composition of run-time data and iteration-reordering transformations. Our framework builds on the iteration-reordering framework of Kelly and Pugh to represent the effects of a given composition. To motivate our extension, we show that new compositions of run-time reordering transformations can result in better performance on three benchmarks.We show how to express a number of run-time data and iteration-reordering transformations that focus on improving data locality. We also describe the space of possible run-time reordering transformations and how existing transformations fit within it. Since sparse tiling techniques are included in our framework, they become more generally applicable, both to a larger class of applications, and in their composition with other reordering transformations. Finally, within the presented framework data need be remapped only once at runtime for a given composition thus exhibiting one example of overhead reductions the framework can express.
international conference on parallel architectures and compilation techniques | 1999
Nick Mitchell; Larry Carter; Jeanne Ferrante
Existing techniques can enhance the locality of arrays indexed by affine functions of induction variables. This paper presents a technique to localize non-affine array references, such as the indirect memory references common in sparse-matrix computation. Our optimization combines elements of tiling, data-centric tiling, data remapping and inspector-executor parallelization. We describe our technique, bucket tiling, which includes the tasks of permutation generation, data remapping, and loop regeneration. We show that profitability cannot generally be determined at compile-time, but requires an extension to run-time. We demonstrate our technique on three codes: integer sort, conjugate gradient, and a kernel used in simulating a beating heart. We observe speedups of 1.91 on integer sort, 1.57 on conjugate gradient, and 2.69 on the heart kernel.
International Journal of Parallel Programming | 1998
Nick Mitchell; Karin Högstedt; Larry Carter; Jeanne Ferrante
Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures tend towards a hierarchy of memory and of parallelism. We have identified three common architectural scenarios where a single tiling choice could be improved by using information from multiple levels in concert. For each scenario, we derive multi-level cost functions which guide the optimal choice of tile size and shape, and quantify the improvement gained. We give both analysis and simulation results to support our points.
international parallel processing symposium | 1995
Larry Carter; Jeanne Ferrante; Susan Flynn Hummel
It takes more than a good algorithm to achieve high performance: inner-loop performance and data locality are also important. Tiling is a well-known method for parallelization and for improving data locality. However, tiling has the potential of being even more beneficial. At the finest granularity, it can be used to guide register allocation and instruction scheduling; at the coarsest level, it can help manage magnetic storage media. It also can be useful in overlapping data movement with computation, for instance by prefetching data from archival storage, disks and main memory into cache and registers, or by choreographing data movement between processors. Hierarchical tiling is a framework for applying both known tiling methods and new techniques to an expanded set of uses. It eases the burden on several compiler phases that are traditionally treated separately, such as scalar replacement, register allocation, generation of message passing calls, and storage mapping. By explicitly naming and copying data, it takes control of the mapping of data to memory and of the movement of data between processing elements and up and down the memory hierarchy. This paper focuses on using hierarchical tiling to exploit superscalar pipelined processors. On a simple example, it improves performance by a factor of 3, achieving perfect use of the superscalar processors pipeline. Hierarchical tiling is presented here as a method of hand-tuning performance; while outside the scope of this paper, the ideas can be incorporated into an automatic preprocessor or optimizing compiler.<<ETX>>
Proceedings of Workshop on Programming Models for Massively Parallel Computers | 1993
Bowen Alpern; Larry Carter; Jeanne Ferrante
A parameterized generic model that captures the features of diverse computer architectures would facilitate the development of portable programs. Specific models appropriate to particular computers are obtained by specifying parameters of the generic model. A generic model should be simple, and for each machine that it is intended to represent, it should have a reasonably accurate specific model. The Parallel Memory Hierarchy (PMH) model of computation uses a single mechanism to model the costs of both interprocessor communication and memory hierarchy traffic. A computer is modeled as a tree of memory modules with processors at the leaves. All data movement takes the form of block transfers between children and their parents. The paper assesses the strengths and weaknesses of the PMH model as a generic model.<<ETX>>
ieee international conference on high performance computing data and analytics | 2004
Michelle Mills Strout; Larry Carter; Jeanne Ferrante; Barbara Kreaseck
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a run-time reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
acm symposium on parallel algorithms and architectures | 1999
Karin Högstedt; Larry Carter; Jeanne Ferrante
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops which have a regular stencil of data dependences. Tiling is a well-known optimization that improves performance on such loops, particularly for computers with a multi-levelled hierarchy of parallelism and memory. Most previous work on tiling restricts the tile shape to be rectangular. Our previous work and its extension by Desprez, Dongarra, Rastello and Robert Showed that for doubly nested loops, using parallelograms can improve parallel execution time by decreasing the idle time, the time that a processor spends waiting for data or synchronization. In this technical report, we extend that work to more deeply nested loops, as well as to more complex loop bounds. We introduce a model which allows us to demonstrate the equivalence in complexity of linear programming and determining the execution time of a tiling in the model. We then identify a sub-class of these loops that constitute rectilinear iteration spaces for which we derive a closed form formula for their execution time. This formula can be used by a compiler to predict the execution tome of a loop nest. We then derive the tile shape that minimizes this formula. Using the duality property of linear programming, we also study how the longest path of dependent tiles within a rectilinear iteration space depends on the slope of only four of the facets defining the iteration space, independent of its dimensionality.
Collaboration
Dive into the Larry Carter's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs