Monica S. Lam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Monica S. Lam is active.

Explore More

Publication

Featured researches published by Monica S. Lam.

programming language design and implementation | 1991

A data locality optimizing algorithm

Monica S. Lam; Michael E. Wolf

This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations.The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locality optimization is especially crucial for scaling up the performance of parallel code.

programming language design and implementation | 1988

Software pipelining: an effective scheduling technique for VLIW machines

Monica S. Lam

The basic idea behind software pipelining was first developed by Patel and Davidson for scheduling hardware pipe-lines. As instruction-level parallelism made its way into general-purpose computing, it became necessary to automate scheduling. How and whether instructions can be scheduled statically have major ramifications on the design of computer architectures. Rau and Glaeser were the first to use software pipelining in a compiler for a machine with specialized hardware designed to support software pipelining. In the meantime, trace scheduling was touted to be the scheduling technique of choice for VLIW (Very Long Instruction Word) machines. The most important contribution from this paper is to show that software pipelining is effective on VLIW machines without complicated hardware support. Our understanding of software pipelining subsequently deepened with the work of many others. And today, software pipelining is used in all advanced compilers for machines with instruction-level parallelism, none of which, except the Intel Itanium, relies on any specialized support for software pipelining.This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.

architectural support for programming languages and operating systems | 1992

Design and evaluation of a compiler algorithm for prefetching

Todd C. Mowry; Monica S. Lam; Anoop Gupta

Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today’s high-performance processors. While prefctching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a resu 1~ care must be taken to ensure that such overheads do not exceed the benefits. This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices. Our algorithm identiEes those references that are likely to be cache misses, and issues prefetches only for them. We have implemented our algorithm in the SUfF (Stanford University Intermediate Form) optimizing compiler. By generating fully functional code, we have been able to measure not only the improvements in cache miss rates, but also the oversdl performance of a simulated system. We show that our algorithm significantly improves the execution speed of our benchmark programs-some of the programs improve by as much as a factor of two. When compared to an algorithm that indiscriminately prefetches alf array accesses, our algorithm can eliminate many of the unnecessary prefetches without any significant decrease in the coverage of the cache misses.

IEEE Transactions on Parallel and Distributed Systems | 1991

A loop transformation theory and an algorithm to maximize parallelism

Michael E. Wolf; Monica S. Lam

An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five. >

international conference on software engineering | 2002

Tracking down software bugs using automatic anomaly detection

Sudheendra Hangal; Monica S. Lam

This paper introduces DIDUCE, a practical and effective tool that aids programmers in detecting complex program errors and identifying their root causes. By instrumenting a program and observing its behavior as it runs, DIDUCE dynamically formulates hypotheses of invariants obeyed by the program. DIDUCE hypothesizes the strictest invariants at the beginning, and gradually relaxes the hypothesis as violations are detected to allow for new behavior. The violations reported help users to catch software bugs as soon as they occur. They also give programmers new visibility into the behavior of the programs such as identifying rare corner cases in the program logic or even locating hidden errors that corrupt the programs results.We implemented the DIDUCE system for Java programs and applied it to four programs of significant size and complexity. DIDUCE succeeded in identifying the root causes of programming errors in each of the programs quickly and automatically. In particular, DIDUCE is effective in isolating a timing-dependent bug in a released JSSE (Java Secure Socket Extension) library, which would have taken an experienced programmer days to find. Our experience suggests that detecting and checking program invariants dynamically is a simple and effective methodology for debugging many different kinds of program errors across a wide variety of application domains.

IEEE Computer | 1996

Maximizing multiprocessor performance with the SUIF compiler

Mary W. Hall; Jennifer-Ann M. Anderson; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Edouard Bugnion; Monica S. Lam

This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIFs robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

Sigplan Notices | 1994

SUIF: an infrastructure for research on parallelizing and optimizing compilers

Robert P. Wilson; Robert S. French; Christopher S. Wilson; Saman P. Amarasinghe; Jennifer-Ann M. Anderson; Steven W. K. Tjiang; Shih-Wei Liao; Chau-Wen Tseng; Mary W. Hall; Monica S. Lam; John L. Hennessy

Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts.SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use.Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper.

programming language design and implementation | 2004

Cloning-based context-sensitive pointer alias analysis using binary decision diagrams

John Whaley; Monica S. Lam

This paper presents the first scalable context-sensitive, inclusion-based pointer alias analysis for Java programs. Our approach to context sensitivity is to create a clone of a method for every context of interest, and run a context-insensitive algorithm over the expanded call graph to get context-sensitive results. For precision, we generate a clone for every acyclic path through a programs call graph, treating methods in a strongly connected component as a single node. Normally, this formulation is hopelessly intractable as a call graph often has 10 14 acyclic paths or more. We show that these exponential relations can be computed efficiently using binary decision diagrams (BDDs). Key to the scalability of the technique is a context numbering scheme that exposes the commonalities across contexts. We applied our algorithm to the most popular applications available on Sourceforge, and found that the largest programs, with hundreds of thousands of Java bytecodes, can be analyzed in under 20 minutes.This paper shows that pointer analysis, and many other queries and algorithms, can be described succinctly and declaratively using Datalog, a logic programming language. We have developed a system called bddbddb that automatically translates Datalog programs into highly efficient BDD implementations. We used this approach to develop a variety of context-sensitive algorithms including side effect analysis, type analysis, and escape analysis.

conference on object-oriented programming systems, languages, and applications | 2005

Finding application errors and security flaws using PQL: a program query language

Michael C. Martin; V. Benjamin Livshits; Monica S. Lam

A number of effective error detection tools have been built in recent years to check if a program conforms to certain design rules. An important class of design rules deals with sequences of events asso-ciated with a set of related objects. This paper presents a language called PQL (Program Query Language) that allows programmers to express such questions easily in an application-specific context. A query looks like a code excerpt corresponding to the shortest amount of code that would violate a design rule. Details of the tar-get applications precise implementation are abstracted away. The programmer may also specify actions to perform when a match is found, such as recording relevant information or even correcting an erroneous execution on the fly.We have developed both static and dynamic techniques to find solutions to PQL queries. Our static analyzer finds all potential matches conservatively using a context-sensitive, flow-insensitive, inclusion-based pointer alias analysis. Static results are also use-ful in reducing the number of instrumentation points for dynamic analysis. Our dynamic analyzer instruments the source program to catch all violations precisely as the program runs and to optionally perform user-specified actions.We have implemented the techniques described in this paper and found 206 errors in 6 large real-world open-source Java applica-tions containing a total of nearly 60,000 classes. These errors are important security flaws, resource leaks, and violations of consis-tency invariants. The combination of static and dynamic analysis proves effective at addressing a wide range of debugging and pro-gram comprehension queries. We have found that dynamic analysis is especially suitable for preventing errors such as security vulner-abilities at runtime.

international symposium on computer architecture | 1992

Limits of control flow on parallelism

Monica S. Lam; Robert P. Wilson

This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by using trace simulations to find the limits of parallelism for machines that employ different combinations of these techniques. We have three major results. First, local regions of code have limited parallelism, and control dependence analysis is useful in extracting global parallelism from different parts of a program. Second, a superscalar processor is fundamentally limited because it cannot execute independent regions of code concurrently. Higher performance can be obtained with machines, such as multiprocessors and dataflow machines, that can simultaneously follow multiple flows of control. Finally, without speculative execution to allow instructions to execute before their control dependences are resolved, only modest amounts of parallelism can be obtained for programs with complex control flow.

Explore More