David Grove | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Grove is active.

Explore More

Publication

Featured researches published by David Grove.

conference on object-oriented programming systems, languages, and applications | 2000

Adaptive optimization in the Jalapeno JVM

Matthew Arnold; Stephen J. Fink; David Grove; Michael Hind; Peter F. Sweeney

Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. This paper presents the architecture of the Jalapeno Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We describe the extensible system architecture, based on a federation of threads with asynchronous communication. We present an implementation of the general architecture that supports adaptive multi-level optimization based purely on statistical sampling. We empirically demonstrate that this profiling technique has low overhead and can improve startup and steady-state performance, even without the presence of online feedback-directed optimizations. The paper also describes and evaluates an online feedback-directed inlining optimization based on statistical edge sampling. The system is written completely in Java, applying the described techniques not only to application code and standard libraries, but also to the virtual machine itself.

Proceedings of the ACM 1999 conference on Java Grande | 1999

The Jalapeño dynamic optimizing compiler for Java

Michael G. Burke; Jong-Deok Choi; Stephen J. Fink; David Grove; Michael Hind; Vivek Sarkar; Mauricio J. Serrano; Vugranam C. Sreedhar; Harini Srinivasan; John Whaley

interpretation Loop: Parse bytecode Update state Rectify state with Successor basic blocks Main Initialization Choose basic block from set Figure 4: Overview of BC2IR algorithm class t1 { static float foo(A a, B b, float c1, float c3) { float c2 = c1/c3; return(c1*a.f1 + c2*a.f2 + c3*b.f1); } } Figure 5: An example Java program element-wise meet operation is used on the stack operands to update the symbolic state [38]. When a backward branch whose target is the middle of an already-generated basic block is encountered, the basic block is split at that point. If the stack is not empty at the start of the split BB, the basic block must be regenerated because the initial states may be incorrect. The initial state of a BB may also be incorrect due to as-of-yet-unseen control ow joins. To minimize the number of a times HIR is generated for a BB a simple greedy algorithm is used for selecting BBs in the main loop. When selecting a BB to generate the HIR, the BB with the lowest starting bytecode index is chosen. This simple heuristic relies on the fact that, except for loops, all controlow constructs are generated in topological order, and that the control ow graph is reducible. Surprisingly, for programs compiled with current Java compilers, the greedy algorithm can always nd the optimal ordering in practice.5 5The optimal order for basic block generation that minimizes number of regeneration is a topological order (ignoring the back edges). However, because BC2IR computes the control ow graph in the same pass, it cannot compute the optimal order a priori. Example: Figure 5 shows an example Java source program of class t1, and Figure 6 shows the HIR for method foo of the example. The number on the rst column of each HIR instruction is the index of the bytecode from which the instruction is generated. Before compiling class t1, we compiled and loaded class B, but not class A. As a result, the HIR instructions for accessing elds of class A, bytecode indices 7 and 14 in Figure 6, are getfield unresolved, while the HIR instruction accessing a eld of class B, bytecode index 21, is a regular getfield instruction. Also notice that there is only one null check instruction that covers both getfield unresolved instructions; this is a result of BC2IRs on-they optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) 7 null_check l0(A, NonNull) 7 getfield_unresolved t5(float) = l0(A), < A.f1> 10 float_mul t6(float) = l2(float), t5(float) 14 getfield_unresolved t7(float) = l0(A, NonNull), < A.f2> 17 float_mul t8(float) = l4(float), t7(float) 18 float_add t9(float) = t6(float), t8(float) 21 null_check l1(B, NonNull) 21 getfield t10(float) = l1(B), < B.f1> 24 float_mul t11(float) = l3(float), t10(float) 25 float_add t12(float) = t9(float), t11(float) 26 float_return t12(float) END_BBLOCK B0@0 Figure 6: HIR of method foo(). l and t are virtual registers for local variables and temporary operands, respectively. 5.2 On-the-Fly Analyses and Optimizations To illustrate our approach to on-they optimizations we consider copy propagation as an example. Java bytecode often contains sequences that perform a calculation and store the result into a local variable (see Figure 7). A simple copy propagation can eliminate most of the unnecessary temporaries. When storing from a temporary into a local variable, BC2IR inspects the most recently generated instruction. If its result is the same temporary, the instruction is modi ed to write the value directly to the local variable instead. Other optimizations such as constant propagation, dead Java bytecode Generated IR Generated IR (optimization off) (optimization on) ---------------------------------------------iload x INT_ADD tint, xint, 5 INT_ADD yint, xint, 5 iconst 5 INT_MOVE yint, tint iadd istore y Figure 7: Example of limited copy propagation and dead code elimination code elimination, register renaming for local variables, method inlining, etc. are performed during the translation process. Further details are provided in [38]. 6 Jalape~ no Optimizing Compiler Back-end In this section, we describe the back-end of the Jalape~ no Optimizing Compiler. 6.1 Lowering of the IR After high-level analyses and optimizations are performed, HIR is lowered to low-level IR (LIR). In contrast to HIR, the LIR expands instructions into operations that are speci c to the Jalape~ no JVM implementation, such as object layouts or parameter-passing mechanisms of the Jalape~ no JVM. For example, operations in HIR to invoke methods of an object or of a class consist of a single instruction, closely matching the corresponding bytecode instructions such as invokevirtual/invokestatic. These single-instruction HIR operations are lowered (i.e., converted) into multiple-instruction LIR operations that invoke the methods based on the virtualfunction-table layout. These multiple LIR operations expose more opportunities for low-level optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) (n1) 7 null_check l0(A, NonNull) (n2) 7 getfield_unresolved t5(float) = l0(A), <A.f1> (n3) 10 float_mul t6(float) = l2(float), t5(float) (n4) 14 getfield_unresolved t7(float) = l0(A, NonNull), <A.f2>(n5) 17 float_mul t8(float) = l4(float), t7(float) (n6) 18 float_add t9(float) = t6(float), t8(float) (n7) 21 null_check l1(B, NonNull) (n8) 21 float_load t10(float) = @{ l1(B), -16 } (n9) 24 float_mul t11(float) = l3(float), t10(float) (n10) 25 float_add t12(float) = t9(float), t11(float) (n11) 26 return t12(float) (n12) END_BBLOCK B0@0 Figure 8: LIR of method foo() Example: Figure 8 shows the LIR for method foo of the example in Figure 5. The labels (n1) through (n12) on the far right of each instruction indicate the corresponding node in the data dependence graph shown in Figure 9. 6.2 Dependence Graph Construction We construct an instruction-level dependence graph, used during BURS code generation (Section 6.3), for each basic block that captures register true/anti/output dependences, reg_true n12 excep reg_true excep reg_true reg_true reg_true control reg_true excep reg_true excep reg_true reg_true n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 Figure 9: Dependence graph of basic block in method foo() memory true/anti/output dependences, and control dependences. The current implementation of memory dependences makes conservative assumptions about alias information. Synchronization constraints are modeled by introducing synchronization dependence edges between synchronization operations (monitor enter and monitor exit) and memory operations. These edges prevent code motion of memory operations across synchronization points. Java exception semantics [29] is modeled by exception dependence edges, which connect di erent exception points in a basic block. Exception dependence edges are also added between register write operations of local variables and exception points in the basic block. Exception dependence edges between register operations and exceptions points need not be added if the corresponding method does not have catch blocks. This precise modeling of dependence constraints allows us to perform more aggressive code generation. Example: Figure 9 shows the dependence graph for the single basic block in method foo() of Figure 5. The graph, constructed from the LIR for the method, shows registertrue dependence edges, exception dependence edges, and a control dependence edge from the rst instruction to the last instruction in the basic block. There are no memory dependence edges because the basic block contains only loads and no stores, and we do not currently model load-load input dependences6 . An exception dependence edge is created between an instruction that tests for an exception (such as null check) and an instruction that depends on the result of the test (such as getfield). 6.3 BURS-based Retargetable Code Generation In this section, we address the problem of using tree-patternmatching systems to perform retargetable code generation after code optimization in the Jalape~ no Optimizing Compiler [33]. Our solution is based on partitioning a basic 6The addition of load-load memory dependences will be necessary to correctly support the Java memory model for multithreaded programs that contain data races. input LIR: DAG/tree: input grammar (relevant rules): move r2=r0 not r3=r1 and r4=r2,r3 cmp r5=r4,0 if r5,!=,LBL emitted instructions: andc. r4,r0,r1 bne LBL IF CMP AND MOVE r0 0 NOT r1 RULE PATTERN COST ------------1 reg: REGISTER 0 2 reg: MOVE(reg) 0 3 reg: NOT(reg) 1 4 reg: AND(reg,reg) 1 5 reg: CMP(reg,INTEGER) 1 6 stm: IF(reg) 1 7 stm: IF(CMP(AND(reg, 2 NOT(reg)),ZERO))) Figure 10: Example of tree pattern matching for PowerPC block dependence graph (de ned in Section 6.2) into trees that can be given as input to a BURS-based tree-patternmatching system [15]. Unlike previous approaches to partitioning DAGs for tree-pattern-matching (e.g., [17]), our approach considers partitioning in the presence of memory and exception dependences (not just register-true dependences). We have de ned legality constraints for this partitioning, and developed a partitioning algorithm that incorporates code duplication. Figure 10 shows a simple example of pattern matching for the PowerPC. The data dependence graph is partitioned into trees before using BURS. Then, pattern matching is applied on the trees using a grammar (relevant fragments are illustrated in Figure 10). Each grammar rule has an associated cost, in this case the number of instructions that the rule will generate. For example, rule 2 has a zero cost because it is used to eliminate unnecessary register moves, i.e., coalescing. Although rules 3, 4, 5, and 6 could be used to parse the tree, the pattern matching selects rules 1, 2, and 7 as the ones with the least cost to cover the tree. Once these rules are selected as the least cover of the tree, the selected code is emitted as MIR instructions. Thus, for our example, only two PowerPC instructions are emitted for v

Proceedings of the IEEE | 2005

A Survey of Adaptive Optimization in Virtual Machines

Matthew Arnold; Stephen J. Fink; David Grove; Michael Hind; Peter F. Sweeney

Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular program representations preclude many forms of whole-program interprocedural optimization. Third, virtual machines incur additional costs for runtime services such as security guarantees and automatic memory management. To address these challenges, vendors have invested considerable resources into adaptive optimization systems in production virtual machines. Today, mainstream virtual machine implementations include substantial infrastructure for online monitoring and profiling, runtime compilation, and feedback-directed optimization. As a result, adaptive optimization has begun to mature as a widespread production-level technology. This paper surveys the evolution and current state of adaptive optimization technology in virtual machines.

Ibm Systems Journal | 2005

The Jikes research virtual machine project: building an open-source research community

Bowen Alpern; S. Augart; Stephen M. Blackburn; Maria A. Butrico; A. Cocchi; Pau-Chen Cheng; Julian Dolby; Stephen J. Fink; David Grove; Michael Hind; Kathryn S. McKinley; Mark F. Mergen; J. E. B. Moss; Ton Ngo; Vivek Sarkar

This paper describes the evolution of the JikesTM Research Virtual Machine project from an IBM internal research project, called Jalapeno, into an open-source project. After summarizing the original goals of the project, we discuss the motivation for releasing it as an open-source project and the activities performed to ensure the success of the project. Throughout, we highlight the unique challenges of developing and maintaining an open-source project designed specifically to support a research community.

workshop on program analysis for software tools and engineering | 1999

Efficient and precise modeling of exceptions for the analysis of Java programs

Jong-Deok Choi; David Grove; Michael Hind; Vivek Sarkar

The Factored Control Flow Graph, FCFG, is a novel representation of a programs intraprocedural control flow, which is designed to efficiently support the analysis of programs written in languages, such as Java, that have frequently occurring operations whose execution may result in exceptional control flow. The FCFG is more compact than traditional CFG representations for exceptional control flow, yet there is no loss of precision in using the FCFG. In this paper, we introduce the FCFG representation and outline how standard forward and backward data flow analysis algorithms can be adapted to work on this representation. We also present empirical measurements of FCFG sizes for a large number of methods obtained from a variety of Java programs, and compare these sizes with those of a traditional CFG representation.

programming language design and implementation | 1995

Selective specialization for object-oriented languages

Jeffrey Dean; Craig Chambers; David Grove

Dynamic dispatching is a major source of run-time overhead in object-oriented languages, due both to the direct cost of method lookup and to the indirect effect of preventing other optimizations. To reduce this overhead, optimizing compilers for object-oriented languages analyze the classes of objects stored in program variables, with the goal of bounding the possible classes of message receivers enough so that the compiler can uniquely determine the target of a message send at compile time and replace the message send with a direct procedure call. Specialization is one important technique for improving the precision of this static class information: by compiling multiple versions of a method, each applicable to a subset of the possible argument classes of the method, more precise static information about the classes of the methods arguments is obtained. Previous specialization strategies have not been selective about where this technique is applied, and therefore tended to significantly increase compile time and code space usage, particularly for large applications. In this paper, we present a more general framework for specialization in object-oriented languages and describe a goal directed specialization algorithm that makes selective decisions to apply specialization to those cases where it provides the highest benefit. Our results show that our algorithm improves the performance of a group of sizeable programs by 65% to 275% while increasing compiled code space requirements by only 4% to 10%. Moreover, when compared to the previous state-of-the-art specialization scheme, our algorithm improves performance by 11% to 67% while simultaneously reducing code space requirements by 65% to 73%.

conference on object oriented programming systems languages and applications | 1995

Profile-guided receiver class prediction

David Grove; Jeffrey Dean; Charles Garrett; Craig Chambers

The use of dynamically-dispatched procedure calls is a key mechanism for writing extensible and flexible code in object-oriented languages. Unfortunately, dynamic dispatching imposes a runtime performance penalty. Some recent implementations of pure object-oriented languages have utilized profile-guided receiver class prediction to reduce this performance penalty, and some researchers have argued for applying receiver class prediction in hybrid languages like C++. We performed a detailed examination of the dynamic profiles of eight large object-oriented applications written in C++ and Cecil, determining that the receiver class distributions are strongly peaked and stable across both inputs and program versions through time. We describe techniques for gathering and manipulating profile information at varying degrees of precision, particularly in the presence of optimizations such as inlining. Our implementation of profile-guided receiver class prediction improves the performance of large Cecil applications by more than a factor of two over solely static optimizations.

acm sigplan symposium on principles and practice of parallel programming | 2011

Lifeline-based global load balancing

Vijay A. Saraswat; Prabhanjan Kambadur; Sreedhar B. Kodali; David Grove; Sriram Krishnamoorthy

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary. Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes. In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

symposium on principles of programming languages | 1998

Fast interprocedural class analysis

Greg DeFouw; David Grove; Craig Chambers

Previous algorithms for interprocedural control flow analysis of higher-order and/or object-oriented languages have been described that perform propagation or constraint satisfaction and take O(N3) time (such as Shiverss O-CFA and Heintzes set-based analysis), or unification and take O(N¿ (N,N) time (such as Steensgaards pointer analysis), or optimistic reachability analysis and take O(N) time (such as Bacon and Sweeneys Rapid Type Analysis). We describe a general parameterized analysis framework that integrates propagation-based and unification-based analysis primitives and optimistic reachability analysis, whose instances mimic these existing algorithms as well as several new algorithms taking O(N), O(N¿(N,N)), O(N2), and O(N2¿(N,N)) time; our O(N) and O(N¿(N,N)) algorithms produce more precise results than the previous algorithms with these complexities. We implemented our algorithm framework in the Vortex optimizing compiler, and we measured the cost and benefit of these interprocedural analysis algorithms in practice on a collection of substantial Cecil and Java programs.

european conference on object oriented programming | 2002

Space- and Time-Efficient Implementation of the Java Object Model

David F. Bacon; Stephen J. Fink; David Grove

While many object-oriented languages impose space overhead of only one word per object to support features like virtual method dispatch, Javas richer functionality has led to implementations that require two or three header words per object. This space overhead increases memory usage and attendant garbage collection costs, reduces cache locality, and constrains programmers who might naturally solve a problem by using large numbers of small objects.In this paper, we show that with careful engineering, a high-performance virtual machine can instantiate most Java objects with only a single-word object header. The single header word provides fast access to the virtual method table, allowing for quick method invocation. The implementation represents other per-object data (lock state, hash code, and garbage collection flags) using heuristic compression techniques. The heuristic retains two-word headers, containing thin lock state, only for objects that have synchronized methods.We describe the implementation of various object models in the IBM Jikes Research Virtual Machine, by introducing a pluggable object model abstraction into the virtual machine implementation. We compare an object model with a two-word header with three different object models with single-word headers. Experimental results show that the object header compression techniques give a mean space savings of 7%, with savings of up to 21%. Compared to the two-word headers, the compressed space-encodings result in application speedups ranging from -1.5% to +2.2%. Performance on synthetic micro-benchmarks ranges from +23% due to benefits from reduced object size, to -12% on a stress test of virtual method invocation.

Explore More