Michael Hind
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Hind.
conference on object-oriented programming systems, languages, and applications | 2000
Matthew Arnold; Stephen J. Fink; David Grove; Michael Hind; Peter F. Sweeney
Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. This paper presents the architecture of the Jalapeno Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We describe the extensible system architecture, based on a federation of threads with asynchronous communication. We present an implementation of the general architecture that supports adaptive multi-level optimization based purely on statistical sampling. We empirically demonstrate that this profiling technique has low overhead and can improve startup and steady-state performance, even without the presence of online feedback-directed optimizations. The paper also describes and evaluates an online feedback-directed inlining optimization based on statistical edge sampling. The system is written completely in Java, applying the described techniques not only to application code and standard libraries, but also to the virtual machine itself.
Proceedings of the ACM 1999 conference on Java Grande | 1999
Michael G. Burke; Jong-Deok Choi; Stephen J. Fink; David Grove; Michael Hind; Vivek Sarkar; Mauricio J. Serrano; Vugranam C. Sreedhar; Harini Srinivasan; John Whaley
interpretation Loop: Parse bytecode Update state Rectify state with Successor basic blocks Main Initialization Choose basic block from set Figure 4: Overview of BC2IR algorithm class t1 { static float foo(A a, B b, float c1, float c3) { float c2 = c1/c3; return(c1*a.f1 + c2*a.f2 + c3*b.f1); } } Figure 5: An example Java program element-wise meet operation is used on the stack operands to update the symbolic state [38]. When a backward branch whose target is the middle of an already-generated basic block is encountered, the basic block is split at that point. If the stack is not empty at the start of the split BB, the basic block must be regenerated because the initial states may be incorrect. The initial state of a BB may also be incorrect due to as-of-yet-unseen control ow joins. To minimize the number of a times HIR is generated for a BB a simple greedy algorithm is used for selecting BBs in the main loop. When selecting a BB to generate the HIR, the BB with the lowest starting bytecode index is chosen. This simple heuristic relies on the fact that, except for loops, all controlow constructs are generated in topological order, and that the control ow graph is reducible. Surprisingly, for programs compiled with current Java compilers, the greedy algorithm can always nd the optimal ordering in practice.5 5The optimal order for basic block generation that minimizes number of regeneration is a topological order (ignoring the back edges). However, because BC2IR computes the control ow graph in the same pass, it cannot compute the optimal order a priori. Example: Figure 5 shows an example Java source program of class t1, and Figure 6 shows the HIR for method foo of the example. The number on the rst column of each HIR instruction is the index of the bytecode from which the instruction is generated. Before compiling class t1, we compiled and loaded class B, but not class A. As a result, the HIR instructions for accessing elds of class A, bytecode indices 7 and 14 in Figure 6, are getfield unresolved, while the HIR instruction accessing a eld of class B, bytecode index 21, is a regular getfield instruction. Also notice that there is only one null check instruction that covers both getfield unresolved instructions; this is a result of BC2IRs on-they optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) 7 null_check l0(A, NonNull) 7 getfield_unresolved t5(float) = l0(A), < A.f1> 10 float_mul t6(float) = l2(float), t5(float) 14 getfield_unresolved t7(float) = l0(A, NonNull), < A.f2> 17 float_mul t8(float) = l4(float), t7(float) 18 float_add t9(float) = t6(float), t8(float) 21 null_check l1(B, NonNull) 21 getfield t10(float) = l1(B), < B.f1> 24 float_mul t11(float) = l3(float), t10(float) 25 float_add t12(float) = t9(float), t11(float) 26 float_return t12(float) END_BBLOCK B0@0 Figure 6: HIR of method foo(). l and t are virtual registers for local variables and temporary operands, respectively. 5.2 On-the-Fly Analyses and Optimizations To illustrate our approach to on-they optimizations we consider copy propagation as an example. Java bytecode often contains sequences that perform a calculation and store the result into a local variable (see Figure 7). A simple copy propagation can eliminate most of the unnecessary temporaries. When storing from a temporary into a local variable, BC2IR inspects the most recently generated instruction. If its result is the same temporary, the instruction is modi ed to write the value directly to the local variable instead. Other optimizations such as constant propagation, dead Java bytecode Generated IR Generated IR (optimization off) (optimization on) ---------------------------------------------iload x INT_ADD tint, xint, 5 INT_ADD yint, xint, 5 iconst 5 INT_MOVE yint, tint iadd istore y Figure 7: Example of limited copy propagation and dead code elimination code elimination, register renaming for local variables, method inlining, etc. are performed during the translation process. Further details are provided in [38]. 6 Jalape~ no Optimizing Compiler Back-end In this section, we describe the back-end of the Jalape~ no Optimizing Compiler. 6.1 Lowering of the IR After high-level analyses and optimizations are performed, HIR is lowered to low-level IR (LIR). In contrast to HIR, the LIR expands instructions into operations that are speci c to the Jalape~ no JVM implementation, such as object layouts or parameter-passing mechanisms of the Jalape~ no JVM. For example, operations in HIR to invoke methods of an object or of a class consist of a single instruction, closely matching the corresponding bytecode instructions such as invokevirtual/invokestatic. These single-instruction HIR operations are lowered (i.e., converted) into multiple-instruction LIR operations that invoke the methods based on the virtualfunction-table layout. These multiple LIR operations expose more opportunities for low-level optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) (n1) 7 null_check l0(A, NonNull) (n2) 7 getfield_unresolved t5(float) = l0(A), <A.f1> (n3) 10 float_mul t6(float) = l2(float), t5(float) (n4) 14 getfield_unresolved t7(float) = l0(A, NonNull), <A.f2>(n5) 17 float_mul t8(float) = l4(float), t7(float) (n6) 18 float_add t9(float) = t6(float), t8(float) (n7) 21 null_check l1(B, NonNull) (n8) 21 float_load t10(float) = @{ l1(B), -16 } (n9) 24 float_mul t11(float) = l3(float), t10(float) (n10) 25 float_add t12(float) = t9(float), t11(float) (n11) 26 return t12(float) (n12) END_BBLOCK B0@0 Figure 8: LIR of method foo() Example: Figure 8 shows the LIR for method foo of the example in Figure 5. The labels (n1) through (n12) on the far right of each instruction indicate the corresponding node in the data dependence graph shown in Figure 9. 6.2 Dependence Graph Construction We construct an instruction-level dependence graph, used during BURS code generation (Section 6.3), for each basic block that captures register true/anti/output dependences, reg_true n12 excep reg_true excep reg_true reg_true reg_true control reg_true excep reg_true excep reg_true reg_true n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 Figure 9: Dependence graph of basic block in method foo() memory true/anti/output dependences, and control dependences. The current implementation of memory dependences makes conservative assumptions about alias information. Synchronization constraints are modeled by introducing synchronization dependence edges between synchronization operations (monitor enter and monitor exit) and memory operations. These edges prevent code motion of memory operations across synchronization points. Java exception semantics [29] is modeled by exception dependence edges, which connect di erent exception points in a basic block. Exception dependence edges are also added between register write operations of local variables and exception points in the basic block. Exception dependence edges between register operations and exceptions points need not be added if the corresponding method does not have catch blocks. This precise modeling of dependence constraints allows us to perform more aggressive code generation. Example: Figure 9 shows the dependence graph for the single basic block in method foo() of Figure 5. The graph, constructed from the LIR for the method, shows registertrue dependence edges, exception dependence edges, and a control dependence edge from the rst instruction to the last instruction in the basic block. There are no memory dependence edges because the basic block contains only loads and no stores, and we do not currently model load-load input dependences6 . An exception dependence edge is created between an instruction that tests for an exception (such as null check) and an instruction that depends on the result of the test (such as getfield). 6.3 BURS-based Retargetable Code Generation In this section, we address the problem of using tree-patternmatching systems to perform retargetable code generation after code optimization in the Jalape~ no Optimizing Compiler [33]. Our solution is based on partitioning a basic 6The addition of load-load memory dependences will be necessary to correctly support the Java memory model for multithreaded programs that contain data races. input LIR: DAG/tree: input grammar (relevant rules): move r2=r0 not r3=r1 and r4=r2,r3 cmp r5=r4,0 if r5,!=,LBL emitted instructions: andc. r4,r0,r1 bne LBL IF CMP AND MOVE r0 0 NOT r1 RULE PATTERN COST ------------1 reg: REGISTER 0 2 reg: MOVE(reg) 0 3 reg: NOT(reg) 1 4 reg: AND(reg,reg) 1 5 reg: CMP(reg,INTEGER) 1 6 stm: IF(reg) 1 7 stm: IF(CMP(AND(reg, 2 NOT(reg)),ZERO))) Figure 10: Example of tree pattern matching for PowerPC block dependence graph (de ned in Section 6.2) into trees that can be given as input to a BURS-based tree-patternmatching system [15]. Unlike previous approaches to partitioning DAGs for tree-pattern-matching (e.g., [17]), our approach considers partitioning in the presence of memory and exception dependences (not just register-true dependences). We have de ned legality constraints for this partitioning, and developed a partitioning algorithm that incorporates code duplication. Figure 10 shows a simple example of pattern matching for the PowerPC. The data dependence graph is partitioned into trees before using BURS. Then, pattern matching is applied on the trees using a grammar (relevant fragments are illustrated in Figure 10). Each grammar rule has an associated cost, in this case the number of instructions that the rule will generate. For example, rule 2 has a zero cost because it is used to eliminate unnecessary register moves, i.e., coalescing. Although rules 3, 4, 5, and 6 could be used to parse the tree, the pattern matching selects rules 1, 2, and 7 as the ones with the least cost to cover the tree. Once these rules are selected as the least cover of the tree, the selected code is emitted as MIR instructions. Thus, for our example, only two PowerPC instructions are emitted for v
ACM Transactions on Programming Languages and Systems | 1999
Michael Hind; Michael G. Burke; Paul R. Carini; Jong-Deok Choi
We present practical approximation methods for computing and representing interprocedural aliases for a program written in a language that includes pointers, reference parameters, and recursion. We present the following contributions: (1) a framework for interprocedural pointer alias analysis that handles function pointers by constructing the program call graph while alias analysis is being performed; (2) a flow-sensitive interprocedural pointer alias analysis algorithm; (3) a flow-insensitive interprocedural pointer alias analysis algorithm; (4) a flow-insensitive interprocedural pointer alias analysis algorithm that incorporates kill information to improve precision; (5) empirical measurements of the efficiency and precision of the three interprocedural alias analysis algorithms.
Proceedings of the IEEE | 2005
Matthew Arnold; Stephen J. Fink; David Grove; Michael Hind; Peter F. Sweeney
Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular program representations preclude many forms of whole-program interprocedural optimization. Third, virtual machines incur additional costs for runtime services such as security guarantees and automatic memory management. To address these challenges, vendors have invested considerable resources into adaptive optimization systems in production virtual machines. Today, mainstream virtual machine implementations include substantial infrastructure for online monitoring and profiling, runtime compilation, and feedback-directed optimization. As a result, adaptive optimization has begun to mature as a widespread production-level technology. This paper surveys the evolution and current state of adaptive optimization technology in virtual machines.
Ibm Systems Journal | 2005
Bowen Alpern; S. Augart; Stephen M. Blackburn; Maria A. Butrico; A. Cocchi; Pau-Chen Cheng; Julian Dolby; Stephen J. Fink; David Grove; Michael Hind; Kathryn S. McKinley; Mark F. Mergen; J. E. B. Moss; Ton Ngo; Vivek Sarkar
This paper describes the evolution of the JikesTM Research Virtual Machine project from an IBM internal research project, called Jalapeno, into an open-source project. After summarizing the original goals of the project, we discuss the motivation for releasing it as an open-source project and the activities performed to ensure the success of the project. Throughout, we highlight the unique challenges of developing and maintaining an open-source project designed specifically to support a research community.
international symposium on software testing and analysis | 2000
Michael Hind; Anthony Pioli
During the past two decades many different pointer analysis algorithms have been published. Although some descriptions include measurements of the effectiveness of the algorithm, qualitative comparisons among algorithms are difficult because of varying infrastructure, benchmarks, and performance metrics. Without such comparisons it is not only difficult for an implementor to determine which pointer analysis is appropriate for their application, but also for a researcher to know which algorithms should be used as a basis for future advances. This paper describes an empirical comparison of the effectiveness of five pointer analysis algorithms on C programs. The algorithms vary in their use of control flow information (flow-sensitivity) and alias data structure, resulting in worst-case complexity from linear to polynomial. The effectiveness of the analyses is quantified in terms of compile-time precision and efficiency. In addition to measuring the direct effects of pointer analysis, precision is also reported by determining how the information computed by the five pointer analyses affects typical client analyses of pointer information: Mod/Ref analysis, live variable analysis and dead assignment identification, reaching definitions analysis, dependence analysis, and conditional constant propagation and unreachable code identification. Efficiency is reported by measuring analysis time and memory consumption of the pointer analyses and their clients.
workshop on program analysis for software tools and engineering | 1999
Jong-Deok Choi; David Grove; Michael Hind; Vivek Sarkar
The Factored Control Flow Graph, FCFG, is a novel representation of a programs intraprocedural control flow, which is designed to efficiently support the analysis of programs written in languages, such as Java, that have frequently occurring operations whose execution may result in exceptional control flow. The FCFG is more compact than traditional CFG representations for exceptional control flow, yet there is no loss of precision in using the FCFG. In this paper, we introduce the FCFG representation and outline how standard forward and backward data flow analysis algorithms can be adapted to work on this representation. We also present empirical measurements of FCFG sizes for a large number of methods obtained from a variety of Java programs, and compare these sizes with those of a traditional CFG representation.
conference on object-oriented programming systems, languages, and applications | 2004
Matthias Hauswirth; Peter F. Sweeney; Amer Diwan; Michael Hind
Object-oriented programming languages provide a rich set of features that provide significant software engineering benefits. The increased productivity provided by these features comes at a justifiable cost in a more sophisticated runtime system whose responsibility is to implement these features efficiently. However, the virtualization introduced by this sophistication provides a significant challenge to understanding complete system performance, not found in traditionally compiled languages, such as C or C++. Thus, understanding system performance of such a system requires profiling that spans all levels of the execution stack, such as the hardware, operating system, virtual machine, and application. In this work, we suggest an approach, called <i>vertical profiling</i>, that enables this level of understanding. We illustrate the efficacy of this approach by providing deep understandings of performance problems of Java applications run on a VM with vertical profiling support. By incorporating vertical profiling into a programming environment, the programmer will be able to understand how their program interacts with the underlying abstraction levels, such as application server, VM, operating system, and hardware.
conference on object-oriented programming systems, languages, and applications | 2002
Matthew Arnold; Michael Hind; Barbara G. Ryder
This paper describes the implementation of an online feedback-directed optimization system. The system is fully automatic; it requires no prior (offline) profiling run. It uses a previously developed low-overhead instrumentation sampling framework to collect control flow graph edge profiles. This profile information is used to drive several traditional optimizations, as well as a novel algorithm for performing feedback-directed control flow graph node splitting. We empirically evaluate this system and demonstrate improvements in peak performance of up to 17% while keeping overhead low, with no individual execution being degraded by more than 2% because of instrumentation.
languages and compilers for parallel computing | 1994
Michael G. Burke; Paul R. Carini; Jong-Deok Choi; Michael Hind
Data-flow analysis algorithms can be classified into two categories: flow-sensitive and flow-insensitive. To improve efficiency, flow insensitive interprocedural analyses do not make use of the intraprocedural control flow information associated with individual procedures. Since pointer-induced aliases can change within a procedure, applying known flow-insensitive analyses can result in either incorrect or overly conservative solutions. In this paper, we present a flow-insensitive data flow analysis algorithm that computes interprocedural pointer-induced aliases. We improve the precision of our analysis by (1) making use of certain types of kill information that can be precomputed efficiently, and (2) computing aliases generated in each procedure instead of holding at the exit of each procedure. We improve the efficiency of our algorithm by introducing a technique called deferred evaluation.