Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mauricio J. Serrano.
conference on object-oriented programming systems, languages, and applications | 1999
Jong-Deok Choi; Manish Gupta; Mauricio J. Serrano; Vugranam C. Sreedhar; Samuel P. Midkiff
This paper presents a simple and efficient data flow algorithm for escape analysis of objects in Java programs to determine (i) if an object can be allocated on the stack; (ii) if an object is accessed only by a single thread during its lifetime, so that synchronization operations on that object can be removed. We introduce a new program abstraction for escape analysis, the connection graph, that is used to establish reachability relationships between objects and object references. We show that the connection graph can be summarized for each method such that the same summary information may be used effectively in different calling contexts. We present an interprocedural algorithm that uses the above property to efficiently compute the connection graph and identify the non-escaping objects for methods and threads. The experimental results, from a prototype implementation of our framework in the IBM High Performance Compiler for Java, are very promising. The percentage of objects that may be allocated on the stack exceeds 70% of all dynamically created objects in three out of the ten benchmarks (with a median of 19%), 11% to 92% of all lock operations are eliminated in those ten programs (with a median of 51%), and the overall execution time reduction ranges from 2% to 23% (with a median of 7%) on a 333 MHz PowerPC workstation with 128 MB memory.
Proceedings of the ACM 1999 conference on Java Grande | 1999
Michael G. Burke; Jong-Deok Choi; Stephen J. Fink; David Grove; Michael Hind; Vivek Sarkar; Mauricio J. Serrano; Vugranam C. Sreedhar; Harini Srinivasan; John Whaley
interpretation Loop: Parse bytecode Update state Rectify state with Successor basic blocks Main Initialization Choose basic block from set Figure 4: Overview of BC2IR algorithm class t1 { static float foo(A a, B b, float c1, float c3) { float c2 = c1/c3; return(c1*a.f1 + c2*a.f2 + c3*b.f1); } } Figure 5: An example Java program element-wise meet operation is used on the stack operands to update the symbolic state [38]. When a backward branch whose target is the middle of an already-generated basic block is encountered, the basic block is split at that point. If the stack is not empty at the start of the split BB, the basic block must be regenerated because the initial states may be incorrect. The initial state of a BB may also be incorrect due to as-of-yet-unseen control ow joins. To minimize the number of a times HIR is generated for a BB a simple greedy algorithm is used for selecting BBs in the main loop. When selecting a BB to generate the HIR, the BB with the lowest starting bytecode index is chosen. This simple heuristic relies on the fact that, except for loops, all controlow constructs are generated in topological order, and that the control ow graph is reducible. Surprisingly, for programs compiled with current Java compilers, the greedy algorithm can always nd the optimal ordering in practice.5 5The optimal order for basic block generation that minimizes number of regeneration is a topological order (ignoring the back edges). However, because BC2IR computes the control ow graph in the same pass, it cannot compute the optimal order a priori. Example: Figure 5 shows an example Java source program of class t1, and Figure 6 shows the HIR for method foo of the example. The number on the rst column of each HIR instruction is the index of the bytecode from which the instruction is generated. Before compiling class t1, we compiled and loaded class B, but not class A. As a result, the HIR instructions for accessing elds of class A, bytecode indices 7 and 14 in Figure 6, are getfield unresolved, while the HIR instruction accessing a eld of class B, bytecode index 21, is a regular getfield instruction. Also notice that there is only one null check instruction that covers both getfield unresolved instructions; this is a result of BC2IRs on-they optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) 7 null_check l0(A, NonNull) 7 getfield_unresolved t5(float) = l0(A), < A.f1> 10 float_mul t6(float) = l2(float), t5(float) 14 getfield_unresolved t7(float) = l0(A, NonNull), < A.f2> 17 float_mul t8(float) = l4(float), t7(float) 18 float_add t9(float) = t6(float), t8(float) 21 null_check l1(B, NonNull) 21 getfield t10(float) = l1(B), < B.f1> 24 float_mul t11(float) = l3(float), t10(float) 25 float_add t12(float) = t9(float), t11(float) 26 float_return t12(float) END_BBLOCK B0@0 Figure 6: HIR of method foo(). l and t are virtual registers for local variables and temporary operands, respectively. 5.2 On-the-Fly Analyses and Optimizations To illustrate our approach to on-they optimizations we consider copy propagation as an example. Java bytecode often contains sequences that perform a calculation and store the result into a local variable (see Figure 7). A simple copy propagation can eliminate most of the unnecessary temporaries. When storing from a temporary into a local variable, BC2IR inspects the most recently generated instruction. If its result is the same temporary, the instruction is modi ed to write the value directly to the local variable instead. Other optimizations such as constant propagation, dead Java bytecode Generated IR Generated IR (optimization off) (optimization on) ---------------------------------------------iload x INT_ADD tint, xint, 5 INT_ADD yint, xint, 5 iconst 5 INT_MOVE yint, tint iadd istore y Figure 7: Example of limited copy propagation and dead code elimination code elimination, register renaming for local variables, method inlining, etc. are performed during the translation process. Further details are provided in [38]. 6 Jalape~ no Optimizing Compiler Back-end In this section, we describe the back-end of the Jalape~ no Optimizing Compiler. 6.1 Lowering of the IR After high-level analyses and optimizations are performed, HIR is lowered to low-level IR (LIR). In contrast to HIR, the LIR expands instructions into operations that are speci c to the Jalape~ no JVM implementation, such as object layouts or parameter-passing mechanisms of the Jalape~ no JVM. For example, operations in HIR to invoke methods of an object or of a class consist of a single instruction, closely matching the corresponding bytecode instructions such as invokevirtual/invokestatic. These single-instruction HIR operations are lowered (i.e., converted) into multiple-instruction LIR operations that invoke the methods based on the virtualfunction-table layout. These multiple LIR operations expose more opportunities for low-level optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) (n1) 7 null_check l0(A, NonNull) (n2) 7 getfield_unresolved t5(float) = l0(A), <A.f1> (n3) 10 float_mul t6(float) = l2(float), t5(float) (n4) 14 getfield_unresolved t7(float) = l0(A, NonNull), <A.f2>(n5) 17 float_mul t8(float) = l4(float), t7(float) (n6) 18 float_add t9(float) = t6(float), t8(float) (n7) 21 null_check l1(B, NonNull) (n8) 21 float_load t10(float) = @{ l1(B), -16 } (n9) 24 float_mul t11(float) = l3(float), t10(float) (n10) 25 float_add t12(float) = t9(float), t11(float) (n11) 26 return t12(float) (n12) END_BBLOCK B0@0 Figure 8: LIR of method foo() Example: Figure 8 shows the LIR for method foo of the example in Figure 5. The labels (n1) through (n12) on the far right of each instruction indicate the corresponding node in the data dependence graph shown in Figure 9. 6.2 Dependence Graph Construction We construct an instruction-level dependence graph, used during BURS code generation (Section 6.3), for each basic block that captures register true/anti/output dependences, reg_true n12 excep reg_true excep reg_true reg_true reg_true control reg_true excep reg_true excep reg_true reg_true n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 Figure 9: Dependence graph of basic block in method foo() memory true/anti/output dependences, and control dependences. The current implementation of memory dependences makes conservative assumptions about alias information. Synchronization constraints are modeled by introducing synchronization dependence edges between synchronization operations (monitor enter and monitor exit) and memory operations. These edges prevent code motion of memory operations across synchronization points. Java exception semantics [29] is modeled by exception dependence edges, which connect di erent exception points in a basic block. Exception dependence edges are also added between register write operations of local variables and exception points in the basic block. Exception dependence edges between register operations and exceptions points need not be added if the corresponding method does not have catch blocks. This precise modeling of dependence constraints allows us to perform more aggressive code generation. Example: Figure 9 shows the dependence graph for the single basic block in method foo() of Figure 5. The graph, constructed from the LIR for the method, shows registertrue dependence edges, exception dependence edges, and a control dependence edge from the rst instruction to the last instruction in the basic block. There are no memory dependence edges because the basic block contains only loads and no stores, and we do not currently model load-load input dependences6 . An exception dependence edge is created between an instruction that tests for an exception (such as null check) and an instruction that depends on the result of the test (such as getfield). 6.3 BURS-based Retargetable Code Generation In this section, we address the problem of using tree-patternmatching systems to perform retargetable code generation after code optimization in the Jalape~ no Optimizing Compiler [33]. Our solution is based on partitioning a basic 6The addition of load-load memory dependences will be necessary to correctly support the Java memory model for multithreaded programs that contain data races. input LIR: DAG/tree: input grammar (relevant rules): move r2=r0 not r3=r1 and r4=r2,r3 cmp r5=r4,0 if r5,!=,LBL emitted instructions: andc. r4,r0,r1 bne LBL IF CMP AND MOVE r0 0 NOT r1 RULE PATTERN COST ------------1 reg: REGISTER 0 2 reg: MOVE(reg) 0 3 reg: NOT(reg) 1 4 reg: AND(reg,reg) 1 5 reg: CMP(reg,INTEGER) 1 6 stm: IF(reg) 1 7 stm: IF(CMP(AND(reg, 2 NOT(reg)),ZERO))) Figure 10: Example of tree pattern matching for PowerPC block dependence graph (de ned in Section 6.2) into trees that can be given as input to a BURS-based tree-patternmatching system [15]. Unlike previous approaches to partitioning DAGs for tree-pattern-matching (e.g., [17]), our approach considers partitioning in the presence of memory and exception dependences (not just register-true dependences). We have de ned legality constraints for this partitioning, and developed a partitioning algorithm that incorporates code duplication. Figure 10 shows a simple example of pattern matching for the PowerPC. The data dependence graph is partitioned into trees before using BURS. Then, pattern matching is applied on the trees using a grammar (relevant fragments are illustrated in Figure 10). Each grammar rule has an associated cost, in this case the number of instructions that the rule will generate. For example, rule 2 has a zero cost because it is used to eliminate unnecessary register moves, i.e., coalescing. Although rules 3, 4, 5, and 6 could be used to parse the tree, the pattern matching selects rules 1, 2, and 7 as the ones with the least cost to cover the tree. Once these rules are selected as the least cover of the tree, the selected code is emitted as MIR instructions. Thus, for our example, only two PowerPC instructions are emitted for v
programming language design and implementation | 1998
David F. Bacon; Ravi B. Konuru; Chet Murthy; Mauricio J. Serrano
Language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs.
ACM Transactions on Programming Languages and Systems | 2003
Jong-Deok Choi; Manish Gupta; Mauricio J. Serrano; Vugranam C. Sreedhar; Samuel P. Midkiff
This article presents an escape analysis framework for Java to determine (1) if an object is not reachable after its method of creation returns, allowing the object to be allocated on the stack, and (2) if an object is reachable only from a single thread during its lifetime, allowing unnecessary synchronization operations on that object to be removed. We introduce a new program abstraction for escape analysis, the connection graph, that is used to establish reachability relationships between objects and object references. We show that the connection graph can be succinctly summarized for each method such that the same summary information may be used in different calling contexts without introducing imprecision into the analysis. We present an interprocedural algorithm that uses the above property to efficiently compute the connection graph and identify the nonescaping objects for methods and threads. The experimental results, from a prototype implementation of our framework in the IBM High Performance Compiler for Java, are very promising. The percentage of objects that may be allocated on the stack exceeds 70% of all dynamically created objects in the user code in three out of the ten benchmarks (with a median of 19%); 11% to 92% of all mutex lock operations are eliminated in those 10 programs (with a median of 51%), and the overall execution time reduction ranges from 2% to 23% (with a median of 7%) on a 333-MHz PowerPC workstation with 512 MB memory.
programming language design and implementation | 2006
Xiaotong Zhuang; Mauricio J. Serrano; Harold W. Cain; Jong-Deok Choi
Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, high overhead, or both. We have developed a new approach for building the calling context tree at runtime, called adaptive bursting. By selectively inhibiting redundant profiling, this approach dramatically reduces overhead while preserving profile accuracy. We first demonstrate the drawbacks of previously proposed calling context profiling mechanisms. We show that a low-overhead solution using sampled stack-walking alone is less than 50% accurate, based on degree of overlap with a complete calling-context tree. We also show that a static bursting approach collects a highly accurate profile, but causes an unacceptable application slowdown. Our adaptive solution achieves 85% degree of overlap and provides an 88% hot-edge coverage when using a 0.1 hot-edge threshold, while dramatically reducing overhead compared to the static bursting approach.
programming language design and implementation | 2004
Ali-Reza Adl-Tabatabai; Richard L. Hudson; Mauricio J. Serrano; Sreenivas Subramoney
Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of prefetch addresses. This paper presents Mississippi Delta (MS Delta), a novel technique for prefetching linked data structures that closely integrates the hardware performance monitor (HPM), the garbage collectors global view of heap and object layout, the type-level metadata inherent in type-safe programs, and JIT compiler analysis. The garbage collector uses the HPMs data cache miss information to identify cache miss intensive traversal paths through linked data structures, and then discovers regular distances (deltas) between these linked objects. JIT compiler analysis injects prefetch instructions using deltas to materialize prefetch addresses.We have implemented MS Delta in a fully dynamic profile-guided optimization system: the StarJIT dynamic compiler [1] and the ORP Java virtual machine [9]. We demonstrate a 28-29% reduction in stall cycles attributable to the high-latency cache misses targeted by MS Delta and a speedup of 11-14% on the cache miss intensive SPEC JBB2000 benchmark.
conference on object-oriented programming systems, languages, and applications | 2000
Mauricio J. Serrano; Rajesh Bordawekar; Samuel P. Midkiff; Manish Gupta
This paper presents the design and implementation of the Quicksilver1 quasi-static compiler for Java. Quasi-static compilation is a new approach that combines the benefits of static and dynamic compilation, while maintaining compliance with the Java standard, including support of its dynamic features. A quasi-static compiler relies on the generation and reuse of persistent code images to reduce the overhead of compilation during program execution, and to provide identical, testable and reliable binaries over different program executions. At runtime, the quasi-static compiler adapts pre-compiled binaries to the current JVM instance, and uses dynamic compilation of the code when necessary to support dynamic Java features. Our system allows interprocedural program optimizations to be performed while maintaining binary compatibility. Experimental data obtained using a preliminary implementation of a quasi-static compiler in the Jalapeño JVM clearly demonstrates the benefits of our approach: we achieve a runtime compilation cost comparable to that of baseline (fast, non-optimizing) compilation, and deliver the runtime program performance of the highest optimization level supported by the Jalapeño optimizing compiler. For the SPECjvm98 benchmark suite, we obtain a factor of 104 to 158 reduction in the runtime compilation overhead relative to the Jalapeño optimizing compiler. Relative to the better of the baseline and the optimizing Jalapeño compilers, the overall performance (taking into account both runtime compilation and execution costs) is increased by 9.2% to 91.4% for the SPECjvm98 benchmarks with size 100, and by 54% to 356% for the (shorter running) SPECjvm98 benchmarks with size 10.
Ibm Journal of Research and Development | 2015
Ravi Nair; Samuel F. Antao; Carlo Bertolli; Pradip Bose; José R. Brunheroto; Tong Chen; Chen-Yong Cher; Carlos H. Andrade Costa; J. Doi; Constantinos Evangelinos; Bruce M. Fleischer; Thomas W. Fox; Diego S. Gallo; Leopold Grinberg; John A. Gunnels; Arpith C. Jacob; P. Jacob; Hans M. Jacobson; Tejas Karkhanis; Choon Young Kim; Jaime H. Moreno; John Kevin Patrick O'Brien; Martin Ohmacht; Yoonho Park; Daniel A. Prener; Bryan S. Rosenburg; Kyung Dong Ryu; Olivier Sallenave; Mauricio J. Serrano; Patrick Siegl
Many studies point to the difficulty of scaling existing computer architectures to meet the needs of an exascale system (i.e., capable of executing
hawaii international conference on system sciences | 1994
Wayne Yamamoto; Mauricio J. Serrano; Adam R. Talcott; Roger C. Wood; M. Nemirosky
10^{18}
IEEE Transactions on Parallel and Distributed Systems | 1993
Mauricio J. Serrano; Behrooz Parhami
floating-point operations per second), consuming no more than 20 MW in power, by around the year 2020. This paper outlines a new architecture, the Active Memory Cube, which reduces the energy of computation significantly by performing computation in the memory module, rather than moving data through large memory hierarchies to the processor core. The architecture leverages a commercially demonstrated 3D memory stack called the Hybrid Memory Cube, placing sophisticated computational elements on the logic layer below its stack of dynamic random-access memory (DRAM) dies. The paper also describes an Active Memory Cube tuned to the requirements of a scientific exascale system. The computational elements have a vector architecture and are capable of performing a comprehensive set of floating-point and integer instructions, predicated operations, and gather-scatter accesses across memory in the Cube. The paper outlines the software infrastructure used to develop applications and to evaluate the architecture, and describes results of experiments on application kernels, along with performance and power projections.