Harini Srinivasan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harini Srinivasan is active.

Explore More

Publication

Featured researches published by Harini Srinivasan.

measurement and modeling of computer systems | 1998

Deterministic replay of Java multithreaded applications

Jong-Deok Choi; Harini Srinivasan

A multithreaded program includes sequences of events wherein each sequence is associated with one of a plurality of execution threads. In a record mode, the software tool of the present invention records a run-time representation of the program by distinguishing critical events from non-critical events of the program and identifying the execution order of such critical events. Groups of critical events are generated wherein, for each group Gi, critical events belonging to the group Gi belong to a common execution thread, critical events belonging to the group Gi are consecutive, and only non-critical events occur between any two consecutive critical events in the group Gi. In addition, the groups are ordered and no two adjacent groups include critical events that belong to a common execution thread. For each execution thread, a logical thread schedule is generated that identifies a sequence of said groups associated with the execution thread. The logical thread schedules are stored in persistent storage for subsequent reuse. In a replay mode, for each execution thread, the logical thread schedule associated with the execution thread is loaded from persistent storage and the critical events identified by the logical thread schedule are executed.

Proceedings of the ACM 1999 conference on Java Grande | 1999

The Jalapeño dynamic optimizing compiler for Java

Michael G. Burke; Jong-Deok Choi; Stephen J. Fink; David Grove; Michael Hind; Vivek Sarkar; Mauricio J. Serrano; Vugranam C. Sreedhar; Harini Srinivasan; John Whaley

interpretation Loop: Parse bytecode Update state Rectify state with Successor basic blocks Main Initialization Choose basic block from set Figure 4: Overview of BC2IR algorithm class t1 { static float foo(A a, B b, float c1, float c3) { float c2 = c1/c3; return(c1*a.f1 + c2*a.f2 + c3*b.f1); } } Figure 5: An example Java program element-wise meet operation is used on the stack operands to update the symbolic state [38]. When a backward branch whose target is the middle of an already-generated basic block is encountered, the basic block is split at that point. If the stack is not empty at the start of the split BB, the basic block must be regenerated because the initial states may be incorrect. The initial state of a BB may also be incorrect due to as-of-yet-unseen control ow joins. To minimize the number of a times HIR is generated for a BB a simple greedy algorithm is used for selecting BBs in the main loop. When selecting a BB to generate the HIR, the BB with the lowest starting bytecode index is chosen. This simple heuristic relies on the fact that, except for loops, all controlow constructs are generated in topological order, and that the control ow graph is reducible. Surprisingly, for programs compiled with current Java compilers, the greedy algorithm can always nd the optimal ordering in practice.5 5The optimal order for basic block generation that minimizes number of regeneration is a topological order (ignoring the back edges). However, because BC2IR computes the control ow graph in the same pass, it cannot compute the optimal order a priori. Example: Figure 5 shows an example Java source program of class t1, and Figure 6 shows the HIR for method foo of the example. The number on the rst column of each HIR instruction is the index of the bytecode from which the instruction is generated. Before compiling class t1, we compiled and loaded class B, but not class A. As a result, the HIR instructions for accessing elds of class A, bytecode indices 7 and 14 in Figure 6, are getfield unresolved, while the HIR instruction accessing a eld of class B, bytecode index 21, is a regular getfield instruction. Also notice that there is only one null check instruction that covers both getfield unresolved instructions; this is a result of BC2IRs on-they optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) 7 null_check l0(A, NonNull) 7 getfield_unresolved t5(float) = l0(A), < A.f1> 10 float_mul t6(float) = l2(float), t5(float) 14 getfield_unresolved t7(float) = l0(A, NonNull), < A.f2> 17 float_mul t8(float) = l4(float), t7(float) 18 float_add t9(float) = t6(float), t8(float) 21 null_check l1(B, NonNull) 21 getfield t10(float) = l1(B), < B.f1> 24 float_mul t11(float) = l3(float), t10(float) 25 float_add t12(float) = t9(float), t11(float) 26 float_return t12(float) END_BBLOCK B0@0 Figure 6: HIR of method foo(). l and t are virtual registers for local variables and temporary operands, respectively. 5.2 On-the-Fly Analyses and Optimizations To illustrate our approach to on-they optimizations we consider copy propagation as an example. Java bytecode often contains sequences that perform a calculation and store the result into a local variable (see Figure 7). A simple copy propagation can eliminate most of the unnecessary temporaries. When storing from a temporary into a local variable, BC2IR inspects the most recently generated instruction. If its result is the same temporary, the instruction is modi ed to write the value directly to the local variable instead. Other optimizations such as constant propagation, dead Java bytecode Generated IR Generated IR (optimization off) (optimization on) ---------------------------------------------iload x INT_ADD tint, xint, 5 INT_ADD yint, xint, 5 iconst 5 INT_MOVE yint, tint iadd istore y Figure 7: Example of limited copy propagation and dead code elimination code elimination, register renaming for local variables, method inlining, etc. are performed during the translation process. Further details are provided in [38]. 6 Jalape~ no Optimizing Compiler Back-end In this section, we describe the back-end of the Jalape~ no Optimizing Compiler. 6.1 Lowering of the IR After high-level analyses and optimizations are performed, HIR is lowered to low-level IR (LIR). In contrast to HIR, the LIR expands instructions into operations that are speci c to the Jalape~ no JVM implementation, such as object layouts or parameter-passing mechanisms of the Jalape~ no JVM. For example, operations in HIR to invoke methods of an object or of a class consist of a single instruction, closely matching the corresponding bytecode instructions such as invokevirtual/invokestatic. These single-instruction HIR operations are lowered (i.e., converted) into multiple-instruction LIR operations that invoke the methods based on the virtualfunction-table layout. These multiple LIR operations expose more opportunities for low-level optimizations. 0 LABEL0 B0@0 2 float_div l4(float) = l2(float), l3(float) (n1) 7 null_check l0(A, NonNull) (n2) 7 getfield_unresolved t5(float) = l0(A), <A.f1> (n3) 10 float_mul t6(float) = l2(float), t5(float) (n4) 14 getfield_unresolved t7(float) = l0(A, NonNull), <A.f2>(n5) 17 float_mul t8(float) = l4(float), t7(float) (n6) 18 float_add t9(float) = t6(float), t8(float) (n7) 21 null_check l1(B, NonNull) (n8) 21 float_load t10(float) = @{ l1(B), -16 } (n9) 24 float_mul t11(float) = l3(float), t10(float) (n10) 25 float_add t12(float) = t9(float), t11(float) (n11) 26 return t12(float) (n12) END_BBLOCK B0@0 Figure 8: LIR of method foo() Example: Figure 8 shows the LIR for method foo of the example in Figure 5. The labels (n1) through (n12) on the far right of each instruction indicate the corresponding node in the data dependence graph shown in Figure 9. 6.2 Dependence Graph Construction We construct an instruction-level dependence graph, used during BURS code generation (Section 6.3), for each basic block that captures register true/anti/output dependences, reg_true n12 excep reg_true excep reg_true reg_true reg_true control reg_true excep reg_true excep reg_true reg_true n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 Figure 9: Dependence graph of basic block in method foo() memory true/anti/output dependences, and control dependences. The current implementation of memory dependences makes conservative assumptions about alias information. Synchronization constraints are modeled by introducing synchronization dependence edges between synchronization operations (monitor enter and monitor exit) and memory operations. These edges prevent code motion of memory operations across synchronization points. Java exception semantics [29] is modeled by exception dependence edges, which connect di erent exception points in a basic block. Exception dependence edges are also added between register write operations of local variables and exception points in the basic block. Exception dependence edges between register operations and exceptions points need not be added if the corresponding method does not have catch blocks. This precise modeling of dependence constraints allows us to perform more aggressive code generation. Example: Figure 9 shows the dependence graph for the single basic block in method foo() of Figure 5. The graph, constructed from the LIR for the method, shows registertrue dependence edges, exception dependence edges, and a control dependence edge from the rst instruction to the last instruction in the basic block. There are no memory dependence edges because the basic block contains only loads and no stores, and we do not currently model load-load input dependences6 . An exception dependence edge is created between an instruction that tests for an exception (such as null check) and an instruction that depends on the result of the test (such as getfield). 6.3 BURS-based Retargetable Code Generation In this section, we address the problem of using tree-patternmatching systems to perform retargetable code generation after code optimization in the Jalape~ no Optimizing Compiler [33]. Our solution is based on partitioning a basic 6The addition of load-load memory dependences will be necessary to correctly support the Java memory model for multithreaded programs that contain data races. input LIR: DAG/tree: input grammar (relevant rules): move r2=r0 not r3=r1 and r4=r2,r3 cmp r5=r4,0 if r5,!=,LBL emitted instructions: andc. r4,r0,r1 bne LBL IF CMP AND MOVE r0 0 NOT r1 RULE PATTERN COST ------------1 reg: REGISTER 0 2 reg: MOVE(reg) 0 3 reg: NOT(reg) 1 4 reg: AND(reg,reg) 1 5 reg: CMP(reg,INTEGER) 1 6 stm: IF(reg) 1 7 stm: IF(CMP(AND(reg, 2 NOT(reg)),ZERO))) Figure 10: Example of tree pattern matching for PowerPC block dependence graph (de ned in Section 6.2) into trees that can be given as input to a BURS-based tree-patternmatching system [15]. Unlike previous approaches to partitioning DAGs for tree-pattern-matching (e.g., [17]), our approach considers partitioning in the presence of memory and exception dependences (not just register-true dependences). We have de ned legality constraints for this partitioning, and developed a partitioning algorithm that incorporates code duplication. Figure 10 shows a simple example of pattern matching for the PowerPC. The data dependence graph is partitioned into trees before using BURS. Then, pattern matching is applied on the trees using a grammar (relevant fragments are illustrated in Figure 10). Each grammar rule has an associated cost, in this case the number of instructions that the rule will generate. For example, rule 2 has a zero cost because it is used to eliminate unnecessary register moves, i.e., coalescing. Although rules 3, 4, 5, and 6 could be used to parse the tree, the pattern matching selects rules 1, 2, and 7 as the ones with the least cost to cover the tree. Once these rules are selected as the least cover of the tree, the selected code is emitted as MIR instructions. Thus, for our example, only two PowerPC instructions are emitted for v

international parallel and distributed processing symposium | 2000

Deterministic replay of distributed Java applications

Ravi B. Konuru; Harini Srinivasan; Jong-Deok Choi

Execution behavior of a Java application can be nondeterministic due to concurrent threads of execution, thread scheduling, and variable network delays. This nondeterminism in Java makes the understanding and debugging of multi-threaded distributed Java applications a difficult and a laborious process. It is well accepted that providing deterministic replay of application execution is a key step towards programmer productivity and program under-standing. Towards this goal, we developed a replay framework based on logical thread schedules and logical intervals. An application of this framework was previously published in the context of a system called Deja Vu that provides deterministic replay of multi-threaded Java programs on a single Java Virtual Machine (JVM). In contrast, this paper focuses on distributed Deja Vu that provides deterministic replay of distributed Java applications running on multiple JVMs. We describe the issues and present the design, implementation and preliminary performance results of distributed Deja Vu that supports both multi-threaded and distributed Java applications.

IEEE Transactions on Parallel and Distributed Systems | 1996

A unified framework for optimizing communication in data-parallel programs

Manish Gupta; Edith Schonberg; Harini Srinivasan

This paper presents a framework, based on global array data-flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation allows us to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, we are able to capture optimizations like (1) vectorizing communication, (2) eliminating communication that is redundant on any control flow path, (3) reducing the amount of data being communicated, (4) reducing the number of processors to which data must be communicated, and (5) moving communication earlier to hide latency, and to subsume previous communication. We show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more efficient. We present results from a preliminary implementation of this framework, which are extremely encouraging, and demonstrate the effectiveness of this analysis in improving the performance of programs.

acm sigplan symposium on principles and practice of parallel programming | 1993

Data flow equations for explicitly parallel programs

Dirk Grunwald; Harini Srinivasan

We present a solution to the reaching definitions problem for programs with explicit lexically specified parallel constructs, such as cobegin/coend or parallel_sections, both with and without explicit synchronization operations, such as Post, Wait or Advance. The reaching definitions information for sequential programs is used to solve many standard optimization problems. In parallel programs, this information can also be used to explicitly direct communication and data ownership. Although work has been done on analyzing parallel programs to detect data races, little work has been done on optimizing such programs. We show how the memory consistency model specified by an explicitly parallel programming language can influence the complexity of the reaching definitions problem. By selecting the “weakest” memory consistency semantics, we can efficiently solve the reaching definitions problem for correct programs.

european conference on object oriented programming | 2006

Modeling runtime behavior in framework-based applications

Nick Mitchell; Gary Sevitsky; Harini Srinivasan

Our research group has analyzed many industrial, framework-based applications. In these applications, simple functionality often requires excessive runtime activity. It is increasingly difficult to assess if and how inefficiencies can be fixed. Much of this activity involves the transformation of information, due to framework couplings. We present an approach to modeling and quantifying behavior in terms of what transformations accomplish. We structure activity into dataflow diagrams that capture the flow between transformations. Across disparate implementations, we observe commonalities in how transformations use and change their inputs. We introduce vocabulary of common phenomena of use and change, and four ways to classify data and transformations using this vocabulary. The structuring and classification enable evaluation and comparison in terms abstracted from implementation specifics. We introduce metrics of complexity and cost, including behavior signatures that attribute measures to phenomena. We demonstrate the approach on a benchmark, a library, and two industrial applications.

international symposium on software testing and analysis | 2004

SABER: smart analysis based error reduction

Darrell C. Reimer; Edith Schonberg; Kavitha Srinivas; Harini Srinivasan; Bowen Alpern; Robert D. Johnson; Aaron Kershenbaum; Larry Koved

In this paper, we present an approach to automatically detect high impact coding errors in large Java applications which use frameworks. These high impact errors cause serious performance degradation and outages in real world production environments, are very time-consuming to detect, and potentially cost businesses thousands of dollars. Based on 3 years experience working with IBM customer production systems, we have identified over 400 high impact coding patterns, from which we have been able to distill a small set of pattern detection algorithms. These algorithms use deep static analysis, thus moving problem detection earlier in the development cycle from production to development. Additionally, we have developed an automatic false positive filtering mechanism based on domain specific knowledge to achieve a level of usability acceptable to IBM field engineers. Our approach also provides necessary contextual information around the sources of the problems to help in problem remediation. We outline how our approach to problem determination can be extended to multiple programming models and domains. We have implemented this problem determination approach in the SABER tool and have used it successfully to detect many serious code defects in several large commercial applications. This paper shows results from four such applications that had over 60 coding defects.

foundations of software engineering | 2005

Summarizing application performance from a components perspective

Kavitha Srinivas; Harini Srinivasan

In the era of distributed development, it is common for large applications to be assembled from multiple component layers that are developed by different development teams. Layered applications have deep call paths and numerous invocations (average call stack depth of up to 75, and upto 35 million invocations in the applications we studied) making summarization of performance problems a critical issue. Summarization of performance by classes, methods, invocations or packages is usually inadequate because they are often at the wrong level of granularity. We propose a technique that uses thresholding and filtering to identify a small set of interesting method invocations in components deemed interesting by the user. We show the utility of this technique with a set of 7 real-life applications, where the technique was used to identify a small set (10-93) of expensive invocations which accounted for 82-99% of the overall performance costs of the application. Our experience shows that this type of characterization can help quickly isolate the specific parts of a large system that can benefit most from performance tuning.

languages and compilers for parallel computing | 1994

A Unified Data-Flow Framework for Optimizing Communication

Manish Gupta; Edith Schonberg; Harini Srinivasan

This paper presents a framework, based on global array data flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. This framework applies techniques for partial redundancy elimination to available section descriptors, a novel representation of communication involving array sections. With a single framework, we are able to capture numerous optimizations like (i) vectorizing communication, (ii) eliminating communication that is redundant on any control flow path, (iii) reducing the amount of data being communicated, (iv) reducing the number of processors to which data must be communicated, and (v) moving communication earlier to hide latency, and to subsume previous communication. Further, the explicit representation of availability of data in our framework allows processors other than the owners also to send values needed by other processors, leading to additional opportunities for optimizing communication. Another contribution of this paper is to show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems, in the context of communication as well.

Concurrency and Computation: Practice and Experience | 1997

SPMD programming in Java

Susan Flynn Hummel; Ton Ngo; Harini Srinivasan

We consider the suitability of the Java concurrent constructs for writing high-performance SPMD code for parallel machines. More specifically, we investigate implementing a financial application in Java on a distributed-memory parallel machine. Despite the fact that Java was not expressly targeted to such applications and architectures per se, we conclude that efficient implementations are feasible. Finally, we propose a library of Java methods to facilitate SPMD programming.

Explore More