James M. Stichnoth | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James M. Stichnoth is active.

Explore More

Publication

Featured researches published by James M. Stichnoth.

programming language design and implementation | 2000

Practicing JUDO: Java under dynamic optimizations

Michal Cierniak; Guei-Yuan Lueh; James M. Stichnoth

A high-performance implementation of a Java Virtual Machine (JVM) consists of efficient implementation of Just-In-Time (JIT) compilation, exception handling, synchronization mechanism, and garbage collection (GC). These components are tightly coupled to achieve high performance. In this paper, we present some static anddynamic techniques implemented in the JIT compilation and exception handling of the Microprocessor Research Lab Virtual Machine (MRL VM), i.e., lazy exceptions, lazy GC mapping, dynamic patching, and bounds checking elimination. Our experiments used IA-32 as the hardware platform, but the optimizations can be generalized to other architectures.

acm sigplan symposium on principles and practice of parallel programming | 1993

Exploiting task and data parallelism on a multicomputer

Jaspal Subhlok; James M. Stichnoth; David R. O'Hallaron; Thomas R. Gross

For many applications, achieving good performance on a private memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different tradeoffs between task and data parallelism are appropriate for a parallel system. Most existing compilers focus on only one of data parallelism and task parallelism. Therefore, to achieve the desired results, the programmer must separately program the data and task parallelism. We have taken a unified approach to exploiting both kinds of parallelism in a single framework with an existing language. This approach eases the task of programming and exposes the tradeoffs between data and task parallelism to compiler. We have implemented a parallelizing Fortran compiler for the iWarp system based on this approach. We discuss the design of our compiler, and present performance results to validate our approach.

Journal of Parallel and Distributed Computing | 1994

Generating communication for array statements: design, implementation, and evaluation

James M. Stichnoth; David R. O'Hallaron; Thomas R. Gross

Abstract Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a private memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern private memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node private memory iWarp system, using a number of different distributions.

Concurrency and Computation: Practice and Experience | 2005

The Open Runtime Platform: a flexible high‐performance managed runtime environment

Michal Cierniak; Marsha Eng; Neal Glew; Brian T. Lewis; James M. Stichnoth

The Open Runtime Platform (ORP) is a high‐performance managed runtime environment (MRTE) that features exact generational garbage collection, fast thread synchronization, and multiple coexisting just‐in‐time compilers (JITs). ORP was designed for flexibility in order to support experiments in dynamic compilation, garbage collection, synchronization, and other technologies. It can be built to run either Java or Common Language Infrastructure (CLI) applications, to run under the Windows or Linux operating systems, and to run on the IA‐32 or Itanium processor family (IPF) architectures. Achieving high performance in a MRTE presents many challenges, particularly when flexibility is a major goal. First, to enable the use of different garbage collectors and JITs, each component must be isolated from the rest of the environment through a well‐defined software interface. Without careful attention, this isolation could easily harm performance. Second, MRTEs have correctness and safety requirements that traditional languages such as C++ lack. These requirements, including null pointer checks, array bounds checks, and type checks, impose additional runtime overhead. Finally, the dynamic nature of MRTEs makes some traditional compiler optimizations, such as devirtualization of method calls, more difficult to implement or more limited in applicability. To get full performance, JITs and the core virtual machine (VM) must cooperate to reduce or eliminate (where possible) these MRTE‐specific overheads. In this paper, we describe the structure of ORP in detail, paying particular attention to how it supports flexibility while preserving high performance. We describe the interfaces between the garbage collector, the JIT, and the core VM; how these interfaces enable multiple garbage collectors and JITs without sacrificing performance; and how they allow the JIT and the core VM to reduce or eliminate MRTE‐specific performance issues. Copyright

international conference on supercomputing | 1995

Decoupling synchronization and data transfer in message passing systems of parallel computers

Thomas M. Stricker; James M. Stichnoth; David R. O'Hallaron; Susan Hinrichs; Thomas R. Gross

Synchronization is an important issue for the design of a scalable parallel computer, and some systems include special hardware support for control messages or barriers. The cost of synchronization has a high impact on the design of the message passing (communication) services. In this paper, we investigate three different communication libraries that are tailored toward the synchronization services available: (1) a version of generic send-receive message passing (PVM), which relies on traditional flow control and buffering to synchronize the data transfers; (2) message passing with pulling, i.e. a message is transferred only when the recipient is ready and requests it (as, e.g., used in NX for large messages); and (3) the decoupled direct deposit message passing that uses separate, global synchronization to ensure that nodes send messages only when the message data can be deposited directly into the final destination in the memory of the remote recipient. Measurements of these three styles on a Cray T3D demonstrate the benefits of the decoupled message passing with direct deposit. The performance advantage of this style is made possible by (1) preemptive synchronization to avoid unnecessary copies of the data, (2) high-speed barrier synchronization, and (3) improved congestion control in the network. The designers of the communication system of future parallel computers are therefore strongly encouraged to provide good synchronizationfacilities in addition to high throughput data transfers to support high performance message passing.

symposium on code generation and optimization | 2004

Improving 64-bit Java IPF performance by compressing heap references

Ali-Reza Adl-Tabatabai; Jay Bharadwaj; Michal Cierniak; Marsha Eng; Jesse Fang; Brian T. Lewis; Brian R. Murphy; James M. Stichnoth

64-bit processor architectures like the Intel/spl reg/ Itanium/spl reg/ processor family are designed for large applications that need large memory addresses. When running applications that fit within a 32-bit address space, 64-bit CPUs are at a disadvantage compared to 32-bit CPUs because of the larger memory footprints for their data. This results in worse cache and TLB utilization, and consequently lower performance because of increased miss ratios. This paper considers software techniques for virtual machines that allow 32-bit pointers to be used on 64-bit CPUs for managed runtime applications that do not need the full 64-bit address space. We describe our pointer compression techniques and discuss our experience implementing these for Java applications. In addition, we give performance results with our techniques for both the SPEC JVM98 and SPEC JBB2000 benchmarks. We demonstrate a 12% performance improvement on SPEC JBB2000 and a reduction in the number of garbage collections required for a given heap size.

conference on high performance computing (supercomputing) | 1996

Runtime Performance of Parallel Array Assignment: An Empirical Study

Lei Wang; James M. Stichnoth; Siddhartha Chatterjee

Generating code for the array assignment statement of High Performance Fortran (HPF) in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the run-time performance of the code these algorithms generate. We classify these algorithms into several families, identify several issues of interest in the generated code, and present experimental performance data for the various algorithms. We demonstrate that the code generated for block-cyclic distributions runs almost as efficiently as that generated for block or cyclic distributions.

Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande | 2002

Open runtime platform: flexibility with performance using interfaces

Michal Cierniak; Brian T. Lewis; James M. Stichnoth

According to conventional wisdom, interfaces provide flexibility at the cost of performance. Most high-performance Java virtual machines today tightly integrate their core virtual machines with their just-in-time compilers and garbage collectors to get the best performance. The Open Runtime Platform (ORP) is unusual in that it reconciles high performance with the extensive use of well-defined interfaces between its components. ORP was developed to support experiments in dynamic compilation, garbage collection, synchronization, and other technologies. To achieve this, two key interfaces were designed: one for garbage collection and another for just-in-time compilation. This paper describes some interesting features of these interfaces and discusses lessons learned in their use. One lesson we learned was to selectively expose small but frequently accessed data structures in our interfaces; this improves performance while minimizing the number of interface crossings.

languages and compilers for parallel computing | 2007

Pillar: A Parallel Implementation Language

Todd A. Anderson; Neal Glew; Peng Guo; Brian T. Lewis; Wei Liu; Zhanglin Liu; Leaf Petersen; Mohan Rajagopalan; James M. Stichnoth; Gansha Wu; Dan Zhang

As parallelism in microprocessors becomes mainstream, new prog- ramming languages and environments are emerging to meet the challenges of parallel programming. To support research on these languages, we are developing a low-level language infrastructure called Pillar(derived from Parallel Implementation Language). Although Pillar programs are intended to be automatically generated from source programs in each parallel language, Pillar programs can also be written by expert programmers. The language is defined as a small set of extensions to C. As a result, Pillar is familiar to C programmers, but more importantly, it is practical to reuse an existing optimizing compiler like gcc [1] or Open64 [2] to implement a Pillar compiler. Pillars concurrency features include constructs for threading, synchronization, and explicit data-parallel operations. The threading constructs focus on creating new threads only when hardware resources are idle, and otherwise executing parallel work within existing threads, thus minimizing thread creation overhead. In addition to the usual synchronization constructs, Pillar includes transactional memory. Its sequential features include stack walking, second-class continuations, support for precise garbage collection, tail calls, and seamless integration of Pillar and legacy code. This paper describes the design and implementation of the Pillar software stack, including the language, compiler, runtime, and high-level converters(that translate high-level language programs into Pillar programs). It also reports on early experience with three high-level languages that target Pillar.

languages and compilers for parallel computing | 1993

Do&Merge: Integrating Parallel Loops and Reductions

Bwolen Yang; Jon A. Webb; James M. Stichnoth; David R. O'Hallaron; Thomas R. Gross

Many computations perform operations that match this pattern: first, a loop iterates over an input array, producing an array of (partial) results. The loop iterations are independent of each other and can be done in parallel. Second, a reduction operation combines the elements of the partial result array to produce the single final result. We call these two steps a Do&Merge computation. The most common way to effectively parallelize such a computation is for the programmer to apply a DOALL operation across the input array, and then to apply a reduction operator to the partial results. We show that combining the Do phase and the Merge phase into a single Do&Merge computation can lead to improved execution time and memory usage. In this paper we describe a simple and efficient construct (called the Pdo loop) that is included in an experimental HPF-like compiler for private-memory parallel systems.

Explore More