Is this you? Create Your Porfile

Jimmy Su

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jimmy Su is active.

Explore More

Publication

Featured researches published by Jimmy Su.

conference on high performance computing (supercomputing) | 2005

Making Sequential Consistency Practical in Titanium

Amir Kamil; Jimmy Su; Katherine A. Yelick

The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. The most natural model for programmers is to have memory accesses appear to take effect in the order specified in the original program. Language designers have been reluctant to use this strong semantics, called sequential consistency, due to concerns over the performance of memory fence instructions and related mechanisms that guarantee order. In this paper, we provide evidence for the practicality of sequential consistency by showing that advanced compiler analysis techniques are sufficient to eliminate the need for most memory fences and enable high-level optimizations. Our analyses eliminated over 97% of the memory fences that were needed by a na¨ýve implementation, accounting for 87 to 100% of the dynamically encountered fences in all but one benchmark. The impact of the memory model and analysis on runtime performance depends on the quality of the optimizations: more aggressive optimizations are likely to be invalidated by a strong memory consistency semantics. We consider two specific optimizations pipelining of bulk memory copies and communication aggregation and scheduling for irregular accesses and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. While additional work on parallel optimizations and analyses is needed, we believe these results provide important evidence on the viability of using a simple memory consistency model without sacrificing performance.

international parallel and distributed processing symposium | 2005

Automatic support for irregular computations in a high-level language

Jimmy Su; Katherine A. Yelick

The problem of writing high performance parallel applications becomes even more challenging when irregular, sparse or adaptive methods are employed. In this paper we introduce compiler and runtime support for programs with indirect array accesses into Titanium, a high-level language that combines an explicit SPMD parallelism model with implicit communication through a global shared address space. By combining the well-known inspector-executor technique with high level multi-dimensional array constructs, compiler analysis and performance modeling, we demonstrate optimizations that are entirely hidden from the programmer. The global address space makes the programs easier to write than in message passing, with remote array accesses used in place of explicit messages with data packing and unpacking. The programs are also faster than message passing programs: using sparse matrix-vector multiplication programs, we show that the Titanium code is an average of 21% faster across several matrices and machines, with the best case speedup more than a factor of 2x. The performance advantages are due to both the lightweight RDMA (remote direct memory access) communication model that underlies the Titanium implementation and automatic optimization selection that adapts the communication to the machine and workload, in some cases using different communication models for different processors within a single computation.

ieee international conference on high performance computing data and analytics | 2007

Parallel Languages and Compilers: Perspective From the Titanium Experience

Katherine A. Yelick; Paul N. Hilfinger; Susan L. Graham; Dan Bonachea; Jimmy Su; Amir Kamil; Kaushik Datta; Phillip Colella; Tong Wen

We describe the rationale behind the design of key features of Titanium—an explicitly parallel dialect of Java for high-performance scientific programming—and our experiences in building applications with the language. Specifically, we address Titaniums partitioned global address space model, single program multiple data parallelism support, multi-dimensional arrays and array-index calculus, memory management, immutable classes (class-like types that are value types rather than reference types), operator overloading, and generic programming. We provide an overview of the Titanium compiler implementation, covering various parallel analyses and optimizations, Titanium runtime technology and the GASNet network communication layer. We summarize results and lessons learned from implementing the NAS parallel benchmarks, elliptic and hyperbolic solvers using adaptive mesh refinement, and several applications of the immersed boundary method.

Archive | 2006

Titanium Language Reference Manual, Version 2.20

Dan Bonachea; Paul N. Hilfinger; Kaushik Datta; Susan L. Graham; Amir Kamil; Ben Liblit; Geoff Pike; Jimmy Su; Katherine A. Yelick

The Titanium language is a Java dialect for high-performance parallel scientific computing. Titanium’s differences from Java include multi-dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone-based memory management. This reference manual describes the differences between Titanium and Java.

conference on high performance computing (supercomputing) | 2007

An adaptive mesh refinement benchmark for modern parallel programming languages

Tong Wen; Jimmy Su; Phillip Colella; Katherine A. Yelick; Noel Keen

We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer architectures, do not fully capture the complexity of state-of-the-art computational software systems running on todays parallel machines or to be run on the emerging ones from the multi-cores to the peta-scale High Productivity Computer Systems. This benchmark, extracted from a real application framework, presents challenges for a programming language in both expressiveness and performance. It consists of an infrastructure for finite difference calculations on block-structured adaptive meshes and a solver for elliptic Partial Differential Equations built on this infrastructure. Adaptive Mesh Refinement algorithms are challenging to implement due to the irregularity introduced by local mesh refinement. We describe those challenges posed by this benchmark through two reference implementations (C++ /Fortran/MPI and Titanium) and in the context of three programming models.

international parallel and distributed processing symposium | 2004

Array prefetching for irregular array accesses in Titanium

Jimmy Su; Katherine A. Yelick

Summary form only given. Compiling irregular applications, such as sparse matrix vector multiply and particle/mesh methods in a SPMD parallel language is a challenging problem. These applications contain irregular array accesses, for which the array access pattern is not known until runtime. Numerous research projects have approached this problem under the inspector executor paradigm in the last 15 years. The value added by the work described in this paper is in using performance modeling to choose the best data communication method in the inspector executor model. We explore our ideas in a compiler for Titanium, a dialect of Java designed for high performance computing. For a sparse matrix vector multiply benchmark, experimental results show that the optimized Titanium code has comparable performance to C code with MPI using the Aztec library.

languages and compilers for parallel computing | 2007

Automatic Communication Performance Debugging in PGAS Languages

Jimmy Su; Katherine A. Yelick

Recent studies have shown that programming in a Partition Global Address Space (PGAS) language can be more productive than programming in a message passing model. One reason for this is the ability to access remote memory implicitly through shared memory reads and writes. But this benefit does not come without a cost. It is very difficult to spot communication by looking at the program text, since remote reads and writes look exactly the same as local reads and writes. This makes manual communication performance debugging an arduous task. In this paper, we describe a tool called ti-trend-prof that can do automatic performance debugging using only program traces from small processor configurations and small input sizes in Titanium [13], a PGAS language. ti-trend-prof presents trends to the programmer to help spot possible communication performance bugs even for processor configurations and input sizes that have not been run. We used ti-trend-prof on two of the largest Titanium applications and found bugs that would have taken days in under an hour.

IEEE Transactions on Parallel and Distributed Systems | 2012

Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms

Kamesh Madduri; Jimmy Su; Samuel Williams; Leonid Oliker; Stephane Ethier; Katherine A. Yelick

We are now in the multicore revolution which is witnessing a rapid evolution of architectural designs due to power constraints and correspondingly limited microprocessor clock speeds. Understanding how to efficiently utilize these systems in the context of demanding numerical algorithms is an urgent challenge to meet the ever growing computational needs of high-end computing. In this work, we examine multicore parallel optimization of the particle-to-grid interpolation step in particle-mesh methods, an inherently complex optimization problem due to its low computation intensity, irregular data accesses, and potential fine-grained data hazards. Our evaluated kernels are derived from two important numerical computations: a biological simulation of the heart using the Immersed Boundary (IB) method, and a Gyrokinetic Particle-in-Cell (PIC)-based application for studying fusion plasma microturbulence. We develop several novel synchronization and grid decomposition schemes, as well as low-level optimization techniques to maximize performance on three modern multicore platforms: Intels Xeon X5550 (Nehalem), AMDs Opteron 2356 (Barcelona), and Suns UltraSparc T2+ (Niagara). Results show that our optimizations lead to significant performance improvements, achieving up to a 5.6× speedup compared to the reference parallel implementation. Our work also provides valuable insight into the design of future autotuning frameworks for particle-to-grid interpolation on next-generation systems.

Archive | 2006