Jun Shirako | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jun Shirako is active.

Explore More

Publication

Featured researches published by Jun Shirako.

principles and practice of programming in java | 2011

Habanero-Java: the new adventures of old X10

Vincent Cavé; Jisheng Zhao; Jun Shirako; Vivek Sarkar

In this paper, we present the Habanero-Java (HJ) language developed at Rice University as an extension to the original Java-based definition of the X10 language. HJ includes a powerful set of task-parallel programming constructs that can be added as simple extensions to standard Java programs to take advantage of todays multi-core and heterogeneous architectures. The language puts a particular emphasis on the usability and safety of parallel constructs. For example, no HJ program using async, finish, isolated, and phaser constructs can create a logical deadlock cycle. In addition, the future and data-driven task variants of the async construct facilitate a functional approach to parallel programming. Finally, any HJ program written with async, finish, and phaser constructs that is data-race free is guaranteed to also be deterministic. HJ also features two key enhancements that address well known limitations in the use of Java in scientific computing --- the inclusion of complex numbers as a primitive data type, and the inclusion of array-views that support multidimensional views of one-dimensional arrays. The HJ compiler generates standard Java class-files that can run on any JVM for Java 5 or higher. The HJ runtime is responsible for orchestrating the creation, execution, and termination of HJ tasks, and features both work-sharing and work-stealing schedulers. HJ is used at Rice University as an introductory parallel programming language for second-year undergraduate students. A wide variety of benchmarks have been ported to HJ, including a full application that was originally written in Fortran 90. HJ has a rich development and runtime environment that includes integration with DrJava, the addition of a data race detection tool, and service as a target platform for the Intel Concurrent Collections coordination language

international conference on supercomputing | 2008

Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Jun Shirako; David M. Peixotto; Vivek Sarkar; William N. Scherer

Coordination and synchronization of parallel tasks is a major source of complexity in parallel programming. These constructs take many forms in practice including mutual exclusion in accesses to shared resources, termination detection of child tasks, collective barrier synchronization, and point-to-point synchronization. In this paper, we introduce phasers, a new coordination construct that unifies collective and point-to-point synchronizations. We establish two safety properties for phasers: deadlock-freedom and phase-ordering. Performance results obtained from a portable implementation of phasers on three different SMP platforms demonstrate that phasers can deliver superior performance to existing barrier implementations, in addition to the productivity benefits that result from their generality and safety properties.

international solid-state circuits conference | 2008

An 8640 MIPS SoC with Independent Power-Off Control of 8 CPUs and 8 RAMs by An Automatic Parallelizing Compiler

Masayuki Ito; Toshihiro Hattori; Yutaka Yoshida; Kiyoshi Hayase; Tomoichi Hayashi; Osamu Nishii; Yoshihiko Yasu; Atsushi Hasegawa; Masashi Takada; Hiroyuki Mizuno; Kunio Uchiyama; Toshihiko Odaka; Jun Shirako; Masayoshi Mase; Keiji Kimura; Hironori Kasahara

Power efficient SoC design for embedded applications requires several independent power-domains where the power of unused blocks can be turned off. An SoC for mobile phones defines 23 hierarchical power domains but most of the power domains are assigned for peripheral IPs that mainly use low-leakage high-Vt transistors. Since high-performance multiprocessor SoCs use leaky low-Vt transistors for CPU sections, leakage power savings of these CPU sections is a primary objective. We develop an SoC with 8 processor cores and 8 user RAMs (1 per core) targeted for power-efficient high-performance embedded applications. We assign these 16 blocks to separate power domains so that they can be independently be powered off. A resume mode is also introduced where the power of the CPU is off and the user RAM is on for fast resume operation. An automatic parallelizing compiler schedules tasks for each CPU core and also performs power management for each CPU core. With the help of this compiler, each processor core can operate at a different frequency or even dynamically stop the clock to maintain processing performance while reducing average operating power consumption. The compiler also executes power-off control of unnecessary CPU cores.

conference on object-oriented programming systems, languages, and applications | 2009

The habanero multicore software research project

Rajkishore Barik; Zoran Budimlic; Vincent Cavé; Sanjay Chatterjee; Yi Guo; David M. Peixotto; Raghavan Raman; Jun Shirako; Sagnak Tasirlar; Yonghong Yan; Yisheng Zhao; Vivek Sarkar

Multiple programming models are emerging to address an increased need for dynamic task parallelism in multicore shared-memory multiprocessors. This poster describes the main components of Rice Universitys Habanero Multicore Software Research Project, which proposes a new approach to multicore software enablement based on a two-level programming model consisting of a higher-level coordination language for domain experts and a lower-level parallel language for programming experts.

languages and compilers for parallel computing | 2005

Compiler control power saving scheme for multi core processors

Jun Shirako; Naoto Oshiyama; Yasutaka Wada; Hiroaki Shikano; Keiji Kimura; Hironori Kasahara

With the increase of transistors integrated onto a chip, multi core processor architectures have attracted much attention to achieve high effective performance, shorten development period and reduce the power consumption. To this end, the compiler for a multi core processor is expected not only to parallelize program effectively, but also to control the voltage and clock frequency of processors and storages carefully inside an application program. This paper proposes a compilation scheme for reduction of power consumption under the multigrain parallel processing environment that controls Voltage/Frequency and power supply of each processor core on a chip. In the evaluation, the OSCAR compiler with the proposed scheme achieves 60.7 percent energy savings for SPEC CFP95 applu without performance degradation on 4 processors, and 45.4 percent energy savings for SPEC CFP95 tomcatv with real-time deadline constraint on 4 processors, and 46.5 percent energy savings for SPEC CFP95 swim with the deadline constraint on 4 processors.

international parallel and distributed processing symposium | 2009

Phaser accumulators: A new reduction construct for dynamic parallelism

Jun Shirako; David M. Peixotto; Vivek Sarkar; William N. Scherer

A reduction is a computation in which a common operation, such as a sum, is to be performed across multiple pieces of data, each supplied by a separate task. We introduce phaser accumulators, a new reduction construct that meshes seamlessly with phasers to support dynamic parallelism in a phased (iterative) setting. By separating reduction computations into the parts of sending data, performing the computation itself, and retrieving the result, we enable overlap of communication and computation in a manner analogous to that of split-phase barriers. Additionally, this separation enables exploration of implementation strategies that differ as to when the reduction itself is performed: eagerly when the data is supplied, or lazily when a synchronization point is reached. We implement accumulators as extensions to phasers in the Habanero dialect of the X10 programming language. Performance evaluations of the EPCC Syncbench, Spectral-norm, and CG benchmarks on AMD Opteron, Intel Xeon, and Sun UltraSPARC T2 multicore SMPs show superior performance and scalability over OpenMP reductions (on two platforms) and X10 code (on three platforms) written with atomic blocks, with improvements of up to 2.5× on the Opteron and 14.9× on the UltraSPARC T2 relative to OpenMP and 16.5× on the Opteron, 26.3× on the Xeon and 94.8× on the UltraSPARC T2 relative to X10 atomic blocks. To the best of our knowledge, no prior reduction construct supports the dynamic parallelism and asynchronous capabilities of phaser accumulators.

international conference on supercomputing | 2009

Chunking parallel loops in the presence of synchronization

Jun Shirako; Jisheng M. Zhao; V. Krishna Nandivada; Vivek Sarkar

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage programmers to express the ideal parallelism in an application at a fine granularity that is natural for the underlying domain, while delegating to the compiler and runtime system the job of extracting coarser-grained useful parallelism for a given target system. A simple and important example of this separation of concerns between ideal and useful parallelism can be found in chunking of parallel loops, where the programmer expresses ideal parallelism by declaring all iterations of a loop to be parallel and the implementation exploits useful parallelism by executing iterations of the loop in sequential chunks. Though chunking of parallel loops has been used as a standard transformation for several years, it poses some interesting challenges when the parallel loop may directly or indirectly (via procedure calls) perform synchronization operations such as barrier, signal or wait statements. In such cases, a straightforward transformation that attempts to execute a chunk of loops in sequence in a single thread may violate the semantics of the original parallel program. In this paper, we address the problem of chunking parallel loops that may contain synchronization operations. We present a transformation framework that uses a combination of transformations from past work (e.g., loop strip-mining, interchange, distribution, unswitching) to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program. These transformations result in reduced synchronization and scheduling overheads, thereby improving performance and scalability. Our experimental results for 11 benchmark programs on an UltraSPARC II multicore processor showed a geometric mean speedup of 0.52x for the unchunked case and 9.59x for automatic chunking using the techniques described in this paper. This wide gap underscores the importance of using these techniques in future compiler and runtime systems for programming models with lightweight parallelism.

compiler construction | 2012

Analytical bounds for optimal tile size selection

Jun Shirako; Kamal Sharma; Naznin Fauzia; Louis-Noël Pouchet; J. Ramanujam; P. Sadayappan; Vivek Sarkar

In this paper, we introduce a novel approach to guide tile size selection by employing analytical models to limit empirical search within a subspace of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307× to 11,879× for an Intel Core-2-Quad system; 358× to 1,978× for an Intel Nehalem system; and 45× to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40× (Intel Core-2 Quad), 1.28× (Nehalem) and 1.19× (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search optimization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95× and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code.

languages and compilers for parallel computing | 2002

Hierarchical parallelism control for multigrain parallel processing

Motoki Obata; Jun Shirako; Hiroki Kaminaga; Kazuhisa Ishizaka; Hironori Kasahara

To improve effective performance and usability of shared memory multiprocessor systems, a multi-grain compilation scheme, which hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional loop parallelism and near fine grain parallelism among statements inside a basic block, is important. In order to efficiently use hierarchical parallelism of each nest level, or layer, in multigrain parallel processing, it is required to determine how many processors or groups of processors should be assigned to each layer, according to the parallelism of the layer. This paper proposes an automatic hierarchical parallelism control scheme to assign suitable number of processors to each layer so that the parallelism of each hierarchy can be used efficiently. Performance of the proposed scheme is evaluated on IBM RS6000 SMP server with 8 processors using 8 programs of SPEC95FP.

languages and compilers for parallel computing | 2009

OSCAR API for real-time low-power multicores and its performance on multicores and SMP servers

Keiji Kimura; Masayoshi Mase; Hiroki Mikami; Takamichi Miyamoto; Jun Shirako; Hironori Kasahara

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled “Multicore Technology for Realtime Consumer Electronics.” By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.

Explore More