Tatsushi Inagaki
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tatsushi Inagaki.
programming language design and implementation | 2003
Tatsushi Inagaki; Tamiya Onodera; Hideaki Komatsu; Toshio Nakatani
Software prefetching is a promising technique to hide cache miss latencies, but it remains challenging to effectively prefetch pointer-based data structures because obtaining the memory address to be prefetched requires pointer dereferences. The recently proposed stride prefetching overcomes this problem, but it only exploits inter-iteration stride patterns and relies on an off-line profiling method.We propose a new algorithm for stride prefetching which is intended for use in a dynamic compiler. We exploit both inter- and intra-iteration stride patterns, which we discover using an ultra-lightweight profiling technique, called object inspection. This is a kind of partial interpretation that only a dynamic compiler can perform. During the compilation of a method, the dynamic compiler gathers the profile information by partially interpreting the method using the actual values of parameters and causing no side effects.We evaluated an implementation of our prefetching algorithm in a production-level Java just-in time compiler. The results show that the algorithm achieved up to an 18.9% and 25.1% speedup in industry-standard benchmarks on the Pentium 4 and the Athlon MP, respectively, while it increased the compilation time by less than 3.0%.
conference on object oriented programming systems languages and applications | 2003
Kazuaki Ishizaki; Mikio Takeuchi; Kiyokuni Kawachiya; Toshio Suganuma; Osamu Gohda; Tatsushi Inagaki; Akira Koseki; Kazunori Ogata; Motohiro Kawahito; Toshiaki Yasue; Takeshi Ogasawara; Tamiya Onodera; Hideaki Komatsu; Toshio Nakatani
This paper describes the system overview of our Java Just-In-Time (JIT) compiler, which is the basis for the latest production version of IBM Java JIT compiler that supports a diversity of processor architectures including both 32-bit and 64-bit modes, CISC, RISC, and VLIW architectures. In particular, we focus on the design and evaluation of the cross-platform optimizations that are common across different architectures. We studied the effectiveness of each optimization by selectively disabling it in our JIT compiler on three different platforms: IA-32, IA-64, and PowerPC. Our detailed measurements allowed us to rank the optimizations in terms of the greatest performance improvements with the smallest compilation times. The identified set includes method inlining only for tiny methods, exception check eliminations using forward dataflow analysis and partial redundancy elimination, scalar replacement for instance and class fields using dataflow analysis, optimizations for type inclusion checks, and the elimination of merge points in the control flow graphs. These optimizations can achieve 90% of the peak performance for two industry-standard benchmark programs on these platforms with only 34% of the compilation time compared to the case for using all of the optimizations.
Ibm Journal of Research and Development | 2004
Toshio Suganuma; Takeshi Ogasawara; Kiyokuni Kawachiya; Mikio Takeuchi; Kazuaki Ishizaki; Akira Koseki; Tatsushi Inagaki; Toshiaki Yasue; Motohiro Kawahito; Tamiya Onodera; Hideaki Komatsu; Toshio Nakatani
JavaTM has gained widespread popularity in the industry, and an efficient Java virtual machine (JVMTM) and just-in-time (JIT) compiler are crucial in providing high performance for Java applications. This paper describes the design and implementation of our JIT compiler for IA-32 platforms by focusing on the recent advances achieved in the past several years. We first present the dynamic optimization framework, which focuses the expensive optimization efforts only on performance-critical methods, thus helping to manage the total compilation overhead. We then describe the platform-independent features, which include the conversion from the stack-semantic Java bytecode into our register-based intermediate representation (IR) and a variety of aggressive optimizations applied to the IR. We also present some techniques specific to the IA-32 used to improve code quality, especially for the efficient use of the small number of registers on that platform. Using several industry-standard benchmark programs, the experimental results show that our approach offers high performance with low compilation overhead. Most of the techniques presented here are included in the IBM JIT compiler product, integrated into the IBM Development Kit for Microsoft Windows®, Java Technology Edition Version 1.4.0.
symposium on code generation and optimization | 2010
Rei Odaira; Takuya Nakaike; Tatsushi Inagaki; Hideaki Komatsu; Toshio Nakatani
Graph coloring register allocation tries to minimize the total cost of spilled live ranges of variables. Live-range splitting and coalescing are often performed before the coloring to further reduce the total cost. Coalescing of split live ranges, called sub-ranges, can decrease the total cost by lowering the interference degrees of their common interference neighbors. However, it can also increase the total cost because the coalesced sub-ranges can become uncolorable. In this paper, we propose coloring-based coalescing, which first performs trial coloring and next coalesces all copyrelated sub-ranges that were assigned the same color. The coalesced graph is then colored again with the graph coloring register allocation. The rationale is that coalescing of differently colored sub-ranges could result in spilling because there are some interference neighbors that prevent them from being assigned the same color. Experiments on Java programs show that the combination of live-range splitting and coloring-based coalescing reduces the static spill cost by more than 6% on average, comparing to the baseline coloring without splitting. In contrast, well-known iterated and optimistic coalescing algorithms, when combined with splitting, increase the cost by more than 20%. Coloring-based coalescing improves the execution time by up to 15% and 3% on average, while the existing algorithms improve by up to 12% and 1% on average.
international conference on parallel architectures and compilation techniques | 2002
Kazuaki Ishizaki; Tatsushi Inagaki; Hideaki Komatsu; Toshio Nakatani
Java exception checks are designed to ensure that any, faulting instruction causing a hardware exception does not terminate the program abnormally. These checks, however, impose some constraints upon the execution order between an instruction potentially raising a Java exception and a faulting instruction causing a hardware exception. This reduces the effectiveness of instruction reordering optimization. We propose a new framework to effectively perform speculation for the Java language using a direct acyclic graph representation based on the SSA form. Using this framework, we apply a well-known speculation technique to a faulting load instruction to eliminate such constraints. We use edges to represent exception constraints. This allows us to accurately estimate the potential reduction of the critical path length for applying speculation. We also propose an approach to avoid extra copy instructions and to generate efficient code with minimum register pressure. We have implemented the technique in the IBM Java Just-In-Time compiler, and observed performance improvements up to 25% for micro-benchmark programs, up to 10% for Java Grande Benchmark Suite, and up to 12% for SPECjvm98 on an Itanium processor.
ieee international symposium on workload characterization | 2016
Tatsushi Inagaki; Yohei Ueda; Moriyoshi Ohara
Operating-system-level virtualization is becoming increasingly important for server applications since it provides containers as a foundation of the emerging microservice architecture, which enables agile application development, deployment, and operation - the essential characteristics in modern cloud-based services. Agility in the microservice architecture heavily depends on fast management operations for containers, such as create, start, and stop. Since containers rely on administrative kernel services provided by the host operating system, the microservice architecture can be considered as a new workload for an operating system as it stresses those services differently from traditional workloads. We studied the scalability of container management operations for Docker, one of the most popular container management systems, from two aspects: core and container scalability, which indicate how much the number of processor cores and number of containers affect container management performance, respectively. We propose a hierarchical analysis approach to identify scalability bottlenecks where we analyze multiple layers of a software stack from the top to bottom layer. Our analysis reveals that core scalability has bottlenecks at a virtualization layer for storage and network devices, and that container scalability has bottlenecks at various components that inquire mount points. While those bottlenecks exist in a daemon process of Docker, the root causes are a couple of interfaces of the underlying kernel. This implies the operating system has room for improvement to more efficiently host emerging microservice applications.
symposium on code generation and optimization | 2003
Tatsushi Inagaki; Hideaki Komatsu; Toshio Nakatani
We present a new integrated prepass scheduling (IPS) algorithm for a Java just-in-time (JIT) compiler which integrates register minimization into list scheduling. We use backtracking in the list scheduling when we have used up all the available registers. To reduce the overhead of backtracking, we incrementally maintain a set of candidate instructions for undoing scheduling. To maximize the ILP after undoing scheduling, we select an instruction chain with the smallest increase in the total execution time. We implemented our new algorithm in a production-level Java JIT compiler for the Intel Itanium processor. The experiment showed that, compared to the best known algorithm by Govindarajan et al., our IPS algorithm improved the performance by up to +1.8% while it reduced the compilation time for IPS by 58% on average.
network and parallel computing | 2004
Clement Richard Attanasio; Jong-Deok Choi; Niteesh Dubey; Kattamuri Ekanadham; Manish Gupta; Tatsushi Inagaki; Kazuaki Ishizaki; Joefon Jann; Robert D. Johnson; Toshio Nakatani; Il Park; Pratap Pattnaik; Mauricio J. Serrano; Stephen Edwin Smith; Ian Steiner; Yefim Shuf
The evolution of the Web as an enabling tool for e-business introduces a challenge to understanding the execution behavior of large-scale middleware systems, such as J2EE [2], and their commercial workloads. This paper presents a brief description of the whole-stack analysis and optimization system – being developed at IBM Research – for commercial workloads on Websphere Application Server (WAS) [5] – IBM’s implementation of J2EE – running on IBM’s pSeries [4] and zSeries [3] server systems.
Archive | 2008
Tatsushi Inagaki; Hideaki Komatsu; Takuya Nakaike; Rei Odaira
Archive | 2009
Tatsushi Inagaki; Takuya Nakaike; Takeshi Ogasawara; Toshio Suganuma