Hiroki Murata
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hiroki Murata.
acm sigplan symposium on principles and practice of parallel programming | 2014
David Cunningham; David Grove; Benjamin Herta; Arun Iyengar; Kiyokuni Kawachiya; Hiroki Murata; Vijay A. Saraswat; Mikio Takeuchi; Olivier Tardieu
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {\em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge. We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.
Future Generation Computer Systems | 2005
Shu Tezuka; Hiroki Murata; Shuji Tanaka; Shoji Yumae
Due to reduced profitability, increased price competition, and strengthened regulation, financial institutions in all countries are now upgrading their financial analytics based on Monte Carlo simulation. In this article, we propose three key technologies, i.e., data protection, integrity, and deadline scheduling, which are indispensable to build a secure PC-grid for financial risk management. We constructed a PC-grid by scavenging unused CPU cycles of about 50 PCs under real office environment, and obtained the 80 times speed-up, namely, for 100,000 Monte Carlo scenarios, 95 h computation on a single server is reduced to 70 min. Finally, we discuss future research directions.
IEEE Transactions on Parallel and Distributed Systems | 2012
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
High productivity is critical in harnessing the power of high-performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in programming language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper, we propose a systematic approach toward automated performance analysis and tuning that we expect to improve the productivity of performance debugging significantly. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.
european conference on parallel processing | 2009
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
High productivity to the end user is critical in harnessing the power of high performance computing systems to solve science and engineering problems. It is a challenge to bridge the gap between the hardware complexity and the software limitations. Despite significant progress in language, compiler, and performance tools, tuning an application remains largely a manual task, and is done mostly by experts. In this paper we propose a holistic approach towards automated performance analysis and tuning that we expect to greatly improve the productivity of performance debugging. Our approach seeks to build a framework that facilitates the combination of expert knowledge, compiler techniques, and performance research for performance diagnosis and solution discovery. With our framework, once a diagnosis and tuning strategy has been developed, it can be stored in an open and extensible database and thus be reused in the future. We demonstrate the effectiveness of our approach through the automated performance analysis and tuning of two scientific applications. We show that the tuning process is highly automated, and the performance improvement is significant.
parallel and distributed systems testing analysis and debugging | 2012
Yasushi Negishi; Hiroki Murata; Guojing Cong; Hui-Fang Wen; I-Hsin Chung
Multicore processors are becoming dominant in the high performance computing (HPC) area, so multithread programming with OpenMP is becoming a key to good performance on such processors, though debugging problems remain. In particular, it is difficult to detect data races among threads with nondeterministic results, thus calling for tools to detect data races. Because HPC programs tend to run for long periods, detection tools that do not need to run the target programs are strongly preferred. We developed a static program analysis tool to detect data races in OpenMP loops in FORTRAN programs. Programmers can quickly use the tool at compile time without executing the target program. Because static analysis tools tend to report many false positives, we counted the false positives in some large applications to assess the utility and limits of static analysis tools. We have devised a new approach to detect data races. Our approach combines existing program analysis methods with a new analysis. We experimented with NAS parallel benchmarks and two real applications, GTC for plasma physics and GFMC for nuclear physics. Our new analysis method also reduces number of reported candidates from totally 97 to 33 in these applications. We found 13 previously unknown bugs out of 33 candidates reported by our prototype. Our analysis is fast enough for practical use, since the analysis time for the NAS parallel benchmark was shorter than the compilation time (18.5 seconds compared to 33.0 seconds).
ieee international conference on high performance computing data and analytics | 2012
Guojing Cong; Hui-Fang Wen; Hiroki Murata; Yasushi Negishi
UPC is designed to improve user productivity when programming distributed-memory machines. Yet the shared-memory abstraction also makes performance analysis hard as it introduces extra overhead with local accesses and implicit communication with remote ones. As far as we know, there are no mature software utilities for systematic analysis and tuning of shared-memory access performance in UPC programs. We develop a mechanism to track shared memory accesses and correlate them to the UPC source lines, functions, and data structures. We then apply tool-assisted analysis to a set of UPC programs. For the NAS UPC benchmark we achieve dramatic performance improvement over the unoptimized implementation as well as up to two times speedups over the fully hand-tuned implementation. We expect our approach effective in tuning a wide range of UPC programs.
international parallel and distributed processing symposium | 2012
Guojing Cong; Hui-Fang Wen; I-Hsin Chung; David J. Klepacki; Hiroki Murata; Yasushi Negishi
Deploying an application onto a target platform for high performance oftentimes demands manual tuning by experts. As machine architecture gets increasingly complex, tuning becomes even more challenging and calls for systematic approaches. In our earlier work we presented a prototype that combines efficiently expert knowledge, static analysis, and runtime observation for bottleneck detection, and employs refactoring and compiler feedback for mitigation. In this study, we develop a software tool that facilitates \emph{fast} searching of bottlenecks and effective mitigation of problems from major dimensions of computing (e.g., computation, communication, and I/O). The impact of our approach is demonstrated by the tuning of the LBMHD code and a Poisson solver code, representing traditional scientific codes, and a graph analysis code in UPC, representing emerging programming paradigms. In the experiments, our framework detects with a single run of the application intricate bottlenecks of memory access, I/O, and communication. Moreover, the automated solution implementation yields significant overall performance improvement on the target platforms. The improvement for LBMHD is up to 45\%, and the speedup for the UPC code is up to 5. These results suggest that our approach is a concrete step towards systematic tuning of high performance computing applications.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Guojing Cong; I-Hsin Chung; Hui-Fang Wen; David J. Klepacki; Hiroki Murata; Yasushi Negishi; Takao Moriyama
To fully utilize the power of current high performance computing systems, high productivity to the end user is critical. It is a challenge to map an application to the target architecture efficiently. Tuning an application for high performance remains a daunting task, and frequently involves manual changes to the program. Recently refactoring techniques are proposed to rewrite or reorganize programs for various software engineering purposes. In our research we explore combining performance analysis with refactoring techniques for automated tuning that we expect to greatly improve the productivity of application deployment. We seek to build a system that can apply appropriate refactoring according to the bottleneck discovered. We demonstrate the effectiveness of this approach through the tuning of several scientific applications and kernels.
Archive | 1998
Naotaka Kato; Itiro Siio; Hiroki Murata; Toru Aihara
Archive | 2004
Yashushi Negishi; Hiroki Murata; Kenichi Okuyama; Kazuya Tago