Takeshi Fukaya
Nagoya University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Takeshi Fukaya.
parallel computing technologies | 2007
Yusaku Yamamoto; Takeshi Fukaya; Takashi Uneyama; Masami Takata; Kinji Kimura; Masashi Iwasaki; Yoshimasa Nakamura
We propose an approach to speed up the singular value decomposition (SVD) of very large rectangular matrices using the CSX600 floating point coprocessor. The CSX600-based acceleration board we use offers 50GFLOPS of sustained performance, which is many times greater than that provided by standard microprocessors. However, this performance can be achieved only when a vendor-supplied matrix-matrix multiplication routine is used and the matrix size is sufficiently large. In this paper, we optimize two of the major components of rectangular SVD, namely, QR decomposition of the input matrix and back-transformation of the left singular vectors by matrix Q, so that large-size matrix multiplications can be used efficiently. In addition, we use the Integrable SVD algorithm to compute the SVD of an intermediate bidiagonal matrix. This helps to further speed up the computation and reduce the memory requirements. As a result, we achieved up to 3.5 times speedup over the Intel Math Kernel Library running on an 3.2GHz Xeon processor when computing the SVD of a 100,000 × 4000 matrix.
international conference on cluster computing | 2008
Takeshi Fukaya; Yusaku Yamamoto; Shao-Liang Zhang
In this paper, we present a new approach to optimizing the blocking strategy for the householder QR decomposition. In high performance implementations of the householder QR algorithm, it is common to use a blocking technique for the efficient use of the cache memory. There are several well known blocking strategies like the fixed-size blocking and recursive blocking, and usually their parameters such as the block size and the recursion level are tuned according to the target machine and the problem size. However, strategies generated with this kind of parameter optimization constitute only a small fraction of all possible blocking strategies. Given the complex performance characteristics of modern microprocessors, non-standard strategies may prove effective on some machines. Considering this situation, we first propose a new universal model that can express a far larger class of blocking strategies than has been considered so far. Next, we give an algorithm to find a near-optimal strategy from this class using dynamic programming. As a result of this approach, we found an effective blocking strategy that has never been reported. Performance evaluation on the Opteron and Core2 processors show that our strategy achieves about 1.2 times speedup over recursive blocking when computing the QR decomposition of a 6000 times 6000 matrix.
Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems | 2014
Takeshi Fukaya; Yuji Nakatsukasa; Yuka Yanagisawa; Yusaku Yamamoto
Designing communication-avoiding algorithms is crucial for high performance computing on a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to be much faster and as stable as the classical Householder QR algorithm. The Cholesky QR algorithm is another very simple and fast communication-avoiding algorithm, but rarely used in practice because of its numerical instability. Our recent work points out that an algorithm that simply repeats Cholesky QR twice, which we call CholeskyQR2, gives excellent accuracy for a wide range of matrices arising in practice. Although the communication cost of CholeskyQR2 is twice that of TSQR, it has an advantage that its reduction operation is addition whereas that of TSQR is a QR factorization, whose high-performance implementation is more difficult. Thus, CholeskyQR2 can potentially be significantly faster than TSQR. Indeed, in our experiments using 16384 nodes of the K computer, CholeskyQR2 ran about three times faster than TSQR for a 4194304 × 64 matrix.
international parallel and distributed processing symposium | 2015
Takeshi Fukaya; Toshiyuki Imamura
The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.
ieee international conference on high performance computing data and analytics | 2014
Takeshi Fukaya; Toshiyuki Imamura; Yusaku Yamamoto
We consider computing tall-skinny QR factorizations on a large-scale parallel machine. We present a realistic performance model and analyze the difference of the parallel execution time between Householder QR and TSQR. Our analysis indicates the possibility that TSQR becomes slower than Householder QR as the number of columns of the target matrix increases. We aim for estimating the difference and selecting the faster algorithm by using models, which falls into auto-tuning. Numerical experiments on the K computer support our analysis and show our success in determining the faster algorithm.
Software Automatic Tuning, From Concepts to State-of-the-Art Results | 2011
Yusaku Yamamoto; Takeshi Fukaya
In this chapter, we survey several approaches to optimizing the blocking strategy for basic matrix decompositions, such as LU, Cholesky, and QR. Conventional blocking strategies such as fixed-size blocking and recursive blocking are widely used to optimize the performance of these decompositions. However, these strategies have only a small number of parameters such as the block size or the level of recursion and are not sufficiently flexible to exploit the performance of modern high-performance architectures. As such, several attempts have been made to define a much larger class of strategies and to choose the best strategy among them according to the target machine and the matrix size. The number of candidate strategies is usually exponential in the size of the matrix. However, with the use of dynamic programming, the cost of optimization can be reduced to a realistic level. As representatives of such approaches, we survey variable-size blocking, generalized recursive blocking, and the combination of variable-size blocking and the TSQR algorithm. Directions for future research are also discussed.
JSIAM Letters | 2009
Yusaku Yamamoto; Takeshi Fukaya
JSIAM Letters | 2010
Yusaku Yamamoto; Takeshi Fukaya
International journal of networking and computing | 2011
Jun-ichi Muramatsu; Takeshi Fukaya; Shao-Liang Zhang; Kinji Kimura; Yusaku Yamamoto
international parallel and distributed processing symposium | 2018
Takeshi Fukaya; Toshiyuki Imamura; Yusaku Yamamoto