Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Takahiro Katagiri is active.

Publication


Featured researches published by Takahiro Katagiri.


ieee international conference on high performance computing data and analytics | 2006

Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment

Satoshi Ohshima; Kenji Kise; Takahiro Katagiri; Toshitsugu Yuba

GPUs for numerical computations are becoming an attractive alternative in research. In this paper, we propose a new parallel processing environment for matrix multiplications by using both CPUs and GPUs. The execution time of matrix multiplications can be decreased to 40.1% by our method, compared with using the fastest of either CPU only case or GPU only case. Our method performs well when matrix sizes are large.


parallel computing | 2005

A time-to-live based reservation algorithm on fully decentralized resource discovery in Grid computing

Sanya Tangpongprasit; Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

We present an alternative algorithm of fully decentralized resource discovery in Grid computing, which enables the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources. Our algorithm is based on a simply unicast request transmission that can be easily implemented. The addition of a reservation algorithm is enable resource discovery mechanism to find more available matching resources. The deadline for resource discovery time is decided with time-to-live value. With our algorithm, the only one resource is automatically decided for any request if multiple available resources are found on forward path of resource discovery, resulting in no need to ask user to manually select the resource from a large list of available matching resources. We evaluated the performance of our algorithms by comparing with first-found-first-served algorithm. The experiment results show that the percentages of request that can be supported by both algorithms are not different. However, it can improve the performance of either resource utilization or turnaround time, depending on how to select the resource. The algorithm that finds the available matching resource whose attributes are closest to the required attribute can improve the resource utilization, whereas another one that finds the available matching resource which has the highest performance can improve the turn-around time. However, it is found that the performance of our algorithm relies on the density of resource in the network. Our algorithm seems to perform well only in the environment with enough resources, comparing with the density of requests in the network.


ieee international conference on high performance computing data and analytics | 2003

FIBER: A Generalized Framework for Auto-tuning Software

Takahiro Katagiri; Kenji Kise; Hiroaki Honda; Toshitsugu Yuba

This paper proposes a new software architecture framework, named FIBER, to generalize auto-tuning facilities and obtain highly accurate estimated parameters. The FIBER framework also provides a loop unrolling function, needing code generation and parameter registration processes, to support code development by library developers. FIBER has three kinds of parameter optimization layers–installation, before execution-invocation, and run-time. An eigensolver parameter to apply the FIBER framework is described and evaluated in three kinds of parallel computers; the HITACHI SR8000/MPP, Fujitsu VPP800/63, and Pentium4 PC cluster. Evaluation indicated a 28.7% speed increase in the computation kernel of the eigensolver with application of the new optimization layer of before execution-invocation.


parallel computing | 2006

ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

Conventional auto-tuning numerical software has the drawbacks of (1) fixed sampling points for the performance estimation; (2) inadequate adaptation to heterogeneous environments. To solve these drawbacks, we developed ABCLib_DRSSED, which is a parallel eigensolver with an auto-tuning facility. ABCLib_DRSSED has (1) functions based on the sampling points which are constructed with an end-user interface; (2) a load-balancer for the data to be distributed; (3) a new auto-tuning optimization timing called Before Execute-time Optimization (BEO). In our performance evaluation of the BEO, we obtained speedup factors from 10% to 90%, and 340% in the case of a failed estimation. In the evaluation of the load-balancer, the performance was 220% improved.


parallel computing | 2006

ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

We describe the design and implementation of ABCLibScript, which is a directive that supports the addition of an auto-tuning facility. ABCLibScript limits the function of auto-tuning to numerical computations. For example, the block length adjustment for blocked algorithms, loop unrolling depth adjustment and algorithm selection are crucial functions. To establish these three particular functions, we make three kinds of instruction operators, variable, unroll, and select, respectively. As a result of performance evaluation, we showed that a non-expert user obtained a maximum speedup of 4.3 times by applying ABCLibScript to a program compared to a program without ABCLibScript.


computing frontiers | 2004

Effect of auto-tuning with user's knowledge for numerical software

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

This paper evaluates the effect of an auto-tuning facility with the users knowledge for numerical software. We proposed a new software architecture framework, named FIBER, to generalize auto-tuning facilities and obtain highly accurate estimated parameters. The FIBER framework also provides a loop-unrolling function and an algorithm selection function to support code development by library developers needing code generation and parameter registration processes. FIBER offers three kinds of parameter optimization layers---install-time, before execute-time, and run-time. The users knowledge is needed in the before execute-time optimization layer. In this paper, eigensolver parameters that apply the FIBER framework are described and evaluated in three kinds of parallel computers: the HITACHI SR8000/MPP, Fujitsu VPP800/63, and Pentium4 PC cluster. Our evaluation of the application of the before execute-time layer indicated a maximum speed increase of 3.4 times for eigensolver parameters, and a maximum increase of 17.1 times for the algorithm selection of orthogonalization in the computation kernel of the eigensolver.


Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05) | 2005

The bimode++ branch predictor

Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba

Modern wide-issue superscalar processors tend to adopt deeper pipelines in order to attain high clock rates. This trend increases the number of on-the-fly instructions in processors and a mispredicted branch can result in substantial amounts of wasted work. In order to mitigate these wasted works, an accurate branch prediction is required for the high performance processors. In order to improve the prediction accuracy, we propose the bimode++ branch predictor. It is an enhanced version of the bimode branch predictor. Throughout execution from the start to the end of a program, some branch instructions have the same result at all times. These branches are defined as extremely biased branches. The bimode++ branch predictor is unique in predicting the output of an extremely biased branch with a simple hardware structure. In addition, the bimode++ branch predictor improves the accuracy using the refined indexing and a fusion function. Our experimental results with benchmarks from SpecFP, SpecINT, multi-media and server area show that the bimode++ branch predictor can reduce the misprediction rate by 13.2% to the bimode and by 32.5% to the gshare.


parallel computing | 2006

d-spline based incremental parameter estimation in automatic performance tuning

Teruo Tanaka; Takahiro Katagiri; Toshitsugu Yuba

In this paper, we introduce a new d-Spline based Incremental Performance Parameter Estimation method (IPPE). We first define a fitting function d-Spline, which has high flexibility to adapt given data and can be easily computed. The complexity of d-Spline is O(n). We introduce a procedure for incremental performance parameter estimation and an example of data fitting using d-Spline. We applied the IPPE method to automatic performance tuning and ran some experiments. The experimental results illustrate of the advantages of this method, such as high accuracy with a relatively small estimation time and high efficiency for large problem sizes.


workshop on computer architecture education | 2004

The SimCore/Alpha Functional Simulator

Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba

We have developed a function-level processor simulator, SimCore/Alpha Functional Simulator Version 2.0 (SimCore Version 2.0), for processor architecture research and processor education. This paper describes the design and implementation of SimCore Version 2.0. The main features of SimCore Version 2.0 are as follows: (1) It offers many functions as a function-level simulator. (2) It is implemented compactly with 2,800 lines in C++. (3) It separates the function of the program loader. (4) No global variable is used, and so it improves the readability and function. (5) It offers a powerful verification mechanism. (6) It operates on many platforms. (7) Compared with sim-fast in the SimpleScalar Tool Set, SimCore Version 2.0 attains a 19% improvement in simulation speed.


Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) | 2004

A Super Instruction-Flow Architecture for High Performance and Low Power Processors

Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba

Microprocessor performance has improved at about 55% per year for the past three decades. To maintain this performance growth rate, next generation processors must achieve higher levels of instruction level parallelism. However, it is known that a conditional branch poses serious performance problems in modern processors. In addition, as an instruction pipeline becomes deep and the issue width becomes wide, this problem becomes worse. The goal of this study is to develop a novel processor architecture which mitigates the performance degradation caused by branch instructions. In order to solve this problem, we propose a super instruction-flow architecture. The concept of the architecture is described. This architecture has a mechanism which processes multiple instruction-flows efficiently and tries to mitigate the performance degradation. Preliminary evaluation results with small benchmark programs show that the first generation super instruction-flow processor efficiently mitigates branch overhead

Collaboration


Dive into the Takahiro Katagiri's collaboration.

Top Co-Authors

Avatar

Toshitsugu Yuba

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar

Hiroki Honda

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar

Kenji Kise

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Sanya Tangpongprasit

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Satoshi Ohshima

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar

Teruo Tanaka

University of Electro-Communications

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

James Demmel

University of California

View shared research outputs
Researchain Logo
Decentralizing Knowledge