Is this you? Create Your Porfile

Kenji Kise

Tokyo Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kenji Kise is active.

Explore More

Publication

Featured researches published by Kenji Kise.

ieee international conference on high performance computing data and analytics | 2006

Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment

Satoshi Ohshima; Kenji Kise; Takahiro Katagiri; Toshitsugu Yuba

GPUs for numerical computations are becoming an attractive alternative in research. In this paper, we propose a new parallel processing environment for matrix multiplications by using both CPUs and GPUs. The execution time of matrix multiplications can be decreased to 40.1% by our method, compared with using the fastest of either CPU only case or GPU only case. Our method performs well when matrix sizes are large.

parallel computing | 2005

A time-to-live based reservation algorithm on fully decentralized resource discovery in Grid computing

Sanya Tangpongprasit; Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

We present an alternative algorithm of fully decentralized resource discovery in Grid computing, which enables the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources. Our algorithm is based on a simply unicast request transmission that can be easily implemented. The addition of a reservation algorithm is enable resource discovery mechanism to find more available matching resources. The deadline for resource discovery time is decided with time-to-live value. With our algorithm, the only one resource is automatically decided for any request if multiple available resources are found on forward path of resource discovery, resulting in no need to ask user to manually select the resource from a large list of available matching resources. We evaluated the performance of our algorithms by comparing with first-found-first-served algorithm. The experiment results show that the percentages of request that can be supported by both algorithms are not different. However, it can improve the performance of either resource utilization or turnaround time, depending on how to select the resource. The algorithm that finds the available matching resource whose attributes are closest to the required attribute can improve the resource utilization, whereas another one that finds the available matching resource which has the highest performance can improve the turn-around time. However, it is found that the performance of our algorithm relies on the density of resource in the network. Our algorithm seems to perform well only in the environment with enough resources, comparing with the density of requests in the network.

ieee international conference on high performance computing data and analytics | 2003

FIBER: A Generalized Framework for Auto-tuning Software

Takahiro Katagiri; Kenji Kise; Hiroaki Honda; Toshitsugu Yuba

This paper proposes a new software architecture framework, named FIBER, to generalize auto-tuning facilities and obtain highly accurate estimated parameters. The FIBER framework also provides a loop unrolling function, needing code generation and parameter registration processes, to support code development by library developers. FIBER has three kinds of parameter optimization layers–installation, before execution-invocation, and run-time. An eigensolver parameter to apply the FIBER framework is described and evaluated in three kinds of parallel computers; the HITACHI SR8000/MPP, Fujitsu VPP800/63, and Pentium4 PC cluster. Evaluation indicated a 28.7% speed increase in the computation kernel of the eigensolver with application of the new optimization layer of before execution-invocation.

parallel computing | 2006

ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

Conventional auto-tuning numerical software has the drawbacks of (1) fixed sampling points for the performance estimation; (2) inadequate adaptation to heterogeneous environments. To solve these drawbacks, we developed ABCLib_DRSSED, which is a parallel eigensolver with an auto-tuning facility. ABCLib_DRSSED has (1) functions based on the sampling points which are constructed with an end-user interface; (2) a load-balancer for the data to be distributed; (3) a new auto-tuning optimization timing called Before Execute-time Optimization (BEO). In our performance evaluation of the BEO, we obtained speedup factors from 10% to 90%, and 340% in the case of a failed estimation. In the evaluation of the load-balancer, the performance was 220% improved.

parallel computing | 2006

ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

We describe the design and implementation of ABCLibScript, which is a directive that supports the addition of an auto-tuning facility. ABCLibScript limits the function of auto-tuning to numerical computations. For example, the block length adjustment for blocked algorithms, loop unrolling depth adjustment and algorithm selection are crucial functions. To establish these three particular functions, we make three kinds of instruction operators, variable, unroll, and select, respectively. As a result of performance evaluation, we showed that a non-expert user obtained a maximum speedup of 4.3 times by applying ABCLibScript to a program compared to a program without ABCLibScript.

computing frontiers | 2004

Effect of auto-tuning with user's knowledge for numerical software

Takahiro Katagiri; Kenji Kise; Hiroki Honda; Toshitsugu Yuba

This paper evaluates the effect of an auto-tuning facility with the users knowledge for numerical software. We proposed a new software architecture framework, named FIBER, to generalize auto-tuning facilities and obtain highly accurate estimated parameters. The FIBER framework also provides a loop-unrolling function and an algorithm selection function to support code development by library developers needing code generation and parameter registration processes. FIBER offers three kinds of parameter optimization layers---install-time, before execute-time, and run-time. The users knowledge is needed in the before execute-time optimization layer. In this paper, eigensolver parameters that apply the FIBER framework are described and evaluated in three kinds of parallel computers: the HITACHI SR8000/MPP, Fujitsu VPP800/63, and Pentium4 PC cluster. Our evaluation of the application of the before execute-time layer indicated a maximum speed increase of 3.4 times for eigensolver parameters, and a maximum increase of 17.1 times for the algorithm selection of orthogonalization in the computation kernel of the eigensolver.

parallel and distributed computing: applications and technologies | 2009

A Study of an Infrastructure for Research and Development of Many-Core Processors

Koh Uehara; Shimpei Sato; Takefumi Miyoshi; Kenji Kise

Many-core processors which have thousands of cores on a chip will be realized. We developed an infrastructure which accelerates the research and development of such many-core processors. This paper describes three main elements provided by our infrastructure. The first element is the definition of simple many-core processor architecture called M-Core. The second is SimMc, a software simulator of M-Core. The third is the software library MClib which helps the development of application programs for M-Core. The simulation speed of SimMc and the parallelization efficiency of M-Core are evaluated using some benchmark programs. We show that our infrastructure accelerates the research and development of many-core processors.

Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'05) | 2005

The bimode++ branch predictor

Kenji Kise; Takahiro Katagiri; Hiroki Honda; Toshitsugu Yuba

Modern wide-issue superscalar processors tend to adopt deeper pipelines in order to attain high clock rates. This trend increases the number of on-the-fly instructions in processors and a mispredicted branch can result in substantial amounts of wasted work. In order to mitigate these wasted works, an accurate branch prediction is required for the high performance processors. In order to improve the prediction accuracy, we propose the bimode++ branch predictor. It is an enhanced version of the bimode branch predictor. Throughout execution from the start to the end of a program, some branch instructions have the same result at all times. These branches are defined as extremely biased branches. The bimode++ branch predictor is unique in predicting the output of an extremely biased branch with a simple hardware structure. In addition, the bimode++ branch predictor improves the accuracy using the refined indexing and a fusion function. Our experimental results with benchmarks from SpecFP, SpecINT, multi-media and server area show that the bimode++ branch predictor can reduce the misprediction rate by 13.2% to the bimode and by 32.5% to the gshare.

asia pacific conference on circuits and systems | 2014

An NoC-based evaluation platform for safety-critical automotive applications

Tomohiro Yoneda; Masashi Imai; Hiroshi Saito; Takahiro Hanyu; Kenji Kise; Yuichi Nakamura

We have been developing an NoC (Network-on-Chip) based platform for a centralized ECU (Electronic Control Unit), where a many-core system functions as a set of several conventional automotive ECUs. The outcome of this research project has formed into an evaluation platform that includes a hardware board, a dependable task execution scheme, its support tool for Simulink programs, and a functionality of hardware-in-the-loop simulation using a built-in plant model. This paper introduces an overview of our evaluation platform, and shows some preliminary evaluation results for an integrated attitude control system of a four-wheel drive electric vehicle, obtained using our platform.

field programmable logic and applications | 2015

Ultra-fast NoC emulation on a single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Network-on-Chip (NoC) has become the de facto on-chip communication architecture for many-core systems. This paper proposes novel methods for emulating large-scale NoC designs on a single FPGA. Since FPGAs offer a highly parallel platform, FPGA-based emulation can be much faster than the software-based approach. However, emulating NoC designs with up to thousands of nodes is a challenging task due to the FPGA capacity constraints. We first describe how to accurately model synthetic workloads on FPGA by separating the time of the emulated network and the times of the traffic generation units. We next present a novel use of time-multiplexing in emulating the entire network using several physical nodes. Finally, we show the basic steps to apply the proposed methods to emulate different NoC architectures. The proposed methods enable ultrafast emulations of large-scale NoC designs with up to thousands of nodes using only on-chip resources of a single FPGA. In particular, more than 5,000× simulation speedup over BookSim, a widely used software-based NoC simulator, is achieved.

Explore More