Tomoaki Tsumura | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tomoaki Tsumura is active.

Explore More

Publication

Featured researches published by Tomoaki Tsumura.

international conference on networking and computing | 2012

Dynamic Processing Slots Scheduling for I/O Intensive Jobs of Hadoop MapReduce

Shiori Kurazumi; Tomoaki Tsumura; Shoichi Saito; Hiroshi Matsuo

Hadoop, consists of Hadoop MapReduce and Hadoop Distributed File System (HDFS), is a platform for large scale data and processing. Distributed processing has become common as the number of data has been increasing rapidly worldwide and the scale of processes has become larger, so that Hadoop has attracted many cloud computing enterprises and technology enthusiasts. Hadoop users are expanding under this situation. Our studies are to develop the faster of executing jobs originated by Hadoop. In this paper, we propose dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce focusing on I/O wait during execution of jobs. Assigning more tasks to added free slots when CPU resources with the high rate of I/O wait have been detected on each active Task Tracker node leads to the improvement of CPU performance. We implemented our method on Hadoop 1.0.3, which results in an improvement of up to about 23% in the execution time.

annual simulation symposium | 2006

Design and implementation of a workload specific simulator

Takashi Nakada; Tomoaki Tsumura; Hiroshi Nakashima

This paper proposes a simple but efficient technique for instruction set simulators. Our simulator is made workload specific by a simple process to generate a set of C functions from a workload binary. It is as portable and retargetable as ordinary instruction emulators because the translation targets C code and works well with well-abstracted instruction definitions. The translation is also easy-to-implement without requiring any complicated analysis nor profiling. We also propose a set of simple optimization techniques for cache simulation which cooperates with the workload specific technique. A SimpleScalar-based implementation of these techniques results a significantly large performance improvement. Our evaluations with SPEC CPU95 exhibit that the maximum speedups over sim-fast, sim-cache and sim-outorder are 38-fold, 14-fold and 9.7-fold respectively, while the average numbers are 19-fold, 8.3-fold and 3.8-fold.

international conference on networking and computing | 2012

A Speed-up Technique for an Auto-Memoization Processor by Reusing Partial Results of Instruction Regions

Kazutaka Kamimura; Ryosuke Oda; Tatsuhiro Yamada; Tomoaki Tsumura; Hiroshi Matsuo; Yasuhiko Nakashima

We have proposed an auto-memoization processor based on computation reuse. The auto-memoization processor dynamically detects functions and loop iterations as reusable blocks, and memoizes them automatically. In the past model, computation reuse cannot be applied if the current input sequence even differs by only one input value from the past input sequences, since processing results will differ. This paper proposes a new partial reuse model, which can apply computation reuse to the early part of a reusable block as long as the early part of the current input sequence matches one of the past sequences. In addition, in order to acquire sufficient benefit from the partial reuse model, we also propose a technique that reduces the searching overhead for memoization table by partitioning it. The result of the experiment with SPEC CPU95 suite benchmarks shows that the new method improves the maximum speedup from 40.6% to 55.1%, and the average speedup from 10.6% to 22.8%.

international conference on networking and computing | 2010

A Speed-Up Technique for an Auto-Memoization Processor by Collectively Reusing Continuous Iterations

Tomoki Ikegaya; Tomoaki Tsumura; Hiroshi Matsuo; Yasuhiko Nakashima

We have proposed an auto-memoization processor based on computation reuse, and merged it with speculative multithreading based on value prediction into a parallel early computation. In the past model, the parallel early computation detects each iteration of loops as a reusable block. This paper proposes a new parallel early computation model, which integrates multiple continuous iterations into a reusable block automatically and dynamically without modifing executable binaries. We also propose a model for automatically detecting how many iterations should be integrated into one reusable block. Our model reduces the overhead of computation reuse, and further exploits reuse tables. The result of the experiment with SPEC CPU95 FP suite benchmarks shows that the new model improves the maximum speedup from 40.5% to 57.6%, and the average speedup from 15.0% to 26.0%.

parallel and distributed computing: applications and technologies | 2009

A Speculative Technique for Auto-Memoization Processor with Multithreading

Yushi Kamiya; Tomoaki Tsumura; Hiroshi Matsuo; Yasuhiko Nakashima

We have proposed an auto-memoization processor. This processor automatically and dynamically memoizes both functions and loop iterations, and skips their execution by reusing their results. On the other hand, multi/many-core processors have come into wide use. The number of cores is expected to increase to a hundred or more. However, many programs do not have so much parallelism in them. Therefore it becomes very important to consider how to utilize many cores effectively. This paper describes a speedup technique for auto-memoization processor using speculative multi-threading. Two speculative threads will be forked on reuse test. The one assumes that the reuse test will succeed, and executes the following codes of the reuse target block speculatively. The other assumes that the reuse test will fail, and executes the reuse target block. These two threads conceal the overhead of auto-memoization processor. The result of the experiment with SPEC CPU95 suite benchmarks shows that proposing method improves the maximum speedup from 13.9% to 36.0%.

international symposium on system-on-chip | 2014

An implementation of Auto-Memoization mechanism on ARM-based superscalar processor

Yuuki Shibata; Takanori Tsumura; Tomoaki Tsumura; Yasuhiko Nakashima

We have proposed a processor called Auto-Memoization Processor which is based on computation reuse. Until now, we have implemented the auto-memoization mechanism on a single-issue non-pipelined SPARC processor and studied the processor. The processor dynamically detects functions and loop iterations as reusable blocks, and memoizes them automatically. In addition, the processor can apply computation reuse to the blocks with a little reuse overhead. However, the fine evaluation result of the processor may not guarantee enough practicality. This is because instead of such a simple architecture, superscalar architectures are now widely used for generic processors for PCs, embedded processors, and other various processors. Hence, we examine problems which will be caused in the case of implementing the auto-memoization mechanism on an ARM-based superscalar processor and design the ARM-based Auto-Memoization Processor. For example, one of such problems is that pipeline stalls are caused because of the reuse overhead. To solve this problem, we implement a mechanism for overlapping the reuse overhead and the pipeline execution of the processor. The evaluation result with SPEC CPU95 benchmark suite shows that the ARM-based Auto-Memoization Processor can also achieve speed-up as well as the previous SPARC-based Auto-Memoization Processor. In this paper, we describe the implementation and the evaluation result of the ARM-based Auto-Memoization Processor.

international conference on networking and computing | 2012

An Automatic Host and Device Memory Allocation Method for OpenMPC

Hiroaki Uchiyama; Tomoaki Tsumura; Hiroshi Matsuo

The CUDA programming model provides better abstraction for GPU programming. However, it is still hard to write programs with CUDA because both some specific techniques and knowledge about GPU architecture are required. Hence, many programming frameworks for CUDA have been developed. OpenMPC is one of them based on OpenMP. OpenMPC is an easy-to-write framework for programmers familiar with traditional OpenMP, but still requires programmers to use the special directives for utilizing fast device memories. To solve this problem, this paper proposes a method for allocating appropriate device memories automatically. This paper also proposes a method for automatically allocating page locked memory for the data which are transferred between host and device. The evaluation results with several programs show that proposed methods can reduce 52% execution time in maximum.

international conference on networking and computing | 2011

Input Entry Integration for an Auto-Memoization Processor

Ryosuke Oda; Tatsuhiro Yamada; Tomoki Ikegaya; Tomoaki Tsumura; Hiroshi Matsuo; Yasuhiko Nakashima

We have proposed an auto-memoization processor based on computation reuse. The table for registering inputs/outputs is implemented by a ternary CAM, and the input sequences are stored onto the table, being folded into tree forms. This paper proposes a new registration model for merging multiple input entries into a single entry. The new model can efficiently store input values and can reduce the search cost. The result of the experiment with SPEC CPU95 suite benchmarks shows that the new model improves the maximum speedup ratio from 40.5% to 50.0%, and the average speedup ratio from 10.5% to 16.4%.

international symposium on computing and networking | 2015

An Approximate Computing Stack Based on Computation Reuse

Yuuki Sato; Takanori Tsumura; Tomoaki Tsumura; Yasuhiko Nakashima

Approximate computing has been studied widely in computing systems ranging from hardware to software. Approximate computing is a paradigm for reducing execution time and power consumption by tolerating some quality loss in computed results. On the other hand, we have proposed a processor called auto-memoization processor which is based on computation reuse. The processor dynamically detects functions as reusable blocks, and automatically stores their inputs and outputs into a lookup table. Then, when the processor detects the same block, the processor compares the current input sequence with past input sequences stored in the table. If the current input sequence matches one of the input sequences in the lookup table, the processor writes back the associated outputs, and skips the execution of the function. Here, by tolerating partial input mismatch in computation reuse, approximate computing can be achieved. In this paper, we propose an approximate computing stack based on computation reuse. The stack includes a programming framework which allows programmers to easily apply approximate computing to various applications, a compiler, and the modified auto-memoization processor. Through an evaluation with cjpeg from MediaBench, by tolerating partial input mismatch in computation reuse, execution cycles are reduced by 22.3% in maximum, and reuse rate is improved by 29.5% in maximum with negligible quality deterioration in outputs.

international symposium on computing and networking | 2014

Hinting for Auto-Memoization Processor Based on Static Binary Analysis

Takanori Tsumura; Yuuki Shibata; Kazutaka Kamimura; Tomoaki Tsumura; Yasuhiko Nakashima

We have proposed a processor called Auto-Memoization Processor which is based on computation reuse, and merged it with speculative multi-threading based on value prediction into a mechanism called Parallel Speculative Execution. The processor dynamically detects functions and loop iterations as reusable blocks, and registers their inputs and outputs into the table called Reuse Table automatically. Then, when the processor detects the same block, to apply computation reuse to the block, the processor compares the current input sequence with the previous input sequences registered in Reuse Table. In this paper, we propose a hinting technique for Auto-Memoization Processor based on static binary analysis. The hint indicates two distinctive types of input for loop bodies. One input type is unchanging value. When applying computation reuse to a loop, the processor can skip comparing such unchanging inputs with the values on Reuse Table. The other input type is unmonotonous changing value. The loops which have unmonotonous changing inputs will not benefit from computation reuse, and the processor can stop applying useless computation reuse to such loop iterations. By hinting these types of input to the processor, the overhead of Auto-Memoization Processor can be reduced. The result of the experiment with SPEC CPU95 benchmark suite shows that the hinting technique improves the maximum speedup from 40.6%to 51.8%, and the average speedup from 11.9% to 16.5%.

Explore More