Takatsugu Ono | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takatsugu Ono is active.

Explore More

Publication

Featured researches published by Takatsugu Ono.

international soc design conference | 2009

Adaptive cache-line size management on 3D integrated microprocessors

Takatsugu Ono; Koji Inoue; Kazuaki Murakami

The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable line-size cache scheme. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data cache and stacked DRAM energy consumption up to 75%, compared to a conventional cache.

international conference on big data | 2014

FlexDAS: A flexible direct attached storage for I/O intensive applications

Takatsugu Ono; Yotaro Konishi; Teruo Tanimoto; Noboru Iwamatsu; Takashi Miyoshi; Jun Tanaka

Big data analysis and a data storing applications require a huge volume of storage and a high I/O performance. Applications can achieve high levels of performance and cost efficiency by exploiting the high I/O performances of direct attached storages (DAS) such as internal HDDs. With the size of stored data ever increasing, it will be difficult to replace servers since internal HDDs contain huge amounts of data. In response to this issue, we propose FlexDAS, which improves the flexibility of direct attached storage by using a disk area network (DAN) without degrading the I/O performance. We developed a prototype FlexDAS switch and quantitatively evaluated the architecture. Results show that the FlexDAS switch can disconnect and connect the HDD to the server in just 1.16 seconds. The I/O performances of the disks connected via the FlexDAS switch were almost the same as the conventional DAS architecture.

international symposium on computing and networking | 2017

CPCI Stack: Metric for Accurate Bottleneck Analysis on OoO Microprocessors

Teruo Tanimoto; Takatsugu Ono; Koji Inoue

Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

Journal of Information Processing | 2017

Dependence graph model for accurate critical path analysis on out-of-order processors

Teruo Tanimoto; Takatsugu Ono; Koji Inoue

The dependence graph model of out-of-order (OoO) instruction execution is a powerful representation used for the critical path analysis. However, most, if not all, of the previous models are out-of-date and lack enough detail to model modern OoO processors, or are too specific and complicated which limit their generality and applicability. In this paper, we propose an enhanced dependence graph model which remains simple but greatly improves the accuracy over prior models. The evaluation results using the gem5 simulator with configurations similar to Intel’s Haswell and Silvermont architecture show that the proposed enhanced model achieves CPI errors of 2.1% and 4.4% which are 90.3% and 77.1% improvements from the state-of-the-art model.

international soc design conference | 2016

Single-flux-quantum cache memory architecture

Koki Ishida; Masamitsu Tanaka; Takatsugu Ono; Koji Inoue

Single-flux-quantum (SFQ) logic is promising technology to realize an incredible microprocessor which operates over 100 GHz due to its ultra-fast-speed and ultra-low-power natures. Although previous work has demonstrated prototype of an SFQ microprocessor, the SFQ based L1 cache memory has not well optimized: a large access latency and strictly limited scalability. This paper proposes a novel SFQ cache architecture to support fast accesses. The sub-arrayed structure applied to the cache produces better scalability in terms of capacity. Evaluation results show that the proposed cache achieves 1.8X fast access speed.

international conference on big data | 2016

Evaluating the impacts of code-level performance tunings on power efficiency

Satoshi Imamura; Keitaro Oka; Yuichiro Yasui; Yuichi Inadomi; Katsuki Fujisawa; Toshio Endo; Koji Ueno; Keiichiro Fukazawa; Nozomi Hata; Yuta Kakibuka; Koji Inoue; Takatsugu Ono

As the power consumption of HPC systems will be a primary constraint for exascale computing, a main objective in HPC communities is recently becoming to maximize power efficiency (i.e., performance per watt) rather than performance. Although programmers have spent a considerable effort to improve performance by tuning HPC programs at a code level, tunings for improving power efficiency is now required. In this work, we select two representative HPC programs (Graph500 and SDPARA) and evaluate how traditional code-level performance tunings applied to these programs affect power efficiency. We also investigate the impacts of the tunings on power efficiency at various operating frequencies of CPUs and/or GPUs. The results show that the tunings significantly improve power efficiency, and different types of tunings exhibit different trends in power efficiency by varying CPU frequency. Finally, the scalability and power efficiency of state-of-the-art Graph500 implementations are explored on both a single-node platform and a 960-node supercomputer. With their high scalability, they achieve 27.43 MTEPS/Watt with 129.76 GTEPS on the single-node system and 4.39 MTEPS/Watt with 1,085.24 GTEPS on the supercomputer.

Proceedings of the First International Workshop on High Performance Graph Data Management and Processing | 2016

Power-efficient breadth-first search with DRAM row buffer locality-aware address mapping

Satoshi Imamura; Yuichiro Yasui; Koji Inoue; Takatsugu Ono; Hiroshi Sasaki; Katsuki Fujisawa

Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

2016 Fourth International Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC) | 2016

Accuracy analysis of machine learning-based performance modeling for microprocessors

Yoshihiro Tanaka; Keitaro Oka; Takatsugu Ono; Koji Inoue

This paper analyzes accuracy of performance models generated by machine learning-based empirical modeling methodology. Although the accuracy strongly depends on the quality of learning procedure, it is not clear what kind of learning algorithms and training data set (or feature) should be used. This paper inclusively explores the learning space of processor performance modeling as a case study. We focus on static architectural parameters as training data set such as cache size and clock frequency. Experimental results show that a tree-based non-linear regression modeling is superior to a stepwise linear regression modeling. Another observation is that clock frequency is the most important feature to improve prediction accuracy.

international conference on supercomputing | 2014

Hardware-assisted scalable flow control of shared receive queue

Teruo Tanimoto; Takatsugu Ono; Kohta Nakashima; Takashi Miyoshi

The total number of processor cores in supercomputers is increasing while memory size per core is decreasing due to the adoption of processors with multiple cores. Shared Receive Queue is a technique that effectively reduces the memory usage of buffers, but the absence of flow control results in excess buffer pools. We propose a hardware-assisted flow control that reduces flow control latency by 95.1%, thus enabling scalable supercomputers with multi-core processors.

Archive | 2010