Shinobu Miwa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shinobu Miwa is active.

Explore More

Publication

Featured researches published by Shinobu Miwa.

international conference on computer design | 2013

Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping

Toshiya Komoda; Shingo Hayashi; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura

Future computer systems are built under much stringent power budget due to the limitation of power delivery and cooling systems. To this end, sophisticated power management techniques are required. Power capping is a technique to limit the power consumption of a system to the predetermined level, and has been extensively studied in homogeneous systems. However, few studies about the power capping of CPU-GPU heterogeneous systems have been done yet. In this paper, we propose an efficient power capping technique through coordinating DVFS and task mapping in a single computing node equipped with GPUs. In CPU-GPU heterogeneous systems, settings of the device frequencies have to be considered with task mapping between the CPUs and the GPUs because the frequency scaling can incurs load imbalance between them. To guide the settings of DVFS and task mapping for avoiding power violation and the load imbalance, we develop new empirical models of the performance and the maximum power consumption of a CPU-GPU heterogeneous system. The models enable us to set near-optimal settings of the device frequencies and the task mapping in advance of the application execution. We evaluate the proposed technique with five data-parallel applications on a machine equipped with a single CPU and a single GPU. The experimental result shows that the performance achieved by the proposed power capping technique is comparable to the ideal one.

international conference on parallel processing | 2013

Integrating Multi-GPU Execution in an OpenACC Compiler

Toshiya Komoda; Shinobu Miwa; Hiroshi Nakamura; Naoya Maruyama

GPUs have become promising computing devices in current and future computer systems due to its high performance, high energy efficiency, and low price. However, lack of high level GPU programming models hinders the wide spread of GPU applications. To resolve this issue, OpenACC is developed as the first industry standard of a directive-based GPU programming model and several implementations are now available. Although early evaluations of the OpenACC systems showed significant performance improvement with modest programming efforts, they also revealed the limitations of the systems. One of the biggest limitations is that the current OpenACC compilers do not automate the utilization of multiple GPUs. In this paper, we present an OpenACC compiler with the capability to execute single GPU OpenACC programs on multiple GPUs. By orchestrating the compiler and the runtime system, the proposed system can efficiently manage the necessary data movements among multiple GPUs memories. To enable advanced communication optimizations in the proposed system, we propose a small set of directives as extensions of OpenACC API. The directives allow programmers to express the patterns of memory accesses in the parallel loops to be offloaded. Inserting a few directives into an OpenACC program can reduce a large amount of unnecessary data movements and thus helps the proposed system drawing great performance from multi-GPU systems. We implemented and evaluated the prototype system on top of CUDA with three data parallel applications. The proposed system achieves up to 6.75x of the performance compared to OpenMP in the 1CPU with 2GPU machine, and up to 2.95x of the performance compared to OpenMP in the 2CPU with 3GPU machine. In addition, in two of the three applications, the multi-GPU OpenACC compiler outperforms the single GPU system where hand-written CUDA programs run.

design, automation, and test in europe | 2013

D-MRAM cache: enhancing energy efficiency with 3T-1MTJ DRAM/MRAM hybrid memory

Hiroki Noguchi; Kumiko Nomura; Keiko Abe; Shinobu Fujita; Eishi Arima; Kyundong Kim; Takashi Nakada; Shinobu Miwa; Hiroshi Nakamura

This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active power is intermittent refresh process for the DRAM-mode. D-MRAM has advantage to reduce static power consumptions compared to the conventional SRAM, because there are no static leakage paths in the D-MRAM cell and it is not needed to supply voltage to its cells when used as the MRAM-mode. Besides, with advanced perpendicular magnetic tunnel junctions (p-MTJ), which decreases the write energy and latency without shortening its retention time, D-MRAM is capable of power reduction by replacing the traditional SRAM caches. Considering the 65-nm CMOS technology, the access latencies of 1MB memory macro are 2.2 ns / 1.5 ns for read / write in DRAM mode, and 2.2 ns / 4.5 ns in MRAM mode, while those of SRAM are 1.17 ns. The SPEC CPU2006 benchmarks have revealed that the energy per instruction (EPI) of the total cache memory can be dramatically reduced by 71 % on average, and the instruction per cycle (IPC) performance of the D-MRAM cache architecture degraded only by approximately 4 % on average in spite of its latency overhead.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Predict-More Router: A Low Latency NoC Router with More Route Predictions

Yuan He; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura

Network-on-Chip (NoC) is a critical part of the memory hierarchy of emerging multicores. Lowering its communication latency while preserving its bandwidth is key to achieving high system performance. By now, one of the most effective methods helps achieving this goal is prediction router (PR). PR works by predicting the route an incoming packet may be transferred to and it speculatively allocates resources (virtual channels and the switch crossbar) to the packet and traverses the packets flits using this predicted route in a single cycle without waiting for route computation; however, if prediction misses, the packet will then be processed in the conventional pipeline (in our work, four cycles) and the speculatively allocated router resources will be wasted. Obviously, prediction accuracy contributes to the amount of successful predictions, latency reduction and bandwidth consumption. We find that predictions hit around 65% for most applications even under the best algorithm so in such cases PR can at most accelerate about 65% of the packets while the left 35% will consume extra router resources and bandwidth. In order to increase the prediction accuracy, we propose a technique, which makes use of multiple prediction algorithms at the same time for one incoming packet. Such a prediction is more accurate. With this proposal, we design and implement predict-more router (PmR). While effectively increasing the prediction accuracy, PmR also helps utilizing remaining bandwidth within the router more productively. When both PmR and PR are evaluated under their best algorithm(s), we find that PmR is over 15% higher in prediction accuracy than PR, which helps PmR outperform PR by 3.5% on average in speeding-up the system. We also find that although PmR creates more contentions in prediction, these contentions can be well resolved and are kept within the router so both router internal bandwidth and link bandwidth are not exacerbated with it.

IEICE Transactions on Information and Systems | 2008

A Dynamic Control Mechanism for Pipeline Stage Unification by Identifying Program Phases

Jun Yao; Shinobu Miwa; Hajime Shimada; Shinji Tomita

Recently, a method called pipeline stage unification (PSU) has been proposed to reduce energy consumption for mobile processors via inactivating and bypassing some of the pipeline registers and thus adopt shallow pipelines. It is designed to be an energy efficient method especially for the processors under future process technologies. In this paper, we present a mechanism for the PSU controller which can dynamically predict a suitable configuration based on the program phase detection. Our results show that the designed predictor can achieve a PSU degree prediction accuracy of 84.0%, averaged from the SPEC CPU2000 integer benchmarks. With this dynamic control mechanism, we can obtain 11.4% Energy-Delay-Product (EDP) reduction in the processor that adopts a PSU pipeline, compared to the baseline processor, even after the application of complex clock gating.

ACM Sigarch Computer Architecture News | 2007

Optimal pipeline depth with pipeline stage unification adoption

Jun Yao; Shinobu Miwa; Hajime Shimada; Shinji Tomita

To find the optimal pipeline design point by considering both performance and power objectives has been one focus of interest in recent researches. However, we found that previous papers did not consider deepening or shrinking pipeline depth dynamically during the program execution. In this paper, with the adoption of the earlier proposed Pipeline Stage Unification (PSU) method, we studied the relationship between power/performance and pipeline depth in processors with a pipeline of multi-usable depths. Our evaluation results of SPECint2000 benchmarks shown in this paper illustrate that the PSU adoption can achieve good efficiency for platforms which concern both energy and performance, even after the utilization of complex clock gating.

international conference on computer design | 2015

Immediate sleep: Reducing energy impact of peripheral circuits in STT-MRAM caches

Eishi Arima; Hiroki Noguchi; Takashi Nakada; Shinobu Miwa; Susumu Takeda; Shinobu Fujita; Hiroshi Nakamura

Implementing last level caches (LLCs) with STT-MRAM is a promising approach for designing energy efficient microprocessors due to high density and low leakage power of its memory cells. However, peripheral circuits of an STT-MRAM cache still suffer from leakage power because large and leaky transistors are required to drive large write current to STT-MRAM element. To overcome this problem, we propose a new power management scheme called Immediate Sleep (IS). IS immediately turns off a subarray of an STT-MRAM cache if the next access is predicted to be not critical in performance. Thus, IS can effectively reduce leakage energy with little impact on performance. Our experimental results show that our technique can save the leakage energy of an STT-MRAM LLC by 32% compared to an STT-MRAM LLC with the conventional scheme at the same performance.

asia and south pacific design automation conference | 2014

Normally-off computing project: Challenges and opportunities

Hiroshi Nakamura; Takashi Nakada; Shinobu Miwa

Normally-Off is a way of computing which aggressively powers off components of computer systems when they need not to operate. Simple power gating cannot fully take the chances of power reduction because volatile memories lose data when power is turned off. Recently, new non-volatile memories (NVMs) have appeared. High attention has been paid to normally-off computing using these NVMs. In this paper, its expectation and challenges are addressed with a brief introduction of our project started in 2011.

international conference on parallel architectures and compilation techniques | 2013

McRouter: multicast within a router for high performance network-on-chips

Yuan He; Hiroshi Sasaki; Shinobu Miwa; Hiroshi Nakamura

The inevitable advent of the multi-core era has driven an increasing demand for low latency on-chip inter-connection networks (or NoCs). Being a critical part of the memory hierarchy for modern chip multi-processors (CMPs), these networks face stringent design constraints to provide fast communication with tight power budget. Modern NoCs first-order concern is clearly its latency, while we also find that internal bandwidth of its routers is relatively plentiful; thus, we present a low latency router design utilizing a technique we call “multicast within a router” or McRouter, which allows productive utilization of remaining bandwidth inside a NoC router. McRouter allows a single cycle transfer of flits which shortens the communication latency when there is enough remaining bandwidth within the router. The key idea is to transmit a header flit to all possible output ports (multicast) so that it is always transmitted to the correct output port without relying on route computation. In addition, we find it is affordable with marginal power overhead while still being a stand-alone design by maintaining portability and modularity (unlike look-ahead routing based designs). Our evaluation with application traffic shows that McRouter helps achieving system speed-ups of 1.28, 1.17 and 1.05 over the conventional router (CR), the VSA router (VSAR) and the prediction router (PR), respectively.

international parallel and distributed processing symposium | 2012

Communication Library to Overlap Computation and Communication for OpenCL Application

Toshiya Komoda; Shinobu Miwa; Hiroshi Nakamura

User-friendly parallel programming environments, such as CUDA and OpenCL are widely used for accelerators. They provide programmers with useful APIs, but the APIs are still low level primitives. Therefore, in order to apply communication optimization techniques, such as double buffering techniques, programmers have to manually write the programs with the primitives. Manual communication optimization requires programmers to have significant knowledge of both application characteristics and CPU-accelerator architecture. This prevents many application developers from effective utilization of accelerators. In addition, managing communication is a tedious and error-prone task even for expert programmers. Thus, it is necessary to develop a communication system which is highly abstracted but still capable of optimization. For this purpose, this paper proposes an OpenCL based communication library. To maximize performance improvement, the proposed library provides a simple but effective programming interface based on Stream Graph in order to specify an applications communication pattern. We have implemented a prototype system on OpenCL platform and applied it to several image processing applications. Our evaluation shows that the library successfully masks the details of accelerator memory management while it can achieve comparable speedup to manual optimization in which we use existing low level interfaces.

Explore More