Hiroki Matsutani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hiroki Matsutani is active.

Explore More

Publication

Featured researches published by Hiroki Matsutani.

networks on chips | 2008

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

Michihiro Koibuchi; Hiroki Matsutani; Hideharu Amano; T. Mark Pinkston

Survival capability is becoming a crucial factor in designing multicore processors built with on-chip packet networks, or networks on chip (NoCs). In this paper, we propose a lightweight fault-tolerant mechanism for NoCs based on default backup paths (DBPs) designed to maintain, in the presence of failures, network connectivity of both non-faulty routers as well as healthy processor cores which may be connected to faulty routers. The mechanism provides default paths as backup between certain router ports which serve as alternative datapaths to circumvent failed components within a faulty router. Along with a minimal subset of normal network channels, the set of default backup paths internal to faulty routers form - in the worst case - a unidirectional ring topology that provides network-wide connectivity to all processor cores. Routing using the DBP mechanism is proved to be deadlock-free with only two virtual channels even for fault scenarios in which regular networks degrade to irregular (arbitrary) topologies. Evaluation results show that, for a 2-D mesh wormhole NoC, only 12.6% additional hardware resources are needed to implement the proposed DBP mechanism in order to provide graceful performance degradation without chip-wide failure as the number of faults increases to the maximum needed to form ring.

high-performance computer architecture | 2009

Prediction router: Yet another low latency on-chip router architecture

Hiroki Matsutani; Michihiro Koibuchi; Hideharu Amano; Tsutomu Yoshinaga

Network-on-Chips (NoCs) are quite latency sensitive, since their communication latency strongly affects the application performance on recent many-core architectures. To reduce the communication latency, we propose a low-latency router architecture that predicts an output channel being used by the next packet transfer and speculatively completes the switch arbitration. In the prediction routers, incoming packets are transferred without waiting the routing computation and switch arbitration if the prediction hits. Thus, the primary concern for reducing the communication latency is the hit rates of prediction algorithms, which vary from the network environments, such as the network topology, routing algorithm, and traffic pattern. Although typical low-latency routers that speculatively skip one or more pipeline stages use a bypass datapath for specific packet transfers (e.g., packets moving on the same dimension), our prediction router predictively forwards packets based on a prediction algorithm selected from several candidates in response to the network environments. In this paper, we analyze the prediction hit rates of six prediction algorithms on meshes, tori, and fat trees. Then we provide three case studies, each of which assumes different many-core architecture. We have implemented a prediction router for each case study by using a 65nm CMOS process, and evaluated them in terms of the prediction hit rate, zero load latency, hardware amount, and energy consumption. The results show that although the area and energy are increased by 6.4–15.9% and 8.0–9.5% respectively, up to 89.8% of the prediction hit rate is achieved in real applications, which provide favorable trade-offs between the modest hardware/energy overheads and the latency saving.

asia and south pacific design automation conference | 2008

Run-time power gating of on-chip routers using look-ahead routing

Hiroki Matsutani; Michihiro Koibuchi; Daihan Wang; Hideharu Amano

Since on-chip routers in network-on-chips play a key role in on-chip communication between cores, they should be always preparing for packet injections even if a part of cores are in standby mode, resulting in a larger standby power of routers compared with cores. The run-time power gating of individual channels in a router is one of attractive solutions to reduce the standby power of chip without affecting the on-chip communication. However, a state transition between sleep and active mode incurs the performance penalty, and turning a power switch on or off dissipates the overhead energy, which means a short-term sleep adversely increases the power consumption. In this paper, we propose a sleep control method based on look-ahead routing that detects the arrival of packets two hops ahead, so as to hide the wake-up delay and reduce the short-term sleeps of channels. Simulation results using real application traces show that the proposed method conceals the wake-up delay of less than five cycles, and more leakage power can be saved compared with the original naive method.

international conference on parallel processing | 2007

Tightly-Coupled Multi-Layer Topologies for 3-D NoCs

Hiroki Matsutani; Michihiro Koibuchi; Hideharu Amano

Three-dimensional network-on-chip (3-D NoC) is an emerging research topic exploring the network architecture of 3-D ICs that stack several smaller wafers for reducing wire length and wire delay. Although the network topology of 3-D NoC has been explored for a couple of years, there is still only a narrow range of choices. In this paper, we propose a class of 3-D topologies called Xbar-connected network-on-tiers (XNoTs), which consist of multiple network layers tightly connected via crossbar switches. To make the best use of the short delay and high density of inter-wafer links, XNoTs topologies have crossbar switches that connect different layers and their cores. The planar topology on every layer can be independently customized so as to meet the cost-performance requirements, as far as network connectivity is at least guaranteed with the bottom layer. We also propose their routing algorithm, which guarantees deadlock-freedom by restricting the inter-layer packet transfer from a lower-numbered layer to a higher-numbered layer. Path sets at the bottom layer close to the heat sink of the chip can be selectively employed in order to mitigate the heat-dissipation problem of 3-D ICs. Several forms of XNoTs topologies including meshes, tori, and/or trees are created, and they are evaluated in terms of performance, cost, and energy consumption. As a result, we show that even with the flexibilities mentioned above, XNoTs achieve at least as high throughput as existing 3-D topologies for equivalent chip sizes.

international symposium on computer architecture | 2012

A case for random shortcut topologies for HPC interconnects

Michihiro Koibuchi; Hiroki Matsutani; Hideharu Amano; D. Frank Hsu; Henri Casanova

As the scales of parallel applications and platforms increase the negative impact of communication latencies on performance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit low-latency topologies of high-radix switches. In this context, we propose the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Using graph analysis we find that these topologies, when compared to non-random topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Using flit-level discrete event simulation we find that random shortcut topologies achieve throughput comparable to and latency lower than that of existing non-random topologies such as hypercubes and tori. Finally, we discuss and quantify practical challenges for random shortcut topologies, including routing scalability and larger physical cable lengths.

networks on chips | 2010

Ultra Fine-Grained Run-Time Power Gating of On-chip Routers for CMPs

Hiroki Matsutani; Michihiro Koibuchi; Daisuke Ikebuchi; Kimiyoshi Usami; Hiroshi Nakamura; Hideharu Amano

This paper proposes an ultra fine-grained run-time power gating of on-chip router, in which power supply to each router component (e.g., VC queue, crossbar MUX, and output latch) can be individually controlled in response to the applied workload.As only the router components which are just transferring a packet are activated, the leakage power of the on-chip network can be reduced to the near-optimal level.However, a certain amount of wakeup latency is required to activate the sleeping components, and the application performance will be degraded.In this paper, we estimate the wakeup latency for each component based on circuit simulations using a 65nm process.Then we propose four early wakeup methods to overcome the wakeup latency.The proposed router with the early wakeup methods is evaluated in terms of the application performance, area, and leakage power.As a result, it reduces the leakage power by 78.9%, at the expense of the 4.3% area and 4.0% performance when we assume a 1GHz operation.

networks on chips | 2008

Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks

Hiroki Matsutani; Michihiro Koibuchi; Daihan Wang; Hideharu Amano

In this paper, we introduce the use of slow-silent virtual channels to reduce the switching power of on-chip networks while keeping the leakage power small. Adding virtual channels to a network improves the throughput until each link bandwidth is saturated. This enables us to reduce the switching power of on-chip networks by decreasing their operating frequency and supply voltage. However, adding virtual channels increases the leakage power of routers as well as the area due to their large buffers; so the runtime power gating is applied to individual virtual channels to eliminate this problem. We evaluate the performance of slow-silent virtual channels by using real application traces, and their power consumption (switching and leakage) is evaluated based on the detailed design of a virtual-channel router placed and routed with a 90 nm technology. These evaluation results show that a network with three or four virtual channels achieves the best energy efficiency in a uniform traffic. In the cases of neighboring communications, a network with two virtual channels is better than the other networks with more virtual channels, because the performance improvement from no virtual channel to two virtual channels is the largest and their frequency and supply voltage can also be reduced well in these cases.

asia and south pacific design automation conference | 2013

A case for wireless 3D NoCs for CMPs

Hiroki Matsutani; Paul Bogdan; Radu Marculescu; Yasuhiro Take; Daisuke Sasaki; Hao Zhang; Michihiro Koibuchi; Tadahiro Kuroda; Hideharu Amano

Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.

IEEE Transactions on Computers | 2014

3D NoC with Inductive-Coupling Links for Building-Block SiPs

Yasuhiro Take; Hiroki Matsutani; Daisuke Sasaki; Michihiro Koibuchi; Tadahiro Kuroda; Hideharu Amano

A wireless 3D NoC architecture is described for building-block SiPs, in which the number of hardware components (or chips) in a package can be changed after chips have been fabricated. The architecture uses inductive-coupling links that can connect more than two examined dies without wire connections. Each chip has data transceivers for the uplink and downlink in order to communicate with its neighboring chips in the package. These chips form a vertical unidirectional ring network so as to fully exploit the flexibility of the wireless approach that enables us to add, remove, and swap the chips in the ring. To avoid protocol and structural deadlocks in the ring, we use bubble flow control, which does not rely on the conventional VC-based deadlock avoidance mechanism. In addition, we propose a bidirectional communication scheme to form a bidirectional ring network by using the inductive-coupling transceivers that can dynamically change the communication modes, such as TX, RX, and Idle modes. This paper illustrates the inductive-coupling transceiver circuits, which can carry high data transfer rates of up to 8 Gbps per channel, for the wireless 3D NoC. It also illustrates an implementation of a wireless 3D NoC that has on-chip routers and transceivers implemented with a 65 nm process in order to show the feasibility of our proposal. The vertical bubble flow control and conventional VC-based approach on the uni- and bidirectional ring networks are compared with the vertical broadcast bus in terms of throughput, hardware amount, and application performance using a full system multiprocessor simulator. The results show that the proposed bidirectional communication scheme efficiently improves application performance without adding any inductive-coupling transceivers. In addition, the proposed vertical bubble flow network outperforms the conventional VC-based approach by 7.9-12.5 percent with a 33.5 percent smaller router area for building-block SiPs connecting up to eight chips.

international parallel and distributed processing symposium | 2007

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network

Hiroki Matsutani; Michihiro Koibuchi; Hideharu Amano

The topological explorations of on-chip networks are important for efficiently using their enormous wire resources for low-latency and high-throughput communications using a modest silicon budget. In this paper, we propose a novel tree-based interconnection network called Fat H-Tree that meets these requirements. A Fat H-Tree provides a torus structure by combining two folded H-Tree networks and is an attractive alternative to tree-based networks such as the Fat Trees in a microarchitecture domain. We introduce its chip layout schemes based on a folding technique for 2D and 3D ICs. Three deadlock-free routing schemes are proposed for Fat H-Tree. We evaluate the performance of Fat H-Tree and other tree-based networks using real application traces. In addition, the network logic area, wire resource, and energy consumption of Fat H-Tree are compared with other topologies, based on a typical implementation of on-chip routers synthesized with a 90-nm standard cell library. The results show that (1) a Fat H-Tree outperforms a Fat Tree with two upward and four downward connections in terms of the throughput and average hop count, (2) a Fat H-Tree requires 19.8 percent-27.8 percent smaller network logic area than the Fat Tree, (3) a Fat H-Tree consumes slightly less energy than the Fat Tree does, and (4) a Fat H-Tree uses slightly more wire resources than the Fat Tree, but the current process technology can provide sufficient wire resources for implementing Fat-H-Tree-based on-chip networks.

Explore More