Thiem Van Chu
Tokyo Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thiem Van Chu.
field programmable logic and applications | 2015
Thiem Van Chu; Shimpei Sato; Kenji Kise
Network-on-Chip (NoC) has become the de facto on-chip communication architecture for many-core systems. This paper proposes novel methods for emulating large-scale NoC designs on a single FPGA. Since FPGAs offer a highly parallel platform, FPGA-based emulation can be much faster than the software-based approach. However, emulating NoC designs with up to thousands of nodes is a challenging task due to the FPGA capacity constraints. We first describe how to accurately model synthetic workloads on FPGA by separating the time of the emulated network and the times of the traffic generation units. We next present a novel use of time-multiplexing in emulating the entire network using several physical nodes. Finally, we show the basic steps to apply the proposed methods to emulate different NoC architectures. The proposed methods enable ultrafast emulations of large-scale NoC designs with up to thousands of nodes using only on-chip resources of a single FPGA. In particular, more than 5,000× simulation speedup over BookSim, a widely used software-based NoC simulator, is achieved.
field programmable logic and applications | 2014
Hiroshi Nakatsuka; Yuichiro Tanaka; Thiem Van Chu; Shinya Takamaeda-Yamazaki; Kenji Kise
Soft processors have been commonly used in FPGAbased designs to perform various useful functions. Some of these functions are not performance-critical and required to be implemented using very few FPGA resources. For such cases, it is desired to reduce circuit area of the soft processor as much as possible. This paper proposes Ultrasmall, a small soft processor for FPGAs. Ultrasmall supports a subset of the MIPS-I ISA and is designed for microcontrollers in FPGA-based SoCs. Ultrasmall employs an area efficient architecture to minimize the use of FPGA resources. While supporting the 32-bit ISA, Ultrasmall adopts the 2-bit wide serial ALU architecture. This approach significantly reduces the amount of FPGA resource usage. In addition to the device-independent optimizations for any FPGAs, we apply primitives-based optimizations for the Xilinx Spartan-3E FPGA series with 4-input LUTs, thereby further reducing the total number of occupied slices. The evaluation result shows that, on the Xilinx Spartan-3E XC3S500E FPGA, Ultrasmall occupies only 137 slices which is 84% of the number of occupied slices of Supersmall, a very small soft processor with the same design concept as Ultrasmall. On the other hand, in term of performance, Ultrasmall is 2.9× faster than Supersmall.
field programmable custom computing machines | 2017
Susumu Mashimo; Thiem Van Chu; Kenji Kise
State-of-the-art studies show that FPGA-based hardware merge sorters (HMSs) can achieve superior performance compared with optimized algorithms on CPUs and GPUs. The performance of any HMS is proportional to its operating frequency (F) and the number of records that can be output each cycle (E). However, all existing HMSs have a problem that F drops significantly with increasing E due to the increase of the number of levels of gates. In this paper, we propose novel architectures for HMSs where the number of levels of gates is constant when E is increased. We implement some HMSs adopting the proposed architectures on a Virtex-7 FPGA. The evaluation shows that an HMS of E = 32 operates at 311MHz and achieves 3.13x higher throughput than the state-of-the-art HMS.
field-programmable custom computing machines | 2015
Thiem Van Chu; Shimpei Sato; Kenji Kise
Network on Chip (NoC) has become the de facto on-chip communication architecture of many-core systems. This paper proposes an FPGA-based NoC emulator which can achieve an ultra-fast simulation speed. We improve the scalability of the NoC emulator without simplifying the emulated architectures or using off-chip resources. We introduce a novel method which enables to accurately emulate NoC designs under synthetic workloads without using a large amount of memory by decoupling the time of the emulated NoC and the time of the traffic generators. Additionally, we propose a method based on the time-division multiplexing technique to emulate the behavior of the entire network using several physical nodes while effectively using FPGA resources. We show that an implementation of the proposed NoC emulator on a Virtex-7 FPGA can achieve 2, 745x simulation speedup over Booksim, one of the most widely used software-based NoC simulator, while maintaining the simulation accuracy.
ACM Sigarch Computer Architecture News | 2017
Susumu Mashimo; Thiem Van Chu; Kenji Kise
High-performance sorting is used in various areas such as database transactions and genomic feature operations. To improve sorting performance, in addition to the conventional approach of using general purpose processors or GPUs, the approach of using FPGAs is becoming a promising solution. As an FPGA sorting accelerator, Casper and Olukotun have recently proposed the fastest one known so far. In their study, they proposed a merge network which can merge two sorted data series at a throughput of 6 data elements per 200MHz clock cycle. If an FPGA sorting accelerator is constructed using merge networks, the overall throughput will be mainly determined by the throughputs of the merge networks. This motivates us to design a merge network which outputs more than 6 data elements per 200MHz clock cycle. In this paper, we propose a cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator. The evaluation shows that our proposal achieves a throughput of 8 data elements per 200MHz clock cycle.
2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs | 2014
Thiem Van Chu; Shimpei Sato; Kenji Kise
Many-core architectures are becoming mainstream in both processor designs and System-on-Chip (SoC) designs. With the growing number of cores on a chip, Network-on-Chip (NoC) has become the de-facto on-chip communication infrastructure. Since it is believed that the near future many-core architectures will have thousands of cores integrated on a single chip, it is essential to have both full-system simulators and stand-alone NoC simulators for supporting architectural design exploration and performance evaluation of such kilo scale many-core systems. This paper proposes KNoCEmu, an FPGA emulator which can achieve fast and cycle-accurate Kilo-Node scale NoC simulations. To overcome the limitation of FPGA resources, we propose a method which reduces the amount of required FPGA resources while maintaining the simulation accuracy. The time-division multiplexing technique is adopted to emulate the behavior of the entire network using one or several physical nodes. Our design is optimized to efficiently use FPGA resources such as block RAM and distributed RAM. We have implemented KNoCEmu for emulating a 32×32 mesh NoC with the conventional input buffer router on a Virtex 7 XC7VX485T FPGA and evaluated the amount of occupied FPGA resources and the simulation speedup over a software-based software simulator. The evaluation results show that KNoCEmu can achieve 134× simulation speedup over the software-based simulator while using less than 8% of the total number of available slices of the Virtex 7 XC7VX485T FPGA.
ACM Transactions on Reconfigurable Technology and Systems | 2017
Thiem Van Chu; Shimpei Sato; Kenji Kise
Modeling and simulation/emulation play a major role in research and development of novel Networks-on-Chip (NoCs). However, conventional software simulators are so slow that studying NoCs for emerging many-core systems with hundreds to thousands of cores is challenging. State-of-the-art FPGA-based NoC emulators have shown great potential in speeding up the NoC simulation, but they cannot emulate large-scale NoCs due to the FPGA capacity constraints. Moreover, emulating large-scale NoCs under synthetic workloads on FPGAs typically requires a large amount of memory and thus involves the use of off-chip memory, which makes the overall design much more complicated and may substantially degrade the emulation speed. This article presents methods for fast and cycle-accurate emulation of NoCs with up to thousands of nodes using a single FPGA. We first describe how to emulate a NoC under a synthetic workload using only FPGA on-chip memory (BRAMs). We next present a novel use of time-division multiplexing where BRAMs are effectively used for emulating a network using a small number of nodes, thereby overcoming the FPGA capacity constraints. We propose methods for emulating both direct and indirect networks, focusing on the commonly used meshes and fat-trees (k-ary n-trees). This is different from prior work that considers only direct networks. Using the proposed methods, we build a NoC emulator, called FNoC, and demonstrate the emulation of some mesh-based and fat-tree-based NoCs with canonical router architectures. Our evaluation results show that (1) the size of the largest NoC that can be emulated depends on only the FPGA on-chip memory capacity; (2) a mesh-based NoC with 16,384 nodes (128×128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) can be emulated using a single Virtex-7 FPGA; and (3) when emulating these two NoCs, we achieve, respectively, 5,047× and 232× speedups over BookSim, one of the most widely used software-based NoC simulators, while maintaining the same level of accuracy.
international symposium on computing and networking | 2016
Takuma Usui; Thiem Van Chu; Kenji Kise
Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.
networks on chips | 2016
Masashi Imai; Thiem Van Chu; Kenji Kise; Tomohiro Yoneda
field programmable custom computing machines | 2018
Makoto Saitoh; Elsayed A. Elsayed; Thiem Van Chu; Susumu Mashimo; Kenji Kise