Thiem Van Chu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thiem Van Chu is active.

Explore More

Publication

Featured researches published by Thiem Van Chu.

field programmable logic and applications | 2015

Ultra-fast NoC emulation on a single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Network-on-Chip (NoC) has become the de facto on-chip communication architecture for many-core systems. This paper proposes novel methods for emulating large-scale NoC designs on a single FPGA. Since FPGAs offer a highly parallel platform, FPGA-based emulation can be much faster than the software-based approach. However, emulating NoC designs with up to thousands of nodes is a challenging task due to the FPGA capacity constraints. We first describe how to accurately model synthetic workloads on FPGA by separating the time of the emulated network and the times of the traffic generation units. We next present a novel use of time-multiplexing in emulating the entire network using several physical nodes. Finally, we show the basic steps to apply the proposed methods to emulate different NoC architectures. The proposed methods enable ultrafast emulations of large-scale NoC designs with up to thousands of nodes using only on-chip resources of a single FPGA. In particular, more than 5,000× simulation speedup over BookSim, a widely used software-based NoC simulator, is achieved.

field programmable logic and applications | 2014

Ultrasmall: The smallest MIPS soft processor

Hiroshi Nakatsuka; Yuichiro Tanaka; Thiem Van Chu; Shinya Takamaeda-Yamazaki; Kenji Kise

Soft processors have been commonly used in FPGAbased designs to perform various useful functions. Some of these functions are not performance-critical and required to be implemented using very few FPGA resources. For such cases, it is desired to reduce circuit area of the soft processor as much as possible. This paper proposes Ultrasmall, a small soft processor for FPGAs. Ultrasmall supports a subset of the MIPS-I ISA and is designed for microcontrollers in FPGA-based SoCs. Ultrasmall employs an area efficient architecture to minimize the use of FPGA resources. While supporting the 32-bit ISA, Ultrasmall adopts the 2-bit wide serial ALU architecture. This approach significantly reduces the amount of FPGA resource usage. In addition to the device-independent optimizations for any FPGAs, we apply primitives-based optimizations for the Xilinx Spartan-3E FPGA series with 4-input LUTs, thereby further reducing the total number of occupied slices. The evaluation result shows that, on the Xilinx Spartan-3E XC3S500E FPGA, Ultrasmall occupies only 137 slices which is 84% of the number of occupied slices of Supersmall, a very small soft processor with the same design concept as Ultrasmall. On the other hand, in term of performance, Ultrasmall is 2.9× faster than Supersmall.

field programmable custom computing machines | 2017

High-Performance Hardware Merge Sorter

Susumu Mashimo; Thiem Van Chu; Kenji Kise

State-of-the-art studies show that FPGA-based hardware merge sorters (HMSs) can achieve superior performance compared with optimized algorithms on CPUs and GPUs. The performance of any HMS is proportional to its operating frequency (F) and the number of records that can be output each cycle (E). However, all existing HMSs have a problem that F drops significantly with increasing E due to the increase of the number of levels of gates. In this paper, we propose novel architectures for HMSs where the number of levels of gates is constant when E is increased. We implement some HMSs adopting the proposed architectures on a Virtex-7 FPGA. The evaluation shows that an HMS of E = 32 operates at 311MHz and achieves 3.13x higher throughput than the state-of-the-art HMS.

field-programmable custom computing machines | 2015

Enabling Fast and Accurate Emulation of Large-Scale Network on Chip Architectures on a Single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Network on Chip (NoC) has become the de facto on-chip communication architecture of many-core systems. This paper proposes an FPGA-based NoC emulator which can achieve an ultra-fast simulation speed. We improve the scalability of the NoC emulator without simplifying the emulated architectures or using off-chip resources. We introduce a novel method which enables to accurately emulate NoC designs under synthetic workloads without using a large amount of memory by decoupling the time of the emulated NoC and the time of the traffic generators. Additionally, we propose a method based on the time-division multiplexing technique to emulate the behavior of the entire network using several physical nodes while effectively using FPGA resources. We show that an implementation of the proposed NoC emulator on a Virtex-7 FPGA can achieve 2, 745x simulation speedup over Booksim, one of the most widely used software-based NoC simulator, while maintaining the simulation accuracy.

ACM Sigarch Computer Architecture News | 2017

Cost-Effective and High-Throughput Merge Network: Architecture for the Fastest FPGA Sorting Accelerator

Susumu Mashimo; Thiem Van Chu; Kenji Kise

High-performance sorting is used in various areas such as database transactions and genomic feature operations. To improve sorting performance, in addition to the conventional approach of using general purpose processors or GPUs, the approach of using FPGAs is becoming a promising solution. As an FPGA sorting accelerator, Casper and Olukotun have recently proposed the fastest one known so far. In their study, they proposed a merge network which can merge two sorted data series at a throughput of 6 data elements per 200MHz clock cycle. If an FPGA sorting accelerator is constructed using merge networks, the overall throughput will be mainly determined by the throughputs of the merge networks. This motivates us to design a merge network which outputs more than 6 data elements per 200MHz clock cycle. In this paper, we propose a cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator. The evaluation shows that our proposal achieves a throughput of 8 data elements per 200MHz clock cycle.

2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs | 2014

KNoCEmu: High Speed FPGA Emulator for Kilo-node Scale NoCs

Thiem Van Chu; Shimpei Sato; Kenji Kise

Many-core architectures are becoming mainstream in both processor designs and System-on-Chip (SoC) designs. With the growing number of cores on a chip, Network-on-Chip (NoC) has become the de-facto on-chip communication infrastructure. Since it is believed that the near future many-core architectures will have thousands of cores integrated on a single chip, it is essential to have both full-system simulators and stand-alone NoC simulators for supporting architectural design exploration and performance evaluation of such kilo scale many-core systems. This paper proposes KNoCEmu, an FPGA emulator which can achieve fast and cycle-accurate Kilo-Node scale NoC simulations. To overcome the limitation of FPGA resources, we propose a method which reduces the amount of required FPGA resources while maintaining the simulation accuracy. The time-division multiplexing technique is adopted to emulate the behavior of the entire network using one or several physical nodes. Our design is optimized to efficiently use FPGA resources such as block RAM and distributed RAM. We have implemented KNoCEmu for emulating a 32×32 mesh NoC with the conventional input buffer router on a Virtex 7 XC7VX485T FPGA and evaluated the amount of occupied FPGA resources and the simulation speedup over a software-based software simulator. The evaluation results show that KNoCEmu can achieve 134× simulation speedup over the software-based simulator while using less than 8% of the total number of available slices of the Virtex 7 XC7VX485T FPGA.

ACM Transactions on Reconfigurable Technology and Systems | 2017

Fast and Cycle-Accurate Emulation of Large-Scale Networks-on-Chip Using a Single FPGA

Thiem Van Chu; Shimpei Sato; Kenji Kise

Modeling and simulation/emulation play a major role in research and development of novel Networks-on-Chip (NoCs). However, conventional software simulators are so slow that studying NoCs for emerging many-core systems with hundreds to thousands of cores is challenging. State-of-the-art FPGA-based NoC emulators have shown great potential in speeding up the NoC simulation, but they cannot emulate large-scale NoCs due to the FPGA capacity constraints. Moreover, emulating large-scale NoCs under synthetic workloads on FPGAs typically requires a large amount of memory and thus involves the use of off-chip memory, which makes the overall design much more complicated and may substantially degrade the emulation speed. This article presents methods for fast and cycle-accurate emulation of NoCs with up to thousands of nodes using a single FPGA. We first describe how to emulate a NoC under a synthetic workload using only FPGA on-chip memory (BRAMs). We next present a novel use of time-division multiplexing where BRAMs are effectively used for emulating a network using a small number of nodes, thereby overcoming the FPGA capacity constraints. We propose methods for emulating both direct and indirect networks, focusing on the commonly used meshes and fat-trees (k-ary n-trees). This is different from prior work that considers only direct networks. Using the proposed methods, we build a NoC emulator, called FNoC, and demonstrate the emulation of some mesh-based and fat-tree-based NoCs with canonical router architectures. Our evaluation results show that (1) the size of the largest NoC that can be emulated depends on only the FPGA on-chip memory capacity; (2) a mesh-based NoC with 16,384 nodes (128×128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) can be emulated using a single Virtex-7 FPGA; and (3) when emulating these two NoCs, we achieve, respectively, 5,047× and 232× speedups over BookSim, one of the most widely used software-based NoC simulators, while maintaining the same level of accuracy.

international symposium on computing and networking | 2016

A Cost-Effective and Scalable Merge Sorter Tree on FPGAs

Takuma Usui; Thiem Van Chu; Kenji Kise

Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.

networks on chips | 2016