Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Wanming Chu is active.

Publication


Featured researches published by Wanming Chu.


field programmable custom computing machines | 1997

Implementation of single precision floating point square root on FPGAs

Yamin Li; Wanming Chu

The square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a non-restoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is low-cost iterative implementation that uses a traditional adder/subtracter. The operation latency is 25 clock cycles and the issue rate is 24 clock cycles. The other is high-throughput pipelined implementation that uses multiple adder/subtracters. The operation latency is 15 clock cycles and the issue rate is one clock cycle. It means that the pipelined implementation is capable of accepting a square root instruction on every clock cycle.


international conference on computer design | 1997

Parallel-array implementations of a non-restoring square root algorithm

Yamin Li; Wanming Chu

In this paper we present a parallel-array implementation of a new non-restoring square root algorithm (PASQRT). The carry-save adder (CSA) is used in the parallel array. The PASQRT has several features unlike other implementations. First, it does not use redundant representation for square root result. Second, each iteration generates an exact resulting value. Next it does not require any conversion on the inputs of the CSA. And last, a precise remainder can be obtained immediately. Furthermore, we present an improved version-a root-select parallel-array implementation (RS-PASQRT) for fast result value generation. The RSPASQRT is capable of achieving up to about 150% speedup ratio over the PASQRT. The simplicity of the implementations indicates that the proposed approach is an alternative to consider when designing a fully pipelined square root unit.


annual computer security applications conference | 2001

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Kenji Watanabe; Wanming Chu; Yamin Li

Java bytecodes can be executed with the following three methods: a Java interpreter running on a particular machine interprets bytecodes; a Just-in-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the machine executes the translated codes; and a Java processor executes bytecodes directly. The first two methods require no special hardware support for the execution of Java bytecodes and are widely used currently. The last method requires an embedded Java processor, picoJavaI or picoJavaII for instance. The picoJavaI and picoJavaII are simple pipelined processors with no ILP (instruction level parallelism) and TLP (thread level parallelism) supports. A so-called MAJC (microprocessor architecture for Java computing) design can exploit ILP and TLP by using a modified VLIW (very long instruction word) architecture and vertical multithreading technique, but it has its own instruction set and cannot execute Java bytecodes directly. In this paper, we investigate a processor architecture which can directly execute Java bytecodes meanwhile can exploit Java ILP and TLP simultaneously. The proposed processor consists of multiple slots implementing horizontal multithreading and multiple functional units shared by all threads executed in parallel. Our architectural simulation results show that the Java processor could achieve an average 20 IPC (instructions per cycle), or 7.33 EIPC (effective IPC), with 8 slots and a 4-instruction scheduling window for each slot. We also check other configurations and give the utilization of functional units as well as the performance improvement with various kinds of working loads.


The Journal of Supercomputing | 2010

Metacube—a versatile family of interconnection networks for extremely large-scale supercomputers

Yamin Li; Shietung Peng; Wanming Chu

AbstractIn the next decade, the high-performance supercomputers will consist of several millions of CPUs. The interconnection networks in such supercomputers play an important role for achieving high performance. In this paper, we introduce the Metacube (MC), a versatile family of interconnection network that can connect an extremely large number of nodes with a small number of links per node and keep the diameter rather low. An MC network has a 2-level hypercube structure. An MC(k,m) network can connect


ieee international conference on high performance computing data and analytics | 2005

Fault-tolerant cycle embedding in dual-cube with node faults

Yamin Li; Shietung Peng; Wanming Chu

2^{2^{k}m+k}


mobile adhoc and sensor systems | 2006

An Efficient Algorithm for Finding an Almost Connected Dominating Set of Small Size on Wireless Ad Hoc Networks

Yamin Li; Shietung Peng; Wanming Chu

nodes with m+k links per node, where k is the dimension of the high-level hypercubes (classes) and m is the dimension of the low-level hypercubes (clusters). An MC is a symmetric network with short diameter, easy and efficient routing and broadcasting similar to that of the hypercube. However, the MC network can connect millions of nodes with up to 6 links per node. An MC(2,3) with 5 links per node has 16,384 nodes and an MC(3,3) with 6 links per node has 134,217,728 nodes. We describe the MC network’s structure, topological properties, routing and broadcasting algorithms, and the Hamiltonian cycle embedding in the Metacube networks.


annual computer security applications conference | 2000

Cost/performance tradeoff of n-select square root implementations

Wanming Chu; Yamin Li

A low-degree dual-cube was proposed as an alternative to the hypercubes. A dual-cube DC(m) has m + 1 links per node, where m is the degree of a cluster (m-cube) and one more link is used for connecting to a node in another cluster. There are 2m+1 clusters and hence the total number of nodes in a DC(m) is 22m+1. In this paper, by using Gray code, we show that there exists a fault-free cycle containing at least 22m+1-2f nodes in DC(m), m≥3, with f≤m faulty nodes.


high performance computer architecture | 1995

The effects of STEF in finely parallel multithreaded processors

Yamin Li; Wanming Chu

In this paper, we propose an efficient, distributed and localized algorithm for finding an almost connected dominating set of small size on wireless ad hoc networks. Broadcasting and routing based on a connected dominating set (CDS) is a promising approach. A set is dominating if all the nodes of the network are either in the set or neighbors of nodes in the set. The efficiency of dominating-set-based broadcasting or routing mainly depends on the overhead in constructing the dominating set and the size of the dominating set. Our algorithm can find a CDS faster and the size of the found CDS is smaller than the previous algorithms proposed in the literature. Although our algorithm cannot guarantee the set found is actually a CDS but from our simulation results, the probabilities that the found set is a CDS are higher than 99.96% in all cases


parallel and distributed computing: applications and technologies | 2007

Efficient Algorithms for Finding a Trunk on a Tree Network and Its Applications

Yamin Li; Shietung Peng; Wanming Chu

Hardware square-root units require large numbers of gates even for iterative implementations. In this paper we present four low-cost high-performance fully-pipelined n-select implementations (nS-Root) based on a non-restoring-remainder square root algorithm. The nS-Root uses a parallel array of carry-save adders (CSAs). For a square root bit calculation, a CSA is used once. This means that the calculations can be fully pipelined. It also uses the n-way root-select technique to speedup the square root calculation. The cost/performance evaluation shows that n=2 or n=2.5 is a suitable solution for designing a high-speed fully pipelined square root unit while keeping the low-cost.


workshop on computer architecture education | 1996

Using FPGA for computer architecture/organization education

Yamin Li; Wanming Chu

The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model which is used to quantify the advantage of FPMP architectures. The effects of four important parameters in FPMP, S,T,E, and F (STEF) are evaluated. Unlike previous analytic models of multithreaded architecture, the model presented concerns the performance of multiple pipelines. It deals not only with pipelines dependencies but also with structure conflicts. The model accepts the configuration parameters of a FPMP, the distribution of instruction types, and the distribution of interlock delay cycles. The model provides a quick performance prediction and a quick utilization prediction which are helpful in the processor design.<<ETX>>

Collaboration


Dive into the Wanming Chu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge