Ming-Der Shieh
National Cheng Kung University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ming-Der Shieh.
IEEE Transactions on Circuits and Systems | 2008
Chin-Long Wey; Ming-Der Shieh; Shin-Yo Lin
Given a set of numbers X, finding the minimum value of X, min_1st, is a very easy task. However, efficiently finding its second minimum value, min_2nd, requires the derivations of min_1st and finding the minimum value from the set of the remaining numbers. Efficient algorithms and cost-effective hardware of finding the two smallest of X are greatly needed for the low-density parity-check (LDPC) decoder design. The following two architectures are developed in this paper: (1) sorting-based (XS) approach and (2) tree structure (TS) approach. Experimental results show that the XS approach provides less number of comparisons, while the TS approach achieves higher speed performance at lower hardware cost. Since the hardware unit is repeatedly used in the LDPC decoder design, the proposed high-speed low-cost TS approach is strongly recommended.
IEEE Transactions on Very Large Scale Integration Systems | 2008
Ming-Der Shieh; Jun-Hong Chen; Hao-Hsuan Wu; Wen-Ching Lin
Modular exponentiation with a large modulus, which is usually accomplished by repeated modular multiplications, has been widely used in public key cryptosystems for secured data communications. To speed up the computation, the Montgomery modular multiplication algorithm is used to relax the process of quotient determination, and the carry-save addition (CSA) is employed to reduce the critical path delay. In this paper, based on the inherent data dependency between the modular multiplication and square operations in the H-algorithm of modular exponentiation, we present a new modular exponentiation architecture with a unified modular multiplication/square module and show how to reduce the number of input operands for the CSA tree by mathematical manipulation. The developed architecture has the following advantages. 1) There is no need to convert the carry-save form of an operand into its binary representation at the end of each modular multiplication. In this way, except the final step to get the result of modular exponentiation, the time-consuming carry propagation can then be eliminated. 2) The number of input operands for the CSA tree is reduced in a very efficient way. 3) The hardware saving is achieved with very limited impact on the original critical path delay when designed with two distinct modular multiplication and square components. Experimental results show that our modular exponentiation design obtains the least hardware complexity compared with the existing work and outperforms them in terms of area-time (AT) complexity as well.
IEEE Transactions on Computers | 2004
Chien-Hsing Wu; Chien-Ming Wu; Ming-Der Shieh; Yin-Tsung Hwang
We extend the binary algorithm invented by Stein and propose novel iterative division algorithms over GF(2/sup m/) for systolic VLSI realization. While algorithm EBg is a basic prototype with guaranteed convergence in at most 2m - 1 iterations, its variants, algorithms EBd and EBdf, are designed for reduced complexity and fixed critical path delay, respectively. We show that algorithms EBd and EBdf can be mapped to parallel-in parallel-out systolic circuits with low area-time complexities of O(m/sup 2/loglogm) and O(m/sup 2/), respectively. Compared to the systolic designs based on the extended Euclids algorithm, our circuits exhibit significant speed and area advantages.
IEEE Transactions on Computers | 2010
Ming-Der Shieh; Wen-Ching Lin
Modular multiplication is a crucial operation in public key cryptosystems like RSA and elliptic curve cryptography (ECC). This paper presents a new word-based Montgomery modular multiplication algorithm which can be used to achieve a low-latency scalable architecture for efficient hardware implementations. We show how to relax the data dependency in conventional word-based algorithms so that a latency of exactly one cycle can be obtained regardless of the chosen word size w (w > 1). With the presented operand reduction scheme, the proposed scalable architecture can operate at high speeds and suitable data paths can be chosen for specific applications. Complexity analysis shows that the proposed architecture has the lowest latency and area complexity compared to related scalable architectures. Experimental results demonstrate that our design has area, speed, and flexibility advantages over related schemes.
international symposium on circuits and systems | 2001
Hsin-Fu Lo; Ming-Der Shieh; Chien-Ming Wu
This paper describes the design of Fast Fourier Transform (FFT) for the Eureka-147 DAB system. We investigate several possible FFT implementations based on the single butterfly architecture, including an in-place memory structure, to minimize the hardware requirement. We also describe a unified approach toward partitioning the whole memory into several banks so as to increase the equivalent memory bandwidth between the memory unit and the butterfly unit, which can be implemented in either radix-2 or high-radix arithmetic. Implementation results demonstrate the applicability of our work to the targeted channel demodulator and the advantages over previous solutions.
IEEE Transactions on Circuits and Systems | 2009
Ming-Der Shieh; Jun-Hong Chen; Wen-Ching Lin; Hao-Hsuan Wu
Modular exponentiation in public-key cryptosystems is usually achieved by repeated modular multiplications on large integers. Designing high-speed modular multiplication is thus very crucial to speed up the decryption/encryption process. In this paper, we first explore how to relax the data dependency that exists between multiplication, quotient determination, and modular reduction in the conventional Montgomery modular multiplication algorithm. Then, we propose a new modular multiplication algorithm for high-speed hardware design. The speed improvement is achieved by reducing the critical path delay from the 4-to-2 to 3-to-2 carry-save addition. The resulting time complexity of our development is further decreased by simultaneously performing the multiplication and modular reduction processes. Experimental results show that the developed modular multiplication can operate at speeds higher than those of related work. When the proposed modular multiplication is applied to modular exponentiation, both time and area-time advantages are obtained.
IEEE Transactions on Very Large Scale Integration Systems | 2010
Jun Hong Chen; Ming-Der Shieh; Wen Ching Lin
With rapid increases in communication and network applications, cryptography has become a crucial issue to ensure the security of transmitted data. In this paper, we propose a microcode-based architecture with a novel reconfigurable datapath which can perform either prime field GF(p) operations or binary extension field GF(2m) operations for arbitrary prime numbers, irreducible polynomials, and precision. Using these field arithmetic units, users are capable of programming cryptographic algorithms in microcode sequences for full compliance with a majority of public-key cryptographic algorithms such as Rivest-Shamir-Adleman (RSA) and elliptic curve cryptosystems. An algorithmic optimization or refinement can thus be made at a higher level based on the reconfigurable datapath. Experimental results show that the developed processor has full cryptography algorithm flexibility, high hardware utilization, and high performance.
international conference on consumer electronics | 1999
Ming-Der Shieh; Chien-Ming Wu; Hsiao-Hsing Chou; Min-Hui Chen; Chia-Liang Liu
This paper describes the design of the de-interleaver and punctured Viterbi decoder for the Eureka-147 DAB system and their corresponding VLSI implementations. We emphasize how to efficiently handle four DAB transmission modes, time/frequency de-interleaving and path metric/survivor memory management in our development. Results show that our implementation has the characteristics of modular design, consuming less silicon area, and facilitating the extension for high transmission rate requirement. The core size of the resulting chip implementation is 4990/spl times/4930 /spl mu/m/sup 2/ based on the Taiwan Semiconductor Manufacturing Company (TSMC) 0.6 /spl mu/m single-polysilicon-triple-metal CMOS process.
international symposium on circuits and systems | 2001
Chien-Hsing Wu; Chien-Ming Wu; Ming-Der Shieh; Yin-Tsung Hwang
We present a parallel-in parallel-out systolic division circuit over GF(2/sup m/) based on the novel extended Steins algorithm that provides guaranteed convergence in 2/sup m/-1 iterations. The area-time (AT) complexity of our design is O(m/sup 2/) and the achievable maximum clock rate is 1 GHz based on the 0.6 /spl mu/m technology. Compared to the best systolic design known to date based on the extended Euclids algorithm the proposed circuit exhibits significant area and speed advantages.
IEEE Transactions on Computers | 1998
Chin-Long Wey; Ming-Der Shieh
Given a binary number N, the simplest way for evaluating its square N/sup 2/ is the use of ROM look-up tables. For example, the squares of 12-bit numbers can be stored in a ROM of (2/sup 12//spl times/24) bits, which takes an area of 3.5 mm/sup 2/ and an access time of 9.96 ns with 0.8 /spl mu/m CMOS process. However, the conventional ROM table approaches are limited only for small bit size applications due to the unmanageable increase of the ROM table size. A novel design of square generator circuit using a folding approach is presented for high speed performance applications. Results show that, with the same process, the proposed square generator circuit takes 12.27 ns to generate the squares of 40 bit numbers with an area of about 2.88 times that of the (2/sup 12//spl times/24) ROM, i.e., 10 mm/sup 2/ a design trade-off between speed and area. A nested structure is also presented to achieve a 103 bit square generator with a delay of 15.82 ns. The bit size can be further increased by adding more levels of the nested structure. The results are promising and thus the proposed approach is well suitable for large bit size and high speed applications.